Directory services support browsing and a combination of browsing with a limited set of attributes for the content managed or aggregated by the site. When domain information is captured, a host of people over at one company providing directing services and over at another classifies new and old Web pages, to ensure the quality of those domain search results.
This is an extremely human-intensive process. The human catalogers or editors use hundreds of classification or keyword terms that are mostly proprietary to that company. NorthernLight uses a mostly automated classification apparatus that classifies newly found content based on comparison with more than subject terms. Several Web sites have classified their assets into domains and attributes. Customers looking for videos can search mgm. Video indexing machines like Excalibur allow a company to segment its video assets, enter and search by an arbitrary number of user-specified attributes.
Unfortunately, this powerful search is restricted to one particular Web site only. No large-scale attribute search for all kinds of documents has been available for the whole Internet. While WebCrawlers can reach and scan documents in the farthest locations, the classification of structurally very different documents has been the main obstacle of building a metabase that allows the desired comprehensive attribute search against heterogeneous data.
The context of a search request is necessary to resolve ambiguities in the search terms that the user enters. Due to the unstructured and heterogeneous nature of the Web resources, every Web site uses a different terminology to describe similar things. A semantic mapping of terms is then necessary to ensure that the system serves documents within the same context in which the user searched. The Context Interchange Network COIN that was developed at the MIT presents a system that translates requests into different context as required by a search against disparate data sources.
The support for semantics is very limited, primarily dealing with unit differences and functions for mapping values. No domain modeling is supported. What is better and is achieved by the present invention , the context of digital media is determined before metadata is inserted into the metabase. Current manual or automated content acquisition may use metatags that are part of an HTML page, but these are proprietary and have no contextual meaning for general search applications.
However, this would require widespread adoption of this possible future standard, and its use for page and site creators to appropriately use DAML, before appropriate agents can be written. The concept of a Semantic Web is an important step forward in supporting higher precision, relevance and timeliness in using Web-accessible content.
Some of the current use of this term does not reflect the use of various components that support broad and important aspect of semantics, including context, domain modeling, and knowledge, and primarily focuses on terminological and ontological components as further described in R.
Similar: Content-based search and browsing in semantic multimedia retrieval | EURASIP
Research in heterogeneous database management and information systems have addressed the issues of syntax, structure and semantics, and have developed techniques to integrate data from multiple databases and data sources. Large scale scaling and associated automation has, however, not be achieved in the past. One key issue in supporting semantics is that of understanding and modeling context. Currently, syntax and structure-based methods pervade the entire Web—both in its creation and the applications realized over it. The challenge has been to include semantics in creating physical or virtual organizations of the Web and its applications—all without imposing new standards and protocols as required by current proposals for the Semantic Web.
These advantages and others are realized by the present invention. The present invention is directed to software, a system and a method for creating a database of metadata metabase of a variety of digital media content, data sets, including TV and radio content potentially delivered on Internet. The data sets may be accessed locally or remotely via a suitable communications channel such as the Internet. This semantic-based method captures and enhances domain or subject specific metadata of digital media content, including the specific meaning and intended use of original content.
The digital media content can be semi-structured text, audio, video, animations, etc. To support semantics, the present invention uses a WorldModel that includes specific domain knowledge, ontologies as well as a set of rules relevant to the original content. The metabase is also dynamic because it may track changes to locally or remotely accessible content, including live and archival TV and radio programming.
Because these tasks would be labor intensive if performed manually, two methods and apparatus have been designed and implemented. First, a distributed method and apparatus to quickly produce agents which automatically create and manage digital media metadata. Second, a WorldModel that embodies the essence of semantics that is used by the agents and captured in the metadata they produce.
The WorldModel cooperates with an associated Knowledge base that uses semantics to enhance relevant information that may not be present in the original source. Assets, profile and personalization information as well as advertisement and e-commerce are correlated through the WorldModel. The metabase created by this system represents a unique and proprietary map of a part of the Web, and achieves one unique form of the realization of a Semantic Web.
Semantic search is the first application of a Semantic Web and consists of a set of methods for browsing and searching. Accordingly, a variety of interfaces and look-and-feel methods have been developed. Additional methods have been developed to utilize semantics related to user's current interest and information need, as well as historical information, to achieve a semantics-based personalization and profiling Semantic Profiling and semantics-based targeted advertisements Semantic Advertisement.
Semantics may be exchanged and utilized between partners, including content owner or content syndicator or distributor , destination sites or the sites visited by users , and advertisers or advertisement distributors or syndicators , to improve the value of content ownership, advertisement space impressions , and advertisement charges. A one-click-media-play option is provided that utilizes content rights and ownership information to enhance user experience as well as bring increased revenues through better customization and targeting of advertisement and e-commerce for partners involved in Semantic Web applications.
Additional advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed. The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate one embodiment of the invention and together with the description, serve to explain the principles of the invention.
A preferred embodiment of the invention is now described in detail. Referring to the drawings, like numbers indicate like parts throughout the views. In the foregoing discussion, the following terms will have the following definitions unless the context clearly dictates otherwise. The standard allows definition, transmission, validation, and interpretation of data between applications and between organizations. Asset: a representable, exchangeable or tradable unit content that is described by metadata a URL in Internet is adequate substitute of the files or documents or Web pages.
Blended Semantic Browsing and Querying BSBQ : A method of combining browsing and querying to specify search for information that also utilizes semantics, especially the domain context provided by browsing and presenting relevant domain specific attributes to specifying queries. Content or Media Content: a text document, image, audio, video, or original radio or television programming. Domain: a comprehensive modeling of information including digital media and all data or information such as those accessible on the Web with the broadest variety of metadata possible.
Metadata: Data, Information or Assets are described by additional data about them, termed metadata.
Table of contents
Metadata specific tag, value, or text as on a Web page, in HTML, or in a database by itself constitutes its syntax only. One-Click media play: A novel method and apparatus for displaying the original source of the asset, and playing the asset itself subject to the business rules including content ownership and partnership considerations.
Ontology: a universe of subjects or terms also, categories and attributes and relationships between them, often organized in a tree or forest structure; includes a commitment to uniformly use the terms in a discourse in which the ontology is subscribed to or used. Search result or hits: a listing of results provided by a state-of-the-art search engine, typically consisting of a title, a very short usually 2-line description of a document or Web page, and a URL for the Web page or document.
Syntax: Syntax of a data or message is the use of words—without the associated meaning or use. Syntactic approaches do consider differences in machine-readable aspects of data representation, or formatting. Use of a understanding of placement as in a title versus paragraph of the text or data value, or description and use of information structure called data or information model, or schemata.
Structure: Structure implies the representation or organization of data and information, as for example supported by HTML or other markup language. It may also involve recognition of links as in hypertext reference or of data typically consisting of a set of metadata attributes or names and their values. These attributes can be based on:.
Subject or Category: a limited form of domain representation, namely a term descriptive of a domain. Semantics: Semantics implies meaning and use of data, or relevant information that is typically needed for decision making. Domain modeling including directory structure, classification and categorizations that organize information , ontologies that represent relationships and associations between related terms, context and knowledge are important components of representing and reasoning about semantics.
Analysis of syntax and structure can also lead to semantics, but only partially. Since the term semantics has been used in many different ways, its use herein is directed to those cases that at the minimum involve domain-specific information or context. Semantic Web: The concept that Web-accessible content can be organized semantically, rather than though syntactic and structural methods. Semantic Search: Allowing users to use semantics, including domain specific attributes, in formulating and specifying search and utilizing context and other semantic information in processing search request.
It is also an application of Semantic Web. Semantic Profiling: Capture and management of user interests and usage patterns utilizing the semantics-based organization such as the WorldModel. Semantic Advertising: Utilizing semantics to target advertising to users utilizing semantic-based information such as that available from semantic search, semantic profiling and the WorldModel. The present invention is an extension of the concept of a Semantic Web that requires no prior standards or participation on the part of providers of digital media content. This is made possible by two essential components of the present invention: the WorldModel and the apparatus by which metadata is acquired and enhanced.
The data sets may be acquired either locally or remotely such as via the Internet or other suitable communications channel. By applying semantics at this early stage of content acquisition, the task of applying semantics to searching and advertising becomes simpler. One component of the present invention is the WorldModel. The WorldModel starts with a hierarchy of domains.
Each domain is divided into subcategories that can have further subcategories. Attributes of parent categories are inherited. Current directory services e. However, the WorldModel is much more than a domain hierarchy and lists of documents with optional keyword or category tags that is supported by directories. It is a comprehensive infrastructure for creating the Semantic Web from the existing Web and for realizing the applications of Semantic Web including Semantic Search, Semantic Profiling, and Semantic Advertisement. It includes more comprehensive domain modeling including domain specific metadata and attributes, knowledge of sources, rules for mapping including syntactical, structural and semantic mappings , organizational basis for search, profiling and advertisement specific information, rules and knowledge, etc.
In one embodiment, the WorldModel manifests itself in the form of a collection of XML documents, but could also be implemented as tables in a relational database. The WorldModel is somewhat similar to a data dictionary in that it describes in more precise terms what kind of information is stored in a database. However, the WorldModel not only stores data types, descriptions, and constraints on the items stored in the present invention's metabase, but it also captures the semantics of the data in the metabase.
It also has a vital role in the creation of metabase records, the management of the extraction process, and the customization of results for different customers. In its simplest form the WorldModel is a tree structure of media classification, also called a domain model. At its root is a list of attributes common to most digital media; FIG. Referring to FIG. Similar to the way child classes in object-oriented programming can inherit 22 the characteristics of their parent classes, several domains 23 extend the generic Asset description. Such domains include Travel, News, Entertainment, Sports, and so on.
A media Asset belonging to one of these domains would likely have a set of attributes that media belonging to another domain would not have. For instance, a news Asset usually has a reporter or a location associated with a news event. These are attributes that have no meaning when the media is a movie trailer or a song.
Such attributes are called domain-dependent attributes Each of these attributes has a set of attribute properties These properties contain such things as the expected data type of the attribute, formatting constraints for values of the attribute, and the name of associated mapping functions. Mapping functions are applied when the value of the attribute as obtained by the extractor does not match some canonical form.
Also included in an attribute's property list is a textual description of the attribute what the attribute means. These descriptions are similar to the description given in FIG. Each attribute also has information concerning its display in a default query interface. However, it is usually sufficient to have only three or four levels of sub-domains i. In addition to a list of attributes, every domain has a set of properties Domain attributes are characteristics of assets that belong to a particular domain, whereas domain properties are a list of characteristics of the domain itself.
Such properties include default scheduling information for extractors that create assets belonging to a domain. If extraction times are not specifically assigned to an extractor, it will be scheduled automatically according to its domain. For instance, since news Web sites are updated frequently, a newly created news extractor would, by default, be run more frequently than a travel video extractor. Also contained in the list of domain properties are rules concerning the default ordering of search results. For instance, it is useful to sort news assets by the date the even occurred, while travel videos are best sorted by location.
Associated with each domain is a list of extracted sites belonging to that domain; the sites for extraction as previously mentioned may be local or remote and may be characterized by a URL. For each site in this list, information including the following is stored such as depicted in FIG. Another aspect of the present invention is its ability to acquire metadata content from many sources, typically identified by a URL, with a minimum of human involvement. This process is performed while preserving the original context of the metadata.
In order to keep the metabase up-to-date with the latest news or movie trailers, metadata extraction from various sources must be scheduled. For instance, it is desirable to check a single news site once an hour for any breaking news. On the other hand, the extraction of trailers for newly released movies is less time-critical and could be performed once a week. This heuristically gained information is stored in the WorldModel.
Five modules contribute to this task:. The process starts by identifying a source containing digital media data sets. Several factors determine whether a source is actually worth extracting. The foregoing discussion refers to a Web site as an example source of digital media for retrieval; other potential sources such as a local drive or another remote resource such as FTP, GOPHER, etc. These include the volume of media files and the quality of the available metadata about those media files. Once a Web site is deemed worthy of undergoing this process, the assets that will be acquired from the site are assigned to a particular domain in the WorldModel.
Next, crawling and extraction rules are written that specify where and how to retrieve the metadata from the Web site. For each attribute of a particular asset type domain , an extractor writer provides a rule that specifies where to find a value of this attribute. Crawling and extraction rules are input to WebCrawler and Extractor programs that traverse Web sites and retrieve digital media metadata from selected pages. These generated assets contain values for each attribute name belonging to the domain of that Web site. Once created, the assets are sent to a Metabase Agent that is in charge of enhancing and inserting them into a database of records.
In order to enhance the assets, the Metabase Agent uses information stored in the WorldModel as well as a Knowledgebase. The Knowledgebase is a collection of tables containing domain-specific information and relationships. After insertion into the metabase, the assets are then ready to be searched.
A WebCrawler 3 is a piece of software, invoked on a remote or local host, which begins reading pages from a particular site and determines which of these pages are extractable. These rules dictate where on which page the crawler should begin its search, which directories the crawler must remain within, and define the characteristics of an extractable page.
Without such rules, a WebCrawler would likely find a link off of the site it was assigned to crawl and begin aimlessly reading the entire Web. A crawler can often recognize an extractable page by simply examining the URL.
If the URL of the page being examined follows a certain pattern, there is a good chance that the page contains a link to an audio or video asset. When the WebCrawler determines that it has found an extractable page, it sends the contents of that page 6 on to another remote piece of software an Extractor that retrieves the valuable metadata from the page. Any number of WebCrawlers can run on any host that is set up to do so. Extractors 7 are programs that are designed to find information about digital media from a Web page. Their ability to retrieve values for domain-specific attributes is critical to capturing the semantics of the information on the source site.
By assuming that Web pages from a particular site follow a pattern or have some recognizable structure, rules can be written that reliably retrieve small sections of text from those pages. Normally, these small pieces of text have some relationship to a media file whose link is on the same page. In order for this information to be meaningful, they need to be mapped to a domain-specific attribute in the WorldModel.
For example:. When a WebCrawler finds an extractable page it will send this page as well as the name of the site on which it is crawling to one extractor in a large pool of extractors running on another machine. An extractor is designed to work on only one of these pages a time. Once an extractor receives a page, it looks up a set of extraction rules 8 associated with the site from which the page came. These rules list the metadata attributes for the type of media that this site contains as well as rules that describe where to find values for these attributes within the page.
The set of attributes associated with, for example, a news video reporter, location, event date, etc. The extractor scans the Web page content for pieces of text that match the pattern specified by the extraction rules. After the extractor has attempted to find every attribute in its list of extraction rules, it creates an XML document containing attribute-value pairs 9.
This document is sent to the Metabase Agent Extraction rules for a site can usually be created in five minutes to an hour depending on the complexity and irregularity of the site using an Extractor Toolkit such as that shown in FIG. This toolkit allows a relatively non-technical person to write extraction rules, such as those shown in FIG. The ability to retrieve text that matches a particular pattern is provided by the PERL regular expression language.
To this language, the present invention adds several important features. The basic procedure for writing an extraction rule begins with choosing an attribute to extract. A list of possible attributes is displayed in the toolkit for a given category of asset i. Baseball, News, Technology. The extractor author then tries to visually determine a pattern that describes either the location or the style of the text to be retrieved.
For example, the desired text may be the first bold text on the page, the text after a particular picture, or the last URL link on the page. The author then examines the HTML sources of several similar pages to verify that such rules will consistently retrieve the correct text. Finally, the author types the rule for an attribute and can test it against several different pages. An example of a typical extraction rule is given in FIG. Each extraction rule will contain potentially three components. The first component designates the name of the attribute. The second component indicates whether multiple assets are generated from a single data set subject to extraction and whether the attribute will be found in the common shared text, not the text belonging to the individual assets.
The third component designates the pattern for which the extractor should search to locate the value for the attribute designated in the first component Referring back to FIG. The XML document that the Metabase Agent 10 receives contains the asset type and a list of attribute-value pairs. As an example, a particular asset extracted from CNN. In order to improve the quality of the data for this asset and enable higher recall, the Metabase Agent must consult the WorldModel.
Hayman, V. Hinard, D. Howe, X. Huang, R. Huntley, H.
Semantic Models for Multimedia Database Searching and Browsing
Bye-A-Jee, R. Kishore, O. Lang, R. Lee, A. Lock, R. Lovering, A. MacDougall, M. Martin, P. Masson, J. Mendel, M. Munoz-Torres, R. Nash, L.
Ni, A. Nikjenad, C. Palka, C. Pich, K. Pichler, S. Poux, L. Reiser, P. Roncaglia, T. Sawford, A. Shypitsyna, D. Sitnikov, E. Speretta, N. Tyagi, S. Toro, M.
- RELATED BOOKS.
- Data reduction and error analysis for physical sciences.
- People Power: Unarmed Resistance and Global Solidarity?
- Semi-Citizenship in Democratic Politics.
- Modelling multimedia data!
- Statistical Mechanics?
- Railroad Postcards in the Age of Steam!
Tuli, K. Warner, E. Wong, V. Wood and R. Correspondence to Paul D. Reprints and Permissions. Advanced search. Skip to main content. Subjects Computational biology and bioinformatics Genomic analysis. Rent or Buy article Get time limited or full article access on ReadCube. References 1. Article Google Scholar 3. Article Google Scholar 4. Article Google Scholar 8. Separate different tags with a comma. To include a comma in your tag, surround the tag with double quotes. Please enable cookies in your browser to get the full Trove experience.
Skip to content Skip to search. Kashyap, Arif Ghafoor. Published Boston, Mass. Language English View all editions Prev Next edition 4 of 6. Series The Kluwer international series on advances in database systems ; 21 Subjects Multimedia systems. Database searching.
Andre fag naturvidenskab og teknik Andre fag. Contents Machine derived contents note: List of Figures. List of Tables. Semantic Models for Multimedia Information. Multimedia Database Searching. Multimedia Browsing.
- Modelling multimedia data;
- Minimal Access Surgery in Oncology.
- 74, All Tech jobs | fozoxidy.tk.
- Radiation Biophysics?
- Mens Health (April 2010).
Notes Includes bibliographical references and index.