Semantic 1 : of or relating to meaning in language
That’s the dictionary definition of Semantic. When applied to the Web–it means content which is semantically related to the content. Let us take the example of a keyword search on Google. I type in Blog, take a snapshot of the results and then key in Weblog. There is only one result in the top 10 which is found in these two samples.
Blog and Weblog; don’t we use these interchangeably? Don’t they mean the same? Semantically, to a human–YES; to the search engine indexing the web content–NO. That’s exactly the vision of Semantic Web, when search engines and information retrieval in general extracts data like humans.
Well, in the above example of “Blog” vs. “Weblog”, its not the search engine’s fault for failing to index the content in a desirable manner. To some extent the problem also lies in the HTML page, which expresses the term “Blog” and “Weblog”. What if the HTML page header says that all the terms in the page conform to certain taxonomy. This is not uncommon, exactly what we do in a DTD or an XML Schema document. Take for example, the <P> tag. The tag is defines in the HTML DTD, and well understood by the browser’s parsing and rendering engine. A browser semantically understands this tag as–“the text which comes after this tag is a paragraph and should be rendered as such”. In case of HTML the vocabulary is limited, a P tag is always a P tag. However, in case of English language a “Blog” is a “Weblog” which is an “Online Journal” which is… the list continues.
Establishing relationship is not trivial. A well-defined set of terms related with peers, parent-child nodes, and attributes–essentially this is Ontology, a way of representing and conceptualizing knowledge.
One very good example, where this association works–A robot programmed to identify/recognize fruits. Robot’s master writes the word “Mango” on the whiteboard. The robot quickly scans his ontology(assuming that the robot in our example uses Ontology for Knowledge Representation) for a match. He finds an exact match for the word M-A-N-G-O. Then he traverses; Mango –> Mangifera Indica (attribute type Scientific Name) –> Fruit (Parent node). The robot then thinks–“Mango is a Fruit”. But, how does he find whether the fruit is sweet/sour, grown in tropical climate, has a large seed, grows on trees, is rich in Vitamin C, Folate, Selenium and Pantothenic Acid ? The answer lies within the Ontology, which could represent the extended knowledge as well.
Going back to the search example, there are couple of ways to solve this problem:
- While indexing the page, instead of indexing the terms, index the generic id as retrieved from a “super” ontology. The hard part is locating the Ontology
- Let the web page authors expose the terms with some metadata around it. For (a hypothetical) example:
<p>This is my <so:onto id=”757893″ contextid=”222″>Weblog</so:onto> - Convert the search term itself. For example, if I search for Weblog, two queries are made–for “Blog” and “Weblog” and the search results de-duped and presented.
Some work is already being done in the TAP Project. TAP is a succession of Alpiri, founded by RV Guha and Rob McCool, the same people behind TAP.