Machine-Readable, Structured Data With Meaningful Annotations

Until recently, software agents could not handle many kinds of information that could have been associated with files. Although file structure and extensions provided some information about files, much information could not be expressed. For example, a file with a.jpg extension has always represented a JPEG image but provided no information about the shutter speed, exposure program, F-stop, aperture, ISO speed rating, or focal length until the introduction of metadata formats such as Exif and XMP. However, sharing metadata stored in binary files is still not the most efficient way to share metadata, especially if it is much more generic. In the digital era, electronic files are being sold (e-books, MP3 files, and so on) that might be retrieved or played on many types of devices. A variety of metadata technologies can be used to express arbitrary information and represent any kind of knowledge associated with electronic documents in a machine-readable format. Machine-readable data (automated data) is data stored in a machine-readable format, making it possible for automated software agents to access and process it without human intervention. To browsers, web documents consisted of human-readable data only. In fact, information was confused with the containers that contained them. In contrast to the conventional Web (the “Web of documents”), the Semantic Web is the “Web of data.” The Semantic Web provides machine-processable data, making it possible for software agents to “understand” the meaning of information (in other words, semantics) presented by web documents. This feature can be used for a variety of services, such as museums, community sites, or podcasting.

Note that the word semantic is used on the Web in other contexts as well. For example, HTML5 supports semantic (in other words, meaningful) structuring elements, but this expression refers to the “meaning” of elements. In this context, the word semantic contrasts the “meaning” of elements, such as that of section (a thematic grouping), with the generic elements of older HTML versions, such as the “meaningless” div. The semantics of markup elements should not be confused with the semantics (in other words, machine-processability) of metadata annotations and web ontologies used on the Semantic Web. The latter can provide far more sophisticated data than the meaning of a markup element.

Conventional web documents can be extended with additional data that add meaning to them rather than structure alone. Semantic Web is a new approach that is going to change the world of the Web. Surprisingly, as early as 2001, Tim Berners-Lee described the reason for the existence of the Semantic Web. On the Semantic Web, data can be retrieved from seemingly unrelated fields automatically in order to combine them, find relations, and make discoveries. The Semantic Web should be considered an extension of the conventional Web.

Two terms are frequently associated with the Semantic Web, although neither of them has a clear definition: Web 2.0 and Web 3.0. Web 2.0 is an umbrella term used for a collection of technologies that form the second generation of the Web, such as Extensible Markup Language (XML), Asynchronous JavaScript and XML (Ajax), Really Simple Syndication (RSS), and Session Initiation Protocol (SIP). They are the underlying technologies and standards behind instant messaging, Voice over IP, wikis, blogs, forums, and syndication. The next generation of web services is more and more frequently denoted as Web 3.0, which is an umbrella term usually referring to customization and semantic contents and more sophisticated web applications toward Artificial Intelligence (AI), including computer-generated contents.

The Semantic Web is a major aspect of Web 2.0 and Web 3.0. Web 3.0 can be considered a superset of the Semantic Web that features social connections and personalization. Several technologies contribute to the sharing of such information instead of web pages alone, and the number of Semantic Web applications is constantly increasing.

On the Semantic Web, there is a variety of structured data, usually expressed in, or based on, the Resource Description Framework (RDF). Similar to conventional conceptual modeling approaches, such as class diagrams and entity relationships, the RDF data model is based on statements that describe and feature resources, especially web resources, in the form of subject-predicate-object expressions. The subject corresponds to the resource. The predicate expresses a relationship between the subject and the object. Such expressions are called triples. For example, the statement “The sky is blue” can be expressed in an RDF triple as follows:

  • Subject: “The sky”
  • Predicate: “is”
  • Object: “blue”

RDF is an abstract model that has several serialization formats. Consequently, the syntax of the triple varies from format to format. Keep in mind that RDF is a concept, not a syntax.

The authors of the “conventional” Web usually publish unstructured data, because they do not know about the power of structured data, find RDF too complex, or do not know how to create and publish RDF in any of its serialization formats. The following are solutions to the problem that add structured data to conventional (X)HTML markup, which can be extracted by appropriate software and converted to RDF:

  • Microformats, which reuse markup attributes
  • Microdata, which extends HTML5 markup with structured metadata
  • RDFa (RDF in attributes), which expresses RDF in markup attributes that are not part of (X)HTML vocabularies

All data controlled by conventional web applications are kept by the applications themselves, making a significant share of data and their relationships virtually unavailable for automated processing. Semantic Web applications, on the other hand, can access this data through the general web architecture and transfer structured data between applications and web sites. Semantic web technologies can be widely applied in a variety of areas, such as web search, data integration, resource discovery and classification, cataloging, intelligent software agents, content rating, and intellectual property right descriptions. A much wider range of tasks can be performed on semantic web pages than on conventional ones; for example, relationships between data and even sentences can be automatically processed. Additionally, the efficiency is much higher. For example, a very promising approach provides direct mapping of relational data to RDF, making it possible to share data of relational databases on the Semantic Web. Since relational databases are extremely popular in computing, databases that have been stored on local hard drives up to now can be shared on the Semantic Web. Commercial RDF database software packages are already available on the market (5Store, AllegroGraph, BigData, Oracle, OWLIM, Talis Platform, Virtuoso, and so on). Semantic tools can also be used in a variety of other areas, including business process modeling or diagnostic applications.