Thursday, September 19, 2024

The Machine Readable Web

The vast majority of the Web is intended for human readers. The goal has been to create an online experience for human beings. It is an open and ever growing body of information.

This is all great, but it does present some problems. There is just too much there. We aren’t sure what information to trust. We can get lost in the Web and waste a lot of time. So we need some software tools to help us, but the information itself is not structured in a way that software can easily deal with. Enter the machine readable Web.

The most basic way for software to deal with information on the Web is to simply read the HTML of the pages and “analyze” it. This is what search engines do. They have software agents called spiders that walk the Web and index the pages. They then use various techniques to give us the “best” pages for the search queries we enter.

This is helpful and essential, but you still have to go to the pages (many pages) and try to find what you want. And you need to know when to go back to get updated information. You may even know that a page has the information you want and that it will be updated regularly, but you don’t want to go back again and again to get that bit of information off that page.

There are tools called “screen scrapers” or Web page extractors that can read the pages and extract just the information you want, but the pages are unstructured and changing. The rules you describe for extracting the information may be complex and may not work as the page changes.

And content providers often don’t want you to use their page that way. They want you to look at the whole page, so that you will get the other messages they have on the page (like marketing messages), not just the bit you want. They try to put up a “no droids allowed” sign, in this case, “no robots, we want human eyeballs only”.

Some content providers realize that you can’t always come to their site and that if they will give you a useful summary of what is on their site, you might come more often to see the details (and the other stuff you really don’t want to see, but live with to get the content you want). A very useful way of doing this is using RSS feeds. RSS (Really Simple Syndication) provides the summary in an XML file that a software agent can easily process. RSS news readers or information aggregators go and get the summary for you and then you can see if you want to click through to see the details. (See http://www.w3schools.com/rss/default.asp for more on RSS.)

RSS is the first really successful example of the machine readable Web. The RSS XML file has a well known structure and is easy to produce and to process. It has also been successful because it is a win-win situation for content providers and consumers. Consumers get the summary information they want, making their Web browsing more effective and enjoyable. And providers get what they want, more traffic to their site.

Content providers had started down this path by providing HTML fragments that Web site authors could add to their sites. There are a few tools for individuals to also use these fragments. Commented HTML can be used to allow Web page extractors to more easily extract dynamic HTML fragments. HTML fragments, like RSS feeds, are useful for consumers of information and helpful for content providers in attracting traffic.

This brings us to a fundamental point. The content providers need to have a relatively easy way to provide the machine readable content and it has to fit in with their mission. And the consumers won’t use it unless they get something useful from it. So we need the win-win for the machine readable Web to get off the ground. At this point, according to a Pew Research report (http://www.pewinternet.org/PPF/r/144/report_display.asp), 5% of internet users are using RSS. Most of these people are classic early adopters. But it seems like RSS is moving quickly to being more widely adopted.

But even this relative simple standard was not easy to get to. There was a lot of conflict between the “keep it simple” crowd and the “more features” crowd (see http://diveintomark.org/archives/2002/09/06). And RSS is just scratching the surface. After all, it just provides a title, a link, and a short summary for each item. Richer information will require a richer structure.

There is a community of researchers looking to provide the approach for this richer structure under the tag phrase “semantic Web”. This is largely a vision and research project at this point. See http://www.w3.org/2001/sw/. One criticism of this work is that it is “too complex”, and if you try to read some of it, you might be pardoned for developing a similar opinion. Of course, tools will be provided to hide the complexity from users, but the issue is whether it is too complex for the typical content provider and for tool developers. See http://www.snipsnap.org/space/RDF+too+complex if you are interested in exploring this issue.

A more near term approach is “Web services”. This uses the Web infrastructure for application to application communications. It is not as easy as RSS, but it builds on a similar structure of XML as the data format. At this point it is mostly used for business-to- business and there are hardly any Web services that provide public information. And you need to define an interface for each kind of information you might as a content provider want to serve. See http://www.w3.org/2002/ws/ if you are interesting in exploring this.

Another intriguing use of the Web is machine-to-machine. The idea here is that many machines exist that have embedded computers. If they could hook into the Web, they could provide a lot of useful information. Some might be sensors of various kinds. Others might be cars, toasters, or washing machines. Wireless companies are interested in providing devices similar to cell phones to allow these machines to be accessible over the Web. See http://itpapers.zdnet.com/whitepaper.aspx?scname=GSM&docid=97767.

So a machine readable Web is starting to become a reality with RSS and Web services and may progress even further with something like machine-to-machine or the semantic Web. Early adopter consumers are starting to adopt the idea via RSS. The key will be for content providers to adopt a richer set of machine readable formats like they have started to do for RSS and keeping it as simple as possible so a wide variety of software developers can provide tools for the end users. This may be the key to making the Web even more useful.

Ron Tower is the President of Sugarloaf Software and is the developer
of Personal Watchkeeper, an information aggregator supporting a
variety of ways to summarize the Web.
http://www.sugarloafsw.com

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles