Thanks to my current day job, I haven’t written any serious (Java) code for a long time. Couple of weeks ago, I was mulling over to overcome this loss of habit–more so trying to refresh the skills. So, I started writing a crawler and a harvester for the RSS/Atom feeds. It may not be a big deal, but I never wrote one before. It became interesting when I fired-up Eclipse. The harvester returned 280 blogs on my second-degree of separation. If your weblog stats have a sharp uptrend for the last 10 days then you know who to thank.
I picked up ROME as the feed parser of choice. ROME has couple of bugs, which I’ll be sending to the developers. The nastiest one is its inability to parse atom feeds from blogspot.com. I may know the solution, but need to go through the ROME source code in order to fix/suggest solution.
The harvester/crawler source uses simple java.net.HttpURLConnection for HTTP transfers. The next step is to make it work over Java NIO in order to “up” the performance of network I/O for frequent updates and large set of feeds.
Writing the code was easy once the hands got dirty–The big task is to figure out what to do with this code. How about me-too of Bloglines, Technorati or Kinja?
Hello, to the world of raw structured data and it’s various formats viz. RSS 0.90/ 0.91/0.92/0.93/0.94/1.0/2.0/atom 0.3 (including the standard & proposed RSS modules)!
Archive for February, 2005
Writing A Feed Crawler in Java
Sunday, February 27th, 2005Reference Web vs. the Incremental Web: How the current discovery methods will break
Wednesday, February 16th, 2005Google searches the reference Internet. Users come to google with a specific query, and search a vast corpus of largely static information. This is a very valuable and lucrative service to provide: it’s the Yellow Pages.
On the other hand, Weblogs (which looks like yet another HTML page) are chronologically organized. The posts are structured data, well tagged and facilitate easy discovery. The ranking & indexing becomes easier in case of weblog. A search engine may assign higher rank to keywords appearing in the <dc:subject> or <title> tags compared to the content in <description> tag. Thanks to this tagging almost, the ranking scheme does not become somebody’s personal algorithm. Compare this to how Google assigns the magic rank to the non-structure web; More weight is given to words appearing in the HTML <title> tag or the the text of the links in the <a> tag (oversimplification here, Google does a tad more). Same scheme is applied to <H1>, <B> tags. The logic of doing this is obvious.
On the outset, the difference between regular HTML pages and Weblogs is not much. However, HTML can be read only with a browser while Weblogs can be read with the browser and other client-based (NewsMonster, Gush, etc.) and web-based (Bloglines, Feedster, etc.) applications. Thanks to standardized delivery medium like RSS or Atom, the Weblog could be read on any custom software or device.
Google works best on Reference web, the web, which is primarly, contained of HTML pages and the content is not tagged beyond the ones required for rendering the HTML markup. Try searching on Google for the latest conversation on Java. The top site is from Sun. On a different twist try searching for some help on formatting/parsing a java.util.Date object–The search result references the discussion around the deprecated APIs. This is the reference web–here the content does not say what it is and what it refers to. It’s the search engine’s algorithm, which decides how to cut, chop and present.
Contrast this to the incremental web–The content says what it is, what categories it belongs to and when it was published.
I think this is an immense opportunity, some of which is being addressed by Topix, Technorati, Feedster, etc. But, Weblog searching is still in infancy. Using the traditional search techniques–the wheat (the blog entries I want to read) and the chaff (the blog entries I want to avoid) are going in the grind together.
On a grand scheme of things, I think we are on the path to the Semantic Web.
Total Internal Refraction
Friday, February 11th, 2005
Interestingly, Paris Las Vegas and Bellagio are diagonally opposite (look at the map below). The picture was shot while looking towards Bellagio; standing behind the protective glass shield on a pedestrian Xing over ‘The Strip’ at Las Vegas.