I’ve always enjoyed building domain specific languages. The most advanced language that I’ve developed so far comes handy when I need to extract some information from any web page or document. I call this system “Content Retrieval Engine”. You can execute CRE scripts in a single thread to open web pages, extract some information from them and use the retrieved information as an input to another document.
Here is an example to get the search result urls of a query that is made on google:
url = "http://www.google.co.uk/search?hl=en&q=${1}&btnG=Search&meta=&aq=1";
query = "erdinc yilmazel";
echo "opening " + url(query.urlEncode());
open url(query.urlEncode()) as input;
read input {
locate "<h2 class=hd>Search Results</h2>";
do {
locate "<li class=g>";
locate "<h3 ";
locate "<a href=\"";
capture as link until "\"";
echo link;
links += link;
} until "</ol> </div> <!--z-->";
}
The script above prints every url that is gathered to the output and also collects them in a list named as links (Actually an ArrayList in Java). At the time of writing executing the script gives the following output:
opening http://www.google.co.uk/search?hl=en&q=erdinc+yilmazel&btnG=Search&meta=&aq=1 http://tr.linkedin.com/in/erdincyilmazel http://www.facebook.com/erdincyilmazel http://blog.erdincyilmazel.com/2010/01/05/open-sourcing-some-projects/ http://developer.amazonwebservices.com/connect/message.jspa?messageID=147347 http://markmail.org/message/x6irtpux7euhnptj http://www.vclcomponents.com/PHP/Development_Tools/Libraries_and_Classes/MyObjects-info.html http://www.aboutus.org/erdinc.yilmazel.com http://groups.google.com/group/javawug/browse_thread/thread/6e0a171748a8b735 http://vaadin.com/forum/-/message_boards/message/89339 http://www.myobjects.org/
The thing about “Content Retrieval Engine” is that it doesn’t have to read all the document and get it into memory to do data extraction. It does everything in a single pass while reading from the input stream. This allows it to run very fast and very memory efficient. You can extract information from huge documents that are several gigabytes.
So what is tinymashup.com? I am planning it to be a hosted service to run CRE scripts. You can write your scripts, run them against any web page and get the output in any format you like such as XML, JSON, plain text etc. You will be able to write scripts and share them with others. I am also planning on licensing the technology to people who need it. (I can also open source it too, haven’t decided yet
)
