arch bash cakephp conf dauth devops drupal foss git golang information age life linux lua mail monitoring music mysql n900 netlog openstack perf photos php productivity python thesis travel uzbl vimeo web2.0

Building a search engine

I started working at IBCN, the research group of the university of Ghent. I was looking to get back to the challenging world of high-performance and large-scale (web) applications, but I also wanted something more conceptual and researchy, rather then the highly hands-on dev- and ops work I've been doing for a few years now.
The Bom-vl project is pretty broad: it aims to make the Flemish cultural heritage media more useable by properly digitizing, archiving and making public the (currently mostly analog) archives from providers such as TV stations.

Currently, I believe there's some >100TB of media in our cluster (mostly from VRT, afaik), along with associated textual descriptions/metadata, with more to follow. The application is currently for a selected audience but the goal is to make it public in the near future. I'm part of the search engine team, we aim to provide users with the most relevant hits for their queries, by using existing technology (think Lucene, hadoop, etc) or devising our own where needed. As I'm charged with a similarity search problem ("other videos which might also interest you"), I'm studying information retrieval topics such as index and algorithm design and various vector models. Starting next week, I'll probably start implementing and testing some approaches.


If you encounter any Buiten de Zone or W817 episodes, be sure to digitize them properly on the web ;)
At least for the 'You might also be interested in...', you'll want to check and -> certainly worth a look!
thanks for the tip,
but with Introduction to Information Retrieval (also by Manning, and others), and with the resources and papers I can find online, I have enough material... for now :P





What is the first name of the guy blogging here?

This comment form is pretty crude. Make sure mandatory fields are entered correctly.
Basic html tags (a,i,b, etc) are allowed, others are sanitized