|
|
-
Re: lucene in combination with pattern recognition...Simon Courtenage 2006-06-22, 21:08
You might also check out an old paper by Kruger, Giles, Lawrence et al. on
a search engine called Deadliner (see here at http://clgiles.ist.psu.edu/papers/CIKM-2000-deadliner.pdf). Deadliner crawled for Calls for Papers for conferences, using Support Vector Machines trained to recognise relevant pages, and then applying sets of regular expressions to extract information from the CFP pages. Lawrence is now with Google, I believe. Hope this helps, Simon Bob Carpenter wrote: > Check out Andrew McCallum's paper: > > http://www.cs.umass.edu/~mccallum/papers/acm-queue-ie.pdf > > It mentions this very problem. There are > also some more technical presentations around. > > He was part of the Whiz-Bang team that took > on the problem. The fact that the company's > out of business is a testament to how hard > this problem is in general. > > - Bob Carpenter > Alias-i > >> >> i'm looking at a problem and i can't figure out how to "easily" solve >> it... >> >> basically, i'm trying to figure out if there's a way to use lucene/nutch >> with some form of pattern matching to extract course information from a >> College/Registrar's course section... >> >> Assume I can point to a Regiatrar's section of a College site. >> Assume I can then crawl through the section, and capture >> all the underlying information, including the Course >> information... >> Is there a way to somehow use pattern matching/recognition >> to somehow interpret the DOM to pull out the class schedule >> information. I'm pretty sure there's no vanilla approach, >> so I'd even consider some kind of solution where I might >> have to intially evaluate/analyze the site, to tell it >> what DOM elements are "important"... >> >> anyone done any work/projects like this... >> any research/papers/sample apps i could look at... >> any thoughts/comments/etc.... >> >> i could brute force this by writing a bunch of perl >> scripts, with each script tied to a given registrar site, >> but i'd like a more generalizable solution if one exists.. >> >> thanks >> >> -bruce >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Dr. Simon Courtenage Software Systems Engineering Research Group Dept. of Software Engineering, Cavendish School of Computer Science University of Westminster, London, UK Email: [EMAIL PROTECTED] Web: http://users.cscs.wmin.ac.uk/~courtes | http://www.sse.wmin.ac.uk --------------------------------------------------------------------- |