-Re: Trickl-Crawler - Significant Fork and Extension of Droids Framework
Richard Frovarp 2011-12-22, 19:34
On 12/13/2011 12:29 PM, Tim Gee wrote:
> I've just released a significant fork and extension of the Apache Droids
> framework, which I've been using for my own purposes for a while.
> I've released it under the ASL and the intent is that any useful code
> might be integrated into the official trunk of droids in the future. I've
> taken a rather brutal, but pragmatic approach to using the framework -
> where the design hasn't met my needs I've duplicated and revised code from
> the framework. So, for example, you will see that significant chunks of the
> API I have copied and changed and are available under
> com.trickl.crawler.api. Obviously, in a perfect world, I would work with
> your development team to discuss changes and find sensible workarounds, but
> sadly I didn't have the time for that so I just rushed ahead and made
> changes where I needed them to my modified implementation.
> So there will be conflicts in design and perhaps philosophy about some of
> my core changes, many of which you might regard as unnecessary. However,
> hopefully, there will still be a significant chunk of code that is useful
> and perhaps some design changes were indeed worthwhile.
I've been meaning to take some time to look through your release, but
sadly I have not had the opportunity to yet. Thank you for using AL2,
and I hope we can incorporate some of your changes back into the core.
You make some interesting changes and additions. Being able to process
those different content types (JSON for example), might not make sense
for a crawler, but it is quite useful. In my implementation, I'm doing
all sorts of status code handling, and recording that into a database. I
figure if I'm crawling my material, I might as well know what is broken
and where the redirects are. So those functionalities are certainly very
useful when going over a certain set of content.
You mentioned that you've tried to use other HTML parsers. How does
HtmlCleaner vary from JTidy? Do you have any feeling of how those and/or
Neko compare to the ones from Tika? I've got a few pages that Tika blows
How did you handle Spring? I see you aren't using the droids-spring
module. I know so very little about Spring, that I don't know where any
of the deficiencies are.
It would be nice to merge some of your functionality in. I hope I have
time to look at it soon. Obviously patches are always welcome, and may
work with low hanging fruit. The more significant changes would require