Dear Nutch users,
I am developing and app that needs to crawl and index images and in order
to fetch dynamic content - like images in galleries - I started using
protocol-selenium plugin. However, after initial success (though I needed
to install a very outdated version of Firefox - 31.x) with a single URL in
seed.txt, the crawler crashed when I tried to crawl multiple sites (a
standard scenario in the app).
This - of course - was the result of Nutch starting a queue for every
different host and inability to open several Firefox instances with
selenium in local mode.
I tried to switch to Selenium grid, per:https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium
I used selenium-server-standalone 3.4.0, however when I started the hub and
started crawling, the* hub didn't register any attempts at connecting to
it. I* think nutch-site.xml was properly configured, though I didn't set
the grid.binary.location. I also tried upgrading the lib-selenium and the
server, with little luck. I dis
Does anyone know what is the issue here? Has anyone succeeded in
configuring protocol-selenium grid and made it work with multiple URLs from
different hosts in the seed.txt?
Thanks in advance,