Crawling: i2p2.i2p recursive source loops
Opened 5 years ago
Last modified 5 years ago
#1781assigneddefect
Crawling: i2p2.i2p recursive source loops
Reported by:k1773rOwned by:str4d Priority: minor Milestone: undecided Component: www/i2p Version: 0.9.24 Keywords:
Cc:
Parent Tickets:
Sensitive: no
Description
While crawling www.i2p2.i2p i get recursive links which lead to a "page not found" site, but the HTTP status is 200. On those pages i get further nested links and it starts all over. Eventually it will hit a 404 (as shown below).
crawler logs:
first link is the site crawled, second link is where it came from.
2016-04-06T**:19:40.798Z 404 22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_ru.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_ru.html text/html #044 20160406**1940424+346 sha1:66374BVL4IQZ3HBJXFVOAYAZBWU6VGEQ - -
2016-04-06T**:19:39.700Z 404 22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_nl.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_nl.html text/html #018 20160406**1939082+603 sha1:JWWJX7KEBMZCBJSEZW6C3TQPEEA6VG32 - -
2016-04-06T**:19:38.583Z 404 22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_it.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_it.html text/html #047 20160406**1938203+365 sha1:TNDZLJEXSFWTE3UZ3FX4BHELNBQSAW3F - -
2016-04-06T**:19:37.853Z 404 22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_fr.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_fr.html text/html #029 20160406**1937490+336 sha1:UIIBTTZBEW2LHC5TIWALY33YBZPQ4Y5C - -
2016-04-06T**:19:37.081Z 404 22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_zh.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_zh.html text/html #018 20160406**1936671+397 sha1:P6IKCGRG77YEY3U3QGET6JQICO2M274M - -
2016-04-06T**:19:36.201Z 404 22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_es.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_es.html text/html #047 20160406**1935726+448 sha1:GWBZFXTRMUQZQIPJ4EKA3FW4ERRRLYHS - -
2016-04-06T**:19:35.361Z 404 22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_de.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_de.html text/html #040 20160406**1934995+353 sha1:M56A3Y62E7AJYUEURZ224EEEYXS3GYCP - -
2016-04-06T**:19:34.526Z 404 22318 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index.html text/html #048 20160406**1934130+372 sha1:MAW4ZNR2RB4RFR6XG2UECOZCKQFT4TFW - -
The Crawler would detect the loop after some nested loops, but for now i just created a exclude regex.