Parsing the robot.txt file, the module responsible for fetching itĬould urllib. I forgot to mention that I ran a nc process in parallel, to see what data is I attach a script, tested against the ``default`` branch of the mercurial The user could sub-class FancyURLopener, set the appropiate This requires a modification of the Lib, butĪllows us to modify only the behaviour of that specific instance of Revert to the old way, using an instance of a FancyURLopener (or URLopener), Modifies the behaviour of urlopen() globally, so that change affects anyĢ. ![]() Installing an opener with the specific User-agent. Avoid changes in the Lib, and document a work-around, which involves I'm not sure what's the best approach here.ġ. In any case, a doc change *could* go in 2.7 and 3.3/2. I wonder if documenting the simple solution would be sufficient. = 'MyVersion'Įnhancements can only be targeted at 3.4, where robotparser is now urllib.robotparser I also attach a simple example of how it solves the issue, at least with wikipedia. I attach a patch that allows the user to change the opener used by RobotFileParser, in case the need of some specific behavior arises. ![]() I think the user should have the possibility to set a specific user agent string, to better identify their bot. ![]() I found that returns 403 if the provided user agent is in a specific blacklist.Īnd since robotparser doesn't provide a mechanism to change the default user agent used by the opener, it becomes unusable for that site (and sites that have a similar policy). Robotparser.py patch (against the mercurial 2.7 branch).Īuthor: Eduardo A. Lib/robotparser.py doesn't accept setting a user agent string, instead it uses the default.ĭualbus, lotti, karlcow, orsenthil, rhettinger, terry.reedy, tshepangĬreated on 18:36 by dualbus, last changed 14:57 by admin.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |