Python mechanize detect downloaded file extension -
i'm trying retrieve websites , save them on local disk using python mechanize. problem many websites redirect links other html/asp/php. there accurate method detect extension url has , type of files retrieve?
for instance: http://www.yahoo.com should saved html file.
http://www.microsoft.com/en-us/download/confirmation.aspx?id=3745 should saved .exe file redirects , downloads exe file. content-type declared text/html not reliable method guess.
how can accurately detect a file extensions way browsers while saving file?
thanks heaps
http://www.microsoft.com/en-us/download/confirmation.aspx?id=3745 should saved .exe file redirects , downloads exe file. content-type declared text/html not reliable method guess.
that's not quite correct. doesn't use http redirect. problem microsoft uses javascript cause browser download file. actual file is:
http://download.microsoft.com/download/4/4/9/449b0038-ac27-4b24-bf11-dd8ebdf5cca6/sonar_setup.exe
since mechanize can't run javascript you, you'll have resort parsing html , javascript files links. might reasonable if you're scraping 1 site downloads files same way. if you're looking general method, you'll have find way entirely.
the way browser can know downloaded file is:
- check content-type
- check path extension (i'm not sure if browsers 2.)
Comments
Post a Comment