Python mechanize detect downloaded file extension -

- July 15, 2013

i'm trying retrieve websites , save them on local disk using python mechanize. problem many websites redirect links other html/asp/php. there accurate method detect extension url has , type of files retrieve?

for instance: http://www.yahoo.com should saved html file.

http://www.microsoft.com/en-us/download/confirmation.aspx?id=3745 should saved .exe file redirects , downloads exe file. content-type declared text/html not reliable method guess.

how can accurately detect a file extensions way browsers while saving file?

thanks heaps

http://www.microsoft.com/en-us/download/confirmation.aspx?id=3745 should saved .exe file redirects , downloads exe file. content-type declared text/html not reliable method guess.

that's not quite correct. doesn't use http redirect. problem microsoft uses javascript cause browser download file. actual file is:

http://download.microsoft.com/download/4/4/9/449b0038-ac27-4b24-bf11-dd8ebdf5cca6/sonar_setup.exe

since mechanize can't run javascript you, you'll have resort parsing html , javascript files links. might reasonable if you're scraping 1 site downloads files same way. if you're looking general method, you'll have find way entirely.

the way browser can know downloaded file is:

check content-type
check path extension (i'm not sure if browsers 2.)

Search This Blog

EXIT

Python mechanize detect downloaded file extension -

Comments

Post a Comment

Popular posts from this blog

c# - SelectList with Dictionary, add values to the Dictionary after it's assigned to SelectList -

how can i manage url using .htaccess in php? -

javascript - Chart.js - setting tooltip z-index -