Python mechanize detect downloaded file extension -


i'm trying retrieve websites , save them on local disk using python mechanize. problem many websites redirect links other html/asp/php. there accurate method detect extension url has , type of files retrieve?

for instance: http://www.yahoo.com should saved html file.

http://www.microsoft.com/en-us/download/confirmation.aspx?id=3745 should saved .exe file redirects , downloads exe file. content-type declared text/html not reliable method guess.

how can accurately detect a file extensions way browsers while saving file?

thanks heaps

http://www.microsoft.com/en-us/download/confirmation.aspx?id=3745 should saved .exe file redirects , downloads exe file. content-type declared text/html not reliable method guess.

that's not quite correct. doesn't use http redirect. problem microsoft uses javascript cause browser download file. actual file is:

http://download.microsoft.com/download/4/4/9/449b0038-ac27-4b24-bf11-dd8ebdf5cca6/sonar_setup.exe

since mechanize can't run javascript you, you'll have resort parsing html , javascript files links. might reasonable if you're scraping 1 site downloads files same way. if you're looking general method, you'll have find way entirely.

the way browser can know downloaded file is:

  1. check content-type
  2. check path extension (i'm not sure if browsers 2.)

Comments

Popular posts from this blog

javascript - Count length of each class -

What design pattern is this code in Javascript? -

hadoop - Restrict secondarynamenode to be installed and run on any other node in the cluster -