Tuesday, July 21, 2009

ncftp and wget

As a bioinformatics researcher, it often becomes imperative to download a large amount of data sets from various servers. The most frequent data download site is NCBI and/or EBI. While most of the raw data can be found at NCBI, EBI hosts something much more curated. One nightmare I often face is, updating interproscan database. The data files are something like 9GB and take a lot of time to download, often exceeding the data limit for our server that downloads it. The most irritating of all is when it times out or says "the message list is too long". The best way to handle this trouble could be by using "wget" or "ncftp".

It is very simple to use and is very user friendly. Wget is very reliable and robust. It is specially designed for the home network if you are working from home using an unreliable network. If a download does not complete due to a network problem, Wget will automatically try to continue the download from where it left off, and repeat this until the whole file has been retrieved. It was one of the first clients to make use of the then-new Range HTTP header to support this feature.

If you are using the http protocol with wget then the format will be something like this:
wget --no-check-certificate https://login:passwd@site-address//path
or
wget ftp://ftp.gnu.org/pub/gnu/wget/wget-latest.tar.gz
Or more command info can be found by doing "man wget".

While wget is really cool, ncftp is another ftp protocol, that is sometimes much better than other existing methods, if you must use a ftp protocol for data download. A typical ncftp command could be:

ncftp -u login -p pass ftp://ftp.hostname.edu

Then use get command to get the files of interest.

No comments: