Download specific file from tar.gz - tarfile

I want to extract and download a file from a tar.gz archive, from a web server, but without first downloading the entire archive, because it is large, about 3 GB. I am using a Unix-like environment. How can I achieve this with either a shell command or a with a Python module?
Related tools:
5 Ways to Preview ZIP and Download Selected Files in Archive, On Windows
Related questions:
Download specific folder from tar.gz file using wget command
How to download only a single file from an online ZIP archive via Powershell?

Related

Problem with R downloading zipped ACS data files

I have been having a problem trying to download a zipped file with American Community Survey (ACS) data. The file is a zipped folder that contains zipped sub-folders within it. I want to download and unzip the file leaving the individual zipped sub-folders. The code I am using is:
ACS.url<-"https://www2.census.gov/programs-surveys/acs/summary_file/2019/data/5_year_entire_sf/Tracts_Block_Groups_Only.zip"
dir<-getwd()
zip.file<-"CTrctACS19.zip"
zip.combine<-as.character(paste(dir,zip.file,sep="/"))
download.file(ACS.url,destfile=zip.combine,mode="wb")
unzip(zip.file)
After running the code I get what appears to be a correct download, but the unzip does not work. I get the following error message:
In unzip(zip.file) : error 1 in extracting from zip file
The downloaded file is only about half of the size zip of the file I am trying to download (the original is 3.7G at Census website but I have about 1.8G) so I think it is not downloading the data correctly. I tried to access the downloaded file with the winzip program and it would not work either. I can download files from that website if they are smaller and don't include zipped subfolders. Any help would be appreciated.

Download only *.Rmd files from a github repository using R or Rmd

I would like to download all of the *.Rmd files in a github repository.
For a simple example, say I wanted to use R or an Rmd file to download all of the *.Rmd files in this repo:
https://github.com/maelle/rmd-blogging-course
I tried using a bash chunk in my Rmd file and wget, but wasn't able to get the Rmd files:
#\```{bash}
wget -r -k --accept *.Rmd https://github.com/maelle/rmd-blogging-course
#\```
I've seen this previous question on how to download an entire repo, but I'm after only the files of a certain extension.
How to download entire repository from Github using R?
You should use Git to clone the repository, or if you only need one revision, you can download a tarball or a zip file, the latter of which you can access from the button that says “Code”. As far as just downloading the *.Rmd files, GitHub doesn't provide a way to recursively download a large amount of files without cloning or downloading a tarball or zip file.
While there are raw file endpoints, they won't work with wget --recursive because there are no directories. Trying to do so anyway would likely cause you to get rate-limited and possibly flagged, since those endpoints aren't intended for bulk download. A tarball or zip file will also likely be much faster as well.

I've downloaded a file via git clone to my notebook on google colab, how do I determine that file's path now?

To be clear this file was NOT imported from Google Drive, instead it was downloaded directly.
Use %pwd to show the current directory, %cd to switch directories, and !ls to list directories. (Or, use the file browser GUI on the left hand side.)
Here's an example:

Changing directories in Spyder 3.5

I just updated my copy of Spyder on my Windows XP desktop to Spyder 3.5. How do I change directories so that the working directory is the one (in a subfolder in My documents)in which I have my Python Script and .txt data files (e.g. for running a regression)?
I had the same problem. I'm using the latest Spyder. I wanted my files to be in other than my home directory. I tried setting the current directory in Preferences. My files still wound up in my home directory. I tried the "cd" command. It said that the current directory was the one I wanted, but still the files were written to my home directory.
So I saved my script file in the directory I wanted. Now my files go where I would like.

How to save Jupyter notebooks from GitHub

When I download an ipynb file using the RAW button in GitHub it displays the text (json) in the browser.
Should I just copy this text into a file and name it xxx.ipynb? What's the best way to do it?
First click on Raw
Then, press ctrl+s to save it as .ipynb (Note that you'll have to manually type '.ipynb' after the file name to make this work, as files from GitHub are saved as text files as default.)
Open jupyter notebook
Go to location where you saved .ipynb file
Open file, you will see the code
Hope this helps
Here is the Lifesaver Extension developed by me for both
Chrome
Firefox
The project is open-sourced here.
The extension not only opens github hosted notebooks in Colab but also in nbviewer!
And you can open the github repo from Colab and nbviewer
And go to nbviewer from Colab and github
Works all 3 ways!!
A new feature of opening new notebooks in one-click is already developed in the master branch, just need to push it to the extension platforms :)
Firefox extension
Chrome extension
The following steps worked for me:
Click on Raw in git repository.
Save the file. The file was saved as *.ipynb.txt format for me.
Then, in the jupyter directory tree (not in local directory), I selected, removed the .txt at the end and renamed the file as *.ipynb.
Finally I was able to run the file as jupyter notebook.
Note that, when I tried to rename the *.txt file in local directory to *.ipynb, it did not work. This had to be renamed in directory in jupyter itself.
True to 2020:
Click Download
Wait for JSON to finish loding in your browser
Ctrl S (save as .txt file)
remove .txt extension
Run locally
I saved the file following the instructions from this post. My destination however was a folder on google drive. I opened google drive on my browser and located the file. From there, I renamed the name of my file by just removing the txt extension, leaving the ipynb extension. That worked for me.

Resources