Read a zipped .csv file in R - r

I have been trying hard to solve this, but I cannot get my head around how to read zipped .csv files in R. I could first unzip the files and then read them, but since the amount of unzipped data is around 22GB, I guess it is more practical to handle zipped files.
I basically have many .csv files, which I ZIPPED ONE BY ONE into single .7z files. Every file is named like: file1.csv, file2.csv, etc., which zipped became respectively: file1.csv.7z, file2.csv.7z, etc.
If I use the following command:
data <- read.table(unz("substn-20100101.csv.7z", "substn-20100101.csv"), nrows=10, header=T, quote="\"", sep=",")
I get the message:
Error in open.connection(file, "rt") : cannot open the connection
In addition: Warning message:
In open.connection(file, "rt") : cannot open zip file 'substn-20100101.7z'
Any help would be much appreciated, thank you in advance.

First of all if your problem is RAM, as you said each file has 22G, using compressed files won't resolve your problems. After read.table, for example, all file will be loaded in memory. If you are using these files to some kind of modeling i advise you to look at ff and bigmemory packages.
Another solution is use Revolutions R that has an academic licence and you can use for free. Revolutions R provides Big Data capabilities and you can manage this files easily with packages like revoscaleR.
Even another solution is using Postgres + MADLib + PivotalR. After ingesting data at Postgres, use PivotalR package to access that data and do models with MADLib library, directly from R console.
BUT, if you are planing something that be done with chunks of data, summary for example, you can use the package iterators. I will provide an use case to show how this can be done. Get Airlines data, 1988, and follow this code:
> install.packages('iterators')
> library(iterators)
> con <- bzfile('1988.csv.bz2', 'r')
OK, now you have a connection to your file. Let's create an iterator:
> it <- ireadLines(con, n=1) ## read just one line from the connection (n=1)
Just to test:
> nextElem(it)
and you will see something like:
1 "1988,1,9,6,1348,1331,1458,1435,PI,942,NA,70,64,NA,23,17,SYR,BWI,273,NA,NA,0,NA,0,NA,NA,NA,NA,NA"
> nextElem(it)
and you will see the next line, and so on. Be aware that you are reading a line at a time, so you are not loading all the file to RAM.
If you want to read line by line till the end of the file you can use
> tryCatch(expr=nextElem(it), error=function(e) return(FALSE))
for example. When the file ends it return a logical FALSE.

If I understand the question correctly, at least on Windows OS, you could use 7-Zip Command-Line.
For the sake of simplicity put 7za.exe in your R working directory (and your 7zip files), create .bat file with the following text in it:
"7za e *.7z -y"
...than in R you run the following code:
my_batch <- "your_bat_file_name.bat"
shell.exec(shQuote(paste(my_batch), type = "cmd"))
Than you just read.table()...
It works for me.

According to the readr package documentation, readr::read_csv and fellows will automatically unzip files ending in .gz, .bz2, .xz, or .zip. Although .7z is not mentioned, perhaps a solution is to change to one of these compression formats and then use readr (which also offers a number of other benefits). If your data is compressed with zip, your code would be:
library(readr)
data <- read_csv("substn-20100101.csv.zip", n_max=10)

Related

Error in file(file, "rt") : invalid 'description' argument when reading csv files

I have seen posts about this issue before, but none exactly like the issue that I am having. This code has been working for me with previous versions of R. I recently updated my R and R Studio to versions R 4.2.1 and RStudio Desktop 2022.07.1+554, and now I am getting the subject error when I try to read in my data files. The data files all have the same filenames. I point to the top level directory and then the code goes down through the folder structure and pulls out all of the data files to be used by the rest of the program.
Also want to mention that I am not that well versed in R, so I may not be doing everything in the best manner. Any suggestions that anyone can provide would be most appreciated.
Here is my code to select the top level folder, search through those folders and then read the files which is generating the error.
wd <<- choose.dir(caption = "Select top level folder where your data is located")
setwd(wd)
#List the full path and filename of all files in the working directory and sub-directories that starts
#with "DINum" and ends with ".csv"
out_files <- list.files(pattern = "^DINum(.*)csv$", recursive = TRUE)
# initialise list to store csv files
list.data <- NULL
# create a loop to read in data
for (i in 1:length(out_files))
{
list.data[[i]]<-read.csv(out_files[i], check.names = TRUE)
}
I found a solution. I have been using read.csv from base R. I installed the readr package and tried read_csv and it is working fine. Not sure why read.csv no longer works, but from what I read online, read_csv is a better choice anyway for large data files. Thanks to everyone that tried to help me. I appreciate your time!

Confusion while uploading the csv file in R [duplicate]

I have an excel file that I want to open in R. I tried both of these commands after saving the excel file as a csv file or a text file.
read.table() or read.csv()
I think part of the problem is where the file is located. I have it saved on the desk top. What am I missing here?
Here is the R output
In file(file, "rt") :
cannot open file 'Rtrial.csv': No such file or directory
> help.search("read.csv")
> read.csv("Rtrial.csv")
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'Rtrial.csv': No such file or directory
> read.table("tab")
To throw out another option, why not set the working directory (preferably via a script) to the desktop using setwd('C:\John\Desktop') and then read the files just using file names
Try
f <- file.choose()
to choose the file interactively and save the name in f.
Then run read.csv on the saved filename
d <- read.csv(f)
Sound like you just have an issue with the path. Include the full path, if you use backslashes they need to be escaped: "C:\\folder\\folder\\Desktop\\file.csv" or "C:/folder/folder/Desktop/file.csv".
myfile = read.csv("C:/folder/folder/Desktop/file.csv") # or read.table()
It may also be wise to avoid spaces and symbols in your file names, though I'm fairly certain spaces are OK.
I had to combine Maiasaura and Svun answers to get it to work: using setwd and escaping all the slashes and spaces.
setwd('C:\\Users\\firstname\ lastname\\Desktop\\folder1\\folder2\\folder3')
data = read.csv("file.csv")
data
This solved the issue for me.
Here is one way to do it. It uses the ability of R to construct file paths based on the platform and hence will work on both Mac OS and Windows. Moreover, you don't need to convert your xls file to csv, as there are many R packages that will help you read xls directly (e.g. gdata package).
# get user's home directory
home = setwd(Sys.getenv("HOME"));
# construct path to file
fpath = file.path(home, "Desktop", "RTrial.xls");
# load gdata library to read xls files
library(gdata);
# read xls file
Rtrial = read.xls(fpath);
Let me know if this works.
Save as in excel will keep the file open and lock it so you can't open it. Close the excel file or you won't be able to use it in R.
Give the full path and escape backslashes read.csv("c:\\users\\JoeUser\\Desktop\\JoesData.csv")
I have experienced that this error occurs when you either move the excel file to the destination other than where your r file is located or when you move your r file to the destination other than where your excel file is located.
Good Practice:
Keep your .r and .csv files in the same directory.
open your .r file from getting into its directory instead of opening the r file from rstuio's open file option.
You also have import Dataset option at Environment Block, just click there and get your required packages installed & from next time use this option to read datasets. You will not get this error again.
I also appreciate the above provided answers.
Another way of reading Excel including the new format xlsx could be the package speedR (https://r-forge.r-project.org/projects/speedr/). It is an interactive and visual data importer. Besides importing you can filter(subset) the existing objects from the R workspace.
My issue was very simple, the working directory was not the "Source" directory that was printed when the file ran. To fix this, you can use getwd() and setwd() to get your relative links working, or just use a full path when opening the csv.
print(getwd()) # Where does the code think it is?
setwd("~/Documents") # Where do I want my code to be?
dat = read.csv("~/Documents/Data Visualization/expDataAnalysis/one/ac1_survey.csv") #just make it work!
MAC OS It happened to me as well. I simply chose from the R toolbar MISC and then chose Change Working Directory. I was able to choose the directory that the .csv file was saved in. When I went back to the command line and typed getwd() the full directory was updated and correct and the read.csv function finally worked.
I had the same problem and when I checked the properties of the file on file explorer, it shows me the next message:
"Security: This file came from another computer and might be blocked to help protect this computer"
You click on the "Unblock" button and... you can access to the file from R without any problem, just using read.csv() function and from the directory specified as your working directory, even if is not the same as the file’s directory you are accessing to.
I just had this problem and I first switched to another directory and then switched back and the problem was fixed.
this work for me, accesing data from root. use double slash to access address.
dataset = read.csv('C:\\Users\\Desktop\\Machine Learning\\Data.csv')
Kindly check whether the file name has an extension for example:
abc.csv
if so remove the .csv extension.
set wd to the folder containing the file (~)
data<-read.csv("abc.csv")
Your data has been read the data object
In my case this very problem was raised by wrong spelling, lower case 'c:' instead of upper case 'C:' in the path. I corrected spelling and problem vanished.
You can add absolute path to the file
heisenberg <- read.csv(file="C:/Users/tiago/Desktop/sample_100000.csv")
If really want to run something like
heisenberg <- read.csv(file="sample_100000.csv")
then you'll have to change the working directory to match the place the .CSV file is at. More about it here.

Importing to R an Excel file saved as web-page

I would like to open an Excel file saved as webpage using R and I keep getting error messages.
The desired steps are:
1) Upload the file into RStudio
2) Change the format into a data frame / tibble
3) Save the file as an xls
The message I get when I open the file in Excel is that the file format (excel webpage format) and extension format (xls) differ. I have tried the steps in this answer, but to no avail. I would be grateful for any help!
I don't expect anybody will be able to give you a definitive answer without a link to the actual file. The complication is that many services will write files as .xls or .xlsx without them being valid Excel format. This is done because Excel is so common and some non-technical people feel more confident working with Excel files than a csv file. Now, the files will have been stored in a format that Excel can deal with (hence your warning message), but R's libraries are more strict and don't see the actual file type they were expecting, so they fail.
That said, the below steps worked for me when I last encountered this problem. A service was outputting .xls files which were actually just HTML tables saved with an .xls file extension.
1) Download the file to work with it locally. You can script this of course, e.g. with download.file(), but this step helps eliminate other errors involved in working directly with a webpage or connection.
2) Load the full file with readHTMLTable() from the XML package
library(XML)
dTemp = readHTMLTable([filename], stringsAsFactors = FALSE)
This will return a list of dataframes. Your result set will quite likely be the second element or later (see ?readHTMLTable for an example with explanation). You will probably need to experiment here and explore the list structure as it may have nested lists.
3) Extract the relevant list element, e.g.
df = dTemp[2]
You also mention writing out the final data frame as an xls file which suggests you want the old-style format. I would suggest the package WriteXLS for this purpose.
I seriously doubt Excel is 'saved as a web page'. I'm pretty sure the file just sits on a server and all you have to do is go fetch it. Some kind of files (In particular Excel and h5) are binary rather than text files. This needs an added setting to warn R that it is a binary file and should be handled appropriately.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download.file(url=myurl, destfile="localcopy.xlsx", mode="wb")
or, for use downloader, and ty something like this.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download(myurl, destfile="localcopy.csv", mode="wb")

Editing a .r file from within another .r file

I am trying to make my current project reproducible, and so am creating a master document (eventually a .rmd file) that will be used to call and execute several other documents. This way myself and other investigators only need to open and run one file.
There are three layers to the current setup: master file, 2 read-in files, 2 databases. The master file calls the read-in files using source(), and the read-in files parse the .csv databases and apply labels.
The read-in files and the databases are generated automatically with the data management software I'm currently using (REDCap) each time I download the updated data.
However, the read-in files have a line of code that removes all of the objects in my environment. I would like to edit the read-in files directly from the master file so that I do not have to open the read-in files individually each time I run my report. Specifically, since all the read-in files are the same, I would like to remove line #2 in each.
I've tried searching Google, and tried file.edit(), but have been unable to find anything. Not even sure it is possible, but figured I would ask. Let me know if I can improve this question or if you need any additional code to answer it. Thanks!
Current relevant master code (edited for generality):
source("read-in1")
source("read-in2")
Current relevant read-in file code (same in each file, except for the database name):
#Clear existing data and graphics
rm(list=ls())
graphics.off()
#Load Hmisc library
library(Hmisc)
#Read Data
data=read.csv('database.csv')
#Setting Labels
[read-in code truncated]
Additional details:
OS: Windows 7 Professional x86
R version: 3.1.3
R Studio version: 0.99.441
You might try readLines() and something like the following (which was simplified greatly by a suggestion from #Hong Ooi below):
eval(parse(readLines("read-in1.R")[-2]))
My original solution which was much more pedantic:
f <- file("read-in1.R", open="r")
t <- readLines(f)
close(f)
for (l in t[-2]) { eval(parse(text=l)) }
The for() loop just parses and evaluates each line from the text file except for the second one (that's what the -2 index value does). If you're reading and writing longer files then the following will be much faster than the second option, however still less preferable than #Hong Ooi's:
f <- file("read-in1.R", open="r")
t <- readLines(f)
close(f)
f <- file("out.R", open="w")
o <- writeLines(t[-2], f)
close(f)
source("out.R")
Sorry I'm so late in noticing this question, but you may want to investigate getting access the the REDCap API and using either the redcapAPI package or the REDCapR package. Both of those packages will allow you to export the data from REDCap and directly into R without having to use the download scripts. redcapAPI will even apply all the formats and dates (REDCapR might do this now too. It was in the plan, but I haven't used it in a while).
You could try this. It just calls some shell commands: (1) renames the file, then (2) copies all lines not containing rm(list=ls()) to a new file with the same name as the original file, then (3) removes the copy.
files_to_change <- c("read-in1.R", "read-in2.R")
for (f in files_to_change) {
old <- paste0(f, ".old")
system(paste("cmd.exe /c ren", f, old))
system(paste("cmd.exe /c findstr /v rm(list=ls())", old, ">", f))
system(paste("cmd.exe /c rm", old))
}
After calling this loop you should have
#Clear existing data and graphics
graphics.off()
#Load Hmisc library
library(Hmisc)
#Read Data
data=read.csv('database.csv')
#Setting Labels
in your read-in*.R files. You could put this in a batch script
#echo off
ren "%~f1" "%~nx1.old"
findstr /v "rm(list=ls())" "%~f1.old" > "%~f1"
rm "%~nx1.old"
say, "example.bat", and call that in the same way using system.

Reading an Excel file into an R dataframe from a zipped folder

I have an Excel file (.xls extension) that is inside a zipped folder that I would like to read as a dataframe into R. I loaded the gdata library and set up my working directory to the folder that houses the zipped folder.
When I type in the following syntax:
data_frame1 <- read.xls( unz("./Data/Project1.zip","schools.xls"))
I get the following error messages:
Error in path.expand(xls) : invalid 'path' argument
Error in file.exists(tfn) : invalid 'file' argument
I'm guessing that I'm missing some arguments in the syntax, but I'm not entirely sure what else needs to be included.
Thanks for your help! This R newbie really appreciates it!
Unfortunately, after a quick survey of all the xls functions I know, there is no xls reading function that can recognize the unz output (I would love to be proven wrong here). If it were a 'csv' it would work fine. As it stands, until such a function is written, you must do the loading in two steps extraction and then loading.
To give you a little more control, you can specify which file to unzip as well as the directory to place the files with unzip.
# default exdir is current directory
unzip(zipfile="./Data/Project1.zip", files = "schools.xls", exdir=".")
dataframe_1 <- read.xls("schools.xls")
Sadly, this also means that you must do cleanup afterwards if you don't want the 'xls' file hanging around.

Resources