Batch reading compressed CSV files in R

Batch reading compressed CSV files in R - r

Newbie here. I have 1000 compressed CSV files that I need to read and row bind. My problem is similar to this one, but with two differences:
a) File names are of different lengths and not sequential, in this form:
"members_[name of company]_[state code].csv"`
I have two vectors, company and states with the required codes. So, I've built a vector of all the files I need with this code:
combinations <- expand.grid(company, states)
csvfiles <- paste0("members_" ,
combinations$Var1, "_",
combinations$Var2,".csv" )
so it has all the filenames I need (20 companies X 50 states). But I am lost as to how to cycle through all zip files. There are 10 other CSVs inside those zip files, but I only need the ones described above.
b) When decompressed, the files expand to a directory structure such as this:
/files/member_database/members/state/members_[name of company]_[state code].csv
but when I try to read the CSV from the zip file using
data <- read.csv(unz("members_GE_FL.zip", "members_GE_FL.csv"), header=F, sep=":")
it returns the 'cannot open connection' message. Adding the path such as ./files/member_database/members/state/members_GE_FL.csv doesn't work either.
Then, I'm not sure if the command read.csv(unz(csvfiles... would make it read the names in my csvfiles, but I'm not sure if that's because of the above or if the command is wrong altogether.
Any help is appreciated -- insights, docs I should look at, etc. Again, I'm NOT trying to get people to do my work. As I type, I have 37 tabs open (many from SO), and have already spent 22 hours on this thing alone. I've learned this post and others how to read a file within a ZIP and from this post how to extract and import data. Still, I can't piece it all together. I've only started with R a few months ago, and have no prior experience as a programmer.

I suspect all that was missing was the correct path to the file in the archive: neither "members_GE_FL.csv" nor "./files/member_database/members/state/members_GE_FL.csv" will work.
But "files/member_database/members/state/members_GE_FL.csv" (without the initial dot) should.
For the sake of completeness, here is a complete example:
Let's create some dummy data, three files named out-1.csv, out-2.csv, out-3.csv and zip them in dummy-archive.zip:
if (!dir.exists("data")) dir.create("data")
if (!dir.exists("data/dummy-files")) dir.create("data/dummy-files")
for (i in 1:3)
write.csv(data.frame(foo = 1:2, bar = 7:8), paste0("data/dummy-files/out-", i, ".csv"), row.names = FALSE)
zip("data/dummy-archive.zip", "data/dummy-files")
Now let's assume we're looking for 3 other files, two of which are in the archive, one is not:
files_to_find <- c("out-2.csv", "out-3.csv", "out-4.csv")
List the files in the archive, and name them for the sake of clarity:
files_in_archive <- unzip("data/dummy-archive.zip", list = TRUE)$Name
files_in_archive <- setNames(files_in_archive, basename(files_in_archive))
# dummy-files out-2.csv
# "data/dummy-files/" "data/dummy-files/out-2.csv"
# out-3.csv out-1.csv
# "data/dummy-files/out-3.csv" "data/dummy-files/out-1.csv"
Find the indices of files we're looking for in the archive, and read them like you intended to (with read.csv(unz(....))):
i <- basename(files_in_archive) %in% files_to_find
res <- lapply(files_in_archive[i], function(f) read.csv(unz("data/dummy-archive.zip", f)))
# $`out-2.csv`
# foo bar
# 1 1 7
# 2 2 8
#
# $`out-3.csv`
# foo bar
# 1 1 7
# 2 2 8
Clean-up:
unlink(c("data/dummy-files/", "data/dummy-archive.zip"), recursive = TRUE)

Related

How to use lapply to run a set of functions/code for a list of files in R? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed last month.
Improve this question
I have 50 text files all beginning with NEW. I want to loop through each textfile/dataframe and run the same functions for each of these files and then output the results via the write.table function. Therefore, for each 50 files, the same functions are applied and then 50 independent output results should be created containing the original name with word 'output' at the end.
Here is my code. I have used the list.file function to read in the 50 files but how can I adapt my code below so that each of the below R commands/functions run for each of the 50 files independently and then output the corresponding 50 files?
file_list <- list.files("/data/genome/relevantfiles/AL/*NEW*.") #reading in the list of 50 files in my directory all starting with NEW
df <- file_list #each file *NEW* is a df #this does not work - how can apply each file into a dataframe.
##code for each dataframe. The code/function is shown more simply below because it works
#running a function on the dataframe
df_output_results <- coloc.susie(dataset1 = list(beta=df$BETA, varbeta=df$Varbeta, N=1597, type="quant" ...)
#printing out results for each dataframe
final_results <- print(df_output_results$summary)
#outputting results
write.table(final_results, file = paste0(fileName,"_output.txt"), quote = F, sep = "\t")
I am unsure how to adapt the code so that each file is inputted in a list and the following R codes are applied to each file in the code block and then outputted into a seperate file ... repeated for 50 files. I am assuming I need to use lapply but not sure how to use this? The codes in the code block work so there is not issue with the codes.

From what I understand you want to import 50 files from a folder and store each file in a list. Then you want to loop a function across that list, then export those results somewhere.
I created an example folder on my desktop ("Desktop/SO Example") and put five CSV files in there. You didn't specify what format your files were in, but you can change the below code to whatever import command you need (see ?read.delim). The data in the CSVs are identical and made using:
ex_df <- data.frame(A = LETTERS[1:5], B = 1:5, C = words[1:5])
And look like this:
A
B
C
A
1
a
B
2
able
C
3
about
D
4
absolute
E
5
accept
I imported these and stored them in a list using lapply. Then I made a simple example function to loop through each data frame in the list and perform some operation (using lapply). Lastly, I exported those results as a CSV file back in the same folder using sapply.
Hopefully this helps!
# Define file path to desired folder
file_path <- "~/Desktop/SO Example/"
# Get file names in the folder
file_list <- list.files(path = file_path)
# Use lapply() with read.csv (if they are CSV files) to store data in a list
list_data <- lapply(file_list, function(x) read.csv(paste0(file_path, x)))
# Define some function
somefunction <- function(x){
paste(x[,1], x[,2], x[,3])
}
# Run the function across your list data using lapply()
results <- lapply(list_data, somefunction)
# Output to the same folder using sapply
sapply(1:length(results), function(x)
write.csv(results[x],
paste0(file_path, "results_output_", x, ".csv"),
row.names = FALSE))

Loop over a large number of CSV files with the same statements in R?

I'm having a lot of trouble reading/writing to CSV files. Say I have over 300 CSV's in a folder, each being a matrix of values.
If I wanted to find out a characteristic of each individual CSV file such as which rows had an exact number of 3's, and write the result to another CSV fil for each test, how would I go about iterating this over 300 different CSV files?
For example, say I have this code I am running for each file:
values_4 <- read.csv(file = 'values_04.csv', header=FALSE) // read CSV in as it's own DF
values_4$howMany3s <- apply(values_04, 1, function(x) length(which(x==3))) // compute number of 3's
values_4$exactly4 <- apply(values_04[50], 1, function(x) length(which(x==4))) // show 1/0 on each column that has exactly four 3's
values_4 // print new matrix
I am then continuously copy and pasting this code and changing the "4" to a 5, 6, etc and noting the values. This seems wildly inefficient to me but I'm not experienced enough at R to know exactly what my options are. Should I look at adding all 300 CSV files to a single list and somehow looping through them?
Appreciate any help!

Here's one way you can read all the files and proceess them. Untested code as you haven't given us anything to work on.
# Get a list of CSV files. Use the path argument to point to a folder
# other than the current working directory
files <- list.files(pattern=".+\\.csv")
# For each file, work your magic
# lapply runs the function defined in the second argument on each
# value of the first argument
everything <- lapply(
files,
function(f) {
values <- read.csv(f, header=FALSE)
apply(values, 1, function(x) length(which(x==3)))
}
)
# And returns the results in a list. Each element consists of
# the results from one function call.
# Make sure you can access the elements of the list by filename
names(everything) <- files
# The return value is a list. Access all of it with
everything
# Or a single element with
everything[["values04.csv"]]

R Function to predict which csv file would not be modified

I am trying to identify which types of csv files would not be modified in the future.
There are 540 csv files in one folder, and only 518 are modified. Basically, I wrote code to read and prepare this files to be modified by Java application and by running terminal on Linux they are modified.
This is what terminal shows:
data_3_5.csv
Error in mapmatching or profiling!
No edge matches found for path. Too short? Sequence size 2
directory <- "/path/folder"
directory_jar <- "/path/path.jar"
setwd(directory)
file_names <-list.files(directory)
predict(file_names, model, filename="", fun=predict, ext=NULL,
const=NULL, index=1, na.rm=TRUE)
I think, it doesn't work only for those files what have small length? Maybe just apply code which calculates the length of all columns in all csv files and which would be small than n?

Welcome, and good job posting some code. You're pretty close, the predict function is used in modelling though, try this on:
directory <- "/path/folder"
directory_jar <- "/path/path.jar"
setwd(directory)
## let's take out a little bit of protection to ensure we are only getting csvs
file_names <-list.files(directory, pattern = ".csv", full.names = TRUE)
## ^ ok so the above gives us all the filenames, but we haven't read them in yet...
## so let's create a function that reads the files in and counts how many columns in each.
library(tidyverse)
## if the above fails, run install.packages("tidyverse")
## let's create a function that will open the csv file and read the number of columns for each.
openerFun <- function(x){ ## here x is the input, or the path
openedFile <- read.csv(x, stringsAsFactors = FALSE) ## open the file
numCols <- ncol(openedFile) ## Count columns
tibble(name = x, numCols = numCols) ## output the file with the # columns
}
## and now let's call it with map, but map_dfr it's better cause we have a nice dataframe!
map_dfr(file_names, openerFun)
Once you have that, you can use it to compare against which files failed... hopefully that will help!

How to merge many databases in R?

I have this huge database from a telescope at the institute where I currently am working, this telescope saves every single day in a file, it takes values for each of the 8 channels it measures every 10 seconds, and every day starts at 00:00 and finishes at 23:59, unless there was a connection error, in which case there are 2 or more files for one single day.
Also, the database has measurement mistakes, missing data, repeated values, etc.
File extensions are .sn1 for days saved in one single file and, .sn1, .sn2, .sn3...... for days saved in multiple files, all the files have the same number of rows and variables, besides that there are 2 formats of databases, one has a sort of a header and it uses the first 5 lines of the file, the other one doesn't have it.
Every month has it's own folder including the days it has, and then this folders are saved in the year they belong to, so for 10 years I'm talking about more than 3000 files, and to be honest I had never worked with .sn1 files before
I have code to merge 2 or a handful of files into 1, but this time I have thousands of files (which is way more then what I've used before and also the reason of why I can't provide a simple example) and I would like to generate a program that would merge all of the files to 1 huge database, so I can get a better sample from it.
I have an Excel extension that would list all the file locations in a specific folder, can I use a list like this to put all the files together?

Suggestions were too long for a comment, so I'm posting them as an aswer here.
It appears that you are able to read the files into R (at least one at a time) so I'm not getting into that.
Multiple Locations: If you have a list of all the locations, you can search in those locations to give you just the files you need. You mentioned an excel file (let's call it paths.csv - has only one column with the directory locations):
library(data.table)
all_directories <- fread(paths.csv, col.names = "paths")
# Focussing on only .sn1 files to begin with
files_names <- dir(path = all_directories$paths[1], pattern = ".sn1")
# Getting the full path for each file
file_names <- paste(all_directories$path[1], file_names, sep = "/")
Reading all the files: I created a space-delimited dummy file and gave it the extension ".sn1" - I was able to read it properly with data.table::fread(). If you're able to open the files using notepad or something similar, it should work for you too. Need more information on how the files with different headers can be distinguished from one another - do they follow a naming convention, or have different extensions (appears to be the case). Focusing on the files with 5 rows of headers/other info for now.
read_func <- function(fname){
dat <- fread(fname, sep = " ", skip = 5)
dat$file_name <- fname # Add file name as a variable - to use for sorting the big dataset
}
# Get all files into a list
data_list <- lapply(file_names, read_func)
# Merge list to get one big dataset
dat <- rdbindlist(data_list, use.names = T, fill = T)
Doing all of the above will give you a dataset for all the files that have the extension ".sn1" in the first directory from your list of directories (paths.csv). You can enclose all of this in a function and use lapply over all the different directories to get a list wherein each element is a dataset of all such files.
To include files with ".sn2", ".sn3" ... extensions you can modify the call as below:
ptrns <- paste(sapply(1:5, function(z) paste(".sn",z,sep = "")), collapse = "|")
# ".sn1|.sn2|.sn3|.sn4|.sn5"
dir(paths[1], pattern = ptrns)
Here's the simplified version that should work for all file extensions in all directories right away - might take some time if the files are too large etc. You may want to consider doing this in chunks instead.
# Assuming only one column with no header. sep is set to ";" since by default fread may treate spaces
# as separators. You can use any other symbol that is unlikely to be present in the location names
# We need the output to be a vector so we can use `lapply` without any unwanted behaviour
paths_vec <- as.character(fread("paths.csv", sep = ";", select = 1, header = F)$V1)
# Get all file names incl. location)
file_names <- unlist(lapply(paths_vec, function(z){
ptrns <- paste(sapply(1:5, function(q) paste(".sn",q,sep = "")), collapse = "|")
inter <- dir(z, pattern = ptrns)
return(paste(z,inter, sep = "/"))
}))
# Get all data in a single data.table using read_func previously defined
dat <- rbindlist(lapply(file_names, read_func), use.names = T, fill = T)

R - Import & Merge Multiple Excel Files And Add Filesource Variable

I have used R for various things over the past year but due to the number of packages and functions available, I am still sadly a beginner. I believe R would allow me to do what I want to do with minimal code, but I am struggling.
What I want to do:
I have roughly a hundred different excel files containing data on students. Each excel file represents a different school but contains the same variables. I need to:
Import the data into R from Excel
Add a variable to each file containing the filename
Merge all of the data (add observations/rows - do not need to match on variables)
I will need to do this for multiple sets of data, so I am trying to make this as simple and easy to replicate as possible.
What the Data Look Like:
Row 1 Title
Row 2 StudentID Var1 Var2 Var3 Var4 Var5
Row 3 11234 1 9/8/2011 343 159-167 32
Row 4 11235 2 9/16/2011 112 152-160 12
Row 5 11236 1 9/8/2011 325 164-171 44
Row 1 is meaningless and Row 2 contains the variable names. The files have different numbers of rows.
What I have so far:
At first I simply tried to import data from excel. Using the XLSX package, this works nicely:
dat <- read.xlsx2("FILENAME.xlsx", sheetIndex=1,
sheetName=NULL, startRow=2,
endRow=NULL, as.data.frame=TRUE,
header=TRUE)
Next, I focused on figuring out how to merge the files (also thought this is where I should add the filename variable to the datafiles). This is where I got stuck.
setwd("FILE_PATH_TO_EXCEL_DIRECTORY")
filenames <- list.files(pattern=".xls")
do.call("rbind", lapply(filenames, read.xlsx2, sheetIndex=1, colIndex=6, header=TRUE, startrow=2, FILENAMEVAR=filenames));
I set my directory, make a list of all the excel file names in the folder, and then try to merge them in one statement using the a variable for the filenames.
When I do this I get the following error:
Error in data.frame(res, ...) :
arguments imply differing number of rows: 616, 1, 5
I know there is a problem with my application of lapply - the startrow is not being recognized as an option and the FILENAMEVAR is trying to merge the list of 5 sample filenames as opposed to adding a column containing the filename.
What next?
If anyone can refer me to a useful resource or function, critique what I have so far, or point me in a new direction, it would be GREATLY appreciated!

I'll post my comment (with bdemerast picking up on the typo). The solution was untested as xlsx will not run happily on my machine
You need to pass a single FILENAMEVAR to read.xlsx2.
lapply(filenames, function(x) read.xlsx2(file=x, sheetIndex=1, colIndex=6, header=TRUE, startRow=2, FILENAMEVAR=x))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex