Opening large csv files in a loop with R

Opening large csv files in a loop with R - r

I have a folder with different csv files (more than 100) and every file is very large in size and takes a lifetime to open. Do you know how can I write a code just to see if the files opens correctly and if not which are the problematic files. I've tried with this but it was not working.
library(data.table)
setwd("Working dir"
files<-list.files(pattern="*.csv")
numfiles <- length(files)
for (i in c(1:numfiles)){
files[i] <- paste(".\\",files[i],sep="")
assign(gsub("[.]csv$","",files[i]),fread(files[i], header=FALSE))
}
Thank you for your help

Related

reading multiple csv files using data.table doesn't work when given files path, possible bug?

I want to read multiple csv files where I only read two columns from each. So my code is this:
library(data.table)
files <- list.files(pattern="C:\\Users\\XYZ\\PROJECT\\NAME\\venv\\RawCSV_firstBatch\\*.csv")
temp <- lapply(files, function(x) fread(x, select = c("screenNames", "retweetUserScreenName")))
data <- rbindlist(temp)
This yields character(0). However when I move those csv files out to where my script is, and change the files to this:
files <- list.files(pattern="*.csv")
#....
My dir() output is this:
[1] "adjaceny_list.R" "cleanusrnms_firstbatch"
[3] "RawCSV_firstBatch" "username_cutter.py"
everything gets read. Could you help me track down what's exactly going on please? The folder that contains these csv files are in same directory where the script is. SO even if I do patterm= "RawCSV_firstBatch\\*.csv" same problem.
EDIT:
also did:
files <- list.files(path="C:\\Users\\XYZ\\PROJECT\\NAME\\venv\\RawCSV_firstBatch\\",pattern="*.csv")
#and
files <- list.files(pattern="C:/Users/XYZ/PROJECT/NAME/venv/RawCSV_firstBatch/*.csv")
Both yielded empty data frame.

#NelsonGon mentioned a workaround:
Do something like: list.files("./path/folder",pattern="*.csv$") Use ..
or . as required.(Not sure about using actual path). Can also utilise
~
So that works. Thank you. (sorry have 2 days limit before I tick this as answer)

How do I create a loop in a file path?

UPDATE
Thanks for the suggestions. This is how for I got so far, but I still don't find how I can get the loop to work within the file path name.
setwd("//tsclient/C/Users/xxx")
folders <- list.files("TEST")
--> This gives me a list of my folder names
for(f in folders){
setwd("//tsclient/C/xxx/[f]")
files[f] <- list.files("//tsclient/C/Users/xxx/TEST/[f]", pattern="*.TXT")
mergedfile[f] <- do.call(rbind, lapply(files[f], read.table))
write.table(mergedfile[f], "//tsclient/C/Users/xxx/[f].txt", sep="\t")
}
I have around 100 folders, each containing multiple txt files. I want to create 1 merged file per folder and save that elsewhere. However, I do not want to manually adapt the folder name in my code for each folder.
I created the following code to load in all files from a single folder (which works) and merge these files.
setwd("//tsclient/C/xxx")
files <- list.files("//tsclient/C/Users/foldername", pattern="*.TXT")
file.list <- lapply(files, read.table)
setattr(file.list, "names", files)
masterfilesales <- rbindlist(file.list, idcol="id")[, id := substr(id,1,4)]
write.table(masterfilesales, "//tsclient/C/Users/xxx/datasets/foldername.txt", sep="\t")
If I wanted to do this manually, I would every time have to adapt "foldername". The foldernames contain numeric values, containing 100 numbers between 2500 and 5000 (always 4 digits).
I looked into repeat loops, but those don't run using it within a file path.
If anyone could direct me in a good direction, I would be very grateful.

R download and read many excel files, automatically

I need to download a few hundred number of excel files and import them into R each day. Each one should be their own data-frame. I have a csv. file with all the adresses (the adresses remains static).
The csv. file looks like this:
http://www.www.somehomepage.com/chartserver/hometolotsoffiles%a
http://www.www.somehomepage.com/chartserver/hometolotsoffiles%b
http://www.www.somehomepage.com/chartserver/hometolotsoffiles%a0
http://www.www.somehomepage.com/chartserver/hometolotsoffiles%aa11
etc.....
I can do it with a single file like this:
library(XLConnect)
my.url <- "http://www.somehomepage.com/chartserver/hometolotsoffiles%a"
loc.download <- "C:/R/lotsofdata/" # each files probably needs to have their own name here?
download.file(my.url, loc.download, mode="wb")
df.import.x1 = readWorksheetFromFile("loc.download", sheet=2))
# This kind of import works on all the files, if you ran them individually
But I have no idea how to download each file, and place it separately in a folder, and then import them all into R as individual data frames.

It's hard to answer your question as you haven't provided a reproducible example and it isn't clear what you exactly want. Anyway, the code below should point you in the right direction.
You have a list of urls you want to visit:
urls = c("http://www/chartserver/hometolotsoffiles%a",
"http://www/chartserver/hometolotsoffiles%b")
in your example, you load this from a csv file
Next we download each file and put it in a separate directory (you mentioned that in your question
for(url in urls) {
split_url = strsplit(url, "/")[[1]]
##Extract final part of URL
dir = split_url[length(split_url)]
##Create a directory
dir.create(dir)
##Download the file
download.file(url, dir, mode="wb")
}
Then we loop over the directories and files and store the results in a list.
##Read in files
l = list(); i = 1
dirs = list.dirs("/data/", recursive=FALSE)
for(dir in dirs){
file = list.files(dir, full.names=TRUE)
##Do something?
##Perhaps store sheets as a list
l[[i]] = readWorksheetFromFile(file, sheet=2)
i = i + 1
}
We could of course combine steps two and three into a single loop. Or drop the loops and use sapply.

How to download a gzipped file in R without saving it to the computer

I am a new user of R.
I have some txt.gz files on the web of approximate size 9x500000.
I'm trying to uncompress a file and read it straight to R with read.table().
I have used this code (url censored):
LoadData <- function(){
con <- gzcon(url("http://"))
raw <- textConnection(readLines(con, n = 25000))
close(con)
dat <- read.table(raw,skip = 2, na.strings = "99.9")
close(raw)
return(dat)
}
The problem is that if I read more lines with readLines, the
program will take much more time to do what it should.
How can I do this is reasonable time?

You can make a temporary file like this:
tmpfile <- tempfile(tmpdir=getwd())
file.create(tmpfile)
download.file(url,tmpfile)
#do your stuff
file.remove(tmpfile) #delete the tmpfile

Don't do this.
Each time you want to access the file, you'll have to re-download it, which is both time consuming for you and costly for the file hoster.
It is better practise to download the file (see download.file) and then read in a local copy in a separate step.
You can decompress the file with untar(..., compressed = "gzip").

load new files in directory

I have a R script to load multiple text files in a directory and save the data as compressed .rda. It looks like this,
#!/usr/bin/Rscript --vanilla
args <- commandArgs(TRUE)
## arg[1] is the folder name
outname <- paste(args[1], ".rda", sep="")
files <- list.files(path=args[1], pattern=".txt", full=TRUE)
tmp <- list()
if(file.exists(outname)){
message("found ", outname)
load(outname)
tmp <- get(args[1]) # previously read stuff
files <- setdiff(files, names(tmp))
}
if(is.null(files))
message("no new files") else {
## read the files into a list of matrices
results <- plyr::llply(files, read.table, .progress="text")
names(results) <- files
assign(args[1], c(tmp, results))
message("now saving... ", args[1])
save(list=args[1], file=outname)
}
message("all done!")
The files are quite large (15Mb each, 50 of them typically), so running this script takes up to a few minutes typically, a substantial part of which is taken writing the .rda results.
I often update the directory with new data files, therefore I would like to append them to the previously saved and compressed data. This is what I do above by checking if there's already an output file with that name. The last step is still pretty slow, saving the .rda file.
Is there a smarter way to go about this in some package, keeping a trace of which files have been read, and saving this faster?
I saw that knitr uses tools:::makeLazyLoadDB to save its cached computations, but this function is not documented so I'm not sure where it makes sense to use it.

For intermediate files that I need to read (or write) often, I use
save (..., compress = FALSE)
which speeds up things considerably.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Opening large csv files in a loop with R - r

Related

reading multiple csv files using data.table doesn't work when given files path, possible bug?

How do I create a loop in a file path?

R download and read many excel files, automatically

How to download a gzipped file in R without saving it to the computer

load new files in directory

Categories

Resources