CSV to disk frame with multiple CSVs - r

I'm getting this error when trying to import CSVs using this code:
some.df = csv_to_disk.frame(list.files("some/path"))
Error in split_every_nlines(name_in = normalizePath(file, mustWork =
TRUE), : Expecting a single string value: [type=character;
extent=3].
I got a temporary solution with a for loop that iterated through each of the files and then I rbinded all the disk frames together.
I pulled the code from the ingesting data doc

This seems to be an error triggered by the bigreadr package. I wonder if you have a way to reproduce the chunks.
Or maybe try a different chunk reader,
csv_to_disk.frame(..., chunk_reader ="data.table")
Also, if all fails (since CSV reading is hard), reading them in a loop then append would work as well.
Perhaps you need to specify to only read CSVs? like
list.files("some/path", pattern=".csv", full.names=TRUE)
Otherwise, it normally works,
library(disk.frame)
tmp = tempdir()
sapply(1:10, function(x) {
data.table::fwrite(nycflights13::flights, file.path(tmp, sprintf("tmp%s.csv", x)))
})
library(disk.frame)
setup_disk.frame()
some.df = csv_to_disk.frame(list.files(tmp, pattern = "*.csv", full.names = TRUE))

Related

Import Mutliple Csv File as Data Frame in R

I want to import multiple csv file as data frame. I try the code below, but the elements of my list are still character. Thanks for your help!
new_seg <-(list.files (path=csv, pattern="^new.*?\\.csv",recursive = T))
for (i in 1:length(new_seg))
assign(new_seg[i], data.frame(read.csv(new_seg[i])))
new_seg
[1] "new_ Seg_grow_1mm.csv" "new_ Seg_grow_3mm.csv" "new_ Seg_resample.csv"
class('new_ Seg_grow_1mm.csv')
[1] "character"
You need to use full.names = T in the list.files function. Then I typically use lapply to load the files in. Also, in my code below I use pattern = "\\.csv" because that's what I needed for this to work with my files.
csv <- getwd()
new_seg <- (list.files(path=csv, pattern="\\.csv", recursive = T, full.names = T))
new_seg_dfs <- lapply(new_seg, read.csv)
Now, new_seg_dfs is a list of data frames.
P.S. seems that you maybe set your working directory beforehand since your files are showing up, but it's always good practice to show the every step you took in these examples.

Issues with user function/ map to read in and combine DBF files in R

I have written a function to read in a set of dbf files. Unfortunately, these files are very large, and I wouldn't want anyone to have to run them on my behalf.
readfun_dbf = function(path) {
test = read.dbf(path, as.is = TRUE) # dont convert to factors
test
}
dbfiles identifies the list of file names. map_dfr applies my function to the list of files and row binds them together. I've used very similar code to read in some text files, so I know the logic works.
dbfiles = list.files(pattern = "assign.dbf", full.names = F, recursive = T)
dbf_combined <- map_dfr(dbfiles, readfun_dbf)
When I run this, I get the error:
Error: Column `ASN_PCT` can't be converted from integer to character
So I ran the read.dbf command on all the files individually and noticed that some dfb files were being read in with all their feilds as characters, and some were being read in with some as integers and characters. I figured that map_dfr needs the fields to be of the same type to bind them, so I added the mutate_all command to my function--but it's still throwing the same error.
readfun_dbf = function(path) {
test = read.dbf(path, as.is = TRUE) # dont convert to factors
**mutate_all(test,as.character)**
test
}
Do you think the mixed field types are the issues? Or could it be something else? Any suggestions would be great!
Assign the value back to the object.
readfun_dbf = function(path) {
test = read.dbf(path, as.is = TRUE)
test <- dplyr::mutate_all(test,as.character)
return(test)
}
and then try :
dbf_combined <- purrr::map_dfr(dbfiles, readfun_dbf)

Loop over all subdirectories and read in a file in each subdirectory

I have an output directory from dbcans with each sample output in a subdirectory. I need to loop over each subdrectory are read into R a file called overview.csv.
for (subdir in list.dirs(recursive = FALSE)){
data = read.csv(file.path(~\\subdir, "overview.csv"))
}
I am unsure how to deal with the changing filepath in read.csv for each subdir. Any help would be appriciated.
Up front, the ~\\subdir (not as a string) is obviously problematic. Since subdir is already a string, using file.path is correct but with just the variable. If you are concerned about relative versus absolute, you can always normalize the paths with normalizePath(list.dirs()), though this does not really change things if you use `
A few things to consider.
Constantly reassigning to the same variable doesn't help, so either you need to assign to an element of a list or something else (e.g., lapply, below). (I also think data as a variable name is problematic. While it works just fine "now", if you ever run part of your script without assigning to data, you will be referencing the function, resulting in possibly confusing errors such as Error in data$a : object of type 'closure' is not subsettable; since a closure is really just a function with its enclosing namespace/environment, this is just saying "you tried to do something to a function".)
I think both pattern= and full.names= might be useful to switch from using list.dirs to list.files, such as
datalist <- list()
# I hope recursion doesn't go too deep here
filelist <- list.files(pattern = "overview.csv", full.names = TRUE, recursive = TRUE)
for (ind in seq_along(filelist)) {
datalist[[ind]] <- read.csv(filelist[ind])
}
# perhaps combine into one frame
data1 <- do.call(rbind, datalist)
Reading in lots of files and doing them same thing to all of them suggests lapply. This is a little more compact version of number 2:
filelist <- list.files(pattern = "overview.csv", recursive = TRUE, full.names = TRUE)
datalist <- lapply(filelist, read.csv)
data1 <- do.call(rbind, datalist)
Note: if you really only need precisely one level of subdirs, you can work around that with:
filelist <- list.files(list.dirs(somepath, recursive = FALSE),
pattern = "overview.csv", full.names = TRUE)
or you can limit to no more than some depth, perhaps with list.dirs.depth.n from https://stackoverflow.com/a/48300309.
I think it should be this.
for (subdir in list.dirs(recursive = FALSE)){
data = read.csv(paste0(subdir, "overview.csv"))
}

Create dataframe from list in Rproj

I have an issue that really bugs me: I've tried to convert to Rproj lately, because I would like to make my data and scripts available at some point. But with one of them, I get an error that, I think, should not occur. Here is the tiny code that gives me so much trouble, the R.proj being available at: https://github.com/fredlm/mockup.
library(readxl)
list <- list.files(path = "data", pattern = "file.*.xls") #List excel files
#Aggregate all excel files
df <- lapply(list, read_excel)
for (i in 1:length(df)){
df[[i]] <- cbind(df[[i]], list[i])
}
df <- do.call("rbind", df)
It gives me the following error right after "df <- lapply(list, read_excel)":
Error in read_fun(path = path, sheet = sheet, limits = limits, shim =
shim, : path[1]="file_1.xls": No such file or directory
Do you know why? When I do it old school, i.e. using 'setwd' before creating 'list', everything works just fine. So it looks like lapply does not know where to look for the file when used in a Rproj, which seems very odd...
What did I miss?
Thanks :)
Thanks to a non-stackoverflower, a solution was found. It's silly, but 'list' was missing a directory, so lapply couldn't aggregate the data. The following works just fine:
list <- paste("data/", list.files(path = "data", pattern = pattern = "file.*.xls"), sep = "") #List excel files

R: Exporting potentially infinite function outputs to a csv file

I have a script that takes raw csv files in a folder, transforms the data in a method described in a function(filename) called "analyze", and spits out values into the console. When I attempt to write.csv these values, it only gives the last value of the function. IF there was a set amount of files per folder I would just do each specific csv file through the program, say [1:5], and lapply/set a matrix into write.csv. However, there is a potential for an infinite amount of files drawn from the directory, so this will not work (I think?). How would I export potentially infinite function outputs to a csv file? I have listed below my final steps after the function definition. It lists all the files in the folder and applys the function "anaylze" to all the files in the folder.
filename <- list.files(path = "VCDATA", pattern = ".csv", full.names = TRUE)
for (f in filename) {
print(f)
analyze(f)
}
Best,
Evan
It's hard to tell without a reproducible example, but I think you have assign the output of analyze to a vector or a dataframe (instead of spitting it out to the console).
Something along these lines:
filename <- list.files(path = "VCDATA", pattern = ".csv", full.names = TRUE)
results <- vector() #empty vector
for (f in filename) {
print(f)
results[which(filename==f)] <- analyze(f) #assign output vector
}
write.csv(results, file=xxx) #write csv file when loop is finished
I hope this answers your question, but it really depends on the format of the output of the analyze function.

Resources