I need to summarise a .csv file in R that have 12GB of data. I divided the .csv in multiple files, so I can load them, but for this I have to read the files and close them before reading the next ones. How can I do it? Is the rm() function inside the loop enough?
EDIT:
the solution i thought was this:
files <- list.files(path = "/path", pattern = ".csv")
for (i in 1:lenght(files)) {
temp <- read.csv(files[i], sep = ";", header = TRUE)
dosomething(temp)
rm(temp)
}
But I don't know if this will remove the variables from my RAM so i load them all.
Related
I am batch converting a lot of .csv files to xlsx format using write.xlsx in the openxlsx package
I use the following to convert the list of files (there are over 200).
The reason I need to do this is for upload to a database, it will only accept xlsx files.
filenames <- list.files("C:/split files", pattern="*.csv", full.names=TRUE)
for(i in filenames) {
a <- read.csv(i)
new_name <- sub('.csv', '.xlsx', i, fixed = TRUE)
write.xlsx(a, new_name, row.names = F)
The problem I have is that the headers which used to have spaces in their names (again required format for the database) now have "." where the spaces used to be. Is there a simple way to add to the above code and replace the "." with " " ?
Try
read.csv(i, check.names = F)
You got that "." because R checks and converts your column names when reading the csv file. We can preserve original names by disabling that checking.
I'm getting this error when trying to import CSVs using this code:
some.df = csv_to_disk.frame(list.files("some/path"))
Error in split_every_nlines(name_in = normalizePath(file, mustWork =
TRUE), : Expecting a single string value: [type=character;
extent=3].
I got a temporary solution with a for loop that iterated through each of the files and then I rbinded all the disk frames together.
I pulled the code from the ingesting data doc
This seems to be an error triggered by the bigreadr package. I wonder if you have a way to reproduce the chunks.
Or maybe try a different chunk reader,
csv_to_disk.frame(..., chunk_reader ="data.table")
Also, if all fails (since CSV reading is hard), reading them in a loop then append would work as well.
Perhaps you need to specify to only read CSVs? like
list.files("some/path", pattern=".csv", full.names=TRUE)
Otherwise, it normally works,
library(disk.frame)
tmp = tempdir()
sapply(1:10, function(x) {
data.table::fwrite(nycflights13::flights, file.path(tmp, sprintf("tmp%s.csv", x)))
})
library(disk.frame)
setup_disk.frame()
some.df = csv_to_disk.frame(list.files(tmp, pattern = "*.csv", full.names = TRUE))
I have a folder (folder 1) containing multiple csv: "x.csv", "y.csv", "z.csv"...
I want to extract the 3rd column of each file and then write new csv files in a new folder (folder 2). Hence, folder 2 must contain "x.csv", "y.csv", "z.csv"...(but with just the 3rd column).
I tried this:
dfiles <- list.files(pattern =".csv") #if you want to read all the files in working directory
lst2 <- lapply(dfiles, function(x) (read.csv(x, header=FALSE)[,3]))
But I got this error:
Error in `[.data.frame`(read.csv(x, header = FALSE), , 3) :
undefined columns selected
Moreover, I don't know how to write multiple csv.
However, if I do this with one file, it works properly, despite the output is in the same folder:
essai <-read.csv("x.csv", header = FALSE, sep = ",")[,3]
write.csv (essai, file = "x.csv")
Any help would be appreciated.
so here's how I would do it. There may be a nicer and more efficient way but it should still work pretty well.
setwd("~/stackexchange") #set your main folder. Best way to do this is actually the here() package. But that's another topic.
library(tools) #for file extension tinkering
folder1 <- "folder1" #your original folder
folder2 <- "folder2" #your new folder
#I setup a function and loop over it with lapply.
write_to <- function(file.name){
file.name <- paste0(tools::file_path_sans_ext(basename(file.name)), ".csv")
essai <-read.csv(paste(folder1, file.name, sep = "/"), header = FALSE, sep = ",")[,3]
write.csv(essai, file = paste(folder2, file.name, sep="/"))
}
# get file names from folder 1
dfiles <- list.files(path=folder1, pattern ="*.csv") #if you want to read all the csv files in folder1 directory
lapply(X = paste(folder1, dfiles, sep="/"), write_to)
Have fun!
Btw: if you have many files, you could use data.table::fread and data.table::fwrite which improves csv reading/writing speed by a lot.
First of all, from the error message it seems that some of the csv files have less than 3 columns. Check if you are reading the correct files and if all of them are supposed to have 3 columns at least.
Once you do that you can use the below code, to read the csv file, select the 3rd column and write the csv file in 'folder2'.
lapply(dfiles, function(x) {
df <- read.csv(x, header = FALSE)
write.csv(subset(df, select = 3), paste0('folder2/', x), row.names = FALSE)
})
For the "write" portion of this question, I had some luck using map2() in purrr. I'm not sure this is the most elegant solution but here it goes:
listofessais # this is your .csv files together as a named list of tbls
map2(listofessais, names(listofessais), ~write_csv(.x, glue("FilePath/{.y}.csv"))
That should give you all your .csv files exported in that folder, and named with the same names they were given in the list.
I have a script that takes raw csv files in a folder, transforms the data in a method described in a function(filename) called "analyze", and spits out values into the console. When I attempt to write.csv these values, it only gives the last value of the function. IF there was a set amount of files per folder I would just do each specific csv file through the program, say [1:5], and lapply/set a matrix into write.csv. However, there is a potential for an infinite amount of files drawn from the directory, so this will not work (I think?). How would I export potentially infinite function outputs to a csv file? I have listed below my final steps after the function definition. It lists all the files in the folder and applys the function "anaylze" to all the files in the folder.
filename <- list.files(path = "VCDATA", pattern = ".csv", full.names = TRUE)
for (f in filename) {
print(f)
analyze(f)
}
Best,
Evan
It's hard to tell without a reproducible example, but I think you have assign the output of analyze to a vector or a dataframe (instead of spitting it out to the console).
Something along these lines:
filename <- list.files(path = "VCDATA", pattern = ".csv", full.names = TRUE)
results <- vector() #empty vector
for (f in filename) {
print(f)
results[which(filename==f)] <- analyze(f) #assign output vector
}
write.csv(results, file=xxx) #write csv file when loop is finished
I hope this answers your question, but it really depends on the format of the output of the analyze function.
I'm trying to do some distance calculation based on the Geolife Trajecotry Dataset which is in .plt format. Currently I can read one .plt file at a time using code below.
trajectory = read.table("C:/Users/User/Desktop/20081023025304.plt", header = FALSE, quote = "\"", skip = 6, sep = ",")
My question is how I can read all the .plt files into R using a single command? I have try the command below but not work.
file_list <- list.files("C:/Users/User/Desktop/Geolife Trajectories 1.3/Data/000/Trajecotry")
The Geolife dataset path is :
Geolife Trajectories 1.3/Data/000/Trajectory/
Inside the Data folder there are total 82 folder starting 000 to 081
Thank you for help.
It's very basic R. list.files is to list all the files in a specified directory. read.table is to read a specified file into R. You need to apply read.table to each file listed in the directory.
file_list <- list.files("C:/Users/User/Desktop/Geolife Trajectories 1.3/Data/000/Trajecotry", full=T)
file_con <- lapply(file_list, function(x){
return(read.table(x, head=F, quote = "\"", skip = 6, sep = ","))
})
file_con_df <- do.call(rbind, file_con)