I need to read ~20,000 csv files (~500GB), then filter the data and bind them together. My code works when I only read ~15,000 files, but it prompts 'R session aborted' when I read ~20,000 files.
memory.limit(80000)
ReadCustomer = function(x)
fread(x, encoding = "UTF-8", select = c("customer_sysno", "event_cat2")) %>%
filter(event_cat2 == "***") %>%
select(customer_sysno) %>%
rename(CustomerSysNo = customer_sysno) %>%
mutate(CustomerSysNo = as.numeric(CustomerSysNo)) %>%
filter(CustomerSysNo > 0)
CustomerData = rbindlist(lapply(FileList, ReadCustomer))
I tried replacing fread(x, encoding = "UTF-8", select = c("customer_sysno", "event_cat2")) by spark_read_csv(sc, "Data", x), but sparkR still didn't work.
How can I read all the files? Will Rcpp help?
Do you know how many rows you get back from each file, you don't say?
You're essentially posing this problem as a straightforward filtering exercise; you want only the customer_sysno column where certain conditions are met. What you then want to do with this will influence whether you even want to merge them all together.
I propose opening an output file and appending each new output to it. Then you've got a local file containing all your desired customer_sysno values. You can then walk through or sample that as suits your use case.
If the rows where your event_cat2 condition is met is actually a small subset of each file, and each file is big, then another approach would be to readLine your way through them, maybe in conjunction with appending results to an output file. This is basically asking R to do a job like (g)awk is awesome at, so that might be a useful preprocessing step to get you the desired data.
Related
I have an issue, where I'm reading in big (+500mb) CSV-files and then want to verify that all data has been read in correctly. To do so, I have been using a comparison between length() of readLines() and nrow() of read.csv2.
The following is my R-code:
df <- readFileFromServer(HOST, KEY,
paste0(SERVER_PATH, SERVER_FOLDER),
FILENAME,
FUN = read.csv2,
sep = ";",
quote = "", encoding = "UTF-8", skipNul = TRUE)
df_check <- readFileFromServer(HOST, KEY,
paste0(SERVER_PATH, SERVER_FOLDER),
FILENAME,
FUN = readLines,skipNul = TRUE)`
Then I verify that all data was loaded, by checking:
if(nrow(df) != (length(df_check) - dif)){
stop("some error msg")
}
dif is set to 1, to account for header in the CSV-files.
This check is the part that fails for a given CSV-file.
This has been working as intended up until this point, but now this check is causing issues, but I cannot fully understand why.
The one CSV-file that fails the check has "NULL" in the data, which I believe readLines interprets as a delimiter, thus causing a new line, and then the check fails, but I'm really not sure.
I tried parsing different parameters to my readfunctions, but issue still persists.
I expect readlines and read.csv2 to result in equal the same length()-1 and nrow() respectively, as shown in my code-snippet.
This is not a proper answer, but it was too long for a comment. This would be my debug strategy here.
Pick a file that fails. Slurp it with readLines.
Save the file locally using writeLines.
Your first job is to make sure that the check fails also when the file
is loaded from the disk. My first thought would be that the file transfer the first time you have run readFilesFromServer and the second time were not precisely identical.
Now. If your problem persists for the given file when you read it locally with read.csv (different number of rows than number of lines in the readLine output), your job becomes much easier (and faster, probably) to solve.
First, take a look at the beginning of the CSV file and at its end. Are they as they should be? Do they match the data in the head and tail of your data frame? If yes, then you need to find the missing lines systematically.
Since CSV is just comma separated files, you can compare each line read from the CSV file with readLines with the line as it should be based on the table you have read using read.csv. How this should be done, depends on how your original csv file looks like (whether you need to insert quotes etc.). Basically, you need to figure out a way of restoring the lines of the CSV file from the data in your data frame, and then looking for the first line that is different.
Here is some code to give you an idea what I mean:
## first, prepare data – for this example only!
f <- file("test.csv", "w")
writeLines(c("a,b,c", "1,what ever,42", "12,89,one"), f)
close(f)
## actual test
## first, read the file with readlines
f <- file("test.csv", "r")
rl <- readLines(f)
close(f)
## then, read it with test.csv
csv <- read.csv("test.csv")
## third, prepare the lines as they should look based on the CSV
rl_sim <- do.call(paste, c(csv, sep=","))
## find the first mismatch
for(i in 1:length(rl_sim)) {
if(rl_sim[i] != rl[i + 1]) {
message("Problems start at line ", i, "\n", rl_sim[i], rl[i + 1])
break
}
}
I have a list of 15 txt files in a folder that are each 1.5 - 2 GB. Is there a way to query each file for specific rows based on a condition and then load the results into a list of data frames in R? Currently, I am loading a few files in at once using
temp = list.files(pattern="*.txt");
data_list = lapply(temp, read.delim); names(data_list) = temp
and then applying a custom filter function to each of the data frames in the list using lapply.
Due to RAM limitations, I cannot load entire files into my R environment and then query. I'm looking for some code to perhaps automatically read in one file, perform the query, add the data frame result to a list, free up the memory, and then repeat. Thank you!
Edit:
It seems like I should use vroom instead of read.delim:
temp = list.files(pattern="*.txt");
data_list = lapply(temp, vroom); names(data_list) = temp
I get a few warning messages and when I run problems(),
I get:
Error in vroom_materialize(x, replace = FALSE) : argument "x" is missing, with no default
Is this an issue?
Each of the files have a different number of columns, so do I need to use map as described here
Lastly, on each of the data frames in the list, I would like to run
filtered_list = lapply(data_list, filter, COL1 == "ABC")
Does doing so essentially read in each of the files, negating the benefits of vroom? When I run this lapply, R takes a very long time.
I have a number of large data files (.csv) on my local drive that I need to read in R, filter rows/columns, and then combine. Each file has about 33,000 rows and 575 columns.
I read this post: Quickly reading very large tables as dataframes and decided to use "sqldf".
This is the short version of my code:
Housing <- file("file location on my disk")
Housing_filtered <- sqldf('SELECT Var1 FROM Housing', file.format = list(eol="/n")) *I am using Windows
I see "Housing_filtered" data.frame is created with Var1, but zero observations. This is my very first experience with sqldf. I am not sure why zero observations are returned.
I also used "read.csv.sql" and still I see zero observations.
Housing_filtered <- read.csv.sql(file = "file location on my disk",
sql = "select Var01 from file",
eol = "/n",
header = TRUE, sep = ",")
You never really imported the file as a data.frame like you think.
You've opened a connection to a file. You mentioned that it is a CSV. Your code should look something like this if it is a normal CSV file:
Housing <- read.csv("my_file.csv")
Housing_filtered <- sqldf('SELECT Var1 FROM Housing')
If there's something non-standard about this CSV file please mention what it is and how it was created.
Also, to another point that was made in the comments, if you do for some reason need to manually input the line breaks use \n where you were using /n. Any error is not being caused by that change, but rather you're getting passed 1 problem and on to another, probably due to improperly handling missing data, space, commas in text fields that aren't being handled, etc.
If there are still data errors can you please use R code to create a small file that is reflective of the relevant characteristics of your data and which produces the same error when you import it? This may help.
I have some data from and I am trying to load it into R. It is in .csv files and I can view the data in both Excel and OpenOffice. (If you are curious, it is the 2011 poll results data from Elections Canada data available here).
The data is coded in an unusual manner. A typical line is:
12002,Central Nova","Nova-Centre"," 1","River John",N,N,"",1,299,"Chisholm","","Matthew","Green Party","Parti Vert",N,N,11
There is a " on the end of the Central-Nova but not at the beginning. So in order to read in the data, I suppressed the quotes, which worked fine for the first few files. ie.
test<-read.csv("pollresults_resultatsbureau11001.csv",header = TRUE,sep=",",fileEncoding="latin1",as.is=TRUE,quote="")
Now here is the problem: in another file (eg. pollresults_resultatsbureau12002.csv), there is a line of data like this:
12002,Central Nova","Nova-Centre"," 6-1","Pictou, Subd. A",N,N,"",0,168,"Parker","","David K.","NDP-New Democratic Party","NPD-Nouveau Parti democratique",N,N,28
Because I need to suppress the quotes, the entry "Pictou, Subd. A" makes R wants to split this into 2 variables. The data can't be read in since it wants to add a column half way through constructing the dataframe.
Excel and OpenOffice both can open these files no problem. Somehow, Excel and OpenOffice know that quotation marks only matter if they are at the beginning of a variable entry.
Do you know what option I need to enable on R to get this data in? I have >300 files that I need to load (each with ~1000 rows each) so a manual fix is not an option...
I have looked all over the place for a solution but can't find one.
Building on my comments, here is a solution that would read all the CSV files into a single list.
# Deal with French properly
options(encoding="latin1")
# Set your working directory to where you have
# unzipped all of your 308 CSV files
setwd("path/to/unzipped/files")
# Get the file names
temp <- list.files()
# Extract the 5-digit code which we can use as names
Codes <- gsub("pollresults_resultatsbureau|.csv", "", temp)
# Read all the files into a single list named "pollResults"
pollResults <- lapply(seq_along(temp), function(x) {
T0 <- readLines(temp[x])
T0[-1] <- gsub('^(.{6})(.*)$', '\\1\\"\\2', T0[-1])
final <- read.csv(text = T0, header = TRUE)
final
})
names(pollResults) <- Codes
You can easily work with this list in different ways. If you wanted to just see the 90th data.frame you can access it by using pollResults[[90]] or by using pollResults[["24058"]] (in other words, either by index number or by district number).
Having the data in this format means you can also do a lot of other convenient things. For instance, if you wanted to fix all 308 of the CSVs in one go, you can use the following code, which will create new CSVs with the file name prefixed with "Corrected_".
invisible(lapply(seq_along(pollResults), function(x) {
NewFilename <- paste("Corrected", temp[x], sep = "_")
write.csv(pollResults[[x]], file = NewFilename,
quote = TRUE, row.names = FALSE)
}))
Hope this helps!
This answer is mainly to #AnandaMahto (see comments to the original question).
First, it helps to set some options globally because of the french accents in the data:
options(encoding="latin1")
Next, read in the data verbatim using readLines():
temp <- readLines("pollresults_resultatsbureau13001.csv")
Following this, simply replace the first comma in each line of data with a comma+quotation. This works because the first field is always 5 characters long. Note that it leaves the header untouched.
temp[-1] <- gsub('^(.{6})(.*)$', '\\1\\"\\2', temp[-1])
Penultimately, write over the original file.
fileConn<-file("pollresults_resultatsbureau13001.csv")
writeLines(temp,fileConn)
close(fileConn)
Finally, simply read the data back into R:
data<-read.csv(file="pollresults_resultatsbureau13001.csv",header = TRUE,sep=",")
There is probably a more parsimonious way to do this (and one that can be iterated more easily) but this process made sense to me.
I have a data frame in an object with large number of rows and columns. I wish to write it in a file, so I do this,
> write.table(object, file="file.txt")
But I don't know for what reason this is giving me an empty file. I thought may be because write.table does not handle such large data (800 columns and 450,000 rows). So I tried the following.
> write.table(object[1:4,1:5], file="file.txt")
But I still get an empty file. I checked my object. It does contain all the data i need.
Can anyone help me know why I may be getting an empty file? Is there any other way to get my object data into a file?
I am sorry for the trouble, but i just realised what was the problem. I was working with R through a server and it was running out of memory for my data. So I deleted a few files and ran the "write.table" command again. And now it works fine.. Thank you for your help though.. :)
I am not sure but you can try to convert your list into a dataframe. Then you can create a CSV file with your dataframe.
df_last<-as.data.frame(do.call(rbind, object))
write.table(df_last, file = "foo.csv", sep = ",")
Try this:-
object <- data.frame(a = I("a \" quote"), b = pi)
write.table(object, file = "foo.csv", sep = ",", col.names = NA,
qmethod = "double")
Do you get foo.csv file created?