I'm trying to work on a very large dataset in R. It's currently a saved as a CSV and I'm using read.csv to import it in. Unfortunately, one of the fields is Address, which naturally contains commas and R is obviously reading these as separators. One positive, all of the commas within the addresses are followed by spaces.
So here's the question. Is there anyway of telling read.csv that ", " is not a separator but "," is?
If not, is there anyway I can import a csv into R as one long string to then do a find and replace? Or, last resort, could you point me in the direction of a decent text editor that I can get away with installing on my work laptop?
Thanks, James
My first attempt would be to get the file recreated using either quotes or a | separator.
If that weren't an option, I would then try to use readLines and gsub,
assuming I can read all of those character strings into memory.
library(dplyr)
File <-
paste("Name,Address,Occupation,Hair Color",
"Mary,123 Lamb St, Ruralsville, KY,Shepherd,Blonde",
"Jim,17 Elm St., Urban Center, TN,Butler,Brown",
sep = "\n")
write(File, tmp)
DF <- readLines(tmp) %>%
gsub(", ", "_;_", .) %>%
textConnection() %>%
read.csv() %>%
mutate(Address = gsub("_;_", ", ", Address))
unlink(tmp)
If the file were too large to read into memory, I would likely attempt to write a loop
that reads 100,000 lines at a time, performs the above code on each segment, and
writes it to a new CSV. The new CSV will be properly quoted from R and should read in
fairly well.
(I haven't tried to write this loop yet. I"m hoping for your sake it doesn't come to that)
Related
I have an issue, where I'm reading in big (+500mb) CSV-files and then want to verify that all data has been read in correctly. To do so, I have been using a comparison between length() of readLines() and nrow() of read.csv2.
The following is my R-code:
df <- readFileFromServer(HOST, KEY,
paste0(SERVER_PATH, SERVER_FOLDER),
FILENAME,
FUN = read.csv2,
sep = ";",
quote = "", encoding = "UTF-8", skipNul = TRUE)
df_check <- readFileFromServer(HOST, KEY,
paste0(SERVER_PATH, SERVER_FOLDER),
FILENAME,
FUN = readLines,skipNul = TRUE)`
Then I verify that all data was loaded, by checking:
if(nrow(df) != (length(df_check) - dif)){
stop("some error msg")
}
dif is set to 1, to account for header in the CSV-files.
This check is the part that fails for a given CSV-file.
This has been working as intended up until this point, but now this check is causing issues, but I cannot fully understand why.
The one CSV-file that fails the check has "NULL" in the data, which I believe readLines interprets as a delimiter, thus causing a new line, and then the check fails, but I'm really not sure.
I tried parsing different parameters to my readfunctions, but issue still persists.
I expect readlines and read.csv2 to result in equal the same length()-1 and nrow() respectively, as shown in my code-snippet.
This is not a proper answer, but it was too long for a comment. This would be my debug strategy here.
Pick a file that fails. Slurp it with readLines.
Save the file locally using writeLines.
Your first job is to make sure that the check fails also when the file
is loaded from the disk. My first thought would be that the file transfer the first time you have run readFilesFromServer and the second time were not precisely identical.
Now. If your problem persists for the given file when you read it locally with read.csv (different number of rows than number of lines in the readLine output), your job becomes much easier (and faster, probably) to solve.
First, take a look at the beginning of the CSV file and at its end. Are they as they should be? Do they match the data in the head and tail of your data frame? If yes, then you need to find the missing lines systematically.
Since CSV is just comma separated files, you can compare each line read from the CSV file with readLines with the line as it should be based on the table you have read using read.csv. How this should be done, depends on how your original csv file looks like (whether you need to insert quotes etc.). Basically, you need to figure out a way of restoring the lines of the CSV file from the data in your data frame, and then looking for the first line that is different.
Here is some code to give you an idea what I mean:
## first, prepare data – for this example only!
f <- file("test.csv", "w")
writeLines(c("a,b,c", "1,what ever,42", "12,89,one"), f)
close(f)
## actual test
## first, read the file with readlines
f <- file("test.csv", "r")
rl <- readLines(f)
close(f)
## then, read it with test.csv
csv <- read.csv("test.csv")
## third, prepare the lines as they should look based on the CSV
rl_sim <- do.call(paste, c(csv, sep=","))
## find the first mismatch
for(i in 1:length(rl_sim)) {
if(rl_sim[i] != rl[i + 1]) {
message("Problems start at line ", i, "\n", rl_sim[i], rl[i + 1])
break
}
}
I export my CSV file with python, numbers are wrapped as ="10000000000" in cells, for example:
name,price
"something expensive",="10000000000",
in order to display the number correctly, I prefer to wrap the big number or string of numbers(so someone could open it directly without reformating the column), like order ID into this format.
It's correct with excel or number, but when I import it with R by using read.csv, cells' values show as =10000000000.
Is there any solution to this?
Thank you
how about:
yourcsv <- read.csv("yourcsv.csv")
yourcsv <- gsub("=", "", yourcsv$price)
Also, in my experience read_csv() from the tidyverse library reads data in much faster than read.csv() and I think also has more logic built into it for nonideal cases encountered, so maybe it's worth trying.
I have run into some problems while importing a pipe delimited file. The file consistently delimits but something is getting in the way of R reading some of the delimiters while parsing. R reads in 10 columns when there should be 11, even though the appropriate number of pipes are in place.
A very small sample of the data can be found here: https://drive.google.com/file/d/1ek6-H5EWKCaPfDTfB2muqYBjJz1fM3pf/view
dat <- read_delim("~/Desktop/foo.txt", delim = "|", col.names = TRUE)
I've tried playing around with how R treats the quotes... quote = "/"" did nothing to help and ignoring the quotes with quote = "" made an even bigger mess of the import.
Any thoughts on how to fix the problem?
Feel free to use fread() in data.table package as below.
library(data.table)
FOO3<-fread("~/Downloads/foo.txt",sep = "|",fill = T)
Below is the import dataset I got.
Hi I'm trying to import data from the URL:https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data but it always imports it as single line. I split the data by "\t" but it still not working. My R code;
bostonHousing <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data",
col.names= c("CRIM","ZN","INDUS","CHAS","NOX","RM","AGE","DIS","RAD","TAX","PTRATIO","B","LSTAT","MEDV"),
dec=",",sep = "\t")
The file isn't tab-separated, it's whitespace-separated. By default, read.table assumes columns are separated by one or more whitespace characters (tab or space). Specifying tab-delimiters (or using read.delim()) is only really necessary when columns are tab-delimited and the data columns may contain embedded spaces ...
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
bostonHousing <- read.table(url)
seems to work fine (dec="," is also a bad idea)
I have some data from and I am trying to load it into R. It is in .csv files and I can view the data in both Excel and OpenOffice. (If you are curious, it is the 2011 poll results data from Elections Canada data available here).
The data is coded in an unusual manner. A typical line is:
12002,Central Nova","Nova-Centre"," 1","River John",N,N,"",1,299,"Chisholm","","Matthew","Green Party","Parti Vert",N,N,11
There is a " on the end of the Central-Nova but not at the beginning. So in order to read in the data, I suppressed the quotes, which worked fine for the first few files. ie.
test<-read.csv("pollresults_resultatsbureau11001.csv",header = TRUE,sep=",",fileEncoding="latin1",as.is=TRUE,quote="")
Now here is the problem: in another file (eg. pollresults_resultatsbureau12002.csv), there is a line of data like this:
12002,Central Nova","Nova-Centre"," 6-1","Pictou, Subd. A",N,N,"",0,168,"Parker","","David K.","NDP-New Democratic Party","NPD-Nouveau Parti democratique",N,N,28
Because I need to suppress the quotes, the entry "Pictou, Subd. A" makes R wants to split this into 2 variables. The data can't be read in since it wants to add a column half way through constructing the dataframe.
Excel and OpenOffice both can open these files no problem. Somehow, Excel and OpenOffice know that quotation marks only matter if they are at the beginning of a variable entry.
Do you know what option I need to enable on R to get this data in? I have >300 files that I need to load (each with ~1000 rows each) so a manual fix is not an option...
I have looked all over the place for a solution but can't find one.
Building on my comments, here is a solution that would read all the CSV files into a single list.
# Deal with French properly
options(encoding="latin1")
# Set your working directory to where you have
# unzipped all of your 308 CSV files
setwd("path/to/unzipped/files")
# Get the file names
temp <- list.files()
# Extract the 5-digit code which we can use as names
Codes <- gsub("pollresults_resultatsbureau|.csv", "", temp)
# Read all the files into a single list named "pollResults"
pollResults <- lapply(seq_along(temp), function(x) {
T0 <- readLines(temp[x])
T0[-1] <- gsub('^(.{6})(.*)$', '\\1\\"\\2', T0[-1])
final <- read.csv(text = T0, header = TRUE)
final
})
names(pollResults) <- Codes
You can easily work with this list in different ways. If you wanted to just see the 90th data.frame you can access it by using pollResults[[90]] or by using pollResults[["24058"]] (in other words, either by index number or by district number).
Having the data in this format means you can also do a lot of other convenient things. For instance, if you wanted to fix all 308 of the CSVs in one go, you can use the following code, which will create new CSVs with the file name prefixed with "Corrected_".
invisible(lapply(seq_along(pollResults), function(x) {
NewFilename <- paste("Corrected", temp[x], sep = "_")
write.csv(pollResults[[x]], file = NewFilename,
quote = TRUE, row.names = FALSE)
}))
Hope this helps!
This answer is mainly to #AnandaMahto (see comments to the original question).
First, it helps to set some options globally because of the french accents in the data:
options(encoding="latin1")
Next, read in the data verbatim using readLines():
temp <- readLines("pollresults_resultatsbureau13001.csv")
Following this, simply replace the first comma in each line of data with a comma+quotation. This works because the first field is always 5 characters long. Note that it leaves the header untouched.
temp[-1] <- gsub('^(.{6})(.*)$', '\\1\\"\\2', temp[-1])
Penultimately, write over the original file.
fileConn<-file("pollresults_resultatsbureau13001.csv")
writeLines(temp,fileConn)
close(fileConn)
Finally, simply read the data back into R:
data<-read.csv(file="pollresults_resultatsbureau13001.csv",header = TRUE,sep=",")
There is probably a more parsimonious way to do this (and one that can be iterated more easily) but this process made sense to me.