R: respect quotes around numbers (treat as character) with read.csv()? - r

I have a .csv file with account codes in the form of 00xxxxx and I need them to stay that way for use with other programs which use the account codes in this format. I was just working on an R script to reconcile account charges on Friday and swore that as.is = T was working for me. Now, it doesn't seem to be. Here's some example data:
test <- data.frame(col1 = c("apple", "banana", "carrot"),
col2 = c(100, 200, 300),
col3 = c("00234", "00345", "00456"))
My write.table strategy:
write.table(test, file = "C:/path/test.csv", quote = T,
sep=",", row.names = F)
Remove the old data.frame and re-read:
rm(test)
test <- read.csv("C:/path/test.csv")
test
col1 col2 col3
1 apple 100 234
2 banana 200 345
3 carrot 300 456
In case it's not clear, it should look like the original data.frame we created:
test
col1 col2 col3
1 apple 100 00234
2 banana 200 00345
3 carrot 300 00456
I also tried the following, after perusing the available read.table options, with the results all the same as above:
test <- read.csv("C:/path/test.csv", quote = '"')
test <- read.csv("C:/path/test.csv", as.is = T)
test <- read.csv("C:/path/test.csv", as.is = T, quote = '"')
StringsAsFactors didn't seem to be relevant in this case (and sounds like as.is will do the same thing.
When I open the file in Emacs, col3 is, indeed, surrounded by quotes, so I'd expect it to be treated like text instead of converted to a number:
Most of the other questions are simply about not treating things like factors, or getting numbers not to be recognized as characters, usually the result of an overlooked character string in that column.
I see I can pursue the colClasses argument from questions like this one, but I'd prefer not to; my "colClasses" are built into the data :) Quoted = character, not quoted = numeric.

After pinging a couple of friends who are R users, they both suggested using colClasses. I was relieved to find that I didn't need to specify each class, since my data is ~25 columns. SO confirmed this (once I knew what I was looking for) in another question.
test <- read.csv("C:/path/test.csv", colClasses = c(col3 = "character"))
test
col1 col2 col3
1 apple 100 00234
2 banana 200 00345
3 carrot 300 00456
As it currently stands, the question is a duplicate of the other with respect to the solution. The difference is that I was looking for ways other than colClasses (since as.is sounds like such a likely candidate), while that question was about how to use colClasses.
I'll reiterate that I don't actually like this solution, even thought it's pretty simple. Quotes denote text fields in a .csv, and they don't seem to be respected in this case. The LibreOffice .csv import has a checkbox for "Treat quoted fields as text," which I'd think is analogous to as.is = T in R. Obviously not! #end_rant

I have this issue too. Of course you can manually specify colClasses, but why is this necessary when data is quoted? I agree with the OP's 'rant' in the answer posted to his own question:
Quotes denote text fields in a .csv, and they don't seem to be
respected in this case.
Anyway, I elected to use data.table's fread() which doesn't have this issue. Still annoying behaviour for read.csv though.
# here's a data frame with chr and int columns
my_df <- data.frame(chars=letters[1:5],
nums=1:5,
txt_nums=sprintf('%02d', 1:5),
stringsAsFactors=F)
# all looks as it should
lapply(my_df, class)
# $chars
# [1] "character"
#
# $nums
# [1] "integer"
#
# $txt_nums
# [1] "character"
But now, write to csv, read it back in, and the third column is coerced to int!
# quote=T redundant since that's the default, but just to be sure
write.csv(my_df, 'my_df.csv', row.names=F, quote=T)
my_df2 <- read.csv('my_df.csv')
lapply(my_df2, class)
# even with as.is=TRUE, same issue
my_df2 <- read.csv('my_df.csv', as.is=T)
lapply(my_df2, class)
# data.table's fread doesn't have this issue, at least
library(data.table)
my_dt <- fread('my_df.csv')
lapply(my_dt, class)

I expect there's a better method, but one option would be to use quote=""
test <- read.csv("C:/path/test.csv", as.is = TRUE, quote = "")
This would make the quotes part of the values, giving you:
test
#col1 col2 col3
#1 apple 100 "00234"
#2 banana 200 "00345"
#3 carrot 300 "00456"
You could then either keep them in that format, or use something like gsub to remove them:
test$col3 <- gsub('"', '', test$col3)
test
#col1 col2 col3
#1 apple 100 00234
#2 banana 200 00345
#3 carrot 300 00456
You can use some kind of apply-type function to do the gsub on the whole data frame at once:
test <- as.data.frame(sapply(test,gsub,pattern='"',replacement=""))
sapply code taken from: R - how to replace parts of variable strings within data frame
Obviously, this method will only be useful to you if you don't need the quotes elsewhere for other reasons.

The popular "readr" package also respects the quotes in .csv files.
test <- read_csv("C:/path/test.csv")
I couldn't agree more that the base R read.csv() behavior is unacceptable.

Related

Using R Separate_Rows doesn't work with a "|"

Have a CSV file which has a column which has a variable list of items separated by a |.
I use the code below:
violations <- inspections %>% head(100) %>%
select(`Inspection ID`,Violations) %>%
separate_rows(Violations,sep = "|")
but this only creates a new row for each character in the field (including spaces)
What am I missing here on how to separate this column?
It's hard to help without a better description of your data and an example of what the correct output would look like. That said, I think part of your confusion is due to the documentation in separate_rows. A similar function, separate, documents its sep argument as:
If character, sep is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values.
but the documentation for the sep argument in separate_rows doesn't say the same thing though I think it has the same behavior. In regular expressions, | has special meaning so it must be escaped as \\|.
df <- tibble(
Inspection_ID = c(1, 2, 3),
Violations = c("A", "A|B", "A|B|C"))
separate_rows(df, Violations, sep = "\\|")
Yields
# A tibble: 6 x 2
Inspection_ID Violations
<dbl> <chr>
1 1 A
2 2 A
3 2 B
4 3 A
5 3 B
6 3 C
Not sure what your data looks like, but you may want to replace sep = "|" with sep = "\\|". Good luck!
Using sep=‘\|’ with the separate_rows function allowed me to separate pipe delimited values

R-Studio csv import separated by commas, but containing semicolons in some strings

I am trying to import a csv file into R-Studio. The columns are separated by a comma, but the problem is that one column contains a String and this String sometimes is only formed by chars, sometimes it contains a semicolon (like "abcdefg33;asbfsk2ala;shcjd22l"). In any case this string should not be separated, the semicolons are not separators.
What happens is that for these lines where this column contains semicolons, nothing is separated.
The other lines instead work well.
The result looks like this:
Column1 Column2 Column3
a 12 abc12
b 222 bbbb222
c,333,abcdefg33;asbfsk2ala;shcjd22l
d 282 ddbb232
To import the data I tryed using this code, but in both case I get the result above.
data <- read.csv("Test.csv")
and
data <- read.csv("Test.csv", sep = ",", strip.white = TRUE)
Does anybody know how I can fix it?
Thank you!
I can simulate your result only if I explicitly add the double quotes in the csv file (e.g. with Notepad++):
a,12,1bc12
b,222,bbbb222
"c,333,abcdefg33;asbfsk2ala;shcjd22l"
d,282,ddbb232
In this case the resulting data frame looks like yours:
> data
V1 V2 V3
1 a 12 1bc12
2 b 222 bbbb222
3 c,333,abcdefg33;asbfsk2ala;shcjd22l NA
4 d 282 ddbb232
My suggestion would be to ensure that your csv file does not contain the quotes.
Otherwise, you could use readLines to read the object line by line and then use e.g. regex to get rid of the quotes.
fread from data.table may help you:
library(data.table)
data4 <- fread("data_62871591.csv", sep = ",", quote = "")
Reads this file as follows:
> data4
V1 V2 V3
1: a 12 1bc12
2: b 222 bbbb222
3: "c 333 abcdefg33;asbfsk2ala;shcjd22l"
4: d 282 ddbb232
And as you can see there is still some post processing required to get rid of the quotes on row 3, columns V1 and V3.

Define row and column separators for a data frame

I imported a dataset that unfortunately did not have any separators defined, nor in columns or in rows. I have tried to look for an option to define a specific row separator but could not find one that could be applicable to this situation.
df1 <- data.frame("V1" = "{lat:45.493,lng:-76.4886,alt:22400,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:328,spd:null,postime:2019-01-15 16:10:39},
{lat:45.5049,lng:-76.5285,alt:23425,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50},
{lat:45.5049,lng:-76.5285,alt:24000,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50},
{lat:45.5049,lng:-76.5285,alt:24000,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50}")
df2 <- data.frame("V1" = "{lat:45.493,lng:-76.4886,alt:22400,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:328,spd:null,postime:2019-01-15 16:10:39},
{lat:45.5049,lng:-76.5285,alt:23425,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50},
{lat:45.5049,lng:-76.5285,alt:24000,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50},
{lat:45.5049,lng:-76.5285,alt:24000,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50}")
newdf <- rbind(df1,df2)
This is a model of the data that I am currently struggling with. Ideally, the row separators in this case would have to be defined as "},{" and the column separators as ",". I tried subsetting this pattern to a tab and defining a different separator but this either returned an error (tried with separate_rows from TidyR) or simply did nothing.
Hope you guys can help
This looks like incomplete (incorrect) JSON, so I suggest you bring it up-to-spec and then parse it with known tools. Some problems, easily mitigated:
sqk should have a comma-separator, perhaps a copy/paste issue. This might be generalized as any "number-letter" progression depending on your process. (Edit: your update seems to have resolved this issue, so I'll remove it. If you still need it, I recommend you go with a very literal gsub("([^,])sqk:", "\\1,sql:", s).)
Labels (e.g., lat, alt, sql) should all be double-quoted.
Non-numeric data needs to be quoted, specifically the dates.
Exception to 3: null should remain unquoted.
There are multiple "dicts" that need to be within a "list", i.e., from {...},{...} to [{...},{...}].
Side note with your data: I read them in with stringsAsFactors=FALSE, since we don't need factors.
fixjson <- function(s) {
gsub(",+", ",",
paste(
gsub('"sqk":([^,]+)', '"sqk":"\\1"',
gsub("\\s*\\b([A-Za-z]+)\\s*(?=:)", '"\\1"', # note 2
gsub('(?<=:)"(-?[0-9.]+|null)"', "\\1", # notes 3, 4
gsub("(?<=:)([^,]+)\\b", "\"\\1\"", # quote all data
s, perl = TRUE), perl = TRUE), perl = TRUE)),
collapse = "," )
)
}
fixjson(df1$V1)
# [1] "{\"lat\":45.493,\"lng\":-76.4886,\"alt\":22400,\"call\":\"COFPQ\",\"icao\":\"C056P\",\"registration\":\"X-VLMP\",\"sqk\":\"6232\",\"trak\":328,\"spd\":null,\"postime\":\"2019-01-15 16:10:39\"},\n {\"lat\":45.5049,\"lng\":-76.5285,\"alt\":23425,\"call\":\"COFPQ\",\"icao\":\"C056P\",\"registration\":\"X-VLMP\",\"sqk\":\"6232\",\"trak\":321,\"spd\":null,\"postime\":\"2019-01-15 16:11:50\"},\n {\"lat\":45.5049,\"lng\":-76.5285,\"alt\":24000,\"call\":\"COFPQ\",\"icao\":\"C056P\",\"registration\":\"X-VLMP\",\"sqk\":\"6232\",\"trak\":321,\"spd\":null,\"postime\":\"2019-01-15 16:11:50\"},\n {\"lat\":45.5049,\"lng\":-76.5285,\"alt\":24000,\"call\":\"COFPQ\",\"icao\":\"C056P\",\"registration\":\"X-VLMP\",\"sqk\":\"6232\",\"trak\":321,\"spd\":null,\"postime\":\"2019-01-15 16:11:50\"}"
From here, we use a well-defined json parser (from either jsonlite or RJSONIO, both use similar APIs):
jsonlite::fromJSON(paste("[", fixjson(df1$V1), "]", sep=""))
# lat lng alt call icao registration sqk trak spd postime
# 1 45.4930 -76.4886 22400 COFPQ C056P X-VLMP 6232 328 NA 2019-01-15 16:10:39
# 2 45.5049 -76.5285 23425 COFPQ C056P X-VLMP 6232 321 NA 2019-01-15 16:11:50
# 3 45.5049 -76.5285 24000 COFPQ C056P X-VLMP 6232 321 NA 2019-01-15 16:11:50
# 4 45.5049 -76.5285 24000 COFPQ C056P X-VLMP 6232 321 NA 2019-01-15 16:11:50
From here, rbind as needed. (Note that the null literal was translated into R's NA, which is "as it should be" in my opinion.)
Follow-on suggestion: you can use as.POSIXct directly on your postime column; I hope you are certain all of your data are in the same timezone since the field contains no hint.
Lastly, you mentioned something about non-ASCII characters gumming up the works. My recent edit included a little added robustness for spaces introduced from the use of iconv (e.g., the use of \\s*), so the following might suffice for you:
jsonlite::fromJSON( paste("[", fixjson(iconv(df2$V1, "latin1", "ASCII", sub="")), "]") )
(Use of iconv suggested by https://stackoverflow.com/a/9935242/3358272)

Removing Whitespace From a Whole Data Frame in R

I've been trying to remove the white space that I have in a data frame (using R). The data frame is large (>1gb) and has multiple columns that contains white space in every data entry.
Is there a quick way to remove the white space from the whole data frame? I've been trying to do this on a subset of the first 10 rows of data using:
gsub( " ", "", mydata)
This didn't seem to work, although R returned an output which I have been unable to interpret.
str_replace( " ", "", mydata)
R returned 47 warnings and did not remove the white space.
erase_all(mydata, " ")
R returned an error saying 'Error: could not find function "erase_all"'
I would really appreciate some help with this as I've spent the last 24hrs trying to tackle this problem.
Thanks!
A lot of the answers are older, so here in 2019 is a simple dplyr solution that will operate only on the character columns to remove trailing and leading whitespace.
library(dplyr)
library(stringr)
data %>%
mutate_if(is.character, str_trim)
## ===== 2020 edit for dplyr (>= 1.0.0) =====
df %>%
mutate(across(where(is.character), str_trim))
You can switch out the str_trim() function for other ones if you want a different flavor of whitespace removal.
# for example, remove all spaces
df %>%
mutate(across(where(is.character), str_remove_all, pattern = fixed(" ")))
If i understood you correctly then you want to remove all the white spaces from entire data frame, i guess the code which you are using is good for removing spaces in the column names.I think you should try this:
apply(myData, 2, function(x)gsub('\\s+', '',x))
Hope this works.
This will return a matrix however, if you want to change it to data frame then do:
as.data.frame(apply(myData, 2, function(x) gsub('\\s+', '', x)))
EDIT In 2020:
Using lapply and trimws function with both=TRUE can remove leading and trailing spaces but not inside it.Since there was no input data provided by OP, I am adding a dummy example to produce the results.
DATA:
df <- data.frame(val = c(" abc", " kl m", "dfsd "),
val1 = c("klm ", "gdfs", "123"),
num = 1:3,
num1 = 2:4,
stringsAsFactors = FALSE)
#situation: 1 (Using Base R), when we want to remove spaces only at the leading and trailing ends NOT inside the string values, we can use trimws
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, cols_to_be_rectified] <- lapply(df[, cols_to_be_rectified], trimws)
# situation: 2 (Using Base R) , when we want to remove spaces at every place in the dataframe in character columns (inside of a string as well as at the leading and trailing ends).
(This was the initial solution proposed using apply, please note a solution using apply seems to work but would be very slow, also the with the question its apparently not very clear if OP really wanted to remove leading/trailing blank or every blank in the data)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, cols_to_be_rectified] <- lapply(df[, cols_to_be_rectified],
function(x) gsub('\\s+', '', x))
## situation: 1 (Using data.table, removing only leading and trailing blanks)
library(data.table)
setDT(df)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, c(cols_to_be_rectified) := lapply(.SD, trimws), .SDcols = cols_to_be_rectified]
Output from situation1:
val val1 num num1
1: abc klm 1 2
2: kl m gdfs 2 3
3: dfsd 123 3 4
## situation: 2 (Using data.table, removing every blank inside as well as leading/trailing blanks)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, c(cols_to_be_rectified) := lapply(.SD, function(x) gsub('\\s+', '', x)), .SDcols = cols_to_be_rectified]
Output from situation2:
val val1 num num1
1: abc klm 1 2
2: klm gdfs 2 3
3: dfsd 123 3 4
Note the difference between the outputs of both situation, In row number 2: you can see that, with trimws we can remove leading and trailing blanks, but with regex solution we are able to remove every blank(s).
I hope this helps , Thanks
One possibility involving just dplyr could be:
data %>%
mutate_if(is.character, trimws)
Or considering that all variables are of class character:
data %>%
mutate_all(trimws)
Since dplyr 1.0.0 (only strings):
data %>%
mutate(across(where(is.character), trimws))
Or if all columns are strings:
data %>%
mutate(across(everything(), trimws))
Picking up on Fremzy and the comment from Stamper, this is now my handy routine for cleaning up whitespace in data:
df <- data.frame(lapply(df, trimws), stringsAsFactors = FALSE)
As others have noted this changes all types to character. In my work, I first determine the types available in the original and conversions required. After trimming, I re-apply the types needed.
If your original types are OK, apply the solution from MarkusN below https://stackoverflow.com/a/37815274/2200542
Those working with Excel files may wish to explore the readxl package which defaults to trim_ws = TRUE when reading.
Picking up on Fremzy and Mielniczuk, I came to the following solution:
data.frame(lapply(df, function(x) if(class(x)=="character") trimws(x) else(x)), stringsAsFactors=F)
It works for mixed numeric/charactert dataframes manipulates only character-columns.
You could use trimws function in R 3.2 on all the columns.
myData[,c(1)]=trimws(myData[,c(1)])
You can loop this for all the columns in your dataset. It has good performance with large datasets as well.
If you're dealing with large data sets like this, you could really benefit form the speed of data.table.
library(data.table)
setDT(df)
for (j in names(df)) set(df, j = j, value = df[[trimws(j)]])
I would expect this to be the fastest solution. This line of code uses the set operator of data.table, which loops over columns really fast. There is a nice explanation here: Fast looping with set.
R is simply not the right tool for such file size. However have 2 options :
Use ffdply and ff base
Use ff and ffbase packages:
library(ff)
library(ffabse)
x <- read.csv.ffdf(file=your_file,header=TRUE, VERBOSE=TRUE,
first.rows=1e4, next.rows=5e4)
x$split = as.ff(rep(seq(splits),each=nrow(x)/splits))
ffdfdply( x, x$split , BATCHBYTES=0,function(myData)
apply(myData,2,function(x)gsub('\\s+', '',x))
Use sed (my preference)
sed -ir "s/(\S)\s+(/S)/\1\2/g;s/^\s+//;s/\s+$//" your_file
If you want to maintain the variable classes in your data.frame - you should know that using apply will clobber them because it outputs a matrix where all variables are converted to either character or numeric. Building upon the code of Fremzy and Anthony Simon Mielniczuk you can loop through the columns of your data.frame and trim the white space off only columns of class factor or character (and maintain your data classes):
for (i in names(mydata)) {
if(class(mydata[, i]) %in% c("factor", "character")){
mydata[, i] <- trimws(mydata[, i])
}
}
I think that a simple approach with sapply, also works, given a df like:
dat<-data.frame(S=LETTERS[1:10],
M=LETTERS[11:20],
X=c(rep("A:A",3),"?","A:A ",rep("G:G",5)),
Y=c(rep("T:T",4),"T:T ",rep("C:C",5)),
Z=c(rep("T:T",4),"T:T ",rep("C:C",5)),
N=c(1:3,'4 ','5 ',6:10),
stringsAsFactors = FALSE)
You will notice that dat$N is going to become class character due to '4 ' & '5 ' (you can check with class(dat$N))
To get rid of the spaces on the numeic column simply convert to numeric with as.numeric or as.integer.
dat$N<-as.numeric(dat$N)
If you want to remove all the spaces, do:
dat.b<-as.data.frame(sapply(dat,trimws),stringsAsFactors = FALSE)
And again use as.numeric on col N (ause sapply will convert it to character)
dat.b$N<-as.numeric(dat.b$N)

Replace strings in text based on dictionary

I am new to R and need suggestions.
I have a dataframe with 1 text field in it. I need to fix the misspelled words in that text field. To help with that, I have a second file (dictionary) with 2 columns - the misspelled words and the correct words to replace them.
How would you recommend doing it? I wrote a simple "for loop" but the performance is an issue.
The file has ~120K rows and the dictionary has ~5k rows and the program's been running for hours. The text can have a max of 2000 characters.
Here is the code:
output<-source_file$MEMO_MANUAL_TXT
for (i in 1:nrow(fix_file)) { #dictionary file
target<-paste0(" ", fix_file$change_to_target[i], " ")
replace<-paste0(" ", fix_file$target[i], " ")
output<-gsub(target, replace, output, fixed = TRUE)
I would try agrep. I'm not sure how well it scales though.
Eg.
> agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2, value = TRUE)
[1] "1 lazy"
Also check out pmatch and charmatch although I feel they won't be as useful to you.
here an example , to show #joran comment using a data.table left join. It is very fast (instantaneously here).
library(data.table)
n1 <- 120e3
n2 <- 1e3
set.seed(1)
## create vocab
tt <- outer(letters,letters,paste0)
vocab <- as.vector(outer(tt,tt,paste0))
## create the dictionary
dict <- data.table(miss=sample(vocab,n2,rep=F),
good=sample(letters,n2,rep=T),key='miss')
## the text table
orig <- data.table(miss=sample(vocab,n1,rep=TRUE),key='miss')
orig[dict]
orig[dict]
miss good
1: aakq v
2: adac t
3: adxj r
4: aeye t
5: afji g
---
1027: zvia d
1028: zygp p
1029: zyjm x
1030: zzak t
1031: zzvs q

Resources