Parsing a CSV with irregular quoting rules using readr - r

I have a weird CSV that I can't parse with readr. Let's call it data.csv. It looks something like this:
name,info,amount_spent
John Doe,Is a good guy,5412030
Jane Doe,"Jan Doe" is cool,3159
Senator Sally Doe,"Sally "Sal" Doe is from New York, NY",4451
If all of the rows were like first one below the columns row – two character columns followed by an integer column – this would be easy to parse with read_csv:
df <- read_csv("data.csv")
However, some rows are formatted like the second one, in that the second column ("info") contains a string, part of which is enclosed by double quotes and part of which is not. This makes it so read_csv doesn't read the comma after the word cool as a delimiter, and the entire following row gets appended to the offending cell.
A solution for this kind of problem is to pass FALSE to the escape_double argument in read_delim(), like so:
df <- read_delim("data.csv", delim = ",", escape_double = FALSE)
This works for the second row, but gets killed by the third, where the second column contains a string enclosed by double quotes which itself contains nested double quotes and a comma.
I have read the readr documentation but have as yet found no solution that would parse both types of rows.

Here is what worked for me with the example specified.
Used read.csv rather than read_csv.
This means I am using a dataframe rather than a tibble.
#Read the csv, just turned the table you had as an example to a csv.
#That resulted as a csv with one column
a <- read.csv(file = "Book1.csv", header=T)
#Replace the comma in the third(!) line with just space
a[,1] <- str_replace_all(as.vector(a[,1]), ", ", " ")
#Use seperate from the tidyer package to split the column to three columns
#and convert to a tibble
a <- a %>% separate(name.info.amount_spent, c("name", "info", "spent"), ",")%>%
as_tibble(a)
glimpse(a)
$name <chr> "John Doe", "Jane Doe", "Senator Sally Doe"
$info <chr> "Is a good guy", "\"Jan Doe\" is cool", "\"Sally \"Sal\" Doe is from New York NY\""
$spent <chr> "5412030", "3159", "4451"

You could use a regular expression which splits on the comma in question (using (*SKIP)(*FAIL)):
input <- c('John Doe,Is a good guy,5412030', 'Jane Doe,"Jan Doe" is cool,3159',
'Senator Sally Doe,"Sally "Sal" Doe is from New York, NY",4451')
lst <- strsplit(input, '"[^"]*"(*SKIP)(*FAIL)|,', perl = T)
(df <- setNames(as.data.frame(do.call(rbind, lst)), c("name","info","amount_spent")))
This yields
name info amount_spent
1 John Doe Is a good guy 5412030
2 Jane Doe "Jan Doe" is cool 3159
3 Senator Sally Doe "Sally "Sal" Doe is from New York, NY" 4451
See a demo for the expression on regex101.com.

Related

Extract words from text and create a vector from them

Suppose, I have a txt file that contains such text:
Type: fruits
Title: retail
Date: 2015-11-10
Country: UK
Products:
apple,
passion fruit,
mango
Documents: NDA
Export: 2.10
I read this file with readLines function.
Then, I want to get a vector that looks like this:
x <- c(fruits, apple, passion fruit, mango)
So, I want to extract the word after "Type:" and all words between "Products:" and "Documents:".
How can I do this? Thanks!
If it's not subject to change, it looks close to yaml format e.g. using package of the same name
library(yaml)
info <- yaml::read_yaml("your file.txt")
# strsplit - split either side of the commas
# unlist - convert to vector
# trimws - remove trailing and leading white space
out <- trimws(unlist(strsplit(info$Products, ",")))
You will get the other entries as list elements in info of the required name e.g. info$Type
Maybe there is a more elegant solution, in case you can try this, if you got a vector like this:
vec <- readLines("path\\file.txt")
And in the file there is the text you posted, you can try this:
# replace biggest spaces
gsub(" "," ",
# replace the first space
sub(" ",", ",
# pattern to extract words
gsub(".*Type:\\s*|Title.*Products:\\s*| Documents.*", "",
# collapse in one vector
paste0(vec, collapse = " "))))
[1] "fruits, apple, passion fruit, mango"
If you dput(vec) to make code reproducible:
c("Type: fruits", "Title: retail", "Date: 2015-11-10", "Country: UK",
"Products:", " apple,", " passion fruit,", " mango", "Documents: NDA",
"Export: 2.10")

splitting surnames from fullnames

I've used this:
String <- unlist(str_split(Invname,"[ ]",n=2))
To split the names that I have into Surnames and First Names, since the surnames come first. But I cannot figure out how to reassign the split Invname into two lists, so that I can use only the surnames for the rest of my project. Right now I have this:
" [471] "KRUEGER" "MARCUS" "
And I would like to have the left side only assigned to a new variable, so that I can work further with mining the surnames for information.
Using the data in nate.edwinton's answer, there is no need to unlist.
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")
String <- stringr::str_split(Invnames, "[ ]", n = 2)
Surnames <- sapply(String, '[', 1)
Firstnames <- sapply(String, '[', 2)
data.frame(Surnames, Firstnames)
# Surnames Firstnames
#1 Krueger Markus
#2 Doe John
#3 Tatum Jayson
As mentioned in the comments, it would be easier to help if you provided some data. Anyway, here might be a solution:
Assuming that Invnames is a vector of where for every first name there is (exactly) one last name, you could do the following
# data
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")
# extraction
String <- unlist(stringr::str_split(Invnames,"[ ]",n=2))
# saving first and last names
lastNames <- String[seq(1,length(String),2)]
firstNames <- String[seq(2,length(String),2)]
# yields
> cbind(lastNames,firstNames)
lastNames firstNames
[1,] "Krueger" "Markus"
[2,] "Doe" "John"
[3,] "Tatum" "Jayson"
Here is some sample data and a suggested solution. Data modified from #Rui Barradas' answer:
Invnames <- c("Krueger.$Markus","Doe.John","Tatum.Jayson")
sapply(strsplit(Invnames,"\\W"),"[")
Again using data from an earlier answer with dplyr this time
library(tidyverse)
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")
Invnames <- data.frame(Invnames)
Invnames %>%
separate(Invnames, c('Surname', 'FirstName'), sep=" ")
Surname FirstName
1 Krueger Markus
2 Doe John
3 Tatum Jayson
With base R, we can make use of read.table/read.csv to separate the string into columns
read.table(text = Invnames, header = FALSE, col.names = c("Surnames", "Firstnames"))
# Surnames Firstnames
#1 Krueger Markus
#2 Doe John
#3 Tatum Jayson
data
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")
If only names were so straightforward! If there were few complications between strings then yes the answers below are good options. In my experience with name lists we get hyphenated names (both in "first" and "last"), "Middle" names, Titles and shortened name versions (Dr., Mr, Md), and many other variants. I first try to clean the strings before any splitting.
Here is just one idea using dplyr (explicit code provided for clarity)
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson", "Taylor - Cline Jeff", "Davis - Freud Melvin- John")
df <- as.data.frame(Invnames, Invnames = Invnames) %>%
mutate(Invnames2 = gsub("- ","-",Invnames)) %>%
mutate(Invnames2 = gsub(" -","-",Invnames2)) %>%
mutate(surname = gsub(" .*", "", Invnames2))

R: Separate string based on set of regular expressions

I have a data frame foo.df that contains one variable that is just a very long string consisting of several substrings. Additionally, I have vectors of characters that match parts of the string. Example for the variable in the data frame:
foo.df$var[1]
[1] "Peter Paul SmithLabour3984234.55%Hans NicholsConservative103394.13%Turnout294834.3%
Now an example for the vectors of characters:
head(candidates)
[1] "Peter Paul Smith" "Hans Nichols" "Denny Gross" "Walter Mittens"
[5] "Charles Butt" "Mitch Esterhazy"
I want to create a variable foo.df$candidate1 that contains the name of the first candidate appearing in the string (i.e. food.df$candidate1[1] would be Peter Paul Smith). I was trying to approach this with grepl but it doesn't work as grepl only uses the first the first entry from candidates. Any idea how this could be done efficiently?
You can use the regex OR character, |, with paste and regmatches/regexpr.
candidates <- scan(what = character(), text = '
"Peter Paul Smith" "Hans Nichols" "Denny Gross" "Walter Mittens"')
var1 <- "Peter Paul SmithLabour3984234.55%Hans NicholsConservative103394.13%Turnout294834.3%"
foo.df <- data.frame(var1)
pat <- paste(candidates, collapse = "|")
regmatches(foo.df$var1, regexpr(pat, foo.df$var1))
#[1] "Peter Paul Smith"
foo.df$candidate1 <- regmatches(foo.df$var1, regexpr(pat, foo.df$var1))

Extract last word in string before the first comma

I have a list of names like "Mark M. Owens, M.D., M.P.H." that I would like to sort to first name, last names and titles. With this data, titles always start after the first comma, if there is a title.
I am trying to sort the list into:
FirstName LastName Titles
Mark Owens M.D.,M.P.H
Lara Kraft -
Dale Good C.P.A
Thanks in advance.
Here is my sample code:
namelist <- c("Mark M. Owens, M.D., M.P.H.", "Dale C. Good, C.P.A", "Lara T. Kraft" , "Roland G. Bass, III")
firstnames=sub('^?(\\w+)?.*$','\\1',namelist)
lastnames=sub('.*?(\\w+)\\W+\\w+\\W*?$', '\\1', namelist)
titles = sub('.*,\\s*', '', namelist)
names <- data.frame(firstnames , lastnames, titles )
You can see that with this code, Mr. Owens is not behaving. His title starts after the last comma, and the last name begins from P. You can tell that I referred to Extract last word in string in R, Extract 2nd to last word in string and Extract last word in a string after comma if there are multiple words else the first word
You were off to a good start so you should pick up from there. The firstnames variable was good as written. For lastnames I used a modified name list. Inside of the sub function is another that eliminates everything after the first comma. The last name will then be the final word in the string. For titles there is a two-step process of first eliminating everything before the first comma, then replacing non-matched strings with a hyphen -.
namelist <- c("Mark M. Owens, M.D., M.P.H.", "Dale C. Good, C.P.A", "Lara T. Kraft" , "Roland G. Bass, III")
firstnames=sub('^?(\\w+)?.*$','\\1',namelist)
lastnames <- sub(".*?(\\w+)$", "\\1", sub(",.*", "", namelist), perl=TRUE)
titles <- sub(".*?,", "", namelist)
titles <- ifelse(titles == namelist, "-", titles)
names <- data.frame(firstnames , lastnames, titles )
firstnames lastnames titles
1 Mark Owens M.D., M.P.H.
2 Dale Good C.P.A
3 Lara Kraft -
4 Roland Bass III
This should do the trick, at least on test data:
x=strsplit(namelist,split = ",")
x=rapply(object = x,function(x) gsub(pattern = "^ ",replacement = "",x = x),how="replace")
names=sapply(x,function(y) y[[1]])
titles=sapply(x,function(y) if(length(unlist(y))>1){
paste(na.omit(unlist(y)[2:length(unlist(y))]),collapse = ",")
}else{""})
names=strsplit(names,split=" ")
firstnames=sapply(names,function(y) y[[1]])
lastnames=sapply(names,function(y) y[[3]])
names <- data.frame(firstnames, lastnames, titles )
names
In cases like this, when the structure of strings is always the same, it is easier to use functions like strsplit() to extract desired parts

Gsub apostrophe in data frame R

I need to remove all apostrophes from my data frame but as soon as I use....
textDataL <- gsub("'","",textDataL)
The data frame gets ruined and the new data frame only contains values and NAs, when I am only looking to remove any apostrophes from any text that might be in there? Am I missing something obvious with apostrophes and data frames?
To keep the structure intact:
dat1 <- data.frame(Col1= c("a woman's hat", "the boss's wife", "Mrs. Chang's house", "Mr Cool"),
Col2= c("the class's hours", "Mr. Jones' golf clubs", "the canvas's size", "Texas' weather"),
stringsAsFactors=F)
I would use
dat1[] <- lapply(dat1, gsub, pattern="'", replacement="")
or
library(stringr)
dat1[] <- lapply(dat1, str_replace_all, "'","")
dat1
# Col1 Col2
# 1 a womans hat the classs hours
# 2 the bosss wife Mr. Jones golf clubs
# 3 Mrs. Changs house the canvass size
# 4 Mr Cool Texas weather
You don't want to apply gsub directly on a data frame, but column-wise instead, e.g.
apply(textDataL, 2, gsub, pattern = "'", replacement = "")

Resources