Gsub apostrophe in data frame R - r

I need to remove all apostrophes from my data frame but as soon as I use....
textDataL <- gsub("'","",textDataL)
The data frame gets ruined and the new data frame only contains values and NAs, when I am only looking to remove any apostrophes from any text that might be in there? Am I missing something obvious with apostrophes and data frames?

To keep the structure intact:
dat1 <- data.frame(Col1= c("a woman's hat", "the boss's wife", "Mrs. Chang's house", "Mr Cool"),
Col2= c("the class's hours", "Mr. Jones' golf clubs", "the canvas's size", "Texas' weather"),
stringsAsFactors=F)
I would use
dat1[] <- lapply(dat1, gsub, pattern="'", replacement="")
or
library(stringr)
dat1[] <- lapply(dat1, str_replace_all, "'","")
dat1
# Col1 Col2
# 1 a womans hat the classs hours
# 2 the bosss wife Mr. Jones golf clubs
# 3 Mrs. Changs house the canvass size
# 4 Mr Cool Texas weather

You don't want to apply gsub directly on a data frame, but column-wise instead, e.g.
apply(textDataL, 2, gsub, pattern = "'", replacement = "")

Related

Is there a way to show the matching element of a specific case using the grepl function in R?

I checked whether the brands of the data frame "df1"
brands
1 Nike
2 Adidas
3 D&G
are to be found in the elements of the following column of the data frame "df2"
statements
1 I love Nike
2 I don't like Adidas
3 I hate Puma
For this I use the code:
subset_df2 <- df2[grepl(paste(df1$brands, collapse="|"), ignore.case=TRUE, df2$statements), ]
The code works and I get a subset of df2 containing only the lines with the desired brands:
statements*
1 I love Nike
2 I don't like Adidas
Is there also a way to display which element of the cells from df2$statements exactly matches with df1$brands? For instance, a vector like [Nike, Adidas]. So, I only want to get the Nike and Adidas elements as my output and not the whole statement.
Many thanks in advance!
brands <- c("nike", "adidas", "d&g") # lower-case here
text <- c("I love Nike", "I love Adidas")
ptns <- paste(brands, collapse = "|")
ptns
# [1] "nike|adidas|d&g"
text2 <- text[NA]
text2[grepl(ptns, text, ignore.case=TRUE)] <- gsub(paste0(".*(", ptns, ").*"), "\\1", text, ignore.case = TRUE)
text2
# [1] "Nike" "Adidas"
The pre-assignment of text[NA] is because gsub will make no change if the pattern is not found. I'm using text[NA], but we could also use rep(NA_character_, length(text)), it's the same effect.
If you need multiple matches per text, then perhaps
brands <- c("Nike", "Adidas", "d&g")
text <- c("I love nike", "I love Adidas and Nike")
ptns <- paste(brands, collapse = "|")
gre <- gregexpr(ptns, text, ignore.case = TRUE)
sapply(regmatches(text, gre), paste, collapse = ";")
# [1] "nike" "Adidas;Nike"

Recognizing synonyms in left_join in R

I have several quite large data tables containing characters, which I would like to join with the entries in my database. The spelling is often not quite right, thus joining is not possible.
I know there is no way around creating a synonym table to replace some misspelled characters. But is there a way to automatically detect certain anomalies (see example below)?
My data tables look similar to this:
data <- data.table(products=c("potatoe Chips", "potato Chips", "potato chips", "Potato-chips", "apple", "Apple", "Appl", "Apple Gala"))
The characters in my database are similar to this:
characters.database <- data.table(products=c("Potato Chips", "Potato Chips Paprika", "Apple"), ID=c("1", "2", "3"))
Currently if i perform a left_join only "Apple" will join:
data <- data %>%
left_join(characters.database, by = c('products'))
Result:
products
ID
potatoe Chips
NA
potato Chips
NA
potato chips
NA
Potato-chips
NA
apple
NA
Apple
3
Appl
NA
Apple Gala
NA
Is it possible to automatically ignore: "Case letters", space" ", "-", and an "e" at the end of a word during left_join?
This would be the table i would like:
products
ID
potatoe Chips
1
potatoChips
1
potato chips
1
Potato-chips
1
apple
1
Apple
3
Appl
1
Apple Gala
NA
Any Ideas?
If I were you, I'd do a few things:
I'd strip all special characters, lower case all characters, remove spaces, etc. That'd help a bunch (i.e. potato chips, Potato Chips, and Potato-chips all go to "potatochips" which you can then join on).
There's a package called fuzzyjoin that will let you join on regular expressions, by edit distance, etc. That'll help with Apple vs Apple Gala and misspellings, etc.
You can strip special characters (only keep letters) + lowercase with something like:
library(stringr)
library(magrittr)
string %>%
str_remove_all("[^A-Za-z]+") %>%
tolower()
Thanks Matt Kaye for your suggestion I did something similar now.
As I need the correct spelling in the data base and some of my characters contain symbols and numbers which are relevant I did the following:
#data
data <- data.table(products=c("potatoe Chips", "potato Chips", "potato chips", "Potato-chips", "apple", "Apple", "Appl", "Apple Gala"))
characters.database <- data.table(products=c("Potato Chips", "Potato Chips Paprika", "Apple"), ID=c("1", "2", "3"))
#remove spaces and capital letters in data
data <- data %>%
mutate(products= tolower(products)) %>%
mutate(products= gsub(" ", "", products))
#add ID to database
characters.database <- characters.database %>%
dplyr::mutate(ID = row_number())
#remove spaces and capital letters in databasr product names
characters.database_syn <- characters.database %>%
mutate(products= tolower(products)) %>%
mutate(products= gsub(" ", "", products))
#join and add correct spelling from database
data <- data %>%
left_join(characters.database_syn, by = c('products')) %>%
select(product_syn=products, 'ID') %>%
left_join(characters.database, by = c('ID'))
#other synonyms have to manually be corrected or with the help of a synonym table (As in MY data special caracters are relevant!)

Extract words from text and create a vector from them

Suppose, I have a txt file that contains such text:
Type: fruits
Title: retail
Date: 2015-11-10
Country: UK
Products:
apple,
passion fruit,
mango
Documents: NDA
Export: 2.10
I read this file with readLines function.
Then, I want to get a vector that looks like this:
x <- c(fruits, apple, passion fruit, mango)
So, I want to extract the word after "Type:" and all words between "Products:" and "Documents:".
How can I do this? Thanks!
If it's not subject to change, it looks close to yaml format e.g. using package of the same name
library(yaml)
info <- yaml::read_yaml("your file.txt")
# strsplit - split either side of the commas
# unlist - convert to vector
# trimws - remove trailing and leading white space
out <- trimws(unlist(strsplit(info$Products, ",")))
You will get the other entries as list elements in info of the required name e.g. info$Type
Maybe there is a more elegant solution, in case you can try this, if you got a vector like this:
vec <- readLines("path\\file.txt")
And in the file there is the text you posted, you can try this:
# replace biggest spaces
gsub(" "," ",
# replace the first space
sub(" ",", ",
# pattern to extract words
gsub(".*Type:\\s*|Title.*Products:\\s*| Documents.*", "",
# collapse in one vector
paste0(vec, collapse = " "))))
[1] "fruits, apple, passion fruit, mango"
If you dput(vec) to make code reproducible:
c("Type: fruits", "Title: retail", "Date: 2015-11-10", "Country: UK",
"Products:", " apple,", " passion fruit,", " mango", "Documents: NDA",
"Export: 2.10")

splitting surnames from fullnames

I've used this:
String <- unlist(str_split(Invname,"[ ]",n=2))
To split the names that I have into Surnames and First Names, since the surnames come first. But I cannot figure out how to reassign the split Invname into two lists, so that I can use only the surnames for the rest of my project. Right now I have this:
" [471] "KRUEGER" "MARCUS" "
And I would like to have the left side only assigned to a new variable, so that I can work further with mining the surnames for information.
Using the data in nate.edwinton's answer, there is no need to unlist.
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")
String <- stringr::str_split(Invnames, "[ ]", n = 2)
Surnames <- sapply(String, '[', 1)
Firstnames <- sapply(String, '[', 2)
data.frame(Surnames, Firstnames)
# Surnames Firstnames
#1 Krueger Markus
#2 Doe John
#3 Tatum Jayson
As mentioned in the comments, it would be easier to help if you provided some data. Anyway, here might be a solution:
Assuming that Invnames is a vector of where for every first name there is (exactly) one last name, you could do the following
# data
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")
# extraction
String <- unlist(stringr::str_split(Invnames,"[ ]",n=2))
# saving first and last names
lastNames <- String[seq(1,length(String),2)]
firstNames <- String[seq(2,length(String),2)]
# yields
> cbind(lastNames,firstNames)
lastNames firstNames
[1,] "Krueger" "Markus"
[2,] "Doe" "John"
[3,] "Tatum" "Jayson"
Here is some sample data and a suggested solution. Data modified from #Rui Barradas' answer:
Invnames <- c("Krueger.$Markus","Doe.John","Tatum.Jayson")
sapply(strsplit(Invnames,"\\W"),"[")
Again using data from an earlier answer with dplyr this time
library(tidyverse)
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")
Invnames <- data.frame(Invnames)
Invnames %>%
separate(Invnames, c('Surname', 'FirstName'), sep=" ")
Surname FirstName
1 Krueger Markus
2 Doe John
3 Tatum Jayson
With base R, we can make use of read.table/read.csv to separate the string into columns
read.table(text = Invnames, header = FALSE, col.names = c("Surnames", "Firstnames"))
# Surnames Firstnames
#1 Krueger Markus
#2 Doe John
#3 Tatum Jayson
data
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")
If only names were so straightforward! If there were few complications between strings then yes the answers below are good options. In my experience with name lists we get hyphenated names (both in "first" and "last"), "Middle" names, Titles and shortened name versions (Dr., Mr, Md), and many other variants. I first try to clean the strings before any splitting.
Here is just one idea using dplyr (explicit code provided for clarity)
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson", "Taylor - Cline Jeff", "Davis - Freud Melvin- John")
df <- as.data.frame(Invnames, Invnames = Invnames) %>%
mutate(Invnames2 = gsub("- ","-",Invnames)) %>%
mutate(Invnames2 = gsub(" -","-",Invnames2)) %>%
mutate(surname = gsub(" .*", "", Invnames2))

Getting a split out of a string into a new column

I'm working on a data.frame trying to extract a part of a string between , and . and putting that into a neww column. I would like to use dplyr.
library(dplyr)
name <- c("Cumings, Mrs. John Bradley","Heikkinen, Miss. Laina","Moran, Mr. James","Allen, Mr. William Henry","Futrelle, Mrs. Jacques Heath (Lily May Peel)")
sex <- c("female","female","male","male","female")
age <- c(22,23,24,37,42)
data <- data.frame(name,sex,age)
So I want to extract Mrs, Misss, Mr and so on into a own column.
data %>%
mutate(title = strsplit(name, split = "[,.]")) %>%
select(name,title)
Delete everything outside ,. using gsub(".*, |\\..*", "", name):
library(dplyr)
data %>% mutate(title = gsub(".*, |\\..*", "", name))
gsub(".*, ", "", name): deletes everything before ,, , itself and space after.
gsub("\\..*", "", name): deletes . and everything after it.
| combines two gsub patterns.
str_extract will retrieve the first instance within each string:
library(dplyr)
library(stringr)
data <- data.frame(name,sex,age) %>%
mutate(title = str_extract(name, ",.+\\."),
title = str_replace_all(title, "([[:punct:]]| )", ""))
A slightly more efficient solution:
data %>%
mutate(title = str_trim(str_extract(name, regex("(?<=,).*?(?=\\.)"))))
The (?<=,) says to look after a comma, the (?=\\.) says to look before the period, and the .*? says grab everything in between. the str_trim removes the leading and trailing white space.
Im having no answer to the dplyr problem.
I just wanted to mention, that this way of splitting the salutation from a name is a way which will probably encounter multiple errors when using real world data.
A better (but still error-prone) way to do this is by creating a lookup table for common salutations while utilizing on regex.
The advantage over splitting the data lies within the fact that if there is no hit in the regex, it remains empty (NA) and can easily be fixed manually, but does not create inconsistent data in the first step.
Without using any external package
data$title <- with(data, sub("^[^,]+,\\s*(\\S+).*", "\\1", name))
data$title
#[1] "Mrs." "Miss." "Mr." "Mr." "Mrs."
Similar to #Benjamin's answer (Base R's equivalent to str_extract_all), here's how to do it using regmatches + gregexpr + positive lookahead:
library(dplyr)
data %>%
mutate(title = regmatches(data$name, gregexpr("\\b[[:alpha:]]+(?=[.])",
data$name, perl = TRUE))) %>%
select(name,title)
Result:
name title
1 Cumings, Mrs. John Bradley Mrs
2 Heikkinen, Miss. Laina Miss
3 Moran, Mr. James Mr
4 Allen, Mr. William Henry Mr
5 Futrelle, Mrs. Jacques Heath (Lily May Peel) Mrs
\\b matches a "word boundary", which in this case is a space. perl = TRUE is needed to utilize positive lookahead (?=[.]), which essentially says "only if the pattern is followed by a ."
guess something like this: data %>% mutate(title = gsub(".*, |\\..*", "", name))

Resources