Matching two data frames with some characters in R - r

I have the following data frames
df1 <- data.frame(
Description=c("How are you- doing?", "will do it tomorrow otherwise: next week", "I will work hard to complete it for nextr week1 or tomorrow", "I am HAPPY with this situation now","Utilising this approach can helpα'x-ray", "We need to use interseting <U+0452> books to solve the issue", "Not sure if we could do it appropriately.", "The schools and Universities are closed in f -blook for a week", "Things are hectic here and we are busy"))
<!-- begin snippet: js hide: false console: true babel: false -->
and I want to get the following table:
d <- data.frame(
Description=c("Utilising this approach can helpa'x-ray", "How are you- doing", " We need to use interseting <U+0452> books to solve the issue ", " will do it tomorrow otherwise: next week ", " Things are hectic here and we are busy ", "I will work hard to complete it for nextr week1 or tomorrow ", "The schools and Universities are closed in f -blook for a week", " I am HAPPY with this situation now "," I will work hard to complete it for nextr week1 or tomorrow"))
f2<- read.table(text="B12 B6 B9
No Yes Yes
12 6 9
No No Yes
No No Yes
No No Yes
Yes No Yes
11 No Yes
12 11 P
No No Yes
", header=TRUE)
df3<-cbind(d,f2)
As you can see in the Description column, there are space and colon, and so on 1 after week is subscript and I was unable to fix it. I want to match it based on "Description". So I want to match df1 with df2 using Description. Can we do it it in R for this case?

We can use stringdist joins from fuzzyjoin package to match data based on 'Description'. We use na.omit to remove the NA rows from the final dataframe.
na.omit(fuzzyjoin::stringdist_left_join(df1, df3, by = 'Description'))

Related

R - All or Partial String Matching?

I have a data frame of tweets for a sentiment analysis I am working on. I want to remove references to some proper names (for example, "Jeff Smith"). Is there a way to remove all or partial references to a name in the same command? Right now I am doing it the long way:
library(stringr)
str_detect(text, c('(Jeff Smith) | (Jeff) | (Smith)' ))
But that obviously gets cumbersome as I add more names. Ideally there'd be some way to feed just "Jeff Smith" and then be able to match all or some of it. Does anybody have any ideas?
Some sample code if you would like to play with it:
tweets = data.frame(text = c('Smith said he’s not counting on Monday being a makeup day.',
"Williams says that Steve Austin will miss the rest of the week",
"Weird times: Jeff Smith just got thrown out attempting to steal home",
"Rest day for Austin today",
"Jeff says he expects to bat leadoff", "Jeff", "No reference to either name"))
name = c("Jeff Smith", "Steve Austin")
Based on the data showed, all of them should be TRUE
library(dplyr)
library(stringr)
pat <- str_c(gsub(" ", "\\b|\\b", str_c("\\b", name, "\\b"),
fixed = TRUE), collapse="|")
tweets %>%
mutate(ind = str_detect(text, pat))
-output
# text ind
#1 Smith said he’s not counting on Monday being a makeup day. TRUE
#2 Williams says that Steve Austin will miss the rest of the week TRUE
#3 Weird times: Jeff Smith just got thrown out attempting to steal home TRUE
#4 Rest day for Austin today TRUE
#5 Jeff says he expects to bat leadoff TRUE
#6 Jeff TRUE
#7 No reference to either name FALSE
Not a beauty, but it works.
#example data
namelist <- c('Jeff Smith', 'Kevin Arnold')
namelist_spreaded <- strsplit(namelist, split = ' ')
f <- function(x) {
paste0('(',
paste(x, collapse = ' '),
') | (',
paste(x, collapse = ') | ('),
')')
}
lapply(namelist_spreaded, f)

How do you replace words that repeat themselves one after another in R?

I want to substitute all the strings that have words that repeat themselves one after another with words that have a single occurrence.
My strings go something like that:
text_strings <- c("We have to extract these numbers 12, 47, 48", "The integers numbers are also interestings: 189 2036 314",
"','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456", "We like to to offer you 7890$ per month in order to complete this task... we are joking", "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits.", "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life.", "you can also extract exotic stuff like a456 gb67 and 45678911ghth", "Writing 1 example is not funny, please consider that 66% is validation+testing", "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]", "Who loves arrays more than me?", "{366,78,90,5}Yes, there are only 4 numbers inside", "Integers are fine but sometimes you like 99 cents after the 99 dollars", "100€ are better than 99€", "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]", "Ok ok 1 2 3 4 5 and the last one is 6", "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando")
I tried:
gsub("\b(?=\\w*(\\w)\1)\\w+", "\\w", text_strings, perl = TRUE)
But nothing happened (the output remained the same).
How can I remove the repeating words such as in
text_strings[9]
#[1] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"
Thank you!
You can use gsub and a regular expression.
gsub("\\b(\\w+)\\W+\\1", "\\1", text_strings, ignore.case=TRUE, perl=TRUE)
[1] "We have to extract these numbers 12, 47, 48"
[2] "The integers numbers are also interestings: 189 2036 314"
[3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"
[4] "We like to offer you 7890$ per month in order to complete this task... we are joking"
[5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."
[6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
[7] "you can also extract exotic stuff like a456 gb67 and 45678911ghth"
[8] "Writing 1 example is not funny, please consider that 66% is validation+testing"
[9] "You are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"
[10] "Who loves arrays more than me?"
[11] "{366,78,90,5}Yes, there are only 4 numbers inside"
[12] "Integers are fine but sometimes you like 99 cents after the 99 dollars"
[13] "100€ are better than 99€"
[14] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]"
[15] "Ok 1 2 3 4 5 and the last one is 6"
[16] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando
"

Replace text in one data-frame using a look up to another data-frame

I have the task of searching through text, replacing peoples names and nicknames with a generic character string.
Here is the structure of my data frame of names and corresponding nicknames:
names <- c("Thomas","Thomas","Abigail","Abigail","Abigail")
nicknames <- c("Tom","Tommy","Abi","Abby","Abbey")
df_name_nick <- data.frame(names,nicknames)
Here is the structure of my data frame containing text
text_names <- c("Abigail","Thomas","Abigail","Thomas","Colin")
text_comment <- c("Tommy sits next to Abbey","As a footballer Tommy is very good","Abby is a mature young lady","Tom is a handsome man","Tom is friends with Colin and Abi")
df_name_comment <- data.frame(text_names,text_comment)
Giving these dataframes
df_name_nick:
names nicknames
1 Thomas Tom
2 Thomas Tommy
3 Abigail Abi
4 Abigail Abby
5 Abigail Abbey
df_name_comment:
text_names text_comment
1 Abigail Tommy sits next to Abbey
2 Thomas As a footballer Tommy is very good
3 Abigail Abby is a mature young lady
4 Thomas Tom is a handsome man
5 Colin Tom is friends with Colin and Abi
I am looking for a routine that will search through each row of df_name_comment and use the df_name_comment$text_names to look up the corresponding nickname from df_name_nick and replace it with XXX.
Note for each person's name there can be several nicknames.
Note that in each text comment, only the appropriate name for that row is replaced, so that we would get this as output:
Abigail "Tommy sits next to XXX"
Thomas "As a footballer, XXX is very good"
Abigail "XXX is a mature young lady"
Thomas "XXX is a handsome man"
Colin "Tom is friends with Colin and Abi"
I’m thinking this will require a cunning combination of gsubs, matches and apply functions (either mapply, sapply, etc)
I've searched on Stack Overflow for something similar to this request and can only find very specific regex solutions based on data frames with unique row elements, and not something that I think will work with generic text lookups and gsubs via multiple nicknames.
Can anyone please help me solve my predicament?
With thanks
Nevil
(newbie R programmer since Jan 2017)
Here is an idea via base R. We basically paste the nicknames together for each name, collapsed by | so as to pass it as regex in gsub and replace the matched words of each comment with XXX. We use mapply to do that after we merge our aggregated nicknames with df_name_comment.
d1 <- aggregate(nicknames ~ names, df_name_nick, paste, collapse = '|')
d2 <- merge(df_name_comment, d1, by.x = 'text_names', by.y = 'names', all = TRUE)
d2$nicknames[is.na(d2$nicknames)] <- 0
d2$text_comment <- mapply(function(x, y) gsub(x, 'XXX', y), d2$nicknames, d2$text_comment)
d2$nicknames <- NULL
d2
Which gives,
text_names text_comment
1 Abigail Tommy sits next to XXX
2 Abigail XXX is a mature young lady
3 Colin Tom is friends with Colin and Abi
4 Thomas As a footballer XXX is very good
5 Thomas XXX is a handsome man
Note1: Replacing NA in nicknames with 0 is due to the fact that NA (which is the default fill in merge for unmatched elements) would convert the comment string to NA as well when passed in gsub
Note2 The order is also changed due to merge, but you can sort as you wish as per usual.
Note3 Is better to have your variables as characters rather than factors. So you either read the data frames with stringsAsFactors = FALSE or convert via,
df_name_comment[] <- lapply(df_name_comment, as.character)
df_name_nick[] <- lapply(df_name_nick, as.character)
EDIT
Based on your comment, we can simply match the comments' names with our aggregated data set, save that in a vector and use mapply directly on the original data frame, without having to merge and then drop variables, i.e.
#d1 as created above
v1 <- d1$nicknames[match(df_name_comment$text_names, d1$names)]
v1[is.na(v1)] <- 0
df_name_comment$text_comment <- mapply(function(x, y) gsub(x, 'XXX', y),
v1, df_name_comment$text_comment)
Hope this helps!
l <- apply(df_name_comment, 1, function(x)
ifelse(length(df_name_nick[df_name_nick$names==x["text_names"], "nicknames"]) > 0,
gsub(paste(df_name_nick[df_name_nick$names==x["text_names"], "nicknames"], collapse="|"),'XXX', x["text_comment"]),
x["text_comment"]))
df_name_comment$text_comment <- as.list.data.frame(l)
Don't forget to let us know if it solved your problem :)
Data
df_name_nick <- data.frame(names,nicknames,stringsAsFactors = F)
df_name_comment <- data.frame(text_names,text_comment,stringsAsFactors = F)
Solution 2
EDIT: In this initial solution I manually checked with grepl if the nickname was present, and then gsubbed with one of the matching ID's. I knew the '|' operator worked with grepl, but not with gsub. So credits to Sotos for that idea.
df = df_name_comment
for(i in 1:nrow(df))
{
matching_nicknames = df_name_nick$nicknames[df_name_nick$names==df$text_names[i]]
if(length(matching_nicknames)>0)
{
df$text_comment[i] = mapply(sub, pattern=paste(paste0("\\b",matching_nicknames,"\\b"),collapse="|"), "XXX", df$text_comment[i])
}
}
Output
text_names text_comment
1 Abigail Tommy sits next to XXX
2 Thomas As a footballer XXX is very good
3 Abigail XXX is a mature young lady
4 Thomas XXX is a handsome man
5 Colin Tom is friends with Colin and Abi
Hope this helps!

Need help processing multiple response strings from a google form using R

I'm trying to process results from a Google Form in R and have hit a wall in dealing with string data.
The question can be seen here:
Google returns the results in a single column with a comma separating each response.
They end up looking like
ID | Type of Research
=====================
1 | Policy analysis, Review of other research
2 | Bla
3 | Review of other research, Original empirical research
4 | Policy analysis, Theoretical
5 | Review of other research
I've used grepl to create logical columns and a data.frame for the three pre-selected responses.
Private$ResearchTypeOriginal <- grepl("Original", Private$ResearchType)
Private$ResearchTypeReview <- grepl("Review", Private$ResearchType)
Private$ResearchTypePolicy <- grepl("Policy", Private$ResearchType)
ResearchTypeGrid <- data.frame(Private$ResearchTypeOriginal, Private$ResearchTypeReview, Private$ResearchTypePolicy)
This works great. However, I also need to pull out the "other"s. I was using
ResearchTypeOther <- subset(Private, !grepl("Original", Private$ResearchType) & !grepl("Review", Private$ResearchType) & !grepl("Policy", Private$ResearchType), select=c(ID, ResearchType, PubLang, Reviewer))
ResearchTypeOther <- na.omit(ResearchTypeOther)
but just realized that if a response has both a pre-selected response AND a open-ended one, that's lost using this method. It works fine for giving me the "Bla" responses, but only the ones that are exclusively "other."
In other words, this produces
ID | Type of Research
=======================
2 | Bla
But what I'd like is
ID | Type of Research
======================
2 | Bla
4 | Policy analysis, Theoretical
This is my first time posting on SO, and I'm obviously new at R, so please excuse any mistakes in how I'm asking the question. I'm sorry if I'm not phrasing this very well. I have ~20 other questions with the same problem, so I need a flexible solution.
Thanks for any help.
You could "regex your way through" in the veins of
doc <- readLines(n = 5)
1 | Policy analysis, Review of other research
2 | Bla
3 | Review of research, Original empirical research
4 | Policy analysis, Theoretical
5 | Review of other research
items <- c("Review of other research",
"Original empirical research",
"Policy analysis")
(others <- gsub(sprintf("(,\\s)?(%s)(,\\s)?", paste(items, collapse = "|")), "",
sub(".*\\|\\s(.*)", "\\1", doc)))
# [1] "" "Bla" "Review of research"
# [4] "Theoretical " ""
sub(sprintf("(,\\s)?(%s)(,\\s)?", paste(others[others != ""], collapse = "|")), "", doc)
# [1] "1 | Policy analysis, Review of other research"
# [2] "2 | "
# [3] "3 | Original empirical research"
# [4] "4 | Policy analysis"
# [5] "5 | Review of other research"
Got it thanks to Luke. Not elegant at all, but this worked:
items <- c("Review of other research",
"Original empirical research",
"Policy analysis")
ResearchTypeOther <- data.frame((others <- gsub(sprintf("(,\\s)?(%s)(,\\s)?", paste(items, collapse = "|")), "",
sub(".*\\|\\s(.*)", "\\1", Private$ResearchType))))
ResearchTypeOther[ResearchTypeOther==""] <- NA
ResearchTypeOther <- na.omit(ResearchTypeOther)
You could try: (using doc and items from #lukeA)
library(stringr)
doc[sapply(strsplit(doc, "\\d +\\||,"), function(x) {
x1 <- str_trim(x)
x2 <- x1[x1!='']
indx <- x2 %in% items
!(any(indx) & tail(indx,1))})]
#[1] "2 | Bla" "4 | Policy analysis, Theoretical

Getting a list of Means by a parameter R

I have data that looks like this:
> head(chf)
Admit.Day.of.Week Type.of.Admission Patient.Disposition
1 SAT Emergency Skilled Nursing Home
2 FRI Elective Home or Self Care
3 FRI Emergency Home w/ Home Health Services
4 MON Emergency Skilled Nursing Home
5 THU Emergency Home or Self Care
6 WED Emergency Skilled Nursing Home
mean_los_dispo
1 8.553525
2 4.224193
3 5.789052
4 8.553525
5 4.224193
6 8.553525
I use the following command to get the column labled mean_los_dispo
# Mean LOS for each patient disposition
chf$mean_los_dispo <- ave(chf$Length.of.Stay, chf$Patient.Disposition,
FUN = mean)
What I want to do is set a variable to hold the value of the mean_los_dispo for each of the four different dispositions, for example
SNH = 8.553525
HSC = 4.224193
...
How would I go about doing this? I want to be able to eventually use paste or something similar to put the information in the title of a graph.
You can use paste. So for example, I created two variables, one with numbers (so your means) and another with characters (so your dispositions), and then I used paste to concatenate them:
a<-c(1,2,3,4,5)
b<-c("a","b","c","d","e")
strs<-paste(b," = ",as.character(a),sep="")
This produces:
[1] "a = 1" "b = 2" "c = 3" "d = 4" "e = 5"
In your case you could do something like the following:
unique(paste(chf$Patient.Disposition," = ",as.character(chf$mean_los_dispo),sep=""))
The unique will get rid of all of the duplicates.

Resources