Using group_by to replace different strings by group - r

I'm trying to replace every instance of an author's name in a data.frame with a different string, but only when that author was the one speaking. For instance, if we have the data:
test <- data.frame(author = c("jon", "mike", "sam"), text = rep("jon and mike mike and sam sam sam", 3))
I'd like to replace every instance of "sam" with some other text when author=="sam".
I've tried using do and str_replace_all to do this, but haven't gotten it to work:
test %>% group_by(author) %>% do(mutate(., text2 = str_replace(.$text, eval(parse(text = .$author)), "yay")))

str_replace_all is Vectorised over string, pattern and replacement. (see ?str_replace_all), so you can just use the author column as pattern:
test %>% mutate(new_text = str_replace_all(text, author, 'yay'))
# author text new_text
#1 jon jon and mike mike and sam sam sam yay and mike mike and sam sam sam
#2 mike jon and mike mike and sam sam sam jon and yay yay and sam sam sam
#3 sam jon and mike mike and sam sam sam jon and mike mike and yay yay yay

Related

Expand data.table so one row per pattern match of each ID

I have a lot of text data in a data.table. I have several text patterns that I'm interested in. I have managed to subset the table so it shows text that matches at least two of the patterns (relevant question here).
I now want to be able to have one row per match, with an additional column that identifies the match - so rows where there are multiple matches will be duplicates apart from that column.
It feels like this shouldn't be too hard but I'm struggling! My vague thoughts are around maybe counting the number of pattern matches, then duplicating the rows that many times...but then I'm not entirely sure how to get the label for each different pattern...(and also not sure that is very efficient anyway).
Thanks for your help!
Example data
library(data.table)
library(stringr)
text_table <- data.table(ID = (1:5),
text = c("lucy, sarah and paul live on the same street",
"lucy has only moved here recently",
"lucy and sarah are cousins",
"john is also new to the area",
"paul and john have known each other a long time"))
text_patterns <- as.character(c("lucy", "sarah", "paul|john"))
# Filtering the table to just the IDs with at least two pattern matches
text_table_multiples <- text_table[, Reduce(`+`, lapply(text_patterns,
function(x) str_detect(text, x))) >1]
Ideal output
required_table <- data.table(ID = c(1, 1, 1, 2, 3, 3, 4, 5),
text = c("lucy, sarah and paul live on the same street",
"lucy, sarah and paul live on the same street",
"lucy, sarah and paul live on the same street",
"lucy has only moved here recently",
"lucy and sarah are cousins",
"lucy and sarah are cousins",
"john is also new to the area",
"paul and john have known each other a long time"),
person = c("lucy", "sarah", "paul or john", "lucy", "lucy", "sarah", "paul or john", "paul or john"))
A way to do that is to create a variable for each indicator and melt:
library(stringi)
text_table[, lucy := stri_detect_regex(text, 'lucy')][ ,
sarah := stri_detect_regex(text, 'sarah')
][ ,`paul or john` := stri_detect_regex(text, 'paul|john')
]
melt(text_table, id.vars = c("ID", "text"))[value == T][, -"value"]
## ID text variable
## 1: 1 lucy, sarah and paul live on the same street lucy
## 2: 2 lucy has only moved here recently lucy
## 3: 3 lucy and sarah are cousins lucy
## 4: 1 lucy, sarah and paul live on the same street sarah
## 5: 3 lucy and sarah are cousins sarah
## 6: 1 lucy, sarah and paul live on the same street paul or john
## 7: 4 john is also new to the area paul or john
## 8: 5 paul and john have known each other a long time paul or john
A tidy way of doing the same procedure is:
library(tidyverse)
text_table %>%
mutate(lucy = stri_detect_regex(text, 'lucy')) %>%
mutate(sarah = stri_detect_regex(text, 'sarah')) %>%
mutate(`paul or john` = stri_detect_regex(text, 'paul|john')) %>%
gather(value = value, key = person, - c(ID, text)) %>%
filter(value) %>%
select(-value)
DISCLAIMER: this is not an idiomatic data.table solution
I would build a helper function like the following, that take a single row and an input and returns a new dt with Nrows:
library(data.table)
library(tidyverse)
new_rows <- function(dtRow, patterns = text_patterns){
res <- map(text_patterns, function(word) {
textField <- grep(x = dtRow[1, text], pattern = word, value = TRUE) %>%
ifelse(is.character(.), ., NA)
personField <- str_extract(string = dtRow[1, text], pattern = word) %>%
ifelse( . == "paul" | . == "john", "paul or john", .)
idField <- ifelse(is.na(textField), NA, dtRow[1, ID])
data.table(ID = idField, text = textField, person = personField)
}) %>%
rbindlist()
res[!is.na(text), ]
}
And I will execute it:
split(text_table, f = text_table[['ID']]) %>%
map_df(function(r) new_rows(dtRow = r))
The answer is:
ID text person
1: 1 lucy, sarah and paul live on the same street lucy
2: 1 lucy, sarah and paul live on the same street sarah
3: 1 lucy, sarah and paul live on the same street paul or john
4: 2 lucy has only moved here recently lucy
5: 3 lucy and sarah are cousins lucy
6: 3 lucy and sarah are cousins sarah
7: 4 john is also new to the area paul or john
8: 5 paul and john have known each other a long time paul or john
which looks like your required_table (duplicated IDs included)
ID text person
1: 1 lucy, sarah and paul live on the same street lucy
2: 1 lucy, sarah and paul live on the same street sarah
3: 1 lucy, sarah and paul live on the same street paul or john
4: 2 lucy has only moved here recently lucy
5: 3 lucy and sarah are cousins lucy
6: 3 lucy and sarah are cousins sarah
7: 4 john is also new to the area paul or john
8: 5 paul and john have known each other a long time paul or john

Multiple criteria lookup in R

I have data like this:
ID 1a 2a 3a 1b 2b 3b Name Team
cb128c James John Bill Jeremy Ed Simon Simon Wolves
cb128c John James Randy Simon David Ben John Tigers
ko351u Adam Alex Jacob Bob Oscar David Oscar Sparrows
ko351u Adam Matt Sam Fred Frank Harry Adam Wildcats
And I want to add columns indicating teams A and B by matching the row ID of that row in the ID column, and by matching one of the names in one of the "a" columns of that row in the "Name" column (for Team A), and doing the same for Team B using one of the names in one of the "b" columns of that row:
ID 1a 2a 3a 1b 2b 3b Name Team Team A Team B
cb128c James John Bill Jeremy Ed Simon Simon Wolves Tigers Wolves
cb128c John James Randy Simon David Ben John Tigers Tigers Wolves
ko351u Adam Alex Jacob Bob Oscar David Oscar Sparrows Wildcats Sparrows
ko351u Adam Matt Sam Fred Frank Harry Adam Wildcats Wildcats Sparrows
In row 1, we know Team A is Tigers because we match the ID of row 1, cb128c, in the ID column, and one of the "a" names of row 1 (either James, John or Bill) in the Name column. In this case, Row 2 has that ID, cb128c, and has "John" in the Name column. The Team in row 2 is "Tigers." Therefore, Row 1's Team A is Tigers. Team B is the Wolves because we match row 1's ID, still cb128c, and one of the "b" names in row 1 (either Jeremy, Ed or Simon) in the Name column. In this case, row 1 itself has the data we're looking for since one of the "b" names appears in the "Name" column of that row (Simon). The "Team" listed in each row will always either be the Team A or the Team B for that row.
Further down, we know Team A for row 3 is Wildcats because we match row 3's ID, ko351u and one of row 3's "a" names (either Adam, Alex or Jacob) in the "Name" column. Row 4 has that ID and "Adam" in the Name column. So the Team in Row 4 is Team A for Row 3.
Also notice that David switched teams in Row 3. In Row 2, David was on Simon's team, which we know is the Wolves (as explained above), but when we match Row 3's ID and one of Row 3's "b" names (Bob, Oscar or David), we get the Sparrows (like Row 1, one of the "b" names appears in the name column of that same row, so the Team B is the Team listed in that row).
How can I get this done in R?
df = read.table(text = "ID 1a 2a 3a 1b 2b 3b Name Team
cb128c James John Bill Jeremy Ed Simon Simon Wolves
cb128c John James Randy Simon David Ben John Tigers
ko351u Adam Alex Jacob Bob Oscar David Oscar Sparrows
ko351u Adam Matt Sam Fred Frank Harry Adam Wildcats", header = T)
# convert to character
df[] = lapply(df, as.character)
library(tidyr)
library(dplyr)
The following code 1. gathers to long format, 2. creates "Team_A" and "Team_B" out of the a or b suffix, 3. matches names to fill in the A/B Team Name, 4. removes missing values (no match), 5. gets rid of unnecessary columns, 6. converts back to wide format, 7. joins the A and B teams to the original data.
I'd encourage you to step through the code line by line to understand what's going on. I'll leave reordering the columns to you.
result = gather(df, key = "key", value = "value", starts_with("X")) %>%
mutate(ab = paste0("Team_", toupper(substr(key, start = nchar(key), stop = nchar(key)))),
team = ifelse(Name == value, Team, NA)) %>%
filter(!is.na(team)) %>%
select(ID, ab, team) %>%
spread(key = ab, value = team) %>%
right_join(df)
result
# ID Team_A Team_B X1a X2a X3a X1b X2b X3b Name Team
# 1 cb128c Tigers Wolves James John Bill Jeremy Ed Simon Simon Wolves
# 2 cb128c Tigers Wolves John James Randy Simon David Ben John Tigers
# 3 ko351u Wildcats Sparrows Adam Alex Jacob Bob Oscar David Oscar Sparrows
# 4 ko351u Wildcats Sparrows Adam Matt Sam Fred Frank Harry Adam Wildcats

How to Restructure R Data Frame in R [duplicate]

This question already has answers here:
reshape wide to long with character suffixes instead of numeric suffixes
(3 answers)
Closed 5 years ago.
I have data in this format:
boss employee1 employee2
1 wil james andy
2 james dean bert
3 billy herb collin
4 tony mike david
and I would like it in this format:
boss employee
1 wil james
2 wil andy
3 james dean
4 james bert
5 billy herb
6 billy collin
7 tony mike
8 tony david
I have searched the forums, but I have not yet found anything that helps. I have tried using dplyr and some others, but I am still pretty new to R.
If this question has been answered and you could give me a link that would be greatly appreciated.
Thanks,
Wil
Here is a solution that uses tidyr. Specifically, the gather function is used to combine the two employee columns. This also generates a column bsaed on the column headers (employee1 and employee2) which is called key. We remove that with select from dplyr.
library(tidyr)
library(dplyr)
df <- read.table(
text = "boss employee1 employee2
1 wil james andy
2 james dean bert
3 billy herb collin
4 tony mike david",
header = TRUE,
stringsAsFactors = FALSE
)
df2 <- df %>%
gather(key, employee, -boss) %>%
select(-key)
> df2
boss employee
1 wil james
2 james dean
3 billy herb
4 tony mike
5 wil andy
6 james bert
7 billy collin
8 tony david
I would be shocked if there isn't a slicker, base solution but this should work for you.
Using base R:
df1 <- df[, 1:2]
df2 <- df[, c(1, 3)]
names(df1)[2] <- names(df2)[2] <- "employee"
rbind(df1, df2)
# boss employee
# 1 wil james
# 2 james dean
# 3 billy herb
# 4 tony mike
# 11 wil andy
# 21 james bert
# 31 billy collin
# 41 tony david
Using dplyr:
df %>%
select(boss, employee1) %>%
rename(employee = employee1) %>%
bind_rows(df %>%
select(boss, employee2) %>%
rename(employee = employee2))
# boss employee
# 1 wil james
# 2 james dean
# 3 billy herb
# 4 tony mike
# 5 wil andy
# 6 james bert
# 7 billy collin
# 8 tony david
Data:
df <- read.table(text = "
boss employee1 employee2
1 wil james andy
2 james dean bert
3 billy herb collin
4 tony mike david
", header = TRUE, stringsAsFactors = FALSE)

separate different combinations of names to first and last using dplyr, tidyr, and regex

Sample data frame:
name <- c("Smith John Michael","Smith, John Michael","Smith John, Michael","Smith-John Michael","Smith-John, Michael")
df <- data.frame(name)
df
name
1 Smith John Michael
2 Smith, John Michael
3 Smith John, Michael
4 Smith-John Michael
5 Smith-John, Michael
I need to achieve the following desired output:
name first.name last.name
1 Smith John Michael John Smith
2 Smith, John Michael John Smith
3 Smith John, Michael Michael Smith John
4 Smith-John Michael Michael Smith-John
5 Smith-John, Michael Michael Smith-John
The rules are: if there is a comma in the string, then anything before is the last name. the first word following the comma is first name. If no comma in string, first word is last name, second word is last name. hyphenated words are one word. I would rather acheive this with dplyr and regex but I'll take any solution. Thanks for the help
You can achieve your desired result using strsplit switching between splitting by "," or " " based on whether there is a comma or not in name. Here, we define two functions to make the presentation clearer. You can just as well inline the code within the functions.
get.last.name <- function(name) {
lapply(ifelse(grepl(",",name),strsplit(name,","),strsplit(name," ")),`[[`,1)
}
The result of strsplit is a list. The lapply(...,'[[',1) loops through this list and extracts the first element from each list element, which is the last name.
get.first.name <- function(name) {
d <- lapply(ifelse(grepl(",",name),strsplit(name,","),strsplit(name," ")),`[[`,2)
lapply(strsplit(gsub("^ ","",d), " "),`[[`,1)
}
This function is similar except we extract the second element from each list element returned by strsplit, which contains the first name. We then remove any starting spaces using gsub, and we split again with " " to extract the first element from each list element returned by that strsplit as the first name.
Putting it all together with dplyr:
library(dplyr)
res <- df %>% mutate(first.name=get.first.name(name),
last.name=get.last.name(name))
The result is as expected:
print(res)
## name first.name last.name
## 1 Smith John Michael John Smith
## 2 Smith, John Michael John Smith
## 3 Smith John, Michael Michael Smith John
## 4 Smith-John Michael Michael Smith-John
## 5 Smith-John, Michael Michael Smith-John
Data:
df <- structure(list(name = c("Smith John Michael", "Smith, John Michael",
"Smith John, Michael", "Smith-John Michael", "Smith-John, Michael"
)), .Names = "name", row.names = c(NA, -5L), class = "data.frame")
## name
##1 Smith John Michael
##2 Smith, John Michael
##3 Smith John, Michael
##4 Smith-John Michael
##5 Smith-John, Michael
I am not sure if this is any better than aichao's answer but I gave it a shot anyway. I gives the right output.
df1 <- df %>%
filter(grepl(",",name)) %>%
separate(name, c("last.name","first.middle.name"), sep = "\\,", remove=F) %>%
mutate(first.middle.name = trimws(first.middle.name)) %>%
separate(first.middle.name, c("first.name","middle.name"), sep="\\ ",remove=T) %>%
select(-middle.name)
df2 <- df %>%
filter(!grepl(",",name)) %>%
separate(name, c("last.name","first.name"), sep = "\\ ", remove=F)
df<-rbind(df1,df2)

Find Duplicates in R based on multiple characters

I can't seem to remember how to code this properly in R -
if I want to remove duplicates within a csv file based on multiple entries - first name and last name that are stored in separate columns
Then I can code: file[(duplicated(file$First.Name),]
but that only looks at the first name, I want it to look at the last same simultaneously.
If this is my starting file:
Steve Jones
Eric Brown
Sally Edwards
Steve Jones
Eric Davis
I want the output to be
Steve Jones
Eric Brown
Sally Edwards
Eric Davis
Only removing names of first and last name matching.
You can use
file[!duplicated(file[c("First.Name", "Last.Name")]), ]
Here is the solution for better performance (using data.table assuming First Name and Last Name are stored in separate columns):
> df <- read.table(text = 'Steve Jones
+ Eric Brown
+ Sally Edwards
+ Steve Jones
+ Eric Davis')
> colnames(df) <- c("First.Name","Last.Name")
> df
First.Name Last.Name
1 Steve Jones
2 Eric Brown
3 Sally Edwards
4 Steve Jones
5 Eric Davis
Here is where data.table specific code begins
> dt <- setDT(df)
> unique(dt,by=c('First.Name','Last.Name'))
First.Name Last.Name
1: Steve Jones
2: Eric Brown
3: Sally Edwards
4: Eric Davis
If there is a single column, use sub to remove the substring (i.e. first name) followed by space, get the logical vector (!duplicated(..) based on that to subset the rows of the dataset.
df1[!duplicated(sub("\\w+\\s+", "", df1$Col1)),,drop=FALSE]
# Col1
#1 Steve Jones
#2 Eric Brown
#3 Sally Edwards
#5 Eric Davis
If it is based on two columns and the dataset have two columns, just do duplicated directly on the dataset to get the logical vector, negate it and subset the rows.
df1[!duplicated(df1), , drop=FALSE]
# first.name second.name
#1 Steve Jones
#2 Eric Brown
#3 Sally Edwards
#5 Eric Davis
try:
!duplicated(paste(File$First.Name,File$Last.Name))

Resources