I have a dataset with 4 variables. They are id, last_name, first_name, and email. Lets say similar ids, last names and first names can be repeated across multiple rows and the way to find out if its the same or different person is by comparing emails. Lets consider the following code and dataset:
id <- c("12-A", "12-A", "12-A", "12-A")
last_name <- c("Mike", "Mike", "Mike", "Mike")
first_name <- c("Tom", "Tom", "Tom", "Tom")
email <- c("tom#yahoo.com", "tom#yahoo.com", "tom#yahoo.com", "tommike#yahoo.com")
ds <- data.frame(id, last_name, first_name, email)
The following is the output:
id last_name first_name email
1 12-A Mike Tom tom#yahoo.com
2 12-A Mike Tom tom#yahoo.com
3 12-A Mike Tom tom#yahoo.com
4 12-A Mike Tom tommike#yahoo.com
Now I want to create an another variable (identity) that input value Yes if given the same id, last_name, first_name, emails are the same, if the email is different as in row # 4, I want the input values to say "No". How can I do that especially using dplyr::mutate with case_when
Thanks
How about if you just concatenate all the variables that you want to constitute an ID and then coerce it to a factor? Would this be ok?
ds$identity <- paste(ds$id, ds$last_name, ds$first_name, ds$email, sep = "-")
ds$identity <- as.numeric(as.factor(ds$identity))
ds
id last_name first_name email identity
1 12-A Mike Tom tom#yahoo.com 1
2 12-A Mike Tom tom#yahoo.com 1
3 12-A Mike Tom tom#yahoo.com 1
4 12-A Mike Tom tommike#yahoo.com 2
Related
I have a list of names from different sources in one data set: one set is organized by FirstName LastName; the other has FullName. I want to see if the first name or the last name is within the full name column, and create a flag. Two questions:
First, I used this solution, but the resulting data doesn't have the right amount of rows, and I'm not sure how to get it to make a flag. I tried to turn it into an ifelse statement, but got another error. How do I fix this so if FirstName is in FullName, I flag True (or 1), otherwise I flag False (or 0)?
Second, I have a few million names, is this an efficient way to do things?
FirstName = c("mary", "paul", "mother", "john", "red", "little", "king")
LastName = c("berry", "hollywood", "theresa", "jones", "rover", "tim", "arthur")
FullName = c("mary berry", "anthony horrowitz", "jennifer lawrence", "john jones", "red rover", "mick jagger", "king arthur")
df = data.frame(FirstName, LastName, FullName)
#attempt 1 and error
df$match_firstname <- df[mapply(grepl, df$FirstName, df$FullName), ]
Error in `$<-.data.frame`(`*tmp*`, match_firstname, value = list(FirstName = c("mary", :
replacement has 4 rows, data has 7
#attempt 2 and error
df$match_firstname <- ifelse(df[mapply(grepl, df$FirstName, df$FullName), ], 1, 0)
Error in ifelse(df[mapply(grepl, df$FirstName, df$FullName), ], 1, 0) :
'list' object cannot be coerced to type 'logical'
Instead we could use str_detect which is vectorized for both pattern and string whereas in the Map/mapply code, it is looping over each row and thus could be less efficient
library(dplyr)
library(stringr)
df %>%
filter(str_detect(FullName, FirstName))
-output
FirstName LastName FullName
1 mary berry mary berry
2 john jones john jones
3 red rover red rover
4 king arthur king arthur
If we want to add a new binary column, instead of filtering, convert the logical to binary with as.integer or +
df <- df %>%
mutate(match_firstname = +(str_detect(FullName, FirstName)))
-output
FirstName LastName FullName match_firstname
1 mary berry mary berry 1
2 paul hollywood anthony horrowitz 0
3 mother theresa jennifer lawrence 0
4 john jones john jones 1
5 red rover red rover 1
6 little tim mick jagger 0
7 king arthur king arthur 1
The error in the OP's code is because we are assigning a subset of data into a new column in the original dataset which obviously result in length difference
df[mapply(grepl, df$FirstName, df$FullName), ]
FirstName LastName FullName
1 mary berry mary berry
4 john jones john jones
5 red rover red rover
7 king arthur king arthur
Similar to the previous solution, use +
df$match_firstname <- +(mapply(grepl, df$FirstName, df$FullName))
I'm trying to replace every instance of an author's name in a data.frame with a different string, but only when that author was the one speaking. For instance, if we have the data:
test <- data.frame(author = c("jon", "mike", "sam"), text = rep("jon and mike mike and sam sam sam", 3))
I'd like to replace every instance of "sam" with some other text when author=="sam".
I've tried using do and str_replace_all to do this, but haven't gotten it to work:
test %>% group_by(author) %>% do(mutate(., text2 = str_replace(.$text, eval(parse(text = .$author)), "yay")))
str_replace_all is Vectorised over string, pattern and replacement. (see ?str_replace_all), so you can just use the author column as pattern:
test %>% mutate(new_text = str_replace_all(text, author, 'yay'))
# author text new_text
#1 jon jon and mike mike and sam sam sam yay and mike mike and sam sam sam
#2 mike jon and mike mike and sam sam sam jon and yay yay and sam sam sam
#3 sam jon and mike mike and sam sam sam jon and mike mike and yay yay yay
I have a data frame of individuals and their spouses with some personal information (i.e. last names) that I have randomized with plyr::mapvalues in order to protect identities. Here is a reproducible example of how it looked before and after changing the surnames:
# before
d <- data.frame(id = c(1:6),
first_name = c('Jeff', 'Marilyn', 'Gwyn',
'Alice', 'Sam', 'Sarah'),
surname = c('Goldbloom', 'Monroe', 'Paltrow', 'Goldbloom',
'Smith', 'Silverman'),
spouse_id = c(2, 1, 1, 5, 4, "NA"),
spouse = c('Marilyn Monroe', 'Jeff Goldbloom', 'Jeff Goldbloom',
'Sam Smith', 'Alice Goldbloom', 'NA'))
d
> id first_name surname spouse_id spouse
1 Jeff Goldbloom 2 Marilyn Monroe
2 Marilyn Monroe 1 Jeff Goldbloom
3 Gwyn Paltrow 1 Jeff Goldbloom
4 Alice Goldbloom 5 Sam Smith
5 Sam Smith 4 Alice Goldbloom
6 Sarah Silverman NA NA
# replacement names to serve as surnames (doesn't matter what they are, just
that the ratios remain the same as before; mapvalues takes care of this)
repnames <- c("Arman" , "Clovis" , "Garner" , "Casey" , "Birch")
s <- unique(d$surname)
d$surname <- plyr::mapvalues(d$surname, from = s, to = repnames) #replace surnames
# After replacement, the dataframe looks like:
d
> id first_name surname spouse_id spouse
1 Jeff Arman 2 Marilyn Monroe
2 Marilyn Clovis 1 Jeff Goldbloom
3 Gwyn Garner 1 Jeff Goldbloom
4 Alice Arman 5 Sam Smith
5 Sam Casey 4 Alice Goldbloom
6 Sarah Birch NA NA
Each person has his or her own id number, but not all people have spouses. If a person does have a spouse, their spouse's individual id is reflected in the spouse_id column. I did this so that I could filter individuals and their spouses separately later using something like dplyr::filter(d, spouse %in% spouse_id).
My question is, how can I use the relational id and spouse_id columns to re-populate the spouse column so that it reflects the new, randomized surnames? i.e. the final expected output would be:
id first_name surname spouse_id spouse
1 Jeff Arman 2 Marilyn Clovis
2 Marilyn Clovis 1 Jeff Arman
3 Gwyn Garner 1 Jeff Arman
4 Alice Arman 5 Sam Casey
5 Sam Casey 4 Alice Arman
6 Sarah Birch NA NA
...So some concatenation will be involved on the first_name and surname columns. I've never done something quite so conditional in R - in Excel I guess it would be nested VLOOKUP functions...
Thanks, sorry it's so specific but hopefully it presents a fun challenge to someone out there.
Assuming that your NAs are actual NAs, then
d$spouse <- paste(d$first_name, d$surname)[d$spouse_id]
d$spouse
#[1] "Marilyn Clovis" "Jeff Arman" "Jeff Arman" "Sam Casey" "Alice Arman" NA
I have a df where each row represents an individual and each column a characteristic of these individuals. One of the columns is TeamName, which is the name of the Team that individual belongs to. Multiple individuals belong to a Team.
I'd like a function in R that creates a new column with the number of team members for each Team.
So, for example I have:
df
Name Surname TeamName
John Smith Champions
Mary Osborne Socceroos
Mark Johnson Champions
Rory Bradon Champions
Jane Bryant Socceroos
Bruce Harper
I'd like to have
df1
Name Surname TeamName TeamNo
John Smith Champions 3
Mary Osborne Socceroos 2
Mark Johnson Champions 3
Rory Bradon Champions 3
Jane Bryant Socceroos 2
Bruce Harper 0
So as you can see the counting includes that individual too, and if someone (e.g. Bruce Harper) has no Team name, then he gets a 0.
How can I do that? Thanks!
This is a solution based on using data.table which perhaps is too much for what you need, but here it goes:
library(data.table)
dt=data.table(df)
# First, let's convert the factors of TeamName, to characters
dt[,TeamName:=as.character(TeamName)]
# Now, let find all the team numbers
dt[,TeamNo:=.N, by='TeamName']
# Let's exclude the special cases
dt[is.na(TeamName),TeamNo:=NA]
dt[TeamName=="",TeamNo:=NA]
It is clearly not the best solution, but I hope this helps
If you need to know the number of unique members in the first two columns based on the 'TeamName' column, one option is n_distinct from dplyr
library(dplyr)
library(tidyr)
df %>%
unite(Var, Name, Surname) %>% #paste the columns together
group_by(TeamName) %>% #group by TeamName
mutate(TeamNo= n_distinct(Var)) %>% #create the TeamNo column
separate(Var, into=c('Name', 'Surname')) #split the 'Var' column
Or if it just the number of rows per 'TeamName', we can group by 'TeamName', get the number of rows per group with n(), create the 'TeamNo' column with mutate based on that n(), and if needed an ifelse condition can be used to give NA for 'TeamName' that are '' or NA.
df %>%
group_by(TeamName) %>%
mutate(TeamNo = ifelse(is.na(TeamName)|TeamName=='', NA_integer_, n()))
# Name Surname TeamName TeamNo
#1 John Smith Champions 3
#2 Mary Osborne Socceroos 2
#3 Mark Johnson Champions 3
#4 Rory Bradon Champions 3
#5 Jane Bryant Socceroos 2
#6 Bruce Harper NA
Or you can use ave from base R. Suppose if there are '' and NA, I would first convert the '' to NA and then use ave to get the length of 'TeamNo' grouped by that column. It will give NA for `NA' values. For example.
v1 <- c(df$TeamName, NA)# appending an NA with the example to show the case
is.na(v1) <- v1=='' #convert the `'' to `NA`
as.numeric(ave(v1, v1, FUN=length))
#[1] 3 2 3 3 2 NA NA
Using sqldf:
library(sqldf)
sqldf("SELECT Name, Surname, TeamName, n
FROM df
LEFT JOIN
(SELECT TeamName, COUNT(Name) AS n
FROM df
WHERE NOT TeamName IS '' GROUP BY TeamName)
USING (TeamName)")
Output:
Name Surname TeamName n
1 John Smith Champions 3
2 Mary Osborne Socceroos 2
3 Mark Johnson Champions 3
4 Rory Bradon Champions 3
5 Jane Bryant Socceroos 2
6 Bruce Harper NA
Say I have these two data frames:
> df1 <- data.frame(name = c('John Doe',
'Jane F. Doe',
'Mark Smith Simpson',
'Sam Lee'))
> df1
name
1 John Doe
2 Jane F. Doe
3 Mark Smith Simpson
4 Sam Lee
> df2 <- data.frame(family = c('Doe', 'Smith'), size = c(2, 6))
> df2
family size
1 Doe 2
2 Smith 6
I want to merge both data frames in order to get this:
name family size
1 John Doe Doe 2
2 Jane F. Doe Doe 2
3 Mark Smith Simpson Smith 6
4 Sam Lee <NA> NA
But I can't wrap my head around a way to do this apart from the following very convoluted solution, which is becoming very messy with my real data, which has over 100 "family names":
> df3 <- within(df1, {
family <- ifelse(test = grepl('Doe', name),
yes = 'Doe',
no = ifelse(test = grepl('Smith', name),
yes = 'Smith',
no = NA))
})
> merge(df3, df2, all.x = TRUE)
family name size
1 Doe John Doe 2
2 Doe Jane F. Doe 2
3 Smith Mark Smith Simpson 6
4 <NA> Sam Lee NA
I've tried taking a look into pmatch as well as the solutions provided at R partial match in data frame, but still haven't found what I'm looking for.
Rather than attempting to use regular expressions and partial matches, you could split the names up into a lookup-table format, where each component of a person's name is kept in a row, and matched to their full name:
df1 <- data.frame(name = c('John Doe',
'Jane F. Doe',
'Mark Smith Simpson',
'Sam Lee'),
stringsAsFactors = FALSE)
df2 <- data.frame(family = c('Doe', 'Smith'), size = c(2, 6),
stringsAsFactors = FALSE)
library(tidyr)
library(dplyr)
str_df <- function(x) {
ss <- strsplit(unlist(x)," ")
data.frame(family = unlist(ss),stringsAsFactors = FALSE)
}
splitnames <- df1 %>%
group_by(name) %>%
do(str_df(.))
splitnames
name family
1 Jane F. Doe Jane
2 Jane F. Doe F.
3 Jane F. Doe Doe
4 John Doe John
5 John Doe Doe
6 Mark Smith Simpson Mark
7 Mark Smith Simpson Smith
8 Mark Smith Simpson Simpson
9 Sam Lee Sam
10 Sam Lee Lee
Now you can just merge or join this with df2 to get your answer:
left_join(df2,splitnames)
Joining by: "family"
family size name
1 Doe 2 Jane F. Doe
2 Doe 2 John Doe
3 Smith 6 Mark Smith Simpson
Potential problem: if one person's first name is the same as somebody else's last name, you'll get some incorrect matches!
Here is one strategy, you could use lapply with grep match over all the family names. This will find them at any position. First let me define a helper function
transindex<-function(start=1) {
function(x) {
start<<-start+1
ifelse(x, start-1, NA)
}
}
and I will also be using the function coalesce.R to make things a bit simpler. Here the code i'd run to match up df2 to df1
idx<-do.call(coalesce, lapply(lapply(as.character(df2$family),
function(x) grepl(paste0("\\b", x, "\\b"), as.character(df1$name))),
transindex()))
Starting on the inside and working out, i loop over all the family names in df2 and grep for those values (adding "\b" to the pattern so i match entire words). grepl will return a logical vector (TRUE/FALSE). I then apply the above helper function transindex() to change those vector to be either the index of the row in df2 that matched, or NA. Since it's possible that a row may match more than one family, I simply choose the first using the coalesce helper function.
Not that I can match up the rows in df1 to df2, I can bring them together with
cbind(df1, size=df2[idx,])
name family size
# 1 John Doe Doe 2
# 1.1 Jane F. Doe Doe 2
# 2 Mark Smith Simpson Smith 6
# NA Sam Lee <NA> NA
Another apporoach that looks valid, at least with the sample data:
df1name = as.character(df1$name)
df1name
#[1] "John Doe" "Jane F. Doe" "Mark Smith Simpson" "Sam Lee"
regmatches(df1name, regexpr(paste(df2$family, collapse = "|"), df1name), invert = T) <- ""
df1name
#[1] "Doe" "Doe" "Smith" ""
cbind(df1, df2[match(df1name, df2$family), ])
# name family size
#1 John Doe Doe 2
#1.1 Jane F. Doe Doe 2
#2 Mark Smith Simpson Smith 6
#NA Sam Lee <NA> NA