Merge dataframe with a key value that is contained within a string in a separate dataframe - r

employee <- c('John','Peter', 'Gynn', 'Jolie', 'Hope', 'Sue', 'Jane', 'Sarah')
salary <- c('VT020', 'VT126', 'VT027', 'VT667', 'VC120', 'VT000', 'VA120', 'VA020')
emp <- data.frame(employee, salary)
benefit <- c('Health', 'Time', 'Bonus')
benefit_id <- c('VT020 VT126 VT667 VA020', 'VT667', 'VT126 VT667 VT000')
ben <- data.frame(benefit, benefit_id)
Above we have to dataframes, one contains names and a unique ID, the other contains a category and a list of unique IDs.
What is the most efficient way to merge the ben dataframe with the emp dataframe such that we get the appropriate benefit assigned to each employee?

tidyverse
library(dplyr)
library(tidyr) # tidyr
ben %>%
mutate(benefit_id = strsplit(benefit_id, "\\s+")) %>%
unnest(benefit_id) %>%
left_join(emp, ., by = c(salary = "benefit_id"))
# employee salary benefit
# 1 John VT020 Health
# 2 Peter VT126 Health
# 3 Peter VT126 Bonus
# 4 Gynn VT027 <NA>
# 5 Jolie VT667 Health
# 6 Jolie VT667 Time
# 7 Jolie VT667 Bonus
# 8 Hope VC120 <NA>
# 9 Sue VT000 Bonus
# 10 Jane VA120 <NA>
# 11 Sarah VA020 Health
Depending on your needs, you may also prefer a different join. For instance, use a full_join if you want all pairings, where NA in employee indicates a benefit sans employee.
FYI: if you are running R before 4.0, then you might have factors in your data. To fix that, just convert the factor columns with as.character first. (This can be determined with sapply(ben, inherits, "factor").)

data.table
library(data.table)
setDT(emp)
ben_long <- setDT(ben)[, list(benefit_id = unlist(strsplit(x = benefit_id, split = " "))), by = benefit]
merge(x = emp, y = ben_long, by.x = "salary", by.y = "benefit_id", all.x = TRUE)
salary employee benefit
1: VA020 Sarah Health
2: VA120 Jane <NA>
3: VC120 Hope <NA>
4: VT000 Sue Bonus
5: VT020 John Health
6: VT027 Gynn <NA>
7: VT126 Peter Health
8: VT126 Peter Bonus
9: VT667 Jolie Health
10: VT667 Jolie Time
11: VT667 Jolie Bonus

Related

Add multiple new columns to the dataset, based on another dataset's elements

I have the following products list
> products
# A tibble: 311 x 1
value
<fct>
1 NA
2 Alternativ Economy
3 Ambulant Balance
4 Ambulant Economy
5 Ambulant Premium
6 Ambulant 2
7 Ambulant 3
8 Ambulant 1
9 COMPLETA
10 HOSPITAL ECO
# ... with 301 more rows
and the following df
> df <- data.frame(employee = c('John Doe','Peter Gynn','Jolie Hope'),
+ salary = c(21000, 23400, 26800),
+ startdate = as.Date(c('2010-11-1','2008-3-25','2007-3-14')))
> df
employee salary startdate
1 John Doe 21000 2010-11-01
2 Peter Gynn 23400 2008-03-25
3 Jolie Hope 26800 2007-03-14
Now, I want to add the elements of the former (i.e. products) as variables of the latter (i.e. the df). I use
cbind(df, setNames(lapply(products, function(x) x = NA), products))
but I get an error. Can you suggest another way of doing this? What is wrong with my solution? thanks in advance
Here is one solution.
df <- data.frame(employee = c('John Doe','Peter Gynn','Jolie Hope'),
salary = c(21000, 23400, 26800),
startdate = as.Date(c('2010-11-1','2008-3-25','2007-3-14')))
products <- data.frame(value = c(NA, "Alternativ Economy", "COMPLETA"))
#products$value <- ifelse(is.na(products$value), "not_available", as.character(products$value))
cbind(df, `colnames<-`(data.frame(matrix(ncol = nrow(products), nrow = nrow(df))), products$value))
employee salary startdate NA Alternativ Economy COMPLETA
1 John Doe 21000 2010-11-01 NA NA NA
2 Peter Gynn 23400 2008-03-25 NA NA NA
3 Jolie Hope 26800 2007-03-14 NA NA NA
I question the wisdom of having NAs as column names, so I'd uncomment that one line of code in there to replace NAs with some character string instead.

Create unique list of names

I have a list of actors:
name <- c('John Doe','Peter Gynn','Jolie Hope')
age <- c(26 , 32, 56)
postcode <- c('4011', '5600', '7700')
actors <- data.frame(name, age, postcode)
name age postcode
1 John Doe 26 4011
2 Peter Gynn 32 5600
3 Jolie Hope 56 7700
I also have an edge list of relations:
from <- c('John Doe','John Doe','John Doe', 'Peter Gynn', 'Peter Gynn', 'Jolie Hope')
to <- c('John Doe', 'John Doe', 'Peter Gynn', 'Jolie Hope', 'Peter Gynn', 'Frank Smith')
edge <- data.frame(from, to)
from to
1 John Doe John Doe
2 John Doe John Doe
3 John Doe Peter Gynn
4 Peter Gynn Jolie Hope
5 Peter Gynn Peter Gynn
6 Jolie Hope Frank Smith
First, I want to eliminate self references in my edge list i.e. rows 1,2,5 in my 'edge' dataframe.
non.self.ref <- edge[!(edge$from == edge$to),]
does not produce the desired result.
Second, edge includes a name not in the 'actor' dataframe ('Frank Smith'). I want to add 'Frank Smith' to my 'actor' dataframe, even though I do not have age or postcode data for 'Frank Smith'. For example:
name age postcode
1 John Doe 26 4011
2 Peter Gynn 32 5600
3 Jolie Hope 56 7700
4 Frank Smith NA NA
I would be grateful for a tidy solution!
Here is a tidyverse solution to both parts, though in general try not to ask multiple questions per question.
The first part is fairly simple. filter allows a very intuitive syntax that just specifies you want to keep rows where from isn't equal to to.
The second part is a little more complicated. First we gather up the from and to columns, so all the actors are in one column. Then we use distinct to leave us with a one column tbl with unique actor names. Finally, we can use full_join to combine the tables. A full_join keeps all rows and columns from both tables, matching on shared name column by default, and fills NA if there is no data (as there isn't for Frank).
library(tidyverse)
actors <- tibble(
name = c('John Doe','Peter Gynn','Jolie Hope'),
age = c(26 , 32, 56),
postcode = c('4011', '5600', '7700')
)
edge <- tibble(
from = c('John Doe','John Doe','John Doe', 'Peter Gynn', 'Peter Gynn', 'Jolie Hope'),
to = c('John Doe', 'John Doe', 'Peter Gynn', 'Jolie Hope', 'Peter Gynn', 'Frank Smith')
)
edge %>%
filter(from != to)
#> # A tibble: 3 x 2
#> from to
#> <chr> <chr>
#> 1 John Doe Peter Gynn
#> 2 Peter Gynn Jolie Hope
#> 3 Jolie Hope Frank Smith
edge %>%
gather("to_from", "name", from, to) %>%
distinct(name) %>%
full_join(actors)
#> Joining, by = "name"
#> # A tibble: 4 x 3
#> name age postcode
#> <chr> <dbl> <chr>
#> 1 John Doe 26.0 4011
#> 2 Peter Gynn 32.0 5600
#> 3 Jolie Hope 56.0 7700
#> 4 Frank Smith NA <NA>
Created on 2018-03-02 by the reprex package (v0.2.0).
I discovered by including stringsAsFactors = FALSE e.g.
edge <- data.frame(from, to, stringsAsFactors = F)
then:
non.self.ref <- edge[!(edge$from == edge$to),]
works!
An option with dplyr would be to filter the rows by comparing 'from' and 'to' (to get the first output - it is not needed if we are interested only at the second output), unlist, get the unique values, convert it to a tibble and do a left_join
library(dplyr)
edge %>%
filter(from != to) %>% #get the results for the first question
unlist %>%
unique %>%
tibble(name = .) %>%
left_join(actors) # second output
# A tibble: 4 x 3
# name age postcode
# <chr> <dbl> <fctr>
#1 John Doe 26.0 4011
#2 Peter Gynn 32.0 5600
#3 Jolie Hope 56.0 7700
#4 Frank Smith NA <NA>

R count number of Team members based on Team name

I have a df where each row represents an individual and each column a characteristic of these individuals. One of the columns is TeamName, which is the name of the Team that individual belongs to. Multiple individuals belong to a Team.
I'd like a function in R that creates a new column with the number of team members for each Team.
So, for example I have:
df
Name Surname TeamName
John Smith Champions
Mary Osborne Socceroos
Mark Johnson Champions
Rory Bradon Champions
Jane Bryant Socceroos
Bruce Harper
I'd like to have
df1
Name Surname TeamName TeamNo
John Smith Champions 3
Mary Osborne Socceroos 2
Mark Johnson Champions 3
Rory Bradon Champions 3
Jane Bryant Socceroos 2
Bruce Harper 0
So as you can see the counting includes that individual too, and if someone (e.g. Bruce Harper) has no Team name, then he gets a 0.
How can I do that? Thanks!
This is a solution based on using data.table which perhaps is too much for what you need, but here it goes:
library(data.table)
dt=data.table(df)
# First, let's convert the factors of TeamName, to characters
dt[,TeamName:=as.character(TeamName)]
# Now, let find all the team numbers
dt[,TeamNo:=.N, by='TeamName']
# Let's exclude the special cases
dt[is.na(TeamName),TeamNo:=NA]
dt[TeamName=="",TeamNo:=NA]
It is clearly not the best solution, but I hope this helps
If you need to know the number of unique members in the first two columns based on the 'TeamName' column, one option is n_distinct from dplyr
library(dplyr)
library(tidyr)
df %>%
unite(Var, Name, Surname) %>% #paste the columns together
group_by(TeamName) %>% #group by TeamName
mutate(TeamNo= n_distinct(Var)) %>% #create the TeamNo column
separate(Var, into=c('Name', 'Surname')) #split the 'Var' column
Or if it just the number of rows per 'TeamName', we can group by 'TeamName', get the number of rows per group with n(), create the 'TeamNo' column with mutate based on that n(), and if needed an ifelse condition can be used to give NA for 'TeamName' that are '' or NA.
df %>%
group_by(TeamName) %>%
mutate(TeamNo = ifelse(is.na(TeamName)|TeamName=='', NA_integer_, n()))
# Name Surname TeamName TeamNo
#1 John Smith Champions 3
#2 Mary Osborne Socceroos 2
#3 Mark Johnson Champions 3
#4 Rory Bradon Champions 3
#5 Jane Bryant Socceroos 2
#6 Bruce Harper NA
Or you can use ave from base R. Suppose if there are '' and NA, I would first convert the '' to NA and then use ave to get the length of 'TeamNo' grouped by that column. It will give NA for `NA' values. For example.
v1 <- c(df$TeamName, NA)# appending an NA with the example to show the case
is.na(v1) <- v1=='' #convert the `'' to `NA`
as.numeric(ave(v1, v1, FUN=length))
#[1] 3 2 3 3 2 NA NA
Using sqldf:
library(sqldf)
sqldf("SELECT Name, Surname, TeamName, n
FROM df
LEFT JOIN
(SELECT TeamName, COUNT(Name) AS n
FROM df
WHERE NOT TeamName IS '' GROUP BY TeamName)
USING (TeamName)")
Output:
Name Surname TeamName n
1 John Smith Champions 3
2 Mary Osborne Socceroos 2
3 Mark Johnson Champions 3
4 Rory Bradon Champions 3
5 Jane Bryant Socceroos 2
6 Bruce Harper NA

How to rbind when only some of the columns match

I have about 18 dataframes which are essentially frequency counts of the elements stored in the column Rptnames. They all have some different and some the same elements in the Rptnames columns so they look like this
dataframe called GroupedTableProportiondelAll
Rptname freq
bob 4324234
jane 433
ham 4324
tim 22
dataframe called GroupedTableProportiondelLUAD
Rptname freq
bob 987
jane 223
jonny 12
jim 98092
I am trying to set up a table so that the Rptname becomes the column and each row is the frequencies. This is so that I can combine all the dataframes.
I have tried the following
GroupedTableProportiondelAll_T <- as.data.frame(t(GroupedTableProportiondelAll))
GroupedTableProportiondelLUAD_T <- as.data.frame(t(GroupedTableProportiondelLUAD))
total <- rbind(GroupedTableProportiondelLUAD_T, GroupedTableProportiondelAll_T)
but I get the error
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
So the question is
a) how can I do rbind (cbind would also do without transposing I suppose) so that the bind can happen without needing to match.
b) would merge be better here
c) in either is there a way to enter zero for empty values
d) P'raps there's a better way to do this like matrices which Im not really familiar with? I know its 4 questions but the central question's the same- how to bind when not all the rows or columns are matching
An alternative to the rbind + dcast technique that would use the tidyverse.
Use pipes (%>%) to first use bind_rows() to bind all your dataframes together while simultaneously creating a dataframe id column (in this case I just called the variable "df"). Then use spread() to move unique "Rptname" values to become column names and spreading the values of "freq" across the new columns. "Rptname" is the key and "freq" is the value in this case.
It would look like this:
Input:
GTP_A
Rptname freq
1 bob 4324234
2 jane 433
3 ham 4324
4 tim 22
GTP_LUAD
Rptname freq
1 bob 987
2 jane 223
3 jonny 12
4 jim 98092
Code:
GroupTable <- bind_rows(GTP_A,GTP_LUAD, .id = "df") %>%
spread(Rptname, freq)
Output:
GroupTable
df bob ham jane jim jonny tim
1 1 4324234 4324 433 NA NA 22
2 2 987 NA 223 98092 12 NA
UPDATE:
As of the release of tidyr 1.0.0 on 2019/09/13 spread() and gather() have been retired and replaced by pivot_wider() and pivot_longer(), respectively. From the release notes Hadley Wickem states "spread() and gather() won’t go away, but they’ve been retired which means that they’re no longer under active development."
In order to get the same output as above, you will now need to first arrange() by Rptname then use pivot_wider(). If you do not arrange first you will get a similar output but the column order will not be the same as the output from spread().
GroupTable <- bind_rows(GTP_A, GTP_LUAD, .id = "df") %>%
arrange(Rptname) %>%
pivot_wider(names_from = Rptname, values_from = freq)
You could first rbind the dataframes after adding a column to identify the data.frame. Then use dcast function from reshape2 package.
rpt1
## Rptname freq df
## 1 bob 4324234 rpt1
## 2 jane 433 rpt1
## 3 ham 4324 rpt1
## 4 tim 22 rpt1
rpt2
## Rptname freq df
## 1 bob 987 rpt2
## 2 jane 223 rpt2
## 3 jonny 12 rpt2
## 4 jim 98092 rpt2
rpt1$df <- "rpt1"
rpt2$df <- "rpt2"
rpt <- rbind(rpt1, rpt2)
dcast(data = rpt, df ~ Rptname, value.var = "freq")
## df bob ham jane tim jim jonny
## 1 rpt1 4324234 4324 433 22 NA NA
## 2 rpt2 987 NA 223 NA 98092 12

How do I find last date in which a value increased in another column?

I have a data frame in R that looks something like this:
person date level
Alex 2007-06-01 3
Alex 2008-12-01 4
Alex 2009-12-01 3
Beth 2008-03-01 6
Beth 2010-10-01 6
Beth 2010-12-01 6
Mary 2009-11-04 9
Mary 2012-04-25 9
Mary 2013-09-10 10
I have sorted it first by "person" and second by "date".
I am trying to find out when the last increase in "level" occurred for each person. Ideally, the output would look something like:
person date
Alex 2008-12-01
Beth NA
Mary 2013-09-10
Using dplyr
library(dplyr)
dat %>% group_by(person) %>%
mutate(inc = c(F, diff(level) > 0)) %>%
summarize(date = last(date[inc], default = NA))
Yielding:
Source: local data frame [3 x 2]
person date
1 Alex 2008-12-01
2 Beth <NA>
3 Mary 2013-09-10
Try data.table version:
library(data.table)
setDT(dat)[order(person),diff:=c(NA,diff(level)),by=person][diff>0,tail(.SD,1),by=person][,-c(3,4),with=F]
person date
1: Alex 2008-12-01
2: Mary 2013-09-10
If na also needs to be included:
dd=setDT(dat)[order(person),diff:=c(NA,diff(level)),by=person][diff>0,tail(.SD,1),by=person][,-c(3,4),with=F]
dd2 =data.frame(unique(ddt[!(person %in% dd$person),,]$person),NA)
names(dd2) = c('person','date')
rbind(dd, dd2)
person date
1: Alex 2008-12-01
2: Mary 2013-09-10
3: Beth NA
A base-R version, using data frame df:
sapply(levels(df$Person), function(p) {
s <- df[df$Person==p,]
i <- 1+nrow(s)-match(TRUE,rev(diff(s$Level)>0))
ifelse(is.na(i), NA, as.character(s$Date[i]))
})
produces the named vector
Alex Beth Mary
"2008-12-01" NA "2013-09-10"
Easy to wrap this to produce any output format you need:
last.level.up <- function(df) {
data.frame(Date=sapply(levels(df$Person), function(p) {
s <- df[df$Person==p,]
i <- 1+nrow(s)-match(TRUE,rev(diff(s$Level)>0))
ifelse(is.na(i), NA, as.character(s$Date[i]))
}))
}
last.level.up(df)
Date
Alex 2008-12-01
Beth <NA>
Mary 2013-09-10

Resources