Match tables using 2 criteria in R - r

I just begin coding in R , I am trying to manipulate data but I have an issue which is the following:
I have 2 different tables (simplified )
the first one (player_df) is as follows:
name experience Club age Position
luc 2 FCB 18 Goalkeeper
jean 9 Real 26 midfielder
ronaldo 14 FCB 32 Goalkeeper
jean 9 Real 26 midfielder
messi 11 Liverpool 35 midfielder
tevez 6 Chelsea 27 Attack
inzaghi 9 Juve 34 Defender
kwfni 17 Bayern 40 Attack
Blabla 9 Real 25 midfielder
wdfood 11 Liverpool 33 midfielder
player2 7 Chelsea 28 Attack
player3 10 Juve 34 Defender
fgh 17 Bayern 40 Attack
...
The second table is the salary by club and experience in million (salary_df)
*experience FCB BAYERN Juve Real Chelsea
1 1.5 1.3 1 4 3
2 2.5 2 2.4 5 4
3 3.4 3.1 3.5 6.3 5
4 5 4.5 6.7 9 6
5 7.1 6.9 9 12 7
6 9 8 10 15 10
7 10 9 12 16 15
8 14 12 13 19 16
9 14.5 17 15 20 17
10 15 19 17 23 18
..*
I would like to add a new column to my data in the first table named let say salary_estimation, and which takes into consideration 2 variables for example here experience and the club.
For example for "luc" who plays in "FCB" and has "2" years experience the output should be "2.5"
In excel its an index / match function, but in R I don't know which function should I use .
How should I approach the problem ?

Data:
df1 <- read.table(text = 'name experience Club age Position
luc 2 FCB 18 Goalkeeper
jean 9 Real 26 midfielder
ronaldo 14 FCB 32 Goalkeeper
jean 9 Real 26 midfielder
messi 11 Liverpool 35 midfielder
tevez 6 Chelsea 27 Attack
inzaghi 9 Juve 34 Defender
kwfni 17 Bayern 40 Attack
Blabla 9 Real 25 midfielder
wdfood 11 Liverpool 33 midfielder
player2 7 Chelsea 28 Attack
player3 10 Juve 34 Defender
fgh 17 Bayern 40 Attack', header = TRUE, stringsAsFactors = FALSE)
df2 <- read.table(text = 'experience FCB BAYERN Juve Real Chelsea
1 1.5 1.3 1 4 3
2 2.5 2 2.4 5 4
3 3.4 3.1 3.5 6.3 5
4 5 4.5 6.7 9 6
5 7.1 6.9 9 12 7
6 9 8 10 15 10
7 10 9 12 16 15
8 14 12 13 19 16
9 14.5 17 15 20 17
10 15 19 17 23 18', header = TRUE, stringsAsFactors = FALSE)
Code:
library('data.table')
setDT(df2)[, Chelsea := as.numeric(Chelsea)]
df2 <- melt(df2, id.vars = "experience", variable.name = "Club", value.name = "Salary" )
df2[df1, on = c("experience", "Club"), nomatch = NA]
Output:
# experience Club Salary name age Position
# 1: 2 FCB 2.5 luc 18 Goalkeeper
# 2: 9 Real 20.0 jean 26 midfielder
# 3: 14 FCB NA ronaldo 32 Goalkeeper
# 4: 9 Real 20.0 jean 26 midfielder
# 5: 11 Liverpool NA messi 35 midfielder
# 6: 6 Chelsea 10.0 tevez 27 Attack
# 7: 9 Juve 15.0 inzaghi 34 Defender
# 8: 17 Bayern NA kwfni 40 Attack
# 9: 9 Real 20.0 Blabla 25 midfielder
# 10: 11 Liverpool NA wdfood 33 midfielder
# 11: 7 Chelsea 15.0 player2 28 Attack
# 12: 10 Juve 17.0 player3 34 Defender
# 13: 17 Bayern NA fgh 40 Attack

One of possible solutions is by joining first table (let say it is player_df) with "long format" of second table salary_df using experience and club as keys. You can do it by using tidyverse package.
library(tidyverse)
player_df %>%
mutate(Club = str_to_title(Club)) %>%
left_join(
salary_df %>%
pivot_longer(-experience, names_to = "Club", values_to = "salary_estimation") %>%
mutate(Club = str_to_title(Club)) )
# Joining, by = c("experience", "Club")
# # A tibble: 13 x 6
# name experience Club age Position salary_estimation
# <chr> <dbl> <chr> <dbl> <chr> <dbl>
# 1 luc 2 Fcb 18 Goalkeeper 2.5
# 2 jean 9 Real 26 midfielder 20
# 3 ronaldo 14 Fcb 32 Goalkeeper NA
# 4 jean 9 Real 26 midfielder 20
# 5 messi 11 Liverpool 35 midfielder NA
# 6 tevez 6 Chelsea 27 Attack 10
# 7 inzaghi 9 Juve 34 Defender 15
# 8 kwfni 17 Bayern 40 Attack NA
# 9 Blabla 9 Real 25 midfielder 20
# 10 wdfood 11 Liverpool 33 midfielder NA
# 11 player2 7 Chelsea 28 Attack 15
# 12 player3 10 Juve 34 Defender 17
# 13 fgh 17 Bayern 40 Attack NA

Related

How best to do this join in R?

Below is the sample data. I know that I have to do a left join. The question is how to have it only return values that match (indcodelist = indcodelist2) but with the highest codetype value.
indcodelist <- c(110000,111000,112000,113000,114000,115000,121000,210000,211000,315000)
estemp <- c(11,21,31,41,51,61,55,21,22,874)
projemp <- c(15,25,36,45,52,61,31,29,31,899)
nchg <- c(4,4,5,4,1,0,-24,8,9,25)
firsttable <- data.frame(indcodelist,estemp,projemp,nchg)
indcodelist2 <- c(110000,111000,112000,113000,114000,115000,121000,210000,211000,315000,110000,111000,112000,113000)
codetype <- c(18,18,18,18,18,18,18,18,18,18,10,10,10,10)
codetitle <- c("Accountant","Doctor","Lawyer","Teacher","Economist","Financial Analyst","Meteorologist","Dentist", "Editor","Veterinarian","Accounting Technician","Doctor","Lawyer","Teacher")
secondtable <- data.frame(indcodelist2,codetype,codetitle)
tried <- left_join(firsttable,secondtable, by =c(indcodelist = "indcodelist2"))
Desired Result
indcodelist estemp projemp nchg codetitle
110000 11 15 4 Accountant
111000 21 25 4 Doctor
If you only want values that match in both tables, inner_join might be what you’re looking for. You can see this answer to understand different types of joins.
To get the highest codetype, you can use dplyr::slice_max(). Be aware the default behavior is to return values that tie. If there is more than one codetitle at the same codetype, they’ll all be returned.
library(tidyverse)
firsttable %>%
inner_join(., secondtable, by = c("indcodelist" = "indcodelist2")) %>%
group_by(indcodelist) %>%
slice_max(codetype)
#> # A tibble: 10 × 6
#> # Groups: indcodelist [10]
#> indcodelist estemp projemp nchg codetype codetitle
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 110000 11 15 4 18 Accountant
#> 2 111000 21 25 4 18 Doctor
#> 3 112000 31 36 5 18 Lawyer
#> 4 113000 41 45 4 18 Teacher
#> 5 114000 51 52 1 18 Economist
#> 6 115000 61 61 0 18 Financial Analyst
#> 7 121000 55 31 -24 18 Meteorologist
#> 8 210000 21 29 8 18 Dentist
#> 9 211000 22 31 9 18 Editor
#> 10 315000 874 899 25 18 Veterinarian
Created on 2022-09-15 by the reprex package (v2.0.1)
You might use {powerjoin} :
library(powerjoin)
power_inner_join(
firsttable,
secondtable |> summarize_by_keys(dplyr::across()[which.max(codetype),]),
by = c("indcodelist" = "indcodelist2")
)
#> indcodelist estemp projemp nchg codetype codetitle
#> 1 110000 11 15 4 18 Accountant
#> 2 111000 21 25 4 18 Doctor
#> 3 112000 31 36 5 18 Lawyer
#> 4 113000 41 45 4 18 Teacher
#> 5 114000 51 52 1 18 Economist
#> 6 115000 61 61 0 18 Financial Analyst
#> 7 121000 55 31 -24 18 Meteorologist
#> 8 210000 21 29 8 18 Dentist
#> 9 211000 22 31 9 18 Editor
#> 10 315000 874 899 25 18 Veterinarian

Error for NA using group_by or aggregate function [aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) : no rows to aggregate]

I've recently picked up R programming and have been looking through some group_by/aggregate questions posted here to help me learn better. A question came to my mind earlier today on how group_by/aggregate can incorporate NA data rather than 0.
Given the table and code below (credits to max_lim for allowing me to use his data set), what happens if the field of NA exist (which does happen quite often)?
Farms = c(rep("Farm 1", 6), rep("Farm 2", 6), rep("Farm 3", 6))
Year = rep(c(2020,2020,2019,2019,2018,2018),3)
Cow = c(22,NA,16,12,8,NA,31,NA,3,20,39,34,27,50,NA,NA,NA,NA)
Duck = c(12,12,6,NA,NA,NA,28,13,31,50,33,20,NA,9,19,2,NA,7)
Chicken = c(100,120,80,50,NA,10,27,31,NA,43,NA,28,37,NA,NA,NA,5,43)
Sheep = c(30,20,10,NA,16,13,10,20,20,17,48,12,30,NA,20,NA,27,49)
Horse = c(25,20,16,11,NA,12,14,NA,43,42,10,12,42,NA,16,7,NA,42)
Data = data.frame(Farms, Year, Cow, Duck, Chicken, Sheep, Horse)
Farm
Year
Cow
Duck
Chicken
Sheep
Horse
Farm 1
2020
22
12
100
30
25
Farm 1
2020
NA
12
120
20
20
Farm 1
2019
16
6
80
10
16
Farm 1
2019
12
NA
50
NA
11
Farm 1
2018
8
NA
NA
16
NA
Farm 1
2018
NA
NA
10
13
12
Farm 2
2020
31
28
27
10
14
Farm 2
2020
NA
13
31
20
NA
Farm 2
2019
3
31
NA
20
43
Farm 2
2019
20
50
43
17
42
Farm 2
2018
39
33
NA
48
10
Farm 2
2018
34
20
28
12
12
Farm 3
2020
27
NA
37
30
42
Farm 3
2020
50
9
NA
NA
NA
Farm 3
2019
NA
19
NA
20
16
Farm 3
2019
NA
2
NA
NA
7
Farm 3
2018
NA
NA
5
27
NA
Farm 3
2018
NA
7
43
49
42
If I were to use aggregate(.~Farms + Year, Data, mean) here, I would get Error in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) : no rows to aggregate which I assume is because the mean function isn't able to account for NA.
Does anyone know how we can modify the aggregate/group_by function to account for the NA by calculating the average using only years without NA data? i.e.
2020: 10, 2019: NA, 2018:20, 2017:NA, 2016:15 -> the average (after discounting NA years 2019 and 2017) will be (10 + 20 + 15) / (3) = 15.
The ideal output will be as follow:
Farm
Year
Cow
Duck
Chicken
Sheep
Horse
Farm 1
2020
22 (avg = 22/1 as one entry is NA)
12
110
25
22.5
Farm 1
2019
14
6
65
10
13.5
Farm 1
2018
8
N.A. (as it's all NA)
10
14.5
12
Farm 2
2020
31
20.5
29
15
14
Farm 2
2019
11.5
40.5
43
18.5
42.5
Farm 2
2018
36.5
26.5
28
30
11
Farm 3
2020
...
...
...
...
...
Farm 3
2019
...
...
...
...
...
Farm 3
2018
...
...
...
...
...
Here is a way to create the wanted data.frame. I think your solution has one error in row 2 (Sheep), where mean(NA, 10) is equal to 5 and not 10.
library(dplyr)
Using aggregate
Data %>%
aggregate(.~Year+Farms,., FUN=mean, na.rm=T, na.action=NULL) %>%
arrange(Farms, desc(Year)) %>%
as.data.frame() %>%
mutate_at(names(.), ~replace(., is.nan(.), NA))
Using summarize
Data %>%
group_by(Year, Farms) %>%
summarize(MeanCow = mean(Cow, na.rm=T),
MeanDuck = mean(Duck, na.rm=T),
MeanChicken = mean(Chicken, na.rm=T),
MeanSheep = mean(Sheep, na.rm=T),
MeanHorse = mean(Horse, na.rm=T)) %>%
arrange(Farms, desc(Year)) %>%
as.data.frame() %>%
mutate_at(names(.), ~replace(., is.nan(.), NA))
Solution for both
Year Farms Cow Duck Chicken Sheep Horse
1 2020 Farm 1 22.0 12.0 110 25.0 22.5
2 2019 Farm 1 14.0 6.0 65 10.0 13.5
3 2018 Farm 1 8.0 NA 10 14.5 12.0
4 2020 Farm 2 31.0 20.5 29 15.0 14.0
5 2019 Farm 2 11.5 40.5 43 18.5 42.5
6 2018 Farm 2 36.5 26.5 28 30.0 11.0
7 2020 Farm 3 38.5 9.0 37 30.0 42.0
8 2019 Farm 3 NA 10.5 NA 20.0 11.5
9 2018 Farm 3 NA 7.0 24 38.0 42.0

Slide along data frame rows and compare rows with next rows

I guess something similar should have been asked before, however I could only find an answer for python and SQL. So please notify me in the comments when this was also asked for R!
Data
Let's say we have a dataframe like this:
set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20))
# In cause you do not get the same dataframe see the comment by #Ian Campbell - thanks!
position value
1 1 27
2 2 37
3 3 57
4 4 89
5 5 20
6 6 86
7 7 97
8 8 62
9 9 58
10 10 6
11 11 19
12 12 16
13 13 61
14 14 34
15 15 67
16 16 43
17 17 88
18 18 83
19 19 32
20 20 63
Goal
I'm interested in calculating the average value for n positions and subtract this from the average value of the next n positions, let's say n=5 for now.
What I tried
I now used this method, however when I apply this to a bigger dataframe it takes a huge amount of time, and hence wonder if there is a faster method for this.
calc <- function( pos ) {
this.five <- df %>% slice(pos:(pos+4))
next.five <- df %>% slice((pos+5):(pos+9))
differ = mean(this.five$value)- mean(next.five$value)
data.frame(dif= differ)
}
df %>%
group_by(position) %>%
do(calc(.$position))
That produces the following table:
position dif
<int> <dbl>
1 1 -15.8
2 2 9.40
3 3 37.6
4 4 38.8
5 5 37.4
6 6 22.4
7 7 4.20
8 8 -26.4
9 9 -31
10 10 -35.4
11 11 -22.4
12 12 -22.3
13 13 -0.733
14 14 15.5
15 15 -0.400
16 16 NaN
17 17 NaN
18 18 NaN
19 19 NaN
20 20 NaN
I suspect a data.table approach may be faster.
library(data.table)
setDT(df)
df[,c("roll.position","rollmean") := lapply(.SD,frollmean,n=5,fill=NA, align = "left")]
df[, result := rollmean[.I] - rollmean[.I + 5]]
df[,.(position,value,rollmean,result)]
# position value rollmean result
# 1: 1 27 46.0 -15.8
# 2: 2 37 57.8 9.4
# 3: 3 57 69.8 37.6
# 4: 4 89 70.8 38.8
# 5: 5 20 64.6 37.4
# 6: 6 86 61.8 22.4
# 7: 7 97 48.4 4.2
# 8: 8 62 32.2 -26.4
# 9: 9 58 32.0 -31.0
#10: 10 6 27.2 -35.4
#11: 11 19 39.4 -22.4
#12: 12 16 44.2 NA
#13: 13 61 58.6 NA
#14: 14 34 63.0 NA
#15: 15 67 62.6 NA
#16: 16 43 61.8 NA
#17: 17 88 NA NA
#18: 18 83 NA NA
#19: 19 32 NA NA
#20: 20 63 NA NA
Data
RNGkind(sample.kind = "Rounding")
set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20))
RNGkind(sample.kind = "default")

Creating a column of unique identifiers and sequencing along subjects

I have a data set where I am attempting to sequence along subject observations and then create a column that provides their birth year. The data looks like this:
Name <- c("Joe Smith", "Joe Smith","Joe Smith","Joe Smith", "Tom Watson", "Tom Watson", "Tom Watson", "Carl Nelle", "Carl Nelle", "Carl Nelle", "Carl Nelle", "Joe Smith", "Joe Smith", "Joe Smith", "Joe Smith")
Year <- c(2001, 2002, 2003, 2004, 2014, 2015, 2016, 2006, 2007, 2008, 2009, 1997, 1998, 1999, 2000)
Var1 <- round(rnorm(n = Name, mean = 10, sd = 2),1)
Var2 <- round(rnorm(n = Name, mean = 30, sd = 10),0)
data <- data.frame(Name, Year, Var1, Var2)
data
Name Year Var1 Var2
1 Joe Smith 2001 8.9 23
2 Joe Smith 2002 9.8 45
3 Joe Smith 2003 11.1 43
4 Joe Smith 2004 11.7 63
5 Tom Watson 2014 11.7 47
6 Tom Watson 2015 13.2 28
7 Tom Watson 2016 9.5 30
8 Carl Nelle 2006 9.5 44
9 Carl Nelle 2007 11.2 32
10 Carl Nelle 2008 12.2 24
11 Carl Nelle 2009 5.6 15
12 Joe Smith 1997 10.5 38
13 Joe Smith 1998 10.3 14
14 Joe Smith 1999 9.2 27
15 Joe Smith 2000 7.1 49
I used the dplyr package to create my sequence of each observation for the subjects like so:
data <- data %>%
group_by(Name) %>%
mutate(id = row_number())
Name Year Var1 Var2 id
1 Joe Smith 2001 8.9 23 1
2 Joe Smith 2002 9.8 45 2
3 Joe Smith 2003 11.1 43 3
4 Joe Smith 2004 11.7 63 4
5 Tom Watson 2014 11.7 47 1
6 Tom Watson 2015 13.2 28 2
7 Tom Watson 2016 9.5 30 3
8 Carl Nelle 2006 9.5 44 1
9 Carl Nelle 2007 11.2 32 2
10 Carl Nelle 2008 12.2 24 3
11 Carl Nelle 2009 5.6 15 4
12 Joe Smith 1997 10.5 38 5
13 Joe Smith 1998 10.3 14 6
14 Joe Smith 1999 9.2 27 7
15 Joe Smith 2000 7.1 49 8
My first problem with this is that the second Joe Smith doesn't get his own id number. This is a problem as several people in the dataset can have the same name. Is there a way to correct this?
The second issue is that I need to create a column called "Birth.Year" which is represented as the first year that the person is in the data base. So it would look like this:
Name Year Var1 Var2 id Birth.Year
1 Joe Smith 2001 8.9 23 1 2001
2 Joe Smith 2002 9.8 45 2 2001
3 Joe Smith 2003 11.1 43 3 2001
4 Joe Smith 2004 11.7 63 4 2001
5 Tom Watson 2014 11.7 47 1 2014
6 Tom Watson 2015 13.2 28 2 2014
7 Tom Watson 2016 9.5 30 3 2014
8 Carl Nelle 2006 9.5 44 1 2006
9 Carl Nelle 2007 11.2 32 2 2006
10 Carl Nelle 2008 12.2 24 3 2006
11 Carl Nelle 2009 5.6 15 4 2006
12 Joe Smith 1997 10.5 38 5 1997
13 Joe Smith 1998 10.3 14 6 1997
14 Joe Smith 1999 9.2 27 7 1997
15 Joe Smith 2000 7.1 49 8 1997
Is there a way to accomplish these tasks in dplyr or do I need to write a specific function?
Here's a way using the lag function. Note that we need to replace the first instance (which is NA) with FALSE. The use of the lag function allows us to check if the Name matches the previous Name or not.
This solution assumes that if the Names aren't grouped together, they're different people.
data <- data.frame(Name, Year, Var1, Var2, stringsAsFactors = FALSE)
data %>%
mutate(Foo1 = Name != lag(Name),
Foo2 = cumsum(ifelse(is.na(Foo1), FALSE, Foo1))) %>%
group_by(Name, Foo2) %>%
mutate(id = row_number(),
BirthYear = min(Year))
Name Year Var1 Var2 Foo1 Foo2 id BirthYear
<chr> <dbl> <dbl> <dbl> <lgl> <int> <int> <dbl>
1 Joe Smith 2001 9.0 30 NA 0 1 2001
2 Joe Smith 2002 11.8 47 FALSE 0 2 2001
3 Joe Smith 2003 6.9 23 FALSE 0 3 2001
4 Joe Smith 2004 8.6 37 FALSE 0 4 2001
5 Tom Watson 2014 10.7 35 TRUE 1 1 2014
6 Tom Watson 2015 9.4 30 FALSE 1 2 2014
7 Tom Watson 2016 7.5 25 FALSE 1 3 2014
8 Carl Nelle 2006 10.7 32 TRUE 2 1 2006
9 Carl Nelle 2007 6.6 25 FALSE 2 2 2006
10 Carl Nelle 2008 10.9 34 FALSE 2 3 2006
11 Carl Nelle 2009 13.5 18 FALSE 2 4 2006
12 Joe Smith 1997 10.1 34 TRUE 3 1 1997
13 Joe Smith 1998 12.0 34 FALSE 3 2 1997
14 Joe Smith 1999 7.3 40 FALSE 3 3 1997
15 Joe Smith 2000 10.8 26 FALSE 3 4 1997

Using mutate() to efficiently create data frame

I have this local data frame:
Source: local data frame [792 x 3]
team player_name g
1 Anaheim PERRY_COREY 31
2 Anaheim GETZLAF_RYAN 22
3 Dallas BENN_JAMIE 25
4 Pittsburgh CROSBY_SIDNEY 20
5 Toronto KESSEL_PHIL 27
6 Edmonton HALL_TAYLOR 16
7 Dallas SEGUIN_TYLER 24
8 Montreal VANEK_THOMAS 19
9 Colorado LANDESKOG_GABRIEL 18
10 Chicago SHARP_PATRICK 22
.. ... ... ..
I want to be able to rank the teams based on their average number of goals (g) per player. Here is what I did (really feels suboptimal):
library(dplyr)
d1 <- select(df, team, g, player_name)
c1 <- count(d1, team, wt = g)
c2 <- count(d1, team, wt = n_distinct(player_name))
c3 <- cbind(c1, c2[,2])
c4 <- c3[,2] / c3[,3]
c5 <- cbind(c3, c4)
colnames(c5) <- c("team", "ttgpt", "ttnp", "agpp")
c6 <- mutate(c5, rank = row_number(desc(c4)))
c7 <- filter(c6, rank <=10)
c8 <- arrange(c7, rank)
And here is the result of c8:
team ttgpt ttnp agpp rank
1 Chicago 177 23 7.695652 1
2 Colorado 164 23 7.130435 2
3 Anaheim 180 26 6.923077 3
4 NY_Rangers 153 23 6.652174 4
5 Boston 179 27 6.629630 5
6 San_Jose 157 25 6.280000 6
7 Dallas 155 25 6.200000 7
8 St._Louis 148 24 6.166667 8
9 Ottawa 160 26 6.153846 9
10 Philadelphia 140 23 6.086957 10
I would like to recreate this table with consistent use of %>%
See CSV for reproductible example: playerstats.csv
Ok from what you said:
df<-read.csv("../Downloads/playerstats.csv",header=T,sep=",")
df %>% group_by(Team)
%>% summarise(ttgp=sum(G),ttnp=n_distinct(Player.Name),agp=sum(G)/n_distinct(Player.Name))
%>% mutate(rank=rank(desc(agp)))
%>% filter(rank<=10)
%>% arrange(rank)
Source: local data frame [10 x 5]
Team ttgp ttnp agp rank
1 Chicago 177 23 7.695652 1
2 Colorado 164 23 7.130435 2
3 Anaheim 180 26 6.923077 3
4 NY Rangers 153 23 6.652174 4
5 Boston 179 27 6.629630 5
6 San Jose 157 25 6.280000 6
7 Dallas 155 25 6.200000 7
8 St. Louis 148 24 6.166667 8
9 Ottawa 160 26 6.153846 9
10 Philadelphia 140 23 6.086957 10
Note that I am not sure what you mean with ttgpt and ttnp. Therefore, I tried to guess it.

Resources