I have this local data frame:
Source: local data frame [792 x 3]
team player_name g
1 Anaheim PERRY_COREY 31
2 Anaheim GETZLAF_RYAN 22
3 Dallas BENN_JAMIE 25
4 Pittsburgh CROSBY_SIDNEY 20
5 Toronto KESSEL_PHIL 27
6 Edmonton HALL_TAYLOR 16
7 Dallas SEGUIN_TYLER 24
8 Montreal VANEK_THOMAS 19
9 Colorado LANDESKOG_GABRIEL 18
10 Chicago SHARP_PATRICK 22
.. ... ... ..
I want to be able to rank the teams based on their average number of goals (g) per player. Here is what I did (really feels suboptimal):
library(dplyr)
d1 <- select(df, team, g, player_name)
c1 <- count(d1, team, wt = g)
c2 <- count(d1, team, wt = n_distinct(player_name))
c3 <- cbind(c1, c2[,2])
c4 <- c3[,2] / c3[,3]
c5 <- cbind(c3, c4)
colnames(c5) <- c("team", "ttgpt", "ttnp", "agpp")
c6 <- mutate(c5, rank = row_number(desc(c4)))
c7 <- filter(c6, rank <=10)
c8 <- arrange(c7, rank)
And here is the result of c8:
team ttgpt ttnp agpp rank
1 Chicago 177 23 7.695652 1
2 Colorado 164 23 7.130435 2
3 Anaheim 180 26 6.923077 3
4 NY_Rangers 153 23 6.652174 4
5 Boston 179 27 6.629630 5
6 San_Jose 157 25 6.280000 6
7 Dallas 155 25 6.200000 7
8 St._Louis 148 24 6.166667 8
9 Ottawa 160 26 6.153846 9
10 Philadelphia 140 23 6.086957 10
I would like to recreate this table with consistent use of %>%
See CSV for reproductible example: playerstats.csv
Ok from what you said:
df<-read.csv("../Downloads/playerstats.csv",header=T,sep=",")
df %>% group_by(Team)
%>% summarise(ttgp=sum(G),ttnp=n_distinct(Player.Name),agp=sum(G)/n_distinct(Player.Name))
%>% mutate(rank=rank(desc(agp)))
%>% filter(rank<=10)
%>% arrange(rank)
Source: local data frame [10 x 5]
Team ttgp ttnp agp rank
1 Chicago 177 23 7.695652 1
2 Colorado 164 23 7.130435 2
3 Anaheim 180 26 6.923077 3
4 NY Rangers 153 23 6.652174 4
5 Boston 179 27 6.629630 5
6 San Jose 157 25 6.280000 6
7 Dallas 155 25 6.200000 7
8 St. Louis 148 24 6.166667 8
9 Ottawa 160 26 6.153846 9
10 Philadelphia 140 23 6.086957 10
Note that I am not sure what you mean with ttgpt and ttnp. Therefore, I tried to guess it.
Related
Below is the sample data. I know that I have to do a left join. The question is how to have it only return values that match (indcodelist = indcodelist2) but with the highest codetype value.
indcodelist <- c(110000,111000,112000,113000,114000,115000,121000,210000,211000,315000)
estemp <- c(11,21,31,41,51,61,55,21,22,874)
projemp <- c(15,25,36,45,52,61,31,29,31,899)
nchg <- c(4,4,5,4,1,0,-24,8,9,25)
firsttable <- data.frame(indcodelist,estemp,projemp,nchg)
indcodelist2 <- c(110000,111000,112000,113000,114000,115000,121000,210000,211000,315000,110000,111000,112000,113000)
codetype <- c(18,18,18,18,18,18,18,18,18,18,10,10,10,10)
codetitle <- c("Accountant","Doctor","Lawyer","Teacher","Economist","Financial Analyst","Meteorologist","Dentist", "Editor","Veterinarian","Accounting Technician","Doctor","Lawyer","Teacher")
secondtable <- data.frame(indcodelist2,codetype,codetitle)
tried <- left_join(firsttable,secondtable, by =c(indcodelist = "indcodelist2"))
Desired Result
indcodelist estemp projemp nchg codetitle
110000 11 15 4 Accountant
111000 21 25 4 Doctor
If you only want values that match in both tables, inner_join might be what you’re looking for. You can see this answer to understand different types of joins.
To get the highest codetype, you can use dplyr::slice_max(). Be aware the default behavior is to return values that tie. If there is more than one codetitle at the same codetype, they’ll all be returned.
library(tidyverse)
firsttable %>%
inner_join(., secondtable, by = c("indcodelist" = "indcodelist2")) %>%
group_by(indcodelist) %>%
slice_max(codetype)
#> # A tibble: 10 × 6
#> # Groups: indcodelist [10]
#> indcodelist estemp projemp nchg codetype codetitle
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 110000 11 15 4 18 Accountant
#> 2 111000 21 25 4 18 Doctor
#> 3 112000 31 36 5 18 Lawyer
#> 4 113000 41 45 4 18 Teacher
#> 5 114000 51 52 1 18 Economist
#> 6 115000 61 61 0 18 Financial Analyst
#> 7 121000 55 31 -24 18 Meteorologist
#> 8 210000 21 29 8 18 Dentist
#> 9 211000 22 31 9 18 Editor
#> 10 315000 874 899 25 18 Veterinarian
Created on 2022-09-15 by the reprex package (v2.0.1)
You might use {powerjoin} :
library(powerjoin)
power_inner_join(
firsttable,
secondtable |> summarize_by_keys(dplyr::across()[which.max(codetype),]),
by = c("indcodelist" = "indcodelist2")
)
#> indcodelist estemp projemp nchg codetype codetitle
#> 1 110000 11 15 4 18 Accountant
#> 2 111000 21 25 4 18 Doctor
#> 3 112000 31 36 5 18 Lawyer
#> 4 113000 41 45 4 18 Teacher
#> 5 114000 51 52 1 18 Economist
#> 6 115000 61 61 0 18 Financial Analyst
#> 7 121000 55 31 -24 18 Meteorologist
#> 8 210000 21 29 8 18 Dentist
#> 9 211000 22 31 9 18 Editor
#> 10 315000 874 899 25 18 Veterinarian
I have 4 data frames that all look like this:
Product 2018
Number
Minimum
Maximum
1
56
1
5
2
42
12
16
3
6523
23
56
4
123
23
102
5
56
23
64
6
245623
56
87
7
546
25
540
8
54566
253
560
Product 2019
Number
Minimum
Maximum
1
56
32
53
2
642
423
620
3
56423
432
560
4
3
431
802
5
2
2
6
6
4523
43
68
7
555
23
54
8
55646
3
6
Product 2020
Number
Minimum
Maximum
1
23
2
5
2
342
4
16
3
223
3
5
4
13
4
12
5
2
4
7
6
223
7
8
7
5
34
50
8
46
3
6
Product 2021
Number
Minimum
Maximum
1
234
3
5
2
3242
4
16
3
2423
43
56
4
123
43
102
5
24
4
6
6
2423
4
18
7
565
234
540
8
5646
23
56
I want to join all the tables so I get a table that looks like this:
Products
Number 2021
Min-Max 2021
Number 2020
Min-Max 2020
Number 2019
Min-Max 2019
Number 2018
Min-Max 2018
1
234
3 to 5
23
2 to 5
...
...
...
...
2
3242
4 to 16
342
4 to 16
...
...
...
...
3
2423
43 to 56
223
3 to 5
...
...
...
...
4
123
43 to 102
13
4 to 12
...
...
...
...
5
24
4 to 6
2
4 to 7
...
...
...
...
6
2423
4 to 18
223
7 to 8
...
...
...
...
7
565
234 to 540
5
34 to 50
...
...
...
...
8
5646
23 to 56
46
3 to 6
...
...
...
...
The Product for all years are the same so I would like to have a data frame that contains the number for each year as a column and joins the column for minimum and maximum as one.
Any help is welcome!
How about something like this. You are trying to join several dataframes by a single column, which is relatively straight forward using full_join. The difficulty is that you are trying to extract information from the column names and combine several columns at the same time. I would map out everying you want to do and then reduce the list of dataframes at the end. Here is an example with two dataframes, but you could add as many as you want to the list at the begining.
library(tidyverse)
#test data
set.seed(23)
df1 <- tibble("Product 2018" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
set.seed(46)
df2 <- tibble("Product 2019" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
list(df1, df2) |>
map(\(x){
year <- str_extract(colnames(x)[1], "\\d+?$")
mutate(x, !!quo_name(paste0("Min-Max ", year)) := paste(Minimum, "to", Maximum))|>
rename(!!quo_name(paste0("Number ", year)) := Number)|>
rename_with(~gsub("\\s\\d+?$", "", .), 1) |>
select(-c(Minimum, Maximum))
}) |>
reduce(full_join, by = "Product")
#> # A tibble: 8 x 5
#> Product `Number 2018` `Min-Max 2018` `Number 2019` `Min-Max 2019`
#> <int> <int> <chr> <int> <chr>
#> 1 1 29 21 to 481 50 93 to 416
#> 2 2 28 17 to 314 78 7 to 313
#> 3 3 72 40 to 787 1 91 to 205
#> 4 4 43 36 to 557 47 55 to 542
#> 5 5 45 70 to 926 52 76 to 830
#> 6 6 34 96 to 645 70 20 to 922
#> 7 7 48 31 to 197 84 6 to 716
#> 8 8 17 86 to 951 99 75 to 768
This is a similar answer, but includes bind_rows to combine the data.frames, then pivot_wider to end in a wide format.
The first steps strip the year from the Product XXXX column name, as this carries relevant information on year for that data.frame. If that column is renamed as Product they are easily combined (with a separate column containing the Year). If this step can be taken earlier in the data collection or processing timeline, it is helpful.
library(tidyverse)
list(df1, df2, df3, df4) %>%
map(~.x %>%
mutate(Year = gsub("Product", "", names(.x)[1])) %>%
rename(Product = !!names(.[1]))) %>%
bind_rows() %>%
mutate(Min_Max = paste(Minimum, Maximum, sep = " to ")) %>%
pivot_wider(id_cols = Product, names_from = Year, values_from = c(Number, Min_Max), names_vary = "slowest")
Output
Product Number_2018 Min_Max_2018 Number_2019 Min_Max_2019 Number_2020 Min_Max_2020 Number_2021 Min_Max_2021
<int> <int> <chr> <int> <chr> <int> <chr> <int> <chr>
1 1 56 1 to 5 56 32 to 53 23 2 to 5 234 3 to 5
2 2 42 12 to 16 642 423 to 620 342 4 to 16 3242 4 to 16
3 3 6523 23 to 56 56423 432 to 560 223 3 to 5 2423 43 to 56
4 4 123 23 to 102 3 431 to 802 13 4 to 12 123 43 to 102
5 5 56 23 to 64 2 2 to 6 2 4 to 7 24 4 to 6
6 6 245623 56 to 87 4523 43 to 68 223 7 to 8 2423 4 to 18
7 7 546 25 to 540 555 23 to 54 5 34 to 50 565 234 to 540
8 8 54566 253 to 560 55646 3 to 6 46 3 to 6 5646 23 to 56
I have a problem with the humans here; they're giving me Citizen Science data in spreadsheets formatted to be attractive and legible. I figured out the right sequence of pivots _longer and _wider to get it into an analyzable format but first I had to do a whole bunch of hand edits to make the column labels usable. I've just been given a corrected spreadsheet so now I have to do the same hand edits all over. Can I avoid this?
reprex <- read_csv("reprex.csv", col_names = FALSE)
gives:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 NA NA 2014 NA NA 2015 NA NA 2016 NA
2 NA Total F M Total F M Total F M
3 SiteA 180 92 88 134 40 94 34 20 14
4 SiteB NA NA NA 247 143 104 8 8 0
5 SiteC 237 194 43 220 95 125 62 45 17
I want column labels like "2014 Total", "2014 F", ... like so:
Location `2014 Total` `2014 F` `2014 M` `2015 Total` `2015 F` `2015 M` `2016 Total` `2016 F` `2016 M`
1 SiteA 180 92 88 134 40 94 34 20 14
2 SiteB NA NA NA 247 143 104 8 8 0
3 SiteC 237 194 43 220 95 125 62 45 17
...which would allow me to twist it up until I get to something like:
Location date Total F M
1 SiteA 2014 180 92 88
2 SiteB 2014 NA NA NA
3 SiteC 2014 237 194 43
4 SiteA 2015 134 40 94
5 SiteB 2015 247 143 104
6 SiteC 2015 220 95 125
7 SiteA 2016 34 20 14
8 SiteB 2016 8 8 0
9 SiteC 2016 62 45 17
The part from the second table to the third I've got; the problem is in how to get from the first table to the second. It would seem like you could pivot the first and then fill in the missing dates with fill(.direction="updown") except that the dates are the grouping value you need to be following.
For this example we could do like this:
library(tidyverse)
df_helper <- df %>%
slice(1:2) %>%
pivot_longer(cols= everything()) %>%
fill(value, .direction = "up") %>%
mutate(x = lead(value, 11)) %>%
drop_na() %>%
unite("name", c(value, x), sep = " ", remove = FALSE) %>%
pivot_wider(names_from = name)
df %>%
setNames(names(df_helper)) %>%
rename(Location = x) %>%
slice(-c(1:2))
Location 2014 Total 2014 F 2014 M 2015 Total 2015 F 2015 M 2016 Total 2016 F 2016 M
3 SiteA 180 92 88 134 40 94 34 20 14
4 SiteB <NA> <NA> <NA> 247 143 104 8 8 0
5 SiteC 237 194 43 220 95 125 62 45 17
I just begin coding in R , I am trying to manipulate data but I have an issue which is the following:
I have 2 different tables (simplified )
the first one (player_df) is as follows:
name experience Club age Position
luc 2 FCB 18 Goalkeeper
jean 9 Real 26 midfielder
ronaldo 14 FCB 32 Goalkeeper
jean 9 Real 26 midfielder
messi 11 Liverpool 35 midfielder
tevez 6 Chelsea 27 Attack
inzaghi 9 Juve 34 Defender
kwfni 17 Bayern 40 Attack
Blabla 9 Real 25 midfielder
wdfood 11 Liverpool 33 midfielder
player2 7 Chelsea 28 Attack
player3 10 Juve 34 Defender
fgh 17 Bayern 40 Attack
...
The second table is the salary by club and experience in million (salary_df)
*experience FCB BAYERN Juve Real Chelsea
1 1.5 1.3 1 4 3
2 2.5 2 2.4 5 4
3 3.4 3.1 3.5 6.3 5
4 5 4.5 6.7 9 6
5 7.1 6.9 9 12 7
6 9 8 10 15 10
7 10 9 12 16 15
8 14 12 13 19 16
9 14.5 17 15 20 17
10 15 19 17 23 18
..*
I would like to add a new column to my data in the first table named let say salary_estimation, and which takes into consideration 2 variables for example here experience and the club.
For example for "luc" who plays in "FCB" and has "2" years experience the output should be "2.5"
In excel its an index / match function, but in R I don't know which function should I use .
How should I approach the problem ?
Data:
df1 <- read.table(text = 'name experience Club age Position
luc 2 FCB 18 Goalkeeper
jean 9 Real 26 midfielder
ronaldo 14 FCB 32 Goalkeeper
jean 9 Real 26 midfielder
messi 11 Liverpool 35 midfielder
tevez 6 Chelsea 27 Attack
inzaghi 9 Juve 34 Defender
kwfni 17 Bayern 40 Attack
Blabla 9 Real 25 midfielder
wdfood 11 Liverpool 33 midfielder
player2 7 Chelsea 28 Attack
player3 10 Juve 34 Defender
fgh 17 Bayern 40 Attack', header = TRUE, stringsAsFactors = FALSE)
df2 <- read.table(text = 'experience FCB BAYERN Juve Real Chelsea
1 1.5 1.3 1 4 3
2 2.5 2 2.4 5 4
3 3.4 3.1 3.5 6.3 5
4 5 4.5 6.7 9 6
5 7.1 6.9 9 12 7
6 9 8 10 15 10
7 10 9 12 16 15
8 14 12 13 19 16
9 14.5 17 15 20 17
10 15 19 17 23 18', header = TRUE, stringsAsFactors = FALSE)
Code:
library('data.table')
setDT(df2)[, Chelsea := as.numeric(Chelsea)]
df2 <- melt(df2, id.vars = "experience", variable.name = "Club", value.name = "Salary" )
df2[df1, on = c("experience", "Club"), nomatch = NA]
Output:
# experience Club Salary name age Position
# 1: 2 FCB 2.5 luc 18 Goalkeeper
# 2: 9 Real 20.0 jean 26 midfielder
# 3: 14 FCB NA ronaldo 32 Goalkeeper
# 4: 9 Real 20.0 jean 26 midfielder
# 5: 11 Liverpool NA messi 35 midfielder
# 6: 6 Chelsea 10.0 tevez 27 Attack
# 7: 9 Juve 15.0 inzaghi 34 Defender
# 8: 17 Bayern NA kwfni 40 Attack
# 9: 9 Real 20.0 Blabla 25 midfielder
# 10: 11 Liverpool NA wdfood 33 midfielder
# 11: 7 Chelsea 15.0 player2 28 Attack
# 12: 10 Juve 17.0 player3 34 Defender
# 13: 17 Bayern NA fgh 40 Attack
One of possible solutions is by joining first table (let say it is player_df) with "long format" of second table salary_df using experience and club as keys. You can do it by using tidyverse package.
library(tidyverse)
player_df %>%
mutate(Club = str_to_title(Club)) %>%
left_join(
salary_df %>%
pivot_longer(-experience, names_to = "Club", values_to = "salary_estimation") %>%
mutate(Club = str_to_title(Club)) )
# Joining, by = c("experience", "Club")
# # A tibble: 13 x 6
# name experience Club age Position salary_estimation
# <chr> <dbl> <chr> <dbl> <chr> <dbl>
# 1 luc 2 Fcb 18 Goalkeeper 2.5
# 2 jean 9 Real 26 midfielder 20
# 3 ronaldo 14 Fcb 32 Goalkeeper NA
# 4 jean 9 Real 26 midfielder 20
# 5 messi 11 Liverpool 35 midfielder NA
# 6 tevez 6 Chelsea 27 Attack 10
# 7 inzaghi 9 Juve 34 Defender 15
# 8 kwfni 17 Bayern 40 Attack NA
# 9 Blabla 9 Real 25 midfielder 20
# 10 wdfood 11 Liverpool 33 midfielder NA
# 11 player2 7 Chelsea 28 Attack 15
# 12 player3 10 Juve 34 Defender 17
# 13 fgh 17 Bayern 40 Attack NA
I have a data frame of baseball player information:
playerID nameFirst nameLast bats throws yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB
81955 rolliji01 Jimmy Rollins B R 2007 1 PHI NL 162 716 139 212 38 20 30 94 41 6 49 85 5
103358 wilsowi02 Willie Wilson B R 1980 1 KCA AL 161 705 133 230 28 15 3 49 79 10 28 81 3
93082 suzukic01 Ichiro Suzuki L R 2004 1 SEA AL 161 704 101 262 24 5 8 60 36 11 49 63 19
83973 samueju01 Juan Samuel R R 1984 1 PHI NL 160 701 105 191 36 19 15 69 72 15 28 168 2
15201 cashda01 Dave Cash R R 1975 1 PHI NL 162 699 111 213 40 3 4 57 13 6 56 34 5
75531 pierrju01 Juan Pierre L L 2006 1 CHN NL 162 699 87 204 32 13 3 40 58 20 32 38 0
HBP SH SF GIDP average
81955 7 0 6 11 0.2960894
103358 6 5 1 4 0.3262411
93082 4 2 3 6 0.3721591
83973 7 0 1 6 0.2724679
15201 4 0 7 8 0.3047210
75531 8 10 1 6 0.2918455
I want to return a maximum value of the batting average ('average') column where the at-bats ('AB') are greater than 100. There are also 'NaN' in the average column.
If you want to return the entire row for which the two conditions are TRUE, you can do something like this.
library(tidyverse)
data <- tibble(
AB = sample(seq(50, 150, 10), 10),
avg = c(runif(9), NaN)
)
data %>%
filter(AB >= 100) %>%
filter(avg == max(avg, na.rm = TRUE))
Where the first filter is to only keep rows where AB is greater than or equal to 100 and the second filter is to select the entire row where it is max. If you want to to only get the maximum value, you can do something like this:
data %>%
filter(AB >= 100) %>%
summarise(max = max(avg, na.rm = TRUE))