I've recently picked up R programming and have been looking through some group_by/aggregate questions posted here to help me learn better. A question came to my mind earlier today on how group_by/aggregate can incorporate NA data rather than 0.
Given the table and code below (credits to max_lim for allowing me to use his data set), what happens if the field of NA exist (which does happen quite often)?
Farms = c(rep("Farm 1", 6), rep("Farm 2", 6), rep("Farm 3", 6))
Year = rep(c(2020,2020,2019,2019,2018,2018),3)
Cow = c(22,NA,16,12,8,NA,31,NA,3,20,39,34,27,50,NA,NA,NA,NA)
Duck = c(12,12,6,NA,NA,NA,28,13,31,50,33,20,NA,9,19,2,NA,7)
Chicken = c(100,120,80,50,NA,10,27,31,NA,43,NA,28,37,NA,NA,NA,5,43)
Sheep = c(30,20,10,NA,16,13,10,20,20,17,48,12,30,NA,20,NA,27,49)
Horse = c(25,20,16,11,NA,12,14,NA,43,42,10,12,42,NA,16,7,NA,42)
Data = data.frame(Farms, Year, Cow, Duck, Chicken, Sheep, Horse)
Farm
Year
Cow
Duck
Chicken
Sheep
Horse
Farm 1
2020
22
12
100
30
25
Farm 1
2020
NA
12
120
20
20
Farm 1
2019
16
6
80
10
16
Farm 1
2019
12
NA
50
NA
11
Farm 1
2018
8
NA
NA
16
NA
Farm 1
2018
NA
NA
10
13
12
Farm 2
2020
31
28
27
10
14
Farm 2
2020
NA
13
31
20
NA
Farm 2
2019
3
31
NA
20
43
Farm 2
2019
20
50
43
17
42
Farm 2
2018
39
33
NA
48
10
Farm 2
2018
34
20
28
12
12
Farm 3
2020
27
NA
37
30
42
Farm 3
2020
50
9
NA
NA
NA
Farm 3
2019
NA
19
NA
20
16
Farm 3
2019
NA
2
NA
NA
7
Farm 3
2018
NA
NA
5
27
NA
Farm 3
2018
NA
7
43
49
42
If I were to use aggregate(.~Farms + Year, Data, mean) here, I would get Error in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) : no rows to aggregate which I assume is because the mean function isn't able to account for NA.
Does anyone know how we can modify the aggregate/group_by function to account for the NA by calculating the average using only years without NA data? i.e.
2020: 10, 2019: NA, 2018:20, 2017:NA, 2016:15 -> the average (after discounting NA years 2019 and 2017) will be (10 + 20 + 15) / (3) = 15.
The ideal output will be as follow:
Farm
Year
Cow
Duck
Chicken
Sheep
Horse
Farm 1
2020
22 (avg = 22/1 as one entry is NA)
12
110
25
22.5
Farm 1
2019
14
6
65
10
13.5
Farm 1
2018
8
N.A. (as it's all NA)
10
14.5
12
Farm 2
2020
31
20.5
29
15
14
Farm 2
2019
11.5
40.5
43
18.5
42.5
Farm 2
2018
36.5
26.5
28
30
11
Farm 3
2020
...
...
...
...
...
Farm 3
2019
...
...
...
...
...
Farm 3
2018
...
...
...
...
...
Here is a way to create the wanted data.frame. I think your solution has one error in row 2 (Sheep), where mean(NA, 10) is equal to 5 and not 10.
library(dplyr)
Using aggregate
Data %>%
aggregate(.~Year+Farms,., FUN=mean, na.rm=T, na.action=NULL) %>%
arrange(Farms, desc(Year)) %>%
as.data.frame() %>%
mutate_at(names(.), ~replace(., is.nan(.), NA))
Using summarize
Data %>%
group_by(Year, Farms) %>%
summarize(MeanCow = mean(Cow, na.rm=T),
MeanDuck = mean(Duck, na.rm=T),
MeanChicken = mean(Chicken, na.rm=T),
MeanSheep = mean(Sheep, na.rm=T),
MeanHorse = mean(Horse, na.rm=T)) %>%
arrange(Farms, desc(Year)) %>%
as.data.frame() %>%
mutate_at(names(.), ~replace(., is.nan(.), NA))
Solution for both
Year Farms Cow Duck Chicken Sheep Horse
1 2020 Farm 1 22.0 12.0 110 25.0 22.5
2 2019 Farm 1 14.0 6.0 65 10.0 13.5
3 2018 Farm 1 8.0 NA 10 14.5 12.0
4 2020 Farm 2 31.0 20.5 29 15.0 14.0
5 2019 Farm 2 11.5 40.5 43 18.5 42.5
6 2018 Farm 2 36.5 26.5 28 30.0 11.0
7 2020 Farm 3 38.5 9.0 37 30.0 42.0
8 2019 Farm 3 NA 10.5 NA 20.0 11.5
9 2018 Farm 3 NA 7.0 24 38.0 42.0
I just begin coding in R , I am trying to manipulate data but I have an issue which is the following:
I have 2 different tables (simplified )
the first one (player_df) is as follows:
name experience Club age Position
luc 2 FCB 18 Goalkeeper
jean 9 Real 26 midfielder
ronaldo 14 FCB 32 Goalkeeper
jean 9 Real 26 midfielder
messi 11 Liverpool 35 midfielder
tevez 6 Chelsea 27 Attack
inzaghi 9 Juve 34 Defender
kwfni 17 Bayern 40 Attack
Blabla 9 Real 25 midfielder
wdfood 11 Liverpool 33 midfielder
player2 7 Chelsea 28 Attack
player3 10 Juve 34 Defender
fgh 17 Bayern 40 Attack
...
The second table is the salary by club and experience in million (salary_df)
*experience FCB BAYERN Juve Real Chelsea
1 1.5 1.3 1 4 3
2 2.5 2 2.4 5 4
3 3.4 3.1 3.5 6.3 5
4 5 4.5 6.7 9 6
5 7.1 6.9 9 12 7
6 9 8 10 15 10
7 10 9 12 16 15
8 14 12 13 19 16
9 14.5 17 15 20 17
10 15 19 17 23 18
..*
I would like to add a new column to my data in the first table named let say salary_estimation, and which takes into consideration 2 variables for example here experience and the club.
For example for "luc" who plays in "FCB" and has "2" years experience the output should be "2.5"
In excel its an index / match function, but in R I don't know which function should I use .
How should I approach the problem ?
Data:
df1 <- read.table(text = 'name experience Club age Position
luc 2 FCB 18 Goalkeeper
jean 9 Real 26 midfielder
ronaldo 14 FCB 32 Goalkeeper
jean 9 Real 26 midfielder
messi 11 Liverpool 35 midfielder
tevez 6 Chelsea 27 Attack
inzaghi 9 Juve 34 Defender
kwfni 17 Bayern 40 Attack
Blabla 9 Real 25 midfielder
wdfood 11 Liverpool 33 midfielder
player2 7 Chelsea 28 Attack
player3 10 Juve 34 Defender
fgh 17 Bayern 40 Attack', header = TRUE, stringsAsFactors = FALSE)
df2 <- read.table(text = 'experience FCB BAYERN Juve Real Chelsea
1 1.5 1.3 1 4 3
2 2.5 2 2.4 5 4
3 3.4 3.1 3.5 6.3 5
4 5 4.5 6.7 9 6
5 7.1 6.9 9 12 7
6 9 8 10 15 10
7 10 9 12 16 15
8 14 12 13 19 16
9 14.5 17 15 20 17
10 15 19 17 23 18', header = TRUE, stringsAsFactors = FALSE)
Code:
library('data.table')
setDT(df2)[, Chelsea := as.numeric(Chelsea)]
df2 <- melt(df2, id.vars = "experience", variable.name = "Club", value.name = "Salary" )
df2[df1, on = c("experience", "Club"), nomatch = NA]
Output:
# experience Club Salary name age Position
# 1: 2 FCB 2.5 luc 18 Goalkeeper
# 2: 9 Real 20.0 jean 26 midfielder
# 3: 14 FCB NA ronaldo 32 Goalkeeper
# 4: 9 Real 20.0 jean 26 midfielder
# 5: 11 Liverpool NA messi 35 midfielder
# 6: 6 Chelsea 10.0 tevez 27 Attack
# 7: 9 Juve 15.0 inzaghi 34 Defender
# 8: 17 Bayern NA kwfni 40 Attack
# 9: 9 Real 20.0 Blabla 25 midfielder
# 10: 11 Liverpool NA wdfood 33 midfielder
# 11: 7 Chelsea 15.0 player2 28 Attack
# 12: 10 Juve 17.0 player3 34 Defender
# 13: 17 Bayern NA fgh 40 Attack
One of possible solutions is by joining first table (let say it is player_df) with "long format" of second table salary_df using experience and club as keys. You can do it by using tidyverse package.
library(tidyverse)
player_df %>%
mutate(Club = str_to_title(Club)) %>%
left_join(
salary_df %>%
pivot_longer(-experience, names_to = "Club", values_to = "salary_estimation") %>%
mutate(Club = str_to_title(Club)) )
# Joining, by = c("experience", "Club")
# # A tibble: 13 x 6
# name experience Club age Position salary_estimation
# <chr> <dbl> <chr> <dbl> <chr> <dbl>
# 1 luc 2 Fcb 18 Goalkeeper 2.5
# 2 jean 9 Real 26 midfielder 20
# 3 ronaldo 14 Fcb 32 Goalkeeper NA
# 4 jean 9 Real 26 midfielder 20
# 5 messi 11 Liverpool 35 midfielder NA
# 6 tevez 6 Chelsea 27 Attack 10
# 7 inzaghi 9 Juve 34 Defender 15
# 8 kwfni 17 Bayern 40 Attack NA
# 9 Blabla 9 Real 25 midfielder 20
# 10 wdfood 11 Liverpool 33 midfielder NA
# 11 player2 7 Chelsea 28 Attack 15
# 12 player3 10 Juve 34 Defender 17
# 13 fgh 17 Bayern 40 Attack NA
I would like to filter a dataframe based on its date column. I would like to keep the rows where I have at least 3 consecutive days. I would like to do this as effeciently and quickly as possible, so if someone has a vectorized approached it would be good.
I tried to inspire myself from the following link, but it didn't really go well, as it is a different problem:
How to filter rows based on difference in dates between rows in R?
I tried to do it with a for loop, I managed to put an indicator on the dates who are not consecutive, but it didn't give me the desired result, because it keeps all dates that are in a row even if they are less than 3 in a row.
tf is my dataframe
for(i in 2:(nrow(tf)-1)){
if(tf$Date[i] != tf$Date[i+1] %m+% days(-1)){
if(tf$Date[i] != tf$Date[i-1] %m+% days(1)){
tf$Date[i] = as.Date(0)
}
}
}
The first 22 rows of my dataframe look something like this:
Date RR.x RR.y Y
1 1984-10-20 1 10.8 1984
2 1984-11-04 1 12.5 1984
3 1984-11-05 1 7.0 1984
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
7 1984-11-13 1 5.9 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
11 1986-11-17 1 14.1 1986
12 2003-10-17 1 7.8 2003
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
16 2003-11-15 1 26.4 2003
17 2003-11-20 1 10.0 2003
18 2011-10-29 1 10.0 2011
19 2011-11-04 1 11.4 2011
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011
The result should be:
Date RR.x RR.y Y
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011
One possibility could be:
df %>%
mutate(Date = as.Date(Date, format = "%Y-%m-%d"),
diff = c(0, diff(Date))) %>%
group_by(grp = cumsum(diff > 1 & lead(diff, default = last(diff)) == 1)) %>%
filter(if_else(diff > 1 & lead(diff, default = last(diff)) == 1, 1, diff) == 1) %>%
filter(n() >= 3) %>%
ungroup() %>%
select(-diff, -grp)
Date RR.x RR.y Y
<date> <int> <dbl> <int>
1 1984-11-09 1 22.9 1984
2 1984-11-10 1 24.4 1984
3 1984-11-11 1 19 1984
4 1986-10-15 1 10.3 1986
5 1986-10-16 1 18.1 1986
6 1986-10-17 1 11.3 1986
7 2003-10-25 1 7.6 2003
8 2003-10-26 1 5 2003
9 2003-10-27 1 6.6 2003
10 2011-11-21 1 9.8 2011
11 2011-11-22 1 5.6 2011
12 2011-11-23 1 20.4 2011
Here's a base solution:
DF$Date <- as.Date(DF$Date)
rles <- rle(cumsum(c(1,diff(DF$Date)!=1)))
rles$values <- rles$lengths >= 3
DF[inverse.rle(rles), ]
Date RR.x RR.y Y
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011
Similar approach in dplyr
DF%>%
mutate(Date = as.Date(Date))%>%
add_count(IDs = cumsum(c(1, diff(Date) !=1)))%>%
filter(n >= 3)
# A tibble: 12 x 6
Date RR.x RR.y Y IDs n
<date> <int> <dbl> <int> <dbl> <int>
1 1984-11-09 1 22.9 1984 3 3
2 1984-11-10 1 24.4 1984 3 3
3 1984-11-11 1 19 1984 3 3
4 1986-10-15 1 10.3 1986 5 3
5 1986-10-16 1 18.1 1986 5 3
6 1986-10-17 1 11.3 1986 5 3
7 2003-10-25 1 7.6 2003 8 3
8 2003-10-26 1 5 2003 8 3
9 2003-10-27 1 6.6 2003 8 3
10 2011-11-21 1 9.8 2011 13 3
11 2011-11-22 1 5.6 2011 13 3
12 2011-11-23 1 20.4 2011 13 3
This question already has answers here:
Filter rows in R based on values in multiple rows
(2 answers)
Closed 5 years ago.
I find it a bit hard to find the right words for what I'm trying to do.
Say I have this dataframe:
library(dplyr)
# A tibble: 74 x 3
country year conf_perc
<chr> <dbl> <dbl>
1 Canada 2017 77
2 France 2017 45
3 Germany 2017 60
4 Greece 2017 33
5 Hungary 2017 67
6 Italy 2017 38
7 Canada 2009 88
8 France 2009 91
9 Germany 2009 93
10 Greece 2009 NA
11 Hungary 2009 NA
12 Italy 2009 NA
Now I want to delete the rows that have NA values in 2009 but then I want to remove the rows of those countries in 2017 as well. I would like to get the following results:
# A tibble: 74 x 3
country year conf_perc
<chr> <dbl> <dbl>
1 Canada 2017 77
2 France 2017 45
3 Germany 2017 60
4 Canada 2009 88
5 France 2009 91
6 Germany 2009 93
We can do any after grouping by 'country'
library(dplyr)
df1 %>%
group_by(country) %>%
filter(!any(is.na(conf_perc)))
# A tibble: 6 x 3
# Groups: country [3]
# country year conf_perc
# <chr> <int> <int>
#1 Canada 2017 77
#2 France 2017 45
#3 Germany 2017 60
#4 Canada 2009 88
#5 France 2009 91
#6 Germany 2009 93
base R solution:
foo <- df$year == 2009 & is.na(df$conf_perc)
bar <- df$year == 2017 & df$country %in% unique(df$country[foo])
df[-c(which(foo), which(bar)), ]
# country year conf_perc
# 1 Canada 2017 77
# 2 France 2017 45
# 3 Germany 2017 60
# 7 Canada 2009 88
# 8 France 2009 91
# 9 Germany 2009 93
I have this local data frame:
Source: local data frame [792 x 3]
team player_name g
1 Anaheim PERRY_COREY 31
2 Anaheim GETZLAF_RYAN 22
3 Dallas BENN_JAMIE 25
4 Pittsburgh CROSBY_SIDNEY 20
5 Toronto KESSEL_PHIL 27
6 Edmonton HALL_TAYLOR 16
7 Dallas SEGUIN_TYLER 24
8 Montreal VANEK_THOMAS 19
9 Colorado LANDESKOG_GABRIEL 18
10 Chicago SHARP_PATRICK 22
.. ... ... ..
I want to be able to rank the teams based on their average number of goals (g) per player. Here is what I did (really feels suboptimal):
library(dplyr)
d1 <- select(df, team, g, player_name)
c1 <- count(d1, team, wt = g)
c2 <- count(d1, team, wt = n_distinct(player_name))
c3 <- cbind(c1, c2[,2])
c4 <- c3[,2] / c3[,3]
c5 <- cbind(c3, c4)
colnames(c5) <- c("team", "ttgpt", "ttnp", "agpp")
c6 <- mutate(c5, rank = row_number(desc(c4)))
c7 <- filter(c6, rank <=10)
c8 <- arrange(c7, rank)
And here is the result of c8:
team ttgpt ttnp agpp rank
1 Chicago 177 23 7.695652 1
2 Colorado 164 23 7.130435 2
3 Anaheim 180 26 6.923077 3
4 NY_Rangers 153 23 6.652174 4
5 Boston 179 27 6.629630 5
6 San_Jose 157 25 6.280000 6
7 Dallas 155 25 6.200000 7
8 St._Louis 148 24 6.166667 8
9 Ottawa 160 26 6.153846 9
10 Philadelphia 140 23 6.086957 10
I would like to recreate this table with consistent use of %>%
See CSV for reproductible example: playerstats.csv
Ok from what you said:
df<-read.csv("../Downloads/playerstats.csv",header=T,sep=",")
df %>% group_by(Team)
%>% summarise(ttgp=sum(G),ttnp=n_distinct(Player.Name),agp=sum(G)/n_distinct(Player.Name))
%>% mutate(rank=rank(desc(agp)))
%>% filter(rank<=10)
%>% arrange(rank)
Source: local data frame [10 x 5]
Team ttgp ttnp agp rank
1 Chicago 177 23 7.695652 1
2 Colorado 164 23 7.130435 2
3 Anaheim 180 26 6.923077 3
4 NY Rangers 153 23 6.652174 4
5 Boston 179 27 6.629630 5
6 San Jose 157 25 6.280000 6
7 Dallas 155 25 6.200000 7
8 St. Louis 148 24 6.166667 8
9 Ottawa 160 26 6.153846 9
10 Philadelphia 140 23 6.086957 10
Note that I am not sure what you mean with ttgpt and ttnp. Therefore, I tried to guess it.