Aggregating two rows based on condition of different ID in R - r

I am dealing with a dataset of players statistics for a sport. There is an error in the data where one week a player who doesn't exist, has been attributed the data that belongs to a real player. I need to aggregate the two players data and delete the instance of the false players' row.
I need to adjust my preprocessing code to accommodate this so when I scrape future weeks data then I don't need to make manual adjustments.
df <- data.frame(Name = c("Bob","Ben","Bill"),
Team = c("Dogs","Cats","Birds"),
Runs = c(6, 4, 2)
I'd like to do something along the lines of aggregating the two rows based on their df$Name e.g. when df$Name == "Bob" & df$Name == "Bill" aggregate columns [3:40] -- these are my columns with numeric statistics, [1:2] have df$Name and df$Team.

It would depend on the type of aggregation you are trying to do. This looks like a perfect use of the group_by from the dplyr package. Consider the CO2 data set.
library(dplyr)
CO2 %>%
group_by(Plant) %>%
summarise(
n = n(), #Calculate number of rows in each group
meanUptake = mean(uptake) # Aggregate data and take mean for each group
) %>%
ungroup()
Here we take each group, in your case above it would be name. In the summarise, if you wish to include extra information (like team) include it within the summarise.

Related

Create new dataframe column in R that conditions on row values without iterating?

So let's say I have the following dataframe "df":
names <- c("Bob","Mary","Ben","Lauren")
number <- c(1:4)
age <- c(20,33,34,45)
df <- data.frame(names,number,age)
Let's say I have another dataframe ("df2") with thousands of people and I want to sum the income of people in that other dataframe that have the given name, number and age of each row in "df". That is, for each row "i" of "df", I want to create a fourth column "TotalIncome" that is the sum of the income of all the people with the given name, age and number in dataframe "df2". In other words, for each row "i":
df$TotalIncome[i] <- sum(
df2$Income[df2$Name == df1$Name[i] &
df2$Numbers == df1$Numbers[i] &
df2$Age == df1$Age[i]], na.rm=TRUE)
Is there a way to do this without having to iterate in a for loop for each row "i" and perform the above code? Is there a way to use apply() to calculate this for the entire vector rather than only iterating each line individually? The actual dataset I am working with is huge and iterating takes quite a while and I am hoping there is a more efficient way to do this in R.
Thanks!
Have you considered use dplyr package? You can use some grammar with SQL-style and make this job quick and easy.
The code will be something like
library(dplyr)
df %>% left_join(df2) %>%
group_by(name, numbers, age) %>%
summarize(TotalIncome = sum(Income))
I suggest you to find the cheat sheets available on dplyr site or see the Wickham and Grolemund book.

how to average a set of columns and exclude other specific columns in R using the summarise command?

I'm breaking my head here with academic work. I have a data.frame with several numeric columns. I am using the command summarize and group_by in R to perform the average calculations of my data frame.
I tried with the code summarize (across (where (is.numeric), mean), -c(Mes, year_date), but it calculates the average of the entire data.frame and in addition, it creates a new column -c (Mes, year_date)), I would like some numeric columns to be excluded from the media calculation, but continue on the data.frame.
Note that I tried -c(Mes, year_date) to exclude these two columns from the average calculation, but it didn't work.
I tried
library(tidyr)
library(dplyr)
library(lubridate)
sample_station <-c('A','A','A','A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B','B','B','C','C','C','C','C','C','C','C','C','C','A','B','C','A','B','C')
Date_dmy <-c('01/01/2000','08/08/2000','16/03/2001','22/09/2001','01/06/2002','05/01/2002','26/01/2002','16/02/2002','09/03/2002','30/03/2002','20/04/2002','04/01/2000','11/08/2000','19/03/2001','25/09/2001','04/06/2002','08/01/2002','29/01/2002','19/02/2002','12/03/2002','13/09/2001','08/01/2000','15/08/2000','23/03/2001','29/09/2001','08/06/2002','12/01/2002','02/02/2002','23/02/2002','16/03/2002','06/04/2002','01/02/2000','01/02/2000','01/02/2000','02/11/2001','02/11/2001','02/11/2001')
temperature <-c(17,20,24,19,17,19,23,26,19,19,21,15,23,18,22,22,23,18,19,26,21,22,23,27,19,19,21,23,24,25,26,29,30,21,25,24,23)
wind_speed<-c(3.001,6.332,9.321,10.9091,6.38,10.5882,10.5,10.4348,10.3846,10.3448,10.3125,8.35,10.2632,10.2439,10.2273,10.2128,10.2,10.1887,10.1786,12,10.1613,10.1538,10.1471,10.1408,10.1351,10.1299,10.125,2.36,10.1163,10.1124,10.1087,11.2,10.102,10.099,10.0962,10.0935,10.0909)
esp<-c(11.6,11.3,11,10.7,10.4,10.1,9.8,9.5,9.2,8.9,8.6,8.3,8,11.2,10.9,10.6,10.3,10,12.8,12.5,12.2,11.9,11.6,11.3,11,4.36,4.06,3.76,3.46,3.16,2.86,2.56,2.26,1.96,1.66,1.36,23)
volum<-c(300,300,300,300,300,300,300,300,250,250,250,250,250,250,400,400,400,400,400,105,105,105,105,105,105,105,105,105,105,81,81,81,81,81,81,81,81)
df<-data.frame(sample_station, Date_dmy, temperature, wind_speed, esp, volum)%>%
mutate(Date_dmy = dmy(Date_dmy)) %>%
mutate(year_date = floor_date(Date_dmy,'year'))%>%
mutate(Ano=year(Date_dmy))%>%
mutate(Mes=month(Date_dmy))%>%
mutate(Epoca = ifelse(Mes %in% 4:9,'dry','rainy'))%>%
group_by(sample_station, Epoca, Ano)%>%
summarise(across(where(is.numeric), mean), -c(Mes, year_date))
I have several columns that I don't want to be averaged (even if they are numeric). For exemple, columns esp and volum.
update
Exit expectation
Because you are summarising only part of the data, you need to specify what data (rows) of the un-summarised data you want to maintain. In your example, you don't want to summarise Mes and year_date, however you have multiple values within each group (sample_station, Epoca, Ano), of these Mes and year_date columns.
Which values of these unsummarised columns do you want to keep?
If you want to keep all values of the unsummarised columns, you may want to include Mes and year_date inside group_by(sample_station, Epoca, Ano) before summarising.
Alternatively, you may use mutate() rather than summarise() to get summary values in a new column for each row of the original dataframe, then choose your rows from there.
Update:
Again, with your edited post including desired output, what values do you expect for Mes. For example, when sample_station == 'A', Epoca == 'rainy' and Ano == 2000, you have values for Mes of 1 & 2, and the same year_date. summarise() wants to calculate one single summary value for this group.
You can use across(c(where(is.numeric), -Mes). Note that year_date is not included in the calculation as it is not of class numeric and also because it is included in group_by.
You can also combine multiple mutate statements into one.
If you want to exclude certain columns from the average calculation but want to keep it in the dataframe you need to decide which value do you want to keep. For example, to keep the 1st value you can use first.
library(dplyr)
library(lubridate)
data.frame(sample_station, Date_dmy, temperature, wind_speed)%>%
mutate(Date_dmy = dmy(Date_dmy),
year_date = floor_date(Date_dmy,'year'),
Ano=year(Date_dmy),
Mes=month(Date_dmy),
Epoca = ifelse(Mes %in% 4:9,'dry','rainy')) %>%
group_by(sample_station, year_date, Epoca) %>%
summarise(across(c(where(is.numeric), -Mes), mean),
across(Mes, first))

Dplyr solution for difference in row values based on two factor levels in separate columns

I am trying to use dplyr to calculate the difference between two row values based on factor levels in large data frame. In practical terms, I want the vote distance between two groups across each party within each country. For the data below, I would like to end up with a data frame with rows indicating the difference between the vote values for each group pair for each party level within each country level. The lag function does not seem to work with my data as the number of factor levels varies by country, meaning each country has a different total number of groups and parties. A small sample of the setup is below.
df1 <- data.frame(id = c(1:12),
country = c("a","a","a","a","a","a","b","b","b","b","b","b"),
group = c("x","y","z","x","y","z","x","y","z","x","y","z"),
party = c("d","d","d","e","e","e","d","d","d","e","e","e"),
vote = c(.15,.02,.7, .5, .6, .22,.47,.33,.09,.83,.77,.66))
This is how I would like the end product to look.
df2 <- data.frame(id= c(1:12),
country = c("a","a","a","b","b","b","a","a","a","b","b","b"),
group1 = c("x","x","y","x","x","y","x","x","y","x","x","y"),
group2 = c("y","z","z","y","z","z","y","z","z","y","z","z"),
party = c("d","d","d","d","d","d","e","e","e","e","e","e"),
dist = c(.13,-.5,-.68,.14,.38,.24,-.1,.28,.38,.06,.17,.11))
I have tried dcast previously and if I fill with the column I want, it doesn't line up and produces NA or 0 where there should be values. The lag function doesn't work in my case because the number of parties and groups are unique for each country and not fixed. Whenever I have tried different intervals for the lag the values are comparing across countries of across parties rather than across groups in some instances.
I have found solutions outside of dplyr but for parsimony in presenting code I am wondering if there is a way in dplyr. Also, the code I have is incredibly long and clunky, and uses six or seven packages just for this problem.
Thanks
We can use combn to create the difference
library(dplyr)
df1 %>%
group_by(country, party) %>%
mutate(dist = combn(vote, 2, FUN = function(x) x[1] - x[2]))
Another way is to use
library(tidyverse)
df1 %>%
left_join(df1 %>% select(-id), by = c("country", "party"), suffix = c("1", "2")) %>%
filter(group1 != group2) %>%
mutate(dist = vote1 - vote2)

How can I create subsets from these data frame?

I want to aggregate my data. The goal is to have for each time interval one point in a diagram. Therefore I have a data frame with 2 columns. The first columns is a timestamp. The second is a value. I want to evaluate each time period. That means: The values be added all together within the Time period for example 1 second.
I don't know how to work with the aggregate function, because these function supports no time.
0.000180 8
0.000185 8
0.000474 32
It is not easy to tell from your question what you're specifically trying to do. Your data has no column headings, we do not know the data types, you did not include the error message, and you contradicted yourself between your original question and your comment (Is the first column the time stamp? Or is the second column the time stamp?
I'm trying to understand. Are you trying to:
Split your original data.frame in to multiple data.frame's?
View a specific sub-set of your data? Effectively, you want to filter your data?
Group your data.frame in to specific increments of a set time-interval to then aggregate the results?
Assuming that you have named the variables on your dataframe as time and value, I've addressed these three examples below.
#Set Data
num <- 100
set.seed(4444)
tempdf <- data.frame(time = sample(seq(0.000180,0.000500,0.000005),num,TRUE),
value = sample(1:100,num,TRUE))
#Example 1: Split your data in to multiple dataframes (using base functions)
temp1 <- tempdf[ tempdf$time>0.0003 , ]
temp2 <- tempdf[ tempdf$time>0.0003 & tempdf$time<0.0004 , ]
#Example 2: Filter your data (using dplyr::filter() function)
dplyr::filter(tempdf, time>0.0003 & time<0.0004)
#Example 3: Chain the funcions together using dplyr to group and summarise your data
library(dplyr)
tempdf %>%
mutate(group = floor(time*10000)/10000) %>%
group_by(group) %>%
summarise(avg = mean(value),
num = n())
I hope that helps?

Data frame subset according to matching values in R

I have a data.frame with information on racing performance on horses. I have a variable Competition.year that has a "Total" row and then a row for each year the horse competed. I also have a variable Competition.age that describes the age the horses were in each specific year they competed.
I am trying to create a subsetted df based on their best racing times and the age they were when they achieved it. In the "Total" row, the racing time included is their best one. So, I need to figure out how to tell R that, when the race time in Total row is equal to whenever it is they actually achieved that time, include the age they were then in the new data frame. I am super new to R so I have no idea where to even begin doing this, I've tried some stuff I've seen on other questions but I can't get it right. Any help would be much appreciated!
My df looks like this:
travdata <- data.frame(
"Name"=c(rep("Muuttuva",3),rep("Pelson Poika",7),rep("Muusan Muisto",4)),
"Competition.year" = c("Total",2005,2004,"Total",2003,2004,2006,2005,2002,2001,2008,2010,"Total",2009),
"Time.record.auto.start"=c(93.5,NA,93.5,96.5,NA,NA,104.2,96.5,NA,96.6,NA,NA,NA,NA),
"Time.record.volt.start"=c(92.5,98.4,92.5,94.3,NA,105.3,98.3,94.3,102.1,99.1,107.5,NA,107.5,NA),
"Competition.age"=c(NA,6,7,NA,4,5,6,7,8,9,NA,5,6,7))
The desired df should have 223 rows (since that is the total amount of horses I have) with columns Name, Competition.year=="Total", Time.record.auto.start, Time.record.volt.start and Competition.age
Firstly, I had to change your sample data to make sure all 5 variables only had 14 observations each. I did this by removing the final NA in the Competition.age variable. I also had to swap around the 94.3 and 98.3 values in the Time.record.volt.start variable so that the values lined up with what was expected in the Total column for the horse with Name equal to Pelson Poika.
Here is the corrected data:
travdata <- data.frame(
"Name"=c(rep("Muuttuva",3),rep("Pelson Poika",7),rep("Muusan Muisto",4)),
"Competition.year" = c("Total",2005,2004,"Total",2003,2004,2006,2005,2002,2001,2008,2010,"Total",2009),
"Time.record.auto.start"=c(93.5,NA,93.5,96.5,NA,NA,104.2,96.5,NA,96.6,NA,NA,NA,NA),
"Time.record.volt.start"=c(92.5,98.4,92.5,94.3,NA,105.3,98.3,94.3,102.1,99.1,107.5,NA,107.5,NA),
"Competition.age"=c(NA,6,7,NA,4,5,6,7,8,9,NA,5,6,7))
And here is a simple dplyr solution, which I think does what you want.
library(dplyr)
df1 <-
travdata %>% group_by(Name) %>% filter(Competition.year == "Total") %>% select(Name, Time.record.auto.start, Time.record.volt.start)
df2 <- travdata %>% filter(Competition.year != "Total")
df3 <-
inner_join(
df1,
df2,
by = c(
"Name" = "Name",
"Time.record.auto.start" = "Time.record.auto.start",
"Time.record.volt.start" = "Time.record.volt.start"
)
)
The dataframe df3 should return what you were after.

Resources