R adding rows of data and summarize them by group - r

After looking at my notes from a recent R course and here in the Q and As, the most probable function I need to use to get what I need would seem to be colsum, and groupby but no idea how to do it, can you can help me out.
( first I tried to look into summarize and group by but did not get far )
What I Have
player year team rbi
a 2001 NYY 56
b 2001 NYY 22
c 2001 BOS 55
d 2002 DET 77
Results wanted
year team rbi
2001 NYY 78
2001 BOS 55
2002 DET 77
The players name is lost, why ?
I want to add up the RBI for each team for each year using the individual players RBI's
So for each year there should be lets say 32 teams and for each of these teams there should be an RBI number which is the sum of all the players that batted for each of the teams that particular year.
Thank you

As per #bunk 's comment you can use the aggregate function
aggregate(df$rbi, list(df$team, df$year), sum )
# Group.1 Group.2 x
#1 BOS 2001 55
#2 NYY 2001 78
#3 DET 2002 77
As per #akrun's comment to keep the column names as it is, you can use
aggregate(rbi ~ team + year, data = df, sum)

A data.table approach would be to convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'year' and 'team', we get the sum of 'rbi'.
library(data.table)
setDT(df)[, .(rbi=sum(rbi)), by= .(year, team)]
# year team rbi
#1: 2001 NYY 78
#2: 2001 BOS 55
#3: 2002 DET 77
NOTE: The 'player' name is lost because we are not using that variable in the summarizing step.

Assume df contains your player data, then you can get the result you want by
library(dplyr)
df %>%
group_by(year, team) %>%
summarise(rbi = sum(rbi))
The players' names are lost because the column player is not included in the group_by clause, and so is not used by summarise to aggregate the data in the rbi column.

Thank you for your help resolving my issue, something that could have been done easier in a popular spreadsheet program, but I decided to do it in R, I love this program and it{s libraries although with a learning curve
There were 4 proposals to resolve my question and 3 of them worked fine, when I evaluate the answer by the number of rows the final run has because I know what the answer should be from a related dataframe.
1) Arun"s proposal worked fine and its using a novel library(data.table) I read a little more on this library and looks interesting
library(data.table)
setDT(df)[, .(rbi=sum(rbi)), by= .(year, team)]
2) Also Alexs proposal worked fine too, it was
library(dplyr)
df %>%
group_by(year, team) %>%
summarise(rbi = sum(rbi))
3) Akruns solution was also good. This is the one I liked the most because the team column came already in alphabetical order, it came sorted by year and team, while the previous two solutions you need to specify you wanted sorted by year and then team
aggregate(list(rbi=df$rbi), list(team=df$team, year=df$year), sum )
4 ) Solution by Ronak almost worked, out of the 2775 rows that the results had to have this solution only brought 2761 The code was:
aggregate(rbi ~ team + year, data = df, sum)
Thanks again to everybody
Javier

Related

Calculate number of years worked with different end dates

Consider the following two datasets. The first dataset describes an id variable that identifies a person and the date when his or her unemployment benefits starts.
The second dataset shows the number of service years, which makes it possible to calculate the maximum entitlement period. More precisely, each year denotes a dummy variable, which is equal to unity in case someone build up unemployment benefits rights in a particular year (i.e. if someone worked). If this is not the case, this variable is equal to zero.
df1<-data.frame( c("R005", "R006", "R007"), c(20120610, 20130115, 20141221))
colnames(df1)<-c("id", "start_UI")
df1$start_UI<-as.character(df1$start_UI)
df1$start_UI<-as.Date(df1$start_UI, "%Y%m%d")
df2<-data.frame( c("R005", "R006", "R007"), c(1,1,1), c(1,1,1), c(0,1,1), c(1,0,1), c(1,0,1) )
colnames(df2)<-c("id", "worked2010", "worked2011", "worked2012", "worked2013", "worked2014")
Just to summarize the information from the above two datasets. We see that person R005 worked in the years 2010 and 2011. In 2012 this person filed for Unemployment insurance. Thereafter person R005 works again in 2013 and 2014 (we see this information in dataset df2). When his unemployment spell started in 2012, his entitlement was based on the work history before he got unemployed. Hence, the work history is equal to 2. In a similar vein, the employment history for R006 and R007 is equal to 3 and 5, respectively (for R007 we assume he worked in 2014 as he only filed for unemployment benefits in December of that year. Therefore the number is 5 instead of 4).
Now my question is how I can merge these two datasets effectively such that I can get the following table
df_final<- data.frame(c("R005", "R006", "R007"), c(20120610, 20130115, 20141221), c(2,3,5))
colnames(df_final)<-c("id", "start_UI", "employment_history")
id start_UI employment_history
1 R005 20120610 2
2 R006 20130115 3
3 R007 20141221 5
I tried using "aggregate", but in that case I also include work history after the year someone filed for unemployment benefits and that is something I do not want. Does anyone have an efficient way how to combine the information from the two above datasets and calculate the unemployment history?
I appreciate any help.
base R
You should use Reduce with accumulate = T.
df2$employment_history <- apply(df2[,-1], 1, function(x) sum(!Reduce(any, x==0, accumulate = TRUE)))
merge(df1, df2[c("id","employment_history")])
dplyr
Or use the built-in dplyr::cumany function:
df2 %>%
pivot_longer(-id) %>%
group_by(id) %>%
summarise(employment_history = sum(value[!cumany(value == 0)])) %>%
left_join(df1, .)
Output
id start_UI employment_history
1 R005 2012-06-10 2
2 R006 2013-01-15 3
3 R007 2014-12-21 5

Conditional calculation based on other columns lagged values

Newbie: I have a dataset where I want to calculate the y-o-y growth of sales of a company. The dataset contains approx. 1000 companies with each different number of years listed on a public stock exchange. The data looks like this:
# gvkey fyear at company name
#22 17436 2010 59393 BASF SE
#23 17436 2011 61175 BASF SE
#24 17436 2012 64327 BASF SE
...
#30 17436 2018 86556 BASF SE
#31 17828 1989 62737 DAIMLER AG
#32 17828 1990 67339 DAIMLER AG
#33 17828 1991 75714 DAIMLER AG
...
#60 17828 2018 281619 DAIMLER AG
I would like to create a new column growth where I calculate the percentage increase of at from e.g. BASF SE (gvkey 17436) from 2010 to 2011, to 2012 and so on. In row #31 the conditional statement is supposed to work that it would not calculate the increase based on values that belong to BASF but rather have a NA value. Therefore the next value in this new column "growth" in row 32 would be the percentage increase of DAIMLER (gvkey 17828) from 62727 to 67339
So far I tried:
if TA$gvkey == lag(TA$gvkey) {mutate(TA, growth = (at - lag(at))/lag(at))} else {NULL}
Basically I tried to condition the calculation on the change of the gvkey identifier as this makes the most sense to me. I believe there is a nicer way of maybe running a loop until the gvkey changes and the continue with the next set of values - but I simply don't know how to code that.
I am very new to R and quite lost. I would appreciate every support! Thank you, guys :)
I do not see a way to do this in one line. Assuming you data is called data you may try:
for(i in data$gvkey){
a = subset(data,data$gvkey==i) # a now contains the data of one company
# calculate pairwise relative difference (assumes sorted years!)
rel_diff = diff(a)/head(a,-1) #diff computes pariwise difference and divide by a ( head(a,-1) removes the last element)
a$growth = c(0,rel_diff) # extend data frame by result, first difference is 0
#output tro somewhere
}
This is a solution with r-base. There might be more efficient ways but this is easy to understand.
In this case the group_by function in dplyr is a good tool to use.By group_by() ing your gv column you will segment out your mutate() call to apply separately for each distinct value of gv. Here is a quick example I made with some dummy data and your same column values:
library(dplyr)
dummyData =
data.frame(gvkey = c(111,111,111,222,222,222),
fyear = c(2010,2012,2011,2010,2011,2013),
at =c(2,4,2,4,5,10)
)
dummyDataTransformed = dummyData %>%
group_by(gvkey) %>%
arrange(fyear) %>% #to make sure we are chronologically in order
mutate(growth = at/lag(at,1) -1) %>% #subtract 1 to get year over year change
ungroup() #I like to ungroup just to make sure i'm not bugging out any calculations I might add further down the line

Need some tips on how R can handle this situation

I am using the csv version of the Lahman 2018 database found here: http://www.seanlahman.com/baseball-archive/statistics/.
In R, I would like to identify how many extra-base hits all Mets rookies have hit in their rookie seasons by game 95. I want to find out which Met rookie hit the most extra-base hits by game 95.
I have been experimenting with dplyr functions including select, filter, and summarize.
The main thing I am uncertain about is how to get only each Mets players' doubles, triples, and homers for the first 95 games of his first season.
This code shows more of what I have done then how I think my problem can be solved -- for that I am seeking tips.
library(dplyr)
df %>% filter(teamID=='NYN') %>%
select(c(playerID, yearID, G, 2B, 3B, HR)) %>%
group_by(playerID, yearID) %>%
summarise(xbh = sum(2B) + sum(3B)+ sum(HR)) %>%
arrange(desc(xbh))
Here is how I would like the output to appear:
Player Season 2B 3B HR XBH
x 1975 10 2 8 20
y 1980 5 5 5 15
z 2000 9 0 4 13
and so on.
I would like the XBH to be in descending order.

Duplicating rows based on a multiplier

Using R and having some trouble manipulating my data. I've identified bee collected pollen to types and their relative volumes ("adjusted_volume" below) (how much pollen on a slide). I'm now trying to calculate average pollen usage by bees at each of my 14 sites. My data looks like this:
head(pollen)
site treatment hive_code pollen_type adjusted_volume
A conventional 4 alnus_spp 248.5
B conventional 4 alnus_spp 71.0
B conventional 7 alnus_spp 35.5
My plan was to dcast and gather to get the amount of each pollen type per site...
data1 <- dcast(pollen, site + treatment ~ pollen_type, length)
data2 <- gather(data1, pollen_type, count, alnus_spp:vaccinium_corymbosum, factor_key=TRUE, na.rm=TRUE)
But that doesn't account for the differences in volume for each entry. I might be thinking about this the wrong way, but is there a way to multiply each row by the adjusted_volume number in the dcast function? So the first row would count as 248.5 alnus_spp at site A instead of just 1 record?
Thanks for your help in advance! And sorry if I'm going about this in a ridiculous way!
Edit:
This worked! Thanks all!
x <- ddply(pollen, .(site, pollen_type, treatment, hive_code), summarise, tot_pollen = sum(adjusted_volume))
> head(x)
> site pollen_type treatment hive_code tot_pollen
> A alnus_spp conventional 1 497.0
> A alnus_spp conventional 5 142.0
> A graminaceae_spp conventional 1 29.0
I think something like this might get at what you are looking for:
ddply(pollen, .(site, treatment, pollen_type), summarise, tot_pollen = sum(adjusted_volume))
This should summarize the volumes of pollen by site, treatment, and pollen_type.
Good luck!

Repeat sqldf over different values of a variable

Just a little background: I got into programming through statistics, and I don't have much formal programming experience, I just know how to make things work. I'm open to any suggestions to come at this from a differet direction, but I'm currently using multiple sqldf queries to get my desired data. I originally started statistical programming in SAS and one of the things I used on a regular basis was the macro programing ability.
For a simplistic example say that I have my table A as the following:
Name Sex A B DateAdded
John M 72 1476 01/14/12
Sue F 44 3269 02/09/12
Liz F 90 7130 01/01/12
Steve M 21 3161 02/29/12
The select statement that I'm currently using is of the form:
sqldf("SELECT AVG(A), SUM(B) FROM A WHERE DateAdded >= '2012-01-01' AND DateAdded <= '2012-01-31'")
Now I'd like to run this same query on the enteries where DateAdded is in the Month of February. From my experience with SAS, you would create macro variables for the values of DateAdded. I've considered running this as a (very very slow) for loop, but I'm not sure how to pass an R variable into the sqldf, or whether that's even possible. In my table, I'm using the same query over years worth of data--any way to streamline my code would be much appreciated.
Read in the data, convert the DateAdded column to Date class, add a yearmon (year/month) column and then use sqldf or aggregate to aggregate by year/month:
Lines <- "Name Sex A B DateAdded
John M 72 1476 01/14/12
Sue F 44 3269 02/09/12
Liz F 90 7130 01/01/12
Steve M 21 3161 02/29/12"
DF <- read.table(text = Lines, header = TRUE)
# convert DateAdded column to Date class
DF$DateAdded <- as.Date(DF$DateAdded, format = "%m/%d/%y")
# add a year/month column using zoo
library(zoo)
DF$yearmon <- as.yearmon(DF$DateAdded)
Now that we have the data and its in the right form the answer is just one line of code. Here are two ways:
# 1. using sqldf
library(sqldf)
sqldf("select yearmon, avg(A), avg(B) from DF group by yearmon")
# 2. using aggregate
aggregate(cbind(A, B) ~ yearmon, DF, mean)
The result of the last two lines is:
> sqldf("select yearmon, avg(A), avg(B) from DF group by yearmon")
yearmon avg(A) avg(B)
1 Jan 2012 81.0 4303
2 Feb 2012 32.5 3215
>
> # 2. using aggregate
> aggregate(cbind(A, B) ~ yearmon, DF, mean)
yearmon A B
1 Jan 2012 81.0 4303
2 Feb 2012 32.5 3215
EDIT:
Regarding your question of doing it by week see the nextfri function in the zoo quick reference vignette.

Resources