Need some tips on how R can handle this situation - r

I am using the csv version of the Lahman 2018 database found here: http://www.seanlahman.com/baseball-archive/statistics/.
In R, I would like to identify how many extra-base hits all Mets rookies have hit in their rookie seasons by game 95. I want to find out which Met rookie hit the most extra-base hits by game 95.
I have been experimenting with dplyr functions including select, filter, and summarize.
The main thing I am uncertain about is how to get only each Mets players' doubles, triples, and homers for the first 95 games of his first season.
This code shows more of what I have done then how I think my problem can be solved -- for that I am seeking tips.
library(dplyr)
df %>% filter(teamID=='NYN') %>%
select(c(playerID, yearID, G, 2B, 3B, HR)) %>%
group_by(playerID, yearID) %>%
summarise(xbh = sum(2B) + sum(3B)+ sum(HR)) %>%
arrange(desc(xbh))
Here is how I would like the output to appear:
Player Season 2B 3B HR XBH
x 1975 10 2 8 20
y 1980 5 5 5 15
z 2000 9 0 4 13
and so on.
I would like the XBH to be in descending order.

Related

Calculate number of years worked with different end dates

Consider the following two datasets. The first dataset describes an id variable that identifies a person and the date when his or her unemployment benefits starts.
The second dataset shows the number of service years, which makes it possible to calculate the maximum entitlement period. More precisely, each year denotes a dummy variable, which is equal to unity in case someone build up unemployment benefits rights in a particular year (i.e. if someone worked). If this is not the case, this variable is equal to zero.
df1<-data.frame( c("R005", "R006", "R007"), c(20120610, 20130115, 20141221))
colnames(df1)<-c("id", "start_UI")
df1$start_UI<-as.character(df1$start_UI)
df1$start_UI<-as.Date(df1$start_UI, "%Y%m%d")
df2<-data.frame( c("R005", "R006", "R007"), c(1,1,1), c(1,1,1), c(0,1,1), c(1,0,1), c(1,0,1) )
colnames(df2)<-c("id", "worked2010", "worked2011", "worked2012", "worked2013", "worked2014")
Just to summarize the information from the above two datasets. We see that person R005 worked in the years 2010 and 2011. In 2012 this person filed for Unemployment insurance. Thereafter person R005 works again in 2013 and 2014 (we see this information in dataset df2). When his unemployment spell started in 2012, his entitlement was based on the work history before he got unemployed. Hence, the work history is equal to 2. In a similar vein, the employment history for R006 and R007 is equal to 3 and 5, respectively (for R007 we assume he worked in 2014 as he only filed for unemployment benefits in December of that year. Therefore the number is 5 instead of 4).
Now my question is how I can merge these two datasets effectively such that I can get the following table
df_final<- data.frame(c("R005", "R006", "R007"), c(20120610, 20130115, 20141221), c(2,3,5))
colnames(df_final)<-c("id", "start_UI", "employment_history")
id start_UI employment_history
1 R005 20120610 2
2 R006 20130115 3
3 R007 20141221 5
I tried using "aggregate", but in that case I also include work history after the year someone filed for unemployment benefits and that is something I do not want. Does anyone have an efficient way how to combine the information from the two above datasets and calculate the unemployment history?
I appreciate any help.
base R
You should use Reduce with accumulate = T.
df2$employment_history <- apply(df2[,-1], 1, function(x) sum(!Reduce(any, x==0, accumulate = TRUE)))
merge(df1, df2[c("id","employment_history")])
dplyr
Or use the built-in dplyr::cumany function:
df2 %>%
pivot_longer(-id) %>%
group_by(id) %>%
summarise(employment_history = sum(value[!cumany(value == 0)])) %>%
left_join(df1, .)
Output
id start_UI employment_history
1 R005 2012-06-10 2
2 R006 2013-01-15 3
3 R007 2014-12-21 5

Speeding up an extremely slow for-loop

This is my first question on stackoverflow, so feel free to criticize the question.
For every row in a data set, I would like to sum the rows that:
have identical 'team', 'season' and 'simulation_ID'.
have 'match_ID' smaller than (and not equal to) the current 'match_ID'.
such that I find the accumulated number of points up to that match, for that team, season and simulation_ID, i.e. cumsum(simulation$team_points).
I have issues to implement the second condition without using an extremely slow for-loop.
The data looks like this:
match_ID
season
simulation_ID
home_team
team
match_result
team_points
2084
2020-2021
1
TRUE
Liverpool
Away win
0
2084
2020-2021
2
TRUE
Liverpool
Draw
1
2084
2020-2021
3
TRUE
Liverpool
Away win
0
2084
2020-2021
4
TRUE
Liverpool
Away win
0
2084
2020-2021
5
TRUE
Liverpool
Home win
3
2084
2020-2021
1
FALSE
Burnley
Home win
0
2084
2020-2021
2
FALSE
Burnley
Draw
1
My current solution is:
simulation$accumulated_points <- 0
for (row in 1:nrow(simulation)) {
simulation$accumulated_points[row] <-
sum(simulation$team_points[simulation$season==simulation$season[row] &
simulation$match_ID<simulation$match_ID[row] &
simulation$simulation_ID==simulation$simulation_ID[row] &
simulation$team==simulation$team[row]], na.rm = TRUE)
}
This works, but it is obviously too slow to use on large data sets. I cannot figure out how to speed it up. What is a good solution here?
For loops are always slow in scripting languages like R and should best be avoided. This can be done using "vectorized operations", that apply a function to a vector rather than each element separately. Native functions in R or popular packages often rely on optimized C++ code and linear algebra libraries under the hood to do this, such that operations become much faster than a loop in R. For example, your CPU is usually able to process dozens of vector elements at the same time rather than going 1-by-1 as in a for loop. You can find more information about vectorization in this question.
In your specific example, you could for example use dplyr to transform your data:
library(dplyr)
df %>%
# you want to perform the same operation for each of the groups
group_by(team, season, simulationID) %>%
# within each group, order the data by match_ID (ascending)
arrange(match_ID) %>%
# take the vector team_points in each group then calculate its cumsum
# write that cumsum into a new column named "points"
mutate(points = cumsum(team_points))
The code above essentially decomposes the match_points column into one vector for each group that you care about, then applies a single, highly optimized operation to each of them.

How to use group_by() to display earliest date

I have a tibble called master_table that is 488 rows by 9 variables. The two relevant variables to the problem are wrestler_name and reign_begin. There are multiple repeats of certain wrestler_name values. I want to reduce the rows to only be one instance of each unique wrestler_name value, decided by the earliest reign_begin date value. Sample tibble is linked below:
So, in this slice of the tibble, the end goal would be to have just five rows instead of eleven, and the single Ric Flair row, for example, would have a reign_begin date of 9/17/1981, the earliest of the four Ric Flair reign_begin values.
I thought that group_by would make the most sense, but the more I think about it and try to use it, the more I think it might not be the right path. Here are some things I tried:
group_by(wrestler_name) %>%
tibbletime::filter_time(reign_begin, 'start' ~ 'start')
#Trying to get it to just filter the first date it finds for each wrestler_name group, but did not work
master_table_2 <- master_table %>%
group_by(wrestler_name) %>%
filter(reign_begin)
#I know that this would not work, but its the place I'm basically stuck
edit: Per request, here is the head(master_table), which contains slightly different data, but it still expresses the issue:
1 Ric Flair NWA World Heavyweight Championship 40 8 69 1991-01-11 1991-03-21
2 Sting NWA World Heavyweight Championship 39 1 188 1990-07-07 1991-01-11
3 Ric Flair NWA World Heavyweight Championship 38 7 426 1989-05-07 1990-07-07
4 Ricky Steamboat NWA World Heavyweight Championship 37 1 76 1989-02-20 1989-05-07
5 Ric Flair NWA World Heavyweight Championship 36 6 452 1987-11-26 1989-02-20
6 Ronnie Garvin NWA World Heavyweight Championship 35 1 62 1987-09-25 1987-11-26
city_state country
1 East Rutherford, New Jersey USA
2 Baltimore, Maryland USA
3 Nashville, Tennessee USA
4 Chicago, Illinois USA
5 Chicago, Illinois USA
6 Detroit, Michigan USA
The common way to do this for databases involves a join:
earliest <- master_table %>%
group_by(wrestler_name) %>%
summarise(reign_begin = min(reign_begin)
master_table_2 <- master_table %>%
inner_join(earliest, by = c("wrestler_name", "reign_begin"))
No filter is required as an inner join only include overlaps.
The above approach is often required for database because of how they calculate summaries. But as #Martin_Gal suggests R can handle this a different way because it stores the data in memory.
master_table_2 <- master_table %>%
group_by(wrestler_name) %>%
filter(reign_begin == min(reign_begin))
You may also find having the lubridate package installed assist for working with dates.

Is there a tidyverse equivalent to SELECT...COUNT(*)...GROUP BY...?

I want to understand how to do accomplish "group by" and "count" functionality in the tidyverse. I looked at quite a few posts without finding quite what I wanted; if there is an answer to this already posted, I would appreciate the link.
For example, I am looking for outliers in data; I want to know which places received the most "bad" measures:
place = rep(c('AL','AK','AZ','AR','CA','CO','CT','DE','FL','GA','HI'), times=4)
measure = rep(c('meas1','meas2','meas3','meas4'), each=11)
set.seed(200)
rating = sample(c('good','bad'), size = 44, prob=c(2,1), replace=T)
df = data.frame(place, measure, rating)
> df
place measure rating
1 AL meas1 good
2 AK meas1 good
3 AZ meas1 good
4 AR meas1 bad
5 CA meas1 bad
6 CO meas1 bad
7 CT meas1 bad
8 DE meas1 good
9 FL meas1 good
10 GA meas1 good
....(etc).....
I want to understand how to do this using the tidyverse. This approach using sqldf gives me what I want, i.e. tells me which places had the most "bad" ratings, and ranks the places by their "bad-ness"
library(sqldf)
sqldf("SELECT place, rating, COUNT(*) AS Count FROM df GROUP BY place, rating ORDER BY rating, count DESC").
place rating Count
1 CA bad 3
2 AK bad 2
3 AR bad 1
4 CO bad 1
5 CT bad 1
6 DE bad 1
7 FL bad 1
8 GA bad 1
9 AL good 4
10 AZ good 4
11 HI good 4
....(etc)....
Is there a way to do get similar results in the tidyverse?
For an introduction to these basic operations in the tidyverse, I'd suggest reading Wickham and Grolemund's excellent R for Data Science in the first instance: http://r4ds.had.co.nz/
You can use dplyr and magrittr packages to do the following in an easy to follow way:
# Install the tidyverse
library(tidyverse)
# Create data
place = rep(c('AL','AK','AZ','AR','CA','CO','CT','DE','FL','GA','HI'), times=4)
measure = rep(c('meas1','meas2','meas3','meas4'), each=11)
set.seed(200)
rating = sample(c('good','bad'), size = 44, prob=c(2,1), replace=T)
df = data.frame(place, measure, rating)
# Do some analysis
df %>%
group_by(place) %>%
summarise(mean_score = mean(rating == "good"), n = n()) %>%
arrange(desc(mean_score))
Here, we "group by" restaurant name "then" "summarise" each grouping by the mean number of 'good' ratings it received (creating a new variable), "then" "arrange" the output in descending order by this 'mean_score'.
We also create the new 'n' variable in the summarise function which counts the number of ratings that each mean is based on (i.e. so that if we see that one restaurant only had 2 ratings we would know that the mean may not be representative: see http://www.evanmiller.org/how-not-to-sort-by-average-rating.html for a comprehensive example of this).

R adding rows of data and summarize them by group

After looking at my notes from a recent R course and here in the Q and As, the most probable function I need to use to get what I need would seem to be colsum, and groupby but no idea how to do it, can you can help me out.
( first I tried to look into summarize and group by but did not get far )
What I Have
player year team rbi
a 2001 NYY 56
b 2001 NYY 22
c 2001 BOS 55
d 2002 DET 77
Results wanted
year team rbi
2001 NYY 78
2001 BOS 55
2002 DET 77
The players name is lost, why ?
I want to add up the RBI for each team for each year using the individual players RBI's
So for each year there should be lets say 32 teams and for each of these teams there should be an RBI number which is the sum of all the players that batted for each of the teams that particular year.
Thank you
As per #bunk 's comment you can use the aggregate function
aggregate(df$rbi, list(df$team, df$year), sum )
# Group.1 Group.2 x
#1 BOS 2001 55
#2 NYY 2001 78
#3 DET 2002 77
As per #akrun's comment to keep the column names as it is, you can use
aggregate(rbi ~ team + year, data = df, sum)
A data.table approach would be to convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'year' and 'team', we get the sum of 'rbi'.
library(data.table)
setDT(df)[, .(rbi=sum(rbi)), by= .(year, team)]
# year team rbi
#1: 2001 NYY 78
#2: 2001 BOS 55
#3: 2002 DET 77
NOTE: The 'player' name is lost because we are not using that variable in the summarizing step.
Assume df contains your player data, then you can get the result you want by
library(dplyr)
df %>%
group_by(year, team) %>%
summarise(rbi = sum(rbi))
The players' names are lost because the column player is not included in the group_by clause, and so is not used by summarise to aggregate the data in the rbi column.
Thank you for your help resolving my issue, something that could have been done easier in a popular spreadsheet program, but I decided to do it in R, I love this program and it{s libraries although with a learning curve
There were 4 proposals to resolve my question and 3 of them worked fine, when I evaluate the answer by the number of rows the final run has because I know what the answer should be from a related dataframe.
1) Arun"s proposal worked fine and its using a novel library(data.table) I read a little more on this library and looks interesting
library(data.table)
setDT(df)[, .(rbi=sum(rbi)), by= .(year, team)]
2) Also Alexs proposal worked fine too, it was
library(dplyr)
df %>%
group_by(year, team) %>%
summarise(rbi = sum(rbi))
3) Akruns solution was also good. This is the one I liked the most because the team column came already in alphabetical order, it came sorted by year and team, while the previous two solutions you need to specify you wanted sorted by year and then team
aggregate(list(rbi=df$rbi), list(team=df$team, year=df$year), sum )
4 ) Solution by Ronak almost worked, out of the 2775 rows that the results had to have this solution only brought 2761 The code was:
aggregate(rbi ~ team + year, data = df, sum)
Thanks again to everybody
Javier

Resources