How to use group_by() to display earliest date - r

I have a tibble called master_table that is 488 rows by 9 variables. The two relevant variables to the problem are wrestler_name and reign_begin. There are multiple repeats of certain wrestler_name values. I want to reduce the rows to only be one instance of each unique wrestler_name value, decided by the earliest reign_begin date value. Sample tibble is linked below:
So, in this slice of the tibble, the end goal would be to have just five rows instead of eleven, and the single Ric Flair row, for example, would have a reign_begin date of 9/17/1981, the earliest of the four Ric Flair reign_begin values.
I thought that group_by would make the most sense, but the more I think about it and try to use it, the more I think it might not be the right path. Here are some things I tried:
group_by(wrestler_name) %>%
tibbletime::filter_time(reign_begin, 'start' ~ 'start')
#Trying to get it to just filter the first date it finds for each wrestler_name group, but did not work
master_table_2 <- master_table %>%
group_by(wrestler_name) %>%
filter(reign_begin)
#I know that this would not work, but its the place I'm basically stuck
edit: Per request, here is the head(master_table), which contains slightly different data, but it still expresses the issue:
1 Ric Flair NWA World Heavyweight Championship 40 8 69 1991-01-11 1991-03-21
2 Sting NWA World Heavyweight Championship 39 1 188 1990-07-07 1991-01-11
3 Ric Flair NWA World Heavyweight Championship 38 7 426 1989-05-07 1990-07-07
4 Ricky Steamboat NWA World Heavyweight Championship 37 1 76 1989-02-20 1989-05-07
5 Ric Flair NWA World Heavyweight Championship 36 6 452 1987-11-26 1989-02-20
6 Ronnie Garvin NWA World Heavyweight Championship 35 1 62 1987-09-25 1987-11-26
city_state country
1 East Rutherford, New Jersey USA
2 Baltimore, Maryland USA
3 Nashville, Tennessee USA
4 Chicago, Illinois USA
5 Chicago, Illinois USA
6 Detroit, Michigan USA

The common way to do this for databases involves a join:
earliest <- master_table %>%
group_by(wrestler_name) %>%
summarise(reign_begin = min(reign_begin)
master_table_2 <- master_table %>%
inner_join(earliest, by = c("wrestler_name", "reign_begin"))
No filter is required as an inner join only include overlaps.
The above approach is often required for database because of how they calculate summaries. But as #Martin_Gal suggests R can handle this a different way because it stores the data in memory.
master_table_2 <- master_table %>%
group_by(wrestler_name) %>%
filter(reign_begin == min(reign_begin))
You may also find having the lubridate package installed assist for working with dates.

Related

RStudio: Separate YYYY-MM-DD into Individual Columns

I am fairly new to R and I am pulling my hair out trying to do what is probably something super simple.
I downloaded the crime data for Los Angeles from 2010 - 2019. There are 2,114,010 rows of data. Right now, it is called 'df' in my Global Environment area.
I want to manipulate one specific column titled "Occurred" - which is a date reference to when the crime occurred.
Right now, it is set up as YYYY-MM-DD (ie., 2010-02-20).
I am trying to separate all three into individual columns. I have Googled, and Googled, and Googled and tried and tried and tried things from this forum and StackExchange and just cannot get it to work.
I have tried Lubridate and followed instructions to other answers, but it simply won't create new columns (one each for Year, Month, Day).
Here is a bit of the reprex from the dataset ... I did not include all of the different variables, because they aren't the issue.
As mentioned, I am trying to separate 'occurred' into individual Year, Month, and Day columns.
> head(df, 10)[c('dr_no','occurred','time','area_name')]
dr_no occurred time area_name
1 1307355 2010-02-20 1350 Newton
2 11401303 2010-09-12 45 Pacific
3 70309629 2010-08-09 1515 Newton
4 90631215 2010-01-05 150 Hollywood
5 100100501 2010-01-02 2100 Central
6 100100506 2010-01-04 1650 Central
7 100100508 2010-01-07 2005 Central
8 100100509 2010-01-08 2100 Central
9 100100510 2010-01-09 230 Central
10 100100511 2010-01-06 2100 Central
We can do this with tidyverse and lubridate
library(dplyr)
library(lubridate)
df <- df %>%
mutate(occurred = as.Date(occurred),
year = year(occurred), month = month(occurred), day = day(occurred))

Creating new columns from information contained in other row/column combinations

I've got a data frame with match up information (Team, Opponent), as well as the betting spread of that game in different sportsbooks. I've got one row for each of the Teams, so there will be two rows for each game. As an example, see the below data frame:
example <- data.frame(Team = c("Tennessee","Vanderbilt"),
Opponent = c("Vanderbilt","Tennessee"),
PointsBet = c(-13, 13),
DraftKings = c(-12.5, 12.5))
Team Opponent PointsBet DraftKings
1 Tennessee Vanderbilt -13 -12.5
2 Vanderbilt Tennessee 13 12.5
What I'm trying to do is create "Opponent_PointsBet" and "Opponent_DraftKings" columns. So for each row, not only do we have the Team's betting spread, but we also have the betting spread of the Opponent. In a small example like this, it's easy to manually do this, but my actual data set has hundreds of rows and about 25 other columns, each of which I'd like to copy.
Is it possible to take one row of data for a specific "Team", and apply those columns as new columns in the row of data where that team is being identified as the "Opponent"? My output would look like this:
Team Opponent PointsBet DraftKings Opp_PointsBet Opp_DraftKings
1 Tennessee Vanderbilt -13 -12.5 13 12.5
2 Vanderbilt Tennessee 13 12.5 -13 -12.5
Also one thing to note, the columns I'd like to duplicate aren't always going to be opposites, so I can't simply just multiply the value by -1 to get the Opp_ column.
We can create the two columns in base R. Create a position index to match the 'Team' with 'Opponent' and use that to rearrange the column values in 'PointsBet' and 'DraftKings' to create new columns
nm1 <- names(example)[3:4]
i1 <- with(example,match(Team, Opponent))
example[paste0("Opp_", nm1)] <- lapply(example[nm1], function(x) x[i1])
example
# Team Opponent PointsBet DraftKings Opp_PointsBet Opp_DraftKings
#1 Tennessee Vanderbilt -13 -12.5 13 12.5
#2 Vanderbilt Tennessee 13 12.5 -13 -12.5

Need some tips on how R can handle this situation

I am using the csv version of the Lahman 2018 database found here: http://www.seanlahman.com/baseball-archive/statistics/.
In R, I would like to identify how many extra-base hits all Mets rookies have hit in their rookie seasons by game 95. I want to find out which Met rookie hit the most extra-base hits by game 95.
I have been experimenting with dplyr functions including select, filter, and summarize.
The main thing I am uncertain about is how to get only each Mets players' doubles, triples, and homers for the first 95 games of his first season.
This code shows more of what I have done then how I think my problem can be solved -- for that I am seeking tips.
library(dplyr)
df %>% filter(teamID=='NYN') %>%
select(c(playerID, yearID, G, 2B, 3B, HR)) %>%
group_by(playerID, yearID) %>%
summarise(xbh = sum(2B) + sum(3B)+ sum(HR)) %>%
arrange(desc(xbh))
Here is how I would like the output to appear:
Player Season 2B 3B HR XBH
x 1975 10 2 8 20
y 1980 5 5 5 15
z 2000 9 0 4 13
and so on.
I would like the XBH to be in descending order.

R adding rows of data and summarize them by group

After looking at my notes from a recent R course and here in the Q and As, the most probable function I need to use to get what I need would seem to be colsum, and groupby but no idea how to do it, can you can help me out.
( first I tried to look into summarize and group by but did not get far )
What I Have
player year team rbi
a 2001 NYY 56
b 2001 NYY 22
c 2001 BOS 55
d 2002 DET 77
Results wanted
year team rbi
2001 NYY 78
2001 BOS 55
2002 DET 77
The players name is lost, why ?
I want to add up the RBI for each team for each year using the individual players RBI's
So for each year there should be lets say 32 teams and for each of these teams there should be an RBI number which is the sum of all the players that batted for each of the teams that particular year.
Thank you
As per #bunk 's comment you can use the aggregate function
aggregate(df$rbi, list(df$team, df$year), sum )
# Group.1 Group.2 x
#1 BOS 2001 55
#2 NYY 2001 78
#3 DET 2002 77
As per #akrun's comment to keep the column names as it is, you can use
aggregate(rbi ~ team + year, data = df, sum)
A data.table approach would be to convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'year' and 'team', we get the sum of 'rbi'.
library(data.table)
setDT(df)[, .(rbi=sum(rbi)), by= .(year, team)]
# year team rbi
#1: 2001 NYY 78
#2: 2001 BOS 55
#3: 2002 DET 77
NOTE: The 'player' name is lost because we are not using that variable in the summarizing step.
Assume df contains your player data, then you can get the result you want by
library(dplyr)
df %>%
group_by(year, team) %>%
summarise(rbi = sum(rbi))
The players' names are lost because the column player is not included in the group_by clause, and so is not used by summarise to aggregate the data in the rbi column.
Thank you for your help resolving my issue, something that could have been done easier in a popular spreadsheet program, but I decided to do it in R, I love this program and it{s libraries although with a learning curve
There were 4 proposals to resolve my question and 3 of them worked fine, when I evaluate the answer by the number of rows the final run has because I know what the answer should be from a related dataframe.
1) Arun"s proposal worked fine and its using a novel library(data.table) I read a little more on this library and looks interesting
library(data.table)
setDT(df)[, .(rbi=sum(rbi)), by= .(year, team)]
2) Also Alexs proposal worked fine too, it was
library(dplyr)
df %>%
group_by(year, team) %>%
summarise(rbi = sum(rbi))
3) Akruns solution was also good. This is the one I liked the most because the team column came already in alphabetical order, it came sorted by year and team, while the previous two solutions you need to specify you wanted sorted by year and then team
aggregate(list(rbi=df$rbi), list(team=df$team, year=df$year), sum )
4 ) Solution by Ronak almost worked, out of the 2775 rows that the results had to have this solution only brought 2761 The code was:
aggregate(rbi ~ team + year, data = df, sum)
Thanks again to everybody
Javier

What's the smart way to aggregate data?

Suppose there is a dataset of different regions, each region a subset of a state, and some outcome variable:
regions <- c("Michigan, Eastern",
"Michigan, Western",
"Minnesota",
"Mississippi, Northern",
"Mississippi, Southern",
"Missouri, Eastern",
"Missouri, Western")
set.seed(123)
outcome <- rpois(7, 12)
testset <- data.frame(regions,outcome)
regions outcome
1 Michigan, Eastern 10
2 Michigan, Western 11
3 Minnesota 17
4 Mississippi, Northern 12
5 Mississippi, Southern 12
6 Missouri, Eastern 17
7 Missouri, Western 13
A useful tool would aggregate each region and add, or take the mean or maximum, etc. of outcome by region and generate a new data frame for state. A sum, for example, would output this:
state outcome
1 Michigan 21
3 Minnesota 17
4 Mississippi 24
6 Missouri 30
The aggregate() function won't solve this problem. Is there something else in R that is built for this? It seems like grep could be used to generate the new column "states" as part of an application specific program. Seems like this would already be out there somewhere though.
The reason this isn't straight forward is that the structure of your data is not consistent, so you couldn't build a library simply for it.
Your state, region column is basically an index column, and you want to index across part of it. tapply is designed for this, but there's no reason to build in a function to do it automatically for this specific scenario. You could do it without creating the column though
tapply(outcome,gsub(",.*$","",testset$regions),sum)
The index column just replaces the , and everything after it, leaving the index column.
PS: you have a slight typo in your example, your data.frame should be
testset <- data.frame(regions,outcome)

Resources