Grouping Like Variables in Tibbles for R - r

Working on my second project for R. I'm trying to create some variable groups using dplyr, but I'm not sure what the heck I'm doing here.
I'm working with financial data and among the categories, there are several different forms of travel, listed as such:
Travel - Gas, Travel - Airfare, Travel - Subway...
I want to create a new tibble that groups all the Travel subtypes as one Travel subgroup. Is there a good way to do this?
I've been trying to use the dplyr filter function to no effect so far.
Sorry, I was really tired and forgot to put an example up
I have data that's like this:
Month - Year - Category - Amount
01 - 2016 - "Travel- Air" - 247.02
01 - 2016 - "Travel- Car" - 29.04
01 - 2016 - "Retail" - 45.00
03 - 2017 - "Travel - Air" - 253.60
I'm trying to group things so that all the transactions of one type in a particular month/year are summed together like this:
Total_Category_Transactions_Month <- Total_Transactions %>%
group_by(month,Year,Category) %>%
summarize(monthly = sum(Amount))
But after looking at my data, there are just way too many things that are grouped up as "Travel - foo." I'd like to keep that detail for later to analyze, but for the big scale picture, I want to see if I can lump all those Travel expenses as one thing each month.
The Output should end up being:
Month - Year - Category - Amount
01 - 2016 - "Travel" - 276.06
01 - 2016 - "Retail" - 45.00
03 - 2017 - "Travel" - 253.60
Where all the subtypes of category Travel_Foo from the same month and year are added together into one category just called Travel

One option is to use tidyr::separate
df %>%
separate(Category, into = c("Category"), extra = "drop") %>%
group_by(Month, Year, Category) %>%
summarise(Amount = sum(Amount)) %>%
ungroup() %>%
as.data.frame()
# Month Year Category Amount
#1 1 2016 Retail 45.00
#2 1 2016 Travel 276.06
#3 3 2017 Travel 253.60
Note that the as.data.frame() is not really necessary here. I've only included it to show that the resulting Amounts are the ones from your expected output (the tibbles don't print the same number of decimal places).
Sample data
df <- read.table(text =
"Month Year Category Amount
01 2016 'Travel- Air' 247.02
01 2016 'Travel- Car' 29.04
01 2016 'Retail' 45.00
03 2017 'Travel - Air' 253.60", header = T)

Related

Calculate number of years worked with different end dates

Consider the following two datasets. The first dataset describes an id variable that identifies a person and the date when his or her unemployment benefits starts.
The second dataset shows the number of service years, which makes it possible to calculate the maximum entitlement period. More precisely, each year denotes a dummy variable, which is equal to unity in case someone build up unemployment benefits rights in a particular year (i.e. if someone worked). If this is not the case, this variable is equal to zero.
df1<-data.frame( c("R005", "R006", "R007"), c(20120610, 20130115, 20141221))
colnames(df1)<-c("id", "start_UI")
df1$start_UI<-as.character(df1$start_UI)
df1$start_UI<-as.Date(df1$start_UI, "%Y%m%d")
df2<-data.frame( c("R005", "R006", "R007"), c(1,1,1), c(1,1,1), c(0,1,1), c(1,0,1), c(1,0,1) )
colnames(df2)<-c("id", "worked2010", "worked2011", "worked2012", "worked2013", "worked2014")
Just to summarize the information from the above two datasets. We see that person R005 worked in the years 2010 and 2011. In 2012 this person filed for Unemployment insurance. Thereafter person R005 works again in 2013 and 2014 (we see this information in dataset df2). When his unemployment spell started in 2012, his entitlement was based on the work history before he got unemployed. Hence, the work history is equal to 2. In a similar vein, the employment history for R006 and R007 is equal to 3 and 5, respectively (for R007 we assume he worked in 2014 as he only filed for unemployment benefits in December of that year. Therefore the number is 5 instead of 4).
Now my question is how I can merge these two datasets effectively such that I can get the following table
df_final<- data.frame(c("R005", "R006", "R007"), c(20120610, 20130115, 20141221), c(2,3,5))
colnames(df_final)<-c("id", "start_UI", "employment_history")
id start_UI employment_history
1 R005 20120610 2
2 R006 20130115 3
3 R007 20141221 5
I tried using "aggregate", but in that case I also include work history after the year someone filed for unemployment benefits and that is something I do not want. Does anyone have an efficient way how to combine the information from the two above datasets and calculate the unemployment history?
I appreciate any help.
base R
You should use Reduce with accumulate = T.
df2$employment_history <- apply(df2[,-1], 1, function(x) sum(!Reduce(any, x==0, accumulate = TRUE)))
merge(df1, df2[c("id","employment_history")])
dplyr
Or use the built-in dplyr::cumany function:
df2 %>%
pivot_longer(-id) %>%
group_by(id) %>%
summarise(employment_history = sum(value[!cumany(value == 0)])) %>%
left_join(df1, .)
Output
id start_UI employment_history
1 R005 2012-06-10 2
2 R006 2013-01-15 3
3 R007 2014-12-21 5

Adding missing panel dates by group as rows using data.table

I'm having difficulty using data.table operations to correctly manipulate my data. The goal is to, by group create a number of rows for the group based on the value of two date columns. I'm changing my data here in order to protect it, but it gets the idea across
head(my_data_table, 6)
team_name play_name first_detected last_detected PlayID
1: Baltimore Play Action 2016 2017 41955-58
2: Washington Four Verticals 2018 2020 54525-52
3: Dallas O1 Trap 2019 2019 44795-17
4: Dallas Play Action 2020 2020 41955-58
5: Dallas Power Zone 2020 2020 54782-29
6: Dallas Bubble Screen 2018 2018 52923-70
The goal is to turn it into this
team_name play_name year PlayID
1: Baltimore Play Action 2016 41955-58
2: Baltimore Play Action 2017 41955-58
3: Washington Four Verticals 2018 54525-52
4: Washington Four Verticals 2019 54525-52
5: Washington Four Verticals 2020 54525-52
6: Dallas O1 Trap 2019 44795-17
...
n: Dallas Bubble Screen 2018 52923-70
My code I attempt to employ for this purpose is the following
my_data_table[,.(PlayID, year = seq(first_detected,last_detected,by=1)), by = .(team_name, play_name)]
When I run this code, I get:
Error in seq.default(first_detected_ever, last_detected_ever, by = 1) :
'from' must be of length 1
Two other attempts also failed
my_data_table[,.(PlayID, year = seq(min(first_detected),max(last_detected),by=1)), by = .(team_name, play_name)]
my_data_table[,.(PlayID, year = list(seq(min(first_detected),max(last_detected),by=1))), by = .(team_name, play_name)]
which both result in something that looks like
by year PlayID
1: Baltimore Washington Dallas Play Action 2011, 2012, 2013, 2014, 2015, 2016 ... 41955-58
...
In as.data.table.list(jval, .named = NULL) :
Item 3 has 2 rows but longest item has 38530489; recycled with remainder.
I haven't found any clear answers on why this is happening. It seems like, when passing the "first detected' and "last detected", that it's interpreting it somehow as the entire range of the column's values, despite me passing the by = .(team_name,play_name), which always results in one distinct row, which I have verified. Going by the "by" grouping here should only have one value of first_detected and last_detected. I've done something similar before, but the difference was that I wasn't doing it with a "by = .(x,y,z,...)" grouping, and applied the operation on each row. Could anyone help me understand why I am unable to get the desired output with this data.table method?
Despite struggling with this for hours, I managed to solve my own question only a short while later.
The code
my_data_table[,.(PlayID, year = first_detected:last_detected), by = .(team_name, play_name)]
Produces the desired result, creating, by group, a row that has each year inclusive, so long as first_detected and last_detected are integers.

Need some tips on how R can handle this situation

I am using the csv version of the Lahman 2018 database found here: http://www.seanlahman.com/baseball-archive/statistics/.
In R, I would like to identify how many extra-base hits all Mets rookies have hit in their rookie seasons by game 95. I want to find out which Met rookie hit the most extra-base hits by game 95.
I have been experimenting with dplyr functions including select, filter, and summarize.
The main thing I am uncertain about is how to get only each Mets players' doubles, triples, and homers for the first 95 games of his first season.
This code shows more of what I have done then how I think my problem can be solved -- for that I am seeking tips.
library(dplyr)
df %>% filter(teamID=='NYN') %>%
select(c(playerID, yearID, G, 2B, 3B, HR)) %>%
group_by(playerID, yearID) %>%
summarise(xbh = sum(2B) + sum(3B)+ sum(HR)) %>%
arrange(desc(xbh))
Here is how I would like the output to appear:
Player Season 2B 3B HR XBH
x 1975 10 2 8 20
y 1980 5 5 5 15
z 2000 9 0 4 13
and so on.
I would like the XBH to be in descending order.

R adding rows of data and summarize them by group

After looking at my notes from a recent R course and here in the Q and As, the most probable function I need to use to get what I need would seem to be colsum, and groupby but no idea how to do it, can you can help me out.
( first I tried to look into summarize and group by but did not get far )
What I Have
player year team rbi
a 2001 NYY 56
b 2001 NYY 22
c 2001 BOS 55
d 2002 DET 77
Results wanted
year team rbi
2001 NYY 78
2001 BOS 55
2002 DET 77
The players name is lost, why ?
I want to add up the RBI for each team for each year using the individual players RBI's
So for each year there should be lets say 32 teams and for each of these teams there should be an RBI number which is the sum of all the players that batted for each of the teams that particular year.
Thank you
As per #bunk 's comment you can use the aggregate function
aggregate(df$rbi, list(df$team, df$year), sum )
# Group.1 Group.2 x
#1 BOS 2001 55
#2 NYY 2001 78
#3 DET 2002 77
As per #akrun's comment to keep the column names as it is, you can use
aggregate(rbi ~ team + year, data = df, sum)
A data.table approach would be to convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'year' and 'team', we get the sum of 'rbi'.
library(data.table)
setDT(df)[, .(rbi=sum(rbi)), by= .(year, team)]
# year team rbi
#1: 2001 NYY 78
#2: 2001 BOS 55
#3: 2002 DET 77
NOTE: The 'player' name is lost because we are not using that variable in the summarizing step.
Assume df contains your player data, then you can get the result you want by
library(dplyr)
df %>%
group_by(year, team) %>%
summarise(rbi = sum(rbi))
The players' names are lost because the column player is not included in the group_by clause, and so is not used by summarise to aggregate the data in the rbi column.
Thank you for your help resolving my issue, something that could have been done easier in a popular spreadsheet program, but I decided to do it in R, I love this program and it{s libraries although with a learning curve
There were 4 proposals to resolve my question and 3 of them worked fine, when I evaluate the answer by the number of rows the final run has because I know what the answer should be from a related dataframe.
1) Arun"s proposal worked fine and its using a novel library(data.table) I read a little more on this library and looks interesting
library(data.table)
setDT(df)[, .(rbi=sum(rbi)), by= .(year, team)]
2) Also Alexs proposal worked fine too, it was
library(dplyr)
df %>%
group_by(year, team) %>%
summarise(rbi = sum(rbi))
3) Akruns solution was also good. This is the one I liked the most because the team column came already in alphabetical order, it came sorted by year and team, while the previous two solutions you need to specify you wanted sorted by year and then team
aggregate(list(rbi=df$rbi), list(team=df$team, year=df$year), sum )
4 ) Solution by Ronak almost worked, out of the 2775 rows that the results had to have this solution only brought 2761 The code was:
aggregate(rbi ~ team + year, data = df, sum)
Thanks again to everybody
Javier

How to form linear model from two data frames?

MarriageLicen
Year Month Amount
1 2011 Jan 742
2 2011 Feb 796
3 2011 Mar 1210
4 2011 Apr 1376
BusinessLicen
Month Year MARRIAGE_LICENSES
1 Jan 2011 754
2 Feb 2011 2706
3 Mar 2011 2689
4 Apr 2011 738
My question is how can we predict the number of Marriage Licenses (Y) issued by the city using the number of Business Licenses (X)?
And how can we join two datasets together?
It says that you can join using the combined key of Month and Year.
But I am suffering from this question for several days.
There are three options here.
The first is to just be direct. I'm going to assume you have the labels swapped around for the data frames in your example (it doesn't make a whole lot of sense to have a MARRIAGE_LICENSES variable in the BusinessLicen data frame, if I'm following what you are trying to do).
You can model the relationship between those two variables with:
my.model <- lm(MarriageLicen$MARRIAGE_LICENSES ~ BusinessLicen$Amount)
The second (not very rational) option would be to make a new data frame explicitly, since it looks like you have an exact match on each of your rows:
new.df <- data.frame(marriage.licenses=MarriageLicen$MARRIAGE_LICENSES, business.licenses=BusinessLicen$Amount)
my.model <- lm(marriage.licenses ~ business.licenses, data=new.df)
Finally, if you don't actually have the perfect alignment shown in your example you can use merge.
my.df <- merge(BusinessLicen, MarriageLicen, by=c("Month", "Year"))
my.model <- lm(MARRIAGE_LICENCES ~ Amount, data=my.df)

Resources