Aggregating data by two variables using R and dplyr - r

I am using NFL play-by-play data from the 2013 season and I am looking to measure catch success rate by Wide Receivers. Essentially, I have four variables of interest: Targeted Receiver, Pass Distance, Target and Reception. I would like to obtain a data set broken down by Targeted Receiver and Pass Distance, with Targets and Receptions summarized (just a simple count) for each of the two Targeted Receiver and Pass Distance combinations (i.e. Receiver 1 Short, Receiver 1 Long).
Thank you for your help,
CLR

First, take the table df and keep only the columns that are relevant (Targeted Receiver, Pass Distance, Target, and Reception).
df <- select(df, `Targeted Receiver`, `Pass Distance`, `Target`, `Reception`)
Then, remove the rows where there is no receiver (e.g. a running play).
df <- df[!is.na(df$`Targeted Receiver`), ]
After that, use group_by from dplyr so that your data are grouped at the Target Receiver and Pass Distance level.
grouped <- group_by(df, `Targeted Receiver`, `Pass Distance`)
Finally, use the summarise function to create the count of Target and the sum of Reception.
per_rec <- summarise(grouped, Target = n(), Reception = sum(Reception))
The data will look like this:
Targeted Receiver Pass Distance Target Reception
(chr) (chr) (int) (dbl)
1 A.J. Green Deep 50 21
2 A.J. Green Short 128 77
3 A.J. Jenkins Deep 6 2
4 A.J. Jenkins Short 11 6
5 Aaron Dobson Deep 23 6
6 Aaron Dobson Short 49 31

Related

Specify multiple conditions in long form data in R

How do I index rows I need by with specifications?
id<-c(65,65,65,65,65,900,900,900,900,900,900,211,211,211,211,211,211,211,45,45,45,45,45,45,45)
age<-c(19,22,23,24,25,21,26,31,32,37,38,22,23,25,28,29,31,32,30,31,36,39,42,44,48)
stat<-c('intern','reg','manage1','left','reg','manage1','manage2','left','reg',
'reg','left','intern','left','intern','reg','left','reg','manage1','reg','left','intern','manage1','left','reg','manage2')
mydf<-data.frame(id,age,stat)
I need to create 5 variables:
m01time & m12time: measure the amount of years elapsed before becoming a level1 manager (manage1), and then since manage1 to manage2 regardless of whether or not it's at the same job. (numeric in years)
change: capture whether or not they experienced a job change between manage1 and manage2 (if 'left' happens somewhere in between manage1 and manage2), (0 or 1)
& 4: m1p & m2p: capture the position before becoming manager1 and manager2 (intern, reg, or manage1).
There's a lot of information I don't need here that I am not sure how to ignore (all the jobs 211 went through before going to one where they become a manager).
The end result should look something like this:
id m01time m02time change m1p m2p
1 65 4 NA NA reg <NA>
2 900 NA 5 0 <NA> manage1
3 211 1 NA NA reg <NA>
4 45 3 9 1 intern reg
I tried to use ifelse with lag() and lead() to capture some conditions, but there are more for loop type of jobs (such as how to capture a "left" somewhere in between) that I am not sure what to do with.
I'd calculate the variables the first three variables differently than m1p and m2p. Maybe there's an elegant unified approach that I don't see at the moment.
So for the last position before manager you could do:
mydt <- data.table(mydf)
mydt[,.(m1p=stat[.I[stat=="manage1"]-1],
m2p=stat[.I[stat=="manage2"]-1]),by=id]
The other variables are more conveniently calculated in a wide data.format:
dt <- dcast(unique(mydt,by=c("id","stat")),
formula=id~stat,value.var="age")
dt[,.(m01time = manage1-intern,
m12time = manage2-manage1,
change = manage1<left & left<manage2)]
Two caveats:
reshaping might be quite costly larger data sets
I (over-)simplified your dummy data by ignoring duplicates of id and stat

Using "shift" function in R to subtract one row from another by group

I have a data.table that looks like this:
dt
id month balance
1: 1 4 100
2: 1 5 50
3: 2 4 200
4: 2 5 135
5: 3 4 100
6: 3 5 100
7: 4 5 300
"id" is the client's ID, "month" indicates what month it is, and "balance" indicates the account balance of a client. In a sense, this is longitudinal data where, say, element (2,3) indicates that Client #1 has an account balance of 50 at the end of month 5.
I want to generate a column that will give me the difference between a client's balance between month 5 and 4 to know the transactions carried out from one month to another.
This new variable should let me know that Client 1 drew 50, Client 2 drew 65 and Client 3 didn't do anything in aggregate terms between april and may. Client 4 is a new client that joined in may.
I thought of the following code:
dt$transactions <- dt$balance - shift(dt$balance, 1, "up")
However, it does not work properly because it's telling me that Client 4 made a 200 dollar deposit (but Client 4 is new!). Therefore, I want to be able to introduce the argument "by=id" to this somehow.
I know the solution lies in using the following notation:
dt[, transactions := balance - shift(balance, ??? ), by=id]
I just need to figure out how to make the aforementioned code work properly.
Thanks in advance.
Given that I only have two observations (at most), the following code gives me an elegant solution:
dt[, transaction := balance - first(balance), by = id]
This prevents any NAs from entering the variable transaction.
However, if I had more observations per id, I would do the following:
dt[,transaction := balance - shift(balance,1), by = id]
Big thanks to #Ryan and #Onyambu for helping.

Finding Specific Means and Medians in R

I am working on a project for school in R that is looking at swimming data compiled up of 8 different teams looking at each of the 13 events, over 6 years. I have over 8700 rows of data that I have appended and am trying to find out how to draw the specific means that I am looking for. For example, I would like to look at the progression of mean times for team 1 for event 3 for men. Thanks!
You can subset your data-frame to only include those variables, e.g.
ss = subset(df, team == 1 & event == 3)
mean(ss$times)

Grouping words that are similar

CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
I want to get either:
CompanyName2
Kraft
Kraft
Kraft
nestle
nestle
general motors
general motors
Dow
Dow
But would be absolutely fine with:
CompanyName2
1
1
1
2
2
3
3
I see algorithms for getting the distance between two words, so if I had just one weird name I would compare it to all other names and pick the one with the lowest distance. But I have thousands of names and want to group them all into groups.
I do not know anything about elastic search, but would one of the functions in the elastic package or some other function help me out here?
I'm sorry there's no programming here. I know. But this is way out of my area of normal expertise.
Solution: use string distance
You're on the right track. Here is some R code to get you started:
install.packages("stringdist") # install this package
library("stringdist")
CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
CompanyName = tolower(CompanyName) # otherwise case matters too much
# Calculate a string distance matrix; LCS is just one option
?"stringdist-metrics" # see others
sdm = stringdistmatrix(CompanyName, CompanyName, useNames=T, method="lcs")
Let's take a look. These are the calculated distances between strings, using Longest Common Subsequence metric (try others, e.g. cosine, Levenshtein). They all measure, in essence, how many characters the strings have in common. Their pros and cons are beyond this Q&A. You might look into something that gives a higher similarity value to two strings that contain the exact same substring (like dow)
sdm[1:5,1:5]
kraft kraft foods kfraft nestle nestle usa
kraft 0 6 1 9 13
kraft foods 6 0 7 15 15
kfraft 1 7 0 10 14
nestle 9 15 10 0 4
nestle usa 13 15 14 4 0
Some visualization
# Hierarchical clustering
sdm_dist = as.dist(sdm) # convert to a dist object (you essentially already have distances calculated)
plot(hclust(sdm_dist))
If you want to group then explicitly into k groups, use k-medoids.
library("cluster")
clusplot(pam(sdm_dist, 5), color=TRUE, shade=F, labels=2, lines=0)

getting the max() of a data frame under certain conditions

I have a rather large dataframe with 13 variables. Here is the first line just to give an idea:
prov_code nuts1 nuts1name nuts2 nuts2name prov_geoorder prov_name NUTS_ID EDAD year ORDER graphs value prov_geo
1. 15 1 NW 11 Galicia 1 La Corunna ES111 11 1975 1 1 0.000000000 La Corunna
I would like to obtain the maximum for a certain set of variables according to a combination of variables year ORDER and prov_code (ie, f_all being my data.frame: f_all[(f_all$year==1975)&(f_all$ORDER==1)&(f_all$prov_code=="1"),] ). The goal is to repeat the operation in order to obtain a new data frame containing all the maximum values for each year, ORDER, prov_code.
Is there a simple and quick way to do this?
Thanks for any suggestion on the matter,
There are several way of doing this, for example the one #James mentions. I want to suggest using plyr:
library(ply)
ddply(f_all, .(year, ORDER, prov_code), summarise, mx_value = max(value))
Alternatively, if you have a lot of data, data.table provides similar functionality, but is much much faster in that case.

Resources