Creating lag variables for matched factors - r

I have a question about creating lag variables depending on a time factor.
Basically I am working with a baseball dataset where there are lots of names for each player between 2002-2012. Obviously I only want lag variables for the same person to try and create a career arc to predict the current stat. Like for example I want to use lag 1 Average (2003) , lag 2 Average (2004) to try and predict the current average in 2005. So I tried to write a loop that goes through every row (the data frame is already sorted by name and then year, so the previous year is n-1 row), check if the name is the same, and if so then grab the value from the previous row.
Here is my loop:
i=2 #as 1 errors out with 1-0 row
for(i in 2:6264){
if(TS$name[i]==TS$name[i-1]){
TS$runvalueL1[i]=TS$Run_Value[i-1]
}else{
TS$runvalueL1 <- NA
}
i=i+1
}
Because each row is dependent on the name I cannot use most of the lag functions. If you have a better idea I am all ears!
Sample Data won't help a bunch but here is some:
edit: Sample data wasn't producing useable results so I just attached the first 10 people of my dataset. Thanks!
TS[(6:10),c('name','Season','Run_Value')]
name Season ARuns
321 Abad Andy 2003 -1.05
3158 Abercrombie Reggie 2006 27.42
1312 Abercrombie Reggie 2007 7.65
1069 Abercrombie Reggie 2008 5.34
4614 Abernathy Brent 2002 46.71
707 Abernathy Brent 2003 -2.29
1297 Abernathy Brent 2005 5.59
6024 Abreu Bobby 2002 102.89
6087 Abreu Bobby 2003 113.23
6177 Abreu Bobby 2004 128.60
Thank you!

Smth along these lines should do it:
names = c("Adams","Adams","Adams","Adams","Bobby","Bobby", "Charlie")
years = c(2002,2003,2004,2005,2004,2005,2010)
Run_value = c(10,15,15,20,10,5,5)
library(data.table)
dt = data.table(names, years, Run_value)
dt[, lag1 := c(NA, Run_value), by = names]
# names years Run_value lag1
#1: Adams 2002 10 NA
#2: Adams 2003 15 10
#3: Adams 2004 15 15
#4: Adams 2005 20 15
#5: Bobby 2004 10 NA
#6: Bobby 2005 5 10
#7: Charlie 2010 5 NA

An alternative would be to split the data by name, use lapply with the lag function of your choice and then combine the splitted data again:
TS$runvalueL1 <- do.call("rbind", lapply(split(TS, list(TS$name)), your_lag_function))
or
TS$runvalueL1 <- do.call("c", lapply(split(TS, list(TS$name)), your_lag_function))
But I guess there is also a nice possibility with plyr, but as you did not provide a reproducible example, that is all for the beginning.
Better:
TS$runvalueL1 <- unlist(lapply(split(TS, list(TS$name)), your_lag_function))

This is obviously not a problem where you want to create a matrix with cbind, so this is a better data structure:
full=data.frame(names, years, Run_value)
The ave function is quite useful for constructing new columns within categories of other columns:
full$Lag1 <- ave(full$Run_value, full$names,
FUN= function(x) c(NA, x[-length(x)] ) )
full
names years Run_value Lag1
1 Adams 2002 10 NA
2 Adams 2003 15 10
3 Adams 2004 15 15
4 Adams 2005 20 15
5 Bobby 2004 10 NA
6 Bobby 2005 5 10
7 Charlie 2010 5 NA
I thinks it's safer to cionstruct with NA, since that will help prevent errors in logic that using 0 for prior years in year 1 would not alert you to.

Related

R: Create new dataframe using levels from 2 factors

Is it possible to create a new dataframe using levels of two factors?
E.g. I have two factors that look like this:
table(df$year,df$name)
BENJAMIN RALPH GEORGE
2001 0 70 40
2002 16 91 0
2003 84 0 33
I am trying to create a new dataframe that I can populate with coefficients from model outputs where each year is examined separately and name is a factor in the models for each year
i.e. I would like it to look like this:
YEAR NAME Coefficient
2001 RALPH <blank>
2001 GEORGE <blank>
2002 BENJAMIN <blank>
2002 RALPH <blank>
2003 BENJAMIN <blank>
2003 GEAORGE <blank>
Many thanks for any help!
You can use duplicated to keep only unique levels of year and name column and use transform to create Coefficient column with NA value.
cols <- c('year', 'name')
result <- transform(df[!duplicated(df[cols]), cols], Coefficient = NA)

succinct code for repetitive subsetting in R

I am an R beginner and am having trouble finding a better way to recode an element of a dataframe. I have data which has a column with the year it was sampled (assessed), however I want to run some tests based on the biennial subset (not annual like it is formatted). Therefore I want two concurrent years to be identified by the assessment year. I think I could run something like:
ddd$Assessment[ddd$Assessment==1997 & ddd$Assessment==1998]<-1998
but feel there must be a better way (I know I don't need the second half of the code above but just left it in for clarity) especially as I have a lot of data spanning 23 years.
Any help would be very much appreciated
If your assessment year is consistently every other year, here is one way to create your biennial column by using the properties of the ceiling function.
ddd <- data.frame(Assessment = 1997:2006)
ddd$biennial <- ceiling(ddd$Assessment/2)*2
ddd
# Assessment biennial
#1 1997 1998
#2 1998 1998
#3 1999 2000
#4 2000 2000
#5 2001 2002
#6 2002 2002
#7 2003 2004
#8 2004 2004
#9 2005 2006
#10 2006 2006
To code biennial years and make sure that no future user of the data-set is mistaken in what this column actually represents, I'd rather use cut:
ddd <- data.frame(Assessment = 1997:2006)
ddd$biennial <- cut(ddd$Assessment, breaks = seq(1996, 2008, by=2), right = F)
ddd
# Assessment biennial
#1 1997 [1996,1998)
#2 1998 [1998,2000)
#3 1999 [1998,2000)
#4 2000 [2000,2002)
#5 2001 [2000,2002)
#6 2002 [2002,2004)
#7 2003 [2002,2004)
#8 2004 [2004,2006)
#9 2005 [2004,2006)
#10 2006 [2006,2008)

Avoid For-Loops in R

I'm sure this question has been posed before, but would like some input on my specific question. In return for your help, I'll use an interesting example.
Sean Lahman provides giant datasets of MLB baseball statistics, available free on his website (http://www.seanlahman.com/baseball-archive/statistics/).
I'd like to use this data to answer the following question: What is the average number of home runs per game recorded for each decade in the MLB?
Below I've pasted all relevant script:
teamdata = read.csv("Teams.csv", header = TRUE)
decades = c(1870,1880,1890,1900,1910,1920,1930,1940,1950,1960,1970,1980,1990,2000,2010,2020)
i = 0
meanhomers = c()
for(i in c(1:length(decades))){
meanhomers[i] = mean(teamdata$HR[teamdata$yearID>=decades[i] & teamdata$yearID<decades[i+1]]);
i = i+1
}
My primary question is, how could this answer have been determined without resorting to the dreaded for-loop?
Side question: What simple script would have generated the decades vector for me?
(For those interested in the answer to the baseball question, see below.)
meanhomers
[1] 4.641026 23.735849 34.456522 20.421053 25.755682 61.837500 84.012500
[8] 80.987500 130.375000 132.166667 120.093496 126.700000 148.737410 173.826667
[15] 152.973333 NaN
Edit for clarity: Turns out I answered the wrong question; the answer provided above indicates the number of home runs per team per year, not per game. A little fix of the denominator would get the correct result.
Here's a data.table example. Because others showed how to use cut, I took another route for splitting the data into decades:
teamdata[,list(HRperYear=mean(HR)),by=10*floor((yearID)/10)]
However, the original question mentions average HRs per game, not per year (though the code and answers clearly deal with HRs per year).
Here's how you could compute average HRs per game (and average games per team per year):
teamdata[,list(HRperYear=mean(HR),HRperGame=sum(HR)/sum(G),games=mean(G)),by=10*floor(yearID/10)]
floor HRperYear HRperGame games
1: 1870 4.641026 0.08911866 52.07692
2: 1880 23.735849 0.21543555 110.17610
3: 1890 34.456522 0.25140108 137.05797
4: 1900 20.421053 0.13686067 149.21053
5: 1910 25.755682 0.17010657 151.40909
6: 1920 61.837500 0.40144445 154.03750
7: 1930 84.012500 0.54593453 153.88750
8: 1940 80.987500 0.52351325 154.70000
9: 1950 130.375000 0.84289640 154.67500
10: 1960 132.166667 0.81977946 161.22222
11: 1970 120.093496 0.74580935 161.02439
12: 1980 126.700000 0.80990313 156.43846
13: 1990 148.737410 0.95741873 155.35252
14: 2000 173.826667 1.07340167 161.94000
15: 2010 152.973333 0.94427984 162.00000
(The low average game totals in the 1980's and 1990's are due to the 1981 and 1994-5 player strikes).
PS: Nicely-written question, but it would be extra nice for you to provide a fully reproducible example so that I don't have to go and download the CSV to answer your question. Making dummy data is OK.
You can use seq to generate sequences.
decades <- seq(1870, 2020, by=10)
You can use cut to split up numeric variables into intervals.
teamdata$decade <- cut(teamdata$yearID, breaks=decades, dig.lab=4)
Basically it creates a factor with one level for each decade (as specified by the breaks). The dig.lab=4 is just so it prints the years as e.g. "1870" not "1.87e+03".
See ?cut for further configuration (e.g. is '1980' included in this decade or the next one, & so on. You can even configure the labels if you think you'll use them.)
Then to do something for each decade, use the plyr package (data.table and dplyr are other options, but I think plyr has the easiest learning curve, and your data does not seem very large to need data.table).
library(plyr)
ddply(teamdata, .(decade), summarize, meanhomers=mean(HR))
decade meanhomers
1 (1870,1880] 4.930233
2 (1880,1890] 25.409091
3 (1890,1900] 35.115702
4 (1900,1910] 20.068750
5 (1910,1920] 27.284091
6 (1920,1930] 67.681250
7 (1930,1940] 84.050000
8 (1940,1950] 84.125000
9 (1950,1960] 130.718750
10 (1960,1970] 133.349515
11 (1970,1980] 117.745968
12 (1980,1990] 127.584615
13 (1990,2000] 155.053191
14 (2000,2010] 170.226667
15 (2010,2020] 152.775000
Mine is a little different to yours because my intervals are (, ] whereas yours are [, ). Can adjust cut to switch these around.
You can also use the sqldf package in order to use SQL queries on the data.
Here is the code:
library(sqldf)
sqldf("select floor(yearID/10)*10 as decade,avg(hr) as count
from Teams
group by decade;")
decade count
1 1870 4.641026
2 1880 23.735849
3 1890 34.456522
4 1900 20.421053
5 1910 25.755682
6 1920 61.837500
7 1930 84.012500
8 1940 80.987500
9 1950 130.375000
10 1960 132.166667
11 1970 120.093496
12 1980 126.700000
13 1990 148.737410
14 2000 173.826667
15 2010 152.973333
aggregate is handy for this sort of thing. You can use your decades object with findInterval to put the years into bins:
aggregate(HR ~ findInterval(yearID, decades), data=teamdata, FUN=mean)
## findInterval(yearID, decades) HR
## 1 1 4.641026
## 2 2 23.735849
## 3 3 34.456522
## 4 4 20.421053
## 5 5 25.755682
## 6 6 61.837500
## 7 7 84.012500
## 8 8 80.987500
## 9 9 130.375000
## 10 10 132.166667
## 11 11 120.093496
## 12 12 126.700000
## 13 13 148.737410
## 14 14 173.826667
## 15 15 152.973333
Note that the intervals used are left-closed, as you desire. Also note that the intervals need not be regular. Yours are, which leads to the "side question" of how to produce the decades vector: don't even compute it. Instead, directly compute which decade each year falls in:
aggregate(HR ~ I(10 * (yearID %/% 10)), data=teamdata, FUN=mean)
## I(10 * (yearID%/%10)) HR
## 1 1870 4.641026
## 2 1880 23.735849
## 3 1890 34.456522
## 4 1900 20.421053
## 5 1910 25.755682
## 6 1920 61.837500
## 7 1930 84.012500
## 8 1940 80.987500
## 9 1950 130.375000
## 10 1960 132.166667
## 11 1970 120.093496
## 12 1980 126.700000
## 13 1990 148.737410
## 14 2000 173.826667
## 15 2010 152.973333
I usually prefer the formula interface to aggregate as used above, but you can get better names directly by using the non-formula interface. Here's the example for each of the above:
with(teamdata, aggregate(list(mean.HR=HR), list(Decade=findInterval(yearID,decades)), FUN=mean))
## Decade mean.HR
## 1 1 4.641026
## ...
with(teamdata, aggregate(list(mean.HR=HR), list(Decade=10 * (yearID %/% 10)), FUN=mean))
## Decade mean.HR
## 1 1870 4.641026
## ...
dplyr::group_by, mixed with cut is a good option here, and avoids looping. The decades vector is just a stepped sequence.
decades <- seq(1870,2020,by=10)
cut breaks the data into categories, which I've labelled by the decades themselves for clarity.
teamdata$decade <- cut(teamdata$yearID, breaks=decades, right=FALSE, labels=decades[1:(length(decades)-1)])
Then dplyr handles the grouped summarise as neatly as you could hope
library(dplyr)
teamdata %>% group_by(decade) %>% summarise(meanhomers=mean(HR))
# decade meanhomers
# (fctr) (dbl)
# 1 1870 4.641026
# 2 1880 23.735849
# 3 1890 34.456522
# 4 1900 20.421053
# 5 1910 25.755682
# 6 1920 61.837500
# 7 1930 84.012500
# 8 1940 80.987500
# 9 1950 130.375000
# 10 1960 132.166667
# 11 1970 120.093496
# 12 1980 126.700000
# 13 1990 148.737410
# 14 2000 173.826667
# 15 2010 152.973333

R - Bootstrap by several column criteria

So what I have is data of cod weights at different ages. This data is taken at several locations over time.
What I would like to create is "weight at age", basically a mean value of weights at a certain age. I want do this for each location at each year.
However, the ages are not sampled the same way (all old fish caught are measured, while younger fish are sub sampled), so I can't just create a normal average, I would like to bootstrap samples.
The bootstrap should take out 5 random values of weight at an age, create a mean value and repeat this a 1000 times, and then create an average of the means. The values should be able to be used again (replace). This should be done for each age at every AreaCode for every year. Dependent factors: Year-location-Age.
So here's an example of what my data could look like.
df <- data.frame( Year= rep(c(2000:2008),2), AreaCode = c("39G4", "38G5","40G5"), Age = c(0:8), IndWgt = c(rnorm(18, mean=5, sd=3)))
> df
Year AreaCode Age IndWgt
1 2000 39G4 0 7.317489899
2 2001 38G5 1 7.846606144
3 2002 40G5 2 0.009212455
4 2003 39G4 3 6.498688035
5 2004 38G5 4 3.121134937
6 2005 40G5 5 11.283096043
7 2006 39G4 6 0.258404136
8 2007 38G5 7 6.689780137
9 2008 40G5 8 10.180511929
10 2000 39G4 0 5.972879108
11 2001 38G5 1 1.872273650
12 2002 40G5 2 5.552962065
13 2003 39G4 3 4.897882549
14 2004 38G5 4 5.649438631
15 2005 40G5 5 4.525012587
16 2006 39G4 6 2.985615831
17 2007 38G5 7 8.042884181
18 2008 40G5 8 5.847629941
AreaCode contains the different locations, in reality I have 85 different levels. The time series stretches 1991-2013, the ages 0-15. IndWgt contain the weight. My whole data frame has a row length of 185726.
Also, every age does not exist for every location and every year. Don't know if this would be a problem, just so the scripts isn't based on references to certain row number. There are some NA values in the weight column, but I could just remove them before hand.
I was thinking that I maybe should use replicate, and apply or another plyr function. I've tried to understand the boot function but I don't really know if I would write my arguments under statistics, and in that case how. So yeah, basically I have no idea.
I would be thankful for any help I can get!
How about this with plyr. I think from the question you wanted to bootstrap only the "young" fish weights and use actual means for the older ones. If not, just replace the ifelse() statement with its last argument.
require(plyr)
#cod<-read.csv("cod.csv",header=T) #I loaded your data from csv
bootstrap<-function(Age,IndWgt){
ifelse(Age>2, # treat differently for old/young fish
res<-mean(IndWgt), # old fish mean
res<-mean(replicate(1000,sample(IndWgt,5,replace = TRUE))) # young fish bootstrap
)
return(res)
}
ddply(cod,.(Year,AreaCode,Age),summarize,boot_mean=bootstrap(Age,IndWgt))
Year AreaCode Age boot_mean
1 2000 39G4 0 6.650294
2 2001 38G5 1 4.863024
3 2002 40G5 2 2.724541
4 2003 39G4 3 5.698285
5 2004 38G5 4 4.385287
6 2005 40G5 5 7.904054
7 2006 39G4 6 1.622010
8 2007 38G5 7 7.366332
9 2008 40G5 8 8.014071
PS: If you want to sample all ages in the same way, no need for the function, just:
ddply(cod,.(Year,AreaCode,Age),
summarize,
boot_mean=mean(replicate(1000,mean(sample(IndWgt,5,replace = TRUE)))))
Since you don't provide enough code, it's too hard (lazy) for me to test it properly. You should get your first step using the following code. If you wrap this into replicate, you should get your end result that you can average.
part.result <- aggregate(IndWgt ~ Year + AreaCode + Age, data = data, FUN = function(x) {
rws <- length(x)
get.em <- sample(x, size = 5, replace = TRUE)
out <- mean(get.em)
out
})
To handle any missing combination of year/age/location, you could probably add an if statement checking for NULL/NA and producing a warning and/or skipping the iteration.

How to take the mean of last 10 values in a column before a missing value using R?

I am new to R and having trouble figuring out to go about this. I have data on tree growth rates from dead trees, organized by year. So, my first column is year and the columns to the right are growth rates for individual trees, ending in the year each tree died. After the tree died, the values are "NA" for the remaining years in the dataset. I need to take the mean growth for the 10 years preceding each tree's death, but each tree died in a different year. Does anyone have an idea for how to do this? Here is an example of what a dataset might look like:
Year Tree1 Tree2 Tree3
1989 53.00 84.58 102.52
1990 63.68 133.16 146.07
1991 90.37 103.10 233.58
1992 149.24 127.61 245.69
1993 96.20 54.78 417.96
1994 230.64 60.92 125.31
1995 150.81 60.98 100.43
1996 124.25 42.73 75.43
1997 173.42 67.20 50.34
1998 119.60 73.40 32.43
1999 179.97 61.24 NA
2000 114.88 67.43 NA
2001 82.23 55.23 NA
2002 49.40 NA NA
2003 93.46 NA NA
2004 104.67 NA NA
2005 44.14 NA NA
2006 88.40 NA NA
So, the averages I need to calculate are:
Tree1: mean(1997-2006) = 105.01
Tree2: mean(1992-2001) = 67.15
Tree3: mean(1989-1998) = 152.98
Since I need to do this for a large number of trees, it would be helpful to have a method of automating the calculation. Thank you very much for any help! Katie
You can use sapply and tail together with na.omit as follows:
sapply(mydf[-1], function(x) mean(tail(na.omit(x), 10)))
# Tree1 Tree2 Tree3
# 105.017 67.152 152.976
mydf[-1] says to drop the first column. tail has an argument, n, that lets you specify how many values you want from the end (tail) of your data. Here, we've set it to "10" since you want the last 10 values. Then, assuming that there are no NA values in your actual data from while the trees are alive, you can safely use na.omit on your data.

Resources