Grouping and Std. Dev in R - r

I have a data frame called dt. dt looks like this.
Year Sale
2009 6
2008 3
2007 4
2006 5
2005 12
2004 3
I am interested in getting std.dev of sales in the past four years. In case, there are not four year data, as in 2006,2005, and 2004, I want to get NA. How can I create a new column with the values corresponding to each year. New data would look like.
Year Sale std.
2009 6 std(05,06,07,08)
2008 3 std(07,06,05,04)
2007 4 NA
2006 5 NA
2005 12 NA
2004 3 NA
I tried this a lot, but because I am a novice at R, I couldn't do it. Someone please help. Thanks.
Edit :
Here is the data with GVKEY.
GVKEY FYEAR IBC
1 1004 2003 3.504
2 1004 2004 18.572
3 1004 2005 35.163
4 1004 2006 59.447
5 1004 2007 75.745
Regards
Edit:
I am using the mentioned function rollapply function in this manner:
dt <- ddply(dt, .(GVKEY), function(x){x$ww <- rollapply(x$Sale,4,sd, fill =NA, align="right"); x});
But I am getting following error.
Error in seq.default(start.at, NROW(data), by = by) : wrong sign in 'by' argument
Not sure what I am doing wrong. The data with GVKEY is mentioned at the top.

You can use rollapply from package zoo:
require(zoo)
rollapply(df$Sale, 4, sd, fill=NA, align="right")
[edit] I used your data frame as sorted by year. If you have it in original order, you will probably need to use align="left"

This is how I solved the problem:
dt <- dt[order(dt$GVKEY,dt$FYEAR),];
dt <- sqldf("select GVKEY, FYEAR, IBC from dt");
dt$STDEARN <- ave(dt$IBC, dt$GVKEY,FUN = function(x) {if(length(x)>3) c(NA,head(runSD(x,4),-1)) else sample(NA,length(x),TRUE)});

Related

Got an error using ifelse inside mutate inside the for loop

I have a list of 244 data frames which looks like the following:
The name of the list is datas.
datas[[1]]
year sal
2000 10000
2000 15000
2005 10000
2005 9000
2005 12000
2010 15000
2010 12000
2010 20000
2013 25000
2013 15000
2015 20000
I would like to make a new column called fix.sal, multiplying different values for different years. For example, I multiply 2 on sals which are on the same rows with 2000. In the same way, the number multiplied on the sal value is 1.8 for 2005, 1.5 for 2010, 1.2 for 2013, 1 for 2015. So the result should be like this:
Year sal fix.sal
2000 10000 20000
2000 15000 30000
2005 10000 18000
2005 9000 16200
2005 12000 21600
2010 15000 22500
2010 12000 18000
2010 20000 30000
2013 25000 30000
2013 15000 18000
2015 20000 20000
I succeeded to do this by using ifelse inside mutate which for package dplyr.
library(dplyr)
datas[[1]]<-mutate(datas[[1]], fix.sal=
ifelse(datas[[1]]$Year==2000,datas[[1]]$sal*2,
ifelse(datas[[1]]$Year==2005,datas[[1]]$sal*1.8,
ifelse(datas[[1]]$Year==2010,datas[[1]]$sal*1.5,
ifelse(datas[[1]]$Year==2013,datas[[1]]$sal*1.2,
datas[[1]]$sal*1)))))
But I have to do this operation to the 244 data frames in the list datas.
So I tried to do it using the for loop like this;
for(i in 1:244){
datas[[i]]<-mutate(datas[[i]], fix.sal=
ifelse(datas[[i]]$Year==2000,datas[[i]]$sal*2,
ifelse(datas[[i]]$Year==2005,datas[[i]]$sal*1.8,
ifelse(datas[[i]]$Year==2010,datas[[i]]$sal*1.5,
ifelse(datas[[i]]$Year==2013,datas[[i]]$sal*1.2,
datas[[i]]$sal*1)))))
}
Then there came an error;
Error: invalid subscript type 'integer'
How can I solve this...?
Any comments will be greatly appreciated! :)
Please don't force yourself to use ifelse for this. Instead, create a vector with your multipliers, then use the year to select from the vector. The vector will look something like this:
multiplier <-
c("2005" = 1.2
, "2006" = 1.05
, "2007" = 0.9)
With whatever your multiplier is for each year in your data. Then, here is some sample data (all the same, but that doesn't matter):
datas <-
lapply(1:3, function(idx){
data.frame(
Year = 2005:2007
, sal = c(10, 20, 30)
)
})
Finally, we can then use lapply to loop through the list more efficiently. Each time through, it uses the Year to pick a value from the multipliers vector (note the use of as.character, otherwise it will pick, e.g., the 2005th entry, instead of the one named "2005").
lapply(datas, function(x){
mutate(x, fix.sal = sal*multiplier[as.character(Year)])
})
returns:
[[1]]
Year sal fix.sal
1 2005 10 12
2 2006 20 21
3 2007 30 27
[[2]]
Year sal fix.sal
1 2005 10 12
2 2006 20 21
3 2007 30 27
[[3]]
Year sal fix.sal
1 2005 10 12
2 2006 20 21
3 2007 30 27
For more compact code, you can use:
lapply(datas, mutate, fix.sal = sal*multiplier[as.character(Year)])
but that makes it slightly less clear to me what is happening.
Here's a simple solution using ifelse and lapply:
# Creating the list
df <- data.frame(year=c(rep(2000,2),rep(2005,3),rep(2010,3),rep(2013,2),2015),
sal=c(10000,15000,10000,9000,12000,15000,12000,20000,25000,15000,20000))
datas <- list(df,df)
# Applying the function with ifelse
lapply(datas,function(x){
outp <- ifelse(df$year==2000,df$sal*2,
ifelse(df$year==2005,df$sal*1.8,
ifelse(df$year==2010,df$sal*1.5,
ifelse(df$year==2013,df$sal*1.2,df$sal*1))))
return(outp)
})
You'll get the result for each df inside the list.

can't get year factor to point on x axis

I ran this:
> x<-tapply(positive$Emissions, as.factor(positive$year), sum)
> x
1999 2002 2005 2008
7332967 5635780 5454703 3464206
Then ran:
plot(x)
And I keep getting this:
I would like the x axis to show the year, not a numeric scale. Less importantly, I'd like the y axis to not show engineering numbers, but something more readable. I know I can divide by 1000 and get it to print a regular number. But showing the year is more important. I'm contrained to using the base plot functions.
The positive$year column is originally an integer. The positive$Emissions column is numeric.
How do I force this? I've had other plots that do this automatically, but were not operating off of tapply results. I'm willing to pursue something besides the tapply function to get results, but previous attempts failed.
I tried this:
> plot(as.factor(positive$year),sum(positive$Emissions),ylab="Annual Emmissions in tons")
Error in model.frame.default(formula = y ~ x) :
variable lengths differ (found for 'x')
and understand the error, but don't know how to work around it, i.e. don't know how to get positive$year down to 4 values to match the 4 sums.
Data looks like this:
> head(positive)
Emissions year
4 15.714 1999
8 234.178 1999
12 0.128 1999
16 2.036 1999
20 0.388 1999
24 1.490 1999
6 million rows, with 4 year categories.
Any pointers please.
positive <- data.frame(Emissions =rnorm(30), year=c(1999,2000,2001,2002,2003))
positive
# Solution #1
x<-tapply(positive$Emissions, as.character(positive$year), sum)
x
plot(x=x,y=names(x),ylab="year",xlab="Emissions")
# Solution #2
x <- aggregate(positive,by=list(positive$year),sum)
plot(x=x$Emissions, y=x$Group.1) # same plot as above
Your problem is probably that x$year is a factor. If it is numeric, you shouldn't have this issue:
x <- read.table(text=" 1999 2002 2005 2008
7332967 5635780 5454703 3464206", header=F)
x <- t(as.matrix(x))
rownames(x) <- NULL
colnames(x) <- c("year", "emissions")
x <- as.data.frame(x)
x
# year emissions
# 1 1999 7332967
# 2 2002 5635780
# 3 2005 5454703
# 4 2008 3464206

Adding data points in a column by factors in R

The data.frame my_data consists of two columns("PM2.5" & "years") & around 6400000 rows. The data.frame has various data points for pollutant levels of "PM2.5" for years 1999, 2002, 2005 & 2008.
This is what i have done to the data.drame:
{
my_data <- arrange(my_data,year)
my_data$year <- as.factor(my_data$year)
my_data$PM2.5 <- as.numeric(my_data$PM2.5)
}
I want to find the sum of all PM2.5 levels (i.e sum of all data points under PM2.5) according to different year. How can I do it.
!The image shows the first 20 rows of the data.frame.
Since the column "years" is arranged, it is showing only 1999
Say this is your data:
library(plyr) # <- don't forget to tell us what libraries you are using
give us an easy sample set
my_data <- data.frame(year=sample(c("1999","2002","2005","2008"), 10, replace=T), PM2.5 = rnorm(10,mean = 5))
my_data <- arrange(my_data,year)
my_data$year <- as.factor(my_data$year)
my_data$PM2.5 <- as.numeric(my_data$PM2.5)
> my_data
year PM2.5
1 1999 5.556852
2 2002 5.508820
3 2002 4.836500
4 2002 3.766266
5 2005 6.688936
6 2005 5.025600
7 2005 4.041670
8 2005 4.614784
9 2005 4.352046
10 2008 6.378134
One way to do it (out of many, many ways already shown by a simple google search):
> with(my_data, (aggregate(PM2.5, by=list(year), FUN="sum")))
Group.1 x
1 1999 5.556852
2 2002 14.111586
3 2005 24.723037
4 2008 6.378134

R - Bootstrap by several column criteria

So what I have is data of cod weights at different ages. This data is taken at several locations over time.
What I would like to create is "weight at age", basically a mean value of weights at a certain age. I want do this for each location at each year.
However, the ages are not sampled the same way (all old fish caught are measured, while younger fish are sub sampled), so I can't just create a normal average, I would like to bootstrap samples.
The bootstrap should take out 5 random values of weight at an age, create a mean value and repeat this a 1000 times, and then create an average of the means. The values should be able to be used again (replace). This should be done for each age at every AreaCode for every year. Dependent factors: Year-location-Age.
So here's an example of what my data could look like.
df <- data.frame( Year= rep(c(2000:2008),2), AreaCode = c("39G4", "38G5","40G5"), Age = c(0:8), IndWgt = c(rnorm(18, mean=5, sd=3)))
> df
Year AreaCode Age IndWgt
1 2000 39G4 0 7.317489899
2 2001 38G5 1 7.846606144
3 2002 40G5 2 0.009212455
4 2003 39G4 3 6.498688035
5 2004 38G5 4 3.121134937
6 2005 40G5 5 11.283096043
7 2006 39G4 6 0.258404136
8 2007 38G5 7 6.689780137
9 2008 40G5 8 10.180511929
10 2000 39G4 0 5.972879108
11 2001 38G5 1 1.872273650
12 2002 40G5 2 5.552962065
13 2003 39G4 3 4.897882549
14 2004 38G5 4 5.649438631
15 2005 40G5 5 4.525012587
16 2006 39G4 6 2.985615831
17 2007 38G5 7 8.042884181
18 2008 40G5 8 5.847629941
AreaCode contains the different locations, in reality I have 85 different levels. The time series stretches 1991-2013, the ages 0-15. IndWgt contain the weight. My whole data frame has a row length of 185726.
Also, every age does not exist for every location and every year. Don't know if this would be a problem, just so the scripts isn't based on references to certain row number. There are some NA values in the weight column, but I could just remove them before hand.
I was thinking that I maybe should use replicate, and apply or another plyr function. I've tried to understand the boot function but I don't really know if I would write my arguments under statistics, and in that case how. So yeah, basically I have no idea.
I would be thankful for any help I can get!
How about this with plyr. I think from the question you wanted to bootstrap only the "young" fish weights and use actual means for the older ones. If not, just replace the ifelse() statement with its last argument.
require(plyr)
#cod<-read.csv("cod.csv",header=T) #I loaded your data from csv
bootstrap<-function(Age,IndWgt){
ifelse(Age>2, # treat differently for old/young fish
res<-mean(IndWgt), # old fish mean
res<-mean(replicate(1000,sample(IndWgt,5,replace = TRUE))) # young fish bootstrap
)
return(res)
}
ddply(cod,.(Year,AreaCode,Age),summarize,boot_mean=bootstrap(Age,IndWgt))
Year AreaCode Age boot_mean
1 2000 39G4 0 6.650294
2 2001 38G5 1 4.863024
3 2002 40G5 2 2.724541
4 2003 39G4 3 5.698285
5 2004 38G5 4 4.385287
6 2005 40G5 5 7.904054
7 2006 39G4 6 1.622010
8 2007 38G5 7 7.366332
9 2008 40G5 8 8.014071
PS: If you want to sample all ages in the same way, no need for the function, just:
ddply(cod,.(Year,AreaCode,Age),
summarize,
boot_mean=mean(replicate(1000,mean(sample(IndWgt,5,replace = TRUE)))))
Since you don't provide enough code, it's too hard (lazy) for me to test it properly. You should get your first step using the following code. If you wrap this into replicate, you should get your end result that you can average.
part.result <- aggregate(IndWgt ~ Year + AreaCode + Age, data = data, FUN = function(x) {
rws <- length(x)
get.em <- sample(x, size = 5, replace = TRUE)
out <- mean(get.em)
out
})
To handle any missing combination of year/age/location, you could probably add an if statement checking for NULL/NA and producing a warning and/or skipping the iteration.

Refer to relative rows in R

I know this answer must be out there, but I can't figure out how to word the question.
I'd like to calculate the differences between values in my data.frame.
from this:
f <- data.frame(year=c(2004, 2005, 2006, 2007), value=c(8565, 8745, 8985, 8412))
year value
1 2004 8565
2 2005 8745
3 2006 8985
4 2007 8412
to this:
year value diff
1 2004 8565 NA
2 2005 8745 180
3 2006 8985 240
4 2007 8412 -573
(ie value of current year minus value of previous year)
But I don't know how to have a result in one row that is created from another row. Any help?
Thanks,
Tom
There are many different ways to do this, but here's one:
f[, "diff"] <- c(NA, diff(f$value))
More generally, if you want to refer to relative rows, you can use lag() or do it directly with indexes:
f[-1,"diff"] <- f[-1, "value"] - f[-nrow(f), "value"]
Use the diff function
f <- cbind(f, c(NA, diff(f[,2])))
If year column isn't sorted then you could use match:
f$diff <- f$value - f$value[match(f$year-1, f$year)]

Resources