Aggregate using a certain value - r

I'm trying to use the aggregate function in R to get the mean EMISSIONS, organized by YEAR, but only for rows where FIPS is equal to 24510. The following code gives me the right result, but in addition it also adds the overall EMISSIONS, summed across all FIPS values. What am I missing here?
This is the function I'm using:
sum <- aggregate(NEI$Emissions, list(Year = NEI$year, NEI$fips == 24510), sum);
This is the output:
Year Group.2 x
1 1999 FALSE 7329692.557
2 2002 FALSE 5633326.582
3 2005 FALSE 5451611.723
4 2008 FALSE 3462343.556
5 1999 TRUE 3274.180
6 2002 TRUE 2453.916
7 2005 TRUE 3091.354
8 2008 TRUE 1862.282
This is the output that I would like:
Year x
1 1999 3274.180
2 2002 2453.916
3 2005 3091.354
4 2008 1862.282
Should I be using subset separately or can this be done with aggregate alone?

Using this sample
set.seed(15)
NEI <- data.frame(year=2000:2004, fips=rep(c(24510,57399), each=5), Emissions=rnorm(10))
you could use the command
mysum <- aggregate(Emissions~year, subset(NEI, fips == 24510), sum);
to get
year Emissions
1 2000 0.2588229
2 2001 1.8311207
3 2002 -0.3396186
4 2003 0.8971982
5 2004 0.4880163
(also, don't save a value to a variable named sum -- that will conflict with the base function sum())

Related

multiplying column from data frame 1 by a condition found in data frame 2

I have two separate data frame and what I am trying to do is that for each year, I want to check data frame 2 (in the same year) and multiply a column from data frame 1 by the found number. So for example, imagine my first data frame is:
year <- c(2001,2003,2001,2004,2006,2007,2008,2008,2001,2009,2001)
price <- c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000)
df <- data.frame(year, price)
year price
1 2001 1000
2 2003 1000
3 2001 1000
4 2004 1000
5 2006 1000
6 2007 1000
7 2008 1000
8 2008 1000
9 2001 1000
10 2009 1000
11 2001 1000
Now, I have a second data frame which includes inflation conversion rate (code from #akrun)
ref_inf <- c(2,3,1,2.2,1.3,1.5,1.9,1.8,1.9,1.9)
ref_year<- seq(2010,2001)
inf_data <- data.frame(ref_year,ref_inf)
inf_data<-inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100))
ref_year ref_inf final_inf
1 2010 2.0 1.020000
2 2009 3.0 1.050600
3 2008 1.0 1.061106
4 2007 2.2 1.084450
5 2006 1.3 1.098548
6 2005 1.5 1.115026
7 2004 1.9 1.136212
8 2003 1.8 1.156664
9 2002 1.9 1.178640
10 2001 1.9 1.201035
What I want to do is that for example for the first row of data frame 1, it's the year 2001, so I go and found a conversion for the year 2001 from data frame 2 which is 1.201035 and then multiply the price in a data frame 1 by this found conversion rate.
So the result should look like this:
year price after_conv
1 2001 1000 1201.035
2 2003 1000 1156.664
3 2001 1000 1201.035
4 2004 1000 1136.212
5 2006 1000 1098.548
6 2007 1000 1084.450
7 2008 1000 1061.106
8 2008 1000 1061.106
9 2001 1000 1201.035
10 2009 1000 1050.600
11 2001 1000 1201.035
is there any way to do this without using else and if commands?
We can do a join on the 'year' with 'ref_year' and create the new column by assigning (:=) the output of product of 'price' and 'final_inf'
library(data.table)
setDT(df)[inf_data, after_conv := price * final_inf, on = .(year = ref_year)]
-output
df
# year price after_conv
# 1: 2001 1000 1201.035
# 2: 2003 1000 1156.664
# 3: 2001 1000 1201.035
# 4: 2004 1000 1136.212
# 5: 2006 1000 1098.548
# 6: 2007 1000 1084.450
# 7: 2008 1000 1061.106
# 8: 2008 1000 1061.106
# 9: 2001 1000 1201.035
#10: 2009 1000 1050.600
#11: 2001 1000 1201.035
Since the data is already being processed by dplyr, we can also solve this problem with dplyr. A dplyr based solution joins the data with the reference data by year and calculates after_conv.
year <- c(2001,2003,2001,2004,2006,2007,2008,2008,2001,2009,2001)
price <- c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000)
df <- data.frame(year, price)
library(dplyr)
ref_inf <- c(2,3,1,2.2,1.3,1.5,1.9,1.8,1.9,1.9)
ref_year<- seq(2010,2001)
inf_data <- data.frame(ref_year,ref_inf)
inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100)) %>%
rename(year = ref_year) %>%
left_join(df,.) %>%
mutate(after_conv = price * final_inf ) %>%
select(year,price,after_conv)
We use left_join() to keep the data ordered in the original order of df as well as ensure rows in inf_data only contribute to the output if they match at least one row in df. We use . to reference the data already in the pipeline as the right side of the join, merging in final_inf so we can use it in the subsequent mutate() function. We then select() to keep the three result columns we need.
...and the output:
Joining, by = "year"
year price after_conv
1 2001 1000 1201.035
2 2003 1000 1156.664
3 2001 1000 1201.035
4 2004 1000 1136.212
5 2006 1000 1098.548
6 2007 1000 1084.450
7 2008 1000 1061.106
8 2008 1000 1061.106
9 2001 1000 1201.035
10 2009 1000 1050.600
11 2001 1000 1201.035
We can save the result to the original df by writing the result of the pipeline to df.
inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100)) %>%
rename(year = ref_year) %>%
left_join(df,.) %>%
mutate(after_conv = price * final_inf ) %>%
select(year,price,after_conv) -> df

Adding data points in a column by factors in R

The data.frame my_data consists of two columns("PM2.5" & "years") & around 6400000 rows. The data.frame has various data points for pollutant levels of "PM2.5" for years 1999, 2002, 2005 & 2008.
This is what i have done to the data.drame:
{
my_data <- arrange(my_data,year)
my_data$year <- as.factor(my_data$year)
my_data$PM2.5 <- as.numeric(my_data$PM2.5)
}
I want to find the sum of all PM2.5 levels (i.e sum of all data points under PM2.5) according to different year. How can I do it.
!The image shows the first 20 rows of the data.frame.
Since the column "years" is arranged, it is showing only 1999
Say this is your data:
library(plyr) # <- don't forget to tell us what libraries you are using
give us an easy sample set
my_data <- data.frame(year=sample(c("1999","2002","2005","2008"), 10, replace=T), PM2.5 = rnorm(10,mean = 5))
my_data <- arrange(my_data,year)
my_data$year <- as.factor(my_data$year)
my_data$PM2.5 <- as.numeric(my_data$PM2.5)
> my_data
year PM2.5
1 1999 5.556852
2 2002 5.508820
3 2002 4.836500
4 2002 3.766266
5 2005 6.688936
6 2005 5.025600
7 2005 4.041670
8 2005 4.614784
9 2005 4.352046
10 2008 6.378134
One way to do it (out of many, many ways already shown by a simple google search):
> with(my_data, (aggregate(PM2.5, by=list(year), FUN="sum")))
Group.1 x
1 1999 5.556852
2 2002 14.111586
3 2005 24.723037
4 2008 6.378134

efficient date comparison in data table

I have a data frame (actually a data table) that looks like
id hire.date survey.year
1 15-04-2003 2003
2 16-07-2001 2001
3 06-06-1980 2002
4 17-08-1981 2001
I need to check if hire.date is less than say 31st March of survey.year. So I would end up with something like
id hire.date survey.year emp31mar
1 15-04-2003 2003 FALSE
2 16-07-2001 2001 FALSE
3 06-06-1980 2002 TRUE
4 17-08-1981 2001 TRUE
I could always create an object holding March 31st of survey.year and then make the appropriate comparison like so
mar31 = as.Date(paste0("31-03-", as.character(myData$survey.year)), "%d-%m-%Y")
myData$emp31 = myData$hiredate < mar31
but creating the object mar31 is consuming too much time because myData is large-ish (think tens of millions of rows).
I wonder if there is a more efficient way of doing this -- a way that doesn't involve creating an object such as mar31?
You could try the data.table methods for creating the column.
library(data.table)
setDT(df1)[, emp31mar:= as.Date(hire.date, '%d-%m-%Y') <
paste(survey.year, '03-31', sep="-")][]
# id hire.date survey.year emp31mar
#1: 1 15-04-2003 2003 FALSE
#2: 2 16-07-2001 2001 FALSE
#3: 3 06-06-1980 2002 TRUE
#4: 4 17-08-1981 2001 TRUE

R - Bootstrap by several column criteria

So what I have is data of cod weights at different ages. This data is taken at several locations over time.
What I would like to create is "weight at age", basically a mean value of weights at a certain age. I want do this for each location at each year.
However, the ages are not sampled the same way (all old fish caught are measured, while younger fish are sub sampled), so I can't just create a normal average, I would like to bootstrap samples.
The bootstrap should take out 5 random values of weight at an age, create a mean value and repeat this a 1000 times, and then create an average of the means. The values should be able to be used again (replace). This should be done for each age at every AreaCode for every year. Dependent factors: Year-location-Age.
So here's an example of what my data could look like.
df <- data.frame( Year= rep(c(2000:2008),2), AreaCode = c("39G4", "38G5","40G5"), Age = c(0:8), IndWgt = c(rnorm(18, mean=5, sd=3)))
> df
Year AreaCode Age IndWgt
1 2000 39G4 0 7.317489899
2 2001 38G5 1 7.846606144
3 2002 40G5 2 0.009212455
4 2003 39G4 3 6.498688035
5 2004 38G5 4 3.121134937
6 2005 40G5 5 11.283096043
7 2006 39G4 6 0.258404136
8 2007 38G5 7 6.689780137
9 2008 40G5 8 10.180511929
10 2000 39G4 0 5.972879108
11 2001 38G5 1 1.872273650
12 2002 40G5 2 5.552962065
13 2003 39G4 3 4.897882549
14 2004 38G5 4 5.649438631
15 2005 40G5 5 4.525012587
16 2006 39G4 6 2.985615831
17 2007 38G5 7 8.042884181
18 2008 40G5 8 5.847629941
AreaCode contains the different locations, in reality I have 85 different levels. The time series stretches 1991-2013, the ages 0-15. IndWgt contain the weight. My whole data frame has a row length of 185726.
Also, every age does not exist for every location and every year. Don't know if this would be a problem, just so the scripts isn't based on references to certain row number. There are some NA values in the weight column, but I could just remove them before hand.
I was thinking that I maybe should use replicate, and apply or another plyr function. I've tried to understand the boot function but I don't really know if I would write my arguments under statistics, and in that case how. So yeah, basically I have no idea.
I would be thankful for any help I can get!
How about this with plyr. I think from the question you wanted to bootstrap only the "young" fish weights and use actual means for the older ones. If not, just replace the ifelse() statement with its last argument.
require(plyr)
#cod<-read.csv("cod.csv",header=T) #I loaded your data from csv
bootstrap<-function(Age,IndWgt){
ifelse(Age>2, # treat differently for old/young fish
res<-mean(IndWgt), # old fish mean
res<-mean(replicate(1000,sample(IndWgt,5,replace = TRUE))) # young fish bootstrap
)
return(res)
}
ddply(cod,.(Year,AreaCode,Age),summarize,boot_mean=bootstrap(Age,IndWgt))
Year AreaCode Age boot_mean
1 2000 39G4 0 6.650294
2 2001 38G5 1 4.863024
3 2002 40G5 2 2.724541
4 2003 39G4 3 5.698285
5 2004 38G5 4 4.385287
6 2005 40G5 5 7.904054
7 2006 39G4 6 1.622010
8 2007 38G5 7 7.366332
9 2008 40G5 8 8.014071
PS: If you want to sample all ages in the same way, no need for the function, just:
ddply(cod,.(Year,AreaCode,Age),
summarize,
boot_mean=mean(replicate(1000,mean(sample(IndWgt,5,replace = TRUE)))))
Since you don't provide enough code, it's too hard (lazy) for me to test it properly. You should get your first step using the following code. If you wrap this into replicate, you should get your end result that you can average.
part.result <- aggregate(IndWgt ~ Year + AreaCode + Age, data = data, FUN = function(x) {
rws <- length(x)
get.em <- sample(x, size = 5, replace = TRUE)
out <- mean(get.em)
out
})
To handle any missing combination of year/age/location, you could probably add an if statement checking for NULL/NA and producing a warning and/or skipping the iteration.

writing the outcome of a nested loop to a vector object in R

I have the following data read into R as a data frame named "data_old":
yes year month
1 15 2004 5
2 9 2005 6
3 15 2006 3
4 12 2004 5
5 14 2005 1
6 15 2006 7
. . ... .
. . ... .
I have written a small loop which goes through the data and sums up the yes variable for each month/year combination:
year_f <- c(2004:2006)
month_f <- c(1:12)
for (i in year_f){
for (j in month_f){
x <- subset(data_old, month == j & year == i, select="yes")
if (nrow(x) > 0){
print(sum(x))
}
else{print("Nothing")}
}
}
My question is this: I can print the sum for each month/year combination in the terminal, but how do i store it in a vector? (the nested loop is giving me headaches trying to figure this out).
Thomas
Another way,
library(plyr)
ddply(data_old,.(year,month),function(x) sum(x[1]))
year month V1
1 2004 5 27
2 2005 1 14
3 2005 6 9
4 2006 3 15
5 2006 7 15
Forget the loops, you want to use an aggregation function. There's a recent discussion of them in this SO question.
with(data_old, tapply(yes, list(year, month), sum))
is one of many solutions.
Also, you don't need to use c() when you aren't concatenating anything. Plain 1:12 is fine.
Just to add a third option:
aggregate(yes ~ year + month, FUN=sum, data=data_old)

Resources