Adding data points in a column by factors in R - r

The data.frame my_data consists of two columns("PM2.5" & "years") & around 6400000 rows. The data.frame has various data points for pollutant levels of "PM2.5" for years 1999, 2002, 2005 & 2008.
This is what i have done to the data.drame:
{
my_data <- arrange(my_data,year)
my_data$year <- as.factor(my_data$year)
my_data$PM2.5 <- as.numeric(my_data$PM2.5)
}
I want to find the sum of all PM2.5 levels (i.e sum of all data points under PM2.5) according to different year. How can I do it.
!The image shows the first 20 rows of the data.frame.
Since the column "years" is arranged, it is showing only 1999

Say this is your data:
library(plyr) # <- don't forget to tell us what libraries you are using
give us an easy sample set
my_data <- data.frame(year=sample(c("1999","2002","2005","2008"), 10, replace=T), PM2.5 = rnorm(10,mean = 5))
my_data <- arrange(my_data,year)
my_data$year <- as.factor(my_data$year)
my_data$PM2.5 <- as.numeric(my_data$PM2.5)
> my_data
year PM2.5
1 1999 5.556852
2 2002 5.508820
3 2002 4.836500
4 2002 3.766266
5 2005 6.688936
6 2005 5.025600
7 2005 4.041670
8 2005 4.614784
9 2005 4.352046
10 2008 6.378134
One way to do it (out of many, many ways already shown by a simple google search):
> with(my_data, (aggregate(PM2.5, by=list(year), FUN="sum")))
Group.1 x
1 1999 5.556852
2 2002 14.111586
3 2005 24.723037
4 2008 6.378134

Related

can't get year factor to point on x axis

I ran this:
> x<-tapply(positive$Emissions, as.factor(positive$year), sum)
> x
1999 2002 2005 2008
7332967 5635780 5454703 3464206
Then ran:
plot(x)
And I keep getting this:
I would like the x axis to show the year, not a numeric scale. Less importantly, I'd like the y axis to not show engineering numbers, but something more readable. I know I can divide by 1000 and get it to print a regular number. But showing the year is more important. I'm contrained to using the base plot functions.
The positive$year column is originally an integer. The positive$Emissions column is numeric.
How do I force this? I've had other plots that do this automatically, but were not operating off of tapply results. I'm willing to pursue something besides the tapply function to get results, but previous attempts failed.
I tried this:
> plot(as.factor(positive$year),sum(positive$Emissions),ylab="Annual Emmissions in tons")
Error in model.frame.default(formula = y ~ x) :
variable lengths differ (found for 'x')
and understand the error, but don't know how to work around it, i.e. don't know how to get positive$year down to 4 values to match the 4 sums.
Data looks like this:
> head(positive)
Emissions year
4 15.714 1999
8 234.178 1999
12 0.128 1999
16 2.036 1999
20 0.388 1999
24 1.490 1999
6 million rows, with 4 year categories.
Any pointers please.
positive <- data.frame(Emissions =rnorm(30), year=c(1999,2000,2001,2002,2003))
positive
# Solution #1
x<-tapply(positive$Emissions, as.character(positive$year), sum)
x
plot(x=x,y=names(x),ylab="year",xlab="Emissions")
# Solution #2
x <- aggregate(positive,by=list(positive$year),sum)
plot(x=x$Emissions, y=x$Group.1) # same plot as above
Your problem is probably that x$year is a factor. If it is numeric, you shouldn't have this issue:
x <- read.table(text=" 1999 2002 2005 2008
7332967 5635780 5454703 3464206", header=F)
x <- t(as.matrix(x))
rownames(x) <- NULL
colnames(x) <- c("year", "emissions")
x <- as.data.frame(x)
x
# year emissions
# 1 1999 7332967
# 2 2002 5635780
# 3 2005 5454703
# 4 2008 3464206

Translating Stata code into R

General newbie when it comes to time series data analysis in R. I am having trouble translating a bit of Stata code into R code for a replication project I am doing.
The intent of the Stata code and the Stata code (from the original analysis) are the following:
#### Delete extra yearc observations with different wartypes #####
drop if yearc==yearc[_n+1] & wartype!="CIVIL"
drop if yearc==yearc[_n-1] & wartype!="CIVIL"
So, translated, I keep the rows in which the country is having a civil war and delete the rows in which there is an interstate war during the same years.
I have named the data object (i.e., the data set)
mywar
in R.
I am assuming I somehow do a conditional ifelse statement, or something similar, such as:
invisible(mywar$yearc <- ifelse(mywar$yearc==n-1 | mywar$yearc==n+1 | mywar$wartype!=civil, NA,
mywar$yearc)) # I am assuming I cannot condition ifelse statements like this; but, this is how I imagine it
mywar <- mywar[!is.na(mywar$yearc),]
EDIT:
So perhaps an example
> b <- c(1970, 1970, 1970, 1971, 1982, 1999, 1999, 2000, 2001, 2002)
> c <- c("inter", "civil", "intra", "civil", "civil", "inter", "civil", "civil", "civil", "civil")
> df <- data.frame(b,c)
> df$j <- ifelse(df$b==n-1 & df$b==n+1 & df$c!="civil", NA, df$b)
> df
b c j
1 1970 inter 1970
2 1970 civil 1970
3 1970 intra 1970
4 1971 civil 1971
5 1982 civil 1982
6 1999 inter 1999
7 1999 civil 1999
8 2000 civil 2000
9 2001 civil 2001
10 2002 civil 2002
So, what I was trying to do was create NAs for rows 1,3,and 6 as they are duplicate years in my logistic regression on the onset of civil war (I am not interested in inter and intra wars, however defined) so that I can delete these rows from my data set. Here, I just recreated row b. (Note, what is missing from this made up data are the country ids. But assume that these ten entries represent the same country (for instance, Somalia)). So, I am interested in how to delete these type of rows in a data set with 28,000 rows.
dplyr is also a good way — you just need to "keep" instead of "drop"
library(dplyr)
filter(df, (yearc != lead(yearc, 1) & yearc != lag(yearc, 1)) | wartype == "CIVIL")
You're focusing on Stata's if qualifier, but it sounds like you simply want to subset the data frame--hence your use of the drop command in Stata. I also learned Stata before R and was confused since I relied so heavily on the if qualifier in Stata and immediately pursued ifelse in R. But, I later realized that the more relevant technique in R revolved around subsetting. There is a subset() command, but most people prefer subsetting by using brackets (see code below).
In your original question you ask how to do two things:
how to delete observations (i.e. rows) that are coded "inter" or "intra" on column C, and
how to mark them as missing
Sample Data
b <- c(1970, 1970, 1970, 1971, 1982, 1999, 1999, 2000, 2001, 2002)
c <- c("inter", "civil", "intra", "civil", "civil", "inter", "civil", "civil", "civil", "civil")
df <- data.frame(b,c)
df
b c
1 1970 inter
2 1970 civil
3 1970 intra
4 1971 civil
5 1982 civil
6 1999 inter
7 1999 civil
8 2000 civil
9 2001 civil
10 2002 civil
1. Dropping Observations
If you want to delete observations that are not "civil" in column C, you can subset the data frame to only keep those cases that are "civil":
df2 <- df[df$c=="civil",]
df2
b c
2 1970 civil
4 1971 civil
5 1982 civil
7 1999 civil
8 2000 civil
9 2001 civil
10 2002 civil
The above code creates a new data frame, df2, that is a subset of df, but you can also completely overwrite the original data frame:
df <- df[df$c=="civil",]
Or, you can generate a new one and then remove the old one, if you don't like your workspace cluttered with lots of data frames:
df2 <- df[df$c=="civil",]
rm(df)
2. Marking Observations as Missing
If you want to mark observations that are not "civil" in column C, you can do that by overwriting them as NA:
df$c[df$c != "civil"] <- NA
df
b c
1 1970 <NA>
2 1970 civil
3 1970 <NA>
4 1971 civil
5 1982 civil
6 1999 <NA>
7 1999 civil
8 2000 civil
9 2001 civil
10 2002 civil
You could then use listwise deletion (see the na.omit() command) to remove the cases from whatever analyses you're doing.
Side Note
Your original Stata code seeks to subset when column b is a duplicate and column c is "inter" or "intra". However, the way your sample data were presented, this seemed to be a redundant concern, which is why my solution above only looks at column c. However, if you want to match your Stata code as closely as possible, you can do that by
df <- df[order(df$b, df$c),]
df$duplicate <- duplicated(df$b)
df2 <- df[df$c=="civil" & df$duplicate==FALSE,]
which
orders the data chronologically by year and then alphabetically by war
creates a new variable that specifies whether column b is a duplicate year
subsets the data frame to remove undesirable cases.
Try changing your | operator to &.
Here is some made up data:
R> b <- c(rep(1:4, each=3))
R> c <- 1:length(b)
R> df <- data.frame(c,b)
R> df$j <- ifelse(df$b != 2 & df$b != 3 & df$b != 1, NA, df$b)
R> df
c b j
1 1 1 1
2 2 1 1
3 3 1 1
4 4 2 2
5 5 2 2
6 6 2 2
7 7 3 3
8 8 3 3
9 9 3 3
10 10 4 NA
11 11 4 NA
12 12 4 NA
That last line of your code mywar <- mywar[!is.na(mywar$yearc),] should work fine as well

How to calculate time-weighted average and create lags

I have searched the forum, but found nothing that could answer or provide hint on how to do what I wish to on the forum.
I have yearly measurement of exposure data from which I wish to calculate individual level annual average based on entry of each individual into the study. For each row the one year exposure assignment should include data from the preceding 12 months starting from the last month before joining the study.
As an example the first person in the sample data joined the study on Feb 7, 2002. His exposure will include a contribution of January 2002 (annual average is 18) and February to December 2001 (annual average is 19). The time weighted average for this person would be (1/12*18) + (11/12*19). The two year average exposure for the same person would extend back from January 2002 to February 2000.
Similarly, for last person who joined the study in December 2004 will include contribution on 11 months in 2004 and one month in 2003 and his annual average exposure will be (11/12*5 ) derived form 2004 and (1/12*6) which comes from the annual average of 2003.
How can I calculate the 1, 2 and 5 year average exposure going back from the date of entry into study? How can I use lags in the manner taht I hve described?
Sample data is accessed from this link
https://drive.google.com/file/d/0B_4NdfcEvU7La1ZCd2EtbEdaeGs/view?usp=sharing
This is not an elegant answer. But, I would like to leave what I tried. I first arranged the data frame. I wanted to identify which year will be the key year for each subject. So, I created id. variable comes from the column names (e.g., pol_2000) in your original data set. entryYear comes from entry in your data. entryMonth comes from entry as well. check was created in order to identify which year is the base year for each participant. In my next step, I extracted six rows for each participant using getMyRows in the SOfun package. In the next step, I used lapply and did math as you described in your question. For the calculation for two/five year average, I divided the total values by year (2 or 5). I was not sure how the final output would look like. So I decided to use the base year for each subject and added three columns to it.
library(stringi)
library(SOfun)
devtools::install_github("hadley/tidyr")
library(tidyr)
library(dplyr)
### Big thanks to BondedDust for this function
### http://stackoverflow.com/questions/6987478/convert-a-month-abbreviation-to-a-numeric-month-in-r
mo2Num <- function(x) match(tolower(x), tolower(month.abb))
### Arrange the data frame.
ana <- foo %>%
mutate(id = 1:n()) %>%
melt(id.vars = c("id","entry")) %>%
arrange(id) %>%
mutate(variable = as.numeric(gsub("^.*_", "", variable)),
entryYear = as.numeric(stri_extract_last(entry, regex = "\\d+")),
entryMonth = mo2Num(substr(entry, 3,5)) - 1,
check = ifelse(variable == entryYear, "Y", "N"))
### Find a base year for each subject and get some parts of data for each participant.
indx <- which(ana$check == "Y")
bob <- getMyRows(ana, pattern = indx, -5:0)
### Get one-year average
cathy <- lapply(bob, function(x){
x$one <- ((x[6,6] / 12) * x[6,4]) + (((12-x[5,6])/12) * x[5,4])
x
})
one <- unnest(lapply(cathy, `[`, i = 6, j = 8))
### Get two-year average
cathy <- lapply(bob, function(x){
x$two <- (((x[6,6] / 12) * x[6,4]) + x[5,4] + (((12-x[4,6])/12) * x[4,4])) / 2
x
})
two <- unnest(lapply(cathy, `[`, i = 6, j =8))
### Get five-year average
cathy <- lapply(bob, function(x){
x$five <- (((x[6,6] / 12) * x[6,4]) + x[5,4] + x[4,4] + x[3,4] + x[2,4] + (((12-x[2,6])/12) * x[1,4])) / 5
x
})
five <- unnest(lapply(cathy, `[`, i =6 , j =8))
### Combine the results with the key observations
final <- cbind(ana[which(ana$check == "Y"),], one, two, five)
colnames(final) <- c(names(ana), "one", "two", "five")
# id entry variable value entryYear entryMonth check one two five
#6 1 07feb2002 2002 18 2002 1 Y 18.916667 18.500000 18.766667
#14 2 06jun2002 2002 16 2002 5 Y 16.583333 16.791667 17.150000
#23 3 16apr2003 2003 14 2003 3 Y 15.500000 15.750000 16.050000
#31 4 26may2003 2003 16 2003 4 Y 16.666667 17.166667 17.400000
#39 5 11jun2003 2003 13 2003 5 Y 13.583333 14.083333 14.233333
#48 6 20feb2004 2004 3 2004 1 Y 3.000000 3.458333 3.783333
#56 7 25jul2004 2004 2 2004 6 Y 2.000000 2.250000 2.700000
#64 8 19aug2004 2004 4 2004 7 Y 4.000000 4.208333 4.683333
#72 9 19dec2004 2004 5 2004 11 Y 5.083333 5.458333 4.800000

Grouping and Std. Dev in R

I have a data frame called dt. dt looks like this.
Year Sale
2009 6
2008 3
2007 4
2006 5
2005 12
2004 3
I am interested in getting std.dev of sales in the past four years. In case, there are not four year data, as in 2006,2005, and 2004, I want to get NA. How can I create a new column with the values corresponding to each year. New data would look like.
Year Sale std.
2009 6 std(05,06,07,08)
2008 3 std(07,06,05,04)
2007 4 NA
2006 5 NA
2005 12 NA
2004 3 NA
I tried this a lot, but because I am a novice at R, I couldn't do it. Someone please help. Thanks.
Edit :
Here is the data with GVKEY.
GVKEY FYEAR IBC
1 1004 2003 3.504
2 1004 2004 18.572
3 1004 2005 35.163
4 1004 2006 59.447
5 1004 2007 75.745
Regards
Edit:
I am using the mentioned function rollapply function in this manner:
dt <- ddply(dt, .(GVKEY), function(x){x$ww <- rollapply(x$Sale,4,sd, fill =NA, align="right"); x});
But I am getting following error.
Error in seq.default(start.at, NROW(data), by = by) : wrong sign in 'by' argument
Not sure what I am doing wrong. The data with GVKEY is mentioned at the top.
You can use rollapply from package zoo:
require(zoo)
rollapply(df$Sale, 4, sd, fill=NA, align="right")
[edit] I used your data frame as sorted by year. If you have it in original order, you will probably need to use align="left"
This is how I solved the problem:
dt <- dt[order(dt$GVKEY,dt$FYEAR),];
dt <- sqldf("select GVKEY, FYEAR, IBC from dt");
dt$STDEARN <- ave(dt$IBC, dt$GVKEY,FUN = function(x) {if(length(x)>3) c(NA,head(runSD(x,4),-1)) else sample(NA,length(x),TRUE)});

calculate differences in dataframe

I have a dataframe that looks like this:
set.seed(50)
data.frame(distance=c(rep("long", 5), rep("short", 5)),
year=rep(2002:2006),
mean.length=rnorm(10))
distance year mean.length
1 long 2002 0.54966989
2 long 2003 -0.84160374
3 long 2004 0.03299794
4 long 2005 0.52414971
5 long 2006 -1.72760411
6 short 2002 -0.27786453
7 short 2003 0.36082844
8 short 2004 -0.59091244
9 short 2005 0.97559055
10 short 2006 -1.44574995
I need to calculate the difference between in mean.length between long and short in each year. Whats fastest way of doing this?
Here's one way using plyr:
set.seed(50)
df <- data.frame(distance=c(rep("long", 5),rep("short", 5)),
year=rep(2002:2006),
mean.length=rnorm(10))
library(plyr)
aggregation.fn <- function(df) {
data.frame(year=df$year[1],
diff=(df$mean.length[df$distance == "long"] -
df$mean.length[df$distance == "short"]))}
new.df <- ddply(df, "year", aggregation.fn)
Gives you
> new.df
year diff
1 2002 0.8275344
2 2003 -1.2024322
3 2004 0.6239104
4 2005 -0.4514408
5 2006 -0.2818542
A second way
df <- df[order(df$year, df$distance), ]
n <- dim(df)[1]
df$new.year <- c(1, df$year[2:n] != df$year[1:(n-1)])
df$diff <- c(-diff(df$mean.length), NA)
df$diff[!df$new.year] <- NA
new.df.2 <- df[!is.na(df$diff), c("year", "diff")]
all(new.df.2 == new.df) # True
Use tapply() and apply() like this:
apply(
with(x, tapply(mean.length, list(year, distance), FUN=mean)),
1,
diff
)
2002 2003 2004 2005 2006
-0.8275344 1.2024322 -0.6239104 0.4514408 0.2818542
This works because tapply creates a tabular summary by year and distance:
with(x, tapply(mean.length, list(year, distance), FUN=mean))
long short
2002 0.54966989 -0.2778645
2003 -0.84160374 0.3608284
2004 0.03299794 -0.5909124
2005 0.52414971 0.9755906
2006 -1.72760411 -1.4457499
Since you seem to have paired values and the data.frame is ordered, you can do this:
res <- with(DF, mean.length[distance=="long"]-mean.length[distance=="short"])
names(res) <- unique(DF$year)
# 2002 2003 2004 2005 2006
#0.8275344 -1.2024322 0.6239104 -0.4514408 -0.2818542
This should be quite fast, but not as safe as the other answers as it relies on the assumptions.
You've received some good answers for computing the specific question at hand. It may make sense for you to consider reshaping your data into a wide format. Here are two options:
reshape(df, direction = "wide", idvar = "year", timevar = "distance")
#---
year mean.length.long mean.length.short
1 2002 0.54966989 -0.2778645
2 2003 -0.84160374 0.3608284
3 2004 0.03299794 -0.5909124
4 2005 0.52414971 0.9755906
5 2006 -1.72760411 -1.4457499
#package reshape2 is probably easier to use.
library(reshape2)
dcast(year ~ distance, data = df)
#---
year long short
1 2002 0.54966989 -0.2778645
2 2003 -0.84160374 0.3608284
3 2004 0.03299794 -0.5909124
4 2005 0.52414971 0.9755906
5 2006 -1.72760411 -1.4457499
You can easily compute your new statistics now.

Resources