Refer to relative rows in R - r

I know this answer must be out there, but I can't figure out how to word the question.
I'd like to calculate the differences between values in my data.frame.
from this:
f <- data.frame(year=c(2004, 2005, 2006, 2007), value=c(8565, 8745, 8985, 8412))
year value
1 2004 8565
2 2005 8745
3 2006 8985
4 2007 8412
to this:
year value diff
1 2004 8565 NA
2 2005 8745 180
3 2006 8985 240
4 2007 8412 -573
(ie value of current year minus value of previous year)
But I don't know how to have a result in one row that is created from another row. Any help?
Thanks,
Tom

There are many different ways to do this, but here's one:
f[, "diff"] <- c(NA, diff(f$value))
More generally, if you want to refer to relative rows, you can use lag() or do it directly with indexes:
f[-1,"diff"] <- f[-1, "value"] - f[-nrow(f), "value"]

Use the diff function
f <- cbind(f, c(NA, diff(f[,2])))

If year column isn't sorted then you could use match:
f$diff <- f$value - f$value[match(f$year-1, f$year)]

Related

Got an error using ifelse inside mutate inside the for loop

I have a list of 244 data frames which looks like the following:
The name of the list is datas.
datas[[1]]
year sal
2000 10000
2000 15000
2005 10000
2005 9000
2005 12000
2010 15000
2010 12000
2010 20000
2013 25000
2013 15000
2015 20000
I would like to make a new column called fix.sal, multiplying different values for different years. For example, I multiply 2 on sals which are on the same rows with 2000. In the same way, the number multiplied on the sal value is 1.8 for 2005, 1.5 for 2010, 1.2 for 2013, 1 for 2015. So the result should be like this:
Year sal fix.sal
2000 10000 20000
2000 15000 30000
2005 10000 18000
2005 9000 16200
2005 12000 21600
2010 15000 22500
2010 12000 18000
2010 20000 30000
2013 25000 30000
2013 15000 18000
2015 20000 20000
I succeeded to do this by using ifelse inside mutate which for package dplyr.
library(dplyr)
datas[[1]]<-mutate(datas[[1]], fix.sal=
ifelse(datas[[1]]$Year==2000,datas[[1]]$sal*2,
ifelse(datas[[1]]$Year==2005,datas[[1]]$sal*1.8,
ifelse(datas[[1]]$Year==2010,datas[[1]]$sal*1.5,
ifelse(datas[[1]]$Year==2013,datas[[1]]$sal*1.2,
datas[[1]]$sal*1)))))
But I have to do this operation to the 244 data frames in the list datas.
So I tried to do it using the for loop like this;
for(i in 1:244){
datas[[i]]<-mutate(datas[[i]], fix.sal=
ifelse(datas[[i]]$Year==2000,datas[[i]]$sal*2,
ifelse(datas[[i]]$Year==2005,datas[[i]]$sal*1.8,
ifelse(datas[[i]]$Year==2010,datas[[i]]$sal*1.5,
ifelse(datas[[i]]$Year==2013,datas[[i]]$sal*1.2,
datas[[i]]$sal*1)))))
}
Then there came an error;
Error: invalid subscript type 'integer'
How can I solve this...?
Any comments will be greatly appreciated! :)
Please don't force yourself to use ifelse for this. Instead, create a vector with your multipliers, then use the year to select from the vector. The vector will look something like this:
multiplier <-
c("2005" = 1.2
, "2006" = 1.05
, "2007" = 0.9)
With whatever your multiplier is for each year in your data. Then, here is some sample data (all the same, but that doesn't matter):
datas <-
lapply(1:3, function(idx){
data.frame(
Year = 2005:2007
, sal = c(10, 20, 30)
)
})
Finally, we can then use lapply to loop through the list more efficiently. Each time through, it uses the Year to pick a value from the multipliers vector (note the use of as.character, otherwise it will pick, e.g., the 2005th entry, instead of the one named "2005").
lapply(datas, function(x){
mutate(x, fix.sal = sal*multiplier[as.character(Year)])
})
returns:
[[1]]
Year sal fix.sal
1 2005 10 12
2 2006 20 21
3 2007 30 27
[[2]]
Year sal fix.sal
1 2005 10 12
2 2006 20 21
3 2007 30 27
[[3]]
Year sal fix.sal
1 2005 10 12
2 2006 20 21
3 2007 30 27
For more compact code, you can use:
lapply(datas, mutate, fix.sal = sal*multiplier[as.character(Year)])
but that makes it slightly less clear to me what is happening.
Here's a simple solution using ifelse and lapply:
# Creating the list
df <- data.frame(year=c(rep(2000,2),rep(2005,3),rep(2010,3),rep(2013,2),2015),
sal=c(10000,15000,10000,9000,12000,15000,12000,20000,25000,15000,20000))
datas <- list(df,df)
# Applying the function with ifelse
lapply(datas,function(x){
outp <- ifelse(df$year==2000,df$sal*2,
ifelse(df$year==2005,df$sal*1.8,
ifelse(df$year==2010,df$sal*1.5,
ifelse(df$year==2013,df$sal*1.2,df$sal*1))))
return(outp)
})
You'll get the result for each df inside the list.

Adding data points in a column by factors in R

The data.frame my_data consists of two columns("PM2.5" & "years") & around 6400000 rows. The data.frame has various data points for pollutant levels of "PM2.5" for years 1999, 2002, 2005 & 2008.
This is what i have done to the data.drame:
{
my_data <- arrange(my_data,year)
my_data$year <- as.factor(my_data$year)
my_data$PM2.5 <- as.numeric(my_data$PM2.5)
}
I want to find the sum of all PM2.5 levels (i.e sum of all data points under PM2.5) according to different year. How can I do it.
!The image shows the first 20 rows of the data.frame.
Since the column "years" is arranged, it is showing only 1999
Say this is your data:
library(plyr) # <- don't forget to tell us what libraries you are using
give us an easy sample set
my_data <- data.frame(year=sample(c("1999","2002","2005","2008"), 10, replace=T), PM2.5 = rnorm(10,mean = 5))
my_data <- arrange(my_data,year)
my_data$year <- as.factor(my_data$year)
my_data$PM2.5 <- as.numeric(my_data$PM2.5)
> my_data
year PM2.5
1 1999 5.556852
2 2002 5.508820
3 2002 4.836500
4 2002 3.766266
5 2005 6.688936
6 2005 5.025600
7 2005 4.041670
8 2005 4.614784
9 2005 4.352046
10 2008 6.378134
One way to do it (out of many, many ways already shown by a simple google search):
> with(my_data, (aggregate(PM2.5, by=list(year), FUN="sum")))
Group.1 x
1 1999 5.556852
2 2002 14.111586
3 2005 24.723037
4 2008 6.378134

Filter data frame by lowest common overlap in categorical variable in R

I have the following data frame:
input<-data.frame(
site=c("1","2","3","1","2","3","4","1","2"),
year=c(rep("2006",3),rep("2010",4),rep("2014",2)
))
site year
1 1 2006
2 2 2006
3 3 2006
4 1 2010
5 2 2010
6 3 2010
7 4 2010
8 1 2014
9 2 2014
I would like to return a list of sites surveyed in 2006, 2010, and 2014; so in the example above only site 1 and 2 would be in the list as they are the only sites that were surveyed in 2006, 2010, and 2014.
Any advice is most appreciated.
You can use ddply to count the number of years that are in your list of years of interest, for each site and then pull the sites that have all three.
library(plyr)
res <- ddply(.data = input, .variables = .(site),
summarize, allthree = all(c("2006","2010","2014") %in% year))
res$site[res$allthree]
If your data may contain other years. This solution should work
yearsneeded <- c("2006","2010","2014")
names(which(tapply(input$year, input$site, function(x) all(yearsneeded %in% x))))
It may be most straightforward to first cross-tabulate year and site using table(), and to then "apply" the function all to each of the table's rows to find which ones have all non-zero entries, like so:
(tb <- table(input))
# year
# site 2006 2010 2014
# 1 1 1 1
# 2 1 1 1
# 3 1 1 0
# 4 0 1 0
rownames(tb)[apply(tb,1,all)]
# [1] "1" "2"
Or, if you really just care that there should be at least one presence in each of 2006, 2010, and 2014 (even if your data might contain other years), try this:
rownames(tb)[apply(tb[,c("2006", "2010", "2014")], 1, all)]
# [1] "1" "2"
This is another approach (updated). It also works if the original input data frame has more than the 3 years in the example
years <- c(2006,2010,2014) #list with required years
df <- input[input$year %in% years,] #data frame containing only the required years
sites <- as.numeric(which(rowSums(table(df)) == length(years))) #sites that fullfill the criteria

Grouping and conditions without loop (big data)

I have several observations of the same groups, and for each observation I have a year.
dat = data.frame(group = rep(c("a","b","c"),each = 3), year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995))
group year
1 a 2000
2 a 1996
3 a 1975
4 b 2002
5 b 2010
6 b 1980
7 c 1990
8 c 1986
9 c 1995
For each observation, i would like to know if another observation of the same group can be found with given conditions relative to the focal observation. e.g. : "Is there any other observation (than the focal one) that has been done during the last 6 years (starting from the focal year) in the same group".
Ideally the dataframe should be like that
group year six_years
1 a 2000 1 # there is another member of group a that is year = 1996 (2000-6 = 1994, this value is inside the threshold)
2 a 1996 0
3 a 1975 0
4 b 2002 0
5 b 2010 0
6 b 1980 0
7 c 1990 1
8 c 1986 0
9 c 1995 1
Basically for each row we should look into the subset of groups, and see if any(dat$year == conditions). It is very easy to do with a for loop, but it's of no use here : the dataframe is massive (several millions of row) and a loop would take forever.
I am searching for an efficient way with vectorized functions or a fast package.
Thanks !
EDITED
Actually thinking about it you will probably have a lot of recurring year/group combinations, in which case much quicker to pre-calculate the frequencies using count() - which is also a plyr function:
90M rows took ~4sec
require(plyr)
dat <- data.frame(group = sample(c("a","b","c"),size=9000000,replace=TRUE),
year = sample(c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995),size=9000000,replace=TRUE))
test<-function(y,g,df){
d<-df[df$year>=y-6 &
df$year<y &
df$group== g,]
return(nrow(d))
}
rollup<-function(){
summ<-count(dat) # add a frequency to each combination
return(ddply(summ,.(group,year),transform,t=test(as.numeric(year),group,summ)*freq))
}
system.time(rollup())
user system elapsed
3.44 0.42 3.90
My dataset had too many different groups, and the plyr option proposed by Troy was too slow.
I found a hack (experts would probably say "an ugly one") with package data.table : the idea is to merge the data.table with itself quickly with the fast merge function. It gives every possible combination between a given year of a group and all others years from the same group.
Then proceed with an ifelse for every row with the condition you're looking for.
Finally, aggregate everything with a sum function to know how many times every given years can be found in a given timespan relative to another year.
On my computer, it took few milliseconds, instead of the probable hours that plyr was going to take
dat = data.table(group = rep(c("a","b","c"),each = 3), year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995), key = "group")
Produces this :
group year
1 a 2000
2 a 1996
3 a 1975
4 b 2002
5 b 2010
6 b 1980
7 c 1990
8 c 1986
9 c 1995
Then :
z = merge(dat, dat, by = "group", all = T, allow.cartesian = T) # super fast
z$sixyears = ifelse(z$year.y >= z$year.x - 6 & z$year.y < z$year.x, 1, 0) # creates a 0/1 column for our condition
z$sixyears = as.numeric(z$sixyears) # we want to sum this up after
z$year.y = NULL # useless column now
z2 = z[ , list(sixyears = sum(sixyears)), by = list(group, year.x)]
(Years with another year of the same group in the last six years are given a "1" :
group year x
1 a 1975 0
2 b 1980 0
3 c 1986 0
4 c 1990 1 # e.g. here there is another "c" which was in the timespan 1990 -6 ..
5 c 1995 1 # <== this one. This one too has another reference in the last 6 years, two rows above.
6 a 1996 0
7 a 2000 1
8 b 2002 0
9 b 2010 0
Icing on the cake : it deals with NA seamlessly.
Here's another possibility also using data.table but including diff().
dat <- data.table(group = rep(c("a","b","c"), each = 3),
year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995),
key = "group")
valid_case <- subset(dt[,list(valid_case = diff(year)), by=key(dt)],
abs(valid_case)<6)
dat$valid_case <- ifelse(dat$group %in% valid_case$group, 1, 0)
I am not sure how this compares in terms of speed or NA handling (I think it should be fine with NAs since they propagate in diff() and abs()), but I certainly find it more readable. Joins are really fast in data.table, but I'd have to think avoiding that all together helps. There's probably a more idiomatic way to do the condition in the ifelse statement using data.table joins. That could potentially speed things up, although my experience has never found %in% to be the limiting factor.

Grouping and Std. Dev in R

I have a data frame called dt. dt looks like this.
Year Sale
2009 6
2008 3
2007 4
2006 5
2005 12
2004 3
I am interested in getting std.dev of sales in the past four years. In case, there are not four year data, as in 2006,2005, and 2004, I want to get NA. How can I create a new column with the values corresponding to each year. New data would look like.
Year Sale std.
2009 6 std(05,06,07,08)
2008 3 std(07,06,05,04)
2007 4 NA
2006 5 NA
2005 12 NA
2004 3 NA
I tried this a lot, but because I am a novice at R, I couldn't do it. Someone please help. Thanks.
Edit :
Here is the data with GVKEY.
GVKEY FYEAR IBC
1 1004 2003 3.504
2 1004 2004 18.572
3 1004 2005 35.163
4 1004 2006 59.447
5 1004 2007 75.745
Regards
Edit:
I am using the mentioned function rollapply function in this manner:
dt <- ddply(dt, .(GVKEY), function(x){x$ww <- rollapply(x$Sale,4,sd, fill =NA, align="right"); x});
But I am getting following error.
Error in seq.default(start.at, NROW(data), by = by) : wrong sign in 'by' argument
Not sure what I am doing wrong. The data with GVKEY is mentioned at the top.
You can use rollapply from package zoo:
require(zoo)
rollapply(df$Sale, 4, sd, fill=NA, align="right")
[edit] I used your data frame as sorted by year. If you have it in original order, you will probably need to use align="left"
This is how I solved the problem:
dt <- dt[order(dt$GVKEY,dt$FYEAR),];
dt <- sqldf("select GVKEY, FYEAR, IBC from dt");
dt$STDEARN <- ave(dt$IBC, dt$GVKEY,FUN = function(x) {if(length(x)>3) c(NA,head(runSD(x,4),-1)) else sample(NA,length(x),TRUE)});

Resources