Lagging "cycled" variables in R

Lagging "cycled" variables in R - r

So I just received a dataset wherein one column of the data frame is "cycled." This column is actually a cycle of years (in my case, 1984-2007). In another column, there are corresponding dollar amounts (actually, "funding levels") for each of those years. My job is to create a lag variable for these funding levels. But here is the trick: each time the year cycle starts over, a new "variable" has begun. Thus, the lag variable I am looking for is not simply a shift backward of the entire funding column. Instead, I need to create a funding lag for each sub-cycle of the data. To be more concrete, my data looks a little bit like this:
X Y
1 7
2 8
3 9
1 4
2 6
3 5
1 2
2 4
3 3
And I need it to look like this:
X Y
1 NA
2 7
3 8
1 NA
2 4
3 6
1 NA
2 2
3 4
How would I go about doing this? Thank you so much for your help!
-JMC

This should work. (I often forget to name the FUN argument and ave then complains with a cryptic error message.)
#Wrong dfrm$Y <- ave( dfrm$Y, dfrm$X, FUN=function(x) c(NA, x) )
Lacking a proper grouping factor to mark distinct categories of time sequences, I decided to cue off X==1:
dfrm$Y <- ave( dfrm$Y, cumsum(dfrm$X==1), FUN=function(x) c(NA, x[-length(x)]) )

Related

Apply the same calculation in different data frames in R

I am trying to loop over many data frames in R and I feel like this is a rather basic question. However, I only found similar questions that were solved with specific functions that don't match my problem (like calculating means or medians, changing column names, ...). I hope to find a more general solution that can be applied for any change or calculation in various data frames here.
I have a lot (about 500) of data frames that look somewhat like this (very simplified):
df0100
a b c d
1 4 3 5 NA
2 2 5 4 NA
3 4 4 3 NA
...
df0130
a b c d
1 3 2 3 NA
2 4 5 3 NA
3 4 3 2 NA
...
For each of them, I want to calculate a new value (also simplified here) from the values in a and c in the first row and insert the value in any row in column d. It works fine like this for a single data frame:
df0100$d <- ((df0100[1,1]*(df0100[1,3]+13.5)/(3*exp(df0100[1,3]))/100
which leads to
df0100
a b c d
1 4 3 5 36.60858
2 2 5 4 36.60858
3 4 4 3 36.60858
....
Since I don't want to do this for every single of the 500 data frames, I saved them as a list and tried to loop over them as follows. I thought the easiest way would be to replace the former 'df0100' by each data frame name but both versions didn't work. Can anyone tell me what I have to change?
my_files <- list.files(pattern=".csv")
my_data <- lapply(my_files, read.csv)
Version 1:
for (n in my_data)
{
n$d <- ((n[1,1]*(n[1,3]+13.5)/(3*exp(n[1,3]))/100
}
Version 2:
my_data <- lapply(my_data, function(n){
n$d <- ((n[1,1]*(n[1,3]+13.5)/(3*exp(n[1,3]))/100
})
This is my first question here, I hope it makes sense to you.

Creating a new variable in a data frame and changing its values in one step [duplicate]

This question already has answers here:
Convert continuous numeric values to discrete categories defined by intervals
(2 answers)
Closed 5 years ago.
I have a column which is part of a data frame, df. It is full of integers. Let's say it is the number of houses sold in a day by a reality compant. Let's call it df$houses. I want to make a second column called df$quant where the number of houses is categorized, with 0 being 0-2 houses sold in a day, 1 being 3-5 houses, 2 being 6-9 houses and 3 being more than 10 houses? I could do this in two steps.
1) Create the new column df$quant from df$houses:
df$quant <- df$houses
2) Change the values of df$quant:
df$quant[which(df$quant <= 2)] <- 0
etc.
I would like to do this in one step though, making the new variable and filling it with the proper values. Mostly, so I don't have to worry about getting the order of the lines of code in the second step right. It would be more robust.
Could this be done with an if statement?
Thanks a lot.

I would do something like this: (using cut)
x <- 1:11
df <- data.frame(x)
myFunction <- function(x) as.integer(cut(x, c(-1, 2, 5, 9, max(x)))) - 1
df$new <- myFunction(df$x)
df
x new
1 1 0
2 2 0
3 3 1
4 4 1
5 5 1
6 6 2
7 7 2
8 8 2
9 9 2
10 10 3
11 11 3

Reformatting data in order to plot 2D continuous heatmap

I have data stored in a data.frame that I would like to plot as a continuous heat map. I have tried using the interp function from akima package, but as the data can be very large (2 million rows) I would like to avoid this if possible as it takes a very long time. Here is the format of my data
l1 <- c(1,2,3)
grid1 <- expand.grid(l1, l1)
lprobdens <- c(0,2,4,2,8,10,4,8,2)
df <- cbind(grid1, lprobdens)
colnames(df) <- c("age1", "age2", "probdens")
age1 age2 probdens
1 1 0
2 1 2
3 1 4
1 2 2
2 2 8
3 2 10
1 3 4
2 3 8
3 3 2
I would like to format it in a length(df$age1) x length(df$age2) matrix. I gather that once it is formatted in this manner I would be able to use basic functions such as image to plot a 2D histogram continuous heat map similar to that created using the akima package. Here is how I think the transformed data should look. Please correct me if I am wrong.
1 2 3
1 0 2 4
2 2 8 8
3 4 10 2
It seems as though ldply but I can't seem to sort out how it works.
I forgot to mention, the $age information is always continuous and regular, such that the list age1 is equal to age2 but age1 >= age2. I guess this means that it may be classed as continuous data as it stands and doesn't require the interp function.

Ok I think I get it what you want. It just a matter of reshaping data with reshape s 'cast function. The value.var argument is just to avoid the warning message that R tried to guess the value to use. The result does not change if you omit it.
library(reshape2)
as.matrix(dcast(dat, age1 ~ age2, value.var = "probdens")[-1])
1 2 3
[1,] 0 2 4
[2,] 2 8 8
[3,] 4 10 2

increase in one variable nested within another column in R + setting 0 as starting value

I'm trying to use the diff function to calculate the increase in a variable ("damage") in this dataset (df). I want to fill the column "damage_new" with this new variable. The values that you see now are the values I would like to have.
df = data.frame(id=c(1,1,1,2,2), trial=c(1,3,4,1,2), damage=(1,NA,3,1,5))
df
ID TRIAL DAMAGE DAMAGE_NEW
1 1 1 0
1 3 NA NA
1 4 3 NA
2 1 1 0
2 2 5 4
If I run
diff(df$damage) it will calculate the difference in the whole dataset.
two things that I haven't managed are:
-how to nest the difference within the values of another column? Specifically, I want to calculate the damage increase (for the whole dataset), but within a single individual (ID), of which I have repeated measurements.
-I also would like to have the damage_new column to be the same length as the rest of the dataset (to attach it), and for each individual, have the first value of damage_new set to 0, since obviously the first measurement has no reference.
-To further describe the dataset, I have NAs in the 'damage" column, which I suspect will lead to more NAs in the damage_new column, but I would like to keep them (and I wonder how the function deals with them?). I also don't have the same number of measurements per individual (they will have a different number of trials, with some missing in between).
thanks a lot for the always fast and efficient answers!

The dplyr package is great for this kind of things:
library(dplyr)
df %>% group_by(id) %>% mutate(damage_new=c(0,diff(damage)))
Source: local data frame [5 x 4]
Groups: id
id trial damage damage_new
1 1 1 1 0
2 1 3 NA NA
3 1 4 3 NA
4 2 1 1 0
5 2 2 5 4
You can read more about dplyr usage here
Update
If you'd like to go with the base R, you could do:
df$damage_new <- ave(df$damage,df$id,FUN=function(v) c(0,diff(v)))
which will produce the same df.

Library data.table is your friend there:
> library(data.table)
> setDT(df)
> setkey(df, id, trial)
> df[,new_damage:=c(0,diff(damage)),by=id]
> df
id trial damage new_damage
1: 1 1 1 0
2: 1 3 NA NA
3: 1 4 3 NA
4: 2 1 1 0
5: 2 2 5 4
On the diff working with NA, anything you withdraw from NA gives NA:
> diff(c(1,3,4,NA,5,7))
[1] 2 1 NA NA 2

Data frame "expand" procedure in R?

This is not a real statistical question, but rather a data preparation question before performing the actual statistical analysis. I have a data frame which consists of sparse data. I would like to "expand" this data to include zeroes for missing values, group by group.
Here is an example of the data (a and b are two factors defining the group, t is the sparse timestamp and xis the value):
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
Assuming I would like to expand the values between t=0 and t=9, this is the result I'm hoping for:
test.expanded <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
t=c(0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9),
x=c(1,0,2,1,2,0,0,2,0,0,0,0,0,1,1,0,2,1,1,3))
Zeroes have been inserted for all missing values of t. This makes it easier to use.
I have a quick and dirty implementation which sorts the dataframe and loops through each of its lines, adding missing lines one at a time. But I'm not entirely satisfied by the solution. Is there a better way to do it?
For those who are familiar with SAS, it is similar to the proc expand.
Thanks!

As you noted in a comment to the other answer, doing it by group is easy with plyr which just leaves how to "fill in" the data sets. My approach is to use merge.
library("plyr")
test.expanded <- ddply(test, c("a","b"), function(DF) {
DF <- merge(data.frame(t=0:9), DF[,c("t","x")], all.x=TRUE)
DF[is.na(DF$x),"x"] <- 0
DF
})
merge with all.x=TRUE will make the missing values NA, so the second line of the function is needed to replace those NAs with 0's.

This is convoluted but works fine:
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
my.seq <- seq(0,9)
not.t <- !(my.seq %in% test$t)
test[nrow(test)+seq(length(my.seq[not.t])),"t"] <- my.seq[not.t]
test
#------------
a b t x
1 1 1 0 1
2 1 1 2 2
3 1 1 3 1
4 1 1 4 2
5 1 1 7 2
6 1 2 3 1
7 1 2 4 1
8 1 2 6 2
9 1 2 7 1
10 1 2 8 1
11 1 2 9 3
12 NA NA 1 NA
13 NA NA 5 NA
Not sure if you want it sorted by t afterwards or not. If so, easy enough to do:
https://stackoverflow.com/a/6871968/636656