Reformatting data in order to plot 2D continuous heatmap - r

I have data stored in a data.frame that I would like to plot as a continuous heat map. I have tried using the interp function from akima package, but as the data can be very large (2 million rows) I would like to avoid this if possible as it takes a very long time. Here is the format of my data
l1 <- c(1,2,3)
grid1 <- expand.grid(l1, l1)
lprobdens <- c(0,2,4,2,8,10,4,8,2)
df <- cbind(grid1, lprobdens)
colnames(df) <- c("age1", "age2", "probdens")
age1 age2 probdens
1 1 0
2 1 2
3 1 4
1 2 2
2 2 8
3 2 10
1 3 4
2 3 8
3 3 2
I would like to format it in a length(df$age1) x length(df$age2) matrix. I gather that once it is formatted in this manner I would be able to use basic functions such as image to plot a 2D histogram continuous heat map similar to that created using the akima package. Here is how I think the transformed data should look. Please correct me if I am wrong.
1 2 3
1 0 2 4
2 2 8 8
3 4 10 2
It seems as though ldply but I can't seem to sort out how it works.
I forgot to mention, the $age information is always continuous and regular, such that the list age1 is equal to age2 but age1 >= age2. I guess this means that it may be classed as continuous data as it stands and doesn't require the interp function.

Ok I think I get it what you want. It just a matter of reshaping data with reshape s 'cast function. The value.var argument is just to avoid the warning message that R tried to guess the value to use. The result does not change if you omit it.
as.matrix(dcast(dat, age1 ~ age2, value.var = "probdens")[-1])
1 2 3
[1,] 0 2 4
[2,] 2 8 8
[3,] 4 10 2


Extract data from data.frame based on coordinates in another data.frame

So here is what my problem is. I have a really big data.frame woth two columns, first one represents x coordinates (rows) and another one y coordinates (columns), for example:
x y
1 1
2 3
3 1
4 2
3 4
In another frame I have some data (numbers actually):
a b c d
8 7 8 1
1 2 3 4
5 4 7 8
7 8 9 7
1 5 2 3
I would like to add a third column in first data.frame with data from second data.frame based on coordinates from first data.frame. So the result should look like this:
x y z
1 1 8
2 3 3
3 1 5
4 2 8
3 4 8
Since my data.frames are really big the for loops are too slow. I think there is a way to do this with apply loop family, but I can't find how. Thanks in advance (and sorry for ugly message layout, this is my first post here and I don't know how to produce this nice layout with code and proper data.frames like in another questions).
This is a simple indexing question. No need in external packages or *apply loops, just do
df1$z <- df2[as.matrix(df1)]
# x y z
# 1 1 1 8
# 2 2 3 3
# 3 3 1 5
# 4 4 2 8
# 5 3 4 8
A base R solution: (df1 and df2 are coordinates and numbers as data frames):
df1$z <- mapply(function(x,y) df2[x,y], df1$x, df1$y )
It works if the last y in the first data frame is corrected from 5 to 4.
I guess it was a typo since you don't have 5 columns in the second data drame.
Here's how I would do this.
First, use data.table for fast merging; then convert your data frames (I'll call them dt1 with coordinates and vals with values) to data.tables.
Second, put vals into a new data.table with coordinates:
Now merge:
You can also try the data.table package and update df1 by reference
setDT(df1)[, z := df2[cbind(x, y)]][]
# x y z
# 1: 1 1 8
# 2: 2 3 3
# 3: 3 1 5
# 4: 4 2 8
# 5: 3 4 8

Lagging "cycled" variables in R

So I just received a dataset wherein one column of the data frame is "cycled." This column is actually a cycle of years (in my case, 1984-2007). In another column, there are corresponding dollar amounts (actually, "funding levels") for each of those years. My job is to create a lag variable for these funding levels. But here is the trick: each time the year cycle starts over, a new "variable" has begun. Thus, the lag variable I am looking for is not simply a shift backward of the entire funding column. Instead, I need to create a funding lag for each sub-cycle of the data. To be more concrete, my data looks a little bit like this:
1 7
2 8
3 9
1 4
2 6
3 5
1 2
2 4
3 3
And I need it to look like this:
1 NA
2 7
3 8
1 NA
2 4
3 6
1 NA
2 2
3 4
How would I go about doing this? Thank you so much for your help!
This should work. (I often forget to name the FUN argument and ave then complains with a cryptic error message.)
#Wrong dfrm$Y <- ave( dfrm$Y, dfrm$X, FUN=function(x) c(NA, x) )
Lacking a proper grouping factor to mark distinct categories of time sequences, I decided to cue off X==1:
dfrm$Y <- ave( dfrm$Y, cumsum(dfrm$X==1), FUN=function(x) c(NA, x[-length(x)]) )

recursive replacement in R

I am trying to clean some data and would like to replace zeros with values from the previous date. I was hoping the following code works but it doesn't
temp = c(1,2,4,5,0,0,6,7)
1 2 4 5 5 0 6 7
instead of
1 2 4 5 5 5 6 7
Which I was hoping for.
Is there a nice way of doing this without looping?
The operation is called "Last Observation Carried Forward" and usually used to fill data gaps. It's a common operation for time series and thus implemented in package zoo:
temp = c(1,2,4,5,0,0,6,7)
temp[temp==0] <- NA
#[1] 1 2 4 5 5 5 6 7
You could use essentially your same logic except you'll want to apply it to the values vector that results from using rle
temp = c(1,2,4,5,0,0,6,0)
o <- rle(temp)
o$values[o$values == 0] <- o$values[which(o$values == 0) - 1]
#[1] 1 2 4 5 5 5 6 6

How to find the final value from repeated measures in R?

I have data arranged like this in R:
indv time mass
1 10 7
2 5 3
1 5 1
2 4 4
2 14 14
1 15 15
where indv is individual in a population. I want to add columns for initial mass (mass_i) and final mass (mass_f). I learned yesterday that I can add a column for initial mass using ddply in plyr:
sorted <- ddply(test, .(indv, time), sort)
sorted2 <- ddply(sorted, .(indv), transform, mass_i = mass[1])
which gives a table like:
indv mass time mass_i
1 1 1 5 1
2 1 7 10 1
3 1 10 15 1
4 2 4 4 4
5 2 3 5 4
6 2 8 14 4
7 2 9 20 4
However, this same method will not work for finding the final mass (mass_f), as I have a different number of observations for each individual. Can anyone suggest a method for finding the final mass, when the number of observations may vary?
You can simply use length(mass) as the index of the last element:
sorted2 <- ddply(sorted, .(indv), transform,
mass_i = mass[1], mass_f = mass[length(mass)])
As suggested by mb3041023 and discussed in the comments below, you can achieve similar results without sorting your data frame:
ddply(test, .(indv), transform,
mass_i = mass[which.min(time)], mass_f = mass[which.max(time)])
Except for the order of rows, this is the same as sorted2.
You can use tail(mass, 1) in place of mass[1].
sorted2 <- ddply(sorted, .(indv), transform, mass_i = head(mass, 1), mass_f=tail(mass, 1))
Once you have this table, it's pretty simple:
t <- tapply(test$mass, test$ind, max)
This will give you an array with ind. as the names and mass_f as the values.

Data frame "expand" procedure in R?

This is not a real statistical question, but rather a data preparation question before performing the actual statistical analysis. I have a data frame which consists of sparse data. I would like to "expand" this data to include zeroes for missing values, group by group.
Here is an example of the data (a and b are two factors defining the group, t is the sparse timestamp and xis the value):
test <- data.frame(
Assuming I would like to expand the values between t=0 and t=9, this is the result I'm hoping for:
test.expanded <- data.frame(
Zeroes have been inserted for all missing values of t. This makes it easier to use.
I have a quick and dirty implementation which sorts the dataframe and loops through each of its lines, adding missing lines one at a time. But I'm not entirely satisfied by the solution. Is there a better way to do it?
For those who are familiar with SAS, it is similar to the proc expand.
As you noted in a comment to the other answer, doing it by group is easy with plyr which just leaves how to "fill in" the data sets. My approach is to use merge.
test.expanded <- ddply(test, c("a","b"), function(DF) {
DF <- merge(data.frame(t=0:9), DF[,c("t","x")], all.x=TRUE)
DF[$x),"x"] <- 0
merge with all.x=TRUE will make the missing values NA, so the second line of the function is needed to replace those NAs with 0's.
This is convoluted but works fine:
test <- data.frame(
my.seq <- seq(0,9)
not.t <- !(my.seq %in% test$t)
test[nrow(test)+seq(length(my.seq[not.t])),"t"] <- my.seq[not.t]
a b t x
1 1 1 0 1
2 1 1 2 2
3 1 1 3 1
4 1 1 4 2
5 1 1 7 2
6 1 2 3 1
7 1 2 4 1
8 1 2 6 2
9 1 2 7 1
10 1 2 8 1
11 1 2 9 3
12 NA NA 1 NA
13 NA NA 5 NA
Not sure if you want it sorted by t afterwards or not. If so, easy enough to do:
