I have a question and hope that some of you can help me. The issue is this: for a given data frame that includes a vector y of length n and a factor f with k different levels, I want to assign a new variable z which has length k to the data frame, based on f.
Example:
df <- data.frame(y=rnorm(12), f=rep(1:3, length.out=12))
z <- c(-1,0,5)
Note that my real z has been constructed to correspond to the unique factor levels, which is why length(z) = length(unique(df$f). I now want to create a vector of length n=12 that contains the value of z that corresponds to the factor level f. (Note: my real factor values are not ordered like in the above example, so just repeating the vector z won't work),
Now, an obvious solution would be to create a vector foutside the data frame, merge it with z and then to use merge. For instance,
newdf <- data.frame(z=z, f=c(1,2,3))
df <- merge(df, newdf, by="f")
However, I need to repeat this procedure several thousand times, and this merge-solution seems like shooting with canons on microbes. Hence my question: there almost surely is an easier and more efficient way to do this, but I just don't know how. Could anyone point me in the right direction? I am looking for something like the "inverse" of aggregate or by.
assuming that the values in z correspond to the f levels
df <- data.frame(y=rnorm(12), f= sample(c("a","b","c"),12,replace=T))
z <- c(-1,0,5)
df$newz<-z[df$f]
In case this is not clear: this works because factors are stored under the covers as integers. When you index z with that vector of factors you are effectively indexing with the underlying integers, which point to the right z value for that factor value.
Related
I'm having trouble dealing with this challenge. Considering the following data, I would like to obtain a mean of two values (variable x) of two columns with an identical value in the other two different columns (id and z), creating a new variable.
id <- c(32,32,36,36,40)
z<- c(1,1,2,3,4)
x <- c(10,5,15,10,10)
y <- c(8,4,12,6,15)
data <- data.frame(id,z,x,y)
In this example, the repeated id is 32 and the repeated value in z is 1. So, I would like to have this result:
id <- c(32,32,36,36,40)
z<- c(1,1,2,3,4)
x <- c(10,5,15,10,10)
y <- c(8,4,12,6,15)
new <- c(7.5,15,10,10)
data <- data.frame(id,z,x,y,new)
Note that the id equal to 36 does not considered because the equivalent rows in z are not equal. In my original dataset, the variable z would be the time. I sincerely hope that this question has been straightforward and someone can help me.
I am working in R with a series of data values that have an x position (distance along a transect) and a z position (distance from the ground for a given x position). There is not a data value measurement at each x, z coordinate, to do the analysis that I need to perform, I need to code a 0 in there. Here is a short code example, real data is usually 14,000-20,000 rows. In Matlab we solve this issue by creating an empty matrix and filling it. I need an x,z matrix normalized to max(z). So in the sample below, max z is 8 and max x is 4, so I need a 4 x 8 matrix where whenever there is no given value present, 0 would be entered--just not sure the best, most efficient way to do this in R.
x <- c(1,1,1,1,1,2,2,3,3,4,4,4)
z <- c(1,4,5,6,7,1,4,2,8,1,2,5)
value <- c(9,9,9,9,9,9,9,9,9,9,9,9)
data.frame(x,z, value)
Thanks ahead of time!
In R you would do it much the same way as you describe in Matlab. First, create a matrix with all zeroes:
df <- data.frame(x, z, value)
mat <- matrix(0, 4, 8)
And then the tricky part, where you have to create a vector of the selected elements
mat[cbind(df$x, df$z)] <- df$value
What the cbind is essentially doing is creating a 2-column matrix that is used to identify a set of elements in the matrix, and then assigning the corresponding value.
I have a data frame of 15 columns where the first column is an integer and others are numeric. I have to generate a one-liner summary of the sum of all columns except the last one. I need to generate mean of the last column. So, I am doing something as below:
summary <- c(sum(df$col1), ... mean(df$col15))
The summary then appears with values up to two decimal places even for the integer column (first one). I have been trying the round function to fix this. I can understand, when different types are added, e.g. 1 + 1.0. But, in this case, shouldn't the summation maintain the data-type?
Please let me know what am I missing?
If you are looking for a one-line summary:
lst <- c(lapply(df[-ncol(df)], function(x) sum(x)), mean=mean(df[,ncol(df)]))
as.data.frame(lst)
# int num1 mean
#1 10 6 2.5
The output is a data frame that preserves the classes of each vector. If you would like the output to be added to the original data frame you can replace as.data.frame(lst) with:
names(lst) <- names(df)
rbind(df, lst)
If you are trying to get the sum of all integer columns and the mean of numeric columns, go with #Frank's answer.
Data
df <- data.frame(int=1:4, num1=seq(1,2,length.out=4), num2=seq(2,3,length.out=4))
Perhaps an adaptation of this?
apply(iris[,1:4], 2, sum) / c(rep(1,3), nrow(iris))
Say I have loaded a csv file into R with two columns (column A and column B say) with real value entries. Call the dataframe df. Is there away of speeding up the following code:
dfm <- df[floor(A) = x & floor(B) = y,]
x <- 2
y <- 2
dfm
I am hoping there will be something akin to function e.g.
dfm <- function(x,y) {df[floor(A) = x & floor(B) = y,]}
so that I can type
Any help much appreciated.
The way that's written right now won't work for a few reasons:
You need to assign values to x and y before you assign dfm. In other words, the lines x <- 2 and y <- 2 must come before the dfm <- ... line.
R doesn't know what A and B are, even if you put them inside the brackets of the dataframe that contains them. You need to write df$A and df$B.
= is the assignment operator, but you're looking for the logical operator ==. Right now your code is saying "Assign the value x to floor(A)" (which doesn't really make sense). You want to tell it "Only choose rows where floor(A) equals x", or floor(A)==x.
So what you want is:
dfm.create <- function(x,y) {df[floor(df$A)==x & floor(df$B)==y,]}
dfm <- dfm.create(2,2)
Note that if you want the dataframe to be called dfm, you don't want to name the function dfm, or you will have to erase the function to make the dataframe.
I'm trying to find an apply() type function that can run a function that operates on two arrays instead of one.
Sort of like:
apply(X1 = doy_stack, X2 = snow_stack, MARGIN = 2, FUN = r_part(a, b))
The data is a stack of band arrays from Landsat tiles that are stacked together using rbind. Each row contains the data from a single tile, and in the end, I need to apply a function on each column (pixel) of data in this stack. One such stack contains whether each pixel has snow on it or not, and the other stack contains the day of year for that row. I want to run a classifier (rpart) on each pixel and have it identify the snow free day of year for each pixel.
What I'm doing now is pretty silly: mapply(paste, doy, snow_free) concatenates the day of year and the snow status together for each pixel as a string, apply(strstack, 2, FUN) runs the classifer on each pixel, and inside the apply function, I'm exploding each string using strsplit. As you might imagine, this is pretty inefficient, especially on 1 million pixels x 300 tiles.
Thanks!
I wouldn't try to get too fancy. A for loop might be all you need.
out <- numeric(n)
for(i in 1:n) {
out[i] <- snow_free(doy_stack[,i], snow_stack[,i])
}
Or, if you don't want to do the bookkeeping yourself,
sapply(1:n, function(i) snow_free(doy_stack[,i], snow_stack[,i]))
I've just encountered the same problem and, if I clearly understood the question, I may have solved it using mapply.
We'll use two 10x10 matrices populated with uniform random values.
set.seed(1)
X <- matrix(runif(100), 10, 10)
set.seed(2)
Y <- matrix(runif(100), 10, 10)
Next, determine how operations between the matrices will be performed. If it is row-wise, you need to transpose X and Y then cast to data.frame. This is because a data.frame is a list with columns as list elements. mapply() assumes that you are passing a list. In this example I'll perform correlation row-wise.
res.row <- mapply(function(x, y){cor(x, y)}, as.data.frame(t(X)), as.data.frame(t(Y)))
res.row[1]
V1
0.36788
should be the same as
cor(X[1,], Y[1,])
[1] 0.36788
For column-wise operations exclude the t():
res.col <- mapply(function(x, y){cor(x, y)}, as.data.frame(X), as.data.frame(Y))
This obviously assumes that X and Y have dimensions consistent with the operation of interest (i.e. they don't have to be exactly the same dimensions). For instance, one could require a statistical test row-wise but having differing numbers of columns in each matrix.
Wouldn't it be more natural to implement this as a raster stack? With the raster package you can use entire rasters in functions (eg ras3 <- ras1^2 + ras2), as well as extract a single cell value from XY coordinates, or many cell values using a block or polygon mask.
apply can work on higher dimensions (i.e. list elements). Not sure how your data is set up, but something like this might be what you are looking for:
apply(list(doy_stack, snow_stack), c(1,2), function(x) r_part(x[1], x[2]))