I have a vector of numeric samples. I have calculated a smaller vector of breaks that group the values. I would like to create a boxplot that has one box for every interval, with the width of each box coming from a third vector, the same length as the breaks vector.
Here is some sample data. Please note that my real data has thousands of samples and at least tens of breaks:
v <- c(seq.int(5), seq.int(7) * 2, seq.int(4) * 3)
v1 <- c(1, 6, 13) # breaks
v2 <- c(5, 10, 2) # relative widths
This is how I might make separate boxplots, ignorant of the widths:
boxplot(v[v1[1]:v1[2]-1])
boxplot(v[v1[2]:v1[3]-1])
boxplot(v[v1[3]:length(v)])
I would like a solution that does a single boxplot() call without excessive data conditioning. For example, putting the vector in a data frame and adding a column for region/break number seems inelegant, but I'm not yet "thinking in R", so perhaps that is best.
Base R is preferred, but I will take what I can get.
Thanks.
Try this:
v1 <- c(v1, length(v) + 1)
a01 <- unlist(mapply(rep, 1:(length(v1)-1), diff(v1)))
boxplot(v ~ a01, names= paste0("g", 1:(length(v1)-1)))
Related
So let's say you have a vector
a = c(1:10)
But I only want to plot element 2,5 and 7, but at indices 2, 5 7. Not: y values 2,5 and 7 at x values 1,2,3
I can use:
plot(a[c(2,5,7)],a[c(2,5,7)])
plot_subset_ind
However, for the function matplot(), when plotting a matrix, I don't know how to do this:
original:
matplot(t(max_invest_year_zero_matrix/1000))
not-working because all data is shifted one index:
matplot(t(max_invest_year_zero_matrix[,plot_subset_ind]/1000))
Maybe I should replace the non-plotted values with NaN values.
It's not clear if you want to plot some columns or just some rows of all columns.
See the difference between the two plots below. And note that t() isn't used by neither.
max_invest_year_zero_matrix <- matrix(1:64, ncol = 8)
plot_subset_ind <- c(2, 5, 7)
matplot(max_invest_year_zero_matrix[plot_subset_ind, ])
matplot(max_invest_year_zero_matrix[, plot_subset_ind])
So I've been pondering how to do this without a for loop and I couldn't come up with a good answer. Here is an example of what I mean:
sampleData <- matrix(rnorm(25,0,1),5,5)
meanVec <- vector(length=length(sampleData[,1]))
for(i in 1:length(sampleData[,1])){
subMat <- sampleData[1:i,]
ifelse( i == 1 , sumVec <- sum(subMat) ,sumVec <- apply(subMat,2,sum) )
meanVec[i] <- mean(sumVec)
}
meanVec
The actual matrix I want to do this to is reasonably large, and to be honest, for this application it won't make a huge difference in speed, but it's a question I think should be answered:
How can I get rid of that for loop and replace with some *ply call?
Edit: In the example given, I generate sample data, and define a vector equal to the number of rows in the vector.
The for loop does the following steps:
1) takes a submatrix, from row 1 to row i
2) if i is 1, it just sums up the values in that vector
3) if i is not 1, it gets the sum of each row, then gets the mean of the sum and stores that in position i of the vector meanVec.
Finally, it prints out the mean of that sum.
This does what you describe:
cumsum(rowSums(sampleData))/seq_len(nrow(sampleData))
However, your code doesn't do the same.
x <- c("a", 2, 3, 1.0)
y <- c("b", 1, 6, 7.9)
z <- c("c", 1, 8, 2.0)
p <- c("d", 2, 9, 3.3)
df1 <- data.frame(x,y,z,p)
Here is a quick example data set, but it doesn't mirror exactly what im trying to do. Say I wanted to take 50 random samples from each level of factor in row 2 (in this case we only have 2 levels of the factor)... How would I go about coding that efficiently? I have a version working in a loop but it feels needlessly complex
edit: When I say I want to take 50 random samples I mean take 50 columns from each level of the factor.
You will need to extract a factor (assuming that 2nd row is a factor).
fact <- as.factor(as.matrix(df1[2,]))
And then work with the second column which you want to be a factor. For example, to sample all for the first value of factor
df1[,df1[2,]==levels(fact)[1],]
Or for getting exactly 50:
df1[,df1[2,]==levels(fact)[1],][1:50]
Maybe you're looking to do something like this:
x1 <- df1[,sample(c(1,4),50,replace = TRUE)]
x2 <- df1[,sample(c(2,3),50,replace = TRUE)]
...but your question is very confusing. "factor" refers to something very specific in R: a type of variable that is generally stored in a column of a data frame, never a row. Additionally, you appear to be forcing all your columns themselves to be factors (or characters possibly), which seems an odd way to store the value 3.3.
I have a set of observation in irregular grid. I want to have them in regular grid with resolution of 5. This is an example :
d <- data.frame(x=runif(1e3, 0, 30), y=runif(1e3, 0, 30), z=runif(1e3, 0, 30))
## interpolate xy grid to change irregular grid to regular
library(akima)
d2 <- with(d,interp(x, y, z, xo=seq(0, 30, length = 500),
yo=seq(0, 30, length = 500), duplicate="mean"))
how can I have the d2 in SpatialPixelDataFrame calss? which has 3 colomns, coordinates and interpolated values.
You can use code like this (thanks to the comment by #hadley):
d3 <- data.frame(x=d2$x[row(d2$z)],
y=d2$y[col(d2$z)],
z=as.vector(d2$z))
The idea here is that a matrix in R is just a vector with a bit of extra information about its dimensions. The as.vector call drops that information, turning the 500x500 matrix into a linear vector of length 500*500=250000. The subscript operator [ does the same, so although row and col originally return a matrix, that is treated as a linear vector as well. So in total, you have three matrices, turn them all to linear vectors with the same order, use two of them to index the x and y vectors, and combine the results into a single data frame.
My original solution didn't use row and col, but instead rep to formulate the x and y columns. It is a bit more difficult to understand and remember, but might be a bit more efficient, and give you some insight useful for more difficult applications.
d3 <- data.frame(x=rep(d2$x, times=500),
y=rep(d2$y, each=500),
z=as.vector(d2$z))
For this formulation, you have to know that a matrix in R is stored in column-major order. The second element of the linearized vector therefore is d2$z[2,1], so the rows number will change between two subsequent values, while the column number will remain the same for a whole column. Consequently, you want to repeat the x vector as a whole, but repeat each element of y by itself. That's what the two rep calls do.
I'm trying to find an apply() type function that can run a function that operates on two arrays instead of one.
Sort of like:
apply(X1 = doy_stack, X2 = snow_stack, MARGIN = 2, FUN = r_part(a, b))
The data is a stack of band arrays from Landsat tiles that are stacked together using rbind. Each row contains the data from a single tile, and in the end, I need to apply a function on each column (pixel) of data in this stack. One such stack contains whether each pixel has snow on it or not, and the other stack contains the day of year for that row. I want to run a classifier (rpart) on each pixel and have it identify the snow free day of year for each pixel.
What I'm doing now is pretty silly: mapply(paste, doy, snow_free) concatenates the day of year and the snow status together for each pixel as a string, apply(strstack, 2, FUN) runs the classifer on each pixel, and inside the apply function, I'm exploding each string using strsplit. As you might imagine, this is pretty inefficient, especially on 1 million pixels x 300 tiles.
Thanks!
I wouldn't try to get too fancy. A for loop might be all you need.
out <- numeric(n)
for(i in 1:n) {
out[i] <- snow_free(doy_stack[,i], snow_stack[,i])
}
Or, if you don't want to do the bookkeeping yourself,
sapply(1:n, function(i) snow_free(doy_stack[,i], snow_stack[,i]))
I've just encountered the same problem and, if I clearly understood the question, I may have solved it using mapply.
We'll use two 10x10 matrices populated with uniform random values.
set.seed(1)
X <- matrix(runif(100), 10, 10)
set.seed(2)
Y <- matrix(runif(100), 10, 10)
Next, determine how operations between the matrices will be performed. If it is row-wise, you need to transpose X and Y then cast to data.frame. This is because a data.frame is a list with columns as list elements. mapply() assumes that you are passing a list. In this example I'll perform correlation row-wise.
res.row <- mapply(function(x, y){cor(x, y)}, as.data.frame(t(X)), as.data.frame(t(Y)))
res.row[1]
V1
0.36788
should be the same as
cor(X[1,], Y[1,])
[1] 0.36788
For column-wise operations exclude the t():
res.col <- mapply(function(x, y){cor(x, y)}, as.data.frame(X), as.data.frame(Y))
This obviously assumes that X and Y have dimensions consistent with the operation of interest (i.e. they don't have to be exactly the same dimensions). For instance, one could require a statistical test row-wise but having differing numbers of columns in each matrix.
Wouldn't it be more natural to implement this as a raster stack? With the raster package you can use entire rasters in functions (eg ras3 <- ras1^2 + ras2), as well as extract a single cell value from XY coordinates, or many cell values using a block or polygon mask.
apply can work on higher dimensions (i.e. list elements). Not sure how your data is set up, but something like this might be what you are looking for:
apply(list(doy_stack, snow_stack), c(1,2), function(x) r_part(x[1], x[2]))