I have a data frame and a vector of unequal lengths. They do not share an id.
df <- data.frame(
id = factor(rep(1:24, each = 10)),
x = runif(20)*100
)
a <- sort(runif(100*100))
Now, I would really like run over each row of the data frame and find the location in the vector (a) of the closest corresponding value for each id.
For a single value, this is just.
which.min(abs(df[1, 2] - a))
So, if I did it "manually" it would be:
a.location <- c(
which.min(abs(df[1, 2] - a))
which.min(abs(df[2, 2] - a)),
....,
which.min(abs(df[24, 2] - a))
)
But I simply can't wrap my head around how I can do this in a function, when I can't merge the data frame and the vector. I've looked at mapply, but that doesn't go well with unequal lengths and also rowwise from dplyr, but haven't had much luck with that either.
You can use rolling join from data.table package
library(data.table)
setkey(setDT(df), x)
df1 <- data.table(x=a, id1=1:length(a))
setkey(df1, x)
df1[df, roll="nearest"]
id1 column will give you the desired result.
Related
I would like to replace multiple variables with variables from a second dataframe in R.
df1$var1 <- df2$var1
df1$var2 <- df2$var2
# and so on ...
As you can see the variable names are the same in both dataframes, however, numeric values are slightly different whereas the correct version is in df2 but needs to be in df1. I need to do this for many, many variables in a complex data set and wonder whether someone could help with a more efficient way to code this (possibly without using column references).
Here some example data:
# dataframe 1
var1 <- c(1:10)
var2 <- c(1:10)
df1 <- data.frame(var1,var2)
# dataframe 2
var1 <- c(11:20)
var2 <- c(11:20)
df2 <- data.frame(var1,var2)
# assigning correct values
df1$var1 <- df2$var1
df1$var2 <- df2$var2
As Parfait has said, the current post seems a bit too simplified to give any immediate help but I will try and summarize what you may need for something like this to work.
If the assumption is that df1 and df2 have the same number of rows AND that their orders are already matching, then you can achieve this really easily by the following subset notation:
df1[,c({column names df1}), drop = FALSE] <- df2[, c({column names df2}), drop = FALSE]
Lets say that df1 has columns a, b, and c and you want to replace b and c with two columns of df1 whose columns are x, y, z.
df1[,c("b","c"), drop = FALSE] <- df2[, c("y", "z"), drop = FALSE]
Here we are replacing b with y and c with z. The drop argument is just for added protection against subsetting a data.frame to ensure you don't get a vector.
If you do NOT know the order is correct or one data frame may have a differing size than the other BUT there is a unique identifier between the two data.frames - then I would personally use a function that is designed for merging two data frames. Depending on your preference you can use merge from base or use *_join functions from the dplyr package (my preference).
library(dplyr)
#assuming a and x are unique identifiers that can be matched.
new_df <- left_join(df1, df2, by = c("a"="x"))
When trying to find the maximum values of a splitted list, I run into serious performance issues.
Is there a way I can optimize the following code:
# Generate data for this MWE
x <- matrix(runif(900 * 9000), nrow = 900, ncol = 9000)
y <- rep(1:100, each = 9)
my_data <- cbind(y, x)
my_data <- data.frame(my_data)
# This is the critical part I would like to optimize
my_data_split <- split(my_data, y)
max_values <- lapply(my_data_split, function(x) x[which.max(x[ , 50]), ])
I want to get the rows where a given column hits its maximum for a given group (it should be easier to understand from the code).
I know that splitting into a list is probably the reason for the slow performance, but I don't know how to circumvent it.
This may not be immediately clear to you.
There is an internal function max.col doing something similar, except that it finds position index of the maximum along a matrix row (not column). So if you transpose your original matrix x, you will be able to use this function.
Complexity steps in when you want to do max.col by group. The split-lapply convention is needed. But, if after the transpose, we convert the matrix to a data frame, we can do split.default. (Note it is not split or split.data.frame. Here the data frame is treated as a list (vector), so the split happens among the data frame columns.) Finally, we do an sapply to apply max.col by group and cbind the result into a matrix.
tx <- data.frame(t(x))
tx.group <- split.default(tx, y) ## note the `split.default`, not `split`
pos <- sapply(tx.group, max.col)
The resulting pos is something like a look-up table. It has 9000 rows and 100 columns (groups). The pos[i, j] gives the index you want for the i-th column (of your original non-transposed matrix) and j-th group. So your final extraction for the 50-th column and all groups is
max_values <- Map("[[", tx.group, pos[50, ])
You just generate the look-up table once, and make arbitrary extraction at any time.
Disadvantage of this method:
After the split, data in each group are stored in a data frame rather than a matrix. That is, for example, tx.group[[1]] is a 9000 x 9 data frame. But max.col expects a matrix so it will convert this data frame into a matrix internally.
Thus, the major performance / memory overhead includes:
initial matrix transposition;
matrix to data frame conversion;
data frame to matrix conversion (per group).
I am not sure whether we eliminate all above with some functions from MatrixStats package. I look forward to seeing a solution with that.
But anyway, this answer is already much faster than what OP originally does.
A solution using {dplyr}:
# Generate data for this MWE
x <- matrix(runif(900 * 9000), nrow = 900, ncol = 9000)
y <- rep(1:100, each = 9)
my_data <- cbind.data.frame(y, x)
# This is the critical part I would like to optimize
system.time({
my_data_split <- split(my_data, y)
max_values <- lapply(my_data_split, function(x) x[which.max(x[ , 50]), ])
})
# Using {dplyr} is 9 times faster, but you get results in a slightly different format
library(dplyr)
system.time({
max_values2 <- my_data %>%
group_by(y) %>%
do(max_values = .[which.max(.[[50]]), ])
})
all.equal(max_values[[1]], max_values2$max_values[[1]], check.attributes = FALSE)
I have the following data frame df2 and a vector n. How can I create a new data frame where df2 column names are same as vector n
df2 <- data.frame(x1=c(1,6,3),x2=c(4,3,1),x3=c(5,4,6),x4=c(7,6,7))
n<-c("x1","x4")
Any of these would work:
df2[n]
df2[, n] # see note below for caveat
subset(df2, select = n)
Note that in the second one if n can be of length one, i.e. one column, then it returns a vector rather than a data frame and if you want it to always return a data frame you would need instead:
df2[, n, drop = FALSE]
df3 <- subset(df2, select=c("x1", "x4"))
df3
hope it helps
I have a list of data frames x and I want to find the mean of each element across the data frames. I found an elegant solution online courtesy of Dimitris Rizopoulos.
x.mean = Reduce("+", x) / length(x)
However this doesn't really work when the data frames contain NA. Is there a good way to accomplish this?
Here is an approach that uses data.table
The steps are (1) coerce each data.frame [element] in x to data.table, with a column (called rn) identifying the rownames. (2) on the large data.table, by rowname calculate the mean of each column (with na.rm = TRUE dealing with NA values). (3) remove the rn column
library(data.table)
results <- rbindlist(lapply(x,data.table, keep.rownames = TRUE))[,
lapply(.SD, mean,na.rm = TRUE),by=rn][,rn := NULL]
an alternative would be to coerce to matrix, "simplify" to a 3-dimensional array then apply a mean over the appropriate margins
# for example
results <- as.data.frame(apply(simplify2array(lapply(x, as.matrix)),1:2,mean, na.rm = TRUE))
I like #mnel's solution better, but as an educational exercise here's how you can modify your expression to work with NA values while keeping the same type of logic:
Reduce(function(y,z) {y[is.na(y)] <- 0; z[is.na(z)] <- 0; y + z}, x) /
Reduce('+', lapply(x, function(y) !is.na(y)))
I have a large data.frame, and I'd like to be able to reduce it by using a quantile subset by one of the variables. For example:
x <- c(1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10)
df <- data.frame(x,rnorm(100))
df2 <- subset(df, df$x == 1)
df3 <- subset(df2, df2[2] > quantile(df2$rnorm.100.,0.8))
What I would like to end up with is a data.frame that contains all quantiles for x=1,2,3...10.
Is there a way to do this with ddply?
You could try:
ddply(df, .(x), subset, rnorm.100. > quantile(rnorm.100., 0.8))
And off topic: you could use df <- data.frame(x,y=rnorm(100)) to name a column on-the-fly.
Here's a different approach with the little used ave() command. (very fast to calculate this way)
Make a new column that contains the quantile calculation across each level of x
df$quantByX <- ave(df$rnorm.100., df$x, FUN = function (x) quantile(x,0.8))
Select the items of the new column and the x column.
df2 <- unique(df[,c(1,3)])
The result is one data frame with the unique items in the x column and the calculated quantile for each level of x.