Related
I have a data.frame with a single "identifier" column and many additional columns. I am interested in turning this data.frame into a list of length K, whose elements are sets of columns partitioning the data.frame.
For example, given the below data.frame:
# Example data.frame
df <- data.frame(id = 1:10,
x1 = rnorm(10),
x2 = rnorm(10),
x3 = rnorm(10),
x4 = rnorm(10))
I'd like to have some function that converts it into this:
# Partitioning function
foo(df, partitions = 3)
# Expected output
list(data.frame(id = df$id, x1 = df[ ,2]),
data.frame(id = df$id, x2 = df[ ,3]),
data.frame(id = df$id, x3 = df[ ,4], x4 = df[ ,5]),
Bonus points if you can extend this so that you can specify how many non-id columns each element of the list should contain by passing a numeric vector. Imagine the same output with an input that looks like this or equivalent.
columns_per_element <- c(1,1,2)
foo(df, columns_per_element)
It is actually easier to define a function with the splitting sequence. The key functions here are repand split.default i.e.
f2 <- function(df, n, split){
i1 <- rep(seq(n), split)
res_list <- split.default(df[-1], i1)
return(lapply(res_list, function(i)cbind.data.frame(ID = df$id, i)))
}
f2(df, 3, c(1, 1, 2))
$`1`
ID x1
1 1 1.54960977
2 2 -1.59144017
3 3 0.02853548
4 4 -0.14231391
5 5 1.26989801
6 6 0.87495876
7 7 0.27373774
8 8 -0.75600983
9 9 0.32216493
10 10 -1.05113771
$`2`
ID x2
1 1 0.8529416
2 2 0.4555094
3 3 -0.3620756
4 4 1.4779813
5 5 -1.6484066
6 6 -0.5697431
7 7 -0.2139384
8 8 0.1619074
9 9 -0.5390306
10 10 -0.2228809
$`3`
ID x3 x4
1 1 -0.2579865 1.185526074
2 2 -0.0519554 -0.388179976
3 3 2.5350092 -0.675504829
4 4 -1.7051955 0.073448252
5 5 0.6207733 -0.637220508
6 6 0.3015831 -1.324024114
7 7 -0.5647717 0.969025962
8 8 0.1404714 -1.575383604
9 9 1.3049560 -1.846413101
10 10 -0.6716643 0.008675125
f2(df, 3, c(1, 2, 1))
$`1`
ID x1
1 1 1.54960977
2 2 -1.59144017
3 3 0.02853548
4 4 -0.14231391
5 5 1.26989801
6 6 0.87495876
7 7 0.27373774
8 8 -0.75600983
9 9 0.32216493
10 10 -1.05113771
$`2`
ID x2 x3
1 1 0.8529416 -0.2579865
2 2 0.4555094 -0.0519554
3 3 -0.3620756 2.5350092
4 4 1.4779813 -1.7051955
5 5 -1.6484066 0.6207733
6 6 -0.5697431 0.3015831
7 7 -0.2139384 -0.5647717
8 8 0.1619074 0.1404714
9 9 -0.5390306 1.3049560
10 10 -0.2228809 -0.6716643
$`3`
ID x4
1 1 1.185526074
2 2 -0.388179976
3 3 -0.675504829
4 4 0.073448252
5 5 -0.637220508
6 6 -1.324024114
7 7 0.969025962
8 8 -1.575383604
9 9 -1.846413101
10 10 0.008675125
Here is solution with two parameters in the function with a vectorized column select. note this assumes the first column is id and is called id. second if the sum of the vector is greater than ncol(df)-1 (this will be your input df) it will throw an error.
f2 <- function(x,y){
#keep id
id <- x[,"id" , drop = FALSE]
#keep all other variables
df2 <- x[,-1]
#get sequence for columns
y2 <- lapply(cumsum(y), function(x){sequence(x)})
#grab correct columns
y3 <- c(y2[1],mapply(dplyr::setdiff ,y2[2:length(y2)],y2[1:2]))
#recreate df
lapply(y3,
function(x){
cbind.data.frame(id, df2[,x, drop = FALSE])
})
}
f2(df, c(1,1,2))
I am struggling to solve the captioned problem.
My dataframe is like:
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 6 7 8 9 10
3 11 12 13 14 15
What I am trying to do is randomly selecting 3 elements from the third and fourth column and replace their values by 0. So the manipulated dataframe could be like
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 6 7 0 0 10
3 11 12 13 0 15
I saw from here Random number selection from a data-frame that it could be easier if I convert the data frame into matrix, so I tried
mat <- data.frame(rbind(rep(1:5, 1), rep(6:10, 1), rep(11:15, 1)))
mat_matrix <- as.matrix(mat)
mat_matrix[sample(mat_matrix[, 3:4], 3)] <- 0
But it just randomly picked 3 elements across all columns and rows in the matrix and turned them into 0.
Can anyone help me out?
You can use slice.index and sample from that.
mat_matrix[sample(slice.index(mat_matrix, 1:2)[,3:4], 3)] <- 0
Nothing wrong with a for loop in this case. Perhaps like this:
mat <- data.frame(rbind(rep(1:5, 1), rep(6:10, 1), rep(11:15, 1)))
cols <- c(3,4)
n <- nrow(mat)*length(cols)
v <- sample( x=1:n, size=3 )
m <- matrix(FALSE, ncol=length(cols), nrow=nrow(mat))
m[v] <- TRUE
for( i in seq_along(cols) ) {
mat[ m[,i], cols[i] ] <- 0
}
Just create a two column "index matrix" that you sample on and use to replace back into your data.
Here is one way using replace
cols <- c("X3", "X4")
N <- 3
df[cols] <- replace(as.matrix(df[cols]), sample(length(unlist(df[cols])), N), 0)
such that
> df
X1 X2 X3 X4 X5
1 1 2 3 0 5
2 6 7 8 0 10
3 11 12 0 14 15
I have a list of dataframes (and parameters for sensitivity analyses for a study), and I want to mutate each dataframe in the same way. The expected output is generated by the code below (a new column x2). Is there a way to assign the resulting dataframes (newdfs) to the list without using a for loop?
models <- list(m1 = list('params' = list('start'='2014-01-01'),
'data' = data.frame(y=c(1,2,3), x1=c(4,5,6))),
m2 = list('params' = list('start'='2017-01-01'),
'data' = data.frame(y=c(1,2,3), x1=c(7,8,9))))
newdfs <- lapply(models, function(z) {z$data$x2 <- z$data$x1 + 1
z$data})
# Can I do this without "for"?
for(x in 1:length(models)) models[[x]]$data <- newdfs[[x]]
You can try this:
newdfs <- lapply(models, function(z) {z$data$x2 <- z$data$x1 + 1
return(z)})
$m1
$m1$params
$m1$params$start
[1] "2014-01-01"
$m1$data
y x1 x2
1 1 4 5
2 2 5 6
3 3 6 7
$m2
$m2$params
$m2$params$start
[1] "2017-01-01"
$m2$data
y x1 x2
1 1 7 8
2 2 8 9
3 3 9 10
Revise the function in lapply() to return z instead of z$data:
lapply(models, function(z) {z$data$x2 <- z$data$x1 + 1 ; z})
To make this question complete, here are two purrr solutions:
library(purrr)
map() + map_at()
map(models, map_at, "data", transform, x2 = x1 + 1)
transpose() + map()
models %>%
transpose %>%
`[[<-`(., "data", map(.$data, transform, x2 = x1 + 1)) %>%
transpose
Output
$m1
$m1$params
$m1$params$start
[1] "2014-01-01"
$m1$data
y x1 x2
1 1 4 5
2 2 5 6
3 3 6 7
$m2
$m2$params
$m2$params$start
[1] "2017-01-01"
$m2$data
y x1 x2
1 1 7 8
2 2 8 9
3 3 9 10
library(dplyr)
I have a set of vectors:
Sp_A <- c("A",1,2,3,4,5,6,7,8)
Sp_B <- c("B",9,10,11,12,13,14,15,16)
Sp_C <- c("C",17,18,19,20,21,22,23,24)
which I have made into a list of vectors:
list <- ls(pattern = "Sp_")
I want to use this list to loop over each vector in the list and make it into a data frame . I currently do this for one vector using this:
A_df <- select(data.frame(rep(Sp_A[1], each = 4), c(Sp_A[c(2,4,6,8)]), c(Sp_A[c(3,5,7,9)])), name = 1, var1 = 2, var2 = 3)
I have tried to make this operation into a for loop like this:
for(i in list) {
test[i] <- select(A_df <- data.frame(rep(i[1], each = 4),
c(i[c(2,4,6,8)]),
c(i[c(3,5,7,9)]),
name = 1, var1 = 2, var2 = 3))
}
but to no avail.
I have heard that I might be able to use apply() for this sort of thing but I don't know how.
Maybe this:
lapply(list,function(x) data.frame(name=get(x)[1],matrix(get(x)[-1],ncol = 2)))
[[1]]
name X1 X2
1 A 1 5
2 A 2 6
3 A 3 7
4 A 4 8
[[2]]
name X1 X2
1 B 9 13
2 B 10 14
3 B 11 15
4 B 12 16
[[3]]
name X1 X2
1 C 17 21
2 C 18 22
3 C 19 23
4 C 20 24
Or a simple for loop to assign the dataframes to objects:
for (x in 1:length(list)){
assign(paste0("test",x),data.frame(name=get(list[x])[1],matrix(get(list[x])[-1],ncol = 2)))
}
The data contains four fields: id, x1, x2, and x3.
id <- c(1,2,3,4,5,6,7,8,9,10)
x1 <- c(2,4,5,3,6,4,3,6,7,7)
x2 <- c(0,1,2,6,7,6,0,8,2,2)
x3 <- c(5,3,4,5,8,3,4,2,5,6)
DF <- data.frame(id, x1,x2,x3)
Before I ask the question, let me create a new field (minX) which is the min of (x1,x2,x3)
DF$minX <- pmin(DF$x1, DF$x2, DF$x3)
I need to create a new field, y, that is defined as follows
if min(x1,x2,x3) = x1, then y = "x1"
if min(x1,x2,x3) = x2, then y = "x2"
if min(x1,x2,x3) = x3, then y = "x3"
Note: we assume no ties.
As a simply solution, do:
VARS <- colnames(DF)[-1]
y <- VARS[apply(DF[, -1], MARGIN = 1, FUN = which.min)]
DF$y <- y
The function which.min returns the index of the minimum. If the minimum is not unique it returns the first one. Since you guarantee that there is no tie, this is not an issue here.
Finally, you should be familiar with apply, right? MARGIN = 1 means applying function FUN row-wise, while MARGIN = 2 means applying FUN column-wise. This is an useful function to avoid the need for a for loop when dealing with matrix. Since your data frame only contains numerical/integer values, it is like a matrix hence we can use apply.
Here is another option using pmin and max.col
library(data.table)
setDT(DF)[, c("minx", "y") := list(do.call(pmin, .SD),
names(.SD)[max.col(-1*.SD)]), .SDcols= x1:x3]
DF
# id x1 x2 x3 minx y
# 1: 1 2 0 5 0 x2
# 2: 2 4 1 3 1 x2
# 3: 3 5 2 4 2 x2
# 4: 4 3 6 5 3 x1
3 5: 5 6 7 8 6 x1
# 6: 6 4 6 3 3 x3
# 7: 7 3 0 4 0 x2
# 8: 8 6 8 2 2 x3
# 9: 9 7 2 5 2 x2
#10: 10 7 2 6 2 x2
a data.table solution:
# create variables
id <- c(1,2,3,4,5,6,7,8,9,10)
x1 <- c(2,4,5,3,6,4,3,6,7,7)
x2 <- c(0,1,2,6,7,6,0,8,2,2)
x3 <- c(5,3,4,5,8,3,4,2,5,6)
DF <- data.frame(id, x1,x2,x3)
# load package and set data table, calculating min
library(data.table)
setDT(DF)[, minx := apply(.SD, 1, min), .SDcols=c("x1", "x2", "x3")]
# Create variable with name of minimum
DF[, y := apply(.SD, 1, function(x) names(x)[which.min(x)]), .SDcols = c("x1", "x2", "x3")]
# call result
DF
## id x1 x2 x3 minx y
1: 1 2 0 5 0 x2
2: 2 4 1 3 1 x2
3: 3 5 2 4 2 x2
4: 4 3 6 5 3 x1
5: 5 6 7 8 6 x1
6: 6 4 6 3 3 x3
7: 7 3 0 4 0 x2
8: 8 6 8 2 2 x3
9: 9 7 2 5 2 x2
10: 10 7 2 6 2 x2
The last step can be called directly, without the need to calculate minx.
Please notice that data.table is particularily fast in large data sets.
######## EDIT TO ADD: DPLYR METHOD #########
For completeness, this would be a dplyr method to produce the same (final) result. This solution is credited to #eipi10 in a question I started out of this problem (see here):
DF %>% mutate(y = apply(.[,2:4], 1, function(x) names(x)[which.min(x)]))
This solution takes about the same time as the data.table one provided in the original answer, when applyed to a 1e6 rows data frame (about 17 secs in my sony laptop).