split function not working properly for simulated data? - r

I simulated some data which I wanted to split into a list of data based on id but it seems that the split() function is not working properly?
set.seed(323)
#simulate some data
tsfunc2 <- function () {
x1 = rnorm(25, mean = 3, sd = 1)
x2.sample = rnorm(5, mean = 2, sd = 0.5)
x2 = rep(x2.sample, each = 5)
mu = rnorm(25, mean = 10, sd = 2)
y=as.numeric(mu + x1 + x2)
data.frame(id=rep(1:5, each=5), time=1:5, x1=x1, x2=x2, y=y)
}
set.seed(63)
#create a dataset in which the simulated data are randomly sampled in order to create imbalanced panel data
fd <- function() {
df <- tsfunc2()[sample(nrow(tsfunc2()), 20), ]
ds <- df[with(df, order(id, time)),]
return(ds)
}
set.seed(124)
split(fd(), fd()$id) #it seems that data are not properly split based on id (e.g., the first row of id2)
$`1`
id time x1 x2 y
1 1 1 1.614929 1.900059 13.43994
3 1 3 2.236970 1.900059 14.49136
4 1 4 3.212306 1.900059 15.08736
$`2`
id time x1 x2 y
5 1 5 4.425538 1.900059 15.53696 #this row is supposed to be in id1
7 2 2 3.700229 2.027456 17.48522
8 2 3 2.770645 2.027456 15.20741
9 2 4 3.197094 2.027456 13.44979
$`3`
id time x1 x2 y
12 3 2 1.576201 1.658917 16.40684
13 3 3 2.594909 1.658917 14.34763
14 3 4 3.995387 1.658917 16.36730
15 3 5 3.958818 1.658917 15.37498
$`4`
id time x1 x2 y
16 4 1 3.918088 1.636148 15.48205
17 4 2 2.849030 1.636148 12.52288
18 4 3 1.776931 1.636148 12.54456
19 4 4 2.131176 1.636148 13.63235
20 4 5 1.957515 1.636148 15.55745
$`5`
id time x1 x2 y
21 5 1 1.896362 1.569048 12.54131
22 5 2 3.444185 1.569048 14.56303
23 5 3 2.795049 1.569048 12.67120
25 5 5 2.868678 1.569048 13.88765

Related

Creating a list with column-wise partitions of a data.frame

I have a data.frame with a single "identifier" column and many additional columns. I am interested in turning this data.frame into a list of length K, whose elements are sets of columns partitioning the data.frame.
For example, given the below data.frame:
# Example data.frame
df <- data.frame(id = 1:10,
x1 = rnorm(10),
x2 = rnorm(10),
x3 = rnorm(10),
x4 = rnorm(10))
I'd like to have some function that converts it into this:
# Partitioning function
foo(df, partitions = 3)
# Expected output
list(data.frame(id = df$id, x1 = df[ ,2]),
data.frame(id = df$id, x2 = df[ ,3]),
data.frame(id = df$id, x3 = df[ ,4], x4 = df[ ,5]),
Bonus points if you can extend this so that you can specify how many non-id columns each element of the list should contain by passing a numeric vector. Imagine the same output with an input that looks like this or equivalent.
columns_per_element <- c(1,1,2)
foo(df, columns_per_element)
It is actually easier to define a function with the splitting sequence. The key functions here are repand split.default i.e.
f2 <- function(df, n, split){
i1 <- rep(seq(n), split)
res_list <- split.default(df[-1], i1)
return(lapply(res_list, function(i)cbind.data.frame(ID = df$id, i)))
}
f2(df, 3, c(1, 1, 2))
$`1`
ID x1
1 1 1.54960977
2 2 -1.59144017
3 3 0.02853548
4 4 -0.14231391
5 5 1.26989801
6 6 0.87495876
7 7 0.27373774
8 8 -0.75600983
9 9 0.32216493
10 10 -1.05113771
$`2`
ID x2
1 1 0.8529416
2 2 0.4555094
3 3 -0.3620756
4 4 1.4779813
5 5 -1.6484066
6 6 -0.5697431
7 7 -0.2139384
8 8 0.1619074
9 9 -0.5390306
10 10 -0.2228809
$`3`
ID x3 x4
1 1 -0.2579865 1.185526074
2 2 -0.0519554 -0.388179976
3 3 2.5350092 -0.675504829
4 4 -1.7051955 0.073448252
5 5 0.6207733 -0.637220508
6 6 0.3015831 -1.324024114
7 7 -0.5647717 0.969025962
8 8 0.1404714 -1.575383604
9 9 1.3049560 -1.846413101
10 10 -0.6716643 0.008675125
f2(df, 3, c(1, 2, 1))
$`1`
ID x1
1 1 1.54960977
2 2 -1.59144017
3 3 0.02853548
4 4 -0.14231391
5 5 1.26989801
6 6 0.87495876
7 7 0.27373774
8 8 -0.75600983
9 9 0.32216493
10 10 -1.05113771
$`2`
ID x2 x3
1 1 0.8529416 -0.2579865
2 2 0.4555094 -0.0519554
3 3 -0.3620756 2.5350092
4 4 1.4779813 -1.7051955
5 5 -1.6484066 0.6207733
6 6 -0.5697431 0.3015831
7 7 -0.2139384 -0.5647717
8 8 0.1619074 0.1404714
9 9 -0.5390306 1.3049560
10 10 -0.2228809 -0.6716643
$`3`
ID x4
1 1 1.185526074
2 2 -0.388179976
3 3 -0.675504829
4 4 0.073448252
5 5 -0.637220508
6 6 -1.324024114
7 7 0.969025962
8 8 -1.575383604
9 9 -1.846413101
10 10 0.008675125
Here is solution with two parameters in the function with a vectorized column select. note this assumes the first column is id and is called id. second if the sum of the vector is greater than ncol(df)-1 (this will be your input df) it will throw an error.
f2 <- function(x,y){
#keep id
id <- x[,"id" , drop = FALSE]
#keep all other variables
df2 <- x[,-1]
#get sequence for columns
y2 <- lapply(cumsum(y), function(x){sequence(x)})
#grab correct columns
y3 <- c(y2[1],mapply(dplyr::setdiff ,y2[2:length(y2)],y2[1:2]))
#recreate df
lapply(y3,
function(x){
cbind.data.frame(id, df2[,x, drop = FALSE])
})
}
f2(df, c(1,1,2))

R: Sample n elements in certain columns in a dataframe/matrix and replace their values

I am struggling to solve the captioned problem.
My dataframe is like:
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 6 7 8 9 10
3 11 12 13 14 15
What I am trying to do is randomly selecting 3 elements from the third and fourth column and replace their values by 0. So the manipulated dataframe could be like
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 6 7 0 0 10
3 11 12 13 0 15
I saw from here Random number selection from a data-frame that it could be easier if I convert the data frame into matrix, so I tried
mat <- data.frame(rbind(rep(1:5, 1), rep(6:10, 1), rep(11:15, 1)))
mat_matrix <- as.matrix(mat)
mat_matrix[sample(mat_matrix[, 3:4], 3)] <- 0
But it just randomly picked 3 elements across all columns and rows in the matrix and turned them into 0.
Can anyone help me out?
You can use slice.index and sample from that.
mat_matrix[sample(slice.index(mat_matrix, 1:2)[,3:4], 3)] <- 0
Nothing wrong with a for loop in this case. Perhaps like this:
mat <- data.frame(rbind(rep(1:5, 1), rep(6:10, 1), rep(11:15, 1)))
cols <- c(3,4)
n <- nrow(mat)*length(cols)
v <- sample( x=1:n, size=3 )
m <- matrix(FALSE, ncol=length(cols), nrow=nrow(mat))
m[v] <- TRUE
for( i in seq_along(cols) ) {
mat[ m[,i], cols[i] ] <- 0
}
Just create a two column "index matrix" that you sample on and use to replace back into your data.
Here is one way using replace
cols <- c("X3", "X4")
N <- 3
df[cols] <- replace(as.matrix(df[cols]), sample(length(unlist(df[cols])), N), 0)
such that
> df
X1 X2 X3 X4 X5
1 1 2 3 0 5
2 6 7 8 0 10
3 11 12 0 14 15

Mutate dataframes in a nested list without for loop

I have a list of dataframes (and parameters for sensitivity analyses for a study), and I want to mutate each dataframe in the same way. The expected output is generated by the code below (a new column x2). Is there a way to assign the resulting dataframes (newdfs) to the list without using a for loop?
models <- list(m1 = list('params' = list('start'='2014-01-01'),
'data' = data.frame(y=c(1,2,3), x1=c(4,5,6))),
m2 = list('params' = list('start'='2017-01-01'),
'data' = data.frame(y=c(1,2,3), x1=c(7,8,9))))
newdfs <- lapply(models, function(z) {z$data$x2 <- z$data$x1 + 1
z$data})
# Can I do this without "for"?
for(x in 1:length(models)) models[[x]]$data <- newdfs[[x]]
You can try this:
newdfs <- lapply(models, function(z) {z$data$x2 <- z$data$x1 + 1
return(z)})
$m1
$m1$params
$m1$params$start
[1] "2014-01-01"
$m1$data
y x1 x2
1 1 4 5
2 2 5 6
3 3 6 7
$m2
$m2$params
$m2$params$start
[1] "2017-01-01"
$m2$data
y x1 x2
1 1 7 8
2 2 8 9
3 3 9 10
Revise the function in lapply() to return z instead of z$data:
lapply(models, function(z) {z$data$x2 <- z$data$x1 + 1 ; z})
To make this question complete, here are two purrr solutions:
library(purrr)
map() + map_at()
map(models, map_at, "data", transform, x2 = x1 + 1)
transpose() + map()
models %>%
transpose %>%
`[[<-`(., "data", map(.$data, transform, x2 = x1 + 1)) %>%
transpose
Output
$m1
$m1$params
$m1$params$start
[1] "2014-01-01"
$m1$data
y x1 x2
1 1 4 5
2 2 5 6
3 3 6 7
$m2
$m2$params
$m2$params$start
[1] "2017-01-01"
$m2$data
y x1 x2
1 1 7 8
2 2 8 9
3 3 9 10

Combine vectors into data frame, using vector name as a column

library(dplyr)
I have a set of vectors:
Sp_A <- c("A",1,2,3,4,5,6,7,8)
Sp_B <- c("B",9,10,11,12,13,14,15,16)
Sp_C <- c("C",17,18,19,20,21,22,23,24)
which I have made into a list of vectors:
list <- ls(pattern = "Sp_")
I want to use this list to loop over each vector in the list and make it into a data frame . I currently do this for one vector using this:
A_df <- select(data.frame(rep(Sp_A[1], each = 4), c(Sp_A[c(2,4,6,8)]), c(Sp_A[c(3,5,7,9)])), name = 1, var1 = 2, var2 = 3)
I have tried to make this operation into a for loop like this:
for(i in list) {
test[i] <- select(A_df <- data.frame(rep(i[1], each = 4),
c(i[c(2,4,6,8)]),
c(i[c(3,5,7,9)]),
name = 1, var1 = 2, var2 = 3))
}
but to no avail.
I have heard that I might be able to use apply() for this sort of thing but I don't know how.
Maybe this:
lapply(list,function(x) data.frame(name=get(x)[1],matrix(get(x)[-1],ncol = 2)))
[[1]]
name X1 X2
1 A 1 5
2 A 2 6
3 A 3 7
4 A 4 8
[[2]]
name X1 X2
1 B 9 13
2 B 10 14
3 B 11 15
4 B 12 16
[[3]]
name X1 X2
1 C 17 21
2 C 18 22
3 C 19 23
4 C 20 24
Or a simple for loop to assign the dataframes to objects:
for (x in 1:length(list)){
assign(paste0("test",x),data.frame(name=get(list[x])[1],matrix(get(list[x])[-1],ncol = 2)))
}

Create a new variable from the minimum in R

The data contains four fields: id, x1, x2, and x3.
id <- c(1,2,3,4,5,6,7,8,9,10)
x1 <- c(2,4,5,3,6,4,3,6,7,7)
x2 <- c(0,1,2,6,7,6,0,8,2,2)
x3 <- c(5,3,4,5,8,3,4,2,5,6)
DF <- data.frame(id, x1,x2,x3)
Before I ask the question, let me create a new field (minX) which is the min of (x1,x2,x3)
DF$minX <- pmin(DF$x1, DF$x2, DF$x3)
I need to create a new field, y, that is defined as follows
if min(x1,x2,x3) = x1, then y = "x1"
if min(x1,x2,x3) = x2, then y = "x2"
if min(x1,x2,x3) = x3, then y = "x3"
Note: we assume no ties.
As a simply solution, do:
VARS <- colnames(DF)[-1]
y <- VARS[apply(DF[, -1], MARGIN = 1, FUN = which.min)]
DF$y <- y
The function which.min returns the index of the minimum. If the minimum is not unique it returns the first one. Since you guarantee that there is no tie, this is not an issue here.
Finally, you should be familiar with apply, right? MARGIN = 1 means applying function FUN row-wise, while MARGIN = 2 means applying FUN column-wise. This is an useful function to avoid the need for a for loop when dealing with matrix. Since your data frame only contains numerical/integer values, it is like a matrix hence we can use apply.
Here is another option using pmin and max.col
library(data.table)
setDT(DF)[, c("minx", "y") := list(do.call(pmin, .SD),
names(.SD)[max.col(-1*.SD)]), .SDcols= x1:x3]
DF
# id x1 x2 x3 minx y
# 1: 1 2 0 5 0 x2
# 2: 2 4 1 3 1 x2
# 3: 3 5 2 4 2 x2
# 4: 4 3 6 5 3 x1
3 5: 5 6 7 8 6 x1
# 6: 6 4 6 3 3 x3
# 7: 7 3 0 4 0 x2
# 8: 8 6 8 2 2 x3
# 9: 9 7 2 5 2 x2
#10: 10 7 2 6 2 x2
a data.table solution:
# create variables
id <- c(1,2,3,4,5,6,7,8,9,10)
x1 <- c(2,4,5,3,6,4,3,6,7,7)
x2 <- c(0,1,2,6,7,6,0,8,2,2)
x3 <- c(5,3,4,5,8,3,4,2,5,6)
DF <- data.frame(id, x1,x2,x3)
# load package and set data table, calculating min
library(data.table)
setDT(DF)[, minx := apply(.SD, 1, min), .SDcols=c("x1", "x2", "x3")]
# Create variable with name of minimum
DF[, y := apply(.SD, 1, function(x) names(x)[which.min(x)]), .SDcols = c("x1", "x2", "x3")]
# call result
DF
## id x1 x2 x3 minx y
1: 1 2 0 5 0 x2
2: 2 4 1 3 1 x2
3: 3 5 2 4 2 x2
4: 4 3 6 5 3 x1
5: 5 6 7 8 6 x1
6: 6 4 6 3 3 x3
7: 7 3 0 4 0 x2
8: 8 6 8 2 2 x3
9: 9 7 2 5 2 x2
10: 10 7 2 6 2 x2
The last step can be called directly, without the need to calculate minx.
Please notice that data.table is particularily fast in large data sets.
######## EDIT TO ADD: DPLYR METHOD #########
For completeness, this would be a dplyr method to produce the same (final) result. This solution is credited to #eipi10 in a question I started out of this problem (see here):
DF %>% mutate(y = apply(.[,2:4], 1, function(x) names(x)[which.min(x)]))
This solution takes about the same time as the data.table one provided in the original answer, when applyed to a 1e6 rows data frame (about 17 secs in my sony laptop).

Resources