Suppose I would like to create a dataframe in R with two objects/ variables like this following two examples coming from the library(projpred).
The first example is:
projpred::df_gaussian
> str(df_gaussian)
'data.frame': 100 obs. of 2 variables:
$ x: num [1:100, 1:20] 0.274 2.245 -0.125 -0.544 -1.459 ...
$ y: num -1.275 1.843 0.459 0.564 1.873 ...
The second example is
projpred::df_binom
str(df_binom)
> str(df_binom)
'data.frame': 100 obs. of 2 variables:
$ x: num [1:100, 1:30] -0.619 1.094 -0.357 -2.469 0.567 ...
$ y: int 0 1 1 0 1 0 0 0 1 1 ...
Clearly here the 'x' is a matrix of dimension 100 X 20 and 'y' is a vector/matrix of dimension 100 X 1. When I do the something like the following:
> x<- matrix(rnorm(49,0,1),ncol=7,nrow=7)
> x
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.7475965 0.25087585 -0.5454202 1.1080362 0.772668671 0.1541041 -0.18822798
[2,] 0.1156593 0.01525141 -1.7016563 -1.4725411 -1.103412611 -0.5244481 -1.35857198
[3,] -2.0756020 0.76945330 2.1603842 -0.7884491 -0.058697197 1.7486573 -0.22084569
[4,] -0.7190079 0.02477635 -0.1113622 0.2430216 -0.002865642 0.8650818 0.01232973
[5,] -0.9197059 0.88796240 -0.7654234 -1.3388553 -1.323093057 -0.6983747 1.20201014
[6,] 1.4298535 0.04451137 1.2678596 0.3640843 -0.046717376 -2.2444299 1.80306550
[7,] -0.3876859 0.62635356 -0.3490285 -0.9496578 1.150150174 0.4247856 -0.97021264
> y<- rnorm(7,5,1)
> y
[1] 6.456016 4.984491 5.209759 7.183303 4.461657 5.005530 5.052837
> z<-data.frame(x,y)
I get something like below, which is not essentially what I want.
> z
X1 X2 X3 X4 X5 X6 X7 y
1 0.7475965 0.25087585 -0.5454202 1.1080362 0.772668671 0.1541041 -0.18822798 6.456016
2 0.1156593 0.01525141 -1.7016563 -1.4725411 -1.103412611 -0.5244481 -1.35857198 4.984491
3 -2.0756020 0.76945330 2.1603842 -0.7884491 -0.058697197 1.7486573 -0.22084569 5.209759
4 -0.7190079 0.02477635 -0.1113622 0.2430216 -0.002865642 0.8650818 0.01232973 7.183303
5 -0.9197059 0.88796240 -0.7654234 -1.3388553 -1.323093057 -0.6983747 1.20201014 4.461657
6 1.4298535 0.04451137 1.2678596 0.3640843 -0.046717376 -2.2444299 1.80306550 5.005530
7 -0.3876859 0.62635356 -0.3490285 -0.9496578 1.150150174 0.4247856 -0.97021264 5.052837
> str(z)
'data.frame': 7 obs. of 8 variables:
$ X1: num 0.748 0.116 -2.076 -0.719 -0.92 ...
$ X2: num 0.2509 0.0153 0.7695 0.0248 0.888 ...
$ X3: num -0.545 -1.702 2.16 -0.111 -0.765 ...
$ X4: num 1.108 -1.473 -0.788 0.243 -1.339 ...
$ X5: num 0.77267 -1.10341 -0.0587 -0.00287 -1.32309 ...
$ X6: num 0.154 -0.524 1.749 0.865 -0.698 ...
$ X7: num -0.1882 -1.3586 -0.2208 0.0123 1.202 ...
$ y : num 6.46 4.98 5.21 7.18 4.46 ...
Wrap with I for asis - or else by calling the data.frame, the matrix will be converted to data.frame. It is documented in ?I
In function data.frame. Protecting an object by enclosing it in I() in a call to data.frame inhibits the conversion of character vectors to factors and the dropping of names, and ensures that matrices are inserted as single columns. I can also be used to protect objects which are to be added to a data frame, or converted to a data frame via as.data.frame.
z <- data.frame(x = I(x), y)
> str(z)
'data.frame': 7 obs. of 2 variables:
$ x: 'AsIs' num [1:7, 1:7] -0.178 -1.37 -0.682 1.166 0.437 ...
$ y: num 5.12 4.58 5.41 4.91 6.43 ...
> is.matrix(z$x)
[1] TRUE
If we need to change the class from "AsIs"
> class(z$x) <- c("matrix", "array")
> str(z)
'data.frame': 7 obs. of 2 variables:
$ x: 'matrix' num [1:7, 1:7] -0.178 -1.37 -0.682 1.166 0.437 ...
$ y: num 5.12 4.58 5.41 4.91 6.43 ...
Or another option is tibble
library(tibble)
z1 <- tibble(x, y)
str(z1)
tibble [7 × 2] (S3: tbl_df/tbl/data.frame)
$ x: num [1:7, 1:7] -0.178 -1.37 -0.682 1.166 0.437 ...
$ y: num [1:7] 5.12 4.58 5.41 4.91 6.43 ...
Related
I've been using R for a while, but lists perplex me.
For some reason in some cases my function outputs a data frame of lists:
str() returns something like:
*'data.frame': 4683 obs. of 6 variables:
$ f1:List of 4683
..$ : num -0.196
..$ : num -0.205
..$ : num -0.209
..$ : num -0.218
..$ : num -0.197
..$ : num -0.136
..$ : num -0.22*
instead of
*'data.frame': 4683 obs. of 6 variables:
$ f1: num -0.197 -0.205 -0.208 -0.218 -0.197 ...
$ f2: num -0.13 -0.139 -0.136 -0.137 -0.126 ...
$ f3: num -0.216 -0.221 -0.214 -0.209 -0.203 ...
$ f4: num 0.00625 -0.04806 -0.04888 -0.02979 -0.03813 ...
$ f5: num -0.15 -0.178 -0.173 -0.207 -0.154 ...
$ f6: num -0.191 -0.224 -0.25 -0.183 -0.209 ...*
...
like I'd expect. Is there some simple way to convert df from the first case to the second?
I have tried manually casting columns as vectors, which not only doesn't work, but also would not be very effective.
When we have a data frame like this
df
# 1 1, 2, 3 1, 2, 3
# 2 4, 5, 6 4, 5, 6
where
df |> str()
# 'data.frame': 2 obs. of 2 variables:
# $ :List of 2
# ..$ : int 1 2 3
# ..$ : int 4 5 6
# $ :List of 2
# ..$ : int 1 2 3
# ..$ : int 4 5 6
we can do
r <- do.call(data.frame, df)
r
# X1.3 X4.6 X1.3.1 X4.6.1
# 1 1 4 1 4
# 2 2 5 2 5
# 3 3 6 3 6
where
str(r)
# 'data.frame': 3 obs. of 4 variables:
# $ X1.3 : int 1 2 3
# $ X4.6 : int 4 5 6
# $ X1.3.1: int 1 2 3
# $ X4.6.1: int 4 5 6
Explanation: do.call constructs a data.frame() call with df (which is a "data.frame" as well as a "list") as ... arguments. So in our df with two lists of length 2, we get two data frames with 2 columns, i.e. a resulting data frame with 4 columns in this case.
By the way, you can use Reduce(.) just as well.
Data:
df <- structure(list(list(1:3, 4:6), list(1:3, 4:6)), names = c("",
""), class = "data.frame", row.names = c(NA, -2L))
I am currently turning a for loop into a foreach loop so I can parallelize the calculations of the inner loop. I am able to do that and get the correct output. However, I want the format of the output from foreach to be equal to the format I had in my old for loop. Here's a reproducible example to what I want to do:
FOR LOOP:
x <- c(1:10)
y <- c(5)
z <- c(3)
result_list <- list()
for(i in length(x)){
for(j in 1:y){
for(k in 1:z){
result <- rnorm(n = 250)
result_list[[paste(i, j, k)]] <- result
}
}
}
That's the output I get and what I want to achieve:
library(dplyr)
glimpse(result_list)
List of 15
$ 10 1 1: num [1:250] -0.382 -0.156 -0.747 1.139 -1.824 ...
$ 10 1 2: num [1:250] -0.181 0.526 0.255 -0.369 -1.517 ...
$ 10 1 3: num [1:250] 0.3621 1.3634 -0.0507 -0.3943 -1.2183 ...
$ 10 2 1: num [1:250] -1.453 0.731 2.761 -0.586 -1.631 ...
$ 10 2 2: num [1:250] -0.663 -0.641 -1.852 1.58 0.133 ...
$ 10 2 3: num [1:250] -1.334 0.803 -0.116 0.618 -1.339 ...
$ 10 3 1: num [1:250] 0.158 -1.296 -0.947 -0.515 -0.208 ...
$ 10 3 2: num [1:250] -0.604 1.956 0.127 1.846 -0.549 ...
$ 10 3 3: num [1:250] 0.365 -0.467 -0.589 -1.251 0.491 ...
$ 10 4 1: num [1:250] -1.138 0.883 1 0.729 -1.566 ...
$ 10 4 2: num [1:250] -0.0461 2.3096 -1.5347 0.3722 0.3125 ...
$ 10 4 3: num [1:250] 0.127 -0.728 0.402 1.783 -1.457 ...
$ 10 5 1: num [1:250] 1.855 2.224 1.301 0.166 -0.28 ...
$ 10 5 2: num [1:250] 0.463 -1.011 1.067 -1.305 -0.51 ...
$ 10 5 3: num [1:250] 1.937 0.651 -0.424 -0.714 -0.225 ...
That's the code to the FOREACH LOOP:
library(foreach)
result_list <- list()
resultado_foreach <- foreach::foreach(i = 1:10, .inorder = TRUE) %do% {
foreach::foreach(j = 1:5, .inorder = TRUE) %do% {
foreach::foreach((k = 1:3)) %dopar% {
result <- rnorm(n = 250)
result_list[[paste(i, j, k)]] <- result
}
}
}
Although, I get a list of lists of lists (a nested-nested list):
glimpse(resultado_foreach)
List of 10
$ :List of 5
..$ :List of 3
.. ..$ : num [1:250] 0.911 0.594 -0.453 0.651 2.303 ...
.. ..$ : num [1:250] -0.664 -0.696 0.741 -2.78 -0.992 ...
.. ..$ : num [1:250] -0.6877 -1.1266 -1.7784 0.473 0.0185 ...
..$ :List of 3
.. ..$ : num [1:250] 1.273 0.129 0.902 2.47 0.177 ...
.. ..$ : num [1:250] 0.705 0.519 1.219 -1.682 -0.355 ...
.. ..$ : num [1:250] 1.138 0.422 -1.025 -0.237 0.418 ...
..$ :List of 3
.. ..$ : num [1:250] -1.636 -1.297 -1.115 -0.138 0.174 ...
.. ..$ : num [1:250] 0.56 -1.311 0.641 0.861 -0.601 ...
.. ..$ : num [1:250] 0.198 -1.197 0.781 -0.571 -0.141 ...
..$ :List of 3
.. ..$ : num [1:250] -0.355 -0.649 -1.046 -0.717 -0.97 ...
.. ..$ : num [1:250] -1.086 0.912 -0.996 0.303 1.418 ...
.. ..$ : num [1:250] 0.8827 -0.0761 1.3504 -0.5301 0.2267 ...
..$ :List of 3
.. ..$ : num [1:250] -1.826 1.286 1.585 -0.359 -0.955 ...
.. ..$ : num [1:250] -0.588 -0.314 -0.223 -0.779 0.569 ...
.. ..$ : num [1:250] 1.047 -0.242 -0.345 0.27 -0.158 ...
The foreach output is much longer than what I have put here.
I have already tried many combinations and set many functions to the .combine argument in the foreach function. So, how can I obtain the output in the same format of the for loop?
Currently in your for loop you have
for(i in length(x)) {...}
but length(x) is 10 so you are just doing for(i in 10) which only loops once. That's not the same as for (i in 1:10). A safer alternative is for (i in seq_along(x)).
So let's say that you expect 10 outer loops, each with 5 middle loops, and each with 3 inner loops. That should be 10x5x3 = 150 total results.
If you want to use nested loops with for each, you should use the %:% operator. Plus you'll also want to use .combine=c to concatenate the inner loops into a single list at the end. These options are discussed on the ?foreach help page.
A better version would be
resultado_foreach <-
foreach::foreach(i = 1:10, .inorder = TRUE, .combine=c) %:%
foreach::foreach(j = 1:5, .inorder = TRUE, .combine=c) %:%
foreach::foreach((k = 1:3)) %dopar% {
rnorm(n = 250)
}
This will return a list of length 150 each with a vector of length 250.
Note that you shouldn't modify global variables when using foreach. Each block should return a value and foreach will collect and combine those values for you.
Below is the code I have now
set.seed(20)
test_list <- list("1" = matrix(rnorm(100), 10, 10),
"2" = NA,
"3" = NA,
"4" = NA,
"5" = NA,
"6" = matrix(rnorm(100), 10, 10),
"7" = NA,
"8" = NA)
I would like to find a way to copy down the list elements that are not containing an NA with the prior elements contents so that each list element will be filled in. Element 1-5 will contain the matrix in element 1 and 6-8 will contain the matrix in element 6. I can setup this problem without using NAs as the elements which should be copied (if using NULL or something else like that helps the solution).
Thank you in advance for any advice.
is.na can handle "list"s in, exactly, the way neeeded here: return TRUE in case of a single NA:
is.na(test_list)
# 1 2 3 4 5 6 7 8
#FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
This behaviour of is.na, also, makes testing on "list"s efficient as it avoids checking any elements with (length != 1).
Building a NA locf functionality off this:
cummax((!is.na(test_list)) * seq_along(test_list))
#1 2 3 4 5 6 7 8
#1 1 1 1 1 6 6 6
we subset test_list:
test_list[cummax((!is.na(test_list)) * seq_along(test_list))]
You can use Reduce function with accumulate mode (may not work on very big data):
str(test_list)
#List of 8
# $ 1: num [1:10, 1:10] 1.163 -0.586 1.785 -1.333 -0.447 ...
# $ 2: logi NA
# $ 3: logi NA
# $ 4: logi NA
# $ 5: logi NA
# $ 6: num [1:10, 1:10] 0.548 -0.226 1.217 0.701 0.912 ...
# $ 7: logi NA
# $ 8: logi NA
fill_list <- Reduce(function(x, y) if(all(is.na(y))) x else y, test_list, acc = TRUE)
str(fill_list)
#List of 8
# $ : num [1:10, 1:10] 1.163 -0.586 1.785 -1.333 -0.447 ...
# $ : num [1:10, 1:10] 1.163 -0.586 1.785 -1.333 -0.447 ...
# $ : num [1:10, 1:10] 1.163 -0.586 1.785 -1.333 -0.447 ...
# $ : num [1:10, 1:10] 1.163 -0.586 1.785 -1.333 -0.447 ...
# $ : num [1:10, 1:10] 1.163 -0.586 1.785 -1.333 -0.447 ...
# $ : num [1:10, 1:10] 0.548 -0.226 1.217 0.701 0.912 ...
# $ : num [1:10, 1:10] 0.548 -0.226 1.217 0.701 0.912 ...
# $ : num [1:10, 1:10] 0.548 -0.226 1.217 0.701 0.912 ...
I have the following data frame.
> str(df)
'data.frame': 98444 obs. of 25 variables:
$ count : int 361 362 363 364 365 366 367 368 369 370 ...
$ time : num 3.01 3.02 3.02 3.03 3.04 ...
$ H_Rx : num -164 -164 -164 -164 -164 ...
$ H_Ry : num -10.7 -10.7 -10.7 -10.7 -10.7 ...
$ H_Rz : num -174 -174 -174 -174 -174 ...
$ H_Tx : num -0.00137 -0.00137 -0.00136 -0.00135 -0.00134 ...
$ H_Ty : num 1.67 1.67 1.67 1.67 1.67 ...
$ H_Tz : num -0.194 -0.194 -0.194 -0.194 -0.194 ...
$ C_Rx : num -13.4 -13.4 -13.5 -13.5 -13.6 ...
$ C_Ry : num -14.7 -14.6 -14.5 -14.4 -14.4 ...
$ C_Rz : num 7.7 7.69 7.69 7.68 7.67 ...
$ C_Tx : num 0.00914 0.00914 0.00914 0.00914 0.00914 ...
$ C_Ty : num 1.21 1.21 1.21 1.21 1.21 ...
$ C_Tz : num -0.0466 -0.0466 -0.0466 -0.0465 -0.0465 ...
$ D_Rx : num -32.6 -32.6 -32.6 -32.6 -32.6 ...
$ D_Ry : num -49 -49 -49 -49 -49 ...
$ D_Rz : num 1.91 1.91 1.91 1.92 1.92 ...
$ D_Tx : num -0.0403 -0.0403 -0.0403 -0.0402 -0.0402 ...
$ D_Ty : num 1.63 1.63 1.63 1.63 1.63 ...
$ D_Tz : num 0.0214 0.0214 0.0214 0.0214 0.0215 ...
$ part : chr "P2" "P2" "P2" "P2" ...
$ freq : chr "100Hz" "100Hz" "100Hz" "100Hz" ...
$ device : chr "A1" "A1" "A1" "A1" ...
$ act : chr "Nod" "Nod" "Nod" "Nod" ...
$ trial : chr "Rest" "Rest" "Rest" "Rest" ...
- attr(*, "na.action")=Class 'omit' Named int [1:133] 469 470 471 472 473 474 475 476 477 478 ...
.. ..- attr(*, "names")= chr [1:133] "469" "470" "471" "472" ...
And I also have a list of matrices.
> str(listofmatrix)
List of 98444
$ : num [1:4, 1] 0.0807 0.0165 -0.2062 1
$ : num [1:4, 1] 0.0807 0.0165 -0.2062 1
[list output truncated]
I extracted first three elements from each matrix in listofmatrix, placing them onto new columns in df, using a for-loop:
for (i in 1:nrow(df)) {
df$D_Txnew[i] <- listofmatrix[[i]][1, 1]
df$D_Tynew[i] <- listofmatrix[[i]][2, 1]
df$D_Tznew[i] <- listofmatrix[[i]][3, 1]
}
It worked as intended, but the processing speed was less than desirable.
What are the different approaches to speed things up?
Instead of assigning row by row, one option would be to extract the first elements from each of the matrices in 'listofmatrix' (as it have only a single column) to returns a list of vectors, rbind it and assign the output to new columns in 'df'.
df1[paste0("D_T", c("xnew", "ynew", "znew"))] <- do.call(rbind,
lapply(listofmatrix, `[`, 1:3))
By running the OP's code on 'df'
identical(df, df1)
#[1] TRUE
Benchmarks
Here are some benchmarks on a slightly bigger dataset
set.seed(142)
listofmatrix <- lapply(1:1e4, function(i) matrix(rnorm(4), ncol=1))
df <- data.frame(count = 1:1e4, act= sample(LETTERS, 1e4, replace=TRUE))
df1 <- df
system.time({
for (i in 1:nrow(df)) {
df$D_Txnew[i] <- listofmatrix[[i]][1, 1]
df$D_Tynew[i] <- listofmatrix[[i]][2, 1]
df$D_Tznew[i] <- listofmatrix[[i]][3, 1]
}
})
#user system elapsed
# 1.94 0.00 1.94
system.time({
df1[paste0("D_T", c("xnew", "ynew", "znew"))] <- do.call(rbind,
lapply(listofmatrix, `[`, 1:3))
})
# user system elapsed
# 0.02 0.00 0.02
data
set.seed(24)
listofmatrix <- lapply(1:5, function(i) matrix(rnorm(4), ncol=1))
df <- data.frame(count = 1:5, act= LETTERS[1:5])
df1 <- df
I have a big .csv data set. $B1 through $B34. They are all numeric, which is fine. But I would like the last column to be in "factor" The values of the last column DEC consists of only 1 and 0.
How can I make the last column "factor"
mydata<-read.csv(file.choose(),header=T)
str(mydata)
'data.frame': 1024 obs. of 35 variables:
$ B1 : num 90.8 113.2 100.4 144.5 131.6 ...
$ B2 : num 0.133 0.139 0.144 0.147 0.141 ...
-----------
-----------
$ B32 : num 0.216 0.27 0.309 0.259 0.304 ...
$ B33 : num 0.526 0.407 0.286 0.129 0.37 ...
$ B34 : num 4.33 5.61 4.81 7.32 6.83 ...
$ DEC : int 1 1 1 1 1 1 1 1 1 1 ...
You can use the as.factor() function to convert any column to factor. For example:
mydata<-read.csv("data.csv") #Read the data#
mydata$DEC<-as.factor(mydata$DEC) #Convert the column to a factor
class(mydata$DEC) #Just to check that it worked#