I've been using R for a while, but lists perplex me.
For some reason in some cases my function outputs a data frame of lists:
str() returns something like:
*'data.frame': 4683 obs. of 6 variables:
$ f1:List of 4683
..$ : num -0.196
..$ : num -0.205
..$ : num -0.209
..$ : num -0.218
..$ : num -0.197
..$ : num -0.136
..$ : num -0.22*
instead of
*'data.frame': 4683 obs. of 6 variables:
$ f1: num -0.197 -0.205 -0.208 -0.218 -0.197 ...
$ f2: num -0.13 -0.139 -0.136 -0.137 -0.126 ...
$ f3: num -0.216 -0.221 -0.214 -0.209 -0.203 ...
$ f4: num 0.00625 -0.04806 -0.04888 -0.02979 -0.03813 ...
$ f5: num -0.15 -0.178 -0.173 -0.207 -0.154 ...
$ f6: num -0.191 -0.224 -0.25 -0.183 -0.209 ...*
...
like I'd expect. Is there some simple way to convert df from the first case to the second?
I have tried manually casting columns as vectors, which not only doesn't work, but also would not be very effective.
When we have a data frame like this
df
# 1 1, 2, 3 1, 2, 3
# 2 4, 5, 6 4, 5, 6
where
df |> str()
# 'data.frame': 2 obs. of 2 variables:
# $ :List of 2
# ..$ : int 1 2 3
# ..$ : int 4 5 6
# $ :List of 2
# ..$ : int 1 2 3
# ..$ : int 4 5 6
we can do
r <- do.call(data.frame, df)
r
# X1.3 X4.6 X1.3.1 X4.6.1
# 1 1 4 1 4
# 2 2 5 2 5
# 3 3 6 3 6
where
str(r)
# 'data.frame': 3 obs. of 4 variables:
# $ X1.3 : int 1 2 3
# $ X4.6 : int 4 5 6
# $ X1.3.1: int 1 2 3
# $ X4.6.1: int 4 5 6
Explanation: do.call constructs a data.frame() call with df (which is a "data.frame" as well as a "list") as ... arguments. So in our df with two lists of length 2, we get two data frames with 2 columns, i.e. a resulting data frame with 4 columns in this case.
By the way, you can use Reduce(.) just as well.
Data:
df <- structure(list(list(1:3, 4:6), list(1:3, 4:6)), names = c("",
""), class = "data.frame", row.names = c(NA, -2L))
I have data.frame objects in the list which is the output of my function I implemented. However, I intend to make new list where data.frame object in different list put it together. I tried several way to get my expected output but not much elegant. Does anyone know any useful trick of doing this manipulation efficiently ? Is there any elegant solution to accomplish this task ? Any idea?
This is mini example:
savedList <- list(
foo_saved = data.frame(v1=c(1,6,16), v2=c(4,12,23)),
bar_saved = data.frame(v1=c(7,19,31), v2=c(16,28,41)),
cat_saved = data.frame(v1=c(5,13,26), v2=c(11,21,42))
)
dropedList <- list(
foo_droped = data.frame(v1=c(4,9,20), v2=c(7,15,29)),
bar_droped = data.frame(v1=c(14,26,35), v2=c(21,30,47)),
cat_droped = data.frame(v1=c(18,29,39), v2=c(25,36,48))
)
This is my expected output:
foo <- list(
foo_saved = data.frame(v1=c(1,6,16), v2=c(4,12,23)),
foo_droped = data.frame(v1=c(4,9,20), v2=c(7,15,29))
)
bar <- list(
bar_saved = data.frame(v1=c(7,19,31), v2=c(16,28,41)),
bar_droped = data.frame(v1=c(14,26,35), v2=c(21,30,47))
)
cat <- list(
cat_saved = data.frame(v1=c(5,13,26), v2=c(11,21,42)),
cat_droped = data.frame(v1=c(18,29,39), v2=c(25,36,48))
)
I tried some existing solution but I am not feeling satisfy with it. How can I get my desired output easily ? Is there any efficient, compatible solution for this ? Thanks a lot
You could combine the two lists, then split on the common part of the names. split() is not the most efficient function ever, but the code for this is very simple.
x <- c(savedList, dropedList)
split(x, sub("_.*", "", names(x)))
This gives the following:
List of 3
$ bar:List of 2
..$ bar_saved :'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 7 19 31
.. ..$ v2: num [1:3] 16 28 41
..$ bar_droped:'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 14 26 35
.. ..$ v2: num [1:3] 21 30 47
$ cat:List of 2
..$ cat_saved :'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 5 13 26
.. ..$ v2: num [1:3] 11 21 42
..$ cat_droped:'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 18 29 39
.. ..$ v2: num [1:3] 25 36 48
$ foo:List of 2
..$ foo_saved :'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 1 6 16
.. ..$ v2: num [1:3] 4 12 23
..$ foo_droped:'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 4 9 20
.. ..$ v2: num [1:3] 7 15 29
You can use mapply for this, it will iterate thru both lists and make a list with each pair of items:
res <- mapply( list, savedList, dropedList, SIMPLIFY = F)
str(res)
List of 3
$ foo_saved:List of 2
..$ :'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 1 6 16
.. ..$ v2: num [1:3] 4 12 23
..$ :'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 4 9 20
.. ..$ v2: num [1:3] 7 15 29
$ bar_saved:List of 2
..$ :'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 7 19 31
.. ..$ v2: num [1:3] 16 28 41
..$ :'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 14 26 35
.. ..$ v2: num [1:3] 21 30 47
$ cat_saved:List of 2
..$ :'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 5 13 26
.. ..$ v2: num [1:3] 11 21 42
..$ :'data.frame': 3 obs. of 2 variables:
.. ..$ v1: num [1:3] 18 29 39
.. ..$ v2: num [1:3] 25 36 48
My data frame, my.data, contains both numeric and factor variables. I want to standardise just the numeric variables in this data frame.
> mydata2=data.frame(scale(my.data, center=T, scale=T))
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Could the standardising work by doing this? I want to standardise the columns 8,9,10,11 and 12 but I think I have the wrong code.
mydata=data.frame(scale(flowdis3[,c(8,9,10,11,12)], center=T, scale=T,))
Thanks in advance
Here is one option to standardize
mydata[] <- lapply(mydata, function(x) if(is.numeric(x)){
scale(x, center=TRUE, scale=TRUE)
} else x)
You can use the dplyr package to do this:
mydata2%>%mutate_if(is.numeric,scale)
Here are some options to consider, although it is answered late:
# Working environment and Memory management
rm(list = ls(all.names = TRUE))
gc()
memory.limit(size = 64935)
# Set working directory
setwd("path")
# Example data frame
df <- data.frame("Age" = c(21, 19, 25, 34, 45, 63, 39, 28, 50, 39),
"Name" = c("Christine", "Kim", "Kevin", "Aishwarya", "Rafel", "Bettina", "Joshua", "Afreen", "Wang", "Kerubo"),
"Salary in $" = c(2137.52, 1515.79, 2212.81, 2500.28, 2660, 4567.45, 2733, 3314, 5757.11, 4435.99),
"Gender" = c("Female", "Female", "Male", "Female", "Male", "Female", "Male", "Female", "Male", "Male"),
"Height in cm" = c(172, 166, 191, 169, 179, 177, 181, 155, 154, 183),
"Weight in kg" = c(60, 70, 88, 48, 71, 51, 65, 44, 53, 91))
Let us check the structure of df:
str(df)
'data.frame': 10 obs. of 6 variables:
$ Age : num 21 19 25 34 45 63 39 28 50 39
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num 2138 1516 2213 2500 2660 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num 172 166 191 169 179 177 181 155 154 183
$ Weight.in.kg: num 60 70 88 48 71 51 65 44 53 91
We see that Age, Salary, Height and Weight are numeric and Name and Gender are categorical (factor variables).
Let us scale just the numeric variables using only base R:
1) Option: (slight modification of what akrun has proposed here)
start_time1 <- Sys.time()
df1 <- as.data.frame(lapply(df, function(x) if(is.numeric(x)){
(x-mean(x))/sd(x)
} else x))
end_time1 <- Sys.time()
end_time1 - start_time1
Time difference of 0.02717805 secs
str(df1)
'data.frame': 10 obs. of 6 variables:
$ Age : num -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num -0.254 0.365 1.478 -0.996 0.427 ...
2) Option: (akrun's approach)
start_time2 <- Sys.time()
df2 <- as.data.frame(lapply(df, function(x) if(is.numeric(x)){
scale(x, center=TRUE, scale=TRUE)
} else x))
end_time2 <- Sys.time()
end_time2 - start_time2
Time difference of 0.02599907 secs
str(df2)
'data.frame': 10 obs. of 6 variables:
$ Age : num -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num -0.254 0.365 1.478 -0.996 0.427 ...
3) Option:
start_time3 <- Sys.time()
indices <- sapply(df, is.numeric)
df3 <- df
df3[indices] <- lapply(df3[indices], scale)
end_time3 <- Sys.time()
end_time2 - start_time3
Time difference of -59.6766 secs
str(df3)
'data.frame': 10 obs. of 6 variables:
$ Age : num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
..- attr(*, "scaled:center")= num 36.3
..- attr(*, "scaled:scale")= num 13.8
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num [1:10, 1] -0.787 -1.255 -0.731 -0.514 -0.394 ...
..- attr(*, "scaled:center")= num 3183
..- attr(*, "scaled:scale")= num 1329
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num [1:10, 1] -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
..- attr(*, "scaled:center")= num 173
..- attr(*, "scaled:scale")= num 12
$ Weight.in.kg: num [1:10, 1] -0.254 0.365 1.478 -0.996 0.427 ...
..- attr(*, "scaled:center")= num 64.1
..- attr(*, "scaled:scale")= num 16.2
4) Option (using tidyverse and invoking dplyr):
library(tidyverse)
start_time4 <- Sys.time()
df4 <-df %>% dplyr::mutate_if(is.numeric, scale)
end_time4 <- Sys.time()
end_time4 - start_time4
Time difference of 0.012043 secs
str(df4)
'data.frame': 10 obs. of 6 variables:
$ Age : num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
..- attr(*, "scaled:center")= num 36.3
..- attr(*, "scaled:scale")= num 13.8
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num [1:10, 1] -0.787 -1.255 -0.731 -0.514 -0.394 ...
..- attr(*, "scaled:center")= num 3183
..- attr(*, "scaled:scale")= num 1329
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num [1:10, 1] -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
..- attr(*, "scaled:center")= num 173
..- attr(*, "scaled:scale")= num 12
$ Weight.in.kg: num [1:10, 1] -0.254 0.365 1.478 -0.996 0.427 ...
..- attr(*, "scaled:center")= num 64.1
..- attr(*, "scaled:scale")= num 16.2
Based on what kind of structure as output you demand and speed, you can judge. If your data is unbalanced and you want to balance it, and suppose you want to do classification after that after scaling the numeric variables, the matrix numeric structure of the numeric variables, namely - Age, Salary, Height and Weight will cause problems. I mean,
str(df4$Age)
num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
- attr(*, "scaled:center")= num 36.3
- attr(*, "scaled:scale")= num 13.8
Since, for example, ROSE package (which balances data) doesn't accept data structures apart from int, factor and num, it will throw an error.
To avoid this issue, the numeric variables after scaling can be saved as vectors instead of a column matrix by:
library(tidyverse)
start_time4 <- Sys.time()
df4 <-df %>% dplyr::mutate_if(is.numeric, ~scale (.) %>% as.vector)
end_time4 <- Sys.time()
end_time4 - start_time4
with
Time difference of 0.01400399 secs
str(df4)
'data.frame': 10 obs. of 6 variables:
$ Age : num -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num -0.254 0.365 1.478 -0.996 0.427 ...
I have the output from a coxph function, which is estimated by strata. I would like to transform this output from a list into a data frame. The code I ran for coxph is below:
k <- coxph(Surv(cum.goodp, dlq.next) ~ rpc.length + cluster(itemcode) + strata(sector), data = nr.sample)
m <- summary(survfit(k))
There are twenty different strata used to estimate the coxph. Here is the structure of the list
List of 16
$ n : int [1:20] 870 843 2278 603 6687 8618 15155 920 2598 654 ...
$ time : num [1:870] 1 2 3 4 5 6 7 8 9 10 ...
$ n.risk : num [1:870] 870 592 448 361 320 286 232 214 196 186 ...
$ n.event : num [1:870] 246 126 77 34 33 25 18 18 8 6 ...
$ n.censor : num [1:870] 32 18 10 7 1 29 0 0 2 0 ...
$ strata : Factor w/ 20 levels "sector=11","sector=21",..: 1 1 1 1 1 1 1 1 1 1 ...
$ surv : num [1:870] 0.725 0.571 0.471 0.425 0.379 ...
$ type : chr "right"
$ cumhaz : num [1:870] 0.322 0.561 0.754 0.856 0.971 ...
$ std.err : num [1:870] 0.015 0.017 0.0174 0.0174 0.0173 ...
$ upper : num [1:870] 0.755 0.605 0.506 0.46 0.414 ...
$ lower : num [1:870] 0.696 0.538 0.438 0.392 0.347 ...
$ conf.type: chr "log"
$ conf.int : num 0.95
$ call : language survfit(formula = k)
$ table : num [1:20, 1:7] 870 843 2278 603 6687 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:20] "sector=11" "sector=21" "sector=22" "sector=23" ...
.. ..$ : chr [1:7] "records" "n.max" "n.start" "events" ...
- attr(*, "class")= chr "summary.survfit"
I have done this before, but without strata. When I did not have strata I used the following approach:
col <- lapply(c(1 : 7), function(x) m[x])
tbl <- do.call(data.frame, col)
However, when I try that approach here, I get the familiar error:
cannot coerce class "c("survfit.cox", "survfit")" to a data.frame
All columns have the same name, but they are of different length. If possible, I would like to add a column to the final data frame that contains the particular strata that the results are for. Is there a way to do this? It doesn't have to be in base R. Any help would be much appreciated. Thanks so much.
This problem can be solved via the tidy function in the broom package. For the example above, the code is:
n <- survfit(k)
df <- tidy(n)
The tidy function produces a data frame with a variable "strata". It does not, however, provide the median and mean, but they can be estimated from the data frame df if one were so inclined. If the survfit object has multiple strata, the glance(list) cannot provide the median or mean.