Trying to use dplyr to group_by and apply scale() - r

Trying to use dplyr to group_by the stud_ID variable in the following data frame, as in this SO question:
> str(df)
'data.frame': 4136 obs. of 4 variables:
$ stud_ID : chr "ABB112292" "ABB112292" "ABB112292" "ABB112292" ...
$ behavioral_scale: num 3.5 4 3.5 3 3.5 2 NA NA 1 2 ...
$ cognitive_scale : num 3.5 3 3 3 3.5 2 NA NA 1 1 ...
$ affective_scale : num 2.5 3.5 3 3 2.5 2 NA NA 1 1.5 ...
I tried the following to obtain scale scores by student (rather than scale scores for observations across all students):
scaled_data <-
df %>%
group_by(stud_ID) %>%
mutate(behavioral_scale_ind = scale(behavioral_scale),
cognitive_scale_ind = scale(cognitive_scale),
affective_scale_ind = scale(affective_scale))
Here is the result:
> str(scaled_data)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 4136 obs. of 7 variables:
$ stud_ID : chr "ABB112292" "ABB112292" "ABB112292" "ABB112292" ...
$ behavioral_scale : num 3.5 4 3.5 3 3.5 2 NA NA 1 2 ...
$ cognitive_scale : num 3.5 3 3 3 3.5 2 NA NA 1 1 ...
$ affective_scale : num 2.5 3.5 3 3 2.5 2 NA NA 1 1.5 ...
$ behavioral_scale_ind: num [1:12, 1] 0.64 1.174 0.64 0.107 0.64 ...
..- attr(*, "scaled:center")= num 2.9
..- attr(*, "scaled:scale")= num 0.937
$ cognitive_scale_ind : num [1:12, 1] 1.17 0.64 0.64 0.64 1.17 ...
..- attr(*, "scaled:center")= num 2.4
..- attr(*, "scaled:scale")= num 0.937
$ affective_scale_ind : num [1:12, 1] 0 1.28 0.64 0.64 0 ...
..- attr(*, "scaled:center")= num 2.5
..- attr(*, "scaled:scale")= num 0.782
The three scaled variables (behavioral_scale, cognitive_scale, and affective_scale) have only 12 observations - the same number of observations for the first student, ABB112292.
What's going on here? How can I obtain scaled scores by individual?

The problem seems to be in the base scale() function, which expects a matrix. Try writing your own.
scale_this <- function(x){
(x - mean(x, na.rm=TRUE)) / sd(x, na.rm=TRUE)
}
Then this works:
library("dplyr")
# reproducible sample data
set.seed(123)
n = 1000
df <- data.frame(stud_ID = sample(LETTERS, size=n, replace=TRUE),
behavioral_scale = runif(n, 0, 10),
cognitive_scale = runif(n, 1, 20),
affective_scale = runif(n, 0, 1) )
scaled_data <-
df %>%
group_by(stud_ID) %>%
mutate(behavioral_scale_ind = scale_this(behavioral_scale),
cognitive_scale_ind = scale_this(cognitive_scale),
affective_scale_ind = scale_this(affective_scale))
Or, if you're open to a data.table solution:
library("data.table")
setDT(df)
cols_to_scale <- c("behavioral_scale","cognitive_scale","affective_scale")
df[, lapply(.SD, scale_this), .SDcols = cols_to_scale, keyby = factor(stud_ID)]

This was a known problem in dplyr, a fix has been merged to the development version, which you can install via
# install.packages("devtools")
devtools::install_github("hadley/dplyr")
In the stable version, the following should work, too:
scale_this <- function(x) as.vector(scale(x))

df <- df %>% mutate(across(is.numeric, ~ as.numeric(scale(.))))

Related

How to flatten nested list keeping inner structure

Let's say I have the following list (result), which is a nested list that captures information from a model: parameters (betas) and their standard errors (sd), additionally with some information regarding the global model (method) and the number of observations (n).
I want to flatten the lists betas and sd while distinguishing where each value of x1 and x2 comes from (i.e. if they are from betas or sd).
Please gently consider the following example:
result<- list(n = 100,
method = "tree",
betas = list(x1 = 1.47,
x2 = -2.85),
sd = list(x1 = 0.55,
x2 = 0.25))
str(result)
# List of 4
# $ n : num 100
# $ iterations: num 50
# $ betas :List of 2
# ..$ x1: num 1.47
# ..$ x2: num -2.85
# $ sd :List of 2
# ..$ x1: num 0.55
# ..$ x2: num 0.25
First attempt: flatten(). [Spoiler(!): I lose the precedence of each value]
## I can't distinguish between betas and sd.
flatten(result)
# $n
# [1] 100
#
# $iterations
# [1] 50
#
# $x1
# [1] 1.47
#
# $x2
# [1] -2.85
#
# $x1
# [1] 0.55
#
# $x2
# [1] 0.25
Second attempt: unlist(). [Spoiler(!), I need a list, not an atomic vector]
#I need a list
unlist(result)
# n iterations betas.x1 betas.x2 sd.x1 sd.x2
# 100.00 50.00 1.47 -2.85 0.55 0.25
Desired Output.
list(n = 100,
method = "tree",
betas.x1 = 1.47,
betas.x2 = -2.85,
sd.x1 = 0.55,
sd.x2 = 0.25)
# List of 6
# $ n : num 100
# $ method : chr "tree"
# $ betas.x1: num 1.47
# $ betas.x2: num -2.85
# $ sd.x1 : num 0.55
# $ sd.x2 : num 0.25
as.data.frame will flatten for you. From ?as.data.frame:
Arrays with more than two dimensions are converted to matrices by
'flattening' all dimensions after the first and creating suitable
column labels.
Which does a poor job of explaining that it operates on nested lists as well, not just arrays. (In other words, I think the docs do not discuss this feature on non-arrays.)
str(as.data.frame(result))
# 'data.frame': 1 obs. of 6 variables:
# $ n : num 100
# $ method : chr "tree"
# $ betas.x1: num 1.47
# $ betas.x2: num -2.85
# $ sd.x1 : num 0.55
# $ sd.x2 : num 0.25
If you don't want/need a list, just as.list it next:
str(as.list(as.data.frame(result)))
# List of 6
# $ n : num 100
# $ method : chr "tree"
# $ betas.x1: num 1.47
# $ betas.x2: num -2.85
# $ sd.x1 : num 0.55
# $ sd.x2 : num 0.25

R - Aggregate function creating sub-lists

I'm using the aggregate function to summarise some data. The data is loans data, I have the ContractNum and LoanAmount. I want to aggregate the data by StartDate, count the number of Loans and Average the loan amount.
Here is a sample of the data and the function that I use:
ContractNum <- c("RHL-1","RHL-2","RHL-3","RHL-3")
StartDate <- c("2016-11-01","2016-11-01","2016-12-01","2016-12-01")
LoanPurpose <- c("Personal","Personal","HomeLoan","Investment")
LoanAmount <- c(200,500,600,150)
dat <- data.frame(ContractNum,StartDate,LoanPurpose,LoanAmount)
aggr.data <- aggregate(
cbind(LoanAmount,ContractNum) ~ StartDate + LoanPurpose
,data = dat
,FUN = function(x)c(count = mean(x),length(x))
)
When I lookat the results of the aggregate function, it looks ok:
> aggr.data
StartDate LoanPurpose LoanAmount.count LoanAmount.V2 ContractNum.count ContractNum.V2
1 2016-12-01 HomeLoan 600 1 3.0 1.0
2 2016-12-01 Investment 150 1 3.0 1.0
3 2016-11-01 Personal 350 2 1.5 2.0
But when I look at the strucutre of it, it seems to have created a sub-list:
> str(aggr.data)
'data.frame': 3 obs. of 4 variables:
$ StartDate : Factor w/ 2 levels "2016-11-01","2016-12-01": 2 2 1
$ LoanPurpose: Factor w/ 3 levels "HomeLoan","Investment",..: 1 2 3
$ LoanAmount : num [1:3, 1:2] 600 150 350 1 1 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "count" ""
$ ContractNum: num [1:3, 1:2] 3 3 1.5 1 1 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "count" ""
How do I get rid of this sub-list so that I can access each column the way I would normally access a DF? I understand that in the code I've asked to give me a mean on a ContractNum which is not meaningful, but I can just get rid of that column.
Thank you
Just do do.call(data.frame, ...) on aggr.data to unnest the matrices.
aggr.data <- do.call(data.frame, aggr.data);
str(aggr.data);
#'data.frame': 3 obs. of 6 variables:
# $ StartDate : Factor w/ 2 levels "2016-11-01","2016-12-01": 2 2 1
# $ LoanPurpose : Factor w/ 3 levels "HomeLoan","Investment",..: 1 2 3
# $ LoanAmount.count : num 600 150 350
# $ LoanAmount.V2 : num 1 1 2
# $ ContractNum.count: num 3 3 1.5
# $ ContractNum.V2 : num 1 1 2

Subsetting and replacing in a list variable nested in a dataframe

Here is my dataframe example. It includes a column variable, named "dta" which is a single list of n values I want to keep for each of my scenario:
set.seed(777)
df <- data.frame(theo = numeric(),
size = numeric(),
dta = I(list()))
df[ 1: 5,"theo"] <- qlnorm(0.1, meanlog=0, sdlog=1, lower.tail = TRUE, log.p = FALSE)
df[ 6:10,"theo"] <- qlnorm(0.2, meanlog=0, sdlog=1, lower.tail = TRUE, log.p = FALSE)
df[ 1: 5,"size"] <- 10
df[ 6:10,"size"] <- 20
for(i in 1:10){
df$dta[i] <- list(rlnorm(df$size[i], meanlog = 0, sdlog = 1))
}
df
str(df)
This should give a df like:
theo size dta
1 0.2776062 10 1.631967....
2 0.2776062 10 0.737667....
3 0.2776062 10 0.131252....
4 0.2776062 10 1.937334....
5 0.2776062 10 0.739868....
6 0.4310112 20 4.631176....
7 0.4310112 20 2.610180....
8 0.4310112 20 0.175918....
9 0.4310112 20 3.501670....
10 0.4310112 20 0.588178....
or:
'data.frame': 10 obs. of 4 variables:
$ theo: num 0.278 0.278 0.278 0.278 0.278 ...
$ size: num 10 10 10 10 10 20 20 20 20 20
$ dta :List of 10
..$ : num 1.632 0.671 1.667 0.671 5.148 ...
..$ : num 0.738 1.056 0.152 0.967 10.089 ...
..$ : num 0.131 1.256 0.457 3.574 4.211 ...
..$ : num 1.937 2.359 3.496 0.297 4.587 ...
..$ : num 0.74 0.66 0.481 0.434 1.874 ...
..$ : num 4.631 0.298 10.28 0.933 1.286 ...
..$ : num 2.61 0.472 0.251 1.61 0.303 ...
..$ : num 0.176 0.566 2.156 0.407 3.52 ...
..$ : num 3.502 1.748 1.283 0.648 1.359 ...
..$ : num 0.588 0.392 2.447 1.926 0.86 ...
..- attr(*, "class")= chr "AsIs"
Now, I want to subset that list in such a way that:
for each list, each value is compared with the fixed value "theo" stored in the dataframe
when that value is below or equal to "theo", then recode that value NA
Here is a working code and gives me exactly what I want:
df$dta2 <- df$dta
for(i in 1:10){
df$dta2[[i]] [ df$dta2[[i]] <= df$theo[i] ] <- NA
}
However I was wondering is there is a way to get the same result with a single line of code and no "for loop" to proceed with a conditional replacement of values contained in a list which is nested in a dataframe?
We can use Map
df$dta3 <- Map(function(x,y) replace(x, x<=y, NA), df$dta, df$theo)
all.equal(df$dta2, df$dta3, check.attributes=FALSE)
#[1] TRUE

R: perform parameter sweep and collect results in long data frame

I am looking the right R idiom to run a function over a set of parameters and create a long data frame from the results. Imagine that you have the following toy function:
fun <- function(sd, mean, foobar = "foobar") {
list(random = rnorm(10) * sd + mean + 1:10, foobar = foobar)
}
Now you want to run fun over different values of sd and mean:
par_sd <- rep(1:5, 3)
par_mean <- rep(0:2, each = 5)
pars <- data.frame(sd = par_sd, mean = par_mean)
I want to run fun for the parameters in each row of pars, and collect the results in a data frame with columns sd, mean, pos, value. Here is a rather clumsy solution:
set.seed(42)
## Run fun
res <- lapply(seq_len(nrow(pars)), function(x) {
do.call(fun, as.list(pars[x, ]))
})
## Select the result we need
res <- lapply(res, "[[", "random")
## Make it a single data frame
res <- do.call(rbind, res)
## Together with the parameters
res <- as.data.frame(cbind(sd = par_sd, mean = par_mean, res))
colnames(res) <- c("sd", "mean", 1:10)
## Make it a long data frame
res <- reshape2::melt(res, id.vars=c("sd", "mean"),
variable.name = "pos", value.name="value")
## Done
res[1:5,]
#> sd mean pos value
#> 1 1 0 1 2.37095845
#> 2 2 0 1 3.60973931
#> 3 3 0 1 0.08008422
#> 4 4 0 1 2.82180049
#> 5 5 0 1 2.02999300
Is there a simpler way to do this? Anyone knows a package that does things like this? My quick search did not give any good results...
If you're willing to amend fun() to return a data.frame, I find the most elegant solution is plyr's mdply.
fun <- function(sd, mean, foobar = "foobar") {
data.frame(random = rnorm(10) * sd + mean + 1:10, foobar = foobar)
}
par_sd <- rep(1:5, 3)
par_mean <- rep(0:2, each = 5)
pars <- data.frame(sd = par_sd, mean = par_mean)
results = mdply(pars, fun, foobar = "stuff")
str(results)
mapply would seem a good fit:
> str(with(pars, mapply(fun, sd=sd, mean=mean) ) )
List of 30
$ : num [1:10] 3.16 2.28 2.84 1.49 3.43 ...
$ : chr "foobar"
$ : num [1:10] 3.429 0.157 0.583 1.542 6.485 ...
$ : chr "foobar"
$ : num [1:10] -4.56 -1.51 -1.33 7.16 3.21 ...
$ : chr "foobar"
$ : num [1:10] -2.275 2.225 4.196 0.962 15.739 ...
$ : chr "foobar"
$ : num [1:10] 6.23 10.08 2.85 6.81 4.51 ...
$ : chr "foobar"
$ : num [1:10] 1.65 3.15 5.62 5.91 6.14 ...
$ : chr "foobar"
$ : num [1:10] 4.26 1.95 7.33 2.72 6.29 ...
$ : chr "foobar"
$ : num [1:10] 7.53 6.74 3.6 6.43 3.08 ...
$ : chr "foobar"
$ : num [1:10] -0.4181 -0.0584 5.5812 1.038 8.2482 ...
$ : chr "foobar"
$ : num [1:10] 0.2377 4.8557 5.2177 -0.0706 2.0434 ...
$ : chr "foobar"
$ : num [1:10] 2.95 4.3 5.26 8.58 5.81 ...
$ : chr "foobar"
$ : num [1:10] -0.85 4.83 8.19 5.17 6.58 ...
$ : chr "foobar"
$ : num [1:10] 3.59 11.46 6.29 6.57 2.97 ...
$ : chr "foobar"
$ : num [1:10] 0.117 3.142 10.473 10.196 5.56 ...
$ : chr "foobar"
$ : num [1:10] 13.03 2.64 -1.07 5.29 1.97 ...
$ : chr "foobar"
- attr(*, "dim")= int [1:2] 2 15
- attr(*, "dimnames")=List of 2
..$ : chr [1:2] "random" "foobar"
..$ : NULL
By default mapply will attempt to simplify and if you wanted to keep them as separate objects you could negate that default:
> str(with(pars, mapply(fun, sd=sd, mean=mean, SIMPLIFY=FALSE) ) )
List of 15
$ :'data.frame': 10 obs. of 2 variables:
..$ random: num [1:10] 1.08 0.68 3.16 3.38 5.96 ...
..$ foobar: Factor w/ 1 level "foobar": 1 1 1 1 1 1 1 1 1 1
$ :'data.frame': 10 obs. of 2 variables:
..$ random: num [1:10] 0.0927 5.1506 -1.0109 2.7136 2.1263 ...
..$ foobar: Factor w/ 1 level "foobar": 1 1 1 1 1 1 1 1 1 1
$ :'data.frame': 10 obs. of 2 variables:
..$ random: num [1:10] -0.331 2.9 -1.705 5.471 4.712 ...
..$ foobar: Factor w/ 1 level "foobar": 1 1 1 1 1 1 1 1 1 1
snipped
And if you need them in one stacked dataframe, it's just:
> str(do.call( rbind, with(pars, mapply(fun, sd=sd, mean=mean, SIMPLIFY=FALSE) ) ))
'data.frame': 150 obs. of 2 variables:
$ random: num 1 3.34 2.5 4.72 4.25 ...
$ foobar: Factor w/ 1 level "foobar": 1 1 1 1 1 1 1 1 1 1 ...
If you want these "labeled" with the sd and mean values, just this modification of the constructor function:
fun <- function(sd, mean, foobar = "foobar") {
data.frame(random = rnorm(10) * sd + mean + 1:10,
sd=sd, mean=mean, foobar = foobar)
}
str(do.call( rbind, with(pars,
mapply(fun, sd=sd, mean=mean, SIMPLIFY=FALSE) ) ))
#---------------
'data.frame': 150 obs. of 4 variables:
$ random: num 1.42 1.13 3.73 4.5 5.63 ...
$ sd : int 1 1 1 1 1 1 1 1 1 1 ...
$ mean : int 0 0 0 0 0 0 0 0 0 0 ...
$ foobar: Factor w/ 1 level "foobar": 1 1 1 1 1 1 1 1 1 1 ...

Building a list in a loop in R - getting item names correct

I have a function which contains a loop over two lists and builds up some calculated data. I would like to return these data as a lists of lists, indexed by some value, but I'm getting the assignment wrong.
A minimal example of what I'm trying to do, and where i'm going wrong would be:
mybiglist <- list()
for(i in 1:5){
a <- runif(10)
b <- rnorm(16)
c <- rbinom(8, 5, i/10)
name <- paste('item:',i,sep='')
tmp <- list(uniform=a, normal=b, binomial=c)
mybiglist[[name]] <- append(mybiglist, tmp)
}
If you run this and look at the output mybiglist, you will see that something is going very wrong in the way each item is being named.
Any ideas on how I might achieve what I actually want?
Thanks
ps. I know that in R there is a sense in which one has failed if one has to resort to loops, but in this case I do feel justified ;-)
It works if you don't use the append command:
mybiglist <- list()
for(i in 1:5){
a <- runif(10)
b <- rnorm(16)
c <- rbinom(8, 5, i/10)
name <- paste('item:',i,sep='')
tmp <- list(uniform=a, normal=b, binomial=c)
mybiglist[[name]] <- tmp
}
# List of 5
# $ item:1:List of 3
# ..$ uniform : num [1:10] 0.737 0.987 0.577 0.814 0.452 ...
# ..$ normal : num [1:16] -0.403 -0.104 2.147 0.32 1.713 ...
# ..$ binomial: num [1:8] 0 0 0 0 1 0 0 1
# $ item:2:List of 3
# ..$ uniform : num [1:10] 0.61 0.62 0.49 0.217 0.862 ...
# ..$ normal : num [1:16] 0.945 -0.154 -0.5 -0.729 -0.547 ...
# ..$ binomial: num [1:8] 1 2 2 0 2 1 0 2
# $ item:3:List of 3
# ..$ uniform : num [1:10] 0.66 0.094 0.432 0.634 0.949 ...
# ..$ normal : num [1:16] -0.607 0.274 -1.455 0.828 -0.73 ...
# ..$ binomial: num [1:8] 2 2 3 1 1 1 2 0
# $ item:4:List of 3
# ..$ uniform : num [1:10] 0.455 0.442 0.149 0.745 0.24 ...
# ..$ normal : num [1:16] 0.0994 -0.5332 -0.8131 -1.1847 -0.8032 ...
# ..$ binomial: num [1:8] 2 3 1 1 2 2 2 1
# $ item:5:List of 3
# ..$ uniform : num [1:10] 0.816 0.279 0.583 0.179 0.321 ...
# ..$ normal : num [1:16] -0.036 1.137 0.178 0.29 1.266 ...
# ..$ binomial: num [1:8] 3 4 3 4 4 2 2 3
Change
mybiglist[[name]] <- append(mybiglist, tmp)
to
mybiglist[[name]] <- tmp
To show that an explicit for loop is not required
unif_norm <- replicate(5, list(uniform = runif(10),
normal = rnorm(16)), simplify=F)
binomials <- lapply(seq_len(5)/10, function(prob) {
list(binomial = rbinom(n = 5 ,size = 8, prob = prob))})
biglist <- setNames(mapply(c, unif_norm, binomials, SIMPLIFY = F),
paste0('item:',seq_along(unif_norm)))
In general if you go down the for loop path it is better to preassign the list beforehand. This is more memory efficient.
mybiglist <- vector('list', 5)
names(mybiglist) <- paste0('item:', seq_along(mybiglist))
for(i in seq_along(mybiglist)){
a <- runif(10)
b <- rnorm(16)
c <- rbinom(8, 5, i/10)
tmp <- list(uniform=a, normal=b, binomial=c)
mybiglist[[i]] <- tmp
}

Resources