Vectorizing with R instead of for loop - r

I am trying to vectorize the following task with one of the apply functions, but in vain.
I have a list and a dataframe. What I am trying to accomplish is to create subgroups in a dataframe using a lookup list.
The lookup list (which are basically percentile groups) looks like the following:
Look_Up_List
$`1`
A B C D E
0.000 0.370 0.544 0.698 9.655
$`2`
A B C D E
0.000 0.506 0.649 0.774 1.192
The Curret Dataframe looks like this :
Score Big_group
0.1 1
0.4 1
0.3 2
Resulting dataframe must look like the following with an additional column. It matches the score in the percentile bucket from the lookup list in the corresponding Big_Group:
Score Big_group Sub_Group
0.1 1 A
0.4 1 B
0.3 2 A
Thanks so much

You can create a function like this:
myFun <- function(x) {
names(Look_Up_List[[as.character(x[2])]])[
findInterval(x[1], Look_Up_List[[as.character(x[2])]])]
}
And apply it by row with apply:
apply(mydf, 1, myFun)
# [1] "A" "B" "A"'

# reproducible input data
Look_Up_List <- list('1' <- c(A=0.000, B=0.370, C=0.544, D=0.698, E=9.655),
'2' <- c(A=0.000, B=0.506, C=0.649, D=0.774, E=1.192))
Current <- data.frame(Score=c(0.1, 0.4, 0.3),
Big_group=c(1,1,2))
# Solution 1
Current$Sub_Group <- sapply(1:nrow(Current), function(i) max(names(Look_Up_List[[1]][Current$Score[i] > Look_Up_List[[1]] ])))
# Alternative solution (using findInterval, slightly slower at least for this dataset)
Current$Sub_Group <- sapply(1:nrow(Current), function(i) names(Look_Up_List[[1]])[findInterval(Current$Score[i], Look_Up_List[[1]])])
# show result
Current

Related

Add value to a dataframe without using forloop in

I am writing R, and I want to add a new column WITHOUT using for loop.
Here's the thing I want to do:
I want to calculate the mean from the first value to the current value.
If I use for loop, I will do in this way:
for (i in c(1:nrow(data))){
data$Xn_bar[i] = mean(data$Xn[1:i])
}
Is there other way(i.e. map?)
Here's the data:
a = data.frame(
n = c(1:10),
Xn = c(-0.502,0.132,-0.079,0.887,0.117,0.319,-0.582,0.715,-0.825,-0.360)
)
You can do this with dplyr::cummean() or calculate it in base R by dividing the cumulative sum by the number of values so far:
cumsum(a$Xn) / seq_along(a$Xn) # base R
dplyr::cummean(a$Xn) # dplyr
# Output in both cases
# [1] -0.50200000 -0.18500000 -0.14966667 0.10950000 0.11100000 0.14566667 0.04171429
# [8] 0.12587500 0.02022222 -0.01780000
Here is one solution using row_number() of dplyr and mapply() function:
library(dplyr)
df=data.frame(n=c(1,2,3,4,5),
Xn=c(-0.502,0.132,-0.079,0.887,0.117))
# add column row_index which contain row_number of current row
df= df%>%
mutate(row_index=row_number())
# Add column Xxn
df$Xxn=mapply(function(x,y)return(
round(
mean(
df$Xn[1:y]),3
)
),
df$Xn,
df$row_index,
USE.NAMES = F)
#now remove row_index column
df= df%>%select(-row_index)
df
# > df
# n Xn Xxn
# 1 1 -0.502 -0.502
# 2 2 0.132 -0.185
# 3 3 -0.079 -0.150
# 4 4 0.887 0.110
# 5 5 0.117 0.111

How to use lapply to transform specific values in a list of dataframes

I'm looking for help to transform a for loop into an lapply or similar function.
I have a list of similar data.frames, each containing
an indicator column ('a')
a value column ('b')
I want to invert the values in column b for each data frame, but only for specific indicators. For example, invert all values in 'b' that have an indicator of 2 in column a.
Here are some sample data:
x = data.frame(a = c(1, 2, 3, 2), b = (seq(from = .1, to = 1, by = .25)))
y = data.frame(a = c(1, 2, 3, 2), b = (seq(from = 1, to = .1, by = -.25)))
my_list <- list(x = , y = y)
my_list
$x
a b
1 1 0.10
2 2 0.35
3 3 0.60
4 2 0.85
$y
a b
1 1 1.00
2 2 0.75
3 3 0.50
4 2 0.25
My desired output looks like this:
my_list
$x
a b
1 1 0.10
2 2 0.65
3 3 0.60
4 2 0.15
$y
a b
1 1 1.00
2 2 0.25
3 3 0.50
4 2 0.75
I can achieve the desired output with the following for loop.
for(i in 1:length(my_list)){
my_list[[i]][my_list[[i]]['a'] == 2, 'b'] <-
1 - my_list[[i]][my_list[[i]]['a'] == 2, 'b']
}
BUT. When I try to roll this into lapply form like so:
invertfun <- function(inputDF){
inputDF[inputDF['a'] == 2, 'b'] <- 1 - inputDF[inputDF['a'] == 2, 'b']
}
resultList <- lapply(X = my_list, FUN = invertfun)
I get a list with only the inverted values:
resultList
$x
[1] 0.65 0.15
$y
[1] 0.25 0.75
What am I missing here? I've tried to apply (pun intended) the insights from:
how to use lapply instead of a for loop, to perform a calculation on a list of dataframes in R
I'd appreciate any insights or alternative solutions. I'm trying to take my R skills to the next level and apply and similar functions seem to be the key.
We could use lapply to loop over each list and change the b column based on value in a column.
my_list[] <- lapply(my_list, function(x) transform(x, b = ifelse(a==2, 1-b, b)))
my_list
#[[1]]
# a b
#1 1 0.10
#2 2 0.65
#3 3 0.60
#4 2 0.15
#[[2]]
# a b
#1 1 1.00
#2 2 0.25
#3 3 0.50
#4 2 0.75
The same could be done using map from purrr
library(purrr)
map(my_list, function(x) transform(x, b = ifelse(a==2, 1-b, b)))
See Ronak's answer above for a fairly elegant solution using transform() or map(), but for those who are following in my footsteps, my original solution would work if I added a line in the custom function to return the full data frame like so:
invertfun <- function(inputDF){
inputDF[inputDF['a'] == 2, 'b'] <- 1 - inputDF[inputDF['a'] == 2, 'b']
return(inputDF)
}
resultList <- lapply(X = my_list, FUN = invertfun)
UPDATE - On further testing, this solution throws an Error in x[[jj]][iseq] <- vjj : replacement has length zero when the desired 'a' value doesn't exist in one of the data frames. So best not to go down this road and use the accepted answer above.
lapply is typically not the best way to iteratively modify a list. lapply is going to generate a loop internally in any case, so usually easier to read if you do something more explicit:
for (i in seq_along(my_list)) {
my_list[[i]] <- within(my_list[[i]], {
b[a==2] <- 1 - b[a==2]
})}
If we replace within with with in the example above, we get the output from your initial solution, i.e. lapply(X = my_list, FUN = invertfun).
That is, instead of modifying the list in place the latter solutions replace the list elements with new vectors.

R using if statements in apply instead of for loop

I am trying to go through each value in a data frame and based on that value extract information from another data frame. I have code that works for doing nested for loops but I am working with large datasets that run far too long for that to be feasible.
To simplify, I will provide sample data with initially only one row:
ind_1 <- data.frame("V01" = "pp", "V02" = "pq", "V03" = "pq")
ind_1
# V01 V02 V03
#1 pp pq pq
I also have this data frame:
stratum <- rep(c("A", "A", "B", "B", "C", "C"), 3)
locus <- rep(c("V01", "V02", "V03"), each = 6)
allele <- rep(c("p", "q"), 9)
value <- rep(c(0.8, 0.2, 0.6, 0.4, 0.3, 0.7, 0.5, 0.5, 0.6), 2)
df <- as.data.frame(cbind(stratum, locus, allele, value))
head(df)
# stratum locus allele value
#1 A V01 p 0.8
#2 A V01 q 0.2
#3 B V01 p 0.6
#4 B V01 q 0.4
#5 C V01 p 0.3
#6 C V01 q 0.7
There are two allele values for each locus and there are three values for stratum for every locus as well, thus there are six different values for each locus. The column name of ind_1 corresponds to the locus column in df. For each entry in ind_1, I want to return a list of values which are extracted from the value column in df based on the locus(column name in ind_1) and the data entry (pp or pq). For each entry in ind_1 there will be three returned values in the list, one for each of the stratum in df.
My attempted code is as follows:
library(dplyr)
library(magrittr)
pop.prob <- function(df, ind_1){
p <- df %>%
filter( locus == colnames(ind_1), allele == "p")
p <- as.numeric(as.character(p$value))
if( ind_1 == "pp") {
prob <- (2 * p * (1-p))
return(prob)
} else if ( ind_1 == "pq") {
prob <- (p^2)
return(prob)
}
}
test <- sapply(ind_1, function(x) {pop.prob(df, ind_1)} )
This code provides a matrix with incorrect values:
V01 V02 V03
[1,] 0.32 0.32 0.32
[2,] 0.32 0.32 0.32
[3,] 0.42 0.42 0.42
As well as the warning messages:
# 1: In if (ind_1 == "pp") { :
# the condition has length > 1 and only the first element will be used
Ideally, I would have the following output:
> test
# $V01
# 0.32 0.48 0.42
#
# $V02
# 0.25 0.36 0.04
#
# $V03
# 0.16 0.49 0.25
I've been trying to figure out how to NOT use for loops in my code because I've been using nested for loops that take an exorbitant amount of time. Any help in figuring out how to do this for this simplified data set would be greatly appreciated. Once I do that I can work on applying this to a data frame such as ind_1 that has multiple rows
Thank you all, please let me know if the example data are not clear
EDIT
Here's my code that works with a for loop:
pop.prob.for <- function(df, ind_1){
prob.list <- list()
for( i in 1:length(ind_1)){
p <- df %>%
filter( locus == colnames(ind_1[i]), allele == "p")
p <- as.numeric(as.character(p$value))
if( ind_1[i] == "pp") {
prob <- (2 * p * (1-p))
} else if ( ind_1[i] == "pq") {
prob <- (p^2)
}
prob.list[[i]] <- prob
}
return(prob.list)
}
pop.prob.for(df, ind_1)
For my actual data, I would be adding an additional loop to go through multiple rows within a data frame similar to ind_1 and save each of the iterations of lists produced as an .rdata file
There are two issues with your code. One is that you're apply function is operating on the wrong object, and the other is that you can't access the name of an element through sapply
Right now sapply(ind_1, function(x) {pop.prob(df, ind_1)}) is saying "for each element of ind_1 do pop.prob using df and all of ind_1", hence the incorrect matrix output. To operate element-wise on ind_1 you would write sapply(ind_1, function(x) {pop.prob(df, ind_1)})
This change doesn't work because you extract a column name in your function, and "pp" (the first element) has no column name. To use your function as written, you would need to write:
test <- sapply(1:dim(ind_1)[2], function(x) {pop.prob(df, ind_1[x])})
This way you're iterating in the same manner as your for loop. Note also that you're getting a matrix because sapply attempts to coerce lapply output to a vector or a matrix. If you want a list, simply use lapply
Here's a vectorised data.table solution. Should be much faster than the apply or for versions. Not to mention far more succinct.
library(data.table)
setDT(df)[, value := as.numeric(as.character(value))]
df[allele=='p',
.(prob = {if (ind_1[.GRP]=='pp') 2*value*(1-value) else value^2}),
by = locus]
# locus prob
# 1: V01 0.32
# 2: V01 0.48
# 3: V01 0.42
# 4: V02 0.25
# 5: V02 0.36
# 6: V02 0.04
# 7: V03 0.16
# 8: V03 0.49
# 9: V03 0.25

Unroll R data.frame list column retaining the other values in the row [duplicate]

This question already has answers here:
Unlisting columns by groups
(3 answers)
Closed 7 years ago.
I need to efficiently "unroll" a list column in an R data.frame. For example, if I have a data.frame defined as:
dbt <- data.frame(values=c(1,1,1,1,2,3,4),
parm1=c("A","B","C","A","B","C","B"),
parm2=c("d","d","a","b","c","a","a"))
Then, assume an analysis that generates one column as a list, similar to the following output:
agg <- aggregate(values ~ parm1 + parm2, data=dbt,
FUN=function(x) {return(list(x))})
The aggregated data.frame looks like (where class(agg$values) == "list"):
parm1 parm2 values
1 B a 4
2 C a 1, 3
3 A b 1
4 B c 2
5 A d 1
6 B d 1
I'd like to unroll the "values" column , repeating the parm1 & 2 values (adding more rows) in an efficient manner for each element of the list over all the data.frame rows.
At the top level I wrote a function that does the unroll in a for loop called in an apply. It's really inefficient, (the aggregated data.frame takes about an hour to create and nearly 24 hours to unroll, the fully unrolled data has ~500k records). The top level I'm using is:
unrolled.data <- do.call(rbind, apply(agg, 1, FUN=unroll.data))
The function just calls unlist() on the value column object then builds a data.frame object in a for loop as the returned object.
The environment is somewhat restricted and the tidyr, data.table and splitstackshape libraries are unavailable to me, it needs to not only be functions found in base:: but limited to those available in v3.1.1 and before. Thus the answers in this (not really a duplicate) question do not apply.
Any suggestions on something faster?
Thanks!
With base R, you could try
with(agg, {
data.frame(
lapply(agg[,1:2], rep, times=lengths(values)),
values=unlist(values)
)
})
# parm1 parm2 values
# 1.2 B a 4
# 1.31 C a 1
# 1.32 C a 3
# 2.1 A b 1
# 3.2 B c 2
# 4.1 A d 1
# 4.2 B d 1
Timings for an alternative (thanks #thelatemail)
library(dplyr)
agg %>%
sample_n(1e7, replace=T) -> bigger
system.time(
with(bigger, { data.frame(lapply(bigger[,1:2], rep, times=lengths(values)), values=unlist(values)) })
)
# user system elapsed
# 3.78 0.14 3.93
system.time(
with(bigger, { data.frame(bigger[rep(rownames(bigger), lengths(values)), 1:2], values=unlist(values)) })
)
# user system elapsed
# 11.30 0.34 11.64

Gnu R: Rename variable in loop

I would like to create a loop in order to create 15 crosstables with one data.frame (var1), which consist of 15 variables, and another variable (var2), see data which can be downloaded here.
The code is now able to give results, but I would like to know how I can rename the variable "mytable" so that I get mytable1, mytable2, etc.
Code:
library(vcd) # for Cramer's V
var1 <- read.csv("~/example.csv", dec=",")
var2 <- sample(1:43)
i <- 1
while(i <= ncol(var1)) {
mytable[[i]] <- table(var2,var1[,i])
assocstats(mytable[[i]])
print(mytable[[i]])
i <- i + 1
}
As suggested in the comments, using names like mytable1, mytable2, etc. for a list of objects is actively discouraged when using R. Collecting all in a list is more useful and cleaner.
One way to do what you want would be this:
library(vcd) # for Cramer's V
data(mtcars)
var1 <- mtcars[ , c(2, 8:11)] ##OP's CSV no longer available
var2 <- sample(1:5, 32, TRUE)
mytable <- myassoc <- list() ##store output in a list
##a `for` loop looks simpler than `while`
for(i in 1:ncol(var1)){
mytable[[i]] <- table(var2, var1[ , i])
myassoc[[i]] <- assocstats(mytable[[i]])
}
So now to access "mytable2" and "myassoc2" you would simply do:
> mytable[[2]]
var2 0 1
1 4 2
2 6 6
3 1 1
4 2 3
5 5 2
> myassoc[[2]]
X^2 df P(> X^2)
Likelihood Ratio 1.7079 4 0.78928
Pearson 1.6786 4 0.79460
Phi-Coefficient : NA
Contingency Coeff.: 0.223
Cramer's V : 0.229

Resources