I am trying to go through each value in a data frame and based on that value extract information from another data frame. I have code that works for doing nested for loops but I am working with large datasets that run far too long for that to be feasible.
To simplify, I will provide sample data with initially only one row:
ind_1 <- data.frame("V01" = "pp", "V02" = "pq", "V03" = "pq")
ind_1
# V01 V02 V03
#1 pp pq pq
I also have this data frame:
stratum <- rep(c("A", "A", "B", "B", "C", "C"), 3)
locus <- rep(c("V01", "V02", "V03"), each = 6)
allele <- rep(c("p", "q"), 9)
value <- rep(c(0.8, 0.2, 0.6, 0.4, 0.3, 0.7, 0.5, 0.5, 0.6), 2)
df <- as.data.frame(cbind(stratum, locus, allele, value))
head(df)
# stratum locus allele value
#1 A V01 p 0.8
#2 A V01 q 0.2
#3 B V01 p 0.6
#4 B V01 q 0.4
#5 C V01 p 0.3
#6 C V01 q 0.7
There are two allele values for each locus and there are three values for stratum for every locus as well, thus there are six different values for each locus. The column name of ind_1 corresponds to the locus column in df. For each entry in ind_1, I want to return a list of values which are extracted from the value column in df based on the locus(column name in ind_1) and the data entry (pp or pq). For each entry in ind_1 there will be three returned values in the list, one for each of the stratum in df.
My attempted code is as follows:
library(dplyr)
library(magrittr)
pop.prob <- function(df, ind_1){
p <- df %>%
filter( locus == colnames(ind_1), allele == "p")
p <- as.numeric(as.character(p$value))
if( ind_1 == "pp") {
prob <- (2 * p * (1-p))
return(prob)
} else if ( ind_1 == "pq") {
prob <- (p^2)
return(prob)
}
}
test <- sapply(ind_1, function(x) {pop.prob(df, ind_1)} )
This code provides a matrix with incorrect values:
V01 V02 V03
[1,] 0.32 0.32 0.32
[2,] 0.32 0.32 0.32
[3,] 0.42 0.42 0.42
As well as the warning messages:
# 1: In if (ind_1 == "pp") { :
# the condition has length > 1 and only the first element will be used
Ideally, I would have the following output:
> test
# $V01
# 0.32 0.48 0.42
#
# $V02
# 0.25 0.36 0.04
#
# $V03
# 0.16 0.49 0.25
I've been trying to figure out how to NOT use for loops in my code because I've been using nested for loops that take an exorbitant amount of time. Any help in figuring out how to do this for this simplified data set would be greatly appreciated. Once I do that I can work on applying this to a data frame such as ind_1 that has multiple rows
Thank you all, please let me know if the example data are not clear
EDIT
Here's my code that works with a for loop:
pop.prob.for <- function(df, ind_1){
prob.list <- list()
for( i in 1:length(ind_1)){
p <- df %>%
filter( locus == colnames(ind_1[i]), allele == "p")
p <- as.numeric(as.character(p$value))
if( ind_1[i] == "pp") {
prob <- (2 * p * (1-p))
} else if ( ind_1[i] == "pq") {
prob <- (p^2)
}
prob.list[[i]] <- prob
}
return(prob.list)
}
pop.prob.for(df, ind_1)
For my actual data, I would be adding an additional loop to go through multiple rows within a data frame similar to ind_1 and save each of the iterations of lists produced as an .rdata file
There are two issues with your code. One is that you're apply function is operating on the wrong object, and the other is that you can't access the name of an element through sapply
Right now sapply(ind_1, function(x) {pop.prob(df, ind_1)}) is saying "for each element of ind_1 do pop.prob using df and all of ind_1", hence the incorrect matrix output. To operate element-wise on ind_1 you would write sapply(ind_1, function(x) {pop.prob(df, ind_1)})
This change doesn't work because you extract a column name in your function, and "pp" (the first element) has no column name. To use your function as written, you would need to write:
test <- sapply(1:dim(ind_1)[2], function(x) {pop.prob(df, ind_1[x])})
This way you're iterating in the same manner as your for loop. Note also that you're getting a matrix because sapply attempts to coerce lapply output to a vector or a matrix. If you want a list, simply use lapply
Here's a vectorised data.table solution. Should be much faster than the apply or for versions. Not to mention far more succinct.
library(data.table)
setDT(df)[, value := as.numeric(as.character(value))]
df[allele=='p',
.(prob = {if (ind_1[.GRP]=='pp') 2*value*(1-value) else value^2}),
by = locus]
# locus prob
# 1: V01 0.32
# 2: V01 0.48
# 3: V01 0.42
# 4: V02 0.25
# 5: V02 0.36
# 6: V02 0.04
# 7: V03 0.16
# 8: V03 0.49
# 9: V03 0.25
Related
My data contains consecutive columns 1,2,...,2000. I want to apply a functions that returns a 3 vars for each group of 100 columns for each row.
The data look like this:
1 2 3 ..... 2000
0.01 0.0 0.002 0.03
0.005 0.002 0.011 0.04
0.001 0.003 0.004 0.0
Here is the code I tried:
prep_data <- function(df){
#Create Column names
colnms<-c()
for(i in seq(1, 20, 1)){
for(j in seq(1, 3, 1)){
f<-paste0("grp",i,"_",j)
colnms=c(colnms,f)
}
}
#
trans <- data.frame(matrix(ncol = 60, nrow = NROW(df)))
colnames(trans) <-colnms
#Looping over every row
for (i in 1:NROW(df)){
X = c()
#LOOPING over each group of 100 columns
for(j in seq(1, 1900, 100)){
end<-j+99
tmp<-subset(df[i], select=j:end)
#Here I apply the function over the 100 columns for the current row to get 3 values#
X = c(X,MY_FUNC(tmp))
###################################################################################
}
}
#Append the current row
trans[i,] <- X
}
return(trans)
}
The expected output (A dataframe of 60 columns) is as follows:
grp1_1 grp1_2 grp1_3 ..... grp20_3
0.01 0.0 0.002 0.03
0.005 0.002 0.011 0.04
0.001 0.003 0.004 0.0
My code runs but its too slow probably because it's not efficient with all the loops
Thanks in advance
Here is one approach:
Let d be your 3 rows x 2000 columns frame, with column names as.character(1:2000) (See below for generation of fake data). We add a row identifier using .I, then melt the data long, adding grp, and column-group identifier (i.e. identifying the 20 sets of 100). Then apply your function myfunc (see below for stand-in function for this example), by row and group, and swing wide. (I used stringr::str_pad to add 0 to the front of the group number)
# add row identifier
d[, row:=.I]
# melt and add col group identifier
dm = melt(d,id.vars = "row",variable.factor = F)[,variable:=as.numeric(variable)][order(variable,row), grp:=rep(1:20, each=300)]
# get the result (180 rows long), applying myfync to each set of columns, by row
result = dm[, myfunc(value), by=.(row,grp)][,frow:=rep(1:3,times=60)]
# swing wide (3 rows long, 60 columns wide)
dcast(
result[,v:=paste0("grp",stringr::str_pad(grp,2,pad = "0"),"_",row)],
frow~v,value.var="V1"
)[, frow:=NULL][]
Output: (first six columns only)
grp01_1 grp01_2 grp01_3 grp02_1 grp02_2 grp02_3
<num> <num> <num> <num> <num> <num>
1: 0.54187168 0.47650694 0.48045694 0.51278399 0.51777319 0.46607845
2: 0.06671367 0.08763655 0.08076939 0.07930063 0.09830116 0.07807937
3: 0.25828989 0.29603471 0.28419957 0.28160367 0.31353016 0.27942687
Input:
d = data.table()
alloc.col(d,2000)
set.seed(123)
for(c in 1:2000) set(d,j=as.character(c), value=runif(3))
myfunc Function (toy example for this answer):
myfunc <- function(x) c(mean(x), var(x), sd(x))
I'm looking for help to transform a for loop into an lapply or similar function.
I have a list of similar data.frames, each containing
an indicator column ('a')
a value column ('b')
I want to invert the values in column b for each data frame, but only for specific indicators. For example, invert all values in 'b' that have an indicator of 2 in column a.
Here are some sample data:
x = data.frame(a = c(1, 2, 3, 2), b = (seq(from = .1, to = 1, by = .25)))
y = data.frame(a = c(1, 2, 3, 2), b = (seq(from = 1, to = .1, by = -.25)))
my_list <- list(x = , y = y)
my_list
$x
a b
1 1 0.10
2 2 0.35
3 3 0.60
4 2 0.85
$y
a b
1 1 1.00
2 2 0.75
3 3 0.50
4 2 0.25
My desired output looks like this:
my_list
$x
a b
1 1 0.10
2 2 0.65
3 3 0.60
4 2 0.15
$y
a b
1 1 1.00
2 2 0.25
3 3 0.50
4 2 0.75
I can achieve the desired output with the following for loop.
for(i in 1:length(my_list)){
my_list[[i]][my_list[[i]]['a'] == 2, 'b'] <-
1 - my_list[[i]][my_list[[i]]['a'] == 2, 'b']
}
BUT. When I try to roll this into lapply form like so:
invertfun <- function(inputDF){
inputDF[inputDF['a'] == 2, 'b'] <- 1 - inputDF[inputDF['a'] == 2, 'b']
}
resultList <- lapply(X = my_list, FUN = invertfun)
I get a list with only the inverted values:
resultList
$x
[1] 0.65 0.15
$y
[1] 0.25 0.75
What am I missing here? I've tried to apply (pun intended) the insights from:
how to use lapply instead of a for loop, to perform a calculation on a list of dataframes in R
I'd appreciate any insights or alternative solutions. I'm trying to take my R skills to the next level and apply and similar functions seem to be the key.
We could use lapply to loop over each list and change the b column based on value in a column.
my_list[] <- lapply(my_list, function(x) transform(x, b = ifelse(a==2, 1-b, b)))
my_list
#[[1]]
# a b
#1 1 0.10
#2 2 0.65
#3 3 0.60
#4 2 0.15
#[[2]]
# a b
#1 1 1.00
#2 2 0.25
#3 3 0.50
#4 2 0.75
The same could be done using map from purrr
library(purrr)
map(my_list, function(x) transform(x, b = ifelse(a==2, 1-b, b)))
See Ronak's answer above for a fairly elegant solution using transform() or map(), but for those who are following in my footsteps, my original solution would work if I added a line in the custom function to return the full data frame like so:
invertfun <- function(inputDF){
inputDF[inputDF['a'] == 2, 'b'] <- 1 - inputDF[inputDF['a'] == 2, 'b']
return(inputDF)
}
resultList <- lapply(X = my_list, FUN = invertfun)
UPDATE - On further testing, this solution throws an Error in x[[jj]][iseq] <- vjj : replacement has length zero when the desired 'a' value doesn't exist in one of the data frames. So best not to go down this road and use the accepted answer above.
lapply is typically not the best way to iteratively modify a list. lapply is going to generate a loop internally in any case, so usually easier to read if you do something more explicit:
for (i in seq_along(my_list)) {
my_list[[i]] <- within(my_list[[i]], {
b[a==2] <- 1 - b[a==2]
})}
If we replace within with with in the example above, we get the output from your initial solution, i.e. lapply(X = my_list, FUN = invertfun).
That is, instead of modifying the list in place the latter solutions replace the list elements with new vectors.
I have the following data
Frequency = 260
[1] -9.326550e-03
[2] -4.422175e-03
[3] 9.003794e-03
[4] -1.778217e-03
[5] -4.676712e-03
[6] 1.242704e-02
[7] 5.759863e-03
And I want to count how many of these are in between these:
Frequency = 260
[,1] [,2]
[1] NA NA
[2] 0.010363147 -0.010363147
[3] 0.010072569 -0.010072569
[4] 0.010018997 -0.010018997
[1] 0.009700522 -0.009700522
[5] 0.009476024 -0.009476024
[7] 0.009748085 -0.009748085
I have to do this in r, but I'm a beginner.
Thanks in advance!
Unless I misunderstand -- you want the number of times the j-th element of your first object is between the two elements of the j-th row of the second? If so,
sum( (data1 > data2[,1]) & (data1 < data2[,2]))/length(data1)
Will do it.
Here's one approach using foverlaps from the package data.table, with the following toy data sets:
library(data.table)
##
set.seed(123)
ts1 <- data.table(
ts(rnorm(50, sd = .1), frequency = 260))[
,V2 := V1]
##
ts2 <- cbind(
ts(rnorm(50,-0.1,.5), frequency=260)
,ts(rnorm(50,0.1,.5), frequency=260))
ts2 <- data.table(
t(apply(ts2, 1, sort)))[
1, c("V1", "V2") := NA]
setkeyv(ts2, c("V1","V2"))
Since foverlaps needs two columns from each of the input data.tables, we just duplicate the first column in ts1 (this is the convention, as far as I'm aware).
fts <- foverlaps(
x = ts1, y = na.omit(ts2)
,type = "within")[
,list(Freq = .N)
,by = "V1,V2"]
This joins ts1 on ts2 for every occurrence of a ts1 value that falls within each of ts2's [V1, V2] intervals - and then aggregates to get a count by interval. Since it is feasible that some of ts2's intervals will contain zero ts1 values (which is the case with this sample data), you can left join the aggregate data back on the original ts2 object, and derive the corresponding proportions:
(merge(x = ts2, y = fdt, all.x=TRUE)[
is.na(Freq), Freq := 0][
,Inside := Freq/nrow(ts1)][
,Outside := 1 - Inside])[1:10,]
##
# V1 V2 Freq Inside Outside
# 1: NA NA 0 0.00 1.00
# 2: -1.2545844 -0.37373731 0 0.00 1.00
# 3: -0.9266236 -0.21024328 1 0.02 0.98
# 4: -0.8743764 -0.29245223 0 0.00 1.00
# 5: -0.7339710 0.19230687 50 1.00 0.00
# 6: -0.7103589 0.13898042 50 1.00 0.00
# 7: -0.7089414 -0.26660369 0 0.00 1.00
# 8: -0.7007681 0.58032622 50 1.00 0.00
# 9: -0.6860721 0.01936587 35 0.70 0.30
# 10: -0.6573338 -0.41395304 0 0.00 1.00
I think #nrussell's answer is just fine, but you can accomplish your answer much more simply using base R, so I'll document it here for you since you said you're a beginner. I've commented it as well to hopefully help you learn what's going on:
## Set a seed so simulated data can be duplicated:
set.seed(2001)
## Simulate your data to be counted:
d <- rnorm(50)
## Simulate your ranges:
r <- rnorm(10)
r <- cbind(r - 0.1, r + 0.1)
## Sum up the values of d falling inside each row of ranges. The apply
## function takes each row of r, and compares the values of d to the
## bounds of your ranges (lower in the first column, upper in the second)
## and the resulting logical vector is then summed, where TRUEs are equal
## to 1, thus counting the number of values in d falling between each
## set of bounds:
sums <- apply(r, MARGIN=1, FUN=function(x) { sum( d > x[1] & d < x[2] ) })
## Each item of the sums vector refers to the corresponding
## row of ranges in the r object...
I am trying to vectorize the following task with one of the apply functions, but in vain.
I have a list and a dataframe. What I am trying to accomplish is to create subgroups in a dataframe using a lookup list.
The lookup list (which are basically percentile groups) looks like the following:
Look_Up_List
$`1`
A B C D E
0.000 0.370 0.544 0.698 9.655
$`2`
A B C D E
0.000 0.506 0.649 0.774 1.192
The Curret Dataframe looks like this :
Score Big_group
0.1 1
0.4 1
0.3 2
Resulting dataframe must look like the following with an additional column. It matches the score in the percentile bucket from the lookup list in the corresponding Big_Group:
Score Big_group Sub_Group
0.1 1 A
0.4 1 B
0.3 2 A
Thanks so much
You can create a function like this:
myFun <- function(x) {
names(Look_Up_List[[as.character(x[2])]])[
findInterval(x[1], Look_Up_List[[as.character(x[2])]])]
}
And apply it by row with apply:
apply(mydf, 1, myFun)
# [1] "A" "B" "A"'
# reproducible input data
Look_Up_List <- list('1' <- c(A=0.000, B=0.370, C=0.544, D=0.698, E=9.655),
'2' <- c(A=0.000, B=0.506, C=0.649, D=0.774, E=1.192))
Current <- data.frame(Score=c(0.1, 0.4, 0.3),
Big_group=c(1,1,2))
# Solution 1
Current$Sub_Group <- sapply(1:nrow(Current), function(i) max(names(Look_Up_List[[1]][Current$Score[i] > Look_Up_List[[1]] ])))
# Alternative solution (using findInterval, slightly slower at least for this dataset)
Current$Sub_Group <- sapply(1:nrow(Current), function(i) names(Look_Up_List[[1]])[findInterval(Current$Score[i], Look_Up_List[[1]])])
# show result
Current
Consider the following dataframe:
df <- data.frame(Asset = c("A", "B", "C"), Historical = c(0.05,0.04,0.03), Forecast = c(0.04,0.02,NA))
# Asset Historical Forecast
#1 A 0.05 0.04
#2 B 0.04 0.02
#3 C 0.03 NA
as well as the variable x. x is set by the user at the beginning of the R script, and can take two values: either x = "Forecast" or x = "Historical".
If x = "Forecast", I would like to return the following: for each asset, if a forecast is available, return the appropriate number from the column "Forecast", otherwise, return the appropriate number from the column "Historical". As you can see below, both A and B have a forecast value which is returned below. C is missing a forecast value, so the historical value is returned.
Asset Return
1 A 0.04
2 B 0.02
3 C 0.03
If, however, x= "Historical",simply return the Historical column:
Asset Historical
1 A 0.05
2 B 0.04
3 C 0.03
I can't come up with an easy way of doing it, and brute force is very inefficient if you have a large number of rows. Any ideas?
Thanks!
First, pre-process your data:
df2 <- transform(df, Forecast = ifelse(!is.na(Forecast), Forecast, Historical))
Then extract the two columns of choice:
df2[c("Asset", x)]