Sum vector with number by dinamic intervals without looping - r

I have dinamic intervals in a Data Frame generated by calculation of percentage of my data, like below:
Start Finish
0.00 0.86
0.87 0.89
0.90 0.98
0.99 1.00
I have a vector with about 3000 numbers that I want to obtain how many numbers I have by each interval without using a Loop because is too much slow.
Numbers<-c(0.1,0.2,0.3,0.7,0.8,0.9,0.91,0.99)
Expected result in this case: 5,0,2,1....

You can use apply() to go though your start-finish data.frame, check if the numbers are between start and finish values and sum up the logical vector returned from data.tables' between() function.
Numbers<-c(0.1,0.2,0.3,0.7,0.8,0.9,0.91,0.99)
sf <-
read.table(text =
"Start Finish
0.00 0.86
0.87 0.89
0.90 0.98
0.99 1.00",
header = TRUE
)
apply(sf, 1, function(x) {
sum(data.table::between(Numbers, x[1], x[2]))
})
This will return:
5 0 2 1

We can use foverlaps
library(data.table)
setDT(df)
dfN <- data.table(Start = Numbers, Finish = Numbers)
setkeyv(df, names(df))
setkeyv(dfN, names(dfN))
foverlaps(df, dfN, which = TRUE, type = "any")[, sum(!is.na(yid)), xid]$V1
#[1] 5 0 2 1

Related

R using if statements in apply instead of for loop

I am trying to go through each value in a data frame and based on that value extract information from another data frame. I have code that works for doing nested for loops but I am working with large datasets that run far too long for that to be feasible.
To simplify, I will provide sample data with initially only one row:
ind_1 <- data.frame("V01" = "pp", "V02" = "pq", "V03" = "pq")
ind_1
# V01 V02 V03
#1 pp pq pq
I also have this data frame:
stratum <- rep(c("A", "A", "B", "B", "C", "C"), 3)
locus <- rep(c("V01", "V02", "V03"), each = 6)
allele <- rep(c("p", "q"), 9)
value <- rep(c(0.8, 0.2, 0.6, 0.4, 0.3, 0.7, 0.5, 0.5, 0.6), 2)
df <- as.data.frame(cbind(stratum, locus, allele, value))
head(df)
# stratum locus allele value
#1 A V01 p 0.8
#2 A V01 q 0.2
#3 B V01 p 0.6
#4 B V01 q 0.4
#5 C V01 p 0.3
#6 C V01 q 0.7
There are two allele values for each locus and there are three values for stratum for every locus as well, thus there are six different values for each locus. The column name of ind_1 corresponds to the locus column in df. For each entry in ind_1, I want to return a list of values which are extracted from the value column in df based on the locus(column name in ind_1) and the data entry (pp or pq). For each entry in ind_1 there will be three returned values in the list, one for each of the stratum in df.
My attempted code is as follows:
library(dplyr)
library(magrittr)
pop.prob <- function(df, ind_1){
p <- df %>%
filter( locus == colnames(ind_1), allele == "p")
p <- as.numeric(as.character(p$value))
if( ind_1 == "pp") {
prob <- (2 * p * (1-p))
return(prob)
} else if ( ind_1 == "pq") {
prob <- (p^2)
return(prob)
}
}
test <- sapply(ind_1, function(x) {pop.prob(df, ind_1)} )
This code provides a matrix with incorrect values:
V01 V02 V03
[1,] 0.32 0.32 0.32
[2,] 0.32 0.32 0.32
[3,] 0.42 0.42 0.42
As well as the warning messages:
# 1: In if (ind_1 == "pp") { :
# the condition has length > 1 and only the first element will be used
Ideally, I would have the following output:
> test
# $V01
# 0.32 0.48 0.42
#
# $V02
# 0.25 0.36 0.04
#
# $V03
# 0.16 0.49 0.25
I've been trying to figure out how to NOT use for loops in my code because I've been using nested for loops that take an exorbitant amount of time. Any help in figuring out how to do this for this simplified data set would be greatly appreciated. Once I do that I can work on applying this to a data frame such as ind_1 that has multiple rows
Thank you all, please let me know if the example data are not clear
EDIT
Here's my code that works with a for loop:
pop.prob.for <- function(df, ind_1){
prob.list <- list()
for( i in 1:length(ind_1)){
p <- df %>%
filter( locus == colnames(ind_1[i]), allele == "p")
p <- as.numeric(as.character(p$value))
if( ind_1[i] == "pp") {
prob <- (2 * p * (1-p))
} else if ( ind_1[i] == "pq") {
prob <- (p^2)
}
prob.list[[i]] <- prob
}
return(prob.list)
}
pop.prob.for(df, ind_1)
For my actual data, I would be adding an additional loop to go through multiple rows within a data frame similar to ind_1 and save each of the iterations of lists produced as an .rdata file
There are two issues with your code. One is that you're apply function is operating on the wrong object, and the other is that you can't access the name of an element through sapply
Right now sapply(ind_1, function(x) {pop.prob(df, ind_1)}) is saying "for each element of ind_1 do pop.prob using df and all of ind_1", hence the incorrect matrix output. To operate element-wise on ind_1 you would write sapply(ind_1, function(x) {pop.prob(df, ind_1)})
This change doesn't work because you extract a column name in your function, and "pp" (the first element) has no column name. To use your function as written, you would need to write:
test <- sapply(1:dim(ind_1)[2], function(x) {pop.prob(df, ind_1[x])})
This way you're iterating in the same manner as your for loop. Note also that you're getting a matrix because sapply attempts to coerce lapply output to a vector or a matrix. If you want a list, simply use lapply
Here's a vectorised data.table solution. Should be much faster than the apply or for versions. Not to mention far more succinct.
library(data.table)
setDT(df)[, value := as.numeric(as.character(value))]
df[allele=='p',
.(prob = {if (ind_1[.GRP]=='pp') 2*value*(1-value) else value^2}),
by = locus]
# locus prob
# 1: V01 0.32
# 2: V01 0.48
# 3: V01 0.42
# 4: V02 0.25
# 5: V02 0.36
# 6: V02 0.04
# 7: V03 0.16
# 8: V03 0.49
# 9: V03 0.25

How to perform a looped function on a list of data frames where the function requires 2 inputs from the data frame list?

I am still fairly new towards the more advanced portions of R coding and would like help with making loops.
I have a multiple data frames that I need to perform a repetitive function on each.
Df1 <- data.frame(Col_1=c("A","B","C"), Col_2=c(1:3))
Df2 <- data.frame(Col_1=c("D","E","F"), Col_2=c(4:6))
Df3 <- data.frame(Col_1=c("G","H","I"), Col_2=c(7:9))
Df4 <- data.frame(Col_1=c("J","K","L"), Col_2=c(10:12))
DfList <- list(Df1,Df2,Df3,Df4)
So the data frame has the following format
>print(Df1)
Col_1 Col_2
1 A 1
2 B 2
3 C 3
The function in question requires 2 inputs (2 different data frames contained within the list : Dflist
example_function <- function(Dataframe_x,Dataframe_y){
X_Sum_col_2 <- sum(Dataframe_x$Col_2)
Y_Sum_col_2 <- sum(Dataframe_y$Col_2)
ratio <- X_Sum_col_2/Y_Sum_col_2
}
>print(example_function(Df1,Df2))
0.4
My aim is to loop through all possible comparisons of the DfList using the example_function to produce a data frame with the results, something akin to a similarity matrix. Like so:
Df1 Df2 Df3 Df4
Df1 1 2.5 4 5.5
Df2 0.40 1 1.6 2.2
Df3 0.25 0.63 1 1.38
Df4 0.18 0.45 0.73 1
Whenever I attempt this it either tells me I haven't assigned the second variable (not sure how to do it):
>lapply(DfList,function(Dataframe_x,Dataframe_y){
X_Sum_col_2 <- sum(Dataframe_x$Col_2)
Y_Sum_col_2 <- sum(Dataframe_y$Col_2)
ratio <- X_Sum_col_2/Y_Sum_col_2
})
Error in FUN(X[[i]], ...) :
argument "Dataframe_y" is missing, with no default
Or it gives me this error when attempting a for loop:
>for(i in 1:4(DfList)){
example_function(i,i)
}
Error: attempt to apply non-function
Any and all help regarding this problem is appreciated.
Thank you
We can use a nested loop
res <- sapply(DfList, function(x) sapply(DfList, function(y) example_function(x, y)))
nm1 <- paste0("Df", 1:4)
dimnames(res) <- list(nm1, nm1)
round(res, 2)
# Df1 Df2 Df3 Df4
#Df1 1.00 2.50 4.00 5.50
#Df2 0.40 1.00 1.60 2.20
#Df3 0.25 0.62 1.00 1.38
#Df4 0.18 0.45 0.73 1.00

Split data.frame into groups by column name

I'm new to R. I have a data frame with column names of such type:
file_001 file_002 block_001 block_002 red_001 red_002 ....etc'
0.05 0.2 0.4 0.006 0.05 0.3
0.01 0.87 0.56 0.4 0.12 0.06
I want to split them into groups by the column name, to get a result like this:
group_file
file_001 file_002
0.05 0.2
0.01 0.87
group_block
block_001 block_002
0.4 0.006
0.56 0.4
group_red
red_001 red_002
0.05 0.3
0.12 0.06
...etc'
My file is huge. I don't have a certain number of groups.
It needs to be just by the column name's start.
In base R, you can use sub and split.default like this to return a list of data.frames:
myDfList <- split.default(dat, sub("_\\d+", "", names(dat)))
this returns
myDfList
$block
block_001 block_002
1 0.40 0.006
2 0.56 0.400
$file
file_001 file_002
1 0.05 0.20
2 0.01 0.87
$red
red_001 red_002
1 0.05 0.30
2 0.12 0.06
split.default will split data.frames by variable according to its second argument. Here, we use sub and the regular expression "_\d+" to remove the underscore and all numeric values following it in order to return the splitting values "block", "file", and "red".
As a side note, it is typically a good idea to keep these data.frames in a list and work with them through functions like lapply. See gregor's answer to this post for some motivating examples.
Thank you lmo,
after using your code, it didn't work as I wanted, but I came with a solution thanks to your guidance.
So, in order to divide a Data Frame list:
myDfList <- split.default(dat, sub(x = as.character(names(dat)), pattern = "\\_.*", ""))
hope it'll help people in the future!

How to calculate the ratio of data in and outside of an interval in R?

I have the following data
Frequency = 260
[1] -9.326550e-03
[2] -4.422175e-03
[3] 9.003794e-03
[4] -1.778217e-03
[5] -4.676712e-03
[6] 1.242704e-02
[7] 5.759863e-03
And I want to count how many of these are in between these:
Frequency = 260
[,1] [,2]
[1] NA NA
[2] 0.010363147 -0.010363147
[3] 0.010072569 -0.010072569
[4] 0.010018997 -0.010018997
[1] 0.009700522 -0.009700522
[5] 0.009476024 -0.009476024
[7] 0.009748085 -0.009748085
I have to do this in r, but I'm a beginner.
Thanks in advance!
Unless I misunderstand -- you want the number of times the j-th element of your first object is between the two elements of the j-th row of the second? If so,
sum( (data1 > data2[,1]) & (data1 < data2[,2]))/length(data1)
Will do it.
Here's one approach using foverlaps from the package data.table, with the following toy data sets:
library(data.table)
##
set.seed(123)
ts1 <- data.table(
ts(rnorm(50, sd = .1), frequency = 260))[
,V2 := V1]
##
ts2 <- cbind(
ts(rnorm(50,-0.1,.5), frequency=260)
,ts(rnorm(50,0.1,.5), frequency=260))
ts2 <- data.table(
t(apply(ts2, 1, sort)))[
1, c("V1", "V2") := NA]
setkeyv(ts2, c("V1","V2"))
Since foverlaps needs two columns from each of the input data.tables, we just duplicate the first column in ts1 (this is the convention, as far as I'm aware).
fts <- foverlaps(
x = ts1, y = na.omit(ts2)
,type = "within")[
,list(Freq = .N)
,by = "V1,V2"]
This joins ts1 on ts2 for every occurrence of a ts1 value that falls within each of ts2's [V1, V2] intervals - and then aggregates to get a count by interval. Since it is feasible that some of ts2's intervals will contain zero ts1 values (which is the case with this sample data), you can left join the aggregate data back on the original ts2 object, and derive the corresponding proportions:
(merge(x = ts2, y = fdt, all.x=TRUE)[
is.na(Freq), Freq := 0][
,Inside := Freq/nrow(ts1)][
,Outside := 1 - Inside])[1:10,]
##
# V1 V2 Freq Inside Outside
# 1: NA NA 0 0.00 1.00
# 2: -1.2545844 -0.37373731 0 0.00 1.00
# 3: -0.9266236 -0.21024328 1 0.02 0.98
# 4: -0.8743764 -0.29245223 0 0.00 1.00
# 5: -0.7339710 0.19230687 50 1.00 0.00
# 6: -0.7103589 0.13898042 50 1.00 0.00
# 7: -0.7089414 -0.26660369 0 0.00 1.00
# 8: -0.7007681 0.58032622 50 1.00 0.00
# 9: -0.6860721 0.01936587 35 0.70 0.30
# 10: -0.6573338 -0.41395304 0 0.00 1.00
I think #nrussell's answer is just fine, but you can accomplish your answer much more simply using base R, so I'll document it here for you since you said you're a beginner. I've commented it as well to hopefully help you learn what's going on:
## Set a seed so simulated data can be duplicated:
set.seed(2001)
## Simulate your data to be counted:
d <- rnorm(50)
## Simulate your ranges:
r <- rnorm(10)
r <- cbind(r - 0.1, r + 0.1)
## Sum up the values of d falling inside each row of ranges. The apply
## function takes each row of r, and compares the values of d to the
## bounds of your ranges (lower in the first column, upper in the second)
## and the resulting logical vector is then summed, where TRUEs are equal
## to 1, thus counting the number of values in d falling between each
## set of bounds:
sums <- apply(r, MARGIN=1, FUN=function(x) { sum( d > x[1] & d < x[2] ) })
## Each item of the sums vector refers to the corresponding
## row of ranges in the r object...

Complex subsetting of dataframe

Consider the following dataframe:
df <- data.frame(Asset = c("A", "B", "C"), Historical = c(0.05,0.04,0.03), Forecast = c(0.04,0.02,NA))
# Asset Historical Forecast
#1 A 0.05 0.04
#2 B 0.04 0.02
#3 C 0.03 NA
as well as the variable x. x is set by the user at the beginning of the R script, and can take two values: either x = "Forecast" or x = "Historical".
If x = "Forecast", I would like to return the following: for each asset, if a forecast is available, return the appropriate number from the column "Forecast", otherwise, return the appropriate number from the column "Historical". As you can see below, both A and B have a forecast value which is returned below. C is missing a forecast value, so the historical value is returned.
Asset Return
1 A 0.04
2 B 0.02
3 C 0.03
If, however, x= "Historical",simply return the Historical column:
Asset Historical
1 A 0.05
2 B 0.04
3 C 0.03
I can't come up with an easy way of doing it, and brute force is very inefficient if you have a large number of rows. Any ideas?
Thanks!
First, pre-process your data:
df2 <- transform(df, Forecast = ifelse(!is.na(Forecast), Forecast, Historical))
Then extract the two columns of choice:
df2[c("Asset", x)]

Resources