Consider the following dataframe:
df <- data.frame(Asset = c("A", "B", "C"), Historical = c(0.05,0.04,0.03), Forecast = c(0.04,0.02,NA))
# Asset Historical Forecast
#1 A 0.05 0.04
#2 B 0.04 0.02
#3 C 0.03 NA
as well as the variable x. x is set by the user at the beginning of the R script, and can take two values: either x = "Forecast" or x = "Historical".
If x = "Forecast", I would like to return the following: for each asset, if a forecast is available, return the appropriate number from the column "Forecast", otherwise, return the appropriate number from the column "Historical". As you can see below, both A and B have a forecast value which is returned below. C is missing a forecast value, so the historical value is returned.
Asset Return
1 A 0.04
2 B 0.02
3 C 0.03
If, however, x= "Historical",simply return the Historical column:
Asset Historical
1 A 0.05
2 B 0.04
3 C 0.03
I can't come up with an easy way of doing it, and brute force is very inefficient if you have a large number of rows. Any ideas?
Thanks!
First, pre-process your data:
df2 <- transform(df, Forecast = ifelse(!is.na(Forecast), Forecast, Historical))
Then extract the two columns of choice:
df2[c("Asset", x)]
Related
I'm working through the options for formatting all numeric values in a data frame, and not just selected columns. I start with the following base data frame called "c" when running the code beneath it:
> c
A B
1 3.412324 2.234200
2 3.245236 4.234234
Related code:
a <- c(3.412324463,3.2452364)
b <- c(2.2342,4.234234)
c <- data.frame(A=a, B=b, stringsAsFactors = FALSE)
Next, I round all the numbers in the above "c" data frame to 2 decimal places, resulting in data frame "d" shown below with the related code immediately underneath:
> d
A B
[1,] 3.41 2.23
[2,] 3.25 4.23
Related code:
d <- as.data.frame(lapply(c, formatC, decimal.mark =".", format = "f", digits = 2))
d <- sapply(d[,],as.numeric)
Last step, I'd like to express the above data frame "d" in percentages, in a new data frame called "e". I get the below results as a list, using the code shown beneath it.
> e
X.341.0.. X.325.0.. X.223.0.. X.423.0..
1 341.0% 325.0% 223.0% 423.0%
Related code:
e <- as.data.frame(lapply(d*100, sprintf, fmt = "%.1f%%"))
How to I modify the code, in an efficient manner, to leave the data frame structure intact in deriving data frame "e", the way it does when generating data frame "d"? It would be most helpful to see a solution in both base R and dplyr.
I'm pretty sure the issue lies in my use of lapply() in creating data frame "e" (yes, by now I know lapply() spits out lists), but it worked fine in maintaining the data frame structure in creating data frame "d"!
All values in the data frame are formatted the same, so there's no need to subset the columns etc.
The thing with lapply is that it returns a list. A data.frame is a special case of a list, but to make lapply modify a data frame of a particular structure while maintaining that structure, the easiest way to do it is df[] = lapply(df, ...). The extra [] preserves the existing structure.
## d better base version
## `round` is friendlier than `sprintf` - it returns numerics
d = c
d[] = lapply(d, round, 2)
d
# A B
# 1 3.41 2.23
# 2 3.25 4.23
## dplyr version
d_dplyr = c %>%
mutate(across(everything(), round, 2))
d_dplyr
# A B
# 1 3.41 2.23
# 2 3.25 4.23
## e base
e = d
e[] = lapply(e * 100, sprintf, fmt = "%.1f%%")
e
# A B
# 1 341.0% 223.0%
# 2 325.0% 423.0%
## e dplyr
## in `tidyverse`, we'll use `scales::percent_format`
## which generates a percent conversion function according to specification
## the default already multiplies by 100 and adds the `%` sign
## we just need to specify the accuracy.
## (note that we can easily start from `c` again)
e_dplyr = c %>%
mutate(across(everything(), scales::percent_format(accuracy = 0.1)))
e_dplyr
# A B
# 1 341.2% 223.4%
# 2 324.5% 423.4%
I am trying to go through each value in a data frame and based on that value extract information from another data frame. I have code that works for doing nested for loops but I am working with large datasets that run far too long for that to be feasible.
To simplify, I will provide sample data with initially only one row:
ind_1 <- data.frame("V01" = "pp", "V02" = "pq", "V03" = "pq")
ind_1
# V01 V02 V03
#1 pp pq pq
I also have this data frame:
stratum <- rep(c("A", "A", "B", "B", "C", "C"), 3)
locus <- rep(c("V01", "V02", "V03"), each = 6)
allele <- rep(c("p", "q"), 9)
value <- rep(c(0.8, 0.2, 0.6, 0.4, 0.3, 0.7, 0.5, 0.5, 0.6), 2)
df <- as.data.frame(cbind(stratum, locus, allele, value))
head(df)
# stratum locus allele value
#1 A V01 p 0.8
#2 A V01 q 0.2
#3 B V01 p 0.6
#4 B V01 q 0.4
#5 C V01 p 0.3
#6 C V01 q 0.7
There are two allele values for each locus and there are three values for stratum for every locus as well, thus there are six different values for each locus. The column name of ind_1 corresponds to the locus column in df. For each entry in ind_1, I want to return a list of values which are extracted from the value column in df based on the locus(column name in ind_1) and the data entry (pp or pq). For each entry in ind_1 there will be three returned values in the list, one for each of the stratum in df.
My attempted code is as follows:
library(dplyr)
library(magrittr)
pop.prob <- function(df, ind_1){
p <- df %>%
filter( locus == colnames(ind_1), allele == "p")
p <- as.numeric(as.character(p$value))
if( ind_1 == "pp") {
prob <- (2 * p * (1-p))
return(prob)
} else if ( ind_1 == "pq") {
prob <- (p^2)
return(prob)
}
}
test <- sapply(ind_1, function(x) {pop.prob(df, ind_1)} )
This code provides a matrix with incorrect values:
V01 V02 V03
[1,] 0.32 0.32 0.32
[2,] 0.32 0.32 0.32
[3,] 0.42 0.42 0.42
As well as the warning messages:
# 1: In if (ind_1 == "pp") { :
# the condition has length > 1 and only the first element will be used
Ideally, I would have the following output:
> test
# $V01
# 0.32 0.48 0.42
#
# $V02
# 0.25 0.36 0.04
#
# $V03
# 0.16 0.49 0.25
I've been trying to figure out how to NOT use for loops in my code because I've been using nested for loops that take an exorbitant amount of time. Any help in figuring out how to do this for this simplified data set would be greatly appreciated. Once I do that I can work on applying this to a data frame such as ind_1 that has multiple rows
Thank you all, please let me know if the example data are not clear
EDIT
Here's my code that works with a for loop:
pop.prob.for <- function(df, ind_1){
prob.list <- list()
for( i in 1:length(ind_1)){
p <- df %>%
filter( locus == colnames(ind_1[i]), allele == "p")
p <- as.numeric(as.character(p$value))
if( ind_1[i] == "pp") {
prob <- (2 * p * (1-p))
} else if ( ind_1[i] == "pq") {
prob <- (p^2)
}
prob.list[[i]] <- prob
}
return(prob.list)
}
pop.prob.for(df, ind_1)
For my actual data, I would be adding an additional loop to go through multiple rows within a data frame similar to ind_1 and save each of the iterations of lists produced as an .rdata file
There are two issues with your code. One is that you're apply function is operating on the wrong object, and the other is that you can't access the name of an element through sapply
Right now sapply(ind_1, function(x) {pop.prob(df, ind_1)}) is saying "for each element of ind_1 do pop.prob using df and all of ind_1", hence the incorrect matrix output. To operate element-wise on ind_1 you would write sapply(ind_1, function(x) {pop.prob(df, ind_1)})
This change doesn't work because you extract a column name in your function, and "pp" (the first element) has no column name. To use your function as written, you would need to write:
test <- sapply(1:dim(ind_1)[2], function(x) {pop.prob(df, ind_1[x])})
This way you're iterating in the same manner as your for loop. Note also that you're getting a matrix because sapply attempts to coerce lapply output to a vector or a matrix. If you want a list, simply use lapply
Here's a vectorised data.table solution. Should be much faster than the apply or for versions. Not to mention far more succinct.
library(data.table)
setDT(df)[, value := as.numeric(as.character(value))]
df[allele=='p',
.(prob = {if (ind_1[.GRP]=='pp') 2*value*(1-value) else value^2}),
by = locus]
# locus prob
# 1: V01 0.32
# 2: V01 0.48
# 3: V01 0.42
# 4: V02 0.25
# 5: V02 0.36
# 6: V02 0.04
# 7: V03 0.16
# 8: V03 0.49
# 9: V03 0.25
I have this dataset, foo:
foo
assets.198 assets.attributable assets.current
0.98 0.98 0.98
I am trying to create a dataframe that has the names, such as assets.calculated, as the row names, and the numbers such as 0.96 as the column values. I've tried:
foo1 <- data.frame(words = names(foo), correlation = )
What should I put in as equal to correlation?
I'm beginning with R, and am still trying to figure out to inspect foo and see what type of data I have as well.
Thanks!
Assuming that foo is a named vector, you can use enframe() from the tidyverse:
library(tidyverse)
bar <- enframe(foo)
bar
# A tibble: 3 x 2
name value
<chr> <dbl>
1 assets.198 0.98
2 assets.attributable 0.98
3 assets.current 0.98
I have dinamic intervals in a Data Frame generated by calculation of percentage of my data, like below:
Start Finish
0.00 0.86
0.87 0.89
0.90 0.98
0.99 1.00
I have a vector with about 3000 numbers that I want to obtain how many numbers I have by each interval without using a Loop because is too much slow.
Numbers<-c(0.1,0.2,0.3,0.7,0.8,0.9,0.91,0.99)
Expected result in this case: 5,0,2,1....
You can use apply() to go though your start-finish data.frame, check if the numbers are between start and finish values and sum up the logical vector returned from data.tables' between() function.
Numbers<-c(0.1,0.2,0.3,0.7,0.8,0.9,0.91,0.99)
sf <-
read.table(text =
"Start Finish
0.00 0.86
0.87 0.89
0.90 0.98
0.99 1.00",
header = TRUE
)
apply(sf, 1, function(x) {
sum(data.table::between(Numbers, x[1], x[2]))
})
This will return:
5 0 2 1
We can use foverlaps
library(data.table)
setDT(df)
dfN <- data.table(Start = Numbers, Finish = Numbers)
setkeyv(df, names(df))
setkeyv(dfN, names(dfN))
foverlaps(df, dfN, which = TRUE, type = "any")[, sum(!is.na(yid)), xid]$V1
#[1] 5 0 2 1
I created a dataframe that looks like this:
# Dataframe
GeneID TrID PSI Length Ranking
ENSMUSG00000089809 ENSMUST00000146396 0.20 431801 3
ENSMUSG00000089809 ENSMUST00000161516 0.23 354036 2
ENSMUSG00000089809 ENSMUST00000161148 0.57 5601 1
ENSMUSG00000044681 ENSMUST00000117098 0.05 4400 2
ENSMUSG00000044681 ENSMUST00000141196 0.10 1118 1
ENSMUSG00000044681 ENSMUST00000141601 0.75 44973 5
Now I would like to select for each GeneId the TrID that has the higher PSI value with the respective Ranking. At the end the output will be like this:
# Desired Output Dataframe
GeneID TrID PSI Length Ranking
ENSMUSG00000089809 ENSMUST00000161148 0.57 5601 1
ENSMUSG00000044681 ENSMUST00000141601 0.75 44973 5
After that, I will create a distribution of the ranking values and check in which PSI value the rank corresponds. I will permute the Length values and the TrID values in order to perform a control of the distribution.
You can use base R and do:
byGeneId = split(1:nrow(Dataframe), Dataframe$GeneId)
whichTopPsi = sapply(byGeneId, function(i) i[which.max(Dataframe[i,'PSI'])])
Dataframe[whichTopPsi,]
You could also use ddply, which is more general.
require(plyr)
ddply(Dataframe, "GeneId", function(d) d[which.max(d[,'PSI']),,drop=FALSE])