Evaluate different logical conditions from string for each row - r

I have a data.frame like this:
value condition
1 0.46 value > 0.5
2 0.96 value == 0.79
3 0.45 value <= 0.65
4 0.68 value == 0.88
5 0.57 value < 0.9
6 0.10 value > 0.01
7 0.90 value >= 0.6
8 0.25 value < 0.91
9 0.04 value > 0.2
structure(list(value = c(0.46, 0.96, 0.45, 0.68, 0.57, 0.1, 0.9,
0.25, 0.04), condition = c("value > 0.5", "value == 0.79", "value <= 0.65",
"value == 0.88", "value < 0.9", "value > 0.01", "value >= 0.6",
"value < 0.91", "value > 0.2")), class = "data.frame", row.names = c(NA,
-9L))
I would like to evaluate the strings in the condition column for every row.
So the result would look like this.
value condition goal
1 0.46 value > 0.5 FALSE
2 0.96 value == 0.79 FALSE
3 0.45 value <= 0.65 TRUE
4 0.68 value == 0.88 FALSE
5 0.57 value < 0.9 TRUE
6 0.10 value > 0.01 TRUE
7 0.90 value >= 0.6 TRUE
8 0.25 value < 0.91 TRUE
9 0.04 value > 0.2 FALSE
I suppose there is a handy NSE solution within the dplyr framework. I have experimented with !! and expr() and others. I got some promising results when trying to subset by condition using
result <- df[0,]
for(i in 1:nrow(df)) {
result <- rbind(result, filter_(df[i,], bquote(.(df$condition[i]))))
}
But I don't like the solution and it's not exactly what I'm after.
I hope someone can help.
UPDATE: I'm trying to avoid eval(parse(..)).

Not entirely sure whether you are looking for something like this, however, you can also use lazy_eval() from lazyeval:
df %>%
rowwise() %>%
mutate(res = lazy_eval(sub("value", value, condition)))
value condition res
<dbl> <chr> <lgl>
1 0.46 value > 0.5 FALSE
2 0.96 value == 0.79 FALSE
3 0.45 value <= 0.65 TRUE
4 0.68 value == 0.88 FALSE
5 0.570 value < 0.9 TRUE
6 0.1 value > 0.01 TRUE
7 0.9 value >= 0.6 TRUE
8 0.25 value < 0.91 TRUE
9 0.04 value > 0.2 FALSE
And even though it is very close to eval(parse(...)), a possibility is also using parse_expr() from rlang:
df %>%
rowwise() %>%
mutate(res = eval(rlang::parse_expr(condition)))

One straightforward and easy solution would be using eval(parse...
library(dplyr)
df %>%
rowwise() %>%
mutate(goal = eval(parse(text = condition)))
# A tibble: 9 x 3
# value condition goal
# <dbl> <chr> <lgl>
#1 0.46 value > 0.5 FALSE
#2 0.96 value == 0.79 FALSE
#3 0.45 value <= 0.65 TRUE
#4 0.68 value == 0.88 FALSE
#5 0.570 value < 0.9 TRUE
#6 0.1 value > 0.01 TRUE
#7 0.9 value >= 0.6 TRUE
#8 0.25 value < 0.91 TRUE
#9 0.04 value > 0.2 FALSE
However, I would recommend reading some posts before using it.

Using match.fun:
# get function, and the value
myFun <- lapply(strsplit(df1$condition, " "), function(i){
list(f = match.fun(i[ 2 ]),
v = as.numeric(i[ 3 ]))
})
df1$goal <- mapply(function(x, y){
x[[ "f" ]](y, x[ "v" ])
}, x = myFun, y = df1$value)
# value condition goal
# 1 0.46 value > 0.5 FALSE
# 2 0.96 value == 0.79 FALSE
# 3 0.45 value <= 0.65 TRUE
# 4 0.68 value == 0.88 FALSE
# 5 0.57 value < 0.9 TRUE
# 6 0.10 value > 0.01 TRUE
# 7 0.90 value >= 0.6 TRUE
# 8 0.25 value < 0.91 TRUE
# 9 0.04 value > 0.2 FALSE

If you want to avoid eval(parse... you can try this:
library(tidyverse)
df %>% mutate(bound = as.numeric(str_extract(condition, "[0-9 \\.]*$")),
goal = case_when(grepl("==", condition) ~ value == bound,
grepl(">=", condition) ~ value >= bound,
grepl("<=", condition) ~ value <= bound,
grepl(">", condition) ~ value > bound,
grepl("<", condition) ~ value < bound,
T ~ NA))
value condition bound goal
1 0.46 value > 0.5 0.50 FALSE
2 0.96 value == 0.79 0.79 FALSE
3 0.45 value <= 0.65 0.65 TRUE
4 0.68 value == 0.88 0.88 FALSE
5 0.57 value < 0.9 0.90 TRUE
6 0.10 value > 0.01 0.01 TRUE
7 0.90 value >= 0.6 0.60 TRUE
8 0.25 value < 0.91 0.91 TRUE
9 0.04 value > 0.2 0.20 FALSE

Related

How to use an IF function to update columns in a data frame?

predict <- read.table(header=TRUE, text="
0 1
0.44 0.55
0.76 0.24
0.71 0.29
0.75 0.24
0.25 0.75
")
I have attached a sample data frame with 2 columns titled '0' & '1'. I want to use an IF function so that if the value in the 0 column is bigger than 0.7 the cell updates to have a 0 value in it. Also if the value in the '1' column is bigger than 0.7 the cell updates to have a 1 value in it. Finally if neither the '0' or '1' values are bigger than 0.7 I would like the cells to return as -99. I have attached an example of what my sample would look like after this IF function was applied.
predict <- read.table(header=TRUE, text="
0 1
-99 -99
0 0.24
0 0.29
0 0.24
0.25 1
")
The code I have attempted is;
if(predict[,1] > 0.7 ){predict[,1] == '0' }
if(predict[,1] > 0.7 ){predict[,2] == '1' }
If you could advise me on the best way to update this IF function that would be really appreciated.
Update
Based on the intervention of AniGoyal (Many thanks for this!!!)
I updated the answer to fulfill the exact desired output of the OP:
I combined the two answers in one code to get the desired output:
Code:
predict %>%
as_tibble %>%
mutate(a = case_when(X0 > 0.7 ~ 0,
TRUE ~ ifelse(X0 < 0.7 & X1 < 0.7, -99, X0)),
b = case_when(X1 > 0.7 ~ 1,
TRUE ~ ifelse(X1 < 0.7 & X0 < 0.7, -99, X1))
) %>%
select(X0 = a, X1=b)
Output:
X0 X1
<dbl> <dbl>
1 -99 -99
2 0 0.24
3 0 0.290
4 0 0.24
5 0.25 1
We could use case_when from the dplyr package. Mutate changes columns X0 and X1 depending on den case_when condition.
library(dplyr)
predict %>%
mutate(X0 = case_when(X0 > 0.7 ~ 0,
TRUE ~ -99),
X1 = case_when(X1 > 0.7 ~ 1,
TRUE ~ -99)
)
Output:
X0 X1
1 -99 -99
2 0 -99
3 0 -99
4 0 -99
5 -99 1
ifelse
Or we could use ifelse https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/ifelse
predict$X0 <- ifelse(predict$X0 > 0.7, 0, -99)
predict$X1 <- ifelse(predict$X1 > 0.7, 1, -99)
predict
Note - numeric names for columns are less desirable ("0" and "1"). Here they are renamed to "X0" and "X1".
One approach with base R is to subset your data for your 3 circumstances, first checking to see if neither are greater than .7 (and set both to -99), then checking the 0 column (set to 0), then checking the 1 column (set to 1):
predict[!(predict$X0 > .7 | predict$X1 > .7), c("X0", "X1")] <- -99
predict[predict$X0 > .7, "X0"] <- 0
predict[predict$X1 > .7, "X1"] <- 1
predict
Output
X0 X1
1 -99.00 -99.00
2 0.00 0.24
3 0.00 0.29
4 0.00 0.24
5 0.25 1.00
This is just another way using dplyr:
library(dplyr)
predict %>%
as_tibble() %>%
mutate(X0 = ifelse(X0 > 0.7, 0, X0),
X1 = ifelse(X1 > 0.7, 1, X1)) %>%
mutate(across(X0:X1, ~ ifelse((X0 < 0.7 & X0 != 0) & (X1 < 0.7 & X0 != 0), -99, .)))
X0 X1
<dbl> <dbl>
1 -99 -99
2 0 0.24
3 0 0.290
4 0 0.24
5 0.25 1
There are two errors in the code you are trying -
baseR's if else doesn't work iteratively. So If you have to use that for a complete vector where each element is to be checked iteratively, you'll have to use it inside a loop
usage of == for assignment. == is used for comparision/conditionals and not for assignment. Use = for assignment.
If you still want to do it baseR's if else style
for(i in 1:nrow(predict)){
if(predict[i, 1] > 0.7){
predict[i, 1] = 0
}
if(predict[i,2] > 0.7){
predict[i, 2] = 1
}
if(predict[i, 1] < 0.7 & predict[i, 2] < 0.7 & predict[i, 1] >0){
predict[i, 1] = -99
predict[i, 2] = -99
}
}
> predict
X0 X1
1 -99.00 -99.00
2 0.00 0.24
3 0.00 0.29
4 0.00 0.24
5 0.25 1.00
You may also consider use of replace like this
predict[, 1] <- replace(predict[,1], predict[,1] > 0.7, 0)
predict[, 2] <- replace(predict[,2], predict[,2] > 0.7, 1)
predict[, 1] <- replace(predict[, 1], predict[, 2] < 0.7 & predict[, 1] < 0.7 & predict[, 1] > 0, -99)
predict[, 2] <- replace(predict[, 2], predict[, 2] < 0.7 & predict[, 1] < 0.7 & predict[, 1] > 0, -99)
> predict
X0 X1
1 -99.00 0.55
2 0.00 0.24
3 0.00 0.29
4 0.00 0.24
5 0.25 1.00

Storing output for nested loop

I'm trying to do a nested loop for logistic regression.
I'm trying to run a loop for the discretization value and for each class.
Here's the code so far... I'm unable to get an output for each different iteration.
class <- c(1,2,3,4,5)
discretization_value <- seq(0.25, 0.75, by =0.05)
output<-data.frame(matrix(nrow=500, ncol=5))
names(output)=c("discretization_value", "class", "var1_coef", "var2_coef", "var3_coef")
for (i in discretization_value){
for (j in class) {
df$discretization_value <- ifelse(df$score >= i,1,0)
result <- (glm(discretization_value ~
var1 + var2 + var3,
data = df[df$class == j,], family= "binomial"))
output[i,1] <- i
output[i,2] <- j
output[i,3] <- coef(summary(result))[c("var1"),c("Estimate")]
output[i,4] <- coef(summary(result))[c("var2"),c("Estimate")]
output[i,5] <- coef(summary(result))[c("var3"),c("Estimate")]
}
}
a snippet of my df
class score var1 var2 var3
1 0.3 0.18 0.33 356
1 0.5 0.22 0.55 33
1 0.6 0.77 0.44 35
2 0.9 0.99 0.55 2
3 0 0 0 0
3 0.4 0.5 0.11 5
4 0 0.6 0 7
4 0 0.6 0 9
4 0.6 0.2 0.1 6
Could this be the problem?
data = df[df$class == j,], family= "binomial"))
I would try to remove the comma before the squared parenthesis.

Find the nth largest values in the top row and omit the rest of the columns in R

I am trying to change a data frame such that I only include those columns where the first value of the row is the nth largest.
For example, here let's assume I want to only include the columns where the top value in row 1 is the 2nd largest (top 2 largest).
dat1 = data.frame(a = c(0.1,0.2,0.3,0.4,0.5), b = c(0.6,0.7,0.8,0.9,0.10), c = c(0.12,0.13,0.14,0.15,0.16), d = c(NA, NA, NA, NA, 0.5))
a b c d
1 0.1 0.6 0.12 NA
2 0.2 0.7 0.13 NA
3 0.3 0.8 0.14 NA
4 0.4 0.9 0.15 NA
5 0.5 0.1 0.16 0.5
such that a and d are removed, because 0.1 and NA are not the 2nd largest values in
row 1. Here 0.6 and 0.12 are larger than 0.1 and NA in column a and d respectively.
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16
Is there a simple way to subset this? I do not want to order it, because that will create problems with other data frames I have that are related.
Complementing pieca's answer, you can encapsulate that into a function.
Also, this way, the returning data.frame won't be sorted.
get_nth <- function(df, n) {
df[] <- lapply(df, as.numeric) # edit
cols <- names(sort(df[1, ], na.last = NA, decreasing = TRUE))
cols <- cols[seq(n)]
df <- df[names(df) %in% cols]
return(df)
}
Hope this works for you.
Sort the first row of your data.frame, and then subset by names:
cols <- names(sort(dat1[1,], na.last = NA, decreasing = TRUE))
> dat1[,cols[1:2]]
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16
You can get an inverted rank of the first row and take the top nth columns:
> r <- rank(-dat1[1,], na.last=T)
> r <- r <= 2
> dat1[,r]
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16

Rearranging each row from largest value to smallest value in R

My data frame is set as follows:
Black White Red Blue
0.8 0.1 0.07 0.03
0.3 0.6 0 0.1
0.1 0.6 0.25 0.05
I wanted my data frame to look like this:
Black White Red Blue Color1 Color2 Color3 Color4
0.8 0.1 0.07 0.03 0.8 0.1 0.07 0.03
0.3 0.6 0 0.1 0.6 0.3 0.1 0
0.1 0.6 0.25 0.05 0.6 0.25 0.1 0.05
In which Color1 represents the largest value for each row, Color2 represents the second largest value, Color3 represents the third largest, and Color4 represents the smallest value for each row.
So far, I've used this function to obtain what I wanted, which is the result above:
maxn <- function(n) function(x) order(x, decreasing = TRUE)[n]
df$Color1 <- apply(df, 1, max)
df$Color2 <- apply(df, 1, function(x)x[maxn(3)(x)])
df$Color3 <- apply(df, 1, function(x)x[maxn(4)(x)])
df$Color4 <- apply(df, 1, function(x)x[maxn(5)(x)])
Is there a more concise way for me to arrange my dataset?
Additionally, a bit off-topic: I'm not sure if it's because this is a CSV file that I'm working with that whenever I use the function
df$Color2 <- apply(df, 1, function(x)x[maxn(2)(x)])
It will return the same result as the function
apply(df, 1, max)
AND
apply(df, 1, function(x)x[maxn(1)(x)])
One option is to use sort with apply, transpose and then cbind with data frame as:
cbind(df, t(apply(df, 1, sort, decreasing = TRUE)))
# Black White Red Blue 1 2 3 4
# 1 0.8 0.1 0.07 0.03 0.8 0.10 0.07 0.03
# 2 0.3 0.6 0.00 0.10 0.6 0.30 0.10 0.00
# 3 0.1 0.6 0.25 0.05 0.6 0.25 0.10 0.05
Updated: Based on suggestion from #dww column names can be assigned as:
df[paste0('color',1:4)] = t(apply(df, 1, sort, decreasing = TRUE))
# Black White Red Blue color1 color2 color3 color4
# 1 0.8 0.1 0.07 0.03 0.8 0.10 0.07 0.03
# 2 0.3 0.6 0.00 0.10 0.6 0.30 0.10 0.00
# 3 0.1 0.6 0.25 0.05 0.6 0.25 0.10 0.05
It's quite a bit more complex but a speedier solution if you're dealing with a large number of rows is to only do the sorting/ordering once and re-insert it into a matrix shape:
matrix(x[order(-row(x), x, decreasing=TRUE)], nrow=nrow(x), ncol=ncol(x), byrow=TRUE)
Some timings:
x <- matrix(rnorm(300000*5), nrow=300000, ncol=5)
system.time(t(apply(x, 1, sort, decreasing=TRUE)))
# user system elapsed
# 14.13 0.00 14.13
system.time(
matrix(x[order(-row(x),x, decreasing=TRUE)], nrow=nrow(x), ncol=ncol(x), byrow=TRUE)
)
# user system elapsed
# 0.10 0.00 0.09

Subsetting a dataframe based on another dataframe in R

df:
y x
F T
F F
T T
T F
df1:
y z probs.x x probs.y new
F F 0.08 T 0.4 0.032
F F 0.24 F 0.4 0.096
F T 0.12 T 0.6 0.072
F T 0.36 F 0.6 0.216
T F 0.40 T 0.5 0.200
T F 0.20 F 0.5 0.100
T T 0.40 T 0.5 0.200
T T 0.20 F 0.5 0.100
df and df1 are the two data frames. And for each row of df, I want to select the matching rows in df1, add the values in column “new”, and store output in a new data frame like this.
df_res:
y x new
F T .104
F F .312
T T .4
T F .2
Kindly help me out! I have been toiling over this for a long time now. The table headers will change according to the variables, so please do do not hard code the table headers.
Thanks.
I don't know how long is your data but this can be one approach.
df<- read.table(text="y x
F T
F F
T T
T F",header=T,sep="")
df1 <- read.table(text="y z probs.x x probs.y new
F F 0.08 T 0.4 0.032
F F 0.24 F 0.4 0.096
F T 0.12 T 0.6 0.072
F T 0.36 F 0.6 0.216
T F 0.40 T 0.5 0.200
T F 0.20 F 0.5 0.100
T T 0.40 T 0.5 0.200
T T 0.20 F 0.5 0.100", header=T, sep="")
df$yx <- paste0(df$y,df$x)
df1$yx <- paste0(df1$y, df1$x)
# Update automatically using the for loop
for (i in 1:4){
new[i] <- sum(df1[which(df1[,7]==df[i,3]),6])
}
df$new <- new
df
y x yx new
1 FALSE TRUE FALSETRUE 0.104
2 FALSE FALSE FALSEFALSE 0.312
3 TRUE TRUE TRUETRUE 0.400
4 TRUE FALSE TRUEFALSE 0.200
Using sapply
new <- sapply(1:4, function(x) sum(df1[which(df1[,7]==df[x,3]),6]))
it seems like if all you want is F,T combination. this works. otherwise you have to write more clearly.
text=" y z probs.x x probs.y new
F F 0.08 T 0.4 0.032
F F 0.24 F 0.4 0.096
F T 0.12 T 0.6 0.072
F T 0.36 F 0.6 0.216
T F 0.40 T 0.5 0.200
T F 0.20 F 0.5 0.100
T T 0.40 T 0.5 0.200
T T 0.20 F 0.5 0.100"
df<-read.table(text=text, header=T)
df_res<-aggregate(data=df, new~interaction(y,x),sum)
interaction(y, x) new
1 FALSE.FALSE 0.312
2 TRUE.FALSE 0.200
3 FALSE.TRUE 0.104
4 TRUE.TRUE 0.400
Here's an answer using merge and plyr.
Read in your example data.frame:
df1 <- read.table(text="y z probs.x x probs.y new
F F 0.08 T 0.4 0.032
F F 0.24 F 0.4 0.096
F T 0.12 T 0.6 0.072
F T 0.36 F 0.6 0.216
T F 0.40 T 0.5 0.200
T F 0.20 F 0.5 0.100
T T 0.40 T 0.5 0.200
T T 0.20 F 0.5 0.100", header=T, sep="")
If I understand, there are 2 steps to what your asking. First is to select rows in df1 that match patterns in df. That can be done with merge. The df you gave has all combinations of True and False for x and y. Let's leave one out so we can see the effect:
df <- read.table(text="y x
F T
T T
T F",header=T,sep="")
df_merged <- merge(df, df1, all.y=F)
The results are a new data.frame the omits the rows where both x and y are F. This is equivalent to a left join in a SQL database.
y x z probs.x probs.y new
1 FALSE TRUE FALSE 0.08 0.4 0.032
2 FALSE TRUE TRUE 0.12 0.6 0.072
3 TRUE FALSE FALSE 0.20 0.5 0.100
4 TRUE FALSE TRUE 0.20 0.5 0.100
5 TRUE TRUE FALSE 0.40 0.5 0.200
6 TRUE TRUE TRUE 0.40 0.5 0.200
The second part of the question is to group the data and apply a sum to the groups. Plyr is a great tool for this kind of data manipulation:
library(plyr)
ddply(df_merged, .(y,x), function(df) c(new=sum(df$new)))
The dd means we are giving a data.frame and want a data.frame as a result. The next argument .(y,x) is a quoted expression and names the variables we're grouping by. The result is this:
y x new
1 FALSE TRUE 0.104
2 TRUE FALSE 0.200
3 TRUE TRUE 0.400

Resources