How can I vectorize this task in R? - r

For a specific task, I have written the following R script:
pred <- c(0.1, 0.1, 0.1, 0.2, 0.2, 0.3, 0.3)
grp <- as.factor(c(1, 1, 2, 2, 1, 1, 1))
cut <- unique(pred)
cut_n <- length(cut)
n <- length(pred)
class_1 <- numeric(cut_n)
class_2 <- numeric(cut_n)
curr_cut <- cut[1]
class_1_c <- 0
class_2_c <- 0
j <- 1
for (i in 1:n){
if (curr_cut != pred[i]) {
j <- j + 1
curr_cut <- pred[i]
}
if (grp[i] == levels(grp)[1])
class_1_c <- class_1_c + 1
else
class_2_c <- class_2_c + 1
class_1[j] <- class_1_c
class_2[j] <- class_2_c
}
cat("index:", cut, "\n")
cat("class1:", class_1, "\n")
cat("class2:", class_2, "\n")
My goal above was to compute the cumulative number of times the factors in grp appear for each unique value in pred. For example, I get the following output for above:
index: 0.1 0.2 0.3
class1: 2 3 5
class2: 1 2 2
I am a beginner in R and I have few questions about this:
How can I make this code faster and simpler?
Is is it possible to vectorize this and avoid the for loop?
Is there a different "R-esque" way of doing this?
Any help would be greatly appreciated. Thanks!

You can start by getting a the unique group/pred counts using a table
table(grp, pred)
# pred
# grp 0.1 0.2 0.3
# 1 2 1 2
# 2 1 1 0
Of course this isn't exactly what you wanted. You want cumulative totals, so we can adjust this result by applying a cumulative sum across each row (transposed to better match your data layout)
t(apply(table(grp, pred), 1, cumsum))
# grp 0.1 0.2 0.3
# 1 2 3 5
# 2 1 2 2

Related

ifelse with sorted values by row

I have a dataframe (example data):
id <- c(1, 2, 3)
ex1 <- c(0.8, 0.2, 0.3)
ex2 <- c(0.1, 0.4, 0.04)
ex3 <- c(0.04, 0.3, 0.5)
ex <- c(1, 1, 1)
ran <- c(0.5, 0.7, 0.6)
dat <- data.frame(id, ex1, ex2, ex3, ex, ran)
dat
id ex1 ex2 ex3 ex ran
1 1 0.8 0.10 0.04 1 0.5
2 2 0.2 0.40 0.30 1 0.7
3 3 0.3 0.04 0.50 1 0.6
I want to change the values of "ex" with an if-else-condition. "ex" should change to 5 (arbitrary) when "ran" is smaller or equal then the highest value for the ex$-variables. It should be greater then the other ex$-variables aswell, but they should be sorted - the second largest value added to the smallest value. Here are examples for all id's, beginning with id 1:
dat$ex <- ifelse(dat$ran <= dat$ex1 & dat$ran > dat$ex1 + dat$ex2, 5, dat$ex)
Here, ex1 is the largest value, followed by ex1 and ex2.
For id 2, it should be:
dat$ex <- ifelse(dat$ran <= dat$ex2 & dat$ran > dat$ex3 + dat$ex1, 5, dat$ex)
Here, ex2 is the largest value, followed by ex3 and then ex1.
For id 3:
dat$ex <- ifelse(dat$ran <= dat$ex3 & dat$ran > dat$ex1 + dat$ex2, 5, dat$ex)
Here, ex3 is the largest value, followed by ex1 and then ex2.
Now to the problem: How to generalize the ifelse-statement? Note: It is important that the summation of the two smaller values is performed as implemented in the examples. I need to identify the sorted values for ex1, ex2 and ex3 within ifelse by id.
Here is a way how we could achieve the task using dplyr and tidyr:
library(dplyr)
library(tidyr)
dat %>%
pivot_longer(
cols = ex1:ex3
) %>%
arrange(id, desc(value)) %>%
group_by(id) %>%
mutate(ex = ifelse(ran <= value[1] & ran > sum(value[2], value[3]), 5, ex)) %>%
pivot_wider(
names_from=name
)
output:
id ex ran ex1 ex2 ex3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 5 0.5 0.8 0.1 0.04
2 2 1 0.7 0.2 0.4 0.3
3 3 1 0.6 0.3 0.04 0.5
We may use pmax
library(dplyr)
library(purrr)
dat %>%
mutate(ex= case_when(ran <=invoke(pmax, across(matches('^ex\\d+'))) ~ 5,
TRUE ~ ex))
id ex1 ex2 ex3 ex ran
1 1 0.8 0.10 0.04 5 0.5
2 2 0.2 0.40 0.30 1 0.7
3 3 0.3 0.04 0.50 1 0.6
exes <- t(apply(subset(dat, select = grep("^ex.+", names(dat))), 1, function(z) c(max(z), sum(z[-which.max(z)]))))
exes
# [,1] [,2]
# 1 0.8 0.14
# 2 0.4 0.50
# 3 0.5 0.34
ifelse(dat$ran <= exes[,1] & dat$ran > exes[,2], 5, dat$ran)
# 1 2 3
# 5.0 0.7 0.6
Walk-through:
subset(dat, ...) is a way to dynamically extract columns from a frame regardless of its type (e.g., data.frame, tbl_df, or data.table), and without risk of dropping the frame to a column (i.e., see that mtcars[,2] is no longer a frame); there are other ways to do this, some in base R, some in other packages like dplyr or data.table
apply(dat, 1, ..) operates on the rows of the respective columns; because when MARGIN=1 (second arg), it transposes the results, so we need to t(.) it back into the right shape;
exes is now a matrix whose first column contains the max of the ex# variables, and the second column contains the sum of the non-max ex# variables
From here, I think the use of exes is the "general" solution you were looking for.

Syntax in R for prediction

How do I perform this loop in RStudio using R library? Actually my dataset has over 100,000 rows and need some efficient syntax that can produce something similar to this for loop
Use previous row's value to predict for next all rows(in col d) after rows when data not available
# df is a dataframe with columns b,c,d,p.
d = c(1, 2, 4, NA, NA)
b = c(1,1,1,2,2)
c=c(1,1,1,1,1)
df= data.frame(cbind(b,c,d))
df$p <- c(0.1,0.2,0.1,0.1,0.3)
for(i in 1:(nrow(df)-1)) {
if (df$b[i + 1] > df$c[i + 1]) {
df$d[i + 1] = df$d[i] * (1 - df$p[i + 1])
} else{
df$d[i + 1] = df$d[i+1]
}
}
This vectorized code gives the same output as the question's for loop. And is much faster.
inx <- seq_along(df$a)[-1]
b_greater <- df$b[inx] > df$c[inx]
df$a[inx] <- df$d[inx - 1]
df$a[inx][b_greater] <- df$d[inx - 1][b_greater] * (1 - df$p[inx][b_greater])
df
# b c d a p
#1 1 1 1 NA 0.1
#2 1 1 2 1.0 0.2
#3 1 1 4 2.0 0.1
#4 2 1 NA 3.6 0.1
#5 2 1 NA NA 0.3

How to call up different dataframes in loops in r and add to them different computational results?

How can I automate the steps below?
I have the following example of what I would like to do - in the end get a dataframe made up of smaller dataframes that are generated automatically in earlier steps. These smaller dataframes need also calculations done in them before they are aggregated. I can do all manually with a long script, but can't seem to figure out how to combine properly list(), apply() or for() loops to get the result I wanted (not sure those are the best option here).
Please advise.
Thank you!
########### MY QUESTION IN DETAILED CODE
# DATASET
a <- c(2.0, 2.4, 2.1, 2.2, 2.3)
b <- c(4.0, 0, 4.5, 4.4, 4.8)
c <- c(0.3, 0.2, 2.0, 2.1, 2.3)
d <- c(5.0, 4.8, 4.8, 4.9, 5.0)
test.data <- data.frame(rbind(a,b,c,d))
#STEP 1: create separate dfs and do different calculations by column in each
#LONG WAY, MANUAL
# calculates % difference between each value with respect to first value in row
# in df1, then second value in row for df2, etc.
nc <- ncol(test.data)
df1 <- (test.data[,1:nc] - test.data[[1]])/(test.data[[1]])*100
df2 <- (test.data[,1:nc] - test.data[[2]])/(test.data[[2]])*100
df3 <- (test.data[,1:nc] - test.data[[3]])/(test.data[[3]])*100
df4 <- (test.data[,1:nc] - test.data[[4]])/(test.data[[4]])*100
df5 <- (test.data[,1:nc] - test.data[[5]])/(test.data[[5]])*100
# some results from above give Inf (since divided by zero), so set those to NA
df1[df1==Inf] <- NA
df2[df2==Inf] <- NA
df3[df3==Inf] <- NA
df4[df4==Inf] <- NA
df4[df4==Inf] <- NA
df5[df5==Inf] <- NA
#next will filter each calculated %-value by the specified percent difference filter
# and save the results in separate associated dataframes.
percent.diff <- 30
df.A1 <- data.frame(ifelse(df1 > -percent.diff & df1 < percent.diff, 1, 0))
df.A2 <- data.frame(ifelse(df2 > -percent.diff & df2 < percent.diff, 1, 0))
df.A3 <- data.frame(ifelse(df3 > -percent.diff & df3 < percent.diff, 1, 0))
df.A4 <- data.frame(ifelse(df4 > -percent.diff & df4 < percent.diff, 1, 0))
df.A5 <- data.frame(ifelse(df5 > -percent.diff & df5 < percent.diff, 1, 0))
#next add ID columns to each of the newly created dataframes
obs <- 4
#add row and df ID variables to each of the above
df.A1["df.cat"] <- 1
df.A1["row"] <- 1:obs
df.A2["df.cat"] <- 2
df.A2["row"] <- 1:obs
df.A3["df.cat"] <- 3
df.A3["row"] <- 1:obs
df.A4["df.cat"] <- 4
df.A4["row"] <- 1:obs
df.A5["df.cat"] <- 5
df.A5["row"] <- 1:obs
#combine the individual dataframes with IDs into a single dataframe.
Combo.df <-list(df.A1, df.A2, df.A3, df.A4, df.A5)
All.df <- Reduce(rbind, Combo.df)
FINAL OUTPUT SHOULD LOOK LIKE THIS (only first few rows shown)
X1 X2 X3 X4 X5 df.cat row
a 1 1 1 1 1 1 1
b 1 0 1 1 1 1 2
c 1 0 0 0 0 1 3
d 1 1 1 1 1 1 4
a1 1 1 1 1 1 2 1
b1 1 1 1 1 1 2 2
c1 0 1 0 0 0 2 3
d1 1 1 1 1 1 2 4
a2 1 1 1 1 1 3 1
b2 1 0 1 1 1 3 2
c2 0 0 1 1 1 3 3
d2 1 1 1 1 1 3 4
FAILED ATTEMPT TO TRY TO AUTOMATE ABOVE STEPS
#
a) created the number of dataframes I will need
num.reps <- 5
obs <- 4
n.cols <- 5
lst <- replicate(num.reps,data.frame(matrix(NA, nrow = obs, ncol = n.cols)), simplify=FALSE)
names(lst) <- paste0('df', 1:num.reps)
list2env(lst, envir = .GlobalEnv)
# b) fill dataframes (not sure how to call up dataframe by sequential names in loop)
# THIS DOES NOT WORK
f.diff.calc <- function(i)
{df[[i]] <-(df[,1:nc] - df[[i]])/(df[[i]])*100}
diff.calc.list <- replicate(5, f.diff.calc(list))
#Error in `[.data.frame`(df, , 1:nc) : undefined columns selected
This is a simplification of your code and as far as I can see it does what you want.
fun1 <- function(col, DF = test.data){
res <- 100*(DF - DF[[col]])/DF[[col]]
is.na(res) <- is.infinite(as.matrix(res))
res
}
fun2 <- function(DF, percent.diff = 30){
data.frame(ifelse(-percent.diff < DF & DF < percent.diff, 1, 0))
}
df_list <- lapply(seq_len(ncol(test.data)), fun1)
names(df_list) <- paste0("df", seq_along(df_list))
#next will filter each calculated %-value by the specified percent difference filter
# and save the results in a list of dataframes.
percent.diff <- 30
df.A_list <- lapply(df_list, fun2)
#next add ID columns to each of the newly created dataframes
tmp <- names(df.A_list)
df.A_list <- lapply(seq_along(df.A_list), function(i){
df.A_list[[i]][["df.cat"]] <- i
df.A_list[[i]][["row"]] <- seq_len(nrow(df.A_list[[i]]))
df.A_list[[i]]
})
names(df.A_list) <- tmp
# combine the results in one dataframe
All.df <- do.call(rbind, df.A_list)
Well I sincerely think with a bit more research you could have solved it. Also I cannot recreate the exact output you were getting, but I was able to match the output I was getting using your code.
Here is the automated version of the code.
a <- c(2.0, 2.4, 2.1, 2.2, 2.3)
b <- c(4.0, 0, 4.5, 4.4, 4.8)
c <- c(0.3, 0.2, 2.0, 2.1, 2.3)
d <- c(5.0, 4.8, 4.8, 4.9, 5.0)
test.data <- data.frame(rbind(a,b,c,d))
#STEP 1: create separate dfs and do different calculations by column in each
#LONG WAY, MANUAL
# calculates % difference between each value with respect to first value in row
# in df1, then second value in row for df2, etc.
nc <- ncol(test.data)
calc<-function(x,percent.diff=30,i){
x[x==Inf] <- NA
obs<-4
x.A<- data.frame(ifelse(x > -percent.diff & x < percent.diff, 1, 0))
x.A$df.cat<-i
x.A$row<-1:obs
return(x.A)
}
output<-data.frame()
for(i in 1:5){
assign(paste('df',i,sep=""),(test.data[,1:nc] - test.data[[i]])/(test.data[[i]])*100)
}
for(i in 1:5){
output<-rbind.data.frame(output,calc(x = get(paste('df',i,sep="")),percent.diff = 30,i=i))
}

Create a vector of counts

I wanted to create a vector of counts if possible.
For example: I have a vector
x <- c(3, 0, 2, 0, 0)
How can I create a frequency vector for all integers between 0 and 3? Ideally I wanted to get a vector like this:
> 3 0 1 1
which gives me the counts of 0, 1, 2, and 3 respectively.
Much appreciated!
You can do
table(factor(x, levels=0:3))
Simply using table(x) is not enough.
Or with tabulate which is faster
tabulate(factor(x, levels = min(x):max(x)))
You can do this using rle (I made this in minutes, so sorry if it's not optimized enough).
x = c(3, 0, 2, 0, 0)
r = rle(x)
f = function(x) sum(r$lengths[r$values == x])
s = sapply(FUN = f, X = as.list(0:3))
data.frame(x = 0:3, freq = s)
#> data.frame(x = 0:3, freq = s)
# x freq
#1 0 3
#2 1 0
#3 2 1
#4 3 1
You can just use table():
a <- table(x)
a
x
#0 2 3
#3 1 1
Then you can subset it:
a[names(a)==0]
#0
#3
Or convert it into a data.frame if you're more comfortable working with that:
u<-as.data.frame(table(x))
u
# x Freq
#1 0 3
#2 2 1
#3 3 1
Edit 1:
For levels:
a<- as.data.frame(table(factor(x, levels=0:3)))

Data frame subset with specified sum of elements

Having a data frame like this:
df <- data.frame(a=c(31, 18, 0, 1, 20, 2),
b=c(1, 0, 0, 3, 1, 1),
c=c(12, 0, 9, 8, 10, 3))
> df
a b c
1 31 1 12
2 18 0 0
3 0 0 9
4 1 3 8
5 20 1 10
6 2 1 3
How can I do a random subset so the sum of rows and columns is equal to a value, i.e , 100?
As I understand your question, you're trying to sample a subset of the rows and columns of your matrix so that they sum to a target value.
You can use integer optimization to accomplish this. You'll have a binary decision variable for each row, column, and cell, and constraints to force the cell values to be equal to the product of the row and column values. I'll use the lpSolve package to do this, because it has a convenient mechanism to get multiple optimal solutions. We can then use the sample function to select between them:
library(lpSolve)
get.subset <- function(dat, target) {
nr <- nrow(dat)
nc <- ncol(dat)
nvar <- nr + nc + nr*nc
# Cells upper bounded by row and column variable values (r and c) and lower bounded by r+c-1
mat <- as.matrix(do.call(rbind, apply(expand.grid(seq(nr), seq(nc)), 1, function(x) {
r <- x[1]
c <- x[2]
pos <- nr + nc + (r-1)*nc + c
ltc <- rep(0, nvar)
ltc[nr + c] <- 1
ltc[pos] <- -1
ltr <- rep(0, nvar)
ltr[r] <- 1
ltr[pos] <- -1
gtrc <- rep(0, nvar)
gtrc[nr + c] <- 1
gtrc[r] <- 1
gtrc[pos] <- -1
return(as.data.frame(rbind(ltc, ltr, gtrc)))
})))
dir <- rep(c(">=", ">=", "<="), nr*nc)
rhs <- rep(c(0, 0, 1), nr*nc)
# Sum of selected cells equals target
mat <- rbind(mat, c(rep(0, nr+nc), as.vector(t(dat))))
dir <- c(dir, "=")
rhs <- c(rhs, target)
res <- lp(objective.in=rep(0, nvar), # Feasibility problem
const.mat=mat,
const.dir=dir,
const.rhs=rhs,
all.bin=TRUE,
num.bin.solns=100 # Number of feasible solutions to get
)
if (res$status != 0) {
return(list(rows=NA, cols=NA, subset=NA, num.sol=0))
}
sol.num <- sample(res$num.bin.solns, 1)
vals <- res$solution[seq((sol.num-1)*nvar+1, sol.num*nvar)]
rows <- which(vals[seq(nr)] >= 0.999)
cols <- which(vals[seq(nr+1, nr+nc)] >= 0.999)
return(list(rows=rows, cols=cols, subset=dat[rows,cols], num.sol=res$num.bin.solns))
}
The function returns the number of subset with that sum and returns the randomly selected subset:
set.seed(144)
get.subset(df, 1)
# $rows
# [1] 1
# $cols
# [1] 2
# $subset
# [1] 1
# $num.sol
# [1] 14
get.subset(df, 100)
# $rows
# [1] 1 2 4 5
# $cols
# [1] 1 3
# $subset
# a c
# 1 31 12
# 2 18 0
# 4 1 8
# 5 20 10
# $num.sol
# [1] 2
get.subset(df, 10000)
# $rows
# [1] NA
# $cols
# [1] NA
# $subset
# [1] NA
# $num.sol
# [1] 0

Resources