Strange "changeable" results in loop in R

Strange "changeable" results in loop in R - r

Update2
The second set.seed(i) should be replaced as set.seed(i+1), or whatever another new random series. If not, s3 <- sum(data$gene == 0 & data$cancer == 1) will always be 0, since the number smaller than 0.08 will be smaller than 0.39.
I did't correct my original question code because this matter is not related to this post's core question.
Update
set.seed(i) is added twice since there are two random number generations, i.e., random1 and random2. However, the results among operations still changeable, which is strange.
Background:
the codes below is about odds ratios. But the focus point is not statistic here. Instead, I find the results changed (!) among some operations which are actually the same (I suppose that they are in fact not, but I cannot figure it out).
Code:
gene <- vector(length = 500, mode = "integer")
cancer <- vector(length = 500, mode = "integer")
data <- data.frame(gene, cancer)
odd_withMutate <- vector(length = 20, mode = "numeric")
odd_noMutate <- vector(length = 20, mode = "numeric")
result <- data.frame(odd_withMutate, odd_noMutate)
for (i in 1:20) {
# set.seed(12)
# set.seed(16)
set.seed(i)
random1 <- runif(500, min = 0, max = 1)
# set.seed(12)
# set.seed(16)
set.seed(i) # add this instruction
random2 <- runif(500, min = 0, max = 1)
for (j in 1:500) {
if (random1[j] < 0.39){
data[j,1] <- 1
}
if (random2[j] < 0.08){
data[j,2] <- 1
}
}
s1 <- sum(data$gene == 1 & data$cancer == 1) # has the mutated gene & has cancer
s2 <- sum(data$gene == 1 & data$cancer == 0)
s3 <- sum(data$gene == 0 & data$cancer == 1)
s4 <- sum(data$gene == 0 & data$cancer == 0)
result[i,]$odd_withMutate <- s1/s2
result[i,]$odd_noMutate <- s3/s4
}
Different operations:
Operation #1:
If I run the code above, the 12th row of odd_noMutate in result will be 0, the 16th will be NaN. Then I tried to see what happened, so I use set.seed(12) or set.seed(16) to check (Operation #2 & #3). But the 0 and NaN disappeared! I mean, in Operation #2, 0.1638418 0 is not 1.5075377 0. In Operation #3, 0.2830189 0 is not 2.4013605 NaN.
Operation #2:
the changed part of code is:
set.seed(12) #odd_noMutate = 0
# set.seed(16) #odd_noMutate = NaN
# set.seed(i)
random1 <- runif(500, min = 0, max = 1)
set.seed(12)
# set.seed(16)
# set.seed(i) # add this instruction
random2 <- runif(500, min = 0, max = 1)
Operation #3:
# set.seed(12) #odd_noMutate = 0
set.seed(16) #odd_noMutate = NaN
# set.seed(i)
random1 <- runif(500, min = 0, max = 1)
# set.seed(12)
set.seed(16)
# set.seed(i) # add this instruction
Operation #4:
I find that even changed the i in my code will make results totally different (Shouldn't it be the subset of the original result?). It is the Operation #4. Specifically, 0.3092105 0 is not 1.5075377 0; 0.7562724 0 is not 2.4013605 NaN.
for (i in 10:20) {
# set.seed(12) #odd_noMutate = 0
# set.seed(16) #odd_noMutate = NaN
set.seed(i)
random1 <- runif(500, min = 0, max = 1)
# set.seed(12)
# set.seed(16)
set.seed(i) # add this instruction
random2 <- runif(500, min = 0, max = 1)
The results among these operations are shown below:

The problem is that some previous values in data remain and are reused. Maybe your problem are solved remaking data every for loop (i). (putting data <- data.frame(gene, cancer) into for loop).
gene <- vector(length = 500, mode = "integer")
cancer <- vector(length = 500, mode = "integer")
# data <- data.frame(gene, cancer)
odd_withMutate <- vector(length = 20, mode = "numeric")
odd_noMutate <- vector(length = 20, mode = "numeric")
result <- data.frame(odd_withMutate, odd_noMutate)
for (i in 1:20) {
data <- data.frame(gene, cancer) # remaking data every time
# set.seed(12)
# set.seed(16)
set.seed(i)
random1 <- runif(500, min = 0, max = 1)
# set.seed(12)
# set.seed(16)
set.seed(i) # add this instruction
random2 <- runif(500, min = 0, max = 1)
for (j in 1:500) {
if (random1[j] < 0.39){
data[j,1] <- 1
}
if (random2[j] < 0.08){
data[j,2] <- 1
}
}
s1 <- sum(data$gene == 1 & data$cancer == 1) # has the mutated gene & has cancer
s2 <- sum(data$gene == 1 & data$cancer == 0)
s3 <- sum(data$gene == 0 & data$cancer == 1)
s4 <- sum(data$gene == 0 & data$cancer == 0)
result[i,]$odd_withMutate <- s1/s2
result[i,]$odd_noMutate <- s3/s4
}
[ADDITION]
for loop doesn't have own environment unlike function.
So handling in for loop directly affects Global env objects such as your data.
You partly overwrited data of Global env by if statement and it was referred in next loop.
Here is my simple example;
data <- data.frame(gene = vector(length = 5, mode = "integer"))
keep_of_process <- list()
for(i in 1:2) {
set.seed(i)
random_val <- runif(5, 0, 1)
for(j in 1:5) {
if(random_val[j] < 0.39) {
data[j, 1] <- 1
}
keep_of_process[[i]] <- data.frame(random = random_val,
gene = data$gene)
}
}
do.call("cbind", keep_of_process) # just to merge process to show
# left is i = 1 and right is i = 2
random gene random gene
1 0.2655087 1 0.1848823 1
2 0.3721239 1 0.7023740 1
3 0.5728534 0 0.5733263 0
4 0.9082078 0 0.1680519 1
5 0.2016819 1 0.9438393 1
Please see row 2. In i = 2, random is 0.7023740 but gene is 1 (previous result retains).
So to do what you want (from my understanding), you need to remake data (my answer) or completely overwrite data by if statement, such as
if(random_val[j] < 0.39) {
data[j, 1] <- 1
} else {
data[j, 1] <- 0
}

Related

Error in while (e_i$X1 < 12 | e_i$X2 < 12) { : argument is of length zero

In an earlier question (R: Logical Conditions Not Being Respected), I learned how to make the following simulation :
Step 1: Keep generating two random numbers "a" and "b" until both "a" and "b" are greater than 12
Step 2: Track how many random numbers had to be generated until it took for Step 1 to be completed
Step 3: Repeat Step 1 and Step 2 100 times
res <- matrix(0, nrow = 0, ncol = 3)
for (j in 1:100){
a <- rnorm(1, 10, 1)
b <- rnorm(1, 10, 1)
i <- 1
while(a < 12 | b < 12) {
a <- rnorm(1, 10, 1)
b <- rnorm(1, 10, 1)
i <- i + 1
}
x <- c(a,b,i)
res <- rbind(res, x)
}
head(res)
[,1] [,2] [,3]
x 12.14232 12.08977 399
x 12.27158 12.01319 1695
x 12.57345 12.42135 302
x 12.07494 12.64841 600
x 12.03210 12.07949 82
x 12.34006 12.00365 782
Question: Now, I am trying to make a slight modification to the above code - Instead of "a" and "b" being produced separately, I want them to be produced "together" (in math terms: "a" and "b" were being produced from two independent univariate normal distributions, now I want them to come from a bivariate normal distribution).
I tried to modify this code myself:
library(MASS)
Sigma = matrix(
c(1,0.5, 0.5, 1), # the data elements
nrow=2, # number of rows
ncol=2, # number of columns
byrow = TRUE) # fill matrix by rows
res <- matrix(0, nrow = 0, ncol = 3)
for (j in 1:100){
e_i = data.frame(mvrnorm(n = 1, c(10,10), Sigma))
e_i$i <- 1
while(e_i$X1 < 12 | e_i$X2 < 12) {
e_i = data.frame(mvrnorm(n = 1, c(10,10), Sigma))
e_i$i <- i + 1
}
x <- c(e_i$X1, e_i$X2 ,i)
res <- rbind(res, x)
}
res = data.frame(res)
But this is producing the following error:
Error in while (e_i$X1 < 12 | e_i$X2 < 12) { : argument is of length
zero

If I understand your code correctly you are trying to see how many samples occur before both values are >=12 and doing that for 100 trials? This is the approach I would take:
library(MASS)
for(i in 1:100){
n <- 1
while(any((x <- mvrnorm(1, mu=c(10,10), Sigma=diag(0.5, nrow=2)+0.5))<12)) n <- n+1
if(i==1) res <- data.frame("a"=x[1], "b"=x[2], n)
else res <- rbind(res, data.frame("a"=x[1], "b"=x[2], n))
}
Here I am assigning the results of a mvrnorm to x within the while() call. In that same call, it evaluates whether either are less than 12 using the any() function. If that evaluates to FALSE, n (the counter) is increased and the process repeated. Once TRUE, the values are appended to your data.frame and it goes back to the start of the for-loop.
Regarding your code, the mvrnorm() function is returning a vector, not a matrix, when n=1 so both values go into a single variable in the data.frame:
data.frame(mvrnorm(n = 1, c(10,10), Sigma))
Returns:
mvrnorm.n...1..c.10..10...Sigma.
1 9.148089
2 10.605546
The matrix() function within your data.frame() calls, along with some tweaks to your use of i, will fix your code:
library(MASS)
Sigma = matrix(
c(1,0.5, 0.5, 1), # the data elements
nrow=2, # number of rows
ncol=2, # number of columns
byrow = TRUE) # fill matrix by rows
res <- matrix(0, nrow = 0, ncol = 3)
for (j in 1:10){
e_i = data.frame(matrix(mvrnorm(n = 1, c(10,10), Sigma), ncol=2))
i <- 1
while(e_i$X1[1] < 12 | e_i$X2[1] < 12) {
e_i = data.frame(matrix(mvrnorm(n = 1, c(10,10), Sigma), ncol=2))
i <- i + 1
}
x <- c(e_i$X1, e_i$X2 ,i)
res <- rbind(res, x)
}
res = data.frame(res)

R how to vectorize a function with multiple if else conditions

Hi I am new to vectorizing functions in R. I have a code similar the following.
library(truncnorm)
library(microbenchmark)
num_obs=10000
Observation=seq(1,num_obs)
Obs_Type=sample(1:4, num_obs, replace=T)
Upper_bound = runif(num_obs,0,1)
Lower_bound=runif(num_obs,2,4)
mean = runif(num_obs,10,15)
df1= data.frame(Observation,Obs_Type,Upper_bound,Lower_bound,mean)
df1$draw_value = 0
Trial_func=function(df1){
for (i in 1:nrow(df1)){
if (df1[i,"Obs_Type"] ==1){
#If Type == 1; then a=-Inf, b = Upper_Bound
df1[i,"draw_value"] = rtruncnorm(1,a=-Inf,b=df1[i,"Upper_bound"],mean= df1[i,"mean"],sd=1)
} else if (df1[i,"Obs_Type"] ==2){
#If Type == 2; then a=-10, b = Upper_Bound
df1[i,"draw_value"] = rtruncnorm(1,a=-10,b=df1[i,"Upper_bound"],mean= df1[i,"mean"],sd=1)
} else if(df1[i,"Obs_Type"] ==3){
#If Type == 3; then a=Lower_bound, b = Inf
df1[i,"draw_value"] = rtruncnorm(1,a=df1[i,"Lower_bound"],b=Inf,mean= df1[i,"mean"],sd=1)
} else {
#If Type == 3; then a=Lower_bound, b = 10
df1[i,"draw_value"] = rtruncnorm(1,a=df1[i,"Lower_bound"],b=10,mean= df1[i,"mean"],sd=1)
}
}
return(df1)
}
#Benchmarking
mbm=microbenchmark(Trial_func(df1=df1),times = 10)
summary(mbm)
#For obtaining the new data
New_data=Trial_func(df1=df1)
In the above I am creating a dataframe called df1 initially. I then create a function which takes a dataset (df1). Each observation in the dataset (df1), can be one of four types. This is given by df1$Obs_Type. What I want to do is that based on the Obs_Type, I want to draw values from a truncated normal distribution with a given upper and lower points.
The rules are:
a) When Obs_Type =1; a=-Inf, b = Upper_bound value of observation i.
b) When Obs_Type =2; a=-10, b = Upper_bound value of observation i.
c) When Obs_Type =3; a=Upper_bound value of observation i, b = Inf.
d) When Obs_Type =4; a=Upper_bound value of observation i, b = 10.
Where a = lower bound, b = upper bound;
Additionally, mean of observation i is given by df1$mean and sd = 1.
I am not familiar with vectorizing and was wondering if someone could help me with this a bit. I tried looking at some other examples on SO (for eg. this) but could not figure out what to do when I have multiple conditions.
My original dataset has about 10 million observations and other additional conditions (eg. instead of 4 types, my data has 16 types and the means changes with each type), but I used a simpler example here.
Please let me know if any part of the question requires any additional clarification.

The case_when function in the dplyr package is handy to vectorize this type of multiple if else statements.
Instead of passing the individual values to the "if" statements one can pass the entire vector for a very substantial performance improvement.
Also the case_when improves the readability of the script.
library(dplyr)
Trial_func <- function(df1) {
df1[,"draw_value"] <- case_when(
df1$Obs_Type == 1 ~ rtruncnorm(1,a=-Inf,b=df1[,"Upper_bound"],mean= df1[,"mean"], sd=1),
df1$Obs_Type == 2 ~ rtruncnorm(1,a=-10,b=df1[,"Upper_bound"],mean= df1[,"mean"],sd=1),
df1$Obs_Type == 3 ~ rtruncnorm(1,a=df1[,"Lower_bound"],b=Inf,mean= df1[,"mean"],sd=1),
df1$Obs_Type == 4 ~ rtruncnorm(1,a=df1[,"Lower_bound"],b=10,mean= df1[,"mean"],sd=1)
)
df1
}
Trial_func(df1)

Here is a vectorized way. It creates logical vectors i1, i2, i3 and i4 corresponding to the 4 conditions. Then it assigns the new values to the positions indexed by them.
Trial_func2 <- function(df1){
i1 <- df1[["Obs_Type"]] == 1
i2 <- df1[["Obs_Type"]] == 2
i3 <- df1[["Obs_Type"]] == 3
i4 <- df1[["Obs_Type"]] == 4
#If Type == 1; then a=-Inf, b = Upper_Bound
df1[i1, "draw_value"] <- rtruncnorm(sum(i1), a =-Inf,
b = df1[i1, "Upper_bound"],
mean = df1[i1, "mean"], sd = 1)
#If Type == 2; then a=-10, b = Upper_Bound
df1[i2, "draw_value"] <- rtruncnorm(sum(i2), a = -10,
b = df1[i2 , "Upper_bound"],
mean = df1[i2, "mean"], sd = 1)
#If Type == 3; then a=Lower_bound, b = Inf
df1[i3,"draw_value"] <- rtruncnorm(sum(i3),
a = df1[i3, "Lower_bound"],
b = Inf, mean = df1[i3, "mean"],
sd = 1)
#If Type == 3; then a=Lower_bound, b = 10
df1[i4, "draw_value"] <- rtruncnorm(sum(i4),
a = df1[i4, "Lower_bound"],
b = 10,
mean = df1[i4,"mean"],
sd = 1)
df1
}
In the speed test I have named #Dave2e's answer Trial_func3.
mbm <- microbenchmark(
loop = Trial_func(df1 = df1),
vect = Trial_func2(df1 = df1),
cwhen = Trial_func3(df1 = df1),
times = 10)
print(mbm, order = "median")
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# vect 4.349444 4.371169 4.40920 4.401384 4.450024 4.487453 10 a
# cwhen 13.458946 13.484247 14.16045 13.528792 13.787951 19.363104 10 a
# loop 2125.665690 2138.792497 2211.20887 2157.185408 2201.391083 2453.658767 10 b

library(truncnorm)
library(microbenchmark)
num_obs=10000
Observation=seq(1,num_obs)
Obs_Type=sample(1:4, num_obs, replace=T)
Upper_bound = runif(num_obs,0,1)
Lower_bound=runif(num_obs,2,4)
mean = runif(num_obs,10,15)
df1= data.frame(Observation,Obs_Type,Upper_bound,Lower_bound,mean)
df1$draw_value = 0
###########################
# Your example
###########################
Trial_func=function(df1, seed=NULL){
if (!is.null(seed)) set.seed(seed)
for (i in 1:nrow(df1)){
if (df1[i,"Obs_Type"] ==1){
#If Type == 1; then a=-Inf, b = Upper_Bound
df1[i,"draw_value"] = rtruncnorm(1,a=-Inf,b=df1[i,"Upper_bound"],mean= df1[i,"mean"],sd=1)
} else if (df1[i,"Obs_Type"] ==2){
#If Type == 2; then a=-10, b = Upper_Bound
df1[i,"draw_value"] = rtruncnorm(1,a=-10,b=df1[i,"Upper_bound"],mean= df1[i,"mean"],sd=1)
} else if(df1[i,"Obs_Type"] ==3){
#If Type == 3; then a=Lower_bound, b = Inf
df1[i,"draw_value"] = rtruncnorm(1,a=df1[i,"Lower_bound"],b=Inf,mean= df1[i,"mean"],sd=1)
} else {
#If Type == 3; then a=Lower_bound, b = 10
df1[i,"draw_value"] = rtruncnorm(1,a=df1[i,"Lower_bound"],b=10,mean= df1[i,"mean"],sd=1)
}
}
return(df1)
}
#############################
# Vectorized version
#############################
# for each row-elements define a function
truncated_normal <- function(obs_type, lower_bound, upper_bound, mean, sd) {
if (obs_type == 1) {
rtruncnorm(1, a=-Inf, b=upper_bound, mean=mean, sd=sd)
} else if (obs_type == 2){
rtruncnorm(1, a=-10, b=upper_bound, mean=mean, sd=sd)
} else if(obs_type == 3){
rtruncnorm(1, a=lower_bound, b=Inf, mean=mean, sd=sd)
} else {
rtruncnorm(1, a=lower_bound, b=10, mean=mean, sd=sd)
}
}
# vectorize it
truncated_normal <- Vectorize(truncated_normal)
Trial_func_vec <- function(df, res_col="draw_value", seed=NULL) {
if (!is.null(seed)) set.seed(seed)
df[, res_col] <- truncated_normal(
obs_type=df[, "Obs_Type"],
lower_bound=df[, "Lower_bound"],
upper_bound=df[, "Upper_bound"],
mean=df[,"mean"],
sd=1)
df
}
#Benchmarking
set.seed(1)
mbm=microbenchmark(Trial_func(df=df1),times = 10)
summary(mbm)
set.seed(1)
mbm_vec=microbenchmark(Trial_func_vec(df=df1),times = 10)
summary(mbm_vec)
## vectorization roughly 3x faster!
#For obtaining the new data
set.seed(1) # important so that randomization is reproducible
new_data=Trial_func(df=df1)
set.seed(1) # important so that randomization is reproducible
vec_data=Trial_func_vec(df=df1)
# since in both cases random number generator is provoked
# exactly once per row in the order of the rows,
# resulting df should be absolutely identical.
all(new_data == vec_data) ## TRUE! They are absolutely identical.
# proving that your code does - in principle - exactly the same
# like my vectorized code
The Benchmarking results
# #Rui Barradas' function
Trial_func2 <- function(df1){
i1 <- df1[["Obs_Type"]] == 1
i2 <- df1[["Obs_Type"]] == 2
i3 <- df1[["Obs_Type"]] == 3
i4 <- df1[["Obs_Type"]] == 4
#If Type == 1; then a=-Inf, b = Upper_Bound
df1[i1, "draw_value"] <- rtruncnorm(sum(i1), a =-Inf,
b = df1[i1, "Upper_bound"],
mean = df1[i1, "mean"], sd = 1)
#If Type == 2; then a=-10, b = Upper_Bound
df1[i2, "draw_value"] <- rtruncnorm(sum(i2), a = -10,
b = df1[i2 , "Upper_bound"],
mean = df1[i2, "mean"], sd = 1)
#If Type == 3; then a=Lower_bound, b = Inf
df1[i3,"draw_value"] <- rtruncnorm(sum(i3),
a = df1[i3, "Lower_bound"],
b = Inf, mean = df1[i3, "mean"],
sd = 1)
#If Type == 3; then a=Lower_bound, b = 10
df1[i4, "draw_value"] <- rtruncnorm(sum(i4),
a = df1[i4, "Lower_bound"],
b = 10,
mean = df1[i4,"mean"],
sd = 1)
df1
}
# #Dave2e's function
library(dplyr)
Trial_func_dplyr <- function(df1) {
df1[,"draw_value"] <- case_when(
df1$Obs_Type == 1 ~ rtruncnorm(1,a=-Inf,b=df1[,"Upper_bound"],mean= df1[,"mean"], sd=1),
df1$Obs_Type == 2 ~ rtruncnorm(1,a=-10,b=df1[,"Upper_bound"],mean= df1[,"mean"],sd=1),
df1$Obs_Type == 3 ~ rtruncnorm(1,a=df1[,"Lower_bound"],b=Inf,mean= df1[,"mean"],sd=1),
df1$Obs_Type == 4 ~ rtruncnorm(1,a=df1[,"Lower_bound"],b=10,mean= df1[,"mean"],sd=1)
)
df1
}
#Benchmarking
set.seed(1)
mbm <- microbenchmark(
loop = Trial_func(df1=df1),
ruy_vect = Trial_func2(df1=df1),
my_vect = Trial_func_vec(df=df1),
cwhen = Trial_func_dplyr(df1=df1),
times=10)
print(mbm, order = "median")
# > print(mbm, order = "median")
# Unit: milliseconds
# expr min lq mean median uq max
# ruy_vect 7.583821 7.879766 11.59954 8.815835 10.33289 36.60468
# cwhen 22.563190 23.103670 25.13804 23.965722 26.49628 30.63777
# my_vect 1326.771297 1373.415302 1413.75328 1410.995177 1484.28449 1506.11063
# loop 4149.424632 4269.475169 4486.41376 4423.527566 4742.96651 4911.31992
# neval cld
# 10 a
# 10 a
# 10 b
# 10 c
# #Rui's vectorize version wins by 3 magnitudes or order!!

How to create combinatory variable in R data.frame?

I have a data.frame that has several variables with zero values. I need to construct an extra variable that would return the combination of variables that are not zero for each observation. E.g.
df <- data.frame(firm = c("firm1", "firm2", "firm3", "firm4", "firm5"),
A = c(0, 0, 0, 1, 2),
B = c(0, 1, 0, 42, 0),
C = c(1, 1, 0, 0, 0))
Now I would like to generate the new variable:
df$varCombination <- c("C", "B-C", NA, "A-B", "A")
I thought up something like this, which obviously did not work:
for (i in 1:nrow(df)){
df$varCombination[i] <- paste(names(df[i,2:ncol(df) & > 0]), collapse = "-")
}

This could be probably solved easily using apply(df, 1, fun), but here is an attempt to solve this column wise instead of row wise for performance sake (I once saw something similar done by #alexis_laz but can't find it right now)
## Create a logical matrix
tmp <- df[-1] != 0
## or tmp <- sapply(df[-1], `!=`, 0)
## Prealocate result
res <- rep(NA, nrow(tmp))
## Run per column instead of per row
for(j in colnames(tmp)){
res[tmp[, j]] <- paste(res[tmp[, j]], j, sep = "-")
}
## Remove the pre-allocated `NA` values from non-NA entries
gsub("NA-", "", res, fixed = TRUE)
# [1] "C" "B-C" NA "A-B" "A"
Some benchmarks on a bigger data set
set.seed(123)
BigDF <- as.data.frame(matrix(sample(0:1, 1e4, replace = TRUE), ncol = 10))
library(microbenchmark)
MM <- function(df) {
var_names <- names(df)[-1]
res <- character(nrow(df))
for (i in 1:nrow(df)){
non_zero_names <- var_names[df[i, -1] > 0]
res[i] <- paste(non_zero_names, collapse = '-')
}
res
}
ZX <- function(df) {
res <-
apply(df[,2:ncol(df)]>0, 1,
function(i)paste(colnames(df[, 2:ncol(df)])[i], collapse = "-"))
res[res == ""] <- NA
res
}
DA <- function(df) {
tmp <- df[-1] != 0
res <- rep(NA, nrow(tmp))
for(j in colnames(tmp)){
res[tmp[, j]] <- paste(res[tmp[, j]], j, sep = "-")
}
gsub("NA-", "", res, fixed = TRUE)
}
microbenchmark(MM(BigDF), ZX(BigDF), DA(BigDF))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# MM(BigDF) 239.36704 248.737408 253.159460 252.177439 255.144048 289.340528 100 c
# ZX(BigDF) 35.83482 37.617473 38.295425 38.022897 38.357285 76.619853 100 b
# DA(BigDF) 1.62682 1.662979 1.734723 1.735296 1.761695 2.725659 100 a

Using apply:
# paste column names
df$varCombination <-
apply(df[,2:ncol(df)]>0, 1,
function(i)paste(colnames(df[, 2:ncol(df)])[i], collapse = "-"))
# convert blank to NA
df$varCombination[df$varCombination == ""] <- NA
# result
df
# firm A B C varCombination
# 1 firm1 0 0 1 C
# 2 firm2 0 1 1 B-C
# 3 firm3 0 0 0 <NA>
# 4 firm4 1 42 0 A-B
# 5 firm5 2 0 0 A

You had the right idea but the logical comparison in your loop wasn't correct.
I've attempted to keep the code fairly similar to what you had before, this should work:
var_names <- names(df)[-1]
df$varCombination <- character(nrow(df))
for (i in 1:nrow(df)){
non_zero_names <- var_names[df[i, -1] > 0]
df$varCombination[i] <- paste(non_zero_names, collapse = '-')
}
> df
firm A B C varCombination
1 firm1 0 0 1 C
2 firm2 0 1 1 B-C
3 firm3 0 0 0
4 firm4 1 42 0 A-B
5 firm5 2 0 0 A

R: recode previous/following n observations

I have a dataframe of 0/1 dummy variables. Each dummy variable only takes the value 1 once. For each column, I would want to replace n preceding/following observations counting from the observation with the value 1 to a particular value (say 1).
So for single vector, with n=1:
c(0, 0, 1, 0, 0)
I would want to get
c(0, 1, 1, 1, 0)
What would be a good general approach with n columns and allowing for a different number of preceding/following observations to replace (e.g n-1 before & n after)?
Thanks for help!

x<-c(0,0,1,0,0)
ind<-which(x==1)
x[(ind-1):(ind+x)]<-1

Another option:
f <- function(x, pre, post) {
idx <- which.max(x)
x[max(1, (idx-pre)):min(length(x), (idx+post))] <- 1
x
}
Sample data:
df <- data.frame(x = c(0, 0, 1, 0, 0), y = c(0, 1, 0, 0, 0))
Application:
df[] <- lapply(df, f, pre=2, post=1)
#df
# x y
#1 1 1
#2 1 1
#3 1 1
#4 1 0
#5 0 0

What you can do is the following:
vec <- c(0, 0, 1, 0, 0)
sapply(1:length(vec), function(i) {
minval <- max(0, i - 1)
maxval <- min(i + 1, length(vec))
return(sum(vec[minval:maxval]))
})
# [1] 0 1 1 1 0
Or to put it in a function (same code but a bit more compact)
f <- function(vec){
sapply(1:length(vec), function(i)
sum(vec[max(0, i-1):min(i+1, length(vec))]))
}
f(vec)
# [1] 0 1 1 1 0
Speedtest
To compare the two different solutions, I quickly ran a benchmark using microbenchmark, and the winner is: Clearly #Shenglin's code.... Always nice to see simple solutions (as well as to see how complicated some (my) solutions can be).
fDavid <- function(vec){
sapply(1:length(vec), function(i)
sum(vec[max(0, i-1):min(i+1, length(vec))]))
}
fHeroka <- function(vec){
res <- vec
test <- which(vec==1)
#create indices to be replaced
n=1 #variable n
replace_indices <- c(test+(1:n),test-(1:n))
#filter out negatives (may happen with larger n)
replace_indices <- replace_indices[replace_indices>0]
#replace items in 'res' that need to be replaced with 1
res[replace_indices] <- 1
}
fShenglin <- function(vec){
ind<-which(vec==1)
vec[(ind-1):(ind+x)]<-1
}
vect <- sample(0:1, size = 1000, replace = T)
library(microbenchmark)
microbenchmark(fHeroka(vect), fDavid(vect), fShenglin)
# # Unit: nanoseconds
# expr min lq mean median uq max
# fHeroka(vect) 38929 42999 54422.57 49546 61755.5 145451
# fDavid(vect) 2463805 2577935 2875024.99 2696844 2849548.5 5994596
# fShenglin 0 0 138.63 1 355.0 1063
# neval cld
# 100 a
# 100 b
# 100 a
# Warning message:
# In microbenchmark(fHeroka(vect), fDavid(vect), fShenglin) :
# Could not measure a positive execution time for 30 evaluations.

This might be a start:
myv <- c(0, 0, 1, 0, 0)
#make a copy
res <- myv
#check where the ones are
test <- which(myv==1)
#create indices to be replaced
n=1 #variable n
replace_indices <- c(test+(1:n),test-(1:n))
#filter out negatives (may happen with larger n)
replace_indices <- replace_indices[replace_indices>0]
#replace items in 'res' that need to be replaced with 1
res[replace_indices] <- 1
res
> res
[1] 0 1 1 1 0

This could be a solution:
dat<-data.frame(x=c(0,0,1,0,0,0),y=c(0,0,0,1,0,0),z=c(0,1,0,0,0,0))
which_to_change<-data.frame(prev=c(2,2,1),foll=c(1,1,3))
for(i in 1:nrow(which_to_change)){
dat[(which(dat[,i]==1)-which_to_change[i,1]):(which(dat[,i]==1)+which_to_change[i,2]),i]<-1
}

Nested for loop in R

I wrote the following code, and I need to repeat this for 100 times, and I know I need to user another for loop, but I don't know how to do it. Here is the code:
mean <- c(5,5,10,10,5,5,5)
x <- NULL
u <- NULL
delta1 <- NULL
w1 <- NULL
for (i in 1:7 ) {
x[i] <- rexp(1, rate = mean[i])
u[i] <- (1/1.2)*runif(1, min=0, max=1)
y1 <- min(x,u)
if (y1 == min(x)) {
delta1 <- 1
}
else {
delta1 <- 0
}
if (delta1 == 0)
{
w1 <- NULL
}
else {
if(y1== x[[1]])
{
w1 <- "x1"
}
}
}
output <- cbind(delta1,w1)
output
I want the final output to be 100 rows* 3 columns matrix representing run number, delta1, and w1.
Any thought will be truly appreciated.

Here's what I gather you're trying to achieve from your code:
Given two vectors drawn from different distributions (Exponential and Uniform)
Find out which distribution the smallest number comes from
Repeat this 100 times.
Theres a couple of problems with your code if you want to achieve this, so here's a cleaned up example:
rates <- c(5, 5, 10, 10, 5, 5, 5) # 'mean' is an inbuilt function
# Initialise the output data frame:
output <- data.frame(number=rep(0, 100), delta1=rep(1, 100), w1=rep("x1", 100))
for (i in 1:100) {
# Generating u doesn't require a for loop. Additionally, can bring in
# the (1/1.2) out the front.
u <- runif(7, min=0, max=5/6)
# Generating x doesn't need a loop either. It's better to use apply functions
# when you can!
x <- sapply(rates, function(x) { rexp(1, rate=x) })
y1 <- min(x, u)
# Now we can store the output
output[i, "number"] <- y1
# Two things here:
# 1) use all.equal instead of == to compare floating point numbers
# 2) We initialised the data frame to assume they always came from x.
# So we only need to overwrite it where it comes from u.
if (isTRUE(all.equal(y1, min(u)))) {
output[i, "delta1"] <- 0
output[i, "w1"] <- NA # Can't use NULL in a character vector.
}
}
output

Here's an alternative, more efficient approach with replicate:
Mean <- c(5, 5, 10, 10, 5, 5, 5)
n <- 100 # number of runs
res <- t(replicate(n, {
x <- rexp(n = length(Mean), rate = Mean)
u <- runif(n = length(Mean), min = 0, max = 1/1.2)
mx <- min(x)
delta1 <- mx <= min(u)
w1 <- delta1 & mx == x[1]
c(delta1, w1)
}))
output <- data.frame(run = seq.int(n), delta1 = as.integer(res[ , 1]),
w1 = c(NA, "x1")[res[ , 2] + 1])
The result:
head(output)
# run delta1 w1
# 1 1 1 <NA>
# 2 2 1 <NA>
# 3 3 1 <NA>
# 4 4 1 x1
# 5 5 1 <NA>
# 6 6 0 <NA>

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Strange "changeable" results in loop in R - r

Related

Error in while (e_i$X1 < 12 | e_i$X2 < 12) { : argument is of length zero

R how to vectorize a function with multiple if else conditions

How to create combinatory variable in R data.frame?

R: recode previous/following n observations

Nested for loop in R

Categories

Resources