R - Creating and updating a reference value in a for loop - r

I have the following (simulated) dataset
m=500
n=8
df<-data.frame(matrix(sample(0:1,m*n, replace=TRUE),m,n))
df$ID<-c(1:20)
attach(df)
df<-df[order(ID),]
df$round<-c(1:25)
df$payoff<-runif(n=500, min=1e-12, max=.9999999999)
First, I want a for loop that allows me to compare each row with the one before, so that the output takes value 1 if the payoff of the row is greater than the payoff of the row before. Then, I want the row with the highest payoff that was found so far to function as a reference for the next rows, so that the output takes now value 1 if the payoff of the next row is grater than the payoff of the row with the highest payoff that was found so far. The reference needs to be progressively updated as soon as a new highest value is found.
I managed to build a loop for the first step
df_split <- split(df, df$ID)
y<-data.frame("ID"=NULL, "round"=NULL, "feedback"=NULL)
for (i in 1:length(df_split)) {
myvector<-as.matrix(df_split[[i]][-1:-10])
for (j in 2:nrow(myvector)){
feedb<-ifelse(myvector[j,] > myvector[j-1,], 1, 0)
df2<-data.frame("ID"=i, "round"=j, "feedback"=feedb)
y<-rbind(y,df2)
}
}
Now I want to add to the loop the second step, that is indicating the row with the highest payoff that was found so far as reference, and compare the next row with such row. As already mentioned above, such a reference needs to be updated as a new highest value is found.
Does anybody have a solution?
Thank you for all your help!
EDIT:
Thank you both #r2evans and #Jon_Spring for you suggestions!
The reason why I am using a loop is that I need to calculate the output for each ID independently (sorry, I forgot to mention).
This is also why I am splitting the original dataframe into 20 dataframes (one per ID).
If I understand correctly your solutions, the codes are such that, when it comes for instance to ID = 2, the last payoff with the highest value is associated with ID = 1. The same happens when it comes to ID = 3, ID = 4, and so forth. Then, the resulting output is not correct, because the calculation should restart for each ID.
I didn't know the function cummax, thank you again! I'll try to integrate it into the logic of my loop, which also gives an output column as I need it.

I don't think you need any loops.
Up front, for reproducibility, I set my random seed with set.seed(1) before generating the frame above. This allows you to see the "exact same" frame as I'm creating below.
head(within(df, {
isbetter <- c(TRUE, diff(payoff) > 0)
maxsofar <- cummax(df$payoff)
maxsofar <- c(0, maxsofar[-length(maxsofar)])
isbestsofar <- as.integer(payoff > maxsofar)
}), n=20)
# X1 X2 X3 X4 X5 X6 X7 X8 ID round payoff isbestsofar maxsofar isbetter
# 1 0 1 1 0 1 1 1 1 1 1 0.18776846 1 0.0000000 TRUE
# 21 1 1 0 0 0 0 0 1 1 2 0.50475902 1 0.1877685 TRUE
# 41 1 0 0 0 0 1 0 0 1 3 0.02728685 0 0.5047590 FALSE
# 61 1 1 0 0 0 0 1 0 1 4 0.49629785 0 0.5047590 TRUE
# 81 0 0 0 0 1 1 1 0 1 5 0.94735171 1 0.5047590 TRUE
# 101 1 1 1 0 1 1 0 1 1 6 0.38118213 0 0.9473517 FALSE
# 121 1 1 0 1 0 0 1 0 1 7 0.69821373 0 0.9473517 TRUE
# 141 1 1 0 0 1 0 1 1 1 8 0.68876581 0 0.9473517 FALSE
# 161 0 0 0 0 1 0 0 0 1 9 0.47773068 0 0.9473517 FALSE
# 181 0 1 0 1 1 0 0 1 1 10 0.27334761 0 0.9473517 FALSE
# 201 0 1 0 1 1 0 1 0 1 11 0.75691633 0 0.9473517 TRUE
# 221 0 0 1 1 1 0 1 0 1 12 0.24753206 0 0.9473517 FALSE
# 241 0 0 0 1 0 1 1 0 1 13 0.52133948 0 0.9473517 TRUE
# 261 1 1 0 0 1 0 0 0 1 14 0.61284324 0 0.9473517 TRUE
# 281 0 1 0 1 1 0 1 0 1 15 0.09504998 0 0.9473517 FALSE
# 301 1 1 1 0 0 1 0 0 1 16 0.56575876 0 0.9473517 TRUE
# 321 1 0 1 1 0 1 1 1 1 17 0.01687416 0 0.9473517 FALSE
# 341 1 1 0 1 0 1 0 1 1 18 0.19987888 0 0.9473517 TRUE
# 361 0 0 1 1 1 0 0 1 1 19 0.41758380 0 0.9473517 TRUE
# 381 0 0 1 0 1 1 0 0 1 20 0.20550609 0 0.9473517 FALSE
I use within for simple creation/processing of columns within the data.frame; this could easily be done verbatim df$isbetter <- c(TRUE, diff(df$payoff) > 0), with dplyr, with data.table, or likely in other ways too. Take your pick, the logic and outcome should be effectively the same (other than column order, perhaps).

df$cummax = cummax(df$payoff)
df$new_max = df$payoff==df$cummax
Edit: added group_by, dplyr pipe
library(dplyr)
df2 <- df %>%
group_by(ID) %>%
mutate(cummax = cummax(payoff),
new_max = payoff==cummax) %>%
ungroup()
Output, showing what happens when we get to new ID:
> df2[20:30,]
# A tibble: 11 x 13
X1 X2 X3 X4 X5 X6 X7 X8 ID round payoff cummax new_max
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <lgl>
1 0 0 1 0 1 1 0 0 1 20 0.206 0.947 FALSE
2 1 1 0 1 0 0 1 0 1 21 0.377 0.947 FALSE
3 0 0 1 0 0 0 1 0 1 22 0.0765 0.947 FALSE
4 0 0 1 1 0 0 0 0 1 23 0.145 0.947 FALSE
5 0 0 0 1 0 0 1 0 1 24 0.554 0.947 FALSE
6 1 0 0 0 1 1 1 1 1 25 0.662 0.947 FALSE
7 0 1 1 1 1 0 0 1 2 1 0.736 0.736 TRUE
8 0 1 1 1 1 0 0 0 2 2 0.376 0.736 FALSE
9 1 1 0 0 0 0 0 0 2 3 0.869 0.869 TRUE
10 0 0 1 1 1 0 1 1 2 4 0.795 0.869 FALSE
11 1 1 0 1 1 1 0 1 2 5 0.822 0.869 FALSE

Related

Binary Variables Combinations Analysis in R

I have a data set, which has a lot of binary variables. For the ease of illustration, here is a smaller version with only 4 variables:
set.seed(5)
my_data<-data.frame("Slept Well"=sample(c(0,1),10,TRUE),
"Had Breakfast"=sample(c(0,1),10,TRUE),
"Worked out"=sample(c(0,1),10,TRUE),
"Meditated"=sample(c(0,1),10,TRUE))
In the above, each row corresponds to an observation. I am interested in analysing the frequency of each unique combination of the variables. For example, how many observations said that they both slept well and meditated, but did not have breakfast or worked out?
I would like to be able to rank the unique combinations from most frequently occurring to the least frequently occurring. What is the best way to go about coding that up?
You can use aggregate.
x <- aggregate(list(n=rep(1, nrow(my_data))), my_data, length)
#x <- aggregate(list(n=my_data[,1]), my_data, length) #Alternative
x[order(-x$n),]
# Slept.Well Had.Breakfast Worked.out Meditated n
#4 0 1 1 0 2
#1 0 0 0 0 1
#2 1 1 0 0 1
#3 0 0 1 0 1
#5 0 0 0 1 1
#6 1 0 0 1 1
#7 0 1 0 1 1
#8 0 0 1 1 1
#9 0 1 1 1 1
What about a dplyr solution:
library(dplyr)
my_data %>%
# group it
group_by_all() %>%
# frequencies
summarise(freq = n()) %>%
# order decreasing
arrange(-freq)
# A tibble: 9 x 5
Slept.Well Had.Breakfast Worked.out Meditated freq
<chr> <chr> <chr> <chr> <int>
1 0 1 1 0 2
2 0 0 0 0 1
3 0 0 0 1 1
4 0 0 1 0 1
5 0 0 1 1 1
6 0 1 0 1 1
7 0 1 1 1 1
8 1 0 0 1 1
9 1 1 0 0 1
Or with data.table:
res <- setorder(data.table(my_data)[,"."(freq = .N), by = names(my_data)],-freq)
res
Slept.Well Had.Breakfast Worked.out Meditated freq
1: 0 1 1 0 2
2: 1 0 0 1 1
3: 0 0 1 0 1
4: 0 0 0 0 1
5: 0 1 0 1 1
6: 0 1 1 1 1
7: 0 0 1 1 1
8: 0 0 0 1 1
9: 1 1 0 0 1

R dplyr nested dummy coding

I need to recode a data set of test responses for use in another application (a program called BLIMP that imputes missing values). Specifically, I need to represent the test items and subscale assignments with dummy codes.
Here I create a data frame that holds the responses to a 10-item test for two persons in a nested format. These data are a simplified version of the actual input table.
library(tidyverse)
df <- tibble(
person = rep(101:102, each = 10),
item = as.factor(rep(1:10, 2)),
response = sample(1:4, 20, replace = T),
scale = as.factor(rep(rep(1:2, each = 5), 2))
) %>% mutate(
scale_last = case_when(
as.integer(scale) != lead(as.integer(scale)) | is.na(lead(as.integer(scale))) ~ 1,
TRUE ~ NA_real_
)
)
The columns of df contain:
person: ID numbers for the persons (10 rows for each person)
item: test items 1-10 for each person. Note how the items are nested within each person.
response: score for each item
scale: the test has two subscales. Items 1-5 are assigned to subscale 1, and items 6-10 are assigned to subscale 2.
scale_last: a code of 1 in this column indicates that the item is the last item in its assigned sub scale. This characteristic becomes important below.
I then create dummy codes for the items using the recipes package.
library(recipes)
dum <- df %>%
recipe(~ .) %>%
step_dummy(item, one_hot = T) %>%
prep(training = df) %>%
bake(new_data = df)
print(dum, width = Inf)
# person response scale scale_last item_X1 item_X2 item_X3 item_X4 item_X5 item_X6 item_X7
# <int> <int> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 101 2 1 NA 1 0 0 0 0 0 0
# 2 101 3 1 NA 0 1 0 0 0 0 0
# 3 101 3 1 NA 0 0 1 0 0 0 0
# 4 101 1 1 NA 0 0 0 1 0 0 0
# 5 101 1 1 1 0 0 0 0 1 0 0
# 6 101 1 2 NA 0 0 0 0 0 1 0
# 7 101 3 2 NA 0 0 0 0 0 0 1
# 8 101 4 2 NA 0 0 0 0 0 0 0
# 9 101 2 2 NA 0 0 0 0 0 0 0
#10 101 4 2 1 0 0 0 0 0 0 0
#11 102 2 1 NA 1 0 0 0 0 0 0
#12 102 1 1 NA 0 1 0 0 0 0 0
#13 102 2 1 NA 0 0 1 0 0 0 0
#14 102 3 1 NA 0 0 0 1 0 0 0
#15 102 2 1 1 0 0 0 0 1 0 0
#16 102 1 2 NA 0 0 0 0 0 1 0
#17 102 4 2 NA 0 0 0 0 0 0 1
#18 102 2 2 NA 0 0 0 0 0 0 0
#19 102 4 2 NA 0 0 0 0 0 0 0
#20 102 3 2 1 0 0 0 0 0 0 0
# item_X8 item_X9 item_X10
# <dbl> <dbl> <dbl>
# 1 0 0 0
# 2 0 0 0
# 3 0 0 0
# 4 0 0 0
# 5 0 0 0
# 6 0 0 0
# 7 0 0 0
# 8 1 0 0
# 9 0 1 0
#10 0 0 1
#11 0 0 0
#12 0 0 0
#13 0 0 0
#14 0 0 0
#15 0 0 0
#16 0 0 0
#17 0 0 0
#18 1 0 0
#19 0 1 0
#20 0 0 1
The output shows the item dummy codes represented in the columns with the item_ prefix. For downstream processing, I need a further level of recoding. Within each subscale, the items must be dummy-coded relative to the last item of the subscale. Here’s where the scale_last variable comes into play; this variable identifies the rows in the output that need to be recoded.
For example, the first of these rows is row 5, the row for the last item (item 5) in subscale 1 for person 101. In this row the value of column item_X5 needs to be recoded from 1 to 0. In the next row to be recoded (row 10), it is the value of item_X10 that needs to be recoded from 1 to 0. And so on.
I’m struggling for the right combination of dplyr verbs to accomplish this. What’s tripping me up is the need to isolate specific cells within specific rows to be recoded.
Thanks in advance for any help!
We can use mutate_at and replace values from "item" columns to 0 where scale_last == 1
library(dplyr)
dum %>% mutate_at(vars(starts_with("item")), ~replace(., scale_last == 1, 0))
# A tibble: 20 x 14
# person response scale scale_last item_X1 item_X2 item_X3 item_X4 item_X5
# <int> <int> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 101 2 1 NA 1 0 0 0 0
# 2 101 3 1 NA 0 1 0 0 0
# 3 101 1 1 NA 0 0 1 0 0
# 4 101 1 1 NA 0 0 0 1 0
# 5 101 3 1 1 0 0 0 0 0
# 6 101 4 2 NA 0 0 0 0 0
# 7 101 4 2 NA 0 0 0 0 0
# 8 101 3 2 NA 0 0 0 0 0
# 9 101 2 2 NA 0 0 0 0 0
#10 101 4 2 1 0 0 0 0 0
#11 102 2 1 NA 1 0 0 0 0
#12 102 1 1 NA 0 1 0 0 0
#13 102 4 1 NA 0 0 1 0 0
#14 102 4 1 NA 0 0 0 1 0
#15 102 4 1 1 0 0 0 0 0
#16 102 3 2 NA 0 0 0 0 0
#17 102 4 2 NA 0 0 0 0 0
#18 102 1 2 NA 0 0 0 0 0
#19 102 4 2 NA 0 0 0 0 0
#20 102 4 2 1 0 0 0 0 0
# … with 5 more variables: item_X6 <dbl>, item_X7 <dbl>, item_X8 <dbl>,
# item_X9 <dbl>, item_X10 <dbl>
In base R, we can use lapply
cols <- grep("^item", names(dum))
dum[cols] <- lapply(dum[cols], function(x) replace(x, dum$scale_last == 1, 0))

correlation with multiple variables and its mutiple combination

Here is the example of the data set to be calculated the correlation between O_data and possible multiple combinations of M_data.
O_data=runif(10)
M_a=runif(10)
M_b=runif(10)
M_c=runif(10)
M_d=runif(10)
M_e=runif(10)
M_data=data.frame(M_a,M_b,M_c,M_d,M_e)
I can calculate the correlation between O_data and individual M_data data.
correlation= matrix(NA,ncol = length(M_data[1,]))
for (i in 1:length(correlation))
{
correlation[,i]=cor(O_data,M_data[,i])
}
In addition to this, how can I get the correlation between O_data and possible multiple combinations of M_data set?
let's clarify the combination.
cor_M_ab=cor((M_a+M_b),O_data)
cor_M_abc=cor((M_a+M_b+M_c),O_data)
cor_M_abcd=...
cor_M_abcde=...
...
....
cor_M_bcd=..
..
cor_M_eab=...
....
...
I don't want combinations of M_a and M_c, I want the combination on a continuous basis, like, M_ab, or bc,bcd,abcde,ea,eab........
Generate the data using set.seed so you can reproduce:
set.seed(42)
O_data=runif(10)
M_a=runif(10)
M_b=runif(10)
M_c=runif(10)
M_d=runif(10)
M_e=runif(10)
M_data=data.frame(M_a,M_b,M_c,M_d,M_e)
The tricky part is just keeping things organized. Since you didn't specify, I made a matrix with 5 rows and 31 columns. The rows get the names of the variables in your M_data. Here's the matrix (motivated by: All N Combinations of All Subsets)
M_grid <- t(do.call(expand.grid, replicate(5, 0:1, simplify = FALSE))[-1,])
rownames(M_grid) <- names(M_data)
M_grid
#> 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
#> M_a 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
#> M_b 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
#> M_c 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0
#> M_d 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1
#> M_e 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
#> 28 29 30 31 32
#> M_a 1 0 1 0 1
#> M_b 1 0 0 1 1
#> M_c 0 1 1 1 1
#> M_d 1 1 1 1 1
#> M_e 1 1 1 1 1
Now when I do a matrix multiplication of M_data and any column of my M_grid I get a sum of the columns in M_data corresponding to which rows of M_grid have 1's. For example:
as.matrix(M_data) %*% M_grid[,4]
gives me the sum of M_a and M_b. I can calculate the correlation between O_data and any of these sums. Putting it all together in one line:
(final <- cbind(t(M_grid), apply(as.matrix(M_data) %*% M_grid, 2, function(x) cor(O_data, x))))
#> M_a M_b M_c M_d M_e
#> 2 1 0 0 0 0 0.066499681
#> 3 0 1 0 0 0 -0.343839423
#> 4 1 1 0 0 0 -0.255957896
#> 5 0 0 1 0 0 0.381614222
#> 6 1 0 1 0 0 0.334916617
#> 7 0 1 1 0 0 0.024198743
#> 8 1 1 1 0 0 0.059297654
#> 9 0 0 0 1 0 0.180676146
#> 10 1 0 0 1 0 0.190656099
#> 11 0 1 0 1 0 -0.140666930
#> 12 1 1 0 1 0 -0.094245439
#> 13 0 0 1 1 0 0.363591787
#> 14 1 0 1 1 0 0.363546012
#> 15 0 1 1 1 0 0.111435827
#> 16 1 1 1 1 0 0.142772457
#> 17 0 0 0 0 1 0.248640472
#> 18 1 0 0 0 1 0.178471959
#> 19 0 1 0 0 1 -0.117930168
#> 20 1 1 0 0 1 -0.064838097
#> 21 0 0 1 0 1 0.404258155
#> 22 1 0 1 0 1 0.348609692
#> 23 0 1 1 0 1 0.114267433
#> 24 1 1 1 0 1 0.131731971
#> 25 0 0 0 1 1 0.241561478
#> 26 1 0 0 1 1 0.229693510
#> 27 0 1 0 1 1 0.001390233
#> 28 1 1 0 1 1 0.030884234
#> 29 0 0 1 1 1 0.369212761
#> 30 1 0 1 1 1 0.354971839
#> 31 0 1 1 1 1 0.166132390
#> 32 1 1 1 1 1 0.182368955
The final column is the correlation of O_data with all 31 possible sums of columns in M_data. You can tell which column is included by seeing which has a 1 under it for that row.
I try not to resort to matrices too much but this was the first thing I thought of.

Calculating means in R, via case/row frommultiple variables; count and exclude NA values

I'm trying to calculate participant average scores on the following scheme:
1. Take a series of values from multiple variables (test items),
2. Calculate an average score only for items answered Yes or No,
3. Omitting NA values from affecting the mean yet counting frequency and getting coordinates for all NA values,
4. Storing that newfound mean value in a new variable.
I need to do this with binary questions (1 = Yes, 0 = No, -99 = Missing / NA), such as below:
id var1 var2 var3 var4 var5
1 1 0 0 0 0
2 1 1 0 1 1
3 1 0 0 1 0
4 1 0 0 1 0
5 1 0 0 0 0
6 1 1 0 0 1
7 1 1 0 0 1
8 1 1 0 0 0
9 1 0 1 0 1
10 1 0 0 -99 1
11 1 1 0 1 0
12 1 0 0 1 0
13 1 0 0 -99 0
14 1 -99 0 1 1
15 1 0 0 1 0
16 1 0 0 0 1
17 1 0 0 1 0
18 1 0 -99 0 1
19 1 0 0 1 0
20 1 0 0 1 1
21 1 0 0 1 0
22 1 0 0 1 1
23 1 0 0 1 0
24 1 0 0 0 1
25 1 0 0 0 0
26 1 0 0 1 0
27 1 0 0 0 0
28 1 1 0 1 1
And with Likert scale questions (0 = Strongly Disagree / 6 = Strongly Disagree, -99 Missing / NA).
var10 var11 var12 var13 var14
1 1 1 1 0
4 1 1 1 1
1 1 1 1 1
2 1 1 1 1
4 1 1 1 1
2 1 1 1 0
1 1 1 1 0
1 1 1 1 1
2 1 1 1 1
1 1 1 1 0
4 1 1 1 1
4 1 1 1 1
-99 1 1 1 1
1 1 2 1 1
1 4 2 2 0
4 1 1 1 1
4 1 1 1 1
1 1 1 1 1
2 1 1 1 1
4 1 1 1 0
1 1 1 1 1
4 1 1 1 1
1 1 1 1 1
4 1 1 1 1
1 1 1 1 1
Any ideas of how to go about this? I'm sure it can be done by selecting individual columns or by indicating a range of columns from which to draw data. However, I'm inexperienced in writing such a complex, multi-stepped function in R so I'm hoping to get a veteran's advice.
Thanks in advance.

Function ignoring my if condition statement

I have a tree outlined in a data frame as:
number conc knot neg pick
1 1 0 0 1
2 1 0 0 1
3 1 0 0 1
4 3 164 0 1
5 1 0 0 1
6 1 0 0 1
7 3 159 1 1
8 0 0 0 0
9 0 0 0 0
10 3 208 1 1
11 3 181 1 1
12 3 1 1 1
13 3 95 0 1
14 0 0 0 0
15 0 0 0 0
I'm traversing the tree with a recursive function:
printtree <- function(number,tree) {
if (!is.na(tree[number,5] != 0)) {
letssee<-c(tree[number,1],tree[number,2],tree[number,3],tree[number,4],tree[number,5])
print(letssee)
}
left <- tree[number,1]
if (!is.na(left)) printtree(tree[left,1]*2,tree)
right <- tree[number,1]
if (!is.na(right)) printtree(tree[right,1]*2+1,tree)
}
My if condition should be omitting lines when the pick column = 0 but it is still printing and I can't figure out why.
Here's the output:
[1] 1 1 0 0 1
[1] 2 1 0 0 1
[1] 4 3 164 0 1
[1] 8 0 0 0 0
[1] 9 0 0 0 0
[1] 5 1 0 0 1
[1] 10 3 208 1 1
[1] 11 3 181 1 1
[1] 3 1 0 0 1
[1] 6 1 0 0 1
[1] 12 3 1 1 1
[1] 13 3 95 0 1
[1] 7 3 159 1 1
[1] 14 0 0 0 0
[1] 15 0 0 0 0
Is it ignoring my if statement because of is.na()? If I don't have the is.na check I get an error for "missing value where TRUE/FALSE needed" so it has to be there.
If tree[number, 5] happens to really equal zero, then the internal test
tree[number, 5] !=0
Will return FALSE. FALSE is not an NA value, so !is.na(FALSE) will always return TRUE. So the if statement as you've written it will always return TRUE if tree[number, 5] is zero.
Maybe try:
if (!is.na(X) & X !=0) {...}

Resources