How to repeat a code in R fulfilling a condition across repeats - r

I need to repeat a code 24 times (for 24 different participants), making sure that overall, for each Scene2 in each Trial and Route, I have the same number of 1 and 0 across the columns Random of each participant (i.e., Part.1, Part.2, Part.3, etc.) when the Target is equal to 0.
Here is the code I am using:
Scene2 = rep(c(1:10), times=9)
myDF2 <- data.frame(Scene2)
myDF2$Target <- rep(0,10, each=9)
myDF2$Target[myDF2$Scene2==7] <- 1
myDF2$Trial <- rep(c(1:9),each=10)
myDF2$Route <- rep(LETTERS[1:6], each=10, length=nrow(myDF2))
library(plyr)
myDF3 <- myDF2 %>% group_by(Trial, Route) %>% mutate(Random = ifelse(myDF2$Target==0,sample(c(rep(0,5),rep(1,5))),1)) %>% as.data.frame()
I need to obtain something like this:
Scene2 Target Trial Route Part.1 Part.2 Part.3 Part.4 … Part.24 Tot.1 Tot.0
1 0 1 A 0 1 1 0 0 12 12
2 0 1 A 1 0 1 0 0 12 12
3 0 1 A 1 0 0 0 0 12 12
4 0 1 A 0 1 0 1 0 12 12
5 0 1 A 1 0 1 1 0 12 12
6 0 1 A 1 0 0 0 1 12 12
7 1 1 A 1 1 1 1 1 24 0
8 0 1 A 0 0 1 1 1 12 12
9 0 1 A 0 1 1 1 1 12 12
10 0 1 A 0 1 0 0 1 12 12
How to achieve this? Any suggestion would be very much appreciated.

Since there's some conditional logic here that needs to meet particular specifications, I think this is easier to do with a function.
Scene2 = rep(c(1:10), times=9)
myDF2 <- data.frame(Scene2)
myDF2$Target <- rep(0,10, each=9)
myDF2$Target[myDF2$Scene2==7] <- 1
myDF2$Trial <- rep(c(1:9),each=10)
myDF2$Route <- rep(LETTERS[1:6], each=10, length=nrow(myDF2))
library(tidyverse)
fill_random_columns <- function(df, reps) {
# Start a loop with a counter
for (i in 1:reps) {
# Create a vector of 1s and 0s for filling rows
bag <- c(rep(0, 12), rep(1, 12))
# Build up conditional data frame of 1s and 0s
row_vector <- as.data.frame(t(sapply(df$Target, function(v) {
if (v == 1) return(rep(1, reps))
else (return(sample(bag, reps)))
})))
}
# Create column names
colnames <- lapply(1:reps, function(i) {paste0("Part.", i)})
# Name columns and sum up rows
row_vector <- row_vector %>%
`colnames<-`(colnames) %>%
mutate(Total = rowSums(.))
# Attach to original data frame
df <- bind_cols(df, row_vector)
return(df)
}
myDF3 <- myDF2 %>%
group_by(Trial, Route) %>%
fill_random_columns(., 24)

Related

How to perform the equivalent of Excel countifs with multiple conditions in dplyr?

In the below reproducible code, I would like to add a column for CountIfs as detailed in the below image, whereby the Excel countifs() formula in the image has multiple conditions with the tops of the specified ranges "anchored". Any recommendations for how to do the same in dplyr? I'm sure it requires grouping but unsure of how to handle the multiple conditions. The blue below shows the current reproducible code output, the yellow shows what I would like to add, and the non-highlighted shows the underlying countifs() formula.
Reproducible code:
library(dplyr)
myData <-
data.frame(
Name = c("R","R","T","R","N","N","T"),
Group = c(0,0,0,0,1,1,0),
Code = c(0,2,2,0,3,0,4)
)
myDataRender <- myData %>%
group_by(Group) %>%
mutate(CodeGrp = if_else(Group == 0, 0, max(Code)))
print.data.frame(myDataRender)
Posting alternative dataframe and Excel countifs() example:
> myData1
Name Group Code1 Code2
1 B 0 0 0
2 R 1 1 0
3 R 1 1 1
4 R 2 3 0
5 R 2 3 1
6 B 0 0 0
7 A 3 1 0
8 A 3 1 1
9 A 0 0 0
10 A 0 0 0
11 A 0 0 0
myData1 <-
data.frame(
Name = c("B","R","R","R","R","B","A","A","A","A","A"),
Group = c(0,1,1,2,2,0,3,3,0,0,0),
Code1 = c(0,1,1,3,3,0,1,1,0,0,0),
Code2 = c(0,0,1,0,1,0,0,1,0,0,0)
)
Desired output rendered in XLS:
Small change to Rui Barrada's solution to make his for-loop code exactly reflect the COUNTIFS() formula in the XLS example:
fun_aux <- function(x,y) {
out <- integer(length(x))
for(i in seq_along(x)) {
cond1 <- y[1:i] > 0
cond2 <- x[1:i] == x[i] # the y in this line in Rui's code is changed to x (for x[i]) to correspond with XLS countifs() example
out[i] <- sum(cond1*cond2)
}
out
}
Equivalent of above for-loop function using base sapply():
mutate(CountIfs = sapply(1:n(), function(x) sum(Code2[1:x] > 0 & Code1[1:x] == Code1[x])))
Try the following.
Vectorized code like what dplyr or the tidyverse uses is not suited for rolling functions, where the current result depends on the previous ones. So I have written an auxiliary function fun_aux to take care of computing the expected result.
suppressPackageStartupMessages(
library(dplyr)
)
myData1 <-
data.frame(
Name = c("B","R","R","R","R","B","A","A","A","A","A"),
Group = c(0,1,1,2,2,0,3,3,0,0,0),
Code1 = c(0,1,1,3,3,0,1,1,0,0,0),
Code2 = c(0,0,1,0,1,0,0,1,0,0,0)
)
fun_aux <- function(x, y) {
out <- integer(length(x))
for(i in seq_along(x)) {
cond1 <- y[1:i] > 0
cond2 <- x[1:i] == y[i]
out[i] <- sum(cond1*cond2)
}
out
}
myDataRender <- myData1 %>%
mutate(CountIfs = fun_aux(Code1, Code2))
print.data.frame(myDataRender)
#> Name Group Code1 Code2 CountIfs
#> 1 B 0 0 0 0
#> 2 R 1 1 0 0
#> 3 R 1 1 1 1
#> 4 R 2 3 0 0
#> 5 R 2 3 1 1
#> 6 B 0 0 0 0
#> 7 A 3 1 0 0
#> 8 A 3 1 1 2
#> 9 A 0 0 0 0
#> 10 A 0 0 0 0
#> 11 A 0 0 0 0
Created on 2022-08-21 by the reprex package (v2.0.1)

flag the 10 highest and 10 lowest values from a column in r

I'm looking for a command in r by which I can flag the 10(or n) highest and 10 lowest values. I found this post in which it does very similar to what I'm trying to do. The post suggests that is.max could do what I wanted, but I couldn't really find the command in R documentation.
Has it been updated to another command? Is there any other command in tidyvese or dplyr I could try?
Thanks!
library(tidyverse)
generate data:
set.seed(666)
rnorm(20) %>% as.data.frame() -> x
# choose breakpont (e.g. top10)
n <- 10
x %>% arrange(x) %>%
mutate(rnum = row_number()) %>%
mutate(bottom_n = ifelse(rnum %in% seq(1, n), 1, 0)) %>%
mutate(top_n = ifelse(rnum %in% seq( n()-n+1, n()), 1, 0)) %>%
select(-rnum)
Here, we first order the values in ascending order, and create a helper column for row numbers. bottom_n=1 are rows that have their row number between 1 and n; top_n are the rows that have their row number between n()-n+1 and n(), where n() is the length of the input vector.
Output:
. bottom_n top_n
1 -2.21687445 1 0
2 -1.79224083 1 0
3 -1.77023084 1 0
4 -1.72015590 1 0
5 -1.30618526 1 0
6 -0.80251957 1 0
7 -0.58245269 1 0
8 -0.35513446 1 0
9 -0.07582656 1 0
10 -0.04203245 1 0
11 0.13412567 0 1
12 0.34490035 0 1
13 0.75331105 0 1
14 0.75839618 0 1
15 0.78617038 0 1
16 0.85830054 0 1
17 0.86465359 0 1
18 2.01435467 0 1
19 2.02816784 0 1
20 2.15004262 0 1

How to do a running calculation across vectors in R?

I created this data frame:
Count <- c(1:10)
Give <- c(0,0,5,0,0,5,0,5,0,5)
X <- c(rep(0,10))
Y <- c(rep(0,10))
Z <- c(rep(0,10))
X_Target <- 5
Y_Target <- 10
Z_Target <- 5
Basically I have 3 vectors (X,Y,Z) and a target for each one of them.
I want to have a new calculation for X,Y and Z that based on the vector Give.
Once the number on Give is bigger than 0 then it's need to be added to Vector X until it equel to X_Target. Then - the calcultion need to move to the next vector (Y) and do the same, and then to next vector...
The output should be like the following:
Count Give X Y Z
1 0 0 0 0
2 0 0 0 0
3 5 5 0 0
4 0 5 0 0
5 0 5 0 0
6 5 5 5 0
7 0 5 5 0
8 5 5 10 0
9 0 5 10 0
10 5 5 10 5
In this example I have only 3 vectors but please keep in mind that I'll have at least 60 vectors so I need it to be automatic as it can.
Hope I manage to explain myself :)
Thnanks!
It's ugly, but it gives the desired result.
tab1 = data.frame(
Count = c(1:10),
Give = c(0,0,5,0,0,5,0,5,0,5),
X = c(rep(0,10)),
Y = c(rep(0,10)),
Z = c(rep(0,10))
)
targets <- c(5,10,5)
tab2 <- tab1
start <- 2
for(col in 3:ncol(tab2)) {
target <- targets[col-2]
for(row in start:nrow(tab2)) {
if(tab2[row, 2] > 0 & tab2[row, col] < target) {
tab2[row, col] <- pmin(tab2[row - 1, col] + tab2[row, col - 1], target)
} else {
tab2[row, col] <- tab2[row - 1, col]
}
}
start <- which(tab2[, 2] > 0 & tab2[, col] == target)[2]
}
> tab2
Count Give X Y Z
1 1 0 0 0 0
2 2 0 0 0 0
3 3 5 5 0 0
4 4 0 5 0 0
5 5 0 5 0 0
6 6 5 5 5 0
7 7 0 5 5 0
8 8 5 5 10 0
9 9 0 5 10 0
10 10 5 5 10 5
Turn it into a data frame :
tab1 = data.frame(
Count = c(1:10),
Give =c(0,0,5,0,0,5,0,5,0,5),
X = c(rep(0,10)),
Y = c(rep(0,10)),
Z = c(rep(0,10))
)
# create a list of targets for looping
targets = c(X_Target, Y_Target, Z_Target)
Without using data.table you can just put the whole thing in a loop. It will work, but be much slower.
# loop through each column
for(col in seq(1,length(targets))){
print(col)
# loop through each row
for(row in seq(1, dim(tab1[2+col])[1])){
# condition
while(tab1[row,(2+col)] < targets[col] & tab1[row,2]>0){
tab1[row,(2+col)] = tab1[row,(2+col)] +tab1[row,2]
}
}
}
Here is something else to try, using tidyverse.
Put your data into long form, and include targets with a join.
In a loop through Count, find the first row for a given Count that is below target. For current and following rows that have matching names (X, Y, or Z), add Give amount.
In the end, put result back into wide form.
library(tidyverse)
df <- data.frame(Count, Give, X, Y, Z) %>%
pivot_longer(cols = X:Z) %>%
left_join(data.frame(X_Target, Y_Target, Z_Target) %>%
pivot_longer(cols = everything(),
names_to = c("name", ".value"),
names_pattern = "(\\w+)_(\\w+)"))
for (i in seq_along(Count)) {
below_target <- min(which(df$Count == i & df$value < df$Target))
name_rows <- which(df$name == df[below_target, "name", drop = T])
rows_to_change <- name_rows[name_rows >= below_target]
df[rows_to_change, "value"] <- df[rows_to_change, "value"] + df[below_target, "Give", drop = T]
}
df %>%
pivot_wider(id_cols = Count)
Output
Count X Y Z
<int> <dbl> <dbl> <dbl>
1 1 0 0 0
2 2 0 0 0
3 3 5 0 0
4 4 5 0 0
5 5 5 0 0
6 6 5 5 0
7 7 5 5 0
8 8 5 10 0
9 9 5 10 0
10 10 5 10 5
My approach was to make use of the cumulative sums of the Give and then track if that exceeds the targeted values for the columns. Then do some cleaning up.
targets <- c(X_Target, Y_Target, Z_Target)
targets_0 <- c(0, targets)
csum_give <- cumsum(Give)
# from cumsum give take off sum of previous targets
result <- sapply(1:length(targets),
function(x) csum_give - sum(targets_0[1:x]))
# Set max value to target max of column
sapply(1:length(targets),
function(x) result[result[, x] > targets[x], x] <<- targets[x])
# set min value to zero
result[which(result < 0)] <- 0
result
# [,1] [,2] [,3]
# [1,] 0 0 0
# [2,] 0 0 0
# [3,] 5 0 0
# [4,] 5 0 0
# [5,] 5 0 0
# [6,] 5 5 0
# [7,] 5 5 0
# [8,] 5 10 0
# [9,] 5 10 0
# [10,] 5 10 5

use a string character-location identity to create a new variable

So I have been able to achieve my desired output, but I am sure that one can use a string to achieve a much more efficient code.
Let play with this data
set.seed(123)
A <- 1:100
type.a <- rnorm(100, mean=5000, sd=1433)
type.b <- rnorm(100, mean=5000, sd=1425)
type.c <- rnorm(100, mean=5000, sd=1125)
type.d <- rnorm(100, mean=5000, sd=1233)
df1 <- data.frame(A, type.a, type.b, type.c, type.d)
Now we want to create a new variable for df1 that will identity if a type(a:d) begun with number 1. So I have used this code:
df1$Type_1 <- with(df1, ifelse((type.a < 2000 & type.a > 999)|(type.b < 2000 & type.c > 999)|
(type.c < 2000 & type.c > 999)|(type.d < 2000 & type.d > 999), 1,0))
Or similiarly, this also:
df1$type_1 <- with(df1, ifelse(type.a < 2000 & type.a > 999, 1,
ifelse(type.b < 2000 & type.c > 999, 1,
ifelse(type.c < 2000 & type.c > 999, 1,
ifelse(type.d < 2000 & type.d > 999, 1,0)))))
Now my question form two parts
How can you use a string which will look at only the first digit of type(a:d) to test if it is equal to our constraint. (in this instance equal to 1)
Secondly, I have more than four columns of data. Thus I dont think it is efficient I specify column names each time. Can the use of [,x:y] be used?
The code then be used to create 9 new columns of data (ie. type_1 & type_2 ... type_9), as the first digit of our type(a:d) has a range of 1:9
We can use substr to extract the first character of a string. As there are four columns that start with type, we can use grep to get the numeric index of columns, we loop the columns with lapply, check whether the 1st character is equal to 1. If we want to know whether there is at least one value that meets the condition, we can wrap it with any. Using lapply returns a list output with a length of 1 for each list element. As we need a binary (0/1) instead of logical (FALSE/TRUE), we can wrap with + to coerce the logical to binary representation.
indx <- grep('^type', colnames(df1))
lapply(df1[indx], function(x) +(any(substr(x, 1, 1)==1)))
If we need a vector output
vapply(df1[indx], function(x) +(any(substr(x, 1, 1)==1)), 1L)
Great and elegant answer by #akrun. I was interested in the 2nd part of your question. Specifically about how you're going to use the first part to create the new 9 columns you mention. I don't know if I'm missing something, but instead of checking each time if the first element matches 1,2,3, etc. you can just simply capture that first element. Something like this:
library(dplyr)
library(tidyr)
set.seed(123)
A <- 1:100
type.a <- rnorm(100, mean=5000, sd=1433)
type.b <- rnorm(100, mean=5000, sd=1425)
type.c <- rnorm(100, mean=5000, sd=1125)
type.d <- rnorm(100, mean=5000, sd=1233)
df1 <- data.frame(A, type.a, type.b, type.c, type.d)
df1 %>%
group_by(A) %>%
mutate_each(funs(substr(.,1,1))) %>% # keep first digit
ungroup %>%
gather(variable, type, -A) %>% # create combinations of rows and digits
select(-variable) %>%
mutate(type = paste0("type_",type),
value = 1) %>%
group_by(A,type) %>%
summarise(value = sum(value)) %>% # count how many times the row belongs to each type
ungroup %>%
spread(type, value, fill=0) %>% # create the new columns
inner_join(df1, by="A") %>% # join back initial info
select(A, starts_with("type."), starts_with("type_")) # order columns
# A type.a type.b type.c type.d type_1 type_2 type_3 type_4 type_5 type_6 type_7 type_8 type_9
# 1 1 4196.838 3987.671 7473.662 4118.106 0 0 1 2 0 0 1 0 0
# 2 2 4670.156 5366.059 6476.465 4071.935 0 0 0 2 1 1 0 0 0
# 3 3 7233.629 4648.464 4701.712 3842.782 0 0 1 2 0 0 1 0 0
# 4 4 5101.039 4504.752 5611.093 3702.251 0 0 1 1 2 0 0 0 0
# 5 5 5185.269 3643.944 4533.868 4460.982 0 0 1 2 1 0 0 0 0
# 6 6 7457.688 4935.835 4464.222 5408.344 0 0 0 2 1 0 1 0 0
# 7 7 5660.493 3881.511 4112.822 2516.478 0 1 1 1 1 0 0 0 0
# 8 8 3187.167 2623.183 4331.056 5261.372 0 1 1 1 1 0 0 0 0
# 9 9 4015.740 4458.177 6857.271 6524.820 0 0 0 2 0 2 0 0 0
# 10 10 4361.366 6309.570 4939.218 7512.329 0 0 0 2 0 1 1 0 0
# .. .. ... ... ... ... ... ... ... ... ... ... ... ... ...
Example when we have column A and B in the beginning:
library(dplyr)
library(tidyr)
set.seed(123)
A <- 1:100
B <- 101:200
type.a <- rnorm(100, mean=5000, sd=1433)
type.b <- rnorm(100, mean=5000, sd=1425)
type.c <- rnorm(100, mean=5000, sd=1125)
type.d <- rnorm(100, mean=5000, sd=1233)
df1 <- data.frame(A,B, type.a, type.b, type.c, type.d)
# work by grouping on A and B
df1 %>%
group_by(A,B) %>%
mutate_each(funs(substr(.,1,1))) %>%
ungroup %>%
gather(variable, type, -c(A,B)) %>%
select(-variable) %>%
mutate(type = paste0("type_",type),
value = 1) %>%
group_by(A,B,type) %>%
summarise(value = sum(value)) %>%
ungroup %>%
spread(type, value, fill=0) %>%
inner_join(df1, by=c("A","B")) %>%
select(A,B, starts_with("type."), starts_with("type_"))
# A B type.a type.b type.c type.d type_1 type_2 type_3 type_4 type_5 type_6 type_7 type_8 type_9
# 1 1 101 4196.838 3987.671 7473.662 4118.106 0 0 1 2 0 0 1 0 0
# 2 2 102 4670.156 5366.059 6476.465 4071.935 0 0 0 2 1 1 0 0 0
# 3 3 103 7233.629 4648.464 4701.712 3842.782 0 0 1 2 0 0 1 0 0
# 4 4 104 5101.039 4504.752 5611.093 3702.251 0 0 1 1 2 0 0 0 0
# 5 5 105 5185.269 3643.944 4533.868 4460.982 0 0 1 2 1 0 0 0 0
# 6 6 106 7457.688 4935.835 4464.222 5408.344 0 0 0 2 1 0 1 0 0
# 7 7 107 5660.493 3881.511 4112.822 2516.478 0 1 1 1 1 0 0 0 0
# 8 8 108 3187.167 2623.183 4331.056 5261.372 0 1 1 1 1 0 0 0 0
# 9 9 109 4015.740 4458.177 6857.271 6524.820 0 0 0 2 0 2 0 0 0
# 10 10 110 4361.366 6309.570 4939.218 7512.329 0 0 0 2 0 1 1 0 0
# .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
However, in this case you should notice that you have one A value for each line. So, B isn't really needed in order to define your rows (in a unique way). Therefore, you can work exactly as before (when B wasn't there) and just join B to your result:
df1 %>%
select(-B) %>%
group_by(A) %>%
mutate_each(funs(substr(.,1,1))) %>%
ungroup %>%
gather(variable, type, -A) %>%
select(-variable) %>%
mutate(type = paste0("type_",type),
value = 1) %>%
group_by(A,type) %>%
summarise(value = sum(value)) %>% # count how many times the row belongs to each type
ungroup %>%
spread(type, value, fill=0) %>%
inner_join(df1, by="A") %>%
mutate(B=B) %>%
select(A,B, starts_with("type."), starts_with("type_"))

Aggregating every 10 columns in binary matrice

I am new to R.
I would like to transform a binary matrix like this:
example:
" 1874 1875 1876 1877 1878 .... 2009
F 1 0 0 0 0 ... 0
E 1 1 0 0 0 ... 0
D 1 1 0 0 0 ... 0
C 1 1 0 0 0 ... 0
B 1 1 0 0 0 ... 0
A 1 1 0 0 0 ... 0"
Since, columns names are years I would like to aggregate them in decades and obtain something like:
"1840-1849 1850-1859 1860-1869 .... 2000-2009
F 1 0 0 0 0 ... 0
E 1 1 0 0 0 ... 0
D 1 1 0 0 0 ... 0
C 1 1 0 0 0 ... 0
B 1 1 0 0 0 ... 0
A 1 1 0 0 0 ... 0"
I am used to python and do not know how to do this transformation without making loops!
Thanks, isabel
It is unclear what aggregation you want, but using the following dummy data
set.seed(42)
df <- data.frame(matrix(sample(0:1, 6*25, replace = TRUE), ncol = 25))
names(df) <- 1874 + 0:24
The following counts events in each 10-year period.
Get the years as a numeric variable
years <- as.numeric(names(df))
Next we need an indicator for the start of each decade
ind <- seq(from = signif(years[1], 3), to = signif(tail(years, 1), 3), by = 10)
We then apply over the indices of ind (1:(length(ind)-1)), select columns from df that are the current decade and count the 1s using rowSums.
tmp <- lapply(seq_along(ind[-1]),
function(i, inds, data) {
rowSums(data[, names(data) %in% inds[i]:(inds[i+1]-1)])
}, inds = ind, data = df)
Next we cbind the resulting vectors into a data frame and fix-up the column names:
out <- do.call(cbind.data.frame, tmp)
names(out) <- paste(head(ind, -1), tail(ind, -1) - 1, sep = "-")
out
This gives:
> out
1870-1879 1880-1889 1890-1899
1 4 5 6
2 4 6 6
3 2 5 5
4 5 5 7
5 3 3 7
6 5 5 4
If you want simply a binary matrix with a 1 indicating at least 1 event happened in that decade, then you can use:
tmp2 <- lapply(seq_along(ind[-1]),
function(i, inds, data) {
as.numeric(rowSums(data[, names(data) %in% inds[i]:(inds[i+1]-1)]) > 0)
}, inds = ind, data = df)
out2 <- do.call(cbind.data.frame, tmp2)
names(out2) <- paste(head(ind, -1), tail(ind, -1) - 1, sep = "-")
out2
which gives:
> out2
1870-1879 1880-1889 1890-1899
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 1 1 1
If you want a different aggregation, then modify the function applied in the lapply call to use something other than rowSums.
This is another option, using modular arithmetic to aggregate the columns.
# setup, borrowed from #GavinSimpson
set.seed(42)
df <- data.frame(matrix(sample(0:1, 6*25, replace = TRUE), ncol = 25))
names(df) <- 1874 + 0:24
result <- do.call(cbind,
by(t(df), as.numeric(names(df)) %/% 10 * 10, colSums))
# add -xxx9 to column names, for each decade
dimnames(result)[[2]] <- paste(colnames(result), as.numeric(colnames(result)) + 9, sep='-')
# 1870-1879 1880-1889 1890-1899
# V1 4 5 6
# V2 4 6 6
# V3 2 5 5
# V4 5 5 7
# V5 3 3 7
# V6 5 5 4
If you wanted to aggregate with something other than sum, replace the call to
colSums with something like function(cols) lapply(cols, f), where f is the aggregating
function, e.g., max.

Resources