I am trying to use the tidyverse (purrr) package to run a for loop across my dataset. I want to check whether some number of conditions are true across certain columns along the dataset. Note, I am trying to become more familiar with tidyverse and its functions rather than rely on Base R.
Here is the code that I want to write a for loop for.
nrow(subset(data, flwr_clstr1>1 & bud_clstr1==0))
nrow(subset(data, flwr_clstr2>1 & bud_clstr2==0))
nrow(subset(data, flwr_clstr3>1 & bud_clstr3==0))
I have columns of data (in this case, it would be flwr_clstr) that are similar, but differ by the last digit. Also, if there is another way to use tidyverse to check these 'conditions', that would be great too.
Here is my attempt at the for loop.
check1 <- vector("double", ncol(data_phen))
for (i in seq_along(data_phen)) {
check[[i]] <- nrow(subset(data, flwr_clstr[[i]]>1 & bud_clstr[[i]]==0))
}
It would be easier to help if you could provide a reproducible example, however I created a sample of what your data might look like based on my understanding.
We can use map2_int from purrr since we are trying to count number of rows in each pair of columns
library(dplyr)
library(purrr)
map2_int(data %>% select(starts_with("flwr_clstr")),
data %>% select(starts_with("bud_clstr")),
~sum(.x > 1 & .y == 0)) %>% unname()
#[1] 2 3 1
However, base R isn't that bad either. This can be solved using mapply
col1 <- grep("^flwr_clstr", names(data))
col2 <- grep("^bud_clstr", names(data))
mapply(function(x, y) sum(x > 1 & y == 0), data[col1], data[col2])
data
Assuming you have equal number of columns for both "flwr_clstr.." and "bud_clstr.."
data <- data.frame(flwr_clstr1 = c(2, 1, 2, 1, 0), flwr_clstr2 = c(2, 2, 2, 1, 0),
flwr_clstr3 = c(1, 1, 2, 1, 1), bud_clstr1 = 0, bud_clstr2 = 0,bud_clstr3 = 0)
which looks like
data
# flwr_clstr1 flwr_clstr2 flwr_clstr3 bud_clstr1 bud_clstr2 bud_clstr3
#1 2 2 1 0 0 0
#2 1 2 1 0 0 0
#3 2 2 2 0 0 0
#4 1 1 1 0 0 0
#5 0 0 1 0 0 0
Related
I am struggling with an issue concerned with nested for loops and calculation with conditions.
Let's say I have a data frame like this:
df = data.frame("a" = c(2, 3, 3, 4),
"b" = c(4, 4, 4, 4),
"c" = c(5, 5, 4, 4),
"d" = c(3, 4, 4, 2))
With this df, I want to compare each element between vectors with a condition: if the absolute difference between two elements is lower than 2 (so difference of 0 and 1), I'd like to accord 1 in a newly created vector while the absolute difference between two elements is >= 2, I'd like to append 0.
For example, for a calculation between the vector "a" and the other vectors "b", "c", "d", I want this result: 0 0 1. The first 0 is accorded based on the difference of 2 between a1 and b1; the second 0 is based on the difference of 3 between a1 and c1; the 1 is based on the difference of a1 and d1. So I tried to make a nested for loop to applicate the same itinerary to the elements in the following rows as well.
So my first trial was like this:
list_all = list(df$a, df$b, df$c, df$d)
v0<-c()
for (i in list_all)
for (j in list_all)
if (i != j) {
if(abs(i-j)<2) {
v0<-c(v0, 1)
} else {
v0<-append(v0, 0)
}} else {
next}
The result is like this :
v0
[1] 0 0 1 0 1 1 0 1 0 1 1 0
But it seems that the calculation has been made only among the first elements but not among the following elements.
So my second trial was like this:
list = list(df$b, df$c, df$d)
v1<-c()
for (i in df$a){
for (j in list){
if(abs(i-j)<2) {
v1<-append(v1, 1)
} else {
v1<-append(v1, 0)
}
}
}
v1
v1
[1] 0 0 1 1 0 1 1 0 1 1 1 1
It seems like the calculations were made between all elements of df$a and ONLY the first elements of the others. So this is not what I needed, either.
When I put df$b instead of list in the nested for loop, the result is even more messy.
v2<-c()
for (i in df$a){
for (j in df$b){
if(abs(i-j)<2) {
v2<-append(v2, 1)
} else {
v2<-append(v2, 0)
}
}
}
v2
[1] 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
It seems like the calculation has not been made between the corresponding elements (in the same rows), but between all vectors regardless of the place.
Could anyone tell me how to fix the problem? I don't understand why the nested for loop works only for the first elements.
Thank you in advance.
I'm not sure if I understood it all correctly, but how about this?
df = data.frame("a" = c(2, 3, 3, 4),
"b" = c(4, 4, 4, 4),
"c" = c(5, 5, 4, 4),
"d" = c(3, 4, 4, 2))
as.vector(apply(df, 1, \(x) ifelse(abs(x[1] - x[2:4]) < 2, 1, 0)))
#> [1] 0 0 1 1 0 1 1 1 1 1 1 0
I think you're making life unnecessarily complicated for yourself. If I understand you correctly, you can do what you want without nesting loops at all.
The key thing to remember is that R is vectorised by default. That means that R will modify all rows of a vector at the same time. There's no need to loop. So, for example, if a is a vector with values 1 and 2 and I write a + 1, the result will be a vector with values 2 and 3.
Applying this logic to your case, you can write:
df$diffB <- ifelse(abs(df$a-df$b) < 2, 1, 0)
df$diffC <- ifelse(abs(df$a-df$c) < 2, 1, 0)
df$diffD <- ifelse(abs(df$a-df$d) < 2, 1, 0)
df
Giving
a b c d diffB diffC diffD
1 2 4 5 3 0 0 1
2 3 4 5 4 1 0 1
3 3 4 4 4 1 1 1
4 4 4 4 2 1 1 0
You can write a loop to loop over columns if you wish, and Aron has given you one option to do this in his answer.
Personally, I find the using tidyverse results in code that's easier to understand than code written in base R. This is because I can read tidyverse code from left to right, whereas base R code (often) needs to be read from inside out. Tidyverse's syntax is more consistent than base R's as well.
Here's how I would solve your problem using the tidyverse:
library(tidyverse)
df %>%
mutate(
diffB=ifelse(abs(a-b) < 2, 1, 0),
diffC=ifelse(abs(a-c) < 2, 1, 0),
diffD=ifelse(abs(a-d) < 2, 1, 0)
)
And the "loop over columns" becomes
df %>%
mutate(
across(
c(b, c, d),
~ifelse(abs(a-.x) < 2, 1, 0),
.names="diff{.col}"
)
)
I've been trying unsuccessfully to replicate in R the following Stata loop:
forvalues i=1/10 {
replace var`i'= a if other_var`i'==b
}
So far I've got this as the closest attempt:
for(i in 1:10) {
df <- df %>%
mutate(get(paste("var",i,sep="")) =
ifelse(get(paste("other_var",i,sep=""))==b
,a
,get(paste("var",i,sep=""))))
}
But I get the following error:
Error: unexpected '=' in:
"survey_data <- survey_data %>%
mutate(paste("offer",i,"_accepted",sep="") ="
If I change the variable to be mutated to a simple variable name, it works, so I'm guessing my code is OK for the "right-hand side of the mutation", but for some reason it's not OK for the "left-hand side".
This solution is very inelegant, but I think does exactly what you want.
var1 <- "x"
var2 <- "y"
var3 <- "z"
other_var1 <- 1
other_var2 <- 0
other_var3 <- 1
df <- data.frame(var1, other_var1, var2, other_var2, var3, other_var3)
for(i in 1:3){
var_name <- paste("df$var", i, sep = "")
other_var_name <- paste("df$other_var", i, sep = "")
if (eval(parse(text = other_var_name)) == 1){
assign(var_name, "a")
}
}
There are three key ingredients here. First the paste() function to create the names of the variables in the current iteration of the loop. Second, the eval(parse(foo)) combo to reference the actual variable whose name is stored as string in foo. Third, using assign() to assign values to a variable (as opposed to using <-).
This looks like FAQ 7.21.
The most important part of that answer is at the end where is says to use a list instead.
Trying to work on a group of global variables in R leads to complicated code that is hard to read and even harder to debug.
If you instead put those variables into a single list, then you can access them by name or position and use tools like lapply or the purrr package (part of tidyverse) to process everything in the list (or some of the things in the list using map_at or map_if from purrr).
If tell us more about what you are trying to accomplish, we may be able to give a much simpler example of how to do it.
You can do something like the following:
df <- structure(list(var1 = c(1, 2, 3, 4),
var2 = c(1, 2, 3, 4),
var3 = c(1,2, 3, 4),
var4 = c(1, 2, 3, 4),
other_var1 = c(1, 0, 1, 0),
other_var2 = c(0,1, 0, 1),
other_var3 = c(1, 1, 0, 0),
other_var4 = c(0, 0, 1,1)),
class = "data.frame",
row.names = c(NA, -4L))
# var1 var2 var3 var4 other_var1 other_var2 other_var3 other_var4
# 1 1 1 1 1 1 0 1 0
# 2 2 2 2 2 0 1 1 0
# 3 3 3 3 3 1 0 0 1
# 4 4 4 4 4 0 1 0 1
## Values to replace based on OP original question
a <- 777
b <- 1
## Iter along all four variables avaible in df
for (i in 1:4) {
df <- within(df, {
assign(paste0("var",i), ifelse(get(paste0("other_var",i)) %in% c(b), ## Condition
a, ## Value if Positive
get(paste0("var",i)) )) ## Value if Negative
})
}
which results in the following output:
# var1 var2 var3 var4 other_var1 other_var2 other_var3 other_var4
# 1 777 1 777 1 1 0 1 0
# 2 2 777 777 2 0 1 1 0
# 3 777 3 3 777 1 0 0 1
# 4 4 777 4 777 0 1 0 1
The solution doesn't look like a one-line-solution, but it actually is one, a quite dense one tho; hence let's see how it works by its foundation components.
within(): I don't want to repeat what other people have excellently explained, so for the within() usage, I gently refer you here.
The: assign(paste0("var",i), X) part.
Here I am following that #han-tyumi did in his answer, meaning recover the name of the variables using paste0() and assign them the value of X(to be explained) using the assign() function.
Let's talk about X.
Before I referenced assign(paste0("var",i), X). Where, indeed, X is equal to ifelse(get(paste0("other_var",i)) %in% c(b), a, get(paste0("var",i)) ).
Inside the ifelse():
The condition:
First, I recover the values of variable other_var(i) (with i = 1,2,3,4) combining the function get() with paste0() while looping. Then, I use the %in% operator to check whether the value assigned to variable b(on my example, the number 1) was contained on variable other_var(i) or not; this generates a TRUE or FALSE depending if the condition is met.
The TRUE part of the ifelse() function.
This is the simplest part if the condition is met then assign, a (which in my example is equal to 777).
The FALSE part of the ifelse() function.
get(paste0("var",i)): which is the value of the variable itself (meaning, if the condition is not meet, then keep the variable unaltered).
Sorry if this is a trivial question or doesn't make sense, this is my first post. I'm coming from Excel where I've worked with if statements and index match functions and am trying to do something similar in R to pull data from two columns but not necessarily the same row to get a value in a third column, my example is this
df<-data.frame(ID=c(1,5,4,2,3),A=c(1,0,1,1,1),B=c(0,0,1,0,0))
desired output: df<-data.frame(ID=c(1,5,4,2,3),A=c(1,0,1,1,1),B=c(0,0,1,0,0),C=c(0,0,0,0,1))
What I want is to create a third column "C" that essentially follows this format:
Ifelse(A[ID]=1 & B[ID+1]=1 , C[ID]=1 , C[ID]=0)
Essentially if A=1 in ID "x" and B=1 in ID "x+1" then in the new column C in ID "x" =1 otherwise =0. I could order everything by ID if that makes things easier but doing it by the ID column would be ideal.
So far I've tried ifelse statements but I imagine there is probably a better way of doing this
Using dplyr, we can use lead to get next element after arranging the data by ID.
library(dplyr)
df %>%
arrange(ID) %>%
mutate(C = as.integer(A == 1 & lead(B) == 1))
# ID A B C
#1 1 1 0 0
#2 2 1 0 0
#3 3 1 0 1
#4 4 1 1 0
#5 5 0 0 0
In base R, we can do
df1 <- df[order(df$ID),]
df1$C <- with(df1, c(A[-nrow(df)] == 1 & tail(B, -1) == 1, 0))
Without arranging the data, we can probably do
transform(df, C = as.integer(A[ID] == 1 & B[match(ID + 1, ID)] == 1))
Using the lead function I got this to work
df <- df [order(df$ID), ]
df$C <- ifelse (df$A == 1 & lead (df$B) == 1, 1, 0)
I got 2 identical variables due to allowing multiple responses.
Let's say, variables deal about hobbies: 1 = football, 2 = ice hockey, 3 = I have no hobbies
Thus, one can have two hobbies: football PLUS ice hockey.
hobby1<-c(1,2,3)
hobby1<-factor(hobby1,labels("football", "ice hockey", "I have no hobbies")
hobby2<-c(1,2,3)
hobby2<-factor(hobby2,labels("football", "ice hockey", "I have no hobbies")
Now I try to extract amout of hobbies, reaching from 0 to 2.
I already tried:
sum(hobby1<2, hobby2<2)
How can this be done, sum-function is not working for factors?
Plus, my solution would not take into account 3th category: no hobbies.
Should I possibly change my data arrangement, e.g. dummy coding (football yes/no, ...).
Dummy coding could be an easier approach since once you transform the data into a factor you can't use sum or the < operations easily. This approach works in base R:
df <- data.frame(football = c(0, 1, 1, 0),
ice_hockey = c( 1, 1, 0, 0))
df$num_hobbies <- rowSums(df[, 1:2])
df
# football ice_hockey num_hobbies
# 0 1 1
# 1 1 2
# 1 0 1
# 0 0 0
Or using dplyr to take advantage of column names a little more easily:
library(dplyr)
df <- data.frame(football = c(0, 1, 1, 0),
ice_hockey = c( 1, 1, 0, 0)) %>%
mutate(num_hobbies = football + ice_hockey)
df
# football ice_hockey num_hobbies
# 0 1 1
# 1 1 2
# 1 0 1
# 0 0 0
I'm getting a bit confused. I've got data like this in a data frame
index times
1 1 56.60
2 1 150.75
3 1 204.41
4 2 44.71
5 2 98.03
6 2 112.20
and I know that the times indexed 1 are biased, whereas the times indexed otherwise are not. I need to create a copy of that data frame removing the bias from the samples indexed 1. I've been trying several combinations of apply, by, and the likes. The closest I got was with
by(lct, lct$index, function(x) { if(x$index == 1) x$times = x$times-50 else x$times = x$times } )
which returned an object of class by, which is unusable for me. I need to write the data back to a csv file in the same format (index, times) of the original file. Ideas?
Something like this should work:
df$times[df$index ==1] <- df$times[df$times == 1] - 50
The trick here is to take the subset of df$times that fits your filter, and realize that R can also assign to a subset.
Alternatively, you can use ifelse:
df$times = ifelse(df$index == 1, df$times - 50, df$times)
and use it in dplyr:
library(dplyr)
df = data.frame(index = sample(1:5, 100, replace = TRUE),
value = runif(100)) %>% arrange(index)
df %>% mutate(value = ifelse(index == 1, value - 50, value))
# index value
#1 1 -49.95827
#2 1 -49.98104
#3 1 -49.44015
#4 1 -49.37316
#5 1 -49.76286
#6 1 -49.22133
#etc
How about,
index <- c(1, 1, 1, 2, 2, 2)
times <- c(56.60, 150.75, 204.41, 44.71, 98.03, 112.20)
df <- data.frame(index, times)
df$times <- ifelse(df$index == 1, df$times - 50, df$times)
> df
#index times
#1 1 6.60
#2 1 100.75
#3 1 154.41
#4 2 44.71
#5 2 98.03
#6 2 112.20