Calculation between vectors in nested for loops - r

I am struggling with an issue concerned with nested for loops and calculation with conditions.
Let's say I have a data frame like this:
df = data.frame("a" = c(2, 3, 3, 4),
"b" = c(4, 4, 4, 4),
"c" = c(5, 5, 4, 4),
"d" = c(3, 4, 4, 2))
With this df, I want to compare each element between vectors with a condition: if the absolute difference between two elements is lower than 2 (so difference of 0 and 1), I'd like to accord 1 in a newly created vector while the absolute difference between two elements is >= 2, I'd like to append 0.
For example, for a calculation between the vector "a" and the other vectors "b", "c", "d", I want this result: 0 0 1. The first 0 is accorded based on the difference of 2 between a1 and b1; the second 0 is based on the difference of 3 between a1 and c1; the 1 is based on the difference of a1 and d1. So I tried to make a nested for loop to applicate the same itinerary to the elements in the following rows as well.
So my first trial was like this:
list_all = list(df$a, df$b, df$c, df$d)
v0<-c()
for (i in list_all)
for (j in list_all)
if (i != j) {
if(abs(i-j)<2) {
v0<-c(v0, 1)
} else {
v0<-append(v0, 0)
}} else {
next}
The result is like this :
v0
[1] 0 0 1 0 1 1 0 1 0 1 1 0
But it seems that the calculation has been made only among the first elements but not among the following elements.
So my second trial was like this:
list = list(df$b, df$c, df$d)
v1<-c()
for (i in df$a){
for (j in list){
if(abs(i-j)<2) {
v1<-append(v1, 1)
} else {
v1<-append(v1, 0)
}
}
}
v1
v1
[1] 0 0 1 1 0 1 1 0 1 1 1 1
It seems like the calculations were made between all elements of df$a and ONLY the first elements of the others. So this is not what I needed, either.
When I put df$b instead of list in the nested for loop, the result is even more messy.
v2<-c()
for (i in df$a){
for (j in df$b){
if(abs(i-j)<2) {
v2<-append(v2, 1)
} else {
v2<-append(v2, 0)
}
}
}
v2
[1] 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
It seems like the calculation has not been made between the corresponding elements (in the same rows), but between all vectors regardless of the place.
Could anyone tell me how to fix the problem? I don't understand why the nested for loop works only for the first elements.
Thank you in advance.

I'm not sure if I understood it all correctly, but how about this?
df = data.frame("a" = c(2, 3, 3, 4),
"b" = c(4, 4, 4, 4),
"c" = c(5, 5, 4, 4),
"d" = c(3, 4, 4, 2))
as.vector(apply(df, 1, \(x) ifelse(abs(x[1] - x[2:4]) < 2, 1, 0)))
#> [1] 0 0 1 1 0 1 1 1 1 1 1 0

I think you're making life unnecessarily complicated for yourself. If I understand you correctly, you can do what you want without nesting loops at all.
The key thing to remember is that R is vectorised by default. That means that R will modify all rows of a vector at the same time. There's no need to loop. So, for example, if a is a vector with values 1 and 2 and I write a + 1, the result will be a vector with values 2 and 3.
Applying this logic to your case, you can write:
df$diffB <- ifelse(abs(df$a-df$b) < 2, 1, 0)
df$diffC <- ifelse(abs(df$a-df$c) < 2, 1, 0)
df$diffD <- ifelse(abs(df$a-df$d) < 2, 1, 0)
df
Giving
a b c d diffB diffC diffD
1 2 4 5 3 0 0 1
2 3 4 5 4 1 0 1
3 3 4 4 4 1 1 1
4 4 4 4 2 1 1 0
You can write a loop to loop over columns if you wish, and Aron has given you one option to do this in his answer.
Personally, I find the using tidyverse results in code that's easier to understand than code written in base R. This is because I can read tidyverse code from left to right, whereas base R code (often) needs to be read from inside out. Tidyverse's syntax is more consistent than base R's as well.
Here's how I would solve your problem using the tidyverse:
library(tidyverse)
df %>%
mutate(
diffB=ifelse(abs(a-b) < 2, 1, 0),
diffC=ifelse(abs(a-c) < 2, 1, 0),
diffD=ifelse(abs(a-d) < 2, 1, 0)
)
And the "loop over columns" becomes
df %>%
mutate(
across(
c(b, c, d),
~ifelse(abs(a-.x) < 2, 1, 0),
.names="diff{.col}"
)
)

Related

Logical vector across many columns

I am trying to run a logical or statement across many columns in data.table but I am having trouble coming up with the code. My columns have a pattern like the one shown in the table below. I could use a regular logical vector if needed, but I was wondering if I could figure out a way to iterate across a1, a2, a3, etc. as my actual dataset has many "a" type columns.
Thanks in advance.
library(data.table)
x <- data.table(a1 = c(1, 4, 5, 6), a2 = c(2, 4, 1, 10), z = c(9, 10, 12, 12))
# this works but does not work for lots of a1, a2, a3 colnames
# because code is too long and unwieldy
x[a1 == 1 | a2 == 1 , b:= 1]
# this is broken and returns the following error
x[colnames(x)[grep("a", names(x))] == 1, b := 1]
Error in `[.data.table`(x, colnames(x)[grep("a", names(x))] == 1, `:=`(b, :
i evaluates to a logical vector length 2 but there are 4 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.
Output looks like below:
a1 a2 z b
1: 1 2 9 1
2: 4 4 10 NA
3: 5 1 12 1
4: 6 10 12 NA
Try using a mask:
x$b <- 0
x[rowSums(ifelse(x[, list(a1, a2)] == 1, 1, 0)) > 0, b := 1]
Now imagine you have 100 a columns and they are the first 100 columns in your data table. Then you can select the columns using:
x[rowSums(ifelse(x[, c(1:100)] == 1, 1, 0) > 0, b := 1]
ifelse(x[, list(a1, a2)] == 1, 1, 0) returns a data table that only has the values 1 where there is a 1 in the a columns. Then I used rowSums to sum horizontally, and if any of these sums is > 0, it means there was a 1 in at least one of the columns of a given row, so I simply selected those rows and set b to 1.

Using a for loop across columns with similar names

I am trying to use the tidyverse (purrr) package to run a for loop across my dataset. I want to check whether some number of conditions are true across certain columns along the dataset. Note, I am trying to become more familiar with tidyverse and its functions rather than rely on Base R.
Here is the code that I want to write a for loop for.
nrow(subset(data, flwr_clstr1>1 & bud_clstr1==0))
nrow(subset(data, flwr_clstr2>1 & bud_clstr2==0))
nrow(subset(data, flwr_clstr3>1 & bud_clstr3==0))
I have columns of data (in this case, it would be flwr_clstr) that are similar, but differ by the last digit. Also, if there is another way to use tidyverse to check these 'conditions', that would be great too.
Here is my attempt at the for loop.
check1 <- vector("double", ncol(data_phen))
for (i in seq_along(data_phen)) {
check[[i]] <- nrow(subset(data, flwr_clstr[[i]]>1 & bud_clstr[[i]]==0))
}
It would be easier to help if you could provide a reproducible example, however I created a sample of what your data might look like based on my understanding.
We can use map2_int from purrr since we are trying to count number of rows in each pair of columns
library(dplyr)
library(purrr)
map2_int(data %>% select(starts_with("flwr_clstr")),
data %>% select(starts_with("bud_clstr")),
~sum(.x > 1 & .y == 0)) %>% unname()
#[1] 2 3 1
However, base R isn't that bad either. This can be solved using mapply
col1 <- grep("^flwr_clstr", names(data))
col2 <- grep("^bud_clstr", names(data))
mapply(function(x, y) sum(x > 1 & y == 0), data[col1], data[col2])
data
Assuming you have equal number of columns for both "flwr_clstr.." and "bud_clstr.."
data <- data.frame(flwr_clstr1 = c(2, 1, 2, 1, 0), flwr_clstr2 = c(2, 2, 2, 1, 0),
flwr_clstr3 = c(1, 1, 2, 1, 1), bud_clstr1 = 0, bud_clstr2 = 0,bud_clstr3 = 0)
which looks like
data
# flwr_clstr1 flwr_clstr2 flwr_clstr3 bud_clstr1 bud_clstr2 bud_clstr3
#1 2 2 1 0 0 0
#2 1 2 1 0 0 0
#3 2 2 2 0 0 0
#4 1 1 1 0 0 0
#5 0 0 1 0 0 0

how to fill in values in a vector?

I have vectors in R containing a lot of 0's, and a few non-zero numbers.Each vector starts with a non-zero number.
For example <1,0,0,0,0,0,2,0,0,0,0,0,4,0,0,0>
I would like to set all of the zeros equal to the most recent non-zero number.
I.e. this vector would become <1,1,1,1,1,1,2,2,2,2,2,2,4,4,4,4>
I need to do this for a about 100 vectors containing around 6 million entries each. Currently I am using a for loop:
for(k in 1:length(vector){
if(vector[k] == 0){
vector[k] <- vector[k-1]
}
}
Is there a more efficient way to do this?
Thanks!
One option, would be to replace those 0 with NA, then use zoo::na.locf:
x <- c(1,0,0,0,0,0,2,0,0,0,0,0,4,0,0,0)
x[x == 0] <- NA
zoo::na.locf(x) ## you possibly need: `install.packages("zoo")`
# [1] 1 1 1 1 1 1 2 2 2 2 2 2 4 4 4 4
Thanks to Richard for showing me how to use replace,
zoo::na.locf(replace(x, x == 0, NA))
You could try this:
k <- c(1,0,0,0,0,0,2,0,0,0,0,0,4,0,0,0)
k[which(k != 0)[cumsum(k != 0)]]
or another case that cummax would not be appropriate
k <- c(1,0,0,0,0,0,2,0,0,0,0,0,1,0,0,0)
k[which(k != 0)[cumsum(k != 0)]]
Logic:
I am keeping "track" of the indices of the vector elements that are non zero which(k != 0), lets denote this new vector as x, x=c(1, 7, 13)
Next I am going to "sample" this new vector. How? From k I am creating a new vector that increments every time there is a non zero element cumsum(k != 0), lets denote this new vector as y y=c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3)
I am "sampling" from vector x: x[y] i.e. taking the first element of x 6 times, then the second element 6 times and the third element 3 times. Let denote this new vector as z, z=c(1, 1, 1, 1, 1, 1, 7, 7, 7, 7, 7, 7, 13, 13, 13)
I am "sampling" from vector k, k[z], i.e. i am taking the first element 6 times, then the 7th element 6 times then the 13th element 3 times.
Add to #李哲源's answer:
If it is required to replace the leading NAs with the nearest non-NA value, and to replace the other NAs with the last non-NA value, the codes can be:
x <- c(0,0,1,0,0,0,0,0,2,0,0,0,0,0,4,0,0,0)
zoo::na.locf(zoo::na.locf(replace(x, x == 0, NA),na.rm=FALSE),fromLast=TRUE)
# you possibly need: `install.packages("zoo")`
# [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 4 4 4 4

Identifying relative size of overlapping groups based on information in 2 vectors

I am working with very messy family data, in that it is possible for kids to be grouped with multiple families. The data is structured as follows:
famid <- c("A","A","B","C","C","D","D")
kidid <- c("1","2","1","3","4","4","5")
df <- as.data.frame(cbind(famid, kidid))
I want to identify which families I can drop, based on the criteria that all of the kids in that family are grouped together in another, larger, family.
For example, Family A contains Kid 1 and Kid 2. Family B contains Kid 1. Because Family B is entirely contained within Family A, I want to drop Family B.
Alternatively, Family C contains Kid 3 and Kid 4. Family D contains Kid 4 and Kid 5. Neither family is entirely contained within the other, so I do not want to drop either for the time being.
In my data there can be up to 6 families per kid and up to 8 kids per family. There are thousands of families and thousands of kids.
I have tried addressing this by creating a very wide data.frame with one row per student, with columns for each family the kid is associated with, each sibling in each family that the kid is associated with, and an additional column (sibgrp) for each associated family that concatenates all siblings together. But when I tried to search for individual siblings within the concatenated string, I found I didn't know how to do this -- grepl won't take a vector as the pattern argument.
I then started to look into intersect and similar functions, but those compare entire vectors to each other, not observations within a vector to other observations within that vector. (Meaning -- I can't look for the intersections between character string df[1,2] and character string df[1,3]. Intersect instead identifies the intersections between df[2] and df[3]).
I tried to change my thinking to accommodate this approach, so that I could compare vectors of siblings to each other, assuming that I know already that at least one sibling is shared. I could not figure out how to even begin doing this, given how many different families there are, and how many are not related to each other by even one shared kid.
What am I missing here? Would very much appreciate any feedback. Thank you!
This function can also be used for doing the task. It returns a character vector containing the names of the families that can be removed.
test_function <- function(dataset){
## split the kidid on the basis of famid
kids_family <- split.default(dataset[['kidid']],f = dataset[['famid']])
family <- names(kids_family)
## This function generates all the possible combinations if we select any two families from family
combn_family <- combn(family,2)
family_removed <- character(0)
apply(combn_family,MARGIN = 2, function(x){
if (length(setdiff(kids_family[[x[1]]],kids_family[[x[2]]])) == 0)
family_removed <<- c(family_removed,x[1])
else if (length(setdiff(kids_family[[x[2]]],kids_family[[x[1]]])) == 0)
family_removed <<- c(family_removed,x[2])
})
return (family_removed)
}
> df <- data.frame(famid = c("A","A","B","C","C","D","D", "E", "E", "E", "F", "F"),
+ kidid = c(1, 2, 1, 3, 4, 4, 5, 7, 8, 9, 7, 9))
> test_function(df)
[1] "B" "F"
I have tried around setdiff with no chance. I came and post this laborious solution in the hope there is a better way.
# dependencies for melting tables and handling data.frames
require(reshape2)
require(dplyr)
# I have added two more cases to your data.frame
# kidid is passed as numeric (with quoted would have been changed to vector by default)
df <- data.frame(famid = c("A","A","B","C","C","D","D", "E", "E", "E", "F", "F"),
kidid = c(1, 2, 1, 3, 4, 4, 5, 7, 8, 9, 7, 9))
# let's have a look to it
df
famid kidid
1 A 1
2 A 2
3 B 1
4 C 3
5 C 4
6 D 4
7 D 5
8 E 7
9 E 8
10 E 9
11 F 7
12 F 9
# we build a contingency table
m <- table(df$famid, df$kidid)
# a family A only contains a family B, if A has all the elements of B,
# and at least one that B doesnt have
m
1 2 3 4 5 7 8 9
A 1 1 0 0 0 0 0 0
B 1 0 0 0 0 0 0 0
C 0 0 1 1 0 0 0 0
D 0 0 0 1 1 0 0 0
E 0 0 0 0 0 1 1 1
F 0 0 0 0 0 1 0 1
# an helper function to implement that and return a friendly data.frame
family_contained <- function(m){
res <- list()
for (i in 1:nrow(m))
# for each line in m, we calculate the difference to all other lines
res[[i]] <- t(apply(m[-i, ], 1, function(row) m[i, ] - row))
# here we test if all values are 0+ (ie if the selected family has all element of the other)
# and if at least one is >=1 (ie if the selected family has at least one element that the other doesnt have)
tab <- sapply(res, function(m) apply(m, 1, function(x) all(x>=0) & any(x>=1)))
# we format it as a table to have nice names
tab %>% as.table() %>%
# we melt it into a data.frame
melt() %>%
# only select TRUE and get rid of this column
filter(value) %>% select(-value) %>%
# to make things clear we name columns
`colnames<-`(c("this_family_is_contained", "this_family_contains"))
}
family_contained(m)
# this_family_is_contained this_family_contains
# 1 B A
# 2 F E
# finally you can filter them with
filter(df, !(famid %in% family_contained(m)$this_family_is_contained))

R - work on data frame rows based on condition

I'm trying to understand how can I work on the rows of a data frame based on a condition.
Having a data frame like this
> d<-data.frame(x=c(0,1,2,3), y=c(1,1,1,0))
> d
x y
1 0 1
2 1 1
3 2 1
4 3 0
how can I add +1 to all rows that contain a value of zero? (note that zeros can be found in any column), so that the result would look like this:
x y
1 1 2
2 1 1
3 2 1
4 4 1
The following code seems to do part of the job, but is just printing the rows where the action was taken, the number of times it was taken (2)...
> for(i in 1:nrow(d)){
+ d[d[i,]==0,]<-d[i,]+1
+ }
> d
x y
1 1 2
2 4 1
3 1 2
4 4 1
I'm sure there is a simple solution for this, maybe an apply function?, but I'm not getting there.
Thanks.
Some possibilities:
# 1
idx <- which(d == 0, arr.ind = TRUE)[, 1]
d[idx, ] <- d[idx, ] + 1
# 2
t(apply(d, 1, function(x) x + any(x == 0)))
# 3
d + apply(d == 0, 1, max)
The usage of which for vectors, e.g. which(1:3 > 2), is quite common, whereas it is used less for matrices: by specifying arr.ind = TRUE what we get is array indices, i.e. coordinates of every 0:
which(d == 0, arr.ind = TRUE)
row col
[1,] 1 1
[2,] 4 2
Since we are interested only in rows where zeros occur, I take the first column of which(d == 0, arr.ind = TRUE) and add 1 to all the elements in these rows by d[idx, ] <- d[idx, ] + 1.
Regarding the second approach, apply(d, 1, function(x) x) would be simply going row by row and returning the same row without any modifications. By any(x == 0) we check whether there are any zeros in a particular row and get TRUE or FALSE. However, by writing x + any(x == 0) we transform TRUE or FALSE to 1 or 0, respectively, as required.
Now the third approach. d == 0 is a logical matrix, and we use apply to go over its rows. Then when applying max to a particular row, we again transform TRUE, FALSE to 1, 0 and find a maximal element. This element is 1 if and only if there are any zeros in that row. Hence, apply(d == 0, 1, max) returns a vector of zeros and ones. The final point is that when we write A + b, where A is a matrix and b is a vector, the addition is column-wise. In this way, by writing d + apply(d == 0, 1, max) we add apply(d == 0, 1, max) to every column of d, as needed.

Resources