How to skip iteration in for loop if condition is met - r

I have code to turn the upper triangle of a matrix into a vector and store the values from this vector along with their original coordinates from the matrix into a data frame.
How do I skip the for loop if the element in the vector is zero?
I have tried else statements and other attempts.
v <- matrix(sample(0:1, 10, replace = TRUE),9,9)
t <- v[upper.tri(v,diag=T)]
tful <- t[t!=0]
df <- data.frame(FP1=rep(0,length(t)),FP2=rep(0,length(t)),tanimoto=rep(0,length(t)))
for (i in 1:length(t)){
if (t[i]==0) next
else {
col_num <- floor(sqrt(2*i-7/4)+.5)
row_num <- i-(.5*col_num^2-.5*col_num+1)+1
df$FP1[i] <- row_num
df$FP2[i] <- col_num
df$tanimoto[i] <- v[row_num,col_num]
}
}
I dont want any zeros in my data frame, and the loop to skip these values.
I understand the data frame needs to be smaller in rows but i am using this as an example.

Your next is working fine to skip the current iteration of the loop.
You still get 0s in the final result because all values of df were initialized df to 0. When you skip the iteration, they are not changed, so they remain 0. If you change the initialization to be NA values, you'll see that no 0s are added.
df <- data.frame(FP1=rep(NA,length(t)),FP2=rep(NA,length(t)),tanimoto=rep(NA,length(t)))
for (i in 1:length(t)){
if (t[i]==0) next
else {
col_num <- floor(sqrt(2*i-7/4)+.5)
row_num <- i-(.5*col_num^2-.5*col_num+1)+1
df$FP1[i] <- row_num
df$FP2[i] <- col_num
df$tanimoto[i] <- v[row_num,col_num]
}
}
df
# FP1 FP2 tanimoto
# 1 1 1 1
# 2 1 2 1
# 3 2 2 1
# 4 1 3 1
# 5 2 3 1
# 6 3 3 1
# 7 NA NA NA
# 8 2 4 1
# 9 3 4 1
# 10 4 4 1
# 11 NA NA NA
# ...
A simple modification would be to filter your data frame as a last step: df = df[df$tanimoto != 0, ], or if you switch to NA, df = na.omit(df).
We could also create a non-looping solution:
v1 = v != 0
df2 = data.frame(FP1 = row(v)[v1], FP2 = col(v)[v1], tanimoto = v[v1])
df2 = subset(df2, FP1 <= FP2)
df2
# FP1 FP2 tanimoto
# 1 1 1 1
# 7 1 2 1
# 8 2 2 1
# 13 1 3 1
# 14 2 3 1
# 15 3 3 1
# 20 2 4 1
# 21 3 4 1
# 22 4 4 1
# 27 3 5 1
# 28 4 5 1
# 29 5 5 1
# 33 1 6 1
# 34 4 6 1
# 35 5 6 1
# ...

Related

how to subset every 6 rows in R?

I have to subset the data of 6 rows every time. How to do that in R?
data:
col1 : 1,2,3,4,5,6,7,8,9,10
col2 : a1,a2,a3,a4,a5,a6,a7,a8,a9,a10
I want to do subset of 6 rows every time. First subset of the rows will have 1:6 ,next subset of the rows will have 7:nrow(data). I have tried using seq function.
seqData <- seq(1,nrow(data),6)
output: It is giving 1 and 7th row but I want 1 to 6 rows first, next onwards 7 to nrow(data).
How to get output like that.
Will this work:
set.seed(1)
dat <- data.frame(c1 = sample(1:5,12,T),
c2 = sample(1:5,12,T))
dat
c1 c2
1 1 2
2 4 2
3 1 1
4 2 5
5 5 5
6 3 1
7 2 1
8 3 5
9 3 5
10 1 2
11 5 2
12 5 1
split(dat, rep(1:ceiling(nrow(dat)/6), each = 6))
$`1`
c1 c2
1 1 2
2 4 2
3 1 1
4 2 5
5 5 5
6 3 1
$`2`
c1 c2
7 2 1
8 3 5
9 3 5
10 1 2
11 5 2
12 5 1
The function below creates a numeric vector with integers increasing by 1 unit every n rows. And uses this vector to split the data as needed.
data <- data.frame(col1 = 1:10, col2 = paste0("a", 1:10))
split_nrows <- function(x, n){
f <- c(1, rep(0, n - 1))
f <- rep(f, length.out = NROW(x))
f <- cumsum(f)
split(x, f)
}
split_nrows(data, 6)
Here's a simple example with mtcars that yields a list of 6 subset dfs.
nrows <- nrow(mtcars)
breaks <- seq(1, nrows, 6)
listdfs <- lapply(breaks, function(x) mtcars[x:(x+5), ]) # increment by 5 not 6
listdfs[[6]] <- listdfs[[6]][1:2, ] #last df: remove 4 NA rows (36 - 32)

R - New column based on previous columns, for multiple similar variables

This question is similar to previous questions (based on my search) but with a twist. I hope to use [s,l,v]apply to perform this action for efficiency.
df <- data.frame(id = c(1,2,3,1,2), var1_dose_v1 = c(2,4,NA,1,NA),
var1_dose_v2 = c(NA,NA,4,NA,3),
var2_dose_v1 = c(NA,4,2,3,5),
var2_dose_v2 = c(1,NA,NA,NA,NA),
var3_dose_v1 = c(NA,NA,2,3,5),
var3_dose_v2 = c(1,4,NA,NA,NA)))
Which looks like this below
id var1_dose_v1 var1_dose_v2 var2_dose_v1 var2_dose_v2 var3_dose_v1 var3_dose_v2
1 2 NA NA 1 NA 1
2 4 NA 4 NA NA 4
3 NA 4 2 NA 2 NA
1 1 NA 3 NA 3 NA
2 NA 3 5 NA 5 NA
I want to create a new feature that amalgamates the information from version 1 (v1) and version 2 (v2) of each var#, producing the output below.
id var1_dose var2_dose var3_dose
1 2 1 1
2 4 4 4
3 4 2 2
4 1 3 3
5 3 5 5
It's important for me to use apply since there are thousands of var#s.
Thanks for your help!
This-
df[is.na(df)] <- 0
new_df <- sapply(seq(1:((ncol(df)-1)/2)), function(x)
{
df[, paste0("var",x,"_dose_v1")] + df[, paste0("var",x,"_dose_v2")]
})
To have a solution that is general for any number of variables or doses, there's a new function from dplyr called 'coalesce' built for this:
library(dplyr)
grps <- unique(sub("_v.*$?", "", names(df)[-1]))
mat <- sapply(grps, function(g) {
do.call("coalesce", unname(as.list(df[grep(g, names(df))])))
})
df2 <- data.frame(id=df$id, mat)
# id var1_dose var2_dose var3_dose
# 1 1 2 1 1
# 2 2 4 4 4
# 3 3 4 2 2
# 4 1 1 3 3
# 5 2 3 5 5
func <- function(i){
col <- paste0("var",i,"_dose")
xx <- colnames(df)[grep(col, colnames(df))]
yy <- rowSums(df[xx], na.rm = TRUE)
}
l = lapply(1:((dim(df)[2]-1)/2) , func)
df1 = as.data.frame(l)
colnames(df1) <- paste0("var",1:((dim(df)[2]-1)/2),"_dose")
# > df1
# var1_dose var2_dose var3_dose
# 1 2 1 1
# 2 4 4 4
# 3 4 2 2
# 4 1 3 3
# 5 3 5 5
If the 2 versions are always going to be side by side :then concised version of my code could be
l = lapply(1:((dim(df)[2]-1)/2),
function(i) rowSums(df[colnames(df)[c(i*2,i*2+1)]], na.rm = T))
df1 = as.data.frame(l)
colnames(df1) <- paste0("var",1:((dim(df)[2]-1)/2),"_dose")

How can I subset a dataframe according to group membership?

I am wanting to write a function so that a (potentially large) dataframe can be subsetted according to group membership, where a 'group' is a unique combination of a set of column values.
For example, I would like to subset the following data frame according to unique combination of the first two columns (Loc1 and Loc2).
Loc1 <- c("A","A","A","A","B","B","B")
Loc2 <- c("a","a","b","b","a","a","b")
Dat1 <- c(1,1,1,1,1,1,1)
Dat2 <- c(1,2,1,2,1,2,2)
Dat3 <- c(2,2,4,4,6,5,3)
DF=data.frame(Loc1,Loc2,Dat1,Dat2,Dat3)
Loc1 Loc2 Dat1 Dat2 Dat3
1 A a 1 1 2
2 A a 1 2 2
3 A b 1 1 4
4 A b 1 2 4
5 B a 1 1 6
6 B a 1 2 5
7 B b 1 2 3
I want to return (i) the number of groups (i.e. 4), (ii) the number in each group (i.e. c(2,2,2,1), and (iii) to relabel the rows so that I can further analyse the data frame according to group membership (e.g. for ANOVA and MANOVA) (i.e.
Group<-as.factor(c(1,1,2,2,3,3,4))
Data <- cbind(Group,DF[,-1:-2])
Group Dat1 Dat2 Dat3
1 1 1 1 2
2 1 1 2 2
3 2 1 1 4
4 2 1 2 4
5 3 1 1 6
6 3 1 2 5
7 4 1 2 3
).
So far all I have managed is to get the number of groups, and I'm suspicious that there's a better way to do even this:
nrow(unique(DF[,1:2]))
I was hoping to avoid for-loops as I am concerned about the function being slow.
I have tried converting to a data matrix so that I could concatenate the row values but I couldn't get that to work either.
Many thanks
You could try:
Create Group column by using unique level combination of Loc1 and Loc2.
indx <- paste(DF[,1], DF[,2])
DF$Group <- as.numeric(factor(indx, unique(indx))) #query No (iii)
DF1 <- DF[-(1:2)][,c(4,1:3)]
# Group Dat1 Dat2 Dat3
#1 1 1 1 2
#2 1 1 2 2
#3 2 1 1 4
#4 2 1 2 4
#5 3 1 1 6
#6 3 1 2 5
#7 4 1 2 3
table(DF$Group) #(No. ii)
#1 2 3 4
#2 2 2 1
length(unique(DF$Group)) #(i)
#[1] 4
Then, if you need to subset the datasets by group, you could split the dataset using the Group to create a list of 4 list elements
split(DF1, DF1$Group)
Update
If you have multiple columns, you could still try:
ColstoGroup <- 1:2
indx <- apply(DF[,ColstoGroup], 1, paste, collapse="")
as.numeric(factor(indx, unique(indx)))
#[1] 1 1 2 2 3 3 4
You could create a function;
fun1 <- function(dat, GroupCols){
FactGroup <- dat[, GroupCols]
if(length(GroupCols)==1){
dat$Group <- as.numeric(factor(FactGroup, levels=unique(FactGroup)))
}
else {
indx <- apply(FactGroup, 1, paste, collapse="")
dat$Group <- as.numeric(factor(indx, unique(indx)))
}
dat
}
fun1(DF, "Loc1")
fun1(DF, c("Loc1", "Loc2"))
This gets all three of your queries.
Begin with a table of the first two columns and then work with that data.
> (tab <- table(DF$Loc1, DF$Loc2))
#
# a b
# A 2 2
# B 2 1
#
> (ct <- c(tab)) ## (ii)
# [1] 2 2 2 1
> length(unlist(dimnames(tab))) ## (i)
# [1] 4
> cbind(Group = rep(seq_along(ct), ct), DF[-c(1,2)]) ## (iii)
# Group Dat1 Dat2 Dat3
# 1 1 1 1 2
# 2 1 1 2 2
# 3 2 1 1 4
# 4 2 1 2 4
# 5 3 1 1 6
# 6 3 1 2 5
# 7 4 1 2 3
Borrowing a bit from this answer and using some dplyr idioms:
library(dplyr)
Loc1 <- c("A","A","A","A","B","B","B")
Loc2 <- c("a","a","b","b","a","a","b")
Dat1 <- c(1,1,1,1,1,1,1)
Dat2 <- c(1,2,1,2,1,2,2)
Dat3 <- c(2,2,4,4,6,5,3)
DF <- data.frame(Loc1, Loc2, Dat1, Dat2, Dat3)
emitID <- local({
idCounter <- -1L
function(){
idCounter <<- idCounter + 1L
}
})
DF %>% group_by(Loc1, Loc2) %>% mutate(Group=emitID())
## Loc1 Loc2 Dat1 Dat2 Dat3 Group
## 1 A a 1 1 2 0
## 2 A a 1 2 2 0
## 3 A b 1 1 4 1
## 4 A b 1 2 4 1
## 5 B a 1 1 6 2
## 6 B a 1 2 5 2
## 7 B b 1 2 3 3

Conditional calculation of means of different columns in data.table with R

Here was discussed the question of calculation of means and medians of vector t, for each value of vector y (from 1 to 4) where x=1, z=1, using aggregate function in R.
x y z t
1 1 1 10
1 0 1 15
2 NA 1 14
2 3 0 15
2 2 1 17
2 1 NA 19
3 4 2 18
3 0 2 NA
3 2 2 45
4 3 2 NA
4 1 3 59
5 0 3 0
5 4 3 45
5 4 4 74
5 1 4 86
Multiple aggregation in R with 4 parameters
But how can I for each value (from 1 to 5) of vector x calculate (mean(y)+mean(z))/(mean(z)-mean(t)) ? And do not make calculations for values 0 and NA in any vector. For example, in vector y the 3rd value is 0, so the 3rd number in every vector (y,z,t) should not be used. And in result the the third row (for x=3) should be NA.
Here is the code for calculating means of y,z and t and it`s needed to add the formula for calculation (mean(y)+mean(z))/(mean(z)-mean(t)):
data <- data.table(dataframe)
bar <- data[,.N,by=x]
foo <- data[ ,list(mean.y =mean(y, na.rm = T),
mean.z=mean(z, na.rm = T),
mean.t=mean(t,na.rm = T)),
by=x]
In this code for calculating means all rows are used, but for calculating (mean(y)+mean(z))/(mean(z)-mean(t)), any row where y or z or t equal to zero or NA should not be used.
Update:
Oh, this can be further simplified, as data.table doesn't subset NA by default (especially with such cases in mind, similar to base::subset). So, you just have to do:
dt[y != 0 & z != 0 & t != 0,
list(ans = (mean(y) + mean(z))/(mean(z) - mean(t))), by = x]
FWIW, here's how I'd do it in data.table:
dt[(y | NA) & (z | NA) & (t | NA),
list(ans=(mean(y)+mean(z))/(mean(z)-mean(t))), by=x]
# x ans
# 1: 1 -0.22222222
# 2: 2 -0.18750000
# 3: 3 -0.16949153
# 4: 4 -0.07142857
# 5: 5 -0.10309278
Let's break it down with the general syntax: dt[i, j, by]:
In i, we filter out for your conditions using a nice little hack TRUE | NA = TRUE and FALSE | NA = NA and NA | NA = NA (you can test these out in your R session).
Since you say you need only the non-zero non-NA values, it's just a matter of |ing each column with NA - which'll return TRUE only for your condition. That settles the subset by condition part.
Then for each group in by, we aggregate according to your function, in j, to get the result.
HTH
Here's one solution:
# create your sample data frame
df <- read.table(text = " x y z t
1 1 1 10
1 0 1 15
2 NA 1 14
2 3 0 15
2 2 1 17
2 1 NA 19
3 4 2 18
3 0 2 NA
3 2 2 45
4 3 2 NA
4 1 3 59
5 0 3 0
5 4 3 45
5 4 4 74
5 1 4 86", header = TRUE)
library('dplyr')
dfmeans <- df %>%
filter(!is.na(y) & !is.na(z) & !is.na(t)) %>% # remove rows with NAs
filter(y != 0 & z != 0 & t != 0) %>% # remove rows with zeroes
group_by(x) %>%
summarize(xmeans = (mean(y) + mean(z)) / (mean(z) - mean(t)))
I'm sure there is a simpler way to remove the rows with NAs and zeroes, but it's not coming to me. Anyway, dfmeans looks like this:
# x xmeans
# 1 1 -0.22222222
# 2 2 -0.18750000
# 3 3 -0.16949153
# 4 4 -0.07142857
# 5 5 -0.10309278
And if you just want the values from xmeans use dfmeans$xmeans.

Count and label observations per participant using loop

I have repeated-measures data.
I need to create a loop that will incrementally count each observation, within a participant, and label it.
I am new to writing loops. My logic was to say, for each item in the list of unique ids, count each row in that, and apply some function to that row.
Could someone point our what I am doing wrong?
data$Ob <- 0
for (i in unique(data$id)) {
count <- 1
for (u in data[data$id == i,]) {
data[data$id ==u,]$Ob <- count
count <- count + 1
print(count)
}
}
Thanks!
Justin
You can also use ave:
set.seed(1)
data <- data.frame(id = sample(4, 10, TRUE))
data$Ob = ave(data$id, data$id, FUN=seq_along)
data
id Ob
1 2 1
2 2 2
3 3 1
4 4 1
5 1 1
6 4 2
7 4 3
8 3 2
9 3 3
10 1 2
# Generate some dummy data
data <- data.frame(Ob=0, id=sample(4,20,TRUE))
# Go through every id value
for(i in unique(data$id)){
# Label observations
data$Ob[data$id == i] = 1:sum(data$id == i)
}
Be aware though that for loops are notoriously slow in R. In this simple case they work fine, but should you have millions and millions of rows in your data frame you'd better do something purely vectorized.
But you don't need a loop...
data <- data.frame (id = sample (4, 10, TRUE))
## id
## 1 3
## 2 4
## 3 1
## 4 3
## 5 3
## 6 4
## 7 2
## 8 1
## 9 1
## 10 4
data$Ob [order (data$id)] <- sequence (table (data$id))
## id Ob
## 1 3 1
## 2 4 1
## 3 1 1
## 4 3 2
## 5 3 3
## 6 4 2
## 7 2 1
## 8 1 2
## 9 1 3
## 10 4 3
(works also with character or factor IDs)
(isn't R just cool!?)

Resources