Loop through matrix in R and calculate measurement difference between all users - r

I have a matrix that is 10 rows by 4 columns. Each row represents a user, and each column a measurement. Some users only have one measurement, while others may have the full 4 measurements.
The goals I want to accomplish with this matrix are three fold:
To subtract the user's measurements from their own measurements (across columns);
To subtract the user's measurement from other user's measurement points (all included, across rows);
To create a final matrix that counts the number of "matches" (comparisons) each user has against themselves and others.
Within a threshold of 2.0 units, I have tried to measure each user's measurement against their own measurement and other users by obtaining the difference with a nested for-loop.
Below is an example of what the clean_data matrix looks like, and this matrix was used for all three goals:
M1 M2 M3 M4
U1 148.2 148.4 155.6 155.7
U2 149.5 150.1 150.1 153.9
U3 148.4 154.2 NA NA
U4 154.5 NA NA NA
U5 151.1 156.9 157.1 NA
For Goal #3, the output should look something akin to this matrix:
U1 U2 U3 U4 U5
U1 2 8 4 2 3
U2 8 3 2 1 4
U3 4 2 0 1 0
U4 2 1 1 0 0
U5 3 4 0 0 1
For example: User 1 has 2 matches with themselves because, with all 4 of their measurements, 2 differences were less than a value of 2.0 units. User 1 also has 8 matches with User 2. Each of User 1's measurements were subtracted iteratively from User 2's measurements (stored as an absolute value), and those differences that were below a value of 2 were considered a "match."
I have tried using the following nested for-loop, however I believe it is only counting the number of elements in my matrix instead of adding the differences.
# Set the time_threshold.
time_threshold <- 2.000
# Create an empty matrix the same dimensions as the number of users present.
matrix_a<-matrix(nrow = nrow(clean_data), ncol = nrow(clean_data))
# Use a nested for-loop to calculate the intra-user
# and inter-user time differences, adding values below
# the threshold up for those user-comparisons.
for (i in 1:nrow(clean_data)) {
for (j in 1:nrow(clean_data)) {
matrix_a[i, j] <-
round(sum(!is.na(abs((clean_data[i, 2:dim(clean_data)[2]]) -
(clean_data[j, 2:dim(clean_data)[2]])
) <= time_threshold)) / 2)
}
}
# Dividing by 2 and rounding has proven that this code only counts the
# number of vectors that are not NA, not the values below by time_threshold (2.000).
Is there a way that can calculate the differences I outlined above, and is also more efficient than a nested for-loop?
Note: The structure of these data are only relevant in so far that differences can be calculated for individuals across rows and columns. Missing values in this example are represented as NA, and should not be included in the calculation. Alternatively, I have set them to -0.01, which still has not changed the outcome of my for-loop.

You could write a function to do the loop for you:
fun <- function(index, dat){
i <- index[1]
j <- index[2]
m <- if(i==j) combn(dat[i,],2, function(x)diff(x))
else do.call("-", expand.grid(dat[i, ], dat[j, ]))
sum(abs(m)<2, na.rm = TRUE)
}
dist_fun <- function(dat){
dat <- as.matrix(dat)
result <- diag(0, nrow(dat))
mat_index <- which(lower.tri(result, TRUE), TRUE)
result[mat_index] <- apply(mat_index, 1, fun, dat = dat)
result[mat_index[,2:1]] <- result[mat_index]
result
}
dist_fun(df)
[,1] [,2] [,3] [,4] [,5]
[1,] 2 8 4 2 4
[2,] 8 3 4 1 3
[3,] 4 4 0 1 0
[4,] 2 1 1 0 0
[5,] 4 3 0 0 1

Here's one tidyverse approach. I convert the data to longer format, then join it to itself by User (across) and by time point (down), each time counting the number of matches. Then I combine the two and convert to wide format again.
library(tidyverse)
my_data2 <- my_data %>% pivot_longer(-User)
left_join(my_data2, my_data2, by = "User") %>%
filter(name.x < name.y, abs(value.y - value.x) <= 2) %>% # EDIT
count(User) %>%
select(User.x = User, User.y = User, n) -> compare_across
my_data3 <- my_data2 %>% mutate(dummy = 1) # EDIT
inner_join(my_data3, my_data3, by = "dummy") %>% # EDIT
filter(abs(value.x - value.y) <=2, User.x != User.y) %>%
count(User.x, User.y) -> compare_down
bind_rows(compare_across, compare_down) %>%
arrange(User.x, User.y) %>%
pivot_wider(names_from = User.y, values_from = n, values_fill = list(n = 0))
# A tibble: 5 x 6
User.x U1 U2 U3 U4 U5
<chr> <int> <int> <int> <int> <int>
1 U1 2 8 4 2 4
2 U2 8 3 4 1 3
3 U3 4 4 0 1 0
4 U4 2 1 1 0 0
5 U5 4 3 0 0 1
source data:
my_data <- data.frame(
stringsAsFactors = FALSE,
User = c("U1", "U2", "U3", "U4", "U5"),
M1 = c(148.2, 149.5, 148.4, 154.5, 151.1),
M2 = c(148.4, 150.1, 154.2, NA, 156.9),
M3 = c(155.6, 150.1, NA, NA, 157.1),
M4 = c(155.7, 153.9, NA, NA, NA)
)

Related

How to change specific values in a dataframe

Could anyone explain how to change the negative values in the below dataframe?
we have been asked to create a data structure to get the below output.
# > df
# x y z
# 1 a -2 3
# 2 b 0 4
# 3 c 2 -5
# 4 d 4 6
Then we have to use control flow operators and/or vectorisation to multiply only the negative values by 10.
I tried so many different ways but cannot get this to work. I get an error when i try to use a loop and because of the letters.
Create indices of the negative values and multiply by 10, i.e.
i1 <- which(df < 0, arr.ind = TRUE)
df[i1] <- as.numeric(df[i1]) * 10
# x y z
#1 a -20 3
#2 b 0 4
#3 c 2 -50
#4 d 4 6
First find out the numeric columns of the dataframe and multiply the negative values by 10.
cols <- sapply(df, is.numeric)
#Multiply negative values by 10 and positive with 1
df[cols] <- df[cols] * ifelse(sign(df[cols]) == -1, 10, 1)
df
# x y z
#1 a -20 3
#2 b 0 4
#3 c 2 -50
#4 d 4 6
Using dplyr -
library(dplyr)
df <- df %>% mutate(across(where(is.numeric), ~. * ifelse(sign(.) == -1, 10, 1)))

Correlation between two matrices of different dimensions

I'm very new to R. I have two matrices of different dimensions, C (3 rows, 79 columns) and T(3 rows, 215 columns). I want my code to calculate the Spearman correlation between the first column of C and all the columns of T and return the maximum correlation with the indexes and of the columns. Then, the second column of C and all the columns of T and so on. In fact, I want to find the columns between two matrices which are most correlated. Hope it was clear.
What I did was a nested for loop, but the result is not what I search.
for (i in 1:79){
for(j in 1:215){
print(max(cor(C[,i],T[,j],method = c("spearman"))))
}
}
You don't have to loop over the columns.
x <- cor(C,T,method = c("spearman"))
out <- data.frame(MaxCorr = apply(x,1,max), T_ColIndex=apply(x,1,which.max),C_ColIndex=1:nrow(x))
head(out)
gives,
MaxCorr T_ColIndex C_ColIndex
1 1 8 1
2 1 1 2
3 1 2 3
4 1 1 4
5 1 11 5
6 1 4 6
Fake Data:
C <- matrix(rnorm(3*79),nrow=3)
T <- matrix(rnorm(3*215),nrow=3)
Maybe something like the function below can solve the problem.
pairwise_cor <- function(x, y, method = "spearman"){
ix <- seq_len(ncol(x))
iy <- seq_len(ncol(y))
t(sapply(ix, function(i){
m <- sapply(iy, function(j) cor(x[,i], y[,j], method = method))
setNames(c(i, which.max(m), max(m)), c("col_x", "col_y", "max"))
}))
}
set.seed(2021)
C <- matrix(rnorm(3*5), nrow=3)
T <- matrix(rnorm(3*7), nrow=3)
pairwise_cor(C, T)
# col_x col_y max
#[1,] 1 1 1.0
#[2,] 2 2 1.0
#[3,] 3 2 1.0
#[4,] 4 3 0.5
#[5,] 5 5 1.0

Remove rows with zero-variance in R

I have a dataframe of survey responses (rows = participants, columns = question responses). Participants would respond to 50 questions on a 5-point Likert scale. I would like to remove participants who answered 5 across the 50 questions as they have zero-variance and likely to bias my results.
I have seen the nearZeroVar()function, but was wondering if there's a way to do this in base R?
Many thanks,
R
If you had this dataframe:
df <- data.frame(col1 = rep(1, 10),
col2 = 1:10,
col3 = rep(1:2, 5))
You could calculate the variance of each column and select only those columns where the variance is not 0 or greater than or equal to a certain threshold which is close to what nearZeroVar() would do:
df[, sapply(df, var) != 0]
df[, sapply(df, var) >= 0.3]
If you wanted to exclude rows, you could do something similar, but loop through the rows instead and then subset:
df[apply(df, 1, var) != 0, ]
df[apply(df, 1, var) >= 0.3, ]
Assuming you have data like this.
survey <- data.frame(participants = c(1:10),
q1 = c(1,2,5,5,5,1,2,3,4,2),
q2 = c(1,2,5,5,5,1,2,3,4,3),
q3 = c(3,2,5,4,5,5,2,3,4,5))
You can do the following.
idx <- which(apply(survey[,-1], 1, function(x) all(x == 5)) == T)
survey[-idx,]
This will remove rows where all values equal 5.
# Dummy data:
df <- data.frame(
matrix(
sample(1:5, 100000, replace =TRUE),
ncol = 5
)
)
names(df) <- paste0("likert", 1:5)
df$id <- 1:nrow(df)
head(df)
likert1 likert2 likert3 likert4 likert5 id
1 1 2 4 4 5 1
2 5 4 2 2 1 2
3 2 1 2 1 5 3
4 5 1 3 3 2 4
5 4 3 3 5 1 5
6 1 3 3 2 3 6
dim(df)
[1] 20000 6
# Clean out rows where all likert values are 5
df <- df[rowSums(df[grepl("likert", names(df))] == 5) != 5, ]
nrow(df)
[1] 19995
Stealing #AshOfFire's data, with small modification as you say you only have answers in columns and not participants :
survey <- data.frame(q1 = c(1,2,5,5,5,1,2,3,4,2),
q2 = c(1,2,5,5,5,1,2,3,4,3),
q3 = c(3,2,5,4,5,5,2,3,4,5))
survey[!apply(survey==survey[[1]],1,all),]
# q1 q2 q3
# 1 1 1 3
# 4 5 5 4
# 6 1 1 5
# 10 2 3 5
the equality test builds a data.frame filled with Booleans, then with apply we keep rows that aren't always TRUE.

How to make this code more efficient in R?

I know this is a stupid question, but I'm kinda frustrated with my code because it takes so much time. Jere is one part of my code.
basically I have a matrix called "distance"...
a b c
1 2 5 7
2 6 8 4
3 9 2 3
and then lets say I have a column in a data frame, contains of {a,b,c}
c1 c2 c3
c ... ...
a
a just another column
b
c ... ...
so I want to do a match, I wanna make another matrix with ncol=nrow(distance), and nrow=nrow(c1). where replace the factor value with their distance value. Here's an example of the first column of matrix that I'm going to make
a will replaced by 2
b will replaced by 5
c will replaced by 7
and for the second column, i will take row number 2 from distance matrix, and so on... so the result will be like this
m1 m2 m3
7 4 3
2 6 9
2 6 9
5 8 2
7 4 3
That is just an easy example, and I'm running this code, but when it deals with large iterations, it's kinda stressful for me.
for(l in 1:ncol(d.cat)){
get.unique = sort(unique(d.cat[, l]))
for(j in 1:nrow(d.cat)){
value = as.character(d.cat[j, l])
index = which(get.unique == value)
d2[j,l] = (d[[l]][i, index])
}
}
d.cat is categorical data. And d[[...]] is the list of matrix distance for every column in d.cat.
Try to store the indices and do the updating in one go. Lets say your distance matrix is dmat and data frame is df and you want to create a matrix named newmat
a.ind = which(df$c1=="a")
b.ind = which(df$c1=="b")
c.ind = which(df$c1=="c")
newmat = matrix(0,nrow=length(df$c1),ncol=3)
newmat[a.ind,] = dmat[,1]
newmat[b.ind,] = dmat[,2]
newmat[c.ind,] = dmat[,3]
Here's some data
set.seed(123)
d = matrix(1:9, 3, dimnames=list(NULL, letters[1:3]))
df = data.frame(c1 = sample(letters[1:3], 10, TRUE), stringsAsFactors=FALSE)
and a solution
t(d[, match(df$c1, colnames(d))])
For example
> d
a b c
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> df$c1
[1] "a" "c" "b" "c" "c" "a" "b" "c" "b" "b"
> t(d[,match(df$c1, colnames(d))])
[,1] [,2] [,3]
a 1 2 3
c 7 8 9
b 4 5 6
c 7 8 9
c 7 8 9
a 1 2 3
b 4 5 6
c 7 8 9
b 4 5 6
b 4 5 6
Your data
mat <- matrix(c(2,6,9,5,8,2,7,4,3), nrow=3)
rownames(mat) <- 1:3
colnames(mat) <- letters[1:3]
library(dplyr)
set.seed(1)
df <- as.data.frame(matrix(sample(letters[1:3], 12, replace=TRUE), nrow=4)) %>%
setNames(paste0("c", 1:3))
# c1 c2 c3
# 1 a a b
# 2 b c a
# 3 b c a
# 4 c b a
Using purrr::map2_df, iterate through columns of df and columns of tmat
library(purrr)
tmat <- t(mat)
map2_df(df, seq_len(ncol(tmat)), ~tmat[,.y][.x])
# # A tibble: 4 x 3
# c1 c2 c3
# <dbl> <dbl> <dbl>
# 1 2. 6. 2.
# 2 5. 4. 9.
# 3 5. 4. 9.
# 4 7. 8. 9.
Here is my attempt using the tidyverse :
library(tidyverse)
# Lets create some example
distance <- data_frame(a = sample(1:10, 1000, T), b = sample(1:10, 1000, T), c = sample(1:10, 1000, T))
c1 <- data_frame(c1 = sample(letters[1:3], 1000, T), c2 = sample(letters[1:3], 1000, T))
# First rearrange a little bit your data to make it more tidy
distance2 <- distance %>%
mutate(i = seq_len(n())) %>%
gather(col, value, -i)
c2 <- c1 %>%
mutate(i = seq_len(n()) %>%
gather(col, value, -i)
# Now just join the data and spread it again
c12 %>%
left_join(distance2, by = c("i", "value" = "col")) %>%
select(i, col, value.y) %>%
spread(col, value.y)

R: Find the Variance of all Non-Zero Elements in Each Row

I have a dataframe d like this:
ID Value1 Value2 Value3
1 20 25 0
2 2 0 0
3 15 32 16
4 0 0 0
What I would like to do is calculate the variance for each person (ID), based only on non-zero values, and to return NA where this is not possible.
So for instance, in this example the variance for ID 1 would be var(20, 25),
for ID 2 it would return NA because you can't calculate a variance on just one entry, for ID 3 the var would be var(15, 32, 16) and for ID 4 it would again return NULL because it has no numbers at all to calculate variance on.
How would I go about this? I currently have the following (incomplete) code, but this might not be the best way to go about it:
len=nrow(d)
variances = numeric(len)
for (i in 1:len){
#get all nonzero values in ith row of data into a vector nonzerodat here
currentvar = var(nonzerodat)
Variances[i]=currentvar
}
Note this is a toy example, but the dataset I'm actually working with has over 40 different columns of values to calculate variance on, so something that easily scales would be great.
Data <- data.frame(ID = 1:4, Value1=c(20,2,15,0), Value2=c(25,0,32,0), Value3=c(0,0,16,0))
var_nonzero <- function(x) var(x[!x == 0])
apply(Data[, -1], 1, var_nonzero)
[1] 12.5 NA 91.0 NA
This seems overwrought, but it works, and it gives you back an object with the ids attached to the statistics:
library(reshape2)
library(dplyr)
variances <- df %>%
melt(., id.var = "id") %>%
group_by(id) %>%
summarise(variance = var(value[value!=0]))
Here's the toy data I used to test it:
df <- data.frame(id = seq(4), X1 = c(3, 0, 1, 7), X2 = c(10, 5, 0, 0), X3 = c(4, 6, 0, 0))
> df
id X1 X2 X3
1 1 3 10 4
2 2 0 5 6
3 3 1 0 0
4 4 7 0 0
And here's the result:
id variance
1 1 14.33333
2 2 0.50000
3 3 NA
4 4 NA

Resources