Related
I have a matrix that is 10 rows by 4 columns. Each row represents a user, and each column a measurement. Some users only have one measurement, while others may have the full 4 measurements.
The goals I want to accomplish with this matrix are three fold:
To subtract the user's measurements from their own measurements (across columns);
To subtract the user's measurement from other user's measurement points (all included, across rows);
To create a final matrix that counts the number of "matches" (comparisons) each user has against themselves and others.
Within a threshold of 2.0 units, I have tried to measure each user's measurement against their own measurement and other users by obtaining the difference with a nested for-loop.
Below is an example of what the clean_data matrix looks like, and this matrix was used for all three goals:
M1 M2 M3 M4
U1 148.2 148.4 155.6 155.7
U2 149.5 150.1 150.1 153.9
U3 148.4 154.2 NA NA
U4 154.5 NA NA NA
U5 151.1 156.9 157.1 NA
For Goal #3, the output should look something akin to this matrix:
U1 U2 U3 U4 U5
U1 2 8 4 2 3
U2 8 3 2 1 4
U3 4 2 0 1 0
U4 2 1 1 0 0
U5 3 4 0 0 1
For example: User 1 has 2 matches with themselves because, with all 4 of their measurements, 2 differences were less than a value of 2.0 units. User 1 also has 8 matches with User 2. Each of User 1's measurements were subtracted iteratively from User 2's measurements (stored as an absolute value), and those differences that were below a value of 2 were considered a "match."
I have tried using the following nested for-loop, however I believe it is only counting the number of elements in my matrix instead of adding the differences.
# Set the time_threshold.
time_threshold <- 2.000
# Create an empty matrix the same dimensions as the number of users present.
matrix_a<-matrix(nrow = nrow(clean_data), ncol = nrow(clean_data))
# Use a nested for-loop to calculate the intra-user
# and inter-user time differences, adding values below
# the threshold up for those user-comparisons.
for (i in 1:nrow(clean_data)) {
for (j in 1:nrow(clean_data)) {
matrix_a[i, j] <-
round(sum(!is.na(abs((clean_data[i, 2:dim(clean_data)[2]]) -
(clean_data[j, 2:dim(clean_data)[2]])
) <= time_threshold)) / 2)
}
}
# Dividing by 2 and rounding has proven that this code only counts the
# number of vectors that are not NA, not the values below by time_threshold (2.000).
Is there a way that can calculate the differences I outlined above, and is also more efficient than a nested for-loop?
Note: The structure of these data are only relevant in so far that differences can be calculated for individuals across rows and columns. Missing values in this example are represented as NA, and should not be included in the calculation. Alternatively, I have set them to -0.01, which still has not changed the outcome of my for-loop.
You could write a function to do the loop for you:
fun <- function(index, dat){
i <- index[1]
j <- index[2]
m <- if(i==j) combn(dat[i,],2, function(x)diff(x))
else do.call("-", expand.grid(dat[i, ], dat[j, ]))
sum(abs(m)<2, na.rm = TRUE)
}
dist_fun <- function(dat){
dat <- as.matrix(dat)
result <- diag(0, nrow(dat))
mat_index <- which(lower.tri(result, TRUE), TRUE)
result[mat_index] <- apply(mat_index, 1, fun, dat = dat)
result[mat_index[,2:1]] <- result[mat_index]
result
}
dist_fun(df)
[,1] [,2] [,3] [,4] [,5]
[1,] 2 8 4 2 4
[2,] 8 3 4 1 3
[3,] 4 4 0 1 0
[4,] 2 1 1 0 0
[5,] 4 3 0 0 1
Here's one tidyverse approach. I convert the data to longer format, then join it to itself by User (across) and by time point (down), each time counting the number of matches. Then I combine the two and convert to wide format again.
library(tidyverse)
my_data2 <- my_data %>% pivot_longer(-User)
left_join(my_data2, my_data2, by = "User") %>%
filter(name.x < name.y, abs(value.y - value.x) <= 2) %>% # EDIT
count(User) %>%
select(User.x = User, User.y = User, n) -> compare_across
my_data3 <- my_data2 %>% mutate(dummy = 1) # EDIT
inner_join(my_data3, my_data3, by = "dummy") %>% # EDIT
filter(abs(value.x - value.y) <=2, User.x != User.y) %>%
count(User.x, User.y) -> compare_down
bind_rows(compare_across, compare_down) %>%
arrange(User.x, User.y) %>%
pivot_wider(names_from = User.y, values_from = n, values_fill = list(n = 0))
# A tibble: 5 x 6
User.x U1 U2 U3 U4 U5
<chr> <int> <int> <int> <int> <int>
1 U1 2 8 4 2 4
2 U2 8 3 4 1 3
3 U3 4 4 0 1 0
4 U4 2 1 1 0 0
5 U5 4 3 0 0 1
source data:
my_data <- data.frame(
stringsAsFactors = FALSE,
User = c("U1", "U2", "U3", "U4", "U5"),
M1 = c(148.2, 149.5, 148.4, 154.5, 151.1),
M2 = c(148.4, 150.1, 154.2, NA, 156.9),
M3 = c(155.6, 150.1, NA, NA, 157.1),
M4 = c(155.7, 153.9, NA, NA, NA)
)
I have the following data frame, where "x" is a grouping variable and "y" some values:
dat <- data.frame(x = c(1, 2, 3, 3, 2, 1), y = c(3, 4, 4, 5, 2, 5))
I want to create a new column where each "y" value is divided by the sum of "y" within each group defined by "x". E.g. the result for the first row is 3 / (3 + 5) = 0.375, where the denominator is the sum of "y" values for group 1 (x = 1).
There are various ways of solving this, here's one
with(dat, ave(y, x, FUN = function(x) x/sum(x)))
## [1] 0.3750000 0.6666667 0.4444444 0.5555556 0.3333333 0.6250000
Here's another possibility
library(data.table)
setDT(dat)[, z := y/sum(y), by = x]
dat
# x y z
# 1: 1 3 0.3750000
# 2: 2 4 0.6666667
# 3: 3 4 0.4444444
# 4: 3 5 0.5555556
# 5: 2 2 0.3333333
# 6: 1 5 0.6250000
Here's a third one
library(dplyr)
dat %>%
group_by(x) %>%
mutate(z = y/sum(y))
# Source: local data frame [6 x 3]
# Groups: x
#
# x y z
# 1 1 3 0.3750000
# 2 2 4 0.6666667
# 3 3 4 0.4444444
# 4 3 5 0.5555556
# 5 2 2 0.3333333
# 6 1 5 0.6250000
Here are some base R solutions:
1) prop.table Use the base prop.table function with ave like this:
transform(dat, z = ave(y, x, FUN = prop.table))
giving:
x y z
1 1 3 0.3750000
2 2 4 0.6666667
3 3 4 0.4444444
4 3 5 0.5555556
5 2 2 0.3333333
6 1 5 0.6250000
2) sum This also works:
transform(dat, z = y / ave(y, x, FUN = sum))
And of course there's a way for people thinking in SQL, very wordy in this case, but nicely generalising to all sorts of other similiar problems:
library(sqldf)
dat <- sqldf("
with sums as (
select
x
,sum(y) as sy
from dat
group by x
)
select
d.x
,d.y
,d.y/s.sy as z
from dat d
inner join sums s
on d.x = s.x
")
I have the following data frame, where "x" is a grouping variable and "y" some values:
dat <- data.frame(x = c(1, 2, 3, 3, 2, 1), y = c(3, 4, 4, 5, 2, 5))
I want to create a new column where each "y" value is divided by the sum of "y" within each group defined by "x". E.g. the result for the first row is 3 / (3 + 5) = 0.375, where the denominator is the sum of "y" values for group 1 (x = 1).
There are various ways of solving this, here's one
with(dat, ave(y, x, FUN = function(x) x/sum(x)))
## [1] 0.3750000 0.6666667 0.4444444 0.5555556 0.3333333 0.6250000
Here's another possibility
library(data.table)
setDT(dat)[, z := y/sum(y), by = x]
dat
# x y z
# 1: 1 3 0.3750000
# 2: 2 4 0.6666667
# 3: 3 4 0.4444444
# 4: 3 5 0.5555556
# 5: 2 2 0.3333333
# 6: 1 5 0.6250000
Here's a third one
library(dplyr)
dat %>%
group_by(x) %>%
mutate(z = y/sum(y))
# Source: local data frame [6 x 3]
# Groups: x
#
# x y z
# 1 1 3 0.3750000
# 2 2 4 0.6666667
# 3 3 4 0.4444444
# 4 3 5 0.5555556
# 5 2 2 0.3333333
# 6 1 5 0.6250000
Here are some base R solutions:
1) prop.table Use the base prop.table function with ave like this:
transform(dat, z = ave(y, x, FUN = prop.table))
giving:
x y z
1 1 3 0.3750000
2 2 4 0.6666667
3 3 4 0.4444444
4 3 5 0.5555556
5 2 2 0.3333333
6 1 5 0.6250000
2) sum This also works:
transform(dat, z = y / ave(y, x, FUN = sum))
And of course there's a way for people thinking in SQL, very wordy in this case, but nicely generalising to all sorts of other similiar problems:
library(sqldf)
dat <- sqldf("
with sums as (
select
x
,sum(y) as sy
from dat
group by x
)
select
d.x
,d.y
,d.y/s.sy as z
from dat d
inner join sums s
on d.x = s.x
")
I'm looking to create a hybrid of cumsum() and TTR::runSum()where cumSum() runs up until a pre-specified number of datapoints, at which points it acts more like a runSum()
For example:
library(TTR)
data <- rep(1:3,2)
cumsum <- cumsum(data)
runSum <- runSum(data, n = 3)
DesiredResult <- ifelse(is.na(runSum),cumsum,runSum)
Is there a way to get to DesiredResult that doesn't require getting finangly with NAs?
That is what the partial=TRUE argument to rollapplyr does. Here we show this with sum and also with sd and IQR. (Note that the sd of one value is NA and we chose IQR since it is a measure of spread that can be calculated for scalars although it is always 0 in that case.)
library(zoo)
rollapplyr(data, 3, sum, partial = TRUE)
## [1] 1 3 6 6 6 6
rollapplyr(data, 3, sd, partial = TRUE)
## [1] NA 0.7071068 1.0000000 1.0000000 1.0000000 1.0000000
rollapplyr(data, 3, IQR, partial = TRUE)
## [1] 0.0 0.5 1.0 1.0 1.0 1.0
Here are three alternatives.
n <- 3
rowSums(embed(c(rep(0, n - 1), data), n)) # base R
# [1] 1 3 6 6 6 6
library(TTR)
runSum(c(rep(0, n - 1), data), n = n)
# [1] NA NA 1 3 6 6 6 6 # na.omit fixes the beginning
library(zoo)
rollsum(c(rep(0, n - 1), data), k = 3, align = "right")
# [1] 1 3 6 6 6 6
This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 6 years ago.
I have a matrix with a large number of duplicates and would like to obtain a matrix with the unique rows and a frequency count to each unique row.
The example shown below solves this problem but is painfully slow.
rowsInTbl <- function(tbl,row){
sum(apply(tbl, 1, function(x) all(x == row) ))
}
colFrequency <- function(tblall){
tbl <- unique(tblall)
results <- matrix(nrow = nrow(tbl),ncol=ncol(tbl)+1)
results[,1:ncol(tbl)] <- as.matrix(tbl)
dimnames(results) <- list(c(rownames(tbl)),c(colnames(tbl),"Frequency"))
freq <- apply(tbl,1,function(x)rowsInTbl(tblall,x))
results[,"Frequency"] <- freq
return(results)
}
m <- matrix(c(1,2,3,4,3,4,1,2,3,4),ncol=2,byrow=T)
dimnames(m) <- list(letters[1:nrow(m)],c("c1","c2"))
print("Matrix")
print(m)
[1] "Matrix"
c1 c2
a 1 2
b 3 4
c 3 4
d 1 2
e 3 4
print("Duplicate frequency table")
print(colFrequency(m))
[1] "Duplicate frequency table"
c1 c2 Frequency
a 1 2 2
b 3 4 3
Here are the speed measurements of the answers of #Heroka and #m0h3n compared to my example. The matrix shown above was repeated 1000 times. Data.table clearly is the fastest solution.
[1] "Duplicate frequency table - my example"
user system elapsed
0.372 0.000 0.371
[1] "Duplicate frequency table - data.table"
user system elapsed
0.008 0.000 0.008
[1] "Duplicate frequency table - aggregate"
user system elapsed
0.092 0.000 0.089
Looks like a job for data.table, as you need something that can aggregate quickly.
library(data.table)
m <- matrix(c(1,2,3,4,3,4,1,2,3,4),ncol=2,byrow=T)
mdt <- as.data.table(m)
res <- mdt[,.N, by=names(mdt)]
res
# > res
# V1 V2 N
# 1: 1 2 2
# 2: 3 4 3
How about this using base R for extracting unique rows:
mat <- matrix(c(2,5,3,5,2,3,4,2,3,5,4,2,1,5,3,5), ncol = 2, byrow = T)
mat[!duplicated(mat),]
# [,1] [,2]
# [1,] 2 5
# [2,] 3 5
# [3,] 2 3
# [4,] 4 2
# [5,] 1 5
Extracting unique rows along with their frequencies:
m <- as.data.frame(mat)
aggregate(m, by=m, length)[1:(ncol(m)+1)]
# V1 V2 V1.1
# 1 4 2 2
# 2 2 3 1
# 3 1 5 1
# 4 2 5 1
# 5 3 5 3