I am currently having a problem utilizing R to compare each column within a specific matrix. I have attempted to compare each of the entire columns at once, and generate a true and false output via the table command, and then convert the number of trues that can be found to a numeric value and input such values in their respective places within the incidence matrix.
For example, I have data in this type of format:
//Example state matrix - I am attempting to compare c1 with c2, then c1 with c3, then c1 with c4 and so on and so forth
c1 c2 c3 c4
r1 2 6 3 2
r2 1 1 6 5
r3 3 1 3 6
And I am trying to instead put it into this format
//Example incidence matrix - Which is how many times c1 equaled c2 in the above matrix
c1 c2 c3 c4
c1 3 1 1 1
c2 1 3 0 0
c3 1 0 3 0
c4 1 0 0 3
Here is the code I have come up with so far, however, I keep getting this particular error --
Warning message:
In IncidenceMat[rat][r] = IncidenceMat[rat][r] + as.numeric(instances) :number of items to replace is not a multiple of replacement length
rawData = read.table("5-14-2014streamW636PPstate.txt")
colnames = names(rawData) #the column names in R
df <- data.frame(rawData)
rats = ncol(rawData)
instances = nrow(rawData)
IncidenceMat = matrix(rep(0, rats), nrow = rats, ncol = rats)
for(rat in rats)
for(r in rats)
if(rat == r){rawData[instance][rat] == rawData[instance][r] something like this would work in C++ if I attempted,
IncidenceMat[rat][r] = IncidenceMat[rat][r] + as.numeric(instances)
} else{
count = df[colnames[rat]] == df[colnames[r]]
c = table(count)
TotTrue = as.numeric(c[2][1])
IncidenceMat[rat][r] = IncidenceMat[rat][r] + TotTrue #count would go here #this should work like a charm as well
}
Any help would be greatly appreciated; I have also looked at some of these resources, however, I am still stumped
I tried this and this along with some other resources I recently closed.
How about this (note the incidence matrix is symmetric)?
df
c1 c2 c3 c4
r1 2 6 3 2
r2 1 1 6 5
r3 3 1 3 6
incidence <- matrix(rep(0, ncol(df)*ncol(df)), nrow=ncol(df))
diag(incidence) <- nrow(df)
for (i in 1:(ncol(df)-1)) {
for (j in (i+1):ncol(df)) {
incidence[i,j] = incidence[j,i] = sum(df[,i] == df[,j])
}
}
incidence
[,1] [,2] [,3] [,4]
[1,] 3 1 1 1
[2,] 1 3 0 0
[3,] 1 0 3 0
[4,] 1 0 0 3
Related
I have a data frame with several columns. I want to run a function [pmax() in this case] over all columns whose name is stored in a vector except one, and store the result in new separate columns. At the end, I would also like to store the names of all new columns in a separate vector. A minimal example would be:
Name <- c("Case 1", "Case 2", "Case 3", "Case 4", "Case 5")
C1 <- c(1, 0, 1, 1, 0)
C2 <- c(0, 1, 1, 1, 0)
C3 <- c(0, 1, 0, 0, 0)
C4 <- c(1, 1, 0, 1, 0)
Data <- data.frame(Name, C1, C2, C3, C4)
var.min <- function(data, col.names){
new.df <- data
# This is how I would do it outside a function and without loop:
new.df$max.def.col.exc.1 <- pmax(new.df$C2, new.df$C3)
new.df$max.def.col.exc.2 <- pmax(new.df$C1, new.df$C3)
new.df$max.def.col.exc.3 <- pmax(new.df$C1, new.df$C2)
new.columns <- c("max.def.col.exc.1", "max.def.col.exc.2", "max.def.col.exc.3")
return(new.df)
}
new.df <- var.min(Data,
col.names= c("C1", "C2", "C3"))
The result should look like:
Name C1 C2 C3 C4 max.def.col.exc.1 max.def.col.exc.2 max.def.col.exc.3
1 Case 1 1 0 0 1 0 1 1
2 Case 2 0 1 1 1 1 1 1
3 Case 3 1 1 0 0 1 1 1
4 Case 4 1 1 0 1 1 1 1
5 Case 5 0 0 0 0 0 0 0
Anyone with an idea? Many thanks in advance!
Here is a base R solution with combn. It gets all pairwise combinations of the column names and calls a function computing pmax.
Note that the order of the expected output columns is the same as the one output by the code below. If the columns vector is c("C1", "C2", "C3"), the order will be different.
Note also that the function is now a one-liner and accepts combinations of any number of columns, 2, 3 or more.
var.min <- function(cols, data) Reduce(pmax, data[cols])
cols <- c("C3", "C2", "C1")
combn(cols, 2, var.min, data = Data)
# [,1] [,2] [,3]
#[1,] 0 1 1
#[2,] 1 1 1
#[3,] 1 1 1
#[4,] 1 1 1
#[5,] 0 0 0
Now it's just a matter of assigning column names and cbinding with the input data.
tmp <- combn(cols, 2, var.min, data = Data)
colnames(tmp) <- paste0("max.def.col.exc.", seq_along(cols))
Data <- cbind(Data, tmp)
rm(tmp) # final clean-up
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I want to create a new column based on multiple columns of different data types
Names
1
2
3
A
000
NA
030
B
100
DDD
NA
C
XXX
000
050
Based on column 1-3, I want to add another column with the condition If value >= 30 then 1 else 0.
Output will be:
Names
1
2
3
4
A
000
NA
030
1
B
100
DDD
NA
1
C
XXX
000
015
0
Note : There are 36 such columns (1-36) across where I want to use the if condition and then create a new column.
adding some more details:
These variables are extracted from one long string like "030060000XXX010" which turned into 030 , 060, 000, XXX, 010. Now using IFELSE condition if any of the value (number looking) is >= 30 then 1 else 0
Consider using if_any. Loop over the columns other than 'Name', create a logical condition after converting to integer class, replace the NA with FALSE and coerces the logical output from if_any to binary (+)
library(dplyr)
library(tidyr)
df1 %>%
mutate(new = +(if_any(-Names, ~ replace_na(as.integer(.) >= 30, FALSE) ) ))
Since you want to group by 3, one way is to split.default the columns by 3, operate on one three-pack at a time, then combine them later.
I'll demonstrate on the data but repeating the three data columns so that we can show the iteration.
dat <- structure(list(Names = c("A", "B", "C"), X1 = c("000", "100", "XXX"), X2 = c(NA, "DDD", "000"), X3 = c(30L, NA, 50L), X1 = c("000", "100", "XXX"), X2 = c(NA, "DDD", "000"), X3 = c(30L, NA, 50L)), class = "data.frame", row.names = c(NA, -3L))
split.default(dat[,-1], (seq_along(dat)[-1]-2) %/% 3)
# $`0`
# X1 X2 X3
# 1 000 <NA> 30
# 2 100 DDD NA
# 3 XXX 000 50
# $`1`
# X1.1 X2.1 X3.1
# 1 000 <NA> 30
# 2 100 DDD NA
# 3 XXX 000 50
With this, we'll work on one three-pack at a time.
func <- function(x, lim = 30) {
x <- as.matrix(x)
x <- `dim<-`(suppressWarnings(as.numeric(x)), dim(x))
cbind(x,(+(rowSums(x <= lim, na.rm = TRUE) > 0)))
}
lapply(split.default(dat[,-1], (seq_along(dat)[-1]-2) %/% 3), func)
# $`0`
# [,1] [,2] [,3] [,4]
# [1,] 0 NA 30 1
# [2,] 100 NA NA 0
# [3,] NA 0 50 1
# $`1`
# [,1] [,2] [,3] [,4]
# [1,] 0 NA 30 1
# [2,] 100 NA NA 0
# [3,] NA 0 50 1
Now we just need to recombine them all again:
do.call(cbind, c(list(dat[,1,drop=FALSE]), lapply(split.default(dat[,-1], (seq_along(dat)[-1]-2) %/% 3), func)))
# Names 0.1 0.2 0.3 0.4 1.1 1.2 1.3 1.4
# 1 A 0 NA 30 1 0 NA 30 1
# 2 B 100 NA NA 0 100 NA NA 0
# 3 C NA 0 50 1 NA 0 50 1
I'm very new to R. I have two matrices of different dimensions, C (3 rows, 79 columns) and T(3 rows, 215 columns). I want my code to calculate the Spearman correlation between the first column of C and all the columns of T and return the maximum correlation with the indexes and of the columns. Then, the second column of C and all the columns of T and so on. In fact, I want to find the columns between two matrices which are most correlated. Hope it was clear.
What I did was a nested for loop, but the result is not what I search.
for (i in 1:79){
for(j in 1:215){
print(max(cor(C[,i],T[,j],method = c("spearman"))))
}
}
You don't have to loop over the columns.
x <- cor(C,T,method = c("spearman"))
out <- data.frame(MaxCorr = apply(x,1,max), T_ColIndex=apply(x,1,which.max),C_ColIndex=1:nrow(x))
head(out)
gives,
MaxCorr T_ColIndex C_ColIndex
1 1 8 1
2 1 1 2
3 1 2 3
4 1 1 4
5 1 11 5
6 1 4 6
Fake Data:
C <- matrix(rnorm(3*79),nrow=3)
T <- matrix(rnorm(3*215),nrow=3)
Maybe something like the function below can solve the problem.
pairwise_cor <- function(x, y, method = "spearman"){
ix <- seq_len(ncol(x))
iy <- seq_len(ncol(y))
t(sapply(ix, function(i){
m <- sapply(iy, function(j) cor(x[,i], y[,j], method = method))
setNames(c(i, which.max(m), max(m)), c("col_x", "col_y", "max"))
}))
}
set.seed(2021)
C <- matrix(rnorm(3*5), nrow=3)
T <- matrix(rnorm(3*7), nrow=3)
pairwise_cor(C, T)
# col_x col_y max
#[1,] 1 1 1.0
#[2,] 2 2 1.0
#[3,] 3 2 1.0
#[4,] 4 3 0.5
#[5,] 5 5 1.0
I have a matrix that is 10 rows by 4 columns. Each row represents a user, and each column a measurement. Some users only have one measurement, while others may have the full 4 measurements.
The goals I want to accomplish with this matrix are three fold:
To subtract the user's measurements from their own measurements (across columns);
To subtract the user's measurement from other user's measurement points (all included, across rows);
To create a final matrix that counts the number of "matches" (comparisons) each user has against themselves and others.
Within a threshold of 2.0 units, I have tried to measure each user's measurement against their own measurement and other users by obtaining the difference with a nested for-loop.
Below is an example of what the clean_data matrix looks like, and this matrix was used for all three goals:
M1 M2 M3 M4
U1 148.2 148.4 155.6 155.7
U2 149.5 150.1 150.1 153.9
U3 148.4 154.2 NA NA
U4 154.5 NA NA NA
U5 151.1 156.9 157.1 NA
For Goal #3, the output should look something akin to this matrix:
U1 U2 U3 U4 U5
U1 2 8 4 2 3
U2 8 3 2 1 4
U3 4 2 0 1 0
U4 2 1 1 0 0
U5 3 4 0 0 1
For example: User 1 has 2 matches with themselves because, with all 4 of their measurements, 2 differences were less than a value of 2.0 units. User 1 also has 8 matches with User 2. Each of User 1's measurements were subtracted iteratively from User 2's measurements (stored as an absolute value), and those differences that were below a value of 2 were considered a "match."
I have tried using the following nested for-loop, however I believe it is only counting the number of elements in my matrix instead of adding the differences.
# Set the time_threshold.
time_threshold <- 2.000
# Create an empty matrix the same dimensions as the number of users present.
matrix_a<-matrix(nrow = nrow(clean_data), ncol = nrow(clean_data))
# Use a nested for-loop to calculate the intra-user
# and inter-user time differences, adding values below
# the threshold up for those user-comparisons.
for (i in 1:nrow(clean_data)) {
for (j in 1:nrow(clean_data)) {
matrix_a[i, j] <-
round(sum(!is.na(abs((clean_data[i, 2:dim(clean_data)[2]]) -
(clean_data[j, 2:dim(clean_data)[2]])
) <= time_threshold)) / 2)
}
}
# Dividing by 2 and rounding has proven that this code only counts the
# number of vectors that are not NA, not the values below by time_threshold (2.000).
Is there a way that can calculate the differences I outlined above, and is also more efficient than a nested for-loop?
Note: The structure of these data are only relevant in so far that differences can be calculated for individuals across rows and columns. Missing values in this example are represented as NA, and should not be included in the calculation. Alternatively, I have set them to -0.01, which still has not changed the outcome of my for-loop.
You could write a function to do the loop for you:
fun <- function(index, dat){
i <- index[1]
j <- index[2]
m <- if(i==j) combn(dat[i,],2, function(x)diff(x))
else do.call("-", expand.grid(dat[i, ], dat[j, ]))
sum(abs(m)<2, na.rm = TRUE)
}
dist_fun <- function(dat){
dat <- as.matrix(dat)
result <- diag(0, nrow(dat))
mat_index <- which(lower.tri(result, TRUE), TRUE)
result[mat_index] <- apply(mat_index, 1, fun, dat = dat)
result[mat_index[,2:1]] <- result[mat_index]
result
}
dist_fun(df)
[,1] [,2] [,3] [,4] [,5]
[1,] 2 8 4 2 4
[2,] 8 3 4 1 3
[3,] 4 4 0 1 0
[4,] 2 1 1 0 0
[5,] 4 3 0 0 1
Here's one tidyverse approach. I convert the data to longer format, then join it to itself by User (across) and by time point (down), each time counting the number of matches. Then I combine the two and convert to wide format again.
library(tidyverse)
my_data2 <- my_data %>% pivot_longer(-User)
left_join(my_data2, my_data2, by = "User") %>%
filter(name.x < name.y, abs(value.y - value.x) <= 2) %>% # EDIT
count(User) %>%
select(User.x = User, User.y = User, n) -> compare_across
my_data3 <- my_data2 %>% mutate(dummy = 1) # EDIT
inner_join(my_data3, my_data3, by = "dummy") %>% # EDIT
filter(abs(value.x - value.y) <=2, User.x != User.y) %>%
count(User.x, User.y) -> compare_down
bind_rows(compare_across, compare_down) %>%
arrange(User.x, User.y) %>%
pivot_wider(names_from = User.y, values_from = n, values_fill = list(n = 0))
# A tibble: 5 x 6
User.x U1 U2 U3 U4 U5
<chr> <int> <int> <int> <int> <int>
1 U1 2 8 4 2 4
2 U2 8 3 4 1 3
3 U3 4 4 0 1 0
4 U4 2 1 1 0 0
5 U5 4 3 0 0 1
source data:
my_data <- data.frame(
stringsAsFactors = FALSE,
User = c("U1", "U2", "U3", "U4", "U5"),
M1 = c(148.2, 149.5, 148.4, 154.5, 151.1),
M2 = c(148.4, 150.1, 154.2, NA, 156.9),
M3 = c(155.6, 150.1, NA, NA, 157.1),
M4 = c(155.7, 153.9, NA, NA, NA)
)
I'm new to this but I'm pretty sure this question hasn't been answered, or I'm just not good at searching....
I would like to subtract the values in multiple rows from a particular row based on matching columns and values. My actual data will be a large matrix with >5000 columns, eaching needing to be subtracted by a blank value that matches the a value in a factor column.
Here is an example data table:
c1 c2 c3 c4 c5
r1 A 1 2 3 aa
r2 B 2 3 4 bb
r3 C 3 4 5 aa
r4 D 4 1 6 bb
r5 Blank 2 3 4 aa
r6 Blank 3 4 5 bb
I would like to subtract the c2,c3,and c4 values of c1 ="Blank" row from A,B,and C using the c5 factor to define which Blank values are used (aa or bb). I would like the "Blank" values to be subtracted from all rows sharing c5 info.
(i know this is confusing to describe)
So the results would look like this:
c1 c2 c3 c4 c5
r1 A -1 -1 -1 aa
r2 B -1 -1 -1 bb
r3 C 1 1 1 aa
r4 D 1 -3 1 bb
I've seen the ddply function work for doing something like this with a single column, but I wasn't able to expand that to perform this task for multiple columns. I'm a noob though...
Thank you for your help!
This is not tested for all possible cases, but should give you an idea:
df <- read.table(text =
"c1 c2 c3 c4 c5
r1 A 1 2 3 aa
r2 B 2 3 4 bb
r3 C 3 4 5 aa
r4 D 4 1 6 bb
r5 Blank 2 3 4 aa
r6 Blank 3 4 5 bb", header = T)
library(data.table)
# separate dataset into two
dt <- data.table(df, key = "c5")
dt.blank <- dt[c1 == "Blank"]
dt <- dt[c1 != "Blank"]
# merge into resulting dataset
dt.res <- dt[dt.blank]
# update each column
columns.count <- ncol(dt)
for(i in 2:(columns.count-1)) {
dt.res[[i]] <- dt.res[[i]] - dt.res[[i + columns.count]]
}
# > dt.res
# c1 c2 c3 c4 c5 i.c1 i.c2 i.c3 i.c4
# 1: A -1 -1 -1 aa Blank 2 3 4
# 2: C 1 1 1 aa Blank 2 3 4
# 3: B -1 -1 -1 bb Blank 3 4 5
# 4: D 1 -3 1 bb Blank 3 4 5
First split your data, since there's no reason you have them in a single data structure. Then apply the function:
# recreate your data
df <- data.frame(rbind(c(1:3, "aa"), c(2:4, "bb"), c(3:5, "aa"), c(4,1,6, "bb"), c(2:4, "aa"), c(3:5, "bb")))
df[,1:3] <- apply(df[,1:3], 2, as.integer)
# split it
blank1 <- df[5,]
blank2 <- df[6,]
df <- df[1:4,]
for (i in 1:nrow(df)) {
if (df[i,4] == "aa") {df[i,1:3] <- df[i,1:3] - blank1[1:3]}
else {df[i,1:3] <- df[i,1:3] - blank2[1:3]}
}
There are a few different was to run the loop, including vectorizing. But this suffices. I'd also argue that there's no reason to keep the labels "aa" v "bb" in the initial data structure either, which would make this simpler; but it's your choice.