R - looping through column differentials

R - looping through column differentials - r

let's assume we have 4 vectors
a <- c(200,204,209,215)
b <- c(215,220,235,245)
c <- c(230,236,242,250)
d <- c(240,242,243,267)
I basically want to create a loop which creates the differentials between each pair, and then calculate the Z scores for those differentials. So something like scale(d-a). How do I create the loop that basically goes scale(b-a), then scale(c-a), scale(d-a) etc? many thanks.

Single named variables don't lend themselves too well to "looping".
Let's use a list() of vectors instead:
vecs <- list(
a = c(200,204,209,215),
b = c(215,220,235,245),
c = c(230,236,242,250),
d = c(240,242,243,267)
)
This allows us to apply a function to all pairs using combn
scale_diff <- function(subset) {
z <- scale(subset[[1]] - subset[[2]])
colnames(z) <- paste(names(subset), collapse = " - ")
z
}
z_scores <- combn(vecs, 2, scale_diff, simplify = FALSE)
Now z_scores is a list of 6 matrices (column vectors). The column names show you which vectors were subtracted before scaling.

We can place it in a list and use combn to get the combinations and then apply the difference
lst1 <- list(a = a, b = b, c = c, d = d)
out <- combn(lst1, 2, FUN = function(x) scale(Reduce(`-`, x))[,1])
colnames(out) <- combn(names(lst1), 2, FUN = paste, collapse='_')
out
# a_b a_c a_d b_c b_d c_d
#[1,] 0.9108601 1.2009612 0.1290994 -0.7643506 -0.753390 -0.2219686
#[2,] 0.7759179 0.2401922 0.3872983 -0.9441978 -0.360317 0.3699477
#[3,] -0.5735045 -0.2401922 0.9036961 0.6744270 1.474024 1.1098432
#[4,] -1.1132735 -1.2009612 -1.4200939 1.0341214 -0.360317 -1.2578222
As #AlexR mentioned in the comments, if the attributes are important, then remove [,1] and keep it as a matrix of 1 column
out <- combn(lst1, 2, FUN = function(x) scale(Reduce(`-`, x)), simplify = FALSE)

Related

Colnames loop in R, combining characters with variables

I'm new to R (and pretty new to programing in general) an am struggeling to write a loop to rename columns.
My goal is to rename the colums (f1w1, f1w2, f1w3, f2w1...)
Here's what I've got:
f <- 1
w <- 1
x <- 1
while(x<ncol(wdbc)){
while (w < 4) {
x <- (f-1)*3+w
print (x)
w <- w + 1
colnames(wdbc)[x] <- ("f" f " w" w)
}
w <- 1
f <- f + 1
}
f & w are the variables for the column names, x is the column I want to rename. R doesn't like how I tried to combine the variables with text in the name. I played around with "pasteO()" and "cat()" but I don't get it to work.
Thanks!

You can do this without a loop. For example with outer :
f <- 1:4
w <- 1:3
nm <- c(t(outer(paste0('f', f), paste0('w', w), paste0)))
nm
#[1] "f1w1" "f1w2" "f1w3" "f2w1" "f2w2" "f2w3" "f3w1" "f3w2" "f3w3" "f4w1" "f4w2" "f4w3"
You can then assign these names as column names directly.
colnames(wdbc) <- nm

You can also use sapply function to get the necessary column names.
f <- 1:4
w <- 1:3
new_col <- sapply(f, function(x) paste0("f",x,"w",w))
new_col
# [,1] [,2] [,3] [,4]
#[1,] "f1w1" "f2w1" "f3w1" "f4w1"
#[2,] "f1w2" "f2w2" "f3w2" "f4w2"
#[3,] "f1w3" "f2w3" "f3w3" "f4w3"

Perform same operation to multiple variables, assigning result

I'm working on a project where I have to apply the same transformation to multiple variables. For example
a <- a + 1
b <- b + 1
d <- d + 1
e <- e + 1
I can obviously perform the operations in sequence using
for (i in c(a, b, d, e)) i <- i + 1
However, I can't actually assign the result to each variable this way, since i is a copy of each variable, not a reference.
Is there a way to do this? Obviously, it'd be easier if the variables were merged in a data.frame or something, but that's not possible.

Usually if you find yourself doing the same thing to multiple objects, they should be stored / thought-of as single object with sub-components. You say that storing these as a data.frame is not possible, so you can use a list instead. This allows you to use lapply/sapply to apply a function to each element of the list in one step.
a <- c(1, 2, 3)
b <- c(1, 4)
c <- 5
d <- rnorm(10)
e <- runif(5)
lstt <- list(a = a, b = b, c = c, d = d, e = e)
lstt$a
# [1] 1 2 3
lstt <- lapply(lstt, '+', 1)
lstt$a
# [1] 2 3 4

The question states that the variables to increment cannot be in a larger structure but then in the comments it is stated that that is not so after all so we will assume they are in a list L.
L <- list(a = 1, b = 2, d = 3, e = 4) # test data
for(nm in names(L)) L[[nm]] <- L[[nm]] + 1
# or
L <- lapply(L, `+`, 1)
# or
L <- lapply(L, function(x) x + 1)
Scalars
If they are all scalars then they can be put in an ordinary vector:
v <- c(a = 1, b = 2, d = 3, e = 4)
v <- v + 1
Vectors
If they are all vectors of the same length they can be put in data frame or if they are also of the same type they can be put in a matrix in which case we can also add 1 to it.
Environment
If the variables do have to be free in an environment then if nms is a vector of the variable names then we can iterate over the names and use those names to subscript the environment env. If the names follow some pattern we may be able to use nms <- ls(pattern = "...", envir = env) or if they are the only variables in that environment we can use nms <- ls(env).
a <- b <- d <- e <- 1 # test data
env <- .GlobalEnv # can change this if not being done in global envir
nms <- c("a", "b", "d", "e")
for(nm in nms) env[[nm]] <- env[[nm]] + 1
a;b;d;e # check
## [1] 2
## [1] 2
## [1] 2
## [1] 2

Find combination of n vectors across k dataframes with highest correlation

Let's assume four data frames, each with 3 vectors, e.g.
setA <- data.frame(
a1 = c(6,5,2,4,5,3,4,4,5,3),
a2 = c(4,3,1,4,5,1,1,6,3,2),
a3 = c(5,4,5,6,4,6,5,5,3,3)
)
setB <- data.frame(
b1 = c(5,3,4,3,3,6,4,4,3,5),
b2 = c(4,3,1,3,5,2,5,2,5,6),
b3 = c(6,5,4,3,2,6,4,3,4,6)
)
setC <- data.frame(
c1 = c(4,4,5,5,6,4,2,2,4,6),
c2 = c(3,3,4,4,2,1,2,3,5,4),
c3 = c(4,5,4,3,5,5,3,5,5,6)
)
setD <- data.frame(
d1 = c(5,5,4,4,3,5,3,5,5,4),
d2 = c(4,4,3,3,4,3,4,3,4,5),
d3 = c(6,5,5,3,3,4,2,5,5,4)
)
I'm trying to find n number of vectors in each data frame, that have the highest correlation among each other. For this simple example, let's say want to find the n = 1 vectors in each of the k = 4 data frames, that show the overall strongest, positive correlation cor().
I'm not interested in the correlation of vectors within a data frame, but the correlation between data frames, since i wish to pick 1 variable from each set.
Intuitively, I would sum all the correlation coefficients for each combination, i.e.:
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setC$d1)))
sum(cor(cbind(setA$a1, setB$b2, setC$c1, setC$d1)))
sum(cor(cbind(setA$a1, setB$b1, setC$c2, setC$d1)))
... # and so on...
...but this seems like brute-forcing a solution that might be solvable more elegantly, with some kind of clustering-technique?
Anyhow, I was hoping to find a dynamic solution like function(n = 1, ...) where (... for data frames) which would return a list of the highest correlating vector names.

Base on your example I would not go with a really complicated algorithm unless your actual data is huge. This is a simple approach I think gets what you want.
So base on your 4 data frames a creates the list_df and then in the function I just generate all the possible combinations of variables an calculate their correlation. At the end I select the n combinations with highest correlation.
list_df = list(setA,setB,setC,setD)
CombMaxCor = function(n = 1,list_df){
column_names = lapply(list_df,colnames)
mat_comb = expand.grid(column_names)
mat_total = do.call(cbind,list_df)
vec_cor = rep(NA,nrow(mat_comb))
for(i in 1:nrow(mat_comb)){
vec_cor[i] = sum(cor(mat_total[,as.character(unlist(mat_comb[i,]))]))
}
pos_max_temp = rev(sort(vec_cor))[1:n]
pos_max = vec_cor%in%pos_max_temp
comb_max_cor = mat_comb[pos_max,]
return(comb_max_cor)
}

You could use comb function:
fun = function(x){
nm = paste0(names(x),collapse="")
if(!grepl("(.)\\d.*\\1",nm,perl = T))
setNames(sum(cor(x)),nm)
}
unlist(combn(a,4,fun,simplify = FALSE))[1:3]#Only printed the first 3
a1b1c1d1 a1b1c1d2 a1b1c1d3
3.246442 4.097532 3.566949
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d1)))
[1] 3.246442
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d2)))
[1] 4.097532
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d3)))
[1] 3.566949

Here is a function we can use to get n non-repeating columns from each data frame to get the max total correlation:
func <- function(n, ...){
list.df <- list(...)
n.df <- length(list.df)
# 1) First get the correlations
get.two.df.cors <- function(df1, df2) apply(df1, 2,
function(x) apply(df2, 2, function(y) cor(x,y))
)
cor.combns <- lapply(list.df, function(x)
lapply(list.df, function(y) get.two.df.cors(x,y))
)
# 2) Define function to help with aggregating the correlations.
# We will call them for different combinations of selected columns from each df later
# cmbns: given a df corresponding columns to be selected each data frame
# (i-th row corresponds to i-th df),
# return the "total correlation"
get.cmbn.sum <- function(cmbns, cor.combns){
# a helper matrix to help aggregation
# each row represents which two data frames we want to get the correlation sums
df.df <- t(combn(seq(n.df), 2, c))
# convert to list of selections for each df
cmbns <- split(cmbns, seq(nrow(cmbns)))
sums <- apply(df.df, 1,
function(dfs) sum(
cor.combns[[dfs[1]]][[dfs[2]]][cmbns[[dfs[2]]], cmbns[[dfs[1]]]]
)
)
# sum of the sums give the "total correlation"
sum(sums)
}
# 3) Now perform the aggragation
# get the methods of choosing n columns from each of the k data frames
if (n==1) {
cmbns.each.df <- lapply(list.df, function(df) matrix(seq(ncol(df)), ncol=1))
} else {
cmbns.each.df <- lapply(list.df, function(df) t(combn(seq(ncol(df)), n, c)))
}
# get all unique selection methods
unique.selections <- Reduce(function(all.dfs, new.df){
all.dfs.lst <- rep(list(all.dfs), nrow(new.df))
all.new.rows <- lapply(seq(nrow(new.df)), function(x) new.df[x,,drop=F])
for(i in seq(nrow(new.df))){
for(j in seq(length(all.dfs.lst[[i]]))){
all.dfs.lst[[i]][[j]] <- rbind(all.dfs.lst[[i]][[j]], all.new.rows[[i]])
}
}
do.call(c, all.dfs.lst)
}, c(list(list(matrix(numeric(0), nrow=0, ncol=n))), cmbns.each.df))
# for each unique selection method, calculate the total correlation
result <- sapply(unique.selections, get.cmbn.sum, cor.combns=cor.combns)
return( unique.selections[[which.max(result)]] )
}
And now we have:
# n = 1
func(1, setA, setB, setC, setD)
# [,1]
# [1,] 1
# [2,] 2
# [3,] 3
# [4,] 2
# n = 2
func(2, setA, setB, setC, setD)
# [,1] [,2]
# [1,] 1 2
# [2,] 2 3
# [3,] 2 3
# [4,] 2 3

Compare every n rows and show boolean vector

I have similar issue like in this questions Compare every 2 rows and show mismatches in R
I would like to compare not only 2 rows but for example 3, 4, etc.
I have a data.table here:
DT <- data.table(A = rep(1:2, 2), B = rep(1:4, 2),
C = rep(1:2, 1), key = "A")
Then I use
dfs <- split(DT, DT$A)
comp <- function(x) sapply(x, function(u) u[1]==u[2])
matches <- sapply(dfs, comp)
For 3 rows :
comp <- function(x) sapply(x, function(u) u[1]==u[2] & u[1]==u[3])
Is that accurate? How can I generalize it in more elegant way?

try this:
comp2 <- function(dt, i, rws){
k <- length(rws)
tmp <- as.numeric(dt[i])
tmp <- as.data.table(matrix(rep(tmp, k), nrow = k, byrow = TRUE, dimnames = list(NULL, colnames(dt))))
ans <- (dt[rws] == tmp)
ans
}
this function takes three arguments:
-> dt your data.table (or sub-data.tables obtained from splitting your original one, up to you)
-> i -- row you want to compare
-> rws -- vector of row numbers you want to compare i with (e.g. c(2,3,4) would compare i with rows 2, 3 and 4
it then creates a new data.table that consists of row i stacked k times, so a data.frame to data.frame comparison is possible.
example:
comp2(DT, 1, c(2, 3, 4))
# A B C
#[1,] TRUE FALSE TRUE
#[2,] FALSE FALSE FALSE
#[3,] FALSE FALSE FALSE
compares row 1 of your data.table DT to rows 2, 3 and 4.
if you want your output to tell you whether your chosen row differs from at least one of the rows you are comparing it to, then you need an extra operation colSums(ans) == k instead of ans.

R apply function to selected pairs or rows

I have a matrix M and a matrix L that contains 'pair of rows indexes' that i need to select in M, in order to apply a function. The function returns a matrix with 2 rows and the same number of columns of M:
set.seed(1)
# M has even number of rows
M = matrix(runif(24), ncol = 3)
# each row of L is a pair of row indexes that will be selected in M
# so the first 'pair' is M[1,] M[3,], the second is M[2,] M[4,] and so on
L = matrix(c(1,3,2,4,6,7), ncol = 2, byrow = T)
The function f is:
f = function(r1, r2)
{
matrix(c(r1+r2, r1-r2), nrow = 2, byrow = T)
}
The thing is that a need to loop over L, apply f for each 'pair' and append the results to another matrix. So, for the code above the final result would be:
#first row of L
res1 = f(M[1,], M[3,])
#second row of L
res2 = f(M[2,], M[4,])
#third row of L
res3 = f(M[6,], M[7,])
#append everything
RES = rbind(res1, res2, res3)
How can i vectorize this operation? The rows indexes in L are random, and the row order of the final result didn't matter.
Thanks for any help!

What if you wrap your f function in something that takes the matrix M as an additional argument:
fm <- function(rowVector, mat) {
f(mat[rowVector[1],], mat[rowVector[2],])
}
then call it with apply:
apply(L, 1, fm, mat=M)
[,1] [,2] [,3]
[1,] 0.8383620 1.2803317 1.84306495
[2,] -0.3073447 -0.5360839 -0.04628558
[3,] 0.8350886 0.2383430 1.15394514
[4,] 0.4231395 -0.1147705 -0.38573770
[5,] 1.0976537 1.7693513 0.86381629
[6,] 0.3375833 0.2144609 -0.43953124

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - looping through column differentials - r

Related

Colnames loop in R, combining characters with variables

Perform same operation to multiple variables, assigning result

Find combination of n vectors across k dataframes with highest correlation

Compare every n rows and show boolean vector

R apply function to selected pairs or rows

Categories

Resources