pairwise operation on columns in R dataframe - r

I have a dataframe like this
sample <- data.frame(x1=c(1,2),y1=c(2,1), x2=c(2,4),y2=c(3,4),x3=c(5,2),y3=c(1,6))
How can I operate pairwise (sum up x1&y1, x2&y2, x3&y3, 2 columns at a time) to create 3 new columns sum1, sum2, sum3? Thanks

Method 1
Since it is pairwise, you can do it with mapply.
Create two vectors referring to the pairs
p1 <- seq(1, 6, by = 2)
p2 <- seq(2, 6, by = 2)
Then use mapply to apply pairwise summation for the columns desired:
mapply(x = p1, y = p2, function(x, y) sample[[x]] + sample[[y]])
Result:
[,1] [,2] [,3]
[1,] 3 5 6
[2,] 3 8 8`
Method 2
I also like to use the packages dplyr and wrapr in conjunction if you need to output the pairwise operation in the sample table.
require(dplyr)
require(wrapr)
newcols <- paste0(names(sample)[seq(1, 6, by = 2)], names(sample)[seq(2, 6, by = 2)])
for (i in c(1:3)) {
wrapr::let(list(RES = newcols[i],
COL1 = names(sample)[i],
COL2 = names(sample)[i + 1]),
sample <- dplyr::mutate(sample, RES = COL1 + COL2))}
sample
x1 y1 x2 y2 x3 y3 x1y1 x2y2 x3y3
1 1 2 2 3 5 1 3 4 5
2 2 1 4 4 2 6 3 5 8
I liek to use those packages because I find it easier to understand. But if you can't download those packages for any reason. You can do it with base R:
newcols <- paste0(names(sample)[seq(1, 6, by = 2)], names(sample)[seq(2, 6, by = 2)])
for (i in c(1:3)) {
sample[newcols[i]] <- sample[, names(sample)[i]] + sample[, names(sample)[i + 1]]
}

How about this?
sample <- data.frame(x1=c(1,2),y1=c(2,1),
x2=c(2,4),y2=c(3,4),x3=c(5,2),y3=c(1,6))
sample_new <- data.frame(x1=c(1,2),y1=c(2,1),
x2=c(2,4),y2=c(3,4),x3=c(5,2),y3=c(1,6),sumx1y1=(sample$x1 + sample$y1),
sumx2y2=(sample$x2 + sample$y2),sumx3y3=(sample$x3 + sample$y3))

Related

Replace integers in a data frame column with other integers in R?

I want to replace a vector in a dataframe that contains only 4 numbers to specific numbers as shown below
tt <- rep(c(1,2,3,4), each = 10)
df <- data.frame(tt)
I want to replace 1 = 10; 2 = 200, 3 = 458, 4 = -0.1
You could use recode from dplyr. Note that the old values are written as character. And the new values are integers since the original column was integer:
library(tidyverse):
df %>%
mutate(tt = recode(tt, '1'= 10, '2' = 200, '3' = 458, '4' = -0.1))
tt
1 10.0
2 10.0
3 200.0
4 200.0
5 458.0
6 458.0
7 -0.1
8 -0.1
To correct the error in the code in the question and provide for a shorter example we use the input in the Note at the end. Here are several alternatives. nos defined in (1) is used in some of the others too. No packages are used.
1) indexing To get the result since the input is 1 to 4 we can use indexing. This is probably the simplest solution given that the original values of tt are in 1:4.
nos <- c(10, 200, 458, -0.1)
transform(df, tt = nos[tt])
## tt
## 1 10.0
## 2 10.0
## 3 200.0
## 4 200.0
## 5 458.0
## 6 458.0
## 7 -0.1
## 8 -0.1
1a) If the input is not necessarily in 1:4 then we could use this generalization
transform(df, tt = nos[match(tt, 1:4)])
2) arithmetic Another approach is to use arithmetic:
transform(df, tt = 10 * (tt == 1) +
200 * (tt == 2) +
458 * (tt == 3) +
-0.1 * (tt == 4))
3) outer/matrix multiplication This would also work:
transform(df, tt = c(outer(tt, 1:4, `==`) %*% nos))
3a) This is the same except we use model.matrix instead of outer.
transform(df, tt = c(model.matrix(~ factor(tt) + 0, df) %*% nos))
4) factor The levels of the factor are 1:4 and the corresponding labels are defined by nos. Extract the labels using format and then convert them to numeric.
transform(df, tt = as.numeric(format(factor(tt, levels = 1:4, labels = nos))))
4a) or as a pipeline
transform(df, tt = tt |>
factor(levels = 1:4, labels = nos) |>
format() |>
as.numeric())
5) loop We can use a simple loop. Nulling out i at the end is so that it is not made into a column.
within(df, { for(i in 1:4) tt[tt == i] <- nos[i]; i <- NULL })
6) Reduce This is somewhat similar to (5) but implements the loop using Reduce.
fun <- function(tt, i) replace(tt, tt == i, nos[i])
transform(df, tt = Reduce(fun, init = tt, 1:4))
Note
df <- data.frame(tt = c(1, 1, 2, 2, 3, 3, 4, 4))

Computing number of bits that are set to 1 for matching rows in terms of hamming distance between two data frames

I have two data frames of same number of columns (but not rows) df1 and df2. For each row in df2, I was able to find the best (and second best) matching rows from df1 in terms of hamming distance, in my previous post. In that post, we have been using the following example data:
set.seed(0)
df1 <- as.data.frame(matrix(sample(1:10), ncol = 2)) ## 5 rows 2 cols
df2 <- as.data.frame(matrix(sample(1:6), ncol = 2)) ## 3 rows 2 cols
I now need to compute the number of bits equal to 1 for:
each row in df2
the best matching rows in df1
the second matching rows in df1
The number of bits equal to 1 of an integer a maybe computed as
sum(as.integer(intToBits(a)))
And I have applied this to #ZheyuanLi's original function, so I have got item 1>. However I'm unable to apply the same logic to get item 2> and 3>, by simple modification of #ZheyuanLi's function.
Below are the functions from #ZheyuanLi's with modification:
hmd <- function(x,y) {
rawx <- intToBits(x)
rawy <- intToBits(y)
nx <- length(rawx)
ny <- length(rawy)
if (nx == ny) {
## quick return
return (sum(as.logical(xor(rawx,rawy))))
} else if (nx < ny) {
## pivoting
tmp <- rawx; rawx <- rawy; rawy <- tmp
tmp <- nx; nx <- ny; ny <- tmp
}
if (nx %% ny) stop("unconformable length!") else {
nc <- nx / ny ## number of cycles
return(unname(tapply(as.logical(xor(rawx,rawy)), rep(1:nc, each=ny), sum)))
}
}
foo <- function(df1, df2, p = 2) {
## check p
if (p > nrow(df2)) p <- nrow(df2)
## transpose for CPU cache friendly code
xt <- t(as.matrix(df1))
yt <- t(as.matrix(df2))
## after transpose, we compute hamming distance column by column
## a for loop is decent; no performance gain from apply family
n <- ncol(yt)
id <- integer(n * p)
d <- numeric(n * p)
sb <- integer(n)
k <- 1:p
for (i in 1:n) {
set.bits <- sum(as.integer(intToBits(yt[,i])))
distance <- hmd(xt, yt[,i])
minp <- order(distance)[1:p]
id[k] <- minp
d[k] <- distance[minp]
sb[i] <- set.bits
k <- k + p
}
## recode "id", "d" and "sb" into data frame and return
id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
colnames(id) <- paste0("min.", 1:p)
d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
colnames(d) <- paste0("mindist.", 1:p)
sb <- as.data.frame(matrix(sb, ncol = 1)) ## no need for byrow as you have only 1 column
colnames(sb) <- "set.bits.1"
list(id = id, d = d, sb = sb)
}
Running these gives:
> foo(df1, df2)
$id
min1 min2 ## row id for best/second best match in df1
1 1 4
2 2 3
3 5 2
$d
mindist.1 mindist.2 ## minimum 2 hamming distance
1 2 2
2 1 3
3 1 3
$sb
set.bits.1 ## number of bits equal to 1 for each row of df2
1 3
2 2
3 4
OK, after reading through while re-editing your question (many times!), I think I know what you want. Essentially we need change nothing to hmd(). Your required items 1>, 2>, 3> can all be computed after the for loop in foo().
To get item 1>, which you called sb, we can use a tapply(). However, your computation of sb along the for loop is fine, so I will not change it. In the following, I will demonstrate the basic procedure to get item 2> and item 3>.
The id vector inside foo() stores all matching rows in df1:
id <- c(1, 4, 2, 3, 5, 2)
so we can simply extract those rows of df1 (actually, columns of xt), to compute the number of bits equal to 1. As you can see, there are lots of duplicity in id, so we can only computes on unique(id):
id0 <- sort(unique(id))
## [1] 1 2 3 4 5
We now extract those subset columns of xt:
sub_xt <- xt[, id0]
## [,1] [,2] [,3] [,4] [,5]
## V1 9 3 10 5 6
## V2 2 4 8 7 1
To compute the number of bits equal to 1 for each column of sub_xt, we again use tapply() and vectorized approach.
rawbits <- as.integer(intToBits(as.numeric(sub_xt))) ## convert sub_xt to binary
sbxt0 <- unname(tapply(X = rawbits,
INDEX = rep(1:length(id0), each = length(rawbits) / length(id0)),
FUN = sum))
## [1] 3 3 3 5 3
Now we need to map sbxt0 to sbxt:
sbxt <- sbxt0[match(id, id0)]
## [1] 3 5 3 3 3 3
Then we can convert sbxt to a data frame sb1:
sb1 <- as.data.frame(matrix(sbxt, ncol = p, byrow = TRUE))
colnames(sb1) <- paste(paste0("min.", 1:p), "set.bits.1", sep = ".")
## min.1.set.bits.1 min.2.set.bits.1
## 1 3 5
## 2 3 3
## 3 3 3
Finally we can assemble these things up:
foo <- function(df1, df2, p = 2) {
## check p
if (p > nrow(df2)) p <- nrow(df2)
## transpose for CPU cache friendly code
xt <- t(as.matrix(df1))
yt <- t(as.matrix(df2))
## after transpose, we compute hamming distance column by column
## a for loop is decent; no performance gain from apply family
n <- ncol(yt)
id <- integer(n * p)
d <- numeric(n * p)
sb2 <- integer(n)
k <- 1:p
for (i in 1:n) {
set.bits <- sum(as.integer(intToBits(yt[,i])))
distance <- hmd(xt, yt[,i])
minp <- order(distance)[1:p]
id[k] <- minp
d[k] <- distance[minp]
sb2[i] <- set.bits
k <- k + p
}
## compute "sb1"
id0 <- sort(unique(id))
sub_xt <- xt[, id0]
rawbits <- as.integer(intToBits(as.numeric(sub_xt))) ## convert sub_xt to binary
sbxt0 <- unname(tapply(X = rawbits,
INDEX = rep(1:length(id0), each = length(rawbits) / length(id0)),
FUN = sum))
sbxt <- sbxt0[match(id, id0)]
sb1 <- as.data.frame(matrix(sbxt, ncol = p, byrow = TRUE))
colnames(sb1) <- paste(paste0("min.", 1:p), "set.bits.1", sep = ".")
## recode "id", "d" and "sb2" into data frame and return
id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
colnames(id) <- paste0("min.", 1:p)
d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
colnames(d) <- paste0("mindist.", 1:p)
sb2 <- as.data.frame(matrix(sb2, ncol = 1)) ## no need for byrow as you have only 1 column
colnames(sb2) <- "set.bits.1"
list(id = id, d = d, sb1 = sb1, sb2 = sb2)
}
Now, running foo(df1, df2) gives:
> foo(df1,df2)
$id
min.1 min.2
1 1 4
2 2 3
3 5 2
$d
mindist.1 mindist.2
1 2 2
2 1 3
3 1 3
$sb1
min.1.set.bits.1 min.2.set.bits.1
1 3 5
2 3 3
3 3 3
$sb2
set.bits.1
1 3
2 2
3 4
Note that I have renamed the sb you used to sb2.

Conditional expression for a specific column in a list of data frames in R

Sorry if the title is confusing.
I have a list of data frames combined into temp.list. I want to raise each row of a specific column based on the value in vec. For example, vec has the values 2, 0, and 3. I want to do: X2^2, log(X2), X2^3. So do log(X2) if the value in vec==0. The last three lines of code is where I have an issue.
M1 <- data.frame(matrix(1:4, nrow = 2, ncol = 2))
M2 <- data.frame(matrix(1:9, nrow = 3, ncol = 3))
M3 <- data.frame(matrix(1:4, nrow = 2, ncol = 2))
mlist <- list(M1, M2, M3)
temp.list <-mlist
vec <- c(2,0,3)
The code below works! But I don't want to raise X2^0.
for(i in 1:length(vec)){
temp.list[[i]]$X2 <- temp.list[[i]]$X2^vec[[i]]
}
The code below replaces all rows of X2 by the first value calculated in X2.
for(i in 1:length(vec)){
temp.list[[i]]$X2 <- ifelse(vec[[i]]==0,log(temp.list[[i]]$X2),temp.list[[i]]$X2^vec[[i]]
}
Any other ways of doing this would also be much appreciated.
You could use this:
for(i in 1:length(vec)){
temp.list[[i]]$X2 <- if(vec[[i]]==0) log(temp.list[[i]]$X2)
else temp.list[[i]]$X2^vec[[i]]
}
temp.list
# [[1]]
# X1 X2
# 1 1 9
# 2 2 16
# [[2]]
# X1 X2 X3
# 1 1 1.386294 7
# 2 2 1.609438 8
# 3 3 1.791759 9
# [[3]]
# X1 X2
# 1 1 27
# 2 2 64
The problem is with the ifelse(...) statement, which returns a vector of the same length as the condition (e.g., 1 in your case). The if (...) ... else ... statement evaluates the expression and executes whichever block of code is appropriate.

sum adjacent columns for each column in a matrix in R

I am trying to get a function that is the opposite of diff()
I want to add the values of adjacent columns in a matrix for each column in the matrix.
I do NOT need the sum of the entire column or row.
For example:
If I had:
[ 1 2 4;
3 5 8 ]
I would end up with:
[ 3 6;
8 13 ]
Of course for just one or two columns this is simple as I can just do x[,1]+x[,2], but these matrices are quite large.
I'm surprised that I cannot seem to find an efficient way to do this.
m <- matrix(c(1,3,2,5,4,8), nrow=2)
m[,-1] + m[,-ncol(m)]
[,1] [,2]
[1,] 3 6
[2,] 8 13
Or, just for the fun of it:
n <- ncol(m)
x <- suppressWarnings(matrix(c(1, 1, rep(0, n-1)),
nrow = n, ncol = n-1))
m %*% x
[,1] [,2]
[1,] 3 6
[2,] 8 13
Dummy data
mat <- matrix(sample(0:9, 100, replace = TRUE), nrow = 10)
Solution:
sum.mat <- lapply(1:(ncol(mat)-1), function(i) mat[,i] + mat[,i+1])
sum.mat <- matrix(unlist(sum.mat), byrow = FALSE, nrow = nrow(mat))
You could use:
m <- matrix(c(1,2,4,3,5,8), nrow=2, byrow=T)
sapply(2:ncol(m), function(x) m[,x] + m[,(x-1)])

Apply a list of n functions to each row of a dataframe?

I have a list of functions
funs <- list(fn1 = function(x) x^2,
fn2 = function(x) x^3,
fn3 = function(x) sin(x),
fn4 = function(x) x+1)
#in reality these are all f = splinefun()
And I have a dataframe:
mydata <- data.frame(x1 = c(1, 2, 3, 2),
x2 = c(3, 2, 1, 0),
x3 = c(1, 2, 2, 3),
x4 = c(1, 2, 1, 2))
#actually a 500x15 dataframe of 500 samples from 15 parameters
For each of i rows, I would like to evaluate function j on each of the j columns and sum the results:
unlist(funs)
attach(mydata)
a <- rep(NA,4)
for (i in 1:4) {
a[i] <- sum(fn1(x1[i]), fn2(x2[i]), fn3(x3[i]), fn4(x4[i]))
}
How can I do this efficiently? Is this an appropriate occasion to implement plyr functions? If so, how?
bonus question: why is a[4] NA?
Is this an appropriate time to use functions from plyr, if so, how can I do so?
Ignoring your code snippet and sticking to your initial specification that you want to apply function j on the column number j and then "sum the results"... you can do:
mapply( do.call, funs, lapply( mydata, list))
# [,1] [,2] [,3] [,4]
# [1,] 1 27 0.8414710 2
# [2,] 4 8 0.9092974 3
# [3,] 9 1 0.9092974 3
I wasn't sure which way you want to now add the results (i.e. row-wise or column-wise), so you could either do rowSums or colSums on this matrix. E.g:
colSums( mapply( do.call, funs, lapply( mydata, list)) )
# [1] 14.000000 36.000000 2.660066 8.000000
Why don't just write one function for all 4 and apply it to the data frame?
All your functions are vectorized, and so is splinefun, and this will work:
fun <- function(df)
cbind(df[, 1]^2, df[, 2]^3, sin(df[, 3]), df[, 4] + 1)
rowSums(fun(mydata))
This is considerably more efficient than "foring" or "applying" over the rows.
I tried using plyr::each:
library(plyr)
sapply(mydata, each(min, max))
x1 x2 x3 x4
min 1 0 1 1
max 3 3 3 2
and it works fine, but when I pass custom functions I get:
sapply(mydata, each(fn1, fn2))
Error in proto[[i]] <- fs[[i]](x, ...) :
more elements supplied than there are to replace
each has very brief documentation, I don't quite get what's the problem.

Resources