R 2D binning of data frame with secondary complex calculations - r

I have a data frame that looks generally like this
df.data <- data.frame(x=sample(1:9, 10, replace = T), y=sample(1:9, 10, replace=T), vx=sample(-1:1, 10, replace=T), vy=sample(-1:1, 10, replace=T))
x and y are positions. vx and vy are x, y values for a 2d vector. I want to take this data frame and "bin" based on the x and y values, but performing a calculation on the vx and vy. This function does this except it uses a loop which is going to be too slow for my data set.
slowWay <- function(df)
{
df.bin <- data.frame(expand.grid(x=0:3, y=0:3, vx=0, vy=0, count=0))
for(i in 1:nrow(df))
{
x.bin <- floor(df[i, ]$x / 3)
y.bin <- floor(df[i, ]$y / 3)
print(c(x.bin, y.bin))
df.bin[df.bin$x == x.bin & df.bin$y == y.bin, ]$vx = df.bin[df.bin$x == x.bin & df.bin$y == y.bin, ]$vx + df[i, ]$vx
df.bin[df.bin$x == x.bin & df.bin$y == y.bin, ]$vy = df.bin[df.bin$x == x.bin & df.bin$y == y.bin, ]$vy + df[i, ]$vy
df.bin[df.bin$x == x.bin & df.bin$y == y.bin, ]$count = df.bin[df.bin$x == x.bin & df.bin$y == y.bin, ]$count + 1
}
return(df.bin)
}
Is this type of 2D binning possible in a non looping way?

Here's another faster way to do it, one that includes unpopulated bin combinations:
fasterWay <- function(df.data) {
a1 <- aggregate(df.data[,3:4], list(x=floor(df.data$x/3), y=floor(df.data$y/3)), sum)
a2 <- aggregate(list(count=rep(NA,nrow(df.data))), list(x=floor(df.data$x/3), y=floor(df.data$y/3)), length)
result <- merge(expand.grid(y=0:3,x=0:3), merge(a1,a2), by=c("x","y"), all=TRUE)
result[is.na(result)] <- 0
result <- result[order(result$y, result$x),]
rownames(result) <- NULL
result
}
It gives me:
x y vx vy count
1 0 0 0 0 1
2 0 1 0 0 0
3 0 2 -1 -1 1
4 0 3 0 0 0
5 1 0 -1 -1 1
6 1 1 0 0 0
7 1 2 0 0 0
8 1 3 -1 0 2
9 2 0 -1 -1 1
10 2 1 0 0 0
11 2 2 -1 1 2
12 2 3 0 0 1
13 3 0 0 0 0
14 3 1 0 0 0
15 3 2 -1 0 1
16 3 3 0 0 0

This is one way, but will probably need to do it in a couple of steps if you want the full record with unpopulated bin combinations:
> by(df.data[, c("vx", "vy")], # input data
list(x.bin=floor(df.data$x / 3), y.bin=floor(df.data$y / 3)), # grouping
function(df) sapply(df, function(x) c(Sum=sum(x), Count=length(x) ) ) ) #calcs
x.bin: 0
y.bin: 1
vx vy
Sum 0 1
Count 1 1
---------------------------------------------------------------------
x.bin: 1
y.bin: 1
vx vy
Sum 0 1
Count 2 2
---------------------------------------------------------------------
x.bin: 2
y.bin: 1
vx vy
Sum -1 -2
Count 2 2
---------------------------------------------------------------------
x.bin: 0
y.bin: 2
vx vy
Sum 1 0
Count 1 1
---------------------------------------------------------------------
x.bin: 1
y.bin: 2
NULL
---------------------------------------------------------------------
x.bin: 2
y.bin: 2
vx vy
Sum 2 1
Count 4 4

Here is a data.table version:
library(data.table)
dt.data<-as.data.table(df.data) # Convert to data.table
dt.data[,c("x.bin","y.bin"):=list(floor(x/3),floor(y/3))] # Add bin columns
setkey(dt.data,x.bin,y.bin)
dt.bin<-CJ(x=0:3, y=0:3) # Cross join to create bin combinations
dt.data.2<-dt.data[dt.bin,list(vx=sum(vx),vy=sum(vy),count=.N)] # Join the bins and data; sum vx/vy and count matching rows
dt.data.2[is.na(vx),vx:=0L] # Replace NA with 0
dt.data.2[is.na(vy),vy:=0L] # Replace NA with 0
dt.data.2[order(y.bin,x.bin)] # Display the final data.table output
## x.bin y.bin vx vy count
## 1: 0 0 0 0 0
## 2: 1 0 0 0 0
## 3: 2 0 1 1 1
## 4: 3 0 0 0 0
## 5: 0 1 0 0 0
## 6: 1 1 0 -2 3
## 7: 2 1 0 0 0
## 8: 3 1 0 0 0
## 9: 0 2 0 0 1
## 10: 1 2 0 0 0
## 11: 2 2 0 2 3
## 12: 3 2 -1 1 1
## 13: 0 3 0 0 0
## 14: 1 3 0 0 0
## 15: 2 3 0 0 0
## 16: 3 3 1 -1 1

Related

Combine two sequences of data

I have two sequences of data (with five variables in each sequence) that I want to combine accordingly into one using this rubric:
variable sequence 1 variable sequence 2 variable in combined sequence
0 0 1
0 1 2
1 0 3
1 1 4
Here are some example data:
set.seed(145)
mm <- matrix(0, 5, 10)
df <- data.frame(apply(mm, c(1,2), function(x) sample(c(0,1),1)))
colnames(df) <- c("s1_1", "s1_2", "s1_3", "s1_4", "s1_5", "s2_1", "s2_2", "s2_3", "s2_4", "s2_5")
> df
s1_1 s1_2 s1_3 s1_4 s1_5 s2_1 s2_2 s2_3 s2_4 s2_5
1 1 0 0 0 0 0 1 1 0 0
2 1 1 1 0 1 1 0 0 0 0
3 1 1 0 0 0 1 1 0 1 1
4 0 0 1 0 1 1 0 1 0 1
5 0 1 0 0 1 0 0 1 1 0
Here s1_1 represents variable 1 in sequence 1, s2_1 represents variable 2 in sequence 2, and so on. For this example, s1_1=1 and s2_1=0, the variable 1 in combined sequence would be coded as 3. How do I do this in R?
Here's a way -
return_value <- function(x, y) {
dplyr::case_when(x == 0 & y == 0 ~ 1,
x == 0 & y == 1 ~ 2,
x == 1 & y == 0 ~ 3,
x == 1 & y == 1 ~ 4)
}
sapply(split.default(df, sub('.*_', '', names(df))), function(x)
return_value(x[[1]], x[[2]]))
# 1 2 3 4 5
#[1,] 3 2 2 1 1
#[2,] 4 3 3 1 3
#[3,] 4 4 1 2 2
#[4,] 2 1 4 1 4
#[5,] 1 3 2 2 3
split.default splits the data by sequence and using sapply we apply the function return_value to compare the two columns in each dataframe.

Change the value of variables that occur 80% of the times in each row, R

In my data, I have 74 observations (rows) and 128 variables (columns), where each variable takes either 0 or 1 as value. In R, I am trying to write a code, where I can find in each row, the variables that has 1 as value and calculate 80% of the times 1 appears in each row. Pick those variables that has 80% of the times value as 1 and change the value from 1 to 0. I could write code, where I can calculate the 80% of times, 1 appears in each row, but I am not able to pick these variables in each row and change their value from 1 to 0.
data# data frame with 74 observations and 128 variables
row1 <- data[1,]
count1 <- length(which(data[1,] == 1)) # #number of 1 in row 1
print(count1)
perform <- 80/100*count1# 80% of count1
Below code works for one row:
test <- t(apply(data[1,], 1, function(x,n){
onesInX <- which(x==1)
# Randomly select 80% of 1 and change to 0
x[sample(onesInX, floor(length(onesInX)*.8))] <- 0
x
}))
If specify all the rows, code is not working:
test <- t(apply(data[1:74,], 1, function(x,n){
onesInX <- which(x==1)
# Randomly select 80% of 1 and change to 0
x[sample(onesInX, floor(length(onesInX)*.8))] <- 0
x
}))
Example of desired output:
original data frame
df
a b c d e f
1 1 1 1 1 1 1
2 1 0 1 1 0 1
3 1 1 1 0 1 1
When the code is applied to all the three rows in df, output should like this in all the three rows (80% of 1 replaced as 0):
a b c d e f
1 1 0 0 0 1 0
2 0 0 1 0 0 0
3 0 1 1 0 0 0
Thanks
Any suggestions
Thank you
Priya
A solution is to use apply row-wise and get indices where value is 1 using which. Afterwards, pick 80% of those indices (with value as 1) using sample and replace those to '0`.
t(apply(df, 1, function(x){
onesInX <- which(x==1)
# Randomly select 80% of 1 and change to 0
x[sample(onesInX, floor(length(onesInX)*.8))] <- 0
x
}))
# a b c d e f
# [1,] 0 0 0 1 0 0
# [2,] 0 0 0 1 0 0
# [3,] 0 0 1 0 0 1
# [4,] 0 1 0 0 0 0
# [5,] 0 1 0 0 0 0
# [6,] 1 0 0 0 0 0
# [7,] 0 0 0 0 0 1
# [8,] 0 0 1 0 0 0
# [9,] 0 0 1 0 1 0
# [10,] 0 0 0 0 0 1
Sample Data:
set.seed(1)
df <- data.frame(a = sample(c(0,1,1,1), 10, replace = TRUE),
b = sample(c(0,1,1,1), 10, replace = TRUE),
c = sample(c(0,1,1,1), 10, replace = TRUE),
d = sample(c(0,1,1,1), 10, replace = TRUE),
e = sample(c(0,1,1,1), 10, replace = TRUE),
f = sample(c(0,1,1,1), 10, replace = TRUE))
df
# a b c d e f
# 1 1 0 1 1 1 1
# 2 1 0 0 1 1 1
# 3 1 1 1 1 1 1
# 4 1 1 0 0 1 0
# 5 0 1 1 1 1 0
# 6 1 1 1 1 1 0
# 7 1 1 0 1 0 1
# 8 1 1 1 0 1 1
# 9 1 1 1 1 1 1
# 10 0 1 1 1 1 1
# Answer on OP's data
t(apply(df1, 1, function(x){
onesInX <- which(x==1)
x[sample(onesInX, floor(length(onesInX)*.8))] <- 0
x
}))
# a b c d e f
# 1 1 1 0 0 0 0 <- .8*6 = 4.8 => 4 has been converted to 0
# 2 0 0 0 1 0 0 <- .8*5 = 4.0 => 4 has been converted to 0
# 3 0 1 0 0 0 0 <- .8*4 = 3.2 => 3 has been converted to 0
# Data from OP
df1 <- read.table(text="
a b c d e f
1 1 1 1 1 1 1
2 1 0 1 1 0 1
3 1 1 1 0 1 1",
header = TRUE)
df1
# a b c d e f
# 1 1 1 1 1 1 1 <- No of 1 = 6
# 2 1 0 1 1 0 1 <- No of 1 = 4
# 3 1 1 1 0 1 1 <- No of 1 = 5

rearrange a variable based on another variable

Data:
set.seed(25)
df<- data.frame(rank=round(rnorm(10)),category=round(runif(10)),v=round(rnorm(10)))
rank category v
1 0 0 1
2 -1 0 -1
3 -1 0 1
4 0 0 2
5 -2 0 -1
6 0 0 1
7 2 0 0
8 1 1 0
9 0 1 2
10 0 0 -2
I want the variable "v" follows the same ranking as the variable "rank1", within each category. My question is how could I create the desired variable "v1"?
Desired output:
df <- transform(df, rank1 = ave(v, category, FUN = function(x) rank(x, ties.method = "random")))
rank category v rank1 v1
1 0 0 1 6 -1
2 -1 0 -1 3 1
3 -1 0 1 7 -1
4 0 0 2 8 -2
5 -2 0 -1 2 1
6 0 0 1 5 0
7 2 0 0 4 1
8 1 1 0 1 2
9 0 1 2 2 0
10 0 0 -2 1 2
So I get the desired result:
set.seed(25)
df <- data.frame(rank=round(rnorm(10)), category=round(runif(10)), v=round(rnorm(10)))
df <- transform(df, rank1 = ave(v, category, FUN = function(x) rank(x, ties.method = "random")))
df$v1 <- NA
for (i in unique(df$category)) {
df$v1[df$category==i] <- sort(df$v[df$category==i], decrea=TRUE)[df$rank1[df$category==i]]
}
The idea is going through the categories and apply the order given by rank1 to the sorted part of the vector v.

Finding values of vector that occur within range of another vector's values

I have two sequences. They are times in seconds. I wish to know which values in sequence b occur within 10s of any value in sequence a.
seqa = c(4.53333333333333, 7.43333333333334, 9.03333333333333, 20.6166666666667,
20.6333333333333, 42.5666666666667, 48.3166666666667, 48.8, 49.75,
55.1, 56.7833333333333, 59.3833333333333, 110.15, 113.95, 114.6)
seqb = c(18.3833333333333, 18.3833333333333, 63.8833333333333, 72.3166666666667,
76.7166666666667, 85.2166666666667, 91.25, 91.3666666666667,
96.2833333333333)
I have done this using two for loops. Going through each element of seqb and testing if it occurs at a time greater than each value of seqa but within a 10 second limit.
matX <- matrix(nrow=length(seqa), ncol=length(seqb))
for(j in seq_along(seqb)){
for(i in seq_along(seqa)){
test1 <- seqb[j]>=seqa[i]
test2 <- seqb[j]<=seqa[i]+10
matX[i,j] <- sum(test1 + test2)
}
}
matX
I'm storing the results in a matrix. You can see the values of 2 in columns 1, 2 and 3.
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 1 1 1 1 1 1 1 1 1
[2,] 1 1 1 1 1 1 1 1 1
[3,] 2 2 1 1 1 1 1 1 1
[4,] 1 1 1 1 1 1 1 1 1
[5,] 1 1 1 1 1 1 1 1 1
[6,] 1 1 1 1 1 1 1 1 1
[7,] 1 1 1 1 1 1 1 1 1
[8,] 1 1 1 1 1 1 1 1 1
[9,] 1 1 1 1 1 1 1 1 1
[10,] 1 1 2 1 1 1 1 1 1
[11,] 1 1 2 1 1 1 1 1 1
[12,] 1 1 2 1 1 1 1 1 1
[13,] 1 1 1 1 1 1 1 1 1
[14,] 1 1 1 1 1 1 1 1 1
[15,] 1 1 1 1 1 1 1 1 1
out <- apply(matX, 2, function(x) any(x>=2))
seqb[out]
# [1] 18.38333 18.38333 63.88333
These values are those that occur within 10s of at least one value in seqa. (The first two occur within 10s of 9.03333, the third value 63.8333 occurs within 10s of three values of seqa (55.1, 56.78333, 59.38333).
I am trying to speed this up as I will be doing some randomizations of sequences of about 2000 elements. Any ideas greatly appreciated.
Here are two base options
seqa = c(4.53333333333333, 7.43333333333334, 9.03333333333333, 20.6166666666667,
20.6333333333333, 42.5666666666667, 48.3166666666667, 48.8, 49.75,
55.1, 56.7833333333333, 59.3833333333333, 110.15, 113.95, 114.6)
seqb = c(18.3833333333333, 18.3833333333333, 63.8833333333333, 72.3166666666667,
76.7166666666667, 85.2166666666667, 91.25, 91.3666666666667,
96.2833333333333)
## via alexis_laz
a <- function() seqb[seqa[findInterval(seqb, seqa)] + 10 >= seqb]
# [1] 18.38333 18.38333 63.88333
## f
(function() {
la <- length(seqa)
lb <- length(seqb)
rr <- rep(seqb, each = la)
m <- matrix(rep(seqa, length(seqb)) - rr, la)
+(m < 0 & abs(m) <= 10)
})()
## g
o <- outer(seqa, seqb, `-`)
x <- +(o < 0 & abs(o) <= 10)
`dimnames<-`(x, list(round(seqa, 2), round(seqb, 2)))
# 18.38 18.38 63.88 72.32 76.72 85.22 91.25 91.37 96.28
# 4.53 0 0 0 0 0 0 0 0 0
# 7.43 0 0 0 0 0 0 0 0 0
# 9.03 1 1 0 0 0 0 0 0 0
# 20.62 0 0 0 0 0 0 0 0 0
# 20.63 0 0 0 0 0 0 0 0 0
# 42.57 0 0 0 0 0 0 0 0 0
# 48.32 0 0 0 0 0 0 0 0 0
# 48.8 0 0 0 0 0 0 0 0 0
# 49.75 0 0 0 0 0 0 0 0 0
# 55.1 0 0 1 0 0 0 0 0 0
# 56.78 0 0 1 0 0 0 0 0 0
# 59.38 0 0 1 0 0 0 0 0 0
# 110.15 0 0 0 0 0 0 0 0 0
# 113.95 0 0 0 0 0 0 0 0 0
# 114.6 0 0 0 0 0 0 0 0 0
Some benches on my crummy hardware
library('microbenchmark')
seqa <- rep(seqa, 100)
seqb <- rep(seqb, 100)
microbenchmark(f(), g(), baseR(), DT(), unit = 'relative')
# Unit: relative
# expr min lq mean median uq max neval cld
# f() 525.3178 374.23871 402.51609 386.4717 372.50657 496.6496 100 c
# g() 293.2158 223.21560 247.40211 241.3430 225.80202 443.5323 100 bc
# baseR() 13268.9357 9357.70517 8895.30834 9111.6828 8466.15623 6702.1735 100 d
# DT() 136.1109 93.61985 96.88054 96.0771 95.03329 100.5602 100 ab
# a() 1.0000 1.00000 1.00000 1.0000 1.00000 1.0000 100 a
You can try the foverlaps function from the data.table package.
library(data.table)
b <- data.table(seqb)
a <- data.table(seqa)
a[, end := seqa + 10]
setkey(a)
b[, end := seqb]
inds <- foverlaps(b, a,
by.x=c("seqb","end"),
type="within",
mult="all",
which=TRUE # you can use nomatch=0L, but it doesn't change the final matrix
)
# xid yid
#1: 1 3
#2: 2 3
#3: 3 10
#4: 3 11
#5: 3 12
#6: 4 NA
#7: 5 NA
#8: 6 NA
#9: 7 NA
#10: 8 NA
#11: 9 NA
These indices can now be used to create the matrix you want.
mat <- matrix(1, nrow=length(seqa), ncol=length(seqb))
mat[cbind(inds$yid, inds$xid)] <- 2
Here it is in a function with seqa and seqb hardcode:
DT <- function(){
b <- data.table(seqb)
a <- data.table(seqa)
a[, end := seqa + 10]
setkey(a)
b[, end := seqb]
inds <- foverlaps(b, a,
by.x=c("seqb","end"),
type="within",
mult="all",
which=TRUE
)
mat <- matrix(1, nrow=length(seqa), ncol=length(seqb))
mat[cbind(inds$yid, inds$xid)] <- 2
mat
}
seqa = c(4.53333333333333, 7.43333333333334, 9.03333333333333, 20.6166666666667, 20.6333333333333, 42.5666666666667, 48.3166666666667, 48.8, 49.75, 55.1, 56.7833333333333, 59.3833333333333, 110.15, 113.95, 114.6)
seqb = c(18.3833333333333, 18.3833333333333, 63.8833333333333, 2.3166666666667, 76.7166666666667, 85.2166666666667, 91.25, 91.3666666666667, 96.2833333333333)
Data read in above. Below, I show my approach, and that of #jota. Note that this is a bit of a silly comparison, since the data are so small. The data.table solution is almost certainly way faster on larger data.
library(microbenchmark)
library(data.table)
DT <- function(){
b <- data.table(seqb)
a <- data.table(seqa)
a[, end := seqa + 10]
setkey(a)
b[, end := seqb]
inds <- foverlaps(b, a,
by.x=c("seqb","end"),
type="within",
mult="all",
which=TRUE
)
mat <- matrix(1, nrow=length(seqa), ncol=length(seqb))
mat[cbind(inds$yid, inds$xid)] <- 2
mat
}
baseR <- function(){
out <- matrix(NA, ncol=length(seqb), nrow=length(seqa));
for(i in 1:length(seqa)){
out[i,] <- sapply(seqb, function(x){seqa[i] -10 < x & x < seqa[i] +10})
}
out
}
microbenchmark(
baseR(), DT()
)
And the results for the microbenchmark (for fun).
Unit: microseconds
expr min lq mean median uq max neval
baseR() 703.382 750.129 786.283 770.867 788.3085 1905.357 100
DT() 7289.433 7415.906 7631.574 7503.236 7575.7345 8794.439 100
You can use the IRanges package.
library(IRanges)
a.ir <- IRanges(round(seqa, 4)*1e4, round(seqa, 4)*1e4+10*1e4)
b.ir <- IRanges(round(seqb, 4)*1e4, round(seqb, 4)*1e4)
findOverlaps(b.ir, a.ir)
# Hits of length 5
# queryLength: 9
# subjectLength: 15
# queryHits subjectHits
# <integer> <integer>
# 1 1 3
# 2 2 3
# 3 3 10
# 4 3 11
# 5 3 12
seqb[unique(queryHits(findOverlaps(b.ir, a.ir)))]
# [1] 18.38333 18.38333 63.88333

Using loop to make column selections using different vectors

Let's say I have 3 vectors (strings of 10):
X <- c(1,1,0,1,0, 1,1, 0, NA,NA)
H <- c(0,0,1,0,NA,1,NA,1, 1, 1 )
I <- c(0,0,0,0,0, 1,NA,NA,NA,1 )
Data.frame Y contains 10 columns and 6 rows:
1 2 3 4 5 6 7 8 9 10
0 1 0 0 1 1 1 0 1 0
1 1 1 0 1 0 1 0 0 0
0 0 0 0 1 0 0 1 0 1
1 0 1 1 0 1 1 1 0 0
0 0 0 0 0 0 1 0 0 0
1 1 0 1 0 0 0 0 1 1
I'd like to use vector X, H en I to make column selections in data.frame Y, using "1's" and "0's" in the vector as selection criterium .
So the results for vector X using the '1' as selection criterium should be:
X <- c(1,1,0,1,0, 1,1, 0, NA,NA)
1 2 4 6 7
0 1 0 1 1
1 1 0 0 1
0 0 0 0 0
1 0 1 1 1
0 0 0 0 1
1 1 1 0 0
For vector H using the '1' as selection criterium:
H <- c(0,0,1,0,NA,1,NA,1, 1, 1 )
3 6 8 9 10
0 1 0 1 0
1 0 0 0 0
0 0 1 0 1
1 1 1 0 0
0 0 0 0 0
0 0 0 1 1
For vector I using the '1' as selection criterium:
I <- c(0,0,0,0,0, 1,NA,NA,NA,1 )
6 10
1 0
0 0
0 1
1 0
0 0
0 1
For convenience and speed I'd like to use a loop. It might be something like this:
all.ones <- lapply[,function(x) x %in% 1]
In the outcome (all.ones), the result for each vector should stay separate. For example:
X 1,2,4,6,7
H 3,6,8,9,10
I 6,10
The standard way of doing this is using the %in% operator:
Y[, X %in% 1]
To do this for multiple vectors (assuming you want an AND operation):
mylist = list(X, H, I, D, E, K)
Y[, Reduce(`&`, lapply(mylist, function(x) x %in% 1))]
The problem is the NA, use which to get round it. Consider the following:
x <- c(1,0,1,NA)
x[x==1]
[1] 1 1 NA
x[which(x==1)]
[1] 1 1
How about this?
idx <- which(X==1)
Y[,idx]
EDIT: For six vectors, do
idx <- which(X==1 & H==1 & I==1 & D==1 & E==1 & K==1)
Y[,idx]
Replace & with | if you want all columns of Y where at least one of the lists has a 1.

Resources