2D irregular aggregation of a matrix - r

I'm trying to bin a symmetric matrix with irregular intervals in R but am not sure how to proceed. My ideas are:
Reshape the matrix to long format, aggregate and cast it back?
Bin as-is in both dimensions (somehow... tapply, aggregate?)
Keep the regular binning but for each of my (larger) irregular bins, replace all inner values with their sum?
Here's an example of what I'm trying to do:
set.seed(42)
# symmetric matrix
a <- matrix(rpois(1e4, 2), 100)
a[upper.tri(a)] <- t(a)[upper.tri(a)]
image(x=1:100, y=1:100, a, asp=1, frame=F, axes=F)
# vector of irregular breaks for binning
breaks <- c(12, 14, 25, 60, 71, 89)
# white line show the desired bins
abline(h=breaks-.5, lwd=2, col="white")
abline(v=breaks-.5, lwd=2, col="white")
(The aim being that each rectangle drawn above be filled according to the sum of values within it.) I'd appreciate any pointers of how best to approach this.

This answer provides a great starting point using tapply:
b <- melt(a)
bb <- with(b, tapply(value,
list(
y=cut(Var1, breaks=c(0, breaks, Inf), include.lowest=T),
x=cut(Var2, breaks=c(0, breaks, Inf), include.lowest=T)
),
sum)
)
bb
# x
# y [0,12] (12,14] (14,25] (25,60] (60,71] (71,89] (89,Inf]
# [0,12] 297 48 260 825 242 416 246
# (12,14] 48 3 43 141 46 59 33
# (14,25] 260 43 261 794 250 369 240
# (25,60] 825 141 794 2545 730 1303 778
# (60,71] 242 46 250 730 193 394 225
# (71,89] 416 59 369 1303 394 597 369
# (89,Inf] 246 33 240 778 225 369 230
These can then be plotted as rectangular bins using a base plot and rect — i.e.:
library("reshape2")
library("magrittr")
bsq <- melt(bb)
# convert range notation to numerics
getNum <- . %>%
# rm brackets
gsub("\\[|\\(|\\]|\\)", "", .) %>%
# split digits and convert
strsplit(",") %>%
unlist %>% as.numeric
y <- t(sapply(bsq[,1], getNum))
x <- t(sapply(bsq[,2], getNum))
# normalise bin intensity by area
bsq$size <- (y[,2] - y[,1]) * (x[,2] - x[,1])
bsq$norm <- bsq$value / bsq$size
# draw rectangles on top of empty plot
plot(1:100, 1:100, type="n", frame=F, axes=F)
rect(ybottom=y[,1], ytop=y[,2],
xleft=x[,1], xright=x[,2],
col=rgb(colorRamp(c("white", "steelblue4"))(bsq$norm / max(bsq$norm)),
alpha=255*(bsq$norm / max(bsq$norm)), max=255),
border="white")

Related

Plot observations in same x-axis point which linked with id variable

I need help. This is a view of my database :
482 940 914 1
507 824 1042 2
514 730 1450 3
477 595 913 4
My aim is to plot in the same point of x-axis each row.
Example:
in 1 (=x) i want to plot 482, 940 and 914
in 2 (=x) I want to plot 507, 824 and 1042.
So three points in vertical for each x axis points.
it's a good idea to share the data in a reproducible way - I'm using readClipboard to read in the copied vector into R. Anyway, here's a quick answer:
x <- as.numeric(unlist(strsplit(readClipboard(), " ")))
This makes it into a numeric vector. We now need to split into groups based on the description you provided. I'm using matrix to achieve this and will then convert to data.frame for plotting using ggplot2:
m <- matrix(x, ncol = 4, byrow = T)
> m
[,1] [,2] [,3] [,4]
[1,] 482 940 914 1
[2,] 507 824 1042 2
[3,] 514 730 1450 3
[4,] 477 595 913 4
df <- as.data.frame(m)
# Assign names to the data.frame
names(df) <- letters[1:4]
> df
a b c d
1 482 940 914 1
2 507 824 1042 2
3 514 730 1450 3
4 477 595 913 4
To get the plot:
library(ggplot2)
ggplot(df, aes(x = d)) +
geom_point(aes(y = a), color = "red") +
geom_point(aes(y = b), color = "green") +
geom_point(aes(y = c), color = "blue")
OUTPUT
You can play around with ggtitle and xlab etc. to change the plot labels and add legends.
Hope this is helpful!

R - Sum range over lookback period, divided sum of look back - excel to R

I am looking to workout a percentage total over a look back range in R.
I know how to do this in excel with the following formula:
=SUM(B2:B4)/SUM(B2:B4,C2:C4)
This is summing column B over a range of today looking back 3 lines. It then divides this sum buy the total sum of column B + C again looking back 3 lines.
I am looking to achieve the same calculation in R to run across my matrix.
The output would look something like this:
adv dec perct
1 69 376
2 113 293
3 270 150 0.355625492
4 74 371 0.359559402
5 308 96 0.513790386
6 236 173 0.491255962
7 252 134 0.663886572
8 287 129 0.639966969
9 219 187 0.627483444
This is a line of code I could perhaps add the look back range too:
perct <- apply(data.matrix[,c('adv','dec')], 1, function(x) { (x[1] / x[1] + x[2]) } )
If i could get [1] to sum the previous 3 line range and
If i could get [2] to also sum the previous 3 line range.
Still learning how to apply forward and look back periods within R. So any additional learning on the answer would be appreciated!
Here are some approaches. The first 3 use rollsumr and/or rollapplyr in zoo and the last one uses only the base of R.
1) rollsumr Create a matrix with rollsumr whose columns contain the rollling sums, convert that to row proportions and take the "adv" column. Finally assign that to a new column frac in DF. This approach has the shortest code.
library(zoo)
DF$frac <- prop.table(rollsumr(DF, 3, fill = NA), 1)[, "adv"]
giving:
> DF
adv dec frac
1 69 376 NA
2 113 293 NA
3 270 150 0.3556255
4 74 371 0.3595594
5 308 96 0.5137904
6 236 173 0.4912560
7 252 134 0.6638866
8 287 129 0.6399670
9 219 187 0.6274834
1a) This variation is similar except instead of using prop.table we write out the ratio. The code is longer but you may find it clearer.
m <- rollsumr(DF, 3, fill = NA)
DF$frac <- with(as.data.frame(m), adv / (adv + dec))
1b) This is a variation of (1) that is the same except it uses a magrittr pipeline:
library(magrittr)
DF %>% rollsumr(3, fill = NA) %>% prop.table(1) %>% `[`(TRUE, "adv") -> DF$frac
2) rollapplyr We could use rollapplyr with by.column = FALSE like this. The result is the same.
ratio <- function(x) sum(x[, "adv"]) / sum(x)
DF$frac <- rollapplyr(DF, 3, ratio, by.column = FALSE, fill = NA)
3) Yet another variation is to compute the numerator and denominator separately:
DF$frac <- rollsumr(DF$adv, 3, fill = NA) /
rollapplyr(DF, 3, sum, by.column = FALSE, fill = NA)
4) base This uses embed followed by rowSums on each column to get the rolling sums and then uses prop.table as in (1).
DF$frac <- prop.table(sapply(lapply(rbind(NA, NA, DF), embed, 3), rowSums), 1)[, "adv"]
Note: The input used in reproducible form is:
Lines <- "adv dec
1 69 376
2 113 293
3 270 150
4 74 371
5 308 96
6 236 173
7 252 134
8 287 129
9 219 187"
DF <- read.table(text = Lines, header = TRUE)
Consider an sapply that loops through the number of rows in order to index two rows back:
DF$pred <- sapply(seq(nrow(DF)), function(i)
ifelse(i>=3, sum(DF$adv[(i-2):i])/(sum(DF$adv[(i-2):i]) + sum(DF$dec[(i-2):i])), NA))
DF
# adv dec pred
# 1 69 376 NA
# 2 113 293 NA
# 3 270 150 0.3556255
# 4 74 371 0.3595594
# 5 308 96 0.5137904
# 6 236 173 0.4912560
# 7 252 134 0.6638866
# 8 287 129 0.6399670
# 9 219 187 0.6274834

An R-like approach to averaging by histogram bin

as a person transitioning from Matlab, I wish any advice for a more efficient way to find the average of DepDelay values whose indices (indxs) fall within histogram bins (edges). In Matlab and my current R script, I have these commands:
edges = seq( min(t), max(t), by = dt )
indxs = findInterval( t, edges,all.inside=TRUE )
listIndx = sort( unique( indxs ) )
n = length( edges )
avgDelay = rep( 1, n) * 0
for (i in 1 : n ){
id = listIndx[i]
jd = which( id == indxs )
if ( length(jd) > minFlights){
avgDelay[id] = mean(DepDelay[jd])
}
}
I know that using for-loops in R is a potentially fraught issue, but I ask this question in the interests of improved code efficiency.
Sure. A few snippets of the relevant vectors:
DepDelay[1:20] = [1] -4 -4 -4 -9 -6 -7 -1 -7 -6 -7 -7 -5 -8 -3 51 -2 -1 -4 -7 -10
and associated indxs values:
indxs[1:20] = [1] 3 99 195 291 387 483 579 675 771 867 963 1059 1155 1251 1351 1443 1539 1635 1731 1827
minFlights = 3
Thank you.
BSL
There are many ways to do this in R, all involving variations on the "split-apply-combine" strategy (split the data into groups, apply a function to each group, combine the results by group back into a single data frame).
Here's one method using the dplyr package. I've created some fake data for illustration, since your data is not in an easily reproducible form:
library(dplyr)
# Create fake data
set.seed(20)
dat = data.frame(DepDelay = sample(-50:50, 1000, replace=TRUE))
# Bin the data
dat$bins = cut(dat$DepDelay, seq(-50,50,10), include.lowest=TRUE)
# Summarise by bin
dat %>% group_by(bins) %>%
summarise(count = n(),
meanByBin = mean(DepDelay, na.rm=TRUE))
bins count meanByBin
1 [-50,-40] 111 -45.036036
2 (-40,-30] 110 -34.354545
3 (-30,-20] 95 -24.242105
4 (-20,-10] 82 -14.731707
5 (-10,0] 92 -4.304348
6 (0,10] 109 5.477064
7 (10,20] 93 14.731183
8 (20,30] 93 25.182796
9 (30,40] 103 35.466019
10 (40,50] 112 45.696429
data.table is another great package for this kind of task:
library(data.table)
datDT = data.table(dat)
setkey(datDT, bins)
datDT[, list(count=length(DepDelay), meanByBin=mean(DepDelay, na.rm=TRUE)), by=bins]
And here are two ways to calculate the mean by bin in base R:
tapply(dat$DepDelay, dat$bins, mean)
aggregate(DepDelay ~ bins, FUN=mean, data=dat)

barplot in R 3.1.1

even if I am getting used to R I am still new with it and I hope that someone can help me deal with this task ...I have tried to look for some previous topics but I couldn't find what I was looking for, so here I am hoping for some help.
I am trying to draw my bar plot but I am not having much luck on some of the settings so I hope someone could help. I am using R 3.1.1 on my mac OS 10.9.4.
my table look like this:
family area1 area2 area3 area4 area5 area6
A 15 20 500 200 17 26
B 170 520 26 13 100 70
C 35 250 358 128 88 26
D 95 375 289 156 169 356
E 425 177 136 144 285 70
since I have the file save it as a csv I am doing this steps:
fam <- read.csv ("family_per_area_count.csv", sep =";", header = T)
I am converting the file as a matrix
fam.mat <- as.matrix(fam_1, ncol = 6, byrow = T)
then I assign row names and col names
rownames(fam.mat) <- c("A", "B", "C", "D", "E")
colnames(fam.mat) <- c("area1", "area2", "area3", "area4", "area5", "area6")
then I am simply running the bar plot command as
barplot(fam.mat, beside = T, col = rainbow(ncol(fam.mat)))
but I am missing most of the labels for the x axis and the plot seems to be pressed together.
I also tried to run the cumulative bar plot using this command
par(mar = c(5.1, 4.1, 4.1, 7.1), xpd = TRUE)
prop <- prop.table(data_mat, margin = 2)
barplot(data_mat, col = rainbow(length(rownames(data_mat))), width = 3)
legend("topright", inset = c(-0.25, 0), fill = rainbow(length(rownames(data_mat))),
legend = rownames(data_mat))
but the legend colours don't match the data and again my x-axis seems out of center. I have tried to transpose the matrix but still no luck.
Can anyone make any suggestion?
Thank you so much in advance
F.
Here is a start:
DF <- read.table(text="family area1 area2 area3 area4 area5 area6
A 15 20 500 200 17 26
B 170 520 26 13 100 70
C 35 250 358 128 88 26
D 95 375 289 156 169 356
E 425 177 136 144 285 70", header=TRUE)
library(reshape2)
DF <- melt(DF, id.var="family")
library(ggplot2)
ggplot(DF, aes(x=family, y=value, fill=variable)) +
geom_bar(stat="identity", position="dodge")
Study ggplot2 documentation and tutorials to learn how to customise the plot.

Number of unique values within a range in data-frame

From a data-frame, I want to extract the number of unique values (of X) within a certain range of Y (e.g. for every 0-100, 101-200, 201-300, etc. up to 3000).
Example df
X Y
169 183
546 64
154 148
593 203
60 243
568 370
85 894
168 169
154 148
83 897
…
A time consuming way would be to run the following code for each range:
junk<-subset(df, Y > 0 & Y < 100)
length(unique(junk$record.no))
But I have to ask the experts - there must be a better way?
You can do it with by() and cut():
data <- data.frame(X=ceiling(rnorm(10000, 500, 10)), Y=runif(10000, 0, 3000))
data$Groups <- cut(data$Y, seq(0, 3000, 100)) # Create a categorical variable for each range
by(data$X, data$Group, function(x) length(unique(x)))
This seems valid:
aggregate(DF$X, list(cut(DF$Y, seq(0, 1000, 100))), function(x) unique(x))
# Group.1 x #or length(unique(x))
#1 (0,100] 546
#2 (100,200] 169, 154, 168
#3 (200,300] 593, 60
#4 (300,400] 568
#5 (800,900] 85, 83
You can run a for loop based on the range you want and the size of the dataframe and then count the number of levels by converting to factor:
range <- 100 #based on example
loops <- nrow(df)/range
lvlMatrix <- matrix(nrow=0,ncol=2,dimnames=list(NULL,c("range","unique values")))
for(a in 1:loops){
sub <- df[((a-1)*range):(range*a),]
lvls<-nlevels(factor(sub$X))
lvlMatrix <- rbind(lvlMatrix,cbind(paste(as.character((a-1)*range),"-",as.character(range*a),sep=""),lvls))
}

Resources