group and average a large numeric vector to plot - r

I have an R matrix which is very data dense. It has 500,000 rows. If I plot 1:500000 (x axis) to the third column of the matrix mat[, 3] it takes too long to plot, and sometimes even crashes. I've tried plot, matplot, and ggplot and all of them take very long.
I am looking to group the data by 10 or 20. ie, take the first 10 elements from the vector, average that, and use that as a data point.
Is there a fast and efficient way to do this?

We can use cut and aggregate to reduce the number of points plotted:
generate some data
set.seed(123)
xmat <- data.frame(x = 1:5e5, y = runif(5e5))
use cut and aggregate
xmat$cutx <- as.numeric(cut(xmat$x, breaks = 5e5/10))
xmat.agg <- aggregate(y ~ cutx, data = xmat, mean)
make plot
plot(xmat.agg, pch = ".")
more than 1 column solution:
Here, we use the data.table package to group and summarize:
generate some more data
set.seed(123)
xmat <- data.frame(x = 1:5e5,
u = runif(5e5),
z = rnorm(5e5),
p = rpois(5e5, lambda = 5),
g = rbinom(n = 5e5, size = 1, prob = 0.5))
use data.table
library(data.table)
xmat$cutx <- as.numeric(cut(xmat$x, breaks = 5e5/10))
setDT(xmat) #convert to data.table
#for each level of cutx, take the mean of each column
xmat[,lapply(.SD, mean), by = cutx] -> xmat.agg
# xmat.agg
# cutx x u z p g
# 1: 1 5.5 0.5782475 0.372984058 4.5 0.6
# 2: 2 15.5 0.5233693 0.032501186 4.6 0.8
# 3: 3 25.5 0.6155837 -0.258803746 4.6 0.4
# 4: 4 35.5 0.5378580 0.269690334 4.4 0.8
# 5: 5 45.5 0.3453964 0.312308395 4.8 0.4
# ---
# 49996: 49996 499955.5 0.4872596 0.006631221 5.6 0.4
# 49997: 49997 499965.5 0.5974486 0.022103345 4.6 0.6
# 49998: 49998 499975.5 0.5056578 -0.104263093 4.7 0.6
# 49999: 49999 499985.5 0.3083803 0.386846148 6.8 0.6
# 50000: 50000 499995.5 0.4377497 0.109197095 5.7 0.6
plot it all
par(mfrow = c(2,2))
for(i in 3:6) plot(xmat.agg[,c(1,i), with = F], pch = ".")

Related

Is there an efficient way to calculate percentiles from pre-aggregated data (R)?

First: This is my first question here and I'm relatively new to R, too. So, I'm sorry if this is a stupid question or wrong way to ask.
I have a data frame like this:
df <- data.frame(Website = c("A", "A", "A", "B", "B", "B"),
seconds = c(1,12,40,3,5,14),
visitors = c(200000,100000,12000,250000,180000,90000))
> df
Website seconds visitors
A 1 200000
A 12 100000
A 40 12000
B 3 250000
B 5 180000
B 14 90000
How to interpret the data: Website A has 200000 visitors who have been on the website for only 1 second, 100000 visitors for 12 seconds and so on.
In reality, the data has about hundred different websites, each with seconds ranging from 0 to about 900 (and a high number of visitors respectively).
Now, I want to calculate percentiles or at least quartiles for the visiting duration (for each website).
I already found and tried this solution here: https://stackoverflow.com/a/53882909
However, this solution is very inefficient as it results in a data frame with several million rows (and a very long processing time).
My question now: Is there a faster (more efficient way) to calculate percentiles from such pre-aggregated data?
I believe this will be faster. First make a function to compute the quantiles you specify. Then split the data into a list and use sapply:
quant <- function(x, p=c(.25, .50, .75)) {
v <- c(0, cumsum(x$visitors)/sum(x$visitors))
s <- c(0, x$seconds)
approx(v, s, p)$y
}
df.split <- split(df, df$Website)
p <- c(.1, .2, .3, .4, .5, .6, .7, .8, .9)
stats <- t(sapply(df.split, quant, p=p))
colnames(stats) <- as.character(p)
round(stats, 1)
# 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
# A 0.2 0.3 0.5 0.6 0.8 0.9 3.0 6.5 9.9
# B 0.6 1.2 1.9 2.5 3.1 3.7 4.3 4.8 8.8
To see better what is going on here is a plot showing the data for Website A:
test1 <- df[1:3, ]
test1$cumvis <- cumsum(test1$visitors)
barplot(test1$seconds, test1$visitors, space=0, xlim=c(0, 325000))
axis(1, seq(0, 300000, 50000), c("0", "50K", "100K", "150K", "200K",
"250K", "300K"), xpd=NA)
axis(3, seq(0, sum(test1$visitors), by=31200), seq(0, 1, by=.1), lty=1)
lines(c(0, test1$cumvis), c(0, test1$seconds), col="red", lwd=2)
lines(c(0, test1$cumvis-.5*test1$visitors, tail(test1$cumvis, 1)),
c(0, test1$seconds, tail(test1$seconds, 1)), col="blue", lwd=2)
The plot shows the data as grey rectangles. The bottom x-axis shows the cumulative number of visits and the top x-axis shows the cumulative proportion. We can treat the rectangles as the distribution or we can assume that the rectangles are a sample that approximates the underlying distribution. My suggested solution took the red line and used the approx function to use linear interpolation between the data points to estimate the number of seconds along that curve.
The same approach can be used with a different definition of the curve in which the data points are placed in the middle of each rectangle, the blue curve. I'll provide code for that approach as well. It is also possible to estimate the quantiles from the original data without replicating it.
First a function to estimate the quantiles along the blue line:
quant2 <- function(x, p=c(.25, .50, .75)) {
v <- c(0, cumsum(x$visitors)-(.5*x$visitors)/sum(x$visitors), 1)
s <- c(0, x$seconds, tail(x$seconds, 1))
approx(v, s, p)$y
}
p <- c(.1, .2, .3, .4, .5, .6, .7, .8, .9)
stats <- t(sapply(df.split, quant2, p=p))
colnames(stats) <- as.character(p)
round(stats, 1)
# 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
# A 4.0 8.0 12.0 16.0 20 24.0 28.0 32.0 36.0
# B 1.4 2.8 4.2 5.6 7 8.4 9.8 11.2 12.6
The estimates are higher because the blue line is above the red line.
Finally, we can simply use the rectangles without any interpolation. Basically we set breaks at the boundaries of the data points and use those to identify which proportions fall in which groups of observations (seconds).
quant3 <- function(x, p=c(.25, .50, .75)){
v <- c(0, cumsum(x$visitors)/sum(x$visitors))
limits <- cut(p, breaks=v, include.lowest=TRUE, labels=x$seconds)
limits <- as.numeric(as.character(limits))
}
p <- 0:10/10
stats <- t(sapply(df.split, quant3, p=p))
colnames(stats) <- as.character(p)
stats
# 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
# A 1 1 1 1 1 1 1 12 12 12 40
# B 3 3 3 3 3 5 5 5 5 14 14
So for website A, 1 second is the value for quantiles 0 - .6.

R: interpolate a value from dataframe based on two inputs

I have a data frame that looks like this:
Teff logg M_div_H U B V R I J H K L Lprime M
1 2000 4.0 -0.1 -13.443 -11.390 -7.895 -4.464 -1.831 1.666 3.511 2.701 4.345 4.765 5.680
2 2000 4.5 -0.1 -13.402 -11.416 -7.896 -4.454 -1.794 1.664 3.503 2.728 4.352 4.772 5.687
3 2000 5.0 -0.1 -13.358 -11.428 -7.888 -4.431 -1.738 1.664 3.488 2.753 4.361 4.779 5.685
4 2000 5.5 -0.1 -13.220 -11.079 -7.377 -4.136 -1.483 1.656 3.418 2.759 4.355 4.753 5.638
5 2200 3.5 -0.1 -11.866 -9.557 -6.378 -3.612 -1.185 1.892 3.294 2.608 3.929 4.289 4.842
6 2200 4.5 -0.1 -11.845 -9.643 -6.348 -3.589 -1.132 1.874 3.310 2.648 3.947 4.305 4.939
...
Let's say I have two values:
input_Teff = 4.8529282904170595E+003
input_log_g = 1.9241934741026787E+000
Notice how every V value has a unique Teff, logg combination. From the input values, I would like to interpolate a value for V. Is there a way to do this in R?
Edit 1: Here is the link to the full data frame: https://www.dropbox.com/s/prbceabxmd25etx/lcb98cor.dat?dl=0
Building on Ian Campbell's observation that you can consider your data as points on a two-dimensional plane, you can use spatial interpolation methods. The simplest approach is inverse-distance weighting, which you can implement like this
library(data.table)
d <- fread("https://www.dropbox.com/s/prbceabxmd25etx/lcb98cor.dat?dl=1")
setnames(d,"#Teff","Teff")
First rescale the data as appropriate (not shown here, see Ian's answer)
library(gstat)
# fit model
idw <- gstat(id="V", formula = V~1, locations = ~Teff+logg, data=d, nmax=7, set=list(idp = .5))
# new "points" to predict to
newd <- data.frame(Teff=c(4100, 4852.928), logg=c(1.5, 1.9241934741026787))
p <- predict(idw, newd)
#[inverse distance weighted interpolation]
p$V.pred
#[1] -0.9818571 -0.3602857
For higher dimensions you could use fields::Tps (I think you can force that to be an exact method, that is, exactly honor the observations, by making each observation a node)
We can imagine that Teff and logg exist in a 2-dimensional plane. We can see that your input point exists in that same space:
library(tidyverse)
ggplot(data,aes(x = Teff, y = logg)) +
geom_point() +
geom_point(data = data.frame(Teff = 4.8529282904170595e3, logg = 1.9241934741026787),
color = "orange")
However, we can see the scale of Teff and logg are not the same. Simply taking log(Teff) gets us pretty close, but not quite. So we can rescale between 0 and 1 instead. We can create a custom rescale function. It will become clear why we can't use scales::rescale in a moment.
rescale = function(x,y){(x - min(y))/(max(y)-min(y))}
We can now rescale the data:
data %>%
mutate(Teff.scale = rescale(Teff,Teff),
logg.scale = rescale(logg,logg)) -> data
From here, we might use raster::pointDistance to calculate the distance from the input point to all of the scaled values:
raster::pointDistance(cbind(rescale(input_Teff,data$Teff),rescale(input_log_g,data$logg)),
data[,c("Teff.scale","logg.scale")],
lonlat = FALSE)
We can use which.min to find the row with the minimum distance:
data[which.min(raster::pointDistance(cbind(rescale(input_Teff,data$Teff),rescale(input_log_g,data$logg)),
data[,c("Teff.scale","logg.scale")],
lonlat = FALSE)),]
Teff logg M_div_H U B V R I J H K L Lprime M Teff.scale logg.scale
1: 4750 2 -0.1 -2.447 -1.438 -0.355 0.159 0.589 1.384 1.976 1.881 2.079 2.083 2.489 0.05729167 0.4631902
Here we can visualize the result:
ggplot(data,aes(x = Teff.scale, y = logg.scale)) +
geom_point() +
geom_point(data = data[which.min(raster::pointDistance(cbind(rescale(input_Teff,data$Teff),rescale(input_log_g,data$logg)),data[,c("Teff.scale","logg.scale")], FALSE)),],
color = "blue") +
geom_point(data = data.frame(Teff.scale = rescale(input_Teff,data$Teff),logg.scale = rescale(input_log_g,data$logg)),
color = "orange")
And access the appropriate value for V:
data[which.min(raster::pointDistance(cbind(rescale(input_Teff,data$Teff),rescale(input_log_g,data$logg)),data[,c("Teff.scale","logg.scale")], FALSE)),"V"]
V
1: -0.355
Data:
library(data.table)
data <- fread("https://www.dropbox.com/s/prbceabxmd25etx/lcb98cor.dat?dl=1")
setnames(data,"#Teff","Teff")
input_Teff = 4.8529282904170595E+003
input_log_g = 1.9241934741026787E+000

how to subset a vector in the way that represent the general shape of original vector in R

I have vectors of different size, and I want to sample all of them equally (for example 10 sample of each vector), in a way that these samples represent each vector.
suppose that one of my vectors is
y=c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23)
what are the 10 represntive points of this vector?
In case you are referring to retaining the shape of the curve, you can try preserving the local minimas and maximas:
df = as.data.frame(y)
y2 <- df %>%
mutate(loc_minima = if_else(lag(y) > y & lead(y) > y, TRUE, FALSE)) %>%
mutate(loc_maxima = if_else(lag(y) < y & lead(y) < y, TRUE, FALSE)) %>%
filter(loc_minima == TRUE | loc_maxima == TRUE) %>%
select(y)
Though this does not guarantee you'll have exactly 10 points.
Thanks to #minem, I got my answer. Perfect!
library(kmlShape)
Px=(1:length(y))
Py=y
par(mfrow=c(1,2))
plot(Px,Py,type="l",main="original points")
plot(DouglasPeuckerNbPoints(Px,Py,10),type="b",col=2,main="reduced points")
and the result is as below (using Ramer–Douglas–Peucker algorithm):
The best answer has already been given, but since I was working on it, I will post my naive heuristic solution :
Disclaimer :
this is for sure less efficient and naive than Ramer–Douglas–Peucker algorithm, but in this case it gives a similar result...
# Try to remove iteratively one element from the vector until we reach N elements only.
# At each iteration, the reduced vector is interpolated and completed again
# using a spline, then it's compared with the original one and the
# point leading to the smallest difference is selected for the removal.
heuristicDownSample <- function(x,y,n=10){
idxReduced <- 1:length(x)
while(length(idxReduced) > 10){
minDist <- NULL
idxFinal <- NULL
for(idxToRemove in 1:length(idxReduced)){
newIdxs <- idxReduced[-idxToRemove]
spf <- splinefun(x[newIdxs],y[newIdxs])
full <- spf(x)
dist <- sum((full-y)^2)
if(is.null(minDist) || dist < minDist){
minDist <- dist
idxFinal <- newIdxs
}
}
idxReduced <- idxFinal
}
return(list(x=x[idxReduced],y=y[idxReduced]))
}
Usage :
y=c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23)
x <- 1:length(y)
reduced <- heuristicDownSample(x,y,10)
par(mfrow=c(1,2))
plot(x=x,y=y,type="b",main="original")
plot(x=reduced$x,y=reduced$y,type="b",main="reduced",col='red')
You could use cut to generate a factor that indicates in which quintile (or whatever quantile you want) your values belong, and then sample from there:
df <- data.frame(values = c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23))
cutpoints <- seq(min(df$values), max(df$values), length.out = 5)
> cutpoints
[1] -2.00 4.25 10.50 16.75 23.00
df$quintiles <- cut(df$values, cutpoints, include.lowest = TRUE)
> df
values quintiles
1 2.5 [-2,4.25]
2 1.0 [-2,4.25]
3 0.0 [-2,4.25]
4 1.2 [-2,4.25]
5 2.0 [-2,4.25]
6 3.0 [-2,4.25]
7 2.0 [-2,4.25]
8 1.0 [-2,4.25]
9 0.0 [-2,4.25]
10 -2.0 [-2,4.25]
11 -1.0 [-2,4.25]
12 0.5 [-2,4.25]
13 2.0 [-2,4.25]
14 3.0 [-2,4.25]
15 6.0 (4.25,10.5]
16 5.0 (4.25,10.5]
17 7.0 (4.25,10.5]
18 9.0 (4.25,10.5]
19 11.0 (10.5,16.8]
20 15.0 (10.5,16.8]
21 23.0 (16.8,23]
Now you could split the data by quintiles, calculate the propensities and sample from the groups.
groups <- split(df, df$quintiles)
probs <- prop.table(table(df$quintiles))
nsample <- as.vector(ceiling(probs*10))
> nsample
[1] 7 2 1 1
resample <- function(x, ...) x[sample.int(length(x), ...)]
mysamples <- mapply(function(x, y) resample(x = x, size = y), groups, nsample)
z <- unname(unlist(mysamples))
> z
[1] 2.0 1.0 0.0 1.0 3.0 0.5 3.0 5.0 9.0 11.0 23.0
Due to ceiling(), this may lead to 11 cases being sampled instead of 10.
Apparently you are interested in systematic sampling. If so, maybe the following can help.
set.seed(1234)
n <- 10
step <- floor(length(y)/n)
first <- sample(step, 1)
z <- y[step*(seq_len(n) - 1) + first]

Using approx() with groups in dplyr

I am trying to use approx() and dplyr to interpolate values in an existing array. My initial code looks like this ...
p = c(1,1,1,2,2,2)
q = c(1,2,3,1,2,3)
r = c(1,2,3,4,5,6)
Inputs<- data.frame(p,q,r)
new.inputs= as.numeric(c(1.5,2.5))
library(dplyr)
Interpolated <- Inputs %>%
group_by(p) %>%
arrange(p, q) %>%
mutate(new.output=approx(x=q, y=r, xout=new.inputs)$y)
I expect to see 1.5, 2.5, 4.5, 5.5 but instead I get
Error: incompatible size (2), expecting 3 (the group size) or 1
Can anyone tell me where I am going wrong?
You can get the values you expect using dplyr.
library(dplyr)
Inputs %>%
group_by(p) %>%
arrange(p, q, .by_group = TRUE) %>%
summarise(new.outputs = approx(x = q, y = r, xout = new.inputs)$y)
# p new.outputs
# <dbl> <dbl>
# 1 1.5
# 1 2.5
# 2 4.5
# 2 5.5
You can also get the values you expect using the ddply function from plyr.
library(plyr)
# Output as coordinates
ddply(Inputs, .(p), summarise, new.output = paste(approx(
x = q, y = r, xout = new.inputs
)$y, collapse = ","))
# p new.output
# 1 1.5,2.5
# 2 4.5,5.5
#######################################
# Output as flattened per group p
ddply(Inputs,
.(p),
summarise,
new.output = approx(x = q, y = r, xout = new.inputs)$y)
# p new.output
# 1 1.5
# 1 2.5
# 2 4.5
# 2 5.5

Generate multiple serial graphs/scatterplots from data in two dataframes

I have 2 dataframes, Tg and Pf, each of 127 columns. All columns have at least one row and can have up to thousands of them. All the values are between 0 and 1 and there are some missing values (empty cells). Here is a little subset:
Tg
Tg1 Tg2 Tg3 ... Tg127
0.9 0.5 0.4 0
0.9 0.3 0.6 0
0.4 0.6 0.6 0.3
0.1 0.7 0.6 0.4
0.1 0.8
0.3 0.9
0.9
0.6
0.1
Pf
Pf1 Pf2 Pf3 ...Pf127
0.9 0.5 0.4 1
0.9 0.3 0.6 0.8
0.6 0.6 0.6 0.7
0.4 0.7 0.6 0.5
0.1 0.6 0.5
0.3
0.3
0.3
Note that some cell are empty and the vector lengths for the same subset (i.e. 1 to 127) can be of very different length and are rarely the same exact length.
I want to generate 127 graph as follow for the 127 vectors (i.e. graph is for col 1 from each dataframe, graph 2 is for col 2 for each dataframe etc...):
Hope that makes sense. I'm looking forward to your assistance as I don't want to make those graphs one by one...
Thanks!
Here is an example to get you started (data at https://gist.github.com/1349300). For further tweaking, check out the excellent ggplot2 documentation that is all over the web.
library(ggplot2)
# Load data
Tg = read.table('Tg.txt', header=T, fill=T, sep=' ')
Pf = read.table('Pf.txt', header=T, fill=T, sep=' ')
# Format data
Tg$x = as.numeric(rownames(Tg))
Tg = melt(Tg, id.vars='x')
Tg$source = 'Tg'
Tg$variable = factor(as.numeric(gsub('Tg(.+)', '\\1', Tg$variable)))
Pf$x = as.numeric(rownames(Pf))
Pf = melt(Pf, id.vars='x')
Pf$source = 'Pf'
Pf$variable = factor(as.numeric(gsub('Pf(.+)', '\\1', Pf$variable)))
# Stack data
data = rbind(Tg, Pf)
# Plot
dev.new(width=5, height=4)
p = ggplot(data=data, aes(x=x)) + geom_line(aes(y=value, group=source, color=source)) + facet_wrap(~variable)
p
Highlighting the area between the lines
First, interpolate the data onto a finer grid. This way the ribbon will follow the actual envelope of the lines, rather than just where the original data points were located.
data = ddply(data, c('variable', 'source'), function(x) data.frame(approx(x$x, x$value, xout=seq(min(x$x), max(x$x), length.out=100))))
names(data)[4] = 'value'
Next, calculate the data needed for geom_ribbon - namely ymax and ymin.
ribbon.data = ddply(data, c('variable', 'x'), summarize, ymin=min(value), ymax=max(value))
Now it is time to plot. Notice how we've added a new ribbon layer, for which we've substituted our new ribbon.data frame.
dev.new(width=5, height=4)
p + geom_ribbon(aes(ymin=ymin, ymax=ymax), alpha=0.3, data=ribbon.data)
Dynamic coloring between the lines
The trickiest variation is if you want the coloring to vary based on the data. For that, you currently must create a new grouping variable to identify the different segments. Here, for example, we might use a function that indicates when the "Tg" group is on top:
GetSegs <- function(x) {
segs = x[x$source=='Tg', ]$value > x[x$source=='Pf', ]$value
segs.rle = rle(segs)
on.top = ifelse(segs, 'Tg', 'Pf')
on.top[is.na(on.top)] = 'Tg'
group = rep.int(1:length(segs.rle$lengths), times=segs.rle$lengths)
group[is.na(segs)] = NA
data.frame(x=unique(x$x), group, on.top)
}
Now we apply it and merge the results back with our original ribbon data.
groups = ddply(data, 'variable', GetSegs)
ribbon.data = join(ribbon.data, groups)
For the plot, the key is that we now specify a grouping aesthetic to the ribbon geom.
dev.new(width=5, height=4)
p + geom_ribbon(aes(ymin=ymin, ymax=ymax, group=group, fill=on.top), alpha=0.3, data=ribbon.data)
Code is available together at: https://gist.github.com/1349300
Here is a three-liner to do the same :-). We first reshape from base to convert the data into long form. Then, it is melted to suit ggplot2. Finally, we generate the plot!
mydf <- reshape(cbind(Tg, Pf), varying = 1:8, direction = 'long', sep = "")
mydf_m <- melt(mydf, id.var = c(1, 4), variable = 'source')
qplot(id, value, colour = source, data = mydf_m, geom = 'line') +
facet_wrap(~ time, ncol = 2)
NOTE. The reshape function in base R is extremely powerful, albeit very confusing to use. It is used to transform data between long and wide formats.
Kudos for automating something you used to do in Excel using R! That's exactly how I got started with R and a common path to R enlightenment :)
All you really need is a little looping. Here's an example, most of which is creating example data that represents your data structure:
## create some example data
Tg <- data.frame(Tg1 = rnorm(10))
for (i in 2:10) {
vec <- rep(NA, 8)
vec <- c(rnorm(sample(5:10,1)), vec)
Tg[paste("Tg", i, sep="")] <- vec[1:10]
}
Pf <- data.frame(Pf1 = rnorm(10))
for (i in 2:10) {
vec <- rep(NA, 8)
vec <- c(rnorm(sample(5:10,1)), vec)
Pf[paste("Pf", i, sep="")] <- vec[1:10]
}
## ok, sample data created
## now lets loop through all the columns
## if you didn't know how many columns there are you could
## use ncol(Tg) to figure out
for (i in 1:10) {
plot(1:10, Tg[,i], type = "l", col="blue", lwd=5, ylim=c(-3,3),
xlim=c(1, max(length(na.omit(Tg[,i])), length(na.omit(Pf[,i])))))
lines(1:10, Pf[,i], type = "l", col="red", lwd=5, ylim=c(-3,3))
dev.copy(png, paste('rplot', i, '.png', sep=""))
dev.off()
}
This will result in 10 graphs in your working directory that look like the following:

Resources