Histogram-like summary for interval data - r

How do I get a histogram-like summary of interval data in R?
My MWE data has four intervals.
interval range
Int1 2-7
Int2 10-14
Int3 12-18
Int4 25-28
I want a histogram-like function which counts how the intervals Int1-Int4 span a range split across fixed-size bins.
The function output should look like this:
bin count which
[0-4] 1 Int1
[5-9] 1 Int1
[10-14] 2 Int2 and Int3
[15-19] 1 Int3
[20-24] 0 None
[25-29] 1 Int4
Here the range is [minfloor(Int1, Int2, Int3, Int40), maxceil(Int1, Int2, Int3, Int4)) = [0,30) and there are six bins of size = 5.
I would greatly appreciate any pointers to R packages or functions that implement the functionality I want.
Update:
So far, I have a solution from the IRanges package which uses a fast data structure called NCList, which is faster than Interval Search Trees according to users.
> library(IRanges)
> subject <- IRanges(c(2,10,12,25), c(7,14,18,28))
> query <- IRanges(c(0,5,10,15,20,25), c(4,9,14,19,24,29))
> countOverlaps(query, subject)
[1] 1 1 2 1 0 1
But I am still unable to get which are the ranges that overlap. Will update if I get through.

Using IRanges, you should use findOverlaps or mergeByOverlaps instead of countOverlaps. It, by default, doesn't return no matches though.
I'll leave that to you. Instead, will show an alternate method using foverlaps() from data.table package:
require(data.table)
subject <- data.table(interval = paste("int", 1:4, sep=""),
start = c(2,10,12,25),
end = c(7,14,18,28))
query <- data.table(start = c(0,5,10,15,20,25),
end = c(4,9,14,19,24,29))
setkey(subject, start, end)
ans = foverlaps(query, subject, type="any")
ans[, .(count = sum(!is.na(start)),
which = paste(interval, collapse=", ")),
by = .(i.start, i.end)]
# i.start i.end count which
# 1: 0 4 1 int1
# 2: 5 9 1 int1
# 3: 10 14 2 int2, int3
# 4: 15 19 1 int3
# 5: 20 24 0 NA
# 6: 25 29 1 int4

Related

Rolling window with dplyr to find value of factor

I have a matrix like this
head(a)
# A tibble: 6 x 4
date ROE ROFE ROTFE
<date> <dbl> <dbl> <dbl>
1 2000-01-31 0.033968932 0.0324214815 0.010205926
2 2000-02-29 0.006891111 -0.0003352941 -0.005230147
3 2000-03-31 0.006158519 0.0213992647 0.040399265
4 2000-04-28 0.060022222 0.0151191176 0.047586029
5 2000-05-31 -0.016960000 -0.0287617647 -0.036209559
6 2000-06-30 0.034133577 0.0144456522 0.030756522
I want to pick the value of a factor which has highest cumulative return last 2 months over time.
I have done something like this and it works.
However, my friend told me that it can be done in one or two lines of dplyr and I'm wondering if you could please show me how to do that.
index = as.Date(unique(a$date))
nmonth = 2;
mean.ROE = numeric()
for (i in 1:(length(index) - nmonth)) { # i = 2
index1 = index[i]
index2 = index[nmonth + i]
index3 = index[nmonth + i+1]
# Take a 2-month window of ROE returns:
b = a[a$date >= index1 & a$date < index2,] %>% mutate(cum.ROE = cumprod(1 + ROE)) %>% mutate(cum.ROFE = cumprod(1 + ROFE)) %>% mutate(cum.ROTFE = cumprod(1 + ROTFE))
# Use the cumulative return over the 2-month window to determine which factor is best.
mean.ROE1 = ifelse(b$cum.ROE[nmonth] > b$cum.ROFE[nmonth] & b$cum.ROE[nmonth] > b$cum.ROTFE[nmonth], a[a$date == index3,]$ROE, ifelse(b$cum.ROFE[nmonth] > b$cum.ROE[nmonth] & b$cum.ROFE[nmonth] > b$cum.ROTFE[nmonth], a[a$date == index3,]$ROFE, a[a$date == index3,]$ROTFE))
# Bind the answer to the answer vector
mean.ROE = rbind(mean.ROE, mean.ROE1)
}
Create a function maxret which takes 2 + nmonth rows, x, and calculates the cumulative returns, r, for each column of the first two rows. For the largest of those return the value in the last row of x.
Now use rollapplyr to apply it to a rolling window of width 2 + month:
library(zoo)
maxret <- function(x) {
r <- apply(1 + x[1:2, ], 2, prod)
x[2 + nmonth, which.max(r)]
}
z <- read.zoo(as.data.frame(a))
res <- rollapplyr(z, 2 + nmonth, maxret, by.column = FALSE)
giving the zoo series:
> res
2000-04-28 2000-05-31 2000-06-30
0.06002222 -0.03620956 0.03075652
If you want a data frame use fortify.zoo(res) .
Note: 1 The input was not provided in reproducible form in the question so I have assumed this data.frame:
Lines <-
"date ROE ROFE ROTFE
1 2000-01-31 0.033968932 0.0324214815 0.010205926
2 2000-02-29 0.006891111 -0.0003352941 -0.005230147
3 2000-03-31 0.006158519 0.0213992647 0.040399265
4 2000-04-28 0.060022222 0.0151191176 0.047586029
5 2000-05-31 -0.016960000 -0.0287617647 -0.036209559
6 2000-06-30 0.034133577 0.0144456522 0.030756522"
a <- read.table(text = Lines, header = TRUE)
Note 2: With the input in Note 1 or with zoo 1.8.1 (the development version of zoo) this line:
z <- read.zoo(as.data.frame(a))
could be simplified to just:
z <- read.zoo(a)
but we have added the as.data.frame part in the main code so it works with tibbles as well as straight data frames even with the current version of zoo on CRAN.

match 2 column elements based on a differece within a range

I want to match the elements of two unequal columns from two different data frames if they fall within the range: 1 to 3 (2+/-1)
My data frames:
dat1:
Number status
10023 T
10324 F
12277 F
12888 T
12000 T
dat2:
Number status
10020 T
10002 F
12279 F
12888 T
Required ouput:
10023 10020 T
12277 12279 F
My attempt (below) did not work:
diff <- 2
allow <- 1
NewData <- dat1$Number %in% (dat2$Number<=diff+allow | dat2$Number>=diff+allow)
Help will be appreciated.
This looks like a must case for data.table::foverlaps to me.
The workflow is to create start and end columns within both data sets, while we will create the range within the second data set. Then, we will key both and simply run foverlaps
library(data.table)
diff <- 2
allow <- 1
setDT(dat1)[, `:=`(start = Number, end = Number)]
setkey(dat1, status, start, end)
setDT(dat2)[, `:=`(start = Number - (diff + allow), end = Number + diff + allow)]
setkey(dat2, status, start, end)
foverlaps(dat2, dat1, nomatch = 0L)[, .(Numdf1 = Number, Numdf2 = i.Number, status)]
# Numdf1 Numdf2 status
# 1: 12277 12279 FALSE
# 2: 10023 10020 TRUE
# 3: 12888 12888 TRUE ### <- I'm assuming you had an error in the desirred output

Combine information from two data frames with dplyr

I need some help with dplyr.
I have two data frames - one huge, with several time series A,B,... in there (LargeDF), and a second one (Categories) with time intervals (left and right boundaries).
I would like to add another column to LargeDF, labeled leftBoundary, containing the appropriate boundary value, like so:
LargeDF
ts timestamp signal # left_boundary
1 A 0.3209338 10.43279 # 0
2 A 1.4791524 10.34295 # 1
3 A 2.6007494 10.71601 # 2
and
Categories
ts left right
1 A 0 1
2 A 1 2
3 A 2 3
My code I came up with is
LargeDF %>%
group_by(ts) %>%
do(myFUN(., Categories))
# calls this ...
myFUN <- function(Large, Categ) {
CategTS <- Categ %>%
filter(ts == Large[1, "ts"][[1]])
Large %>%
group_by(timestamp) %>% # this is bothering me...
mutate(left_boundary = CategTS$left[CategTS$left < timestamp
& timestamp < CategTS$right])
}
but it is super slow for large time series. I would really like to lose the group_by(timestamp), as they are unique within each ts anyways.
Does someone see a better solution? That would be much appreciated.
# Code for making the example data frames ...
library("dplyr")
n <- 10; series <- c("A", "B", "C")
LargeDF <- data.frame(
ts = rep(series, each = n)
, timestamp = runif(n*length(series), max = 4)
, signal = runif(n*length(series), min = 10, max = 11)
) %>% group_by(ts) %>% arrange(timestamp)
m <- 7
Categories <- data.frame(
ts = rep(series, each = m)
, left = rep(seq(1 : m) - 1, length(series))
, right = rep(seq(1 : m), length(series))
)
Update (data.table and my slightly modified mockup)
So, I tried the suggestions from #DavidArenburg on a quick/dirty mockup-example first, but had the problem that some timestamps were binned twice (into successive categories/intervals).
> foverlaps(d, c, type="any", by.x = c("timestamp", "timestamp2"))
left right value timestamp timestamp2
1: 0.9 1.9 0.1885459 1 1
2: 0.9 1.9 0.0542375 2 2 # binned here
3: 1.9 2.9 0.0542375 2 2 # and here as well
13: 19.9 25.9 0.4579986 20 20
I then read about minoverlap = 1L as a default and realized that a normal timestamp is >> 1.
> as.numeric(Sys.time())
[1] 1429022267
Therefore, if I shifted everything to larger values (e.g. n <- 10 in the example below), everything went fine.
left right value timestamp timestamp2
1: 9 19 0.64971126 10 10
2: 19 29 0.75994751 20 20
3: 29 99 0.98276462 30 30
9: 199 259 0.89816165 200 200
With my real data, everything went smoothly, so thanks again.
## Code for my data.table example -----
n <- 1
d <- data.table( value = runif(9),
timestamp = c(1, 2, 3, 5, 7, 10, 15, 18, 20)*n,
timestamp2 = c(1, 2, 3, 5, 7, 10, 15, 18, 20)*n)
c <- data.table(left = c(0.9, 1.9, 2.9, 9.9, 19.9, 25.9)*n,
right = c(1.9, 2.9, 9.9, 19.9, 25.9, 33.9)*n)
setkey(c, left, right)
foverlaps(d, c, type="any", by.x = c("timestamp", "timestamp2"))
Update 2 (JOIN, then FILTER, within dplyr)
I tested the suggestion from #aosmith to use the dplyr function left_join() to create one (very) large DF, then filter() this again. Very quickly, I ran into memory issues:
Error: std::bad_alloc
Probably, this approach would be a good idea for smaller tables - as the syntax is very nice (but this, again, is personal preference). I'll go for the data.table solution in this case. Thanks again for all suggestions.
dplyr isn't suitable for such operations, try data.tables foverlaps functions instead
library(data.table)
class(LargeDF) <- "data.frame" ## Removing all the dplyr classes
setDT(LargeDF)[, `:=`(left = timestamp, right = timestamp)] # creating min and max boundaries in the large table
setkey(setDT(Categories)) # keying by all columns (necessary for `foverlaps` to work)
LargeDF[, left_boundary := foverlaps(LargeDF, Categories)$left][] # Creating left_boundary
# ts timestamp signal left right left_boundary
# 1: A 0.46771516 10.72175 0.46771516 0.46771516 0
# 2: A 0.58841492 10.35459 0.58841492 0.58841492 0
# 3: A 1.14494484 10.50301 1.14494484 1.14494484 1
# 4: A 1.18298225 10.82431 1.18298225 1.18298225 1
# 5: A 1.69822678 10.04780 1.69822678 1.69822678 1
# 6: A 1.83189609 10.75001 1.83189609 1.83189609 1
# 7: A 1.90947475 10.94715 1.90947475 1.90947475 1
# 8: A 2.73305266 10.14449 2.73305266 2.73305266 2
# 9: A 3.02371968 10.17724 3.02371968 3.02371968 3
# ...

How to find sum and average for some columns based on the numbers from another column in R

GIVEN DATA
I have 6 columns of data of vehicle trajectory (observation of vehicles' change in position, velocity, etc over time) a part of which is shown below:
Vehicle ID Frame ID Global X Vehicle class Vehicle velocity Lane
1 177 6451181 2 24.99 5
1 178 6451182 2 24.95 5
1 179 6451184 2 24.91 5
1 180 6451186 2 24.90 5
1 181 6451187 2 24.96 5
1 182 6451189 2 25.08 5
Vehicle ID is the identification of individual vehicles e.g. vehicle 1, vehicle 2, etc. It is repeated in the column for each frame in which it was observed. Please note that each frame is 0.1 seconds long so 10 frames make 1 second. The IDs of frames is in Frame ID column. Vehicle class is the type of vehicle (1=motorcycle, 2=car, 3=truck). Vehicle velocity column represents instantaneous speed of vehicle in that instant of time i.e. in a frame. Lane represents the number or ID of the lane in which vehicle is present in a particular frame.
WHAT I NEED TO FIND
The data I have is for 15 minutes period. The minimum frame ID is 5 and maximum frame ID is 9952. I need to find the total number of vehicles in every 30 seconds time period. This means that starting from the first 30 seconds (frame ID 5 to frame ID 305), I need to know the unique vehicle IDs observed. Also, for these 30 seconds period, I need to find the average velocity of each vehicle class. This means that e.g. for cars I need to find the average of all velocities of those vehicles whose vehicle class is 2.
I need to find this for all 30 seconds time period i.e. 5-305, 305-605, 605-905,..., 9605-9905. The ouput should tables for cars, trucks and motorcycles like this:
Time Slots Total Cars Average Velocity
5-305 xx xx
305-605 xx xx
. . .
. . .
9605-9905 xx xx
WHAT I HAVE TRIED SO FAR
# Finding the minimum and maximum Frame ID for creating 30-seconds time slots
minfid <- min(data$'Frame ID') # this was 5
maxfid <- max(data$'Frame ID') # this was 9952
for (i in 'Frame ID'==5:Frame ID'==305) {
table ('Vehicle ID')
mean('Vehicle Velocity', 'Vehicle class'==2)
} #For cars in first 30 seconds
I can't generate the required output and I don't know how can I do this for all 30 second periods. Please help.
It's a bit tough to make sure code is completely correct with your data since there is only one vehicle in the sample you show. That said, this is a typical split-apply-combine type analysis you can execute easily with the data.table package:
library(data.table)
dt <- data.table(df) # I just did a `read.table` on the text you posted
dt[, frame.group:=cut(Frame_ID, seq(5, 9905, by=300), include.lowest=T)]
Here, I just converted your data into a data.table (df was a direct import of your data posted above), and then created 300 frame buckets using cut. Then, you just let data.table do the work. In the first expression we calculate total unique vehicles per frame.group
dt[, list(tot.vehic=length(unique(Vehicle_ID))), by=frame.group]
# frame.group tot.vehic
# 1: [5,305] 1
Now we group by frame.group and Vehicle_class to get average speed and count for those combinations:
dt[, list(tot.vehic=length(unique(Vehicle_ID)), mean.speed=mean(Vehicle_velocity)), by=list(frame.group, Vehicle_class)]
# frame.group Vehicle_class tot.vehic mean.speed
# 1: [5,305] 2 1 24.965
Again, a bit silly when we only have one vehicle, but this should work for your data set.
EDIT: to show that it works:
library(data.table)
set.seed(101)
dt <- data.table(
Frame_ID=sample(5:9905, 50000, rep=T),
Vehicle_ID=sample(1:400, 50000, rep=T),
Vehicle_velocity=runif(50000, 25, 100)
)
dt[, frame.group:=cut(Frame_ID, seq(5, 9905, by=300), include.lowest=T)]
dt[, Vehicle_class:=Vehicle_ID %% 3]
head(
dt[order(frame.group, Vehicle_class), list(tot.vehic=length(unique(Vehicle_ID)), mean.speed=mean(Vehicle_velocity)), by=list(frame.group, Vehicle_class)]
)
# frame.group Vehicle_class tot.vehic mean.speed
# 1: [5,305] 0 130 63.34589
# 2: [5,305] 1 131 61.84366
# 3: [5,305] 2 129 64.13968
# 4: (305,605] 0 132 61.85548
# 5: (305,605] 1 132 64.76820
# 6: (305,605] 2 133 61.57129
Maybe it's your data?
Here is a plyr version:
data$timeSlot <- cut(data$FrameID,
breaks = seq(5, 9905, by=300),
dig.lab=5,
include.lowest=TRUE)
# split & combine
library(plyr)
data.sum1 <- ddply(.data = data,
.variables = c("timeSlot"),
.fun = summarise,
totalCars = length(unique(VehicleID)),
AverageVelocity = mean(velocity)
)
# include VehicleClass
data.sum2 <- ddply(.data = data,
.variables = c("timeSlot", "VehicleClass"),
.fun = summarise,
totalCars = length(unique(VehicleID)),
AverageVelocity = mean(velocity)
)
The column names like FrameID would have to be edited to match the ones you use:
data <- read.table(sep = "", header = TRUE, text = "
VehicleID FrameID GlobalX VehicleClass velocity Lane
1 177 6451181 2 24.99 5
1 178 6451182 2 24.95 5
1 179 6451184 2 24.91 5
1 180 6451186 2 24.90 5
1 181 6451187 2 24.96 5
1 182 6451189 2 25.08 5")
data.sum1
# timeSlot totalCars AverageVelocity
# 1 [5,305] 1 24.965
data.sum2
# timeSlot VehicleClass totalCars AverageVelocity
# 1 [5,305] 2 1 24.965

Using R to remove data which is below a quartile threshold

I am creating correlations using R, with the following code:
Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor
I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!
Thank you so much.
I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):
Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663 2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352
So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?
Thanks again, and sorry for the evolution of the question!
Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:
# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
ID Val
1 1 0.76487516
2 2 0.59755578
3 3 0.94584374
4 4 0.72179297
5 5 0.04513418
6 6 0.95772248
7 7 0.14566118
8 8 0.84898704
9 9 0.07246594
10 10 0.14136138
# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
ID Val
1 1 0.7648752
2 2 0.5975558
3 3 0.9458437
4 4 0.7217930
6 6 0.9577225
7 7 0.1456612
8 8 0.8489870
# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
25%
0.1424363
Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.
I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).
qgroup = function(numvec, n = 4){
qtile = quantile(numvec, probs = seq(0, 1, 1/n))
out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))
return(out)
}
Function example:
v = rep(1:20)
> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Consider now the following data:
dt = data.table(
A0 = runif(100),
A1 = runif(100)
)
We apply qgroup() across the data to obtain two quartile group columns:
cols = colnames(dt)
qcols = c('Q0', 'Q1')
dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]
head(dt)
> A0 A1 Q0 Q1
1: 0.72121846 0.1908863 3 1
2: 0.70373594 0.4389152 3 2
3: 0.04604934 0.5301261 1 3
4: 0.10476643 0.1108709 1 1
5: 0.76907762 0.4913463 4 2
6: 0.38265848 0.9291649 2 4
Lastly, we only include rows for which both quartile groups are above the first quartile:
dt = dt[Q0 + Q1 > 2]

Resources