select records according to the difference between records R - r

I hope someone could suggest me something for this "problem", because I really don't know how to proceed...
Well, my data are like this
data<-data.frame(site=c(rep("A",3),rep("B",3),rep("C",3)),time=c(100,180,245,5,55,130,70,120,160))
where time is in minute.
I want to select only the records, for each site, for which the difference is more than 60, so the output should be Like this:
out<-data[c(1:4,6,7,9),]
What I have tried so far. Well,to get the difference I use this:
difference<-stack(tapply(data$time,data$site,diff))
but then, no idea how to pick up those records which satisfied my condition...
If there is already a similar question, although I've searched for a while, I apologize for this.
To make things clear, as probably the definition of difference was not so unambiguous, I need to select all the records (for each site) which are separated at least by 60 minutes, so not only those that are strictly subsequent in time.
Specifically,
> out
site time
1 A 100#included because difference between 2 and 1 is>60
2 A 180#included because difference between 3 and 2 is>60
3 A 245#included because separated by 6o minutes before record#2
4 B 5#included because difference between 6 and 4 is>60
6 B 130#included because separated by 6o minutes before record#4
7 C 70#included because difference between 9 and 7 is>60
9 C 160#included because separated by 60 minutes before record#7
May be to solve the "problem", it could be useful to consider the results of the difference, something like this:
> difference
values ind
1 80 A#include record 1 and 2
2 65 A#include record 2 and 3
3 50 B#include only record 4
4 75 B#include record 6 because there are(50+75)>60 m from r#4
5 50 C#include only record 7
6 40 C#include record 9 because there are (50+40)>60 m from r#7
Thanks for the help.

data[ave(data$time, data$site, FUN = function(x){c(61, diff(x)) > 60}) == 1, ]
# site time
# 1 A 100
# 2 A 180
# 3 A 245
# 4 B 5
# 6 B 130
# 7 C 70
Edit following updated question:
keep <- as.logical(ave(data$time, data$site, FUN = function(x){
c(TRUE, cumsum(diff(x)) > 60)
}))
data[keep, ]
# site time
# 1 A 100
# 2 A 180
# 3 A 245
# 4 B 5
# 6 B 130
# 7 C 70
# 9 C 160

#Calculate the differences
data$diff <- unlist(by(data$time, data$site,function(x)c(NA,diff(x))))
#subset data
data[is.na(data$diff) | data$diff > 60,]

Using plyr:
ddply(dat,.(site),function(x)x[c(TRUE , diff(x$time) >60),])

Related

purrr map / lapply / sapply across groups of multiple (n > 1) elements at a time?

Suppose we have a vector, we can easily enough lapply, sapply or map across 1 element at a time.
Is there a way to do the same across groups of (>1) elements of the vector?
Example
Suppose we are constructing API calls by appending comma-separated user_identifiers to the URL, like so:
user_identifiers <- c("0011399", "0011400", "0013581", "0013769", "0013770", "0018374",
"0018376", "0018400", "0018401", "0018410", "0018415", "0018417",
"0018419", "0018774", "0018775", "0018776", "0018777", "0018778",
"0018779", "0021627", "0023492", "0023508", "0023511", "0023512",
"0024120", "0025672", "0025673", "0025675", "0025676", "0028226",
"0028227", "0028266", "0028509", "0028510", "0028512", "0028515",
"0028518", "0028520", "0028523", "0029160", "0033141", "0034586",
"0035035", "0035310", "0035835", "0035841", "0035862", "0036503",
"0036580", "0036583", "0036587", "0037577", "0038582", "0038583",
"0038587", "0039727", "0039729", "0039731", "0044703", "0044726"
)
get_data <- function(user_identifier) {
url <- paste0("https://www.myapi.com?userIdentifier=",
paste0(user_identifier, collapse=","))
fromJSON(url)
}
In the above, get_data(user_identifiers) would return the APIs response for all 60 user_identifiers in one single request.
But suppose the API accepts a maximum of 10 identifiers at a time (so we cannot do all 60 at once).
A simple solution could be to simply map/lapply/sapply over each element, e.g. sapply(get_data, user_identifiers - this would work fine - however, we would make 60 API calls, when all we really need is 6. If we could map/lapply/sapply over groups of 10 at a time; that would be ideal
Question
Is there an elegant way to map/lapply/sapply over groups of n elements at a time (where n>1)?
We can split user_identifiers in groups of 10 and use sapply/map/lapply
sapply(split(user_identifiers, gl(length(user_identifiers)/10, 10)), get_data)
where gl creates groups from 1 to 6 each of length 10.
gl(length(user_identifiers)/10, 10)
# [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
# 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6
#Levels: 1 2 3 4 5 6
The same groups can be created with rep
rep(1:ceiling(length(user_identifiers)/10), each = 10)
As #thelatemail mentioned, we can use cut and specify number of groups to cut the data into
sapply(split(user_identifiers, cut(seq_along(user_identifiers),6)), get_data)

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

R data.frame flow data pre-processing for aggregated time statistics

What is the most efficient way of processing a flow data.frame like
> df <- data.frame(amount=c(4,3,1,1,4,5,9,13,1,1), size=c(164,124,131,315,1128,331,1135,13589,164,68), tot=1, first=c(1,1,3,3,2,2,2,2,4,4), secs=c(2,2,0,0,1,1,1,1,0,0))
> df
amount size tot first secs
1 4 164 1 1 2
2 3 124 1 1 2
3 1 131 1 3 0
4 1 315 1 3 0
5 4 1128 1 2 1
6 5 331 1 2 1
7 9 1135 1 2 1
8 13 13589 1 2 1
9 1 164 1 4 0
10 1 68 1 4 0
to an per-time aggregated totals
> df2
time tot amount size
1 1 2 3.5 144
2 2 6 34.5 16327
3 3 8 36.5 16773
4 4 2 2.0 232
.. using R, when the actual data-set can be more than 100 000 000 rows or even tens of gigabytes?
Column first denotes the start of a flow with duration secs, with metrics amount, size, and tot. In aggregated totals the size and amount are evenly divided to the time range in double-precision, whereas tot is summed to every time-slot as an integer. Duration secs denotes how many time-slots the flows last in addition to value first: If secs is 1 and first is 5, the flow lasts time-slots 5 and 6. My current implementation uses ugly and dead-slow for-loops, which is not an option:
df2 = data.frame()
for (i in 1:nrow(df)) {
items <- df[i, 'secs']
idd <- df[i, 'first']
for (ss in 0:items) { # run once for secs=0
if (items == 0) { items <- 1 }
df2[idd+ss, 'time'] <- idd+ss
if (is.null(df2[idd+ss, 'tot']) || is.na(df2[idd+ss, 'tot'])) {
df2[idd+ss, 'tot'] <- df[i, 'tot']
} else {
df2[idd+ss, 'tot'] <- df2[idd+ss, 'tot'] + df[i, 'tot']
}
if (is.null(df2[idd+ss, 'amount']) || is.na(df2[idd+ss, 'amount'])) {
df2[idd+ss, 'amount'] <- df[i, 'amount']/items
} else {
df2[idd+ss, 'amount'] <- df2[idd+ss, 'amount'] + df[i, 'amount']/items
}
if (is.null(df2[idd+ss, 'size']) || is.na(df2[idd+ss, 'size'])) {
df2[idd+ss, 'size'] <- df[i, 'size']/items
} else {
df2[idd+ss, 'size'] <- df2[idd+ss, 'size'] + df[i, 'size']/items
}
}
}
You can probably optimize this a lot and achieve good performance using only loops, but I bet that better algorithms exist. Maybe you could somehow expand/duplicate the rows with secs > 0, while increasing the first (timestamp) values of the expanded rows and adjust amount, size, and tot metrics on the fly:
now original data..
amount size tot first secs
1 4 164 1 1 0
2 4 164 1 1 1
3 3 124 1 1 2
magically becomes
amount size tot first
1 4 164 1 1
2 2 82 1 1
3 2 82 1 2
4 1 41.33 1 1
5 1 41.33 1 2
6 1 41.33 1 3
After this pre-processing step aggregation would be trivial using plyr ddply, of course in efficient parallel mode.
All example ddply, apply etc. function examples I was able to find operate on per-row or per-column basis, making it hard to modify other rows. Hopefully I don't have to rely on awk-magic.
Update: The mentioned algorithm can easily exhaust your memory when the expansion is done "as is". Some kind of "on the fly" calculation is thus preferred, where we don't map everything to memory. Mattrition's answer is however correct and helped a lot, so marking it as the accepted answer.
The following is an implementation using data.table. I chose data.table for its aggregation abilities, but it's a nifty and efficient class to work with too.
library(data.table)
dt <- as.data.table(df)
# Using the "expand" solution linked in the Q.
# +1 to secs to allow room for 0-values
dtr <- dt[rep(seq.int(1, nrow(dt)), secs+1)]
# Create a new seci column that enumerates sec for each row of dt
dtr[,seci := dt[,seq(0,secs),by=1:nrow(dt)][,V1]]
# All secs that equal 0 are changed to 1 for later division
dtr[secs==0, secs := 1]
# Create time (first+seci) and adjusted amount and size columns
dtr[,c("time", "amount2", "size2") := list(first+seci, amount/secs, size/secs)]
# Aggregate selected columns (tot, amount2, and size2) by time
dtr.a <- dtr[,list(tot=sum(tot), amount=sum(amount2), size=sum(size2)), by=time]
dtr.a
time tot amount size
1: 1 2 3.5 144
2: 2 6 34.5 16327
3: 3 8 36.5 16773
4: 4 2 2.0 232

Short(er) notation of selecting a part of a data.frame or other objects in R

I always get angry at my R code when I have to process dataframes, i.e. filtering out certain rows. The code gets very illegible as I tend to choose meaningful, but long, names for my objects. An example:
all.mutations.extra.large.name <- read.delim(filename)
head(all.mutations.extra.large.name)
id gene pos aa consequence V
ENSG00000105732 ZN574_HUMAN 81 x/N missense_variant 3
ENSG00000125879 OTOR_HUMAN 7 V/3 missense_variant 2
ENSG00000129194 SOX15_HUMAN 20 N/T missense_variant 3
ENSG00000099204 ABLM1_HUMAN 33 H/R missense_variant 2
ENSG00000103335 PIEZ1_HUMAN 11 Q/R missense_variant 3
ENSG00000171533 MAP6_HUMAN 39 A/G missense_variant 3
all.mutations.extra.large.name <- all.mutations.extra.large.name[which(all.mutations.extra.large.name$gene == ZN574_HUMAN)]
So in order to kick out all other lines in which I am not interested I need to reference 3 times the object all.mutations.extra.large.name. And reating this kind of step for different columns makes the code really difficult to understand.
Therefore my question: Is there a way to filter out rows by a criterion without referencing the object 3 times. Something like this would be beautiful: myobj[,gene=="ZN574_HUMAN"]
You can use subset for that:
subset(all.mutations.extra.large.name, gene == "ZN574_HUMAN")
Several options:
all.mutations.extra.large.name <- data.frame(a=1:5, b=2:6)
within(all.mutations.extra.large.name, a[a < 3] <- 0)
a b
1 0 2
2 0 3
3 3 4
4 4 5
5 5 6
transform(all.mutations.extra.large.name, b = b^2)
a b
1 1 4
2 2 9
3 3 16
4 4 25
5 5 36
Also check ?attach if you would like to avoid repetitive typing like all.mutations.extra.large.name$foo.

Is there a faster way to get percent change?

I have a data frame with around 25000 records and 10 columns. I am using code to determine the change to the previous value in the same column (NewVal) based on another column (y) with a percent change already in it.
x=c(1:25000)
y=rpois(25000,2)
z=data.frame(x,y)
z[1,'NewVal']=z[1,'x']
So I ran this:
for(i in 2:nrow(z)){z$NewVal[i]=z$NewVal[i-1]+(z$NewVal[i-1]*(z$y[i]/100))}
This takes considerably longer than I expected it to. Granted I may be an impatient person - as a scathing letter drafted to me once said - but I am trying to escape the world of Excel (after I read http://www.burns-stat.com/pages/Tutor/spreadsheet_addiction.html, which is causing me more problems as I have begun to mistrust data - that letter also mentioned my trust issues).
I would like to do this without using any of the functions from packages as I would like to know what the formula for creating the values is - or if you will, I am a demanding control freak according to that friendly missive.
I would also like to know how to get a moving average just like rollmean in caTools. Either that or how do I figure out what their formula is? I tried entering rollmean and I think it refers to another function (I am new to R). This should probably be another question - but as that letter said, I don't ever make the right decisions in my life.
The secret in R is to vectorise. In your example you can use cumprod to do the heavy lifting:
z$NewVal2 <- x[1] * cumprod(with(z, 1 +(c(0, y[-1]/100))))
all.equal(z$NewVal, z$NewVal2)
[1] TRUE
head(z, 10)
x y NewVal NewVal2
1 25 4 25.00000 25.00000
2 24 3 25.75000 25.75000
3 23 0 25.75000 25.75000
4 22 1 26.00750 26.00750
5 21 3 26.78773 26.78773
6 20 2 27.32348 27.32348
7 19 2 27.86995 27.86995
8 18 3 28.70605 28.70605
9 17 4 29.85429 29.85429
10 16 2 30.45138 30.45138
On my machine, the loop takes just less than 3 minutes to run, while the cumprod statement is virtually instantaneous.
I got about a 800-fold improvement with Reduce:
system.time(z[, "NewVal"] <-Reduce("*", c(1, 1+z$y[-1]/100), accumulate=T) )
user system elapsed
0.139 0.008 0.148
> head(z)
x y NewVal
1 1 1 1.000
2 2 1 1.010
3 3 1 1.020
4 4 5 1.071
5 5 1 1.082
6 6 2 1.103
7 7 2 1.126
8 8 3 1.159
9 9 0 1.159
10 10 1 1.171
> system.time(for(i in 2:nrow(z)){z$NewVal[i]=z$NewVal[i-1]+
(z$NewVal[i-1]*(z$y[i]/100))})
user system elapsed
37.29 106.38 143.16

Resources