locate dates of evenly-spaced events - r

I wish to locate the date of evenly-spaced events when given the number of events and the number of days in the period of interest. This seems like a trivial objective, but it is confusing me.
Here is a very simple example that has a straight-forward solution:
n.trips <- 5
n.days <- 20
mean.trips.per.day <- n.trips / n.days
cummulative.trips <- mean.trips.per.day * c(1:n.days)
cummulative.trips
#[1] 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
# 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25 4.50 4.75 5.00
# Find the date of each trip
which(cummulative.trips %in% c(1:n.days))
#[1] 4 8 12 16 20
But the following example is not straight-forward. Three possible solutions are shown but none match the desired result. In this example I am trying to pick out the locations of the six elements of the vector cummulative.trips that most closely match the integers 1:6. Those locations are shown in the vector desired.dates:
n.trips <- 6
n.days <- 17
# Here are the desired results
date.of.first.trip <- 3 # 1.0588235
date.of.second.trip <- 6 # 2.1176471
date.of.third.trip <- 8 # or 9: 2.8235294 3.1764706; 8 is the first
date.of.fourth.trip <- 11 # 3.8823529
date.of.fifth.trip <- 14 # 4.9411765
date.of.sixth.trip <- 17 # 6.0000000
desired.dates <- c(3,6,8,11,14,17)
mean.trips.per.day <- n.trips / n.days
cummulative.trips <- mean.trips.per.day * c(1:n.days)
cummulative.trips
#[1] 0.3529412 0.7058824 1.0588235 1.4117647 1.7647059
# 2.1176471 2.4705882 2.8235294 3.1764706 3.5294118
# 3.8823529 4.2352941 4.5882353 4.9411765 5.2941176 5.6470588 6.0000000
Here are three possible solutions I attempted:
# Find the date of each trip
which(cummulative.trips %in% c(1:n.days))
#[1] 17
which(round(cummulative.trips) %in% c(1:n.days))
#[1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
round(seq(1, n.days, length = n.trips))
#[1] 1 4 7 11 14 17
EDIT
I tried this function suggested by MrFlick in a comment, but it simply returns a result that essentially matches the result of the first of three approaches I tried above for my second example.
What is the fastest way to check if a number is a positive natural number? (in R)
is.naturalnumber <-
function(x, tol = .Machine$double.eps^0.5) x > tol & abs(x - round(x)) < tol
x <- cummulative.trips
is.naturalnumber(x)
#[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE

Perhaps something like this will work
nearest_index <- function(targets, values) {
sapply(targets, function(x) which.min(abs(values-x)))
}
nearest_index(1:6, cummulative.trips)
# [1] 3 6 8 11 14 17
For each "target" value, we find the value that minimizes the difference between the observed values.

After checking #MrFlick's answer with a number of combinations of n.trips and n.days I discovered a scenario where his code did not return the answer I expected (n.trips <- 26; n.days <- 13). His code returned, assuming I used it correctly:
[1] 1 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13
But I was expecting:
[1] 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13
I probably should have explained my problem more clearly in my original post. I ended up writing the following for-loop and have tested it with 10 combinations of n.trips and n.days listed below. So far this for-loop seems to return what I expect for all 10 combinations. This code does incorporate #MrFlick's approach albeit is substantially modified form.
mean.trips.per.day <- n.trips / n.days
mean.trips.per.day
cummulative.trips.by.day <- mean.trips.per.day * c(1:n.days)
cummulative.trips.by.day
date.of.trip <- rep(0, n.trips)
for(i in 1:n.trips) {
trip.candidate.days <- which(round(cummulative.trips.by.day) >= i)
if(length(trip.candidate.days) > 0) date.of.trip[i] = trip.candidate.days[which.min(abs(cummulative.trips.by.day[trip.candidate.days] - i))]
# no dates have a value that rounds to >= i which suggests there was at most i-1 trips
if(length(trip.candidate.days) == 0) date.of.trip[i] = 0
}
cummulative.trips.by.day
date.of.trip
Here are the 10 combinations of n.trips and n.days I have used so far to test this code.
n.trips <- 12
n.days <- 12
n.trips <- 6
n.days <- 12
n.trips <- 5
n.days <- 13
n.trips <- 26
n.days <- 13
n.trips <- 28
n.days <- 13
n.trips <- 20
n.days <- 13
n.trips <- 0
n.days <- 13
n.trips <- 1
n.days <- 13
n.trips <- 2
n.days <- 13
n.trips <- 100
n.days <- 23

Related

R vector :removing values conditioned on surrounding values

This may not be the best title, feel free to edit it.
x=c(NA,NA,NA,1,NA,NA,NA,NA,0,NA,NA,NA,1,NA,NA,0,NA,NA,NA,NA,1,NA,NA,NA,0,NA....)
or
x=c(NA,NA,NA,0,NA,NA,NA,NA,1,NA,NA,NA,0,NA,NA,1,NA,NA,NA,NA,0,NA,NA,NA,1,NA....)
y=c(seq(1:length(x)))
I would like z to be a new vector that is equal to y except when NAs are between 0 and 1 (not 1 and 0) where it should repeat the value taken when x=0
[1] 1 2 3 4 5 6 7 8 9 9 9 9 13 14 15 16 16 16 16 16 21 22 23 24 25 25
or
[1] 1 2 3 4 4 4 4 4 9 10 11 12 13 13 13 16 17 18 19 20 21 21 21 21 25 26
depending on x
I really don't know how to translate this condition in R.
My solution is clunkier than #James's (now deleted) answer but maybe (?) it's more flexible:
## identify strings of NAs preceded by 0
library(zoo)
na_following_zero <- na.locf(c(1,x))[-1]==0 & is.na(x)
## now identify the 'chunks' to reset
## (there may be a more elegant way to do this)
rr <- rle(na_following_zero)
startvals <- cumsum(c(0,rr$lengths))+1
endvals <- cumsum(rr$lengths)
values <- c(NA,y[startvals-1])
z <- y
## replace values in chunks
for (i in seq_along(rr$values)[rr$values])
z[startvals[i]:endvals[i]] <- values[i]
If time isn't prohibitive, you can just use a "for" loop:
z <- y
between.0.1 <- rep(FALSE, length(x))
for(i in 2:length(x)){
if(!is.na(x[i-1]) && x[i-1]==0){ # switch on after a 0
between.0.1[i] <- TRUE
value.at.0 <- y[i-1]
z[i] <- value.at.0
}
if(between.0.1[i-1]){ # if switched on, stay switched on
between.0.1[i] <- TRUE
z[i] <- value.at.0
}
if(!is.na(x[i]) && x[i]==1){ # switch off if at a 1
between.0.1[i] <- FALSE
value.at.0 <- NA
}
}
z[between.0.1] # diagnostic check
Another approach:
y0 <- which(x==0)
y1<-which(x==1)
# need a kicker to start with first zero
y1<-y1[y1>y0[1]]
# and check for end of sequence
if(y1[length(y1)]< length(x)) y1[(length(y1)+1] <- length(x)+1
#now y0 and y1 better be same length
z<-y
#now do a loop any way you want
for (jj in 1: length(y0) ) z[y0[jj]:(y1[jj]-1)]<-y[y0[jj]]
Rgames> z
[1] 1 2 3 4 4 4 4 4 9 10 11 12 13 13 13 16 17 18 19 20 21 21 21 21 25
[26] 26

R ifelse condition with hourly data: frequency of continuously NA

With the help of sebastian-c, I figured out my problem with daily data. Please see: R ifelse condition: frequency of continuously NA
And now I have a data set with hourly data:
set.seed(1234)
day <- c(rep(1:2, each=24))
hr <- c(rep(0:23, 2))
v <- c(rep(NA, 48))
A <- data.frame(cbind(day, hr, v))
A$v <- sample(c(NA, rnorm(100)), nrow(A), prob=c(0.5, rep(0.5/100, 100)), replace=TRUE)
What I need to do is: If there are more(>=) 4 continuously missing day-hours(7AM-7PM) or >= 3 continuously missing night-hours(7PM-7AM), I will delete the entire day from the data frame, otherwise just run linear interpolation. Thus, the second day should be entirely deleted from the data frame since there are 4 continuously NA during day-time (7AM-10AM). The result is preferably remain data frame. Please help, thank you!
If I modify the NA_run function from the question you linked to take a variable named v instead of value and return the boolean rather than the data.frame:
NA_run <- function(x, maxlen){
runs <- rle(is.na(x$v))
any(runs$lengths[runs$values] >= maxlen)
}
I can then write a wrapper function to call it twice for daytime and nighttime:
dropfun <- function(x) {
dt <- x$hr > 7 & x$hr < 19
daytime <- NA_run(x[dt,], 4)
nighttime <- NA_run(x[!dt,], 3)
any(daytime, nighttime)
}
Which gives me a data.frame of days to drop.
> ddply(A, .(day), dropfun)
day V1
1 1 TRUE
2 2 FALSE
>
We can alter the dropfun to return the dataframe instead though:
dropfun <- function(x) {
dt <- x$hr > 7 & x$hr < 19
daytime <- NA_run(x[dt,], 4)
nighttime <- NA_run(x[!dt,], 3)
if(any(daytime, nighttime)) NULL else x
}
> ddply(A, .(day), dropfun)
day hr v
1 2 0 NA
2 2 1 NA
3 2 2 2.54899107
4 2 3 NA
5 2 4 -0.03476039
6 2 5 NA
7 2 6 0.65658846
8 2 7 0.95949406
9 2 8 NA
10 2 9 1.08444118
11 2 10 0.95949406
12 2 11 NA
13 2 12 -1.80603126
14 2 13 NA
15 2 14 NA
16 2 15 0.97291675
17 2 16 NA
18 2 17 NA
19 2 18 NA
20 2 19 -0.29429386
21 2 20 0.87820363
22 2 21 NA
23 2 22 0.56305582
24 2 23 -0.11028549
>

Remove rows based on factor-levels

I have a data.frame df in format "long".
df <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(11,11,11,22,22,22,33),
value = ceiling(rnorm(7)*10))
df <- df[order(df$site), ]
df
site time value
1 A 11 12
2 A 22 -24
3 A 33 -30
4 B 11 3
5 B 22 16
6 C 11 3
7 C 22 9
Question
How do I remove the rows where an unique element of df$time is not present for each of the levels of df$site ?
In this case I want to remove df[3,], because for df$time the timestamp 33 is only present for site A and not for site B and site C.
Desired output:
df.trimmed
site time value
1 A 11 12
2 A 22 -24
4 B 11 3
5 B 22 16
6 C 11 3
7 C 22 9
The data.frame has easily 800k rows and 200k unique timestamps. I don't want to use loops but I don't know how to use vectorized functions like apply() or lapply() for this case.
Here's another possible solution using the data.table package:
unTime <- unique(df$time)
library(data.table)
DT <- data.table(df, key = "site")
(notInAll <- unique(DT[, list(ans = which(!unTime %in% time)), by = key(DT)]$ans))
# [1] 3
DT[time %in% unTime[-notInAll]]
# site time value
# [1,] A 11 3
# [2,] A 22 11
# [3,] B 11 -6
# [4,] B 22 -2
# [5,] C 11 -19
# [6,] C 22 -14
EDIT from Matthew
Nice. Or a slightly more direct way :
DT = as.data.table(df)
tt = DT[,length(unique(site)),by=time]
tt
time V1
1: 11 3
2: 22 3
3: 33 1
tt = tt[V1==max(V1)] # See * below
tt
time V1
1: 11 3
2: 22 3
DT[time %in% tt$time]
site time value
1: A 11 7
2: A 22 -2
3: B 11 8
4: B 22 -10
5: C 11 3
6: C 22 1
In case no time is present in all sites, when final result should be empty (as Ben pointed out in comments), the step marked * above could be :
tt = tt[V1==length(unique(DT$site))]
Would rle work for you?
df <- df[order(df$time), ]
df <- subset(df, time != rle(df$time)$value[rle(df$time)$lengths == 1])
df <- df[order(df$site), ]
df
## site time value
## 1 A 11 17
## 4 A 22 -3
## 2 B 11 8
## 5 B 22 5
## 3 C 11 0
## 6 C 22 13
Re-looking at your data, it seems that this solution might be too simple for your needs though....
Update
Here's an approach that should be better than the rle solution that I put above. Rather than look for a run-length of "1", will delete rows that do not match certain conditions of the results of table(df$site, df$time). To illustrate, I've also added some more fake data.
df <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(11,11,11,22,22,22,33),
value = ceiling(rnorm(7)*10))
df2 <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(14,14,15,15,16,16,16),
value = ceiling(rnorm(7)*10))
df <- rbind(df, df2)
df <- df[order(df$site), ]
temp <- as.numeric(names(which(colSums(with(df, table(site, time)))
>= length(levels(df$site)))))
df2 <- merge(df, data.frame(temp), by.x = "time", by.y = "temp")
df2 <- df2[order(df2$site), ]
df2
## time site value
## 3 11 A -2
## 4 16 A -2
## 7 22 A 2
## 1 11 B -16
## 5 16 B 3
## 8 22 B -6
## 2 11 C 8
## 6 16 C 11
## 9 22 C -10
Here's the result of tabulating and summing up the site/time combination:
colSums(with(df, table(site, time)))
## 11 14 15 16 22 33
## 3 2 2 3 3 1
Thus, if we were interested in including sites where at least two sites had the timestamp, we could change the line >= length(levels(df$site)) (in this example, 3) to >= length(levels(df$site))-1 (obviously, 2).
Not sure if this solution is useful to you at all, but I thought I would share it to show the flexibility in solutions we have with R.

Quickly remove zero variance variables from a data.frame

I have a large data.frame that was generated by a process outside my control, which may or may not contain variables with zero variance (i.e. all the observations are the same). I would like to build a predictive model based on this data, and obviously these variables are of no use.
Here's the function I'm currently using to remove such variables from the data.frame. It's currently based on apply, and I was wondering if there are any obvious ways to speed this function up, so that it works quickly on very large datasets, with a large number (400 or 500) of variables?
set.seed(1)
dat <- data.frame(
A=factor(rep("X",10),levels=c('X','Y')),
B=round(runif(10)*10),
C=rep(10,10),
D=c(rep(10,9),1),
E=factor(rep("A",10)),
F=factor(rep(c("I","J"),5)),
G=c(rep(10,9),NA)
)
zeroVar <- function(data, useNA = 'ifany') {
out <- apply(data, 2, function(x) {length(table(x, useNA = useNA))})
which(out==1)
}
And here's the result of the process:
> dat
A B C D E F G
1 X 3 10 10 A I 10
2 X 4 10 10 A J 10
3 X 6 10 10 A I 10
4 X 9 10 10 A J 10
5 X 2 10 10 A I 10
6 X 9 10 10 A J 10
7 X 9 10 10 A I 10
8 X 7 10 10 A J 10
9 X 6 10 10 A I 10
10 X 1 10 1 A J NA
> dat[,-zeroVar(dat)]
B D F G
1 3 10 I 10
2 4 10 J 10
3 6 10 I 10
4 9 10 J 10
5 2 10 I 10
6 9 10 J 10
7 9 10 I 10
8 7 10 J 10
9 6 10 I 10
10 1 1 J NA
> dat[,-zeroVar(dat, useNA = 'no')]
B D F
1 3 10 I
2 4 10 J
3 6 10 I
4 9 10 J
5 2 10 I
6 9 10 J
7 9 10 I
8 7 10 J
9 6 10 I
10 1 1 J
You may also want to look into the nearZeroVar() function in the caret package.
If you have one event out of 1000, it might be a good idea to discard these data (but this depends on the model). nearZeroVar() can do that.
Don't use table() - very slow for such things. One option is length(unique(x)):
foo <- function(dat) {
out <- lapply(dat, function(x) length(unique(x)))
want <- which(!out > 1)
unlist(want)
}
system.time(replicate(1000, zeroVar(dat)))
system.time(replicate(1000, foo(dat)))
Which is an order magnitude faster than yours on the example data set whilst giving similar output:
> system.time(replicate(1000, zeroVar(dat)))
user system elapsed
3.334 0.000 3.335
> system.time(replicate(1000, foo(dat)))
user system elapsed
0.324 0.000 0.324
Simon's solution here is similarly quick on this example:
> system.time(replicate(1000, which(!unlist(lapply(dat,
+ function(x) 0 == var(if (is.factor(x)) as.integer(x) else x))))))
user system elapsed
0.392 0.000 0.395
but you'll have to see if they scale similarly to real problem sizes.
Simply don't use table - it's extremely slow on numeric vectors since it converts them to strings. I would probably use something like
var0 <- unlist(lapply(df, function(x) 0 == var(if (is.factor(x)) as.integer(x) else x)))
It will be TRUE for 0-variance, NA for columns with NAs and FALSE for non-zero variance
Use the Caret Package and the function nearZeroVar
require(caret)
NZV<- nearZeroVar(dataset, saveMetrics = TRUE)
NZV[NZV[,"zeroVar"] > 0, ]
NZV[NZV[,"zeroVar"] + NZV[,"nzv"] > 0, ]
Well, save yourself some coding time:
Rgames: foo
[,1] [,2] [,3]
[1,] 1 1e+00 1
[2,] 1 2e+00 1
[3,] 1 3e+00 1
[4,] 1 4e+00 1
[5,] 1 5e+00 1
[6,] 1 6e+00 2
[7,] 1 7e+00 3
[8,] 1 8e+00 1
[9,] 1 9e+00 1
[10,] 1 1e+01 1
Rgames: sd(foo)
[1] 0.000000e+00 3.027650e+00 6.749486e-01
Warning message:
sd(<matrix>) is deprecated.
Use apply(*, 2, sd) instead.
To avoid nasty floating-point roundoffs, take that output vector, which I'll call "bar," and do something like bar[bar< 2*.Machine$double.eps] <- 0 and then finally your data frame dat[,as.logical(bar)] should do the trick.
How about using factor to count the number of unique elements and looping with sapply:
dat[sapply(dat, function(x) length(levels(factor(x)))>1)]
B D F
1 3 10 I
2 4 10 J
3 6 10 I
4 9 10 J
5 2 10 I
6 9 10 J
7 9 10 I
8 7 10 J
9 6 10 I
10 1 1 J
NAs are excluded by default, but this can be changed with the exclude parameter of factor:
dat[sapply(dat, function(x) length(levels(factor(x,exclude=NULL)))>1)]
B D F G
1 3 10 I 10
2 4 10 J 10
3 6 10 I 10
4 9 10 J 10
5 2 10 I 10
6 9 10 J 10
7 9 10 I 10
8 7 10 J 10
9 6 10 I 10
10 1 1 J NA
Because I'm an idiot who keeps googling the same question, let me leave a tidyverse approach that I've settled on:
library(tidyverse)
df <- df %>%
select(
- {
df %>%
map_dbl(~ length(table(.x, useNA = "ifany"))) %>%
{which(. == 1)} %>%
names()
}
)
I think this could be made shorter but I'm too tired!
I think having zero variance is equivalent to being constant and one can get around without doing any arithmetic operations at all. I would expect that range() outperforms var(), but I have not verified this:
removeConstantColumns <- function(a_dataframe, verbose=FALSE) {
notConstant <- function(x) {
if (is.factor(x)) x <- as.integer(x)
return (0 != diff(range(x, na.rm=TRUE)))
}
bkeep <- sapply(a_dataframe, notConstant)
if (verbose) {
cat('removeConstantColumns: '
, ifelse(all(bkeep)
, 'nothing'
, paste(names(a_dataframe)[!bkeep], collapse=',')
, ' removed', '\n')
}
return (a_dataframe[, bkeep])
}
Check this custom function. I did not try it on data frames with 100+ variables.
remove_low_variance_cols <- function(df, threshold = 0) {
n <- Sys.time() #See how long this takes to run
remove_cols <- df %>%
select_if(is.numeric) %>%
map_dfr(var) %>%
gather() %>%
filter(value <= threshold) %>%
spread(key, value) %>%
names()
if(length(remove_cols)) {
print("Removing the following columns: ")
print(remove_cols)
}else {
print("There are no low variance columns with this threshold")
}
#How long did this script take?
print(paste("Time Consumed: ", Sys.time() - n, "Secs."))
return(df[, setdiff(names(df), remove_cols)])
}

Quickly generate the cartesian product of a matrix

Let's say I have a matrix x which contains 10 rows and 2 columns. I want to generate a new matrix M that contains each unique pair of rows from x - that is, a new matrix with 55 rows and 4 columns.
E.g.,
x <- matrix (nrow=10, ncol=2, 1:20)
M <- data.frame(matrix(ncol=4, nrow=55))
k <- 1
for (i in 1:nrow(x))
for (j in i:nrow(x))
{
M[k,] <- unlist(cbind (x[i,], x[j,]))
k <- k + 1
}
So, x is:
[,1] [,2]
[1,] 1 11
[2,] 2 12
[3,] 3 13
[4,] 4 14
[5,] 5 15
[6,] 6 16
[7,] 7 17
[8,] 8 18
[9,] 9 19
[10,] 10 20
And then M has 4 columns, the first two are one row from x and the next 2 are another row from x:
> head(M,10)
X1 X2 X3 X4
1 1 11 1 11
2 1 11 2 12
3 1 11 3 13
4 1 11 4 14
5 1 11 5 15
6 1 11 6 16
7 1 11 7 17
8 1 11 8 18
9 1 11 9 19
10 1 11 10 20
Is there either a faster or simpler (or both) way of doing this in R?
The expand.grid() function useful for this:
R> GG <- expand.grid(1:10,1:10)
R> GG <- GG[GG[,1]>=GG[,2],] # trim it to your 55 pairs
R> dim(GG)
[1] 55 2
R> head(GG)
Var1 Var2
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
R>
Now you have the 'n*(n+1)/2' subsets and you can simple index your original matrix.
I'm not quite grokking what you are doing so I'll just throw out something that may, or may not help.
Here's what I think of as the Cartesian product of the two columns:
expand.grid(x[,1],x[,2])
You can also try the "relations" package. Here is the vignette. It should work like this:
relation_table(x %><% x)
Using Dirk's answer:
idx <- expand.grid(1:nrow(x), 1:nrow(x))
idx<-idx[idx[,1] >= idx[,2],]
N <- cbind(x[idx[,2],], x[idx[,1],])
> all(M == N)
[1] TRUE
Thanks everyone!
Inspired from the other answers, here is a function implementing cartesian product of two matrices, in the case of two matrices, the full cartesian product, for only one argument, omitting one of each pair:
cartesian_prod <- function(M1, M2) {
if(missing(M2)) { M2 <- M1
ind <- expand.grid(1:NROW(M1), 1:NROW(M2))
ind <- ind[ind[,1] >= ind[,2],] } else {
ind <- expand.grid(1:NROW(M1), 1:NROW(M2))}
rbind(cbind(M1[ind[,1],], M2[ind[,2],]))
}

Resources