Subset timeseries (date sequence) into a list - r

I have a dataframe with a series of dates, here's a simplified version of it:
> eventdates
dr.rank dr.start dr.end
1 14 1964-09-30 1964-10-06
2 16 1964-11-01 1964-12-24
I also have a time series of dates with values etc. associated with that, here's a much simplified version of the timeseries:
ts1964 <- data.frame(DATE = seq(from = as.Date("1964-01-01"), to = as.Date("1964-12-31"), by = "days"),
Q = 1:366)
What I am trying to do is subset by each date in eventdates, i.e.:
> filter(ts1964, ts1964$DATE >= eventdates[1,2] & ts1964$DATE <= eventdates[1,3])
DATE Q
1 1964-09-30 274
2 1964-10-01 275
3 1964-10-02 276
4 1964-10-03 277
5 1964-10-04 278
6 1964-10-05 279
7 1964-10-06 280
8 1964-10-07 281
9 1964-10-08 282
10 1964-10-09 283
11 1964-10-10 284
12 1964-10-11 285
13 1964-10-12 286
14 1964-10-13 287
15 1964-10-14 288
16 1964-10-15 289
17 1964-10-16 290
18 1964-10-17 291
19 1964-10-18 292
20 1964-10-19 293
21 1964-10-20 294
22 1964-10-21 295
23 1964-10-22 296
24 1964-10-23 297
25 1964-10-24 298
26 1964-10-25 299
27 1964-10-26 300
28 1964-10-27 301
29 1964-10-28 302
30 1964-10-29 303
31 1964-10-30 304
32 1964-10-31 305
33 1964-11-01 306
>
But I need to do this hundreds of times. What I would like to do is have each subset form an element in a list. I would normally be considering to using something like dlply in plyr but this isn't an option when I'm using dplyr. Could anyone advise on how I might achieve this otherwise? Thanks

We can use Map
Map(function(x,y) filter(ts1964, DATE >= x & DATE <= y),
eventdates$dr.start, eventdates$dr.end)

Related

R: How to compare values in a column with later values in the same column

I am attempting to work with a large dataset in R where I need to create a column that compares the value in an existing column to all values that follow it (ex: row 1 needs to compare rows 1-10,000, row 2 needs to compare rows 2-10,000, row 3 needs to compare rows 3-10,000, etc.), but cannot figure out how to write the range.
I currently have a column of raw numeric values and a column of row values generated by:
samples$row = seq.int(nrow(samples))
I have attempted to generate the column with the following command:
samples$processed = min(samples$raw[samples$row:10000])
but get the error "numerical expression has 10000 elements: only the first used" and the generated column only has the value for row 1 repeated for each of the 10,000 rows.
How do I need to write this command so that the lower bound of the range is the row currently being calculated instead of 1?
Any help would be appreciated, as I have minimal programming experience.
If all you need is the min of the specific row and all following rows, then
rev(cummin(rev(samples$val)))
# [1] 24 24 24 24 24 24 24 24 24 24 24 24 165 165 165 165 410 410 410 882
If you have some other function that doesn't have a cumulative variant (and your use of min is just a placeholder), then one of:
mapply(function(a, b) min(samples$val[a:b]), seq.int(nrow(samples)), nrow(samples))
# [1] 24 24 24 24 24 24 24 24 24 24 24 24 165 165 165 165 410 410 410 882
sapply(seq.int(nrow(samples)), function(a) min(samples$val[a:nrow(samples)]))
The only reason to use mapply over sapply is if, for some reason, you want window-like operations instead of always going to the bottom of the frame. (Though if you wanted windows, I'd suggest either the zoo or slider packages.)
Data
set.seed(42)
samples <- data.frame(val = sample(1000, size=20))
samples
# val
# 1 561
# 2 997
# 3 321
# 4 153
# 5 74
# 6 228
# 7 146
# 8 634
# 9 49
# 10 128
# 11 303
# 12 24
# 13 839
# 14 356
# 15 601
# 16 165
# 17 622
# 18 532
# 19 410
# 20 882

Create a vector from a specific sequence of intervals

I have 20 intervals:
10 intervals from 1 to 250 of size 25:
[1.25] [26.50] [51.75] [76.100] [101.125] [126.150] ... [226.250]
10 intervals from 251 to 1000 of size 75:
[251,325] [326,400] [401,475] [476,550] [551,625] ... [926,1000]
I would like to create a vector composed of the first 5 elements of each interval like:
(1,2,3,5, 26,27,28,29,30, 51,52,53,54,55, 76,77,78,79,80, ....,
251,252,253,254,255, 326,327,328,329,330, ...)
How create this vector using R?
Let's assume you have two interval like :
interval1 <- seq(1.25, 226.250, 25)
interval2 <- seq(251, 1000, 75)
We can create a new interval combining the two and then use mapply to create sequence
new_interval <- c(as.integer(interval1), interval2)
c(mapply(`:`, new_interval, new_interval + 4))
#[1] 1 2 3 4 5 26 27 28 29 30 51 52 53 54 .....
#[89] ..... 779 780 851 852 853 854 855 926 927 928 929 930

How ask R not to combine the X axis values for a bar chart?

I am a beginner with R . My data looks like this:
id count date
1 210 2009.01
2 400 2009.02
3 463 2009.03
4 465 2009.04
5 509 2009.05
6 861 2009.06
7 872 2009.07
8 886 2009.08
9 725 2009.09
10 687 2009.10
11 762 2009.11
12 748 2009.12
13 678 2010.01
14 699 2010.02
15 860 2010.03
16 708 2010.04
17 709 2010.05
18 770 2010.06
19 784 2010.07
20 694 2010.08
21 669 2010.09
22 689 2010.10
23 568 2010.11
24 584 2010.12
25 592 2011.01
26 548 2011.02
27 683 2011.03
28 675 2011.04
29 824 2011.05
30 637 2011.06
31 700 2011.07
32 724 2011.08
33 629 2011.09
34 446 2011.10
35 458 2011.11
36 421 2011.12
37 459 2012.01
38 256 2012.02
39 341 2012.03
40 284 2012.04
41 321 2012.05
42 404 2012.06
43 418 2012.07
44 520 2012.08
45 546 2012.09
46 548 2012.10
47 781 2012.11
48 704 2012.12
49 765 2013.01
50 571 2013.02
51 371 2013.03
I would like to make a bar graph like graph that shows how much what is the count for each date (dates in format of Month-Y, Jan-2009 for instance). I have two issues:
1- I cannot find a good format for a bar-char like graph like that
2- I want all of my data-points to be present in X axis(date), while R aggregates it to each year only (so I inly have four data-points there). Below is the current command that I am using:
plot(df$date,df$domain_count,col="red",type="h")
and my current plot is like this:
Ok, I see some issues in your original data. May I suggest the following:
Add the days in your date column
df$date=paste(df$date,'.01',sep='')
Convert the date column to be of date type:
df$date=as.Date(df$date,format='%Y.%m.%d')
Plot the data again:
plot(df$date,df$domain_count,col="red",type="h")
Also, may I add one more suggestion, have you used ggplot for ploting chart? I think you will find it much easier and resulting in better looking charts. Your example could be visualized like this:
library(ggplot2) #if you don't have the package, run install.packages('ggplot2')
ggplot(df,aes(date, count))+geom_bar(stat='identity')+labs(x="Date", y="Count")
First, you should transform your date column in a real date:
library(plyr) # for mutate
d <- mutate(d, month = as.numeric(gsub("[0-9]*\\.([0-9]*)", "\\1", as.character(date))),
year = as.numeric(gsub("([0-9]*)\\.[0-9]*", "\\1", as.character(date))),
Date = ISOdate(year, month, 1))
Then, you could use ggplot to create a decent barchart:
library(ggplot2)
ggplot(d, aes(x = Date, y = count)) + geom_bar(fill = "red", stat = "identity")
You can also use basic R to create a barchart, which is however less nice:
dd <- setNames(d$count, format(d$Date, "%m-%Y"))
barplot(dd)
The former plot shows you the "holes" in your data, i.e. month where there is no count, while for the latter it is even wuite difficult to see which bar corresponds to which month (this could however be tweaked I assume).
Hope that helps.

Binning a dataframe with equal frequency of samples

I have binned my data using the cut function
breaks<-seq(0, 250, by=5)
data<-split(df2, cut(df2$val, breaks))
My split dataframe looks like
... ...
$`(15,20]`
val ks_Result c
15 60 237
18 70 247
... ...
$`(20,25]`
val ks_Result c
21 20 317
24 10 140
... ...
My bins looks like
> table(data)
data
(0,5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35]
0 0 0 7 128 2748 2307
(35,40] (40,45] (45,50] (50,55] (55,60] (60,65] (65,70]
1404 11472 1064 536 7389 1008 1714
(70,75] (75,80] (80,85] (85,90] (90,95] (95,100] (100,105]
2047 700 329 1107 399 376 323
(105,110] (110,115] (115,120] (120,125] (125,130] (130,135] (135,140]
314 79 1008 77 474 158 381
(140,145] (145,150] (150,155] (155,160] (160,165] (165,170] (170,175]
89 660 15 1090 109 824 247
(175,180] (180,185] (185,190] (190,195] (195,200] (200,205] (205,210]
1226 139 531 174 1041 107 257
(210,215] (215,220] (220,225] (225,230] (230,235] (235,240] (240,245]
72 671 98 212 70 95 25
(245,250]
494
When I mean the bins, I get on an average of ~900 samples
> mean(table(data))
[1] 915.9
I want to tell R to make irregular bins in such a way that each bin will contain on an average 900 samples (e.g. (0, 27] = 900, (27,28.5] = 900, and so on). I found something similar here, which deals with only one variable, not the whole dataframe.
I also tried Hmisc package, unfortunately the bins don't contain equal frequency!!
library(Hmisc)
data<-split(df2, cut2(df2$val, g=30, oneval=TRUE))
data<-split(df2, cut2(df2$val, m=1000, oneval=TRUE))
Assuming you want 50 equal sized buckets (based on your seq) statement, you can use something like:
df <- data.frame(var=runif(500, 0, 100)) # make data
cut.vec <- cut(
df$var,
breaks=quantile(df$var, 0:50/50), # breaks along 1/50 quantiles
include.lowest=T
)
df.split <- split(df, cut.vec)
Hmisc::cut2 has this option built in as well.
Can be done by the function provided here by Joris Meys
EqualFreq2 <- function(x,n){
nx <- length(x)
nrepl <- floor(nx/n)
nplus <- sample(1:n,nx - nrepl*n)
nrep <- rep(nrepl,n)
nrep[nplus] <- nrepl+1
x[order(x)] <- rep(seq.int(n),nrep)
x
}
data<-split(df2, EqualFreq2(df2$val, 25))

How to find the distance to nearest non-overlapping element?

I have a table like the one below, where each cluster (column 1) contains annotations of different elements (column 4) in small regions with a start (column 2) and an end (column 3) coordinate. For each entry, I would like to add a column corresponding to the distance to the nearest other element in that cluster. But I want to exclude cases where a pair of elements in the cluster have identical start/end coordinates or overlapping regions. How can I produce such extra nearest_distance column for such data frame?
cluster-47593-walk-0125 252 306 AR
cluster-47593-walk-0125 6 23 ZNF148
cluster-47593-walk-0125 357 381 CEBPA
cluster-47593-walk-0125 263 276 CEBPB
cluster-47593-walk-0125 246 324 NR3C1
cluster-47593-walk-0125 139 170 HMGA1
cluster-47593-walk-0125 139 170 HMGA2
cluster-47593-walk-0125 207 227 IRF8
cluster-47593-walk-0125 207 227 IRF1
cluster-47593-walk-0125 207 245 IRF2
cluster-47593-walk-0125 207 227 IRF3
cluster-47593-walk-0125 207 227 IRF4
cluster-47593-walk-0125 207 227 IRF5
cluster-47593-walk-0125 207 227 IRF6
cluster-47593-walk-0125 204 245 IRF7
cluster-47593-walk-0125 13 36 PATZ1
cluster-47593-walk-0125 14 143 PAX4
cluster-47593-walk-0125 4 25 RREB1
cluster-47593-walk-0125 73 87 SMAD1
cluster-47593-walk-0125 73 87 SMAD2
cluster-47593-walk-0125 73 87 SMAD3
cluster-47593-walk-0125 71 89 SMAD4
cluster-47593-walk-0125 11 40 SP1
cluster-47593-walk-0125 11 38 SP2
cluster-47593-walk-0125 7 38 SP3
cluster-47593-walk-0125 11 38 SP4
cluster-47593-walk-0125 13 33 GTF2I
cluster-47593-walk-0125 281 352 YY1
cluster-47586-walk-0222 252 306 AR
cluster-47586-walk-0222 6 23 ZNF148
[...]
First, some column names
names(data) <- c("cluster", "start", "end", "element")
data
cluster start end element
1 cluster-47593-walk-0125 252 306 AR
2 cluster-47593-walk-0125 6 23 ZNF148
3 cluster-47593-walk-0125 357 381 CEBPA
4 cluster-47593-walk-0125 263 276 CEBPB
Now creating new column
data$nearest_distance <- apply(data, 1, function(x)
{
cluster <- x[1]
start <- as.numeric(x[2])
end <- as.numeric(x[3])
elem <- x[4]
posb <- data[data$cluster == cluster & data$element != elem &
((data$start > end) | (data$end < start)), ]
startDist <- as.matrix(dist(c(end, posb$start)))[, 1]
endDist <- as.matrix(dist(c(start, posb$end)))[, 1]
best.dist <- min(startDist[startDist > 0], endDist[endDist > 0])
return(best.dist)
}
)
I don't really like at least the beginning of the function, but I couldn't come up with a better solutions.. So we have
cluster start end element nearest_distance
1 cluster-47593-walk-0125 252 306 AR 7
2 cluster-47593-walk-0125 6 23 ZNF148 48
3 cluster-47593-walk-0125 357 381 CEBPA 5
4 cluster-47593-walk-0125 263 276 CEBPB 5
5 cluster-47593-walk-0125 246 324 NR3C1 1
.....
Edit: after fixing system.time() test it appeared that this is a very inefficient way. Obviously, it is redundant to compute whole dist() matrix , so we can change these two lines to
startDist <- abs(end-posb$start)
endDist <- abs(start-posb$end)
Another minor change is that we can delete constraint data$element != elem because later there is > 0. Testing this function on 1 000 clusters with 30 rows each took more than three minutes.. There remains subsetting problem, so I tried to split data into a list and this allows us to use matrices instead of data frames (since constraint for cluster disappears) , which improves efficiency too. This time we have 10 000 clusters with 30 rows each
data <- data[rep(1:30, each = 10000), ]
data$cluster <- factor(rep(1:10000, 30))
spl <- split(data[, c(2:3)], data$cluster)
spl <- lapply(spl, data.matrix)
system.time({
x = lapply(spl, function(z) {
apply(z, 1, function(x) {
start <- x[1]
end <- x[2]
posb <- z[z[,1] > end | z[,2] < start, , drop = FALSE]
startDist <- abs(end-posb[, 1])
endDist <- abs(start-posb[, 2])
best.dist <- min(startDist[startDist > 0], endDist[endDist > 0])
return(best.dist)
})
})
})
data$nearest_distance = unsplit(x, data$cluster)
user system elapsed
18.16 0.00 18.35

Resources