I have one file (location) that has an x,y coordinates and a date/time identification. I want to get information from a second table (weather) that has a "similar" date/time variable and the co-variables (temperature and wind speed). The trick is the date/time are not exactly the same numbers in both tables. I want to select the weather data that is closest from the location data. I know I need to do some loops and thats about it.
Example location example weather
x y date/time date/time temp wind
1 3 01/02/2003 18:00 01/01/2003 13:00 12 15
2 3 01/02/2003 19:00 01/02/2003 16:34 10 16
3 4 01/03/2003 23:00 01/02/2003 20:55 14 22
2 5 01/04/2003 02:00 01/02/2003 21:33 14 22
01/03/2003 00:22 13 19
01/03/2003 14:55 12 12
01/03/2003 18:00 10 12
01/03/2003 23:44 2 33
01/04/2003 01:55 6 22
So the final output would be a table with the correctly "best" matched weather data to the location data
x y datetime datetime temp wind
1 3 01/02/2003 18:00 ---- 01/02/2003 16:34 10 16
2 3 01/02/2003 19:00 ---- 01/02/2003 20:55 14 22
3 4 01/03/2003 23:00 ---- 01/03/2003 00:22 13 19
2 5 01/04/2003 02:00 ---- 01/04/2003 01:55 6 22
Any suggestions where to start? I am trying to do this in R
I needed to bring that data in as data and time separately and then paste and format
location$dt.time <- as.POSIXct(paste(location$date, location$time),
format="%m/%d/%Y %H:%M")
And the same for weather
Then for each value of date.time in location, find the entry in weather that has the lowest absolute values for the time differences:
sapply(location$dt.time, function(x) which.min(abs(difftime(x, weather$dt.time))))
# [1] 2 3 8 9
cbind(location, weather[ sapply(location$dt.time,
function(x) which.min(abs(difftime(x, weather$dt.time)))), ])
x y date time dt.time date time temp wind dt.time
2 1 3 01/02/2003 18:00 2003-01-02 18:00:00 01/02/2003 16:34 10 16 2003-01-02 16:34:00
3 2 3 01/02/2003 19:00 2003-01-02 19:00:00 01/02/2003 20:55 14 22 2003-01-02 20:55:00
8 3 4 01/03/2003 23:00 2003-01-03 23:00:00 01/03/2003 23:44 2 33 2003-01-03 23:44:00
9 2 5 01/04/2003 02:00 2003-01-04 02:00:00 01/04/2003 01:55 6 22 2003-01-04 01:55:00
cbind(location, weather[
sapply(location$dt.time,
function(x) which.min(abs(difftime(x, weather$dt.time)))), ])[ #pick columns
c(1,2,5,8,9,10)]
x y dt.time temp wind dt.time.1
2 1 3 2003-01-02 18:00:00 10 16 2003-01-02 16:34:00
3 2 3 2003-01-02 19:00:00 14 22 2003-01-02 20:55:00
8 3 4 2003-01-03 23:00:00 2 33 2003-01-03 23:44:00
9 2 5 2003-01-04 02:00:00 6 22 2003-01-04 01:55:00
My answers seem a bit different than yours but another reader has already questioned your abilities to do the matching properly by hand.
One fast and short way may be using data.table.
If you create two data.table's X and Y, both with keys, then the syntax is :
X[Y,roll=TRUE]
We call that a rolling join because we roll the prevailing observation in X forward to match the row in Y. See the examples in ?data.table and the introduction vignette.
Another way to do this is the zoo package which has locf (last observation carried forward), and possibly other packages too.
I'm not sure if you mean closest in terms of location, or time. If location, and that location is x,y coordinates then you will need some distance measure in 2D space I guess. data.table only does univariate 'closest' e.g. by time. Reading your question for a 2nd time it does seem you mean closest in the prevailing sense though.
EDIT: Seen the example data now. data.table won't do this in one step because although it can roll forwards or backwards, it won't roll to the nearest. You could do it with an extra step using which=TRUE and then test whether the one after the prevailing was actually closer.
Related
I have a csv with a 15 minute interval time series of data covering several years. Example data format:
Time stamp Value
07/07/2003 08:00 10
07/07/2003 08:15 10
07/07/2003 08:30 10.5
07/07/2003 08:45 11
07/07/2003 09:00 13
07/07/2003 09:15 15
07/07/2003 09:30 14.5
07/07/2003 09:45 14
07/07/2003 10:00 10
07/07/2003 10:15 9
07/07/2003 10:30 8
07/07/2003 10:45 11
07/07/2003 11:00 12
07/07/2003 11:15 15
07/07/2003 11:30 13
07/07/2003 11:45 12
07/07/2003 12:00 10
I would like to read this into r studio and plot a time series with time stamp on the x axis and value on y.
The second part of the question is working out the number of times the value exceeds a certain threshold and then drops back below that threshold. For example it is => 12 a total of 8 times in the example data, occurring in 2 separate instances or groups within the time series. I am interested in the number of times the threshold is exceeded but key is calculating the number of times the time series is above a threshold in groupings such as described.
Here's how to plot the data. You need to transform Time_stamp to a date/time object using as.POSIXct:
df <- read.table(text=" Time_stamp Value
'07/07/2003 08:00' 10
'07/07/2003 08:15' 10
'07/07/2003 08:30' 10.5
'07/07/2003 08:45' 11
'07/07/2003 09:00' 13
'07/07/2003 09:15' 15
'07/07/2003 09:30' 14.5
'07/07/2003 09:45' 14
'07/07/2003 10:00' 10
'07/07/2003 10:15' 9
'07/07/2003 10:30' 8
'07/07/2003 10:45' 11
'07/07/2003 11:00' 12
'07/07/2003 11:15' 15
'07/07/2003 11:30' 13
'07/07/2003 11:45' 12
'07/07/2003 12:00' 10", header=TRUE, stringsAsFactors=FALSE)
df$Time_stamp <- as.POSIXct(df$Time_stamp, format="%m/%d/%Y %H:%M")
library(ggplot2)
ggplot(data=df, aes(x=Time_stamp, y=Value))+
geom_line()
And here's how to get the sequence above and equal to 12. I'm using dplyr and rleid from data.table
library(dplyr)
library(data.table)
df%>%
mutate(above=ifelse(df$Value<12, NA,(rleid(df$Value>=12))))%>%
na.omit()%>%
mutate(above=rleid(above))
Time_stamp Value above
1 2003-07-07 09:00:00 13.0 1
2 2003-07-07 09:15:00 15.0 1
3 2003-07-07 09:30:00 14.5 1
4 2003-07-07 09:45:00 14.0 1
5 2003-07-07 11:00:00 12.0 2
6 2003-07-07 11:15:00 15.0 2
7 2003-07-07 11:30:00 13.0 2
8 2003-07-07 11:45:00 12.0 2
Similarly to a question posted here, I want to compute number of overlapping days between two periods conditional on a third variable (location).
For each observation of the main dataset (DF) I have a starting and an end date, and a location (character) variable. The Events data comprises information on event location, starting date and end date. Multiple events in the same location and (partially) overlapping periods are allowed.
Thus for each observation in DF the period must be compared to other periods in an event dataset (Events). This means that the count of overlapping days between the between one (DF) and multiple periods (Events) must be done net of overlapping days between two (or more) periods in the Events dataset
An example of the data structure of my two data sources can be easily reproduced in R using this code (note that the location variable has been set to an integer for simplicity):
set.seed(1)
DF <- data.frame(
start = sample(seq(as.Date('2018-01-01'), as.Date('2018-04-30'), by="day"), 20),
end = sample(seq(as.Date('2018-05-01'), as.Date('2018-10-30'), by="day"), 20),
location = sample(seq(1:5)),20)
Events <- data.frame(
start = sample(seq(as.Date('2018-01-01'), as.Date('2018-04-30'), by="day"), 30),
end = sample(seq(as.Date('2018-05-01'), as.Date('2018-10-30'), by="day"), 30),
location = sample(seq(1:5)), 30 )
In the simple case in which the Events data reduces to only one event (and we do not care about the location) counting overallping days for each obervation in DF can be done easily with the following code and dplyr: code taken from Matthew Lundberg answer here, also note that I have created another dataframe with a single event (One_event):
library(dplyr)
One_event <- data.frame(
start = as.Date('2018-01-01'),
end = as.Date('2018-07-30'))
DF %>%
mutate(overlap = pmax(pmin(One_event$end, end) - pmax(One_event$start, start) + 1,0))
resulting in:
start end location X20 overlap
1 2018-02-01 2018-10-19 5 20 180 days
2 2018-02-14 2018-06-08 3 20 115 days
3 2018-03-09 2018-08-26 4 20 144 days
4 2018-04-17 2018-05-23 2 20 37 days
5 2018-01-24 2018-06-17 1 20 145 days
6 2018-04-14 2018-07-08 5 20 86 days
7 2018-04-18 2018-05-03 3 20 16 days
8 2018-03-16 2018-07-07 4 20 114 days
9 2018-03-12 2018-09-30 2 20 141 days
10 2018-01-07 2018-06-29 1 20 174 days
11 2018-01-23 2018-07-23 5 20 182 days
12 2018-01-20 2018-08-12 3 20 192 days
13 2018-04-23 2018-07-24 4 20 93 days
14 2018-02-11 2018-06-01 2 20 111 days
15 2018-03-23 2018-09-17 1 20 130 days
16 2018-02-22 2018-08-21 5 20 159 days
17 2018-04-24 2018-09-10 3 20 98 days
18 2018-04-13 2018-05-18 4 20 36 days
19 2018-02-08 2018-08-28 2 20 173 days
20 2018-03-20 2018-10-23 1 20 133 days
Now back to the orginal problem.
To allow comparison between the period of each observation in Data and the matching event(s) according to observation's and event's location I think that would be reasonable to use the apply function, subset the Event dataset according to the observation location, and finally run the mutate function for each row and a subset of the Events data (temp):
apply(DF, 1, function(x) {
temp = Events[Events$location %in% x["location"]
x %>%
mutate(overlap = pmax(pmin(temp$end, end) - pmax(temp$start, start) +
1,0))
})
There are several issues with this last part of code. First, does not work and gives an error message:
(Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "character")
Second, it does not account for two (or more periods) overlapping in the Events dataset.
are you looking for this:
apply(DF, MARGIN = 1, function(x) {
Events[Events$location == x["location"],] %>% mutate(overlap = pmax(pmin(.data$end,
x["end"]) - pmax(.data$start, x["start"])))
})
This results in my case to:
[[1]]
start end location X30 overlap
1 2018-02-01 2018-07-28 5 30 177 days
2 2018-04-14 2018-08-27 5 30 135 days
3 2018-01-23 2018-09-20 5 30 231 days
4 2018-02-22 2018-09-10 5 30 200 days
5 2018-04-04 2018-07-17 5 30 104 days
6 2018-02-06 2018-05-16 5 30 99 days
[[2]]
start end location X30 overlap
1 2018-01-24 2018-09-26 3 30 114 days
2 2018-01-07 2018-07-11 3 30 114 days
3 2018-03-23 2018-10-28 3 30 77 days
4 2018-03-20 2018-08-22 3 30 80 days
5 2018-01-26 2018-05-12 3 30 87 days
6 2018-01-31 2018-07-02 3 30 114 days
[[3]]
start end location X30 overlap
1 2018-03-09 2018-07-29 4 30 142 days
2 2018-03-16 2018-05-19 4 30 64 days
3 2018-04-23 2018-09-11 4 30 125 days
4 2018-04-13 2018-07-19 4 30 97 days
5 2018-03-05 2018-07-10 4 30 123 days
6 2018-02-05 2018-07-20 4 30 133 days
...
Working with the Rblpapi package, I receive a list of multiple data frames when requesting securities. (Equaling the number of securities requested)
My problem is the following one: Let's say:
I request daily data for A and B from 01.10.2016 - 31.10.2016
Some data for A is missing during that time, while B has,
also some data for B is missing, when A has.
So basically:
list$A
date PX_LAST
1 2016-10-03 216.704
2 2016-10-04 217.245
3 2016-10-05 216.887
4 2016-10-06 217.164
5 2016-10-10 217.504
6 2016-10-11 217.022
7 2016-10-12 217.326
8 2016-10-13 216.219
9 2016-10-14 217.275
10 2016-10-17 216.751
11 2016-10-18 218.812
12 2016-10-19 219.682
13 2016-10-20 220.189
14 2016-10-21 220.930
15 2016-10-25 221.179
16 2016-10-26 219.840
17 2016-10-27 219.158
18 2016-10-31 217.820
list$B
date PX_LAST
1 2016-10-03 1722.82
2 2016-10-04 1717.82
3 2016-10-05 1721.14
4 2016-10-06 1718.40
5 2016-10-07 1712.40
6 2016-10-11 1700.33
7 2016-10-12 1695.54
8 2016-10-13 1689.62
9 2016-10-14 1693.71
10 2016-10-17 1687.84
11 2016-10-18 1701.10
12 2016-10-19 1706.74
13 2016-10-21 1701.16
14 2016-10-24 1706.24
15 2016-10-25 1701.20
16 2016-10-26 1699.92
17 2016-10-27 1694.66
18 2016-10-28 1690.96
19 2016-10-31 1690.92
As you see they have a different number of obervations and dates are also not equal. For example: 5. observation for A is on 2016-10-10 and for B is on 2016-10-07.
So what I need is a means to combine both data frames. My idea was a full range date range (every day) where I add the PX_values for corresponding dates of A and B. After that I could delete empty rows.
Sorry for bad formatting, this is my first post here.
Thanks in advance.
I am using a Kaggle data set for bike sharing. I would like to write script that compares my predicted values to the training data set. I would like comparisons of the mean by month for each year.
The training data set, I call df looks like this:
datetime count
1 2011-01-01 00:00:00 16
2 2011-01-11 01:00:00 40
3 2011-02-01 02:00:00 32
4 2011-02-11 03:00:00 13
5 2011-03-21 04:00:00 1
6 2011-03-11 05:00:00 1
My predicted values, I call sub look like this:
datetime count
1 2011-01-01 00:00:00 42
2 2011-01-11 01:00:00 33
3 2011-02-01 02:00:00 33
4 2011-02-11 05:00:00 36
5 2011-03-21 06:00:00 57
6 2011-03-11 07:00:00 129
I have isolated the month and year using the lubridate package. Then concatenated the month-date as a new column. I used the new column and split, then use lapply to find the mean.
library(lubridate)
df$monyear <- interaction(
month(ymd_hms(df$datetime)),
year(ymd_hms(df$datetime)),
sep="-")
s<-split(df,df$monyear)
x <-lapply(s,function(x) colMeans(x[,c("count", "count")],na.rm=TRUE))
But this gives me the average for each month-year combination nested in a list so it is not easy to compare. What I would like instead is :
year-month train-mean sub-mean diff
1 2011-01 28 37.5 9.5
2 2011-02 22.5 34.5 12
3 2011-03 1 93 92
Is there a better way to do this?
Something like this. For each of your data sets:
library(dplyr)
dftrain %>% group_by(monyear) %>% summarize(mc=mean(count)) -> xtrain
dftest %>% group_by(monyear) %>% summarize(mc=mean(count)) -> xtest
merged <- merge(xtrain, xtest, by="monyear")
So, here is my problem. I have a dataset of locations of radiotagged hummingbirds I’ve been following as part of my thesis. As you might imagine, they fly fast so there were intervals when I lost track of where they were until I eventually found them again.
Now I am trying to identify the segments where the bird was followed continuously (i.e., the intervals between “Lost” periods).
ID Type TimeStart TimeEnd Limiter Starter Ender
1 Observed 6:45:00 6:45:00 NO Start End
2 Lost 6:45:00 5:31:00 YES NO NO
3 Observed 5:31:00 5:31:00 NO Start NO
4 Observed 9:48:00 9:48:00 NO NO NO
5 Observed 10:02:00 10:02:00 NO NO NO
6 Observed 10:18:00 10:18:00 NO NO NO
7 Observed 11:00:00 11:00:00 NO NO NO
8 Observed 13:15:00 13:15:00 NO NO NO
9 Observed 13:34:00 13:34:00 NO NO NO
10 Observed 13:43:00 13:43:00 NO NO NO
11 Observed 13:52:00 13:52:00 NO NO NO
12 Observed 14:25:00 14:25:00 NO NO NO
13 Observed 14:46:00 14:46:00 NO NO End
14 Lost 14:46:00 10:47:00 YES NO NO
15 Observed 10:47:00 10:47:00 NO Start NO
16 Observed 10:57:00 11:00:00 NO NO NO
17 Observed 11:10:00 11:10:00 NO NO NO
18 Observed 11:19:00 11:27:55 NO NO NO
19 Observed 11:28:05 11:32:00 NO NO NO
20 Observed 11:45:00 12:09:00 NO NO NO
21 Observed 11:51:00 11:51:00 NO NO NO
22 Observed 12:11:00 12:11:00 NO NO NO
23 Observed 13:15:00 13:15:00 NO NO End
24 Lost 13:15:00 7:53:00 YES NO NO
25 Observed 7:53:00 7:53:00 NO Start NO
26 Observed 8:48:00 8:48:00 NO NO NO
27 Observed 9:25:00 9:25:00 NO NO NO
28 Observed 9:26:00 9:26:00 NO NO NO
29 Observed 9:32:00 9:33:25 NO NO NO
30 Observed 9:33:35 9:33:35 NO NO NO
31 Observed 9:42:00 9:42:00 NO NO NO
32 Observed 9:44:00 9:44:00 NO NO NO
33 Observed 9:48:00 9:48:00 NO NO NO
34 Observed 9:48:30 9:48:30 NO NO NO
35 Observed 9:51:00 9:51:00 NO NO NO
36 Observed 9:54:00 9:54:00 NO NO NO
37 Observed 9:55:00 9:55:00 NO NO NO
38 Observed 9:57:00 10:01:00 NO NO NO
39 Observed 10:02:00 10:02:00 NO NO NO
40 Observed 10:04:00 10:04:00 NO NO NO
41 Observed 10:06:00 10:06:00 NO NO NO
42 Observed 10:20:00 10:33:00 NO NO NO
43 Observed 10:34:00 10:34:00 NO NO NO
44 Observed 10:39:00 10:39:00 NO NO End
Note: When there is a “Start” and an “End” in the same row it’s because the non-lost period consists only of that record.
I was able to identify the records that start or end these “non-lost” periods (under the columns “Starter” and “Ender”), but now I want to be able to identify those periods by giving them unique identifiers (period A,B,C or 1,2,3, etc).
Ideally, the name of the identifier would be the name of the start point for that period (i.e., ID[ Starter==”Start”])
I'm looking for something like this:
ID Type TimeStart TimeEnd Limiter Starter Ender Period
1 Observed 6:45:00 6:45:00 NO Start End 1
2 Lost 6:45:00 5:31:00 YES NO NO Lost
3 Observed 5:31:00 5:31:00 NO Start NO 3
4 Observed 9:48:00 9:48:00 NO NO NO 3
5 Observed 10:02:00 10:02:00 NO NO NO 3
6 Observed 10:18:00 10:18:00 NO NO NO 3
7 Observed 11:00:00 11:00:00 NO NO NO 3
8 Observed 13:15:00 13:15:00 NO NO NO 3
9 Observed 13:34:00 13:34:00 NO NO NO 3
10 Observed 13:43:00 13:43:00 NO NO NO 3
11 Observed 13:52:00 13:52:00 NO NO NO 3
12 Observed 14:25:00 14:25:00 NO NO NO 3
13 Observed 14:46:00 14:46:00 NO NO End 3
14 Lost 14:46:00 10:47:00 YES NO NO Lost
15 Observed 10:47:00 10:47:00 NO Start NO 15
16 Observed 10:57:00 11:00:00 NO NO NO 15
17 Observed 11:10:00 11:10:00 NO NO NO 15
18 Observed 11:19:00 11:27:55 NO NO NO 15
19 Observed 11:28:05 11:32:00 NO NO NO 15
20 Observed 11:45:00 12:09:00 NO NO NO 15
21 Observed 11:51:00 11:51:00 NO NO NO 15
22 Observed 12:11:00 12:11:00 NO NO NO 15
23 Observed 13:15:00 13:15:00 NO NO End 15
24 Lost 13:15:00 7:53:00 YES NO NO Lost
Would this be too hard to do in R?
Thanks!
> d <- data.frame(Limiter = rep("NO", 44), Starter = rep("NO", 44), Ender = rep("NO", 44), stringsAsFactors = FALSE)
> d$Starter[c(1, 3, 15, 25)] <- "Start"
> d$Ender[c(1, 13, 23, 44)] <- "End"
> d$Limiter[c(2, 14, 24)] <- "Yes"
> d$Period <- ifelse(d$Limiter == "Yes", "Lost", which(d$Starter == "Start")[cumsum(d$Starter == "Start")])
> d
Limiter Starter Ender Period
1 NO Start End 1
2 Yes NO NO Lost
3 NO Start NO 3
4 NO NO NO 3
5 NO NO NO 3
6 NO NO NO 3
7 NO NO NO 3
8 NO NO NO 3
9 NO NO NO 3
10 NO NO NO 3
11 NO NO NO 3
12 NO NO NO 3
13 NO NO End 3
14 Yes NO NO Lost
15 NO Start NO 15
16 NO NO NO 15
17 NO NO NO 15
18 NO NO NO 15
19 NO NO NO 15
20 NO NO NO 15
21 NO NO NO 15
22 NO NO NO 15
23 NO NO End 15
24 Yes NO NO Lost
25 NO Start NO 25
26 NO NO NO 25
27 NO NO NO 25
28 NO NO NO 25
29 NO NO NO 25
30 NO NO NO 25
31 NO NO NO 25
32 NO NO NO 25
33 NO NO NO 25
34 NO NO NO 25
35 NO NO NO 25
36 NO NO NO 25
37 NO NO NO 25
38 NO NO NO 25
39 NO NO NO 25
40 NO NO NO 25
41 NO NO NO 25
42 NO NO NO 25
43 NO NO NO 25
44 NO NO End 25