Data looks like this:
ID Lat Long Time
1 3 3 00:01
1 3 4 00:02
1 4 4 00:03
2 4 3 00:01
2 4 4 00:02
2 4 5 00:03
3 5 2 00:01
3 5 3 00:02
3 5 4 00:03
4 9 9 00:01
4 9 8 00:02
4 8 8 00:03
5 7 8 00:01
5 8 8 00:02
5 8 9 00:03
I want to measure how far the IDs are away from each other within a given radius at each given time interval. I am doing this on 1057 ID's across 16213 time intervals so efficiency is important.
It is important to measure distance between points within a radius because if the points are too far away I don't care. I am trying to measure distances between points who are relatively close. For example I don't care how far away ID 1 is from ID 5 but I care about how far ID 4 is from ID 5.
I am using R and the sp package.
For what I can see, there will be repeated values many times. Therefore, I would suggest to calculate the distance for each pair of coordinates only once (even if repeated many times in the df) as a starting point. Than you can filter the data and merge the tables. (I would add it as a comment, but I don't have enought reputation to do so yet).
The first lines would be:
#Creating a DF with no repeated coordinates
df2 <- df %>% group_by(Lat,Long) %>% summarise()
# Calculating Distances
Dist <- distm(cbind(df2$Long,df2$Lat))
Related
I am working with a dataset that contains variables measured from permanent plots. These plots are continuously remeasured every couple of years. The data sort of looks like the table at the bottom. I used the following code to separate the dataset to slice the initial measurement at t1. Now, I want to slice t2 which is the remeasurement that is one step greater than the minimum_Cycle or minimum_Measured_year. This is particularly a problem for plots that have more than two remeasurements (num_obs > 2) and the measured_year intervals and cycle intervals are different.
I would really appreciate the help. I have stuck on this for quite sometime now.
df_Time1 <- df %>% group_by(State, County, Plot) %>% slice(which.min(Cycle))
State County Plot Measured_year basal_area tph Cycle num_obs
1 1 1 2006 10 10 8 2
2 1 2 2002 20 20 7 3
1 1 1 2009 30 30 9 2
2 1 1 2005 40 40 6 3
2 1 1 2010 50 50 8 3
2 1 2 2013 60 60 10 2
2 1 2 2021 70 70 12 3
2 1 1 2019 80 80 13 3
Create a t variable for yourself based on the Cycle order:
df_Time1 %>%
group_by(State, County, Plot) %>%
mutate(t = order(Cycle))
You can then filter on t == 1 or t == 2, etc.
I have a data frame with four columns. I would like to create a new column where the value depends on the other column row above. For example, I have a data set like this:
Day
Precipitation
Condition
1
3
Wet
2
0
Dry
3
3
Wet
I would like the final product to look something like this?
Day
Precipitation
Condition
Day Before
1
3
Wet
2
0
Dry
Wet
3
3
Wet
Dry
Any ideas on how I can do this?
You may try
library(dplyr)
df %>%
mutate('Day Before' = lag(Condition))
Day Precipitation Condition Day Before
1 1 3 Wet <NA>
2 2 2 Dry Wet
3 3 3 Wet Dry
Plain R approach:
To define the Day Before column , add NA to the beginning of the Condition column (df$Condition) and remove the last element in that vector (df$condition[-length(df$condition)).
df$DayBefore <- c(NA,df$Condition[-length(df$Condition)])
df
Day Percepitation Condition DayBefore
1 1 3 wet <NA>
2 2 0 dry wet
3 3 3 wet dry
I'm looking at daily bookings for a hotel room based on the days before arrival.
I think booking speed varies by day of week and hotel A and hotel B, so I'd like to facet by these categories. However, when I facet (7 x 2 hotels=14 facets), it divides by the total number of dates rather than the number of dates in each category. That is, I have 1400 unique Date-Hotels so everything is being divided by 1400 instead of approximately 100 when I facet. I'd like my code to divide by 97,103,101 depending on how many Hotel-Dates I have in each facet so I can represent a "typical" booking pattern.
Here is my current data and code:
DaysBeforeArrival=rep(1:5,8)
Hotel=rep(LETTERS[1:2],20)
DayOfWeek=c(rep(1,10),rep(2,10),rep(1,10),rep(2,10))
Dates=c(rep("Jan-1",10),rep("Jan-2",10),rep("Jan-8",10),rep("Jan-9",10))
bookings=(sample(1:40))
Date_HotelID=paste(Hotel,Dates,sep="-")
mydf=data.frame(DaysBeforeArrival,Hotel,DayOfWeek,Dates,bookings,Date_HotelID)
ggplot(mydf,aes(DaysBeforeArrival,bookings/length(unique(Date_HotelID)))+
geom_bar(stat=identity) +
facet_grid(DayofWeek~HotelID)
Thanks!
Is this what you wanted to achieve?
library(ggplot2)
ggplot(mydf,aes(DaysBeforeArrival,bookings/length(unique(Date_HotelID))))+
geom_bar(stat="identity") + facet_wrap(~Hotel~DayOfWeek)
One approach is to simply calculate what you want to plot prior to making the graph. In your case, you'd just need to calculate the number of unique Date_HotelID for each DayOfWeek/Hotel combination, and then divide bookings by that value for each row.
For example, I might do this with functions from dplyr. Note I use n_distinct, which is the dplyr version of length(unique(...)).
library(dplyr)
mydf3 = mydf %>%
group_by(DayOfWeek, Hotel) %>%
mutate(book.speed = bookings/n_distinct(Date_HotelID))
mydf3
Source: local data frame [40 x 7]
Groups: DayOfWeek, Hotel [4]
DaysBeforeArrival Hotel DayOfWeek Dates bookings Date_HotelID book.speed
(int) (fctr) (dbl) (fctr) (int) (fctr) (dbl)
1 1 A 1 Jan-1 5 A-Jan-1 2.5
2 2 B 1 Jan-1 34 B-Jan-1 17.0
3 3 A 1 Jan-1 20 A-Jan-1 10.0
4 4 B 1 Jan-1 11 B-Jan-1 5.5
5 5 A 1 Jan-1 13 A-Jan-1 6.5
6 1 B 1 Jan-1 38 B-Jan-1 19.0
7 2 A 1 Jan-1 7 A-Jan-1 3.5
8 3 B 1 Jan-1 15 B-Jan-1 7.5
9 4 A 1 Jan-1 22 A-Jan-1 11.0
10 5 B 1 Jan-1 14 B-Jan-1 7.0
.. ... ... ... ... ... ... ...
The just make your graph with the calculated data.
ggplot(mydf3, aes(DaysBeforeArrival, book.speed)) +
geom_bar(stat="identity") +
facet_grid(DayOfWeek ~ Hotel)
I have 3133 rows representing payments made on some of the 5296 days between 7/1/2000 and 12/31/2014; that is, the "Date" feature is non-continuous:
> head(d_exp_0014)
Year Month Day Amount Count myDate
1 2000 7 6 792078.6 9 2000-07-06
2 2000 7 7 140065.5 9 2000-07-07
3 2000 7 11 190553.2 9 2000-07-11
4 2000 7 12 119208.6 9 2000-07-12
5 2000 7 16 1068156.3 9 2000-07-16
6 2000 7 17 0.0 9 2000-07-17
I would like to fit a linear time trend variable,
t <- 1:3133
to a linear model explaining the variation in the Amount of the expenditure.
fit_t <- lm(Amount ~ t + Count, d_exp_0014)
However, this is obviously wrong, as t increments in different amounts between the dates:
> head(exp)
Year Month Day Amount Count Date t
1 2000 7 6 792078.6 9 2000-07-06 1
2 2000 7 7 140065.5 9 2000-07-07 2
3 2000 7 11 190553.2 9 2000-07-11 3
4 2000 7 12 119208.6 9 2000-07-12 4
5 2000 7 16 1068156.3 9 2000-07-16 5
6 2000 7 17 0.0 9 2000-07-17 6
Which to me is the exact opposite of a linear trend.
What is the most efficient way to get this data.frame merged to a continuous date-index? Will a date vector like
CTS_date_V <- as.data.frame(seq(as.Date("2000/07/01"), as.Date("2014/12/31"), "days"), colnames = "Date")
yield different results?
I'm open to any packages (using fpp, forecast, timeSeries, xts, ts, as of right now); just looking for a good answer to deploy in functional form, since these payments are going to be updated every week and I'd like to automate the append to this data.frame.
I think some kind of transformation to regular (continuous) time series is a good idea.
You can use xts to transform time series data (it is handy, because it can be used in other packages as regular ts)
Filling the gaps
# convert myDate to POSIXct if necessary
# create xts from data frame x
ts1 <- xts(data.frame(a = x$Amount, c = x$Count), x$myDate )
ts1
# create empty time series
ts_empty <- seq( from = start(ts1), to = end(ts1), by = "DSTday")
# merge the empty ts to the data and fill the gap with 0
ts2 <- merge( ts1, ts_empty, fill = 0)
# or interpolate, for example:
ts2 <- merge( ts1, ts_empty, fill = NA)
ts2 <- na.locf(ts2)
# zoo-xts ready functions are:
# na.locf - constant previous value
# na.approx - linear approximation
# na.spline - cubic spline interpolation
Deduplicate dates
In your sample there is now sign of duplicated values. But based on a new question it is very likely. I think you want to aggregate values with sum function:
ts1 <- period.apply( ts1, endpoints(ts1,'days'), sum)
How can I get the duration of the drawdowns in a zoo serie?
the drawdowns can be calculated with cummax(mydata)-mydata. Whenever this value is above zero I have a drawdown.
The Drawdown is the measure of the decline from a historical peak (maximum).
It lasts till this value is reached again.
The PerformanceAnalytics package has several functions to do this operation.
> library(PerformanceAnalytics)
> data(edhec)
> dd <- findDrawdowns(edhec[,"Funds of Funds", drop=FALSE])
> dd$length
[1] 3 3 6 5 4 11 14 5 2 10 2 6 3 2 4 9 2 2 13 8 5 5 4 2 7
[26] 6 11 3 2 23
As a side note, if you have two dates in a time series and need to know the time between them, just use diff. You can also use the lubridate package.