Subset data frame by ID but within 7 days - r

I have data frame with two variables ID and arrival. Here is head of my data frame:
head(sun_2)
Source: local data frame [6 x 2]
ID arrival
(chr) (dats)
1 027506905 01.01.15
2 042363988 01.01.15
3 026050529 01.01.15
4 028375072 01.01.15
5 055384859 01.01.15
6 026934233 01.01.15
How could I subset data by ID which has arrive within 7 days?

So like a lot of the other folks were saying, without more information (what the original observation looks like for example) we can't get at exactly what your issue is without making some assumptions.
I assumed that you have a column of data that indicates the original Date? And that these rows are formatted as.Date.
#generate Data
Data <- data.frame(ID = as.character(1394:2394),
arrival = sample(seq(as.Date('2015/01/01'), as.Date('2016/01/01'), by = 'day'), 1001, replace = TRUE))
# Make the "Original Observation" Variable
delta_times <- sample(c(3:10), 1001, replace = TRUE)
Data$First <- Data$arrival - delta_times
this gives me a data set that looks like this
ID arrival First
1 1394 2015-11-06 2015-10-28
2 1395 2015-08-04 2015-07-26
3 1396 2015-04-19 2015-04-16
4 1397 2015-05-13 2015-05-03
5 1398 2015-07-18 2015-07-11
6 1399 2015-01-08 2015-01-03
If that is the case then the solution is to use difftime, like so:
# Now we need to make a subsetting variables
Data$diff_times <- difftime(Data$arrival, Data$First, units = "days")
Data$diff_times
within_7 <- subset(Data, diff_times <=7)
max(within_7$diff_times)
Time difference of 7 days

It's a bit difficult to be sure given the information you've provided, but I think you could do it like this:
library(dplyr)
dt %>% group_by(ID) %>% filter(arrival < min(arrival) + 7)

Related

How do I change values from xx mm to just xx in R?

I'm working with a dataset about weather where one row contains amount of rain in mm. Problem is it hasn't been recorded in the same format, so while some rows contain just numbers, some contain numbers and "mm". It look something like below:
Date Rain
1 2014-12-08 10mm
2 2014-12-09 3
3 2014-12-10 5mm
4 2014-12-11 0
5 2014-12-12 11
Is there any way to delete the "mm" part so that I only keep the numbers?
Idealy it should look like this:
Date Rain
1 2014-12-08 10
2 2014-12-09 3
3 2014-12-10 5
4 2014-12-11 0
5 2014-12-12 11
The only way I know how to do it now is one number at a time like: weather_data[weather_data=="10mm"]<-10 ; weather_data[weather_data=="5mm"]<-5 ; etc, but since it is a very large dataset containing several years, this would take a lot of time and hoped to find an easier and quicker way.
We could use parse_number to extract the digits and convert to numeric class
library(dplyr)
library(readr)
df1 <- df1 %>%
mutate(Rain = parse_number(Rain))
Or use a regex option to remove the 'mm' and convert to numeric
library(stringr)
df1 <- df1 %>%
mutate(Rain = as.numeric(str_remove(Rain, "mm")))

Duplicate rows to expand date into week?

I have a data frame a that I'm trying to merge with data frame b. Data frame a and data frame b both have a column called date that are both date types. date in data frame a contains only the last day of the week because the data is a weekly summary of pop. date in data frame b is an individual date because the data is a daily summary of cars.
Since I'd like to merge a and b to do some analysis of daily cars for the population, I would like to expand the date column in a and create duplicate rows for each day in the week.
i.e. I start off with the data frame a below
pop date
1 10002 2020-07-12
2 10025 2020-07-19
3 10102 2020-07-26
and turn it into the data frame a_mod below
pop date
1 10002 2020-07-06
2 10002 2020-07-07
3 10002 2020-07-08
4 10002 2020-07-09
5 10002 2020-07-10
6 10002 2020-07-11
7 10002 2020-07-12
8 10025 2020-07-13
9 10025 2020-07-14
...
then merge a_mod and b together to look like this
pop date cars
1 10002 2020-07-06 252
2 10002 2020-07-07 46
3 10002 2020-07-08 43
4 10002 2020-07-09 44
Any idea how I can achieve this? I'm stumped.
ETA: I later figured out this was not the best idea, since I really only just wanted to map values from a onto b rather than blow up my data frames with so many rows. Instead, I asked a different question and got a different technique that worked much better. Thank you to all who took the time to help!
Here are two ways (depending on how you want to go).
Using ceiling_date() on b.
librray(lubridate)
library(dplyr)
b %>%
mutate(date2 = ceiling_date(date,
unit = "weeks",
week_start = 1)) %>% # 1 for Monday, 7 for Sunday
inner_join(a %>% rename(date2 = date)) %>%
select(pop, date, cars)
Modifying a
library(dplyr)
librray(tidyr)
mod_a <- data.frame(date = seq(min(b$date), max(b$date), by = "days") %>%
left_join(a) %>%
fill(pop, .direction = "updown")
mod_a %>% inner_join(b)

Extract the values from the dataframes created in a loop for further analysis (I am not sure, how to sum up the question in one line)

My raw dataset has multiple product Id, monthly sales and corresponding date arranged in a matrix format. I wish to create individual dataframes for each product_id along with the sales value and dates. For this, I am using a for loop.
base is the base dataset.
x is the variable that contains the unique product_id and the corresponding no of observation points.
for(i in 1:nrow(x)){
n <- paste("df", x$vars[i], sep = "")
assign(n, base[base[,1] == x$vars[i],])
print(n)}
This is a part of the output:
[1] "df25"
[1] "df28"
[1] "df35"
[1] "df37"
[1] "df39"
So all the dataframe names are saved in n. This, I think is a string vector.
When I write df25 outside the loop, I get the dataframe I want:
> df25
# A tibble: 49 x 3
ID date Sales
<dbl> <date> <dbl>
1 25 2014-01-01 0
2 25 2014-02-01 0
3 25 2014-03-01 0
4 25 2014-04-01 0
5 25 2014-05-01 0
6 25 2014-06-01 0
7 25 2014-07-01 0
8 25 2014-08-01 0
9 25 2014-09-01 0
10 25 2014-10-01 0
# ... with 39 more rows
Now, I want to use each of these dataframes seperately to perform a forecast analysis. For doing this, I need to get to the values in individual dataframes. This is what I have tried for the same:
for(i in 1:4) {print(paste0("df", x$vars[i]))}
[1] "df2"
[1] "df3"
[1] "df5"
[1] "df14"
But I am unable to refer to individual dataframes.
I am looking for help on how can I get access to the dataframes with their values for further analysis? Since there are more than 200 products, I am looking for some function which deals with all the dataframes.
First, I wish to convert it to a TS, using year and month values from the date variable and then use ets or forecast, etc.
SAMPLE DATASET:
set.seed(354)
df <- data.frame(Product_Id = rep(1:10, each = 50),
Date = seq(from = as.Date("2010/1/1"), to = as.Date("2014/2/1") , by = "month"),
Sales = rnorm(100, mean = 50, sd= 20))
df <- df[-c(251:256, 301:312) ,]
As always, any suggestion would be highly appreciated.
I think this is one way to get an access to the individual dataframes. If there is a better method, please let me know:
(Var <- get(paste0("df",x$vars[i])))

R - Sample consecutive series of dates in time series without replacement?

I have a data frame in R containing a series of dates. The earliest date is (ISO format) 2015-03-22 and the latest date is 2016-01-03, but there are two breaks within the data. Here is what it looks like:
library(tidyverse)
library(lubridate)
date_data <- tibble(dates = c(seq(ymd("2015-03-22"),
ymd("2015-07-03"),
by = "days"),
seq(ymd("2015-08-09"),
ymd("2015-10-01"),
by = "days"),
seq(ymd("2015-11-12"),
ymd("2016-01-03"),
by = "days")),
sample_id = 0L)
I.e.:
> date_data
# A tibble: 211 x 2
dates sample_id
<date> <int>
1 2015-03-22 0
2 2015-03-23 0
3 2015-03-24 0
4 2015-03-25 0
5 2015-03-26 0
6 2015-03-27 0
7 2015-03-28 0
8 2015-03-29 0
9 2015-03-30 0
10 2015-03-31 0
# … with 201 more rows
What I want to do is to take ten 10-day long samples of continous dates from within that time series without replacement. For example, a valid sample would be the ten days from 2015-04-01 to 2015-04-10 because that falls completely within the dates column in my date_data data frame. Each sample would then get a unique (non-zero) number in the sample_id column in date_data such as 1:10.
To be clear, my requirements are:
Each sample would be 10 consecutive days.
The sampling has to be without replacement. So if sample_id == 1 is the 2015-04-01 to 2015-04-10 period, those dates can't be part of another 10-day-long sample.
Each 10-day-long sample can't include any date that's not within date_data$dates.
At the end, date_data$sample_id would have unique numbers representing each 10-day-long sample, likely with lots of 0s left over that were not part of any sample (and there would be 200 rows - 10 for each sample - where sample_id != 0).
I am aware of dplyr::sample_n() but it doesn't sample consecutive values, and I don't know how to devise a way to "remember" which dates have already been sampled...
What's a good way to do this? A for loop?!?! Or perhaps something with purrr? Thank you very much for your help.
UPDATE: Thanks to #gfgm's solution, it reminded me that performance is an important consideration. My real dataset is quite a bit larger, and in some cases I would want to take 20+ samples instead of just 10. Ideally the size of the sample can be changed as well, i.e. not necessarily 10-days long.
This is tricky, as you anticipated, because of the requirement of sampling without replacement. I have a working solution below which achieves a random sample and works fast on a problem of the scale given in your toy example. It should also be fine with more observations, but will get really really slow if you need to pick a lot of points relative to the sample size.
The basic premise is to pick n=10 points, generate the 10 vectors from these points forwards, and if the vectors overlap ditch them and pick again. This is simple and works fine given that 10*n << nrow(df). If you wanted to get 15 subvectors out of your 200 observations this would be a good deal slower.
library(tidyverse)
library(lubridate)
date_data <- tibble(dates = c(seq(ymd("2015-03-22"),
ymd("2015-07-03"),
by = "days"),
seq(ymd("2015-08-09"),
ymd("2015-10-01"),
by = "days"),
seq(ymd("2015-11-12"),
ymd("2016-01-03"),
by = "days")),
sample_id = 0L)
# A function that picks n indices, projects them forward 10,
# and if any of the segments overlap resamples
pick_n_vec <- function(df, n = 10, out = 10) {
points <- sample(nrow(df) - (out - 1), n, replace = F)
vecs <- lapply(points, function(i){i:(i+(out - 1))})
while (max(table(unlist(vecs))) > 1) {
points <- sample(nrow(df) - (out - 1), n, replace = F)
vecs <- lapply(points, function(i){i:(i+(out - 1))})
}
vecs
}
# demonstrate
set.seed(42)
indices <- pick_n_vec(date_data)
for (i in 1:10) {
date_data$sample_id[indices[[i]]] <- i
}
date_data[indices[[1]], ]
#> # A tibble: 10 x 2
#> dates sample_id
#> <date> <int>
#> 1 2015-05-31 1
#> 2 2015-06-01 1
#> 3 2015-06-02 1
#> 4 2015-06-03 1
#> 5 2015-06-04 1
#> 6 2015-06-05 1
#> 7 2015-06-06 1
#> 8 2015-06-07 1
#> 9 2015-06-08 1
#> 10 2015-06-09 1
table(date_data$sample_id)
#>
#> 0 1 2 3 4 5 6 7 8 9 10
#> 111 10 10 10 10 10 10 10 10 10 10
Created on 2019-01-16 by the reprex package (v0.2.1)
marginally faster version
pick_n_vec2 <- function(df, n = 10, out = 10) {
points <- sample(nrow(df) - (out - 1), n, replace = F)
while (min(diff(sort(points))) < 10) {
points <- sample(nrow(df) - (out - 1), n, replace = F)
}
lapply(points, function(i){i:(i+(out - 1))})
}

How to calculate aggregate statistics on a dataframe in R by applying conditions on time values?

I am working on climate data analysis. After loading file in R, my interest is to subset data based upon hours in a day.
for time analysis we can use $hour with the variable in which time vector has been stored if our interest is to deal with hours.
I want to subset my data for each hour in a day for 365 days and then take an average of the data at a particular hour throughout the year. Say I am interested to take values of irradiation/wind speed etc at 12:OO PM for a year and then take mean of these values to get the desired result.
I know how to subset a data frame based upon conditions. If for example my data is in a matrix called data and contains 2 rows say time and wind speed and I'm interested to subset rows of data in which irradiationb isn't zero. We can do this using the following code
my_data <- subset(data, data[,1]>0)
but now in order to deal with hours values in time column which is a variable stored in data, how can I subset values?
My data look like this:
I hope I made sense in this question.
Thanks in advance!
Here is a possible solution. You can create a hourly grouping with format(df$time,'%H'), so we obtain only the hour for each period, we can then simply group by this new column and calculate the mean for each group.
df = data.frame(time=seq(Sys.time(),Sys.time()+2*60*60*24,by='hour'),val=sample(seq(5),49,replace=T))
library(dplyr)
df %>% mutate(hour=format(df$time,'%H')) %>%
group_by(hour) %>%
summarize(mean_val = mean(val))
To subset the non-zero values first, you can do either:
df = subset(df,val!=0)
or start the dplyr chain with:
df %>% filter(df$val!=0)
Hope this helps!
df looks as follows:
time val
1 2018-01-31 12:43:33 4
2 2018-01-31 13:43:33 2
3 2018-01-31 14:43:33 2
4 2018-01-31 15:43:33 3
5 2018-01-31 16:43:33 3
6 2018-01-31 17:43:33 1
7 2018-01-31 18:43:33 2
8 2018-01-31 19:43:33 4
... ... ... ...
And the output:
# A tibble: 24 x 2
hour mean_val
<chr> <dbl>
1 00 3.50
2 01 3.50
3 02 4.00
4 03 2.50
5 04 3.00
6 05 2.00
.... ....
This assumes your time column is already of class POSIXct, otherwise you'd first have to convert it using for example as.POSIXct(x,format='%Y-%m-%d %H:%M:%S')

Resources