Difficulty in generating time series in R for my data set - r

So I am trying to generate time series for my dataset in R but finding difficulty in doing so. My dataset has two columns- one for date and other for price of a material. Now there are many dates which don't have price and hence are not in the dataset. Data is roughly for a year. NOw i am finding difficulty in setting the frequency and start time for the time series. Is there any way to set the start as per the dataset and time series automatically incorporates the missing data points.

the following is for a dataframe df with two columns called "date" and "price".
This will create missing dates and fill the missing prices for those dates as the preceding price. You can change fill('price') to fill with other specified values.
library(tidyverse)
df<-df %>%
complete(date = seq.Date(min(date), max(date), by="day")) %>%
fill('price')

Related

How to count the number of days that pass between two dates in a dataset column in R

I am working with a dataset This is the dataset. In the dataset there are 33 unique Ids that are repeated for each day they provided data, within 30 days, from their fitbit. I am trying to count the number of days they input data through the ActivityDay column and group it to the Id, so that I can see how many total days they used their fitbit out of the 30 days.
the Activity date data type was originally POSIXct and I converted it to Date type. How can I count the dates as number or days and group it to each indvidual ID?
I tried using count within a dplyr::summarise to get the ID and number of days counted while grouping the data to the ID. that failed.
I also thought of using a case_when, however, I thought that wouldn't work because it wouldn't count all the way up to the end dates I specify, so anything between the two dates would get the ouputs I specified. I also tried count_date_between(min(user_device_activity), max(user_device_activity), by 'day') but it said that the function doesn't exist and when I tried installing it. It said it didn't exist within R.
library(dplyr)
user_device_activity %>%
distinct(Id, ActivityDate) %>% # in case duplicates possible in data
count(Id, month = lubridate::floor_date(ActivityDate, "month"))

Average after 2 group_by's in R

I am new to R can't find the right syntax for a specific average I need. I have a large fitbit dataset of heartrate per second for 30 people, for a month each. I want an average of heartrate per day per person to make the data easier to manage and join with other fitbit data.
First few lines of Data
The columns I have are Id (person Id#), Time (Date-Time), and Value (Heartrate). I already separated Time into two columns, one for date and one for time only. My idea is to group the information by person, then by date and get one average number per person per day. But, my code is not doing that.
hr_avg <- hr_per_second %>% group_by(Id) %>% group_by(Date) %>% summarize(mean(Value))
As a result I get an average by date only. I can't do this manually because the dataset is so big, Excel can't open it. And I can't upload it to BigQuery either, the database I learned to use during my data analysis course. Thanks.

Changing period of dates to standard date to do line graph

I'm trying to plot a line graph with R using the dataset that can be found here . I'm looking specifically at how to plot the number of cases in each region i.e. north east, north west etc against the period of time.
However, as the date is a period of a week rather than a standard date, how can I convert it to make the line graph actually possible? For example, right now it has the dates as 01/09/2020 - 07/09/2020. How can I use this for a line graph?
Sorry if my explanation isn't clear, here is a picture below.
I assume you're trying to plot a time series? You could just trim the dates to the beginning of the week and label the time axis as "Week beginning on date". You could do this with substr() in base r and keep the first 10 characters.
substr(data$column,1,10)
You may also want to format it as a date, easiest with the lubridate package, something like dmy() (day month year).
Here is the full code you would want:
library(tidyverse)
#Read in data
data <- read.csv("/Users/sabrinaxie/Downloads/covid19casesbysociodemographiccharacteristicengland1sep2020to10dec20213.csv")
#Modify data and remove extraneous top rows
data <- data %>%
rename(Period=Table.9..Weekly.estimates.of.age.standardised.COVID.19.case.rates..per.100.000.person.weeks..by.region..England..1.September.2020.to.6.December.20211.2.3) %>%
slice(3:n())
#Keep first 10 characters of Period column and assign to old column to replace
data$Period <- substr(data$Period,1,10)
#Parse as date
data$Period <- dmy(data$Period)

Create a column in one dataframe based on another column in another dataframe in R

I am fairly new to R and DPLYR and I am stuck on a this issue:
I have two tables:
(1) Repairs done on cars
(2) Amount owed on each car over time
What I would like to do is create three extra columns on the repair table that gives me:
(1) the amount owed on the car when the repair was done,
(2) 3months down the road and
(3) finally last payment record on file.
And if the case where the repair date does not match with any payment record, I need to use the closest amount owed on record.
So something like:
Any ideas how I can do that?
Here are the data frames:
Repairs done on cars:
df_repair <- data.frame(unique_id =
c("A1","A2","A3","A4","A5","A6","A7","A8"),
car_number = c(1,1,1,2,2,2,3,3),
repair_done = c("Front Fender","Front
Lights","Rear Lights","Front Fender", "Rear Fender","Rear Lights","Front
Lights","Front Fender"),
YearMonth = c("2014-03","2016-03","2016-07","2015-05","2015-08","2016-01","2018-01","2018-05"))
df_owed <- data.frame(car_number = c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3),
YearMonth = c("2014-02","2014-05","2014-06","2014-08","2015-06","2015-12","2016-03","2016-04","2016-05","2016-06","2016-07","2016-08","2015-05","2015-08","2015-12","2016-03","2018-01","2018-02","2018-03","2018-04","2018-05","2018-09"),
amount_owed = c(20000,18000,17500,16000,10000,7000,6000,5500,5000,4500,4000,3000,10000,8000,6000,0,50000,40000,35000,30000,25000,15000))
Using zoo for year-months, and tidyverse, you could try the following. Using left_join add all the df_owed data to your df_repair data, by the car_number. You can convert your year-month columns to yearmon objects with zoo. Then, sort your rows by the year-month column from df_owed.
For each unique_id (using group_by) you can create your three columns of interest. The first will use the latest amount_owed where the owed date is prior to the service date. Then second (3 months) will use the first amount_owed value where the owed date follows the service date by 3 months (3/12). Finally, the most recent take just the last value from amount_owed.
Using the example data, the results differ a bit, possibly due to the data frames not matching the images in the post.
library(tidyverse)
library(zoo)
df_repair %>%
left_join(df_owed, by = "car_number") %>%
mutate_at(c("YearMonth.x", "YearMonth.y"), as.yearmon) %>%
arrange(YearMonth.y) %>%
group_by(unique_id, car_number) %>%
summarise(
owed_repair_done = last(amount_owed[YearMonth.y <= YearMonth.x]),
owed_3_months = first(amount_owed[YearMonth.y >= YearMonth.x + 3/12]),
owed_most_recent = last(amount_owed)
)

R - fill in values for all dates

I have a data set with sales by date, where date is not unique and not all dates are represented: my data set has dates (the date of the sale), quantity, and totalprice. This is an irregular time series.
What I'd like is a vector of sales by date, with every date represented exactly once, and quantities and totalprice summed by date, with zeros where there are no sales.
I have part of this now; I can make a sequence containing all dates:
first_date=as.Date(min(dates))
last_date=as.Date(max(dates))
all_dates=seq(first_date, by=1, to=last_date)
And I can aggregate the sales data by sale date:
quantitybydate=aggregate(quantity, by=list(as.Date(dates)), sum)
But not sure what to do next. If this were python I'd loop through one of the dates arrays, setting or getting the related quantity. But this being R I suspect there's a better way.
Make a dataframe with the all_dates as a column, then merge with quantitybydate using the by variable columns as the by.y, and all.x=TRUE. Then replace the NA's by 0.

Resources