how can i get columns from 2 different data? [duplicate] - r

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
I am new to R
I've been struggling to understand how to get these columns for plotting. This columns are from different data. What I need to do is either use The wday() function in the lubridate package as it says that it be useful. Then I need to pivot the data to long format to get the Direction column. The Timeperiod column comes from the lockdown_dates data. Summarise the data to get appropriate averages.
Column Description
Weekday Day of week (Sunday through Saturday).
Hour Hour of day (0 through 23).
Direction To City or From City
Timeperiod Before times, Lockdown
Count Average (mean) number of cyclists per direction per hour for each time period and weekday.
I have these sample tables .
Date Timeperiod
1 2020-02-01 Before times
2 2020-02-02 Before times
3 2020-02-03 Before times
Date Hour From City To City
1 2020-02-01 0 0 1
2 2020-02-01 1 0 0
3 2020-02-01 2 0 0
I don't know how to start my code, I was thinking of grouping these to form the data but i know it wont work. I would appreciate if someone can give me an example how to do it.
but I tried this but it only gave me "Friday" not a column.
weekdays(as.Date("4/6/2018 20:14", "%m/%d/%Y"))

You can join the two dataframes and calculate the weekDays for Date.
result <- transform(merge(df1, df2, by = 'Date'), wday = weekdays(Date))
Using dplyr :
library(dplyr)
result <- inner_join(df1, df2, by = 'Date') %>% mutate(wday = weekdays(Date))

Related

Calculate average and std same day last 3 weeks in R [duplicate]

This question already has an answer here:
use rollapply and zoo to calculate rolling average of a column of variables
(1 answer)
Closed 2 years ago.
I have a data frame like below (sample data). I want to add two columns for each day to show average and std sales of same day in the last 3 weeks. What I mean by this is the same 3 previous days (last 3 Tuesdays, last 3 Wednesdays, etc.)
df <- data.frame(
stringsAsFactors = FALSE,
date = c("3/28/2019","3/27/2019",
"3/26/2019","3/25/2019","3/24/2019","3/23/2019",
"3/22/2019","3/21/2019","3/20/2019","3/19/2019","3/18/2019",
"3/17/2019","3/16/2019","3/15/2019","3/14/2019",
"3/13/2019","3/12/2020","3/11/2020","3/10/2020","3/9/2021",
"3/8/2021","3/7/2021","3/6/2022","3/5/2022",
"3/4/2022","3/3/2023"),
weekday = c(4L,3L,2L,1L,7L,6L,5L,4L,
3L,2L,1L,7L,6L,5L,4L,3L,2L,1L,7L,6L,5L,4L,
3L,2L,1L,7L),
store_id = c(344L,344L,344L,344L,344L,
344L,344L,344L,344L,344L,344L,344L,344L,344L,344L,
344L,344L,344L,344L,344L,344L,344L,344L,344L,
344L,344L),
store_sales = c(1312005L,1369065L,1354185L,
1339183L,973780L,1112763L,1378349L,1331890L,1357713L,
1366399L,1303573L,936919L,1099826L,1406752L,
1318841L,1321099L,1387767L,1281097L,873449L,1003667L,
1387767L,1281097L,873449L,1003667L,1331636L,1303804L)
)
For example for 3/28/2019 take average sales of (3/21/2019 & 3/14/2019 & 3/7/2021) , like this
date weekday store_id store_sales avg_sameday3
3/28/2019 4 344 1312005 1310609
We can group by weekday and store_id and calculate rolling mean for last 3 entries using zoo::rollapplyr.
library(dplyr)
df %>%
arrange(weekday) %>%
group_by(store_id, weekday) %>%
mutate(store_sales_avg = zoo::rollapplyr(store_sales, 4,
function(x) mean(x[-1]), partial = TRUE))
Note that I have used window size as 4 and removed the first entry from mean calculation so that it does not consider the current value while taking mean. With partial = TRUE it takes mean even when last values are less than 4.

Using lubridate with multiple date formats

I have a column of dates that was stored in the format 8/7/2001, 10/21/1990, etc. Two values are just four-digit years. I converted the entire column to class Date using the following code.
lubridate::parse_date_time(eventDate, orders = c('mdy', 'Y'))
It works great, except the values that were just years are converted to yyyy-01-01 and I want them to just be yyyy. Is there a way to keep lubridate from adding on any information that wasn't already there?
Edit: Code to create data frame
id = (1:5)
eventDate = c("10/7/2001", "1989", NA, "5/5/2016", "9/18/2011")
df <- data.frame(id, eventDate)
I do not think is possible to convert your values to Dates, and keep the "yyyy" values intact. And by transforming your "yyyy" values into "yyyy-01-01" the lubridate is doing the right thing. Because dates have order, and if you have other values in your column that have days and months defined, all the other values needs to have these components too.
For example. If I produce the data.frame below. If I ask R, to order the table, according to the date column, the date in the first line ("2020"), comes before the value in the second row ("2020-02-28")? Or comes after it? The value "2020" being the year of 2020, it can actually means every possible day in this year, so how R should treate it? By adding the first day of the year, lubridate is defining these components, and avoiding that R get confused by it.
dates <- c("2020", "2020-02-28", "2020-02-20", "2020-01-10", "2020-05-12")
id <- 1:5
df <- data.frame(
id,
dates
)
id dates
1 1 2020
2 2 2020-02-28
3 3 2020-02-20
4 4 2020-01-10
5 5 2020-05-12
So if you want to mantain the "yyyy" intact, is very likely that they should not rest in your eventDate column, with other values that are in a different structure ("dd/mm/yyyy"). Now if is really necessary to mantain these values intact, I think is best, to keep the values of eventDate column as characters, and store these values as Dates in another column, like this:
df$as_dates <- lubridate::parse_date_time(df$eventDate, orders = c('mdy', 'Y'))
id eventDate as_dates
1 1 10/7/2001 2001-10-07
2 2 1989 1989-01-01
3 3 <NA> <NA>
4 4 5/5/2016 2016-05-05
5 5 9/18/2011 2011-09-18

Filtering Data based on another dataframe based on two rows

I have two Datasets.
The first dataset includes Companies, the Quarter and the corresponding value from the whole timespan.
Quarter Date Company value
2012.1 2012-12-28 x 1
2013.1 2013-01-02 y 2
2013.1 2013-01-03 z 3
Companies again are in the dataset over the whole time and show up multiple times.
The other dataset is an index which includes a company identifier and the quarter in which it existed in the index (Companies can be in the index in multiple quarters).
Quarter Date Company value
2012.1 2012-12-28 x 1
2014.1 2013-01-02 y 2
2013.1 2013-01-03 x 3
Now I need to only select the companies which are in the index at the same time (quarter) as I have data from the first dataset.
In the example above I would need the data from company x in both quarters, but company y needs to get kicked out because the data is available in the wrong quarter.
I tried multiple functions including filter, subset and match but never got the desired result. It always filters either too much or too little.
data %>% filter(Company == index$Company & Quarter == index$Quarter)
or
data[Company == index$Company & Quarter = index$Quarter,]
Something with my conditions doesn't seem right. Any help is appreciated!
Have a look at dplyr's powerful join functions. Here inner_join might help you
dplyr::inner_join(df1, df2, by=c("Company", "Quarter"))

Calculate mean of one column for 14 rows before certain row, as identified by date for each group (year)

I would like to calculate mean of Mean.Temp.c. before certain date, such as 1963-03-23 as showed in date2 column in this example. This is time when peak snowmelt runoff occurred in 1963 in my area. I want to know 10 day’s mean temperature before this date (ie., 1963-03-23). How to do it? I have 50 years data, and each year peak snowmelt date is different.
example data
You can try:
library(dplyr)
df %>%
mutate(date2 = as.Date(as.character(date2)),
ten_day_mean = mean(Mean.Temp.c[between(date2, "1963-03-14", "1963-03-23")]))
In this case the desired mean would populate the whole column.
Or with data.table:
library(data.table)
setDT(df)[between(as.Date(as.character(date2)), "1963-03-14", "1963-03-23"), ten_day_mean := mean(Mean.Temp.c)]
In the latter case you'd get NA for those days that are not relevant for your date range.
Supposing date2 is a Date field and your data.frame is called x:
start_date <- as.Date("1963-03-23")-10
end_date <- as.Date("1963-03-23")
mean(x$Mean.Temp.c.[x$date2 >= start_date & x$date2 <= end_date])
Now, if you have multiple years of interest, you could wrap this code within a for loop (or [s|l]apply) taking elements from a vector of dates.

averaging by months with daily data [duplicate]

This question already has answers here:
Get monthly means from dataframe of several years of daily temps
(3 answers)
Closed 5 years ago.
I have daily data with my matrix, divided into 6 columns - "Years, months, days, ssts, anoms, missing ".
I want to calculate the average of each month of SST in each year. (For example - 1981 - september - avg values sts of all days in sept), and I want to do the same for all the years. i am trying to work, my code, but am unable to do so.
You should use dplyr package in R. For this, we will call your data df
require(dplyr)
df.mnths <- df %>%
group_by(Years, months) %>%
summarise(Mean.SST = mean(SSTs))
df.years <- df %>%
group_by(Years) %>%
summarise(Mean.SST = mean(SSTs))
This is two new data sets that will have the mean(SST) for each month of each year in df.mnths, and another dataset that will have mean(SST) for all years in df.years.
In terms of data.table you can perform the following action
library(data.table)
dt[, average_sst := mean(SSTs), by = .(years,months)]
adding an extra column average_sst.
just suppose that your data is stored in a data.frame named "data":
years months SSTs
1 1981 1 -0.46939368
2 1981 1 0.03226932
3 1981 1 -1.60266798
4 1981 1 -1.53095676
5 1981 1 1.71177023
6 1981 1 -0.61309846
tapply(data$SSTs, list(data$years, data$months), mean)
tapply(data$SSTs, factor(data$years), mean)

Resources