Join or Merge two data-frames based on multiple columns criteria - r

I have two datasets with the following data.
maindata = data.frame(eventid=c(1:10),
district=c(rep("lucknow",2),rep("allahabad",1), rep("kanpur", 2)),
date = c(rep("2018-01-01", 2), rep("2018-01-02", 1), rep("2018-01-03", 2)))
weather = data.frame(district=c(rep("lucknow", 4), rep("allahabad", 3), rep("kanpur", 3)),
date = c(rep("2017-01-01", 4), rep("2017-01-02", 3), rep("2017-01-03", 3)),
temperature=c(rep("19.3",2),rep("22.1",1), rep("24.1", 2)))
Few considerations:
"date" in each data frame is different, its ok to be like that. MM-DD are sufficient
Both datasets have different length - df1 is my main dataset where "temp" should be added
The merging must happen over "district" and "date"
maindata has district column in lowercase
What i Tried: (doing some silly conversions.. will fix them)
weather$District<-as.factor(tolower(weather$District))
weather$Date<-as.Date(as.character(weather$Date),format="%m/%d/%Y")
maindata$md<-strftime(data$createDate, "%m-%d")
weather$mdr<-strftime(weather$Date, "%m-%d")
maindata<-left_join(maindata, weather, by = c("md" = "mdr", "district" = "District"))
The final expected answer would be something like below in maindata
eventid district date temperature
1 lucknow 2018-01-01 19.3
2 lucknow 2018-01-01 19.3
3 allahabad 2018-01-03 24.1
4 kanpur 2018-01-03 NA
5 kanpur 2018-01-02 22.1
6 lucknow 2018-01-01 19.3
7 lucknow 2018-01-01 19.3
8 allahabad 2018-01-03 24.1
9 kanpur 2018-01-03 NA
10 kanpur 2018-01-02 22.1
Can anybody please help !!!

I don't understand your logic rules for merging; specifically I don't see how date comes in.
It is entirely possible to reproduce your expected output without considering date at all, by simply matching df1$district with df2$dist:
library(tidyverse);
left_join(df1, df2, by = c("district" = "dist")) %>%
distinct() %>%
select(-date.y)
# eventid date.x district temp
#1 1 2017-01-01 dist-1 19.3
#2 2 2017-01-01 dist-1 19.3
#3 3 2017-01-01 dist-1 19.3
#4 4 2017-01-01 dist-1 19.3
#5 5 2017-01-02 dist-2 22.1
#6 6 2017-01-02 dist-2 22.1
#7 7 2017-01-02 dist-2 22.1
#8 8 2017-01-03 dist-3 24.10
#9 9 2017-01-03 dist-3 24.10
#10 10 2017-01-03 dist-3 24.10
Could you provide sample data that is more representative of what you're trying to do, and where the role/importance of merging on date becomes clear?

A quick note - You should really post your trials to the solution before asking for the help in SO.
To the answer -
What you should be using is the merge function available by default in R.
After reproducing the data frames that you have provided - try the below chunk of code
#Since dates doesn't matter, df2 could be changed to a new df with only temp
df3 <- df2[,c("dist","temp")]
df3 <- unique(df3)
df4 <- merge(df1,df3,by.x = "district",by.y = "dist",all.x = T)
The deduplication has been done to avoid creation of numerous rows for each combination of dates in df1 and df2.
all.x = T ensures that you're getting a left-join (Where all rows of the df1 are present in your final output)

Perhaps something like this (with the updated data)
library(tidyverse)
df1 %>%
mutate(date = as.POSIXct(date),
date1 = format(date, "%d/%m")) %>%
left_join(df2 %>%
mutate(date = as.POSIXct(date),
date1 = format(date, "%d/%m")), by = c("date1" = "date1", "district" = "dist")) %>%
select(-date1, - date.y) %>%
rename(date = date.x) %>%
filter(!duplicated(eventid))
#output
eventid date district temp
1 1 2017-01-01 dist-1 19.3
2 2 2017-01-01 dist-1 19.3
3 3 2017-01-01 dist-1 19.3
4 4 2017-01-01 dist-1 19.3
5 5 2017-01-02 dist-2 <NA>
6 6 2017-01-02 dist-2 <NA>
7 7 2017-01-02 dist-2 <NA>
8 8 2017-01-03 dist-3 24.10
9 9 2017-01-03 dist-3 24.10
10 10 2017-01-03 dist-3 24.10
Convert date in both data frames to POSIXct, make a %d/%m column and join by it and district, and then clean up

Maybe you want this.
df2[, 2] <- as.numeric(as.character(df2[, 2]))
m1 <- merge(df1, df2, by.x = "district", by.y = "dist", all.x = TRUE)[-5]
names(m1)[3] <- "date"
m1 <- unique(m1[, c(2, 3, 1, 4)])
rownames(m1) <- NULL
> m1
eventid date district temp
1 1 2017-01-01 dist-1 19.3
2 2 2017-01-01 dist-1 19.3
3 3 2017-01-01 dist-1 19.3
4 4 2017-01-01 dist-1 19.3
5 5 2017-01-02 dist-2 22.1
6 6 2017-01-02 dist-2 22.1
7 7 2017-01-02 dist-2 22.1
8 8 2017-01-03 dist-3 24.1
9 9 2017-01-03 dist-3 24.1
10 10 2017-01-03 dist-3 24.1

Related

how to make short data to long data format but keep the date as date [duplicate]

This question already has an answer here:
How to use Pivot_longer to reshape from wide-type data to long-type data with multiple variables
(1 answer)
Closed 1 year ago.
I have data of S&P101 built in this way:
Symbol Name X2020.01.02 X2020.01.03 X2020.01.06 X2020.01.07 X2020.01.08 X2020.01.09
1 AAPL Apple 75.0875 74.3575 74.95 74.5975 75.7975 77.4075
2 ABBV AbbVie 89.5500 88.7000 89.40 88.8900 89.5200 90.2100
Now, I turned it into long format because I use mixed model:
#convert data to long format for mixed model
c_data = ncol(data_2020)
#convert the dates into numbers
names(data_2020)[3:c_data]<- 1:(c_data-2)
tempDataLong <- data_2020 %>% gather( key = day, value= close, 3:c_data, factor_key = TRUE )
#convert data to numeric for analysis
tempDataLong$day<- as.numeric(tempDataLong$day)
When I try to use the function as.date to transform the data into dates it does not accept it since it is now a factor, and when I make it numeric first and then change it to irrelevant dates (i.e. 1970)
please note that the dates are not continuous right now, because the stock market does not work three days a week, but for my analysis purposes I'm allowed to use them as such.
My question is- how do I turn the data in my long format back to the dates in the wide format?
Here is how my long format looks like right now:
Symbol Name day close
1 AAPL Apple 1 75.08750
2 ABBV AbbVie 1 89.55000
3 ABT Abbott Laboratories 1 86.95000
4 ACN Accenture 1 210.14999
5 ADBE Adobe 1 334.42999
6 AIG American International Group 1 51.76000
7 AMGN Amgen 1 240.10001
8 AMT American Tower 1 228.50000
If you change the column names you'll loose the date information.
Try this with pivot_longer as gather has been superseded.
library(dplyr)
library(tidyr)
tempDataLong <- data_2020 %>%
pivot_longer(cols = starts_with('X'),
names_to = 'day',
names_pattern = 'X(.*)') %>%
mutate(day = lubridate::ymd(day))
tempDataLong
# Symbol Name day value
# <chr> <chr> <date> <dbl>
# 1 AAPL Apple 2020-01-02 75.1
# 2 AAPL Apple 2020-01-03 74.4
# 3 AAPL Apple 2020-01-06 75.0
# 4 AAPL Apple 2020-01-07 74.6
# 5 AAPL Apple 2020-01-08 75.8
# 6 AAPL Apple 2020-01-09 77.4
# 7 ABBV AbbVie 2020-01-02 89.6
# 8 ABBV AbbVie 2020-01-03 88.7
# 9 ABBV AbbVie 2020-01-06 89.4
#10 ABBV AbbVie 2020-01-07 88.9
#11 ABBV AbbVie 2020-01-08 89.5
#12 ABBV AbbVie 2020-01-09 90.2
data
data_2020 <- structure(list(Symbol = c("AAPL", "ABBV"), Name = c("Apple",
"AbbVie"), X2020.01.02 = c(75.0875, 89.55), X2020.01.03 = c(74.3575,
88.7), X2020.01.06 = c(74.95, 89.4), X2020.01.07 = c(74.5975,
88.89), X2020.01.08 = c(75.7975, 89.52), X2020.01.09 = c(77.4075,
90.21)), class = "data.frame", row.names = c("1", "2"))

Prices returns calculation in a df with many tickers with dplyr

I have a dataframe with 3 columns : Dates, Tickers (i.e. financial instruments) and Prices.
I just want to calculate the returns for each ticker.
Some data to play with:
AsofDate = as.Date(c("2018-01-01","2018-01-02","2018-01-03","2018-01-04","2018-01-05",
"2018-01-01","2018-01-02","2018-01-03","2018-01-04","2018-01-05",
"2018-01-01","2018-01-02","2018-01-03","2018-01-04","2018-01-05"))
Tickers = c("Ticker1", "Ticker1", "Ticker1", "Ticker1", "Ticker1",
"Ticker2", "Ticker2", "Ticker2", "Ticker2", "Ticker2",
"Ticker3", "Ticker3", "Ticker3", "Ticker3", "Ticker3")
Prices =c(1,2,7,4,2,
6,5,7,9,12,
11,11,16,14,15)
df = data.frame(AsofDate, Tickers, Prices)
My first Idea was just to order the Prices by (Tickers Prices), then calculate for all the vector and set at NA the first day...
TTR::ROC(x=Prices)
It works in Excel but I want something more pretty
So I tried something like this:
require(dplyr)
ret = df %>%
select(Tickers,Prices) %>%
group_by(Tickers) %>%
do(data.frame(LogReturns=TTR::ROC(x=Prices)))
df$LogReturns = ret$LogReturns
But Here I get too much values, it seems that the calculation is not done by Tickers.
Can you give me a hint ?
Thanks !!
In dplyr, we can use lag to get previous Prices
library(dplyr)
df %>%
group_by(Tickers) %>%
mutate(returns = (Prices - lag(Prices))/Prices)
# AsofDate Tickers Prices returns
# <date> <fct> <dbl> <dbl>
# 1 2018-01-01 Ticker1 1 NA
# 2 2018-01-02 Ticker1 2 0.5
# 3 2018-01-03 Ticker1 7 0.714
# 4 2018-01-04 Ticker1 4 -0.75
# 5 2018-01-05 Ticker1 2 -1
# 6 2018-01-01 Ticker2 6 NA
# 7 2018-01-02 Ticker2 5 -0.2
# 8 2018-01-03 Ticker2 7 0.286
# 9 2018-01-04 Ticker2 9 0.222
#10 2018-01-05 Ticker2 12 0.25
#11 2018-01-01 Ticker3 11 NA
#12 2018-01-02 Ticker3 11 0
#13 2018-01-03 Ticker3 16 0.312
#14 2018-01-04 Ticker3 14 -0.143
#15 2018-01-05 Ticker3 15 0.0667
In base R, we can use ave with diff
df$returns <- with(df, ave(Prices, Tickers,FUN = function(x) c(NA,diff(x)))/Prices)
We can use data.table
library(data.table)
setDT(df)[, returns := (Prices - shift(Prices))/Prices, by = Tickers]

R aggregate second data to minutes more efficient

I have a data.table, allData, containing data on roughly every (POSIXct) second from different nights. Some nights however are on the same date since data is collected from different people, so I have a column nightNo as an id for every different night.
timestamp nightNo data1 data2
2018-10-19 19:15:00 1 1 7
2018-10-19 19:15:01 1 2 8
2018-10-19 19:15:02 1 3 9
2018-10-19 18:10:22 2 4 10
2018-10-19 18:10:23 2 5 11
2018-10-19 18:10:24 2 6 12
I'd like to aggregate the data to minutes (per night) and using this question I've come up with the following code:
aggregate_minute <- function(df){
df %>%
group_by(timestamp = cut(timestamp, breaks= "1 min")) %>%
summarise(data1= mean(data1), data2= mean(data2)) %>%
as.data.table()
}
allData <- allData[, aggregate_minute(allData), by=nightNo]
However my data.table is quite large and this code isn't fast enough. Is there a more efficient way to solve this problem?
allData <- data.table(timestamp = c(rep(Sys.time(), 3), rep(Sys.time() + 320, 3)),
nightNo = rep(1:2, c(3, 3)),
data1 = 1:6,
data2 = 7:12)
timestamp nightNo data1 data2
1: 2018-06-14 10:43:11 1 1 7
2: 2018-06-14 10:43:11 1 2 8
3: 2018-06-14 10:43:11 1 3 9
4: 2018-06-14 10:48:31 2 4 10
5: 2018-06-14 10:48:31 2 5 11
6: 2018-06-14 10:48:31 2 6 12
allData[, .(data1 = mean(data1), data2 = mean(data2)), by = .(nightNo, timestamp = cut(timestamp, breaks= "1 min"))]
nightNo timestamp data1 data2
1: 1 2018-06-14 10:43:00 2 8
2: 2 2018-06-14 10:48:00 5 11
> system.time(replicate(500, allData[, aggregate_minute(allData), by=nightNo]))
user system elapsed
3.25 0.02 3.31
> system.time(replicate(500, allData[, .(data1 = mean(data1), data2 = mean(data2)), by = .(nightNo, timestamp = cut(timestamp, breaks= "1 min"))]))
user system elapsed
1.02 0.04 1.06
You can use lubridate to 'round' the dates and then use data.table to aggregate the columns.
library(data.table)
library(lubridate)
Reproducible data:
text <- "timestamp nightNo data1 data2
'2018-10-19 19:15:00' 1 1 7
'2018-10-19 19:15:01' 1 2 8
'2018-10-19 19:15:02' 1 3 9
'2018-10-19 18:10:22' 2 4 10
'2018-10-19 18:10:23' 2 5 11
'2018-10-19 18:10:24' 2 6 12"
allData <- read.table(text = text, header = TRUE, stringsAsFactors = FALSE)
Create data.table:
setDT(allData)
Create a timestamp and floor it to the nearest minute:
allData[, timestamp := floor_date(ymd_hms(timestamp), "minutes")]
Change the type of the integer columns to numeric:
allData[, ':='(data1 = as.numeric(data1),
data2 = as.numeric(data2))]
Replace the data columns with their means by nightNo group:
allData[, ':='(data1 = mean(data1),
data2 = mean(data2)),
by = nightNo]
The result is:
timestamp nightNo data1 data2
1: 2018-10-19 19:15:00 1 2 8
2: 2018-10-19 19:15:00 1 2 8
3: 2018-10-19 19:15:00 1 2 8
4: 2018-10-19 18:10:00 2 5 11
5: 2018-10-19 18:10:00 2 5 11
6: 2018-10-19 18:10:00 2 5 11

Count how many cases exist per week given start and end dates of each case [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I'm new here, so I apologize if I miss any conventions.
I have a ~2000 row dataset with data on unique cases happening in a three year period. Each case has a start date and an end date. I want to be able to get a new dataframe that shows how many cases occur per week in this three year period.
The structure of the dataset I have is like this:
ID Start_Date End_Date
1 2015-01-04 2017-11-02
2 2015-01-05 2015-10-26
3 2015-01-07 2015-03-04
4 2015-01-12 2016-05-17
5 2015-01-15 2015-04-08
6 2015-01-21 2016-07-31
7 2015-01-21 2015-07-16
8 2015-01-22 2015-03-03
`
This problem can be solved more easily with sqldf package but I thought to stick with dplyr package.
The approach:
library(dplyr)
library(lubridate)
# First create a data frame having all weeks from chosen start date to end date.
# 2015-01-01 to 2017-12-31
df_week <- data.frame(weekStart = seq(floor_date(as.Date("2015-01-01"), "week"),
as.Date("2017-12-31"), by = 7))
df_week <- df_week %>%
mutate(weekEnd = weekStart + 7,
weekNum = as.character(weekStart, "%V-%Y"),
dummy = TRUE)
# The dummy column is only for joining purpose.
# Header looks like
#> head(df_week)
# weekStart weekEnd weekNum dummy
#1 2014-12-28 2015-01-04 52-2014 TRUE
#2 2015-01-04 2015-01-11 01-2015 TRUE
#3 2015-01-11 2015-01-18 02-2015 TRUE
#4 2015-01-18 2015-01-25 03-2015 TRUE
#5 2015-01-25 2015-02-01 04-2015 TRUE
#6 2015-02-01 2015-02-08 05-2015 TRUE
# Prepare the data as mentioned in OP
df <- read.table(text = "ID Start_Date End_Date
1 2015-01-04 2017-11-02
2 2015-01-05 2015-10-26
3 2015-01-07 2015-03-04
4 2015-01-12 2016-05-17
5 2015-01-15 2015-04-08
6 2015-01-21 2016-07-31
7 2015-01-21 2015-07-16
8 2015-01-22 2015-03-03", header = TRUE, stringsAsFactors = FALSE)
df$Start_Date <- as.Date(df$Start_Date)
df$End_Date <- as.Date(df$End_Date)
df <- df %>% mutate(dummy = TRUE) # just for joining
# Use dplyr to join, filter and then group on week to find number of cases
# in each week
df_week %>%
left_join(df, by = "dummy") %>%
select(-dummy) %>%
filter((weekStart >= Start_Date & weekStart <= End_Date) |
(weekEnd >= Start_Date & weekEnd <= End_Date)) %>%
group_by(weekStart, weekEnd, weekNum) %>%
summarise(cases = n())
# Result
# weekStart weekEnd weekNum cases
# <date> <date> <chr> <int>
# 1 2014-12-28 2015-01-04 52-2014 1
# 2 2015-01-04 2015-01-11 01-2015 3
# 3 2015-01-11 2015-01-18 02-2015 5
# 4 2015-01-18 2015-01-25 03-2015 8
# 5 2015-01-25 2015-02-01 04-2015 8
# 6 2015-02-01 2015-02-08 05-2015 8
# 7 2015-02-08 2015-02-15 06-2015 8
# 8 2015-02-15 2015-02-22 07-2015 8
# 9 2015-02-22 2015-03-01 08-2015 8
#10 2015-03-01 2015-03-08 09-2015 8
# ... with 139 more rows
Welcome to SO!
Before solving the problem be sure to have installed some packages and run
install.packages(c("tidyr","dplyr","lubridate"))
if you haven installed those packages yet.
I'll present you a modern R solution next and those packages are magic.
This is a way to solve it:
library(readr)
library(dplyr)
library(lubridate)
raw_data <- 'id start_date end_date
1 2015-01-04 2017-11-02
2 2015-01-05 2015-10-26
3 2015-01-07 2015-03-04
4 2015-01-12 2016-05-17
5 2015-01-15 2015-04-08
6 2015-01-21 2016-07-31
7 2015-01-21 2015-07-16
8 2015-01-22 2015-03-03'
curated_data <- read_delim(raw_data, delim = "\t") %>%
mutate(start_date = as.Date(start_date)) %>% # convert column 2 to date format assuming the date is yyyy-mm-dd
mutate(weeks_lapse = as.integer((start_date - min(start_date))/dweeks(1))) # count how many weeks passed since the lowest date in the data
curated_data %>%
group_by(weeks_lapse) %>% # I group to count by week
summarise(cases_per_week = n()) # now count by group by week
And the solution is:
# A tibble: 3 x 2
weeks_lapse cases_per_week
<int> <int>
1 0 3
2 1 2
3 2 3

Converting sets of calendar dates to Julian days in a data frame

I am a beginner in R and I am trying to convert sets of calendar dates to sets of Julian dates in a data frame using R. I know there are a similar questions answered but I am not being able to get I want.
df <- data.frame(Date = c('2010-06-20','2005-10-19','2000-05-01','2003-04-04','2010-11-20','2009-09-14'), No = c(1, 4, 6, 11, 7, 9))
df$ jDate <- as.POSIXct(as.numeric(df$Date), origin = '1970-01-01')
gives me
df
Date No cDate
1 2010-06-20 1 1969-12-31 19:00:05
2 2005-10-19 4 1969-12-31 19:00:03
3 2000-05-01 6 1969-12-31 19:00:01
4 2003-04-04 11 1969-12-31 19:00:02
5 2010-11-20 7 1969-12-31 19:00:06
6 2009-09-14 9 1969-12-31 19:00:04
How could I get a column with Julian days in the column 'jDate'?
Thank you for your help.
You can do
df$Date <- as.Date(df$Date)
to get the date, and then
df$jDate <- format(df$Date, "%j")
to get the julian days or
df$jDateYr <- format(df$Date, "%Y-%j")
to prepend the year (if you want). This returns
df
Date No jDate jDateYr
1 2010-06-20 1 171 2010-171
2 2005-10-19 4 292 2005-292
3 2000-05-01 6 122 2000-122
4 2003-04-04 11 094 2003-094
5 2010-11-20 7 324 2010-324
6 2009-09-14 9 257 2009-257
To read more about the possible date-time formats, see ?strptime.
Based on aosmith's comments, I did this and got what I wanted.
> df$jDate <- julian(as.Date(df$Date), origin = as.Date('1970-01-01'))
df
Date No jDate
1 2010-06-20 1 14780
2 2005-10-19 4 13075
3 2000-05-01 6 11078
4 2003-04-04 11 12146
5 2010-11-20 7 14933
6 2009-09-14 9 14501

Resources