Filtering Data based on another dataframe based on two rows - r

I have two Datasets.
The first dataset includes Companies, the Quarter and the corresponding value from the whole timespan.
Quarter Date Company value
2012.1 2012-12-28 x 1
2013.1 2013-01-02 y 2
2013.1 2013-01-03 z 3
Companies again are in the dataset over the whole time and show up multiple times.
The other dataset is an index which includes a company identifier and the quarter in which it existed in the index (Companies can be in the index in multiple quarters).
Quarter Date Company value
2012.1 2012-12-28 x 1
2014.1 2013-01-02 y 2
2013.1 2013-01-03 x 3
Now I need to only select the companies which are in the index at the same time (quarter) as I have data from the first dataset.
In the example above I would need the data from company x in both quarters, but company y needs to get kicked out because the data is available in the wrong quarter.
I tried multiple functions including filter, subset and match but never got the desired result. It always filters either too much or too little.
data %>% filter(Company == index$Company & Quarter == index$Quarter)
or
data[Company == index$Company & Quarter = index$Quarter,]
Something with my conditions doesn't seem right. Any help is appreciated!

Have a look at dplyr's powerful join functions. Here inner_join might help you
dplyr::inner_join(df1, df2, by=c("Company", "Quarter"))

Related

Using lubridate with multiple date formats

I have a column of dates that was stored in the format 8/7/2001, 10/21/1990, etc. Two values are just four-digit years. I converted the entire column to class Date using the following code.
lubridate::parse_date_time(eventDate, orders = c('mdy', 'Y'))
It works great, except the values that were just years are converted to yyyy-01-01 and I want them to just be yyyy. Is there a way to keep lubridate from adding on any information that wasn't already there?
Edit: Code to create data frame
id = (1:5)
eventDate = c("10/7/2001", "1989", NA, "5/5/2016", "9/18/2011")
df <- data.frame(id, eventDate)
I do not think is possible to convert your values to Dates, and keep the "yyyy" values intact. And by transforming your "yyyy" values into "yyyy-01-01" the lubridate is doing the right thing. Because dates have order, and if you have other values in your column that have days and months defined, all the other values needs to have these components too.
For example. If I produce the data.frame below. If I ask R, to order the table, according to the date column, the date in the first line ("2020"), comes before the value in the second row ("2020-02-28")? Or comes after it? The value "2020" being the year of 2020, it can actually means every possible day in this year, so how R should treate it? By adding the first day of the year, lubridate is defining these components, and avoiding that R get confused by it.
dates <- c("2020", "2020-02-28", "2020-02-20", "2020-01-10", "2020-05-12")
id <- 1:5
df <- data.frame(
id,
dates
)
id dates
1 1 2020
2 2 2020-02-28
3 3 2020-02-20
4 4 2020-01-10
5 5 2020-05-12
So if you want to mantain the "yyyy" intact, is very likely that they should not rest in your eventDate column, with other values that are in a different structure ("dd/mm/yyyy"). Now if is really necessary to mantain these values intact, I think is best, to keep the values of eventDate column as characters, and store these values as Dates in another column, like this:
df$as_dates <- lubridate::parse_date_time(df$eventDate, orders = c('mdy', 'Y'))
id eventDate as_dates
1 1 10/7/2001 2001-10-07
2 2 1989 1989-01-01
3 3 <NA> <NA>
4 4 5/5/2016 2016-05-05
5 5 9/18/2011 2011-09-18

For looping over dates

So I have a dataframe called Swine_flu_cases that looks as follows (just an extract):
Country Date Confirmed
1 Canada 2020-01-22 1
2 Egypt 2020-01-23 1
3 Algeria 2020-01-24 1
4 France 2020-01-25 1
5 Zambia 2020-01-26 1
6 Congo 2020-01-27 1
This data set looks at the recorded amount of swine flu cases of a country on a specific date.
I have filtered my data to only show variables where the confirmed cases are 1 and have also grouped it by the different country and sorted it by ascending order of date. (I did this to get the dates that these countries each had their first cases)
I have sorted it in ascending order of date because I want to extract the first time each country had their first recorded swine flu case and store that as a vector.
I have tried doing so by using the following code :
first_case_date = as.Date(data.frame(Swine_flu_cases$Date))
This however gave me an error though.
Error in as.Date.default(data.frame(Swine_flu_cases$Date)) : do
not know how to convert 'data.frame(Swine_flu_cases$Date)' to class
“Date”
What I want to do is create a new variable Swine_flu_cases$days_since_first_case which will take the stored date of each of the countries on my lists first case and subtract that from all the other dates for each country.
My knowledge of for loops is very basic but I know I need to somehow use a for loop for this. I have recently familiarised myself with the lead and lag function as well and was thinking maybe there is a way in which I could combine these two functions to create this variable?
If someone can just give me a general idea on how I could go about doing this please I would really appreciate it.
You can do this with dplyr and lubridate to make your dates behave.
library(dplyr)
library(lubridate)
Swine_flu_cases %>%
mutate(Date = ymd(Date) %>% # makes the Dates behave better for subtraction
group_by(Country) %>% # You want grouped by country
mutate(days_since_first_case = Date - min(Date))
# subtracts the first date in each group from the current date for the row)

Calculate mean of one column for 14 rows before certain row, as identified by date for each group (year)

I would like to calculate mean of Mean.Temp.c. before certain date, such as 1963-03-23 as showed in date2 column in this example. This is time when peak snowmelt runoff occurred in 1963 in my area. I want to know 10 day’s mean temperature before this date (ie., 1963-03-23). How to do it? I have 50 years data, and each year peak snowmelt date is different.
example data
You can try:
library(dplyr)
df %>%
mutate(date2 = as.Date(as.character(date2)),
ten_day_mean = mean(Mean.Temp.c[between(date2, "1963-03-14", "1963-03-23")]))
In this case the desired mean would populate the whole column.
Or with data.table:
library(data.table)
setDT(df)[between(as.Date(as.character(date2)), "1963-03-14", "1963-03-23"), ten_day_mean := mean(Mean.Temp.c)]
In the latter case you'd get NA for those days that are not relevant for your date range.
Supposing date2 is a Date field and your data.frame is called x:
start_date <- as.Date("1963-03-23")-10
end_date <- as.Date("1963-03-23")
mean(x$Mean.Temp.c.[x$date2 >= start_date & x$date2 <= end_date])
Now, if you have multiple years of interest, you could wrap this code within a for loop (or [s|l]apply) taking elements from a vector of dates.

Using dplyr::mutate between two dataframes to create column based on date range

Right now I have two dataframes. One contains over 11 million rows of a start date, end date, and other variables. The second dataframe contains daily values for heating degree days (basically a temperature measure).
set.seed(1)
library(lubridate)
date.range <- ymd(paste(2008,3,1:31,sep="-"))
daily <- data.frame(date=date.range,value=runif(31,min=0,max=45))
intervals <- data.frame(start=daily$date[1:5],end=daily$date[c(6,9,15,24,31)])
In reality my daily dataframe has every day for 9 years and my intervals dataframe has entries that span over arbitrary dates in this time period. What I wanted to do was to add a column to my intervals dataframe called nhdd that summed over the values in daily corresponding to that time interval (end exclusive).
For example, in this case the first entry of this new column would be
sum(daily$value[1:5])
and the second would be
sum(daily$value[2:8]) and so on.
I tried using the following code
intervals <- mutate(intervals,nhdd=sum(filter(daily,date>=start&date<end)$value))
This is not working and I think it might have something to do with not referencing the columns correctly but I'm not sure where to go.
I'd really like to use dplyr to solve this and not a loop because 11 million rows will take long enough using dplyr. I tried using more of lubridate but dplyr doesn't seem to support the Period class.
Edit: I'm actually using dates from as.Date now instead of lubridatebut the basic question of how to refer to a different dataframe from within mutate still stands
eps <- .Machine$double.eps
library(dplyr)
intervals %>%
rowwise() %>%
mutate(nhdd = sum(daily$value[between(daily$date, start, end - eps )]))
# start end nhdd
#1 2008-03-01 2008-03-06 144.8444
#2 2008-03-02 2008-03-09 233.4530
#3 2008-03-03 2008-03-15 319.5452
#4 2008-03-04 2008-03-24 531.7620
#5 2008-03-05 2008-03-31 614.2481
In case if you find dplyr solution bit slow (basically due torowwise), you might want to use data.table for pure speed
library(data.table)
setkey(setDT(intervals), start, end)
setDT(daily)[, date1 := date]
foverlaps(daily, by.x = c("date", "date1"), intervals)[, sum(value), by=c("start", "end")]
# start end V1
#1: 2008-03-01 2008-03-06 144.8444
#2: 2008-03-02 2008-03-09 233.4530
#3: 2008-03-03 2008-03-15 319.5452
#4: 2008-03-04 2008-03-24 531.7620
#5: 2008-03-05 2008-03-31 614.2481

Combine different rows

Consider a dataframe of the form
id start end
2009.36220 65693384 2010-03-20 2010-07-04
2010.36221 65693592 2010-01-01 2010-12-31
2010.36222 65698250 2010-01-01 2010-12-31
2010.36223 65704349 2010-01-01 2010-12-31
where I have around 20k observations per year for 15 years.
I need to combine the rows by the following rule:
if for the same id, there exists a record that ends at the last day of the year
and a record that starts at the first day of the following year
then
- create a new row with start value of the earlier row and end value of the later year
- and delete the two original rows
Given that the same id can be visible several times (since I have more than 2 years) I will then just iterate over the script several time to combine different ids that have for example 4 rows in consecutive years that satisfy the condition.
The Question
I'd know how to program this in an iterative manner, where I would go over every single row and check if there's a row with a start date next year somewhere in the whole data frame that corresponds to the end date this year - but that's extremely slow and non satisfying from an aesthetic point of view. I'm a very beginner with R, so I have no clue of where to even look to do such a thing in a more efficient manner - I'm open for any suggestion.
Warning: this kind of code with rbind() is cancerous, but this is the easiest solution I could think of. Let df be your data.
df$start = as.POSIXct(df$start)
df$end = as.POSIXct(df$end)
df2 = data.frame()
for (i in unique(df$id)){
s = subset(df, id==i)
df2 = rbind(df2, c(id, min(s$start), max(s$end)))
}

Resources