R - date sequence with condition - r

I have this dataframe
test <-
data.frame(
id = c(4, 6, 9, 12),
open = c(as.Date("2011-01-01"), as.Date("2011-01-01"), as.Date("2011-01-01"), as.Date("2011-01-01")),
closed = c(as.Date("2011-12-01"), as.Date("2011-12-31"), as.Date("2012-01-01"), as.Date("2015-12-31"))
)
My goal is to get each date that overlapped or reached last day in year. Since id 4 was started at 2011 and ended before last day, there should be NA for it. id 6 reached last day in 2011 but not in 2012, same for id 9.
Result should be
summary <-
data.frame(
id = c(4, 6, 9, 12),
open = c(as.Date("2011-01-01"), as.Date("2011-01-01"), as.Date("2011-01-01"), as.Date("2011-01-01")),
closed = c(as.Date("2011-12-01"), as.Date("2011-12-31"), as.Date("2012-01-01"), as.Date("2015-12-31")),
open_summary = c(NA, 2011, 2011, 2011),
closed_summary = c(NA, 2011, 2011, 2015)
)
Then I'd like to create a sequence from these dates so result should be
result <-
data.frame(
y = c(2011, 2011, 2011, 2012, 2013, 2014, 2015),
id = c(6, 9, 12, 12, 12, 12, 12)
)

Here is a tidyverse solution, also using lubridate (for the year function)...
library(tidyverse)
library(lubridate)
summary <- test %>%
mutate(open_summary = year(open) * (year(open) > year(open - 1)),
closed_summary = (year(closed + 1) - 1) * (year(closed + 1) > year(open)))
output <- summary %>%
filter(open_summary * closed_summary > 1) %>%
mutate(open_year = map2(open_summary, closed_summary, seq)) %>%
select(id, open_year) %>%
unnest(c(open_year))
summary
id open closed open_summary closed_summary
1 4 2011-01-01 2011-12-01 2011 0
2 6 2011-01-01 2011-12-31 2011 2011
3 9 2011-01-01 2012-01-01 2011 2011
4 12 2011-01-01 2015-12-31 2011 2015
output
id open_year
1 6 2011
2 9 2011
3 12 2011
4 12 2012
5 12 2013
6 12 2014
7 12 2015
If either open_summary or closed_summary is zero, that is equivalent to your NA row.

Here is a approach using data.table:
library(data.table)
#create a lookup table of year end dates
yrend <- data.table(YR_END=seq(as.Date("2010-12-31"), as.Date("2015-12-31"), by="1 year"))[,
YR := year(YR_END)]
setDT(test)
#create open_summary column since its just the end of the year of the open column
test[, open_summary := year(open)]
#lookup the year for the closed date
test[, closed_summary := yrend[test, on=.(YR_END>=open, YR_END<=closed), mult="last", YR]]
#create the sequence in part 2 of the qn
test[!is.na(open_summary) & !is.na(closed_summary),
.(y=open_summary:closed_summary), id]
test output:
id open closed open_summary closed_summary
1: 4 2011-01-01 2011-12-01 2011 NA
2: 6 2011-01-01 2011-12-31 2011 2011
3: 9 2011-01-01 2012-01-01 2011 2011
4: 12 2011-01-01 2015-12-31 2011 2015
the other output:
id y
1: 6 2011
2: 9 2011
3: 12 2011
4: 12 2012
5: 12 2013
6: 12 2014
7: 12 2015

Related

Updating table with custom numbers

Below is my dataset, which contains four columns id, year, quarter, and price.
df <- data.frame(id = c(1,2,1,2),
year = c(2010,2010,2011,2011),
quarter = c("2010-q1","2010-q2","2011-q1","2011-q2"),
price = c(10,50,10,50))
Now I want to expand this dataset for 2012 and 2013. First, I want to copy rows for 2010 and 2011 and paste them below, and after that, replace these values for years and quarters with 2012 and 2013 and also quarters with 2012-q1,2012-q2,2013-q1 and 2013-q2.
So can anybody help me with how to solve this and prepare the table as the table below?
df %>%
mutate(year = year + 2, quarter = paste0(year, "-q", id)) %>%
bind_rows(df, .)
id year quarter price
1 1 2010 2010-q1 10
2 2 2010 2010-q2 50
3 1 2011 2011-q1 10
4 2 2011 2011-q2 50
5 1 2012 2012-q1 10
6 2 2012 2012-q2 50
7 1 2013 2013-q1 10
8 2 2013 2013-q2 50

Conversion of monthly data to yearly data in a dataframe in r

I have a dataframe showing monthly mgpp from 2000-2010:
dataframe1
Year Month mgpp
1: 2000 1 0.01986404
2: 2000 2 0.011178429
3: 2000 3 0.02662008
4: 2000 4 0.05034293
5: 2000 5 0.23491388
---
128: 2010 8 0.13234501
129: 2010 9 0.10432369
130: 2010 10 0.04329537
131: 2010 11 0.04343289
132: 2010 12 0.09494946
I am trying to convert this dataframe1 into a raster that will show the variable mgpp. However I want to format the dataframe first which will show only the yearly mgpp. The expected outcome is shown below :
dataframe1
Year mgpp
1: 2000 0.01986704
2: 2001 0.01578429
3: 2002 0.02662328
4: 2003 0.05089593
5: 2004 0.07491388
6: 2005 0.11229201
7: 2006 0.10318569
8: 2007 0.07129537
9: 2008 0.04373689
10: 2009 0.02885386
11: 2010 0.74848348
I want to aggregate the months by mean. For instance, 2000 value shows one value that is the mean from Jan-Dec for the 2000 year.How can I achieve this? Help would be appreciated
Here a data.table approach.
library(data.table)
setDT(dataframe1)[,.(Yearly.mgpp = mean(mgpp)),by=Year]
Year Yearly.mgpp
1: 2000 0.06858387
2: 2010 0.08366928
Or if you prefer dplyr.
library(dplyr)
dataframe1 %>%
group_by(Year) %>%
summarise(Yearly.mgpp = mean(mgpp))
# A tibble: 2 x 2
Year Yearly.mgpp
<dbl> <dbl>
1 2000 0.0686
2 2010 0.0837
Or base R.
result <- sapply(split(dataframe1$mgpp,dataframe1$Year),mean)
data.frame(Year = as.numeric(names(result)),Yearly.mgpp = result)
Year Yearly.mgpp
2000 2000 0.06858387
2010 2010 0.08366928
Sample Data
dataframe1 <- structure(list(Year = c(2000, 2000, 2000, 2000, 2000, 2010, 2010,
2010, 2010, 2010), Month = c(1, 2, 3, 4, 5, 8, 9, 10, 11, 12),
mgpp = c(0.01986404, 0.011178429, 0.02662008, 0.05034293,
0.23491388, 0.13234501, 0.10432369, 0.04329537, 0.04343289,
0.09494946)), class = "data.frame", row.names = c(NA, -10L
))

Fill NA values in one data table with observed values from a second data table in R

I can't believe I'm having this much trouble finding a solution to this problem: I have two data tables with identical rows and columns that look like this:
Country <- c("FRA", "FRA", "DEU", "DEU", "CHE", "CHE")
Year <- c(2010, 2020, 2010, 2020, 2010, 2020)
acctm <- c(20, 30, 10, NA, 20, NA)
acctf <- c(20, NA, 15, NA, 40, NA)
dt1 <- data.table(Country, Year, acctm, acctf)
Country Year acctm acctf
1 FRA 2010 20 20
2 FRA 2020 30 NA
3 DEU 2010 10 15
4 DEU 2020 NA NA
5 CHE 2010 20 40
6 CHE 2020 NA NA
Country <- c("FRA", "FRA", "DEU", "DEU", "CHE", "CHE")
Year <- c(2010, 2020, 2010, 2020, 2010, 2020)
acctm <- c(1, 1, 1, 60, 1, 70)
acctf <- c(1, 60, 1, 80, 1, 100)
dt2 <- data.table(Country, Year, acctm, acctf)
Country Year acctm acctf
1 FRA 2010 1 1
2 FRA 2020 2 60
3 DEU 2010 1 1
4 DEU 2020 60 80
5 CHE 2010 1 2
6 CHE 2020 70 100
I need to create a new data table that replaces NA values in dt1 with values for the corresponding country/year/variable match from dt2, yielding a table that looks like this:
Country Year acctm acctf
1 FRA 2010 20 20
2 FRA 2020 30 60
3 DEU 2010 10 15
4 DEU 2020 60 80
5 CHE 2010 20 40
6 CHE 2020 70 100
We can do this with a join on the 'Country', 'Year' columns
library(data.table)
nm1 <- names(dt1)[3:4]
nm2 <- paste0("i.", nm1)
dt3 <- copy(dt1)
dt3[dt2, (nm1) := Map(function(x, y)
fifelse(is.na(x), y, x), mget(nm1), mget(nm2)), on = .(Country, Year)]
dt3
# Country Year acctm acctf
#1: FRA 2010 20 20
#2: FRA 2020 30 60
#3: DEU 2010 10 15
#4: DEU 2020 60 80
#5: CHE 2010 20 40
#6: CHE 2020 70 100
Or to make this compact, use fcoalesce from data.table (comments from #IceCreamToucan)
dt3[dt2, (nm1) := Map(fcoalesce, mget(nm1), mget(nm2)), on = .(Country, Year)]
If the datasets are of same dimensions and have the same values for 'Country', 'Year', then another option is
library(purrr)
library(dplyr)
list(dt1[, .(acctm, acctf)], dt2[, .(acctm, acctf)]) %>%
reduce(coalesce) %>%
bind_cols(dt1[, .(Country, Year)], .)

Insert a new column to a list of dataframes by parsing the date names of the

I have a list of data frames which I would like to insert a new column into
df<- data_frame(first =seq(1:10), second = seq(1:10))
ldf <- list(df, df, df)
names(ldf) <- c('April 2018 ASP NDC-HCPCS Crosswalk', 'Apr 2019 ASP Pricing File', 'Jan 18 ASP Pricing File')
I wish to insert a new column into each data frame by taking the name of each data frame and converting it to the corresponding YYYY QQ format.
Using the example above, the April 2018 ASP NDC-HCPCS Crosswalk dataframe will have a new column called date which will contain "2018 Q2" for all 10 rows, similarly the Apr 2019 ASP Pricing File dataframe will have a new date column containing "2019 Q2" for all rows and finally the Jan 18 ASP Pricing File will have a new date column with "2018 Q1"
Here is an example of an approach you can use and improve on. You should probably make the regex more robust--although it works for your example I can see situations where it could fail. Also, this works for years 2000 and up; you would need to change it if you have any earlier dates. Let me know if you have any questions!
# Getting month indices
# Extracting month names
new_mon <- gsub(paste0(".*(", paste(c(month.name, month.abb), collapse = "|"), ").*"), "\\1", names(ldf))
# Subsetting a "quarter" vector by month name/abbreviation and taking nonmissing values
new_mon <- pmax(sort(rep(1:4, 3))[match(new_mon, month.name)],
sort(rep(1:4, 3))[match(new_mon, month.abb)], na.rm = T)
# Getting the year indices and ading "20" if there are only two characters
new_year <- gsub(".*?([0-9]{2,4}).*", "\\1", names(ldf)) # This is good for your example but you can make it more robust
new_year <- ifelse(nchar(new_year) == 2, paste0("20", new_year), new_year)
# Pasting them Together
new_cols <- paste0(new_year, " Q", new_mon)
Results:
# Adding the columns using Map
Map(function(x, y) cbind(x, "new_column" = y), ldf, new_cols)
$`April 2018 ASP NDC-HCPCS Crosswalk`
first second new_column
1 1 1 2018 Q2
2 2 2 2018 Q2
3 3 3 2018 Q2
4 4 4 2018 Q2
5 5 5 2018 Q2
6 6 6 2018 Q2
7 7 7 2018 Q2
8 8 8 2018 Q2
9 9 9 2018 Q2
10 10 10 2018 Q2
$`Apr 2019 ASP Pricing File`
first second new_column
1 1 1 2019 Q2
2 2 2 2019 Q2
3 3 3 2019 Q2
4 4 4 2019 Q2
5 5 5 2019 Q2
6 6 6 2019 Q2
7 7 7 2019 Q2
8 8 8 2019 Q2
9 9 9 2019 Q2
10 10 10 2019 Q2
$`Jan 18 ASP Pricing File`
first second new_column
1 1 1 2018 Q1
2 2 2 2018 Q1
3 3 3 2018 Q1
4 4 4 2018 Q1
5 5 5 2018 Q1
6 6 6 2018 Q1
7 7 7 2018 Q1
8 8 8 2018 Q1
9 9 9 2018 Q1
10 10 10 2018 Q1
Data:
df<- data_frame(first =seq(1:10), second = seq(1:10))
ldf <- list(df, df, df)
names(ldf) <- c('April 2018 ASP NDC-HCPCS Crosswalk', 'Apr 2019 ASP Pricing File', 'Jan 18 ASP Pricing File')
A way to start:
df<- data_frame(first =seq(1:10), second = seq(1:10))
ldf <- list(df, df, df)
names(ldf) <- c('April 2018 ASP NDC-HCPCS Crosswalk', 'Apr 2019 ASP Pricing File', 'Jan 18 ASP Pricing File')
lookup_quarters <- setNames(paste0("Q", rep(1:4, each = 3)),
c("jan", "feb", "mar", "apr", "may", "jun",
"jul", "aug", "sep", "oct", "nov", "dec"))
lapply(seq_along(ldf),
function(i) {
qtr <- substr(tolower(strsplit(names(ldf)[i], " ")[[1]][1]), 1, 3)
qtr <- lookup_quarters[qtr]
yr <- strsplit(names(ldf)[i], " ")[[1]][2]
yr <- if (nchar(yr) == 2) paste0("20", yr) else yr
res <- ldf[[i]]
res$newcol <- paste(yr, qtr)
res
})

Aggregation on 2 columns while keeping two unique R

So I have this:
Staff Result Date Days
1 50 2007 4
1 75 2006 5
1 60 2007 3
2 20 2009 3
2 11 2009 2
And I want to get to this:
Staff Result Date Days
1 55 2007 7
1 75 2006 5
2 15 2009 5
I want to have the Staff ID and Date be unique in each row, but I want to sum 'Days' and mean 'Result'
I can't work out how to do this in R, I'm sure I need to do lots of aggregations but I keep getting different results to what I am aiming for.
Many thanks
the simplest way to do this is to group_by Staff and Date and summarise the results with dplyr package:
require(dplyr)
df <- data.frame(Staff = c(1,1,1,2,2),
Result = c(50, 75, 60, 20, 11),
Date = c(2007, 2006, 2007, 2009, 2009),
Days = c(4, 5, 3, 3, 2))
df %>%
group_by(Staff, Date) %>%
summarise(Result = floor(mean(Result)),
Days = sum(Days)) %>%
data.frame
Staff Date Result Days
1 1 2006 75 5
2 1 2007 55 7
3 2 2009 15 5
You can aggregate on two variables by using a formula and then merge the two aggregates
merge(aggregate(Result ~ Staff + Date, data=df, mean),
aggregate(Days ~ Staff + Date, data=df, sum))
Staff Date Result Days
1 1 2006 75.0 5
2 1 2007 55.0 7
3 2 2009 15.5 5
Here is another option with data.table
library(data.table)
setDT(df1)[, .(Result = floor(mean(Result)), Days = sum(Days)), .(Staff, Date)]
# Staff Date Result Days
#1: 1 2007 55 7
#2: 1 2006 75 5
#3: 2 2009 15 5

Resources