map/dplyr way for dynamically populating two columns in dataframe - r

I have the dataframe 'test' as shown at very bottom below.
I have 2 different operations Id like to complete on two different columns and would like to use an efficient dplyr or purrr method to resolve, if possible.
Operation#1:
Id like to populate 'amt_needed' NA values to be the two values from 'remaining' above it (this is a test dataframe, but in actual version Ill have more rows and each time Id like the two 'amt_needed' values to be = to the two values from 'remaining' in the above two rows).
Operation #2:
The two NA values for 'remaining' should be the new 'amt_needed' values - sum(contrib) for both a and b.
Any thoughts/suggestions appreciated!
test <- data.frame(date = c("2018-01-01", "2018-01-01", "2018-01-15", "2018-01-15"),
name = c("a","b","a","b"),
contrib = c(4,2,4,2),
amt_needed = c(100,100, NA,NA),
remaining = c(94,94, NA,NA))

Based on new data provided in OP, one solution using dplyr could be :
library(dplyr)
# Data
test <- data.frame(date = c("2018-01-01", "2018-01-01", "2018-01-15", "2018-01-15", "2018-01-30", "2018-01-30"),
name = c("a","b","a","b", "a","b"),
contrib = c(4,2,4,2,4,2),
amt_needed = c(100,100, NA,NA, NA,NA),
remaining = c(94,94, NA,NA, NA,NA))
# Change column to date
test$date <- as.Date(test$date, "%Y-%m-%d")
test$amt_needed <- test$amt_needed[1]
test %>%
arrange(date, name) %>%
group_by(date) %>%
mutate(group_contrib = cumsum(sum(contrib))) %>%
ungroup() %>%
select(date, group_contrib) %>%
unique() %>%
arrange(date) %>%
mutate(cumm_group_sum = cumsum(group_contrib)) %>%
inner_join(test, by = "date") %>%
mutate(remaining = amt_needed - cumm_group_sum) %>%
mutate(amt_needed_act = remaining + group_contrib) %>%
select(date, name, contrib, amt_needed_act, remaining)
# A tibble: 6 x 5
date name contrib amt_needed_act remaining
<date> <fctr> <dbl> <dbl> <dbl>
1 2018-01-01 a 4.00 100 94.0
2 2018-01-01 b 2.00 100 94.0
3 2018-01-15 a 4.00 94.0 88.0
4 2018-01-15 b 2.00 94.0 88.0
5 2018-01-30 a 4.00 88.0 82.0
6 2018-01-30 b 2.00 88.0 82.0

Related

Number of days spent in each STATE in r

I'm trying to calculate the number of days that a patient spent during a given state in R.
The image of an example data is included below. I only have columns 1 to 3 and I want to get the answer in column 5. I am thinking if I am able to create a date column in column 4 which is the first recorded date for each state, then I can subtract that from column 2 and get the days I am looking for.
I tried a group_by(MRN, STATE) but the problem is, it groups the second set of 1's as part of the first set of 1's, so does the 2's which is not what I want.
Use mdy_hm to change OBS_DTM to POSIXct type, group_by ID and rleid of STATE so that first set of 1's are handled separately than the second set. Use difftime to calculate difference between OBS_DTM with the minimum value in the group in days.
If your data is called data :
library(dplyr)
data %>%
mutate(OBS_DTM = lubridate::mdy_hm(OBS_DTM)) %>%
group_by(MRN, grp = data.table::rleid(STATE)) %>%
mutate(Answer = as.numeric(difftime(OBS_DTM, min(OBS_DTM),units = 'days'))) %>%
ungroup %>%
select(-grp) -> result
result
You could try the following:
library(dplyr)
df %>%
group_by(ID, State) %>%
mutate(priorObsDTM = lag(OBS_DTM)) %>%
filter(!is.na(priorObsDTM)) %>%
ungroup() %>%
mutate(Answer = as.numeric(OBS_DTM - priorObsDTM, units = 'days'))
The dataframe I used for this example:
df <- df <- data.frame(
ID = 1,
OBS_DTM = as.POSIXlt(
c('2020-07-27 8:44', '2020-7-27 8:56', '2020-8-8 20:12',
'2020-8-14 10:13', '2020-8-15 13:32')
),
State = c(1, 1, 2, 2, 2),
stringsAsFactors = FALSE
)
df
# A tibble: 3 x 5
# ID OBS_DTM State priorObsDTM Answer
# <dbl> <dttm> <dbl> <dttm> <dbl>
# 1 1 2020-07-27 08:56:00 1 2020-07-27 08:44:00 0.00833
# 2 1 2020-08-14 10:13:00 2 2020-08-08 20:12:00 5.58
# 3 1 2020-08-15 13:32:00 2 2020-08-14 10:13:00 1.14

How to split dataframe into multiple dataframes by column index

I'm trying to process the weather data specified below. I thought I was on the right track but the pivot_longer is not being used in the correct manor and is causing partial duplicates.
Can anyone offer any suggestions as to how I can edit my code? I guess one way would be to perform the pivot_longer after splitting the dataframe into several dataframes i.e. first dataframe - jan, year, second dataframe - feb, year.
maxT <- read.table('https://www.metoffice.gov.uk/pub/data/weather/uk/climate/datasets/Tmax/ranked/England_S.txt', skip = 5, header = TRUE) %>%
select(c(1:24)) %>%
pivot_longer(cols = seq(2,24,2) , values_to = "year") %>%
mutate_at(c(1:12), ~as.numeric(as.character(.))) %>%
pivot_longer(cols = c(1:12), names_to = "month", values_to = "tmax") %>%
mutate(month = match(str_to_title(month), month.abb),
date = as.Date(paste(year, month, 1, sep = "-"), format = "%Y-%m-%d")) %>%
select(-c("name","year","month")) %>%
arrange(date)
Here is an option with tidyverse, using map2
library(dplyr)
library(purrr)
list_df <- maxT %>%
select(seq(1, ncol(.), by = 2)) %>%
map2(maxT %>%
select(seq(2, ncol(.), by = 2)), bind_cols) %>%
imap( ~ .x %>%
rename(!! .y := `...1`, year = `...2`))
-output
map(list_df, head)
#$jan
# A tibble: 6 x 2
# jan year
# <dbl> <int>
#1 9.9 1916
#2 9.8 2007
#3 9.7 1921
#4 9.7 2008
#5 9.5 1990
#6 9.4 1975
#$feb
# A tibble: 6 x 2
# feb year
# <dbl> <int>
#1 11.2 2019
#2 10.7 1998
#3 10.7 1990
#4 10.3 2002
#5 10.3 1945
#6 10 2020
# ...
data
maxT <- read.table('https://www.metoffice.gov.uk/pub/data/weather/uk/climate/datasets/Tmax/ranked/England_S.txt', skip = 5, header = TRUE) %>%
select(c(1:24))
We can use split.default to split group of 2 columns.
list_df <- split.default(maxT, ceiling(seq_along(maxT)/2))
data
maxT <- read.table('https://www.metoffice.gov.uk/pub/data/weather/uk/climate/datasets/Tmax/ranked/England_S.txt', skip = 5, header = TRUE) %>%
select(c(1:24))

How to get the difference of a lagged variable by date?

Consider the following example:
library(tidyverse)
library(lubridate)
df = tibble(client_id = rep(1:3, each=24),
date = rep(seq(ymd("2016-01-01"), (ymd("2016-12-01") + years(1)), by='month'), 3),
expenditure = runif(72))
In df you have stored information on monthly expenditure from a bunch of clients for the past 2 years. Now you want to calculate the monthly difference between this year and the previous year for each client.
Is there any way of doing this maintaining the "long" format of the dataset? Here I show you the way I am doing it nowadays, which implies going wide:
df2 = df %>%
mutate(date2 = paste0('val_',
year(date),
formatC(month(date), width=2, flag="0"))) %>%
select(client_id, date2, value) %>%
pivot_wider(names_from = date2,
values_from = value)
df3 = (df2[,2:13] - df2[,14:25])
However I find tihs unnecessary complex, and in large datasets going from long to wide can take quite a lot of time, so I think there must be a better way of doing it.
If you want to keep data in long format, one way would be to group by month and date value for each client_id and calculate the difference using diff.
library(dplyr)
df %>%
group_by(client_id, month_date = format(date, "%m-%d")) %>%
summarise(diff = -diff(expenditure))
# client_id month_date diff
# <int> <chr> <dbl>
# 1 1 01-01 0.278
# 2 1 02-01 -0.0421
# 3 1 03-01 0.0117
# 4 1 04-01 -0.0440
# 5 1 05-01 0.855
# 6 1 06-01 0.354
# 7 1 07-01 -0.226
# 8 1 08-01 0.506
# 9 1 09-01 0.119
#10 1 10-01 0.00819
# … with 26 more rows
An option with data.table
library(data.table)
library(zoo)
setDT(df)[, .(diff = -diff(expenditure)), .(client_id, month_date = as.yearmon(date))]

Create new data frame with multiple subsets of same variable

I'd like to create a new data frame where the columns are subsets of the same variable that are split by a different variable. For example, I'd like to make a new subset of variable ('b') where the columns are split by a subset of a different variable ('year')
set.seed(88)
df <- data.frame(year = rep(1996:1998,3), a = runif(9), b = runif(9), e = runif(9))
df
year a b e
1 1996 0.41050128 0.97679183 0.7477684
2 1997 0.10273570 0.54925568 0.7627982
3 1998 0.74104481 0.74416429 0.2114261
4 1996 0.48007870 0.55296210 0.7377032
5 1997 0.99051343 0.18097104 0.8404930
6 1998 0.99954223 0.02063662 0.9153588
7 1996 0.03247379 0.33055434 0.9182541
8 1997 0.76020784 0.10246882 0.7055694
9 1998 0.67713100 0.59292207 0.4093590
Desired output for variable 'b' for years 1996 and 1998, is:
V1 V2
1 0.9767918 0.74416429
2 0.5529621 0.02063662
3 0.3305543 0.59292207
I could probably find a way to do this with a loop, but am wondering if there is a dplyr methed (or any simple method to accomplish this).
We subset dataset based on 1996, 1998 in 'year', select the 'year', 'b' columns and unstack to get the expected output
unstack(subset(df, year %in% c(1996, 1998), select = c('year', 'b')), b ~ year)
# X1996 X1998
#1 0.9767918 0.74416429
#2 0.5529621 0.02063662
##3 0.3305543 0.59292207
Or using tidyverse, we select the columns of interest, filter the rows based on the 'year' column, create a sequence column by 'year', spread to 'wide' format and select out the unwanted columns
library(tidyverse)
df %>%
select(year, b) %>%
filter(year %in% c(1996, 1998)) %>%
group_by(year = factor(year, levels = unique(year), labels = c('V1', 'V2'))) %>%
mutate(n = row_number()) %>%
spread(year, b) %>%
select(-n)
# A tibble: 3 x 2
# V1 V2
# <dbl> <dbl>
#1 0.977 0.744
#2 0.553 0.0206
#3 0.331 0.593
As there are only two 'year's, we can also use summarise
df %>%
summarise(V1 = list(b[year == 1996]), V2 = list(b[year == 1998])) %>%
unnest
Another option with dplyr, mixing in some base R, resulting in a tiny bit shorter solution than #akrun's code:
bind_cols(split(df$b, df$year)) %>% select(-'1997')
# A tibble: 3 x 2
`1996` `1998`
<dbl> <dbl>
1 0.977 0.744
2 0.553 0.0206
3 0.331 0.593

R transpose including NA

I have data like,
trackingnumer = c(1,1,2,2,3)
date = c("2017-08-01", "2017-08-10", "2017-08-02", "2017-08-05", "2017-08-12")
scan = c("Pickup", "Delivered", "Pickup", "Delivered", "Delivered")
df = data.frame(trackingnumer, date, scan)
I want to transpose this data by trackignumber
df2 <- df %>%
group_by(trackingnumer) %>%
mutate(n = row_number()) %>%
{data.table::dcast(data = setDT(.), trackingnumer ~ n, value.var = c('date', 'scan'))}
I have tried this one, but I couldn't get the desirable outcome.I want to set data_1 as pickup date, and date_2 as delivered date. As you can see, trackingnumber 3 doesn't have pickup record so I want date_1 to be NA.
Base R attempt, using relevel to set the appropriate ordering of the scan column:
reshape(
cbind(df, time=as.numeric(relevel(df$scan, "Pickup"))),
idvar="trackingnumer", direction="wide", sep="_"
)
# trackingnumer date_1 scan_1 date_2 scan_2
#1 1 2017-08-01 Pickup 2017-08-10 Delivered
#3 2 2017-08-02 Pickup 2017-08-05 Delivered
#5 3 <NA> <NA> 2017-08-12 Delivered
The problem was that your function in mutate was just counting the rows, it wasn’t paying attention to what was in them. The case_when() function lets you specify specific values for the “n” column based on the value of “scan”
df2 <- df %>%
group_by(trackingnumer) %>%
mutate(n = case_when(scan == "Pickup" ~ 1,
scan == "Delivered" ~ 2)) %>%
{data.table::dcast(data = setDT(.), trackingnumer ~ n, value.var = c('date', 'scan'))}
Or with tidyr
library(tidyr)
df %>% group_by(trackingnumer,scan2 = scan) %>%
nest(date,scan) %>%
spread(scan2,data) %>%
mutate_at(c("Delivered","Pickup"),~ifelse(map_lgl(.x,is_tibble),.x,lst(tibble(date=NA,scan=NA)))) %>%
unnest %>%
rename_at(c("date","scan"),paste0,2)
# # A tibble: 3 x 5
# trackingnumer date2 scan2 date1 scan1
# <dbl> <fctr> <fctr> <fctr> <fctr>
# 1 1 2017-08-10 Delivered 2017-08-01 Pickup
# 2 2 2017-08-05 Delivered 2017-08-02 Pickup
# 3 3 2017-08-12 Delivered <NA> <NA>

Resources