R dplyr rowwise mutate - r

Good morning all, this is my first time posting on stack overflow. Thank you for any help!
I have 2 dataframes that I am using to analyze stock data. One data frame has dates among other information, we can call it df:
df1 <- tibble(Key = c('a','b','c'), i =11:13, date= ymd(20110101:20110103))
The second dataframe also has dates and other important information.
df2 <-tibble(Answer = c('a','d','e','b','f','c'), j =14:19, date= ymd(20150304:20150309))
Here is what I want to do:
For each row in df1, I need to:
-Find the date in df2 where, for when df2$answer is the same as df1$key, it is the closest to that row's date in df1.
-Then extract information for another column in that row in df2, and put it in a new row in df1.
The code i tried:
df1 %>%
group_by(Key, i) %>%
mutate(
`New Column` = df2$j[
which.min(subset(df2$date, df2$Answer== Key) - date)])
This has the result:
Key i date `New Column`
1 a 11 2011-01-01 14
2 b 12 2011-01-02 14
3 c 13 2011-01-03 14
This is correct for the first row, a. In df2, the closest date is 2015-03-04, for which the value of j is in fact 14.
However, for the second row, Key=b, I want df2 to subset to only look at dates for rows where df2$Answer = b. Therefore, the date should be 2015-03-07, for which j =17.
Thank you for your help!
Jesse

This should work:
library(dplyr)
df1 %>%
left_join(df2, by = c("Key" = "Answer")) %>%
mutate(date_diff = abs(difftime(date.x, date.y, units = "secs"))) %>%
group_by(Key) %>%
arrange(date_diff) %>%
slice(1) %>%
ungroup()
We are first joining the two data frames with left_join. Yes, I'm aware there are possibly multiple dates for each Key, bear with me.
Next, we calculate (with mutate) the absolute value (abs) of the difference between the two dates date.x and date.y.
Now that we have this, we will group the data by Key using group_by. This will make sure that each distinct Key will be treated separately in subsequent calculations.
Since we've calculated the date_diff, we can now re-order (arrange) the data for each Key, with the smallest date_diff as first for each Key.
Finally, we are only interested in that first, smallest date_diff for each Key, so we can discard the rest using slice(1).
This pipeline gives us the following:
Key i date.x j date.y date_diff
<chr> <int> <date> <int> <date> <time>
1 a 11 2011-01-01 14 2015-03-04 131587200
2 b 12 2011-01-02 17 2015-03-07 131760000
3 c 13 2011-01-03 19 2015-03-09 131846400

Related

In R, replace values across time series based on another column

Actually this is linked to my previous question: Replace values across time series columns based on another column
However I need to modify values across a time series data set but based on a condition from the same row but across another set of time series columns. The dataset looks like this:
#there are many more years (yrs) in the data set
product<-c("01","02")
yr1<-c("1","7")
yr2<-c("3","4")
#these follow the number of years
type.yr1<-c("mixed","number")
type.yr2<-c("number","mixed")
#this is a reference column to pull values from in case the type value is "mixed"
mixed.rate<-c("1+5GBP","7+3GBP")
df<-data.frame(product,yr1,yr2,type.yr1,type.yr2,mixed.rate)
Where the value 1 should be replaced by "1+5GBP" and 4 should be "7+3GBP". I am thinking of something like the below -- could anyone please help?
df %>%
mutate(across(c(starts_with('yr'),starts_with('type'), ~ifelse(type.x=="mixed", mixed.rate.x, .x)))
The final result should be:
product<-c("01","02")
yr1<-c("1+5GBP","7")
yr2<-c("3","7+3GBP")
type.yr1<-c("mixed","number")
type.yr2<-c("number","mixed")
mixed.rate<-c("1+5 GBP","7+3GBP")
df<-data.frame(product,yr1,yr2,type.yr1,type.yr2,mixed.rate)
If I understand you correctly, I think you might benefit from pivoting longer, replacing the values in a single if_else, and swinging back to wide.
df %>%
pivot_longer(cols = -c(product,mixed.rate), names_to=c(".value", "year"), names_pattern = "(.*)(\\d)") %>%
mutate(yr=if_else(type.yr=="mixed",mixed.rate,yr)) %>%
pivot_wider(names_from=year, values_from=c(yr,type.yr),names_sep = "")
Output:
product mixed.rate yr1 yr2 type.yr1 type.yr2
<chr> <chr> <chr> <chr> <chr> <chr>
1 01 1+5 GBP 1+5 GBP 3 mixed number
2 02 7+3GBP 7 7+3GBP number mixed
You can use pivot_longer to have all yrs in one column and type.yrs in another column. Then record 1 into 1+5GBP and 4 into 7+3GBP if the type.yr column is mixed. then pivot_wider
df %>%
pivot_longer(contains('yr'), names_to = c('.value','grp'),
names_pattern = '(\\D+)(\\d+)') %>%
mutate(yr = ifelse(type.yr == 'mixed', recode(yr, '1' = '1+5GBP', '4' = '7+3GBP'), yr)) %>%
pivot_wider(c(product, mixed.rate), names_from = grp,
values_from = c(yr, type.yr), names_sep = '')
# A tibble: 2 x 6
product mixed.rate yr1 yr2 type.yr1 type.yr2
<chr> <chr> <chr> <chr> <chr> <chr>
1 01 1+5GBP 1+5GBP 3 mixed number
2 02 7+3GBP 7 7+3GBP number mixed
If you're happy to use base R instead of dplyr then the following will produce your required output:
for (i in 1:2) {
df[,paste0('yr',i)] <- if_else(df[,paste0('type.yr',i)]=='mixed',df[,'mixed.rate'],df[,paste0('yr',i)])
}

Subsetting a dataframe conditional on minimum difference between 2 dates

I have a dataframe with 4 variables: id, measurement, date_a, date_b.
A single id can contribute to the df more than once. I want to subset this dataframe to only include one measurement for each id. I want to select a single row for each id based on the minimum difference between date_b and date_a, however this minimum difference is required to be at least one year. Is there a way to do this using dplyr using one line of code rather than creating a new variable for the difference in dates?
Here's some fake data. (It's best practice to include something like this in your question to avoid ambiguity or misunderstandings about your particular situation.)
set.seed(8601)
df <- data.frame(
id = rep(1:3, each = 5),
measurement = "foo",
date_a = as.Date(sample(1:3000, 15), origin = "2010-01-01")
)
df$date_b <- df$date_a + sample(1:1000, 15)
Here's a slightly-longer-than-one-line approach with dplyr:
library(dplyr)
df %>% group_by(id) %>% filter(date_b-date_a >= 365) %>% filter(date_b-date_a == min(date_b-date_a))
Result:
# A tibble: 3 x 4
# Groups: id [3]
id measurement date_a date_b
<int> <fct> <date> <date>
1 1 foo 2013-06-13 2014-11-26
2 2 foo 2014-10-05 2017-04-14
3 3 foo 2012-01-07 2014-02-11

cbind columns of varying length, filtering observations that do not share a common index in R

I have 6 time series objects stored in their own dataframe, each with an index from 2000-01-01 to 2010-01-01, however, the observations differ for each object. For clarification, whilst each object might have an observation for 2005-01-01, one object might not have an observation for 2010-02-01, whilst all 5 others do.
I want to use cbind to bind them all together, however, as each object has a differing length I can't (and the fact I want to find the time-varying correlations between each object). Basically I want to find a way to only bind 'complete cases' across all 6 objects, and slot them into their respective index spot.
I am thinking of creating a data frame with a time index ranging from 2000-01-01 to 2010-01-01, binding them to their respective time index (this is the part I don't know how to do), and then using complete cases to remove the observations that don't share a common index. If there is a better way to do this, clarification is also appreciated!
Thank you!
One way to do this would be:
1
Create a data frame with the full time range from 2000-01-01 to 2010-01-01. For this you can use seq().
2
Use dplyr::left_join() to join your various data frames onto this reference data frame (make sure to give your reference data frame as the first argument of left_join()).
Edit to explain comment:
left_join needs to "know", how to join the data frames together. You have two options:
you can give the same name for your reference data frame's date column (so, for instance, if your 6 data frames's date variable is called "Date", your reference data frame's only column should also be called "Date")
or, if you called it something else (say, "Reference" for instance), you need to add a by argument: left_join(df_ref, df1, by = c("Reference", "Date"))
In Base R you could do
merge( merge( df1, df2, all = TRUE ), df3, all = TRUE )
which gives you
time var_A var_B var_C
1 2012-01-01 3 2 0
2 2010-01-01 NA 3 NA
3 2011-01-01 NA NA 0
You could go for a full_join from dplyr. I'd suggest to load tidyverse, just in case the task gets more complex (see the examples below).
Example dataframes:
df1 <- data.frame(time = c("2012-01-01"), var_A = c(3))
df2 <- data.frame(time = c("2010-01-01", "2012-01-01"), var_B = c(3, 2))
df3 <- data.frame(time = c("2011-01-01", "2012-01-01"), var_C = c(0, 0))
Code:
library(tidyverse)
df <- df1 %>%
full_join(df2, by = "time") %>%
full_join(df3, by = "time")
Output:
df
time var_A var_B var_C
1 2012-01-01 3 2 0
2 2010-01-01 NA 3 NA
3 2011-01-01 NA NA 0
This can also be shortened:
library(tidyverse)
df <- list(df1, df2, df3) %>%
reduce(full_join, by = "time")
Output:
time var_A var_B var_C
1 2012-01-01 3 2 0
2 2010-01-01 NA 3 NA
3 2011-01-01 NA NA 0
If you need it arranged, you can then always use arrange afterwards.
P.S. If you're missing some of the dates in that sequence in your dataframes, you can just add few lines to the statement to complement them (I've also added a replace statement to fill NAs with 0):
library(tidyverse)
list(df1, df2, df3) %>%
reduce(full_join, by = "time") %>%
mutate(time = as.Date(time)) %>%
complete(time = seq.Date(as.Date("2000-01-01"), as.Date("2010-01-01"), by="month")) %>%
replace(., is.na(.), 0)
In the above case I've added a sequence from 2000-01-01 until 2010-01-01 by month, but you can also change that to min(time) and max(time) or what would suit you best.

Gather twice in same data frame

I have a dataframe where I want to do two separate gathers
library(tidyverse)
id <- c("A","B","C","D","E")
test_1_baseline <- c(1,2,4,5,6)
test_2_baseline <- c(21000, 23400, 26800,29000,30000)
test_1_followup <- c(0,4,2,3,1)
test_2_followup <- c(10000,12000,13000,15000,21000)
layout_1 <-data.frame(id,test_1_baseline,test_1_followup,test_2_baseline,test_2_followup)
This is the current layout.
Each person is 1 line.
The result of Test 1 at baseline is one variable
The result of Test 2 at baseline is a second variable
The same applies to Test 1/2 follow-up results
I would like the data to be tidier. One column for timepoint, one for result of test A, one for result of test B.
id2 <- c("A","B","C","D","E","A","B","C","D","E")
time <- c(rep("baseline",5),rep("followup",5))
test_1_result <- c(1,2,4,5,6,0,4,2,3,1)
test_2_result <- c(21000, 23400, 26800,29000,30000,10000,12000,13000,15000,21000)
layout_2 <- data.frame(id2, time,test_1_result,test_2_result)
I'm currently doing a what seems to me odd process where first of all I gather the test 1 data
test_1 <- select(layout_1,id,test_1_baseline,test_1_followup) %>%
gather("Timepoint","test_1",c(test_1_baseline,test_1_followup)) %>%
mutate(Timepoint = replace(Timepoint,Timepoint=="test_1_baseline", "baseline")) %>%
mutate(Timepoint = replace(Timepoint,Timepoint=="test_1_followup", "followup"))
Then I do same for test 2 and join them
test_2 <- select(layout_1,id,test_2_baseline,test_2_followup) %>%
gather("Timepoint","test_2",c(test_2_baseline,test_2_followup)) %>%
mutate(Timepoint = replace(Timepoint,Timepoint=="test_2_baseline", "baseline")) %>%
mutate(Timepoint = replace(Timepoint,Timepoint=="test_2_followup", "followup"))
test_combined <- full_join(test_1,test_2)
I tried doing the first Gather and then the second on the same dataframe but then you end up with duplicates; i.e you end up with
ID 1 Test_1 Baseline Test_2 Baseline
ID 1 Test_1 Baseline Test_2 Followup
ID 1 Test_1 Followup Test_2
Baseline ID 1 Test_1 Followup Test_2 Followup
== 4 rows where there should only be 2
I feel there must be a cleaner tidyverse way to do this.
Guidance welcomed
One option with data.table using melt which can take multiple measure patterns
library(data.table)
nm1 <- unique(sub(".*_", "", names(layout_1)[-1]))
melt(setDT(layout_1), measure = patterns("test_1", "test_2"),
value.name = c('test_1_result', 'test_2_result'),
variable.name = 'time')[, time := nm1[time]][]
You could gather all columns except id, then use separate to split into result and time.
Note that this code assumes that the result name is always 6 characters (test_1, test_2), and separates based on that assumption. You'll need to devise a different separate if that is not the case.
library(tidyr)
library(dplyr)
layout_1 %>%
gather(Var, Val, -id) %>%
separate(Var, into = c("result", "time"), sep = 6) %>%
spread(result, Val) %>%
mutate(time = gsub("_", "", time))
Result:
id time test_1 test_2
1 A baseline 1 21000
2 A followup 0 10000
3 B baseline 2 23400
4 B followup 4 12000
5 C baseline 4 26800
6 C followup 2 13000
7 D baseline 5 29000
8 D followup 3 15000
9 E baseline 6 30000
10 E followup 1 21000

Earliest Date for each id in R

I have a dataset where each individual (id) has an e_date, and since each individual could have more than one e_date, I'm trying to get the earliest date for each individual. So basically I would like to have a dataset with one row per each id showing his earliest e_date value.
I've use the aggregate function to find the minimum values, I've created a new variable combining the date and the id and last I've subset the original dataset based on the one containing the minimums using the new variable created. I've come to this:
new <- aggregate(e_date ~ id, data_full, min)
data_full["comb"] <- NULL
data_full$comb <- paste(data_full$id,data_full$e_date)
new["comb"] <- NULL
new$comb <- paste(new$lopnr,new$EDATUM)
data_fixed <- data_full[which(new$comb %in% data_full$comb),]
The first thing is that the aggregate function doesn't seems to work at all, it reduces the number of rows but viewing the data I can clearly see that some ids appear more than once with different e_date. Plus, the code gives me different results when I use the as.Date format instead of its original format for the date (integer). I think the answer is simple but I'm struck on this one.
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data_full)), grouped by 'id', we get the 1st row (head(.SD, 1L)).
library(data.table)
setDT(data_full)[order(e_date), head(.SD, 1L), by = id]
Or using dplyr, after grouping by 'id', arrange the 'e_date' (assuming it is of Date class) and get the first row with slice.
library(dplyr)
data_full %>%
group_by(id) %>%
arrange(e_date) %>%
slice(1L)
If we need a base R option, ave can be used
data_full[with(data_full, ave(e_date, id, FUN = function(x) rank(x)==1)),]
Another answer that uses dplyr's filter command:
dta %>%
group_by(id) %>%
filter(date == min(date))
You may use library(sqldf) to get the minimum date as follows:
data1<-data.frame(id=c("789","123","456","123","123","456","789"),
e_date=c("2016-05-01","2016-07-02","2016-08-25","2015-12-11","2014-03-01","2015-07-08","2015-12-11"))
library(sqldf)
data2 = sqldf("SELECT id,
min(e_date) as 'earliest_date'
FROM data1 GROUP BY 1", method = "name__class")
head(data2)
id earliest_date
123 2014-03-01
456 2015-07-08
789 2015-12-11
I made a reproducible example, supposing that you grouped some dates by which quarter they were in.
library(lubridate)
library(dplyr)
rand_weeks <- now() + weeks(sample(100))
which_quarter <- quarter(rand_weeks)
df <- data.frame(rand_weeks, which_quarter)
df %>%
group_by(which_quarter) %>% summarise(sort(rand_weeks)[1])
# A tibble: 4 x 2
which_quarter sort(rand_weeks)[1]
<dbl> <time>
1 1 2017-01-05 05:46:32
2 2 2017-04-06 05:46:32
3 3 2016-08-18 05:46:32
4 4 2016-10-06 05:46:32

Resources