How to create several new group-based variables most efficiently?

How to create several new group-based variables most efficiently? - r

Let's use the following example:
set.seed(2409)
N=5
T=10
id<- rep(LETTERS[1:N],each=T)
time<-rep(1:T, times=N)
var1<-runif(N*T,0,100)
var2<-runif(N*T,0,100)
var3<-runif(N*T,0,100)
var4<-runif(N*T,0,100)
var5<-runif(N*T,0,100)
df<-data.frame(id,time,var1,var2,var3,var4,var5); rm(N,T,id,time,var1,var2,var3,var4,var5)
I now try to execute a function for several of these variables (not the whole series of variables!) and create new variables accordingly.
I already have a suitable code for creating log variables. For this I would use the following code:
cols <- c("var1",
"var3",
"var5")
log <- log(df[cols])
colnames(log) <- paste(colnames(log), "log", sep = "_")
df <- cbind(df,log); rm(log, cols)
This would give me my additional log variables. But now I also want to create lagged and z-transformed variables. These functions refer to the individual IDs. So I wrote the following code that of course works, but is extremely long and inefficient in my real dataset where I apply the function to 38 variables each:
library(Hmisc)
library(dplyr)
df<-df %>%
group_by(id) %>%
mutate(var1_1=Lag(var1, shift=1),
var3_1=Lag(var3, shift=1),
var5_1=Lag(var5, shift=1),
var1_2=Lag(var1, shift=2),
var3_2=Lag(var3, shift=2),
var5_2=Lag(var5, shift=2),
var1_z=scale(var1),
var3_z=scale(var3),
var5_z=scale(var5)
)
I am very sure that there is also a way to make this more efficient. It would be desirable if I could define the original variable once and execute different functions and create new variables as a result.
Thank you very much!

You can use mutate_at with funs. This will apply the three functions in funs to each of the three variables in vars, creating 9 new columns.
library(dplyr)
df %>%
group_by(id) %>%
mutate_at(vars(var1, var3, var5),
funs(lag1 = lag(.), lag2 = lag(., 2), scale))
# # A tibble: 50 x 16
# # Groups: id [5]
# id time var1 var2 var3 var4 var5 var1_lag1 var3_lag1 var5_lag1
# <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A 1 38.8 25.7 29.2 91.1 35.3 NA NA NA
# 2 A 2 87.1 22.3 8.27 31.5 93.7 38.8 29.2 35.3
# 3 A 3 61.7 38.8 0.887 63.0 50.4 87.1 8.27 93.7
# 4 A 4 0.692 60.1 71.5 74.0 41.6 61.7 0.887 50.4
# 5 A 5 60.1 13.3 90.4 80.6 47.5 0.692 71.5 41.6
# 6 A 6 46.4 3.67 36.7 86.9 67.5 60.1 90.4 47.5
# 7 A 7 80.4 72.1 82.2 25.5 70.3 46.4 36.7 67.5
# 8 A 8 48.8 25.7 93.4 19.8 81.2 80.4 82.2 70.3
# 9 A 9 48.2 31.5 82.1 47.2 49.2 48.8 93.4 81.2
# 10 A 10 21.8 32.6 76.5 19.7 41.1 48.2 82.1 49.2
# # ... with 40 more rows, and 6 more variables: var1_lag2 <dbl>, var3_lag2 <dbl>,
# # var5_lag2 <dbl>, var1_scale <dbl>, var3_scale <dbl>, var5_scale <dbl>

Here is an option with data.table
library(data.table)
nm1 <- c('var1', 'var3', 'var5')
nm2 <- paste0(nm1, rep(c('_lag1', '_lag2'), each = 3))
nm3 <- paste0(nm1, '_scale')
setDT(df)[, c(nm2, nm3) := c(shift(.SD, n = 1:2), lapply(.SD,
function(x) as.vector(scale(x)))), by = id, .SDcols = nm1]'

Related

Add new variable with arithmetic conditions

the randomly generated data frame contains ID, Dates, and Earnings. I changed up the data frame format so that each column represents a date and its values corresponds to the earnings.
I want to create a new variable named "Date_over100 " that would determine the date when one's cumulative earnings have exceeded 100. I have put below a reproducible code that would generate the data frame. I assume conditional statements or loops would be used to achieve this. I would appreciate all the help there is. Thanks in advance!
ID <- c(1:10)
Date <- sample(seq(as.Date('2021/01/01'), as.Date('2021/01/11'), by="day", replace=T), 10)
Earning <- round(runif(10,30,50),digits = 2)
df <- data.frame(ID,Date,Earning,check.names = F)
df1 <- df%>%
arrange(Date)%>%
pivot_wider(names_from = Date, values_from = Earning)
df1 <- as.data.frame(df1)
df1[is.na(df1)] <- round(runif(sum(is.na(df1)),min=30,max=50),digits = 2)

I go back to long format for the calculation, then join to the wide data:
library(dplyr)
library(tidyr)
df1 %>% pivot_longer(cols = -ID, names_to = "date") %>%
group_by(ID) %>%
summarize(Date_over_100 = Date[which.max(cumsum(value) > 100)]) %>%
right_join(df1, by = "ID")
# # A tibble: 10 × 12
# ID Date_over_100 `2021-01-04` `2021-01-01` `2021-01-08` `2021-01-11` `2021-01-02` `2021-01-09`
# <int> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2021-01-08 45.0 46.2 40.1 47.4 47.5 48.8
# 2 2 2021-01-08 36.7 30.3 36.2 47.5 41.4 41.7
# 3 3 2021-01-08 49.5 46.0 45.0 43.9 45.4 37.1
# 4 4 2021-01-08 31.0 48.7 47.3 40.4 40.8 35.5
# 5 5 2021-01-08 48.2 35.2 32.1 44.2 35.4 49.7
# 6 6 2021-01-08 40.8 37.6 31.8 40.3 38.3 42.5
# 7 7 2021-01-08 37.9 42.9 36.8 46.0 39.8 33.6
# 8 8 2021-01-08 47.7 47.8 39.7 46.4 43.8 46.5
# 9 9 2021-01-08 32.9 42.0 41.8 32.8 33.9 35.5
# 10 10 2021-01-08 34.5 40.1 42.7 35.9 44.8 31.8
# # … with 4 more variables: 2021-01-10 <dbl>, 2021-01-03 <dbl>, 2021-01-07 <dbl>, 2021-01-05 <dbl>

Pivot Longer with Modification of Columns

I have data that is in the following format:
(data <- tribble(
~Date, ~ENRSxOPEN, ~ENRSxCLOSE, ~INFTxOPEN, ~INFTxCLOSE,
"1989-09-11",82.97,82.10,72.88,72.56,
"1989-09-12",83.84,83.96,73.52,72.51,
"1989-09-13",83.16,83.88,72.91,72.12))
# A tibble: 3 x 5
Date ENRSxOPEN ENRSxCLOSE INFTxOPEN INFTxCLOSE
<chr> <dbl> <dbl> <dbl> <dbl>
1 1989-09-11 83.0 82.1 72.9 72.6
2 1989-09-12 83.8 84.0 73.5 72.5
3 1989-09-13 83.2 83.9 72.9 72.1
For analysis, I want to pivot this tibble longer to the following format:
tribble(
~Ticker, ~Date, ~OPEN, ~CLOSE,
"ENRS","1989-09-11",82.97,82.10,
"ENRS","1989-09-12",83.84,83.96,
"ENRS","1989-09-13",83.16,83.88,
"INFT","1989-09-11",72.88,72.56,
"INFT","1989-09-12",73.52,72.51,
"INFT","1989-09-13",72.91,72.12)
# A tibble: 3 x 5
Date ENRSxOPEN ENRSxCLOSE INFTxOPEN INFTxCLOSE
<chr> <dbl> <dbl> <dbl> <dbl>
1 1989-09-11 83.0 82.1 72.9 72.6
2 1989-09-12 83.8 84.0 73.5 72.5
3 1989-09-13 83.2 83.9 72.9 72.1
I.e., I want to separate the Open/Close prices from the ticker, and put the latter as an entirely new column in the beginning.
I've tried to use the function pivot_longer:
pivot_longer(data, cols = ENRSxOPEN:INFTxCLOSE)
While this goes into the direction of what I wanna achieve, it does not separate the prices and keep them in one row for each Ticker.
Is there a way to add additional arguments to pivot_longer()to achieve that?

pivot_longer(data, -Date, names_to = c('Ticker', '.value'), names_sep = 'x')
# A tibble: 6 x 4
Date Ticker OPEN CLOSE
<dbl> <chr> <dbl> <dbl>
1 1969 ENRS 83.0 82.1
2 1969 INFT 72.9 72.6
3 1968 ENRS 83.8 84.0
4 1968 INFT 73.5 72.5
5 1967 ENRS 83.2 83.9
6 1967 INFT 72.9 72.1

How to get returns from a DF

So I'm using the quantmod library to calculate historical returns, but while I can get the past prices, how can I calculate the returns and add it on to the dataframe???
My code looks like this
tickers <- c('KO', 'AAPL')
getSymbols(tickers, from = '2020-07-01', to = '2021-07-01')
history <- cbind(KO$KO.Close,AAPL$AAPL.Close)

First I did a way to better import and structure data
Import
library(quantmod)
library(tidyverse)
tickers <- c('KO', 'AAPL')
df <-
map_df(
.x = tickers,
.f = function(x){
getSymbols(x, from = '2020-07-01', to = '2021-07-01',auto.assign = FALSE) %>%
as_tibble() %>%
set_names(c("open","high","low","close","volume","adjusted")) %>%
mutate(symbol = x)
}
)
# A tibble: 504 x 7
open high low close volume adjusted symbol
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 44.9 45.5 44.7 44.8 14316500 43.1 KO
2 45.3 45.4 44.8 44.9 15111900 43.2 KO
3 45.1 45.3 44.6 45.2 15146000 43.5 KO
4 45 45.5 44.8 45.2 13043600 43.5 KO
5 45.1 45.2 44.5 45.1 13851200 43.3 KO
6 45.0 45.0 43.8 43.9 16087100 42.2 KO
7 43.9 45.2 43.9 45.2 15627800 43.4 KO
8 45.5 45.7 45.0 45.2 16705300 43.5 KO
9 44.9 45.9 44.7 45.9 17080100 44.1 KO
10 46.3 47.2 46.2 46.4 23738000 44.6 KO
Return
I do not know if this is the right formula for return, but you can change later inside mutate
df %>%
group_by(symbol) %>%
mutate(return = 100*((open/lag(open))-1))
# A tibble: 504 x 8
# Groups: symbol [2]
open high low close volume adjusted symbol return
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 44.9 45.5 44.7 44.8 14316500 43.1 KO NA
2 45.3 45.4 44.8 44.9 15111900 43.2 KO 0.801
3 45.1 45.3 44.6 45.2 15146000 43.5 KO -0.331
4 45 45.5 44.8 45.2 13043600 43.5 KO -0.310
5 45.1 45.2 44.5 45.1 13851200 43.3 KO 0.311
6 45.0 45.0 43.8 43.9 16087100 42.2 KO -0.199
7 43.9 45.2 43.9 45.2 15627800 43.4 KO -2.60
8 45.5 45.7 45.0 45.2 16705300 43.5 KO 3.76
9 44.9 45.9 44.7 45.9 17080100 44.1 KO -1.36
10 46.3 47.2 46.2 46.4 23738000 44.6 KO 3.10
# ... with 494 more rows

Assuming the return you're looking for as today's value/yesterday's value, and using the tidyverse:
library(tidyverse)
library(timetk)
tickers <- c('KO', 'AAPL')
quantmod::getSymbols(tickers, from = '2020-07-01', to = '2021-07-01')
# Convert to a tibble to keep the dates
equity1 <- tk_tbl(KO) %>%
select(date = index, 5)
equity2 <- tk_tbl(AAPL) %>%
select(date = index, 5)
# Combine the series using a join, in case dates don't line up exactly.
history <- full_join(equity1, equity2, by = "date")
# Make data long, group by equity, do the calculation, turn back into wide data:
return <- history %>%
pivot_longer(-date) %>%
group_by(name) %>%
mutate(return = value/lag(value)-1) %>%
ungroup() %>%
pivot_wider(names_from = name, values_from = c(value, return))
# A tibble: 252 x 5
date value_KO.Close value_AAPL.Close return_KO.Close return_AAPL.Close
<date> <dbl> <dbl> <dbl> <dbl>
1 2020-07-01 44.8 91.0 NA NA
2 2020-07-02 44.9 91.0 0.00134 0
3 2020-07-06 45.2 93.5 0.00780 0.0268
4 2020-07-07 45.2 93.2 -0.000442 -0.00310
5 2020-07-08 45.1 95.3 -0.00310 0.0233
6 2020-07-09 43.9 95.8 -0.0257 0.00430
7 2020-07-10 45.2 95.9 0.0282 0.00175
8 2020-07-13 45.2 95.5 0.00221 -0.00461
9 2020-07-14 45.9 97.1 0.0137 0.0165
10 2020-07-15 46.4 97.7 0.0116 0.00688
# ... with 242 more rows

R: Why is merge dropping data? How to interpolate missing values for a merge

I am trying to merge two relatively large datasets. I am merging by SiteID - which is a unique indicator of location, and date/time, which are comprised of Year, Month=Mo, Day, and Hour=Hr.
The problem is that the merge is dropping data somewhere. Minimum, Maximum, Mean, and Median values all change, when they should be the same data, simply merged. I have made the data into characters and checked that the character strings match, yet I still lose data. I have tried left_join as well, but that doesn't seem to help. See below for more details.
EDIT: Merge is dropping data because data do not exist for every ("SiteID", "Year","Mo","Day", "Hr"). So, I needed to interpolate missing values from dB before I could merge (see answer below).
END EDIT
see link at the bottom of the page to reproduce this example.
PC17$Mo<-as.character(PC17$Mo)
PC17$Year<-as.character(PC17$Year)
PC17$Day<-as.character(PC17$Day)
PC17$Hr<-as.character(PC17$Hr)
PC17$SiteID<-as.character(PC17$SiteID)
dB$Mo<-as.character(dB$Mo)
dB$Year<-as.character(dB$Year)
dB$Day<-as.character(dB$Day)
dB$Hr<-as.character(dB$Hr)
dB$SiteID<-as.character(dB$SiteID)
# confirm that data are stored as characters
str(PC17)
str(dB)
Now to compare my SiteID values, I use unique to see what character strings I have, and setdiff to see if R recognizes any as missing. One siteID is missing from each, but this is okay, because it is truly missing in the data (not a character string issue).
sort(unique(PC17$SiteID))
sort(unique(dB$SiteID))
setdiff(PC17$SiteID, dB$SiteID) ## TR2U is the only one missing, this is ok
setdiff(dB$SiteID, PC17$SiteID) ## FI7D is the only one missing, this is ok
Now when I look at the data (summarize by SiteID), it looks like a nice, full dataframe - meaning I have data for every site that I should have.
library(dplyr)
dB %>%
group_by(SiteID) %>%
summarise(
min_dBL50=min(dbAL050, na.rm=TRUE),
max_dBL50=max(dbAL050, na.rm=TRUE),
mean_dBL50=mean(dbAL050, na.rm=TRUE),
med_dBL50=median(dbAL050, na.rm=TRUE)
)
# A tibble: 59 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 35.3 57.3 47.0 47.6
2 CU1M 33.7 66.8 58.6 60.8
3 CU1U 31.4 55.9 43.1 43.3
4 CU2D 40 58.3 45.3 45.2
5 CU2M 32.4 55.8 41.6 41.3
6 CU2U 31.4 58.1 43.9 42.6
7 CU3D 40.6 59.5 48.4 48.5
8 CU3M 35.8 75.5 65.9 69.3
9 CU3U 40.9 59.2 46.6 46.2
10 CU4D 36.6 49.1 43.6 43.4
# ... with 49 more rows
Here, I merge the two data sets PC17 and dB by "SiteID", "Year","Mo","Day", "Hr" - keeping all PC17 values (even if they don't have dB values to go with it; all.x=TRUE).
However, when I look at the summary of this data, now all of the SiteID have different values, and some sites are missing completely such as "CU3D" and "CU4D".
PCdB<-(merge(PC17, dB, by=c("SiteID", "Year","Mo","Day", "Hr"), all.x=TRUE))
PCdB %>%
group_by(SiteID) %>%
summarise(
min_dBL50=min(dbAL050, na.rm=TRUE),
max_dBL50=max(dbAL050, na.rm=TRUE),
mean_dBL50=mean(dbAL050, na.rm=TRUE),
med_dBL50=median(dbAL050, na.rm=TRUE)
)
# A tibble: 59 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 47.2 54 52.3 54
2 CU1M 35.4 63 49.2 49.2
3 CU1U 35.3 35.3 35.3 35.3
4 CU2D 42.3 42.3 42.3 42.3
5 CU2M 43.1 43.2 43.1 43.1
6 CU2U 43.7 43.7 43.7 43.7
7 CU3D Inf -Inf NaN NA
8 CU3M 44.1 71.2 57.6 57.6
9 CU3U 45 45 45 45
10 CU4D Inf -Inf NaN NA
# ... with 49 more rows
I set everything to characters with as.character() in the first lines. Additionally, I have checked Year, Day, Mo, and Hr with setdiff and unique just as I did above with SiteID, and there don't appear to be any issues with those character strings not matching.
I have also tried dplyr function left_join to merge the datasets, and it hasn't made a difference.

problay solved when using na.rm = TRUE in your summarising functions...
a data.table approach:
library( data.table )
dt.PC17 <- fread( "./PC_SO.csv" )
dt.dB <- fread( "./dB.csv" )
#data.table left join on "SiteID", "Year","Mo","Day", "Hr", and the summarise...
dt.PCdB <- dt.dB[ dt.PC17, on = .( SiteID, Year, Mo, Day, Hr ) ]
#summarise, and order by SiteID
result <- setorder( dt.PCdB[, list(min_dBL50 = min( dbAL050, na.rm = TRUE ),
max_dBL50 = max( dbAL050, na.rm = TRUE ),
mean_dBL50 = mean( dbAL050, na.rm = TRUE ),
med_dBL50 = median( dbAL050, na.rm = TRUE )
),
by = "SiteID" ],
SiteID)
head( result, 10 )
# SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
# 1: CU1D 47.2 54.0 52.300 54.00
# 2: CU1M 35.4 63.0 49.200 49.20
# 3: CU1U 35.3 35.3 35.300 35.30
# 4: CU2D 42.3 42.3 42.300 42.30
# 5: CU2M 43.1 43.2 43.125 43.10
# 6: CU2U 43.7 43.7 43.700 43.70
# 7: CU3D Inf -Inf NaN NA
# 8: CU3M 44.1 71.2 57.650 57.65
# 9: CU3U 45.0 45.0 45.000 45.00
# 10: CU4D Inf -Inf NaN NA
If you would like to perform a left join, but exclude hits that cannot be found (so you do not get rows like the one above on "CU3D") use:
dt.PCdB <- dt.dB[ dt.PC17, on = .( SiteID, Year, Mo, Day, Hr ), nomatch = 0L ]
this will result in:
# SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
# 1: CU1D 47.2 54.0 52.300 54.00
# 2: CU1M 35.4 63.0 49.200 49.20
# 3: CU1U 35.3 35.3 35.300 35.30
# 4: CU2D 42.3 42.3 42.300 42.30
# 5: CU2M 43.1 43.2 43.125 43.10
# 6: CU2U 43.7 43.7 43.700 43.70
# 7: CU3M 44.1 71.2 57.650 57.65
# 8: CU3U 45.0 45.0 45.000 45.00
# 9: CU4M 52.4 55.9 54.150 54.15
# 10: CU4U 51.3 51.3 51.300 51.30

In the end, I answered this question with a better understanding of the data. The merge function itself was not dropping any values, since it was only doing exactly as one tells it. However, since datasets were merged by SiteID, Year, Mo, Day, Hr the result was Inf, NaN, and NA values for a few SiteID.
The reason for this is that dB is not a fully continuous dataset to merge with. Thus, Inf, NaN, and NA values for some SiteID were returned because data did not overlap in all variables (SiteID, Year, Mo, Day, Hr).
So I solved this problem with interpolation. That is, I filled the missing values in based on values from dates on either side of the missing values. The package imputeTS was valuable here.
So I first interpolated the missing values in between the dates with data, and then I re-merged the datasets.
library(imputeTS)
library(tidyverse)
### We want to first interpolate dB values on the siteID first in dB dataset, BEFORE merging.
### Why? Because the merge drops all the data that would help with the interpolation!!
dB<-read.csv("dB.csv")
dB_clean <- dB %>%
mutate_if(is.integer, as.character)
# Create a wide table with spots for each minute. Missing will
# show up as NA's
# All the NA's here in the columns represent
# missing jDays that we should add. jDay is an integer date 'julian date'
dB_NA_find <- dB_clean %>%
count(SiteID, jDay) %>%
spread(jDay, n)
dB_NA_find
# A tibble: 59 x 88
# SiteID `13633` `13634` `13635` `13636` `13637` `13638` `13639` `13640` `13641`
# <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 CU1D NA NA NA NA NA NA NA NA
# 2 CU1M NA 11 24 24 24 24 24 24
# 3 CU1U NA 11 24 24 24 24 24 24
# 4 CU2D NA NA NA NA NA NA NA NA
# 5 CU2M NA 9 24 24 24 24 24 24
# 6 CU2U NA 9 24 24 24 24 21 NA
# 7 CU3D NA NA NA NA NA NA NA NA
# 8 CU3M NA NA NA NA NA NA NA NA
# 9 CU3U NA NA NA NA NA NA NA NA
# 10 CU4D NA NA NA NA NA NA NA NA
# Take the NA minute entries and make the desired line for each
dB_rows_to_add <- dB_NA_find %>%
gather(jDay, count, 2:88) %>%
filter(is.na(count)) %>%
select(-count, -NA)
# Add these lines to the original, remove the NA jDay rows
# (these have been replaced with jDay rows), and sort
dB <- dB_clean %>%
bind_rows(dB_rows_to_add) %>%
filter(jDay != "NA") %>%
arrange(SiteID, jDay)
length((dB$DailyL50.x[is.na(dB$DailyL50.x)])) ## How many NAs do I have?
# [1] 3030
## Here is where we do the na.interpolation with package imputeTS
# prime the for loop with zeros
D<-rep("0",17)
sites<-unique(dB$SiteID)
for(i in 1:length(sites)){
temp<-dB[dB$SiteID==sites[i], ]
temp<-temp[order(temp$jDay),]
temp$DayL50<-na.interpolation(temp$DailyL50.x, option="spline")
D<-rbind(D, temp)
}
# delete the first row of zeros from above 'priming'
dBN<-D[-1,]
length((dBN$DayL50[is.na(dBN$DayL50)])) ## How many NAs do I have?
# [1] 0
Because I did the above interpolation of NAs based on jDay, I am missing the Month (Mo), Day, and Year information for those rows.
dBN$Year<-"2017" #all data are from 2017
##I could not figure out how jDay was formatted, so I created a manual 'key'
##to get Mo and Day by counting from a known date/jDay pair in original data
#Example:
# 13635 is Mo=5 Day=1
# 13665 is Mo=5 Day=31
# 13666 is Mo=6 Day=1
# 13695 is Mo=6 Day=30
key4<-data.frame("jDay"=c(13633:13634), "Day"=c(29:30), "Mo"=4)
key5<-data.frame("jDay"=c(13635:13665), "Day"=c(1:31), "Mo"=5)
key6<-data.frame("jDay"=c(13666:13695), "Day"=c(1:30), "Mo"=6)
key7<-data.frame("jDay"=c(13696:13719), "Day"=c(1:24), "Mo"=7)
#make master 'key'
key<-rbind(key4,key5,key6,key7)
# Merge 'key' with dataset so all rows now have 'Mo' and 'Day' values
dBM<-merge(dBN, key, by="jDay", all.x=TRUE)
#clean unecessary columns and rename 'Mo' and 'Day' so it matches PC17 dataset
dBM<-dBM[ , -c(2,3,6:16)]
colnames(dBM)[5:6]<-c("Day","Mo")
#I noticed an issue with duplication - merge with PC17 created a massive dataframe
dBM %>% ### Have too many observations per day, will duplicate merge out of control.
count(SiteID, jDay, DayL50) %>%
summarise(
min=min(n, na.rm=TRUE),
mean=mean(n, na.rm=TRUE),
max=max(n, na.rm=TRUE)
)
## to fix this I only kept distinct observations so that each day has 1 observation
dB<-distinct(dBM, .keep_all = TRUE)
### Now run above line again to check how many observations per day are left. Should be 1
Now when you do the merge with dB and PC17, the interpolated values (that were missing NAs before) should be included. It will look something like this:
> PCdB<-(merge(PC17, dB, by=c("SiteID", "Year","Mo","Day"), all.x=TRUE, all=FALSE,no.dups=TRUE))
> ### all.x=TRUE is important. This keeps all PC17 data, even stuff that DOESNT have dB data that corresponds to it.
> library(dplyr)
#Here is the NA interpolated 'dB' dataset
> dB %>%
+ group_by(SiteID) %>%
+ dplyr::summarise(
+ min_dBL50=min(DayL50, na.rm=TRUE),
+ max_dBL50=max(DayL50, na.rm=TRUE),
+ mean_dBL50=mean(DayL50, na.rm=TRUE),
+ med_dBL50=median(DayL50, na.rm=TRUE)
+ )
# A tibble: 59 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 44.7 53.1 49.4 50.2
2 CU1M 37.6 65.2 59.5 62.6
3 CU1U 35.5 51 43.7 44.8
4 CU2D 42 52 47.8 49.3
5 CU2M 38.2 49 43.1 42.9
6 CU2U 34.1 53.7 46.5 47
7 CU3D 46.1 53.3 49.7 49.4
8 CU3M 44.5 73.5 61.9 68.2
9 CU3U 42 52.6 47.0 46.8
10 CU4D 42 45.3 44.0 44.6
# ... with 49 more rows
# Now here is the PCdB merged dataset, and we are no longer missing values!
> PCdB %>%
+ group_by(SiteID) %>%
+ dplyr::summarise(
+ min_dBL50=min(DayL50, na.rm=TRUE),
+ max_dBL50=max(DayL50, na.rm=TRUE),
+ mean_dBL50=mean(DayL50, na.rm=TRUE),
+ med_dBL50=median(DayL50, na.rm=TRUE)
+ )
# A tibble: 60 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 44.8 50 46.8 47
2 CU1M 59 63.9 62.3 62.9
3 CU1U 37.9 46 43.6 44.4
4 CU2D 42.1 51.6 45.6 44.3
5 CU2M 38.4 48.3 44.2 45.5
6 CU2U 39.8 50.7 45.7 46.4
7 CU3D 46.5 49.5 47.7 47.7
8 CU3M 67.7 71.2 69.5 69.4
9 CU3U 43.3 52.6 48.1 48.2
10 CU4D 43.2 45.3 44.4 44.9
# ... with 50 more rows

Nesting several groups of columns inside a data frame

The concept of nesting several columns into a single list-column is very powerful. However, I am not sure whether it is possible at all to nest more than one set of columns into several list-columns within the same pipeline using the nest function in {tidyr}. For instance, assume I have the following data frame:
df <- as.data.frame(replicate(6, runif(10) * 100))
colnames(df) <- c(
paste0("a", 1:2), # a1, a2
paste0("b", 1:4) # b1, b2, b3, b4
)
df
a1 a2 b1 b2 b3 b4
1 20.807348 69.339482 91.837151 99.76813 3.394350 33.780049
2 64.667733 20.676381 80.523369 38.42774 85.635208 60.111491
3 55.352501 55.699571 4.812923 38.65333 98.869203 80.345576
4 45.194094 16.511696 83.834651 51.48698 7.191081 16.697210
5 66.401642 89.041055 26.965636 67.90061 90.622428 59.552935
6 35.750100 55.997766 49.768556 68.45900 67.523080 58.993232
7 21.392823 5.335281 56.348328 35.68331 51.029617 66.290035
8 8.851236 19.486580 14.199370 22.49754 14.617592 18.236406
9 70.475652 6.229997 43.169364 12.63378 21.415589 2.163004
10 47.837613 37.641530 38.001288 71.15896 71.000568 2.135611
I would like to nest the "a" columns into a list-column AND nest the "b" columns into a second list-column because I would like to perform different computations on them.
Nesting the "a" columns works:
library(tidyr)
nest(df, a1, a2, .key = "a")
b1 b2 b3 b4 a
1 91.837151 99.76813 3.394350 33.780049 20.80735, 69.33948
2 80.523369 38.42774 85.635208 60.111491 64.66773, 20.67638
3 4.812923 38.65333 98.869203 80.345576 55.35250, 55.69957
4 83.834651 51.48698 7.191081 16.697210 45.19409, 16.51170
5 26.965636 67.90061 90.622428 59.552935 66.40164, 89.04105
6 49.768556 68.45900 67.523080 58.993232 35.75010, 55.99777
7 56.348328 35.68331 51.029617 66.290035 21.392823, 5.335281
8 14.199370 22.49754 14.617592 18.236406 8.851236, 19.486580
9 43.169364 12.63378 21.415589 2.163004 70.475652, 6.229997
10 38.001288 71.15896 71.000568 2.135611 47.83761, 37.64153
But it is impossible to nest the "b" columns AFTER the "a" columns have been nested:
nest(df, a1, a2, .key = "a") %>%
nest(b1, b2, b3, b4, .key = "b")
Error in grouped_df_impl(data, unname(vars), drop) :
Column `a` can't be used as a grouping variable because it's a list
which makes sense by reading the error message.
My work-around is to:
nest the "a" columns
perform the required computations on the "a" list-column
unnest the "a" list-column
nest the "b" columns
perform the required computations on the "b" list-column
unnest the "b" list-column
Is there a more straight-forward way to achieve this? Your help is much appreciated.

We can use map to do this
library(tidyverse)
out <- list('a', 'b') %>%
map(~ df %>%
select(matches(.x)) %>%
nest(names(.), .key = !! rlang::sym(.x))) %>%
bind_cols
out
# A tibble: 1 x 2
# a b
# <list> <list>
#1 <data.frame [10 × 2]> <data.frame [10 × 4]>
out %>%
unnest
# A tibble: 10 x 6
# a1 a2 b1 b2 b3 b4
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 20.8 69.3 91.8 99.8 3.39 33.8
# 2 64.7 20.7 80.5 38.4 85.6 60.1
# 3 55.4 55.7 4.81 38.7 98.9 80.3
# 4 45.2 16.5 83.8 51.5 7.19 16.7
# 5 66.4 89.0 27.0 67.9 90.6 59.6
# 6 35.8 56.0 49.8 68.5 67.5 59.0
# 7 21.4 5.34 56.3 35.7 51.0 66.3
# 8 8.85 19.5 14.2 22.5 14.6 18.2
# 9 70.5 6.23 43.2 12.6 21.4 2.16
#10 47.8 37.6 38.0 71.2 71.0 2.14
We could do the separate computations on the 'a' and 'b' list of columns
out %>%
mutate(a = map(a, `*`, 4)) %>%
unnest
# A tibble: 10 x 6
# a1 a2 b1 b2 b3 b4
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 83.2 277. 91.8 99.8 3.39 33.8
# 2 259. 82.7 80.5 38.4 85.6 60.1
# 3 221. 223. 4.81 38.7 98.9 80.3
# 4 181. 66.0 83.8 51.5 7.19 16.7
# 5 266. 356. 27.0 67.9 90.6 59.6
# 6 143. 224. 49.8 68.5 67.5 59.0
# 7 85.6 21.3 56.3 35.7 51.0 66.3
# 8 35.4 77.9 14.2 22.5 14.6 18.2
# 9 282. 24.9 43.2 12.6 21.4 2.16
#10 191. 151. 38.0 71.2 71.0 2.14
Having said that, it is also possible to select columns of interest with mutate_at instead of doing nest/unnest
df %>%
mutate_at(vars(matches('^a\\d+')), funs(.*4))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to create several new group-based variables most efficiently? - r

Here is an option with data.table library(data.table) nm1 <- c('var1', 'var3', 'var5') nm2 <- paste0(nm1, rep(c('_lag1', '_lag2'), each = 3)) nm3 <- paste0(nm1, '_scale') setDT(df)[, c(nm2, nm3) := c(shift(.SD, n = 1:2), lapply(.SD, function(x) as.vector(scale(x)))), by = id, .SDcols = nm1]'

Related

Add new variable with arithmetic conditions

Pivot Longer with Modification of Columns

How to get returns from a DF

R: Why is merge dropping data? How to interpolate missing values for a merge

Nesting several groups of columns inside a data frame

Categories

Resources