Create new variables with lag data from all current variables - r

My dataset has about 20 columns and I would like to create 7 new columns with lagged data for each of the 20 current columns.
For example I have column x, y, and z. I would like to create a columns for xlag1, xlag2, xlag3, xlag4, xlag5, xlag6, xlag7, ylag1, ylag2, etc..
My current attempt is with dplyr in R -
aq %>% mutate(.,
xlag1 = lag(x, 1),
xlag2 = lag(x, 2),
xlag3 = lag(x, 3),
xlag4 = lag(x, 4),
xlag5 = lag(x, 5),
xlag6 = lag(x, 6),
xlag7 = lag(x, 7),
)
As you can see it'll take alot of lines of codes to cover all 20 columns. Is there a more efficient way of doing this ? If possible in dplyr and R as I'm most familiar with the package.

We can use data.table. The shift from data.table can take a sequence of 'n'.
library(data.table)
setDT(aq)[, paste0('xlag', 1:7) := shift(x, 1:7)]
If there are multiple columns,
setDT(aq)[, paste0(rep(c("xlag", "ylag"), each = 7), 1:7) :=
c(shift(x, 1:7), shift(y, 1:7))]
If we have many columns, then specify the columns in .SDcols and loop through the dataset, get the shift, unlist and assign to new columns
setDT(aq)[, paste0(rep(c("xlag", "ylag"), each = 7), 1:7) :=
unlist(lapply(.SD, shift, n = 1:7), recursive = FALSE) , .SDcols = x:y]
We can also use the shift in dplyr
library(dplyr)
aq %>%
do(setNames(data.frame(., shift(.$x, 1:7)), c(names(aq), paste0('xlag', 1:7))))
and for multiple columns
aq %>%
do(setNames(data.frame(., shift(.$x, 1:7), shift(.$y, 1:7)),
c(names(aq), paste0(rep(c("xlag", "ylag"), each = 7), 1:7) )))
data
aq <- data.frame(x = 1:20, y = 21:40)

Related

Concatenate columns if they don't contain a zero

Im trying to concatenate 4 Columns into a single column named "tags" for later use of multilabel classification. I would like to concate the columns in a way that gives a an output only pasting columns that are not zero and thereto seperate them with a comma.
An example would be that the cell in last row would be {1,2} instead of {1,2,0,0}
I currently have no code that works as needed and haven't been able to find something on the internet. Do you guys have a tip to do this?
Current code:
df$TV[df$TV==1] = '1'
df$Internet[df$Internet ==1] = '2'
df$Mobil[df$Mobil==1] = '3'
df$Fastnet[df$Fastnet==1] = '4'
df$tags = paste(df$TV,df$Internet,df$Mobil,df$Fastnet, sep=",")
Base R option using apply -
cols <- c('TV', 'Internet', 'Mobil', 'Fastnet')
#df <- read.csv('stack.csv')
df$tags <- apply(df[cols], 1, function(x) toString(x[x!= 0]))
df
In dplyr we can use rowwise -
library(dplyr)
df <- df %>%
rowwise() %>%
mutate(tags = {
tmp <- c_across(all_of(cols))
toString(tmp[tmp != 0])
}) %>%
ungroup
df
We may use dapply from collapse
library(collapse)
cols <- c('TV', 'Internet', 'Mobil', 'Fastnet')
df$tags <- dapply(slt(df, cols) MARGIN = 1, FUN = function(x) toString(x[x != 0]))
data
df <- data.frame(TV = c(1, 3, 2, 0), Internet = c(1, 0, 1, 4), Mobil = c(0, 1, 3, 2), Fastnet = c(1, 5, 3, 2))

Calculate field with previous value in dplyr by group

I want to use dplyr to calculate a field using it's previous value. A for loop would do the job but I want to calculate by different groups of st. I understand mutate (lag or ave) can't use an unitialized field.
df <- data.frame(st = rep(c('a','b', 'c'), each = 5),
z = rep(c(10,14,12), each = 5),
day = rep(1:5, 3),
GAI = rep(0:4, 3),
surfT = sample(1:15))
df %>%
group_by(st) %>%
mutate(soilT = lag(soilT, order_by = day) + (surfT - lag(soilT,
order_by = day))*0.24*exp(-z*0.017)*exp(-0.15*GAI))
or
df %>%
group_by(st) %>%
mutate(soilT = ave(soilT, c(st), FUN=function(x) c(0, soilT + (surfT - soilT)))
*0.24*exp(-z*0.017)*exp(-0.15*GAI))
how can a simple for loop be caculated in dplyr by group of st:
df$soilT <- 0
for (i in 2:dim(df)[1]){
df$soilT[i]=df$soilT[i-1] + (df$surfT[i] - df$soilT[i-1])
*0.24*exp(-z[i]*0.017)*exp(-0.15*GAI[i])
}
We can use accumulate from purrr to get the output of previous row as an input to current row.
library(dplyr)
result <- df %>%
group_by(st) %>%
mutate(soilT = purrr::accumulate(2:n(),
~.x + (surfT[.y] - .x)*0.24*exp(-z[.y]*0.017)*exp(-0.15*GAI[.y]),
.init = 0))

Custom dcasting via data.table in R

Here is my data
dt = data.table(x=sample(8,20,TRUE),
y=sample(2,20,TRUE),
w = sample(letters[5:20], 20, TRUE),
u = sample(letters[2:25], 20, TRUE),
z=sample(letters[1:4], 20,TRUE),
d1 = runif(20), d2=runif(20))
Here is my dcasting code.
DC_1 = dcast.data.table(dt,x+w ~ z, value.var = "d1")
This works fine. However my data could also additionally include column 'a' and column 's' as shown below. Both of them could be included, either one, or none of them.
dt = data.table(x=sample(8,20,TRUE),
y=sample(2,20,TRUE),
w = sample(letters[5:20], 20, TRUE),
u = sample(letters[2:25], 20, TRUE),
z=sample(letters[1:4], 20,TRUE),
a = sample(letters[1:25], 20, T),
s = sample(letters[2:17], 20, T),
d1 = runif(20), d2=runif(20))
The additional columns however would always be characters . Also my data always has to be cast on column 'z' and value variable would always be 'd1'
How do I dcast via data.table such that it takes all the character columns (except z) available in the data table and casts them on z?
We could subset the dataset column and use ... on the lhs of ~ to specify for all columns and on the rhs of formula it would be 'z'
dcast(dt[, setdiff(names(dt), 'd2'), with = FALSE], ... ~ z, value.var = 'd1')
Or get the column names of the character columns programmatically
nm1 <- dt[, names(which(unlist(lapply(.SD, is.character))))]
nm2 <- setdiff(nm1, 'z')
dcast(dt,paste0(paste(nm2, collapse="+"), "~ z"), value.var = 'd1')
Or another option is select from dplyr
library(dplyr) #1.0.0
dcast(dt[, select(.SD, where(is.character), d1)], ... ~ z, value.var = 'd1')
A similar option in tidyverse would be
library(tidyr)
dt %>%
select(where(is.character), d1) %>%
pivot_wider(names_from = z, values_from = d1)

How to use box plot with column range

I want to plot fee as a percentage of income fee_per_inc for each income year_hh_inc quintile.
this is what I have so far:
pacman::p_load(RCurl, plm, tibble, ggplot2, AER, dplyr, car, arm, broom, tidyr, fastDummies, dummies)
x <- getURL("https://raw.githubusercontent.com/dothemathonthatone/maps/master/main_test.csv")
maindf <- read.csv(text = x, row.names=NULL)
maindf <- maindf %>%
mutate(category = cut(year_hh_inc, breaks = (quantile(year_hh_inc, c(0, 1 / 5, 2 / 5, 3 / 5, 4 / 5, 1), na.rm = TRUE)), labels = c("first_quint", "second_quint", "third_quint", 'fourth_quint', 'fifth_quint'), include.lowest = TRUE), vals = 1) %>%
pivot_wider(names_from = category, values_from = vals, values_fill = list(vals = 0))
box <- boxplot(maindf$year_hh_inc ~ maindf$fee_per_inc, col = 3:5)
This is what I would like as an end result:
I think I have a bit more work to do; any help from this point is appreciated.
I think there were a couple of problems here. You need the boxplot to have the variables the other way round. Also, you need to use the category variable that you created in mutate instead of the original variable. Lastly, you don't need the pivot_wider.
Some of the values were also way outside the useful range and may have been wrong (some numbers were -8), so I have trimmed the outliers to make the graph prettier. You'll want to check the original data to see whether this makes sense.
pacman::p_load(RCurl, plm, tibble, ggplot2, AER, dplyr, car, arm, broom, tidyr, fastDummies, dummies)
x <- getURL("https://raw.githubusercontent.com/dothemathonthatone/maps/master/main_test.csv")
maindf <- read.csv(text = x, row.names=NULL)
maindf <- maindf %>%
mutate(category = cut(year_hh_inc,
breaks = (quantile(year_hh_inc, c(0, 1/5, 2/5, 3/5, 4/5, 1), na.rm = TRUE)),
labels = c("first_quint", "second_quint", "third_quint",
'fourth_quint', 'fifth_quint'),
include.lowest = TRUE),
vals = 1)
maindf <- maindf[maindf$fee_per_inc > 0 & maindf$fee_per_inc < 0.01, ]
box <- boxplot(maindf$fee_per_inc ~ maindf$category, col = 3:5)
Created on 2020-03-03 by the reprex package (v0.3.0)

Utilizing roll functions with data.table

I'm having problems specifically applying functions from the roll package using data.table. I'm attempting to calculate rolling metrics on column DT$obs for each group DT$group. I'm able to calculate rolling metrics using the zoo package, but I'd like to use some of the additional arguments in roll package functions.
Demo of the error is below.
require(data.table)
require(zoo)
require(roll)
# Fabricated Data:
DT <- data.table(group = rep(c("A", "B"), each = 20), obs = runif(40, min = 0, max = 100))
# Calculate a rolling sum (this is working properly)
DT[, RollingSum := lapply(.SD, function(x) zoo::rollsumr(x, k = 5, fill = NA)), by = "group", .SDcols = "obs"]
# Attempt to calculate a rolling z-score (this throws me an error)
DT[, RollingZScore := lapply(.SD, function(x) roll::roll_scale(as.matrix(x), width = 10, min_obs = 5)), by = "group", .SDcols = "obs"]
I can't figure out what's different about the zoo function and the roll function. They each return numeric vectors. Any guidance appreciated.
As #Frank describes, the problem is that the result of roll_scale (and thus each element of lapply output) is a matrix. You can either use sapply instead of lapply, or put as.vector in your function definition.
DT[, RollingZScore := sapply(.SD,
function(x) roll::roll_scale(as.matrix(x), width = 10, min_obs = 5)),
by = "group", .SDcols = "obs"]
or
DT[, RollingZScore := lapply(.SD,
function(x) as.vector(roll::roll_scale(as.matrix(x), width = 10, min_obs = 5))),
by = "group", .SDcols = "obs"]
This can be done with rollapplyr by simply defining a function that returns NA if the input has fewer than 5 elements:
Scale <- function(x) if (length(x) < 5) NA else tail(scale(x), 1)
DT[, rollingScore := rollapplyr(obs, 10, Scale, partial = TRUE), by = "group"]

Resources