Is there already a function to substract different variables in subsequent quarters? - r

I have an unbalanced quarterly panel data set with missing values. I want to substract variable A2 from A1 in subsequent quarters. Note that I do not want to get differences of A2, but substract DIFFERENT variables from each other. Differences should be calculated separately for every uid. Besides changing years like Q4 1999 and Q1 2000 are meant to be subsequent.
I am really not sure whether i should concatenate my time index here since packages like zoo only take one index. But that's not the problem here. Here is a some example data:
structure(list(uid = c(1, 1, 1, 2, 2, 3, 3, 3), tndx = c(1999.4,
2000.1, 2000.2, 1999.4, 2000.1, 2000.1, 2000.2, 2000.3), A1 = c(2,
2, 2, 10, 11, 1, 1, 1), A2 = c(3, 3, 3, 14, 14, 2, 100, 2)), .Names = c("uid",
"tndx", "A1", "A2"), row.names = c(NA, -8L), class = "data.frame")
# which results in
uid tndx A1 A2
1 1 1999.4 2 3
2 1 2000.1 2 3
3 1 2000.2 2 3
4 2 1999.4 10 14
5 2 2000.1 11 14
6 3 2000.1 1 2
7 3 2000.2 1 100
8 3 2000.3 1 2
If you prefer a separated index, use this example:
# Thx Andrie!
x2 <- data.frame(x, colsplit(x$tndx, "\\.", names=c("year", "qtr")))
Is there a good way to solve this with reshape2, plyr or even base or would you rather write a custom function?
Note, it is also possible that some uid occurs only once. Obviously you can't calculate a lagged difference then. Still I need to check for that and create an NA then.

We split it on the uid using by and within the function that operates on each set of rows for a single uid, we create a zoo object, z, using yearqtr class for the index. Then we merge the time series with an empty series having all the desired quarters including any missing intermediate quarters giving zm and perform the computation giving zz. Finally we convert to the data.frame form on the way out:
library(zoo)
to.yearqtr <- function(x) as.yearqtr(trunc(x) + (10*(x-trunc(x))-1)/4)
DF <- do.call("rbind", by(x, x$uid, function(x) {
# columns of x are: uid tndx A1 A2
z <- zoo(x[c("A1", "A2")], to.yearqtr(x$tndx))
zm <- merge(z, zoo(, seq(start(z), end(z), 1/4)))
zz <- with(zm, cbind(zm, `A1 - A2 lag` = A1 - lag(A2, -1)))
if (ncol(zz) <= ncol(z)) zz$`A1 - A2 lag` <- NA # append NA if col not added
data.frame(uid = x[1, 1], tndx = time(zz), coredata(zz), check.names = FALSE)
}))
which gives this:
> DF
uid tndx A1 A2 result A1 - A2 lagged
1.1 1 1999 Q4 2 3 NA NA
1.2 1 2000 Q1 2 2 NA -1
1.3 1 2000 Q2 2 3 NA 0
2.1 2 1999 Q4 2 4 NA NA
2.2 2 2000 Q1 NA NA NA NA
2.3 2 2000 Q2 NA NA NA NA
2.4 2 2000 Q3 NA NA NA NA
2.5 2 2000 Q4 NA NA NA NA
2.6 2 2001 Q1 3 4 NA NA
3.1 3 2000 Q1 1 2 NA NA
3.2 3 2000 Q2 1 NA NA -1
3.3 3 2000 Q3 1 2 NA NA
EDIT: Completely re-did the solution based on further discussion. Note that this not only adds an extra column but it also converts the index to "yearqtr" class and adds the extra missing rows.
EDIT: Some minor simplifications in the by function.

I wasn't entirely clear what you wnated because you didn't include a "right answer". If you want to subtract one lagged variable from another unlagged variable you cna do that with indexing that is offset. (You do need to pad the result if you wnat it to get put back into the dataframe.
x$A1lagA2 <- ave(x[, c("A1", "A2")], x$uid, FUN=function(z) {
with(z, c(NA, A1[2:NROW(z)] -A2[1:(NROW(z)-1)]) ) } )[[1]]
x
uid tndx A1 A2 A1lagA2
1 1 1999.4 2 3 NA
2 1 2000.1 2 3 -1
3 1 2000.2 2 3 -1
4 2 1999.4 10 14 NA
5 2 2000.1 11 14 -3
6 3 2000.1 1 2 NA
7 3 2000.2 1 100 -1
8 3 2000.3 1 2 -99
You do get annoying duplicate extra columns with ave() when it argument is multicolumn, but I just took the first one.

Related

adding two variables which has NA present

lets say data is 'ab':
a <- c(1,2,3,NA,5,NA)
b <- c(5,NA,4,NA,NA,6)
ab <-c(a,b)
I would like to have new variable which is sum of the two but keeping NA's as follows:
desired output:
ab$c <-(6,2,7,NA,5,6)
so addition of number + NA should equal number
I tried following but does not work as desired:
ab$c <- a+b
gives me : 6 NA 7 NA NA NA
Also don't know how to include "na.rm=TRUE", something I was trying.
I would also like to create third variable as categorical based on cutoff <=4 then event 1, otherwise 0:
desired output:
ab$d <-(1,1,1,NA,0,0)
I tried:
ab$d =ifelse(ab$a<=4|ab$b<=4,1,0)
print(ab$d)
gives me logical(0)
Thanks!
a <- c(1,2,3,NA,5,NA)
b <- c(5,NA,4,NA,NA,6)
dfd <- data.frame(a,b)
dfd$c <- rowSums(dfd, na.rm = TRUE)
dfd$c <- ifelse(is.na(dfd$a) & is.na(dfd$b), NA_integer_, dfd$c)
dfd$d <- ifelse(dfd$c >= 4, 1, 0)
dfd
a b c d
1 1 5 6 1
2 2 NA 2 0
3 3 4 7 1
4 NA NA NA NA
5 5 NA 5 1
6 NA 6 6 1

How to replace NA with cero in a columns, if the columns beside have a values? using R

I want to know a way to replace the NA of a column if the columns beside have a value, this because, using a example if the worker have values in the other columns mean he went to work that day so if he have an NA it means that should be replaced with cero, and if there are no values in the columns surrounding means he didnt go to work that day and the NA is correct
I have been doing this by sorting the other columns but its so time consuming
A sample of my data called df, the real one have 30 columns and like 30,000 rows
df <- data.frame(
hours = c(NA, 3, NA, 8),
interactions = c(NA, 3, 9, 9),
sales = c(1, 1, 1, NA)
)
df$hours2 <- ifelse(
test = is.na(df$hours) & any(!is.na(df[,c("interactions", "sales")])),
yes = 0,
no = df$hours)
df
hours interactions sales hours2
1 NA NA 1 0
2 3 3 1 3
3 NA 9 1 0
4 8 9 NA 8
You could also do as follows:
library(dplyr)
mutate(df, X = if_else(is.na(hours) | is.na(interactions), 0, hours))
# hours interactions sales X
# 1 NA NA 1 0
# 2 3 3 1 3
# 3 NA 9 1 0
# 4 8 9 NA 8

Adding a vector to a data.frame in an asymmetric way in R

In my code below, I was wondering if there might be a way to add the z1 vector to data.frame d1 such that we can achieve my Desired_Output using Base R or tidyverse?
This is a toy example. Thus, d1 can have any number of rows and columns and z1 vector can have any number elements. Thus, a functional answer applicable to other data.frames is highly appreciated.
d1 <- data.frame(b = 1:5, SE = 2:6)
z1 <- c(2.3, 5.4)
d1$tau <- ""
Desired_Output =
" b SE tau
1 2
2 3
3 4
4 5
5 6
2.3
5.4"
You may use dplyr::bind_rows or data.table::rbindlist
d1 <- data.frame(b = 1:5, SE = 2:6)
z1 <- c(2.3, 5.4)
d2 <- data.frame(tau = z1)
dplyr::bind_rows(d1, d2)
# b SE tau
#1 1 2 NA
#2 2 3 NA
#3 3 4 NA
#4 4 5 NA
#5 5 6 NA
#6 NA NA 2.3
#7 NA NA 5.4
With data.table -
data.table::rbindlist(list(d1, d2), fill = TRUE)
The d1 data frame has 5 rows and two columns. To add a column, it, too, must have 5 rows. However, because it is required for the z vector to occupy rows 6 and 7, those rows must first be added to d1:
d1 <- data.frame(b = 1:5, SE = 2:6)
d1[6:7,] <- NA
d1$tau <- c(rep(NA,5),2.3,5.4)
d1
#> b SE tau
#> 1 1 2 NA
#> 2 2 3 NA
#> 3 3 4 NA
#> 4 4 5 NA
#> 5 5 6 NA
#> 6 NA NA 2.3
#> 7 NA NA 5.4

Iterate over a column ignoring but retaining NA values in R

I have a time series data frame in R that has a column, V1, which consists of integers with a few NAs interspersed throughout. I want to iterate over this column and subtract V1 from itself one time step previously. However, I want to ignore the NA values in V1 and use the last non-NA value in the subtraction. If the current value of V1 is NA, then the difference should return NA. See below for an example
V1 <- c(1, 3, 4, NA, NA, 6, 9, NA, 10)
time <- 1:length(V1)
dat <- data.frame(time = time,
V1 = V1)
lag_diff <- c(NA, 2, 1, NA, NA, 2, 3, NA, 1) # The result I want
diff(dat$V1) # Not the result I want
I'd prefer not to do this with loops because I have hundreds of data frames, each with >10,000 rows.
My first thought to solve this was to filter out the NA rows, perform the iterative difference calculation and then reinsert the rows that were filtered out but I can't think of a way to do that. It doesn't seem very "tidy" to do it that way either and I'm not sure it would be faster than looping. Any help is appreciated, bonus points if the solution uses tidyverse functions.
dat[!is.na(dat$V1), 'lag_diff'] <- c(NA, diff(dat[!is.na(dat$V1), 'V1']))
# time V1 lag_diff
# 1 1 1 NA
# 2 2 3 2
# 3 3 4 1
# 4 4 NA NA
# 5 5 NA NA
# 6 6 6 2
# 7 7 9 3
# 8 8 NA NA
# 9 9 10 1
Or with data.table (same result)
library(data.table)
setDT(dat)
dat[!is.na(V1), lag_diff := V1 - shift(V1)]
# time V1 lag_diff
# 1: 1 1 NA
# 2: 2 3 2
# 3: 3 4 1
# 4: 4 NA NA
# 5: 5 NA NA
# 6: 6 6 2
# 7: 7 9 3
# 8: 8 NA NA
# 9: 9 10 1
A tidyverse version, just in case. It does need a filter though
dat %>%
filter(!is.na(V1)) %>%
mutate(diff=V1- lag(V1)) %>%
right_join(dat,by=c("time","V1"))

Drop data frame rows if NA for certain variables referred to by name in dplyr

I would like to drop entire rows from a data frame if they have all NAs but for only certain subset of columns (which are named in a sequence as well as start with "X").
This is different than other SO answers that I found from what I can tell since I cannot refer to each column manually by name (too many variables) and do not only want to drop the rows if they are completely NA (rather if some variables are completely NA).
So turn sample data:
data1 <- as.data.frame(rbind(c(1,2,3), c(1, NA, 4), c(4,6,7), c(1, NA, NA), c(4, 8, NA)))
colnames(data1) <- c("Z","X1","X2")
data1
Z X1 X2
1 1 2 3
2 1 NA 4
3 4 6 7
4 1 NA NA
5 4 8 NA
into:
V1 V2 V3
1 1 2 3
2 1 NA 4
3 4 6 7
4 4 8 NA
I.e. drop the row if both X1 and X2 (all of the X sequence) are NA.
In this example there are only two variables(X1:X2)for ease but in reality I have closer to 100 of this sequence and many other important variables that may or may not be NA. I would prefer to do so in dplyr with filter but other solutions would be appreciated as well.
I feel like:
data2 %>% filter(!is.na(all(X1:X2)))
or something similar is close but R does not like the sequence reference to X1:X2 within filter.
You can use rowSums + select + starts_with + filter:
data1 %>%
filter(rowSums(!is.na(select(., starts_with("X")))) != 0)
# Z X1 X2
#1 1 2 3
#2 1 NA 4
#3 4 6 7
#4 4 8 NA
A base R solution using apply would be:
drop <- which(apply(data1[,startsWith(colnames(data1), "X")], 1, function(x) all(is.na(x))))
data1[-drop,]
# Z X1 X2
#1 1 2 3
#2 1 NA 4
#3 4 6 7
#5 4 8 NA
Another option using rowSums:
drop <- which(rowSums(is.na(data1[,c("X1","X2")]))>=2)
data1[-drop]

Resources