Labeling each value in a column by grouping from another column R - r

I have an unusual data set that I need to work with and I've created a small scale, reproducible example.
library(data.table)
DT <- data.table(Type = c("A", rep("", 4), "B", rep("", 3), "C", rep("", 5)), Cohort = c(NA,1:4, NA, 5:7, NA, 8:12))
dt <- data.table(Type = c(rep("A", 4), rep("B", 3), rep("C", 5)), Cohort = 1:12)
I need DT to look like dt and the actual dataset has 6.8 million rows. I realize it might be a simple issue but I can't seem to figure it out, maybe setkey? Any help is appreciated, thanks.

You can replace "" by NA and use na.locf from the zoo package:
library(zoo)
DT[Type=="",Type:=NA][,Type:=na.locf(Type)][!is.na(Cohort)]

Here is another option without using na.locf. Grouped by the cumulative sum of logical vector (Type!=""), we select the first 'Type' and the lead value of 'Cohort', assign (:=) it to the names of 'DT' to replace the original column values and use na.omit to replace the NA rows.
na.omit(DT[, names(DT) := .(Type[1L], shift(Cohort, type="lead")), cumsum(Type!="")])
# Type Cohort
# 1: A 1
# 2: A 2
# 3: A 3
# 4: A 4
# 5: B 5
# 6: B 6
# 7: B 7
# 8: C 8
# 9: C 9
#10: C 10
#11: C 11
#12: C 12

Related

na.locf in data.table when completing by group

I have a data.table in which I'd like to complete a column to fill in some missing values, however I'm having some trouble filling in the other columns.
dt = data.table(a = c(1, 3, 5), b = c('a', 'b', 'c'))
dt[, .(a = seq(min(a), max(a), 1), b = na.locf(b))]
# a b
# 1: 1 a
# 2: 2 b
# 3: 3 c
# 4: 4 a
# 5: 5 b
However looking for something more like this:
dt %>%
complete(a = seq(min(a), max(a), 1)) %>%
mutate(b = na.locf(b))
# # A tibble: 5 x 2
# a b
# <dbl> <chr>
# 1 1 a
# 2 2 a
# 3 3 b
# 4 4 b
# 5 5 c
where the last value is carried forward
Another possible solution with only the (rolling) join capabilities of data.table:
dt[.(min(a):max(a)), on = .(a), roll = Inf]
which gives:
a b
1: 1 a
2: 2 a
3: 3 b
4: 4 b
5: 5 c
On large datasets this will probably outperform every other solution.
Courtesy to #Mako212 who gave the hint by using seq in his answer.
First posted solution which works, but gives a warning:
dt[dt[, .(a = Reduce(":", a))], on = .(a), roll = Inf]
data.table recycles observations by default when you try dt[, .(a = seq(min(a), max(a), 1))] so it never generates any NA values for na.locf to fill. Pretty sure you need to use a join here to "complete" the cases, and then you can use na.locf to fill.
dt[dt[, .(a = min(a):max(a))], on = 'a'][, .(a, b = na.locf(b))]
Not sure if there's a way to skip the separate t1 line, but this gives you the desired result.
a b
1: 1 a
2: 2 a
3: 3 b
4: 4 b
5: 5 c
And I'll borrow #Jaap's min/max line to avoid creating the second table. So basically you can either use his rolling join solution, or if you want to use na.locf this gets the same result.

data.table grouping with variable names

I'm attempting to create a summarised data.table from an existing one, however I am wanting to do this in a function that allows me to pass in a column prefix so I can prefix my columns as required.
I've seen the question/response here but am trying to work out how to do it when not using the := operator.
Reprex:
library(data.table)
tbl1 <- data.table(urn = c("a", "a", "a", "b", "b", "b"),
amount = c(1, 2, 1, 3, 3, 4))
# urn amount
# 1: a 1
# 2: a 2
# 3: a 1
# 4: b 3
# 5: b 3
# 6: b 4
tbl2 <- tbl1[, .(mean_amt = mean(amount),
rows = .N),
by = urn]
# urn mean_amt rows
# 1: a 1.333333 3
# 2: b 3.333333 3
This is using fixed names for the column names being created, however as mentioned I'd like to be able to include a prefix.
I've tried the following:
prefix <- "mypfx_"
tbl2 <- tbl1[, .(paste0(prefix, mean_amt) = mean(amount),
paste0(prefix, rows) = .N),
by = urn]
# Desired output
# urn mypfx_mean_amt mypfx_rows
# 1: a 1.333333 3
# 2: b 3.333333 3
Unfortunately that codes gets an error saying: Error: unexpected '=' in " tbl2 <- tbl1[, .(paste0(prefix, mean_amt) ="
Any thoughts on how to make the above work would be appreciated.
You can use setNames to rename the columns dynamically:
prefix <- "mypfx_"
tbl2 <- tbl1[, setNames(list(mean(amount), .N), paste0(prefix, c("mean_amt", "rows"))),
by = urn]
tbl2
# urn mypfx_mean_amt mypfx_rows
#1: a 1.333333 3
#2: b 3.333333 3

How to combine data.tables by= with its shift() without having to create new variables?

I'm trying to generate row sums of a variable and its lag(s). Say I have:
library(data.table)
data <- data.table(id = rep(c("AT","DE"), each = 3),
time = rep(2001:2003, 2), var1 = c(1:6), var2 = c(NA, 1:3, NA, 8))
And I want to create a variable which adds 'var1' and the first lag of 'var2' by 'id'. If I create the lag first and the sum, I know how to:
data[ , lag := shift(var2, 1), by = id]
data[ , goalmessy := sum(var1, lag, na.rm = TRUE), by = 1:NROW(data)]
But is there a way to use shift inside sum or something similar (like apply sum or sth)? The intuitive problem I have, is that the by command is evaluated first as far as I know so we will be in a single row which makes the shifting unfeasible. Any hints?
I think this will do what you want in one line:
dt[, myVals := rowSums(cbind(var1, shift(var2)), na.rm=TRUE), by=id]
dt
id time var1 var2 myVals
1: AT 2001 1 NA 1
2: AT 2002 2 1 2
3: AT 2003 3 2 4
4: DE 2001 4 3 4
5: DE 2002 5 NA 8
6: DE 2003 6 8 6
The two variables of interest are put into cbind which is used to feed rowSums and NAs are dropped as in your code.
We can use rowSums
data[, goalmessy := rowSums(setDT(.(var1, shift(var2))), na.rm = TRUE), by = id]

Different results from dplyr and data.table

Reproducible dataset:
library(data.table)
library(dplyr)
library(zoo)
df = expand.grid(ID = sample(LETTERS[1:5]),
Date = seq.Date(as.Date("2012-01-01"), as.Date("2012-12-01"), by = "1 month"))
df = df[order(as.character(df$ID)),]
df = data.table(df, V1 = runif(nrow(df),0,1), V2 = runif(nrow(df),0,1), V3 = runif(nrow(df),0,1))
ind = sample(nrow(df), nrow(df)*.5)
na.gen <- function(x, ind){x[ind] <- NA}
df1 <- df %>% slice(., ind) %>% mutate_each(funs(na.gen), starts_with("V"))
df2 = df[!ind]
df <- rbind(df1, df2)
df <- df[order(as.character(df$ID), df$Date),]
df$ID = as.character(df$ID)
In the above dataset, my idea was to impute data using Last Observation Carried Forward method. My original problem is a very large dataset, so I tested dplyr and data.table solutions.
final_dplyr <- df %>% group_by(ID) %>% mutate_each(funs(na.locf), starts_with("V"))
final_data.table <- df[, na.locf(.SD), by = ID]
data.table gives me the right solution, however, dplyr messes the subset which begins from NA. I get the following warning using dplyr:
Warning messages:
1: In `[.data.table`(`_dt`, , `:=`(V1, na.locf(V1)), by = `_vars`) :
Supplied 11 items to be assigned to group 1 of size 12 in column 'V1' (recycled leaving remainder of 1 items).
Can somone help me understand what I am doing wrong with dplyr?
Okay, a lot of things going on here. First as #Frank noted, the two commands operate on different objects. na.locf(.SD) on the subset-data.table for each ID, where as dplyr's on each column separately for each ID.
To identify where the issue is, I'll use data.table equivalent of your dplyr syntax.
df[, lapply(.SD, na.locf), by=ID]
# warning
We get the same warning message. Seems like the number of rows returned for each column aren't identical for 1 or more groups. Let's check that.
df[, lapply(.SD, function(x) length(na.locf(x))), by=ID]
# ID Date V1 V2 V3
# 1: A 12 12 12 12
# 2: B 12 12 12 12
# 3: C 12 11 11 11 # <~~~ we've a winner!
# 4: D 12 12 12 12
# 5: E 12 12 12 12
Why is this happening?
head(df[ID == "C"])
# ID Date V1 V2 V3
# 1: C 2012-01-01 NA NA NA
# 2: C 2012-02-01 0.7475075 0.8917311 0.7601174
# 3: C 2012-03-01 0.4922747 0.7749479 0.3995417
# 4: C 2012-04-01 0.9013631 0.3388313 0.8873779
# 5: C 2012-05-01 NA NA NA
# 6: C 2012-06-01 NA NA NA
nrow(df[ID == "C", na.locf(.SD), .SDcols= -c("ID")])
# 12 as expected
nrow(df[ID == "C", lapply(.SD, na.locf), .SDcols= -c("ID")])
# 12, but with warnings
Using na.locf() on columns separately returns 11 for V1:V4. Why? It seems like it's because of the NA at the beginning. ?na.locf has a na.rm argument which by default is set to TRUE which removes NAs from the beginning. So let's set it to false and try again
nrow(df[ID == "C", lapply(.SD, na.locf, na.rm=FALSE), .SDcols = -c("ID")])
# 12, no warnings
It worked with na.locf(.SD) because it also ran na.locf on Date column which returned 12 rows, I think.
In essence, you need to set na.rm=FALSE in dplyr somehow, or get dplyr to work on the entire object somehow. I've no idea how to do either.
PS: Note that you can use := to update the data.table by reference instead of returning a new object with data.table syntax.

Moving Averages on multiple columns - Grouped Data

Apologies if this has been answered. I've gone through numerous examples today but I can't find any that match what I am trying to do.
I have a data set which I need to calculate a 3 point moving average on. I've generated some dummy data below:
set.seed(1234)
data.frame(Week = rep(seq(1:5), 3),
Section = c(rep("a", 5), rep("b", 5), rep("c", 5)),
Qty = runif(15, min = 100, max = 500),
To = runif(15, min = 40, max = 80))
I want to calculate the MA for each group based on the 'Section' column for both the 'Qty' and the 'To' columns. Ideally the output would be a data table. The moving average would start at Week 3 so would be the average of wks 1:3
I am trying to master the data.table package so a solution using that would be great but otherwise any will be much appreciated.
Just for reference my actual data set will have approx. 70 sections with c.1M rows in total. I've found the data.table to be extremely fast at crunching these kind of volumes so far.
We could use rollmean from the zoo package, in combination with data.table .
library(data.table)
library(zoo)
setDT(df)[, c("Qty.mean","To.mean") := lapply(.SD, rollmean, k = 3, fill = NA, align = "right"),
.SDcols = c("Qty","To"), by = Section]
> df
# Week Section Qty To Qty.mean To.mean
#1: 1 a 145.4814 73.49183 NA NA
#2: 2 a 348.9198 51.44893 NA NA
#3: 3 a 343.7099 50.67283 279.3703 58.53786
#4: 4 a 349.3518 47.46891 347.3271 49.86356
#5: 5 a 444.3662 49.28904 379.1426 49.14359
#6: 1 b 356.1242 52.66450 NA NA
#7: 2 b 103.7983 52.10773 NA NA
#8: 3 b 193.0202 46.36184 217.6476 50.37802
#9: 4 b 366.4335 41.59984 221.0840 46.68980
#10: 5 b 305.7005 48.75198 288.3847 45.57122
#11: 1 c 377.4365 72.42394 NA NA
#12: 2 c 317.9899 61.02790 NA NA
#13: 3 c 213.0934 76.58633 302.8400 70.01272
#14: 4 c 469.3734 73.25380 333.4856 70.28934
#15: 5 c 216.9263 41.83081 299.7977 63.89031
A solution using dplyr:
library(dplyr); library(zoo)
myfun = function(x) rollmean(x, k = 3, fill = NA, align = "right")
df %>% group_by(Section) %>% mutate_each(funs(myfun), Qty, To)
#### Week Section Qty To
#### (int) (fctr) (dbl) (dbl)
#### 1 1 a NA NA
#### 2 2 a NA NA
#### 3 3 a 279.3703 58.53786
#### 4 4 a 347.3271 49.86356
There is currently faster approach using new frollmean function in data.table 1.12.0.
setDT(df)[, c("Qty.mean","To.mean") := frollmean(.SD, 3),
.SDcols = c("Qty","To"),
by = Section]

Resources