Custom dcasting via data.table in R - r

Here is my data
dt = data.table(x=sample(8,20,TRUE),
y=sample(2,20,TRUE),
w = sample(letters[5:20], 20, TRUE),
u = sample(letters[2:25], 20, TRUE),
z=sample(letters[1:4], 20,TRUE),
d1 = runif(20), d2=runif(20))
Here is my dcasting code.
DC_1 = dcast.data.table(dt,x+w ~ z, value.var = "d1")
This works fine. However my data could also additionally include column 'a' and column 's' as shown below. Both of them could be included, either one, or none of them.
dt = data.table(x=sample(8,20,TRUE),
y=sample(2,20,TRUE),
w = sample(letters[5:20], 20, TRUE),
u = sample(letters[2:25], 20, TRUE),
z=sample(letters[1:4], 20,TRUE),
a = sample(letters[1:25], 20, T),
s = sample(letters[2:17], 20, T),
d1 = runif(20), d2=runif(20))
The additional columns however would always be characters . Also my data always has to be cast on column 'z' and value variable would always be 'd1'
How do I dcast via data.table such that it takes all the character columns (except z) available in the data table and casts them on z?

We could subset the dataset column and use ... on the lhs of ~ to specify for all columns and on the rhs of formula it would be 'z'
dcast(dt[, setdiff(names(dt), 'd2'), with = FALSE], ... ~ z, value.var = 'd1')
Or get the column names of the character columns programmatically
nm1 <- dt[, names(which(unlist(lapply(.SD, is.character))))]
nm2 <- setdiff(nm1, 'z')
dcast(dt,paste0(paste(nm2, collapse="+"), "~ z"), value.var = 'd1')
Or another option is select from dplyr
library(dplyr) #1.0.0
dcast(dt[, select(.SD, where(is.character), d1)], ... ~ z, value.var = 'd1')
A similar option in tidyverse would be
library(tidyr)
dt %>%
select(where(is.character), d1) %>%
pivot_wider(names_from = z, values_from = d1)

Related

How do you compare means row-wise for the same ratings object in the R expss package?

I have repeated measures data with two ratings (reliable and fast) repeated on two different objects, (each survey respondent rates each object using the same two ratings measures). I would like to have two columns, one for object 1 and one for object 2, with the ratings displayed in two separate rows.
In the reference manual there is reference to using a | separator to compare two variables, but the example given is for mrsets not means, I'm not sure how to do the same with means and keep them in separate data frame columns.
In the code below, the problem is that instead of placing the means side by side (for comparison) they are stacked on top of each other.
#library
library(expss)
library(magrittr)
#dummy data
set.seed(9)
df <- data.frame(
q1_reliable=sample(c(1,5), 100, replace = TRUE),
q1_fast=sample(c(1,5), 100, replace = TRUE),
q2_reliable=sample(c(1,5), 100, replace = TRUE),
q2_fast=sample(c(1,5), 100, replace = TRUE))
#table
df %>%
tab_cells(q1_reliable,q1_fast) %>%
tab_stat_mean(label = "") %>%
tab_cells(q2_reliable,q2_fast) %>%
tab_stat_mean(label = "") %>%
tab_pivot()
I discovered that if I add variable labels first and use 'tab_pivot(stat_position = "inside_columns")' it solved the problem.
#library
library(expss)
library(magrittr)
#dummy data
set.seed(9)
df <- data.frame(
q1_reliable=sample(c(1,5), 100, replace = TRUE),
q1_fast=sample(c(1,5), 100, replace = TRUE),
q2_reliable=sample(c(1,5), 100, replace = TRUE),
q2_fast=sample(c(1,5), 100, replace = TRUE)
)
#labels
df = apply_labels(df,
q1_reliable = "reliable",
q1_fast = "fast",
q2_reliable = "reliable",
q2_fast = "fast")
#table
df %>%
tab_cells(q1_reliable,q1_fast) %>%
tab_stat_mean(label = "") %>%
tab_cells(q2_reliable,q2_fast) %>%
tab_stat_mean(label = "") %>%
tab_pivot(stat_position = "inside_columns")
Like this data.table approach?
library(data.table)
#melt first
DT <- melt( setDT(df),
measure.vars = patterns( reliable = "reliable", fast = "fast"),
variable.name = "q")
#then summarise
DT[, lapply(.SD, mean), by = .(q), .SDcols = c("reliable", "fast")]
q reliable fast
1: 1 3.04 2.96
2: 2 2.92 2.96

split join data.table R

Objective
Join DT1 (as i in data.table) to DT2 given key(s) column(s), within each group of DT2 specified by the Date column.
I cannot run DT2[DT1, on = 'key'] as that would be incorrect since key column is repeated across the Date column, but unique within a single date.
Reproducible example with a working solution
DT3 is my expected output. Is there any way to achieve this without the split manoeuvre, which does not feel very data.table-y?
library(data.table)
set.seed(1)
DT1 <- data.table(
Segment = sample(paste0('S', 1:10), 100, TRUE),
Activity = sample(paste0('A', 1:5), 100, TRUE),
Value = runif(100)
)
dates <- seq(as.Date('2018-01-01'), as.Date('2018-11-30'), by = '1 day')
DT2 <- data.table(
Date = rep(dates, each = 5),
Segment = sample(paste0('S', 1:10), 3340, TRUE),
Total = runif(3340, 1, 2)
)
rm(dates)
# To ensure that each Date Segment combination is unique
DT2 <- unique(DT2, by = c('Date', 'Segment'))
iDT2 <- split(DT2, by = 'Date')
iDT2 <- lapply(
iDT2,
function(x) {
x[DT1, on = 'Segment', nomatch = 0]
}
)
DT3 <- rbindlist(iDT2, use.names = TRUE)
You can achieve the same result with a cartesian merge:
DT4 <- merge(DT2,DT1,by='Segment',allow.cartesian = TRUE)
Here is the proof:
> all(DT3[order(Segment,Date,Total,Activity,Value),
c('Segment','Date','Total','Activity','Value')] ==
DT4[order(Segment,Date,Total,Activity,Value),
c('Segment','Date','Total','Activity','Value')])
[1] TRUE

Create new variables with lag data from all current variables

My dataset has about 20 columns and I would like to create 7 new columns with lagged data for each of the 20 current columns.
For example I have column x, y, and z. I would like to create a columns for xlag1, xlag2, xlag3, xlag4, xlag5, xlag6, xlag7, ylag1, ylag2, etc..
My current attempt is with dplyr in R -
aq %>% mutate(.,
xlag1 = lag(x, 1),
xlag2 = lag(x, 2),
xlag3 = lag(x, 3),
xlag4 = lag(x, 4),
xlag5 = lag(x, 5),
xlag6 = lag(x, 6),
xlag7 = lag(x, 7),
)
As you can see it'll take alot of lines of codes to cover all 20 columns. Is there a more efficient way of doing this ? If possible in dplyr and R as I'm most familiar with the package.
We can use data.table. The shift from data.table can take a sequence of 'n'.
library(data.table)
setDT(aq)[, paste0('xlag', 1:7) := shift(x, 1:7)]
If there are multiple columns,
setDT(aq)[, paste0(rep(c("xlag", "ylag"), each = 7), 1:7) :=
c(shift(x, 1:7), shift(y, 1:7))]
If we have many columns, then specify the columns in .SDcols and loop through the dataset, get the shift, unlist and assign to new columns
setDT(aq)[, paste0(rep(c("xlag", "ylag"), each = 7), 1:7) :=
unlist(lapply(.SD, shift, n = 1:7), recursive = FALSE) , .SDcols = x:y]
We can also use the shift in dplyr
library(dplyr)
aq %>%
do(setNames(data.frame(., shift(.$x, 1:7)), c(names(aq), paste0('xlag', 1:7))))
and for multiple columns
aq %>%
do(setNames(data.frame(., shift(.$x, 1:7), shift(.$y, 1:7)),
c(names(aq), paste0(rep(c("xlag", "ylag"), each = 7), 1:7) )))
data
aq <- data.frame(x = 1:20, y = 21:40)

Fit model by group using Data.Table package

How can I fit multiple models by group using data.table syntax? I want my output to be a data.frame with columns for each "by group" and one column for each model fit. Currently I am able to do this using the dplyr package, but can't do this in data.table.
# example data frame
df <- data.table(
id = sample(c("id01", "id02", "id03"), N, TRUE),
v1 = sample(5, N, TRUE),
v2 = sample(round(runif(100, max = 100), 4), N, TRUE)
)
# equivalent code in dplyr
group_by(df, id) %>%
do( model1= lm(v1 ~v2, .),
model2= lm(v2 ~v1, .)
)
# attempt in data.table
df[, .(model1 = lm(v1~v2, .SD), model2 = lm(v2~v1, .SD) ), by = id ]
# Brodie G's solution
df[, .(model1 = list(lm(v1~v2, .SD)), model2 = list(lm(v2~v1, .SD))), by = id ]
Try:
df[, .(model1 = list(lm(v1~v2, .SD)), model2 = list(lm(v2~v1, .SD))), by = id ]
or slightly more idiomatically:
formulas <- list(v1~v2, v2~v1)
df[, lapply(formulas, function(x) list(lm(x, data=.SD))), by=id]

Replace all NA with FALSE in selected columns in R

I have a question similar to this one, but my dataset is a bit bigger: 50 columns with 1 column as UID and other columns carrying either TRUE or NA, I want to change all the NA to FALSE, but I don't want to use explicit loop.
Can plyr do the trick? Thanks.
UPDATE #1
Thanks for quick reply, but what if my dataset is like below:
df <- data.frame(
id = c(rep(1:19),NA),
x1 = sample(c(NA,TRUE), 20, replace = TRUE),
x2 = sample(c(NA,TRUE), 20, replace = TRUE)
)
I only want X1 and X2 to be processed, how can this be done?
If you want to do the replacement for a subset of variables, you can still use the is.na(*) <- trick, as follows:
df[c("x1", "x2")][is.na(df[c("x1", "x2")])] <- FALSE
IMO using temporary variables makes the logic easier to follow:
vars.to.replace <- c("x1", "x2")
df2 <- df[vars.to.replace]
df2[is.na(df2)] <- FALSE
df[vars.to.replace] <- df2
tidyr::replace_na excellent function.
df %>%
replace_na(list(x1 = FALSE, x2 = FALSE))
This is such a great quick fix. the only trick is you make a list of the columns you want to change.
Try this code:
df <- data.frame(
id = c(rep(1:19), NA),
x1 = sample(c(NA, TRUE), 20, replace = TRUE),
x2 = sample(c(NA, TRUE), 20, replace = TRUE)
)
replace(df, is.na(df), FALSE)
UPDATED for an another solution.
df2 <- df <- data.frame(
id = c(rep(1:19), NA),
x1 = sample(c(NA, TRUE), 20, replace = TRUE),
x2 = sample(c(NA, TRUE), 20, replace = TRUE)
)
df2[names(df) == "id"] <- FALSE
df2[names(df) != "id"] <- TRUE
replace(df, is.na(df) & df2, FALSE)
You can use the NAToUnknown function in the gdata package
df[,c('x1', 'x2')] = gdata::NAToUnknown(df[,c('x1', 'x2')], unknown = 'FALSE')
With dplyr you could also do
df %>% mutate_each(funs(replace(., is.na(.), F)), x1, x2)
It is a bit less readable compared to just using replace() but more generic as it allows to select the columns to be transformed. This solution especially applies if you want to keep NAs in some columns but want to get rid of NAs in others.

Resources