data.table is copied when using spread()? - r

Here is a sample data.table.
set.seed(123)
mydt <- data.table(id = 1:100, x = sample(LETTERS[1:6], size = 100, replace = TRUE), group = paste0("group", sample(1:3, size = 100, replace = TRUE)), prob = runif(100, 0, 1))
I use tydir::spread to go from long to wide format
mydt2 <- mydt %>% spread(group, prob)
Then I want to define new columns as in
mydt2[!is.na(group1), new.col := x]
If I do this, I get the following warning
Warning message:
In `[.data.table`(mydt2, !is.na(group1), `:=`(myscale, x)) :
Invalid .internal.selfref detected and fixed by taking a (shallow) copy
etc, while if I run this instead
mydt2 <- copy(mydt %>% spread(group, prob))
mydt2[!is.na(group1), myscale := x]
I don't get any warning. I don't understand this behaviour. Can anyone offer help? Is using copy() the right way to address this?

I think using dcast is the way to go. However, a possible solution using tidyr::spread would be to add setDT() to the piped call, i.e.,
set.seed(123)
# install.packages(c("data.table"), dependencies = TRUE)
library(data.table)
mydt <- data.table(id = 1:100, x = sample(LETTERS[1:6], size = 100, replace = TRUE),
group = paste0("group", sample(1:3, size = 100, replace = TRUE)),
prob = runif(100, 0, 1)
)
class(mydt)
mydt2 <- mydt %>% tidyr::spread(group, prob) %>% setDT()
mydt2[!is.na(group1), new.col := x]

Related

chi square over multiple groups and variables

I have a huge dataset with several groups (factors with between 2 to 6 levels), and dichotomous variables (0, 1).
example data
DF <- data.frame(
group1 = sample(x = c("A","B","C","D"), size = 100, replace = T),
group2 = sample(x = c("red","blue","green"), size = 100, replace = T),
group3 = sample(x = c("tiny","small","big","huge"), size = 100, replace = T),
var1 = sample(x = 0:1, size = 100, replace = T),
var2 = sample(x = 0:1, size = 100, replace = T),
var3 = sample(x = 0:1, size = 100, replace = T),
var4 = sample(x = 0:1, size = 100, replace = T),
var5 = sample(x = 0:1, size = 100, replace = T))
I want to do a chi square for every group, across all the variables.
library(tidyverse)
library(rstatix)
chisq_test(DF$group1, DF$var1)
chisq_test(DF$group1, DF$var2)
chisq_test(DF$group1, DF$var3)
...
etc
I managed to make it work by using two nested for loops, but I'm sure there is a better solution
groups <- c("group1","group2","group3")
vars <- c("var1","var2","var3","var4","var5")
results <- data.frame()
for(i in groups){
for(j in vars){
test <- chisq_test(DF[,i], DF[,j])
test <- mutate(test, group=i, var=j)
results <- rbind(results, test)
}
}
results
I think I need some kind of apply function, but I can't figure it out
Here is one way to do it with apply. I am sure there is an even more elegant way to do it with dplyr. (Note that here I extract the p.value of the test, but you can extract something else or the whole test result if you prefer).
res <- apply(DF[,1:3], 2, function(x) {
apply(DF[,4:7], 2,
function(y) {chisq.test(x,y)$p.value})
})
Here's a quick and easy dplyr solution, that involves transforming the data into long format keyed by group and var, then running the chi-sq test on each combination of group and var.
DF %>%
pivot_longer(starts_with("group"), names_to = "group", values_to = "group_val") %>%
pivot_longer(starts_with("var"), names_to = "var", values_to = "var_val") %>%
group_by(group, var) %>%
summarise(chisq_test(group_val, var_val)) %>%
ungroup()

Bin data within a group using breaks from another DF

How to avoid using the for loop in the following code to speed up the computation (the real data is about 1e6 times larger)
id = rep(1:5, 20)
v = 1:100
df = data.frame(groupid = id, value = v)
df = dplyr::arrange(df, groupid)
bkt = rep(seq(0, 100, length.out = 4), 5)
id = rep(1:5, each = 4)
bktpts = data.frame(groupid = id, value = bkt)
for (i in 1:5) {
df[df$groupid == i, "bin"] = cut(df[df$groupid == i, "value"],
bktpts[bktpts$groupid == i, "value"],
include.lowest = TRUE, labels = F)
}
I'm not sure why yout bktpts is formatted like it is?
But here is a data.table slution that should be (at least a bit) faster than your for-loop.
library( data.table )
setDT(df)[ setDT(bktpts)[, `:=`( id = seq_len(.N),
value_next = shift( value, type = "lead", fill = 99999999 ) ),
by = .(groupid) ],
bin := i.id,
on = .( groupid, value >= value, value < value_next ) ][]
Another way:
library(data.table)
setDT(df); setDT(bktpts)
bktpts[, b := rowid(groupid) - 1L]
df[, b := bktpts[copy(.SD), on=.(groupid, value), roll = -Inf, x.b]]
# check result
df[, any(b != bin)]
# [1] FALSE
See ?data.table for how rolling joins work.
I came out with another data.table answer:
library(data.table) # load package
# set to data.table
setDT(df)
setDT(bktpts)
# Make a join
df[bktpts[, list(.(value)), by = groupid], bks := V1, on = "groupid"]
# define the bins:
df[, bin := cut(value, bks[[1]], include.lowest = TRUE, labels = FALSE), by = groupid]
# remove the unneeded bks column
df[, bks := NULL]
Explaining the code:
bktpts[, list(.(value)), by = groupid] is a new table that has in a list al the values of value for each groupid. If you run it alone, you'll understand where we're going.
bks := V1 assigns to variable bks in df whatever exists in V1, which is the name of the list column in the previous table. Of course on = "groupid" is the variable on which we make the join.
The code defining the bins needs little explanation, except by the bks[[1]] bit. It needs to be [[ in order to access the list values and provide a vector, as required by the cut function.
EDIT TO ADD:
All data.table commands can be chained in a -rather unintelligible- single call:
df[bktpts[, list(.(value)), by = groupid],
bks := V1,
on = "groupid"][,
bin := cut(value,
bks[[1]],
include.lowest = TRUE,
labels = FALSE),
by = groupid][,
bks := NULL]

How do you compare means row-wise for the same ratings object in the R expss package?

I have repeated measures data with two ratings (reliable and fast) repeated on two different objects, (each survey respondent rates each object using the same two ratings measures). I would like to have two columns, one for object 1 and one for object 2, with the ratings displayed in two separate rows.
In the reference manual there is reference to using a | separator to compare two variables, but the example given is for mrsets not means, I'm not sure how to do the same with means and keep them in separate data frame columns.
In the code below, the problem is that instead of placing the means side by side (for comparison) they are stacked on top of each other.
#library
library(expss)
library(magrittr)
#dummy data
set.seed(9)
df <- data.frame(
q1_reliable=sample(c(1,5), 100, replace = TRUE),
q1_fast=sample(c(1,5), 100, replace = TRUE),
q2_reliable=sample(c(1,5), 100, replace = TRUE),
q2_fast=sample(c(1,5), 100, replace = TRUE))
#table
df %>%
tab_cells(q1_reliable,q1_fast) %>%
tab_stat_mean(label = "") %>%
tab_cells(q2_reliable,q2_fast) %>%
tab_stat_mean(label = "") %>%
tab_pivot()
I discovered that if I add variable labels first and use 'tab_pivot(stat_position = "inside_columns")' it solved the problem.
#library
library(expss)
library(magrittr)
#dummy data
set.seed(9)
df <- data.frame(
q1_reliable=sample(c(1,5), 100, replace = TRUE),
q1_fast=sample(c(1,5), 100, replace = TRUE),
q2_reliable=sample(c(1,5), 100, replace = TRUE),
q2_fast=sample(c(1,5), 100, replace = TRUE)
)
#labels
df = apply_labels(df,
q1_reliable = "reliable",
q1_fast = "fast",
q2_reliable = "reliable",
q2_fast = "fast")
#table
df %>%
tab_cells(q1_reliable,q1_fast) %>%
tab_stat_mean(label = "") %>%
tab_cells(q2_reliable,q2_fast) %>%
tab_stat_mean(label = "") %>%
tab_pivot(stat_position = "inside_columns")
Like this data.table approach?
library(data.table)
#melt first
DT <- melt( setDT(df),
measure.vars = patterns( reliable = "reliable", fast = "fast"),
variable.name = "q")
#then summarise
DT[, lapply(.SD, mean), by = .(q), .SDcols = c("reliable", "fast")]
q reliable fast
1: 1 3.04 2.96
2: 2 2.92 2.96

Utilizing roll functions with data.table

I'm having problems specifically applying functions from the roll package using data.table. I'm attempting to calculate rolling metrics on column DT$obs for each group DT$group. I'm able to calculate rolling metrics using the zoo package, but I'd like to use some of the additional arguments in roll package functions.
Demo of the error is below.
require(data.table)
require(zoo)
require(roll)
# Fabricated Data:
DT <- data.table(group = rep(c("A", "B"), each = 20), obs = runif(40, min = 0, max = 100))
# Calculate a rolling sum (this is working properly)
DT[, RollingSum := lapply(.SD, function(x) zoo::rollsumr(x, k = 5, fill = NA)), by = "group", .SDcols = "obs"]
# Attempt to calculate a rolling z-score (this throws me an error)
DT[, RollingZScore := lapply(.SD, function(x) roll::roll_scale(as.matrix(x), width = 10, min_obs = 5)), by = "group", .SDcols = "obs"]
I can't figure out what's different about the zoo function and the roll function. They each return numeric vectors. Any guidance appreciated.
As #Frank describes, the problem is that the result of roll_scale (and thus each element of lapply output) is a matrix. You can either use sapply instead of lapply, or put as.vector in your function definition.
DT[, RollingZScore := sapply(.SD,
function(x) roll::roll_scale(as.matrix(x), width = 10, min_obs = 5)),
by = "group", .SDcols = "obs"]
or
DT[, RollingZScore := lapply(.SD,
function(x) as.vector(roll::roll_scale(as.matrix(x), width = 10, min_obs = 5))),
by = "group", .SDcols = "obs"]
This can be done with rollapplyr by simply defining a function that returns NA if the input has fewer than 5 elements:
Scale <- function(x) if (length(x) < 5) NA else tail(scale(x), 1)
DT[, rollingScore := rollapplyr(obs, 10, Scale, partial = TRUE), by = "group"]

Subset data.table columns independently

I'm starting with the below table dt and try to subset its column by the list keys:
library(data.table)
set.seed(123)
randomchar <- function(n, w){
chararray <- replicate(w, sample(c(letters, LETTERS), n, replace = TRUE))
apply(chararray, 1, paste0, collapse = "")
}
dt <- data.table(x = randomchar(1000, 3),
y = randomchar(1000, 3),
z = randomchar(1000, 3),
key = c("x", "y", "z"))
keys <- with(dt, list(x = sample(x, 501),
y = sample(y, 500),
z = sample(z, 721)))
I can get the result I want by using a loop:
desired <- copy(dt)
for(i in seq_along(keys)){
keyname <- names(keys)[i]
desired <- desired[get(keyname) %in% keys[[i]]]
}
desired
The question is - Is there a more data.table idiomatic way to do this subset?
I tried using CJ: dt[CJ(keys)], but it takes a very long time.
What about building a mask and filter dt on this mask:
dt[Reduce(`&`, Map(function(key, col) col %in% key, keys, dt)),]

Resources