Map term indices to words in sparklyr ml_count_vectorizer - r

I cannot figure out how to map term indices produced by ft_count_vectorizer in sparklyr back to vocabulary words. The output of lda models only has term indices, not words, so it is hard to make sense of the output without being able to map indices to words. Example below.
library(sparklyr)
library(dplyr)
# connection
sc <- spark_connect(master = 'local')
# fake data
fake_data <- data.frame(a = c(1, 2, 3, 4),
b = c("the groggy", "frog was",
"a very groggy", "frog"))
fake_tbl <- copy_to(sc, df = fake_data, overwrite = TRUE)
# count vectorizer
fake_vectorizer <- fake_tbl %>%
ft_tokenizer(input_col = 'b', output_col = 'tokens') %>%
ft_count_vectorizer(input_col = 'tokens', output_col = 'features')
# model
fake_model <- fake_vectorizer %>%
ml_lda(features_col = 'features', k = 2)
fake_model$topicsMatrix
# Which indices correspond to which words?

Related

How to plot sjPlots from a nested tibble?

I create some models like this using a nested tidyr dataframe:
set.seed(1)
library(tidyr)
library(dplyr)
library(sjPlot)
library(tibble)
library(purrr)
fits <- tribble(~group, ~colA, ~colB, ~colC,
sample(c("group1", "group2"), 10, replace = T), 0, sample(10, replace = T), sample(10, replace = T),
sample(c("group1", "group2"), 10, replace = T), 1, sample(10, replace = T), sample(10, replace = T)) %>%
unnest(cols = c(colB, colC)) %>%
nest(data=-group) %>%
mutate(fit= map(data, ~glm(formula = colA ~ colB + colC, data = .x, family="binomial"))) %>%
dplyr::select(group, fit) %>%
tibble::column_to_rownames("group")
I would like to use this data to create some quick marginal effects plots with sjPlot::plot_models like this
plot_models(as.list(fits), type = "pred", terms = c("colB", "colA", "colC"))
Unfortunately, I get the error
Error in if (fam.info$is_linear) tf <- NULL else tf <- "exp" :
argument is of length zero
In addition: Warning message:
Could not access model information.
I've played around a bit with the nesting of the data but I've been unable to get it into a format that sjPlot::plot_models will accept.
What I was expecting to get is a "Forest plot of multiple regression models" as described in the help file. Ultimately, the goal is to plot the marginal effects of regression models by group, which I was hoping the plot_models will do (please correct me if I'm wrong).
It think there are some issues with the original code as well as with the data. There are arguments from plot_model in the function call which are not supported in plot_models. I first show an example that shows how plot_models can be called and used with a nested tibble using {ggplot2}'s diamonds data set. Then I apply this approach to the OP's sample data, which doesn't yield useable results*. Finally, I create some new toy data to show how the approach could be applied to a binominal model.
(* In the original toy data the dependent variable is either always 0 or always 1 in each model so this is unlikely to yield useable results).
set.seed(1)
library(tidyr)
library(dplyr)
library(sjPlot)
library(tibble)
library(ggplot2)
# general example
fits <- tibble(id = c("x", "y", "z")) %>%
rowwise() %>%
mutate(fit = list(glm(reformulate(
termlabels = c("cut", "color", "depth", "table", "price", id),
response = "carat"),
data = diamonds)))
plot_models(fits$fit)
# OP's example data
fits2 <- tribble(~group, ~colA, ~colB, ~colC,
sample(c("group1", "group2"), 10, replace = T), 0,
sample(10, replace = T), sample(10, replace = T),
sample(c("group1", "group2"), 10, replace = T), 1,
sample(10, replace = T),
sample(10, replace = T)) %>%
unnest(cols = c(colB, colC)) %>%
nest(data = -group) %>%
rowwise() %>%
mutate(fit = list(glm(formula = colA ~ colB + colC, data = data, family="binomial")))
plot_models(fits2$fit)
#> Warning: Transformation introduced infinite values in continuous y-axis
#> Warning: Removed 4 rows containing missing values (geom_point).
# new data for binominal model
n <- 500
g <- round(runif(n, 0L, 1L), 0)
x1 <- runif(n,0,100)
x2 <- runif(n,0,100)
y <- (x2 - x1 + rnorm(n,sd=20)) < 0
fits3 <- tibble(g, y, x1, x2) %>%
nest_by(g) %>%
mutate(fit = list(glm(formula = y ~ x1 + x2, data = data, family="binomial")))
plot_models(fits3$fit)
Created on 2021-01-23 by the reprex package (v0.3.0)

Can you pipe data into a pairwise.t.test?

I'm wondering if the following code can be simplified to allow the data to be piped directly from the summarise command to the pairwise.t.test, without creating the intermediary object?
data_for_PTT <- data %>%
group_by(subj, TT) %>%
summarise(meanRT = mean(RT))
pairwise.t.test(x = data_for_PTT$meanRT, g = data_for_PTT$TT, paired = TRUE)
I tried x = .$meanRT but it didn't like it, returning:
Error in match.arg(p.adjust.method) :
'arg' must be NULL or a character vector
You can use curly braces:
data_for_PTT <- data %>%
group_by(subj, TT) %>%
summarise(meanRT = mean(RT)) %>%
{pairwise.t.test(x = .$meanRT, g = .$TT, paired = TRUE)}
Reproducible:
df <- data.frame(X1 = runif(1000), X2 = runif(1000), subj = rep(c("A", "B")))
df %>%
{pairwise.t.test(.$X1, .$subj, paired = TRUE)}

Rules to change words

I have a dataframe like this:
df <- data.frame(id = c(1,2), keywords = c("google, yahoo, air, cookie", "cookie, air"))
I would like to implement rules like the following:
stocks <- c("google, yahoo")
climate <- c("air")
cuisine <- c("cookie")
and take the results like this:
df_ne <- data.frame(id = c(1,2), keywords = c("stocks, climate, cuisine", "climate, cuisine")
How is it possible to make it?
You can use str_replace_all from stringr package
library(dplyr)
library(stringr)
df <- data.frame(id = c(1,2), keywords = c("google, yahoo, air, cookie", "cookie, air"))
df %>%
mutate(keywords = str_replace_all(keywords,
c("google, yahoo" = "stocks","air" = "climate", "cookie" = "cuisine")))
I liked the cholland answer (+1), but you can also use tidytext::unnest_tokens(), that is going to be easier imho if you're goint to have many more than six words.
First you can create a mapping df:
mapped <- rbind (data.frame(word_a = stocks, type = "stock", stringsAsFactors = F),
data.frame(word_a = climate, type = "climate", stringsAsFactors = F),
data.frame(word_a = cuisine, type = "cuisine", stringsAsFactors = F))
Now you can use the mentioned function to have a couple of unnested df to reach the goal:
library(tidytext)
library(stringr)
library(tidyverse)
mapped <- mapped %>% unnest_tokens(word, word_a)
df %>%
unnest_tokens(word, keywords) %>% # split words
left_join(mapped) %>% # join to map
group_by(id) %>% # group
summarise(keywords = str_c(unique(type), collapse = ",")) # collapse the word (unique)
# A tibble: 2 x 2
id keywords
<dbl> <chr>
1 1 stock,climate,cuisine
2 2 cuisine,climate
Note the second row has inverted words rather than your expected output due they are in that order the corrispondent words in the first df.
With data:
df <- data.frame(id = c(1,2), keywords = c("google, yahoo, air, cookie", "cookie, air"), stringsAsFactors = F)
stocks <- c("google, yahoo")
climate <- c("air")
cuisine <- c("cookie")
Here is a naïve solution to start with :
key <- list(
stocks = c("google", "yahoo"),
climate = "air",
cuisine = "cookie"
)
df2 <- df
#replace by the key
for (k in 1:length(key)){
for(sk in key[[k]]){
df2$keywords <- gsub(sk, names(key)[k], df2$keywords, fixed = TRUE)
}
}
#remove duplicated items
df2$keywords <- lapply(strsplit(df2$keywords, ", "), function(l) paste(unique(l), sep = ","))

r - Split dataframe into multiple dataframes and save in environment

This is a follow up on this quesiton:
split into multiple subset of dataframes with dplyr:group_by? .
Reproducible example:
test <- data.frame(a = c(1,1,1,2,2,2,3,3,3), b = c(1:9))
I'm interested on how to save the dataframes from the following output:
test %>%
group_by(a) %>%
nest() %>%
select(data) %>%
unlist(recursive = F)
as separate dataframes in the environment ? The desired output is the following:
data1 <- data.frame(a = c(1,1,1), b = c(1:3))
data2 <- data.frame(a = c(2,2,2), b = c(4:6))
data3 <- data.frame(a = c(3,3,3), b = c(7:9))
There are many groups so automation is required giving: data1,data2,data3, ... data(n) dataframes.
If you want the dataframe names to be created automatically as well, you could try something like this.
test <- data.frame(a = c(1,1,1,2,2,2,3,3,3), b = c(1:9))
test
n <- length(unique(test$a))
eval(parse(text = paste0("data", seq(1:n), " <- ", split(test, test$a))))
eval(parse(text = paste0("data", seq(1:n), " <- as.data.frame(data", seq(1:3), ")")))

sort data into deciles based on a rolling subset

I am trying to replicate the Fama French 1993 paper using R. I need to do the following sorting :
for each month,
calculate ME decile breakpoints on NYSE stocks only
sort all stocks into the deciles created in 2.
Data generation:
set.seed(1234)
n = 120
stocks <- c("A", "B", "C", "D", "E")
exchange <- c("NYSE", "NASDAQ", "AMEX")
df <- as.data.frame(cbind(Month = 1:12,
exchangeCode = exchange[round(runif(n, 1, 3))],
Stock = stocks[round(runif(n, 1, 5))],
ME=floor(100*abs(rnorm(n)))))
Desired Output:
ME_NYSE_vals <- as.numeric(paste(df[df$Month==1 & df$exchangeCode=="NYSE","ME"]))
ME_ALL_vals <- as.numeric(paste(df[df$Month==1,"ME"]))
cut(x = ME_ALL_vals,
breaks = c(-Inf,quantile(ME_NYSE_vals,probs=seq(.1,.9,.1)),+Inf),
labels = 1:10
)
The breaks should be calculated based on ME_NSYE_vals. The cut should be applied to all ME_ALL_vals for each month.
If the intention is to keep the whole data frame but generate deciles only for the NYSE values the code below could do. The point was to generate deciles only for the entries pertaining to the NYSE values but to keep the full data set achieving some form of a partial sorting.
# Libs
Vectorize(require)(package = c("dplyr", "magrittr"),
character.only = TRUE)
# Transformations
df %<>%
mutate(nTileNYSE = ifelse(exchangeCode == "NYSE", ntile(ME, 10), NA))
arrange(nTileNYSE)
The code was applied to the data:
set.seed(1)
df <- as.data.frame(cbind(exchangeCode = c("NYSE", "NASDAQ"),
Stock = c("A", "B", "C", "A"),
Month = 1:12,
ME=rnorm(1200)))
2nd approach
Following the discussion in the comments I would suggest the following approach:
# Libs --------------------------------------------------------------------
Vectorize(require)(package = c( "tidyr", "dplyr", "magrittr", "xts", "Hmisc"),
char = TRUE)
# Data generation ---------------------------------------------------------
set.seed(1234)
n = 120
stocks <- c("A", "B", "C", "D", "E")
exchange <- c("NYSE", "NASDAQ", "AMEX")
df <- as.data.frame(cbind(Month = 1:12,
exchangeCode = exchange[round(runif(n, 1, 3))],
Stock = stocks[round(runif(n, 1, 5))],
ME = floor(100*abs(rnorm(n)))))
# Transformations ---------------------------------------------------------
# For some reason this was needed
df$ME <- as.numeric(as.character(df$ME))
# Generate cuts
dfNtiles <- df %>%
arrange(exchangeCode, Month, ME) %>%
group_by(exchangeCode, Month) %>%
mutate(cutsBsdOnNYSE = cut(x = ME,
breaks = cut2(x = df$ME[df$exchangeCode == "NYSE"],
g = 10, onlycuts = TRUE))) %>%
ungroup() %>%
group_by(cutsBsdOnNYSE) %>%
mutate(grpBsdOnNYSE = n())
It's fairly straightforward
Generating cut brackets reflecting subset of the data.
Applying those brackets to the whole vector (ME)
Numbering the obtained groups so a group identifier is created
and boils down to:

Resources