Rules to change words

Rules to change words - r

I have a dataframe like this:
df <- data.frame(id = c(1,2), keywords = c("google, yahoo, air, cookie", "cookie, air"))
I would like to implement rules like the following:
stocks <- c("google, yahoo")
climate <- c("air")
cuisine <- c("cookie")
and take the results like this:
df_ne <- data.frame(id = c(1,2), keywords = c("stocks, climate, cuisine", "climate, cuisine")
How is it possible to make it?

You can use str_replace_all from stringr package
library(dplyr)
library(stringr)
df <- data.frame(id = c(1,2), keywords = c("google, yahoo, air, cookie", "cookie, air"))
df %>%
mutate(keywords = str_replace_all(keywords,
c("google, yahoo" = "stocks","air" = "climate", "cookie" = "cuisine")))

I liked the cholland answer (+1), but you can also use tidytext::unnest_tokens(), that is going to be easier imho if you're goint to have many more than six words.
First you can create a mapping df:
mapped <- rbind (data.frame(word_a = stocks, type = "stock", stringsAsFactors = F),
data.frame(word_a = climate, type = "climate", stringsAsFactors = F),
data.frame(word_a = cuisine, type = "cuisine", stringsAsFactors = F))
Now you can use the mentioned function to have a couple of unnested df to reach the goal:
library(tidytext)
library(stringr)
library(tidyverse)
mapped <- mapped %>% unnest_tokens(word, word_a)
df %>%
unnest_tokens(word, keywords) %>% # split words
left_join(mapped) %>% # join to map
group_by(id) %>% # group
summarise(keywords = str_c(unique(type), collapse = ",")) # collapse the word (unique)
# A tibble: 2 x 2
id keywords
<dbl> <chr>
1 1 stock,climate,cuisine
2 2 cuisine,climate
Note the second row has inverted words rather than your expected output due they are in that order the corrispondent words in the first df.
With data:
df <- data.frame(id = c(1,2), keywords = c("google, yahoo, air, cookie", "cookie, air"), stringsAsFactors = F)
stocks <- c("google, yahoo")
climate <- c("air")
cuisine <- c("cookie")

Here is a naïve solution to start with :
key <- list(
stocks = c("google", "yahoo"),
climate = "air",
cuisine = "cookie"
)
df2 <- df
#replace by the key
for (k in 1:length(key)){
for(sk in key[[k]]){
df2$keywords <- gsub(sk, names(key)[k], df2$keywords, fixed = TRUE)
}
}
#remove duplicated items
df2$keywords <- lapply(strsplit(df2$keywords, ", "), function(l) paste(unique(l), sep = ","))

Related

Vectorization to extract and bind very nested data

I have some very nested data. Within my list-column-dataframes, there are some pieces I need to put together and I've done so in a single instance to get my desired dataframe:
a <- df[[2]][["result"]]#data
b <- df[[2]][["result"]]#coords
desired_df <- cbind(a, b)
My original Large list has 171 elements, meaning I have 1:171 (3.3 GB) to go inside those square brackets and would ideally end up with 171 desired dataframes (which I would then bind all together).
I haven't needed to write a loop in 10 years, but I don't see a tidyverse way to deal with this. I also no longer know how to write loops. There are definitely some elements in there that are junk and will fail.

You haven't provided any sort of minimal example of the data.
I've condensed it to mean something like this
base_data <- data.frame(group = c("a", "b", "c"), var1 = c(3, 1, 2),
var2 = c( 2, 4, 8))
base_data2 = matrix(
c(1, 2, 3, 4, 5, 6, 7, 8, 9),
nrow = 3,
ncol = 3,
byrow = TRUE
)
rownames(base_data2) = c("d", "e", "f")
methods::setClass(
"weird_object",
slots = c(data = "data.frame", coords = "matrix"),
prototype = list(data = base_data, coords = base_data2)
)
df <- list(
list(
result = new("weird_object")
),list(
result = new("weird_object")
),list(
result = new("weird_object")
),list(
result = new("weird_object")
)
)
And if I had such a list with these objects, then I could do
df %>%
map(. %>% {
list(data = .$result#data,
cooords = .$result#coords)
}) %>%
enframe() %>%
unnest_wider(value)
But the selecting / hoisting function might fail, thus
one can wrap it in a purrr::possibly, and
choose a reasonable default:
df %>%
map(possibly(. %>% {
list(data = .$result#data,
cooords = .$result#coords)
},
otherwise = list(data = NA, coords = NA))) %>%
enframe() %>%
unnest_wider(value)
Hopefully, this could be a step forward.
Next step is probably something resembling this:
df %>%
map(. %>% {
list(data = .$result#data,
coords = .$result#coords)
}) %>%
enframe() %>%
unnest_wider(value) %>%
mutate(coords = coords %>% map(. %>% as_tibble(rownames = "rowid"))) %>%
unnest(cols = c(data, coords)) %>%
#' rotating the thing now
pivot_longer(cols = c(group, rowid),
names_to = "var_name",
values_to = "var") %>%
select(-var_name) %>%
pivot_longer(cols = c(var1, var2, V1, V2, V3),
names_to = "var_name") %>%
pivot_wider(names_from = var, values_from = value) %>%
identity()

If I understand your data structure, which I probably don't, you could do:
library(tidyverse)
# Create dummy data
df <- mtcars
df$mpg <- list(result = I(list('test')))
df$mpg$result <- list("#data" = I(list('your data')))
df <- df %>% select(mpg, cyl)
df1 <- df
df2 <- df
# Pull data you're interested in.
# The index is 1 here, instead of 2, because it's fake data and not your data.
# Assuming the # is not unique, and is just parsed from JSON or some other format.
dont_at_me <- function(x){
a <- x[[1]][["result"]][["#data"]]
a
}
# Get a list of all of your data.frames
all_dfs <- Filter(function(x) is(x, "data.frame"), mget(ls()))
# Vectorize
purrr::map(all_dfs, ~dont_at_me(.))

Can you pipe data into a pairwise.t.test?

I'm wondering if the following code can be simplified to allow the data to be piped directly from the summarise command to the pairwise.t.test, without creating the intermediary object?
data_for_PTT <- data %>%
group_by(subj, TT) %>%
summarise(meanRT = mean(RT))
pairwise.t.test(x = data_for_PTT$meanRT, g = data_for_PTT$TT, paired = TRUE)
I tried x = .$meanRT but it didn't like it, returning:
Error in match.arg(p.adjust.method) :
'arg' must be NULL or a character vector

You can use curly braces:
data_for_PTT <- data %>%
group_by(subj, TT) %>%
summarise(meanRT = mean(RT)) %>%
{pairwise.t.test(x = .$meanRT, g = .$TT, paired = TRUE)}
Reproducible:
df <- data.frame(X1 = runif(1000), X2 = runif(1000), subj = rep(c("A", "B")))
df %>%
{pairwise.t.test(.$X1, .$subj, paired = TRUE)}

How do you compare means row-wise for the same ratings object in the R expss package?

I have repeated measures data with two ratings (reliable and fast) repeated on two different objects, (each survey respondent rates each object using the same two ratings measures). I would like to have two columns, one for object 1 and one for object 2, with the ratings displayed in two separate rows.
In the reference manual there is reference to using a | separator to compare two variables, but the example given is for mrsets not means, I'm not sure how to do the same with means and keep them in separate data frame columns.
In the code below, the problem is that instead of placing the means side by side (for comparison) they are stacked on top of each other.
#library
library(expss)
library(magrittr)
#dummy data
set.seed(9)
df <- data.frame(
q1_reliable=sample(c(1,5), 100, replace = TRUE),
q1_fast=sample(c(1,5), 100, replace = TRUE),
q2_reliable=sample(c(1,5), 100, replace = TRUE),
q2_fast=sample(c(1,5), 100, replace = TRUE))
#table
df %>%
tab_cells(q1_reliable,q1_fast) %>%
tab_stat_mean(label = "") %>%
tab_cells(q2_reliable,q2_fast) %>%
tab_stat_mean(label = "") %>%
tab_pivot()

I discovered that if I add variable labels first and use 'tab_pivot(stat_position = "inside_columns")' it solved the problem.
#library
library(expss)
library(magrittr)
#dummy data
set.seed(9)
df <- data.frame(
q1_reliable=sample(c(1,5), 100, replace = TRUE),
q1_fast=sample(c(1,5), 100, replace = TRUE),
q2_reliable=sample(c(1,5), 100, replace = TRUE),
q2_fast=sample(c(1,5), 100, replace = TRUE)
)
#labels
df = apply_labels(df,
q1_reliable = "reliable",
q1_fast = "fast",
q2_reliable = "reliable",
q2_fast = "fast")
#table
df %>%
tab_cells(q1_reliable,q1_fast) %>%
tab_stat_mean(label = "") %>%
tab_cells(q2_reliable,q2_fast) %>%
tab_stat_mean(label = "") %>%
tab_pivot(stat_position = "inside_columns")

Like this data.table approach?
library(data.table)
#melt first
DT <- melt( setDT(df),
measure.vars = patterns( reliable = "reliable", fast = "fast"),
variable.name = "q")
#then summarise
DT[, lapply(.SD, mean), by = .(q), .SDcols = c("reliable", "fast")]
q reliable fast
1: 1 3.04 2.96
2: 2 2.92 2.96

r - Split dataframe into multiple dataframes and save in environment

This is a follow up on this quesiton:
split into multiple subset of dataframes with dplyr:group_by? .
Reproducible example:
test <- data.frame(a = c(1,1,1,2,2,2,3,3,3), b = c(1:9))
I'm interested on how to save the dataframes from the following output:
test %>%
group_by(a) %>%
nest() %>%
select(data) %>%
unlist(recursive = F)
as separate dataframes in the environment ? The desired output is the following:
data1 <- data.frame(a = c(1,1,1), b = c(1:3))
data2 <- data.frame(a = c(2,2,2), b = c(4:6))
data3 <- data.frame(a = c(3,3,3), b = c(7:9))
There are many groups so automation is required giving: data1,data2,data3, ... data(n) dataframes.

If you want the dataframe names to be created automatically as well, you could try something like this.
test <- data.frame(a = c(1,1,1,2,2,2,3,3,3), b = c(1:9))
test
n <- length(unique(test$a))
eval(parse(text = paste0("data", seq(1:n), " <- ", split(test, test$a))))
eval(parse(text = paste0("data", seq(1:n), " <- as.data.frame(data", seq(1:3), ")")))

Map term indices to words in sparklyr ml_count_vectorizer

I cannot figure out how to map term indices produced by ft_count_vectorizer in sparklyr back to vocabulary words. The output of lda models only has term indices, not words, so it is hard to make sense of the output without being able to map indices to words. Example below.
library(sparklyr)
library(dplyr)
# connection
sc <- spark_connect(master = 'local')
# fake data
fake_data <- data.frame(a = c(1, 2, 3, 4),
b = c("the groggy", "frog was",
"a very groggy", "frog"))
fake_tbl <- copy_to(sc, df = fake_data, overwrite = TRUE)
# count vectorizer
fake_vectorizer <- fake_tbl %>%
ft_tokenizer(input_col = 'b', output_col = 'tokens') %>%
ft_count_vectorizer(input_col = 'tokens', output_col = 'features')
# model
fake_model <- fake_vectorizer %>%
ml_lda(features_col = 'features', k = 2)
fake_model$topicsMatrix
# Which indices correspond to which words?