Nested Lists with different lengths into dataframe - r

I am using R and the package mongolite to get data from a MongoDB. This results in data consisting of lots of nested lists that cannot be simplified into a data frame by using unlist, rbindlist, nor bind_rows from dplyr (at least I did not manage to do it).
After a lot of trial and error I found a way to do it using the package reshape2 with the function melt and using dplyr and tidyr to get it into the form I want it to be. However, the melting takes a lot of time (up to 15 mins per list, and I have 6 of them).
Do you have any ideas how I can make this faster? (Of course another possible solution would be to query the MongoDB correctly such that it does not result in lists but in something more like my target data frame).
The following code creates a dummy dataset with similar attributes, the target form of the dataset and my solution to get there.
Dummy Data:
dummy_data <- list(
list(actions = list(list(action_type = "link_clicks", value = 30),
list(action_type = "post_likes", value = 3)),
date = '2015-08-04'),
list(actions = list(list(action_type = "link_clicks", value = 10),
list(action_type = "post_likes", value = 2),
list(action_type = "page_engagement", value = 5)),
date = '2015-08-02')
)
Target Form:
final_data = data.frame(c(30, 10), c(3, 2), c(NA, 5), c('2015-08-04', '2015-08-02'))
names(final_data) = c('actions: link_clicks', 'actions: post_likes', 'actions: page_engagement', 'date')
final_data
Temporary solution
Solution <- reshape2::melt(dummy_data)
Solution <- Solution %>%
select(L1, L2, L3, L4, value) %>%
mutate(L4 = ifelse(is.na(L4), L2, L4)) %>%
spread(key = L4, value = value) %>%
mutate(L2 = ifelse(!is.na(action_type), paste0(L2, ": ", action_type), L2),
value = ifelse(!is.na(value), value, date)) %>%
select(L1, L2, value) %>%
spread(key = L2, value = value) %>%
select(-L1)
If you have any advice on the mongolite query here is the simplest kind of queries I use:
M_DB <- mongolite::mongo(collection = "name", url = "url")
M_DB_List <- M_DB$iterate()$batch(size = 100000)
Thanks a lot
**Edit: **
A more complex data structure as this is closer to my problem
dummy_data_complex <- list(
list(actions = list(list(action_type = "link_clicks", value = 30),
list(action_type = "post_likes", value = 3)),
date = '2015-08-04',
currency = 'EUR'),
list(actions = list(list(action_type = "link_clicks", value = 10),
list(action_type = "post_likes", value = 2),
list(action_type = "page_engagement", value = 5)),
date = '2015-08-02',
demographics = list(gender = "female",
list(actions = list(action_type = "link_clicks", value = 10)))
))

Here is an option with tidyverse
library(tidyverse)
dummy_data %>%
map_df(~ .x %>%
as_tibble(.) %>%
mutate(actions = map(actions, as_tibble)) %>%
unnest) %>%
group_by(date, action_type) %>%
mutate(n = row_number()) %>%
spread(action_type, value) %>%
select(-n)
# A tibble: 2 x 4
# Groups: date [2]
# date link_clicks page_engagement post_likes
#* <chr> <dbl> <dbl> <dbl>
#1 2015-08-02 10.0 5.00 2.00
#2 2015-08-04 30.0 NA 3.00

I was able to find a fast solution for my problem.
It was solved by using another query as SymbolixAU suggested.
Instead of using iterate() I used find() which then resulted in a dataframe with nested lists inside it.
From this point on I was easily able to get to my target using tidyr::unnest()
Thanks for your help.

Related

Group dataframe row and column wise based on other dataframe?

I have a dataframe that I would like to group in both directions, first rowise and columnwise after. The first part worked well, but I am stuck with the second one. I would appreciate any help or advice for a solution that does both steps at the same time.
This is the dataframe:
df1 <- data.frame(
ID = c(rep(1,5),rep(2,5)),
ID2 = rep(c("A","B","C","D","E"),2),
A = rnorm(10,20,1),
B = rnorm(10,50,1),
C = rnorm(10,10,1),
D = rnorm(10,15,1),
E = rnorm(10,5,1)
)
This is the second dataframe, which holds the "recipe" for grouping:
df2 <- data.frame (
Group_1 = c("B","C"),
Group_2 = c("D","A"),
Group_3 = ("E"), stringsAsFactors = FALSE)
Rowise grouping:
df1_grouped<-bind_cols(df1[1:2], map_df(df2, ~rowSums(df1[unique(.x)])))
Now i would like to apply the same grouping to the ID2 column and sum the values in the other columns. My idea was to mutate a another column (e.g. "group", which contains the name of the final group of ID2. After this i can use group_by() and summarise() to calculate the sum for each. However, I can't figure out an automated way to do it
bind_cols(df1_grouped,
#add group label
data.frame(
group = rep(c("Group_2","Group_1","Group_1","Group_2","Group_3"),2))) %>%
#remove temporary label column and make ID a character column
mutate(ID2=group,
ID=as.character(ID))%>%
select(-group) %>%
#summarise
group_by(ID,ID2)%>%
summarise_if(is.numeric, sum, na.rm = TRUE)
This is the final table I need, but I had to manually assign the groups, which is impossible for big datasets
I will offer such a solution
library(tidyverse)
set.seed(1)
df1 <- data.frame(
ID = c(rep(1,5),rep(2,5)),
ID2 = rep(c("A","B","C","D","E"),2),
A = rnorm(10,20,1),
B = rnorm(10,50,1),
C = rnorm(10,10,1),
D = rnorm(10,15,1),
E = rnorm(10,5,1)
)
df2 <- data.frame (
Group_1 = c("B","C"),
Group_2 = c("D","A"),
Group_3 = ("E"), stringsAsFactors = FALSE)
df2 <- df2 %>% pivot_longer(everything())
df1 %>%
pivot_longer(-c(ID, ID2)) %>%
mutate(gr_r = df2$name[match(ID2, table = df2$value)],
gr_c = df2$name[match(name, table = df2$value)]) %>%
arrange(ID, gr_r, gr_c) %>%
pivot_wider(c(ID, gr_r), names_from = gr_c, values_from = value, values_fn = list(value = sum))

How can I pass change arguments into character in a function?

Give a minimum example.
df <- data.frame("Treatment" = c(rep("A", 2), rep("B", 2)), "Price" = 1:4, "Cost" = 2:5)
I want to summarize the data by treatments for all the variables I have, and put them together, so I define a function to do this for each variable first, and then rbind them later on.
SummarizeFn <- function(x,y,z) {
x %>% group_by(Treatment) %>%
summarize(n = n(), Mean = mean(y), SD = sd(y)) %>%
cbind("Var" = rep(y, 3)) # add a column to show which variable those statistics belong to.
}
SumPrice <- SummarizeFn(df, df$Price, "Price")
However, R tells me that object "Price" is not found. How to solve this problem?
Also, how to make y as a character indicating the mean and sd are of price?
Price isnt a variable, you need SummarizeFn(df,df$Price) because Price is just defined in your list df
SummarizeFn <- function(x,y,z)
{
df1<-(x %>% group_by(Treatment)
%>% summarize(n = n(), Mean = mean(y), SD = sd(y))
)
df1<- df1 %>% mutate ("Var" = z)
return(df1)
}
SumPrice <- SummarizeFn(df, df$Price,"Price")

Multiple group_by with dplyr

There is a lot of questions about it on this forum but I could not do it. I got a dataframe with a bunch of categrocial variables (class factor). I got a target column (1 or 0). I want to compute the frequency of 1's within each level of the categorical variables. I want to do those 3 group_by computations in once.
library(dplyr)
# Build the toy dataset
target = sample(x = c(0,1),size = 100,replace = T)
cat1 = sample(x = c("a","b","c"),size = 100,replace = T)
cat2 = sample(x = c("x","y","z"),size = 100,replace = T)
cat3 = sample(x = c("T","U","V"),size = 100,replace = T)
df = data.frame(target,cat1,cat2,cat3)
# How to do those 3 group_by computations in once knowing that in reality I got thousands of those categorical columns?
df %>%
group_by(cat1) %>%
summarise(statistics = mean(target))
df %>%
group_by(cat2) %>%
summarise(statistics = mean(target))
df %>%
group_by(cat3) %>%
summarise(statistics = mean(target))
If I understand your question correctly, I believe this code can help you:
df %>%
group_by(cat1,cat2,cat3) %>%
summarise(statistics = mean(target)) %>% arrange(cat1,cat2)

How do you compare means row-wise for the same ratings object in the R expss package?

I have repeated measures data with two ratings (reliable and fast) repeated on two different objects, (each survey respondent rates each object using the same two ratings measures). I would like to have two columns, one for object 1 and one for object 2, with the ratings displayed in two separate rows.
In the reference manual there is reference to using a | separator to compare two variables, but the example given is for mrsets not means, I'm not sure how to do the same with means and keep them in separate data frame columns.
In the code below, the problem is that instead of placing the means side by side (for comparison) they are stacked on top of each other.
#library
library(expss)
library(magrittr)
#dummy data
set.seed(9)
df <- data.frame(
q1_reliable=sample(c(1,5), 100, replace = TRUE),
q1_fast=sample(c(1,5), 100, replace = TRUE),
q2_reliable=sample(c(1,5), 100, replace = TRUE),
q2_fast=sample(c(1,5), 100, replace = TRUE))
#table
df %>%
tab_cells(q1_reliable,q1_fast) %>%
tab_stat_mean(label = "") %>%
tab_cells(q2_reliable,q2_fast) %>%
tab_stat_mean(label = "") %>%
tab_pivot()
I discovered that if I add variable labels first and use 'tab_pivot(stat_position = "inside_columns")' it solved the problem.
#library
library(expss)
library(magrittr)
#dummy data
set.seed(9)
df <- data.frame(
q1_reliable=sample(c(1,5), 100, replace = TRUE),
q1_fast=sample(c(1,5), 100, replace = TRUE),
q2_reliable=sample(c(1,5), 100, replace = TRUE),
q2_fast=sample(c(1,5), 100, replace = TRUE)
)
#labels
df = apply_labels(df,
q1_reliable = "reliable",
q1_fast = "fast",
q2_reliable = "reliable",
q2_fast = "fast")
#table
df %>%
tab_cells(q1_reliable,q1_fast) %>%
tab_stat_mean(label = "") %>%
tab_cells(q2_reliable,q2_fast) %>%
tab_stat_mean(label = "") %>%
tab_pivot(stat_position = "inside_columns")
Like this data.table approach?
library(data.table)
#melt first
DT <- melt( setDT(df),
measure.vars = patterns( reliable = "reliable", fast = "fast"),
variable.name = "q")
#then summarise
DT[, lapply(.SD, mean), by = .(q), .SDcols = c("reliable", "fast")]
q reliable fast
1: 1 3.04 2.96
2: 2 2.92 2.96

dplyr check values between first and last permanent change

I have data structured like this
set.seed(2)
require(tidyverse)
data <- data.frame("TIME" = c(sample(seq(1:20), 20, replace = F), seq(21:30)),
"ID" = c(rep("A", 10), rep("B", 10), rep("C", 10)),
"LOC" = c(sample(c("X", "Y"), 20, replace = T), c("X", rep("Y", 9))))
and I'm trying to use dplyr to create a variable to indicate whether or not a given ID has a change in time that is permanent (PERMANENT =1) or not (PERMANENT=0). I can't use first() or last() because that ignores the in between values. If for instance they go from X to Y and back to X again like in the case of A and B, the indicator should be 0 for every instance of A and B in the data. However C starts at X and stays at Y in all other instances.
I tried to use indexing in a mutate function, but something isn't working.
data %>%
arrange(ID, TIME) %>%
group_by(ID)%>%
mutate(LOC = as.character(LOC),
PERMANENT = ifelse(last(LOC) != "X" & any(LOC[2:length(ID) -1]) != "X"), 1, 0)
Like I said the output should indicate C moved permanently, while A and B bounced around in the data set.
the expected output is what happens if you run the following code:
data$PERMANENT<-ifelse(data$ID%in%c("A","B"),0,1)
Here's a go at it with dpylr. I am assuming an ID is permanent if it hasn't changed since it's second observed time period.
set.seed(2)
data<-data.frame("TIME" = c(sample(seq(1:20),20,replace = F),seq(21:30)),"ID" =c(rep("A",10),rep("B",10),rep("C",10)),"LOC" = c(sample(c("X","Y"),20,replace = T),c("X",rep("Y",9))) )
data %>% arrange(ID, TIME) %>%
group_by(ID) %>%
mutate(timeObs = row_number(), SecondLoc = LOC[timeObs == 2], Change = LOC != SecondLoc) %>%
filter(timeObs > 1) %>%
summarize(Permanent = sum(Change) == 0 ) %>%
right_join(data, by = 'ID')

Resources