Translate Spark SQL function to "normal" R code

Translate Spark SQL function to "normal" R code - r

I am trying to follow an Vignette "How to make a Markov Chain" (http://datafeedtoolbox.com/attribution-theory-the-two-best-models-for-algorithmic-marketing-attribution-implemented-in-apache-spark-and-r/).
This tutorial is interesting, because it is using the same data source as I use. But, a part of the code is using "Spark SQL code" (what I got back from my previous question Concat_ws() function in Sparklyr is missing).
My question: I googled a lot and tried to solve this by myself. But I have no idea how, since I don't know exactly what the data should look like (the author didn't gave an example of his DF before and after the function).
How can I transform this piece of code into "normal" R code (without using Spark) (especially: the concat_ws & collect_list functions are causing trouble
He is using this line of code:
channel_stacks = data_feed_tbl %>%
group_by(visitor_id, order_seq) %>%
summarize(
path = concat_ws(" > ", collect_list(mid_campaign)),
conversion = sum(conversion)
) %>% ungroup() %>%
group_by(path) %>%
summarize(
conversion = sum(conversion)
) %>%
filter(path != "") %>%
collect()
From my previous question, I know that we can replace a part of the code:
concat_ws() can be replaced the paste() function
But again, another part of code is jumping in:
collect_list() # describtion: Aggregate function: returns a list of objects with duplicates.
I hope that I described this question as clear as possible.

paste has the ability to collapse the string vector with a separator that is provided with the collapse parameter.
This can act as a drop in replacement for concat_ws(" > ", collect_list(mid_campaign))
channel_stacks = data_feed_tbl %>%
group_by(visitor_id, order_seq) %>%
summarize(
path = paste(mid_campaign, collapse = " > "),
conversion = sum(conversion)
) %>% ungroup() %>%
group_by(path) %>%
summarize(
conversion = sum(conversion)
) %>%
filter(path != "")

Related

Error in pluck: object not found -- trying to create loop to scrape data from multiple PDFs with uniform formatting

Thanks to other articles on this website, I managed to put together a script that will do the following:
Collect PDF file names from directory and put into a list.
Start a data frame using target data from the first PDF in the directory.
Use loop function to add rows to the original data frame containing the same target data (pulling from the same section of the PDF).
My first two steps work (code below)
file_names <- list.files(pattern = "*.pdf")
df <-
extract_tables(
file = "firstlastname.pdf",
method = "decide",
output = "data.frame"
) %>%
pluck(2) %>%
t() %>%
as.data.frame() %>%
slice(2) %>%
select(1:3) %>%
rename("inst" = "V1",
"date" = "V2",
"field" = "V3")
but my final step throws the following error:
"Error in pluck(., 2) : object 'tmp' not found"
for (i in file_names)
{
new <-
extract_tables(
file = i,
method = "decide",
output = "data.frame"
) %>%
pluck(2) %>%
t() %>%
as.data.frame() %>%
slice(2) %>%
select(1:3) %>%
rename("inst" = "V1",
"date" = "V2",
"field" = "V3") %>%
df[nrow(df) + 1, ] <- new
}
I am confused because I actually made it all the way through successfully a couple times, but I tried it again after closing RStudio and coming back, and it just won't work anymore. I'm a complete beginner just trying to automate my secretary job a little bit, but I'm probably in way over my head. All I can do is Google things, copy and paste code, and try to understand what everything means and how it comes together.
Unfortunately I can't provide my data files because they contain people's personal information, but the final result is supposed to look like a table with about 50 rows and 3 columns. I did take a photo the first time it worked, though:
successful data frame
Thank you for reading. Any tips would be much appreciated!

Problems generating tree diagram with hctreemap2

library(highcharter)
library(dplyr)
library(viridisLite)
library(forecast)
library(treemap)
data("Groceries", package = "arules")
dfitems <- tbl_df(Groceries#itemInfo)
set.seed(10)
dfitemsg <- dfitems %>%
mutate(category = gsub(" ", "-", level1),
subcategory = gsub(" ", "-", level2)) %>%
group_by(category, subcategory) %>%
summarise(sales = n() ^ 3 ) %>%
ungroup() %>%
sample_n(31)
hctreemap2(group_vars = c("category","subcategory"),
size_var = "sales")%>%
hc_tooltip(pointFormat = "<b>{point.name}</b>:<br>
Pop: {point.value:,.0f}<br>
GNI: {point.colorValue:,.0f}")
the error is the following
Error in hctreemap2(., group_vars = c("category", "subcategory"), size_var = "sales") : Treemap data uses same label at multiple levels.
I tried everything and it doesn't work out, could someone with experience explain to me what is happening?

When I tried your code, it also stated that the function was deprecated and to use data_to_hierarchical. Although, it's never quite that simple, right? I tried multiple ways to get hctreemap2 to work, but wasn't able to discern that issue. From there I turned to the package recommended data_to_hierarchical. Now that worked without an issue--once I figured out the right type, which in hindsight seemed kind-of obvious.
That being said, this is what I've got:
data_to_hierarchical(data = dfitemsg,
group_vars = c(category,subcategory),
size_var = sales) %>%
hchart(type = "treemap") %>%
hc_tooltip(pointFormat = "<b>{point.name}</b>:<br>
Pop: {point.value:,.0f}<br>
GNI: {point.colorValue:,.0f}")
You didn't actually designate a color, so the GNI comes up blank.
Let me know if you run into any issues.
Based on your comment:
I have not found a way to change the color to density, which is what both hctreemap2 and treemap appear to do. The function data_to_heirarchical codes the colors to the first grouping variable or the level 1 variable.
Inadvertently, I did figure out why the function hctreemap2 would not work. It checks to see if any category labels are the same as a subcategory label. I didn't go through all of the data, but I know there is a perfumery perfumery. I don't understand what that's a hard stop. If that is a problem for this call, why wouldn't data_to_heirchical be looking for this issue, as well?
So, I changed the function. First, I called the function itself.
x = hctreemap2
Then I selected it from the environment pane. Alternatively, you can code View(x).
This view is read-only, but it's easier to read than the console. I copied the function and assigned it to its original name with changes. I removed two pieces of the code, which changed nothing structurally speaking to how the chart is created.
I removed the first line of code in the function:
.Deprecated("data_to_hierarchical")
and this code (about a third of the way down)
if (data %>% select(!!!group_syms) %>% map(unique) %>% unlist() %>%
anyDuplicated()) {
stop("Treemap data uses same label at multiple levels.")
}
This left me to recreate the function with this code:
hctreemap2 <- function (data, group_vars, size_var, color_var = NULL, ...)
{
assertthat::assert_that(is.data.frame(data))
assertthat::assert_that(is.character(group_vars))
assertthat::assert_that(is.character(size_var))
if (!is.null(color_var))
assertthat::assert_that(is.character(color_var))
group_syms <- rlang::syms(group_vars)
size_sym <- rlang::sym(size_var)
color_sym <- rlang::sym(ifelse(is.null(color_var), size_var, color_var))
data <- data %>% mutate_at(group_vars, as.character)
name_cell <- function(..., depth) paste0(list(...),
seq_len(depth),
collapse = "")
data_at_depth <- function(depth) {
data %>%
group_by(!!!group_syms) %>%
summarise(value = sum(!!size_sym), colorValue = sum(!!color_sym)) %>%
ungroup() %>%
mutate(name = !!group_syms[[depth]], level = depth) %>%
mutate_at(group_vars, as.character()) %>% {
if (depth == 1) {
mutate(., id = paste0(name, 1))
}
else {
mutate(.,
parent = pmap_chr(list(!!!group_syms[seq_len(depth) - 1]),
name_cell, depth = depth - 1),
id = paste0(parent, name, depth))
}
}
}
treemap_df <- seq_along(group_vars) %>% map(data_at_depth) %>% bind_rows()
data_list <- treemap_df %>% highcharter::list_parse() %>%
purrr::map(~.[!is.na(.)])
colorVals <- treemap_df %>%
filter(level == length(group_vars)) %>% pull(colorValue)
highchart() %>%
hc_add_series(data = data_list, type = "treemap",
allowDrillToNode = TRUE, ...) %>%
hc_colorAxis(min = min(colorVals), max = max(colorVals), enabled = TRUE)
}
Now your code, as originally written will work. You did not change the highcharter package by doing this. So if you think you'll use it in the future save the function code, as well. You will need the library purrr, since you already called dplyr (where most, if any conflicts occur), you could just call tidyverse (which calls several libraries at one time, including both dplyr and purrr).
This is what it will look like with set.seed(10):
If you drill down on the largest block:
It looks odd to me, but I'm guessing that's what you were looking for to begin with.

error in could not find function in r with pipeline

I'm just starting with r, so this may very well be a very simple question but...
I've tried changing the name in 'a' to be more elaborate but this makes no difference
If I try to assign it to a variable
(e.g. baseline <- a %>% filter(Period == "Baseline") %>% group_by(File)%>%
It just tells me:
"Error in a %>% filter(Period == "Baseline") %>% group_by(File) %>% :
could not find function "%>%<-"
I'd really be grateful for any help with this.
It keeps telling me "Error in a(.) : could not find function "a"
and that it is unable to find Baseline_MAP even though it is defined earlier.
in mutate(Delta_MAP = Group_MAP - Baseline_MAP,
a <- read_csv("file.csv")
summary(a)
a %>%
filter(Period == "Baseline") %>%
group_by(File)%>%
summarise(Baseline_MAP = mean(MAP_Mean, na.rm=T),
Baseline_SBP = mean(SBP_Mean, na.rm=T),
Baseline_LaserMc1 = mean(Laser1_Magic, na.rm=T),
Baseline_Laser1 = mean(Laser1_Mean, na.rm=T))%>%
a%>%
filter(Period != "Baseline") %>%
group_by(File)%>%
summarise(Group_MAP = mean(MAP_Mean, na.rm=T),
Group_SBP = mean(SBP_Mean, na.rm=T),
Group_Laser_1Magic = mean(Laser1_Magic, na.rm=T),
Group_Laser_1 = mean(Laser1_Mean, na.rm=T))
a%>%
mutate(Delta_MAP = Group_MAP - Baseline_MAP,
Delta_MAP_Log = log(Group_MAP)-log(Baseline_MAP),
Delta_SBP = Group_SBP - Baseline_SBP,
Delta_SBP_Log = log(Group_SBP)-log(Baseline_SBP),
Delta_Laser1_Magic = Group_Laser_1Magic - Baseline_LaserMc1,
Delta_Laser1_Log = log(Group_Laser_1Magic)-log(Baseline_LaserMc1))

%>% is from the package "dplyr". So make sure you load it, i.e. library(dplyr).
Next, %>% does not assign the result to a variable. I.e.
a %>% mutate(foo=bar(x))
does not alter a. It will just show the result on the console (and none if you are running the script or calling it from a function).
You might be confusing the pipe-operator with %<>% (found in the package magrittr) which uses the left-hand variable as input for the pipe, and overwrites the variable with the modified result.
Finally, when you write
If I try to assign it to a variable (e.g. baseline <- a %>% filter(Period == "Baseline") %>% group_by(File)%>%)
You are assigning the result from the pipeline to a variable baseline -- this however does not modify the variable-names in the data frames (i.e. the column names).

How to properly parse (?) mdsets in expss within a loop?

I'm new to R and I don't know all basic concepts yet. The task is to produce a one merged table with multiple response sets. I am trying to do this using expss library and a loop.
This is the code in R without a loop (works fine):
#libraries
#blah, blah...
#path
df.path = "C:/dataset.sav"
#dataset load
df = read_sav(df.path)
#table
table_undropped1 = df %>%
tab_cells(mdset(q20s1i1 %to% q20s1i8)) %>%
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
There are 10 multiple response sets therefore I need to create 10 tables in a manner shown above. Then I transpose those tables and merge. To simplify the code (and learn something new) I decided to produce tables using a loop. However nothing works. I'd looked for a solution and I think the most close to correct one is:
#this generates a message: '1' not found
for(i in 1:10) {
assign(paste0("table_undropped",i),1) = df %>%
tab_cells(mdset(assign(paste0("q20s",i,"i1"),1) %to% assign(paste0("q20s",i,"i8"),1)))
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
}
Still it causes an error described above the code.
Alternatively, an SPSS macro for that would be (published only to better express the problem because I have to avoid SPSS):
define macro1 (x = !tokens (1)
/y = !tokens (1))
!do !i = !x !to !y.
mrsets
/mdgroup name = !concat($SET_,!i)
variables = !concat("q20s",!i,"i1") to !concat("q20s",!i,"i8")
value = 1.
ctables
/table !concat($SET_,!i) [colpct.responses.count pct40.0].
!doend
!enddefine.
*** MACRO CALL.
macro1 x = 1 y = 10.
In other words I am looking for a working substitute of !concat() in R.

%to% is not suited for parametric variable selection. There is a set of special functions for parametric variable selection and assignment. One of them is mdset_t:
for(i in 1:10) {
table_name = paste0("table_undropped",i)
..$table_name = df %>%
tab_cells(mdset_t("q20s{i}i{1:8}")) %>% # expressions in the curly brackets will be evaluated and substituted
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
}
However, it is not good practice to store all tables as separate variables in the global environment. Better approach is to save all tables in the list:
all_tables = lapply(1:10, function(i)
df %>%
tab_cells(mdset_t("q20s{i}i{1:8}")) %>%
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
)
UPDATE.
Generally speaking, there is no need to merge. You can do all your work with tab_*:
my_big_table = df %>%
tab_total_row_position("none")
for(i in 1:10) {
my_big_table = my_big_table %>%
tab_cells(mdset_t("q20s{i}i{1:8}")) %>% # expressions in the curly brackets will be evaluated and substituted
tab_stat_cpct()
}
my_big_table = my_big_table %>%
tab_pivot(stat_position = "inside_columns") # here we say that we need combine subtables horizontally

Same R function works as function/object in Global Env, does not work when loaded from R package

This might be a hard question to field because I can't easily offer a reproducible SNAFU without providing my access token to the Gmail API, but my hope is that I am tripping over an issue that will be clear enough from my description. We'll see...
I wrote a function that takes the gmail message ID of a Google Scholar Alert and parses it into a data frame with a columns for article title, authors, publication, date, etc. My conundrum is that the same code that works when I load it interactively (i.e., load the function into session RAM a la "my_fancy_function <- function(arguments){blah blah})" does not work when I load the function as part of an R package (a la 'devtools::load_all("/mypackage/")'). It's worth noting that the other main function in my package works fine either way, but this one trips up when I try to use it from the package version after loading the package via devtools::load_all("my_misbehaving_package/").
Below is the .R file content of the function that won't come to heel when loaded as part of a package -- and I hereby acknowledge in advance that it's some ugly mess of code, if that will spare me some finger wagging in your answers. This feels like the package version of the classic "strings as factors" SNAFU, but you tell me. The problem seems to occur very early in the function evaluation, and the error I get reads as follows:
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "NULL"
And here is my whole sick code for this function:
library(stringr)
library(rvest)
library(plyr)
library(dplyr)
library(lubridate)
library(gmailr)
GScholar_alert_msg_to_df <- function(message_id){
one_message <- message(message_id)
msg_html <- body(one_message)
title <- read_html(msg_html) %>% html_nodes("h3") %>% html_text()
link <- read_html(msg_html) %>% html_nodes("h3 a") %>% html_attr("href")
msg_chunks <- msg_html %>% str_split("<a href") %>% unlist
msg_chunks <- msg_chunks[2:(length(msg_chunks)-2)]
excerpt <- msg_chunks %>% str_replace_all(fixed("<b>"), "") %>%
str_replace_all(fixed("</b>"), "") %>%
str_replace_all(fixed("<br>"), "") %>%
str_extract("<(font size=2 color=#222222|font size=\"-1\")>(.*?)</font>") %>%
unlist %>% str_replace_all("<(font size=2 color=#222222||font size=\"-1\")>", "") %>%
str_replace_all("</font>", "")
author_pub_field <- msg_chunks %>% str_replace_all(fixed("<b>"), "") %>%
str_replace_all(fixed("</b>"), "") %>%
str_extract("<font size=(2|\"-1\") color=#(006621|009933|008000)>(.*?)</font>") %>%
unlist %>%
str_replace_all("<font size=(2|\"-1\") color=#(006621|009933|008000)>", "") %>%
str_replace_all("</font>", "")
one_message_df <- data.frame(title, excerpt, link, author_pub_field, stringsAsFactors = FALSE)
one_message_df$date <- date(one_message) # needs reformatting
one_message_df$date %>% str_replace("[[:alpha:]]{3}, ", "") %>%
str_extract("^.{11}") %>% dmy -> one_message_df$date
one_message_df$author_only <- str_detect(one_message_df$author_pub_field, " - ")
one_message_df$author <- one_message_df$author_pub_field %>%
str_extract("^(.*?) - ") %>% str_replace(" - ", "")
one_message_df$author <- ifelse(one_message_df$author_only == 1, one_message_df$author, one_message_df$author_pub_field)
one_message_df$publication <- one_message_df$author_pub_field %>%
str_extract(" - (.*?)$") %>% str_replace(" - ", "") %>%
str_replace(", [0-9]{4}$", "") %>% str_replace("^[0-9]{4}$", NA)
one_message_df$publication <- str_replace(one_message_df$publication, "^[0-9]{4}$", NA)
one_message_df$author_MIAs <- str_detect(one_message_df$author, "…")
one_message_df$author %>% str_replace("…", " \\.\\.\\.") -> one_message_df$author
one_message_df$pub_name_clipped <- one_message_df$publication %>% str_detect("…")
one_message_df$publication %>% str_replace("…", " \\.\\.\\.") -> one_message_df$publication
return(one_message_df)
Yeah, it's ugly, but again, I promise it works when the code is entered interactively. As for the misbehaving package version, according to RStudio traceback the error seems to happen way up top, I think either the body() or the message() function I'm using from the gmailr package. (Yes, I've tried the code via Terminal to no happier conclusion.) Help me, oh be anyone, you're my only hope.

Thank you to #MrFlick for indicating a way out of the woods here. I was able to resolve the problem by moving that list of packages to the 'Depends' field of the DESCRIPTION file, and I also had to use the '::' operator to make explicit calls to the 'body' and 'date' functions from the gmailr package. Those two names also belong to nonstandard functions that are part of the 'base' namespace and they can't be masked by loading gmailr within the package environment. According to Hadley, I should be using '::' for every use of a function from another package when I'm writing my own.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Translate Spark SQL function to "normal" R code - r

Related

Error in pluck: object not found -- trying to create loop to scrape data from multiple PDFs with uniform formatting

Problems generating tree diagram with hctreemap2

error in could not find function in r with pipeline

How to properly parse (?) mdsets in expss within a loop?

Same R function works as function/object in Global Env, does not work when loaded from R package

Categories

Resources