How to hot encode/generate dummy columns using sparklyr - r

I know there are number of questions similar to this here but 1) most of the solutions rely on deprecated functions like ml_create_dummy_variables and 2) other solutions are incomplete.
Is there a function or an approach to easily hot encode a categorical variable into multiple dummy variables in sparklyr?
This post asks for a solution in SparkR, incidentally a sparklyr solution is given that only works when the categories are unique in a given column, which renders its pointless.
This solution, results in a single dummy instead of a dummy for each category (grabs the first category). This is also the solution I stumbled onto (based on this post), which does not cut it:
iris_sdf <- copy_to(sc, iris, overwrite = TRUE)
iris_sdf %>%
ft_string_indexer(input_col = "Species", output_col = "species_num") %>%
mutate(cat_num = species_num + 1) %>%
ft_one_hot_encoder("species_num", "species_dum") %>%
ft_vector_assembler(c("species_dum"))
I'm looking for a solution that will take Species from the iris dataset and generate three columns -one for each category in Species (virginica, setosa, and versicolor). Using R, fastDummies package has what I need, but I'm left wondering how to achieve similar functionality in sparklyr.
Again, I'll note that ml_create_dummy_variables (suggested by this post) produced the following error:
Error in ml_create_dummy_variables(., "species_num", "species_dum") : Error in ml_create_dummy_variables(., "species_num", "species_dum") :
could not find function "ml_create_dummy_variables"
Note: I'm using sparklyr_1.3.1

Related

ENP function in mutate

currently, I am cleaning my dataset (Comparative Manifesto Project) and try to compute the effective number of parties using the enp function from the electoral package (https://www.rdocumentation.org/packages/electoral/versions/0.1.2/topics/enp). However, I am running in some issues.
When I run this code:
cmp_1990 %>%
mutate(enp_vote = round(pervote, digits = 2)) %>%
mutate(enp_vote = as.numeric(enp_vote)) %>%
relocate(enp_vote, .before = parfam) %>%
mutate(enp_vote = enp(votes = cmp_1990$enp_vote)) %>%
relocate(enp, .before = parfam)
I get the error message:
Fehler: Can't subset columns that don't exist.
x Column `enp` doesn't exist.
I suppose, r thinks of the function enp as single column even though I have installed and used library on the package.
I tried it with differently rounded numbers and by using the enp command outside of the rest of the command but up until now nothing worked. Oh and the cmp_1990$enp_vote command was necessary as otherwise the enp function thought of enp_vote as categorical and not numerical value.
Sorry by the way if my code doesnt look like the nicest, its my first time using r haha.
Thanks very much in advance!

How do I solve the error object not found after I created this variable using mutate?

I have added a variable that is the sum of all policies for each customer:
mhomes %>% mutate(total_policies = rowSums(select(., starts_with("num"))))
However, when I now want to use this total_policies variable in plots or when using summary() it says: Error in summary(total_policies) : object 'total_policies' not found.
I don't understand what I did wrong or what I should do differently here.
May be slightly round about, but feel solves the purpose. Considering df is the dataset and it has customer_id, policy_id and policy_amount as variables then the below command should work
req_output = df %>% group_by(customer_id) %>% summarise (total_policies = sum (policy_amount)
if you still face the issue, kindly convert to data frame and try plotting
req_output = as.data.frame(req_output)

Custom Descriptive Statistics Table Export

I have spent a few weeks looking for solutions to my problem of not only finding but creating a descriptive statistics summary table on my data that is exportable to xlsx (ideally). I have only found partial answers and my knowledge in R and R packages is still basic enough to limit my progress. My data set is time series data with 6 columns that have 50,000+ rows.
My DF information:
DateTime:POSIXCT format "YYYY-MM-DD HH:MM:SS"
Var1: num
Var2: num
Var3: num
Var4: factor w/ 2 levels "A","B"
Var5: factor w/ 4 levels "S1","S2","S3","S4"
My objectives are as follows:
Manipulate my data frame using tidyverse to subset my data
Take the subsetted data to create 1 summary table( i.e. in a tibble or data.frame format) with 2 sub factors (Var4 and/or Var5) for Var1, Var2, and Var3. Below is a simplified, visual example of the table I am aiming for:
Export the summary table (or summary tables if one table is not possible) to xlxs (ideally), .CSV, or .TXT to be used in Excel for stylistic table edits. At the moment, "writexl" package works very well for me as I have problems with "xlsx" and "openxlsx" packages. Here is the code needed to export to xlsx using the openxlsx package: write_xlsx(dataframe, path = "C:/Users/user/Desktop"). Note for MacOS users, path = /Users/admin/yoursubfolder/yoursubfolder.... (fill in "yoursubfolder" with actual folder name on your computer)
What I have done:
Used dplyr and the %>% function to manipulate the data without and with factor Var4 or Var5
Tried to create a summary table with Var4 as a factor for Var1,Var2, and Var3 (partial success; style is not what I want or it is not exportable to excel)
Looked in multiple StackOverflow questions and Google searches with no success to find code that works for my particular case. I've tried qwraps2 to create one and looking into the following packages for something pre-made: psych, stargazer, and HMSIC. I do not like their table styles and they do not all have the option to just show N, mean, StDev, SEM, Min, and Max.
I know SEM is not a standard function in most packages; thus, I borrowed this function from an answer on stack overflow because I do not know how to create functions. here is the borrowed code: SEM <- function(x) sd(x)/sqrt(length(x))
Since I cannot attach sample data and my coding is very basic, here is what I could come up with:
Example data:
Unfortunately, I cannot attach sample data for testing. Also due to my limited knowledge of R, I cannot make a perfect data frame. Below is a sample data frame, but I cannot get the factor to be evenly distributed in their respective columns (Sorry). Here is my code:
df <- data.frame(
"DateTime" = seq(c(ISOdate(2018,03,01)), by = "day", length.out = 100),
"Var1" = rnorm(1:100),
"Var2" = rnorm(1:100),
"Var3" = rnorm(1:100),
"Var4" = c("A", "B"),
"Var5" = c("S1","S2", "S3", "S4"))
I was trying this:
"S1"[(1:25)],
"S2"[(26:50)],
"S3"[(51:75)],
"S4"[(76:100)] # and
"A"[(1:50], "B"[(51:100)] #but that didn't work, so sorry again.
Despite my lack of proper coding, any guidance, tips, and suggestions from anyone with more experience in R would be greatly appreciated as I do like R and all the capabilities of the software, but I find it very inconvenient that there is no simple, straightforward way to export tables in the console to copy and paste into useful forms like Excel spreadsheets or Word documents instead of standard exporting in LaTex format (which I do not understand at all btw). I know this topic has been discussed in different forums and others share my sentiment on how terrible it is especially for people who need it for data processing instead of document creations like Rmarkdown.
Some example with your df:
library(dplyr)
library(tidyr)
SEM_function <- function(x){sd(x)/sqrt(length(x))}
df %>% as_tibble() %>%
gather("Var_num", "value",Var1:Var3) %>%
group_by(Var_num, Var4,Var5) %>%
summarise("N" = n(),
"mean" = mean(value),
"StDev" = sd(value),
"SEM" = SEM_function(value) ,
"min" = min(value),
"max" = max(value))
Hope this helps

what is the correct way to reference variables when using tidyverse with other functions?

say I would like to use reporttools with tidyverse,
I first make sure the packages are loaded,
#install.packages("tidyverse", "reporttools") #Use this to install it, do this only once
library(reporttools); library(tidyverse)
Second I test it with a basic reporttools tableNominal, i.e.,
data(CO2)
## the basic function
reporttools::tableNominal(vars = CO2[, 1:2], group = CO2[, "Treatment"])
That works.
Now, what is the correct way/best way to reference variables if I wish to use tidyverse to subset and select before using reporttools? Would be neat if I could use tidyverse's select helpers, e.g. contains(), one_of()
CO2 %>% select(Plant, Type, Treatment) %>%
reporttools::tableNominal(vars = vars(Plant, Type), group = vars(Treatment))

Model Matrices Incompatible - Error in update in Biglm package in R

I'm running through a large dataset chunk by chunk, updating a list of linear models as I go using the biglm function. The issue occurs when a particular chunk does not contain all the factors that I have in my linear model, and I get this error:
Error in update.biglm(model, new) : model matrices incompatible
The description of update.biglm mentions that factor levels must be the same across all chunks. I could probably come up with a workaround to avoid this, but there must be a better way. This pdf, on the 'biglm' page, mentions that "Factors must have their full set of levels
specified (not necessarily present in the data chunk)". So I think there is some way to specify all the possible levels so that I can update a model with not all the factors present, but I can't figure out how to do it.
Here's an example piece of code to illustrate my problem:
df = data.frame(a = rnorm(12),b = as.factor(rep(1:4,each = 3)),c = rep(0:1,6))
model = biglm(a~b+c,data = df
df.new = data.frame(a = rnorm(6),b = as.factor(rep(1:2,each = 3)),c =rep(0:1, 3))
model.new = update(model,df.new)
Thanks for any advice you have.
I came across this problem also. Are the variables in your large data frame specified as factors before breaking them into chunks? Also, is the data set formatted as a data frame?
large_df <- as.data.frame(large_data_set) # just to make sure it's a df.
large_df$factor.vars <- as.factor(large_df$factor.vars)
If this is the case, then all of the factor levels should be preserved in the factor variables even after breaking the data frame into chunks. This will ensure that biglm creates the proper design matrix from the first call, and that all subsequent updates will be compatible.
If you have different data frames from the start, (as you illustrate in your example), perhaps you should merge them into one before breaking down into chunks. Continuing from your example:
df.large <- rbind(df,df.new)
chunk1 <- df.large[1:12,]
chunk2 <- df.large[13:18,]
model <- biglm(a~b+c,data = chunk1)
model.new <- update(model,chunk2) # this is now compatible

Resources