Summarise_each and dplyr syntax - r

I've been given a set of particularly messy data. In it there were three columns denoting the same factor variable - focus1, focus2, and focus3 where each observation of the data could contain more than one focus yet they are not a measure of magnitude i.e. the focus given in focus1 is not necessarily a stronger focus than that of focus2. I need to expand these three variables to indicator variables for each possible level of a consolidated focus variable. To do this, I used the code below, and it worked perfectly on my PC yesterday but I work on a mac in my office and I am now running into problems.
# Create focus variables
spr.focus<- y1 %>%
gather(foc_num, focus, starts_with("focus")) %>%
mutate(present = 1) %>%
spread(focus, present, fill = 0)
# Reorder data on ID var while removing unnecessary columns
spr.focus <- spr.focus[order(spr.focus$tid), -c(34, 54)]
# Group by ID var and summarise indicator variables to get one obs per ID
focusvars <- spr.focus %>%
group_by(tid) %>% # tid is id var
summarise_each(funs(sum), Arts:Unclear)
I have run into two problems:
summarise_each appears to have been made obsolete on Mac and not Windows? The answer here appears to be to use summarise_at. Can I use the same x:y notation for denoting the columns to summarise? This is important because there are around 20-30 columns between the first and last index.
For some reason R no longer recognises column names I refer to within the pipe notation. I get an error "Error in eval_bare(dot$expr, dot$env) : object 'Arts' not found".
I'm also quite curious, what is causing these disparities between operating on Windows and Mac? I have to imagine it is different versions of packages/RStudio itself but it is creating quite a conundrum.

After some tinkering with summarize_at, I found my solution:
focusvars <- spr.focus %>%
group_by(tid) %>% # tid is an id var
summarise_at(vars(Arts:Unclear),funs(sum))
For some reason, it still throws errors in the margin that cannot find colnames in scope but it creates the new dataframe. I'll leave this up in case this is helpful to others.

Related

Removing a set of adjacent rows of a data frame meeting a specific pattern - R

I posted this question on 12/19. I received one response that was very helpful but not quite what I was looking for. Then the question was closed by three folks with the specification it needed more focus. the instructions indicated I could update the question or post a new on but after editing it to make it more focused it remained closed. So, I am posting it again.
Here is the link to the edited question, including a more concise dataset (which had been one critical comment): Identifying a specific pattern in several adjacent rows of a single column - R
But, in case that link isn't allowed, here's the content:
I need to remove a specific set of rows from data when they occur. In our survey, an automated telephone survey, the survey tool will attempt three times during that call to prompt the respondent to enter a response. After three timeouts of the question the survey tool hangs up. This mostly happens when the call goes to someone's voicemail.
I would like to identify that pattern when it happens so I can remove it from calculating call time.
The pattern I am looking for looks like this in the Interactions column:
It doesn't HAVE to be Intro. It can be any part of the survey where it prompting the respondent for a response THREE times but no response is provided so the call fails. But, it does have to be sandwiched in between "Answer" (the phone picks up) and "Timeout. Call failed." (a failure).
I did try to apply what I learned from yesterday's solution (about run length encoding) to my other indexing question but I couldn't make it work in the slightest. So, here I am.
Here's an example dataset:
This is 4 respondents and every interaction between the survey tool and the respondent (or their phone, essentially).
Here's the code for the dataframe: This goes to a Google Drive text editor with the code
The response I got from Rui Barradas was this:
removeRows <- function(X, col = "Interaction",
ans = "Answer",
fail = c("Timeout. Call failed.", "Partial", "Enqueueing call"))
{
a <- grep(ans, X[[col]])
f <- which(X[[col]] %in% fail)
a <- a[findInterval(f, a)]
for(i in seq_along(a)){
X[[col]][a[i]:f[i]] <- NA_character_
}
Y <- X[complete.cases(X), , drop = FALSE]
Y
}
removeRows(survey_data)
However, this solution is too broad. I need to specifically to only remove the rows where 3 attempts are made to prompt a response but no response is provided. So, where the prompt is Intro and there's no response so it times out and eventually the call fails.
Thanks!
I would normally use the dplyr package. I'm sure this method can be modified to use base R if needed but dplyr has pre-made functions to make it easier. Comments in the code to explain what it's doing.
df2 <- df %>%
# Find any entry where there were three timeouts evenly spaced afterwards and set TRUE.
# You can add other conditions here if needed (to check even leading values).
mutate(triple_timeout = ifelse(
lead(Interaction,n=1) == "Timeout" &
lead(Interaction,n=3) == "Timeout" &
lead(Interaction,n=5) == "Timeout",
TRUE,
FALSE
)) %>%
# Lead will have some NA values so fill those in
mutate(triple_timeout = ifelse(is.na(triple_timeout),FALSE,triple_timeout)) %>%
# Every triple timeout has six entries that should be true, but only the first is id'd.
# Use an `or` logic and lag statements to set value to true for 5 entries after any TRUE
mutate(triple_timeout = triple_timeout |
lag(triple_timeout,n=1) |
lag(triple_timeout,n=2) |
lag(triple_timeout,n=3) |
lag(triple_timeout,n=4) |
lag(triple_timeout,n=5)
) %>%
# Lag will have some NA values to fill those in
mutate(triple_timeout = ifelse(is.na(triple_timeout),FALSE,triple_timeout)) %>%
# Filter out any TRUE triple_filter
filter(!triple_timeout) %>%
# Remove the filter column
select(-triple_timeout)
I'll know for sure in the coming month when I have this kind of data for 5K respondents. But I have decent RAM. Thanks, again!

mutate function produces unexpected results for numeric column in table (huxtable), but not in dataframe

I am trying to learn how to produce pretty tables using the package huxtable. It's a learning curve, but so far I am really impressed. However, I have run into a few problems that I can't seem to solve.
Firstly, I am trying to format numbers so that there is comma separator for the thousands position (using the mutate_at function from the dplyr package, and prettyNum. It works well except that, for columns with class numeric, internal zeros are excised (e.g., 1001 becomes 1,1 instead of the desired 1,001). If the col class is integer, then the desired output is produced. Also, the correct output is produced if the input data is a dataframe rather than a huxtable, regardless of whether the column is numeric or integer.
Secondly, when I add other table formatting (in particular, a caption), the caption does not seem to be carried over when I write the table to a Word file. Additionally, a note is produced:
Note: zip::zip() is deprecated, please use zip::zipr() instead
Below is some example code that I think illustrates the issue.
My questions are:
1) Why does the mutate function produce the odd result for numeric column in huxtables, but not in data frames, and how can I ensure that it does work? I could, of course, do the number formatting before converting the dataframe to a table, but I'd still like to know what is going on here.
2) Why is the table formatting not preserved in the output file?
3) What does the note about using zipr mean, and could that issue it references also be responsible for the failure to export table properties?
Thanks,
Glenn
library(dplyr)
library(flextable)
library(huxtable)
test=data.frame(var1=1918:1925,var2=c(9009,1000:1006),var3 = 1100:1107)
str(test)
HUX <- hux(test)
number_format(HUX)
number_format(HUX[,2]) <- 0
# works as expected on data frame
mutate_at(test,-1,.funs=list(~prettyNum(.,big.mark=",")))
# does not work as expected on huxtable, for var2 of class numeric
mutate_at(HUX,-1,.funs=list(~prettyNum(., big.mark=",")))
# add caption, borders, and colnames
set_caption(HUX,"Example table") %>%
set_caption_pos("topleft") %>%
set_top_border(1,,1) %>%
set_bottom_border(final(1), , 1) %>%
add_colnames()
# write out the table (this produces a note about zipr)
quick_docx(HUX)
Re the note about using zipr: see https://github.com/awalker89/openxlsx/issues/454
Re mutate_at: your data is being transformed correctly, but huxtable is displaying it wrongly. It is recognising each number, before and after the comma, as separate. (Number recognition is hard, let's go shopping…) I would suggest using number_format instead of transforming the data directly:
number_format(HUX)[,2:3] <- list(function(x) prettyNum(x, big.mark=","))
Finally, your second problem has a simple solution: you are changing all of the features of HUX but you're not saving the result back to the original variable. Remember that R is a functional language, objects are very rarely modified in place. Add HUX <- to the start of your dplyr chain.

Using filter clause in R group by function

I'm trying to get number of phone numbers per each day using group by, but I want to count only phones that are valid, how can I set this condition in filter option? (Maybe different solution?)
The data consists of a table with 4 columns:
CreatedDate
Calculation_id__c
Acc_Phone__c (phone no)
Acc_PhoneHLRStatus__c (the status about the phone no)
Data has some N/A values from time to time in all but the first column. The idea is to count how many calculations, phone numbers, valid/invalid phone numbers per day there are. I've managed to count the number of non-empty rows based on various columns but I need to add the "WHERE" clause to my group by statement which would take only valid/invalid phone numbers. This code works:
grouped_SF_hlr_status <- declaredSalesDF %>%
group_by(CreatedDate) %>%
count(Acc_Phone__c) %>%
But this code produces an error:
grouped_SF_hlr_status <- declaredSalesDF %>%
group_by(CreatedDate) %>%
count(Acc_Phone__c) %>%
filter(Acc_PhoneHLRStatus__c == 'komórkowy(poprawny)')
The error message is:
Error in filter_impl(.data, quo) :
Evaluation error: object 'Acc_PhoneHLRStatus__c' not found.
I'm not sure if the syntax is okay, I'm not familiar in using R, thank you all for your help!
I believe this is because when you count(), the data is mutated so that it shows the variable you are counting (Acc_Phone__c) and the frequency of that variable. I think if you inspect grouped_SF_hlr_status after running your first code chunk, that is what you would find. The other variables are lost because they wouldn't make sense any more given the individual cases they refer to have been grouped together.
In this instance, as you are only interested in the valid numbers, you should filter before counting. Try switching the lines of your code around so the filter goes first.
grouped_SF_hlr_status <- declaredSalesDF %>%
filter(Acc_PhoneHLRStatus__c == 'komórkowy(poprawny)') %>%
group_by(CreatedDate) %>%
count(Acc_Phone__c)
(I can't check if this works without your data, but logically that seems right to me. Let me know if it shows any error...)

Slowdown with repeated calls to spark dataframe in memory

Say I have 40 continuous (DoubleType) variables that I've bucketed into quartiles using ft_quantile_discretizer. Identifying the quartiles on all of the variables is super fast, as the function supports execution of multiple variables at once.
Next, I want to one hot code those bucketed variables, but there is no functionality currently supported to one hot code all of those variables with a single call. So I'm piping ft_string_indexer, ft_one_hot_encoder, and sdf_separate_column for each of the bucketed variables one at a time, by looping through the variables. This gets the job done. However, as the loop progresses, it slows down considerably. I'm thinking it's running out of memory, but can't figure out how to program this so that it executes with the same speed across the variables.
If q_vars is a character array of variable names (say 40 of them) for continuous variables, how can I code this up in a more spark-efficient way?
for (v in q_vars) {
data_sprk_q<-data_sprk_q %>%
ft_string_indexer(v,paste0(v,"b"),"keep",string_order_type = "alphabetAsc") %>%
ft_one_hot_encoder(paste0(v,"b"),paste0(v,"bc")) %>%
sdf_separate_column(paste0(v,"bc"),into=q_vars_cat_list[[v]])
}
I also tried executing as a single massive pipeline with all of the variables referenced, but that too didn't solve the issue, so I'm thinking it doesn't have anything to do with the loop itself.
test_text<-paste0("data_sprk_q<-data_sprk_q %>% ", paste0("ft_string_indexer('",q_vars,"',paste0('",q_vars,"','b'),'keep',string_order_type = 'alphabetAsc') %>% ft_one_hot_encoder(paste0('",q_vars,"','b'),paste0('",q_vars,"','bc')) %>% sdf_separate_column(paste0('",q_vars,"','bc'),into=",q_vars_cat_list,")",collapse=" %>% "))
eval(parse(text=test_text))
Any help would be appreciated.
In general some (sometimes substantial) slowdown with long ML Pipeline is expected, as a result of worse than linear complexity of the Catalyst optimizer. Short of splitting the process into multiple pipelines, and breaking the lineage in between (either using checkpoints and writing data to persistent storage and loading it back) there is not much you can about it at the moment.
However you current code adds a number of problems on top of that:
Unless you use more than 10 buckets StringIndexer
ft_string_indexer(v ,paste0(v, "b"), "keep", string_order_type = "alphabetAsc")
just duplicates the labels assigned by QuantileDiscretizer. With larger number of levels behavior becomes even less useful when using lexicographic order.
Applying One-Hot-Encoding might not be required at all (and in the worst case scenario can be harmful), depending on the downstream process, and even with linear models, might not be strictly necessary (you could argue that assigned labels are valid ordinals, and recording as nominal values, and increasing dimensionality is not desired outcome).
However the biggest problem is application of sdf_separate_column. It
Increases the cost of computing the execution plan by increasing the number of expressions.
Increases amount of memory required for processing by converting sparse data into dense.
Internally sparklyr uses UserDefinedFunction on each index, effectively causing reapeated allocation, decoding and garbage collection for the same row putting a lot of pressure on the cluster.
Last but not least it discards column metadata, extensively used by Spark ML.
I would strongly advise against using this function here. Based on your comments it looks like you want to subset columns before passing the result to some other algorithm - for that you can use VectorSlicer.
Overall you can rewrite your pipeline as
set.seed(1)
df <- copy_to(sc, tibble(x=rnorm(100), y=runif(100), z=rpois(100, 1)))
input_cols <- colnames(df)
discretized_cols <- paste0(input_cols, "_d")
encoded_cols <- paste0(discretized_cols, "_e") %>% setNames(discretized_cols)
discretizer <- ft_quantile_discretizer(
sc, input_cols = input_cols, output_cols = discretized_cols, num_buckets = 10
)
encoders <- lapply(
discretized_cols,
function(x) ft_one_hot_encoder(sc, input_col=x, output_col=encoded_cols[x])
)
transformed_df <- do.call(ml_pipeline, c(list(discretizer), encoders)) %>%
ml_fit(df) %>%
ml_transform(df)
and apply ft_vector_slicer when needed. For example to take values corresponding to the first, third and sixth bucket from x you can:
transformed_df %>%
ft_vector_slicer(
input_col="x_d_e", output_col="x_d_e_s", indices=c(0, 2, 5))

Problems using dplyr in a function (group_by)

I want to use dplyr for some data manipulation. Background: I have a survey weight and a bunch of variables (mostly likert-items). I want to sum the frequencies and percentages per category with and without survey weight.
As an example, let us just use frequencies for the gender variable. The result should be this:
gender freq freq.weighted
1 292 922.2906
2 279 964.7551
9 6 21.7338
I will do this for many variables. So, i decided to put the dplyr-code inside a function, so i only have to change the variable and type less.
#exampledata
gender<-c("2","2","1","2","2","2","2","2","2","2","2","2","1","1","2","2","2","2","2","2","1","2","2","2","2","2","2","2","2","2")
survey_weight<-c("2.368456","2.642901","2.926698","3.628653","3.247463","3.698195","2.776772","2.972387","2.686365","2.441820","3.494899","3.133106","3.253514","3.138839","3.430597","3.769577","3.367952","2.265350","2.686365","3.189538","3.029999","3.024567","2.972387","2.730978","4.074495","2.921552","3.769577","2.730978","3.247463","3.230097")
test_dataframe<-data.frame(gender,survey_weight)
#function
weighting.function<-function(dataframe,variable){
test_weighted<- dataframe %>%
group_by_(variable) %>%
summarise_(interp(freq=count(~weight)),
interp(freq_weighted=sum(~weight)))
return(test_weighted)
}
result_dataframe<-weighting.function(test_dataframe,"gender")
#this second step was left out in this example:
#mutate_(perc=interp(~freq/sum(~freq)*100),perc_weighted=interp(~freq_weighted/sum(~freq_weighted)*100))
This leads to the following Error-Message:
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "formula"
I have tried a lot of different things. First, I used freq=n() to count the frequencies, but I always got an Error (i checked, that plyr was loaded before dplyr and not afterwards - it also didn´t work.).
Any ideas? I read the vignette on standard evaluation. But, i always run into problems and have no idea what could be a solution.
I think you have a few nested mistakes which is causing you problems. The biggest one is using count() instead summarise(). I'm guessing you wanted n():
weighting.function <- function(dataframe, variable){
dataframe %>%
group_by_(variable) %>%
summarise_(
freq = ~n(),
freq_weighted = ~sum(survey_weight)
)
}
weighting.function(test_dataframe, ~gender)
You also had a few unneeded uses of interp(). If you do use interp(), the call should look like freq = interp(~n()), i.e. the name is outside the call to interp, and the thing being interpolated starts with ~.

Resources