Creating a funnel using a pivot table in R considering NA column - r

I have the following dataset:
library(tidyverse)
dataset <- data.frame(id = c(121,122,123,124,125),
segment = c("A","B","B","A",NA),
Web = c(1,1,1,1,1),
Tryout = c(1,1,1,0,1),
Purchase = c(1,0,1,0,0),
stringsAsFactors = FALSE)
This table as you see converts to a funnel, from web visits (the quantity of rows), to tryout to a purchase. So a useful view of this funnel should be:
Step Total A B NA
Web 5 2 2 1
Tryout 4 1 2 1
Purchase 2 1 1 0
So I tried row by row doing this. The web views code is:
dataset %>% mutate(segment = ifelse(is.na(segment), "NA", segment)) %>%
group_by(segment) %>% summarise(Total = n()) %>%
ungroup() %>% spread(segment, Total) %>% mutate(Total = `A` + `B` + `NA`) %>%
select(Total,A,B,`NA`)
And worked fine, except that I have to put manually the row name. But for the other steps like tryout and purchase, is there a way to do it in just one simpler code, avoiding binding? Consider that this is an example and I have many columns so any help will be greatly appreciated.

Here is one option where we convert the data to 'long' format after removing the 'id' column, grouped by 'name' get the sum of 'value', then grouped by 'segment', 'Total' as well and do the second sum, get the distinct rows and pivot back to 'wide' format
library(dplyr)
library(tidyr)
dataset %>%
select(-id) %>%
pivot_longer(cols = -segment) %>%
group_by(name) %>%
mutate(Total = sum(value)) %>%
group_by(name, segment, Total) %>%
mutate(n = sum(value)) %>%
ungroup %>%
select(-value) %>%
distinct %>%
pivot_wider(names_from = segment, values_from = n)
# A tibble: 3 x 5
# name Total A B `NA`
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 Web 5 2 2 1
#2 Tryout 4 1 2 1
#3 Purchase 2 1 1 0

dataset %>%
select(-id) %>%
group_by(segment) %>%
summarise_all(sum) %>%
gather(Step, val, -segment) %>%
spread(segment, val) %>%
mutate(Total = rowSums(.[,-1]))

Related

Calculating average rle$lengths over grouped data

I would like to calculate duration of state using rle() on grouped data. Here is test data frame:
DF <- read.table(text="Time,x,y,sugar,state,ID
0,31,21,0.2,0,L0
1,31,21,0.65,0,L0
2,31,21,1.0,0,L0
3,31,21,1.5,1,L0
4,31,21,1.91,1,L0
5,31,21,2.3,1,L0
6,31,21,2.75,0,L0
7,31,21,3.14,0,L0
8,31,22,3.0,2,L0
9,31,22,3.47,1,L0
10,31,22,3.930,0,L0
0,37,1,0.2,0,L1
1,37,1,0.65,0,L1
2,37,1,1.089,0,L1
3,37,1,1.5198,0,L1
4,36,1,1.4197,2,L1
5,36,1,1.869,0,L1
6,36,1,2.3096,0,L1
7,36,1,2.738,0,L1
8,36,1,3.16,0,L1
9,36,1,3.5703,0,L1
10,36,1,3.970,0,L1
", header = TRUE, sep =",")
I want to know the average length for state == 1, grouped by ID. I have created a function inspired by: https://www.reddit.com/r/rstats/comments/brpzo9/tidyverse_groupby_and_rle/
to calculate the rle average portion:
rle_mean_lengths = function(x, value) {
r = rle(x)
cond = r$values == value
data.frame(count = sum(cond), avg_length = mean(r$lengths[cond]))
}
And then I add in the grouping aspect:
DF %>% group_by(ID) %>% do(rle_mean_lengths(DF$state,1))
However, the values that are generated are incorrect:
ID
count
avg_length
1 L0
2
2
2 L1
2
2
L0 is correct, L1 has no instances of state == 1 so the average should be zero or NA.
I isolated the problem in terms of breaking it down into just summarize:
DF %>% group_by(ID) %>% summarize_at(vars(state),list(name=mean)) # This works but if I use summarize it gives me weird values again.
How do I do the equivalent summarize_at() for do()? Or is there another fix? Thanks
As it is a data.frame column, we may need to unnest afterwards
library(dplyr)
library(tidyr)
DF %>%
group_by(ID) %>%
summarise(new = list(rle_mean_lengths(state, 1)), .groups = "drop") %>%
unnest(new)
Or remove the list and unpack
DF %>%
group_by(ID) %>%
summarise(new = rle_mean_lengths(state, 1), .groups = "drop") %>%
unpack(new)
# A tibble: 2 × 3
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN
In the OP's do code, the column that should be extracted should be not from the whole data, but from the data coming fromt the lhs i.e. . (Note that do is kind of deprecated. So it may be better to make use of the summarise with unnest/unpack
DF %>%
group_by(ID) %>%
do(rle_mean_lengths(.$state,1))
# A tibble: 2 × 3
# Groups: ID [2]
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN

How can I find the longest names (by letters) in my data set?

I have a problem set that wants me to find out the "two longest names given to at least 1000 US babies" in the 'babynames' data set.
The code that I've tried in the past is this:
babynames %>%
mutate(long.name = str_count(babynames$name,
"[:alpha:]")) %>%
filter(n >= 1000) %>%
arrange(-long.name) %>%
head(2) %>%
select(name, long.name)
But it gave me this:
name long.name
<chr> <int>
1 Christopher 11
2 Christopher 11
By group_by name, I'm hoping to eliminate the issue above.
This is where I'm currently at:
babynames %>%
filter(n >= 1000) %>%
group_by(name) %>%
mutate(long.name = str_count(babynames$name,
"[:alpha:]")) %>%
arrange(-long.name) %>%
head(2)
I'm expecting to get something like:
name long.name
<chr> <int>
1 Christopher 11
2 (some name) 10
But I get this:
Error: Column `long.name` must be length 1 (the group size), not 1924665
What am I doing wrong?
We can group_by name and sum all the occurrence of each name, keep only those names which have occurred more than 1000 times, calculate the length using nchar and select top 2 values.
library(babynames)
library(dplyr)
babynames %>%
group_by(name) %>%
summarise(n = sum(n)) %>%
filter(n > 1000) %>%
mutate(name_length = nchar(name)) %>%
#Can also do
#mutate(name_length = stringr::str_count(name, "[:alpha:]")) %>%
top_n(2, name_length)
# name n name_length
# <chr> <int> <int>
#1 Maryelizabeth 1969 13
#2 Michaelangelo 1236 13

Spread in SparklyR / pivot in Spark

I am trying to refactor my R code (shown below) into Sparklyr R code to work on a spark dataset to get to the final result as shown in Table 1:
Using help from stack overflow post Gather in sparklyr and SparklyR separate one Spark Data Frame column into two columns I was able to reach all the way except last step dealing with Spread.
Need Help:
Implement Spread via SparklyR
Optimize code in any way
Table 1: Final output needed:
var n nmiss
1 Sepal.Length 150 0
2 Sepal.Width 150 0
R code to achieve it:
library(dplyr)
library(tidyr)
library(tibble)
data <- iris
data_tbl <- as_tibble(data)
profile <- data_tbl %>%
select(Sepal.Length,Sepal.Width) %>%
summarize_all(funs(
n = n(), #Count
nmiss=sum(as.numeric(is.na(.))) # MissingCount
)) %>%
gather(variable, value) %>%
separate(variable, c("var", "stat"), sep = "_(?=[^_]*$)") %>%
spread(stat, value)
Spark Code:
sdf_gather <- function(tbl){
all_cols <- colnames(tbl)
lapply(all_cols, function(col_nm){
tbl %>%
select(col_nm) %>%
mutate(key = col_nm) %>%
rename(value = col_nm)
}) %>%
sdf_bind_rows() %>%
select(c('key', 'value'))
}
profile <- data_tbl %>%
select(Sepal.Length,Sepal.Width ) %>%
summarize_all(funs(
n = n(),
nmiss=sum(as.numeric(is.na(.)))
)) %>%
sdf_gather(.) %>%
ft_regex_tokenizer(input_col="key", output_col="KeySplit", pattern="_(?=[^_]*$)") %>%
sdf_separate_column("KeySplit", into=c("var", "stat")) %>%
select(var,stat,value) %>%
sdf_register('profile')
In this specific case (in general where all columns have the same type, although if you're interested only in missing data statistics, this can be further relaxed) you can use much simpler structure than this.
With data defined like this:
df <- copy_to(sc, iris, overwrite = TRUE)
gather the columns (below I assume a function as defined in my answer to Gather in sparklyr)
long <- df %>%
select(Sepal_Length, Sepal_Width) %>%
sdf_gather("key", "value", "Sepal_Length", "Sepal_Width")
and then group and aggregate:
long %>%
group_by(key) %>%
summarise(n = n(), nmiss = sum(as.numeric(is.na(value)), na.rm=TRUE))
with result as:
# Source: spark<?> [?? x 3]
key n nmiss
<chr> <dbl> <dbl>
1 Sepal_Length 150 0
2 Sepal_Width 150 0
Given reduced size of the output it is also fine to collect the result after aggregation
agg <- df %>%
select(Sepal_Length,Sepal_Width) %>%
summarize_all(funs(
n = n(),
nmiss=sum(as.numeric(is.na(.))) # MissingCount
)) %>% collect()
and apply your gather - spread logic on the result:
agg %>%
tidyr::gather(variable, value) %>%
tidyr::separate(variable, c("var", "stat"), sep = "_(?=[^_]*$)") %>%
tidyr::spread(stat, value)
# A tibble: 2 x 3
var n nmiss
<chr> <dbl> <dbl>
1 Sepal_Length 150 0
2 Sepal_Width 150 0
In fact the latter approach should be superior performance-wise in this particular case.

Select rows by ID with most matches

I have a data frame like this:
df <- data.frame(id = c(1,1,1,2,2,3,3,3,3,4,4,4),
torre = c("a","a","b","d","a","q","t","q","g","a","b","c"))
and I would like my code to select for each id the torre that repeats more, or the last torre for the id if there isnt one that repeats more than the other, so ill get a new data frame like this:
df2 <- data.frame(id = c(1,2,3,4), torre = c("a","a","q","c"))
You can use aggregate:
aggregate(torre ~ id, data=df,
FUN=function(x) names(tail(sort(table(factor(x, levels=unique(x)))),1))
)
The full explanation for this function is a bit involved, but most of the job is done by the FUN= parameter. In this case we are making a function that get's the frequency counts for each torre, sorts them in increasing order, then get's the last one with tail(, 1) and takes the name of it. aggregate() function then applies this function separately for each id.
You could do this using the dplyr package: group by id and torre to calculate the number of occurrences of each torre/id combination, then group by id only and select the last occurrence of torre that has the highest in-group frequency.
library(dplyr)
df %>%
group_by(id,torre) %>%
mutate(n=n()) %>%
group_by(id) %>%
filter(n==max(n)) %>%
slice(n()) %>%
select(-n)
id torre
<dbl> <chr>
1 1 a
2 2 a
3 3 q
4 4 c
An approach with the data.table package:
library(data.table)
setDT(df)[, .N, by = .(id, torre)][order(N), .(torre = torre[.N]), by = id]
which gives:
id torre
1: 1 a
2: 2 a
3: 3 q
4: 4 c
And two possible dplyr alternatives:
library(dplyr)
# option 1
df %>%
group_by(id, torre) %>%
mutate(n = n()) %>%
group_by(id) %>%
mutate(f = rank(n, ties.method = "first")) %>%
filter(f == max(f)) %>%
select(-n, -f)
# option 2
df %>%
group_by(id, torre) %>%
mutate(n = n()) %>%
distinct() %>%
arrange(n) %>%
group_by(id) %>%
slice(n()) %>%
select(-n)
Yet another dplyr solution, this time using add_count() instead of mutate():
df %>%
add_count(id, torre) %>%
group_by(id) %>%
filter(n == max(n)) %>%
slice(n()) %>%
select(-n)
# A tibble: 4 x 2
# Groups: id [4]
id torre
<dbl> <fct>
1 1. a
2 2. a
3 3. q
4 4. c

Ordering rows when they are not numeric

using df below, I made a table with frequencies for each unit according to each combination of group/year.
After obtaining absolute and relative frequencies, I have pasted the values into just one column Frequency
Is there a way that I can after changing the table to have the units on the rows, have them ordered in descending order based on n of the Total group in 2016? I want my final output to not have rows with n and prop, only Frequency
df <- data.frame(cbind(sample(c('Controle','Tratado'),
10, replace = T),
sample(c(2012,2016), 10, T),
c('A','B','A','B','C','D','D','A','F','A')))
colnames(df) <- c('Group', 'Year', 'Unit')
table <- df %>%
group_by(Year, Group) %>%
count(Unit) %>%
mutate(prop = prop.table(n)) %>%
bind_rows(df %>%
mutate(Group ="Total") %>%
group_by(Year, Group) %>%
count(Unit)) %>%
mutate(prop = prop.table(n))
is.num <- sapply(table, is.numeric)
table[is.num] <- lapply(table[is.num], round, 4)
table <- table %>%
mutate(Frequency = paste0(n,' (', 100*prop,'%)'))
table <- table %>%
gather(type, measurement, -Year, -Group, -Unit) %>%
unite(year_group, Year:Group, sep = ":") %>%
spread(year_group, measurement)
Here is what I am expecting to generate:
Unit type 2012:Total 2012:Tratado 2016:Controle 2016:Total 2016:Tratado
1 A Frequency 2 (66.67%) 2 (66.67%) - 2 (28.57%) 2 (100%)
2 D Frequency - - 2 (40%) 2 (28.57%) -
3 B Frequency 1 (33.33%) 1 (33.33%) 1 (20%) 1 (14.29%) -
4 C Frequency - - 1 (20%) 1 (14.29%) -
5 F Frequency - - 1 (20%) 1 (14.29%) -
Notice that the results are ordered according to column 2016:Total
Just found out a way myself, probably not the best one.
After running the code on the question, I have done the following:
table <- subset.data.frame(table, type == 'Frequency')
table <- table %>%
mutate(value = substr(Total_2016, 1, nchar(Total_2016) - 7 )) %>%
mutate(value = as.numeric(value)) %>%
arrange(desc(value))

Resources