How to use case_when when doing quartiles in R? - r

If i have this tibble:
tibble(
period = c("2010END", "2011END",
"2010Q1","2010Q2","2011END"),
date = c('31-12-2010','31-12-2011', '30-04-2010','31-07-2010','30-09-2010'),
website = c(
"google",
"google",
"facebook",
"facebook",
"youtube"
),
method = c("website",
"phone",
"website",
"laptop",
"phone"),
values = c(1, NA, 1, 2, 3))
And then i have this dataframe which tells you which quantiles to create along with the rankings to be made from the ranks:
tibble(
method = c(
"phone",
"phone",
"phone",
"website",
"website",
"website",
"laptop",
"laptop",
"laptop"
),
rank = c(3,2,1,3,2,1,3,2,1),
tile_condition = c("lowest 25%", "25 to 50%", "more than 50%",
"highest 25%", "25 to 50%", "less than 25%",
"lowest 25%", "25 to 50%", "more than 50%")
)
How can i use a case_when statement to correctly allow myself to create a ranking column which is based on the quartile calculation from the values column in the first dataframe?
I'm trying to apply the quantiles from the other dataframe to create a ranking column in the original dataframe - stuck on how to use case_when for it.

I would do something like this:
set.seed(124)
left_join(
df1[sample(1:5,1000, replace=T),] %>%
mutate(values=sample(c(df1$values,1:30),1000, replace=T)) %>%
group_by(method) %>%
mutate(q=as.double(cut(values,quantile(values,probs=seq(0,1,0.25), na.rm=T), labels=c(1:4), include.lowest=T))) %>%
ungroup(),
df2 %>% mutate(q = list(1,2,c(3,4),4,c(2,3),1,1,2,c(3,4))) %>% unnest(q),
by=c("method", "q")
) %>% select(-q)
Output:
# A tibble: 1,000 × 7
period date website method values rank tile_condition
<chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
1 2010END 31-12-2010 google website 7 2 25 to 75%
2 2011END 31-12-2011 google phone 18 1 more than 50%
3 2010Q1 30-04-2010 facebook website 21 2 25 to 75%
4 2011END 30-09-2010 youtube phone 15 1 more than 50%
5 2011END 30-09-2010 youtube phone 26 1 more than 50%
6 2011END 31-12-2011 google phone 3 3 lowest 25%
7 2010END 31-12-2010 google website 1 1 less than 25%
8 2010Q1 30-04-2010 facebook website 2 1 less than 25%
9 2010Q2 31-07-2010 facebook laptop 14 2 25 to 50%
10 2010Q2 31-07-2010 facebook laptop 16 1 more than 50%
# … with 990 more rows
Notice that I updated your input to 1000 rows, and random new values for the purposes of illustration. Also, notice that I fixed df2, so that method website covers the full range of values. In your example the 50% to 75% quartile is missing.
Adjusted df2 input:
structure(list(method = c("phone", "phone", "phone", "website",
"website", "website", "laptop", "laptop", "laptop"), rank = c(3,
2, 1, 3, 2, 1, 3, 2, 1), tile_condition = c("lowest 25%", "25 to 50%",
"more than 50%", "highest 25%", "25 to 75%", "less than 25%",
"lowest 25%", "25 to 50%", "more than 50%")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -9L))

If I've understood your question correctly, you first have to create a table to compare against, as:
df_quants <-
df1 %>%
drop_na(values) %>%
group_by(method) %>%
summarize(quant25 = quantile(values, probs = 0.25),
quant50 = quantile(values, probs = 0.5),
quant75 = quantile(values, probs = 0.75),
quant100 = quantile(values, probs = 1))
Then, using a join and a case_when statement, you would arrive at:
df2 %>%
left_join(df_quants, by = 'method') %>%
mutate(tiles =
case_when(rank < quant25 ~ 'lowest 25%',
rank < quant50 ~ '25 to 50%',
rank < quant75 ~ 'more than 50%',
rank >= quant75 ~ 'highest 25%'))

Here's a quick version. It doesn't get the exact labels you want. To do that you'd have to parse your tile_condition column, which is a little tricky.
library(santoku)
df |>
group_by(method) |>
mutate(
quantile = chop_quantiles(values, c(0.25, 0.50),
labels = c("Lowest 25%", "25 to 50%", "Above 50%"), extend = TRUE)
)

Related

Classifying columns based on str_detect

I am currently working with a data frame that looks like this:
Example <- structure(list(ID = c(12301L, 12301L, 15271L, 11888L, 15271L,
15271L, 15271L), StationOwner = c("Brian", "Brian", "Simon",
"Brian", "Simon", "Simon", "Simon"), StationName = c("Red", "Red",
"Red", "Green", "Yellow", "Yellow", "Yellow"), Parameter = c("Rain - Daily",
"Temperature -Daily", "VPD - Daily", "Rain - Daily", "Rain - Daily",
"Temperature -Daily", "VPD - Daily")), class = "data.frame", row.names = c(NA,
-7L))
I am looking into using str_detect to filter for example all the observation that start with “Rain –“ and adding what comes after under a new column called "Rain". I have been able to filter out only the values that start with “Rain” using str_detect but have not found a way to assign them automatically. Is there a specific function that would help with this? Appreciate the pointers, thanks!
Example of desired output that I am trying to achieve:
Desired <- structure(list(ID = c(12301L, 15271L, 12301L, 15271L
), StationOwner = c("Brian", "Simon", "Brian", "Simon"), StationName = c("Red",
"Red", "Green", "Yellow"), Rain = c("Daily", NA, "Daily", "Daily"
), Temperature = c("Daily", NA, NA, "Daily"), VDP = c(NA, "Daily",
NA, "Daily")), class = "data.frame", row.names = c(NA, -4L))
Directly using pivot_wider:
pivot_wider(Example, names_from = Parameter, values_from = Parameter,
names_repair = ~str_remove(.,' .*'),values_fn = ~str_remove(.,'.*- ?'))
# A tibble: 4 x 6
ID StationOwner StationName Rain Temperature VPD
<int> <chr> <chr> <chr> <chr> <chr>
1 12301 Brian Red Daily Daily NA
2 15271 Simon Red NA NA Daily
3 11888 Brian Green Daily NA NA
4 15271 Simon Yellow Daily Daily Daily
It's not using str_detectbut can achive Desired by
library(dplyr)
Example %>%
separate(Parameter, c('a', 'b'), sep = "-") %>%
mutate(across(where(is.character), ~trimws(.x))) %>%
pivot_wider(id_cols = c("ID","StationOwner", "StationName"), names_from = "a", values_from = "b")
ID StationOwner StationName Rain Temperature VPD
<int> <chr> <chr> <chr> <chr> <chr>
1 12301 Brian Red Daily Daily NA
2 15271 Simon Red NA NA Daily
3 11888 Brian Green Daily NA NA
4 15271 Simon Yellow Daily Daily Daily

Applying factor labels based on a condition in R

I have a data set with a variable 'education' which is coded differently in each of the three countries included, for example:
Code
Country 1
Country 2
Country 3
1
No education
No education
No education
2
Primary
Primary
Islamic education
3
Secondary
Secondary
Primary
4
NA
NA
Secondary
I need to apply factor levels, which are different for each country.
Below is my attempt, but it doesn't appear to work:
df <- data.frame(
Country = sample(c("Country 1", "Country 2", "Country 3"), 100, replace = TRUE),
Education_1 = sample(1:4)
)
df$Education <-
if(df$Country == "Country1") {
factor(df$Education,
levels = c(1:4),
labels = c("No education", "Primary", "Secondary", "NA"))
} else if (df$Country == "Country2") {
factor(df$Education,
levels = c(1:4),
labels = c("No education", "Primary", "Secondary", "NA"))
} else {
factor(df$Education,
levels = c(1:4),
labels = c("No education", "Islamic education", "Primary", "Secondary")
)
}
Thanks
Perhaps this helps? This takes the data from the table mapping countries with the education code and the education category and converts it to long format.
Then use a left join to the two column dataframe with countries and education codes.
You could use the resulting column with education type as a string or the codes could be recoded to be consistent.
library(dplyr)
library(tidyr)
library(stringr)
df <- data.frame(
Country = sample(c("Country 1", "Country 2", "Country 3"), 100, replace = TRUE),
Education_1 = sample(1:4))
df_ed <- structure(list(Code = 1:4, Country.1 = c("No education", "Primary",
"Secondary", NA), Country.2 = c("No education", "Primary", "Secondary",
NA), Country.3 = c("No education", "Islamic education", "Primary",
"Secondary")), class = "data.frame", row.names = c(NA, -4L))
df_levels <-
df_ed %>%
pivot_longer(-Code) %>%
mutate(name = str_replace(name, "\\.", " "))
df1 <-
df %>%
left_join(df_levels, by = c("Country" = "name", "Education_1" = "Code"))
head(df1)
#> Country Education_1 value
#> 1 Country 1 3 Secondary
#> 2 Country 2 4 <NA>
#> 3 Country 3 1 No education
#> 4 Country 1 2 Primary
#> 5 Country 3 3 Primary
#> 6 Country 2 4 <NA>
Created on 2021-09-22 by the reprex package (v2.0.0)

name variable at the same time using complete

I would like to get a sub total by age group. The sample data and the codes are:
set.seed(12345)
#create a numeric variable Age
AGE <- sample(0:110, 100, replace = TRUE)
# Creat Data fame
Sample.data <-data.frame(AGE)
summary_data<- Sample.data %>%
group_by(grp = cut(
AGE,
breaks=c(-Inf, 0,0.001, 0.083, 2, 13, 65,1000),
right=TRUE,
labels = c("Foetus(0 yr)","Neonate (0.001 - 0.082 yr)","Infant(0.083-1.999 yrs)","Child(2-12.999 yrs)", "Adolescent(13-17.999 yrs)","Adult(18-64.999 yrs.)","Elderly(65-199 yrs)")
)) %>% summarise("Total People" = n())%>% complete(grp = levels(grp), fill = list("Total People = 0))
which gave the results that looks like this:
Is it possible to label grp as "Age Group"? and sort it by age?
I tried to define the name as follow and the results are strange. What did I do wrong?
summary_data<- Sample.data %>%
group_by("Age Group" = cut(
AGE,
breaks=c(-Inf, 0,0.001, 0.083, 2, 13, 65,1000),
right=TRUE,
labels = c("Foetus(0 yr)","Neonate (0.001 - 0.082 yr)","Infant(0.083-1.999 yrs)","Child(2-12.999 yrs)", "Adolescent(13-17.999 yrs)","Adult(18-64.999 yrs.)","Elderly(65-199 yrs)")
)) %>% summarise("Total People" = n())%>% complete("Age Group" = levels("Age Group"), fill = list("Total People" = 0))
The new results are:
The variable now has name "Age Group", but table doesn't fill 0 for the age category that do not have observation. What should I do.
The ideal results should look like this:
We can change the double quotes to backquotes in levels to evaluate the column in complete. The column name is non-standard i.e. it includes a space.
...
%>%
complete("Age Group" = levels(`Age Group`), fill = list("Total People" = 0))
# A tibble: 7 x 2
# `Age Group` `Total People`
# <chr> <dbl>
#1 Adolescent(13-17.999 yrs) 14
#2 Adult(18-64.999 yrs.) 37
#3 Child(2-12.999 yrs) 2
#4 Elderly(65-199 yrs) 46
#5 Foetus(0 yr) 1
#6 Infant(0.083-1.999 yrs) 0
#7 Neonate (0.001 - 0.082 yr) 0
If we want to arrange
Sample.data %>%
group_by("Age Group" = cut(
AGE,
breaks=c(-Inf, 0,0.001, 0.083, 2, 13, 65,1000),
right=TRUE,
labels = c("Foetus(0 yr)","Neonate (0.001 - 0.082 yr)","Infant(0.083-1.999 yrs)","Child(2-12.999 yrs)", "Adolescent(13-17.999 yrs)","Adult(18-64.999 yrs.)","Elderly(65-199 yrs)")
)) %>%
summarise("Total People" = n()) %>%
complete("Age Group" = levels(`Age Group`), fill = list("Total People" = 0)) %>%
arrange(`Total People`)

Create funnelarea chart from data frame using plotly rstudio

I want to create a funnelarea chart using plotly, this is the example from plotly:
fig <- plot_ly(
type = "funnelarea",
values = c(5, 4, 3, 2, 1),
text = c("The 1st","The 2nd", "The 3rd", "The 4th", "The 5th"),
marker = list(colors = c("deepskyblue", "lightsalmon", "tan", "teal", "silver"),
line = list(color = c("wheat", "wheat", "blue", "wheat", "wheat"), width = c(0, 1, 5,
0, 4))),
textfont = list(family = "Old Standard TT, serif", size = 13, color = "black"),
opacity = 0.65)
fig
I would like to use a dataframe to fill this chart, use categories from my dataframe columns instead of text and values but I can't find the way to to it.
Example of my dataframe
funnel_stage size purchaser_payment
1. Available for Sale 10 10000
1. Available for Sale 15 15000
2. Pending on Sale 8 8000
2. Pending on Sale 9 9000
3. Already Sold 1 1000
3. Already Sold 45 45000
3. Already Sold 12 12000
I would like my funnel filled counting the number of times of repetition of first column, It would be like:
It's probably easiest if you first bring your data into a shape with one row per category in the correct order, see df2 below.
library("dplyr")
library("plotly")
df <- structure(
list(funnel_stage = c("Available for Sale", "Available for Sale",
"Pending on Sale", "Pending on Sale", "Already Sold",
"Already Sold", "Already Sold"),
size = c(10L, 15L, 8L, 9L, 1L, 45L, 12L),
purchaser_payment = c(10000L, 15000L, 8000L, 9000L, 1000L, 45000L, 12000L)),
class = "data.frame", row.names = c(NA, -7L))
df$funnel_stage <- factor(df$funnel_stage, levels = c("Available for Sale",
"Pending on Sale",
"Already Sold"))
df2 <- df %>%
group_by(funnel_stage) %>%
count()
df2
#> # A tibble: 3 x 2
#> # Groups: funnel_stage [3]
#> funnel_stage n
#> <fct> <int>
#> 1 Available for Sale 2
#> 2 Pending on Sale 2
#> 3 Already Sold 3
plot_ly() %>%
add_trace(
type = "funnelarea",
values = df2$n,
text = df2$funnel_stage)
packageVersion("dplyr")
#> [1] '0.8.5'
packageVersion("plotly")
#> [1] '4.9.2.1'

how to further refine expss table format?

I am trying to improve my table design using expss. My current design is shown below using the following code:
library(expss)
# bogus example data
x<-structure(list(visits= structure(c(17, 2, 23, 1, 21), label = "Total # Home Visits", class = c("labelled", "numeric")), months_enrolled = structure(c(21.42474, 51.105, 52.474, 53.75, 60.0392105), label = "Enrollment Duration (months)", class =c("labelled","numeric")), marital2 = structure(c("Married", NA, "Married", "Married", "Married"), label = "Marital Status", class = c("labelled", "character")), Relationship2 = structure(c("Mother", "Mother", "Mother", "Mother", "Mother"), label = "Relationship (recoded)", class = c("labelled", "character"))), row.names = c(NA, 5L), class = "data.frame")
htmlTable(x %>%
tab_cells(visits,months_enrolled) %>%
tab_rows(marital2, Relationship2, total()) %>% tab_stat_fun(Mean = w_mean, "Valid N" = w_n, method = list) %>%
tab_pivot() %>%
set_caption("Table 6: Bogus Visits and Duration by Characteristics") %>%
htmlTable(.,css.cell = c("width: 220px", # first column width
rep("width: 50px", ncol(.) - 1))))
I'd like to improve the table design by placing the mean statistics for Home Visits and Enrollment Duration as columns, thus saving a row for each level of Marital Status (and other vars in tab_rows). How is this achieved? Also, is it possible to shade alternate rows?
It seems, the simplest way is to transpose table:
htmlTable(x %>%
tab_cells(visits, months_enrolled) %>%
tab_cols(marital2, Relationship2, total()) %>%
tab_rows(total(label = "|")) %>%
tab_stat_fun(Mean = w_mean, "Valid N" = w_n) %>%
tab_pivot() %>%
tab_transpose() %>%
set_caption("Table 6: Bogus Visits and Duration by Characteristics") %>%
htmlTable(.,css.cell = c("width: 220px", # first column width
rep("width: 50px", ncol(.) - 1))))

Resources