How to use case_when when doing quartiles in R? - r

If i have this tibble:
period = c("2010END", "2011END",
date = c('31-12-2010','31-12-2011', '30-04-2010','31-07-2010','30-09-2010'),
website = c(
method = c("website",
values = c(1, NA, 1, 2, 3))
And then i have this dataframe which tells you which quantiles to create along with the rankings to be made from the ranks:
method = c(
rank = c(3,2,1,3,2,1,3,2,1),
tile_condition = c("lowest 25%", "25 to 50%", "more than 50%",
"highest 25%", "25 to 50%", "less than 25%",
"lowest 25%", "25 to 50%", "more than 50%")
How can i use a case_when statement to correctly allow myself to create a ranking column which is based on the quartile calculation from the values column in the first dataframe?
I'm trying to apply the quantiles from the other dataframe to create a ranking column in the original dataframe - stuck on how to use case_when for it.

I would do something like this:
df1[sample(1:5,1000, replace=T),] %>%
mutate(values=sample(c(df1$values,1:30),1000, replace=T)) %>%
group_by(method) %>%
mutate(q=as.double(cut(values,quantile(values,probs=seq(0,1,0.25), na.rm=T), labels=c(1:4), include.lowest=T))) %>%
df2 %>% mutate(q = list(1,2,c(3,4),4,c(2,3),1,1,2,c(3,4))) %>% unnest(q),
by=c("method", "q")
) %>% select(-q)
# A tibble: 1,000 × 7
period date website method values rank tile_condition
<chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
1 2010END 31-12-2010 google website 7 2 25 to 75%
2 2011END 31-12-2011 google phone 18 1 more than 50%
3 2010Q1 30-04-2010 facebook website 21 2 25 to 75%
4 2011END 30-09-2010 youtube phone 15 1 more than 50%
5 2011END 30-09-2010 youtube phone 26 1 more than 50%
6 2011END 31-12-2011 google phone 3 3 lowest 25%
7 2010END 31-12-2010 google website 1 1 less than 25%
8 2010Q1 30-04-2010 facebook website 2 1 less than 25%
9 2010Q2 31-07-2010 facebook laptop 14 2 25 to 50%
10 2010Q2 31-07-2010 facebook laptop 16 1 more than 50%
# … with 990 more rows
Notice that I updated your input to 1000 rows, and random new values for the purposes of illustration. Also, notice that I fixed df2, so that method website covers the full range of values. In your example the 50% to 75% quartile is missing.
Adjusted df2 input:
structure(list(method = c("phone", "phone", "phone", "website",
"website", "website", "laptop", "laptop", "laptop"), rank = c(3,
2, 1, 3, 2, 1, 3, 2, 1), tile_condition = c("lowest 25%", "25 to 50%",
"more than 50%", "highest 25%", "25 to 75%", "less than 25%",
"lowest 25%", "25 to 50%", "more than 50%")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -9L))

If I've understood your question correctly, you first have to create a table to compare against, as:
df_quants <-
df1 %>%
drop_na(values) %>%
group_by(method) %>%
summarize(quant25 = quantile(values, probs = 0.25),
quant50 = quantile(values, probs = 0.5),
quant75 = quantile(values, probs = 0.75),
quant100 = quantile(values, probs = 1))
Then, using a join and a case_when statement, you would arrive at:
df2 %>%
left_join(df_quants, by = 'method') %>%
mutate(tiles =
case_when(rank < quant25 ~ 'lowest 25%',
rank < quant50 ~ '25 to 50%',
rank < quant75 ~ 'more than 50%',
rank >= quant75 ~ 'highest 25%'))

Here's a quick version. It doesn't get the exact labels you want. To do that you'd have to parse your tile_condition column, which is a little tricky.
df |>
group_by(method) |>
quantile = chop_quantiles(values, c(0.25, 0.50),
labels = c("Lowest 25%", "25 to 50%", "Above 50%"), extend = TRUE)


Classifying columns based on str_detect

I am currently working with a data frame that looks like this:
Example <- structure(list(ID = c(12301L, 12301L, 15271L, 11888L, 15271L,
15271L, 15271L), StationOwner = c("Brian", "Brian", "Simon",
"Brian", "Simon", "Simon", "Simon"), StationName = c("Red", "Red",
"Red", "Green", "Yellow", "Yellow", "Yellow"), Parameter = c("Rain - Daily",
"Temperature -Daily", "VPD - Daily", "Rain - Daily", "Rain - Daily",
"Temperature -Daily", "VPD - Daily")), class = "data.frame", row.names = c(NA,
I am looking into using str_detect to filter for example all the observation that start with “Rain –“ and adding what comes after under a new column called "Rain". I have been able to filter out only the values that start with “Rain” using str_detect but have not found a way to assign them automatically. Is there a specific function that would help with this? Appreciate the pointers, thanks!
Example of desired output that I am trying to achieve:
Desired <- structure(list(ID = c(12301L, 15271L, 12301L, 15271L
), StationOwner = c("Brian", "Simon", "Brian", "Simon"), StationName = c("Red",
"Red", "Green", "Yellow"), Rain = c("Daily", NA, "Daily", "Daily"
), Temperature = c("Daily", NA, NA, "Daily"), VDP = c(NA, "Daily",
NA, "Daily")), class = "data.frame", row.names = c(NA, -4L))
Directly using pivot_wider:
pivot_wider(Example, names_from = Parameter, values_from = Parameter,
names_repair = ~str_remove(.,' .*'),values_fn = ~str_remove(.,'.*- ?'))
# A tibble: 4 x 6
ID StationOwner StationName Rain Temperature VPD
<int> <chr> <chr> <chr> <chr> <chr>
1 12301 Brian Red Daily Daily NA
2 15271 Simon Red NA NA Daily
3 11888 Brian Green Daily NA NA
4 15271 Simon Yellow Daily Daily Daily
It's not using str_detectbut can achive Desired by
Example %>%
separate(Parameter, c('a', 'b'), sep = "-") %>%
mutate(across(where(is.character), ~trimws(.x))) %>%
pivot_wider(id_cols = c("ID","StationOwner", "StationName"), names_from = "a", values_from = "b")
ID StationOwner StationName Rain Temperature VPD
<int> <chr> <chr> <chr> <chr> <chr>
1 12301 Brian Red Daily Daily NA
2 15271 Simon Red NA NA Daily
3 11888 Brian Green Daily NA NA
4 15271 Simon Yellow Daily Daily Daily

Applying factor labels based on a condition in R

I have a data set with a variable 'education' which is coded differently in each of the three countries included, for example:
Country 1
Country 2
Country 3
No education
No education
No education
Islamic education
I need to apply factor levels, which are different for each country.
Below is my attempt, but it doesn't appear to work:
df <- data.frame(
Country = sample(c("Country 1", "Country 2", "Country 3"), 100, replace = TRUE),
Education_1 = sample(1:4)
df$Education <-
if(df$Country == "Country1") {
levels = c(1:4),
labels = c("No education", "Primary", "Secondary", "NA"))
} else if (df$Country == "Country2") {
levels = c(1:4),
labels = c("No education", "Primary", "Secondary", "NA"))
} else {
levels = c(1:4),
labels = c("No education", "Islamic education", "Primary", "Secondary")
Perhaps this helps? This takes the data from the table mapping countries with the education code and the education category and converts it to long format.
Then use a left join to the two column dataframe with countries and education codes.
You could use the resulting column with education type as a string or the codes could be recoded to be consistent.
df <- data.frame(
Country = sample(c("Country 1", "Country 2", "Country 3"), 100, replace = TRUE),
Education_1 = sample(1:4))
df_ed <- structure(list(Code = 1:4, Country.1 = c("No education", "Primary",
"Secondary", NA), Country.2 = c("No education", "Primary", "Secondary",
NA), Country.3 = c("No education", "Islamic education", "Primary",
"Secondary")), class = "data.frame", row.names = c(NA, -4L))
df_levels <-
df_ed %>%
pivot_longer(-Code) %>%
mutate(name = str_replace(name, "\\.", " "))
df1 <-
df %>%
left_join(df_levels, by = c("Country" = "name", "Education_1" = "Code"))
#> Country Education_1 value
#> 1 Country 1 3 Secondary
#> 2 Country 2 4 <NA>
#> 3 Country 3 1 No education
#> 4 Country 1 2 Primary
#> 5 Country 3 3 Primary
#> 6 Country 2 4 <NA>
Created on 2021-09-22 by the reprex package (v2.0.0)

name variable at the same time using complete

I would like to get a sub total by age group. The sample data and the codes are:
#create a numeric variable Age
AGE <- sample(0:110, 100, replace = TRUE)
# Creat Data fame <-data.frame(AGE)
summary_data<- %>%
group_by(grp = cut(
breaks=c(-Inf, 0,0.001, 0.083, 2, 13, 65,1000),
labels = c("Foetus(0 yr)","Neonate (0.001 - 0.082 yr)","Infant(0.083-1.999 yrs)","Child(2-12.999 yrs)", "Adolescent(13-17.999 yrs)","Adult(18-64.999 yrs.)","Elderly(65-199 yrs)")
)) %>% summarise("Total People" = n())%>% complete(grp = levels(grp), fill = list("Total People = 0))
which gave the results that looks like this:
Is it possible to label grp as "Age Group"? and sort it by age?
I tried to define the name as follow and the results are strange. What did I do wrong?
summary_data<- %>%
group_by("Age Group" = cut(
breaks=c(-Inf, 0,0.001, 0.083, 2, 13, 65,1000),
labels = c("Foetus(0 yr)","Neonate (0.001 - 0.082 yr)","Infant(0.083-1.999 yrs)","Child(2-12.999 yrs)", "Adolescent(13-17.999 yrs)","Adult(18-64.999 yrs.)","Elderly(65-199 yrs)")
)) %>% summarise("Total People" = n())%>% complete("Age Group" = levels("Age Group"), fill = list("Total People" = 0))
The new results are:
The variable now has name "Age Group", but table doesn't fill 0 for the age category that do not have observation. What should I do.
The ideal results should look like this:
We can change the double quotes to backquotes in levels to evaluate the column in complete. The column name is non-standard i.e. it includes a space.
complete("Age Group" = levels(`Age Group`), fill = list("Total People" = 0))
# A tibble: 7 x 2
# `Age Group` `Total People`
# <chr> <dbl>
#1 Adolescent(13-17.999 yrs) 14
#2 Adult(18-64.999 yrs.) 37
#3 Child(2-12.999 yrs) 2
#4 Elderly(65-199 yrs) 46
#5 Foetus(0 yr) 1
#6 Infant(0.083-1.999 yrs) 0
#7 Neonate (0.001 - 0.082 yr) 0
If we want to arrange %>%
group_by("Age Group" = cut(
breaks=c(-Inf, 0,0.001, 0.083, 2, 13, 65,1000),
labels = c("Foetus(0 yr)","Neonate (0.001 - 0.082 yr)","Infant(0.083-1.999 yrs)","Child(2-12.999 yrs)", "Adolescent(13-17.999 yrs)","Adult(18-64.999 yrs.)","Elderly(65-199 yrs)")
)) %>%
summarise("Total People" = n()) %>%
complete("Age Group" = levels(`Age Group`), fill = list("Total People" = 0)) %>%
arrange(`Total People`)

Create funnelarea chart from data frame using plotly rstudio

I want to create a funnelarea chart using plotly, this is the example from plotly:
fig <- plot_ly(
type = "funnelarea",
values = c(5, 4, 3, 2, 1),
text = c("The 1st","The 2nd", "The 3rd", "The 4th", "The 5th"),
marker = list(colors = c("deepskyblue", "lightsalmon", "tan", "teal", "silver"),
line = list(color = c("wheat", "wheat", "blue", "wheat", "wheat"), width = c(0, 1, 5,
0, 4))),
textfont = list(family = "Old Standard TT, serif", size = 13, color = "black"),
opacity = 0.65)
I would like to use a dataframe to fill this chart, use categories from my dataframe columns instead of text and values but I can't find the way to to it.
Example of my dataframe
funnel_stage size purchaser_payment
1. Available for Sale 10 10000
1. Available for Sale 15 15000
2. Pending on Sale 8 8000
2. Pending on Sale 9 9000
3. Already Sold 1 1000
3. Already Sold 45 45000
3. Already Sold 12 12000
I would like my funnel filled counting the number of times of repetition of first column, It would be like:
It's probably easiest if you first bring your data into a shape with one row per category in the correct order, see df2 below.
df <- structure(
list(funnel_stage = c("Available for Sale", "Available for Sale",
"Pending on Sale", "Pending on Sale", "Already Sold",
"Already Sold", "Already Sold"),
size = c(10L, 15L, 8L, 9L, 1L, 45L, 12L),
purchaser_payment = c(10000L, 15000L, 8000L, 9000L, 1000L, 45000L, 12000L)),
class = "data.frame", row.names = c(NA, -7L))
df$funnel_stage <- factor(df$funnel_stage, levels = c("Available for Sale",
"Pending on Sale",
"Already Sold"))
df2 <- df %>%
group_by(funnel_stage) %>%
#> # A tibble: 3 x 2
#> # Groups: funnel_stage [3]
#> funnel_stage n
#> <fct> <int>
#> 1 Available for Sale 2
#> 2 Pending on Sale 2
#> 3 Already Sold 3
plot_ly() %>%
type = "funnelarea",
values = df2$n,
text = df2$funnel_stage)
#> [1] '0.8.5'
#> [1] ''

how to further refine expss table format?

I am trying to improve my table design using expss. My current design is shown below using the following code:
# bogus example data
x<-structure(list(visits= structure(c(17, 2, 23, 1, 21), label = "Total # Home Visits", class = c("labelled", "numeric")), months_enrolled = structure(c(21.42474, 51.105, 52.474, 53.75, 60.0392105), label = "Enrollment Duration (months)", class =c("labelled","numeric")), marital2 = structure(c("Married", NA, "Married", "Married", "Married"), label = "Marital Status", class = c("labelled", "character")), Relationship2 = structure(c("Mother", "Mother", "Mother", "Mother", "Mother"), label = "Relationship (recoded)", class = c("labelled", "character"))), row.names = c(NA, 5L), class = "data.frame")
htmlTable(x %>%
tab_cells(visits,months_enrolled) %>%
tab_rows(marital2, Relationship2, total()) %>% tab_stat_fun(Mean = w_mean, "Valid N" = w_n, method = list) %>%
tab_pivot() %>%
set_caption("Table 6: Bogus Visits and Duration by Characteristics") %>%
htmlTable(.,css.cell = c("width: 220px", # first column width
rep("width: 50px", ncol(.) - 1))))
I'd like to improve the table design by placing the mean statistics for Home Visits and Enrollment Duration as columns, thus saving a row for each level of Marital Status (and other vars in tab_rows). How is this achieved? Also, is it possible to shade alternate rows?
It seems, the simplest way is to transpose table:
htmlTable(x %>%
tab_cells(visits, months_enrolled) %>%
tab_cols(marital2, Relationship2, total()) %>%
tab_rows(total(label = "|")) %>%
tab_stat_fun(Mean = w_mean, "Valid N" = w_n) %>%
tab_pivot() %>%
tab_transpose() %>%
set_caption("Table 6: Bogus Visits and Duration by Characteristics") %>%
htmlTable(.,css.cell = c("width: 220px", # first column width
rep("width: 50px", ncol(.) - 1))))
