mutate und case_when with multiple cases - r

I would like to basically write a Syntax to get general scales to T-Scores.
To norm these, there are two conditions, the gender and the age, which requires a separate T-Score.
So my data looks something like this:
w <- factor(c("m", "w", "w", "m", "m", "w", "w", "w", "m", "m"))
x <- c(28, 18, 25, 29, 21, 19, 27, 26, 31, 22)
y <- c(80, 55, 74, 101, 84, 74, 65, 56, 88, 78)
z <- c(170, 174, 183, 190, 185, 178, 169, 163, 189, 184)
bsp1 <- data.frame(w, x, y, z)
colnames(bsp1) <- c("Geschlecht", "Alter", "xx", "yy")
rm(w, x, y, z)
bsp1
So far, I've created something like this, even though in this example it's not complete.
bsp1 <- bsp1 %>%
mutate(xxx =
case_when(
Geschlecht = "m" & Alter > 18 & xx == 55 ~ "1",
Geschlecht = "m" & Alter > 18 & xx == 56 ~ "2",
Geschlecht = "m" & Alter > 18 & xx == TRUE ~ "3",
))
I can't seem to figure out, how to combine these multiple conditions into the case_when function. Also, if there needs to be a TRUE statement for it at the end, where does it go?
I hope it's kind of understandable, what I want to do here.
Thank you in advance.

You probably meant to write :
library(dplyr)
bsp1 <- bsp1 %>%
mutate(xxx =
case_when(
Geschlecht == "m" & Alter > 18 & xx == 55 ~ 1,
Geschlecht == "m" & Alter > 18 & xx == 56 ~ 2,
TRUE ~ 3
))

Related

Transferring name of column to a function in R

I'm trying to write a function which returns specific details about outliers (only sex, age, education, and the outlying value). I need to do it with many parameters, so I would like to transfer name of column to the function. Is there a way to do it?
For example, this code should return: f, 27, 12, 110.
my_data= data.frame( sex= c("f", "m", "f", "f", "m"),
age= c(22, 30, 24, 27, 30),
eduyears= c(12,16, 15, 12, 17),
weight= c(53, 70, 60, 110, 75),
height= c(160, 183, 157, 168, 180))
find_outliers= function (my_data, colname) {
out_values= boxplot.stats(my_data$colname)$out
out_ind= which(my_data$colname %in% out_values) #find outliers indices
outliers= my_data[out_ind ,c("sex","age","eduyears", colname)]
return (outliers)
}
find_outliers(weight)
If the function has two arguments you need to pass them both in its call, you are only passing one, weight. And passing as an unquoted variable means the function must get the column name as a character string in order to access it.
Finally, see the famous question on how to Dynamically select data frame columns using $ and a vector of column names.
my_data <- data.frame(sex = c("f", "m", "f", "f", "m"),
age = c(22, 30, 24, 27, 30),
eduyears = c(12,16, 15, 12, 17),
weight = c(53, 70, 60, 110, 75),
height = c(160, 183, 157, 168, 180))
find_outliers <- function (my_data, colname) {
# get the colname as a character string
colname <- as.character(substitute(colname))
out_values <- boxplot.stats(my_data[[colname]])$out
out_ind <- which(my_data[[colname]] %in% out_values) #find outliers indices
outliers <- my_data[out_ind, c("sex","age","eduyears", colname)]
outliers
}
find_outliers(my_data, weight)
#> sex age eduyears weight
#> 4 f 27 12 110
my_data |> find_outliers(weight)
#> sex age eduyears weight
#> 4 f 27 12 110
Created on 2022-11-05 with reprex v2.0.2

Summing ranks for variable with fewest entries

I am learning R and want to manually compute the Mann-Whitney U statistic and p-value using a normal approximation (and not use wilcox.test or equivalent). My pensioner's brain struggles with coding so it has taken me hours to produce the same answers as the textbook. However, my code to sum the 'StateRank' for the state with the fewest values is convoluted. How can I replace the commented section with more efficient code? I've hunted high and low, both here and on Google, but I don't even know which search terms to use! It won't surprise me to hear that there is a one-line solution but I'm no nearer knowing what it is.
library(tidyverse)
# Activity 9: aboriginal village size in Alaska and California
a.df <- data.frame(
Alaska = c(23, 26, 30, 33, 42, 45, 45, 50, 50.5, 96, 113, 557, NA),
Calif = c(39, 48, 53.5, 55, 57, 66, 77, 79, 108, 121, 162, 197, 309)
) %>%
pivot_longer(
cols = c("Alaska", "Calif"),
names_to = "State",
values_to = "Value",
values_drop_na = TRUE
) %>%
mutate(StateRank = rank(Value, ties.method = "average"))
# clumsy code to sort, then sum ranks (StateRank) for group with fewest values (nA)
#--------------------------------------------------------------------------------
asc_or_desc <- as.matrix(count(a.df, State))
if (as.numeric(asc_or_desc[1,2])>as.numeric(asc_or_desc[2,2])) {
a.df <- arrange(a.df, desc(State))
} else {
a.df <- arrange(a.df, State)
}
#--------------------------------------------------------------------------------
nA <- as.numeric(min(count(a.df, State, sort = TRUE)$n))
nB <- as.numeric(max(count(a.df, State, sort = TRUE)$n))
a.U <- sum(a.df$StateRank[1:nA])
a.E <- (nA*(nA+nB+1))/2 # Expectation of U
a.V <- (nA*nB*(nA+nB+1))/12 # Variance of U
a.Z <- (a.U - a.E)/sqrt(a.V)
a.P <- round((1 - round(pnorm(round(abs(a.Z), 2),
mean = 0, sd = 1) ,4)) * 2, 3)
# all the rounding is to mimic statistical tables (so that
# the answer is the same as in the textbook that I use)
Please try this code and tell me if I am on the right way:
I replaced your so called clumsy code with this one
... %>%
group_by(State) %>%
mutate(mx = max(Value)) %>%
arrange(desc(mx), desc(Value)) %>%
select(-mx)
The whole code:
library(tidyverse)
# Activity 9: aboriginal village size in Alaska and California
a.df <- data.frame(
Alaska = c(23, 26, 30, 33, 42, 45, 45, 50, 50.5, 96, 113, 557, NA),
Calif = c(39, 48, 53.5, 55, 57, 66, 77, 79, 108, 121, 162, 197, 309)
) %>%
pivot_longer(
cols = c("Alaska", "Calif"),
names_to = "State",
values_to = "Value",
values_drop_na = TRUE
) %>%
mutate(StateRank = rank(Value, ties.method = "average")) %>%
group_by(State) %>%
mutate(mx = max(Value)) %>%
arrange(desc(mx), desc(Value)) %>%
select(-mx)
-----------------------------------------------------------------------------
a.U <- sum(a.df$StateRank[1:nA])
a.E <- (nA*(nA+nB+1))/2 # Expectation of U
a.V <- (nA*nB*(nA+nB+1))/12 # Variance of U
a.Z <- (a.U - a.E)/sqrt(a.V)
a.P <- round((1 - round(pnorm(round(abs(a.Z), 2),
mean = 0, sd = 1) ,4)) * 2, 3)
# all the rounding is to mimic statistical tables (so that
# the answer is the same as in the textbook that I use)

Assigning a range of values to a descriptive variable in R

Apologies in advance - I am relatively new to R/RStudio and am trying to figure out how to assign a value of ranges to a letter grade. In the project I am working on I am trying to predict a hidden value, and one portion of it derives from the three semi-revealed values represented by letter grades. For example, I may know that the three traits revealed are an A, B+, and B but not the exact numbers. However, from the previous data I have pulled, I know the following ranges are correct for each letter grade:
A+: 90 or greater
A: 86-89
A-: 82-85
B+: 82-77
B: 75-78
B-: 72-74
C+:69-71
C: 68-66
C-: 63-65
D: 60-62
F: 0-59
Is there a way for me to link these associated values to the letter grades to use later on in a multiple regression model?
Appreciate it.
I think you just want to organize all these values in a dataframe. Like this:
grades <- c("A+", "A", "A-", "B+", "B", "B-", "C+", "C", "C-", "D", "F")
min_value <- c(90, 86, 82, 79, 75, 72, 69, 66, 63, 60, 0)
max_value <- c(100, 89, 85, 81, 77, 74, 71, 68, 65, 62, 59)
mean_value <- (max_value+min_value)/2
df <- data.frame(grades, min_value, max_value, mean_value)
df
Edit: I'm no longer sure I understood your goal correctly. Here are two options to convert numeric grades to letter grades.
First, use the data.table package and perform a "rolling join". You can learn about rolling joins in this blog post:
https://r-norberg.blogspot.com/2016/06/understanding-datatable-rolling-joins.html
library(data.table)
grades =
"let, num
A+, 90
A, 86
A-, 82
B+, 80
B, 75
B-, 72
C+, 69
C, 68
C-, 63
D, 60
F, 0"
grades = read.csv(text=grades)
students =
"name, result
john, 60
mary, 86
anish, 79"
students = read.csv(text=students)
setDT(students)
setDT(grades)
grades[students, roll=TRUE, on=c("num"="result")]
#> let num name
#> 1: D 60 john
#> 2: A 86 mary
#> 3: B 79 anish
Alternatively, you could write a conversion function with a for loop. That would look something like this:
pct2let = function(grades, slack_fail = 2, slack_pass = 1){
bareme = structure(list(lb = c(90, 85, 80, 77, 73, 70, 65, 60, 57, 54,
50, 35, 0.01, 0), ub = c(100, 89.99, 84.99, 79.99, 76.99, 72.99,
69.99, 64.99, 59.99, 56.99, 53.99, 49.99, 34.99, 0.01), let = c("A+",
"A", "A-", "B+", "B", "B-", "C+", "C", "C-", "D+", "D", "E",
"F", "F*")), .Names = c("lb", "ub", "let"), class = "data.frame", row.names = c(NA,
-14L))
grades = ifelse(grades < 50, grades + slack_fail, grades + slack_pass)
out = grades
for(i in nrow(bareme):1){
out[grades >= bareme$lb[i]] = bareme$let[i]
}
return(out)
}
pct2let(c(45, 65, 78))
#> [1] "E" "C+" "B+"

Conditionally replace values across multiple columns based on string match in a separate column

I'm trying to conditionally replace values in multiple columns based on a string match in a different column but I'd like to be able to do so in a single line of code using the across() function but I keep getting errors that don't quite make sense to me. I feel like this is probably a simple solution so if anyone could point me in the right direction, that would be fantastic!
df <- data.frame("type" = c("Park", "Neighborhood", "Airport", "Park", "Neighborhood", "Neighborhood"),
"total" = c(34, 56, 75, 89, 21, 56),
"group_a" = c(30, 26, 45, 60, 3, 46),
"group_b" = c(4, 30, 30, 29, 18, 10))
# working but not concise
df %>%
mutate(total = ifelse(str_detect(type, "Park"), NA, total),
group_a = ifelse(str_detect(type, "Park"), NA, group_a),
group_b = ifelse(str_detect(type, "Park"), NA, group_b))
# concise but not working
df %>% mutate(across(total, group_a, group_b), ifelse(str_detect(type, "Park"), NA, .))
Update
We got a solution that works with my dummy dataset but is not working with my real data, so I am going to share a small snippet of my real data frame with the numbers changed and organization names hidden. When I run this line of code (df %>% mutate(across(c(Attempts, Canvasses, Completes)), ~ifelse(str_detect(long_name, "park-cemetery"), NA, .))) on these data, I get the following error message:
Error: Problem with mutate() input ..2. x Input ..2 must be a
vector, not a formula object. i Input ..2 is
~ifelse(str_detect(long_name, "park-cemetery"), NA, .).
This a small sample of the data that produces this error:
df <- structure(list(Org = c("OrgName", "OrgName", "OrgName", "OrgName",
"OrgName", "OrgName", "OrgName", "OrgName", "OrgName", "OrgName"
), nCode = c("M34", "R36", "R46", "X29", "M31", "K39", "Q12",
"Q39", "X41", "K27"), Attempts = c(100, 100, 100, 100, 100, 100,
100, 100, 100, 100), Canvasses = c(80, 80, 80, 80, 80, 80, 80,
80, 80, 80), Completes = c(50, 50, 50, 50, 50, 50, 50, 50, 50,
50), van_nocc_id = c(999, 999, 999, 999, 999, 999, 999, 999,
999, 999), van_name = c("M-Upper West Side", "SI-Rosebank", "SI-Tottenville",
"BX-park-cemetery-etc-Bronx", "M-Stuyvesant Town-Cooper Village",
"BK-Kensington", "Q-Broad Channel", "Q-Lindenwood", "BX-Wakefield",
"BK-East New York"), boro_short = c("M", "SI", "SI", "BX", "M",
"BK", "Q", "Q", "BX", "BK"), long_name = c("Upper West Side",
"Rosebank", "Tottenville", "park-cemetery-etc-Bronx", "Stuyvesant Town-Cooper Village",
"Kensington", "Broad Channel", "Lindenwood", "Wakefield", "East New York"
)), row.names = c(NA, -10L), class = "data.frame")
Final update
The curse of the misplaced closing bracket! Thanks to everyone for your help... the correct solution was df %>% mutate(across(c(Attempts, Canvasses, Completes), ~ifelse(str_detect(long_name, "park-cemetery"), NA, .)))
If you use the newly introduced function across (which is the correct way to approach this task), you have to specify inside across itself the function you want to apply. In this case the function ifelse(...) has to be a purrr-style lambda (so starting with ~). Check out across documentation and look for the arguments .cols and .fns.
df %>%
mutate(across(c(total, group_a, group_b), ~ifelse(str_detect(type, "Park"), NA, .)))
Output
# type total group_a group_b
# 1 Park NA NA NA
# 2 Neighborhood 56 26 30
# 3 Airport 75 45 30
# 4 Park NA NA NA
# 5 Neighborhood 21 3 18
# 6 Neighborhood 56 46 10
Here a data.table solution.
require(data.table)
df <- data.frame("type" = c("Park", "Neighborhood", "Airport", "Park", "Neighborhood", "Neighborhood"),
"total" = c(34, 56, 75, 89, 21, 56),
"group_a" = c(30, 26, 45, 60, 3, 46),
"group_b" = c(4, 30, 30, 29, 18, 10))
setDT(df)
df[type == "Park", c("total", "group_a", "group_b") := NA]
Update: that didn't take long to figure out! Just needed to place the columns in a vector:
# concise AND working!
df %>% mutate(across(c(total, group_a, group_b)), ifelse(str_detect(type, "Park"), NA, .))
I had tried this initially but placed the columns in quotes... don't do that :)

Add rows with specific values in between existing rows

I have hockey data, called df
structure(list(event_index = 1:57, coords_x = c(80, 53, 31, -56,
-34, -33, -40, 30, -66, -36, 45, 17, -6, 47, -51, -31, -69, -86,
-70, 80, 65, -76, -71, 81, -57, 80, 75, 77, -71, -40, -83, 62,
77, 76, NA, -61, 69, -45, 68, 31, 58, 61, 80, 34, 80, -85, -37,
-57, 76, 14, 49, -82, -34, -36, -83, -84, -55), coords_y = c(-1,
14, -30, 17, 26, -23, -37, 17, -32, -18, 25, 17, -38, 21, 28,
22, 17, 13, 10, -37, -17, 9, 18, -11, 21, -7, 3, 3, -38, 31,
8, -30, -2, 4, NA, -5, 15, 10, -30, -34, 20, 27, -4, 8, -18,
19, 32, -21, 0, 40, -4, -30, -24, -28, -2, -3, 34), event_rinkside = c("R",
"R", "R", "L", "L", "L", "L", "R", "L", "L", "R", "N", "N", "R",
"L", "L", "L", "L", "L", "R", "R", "L", "L", "R", "L", "R", "R",
"R", "L", "L", "L", "R", "R", "R", NA, "L", "R", "L", "R", "R",
"R", "R", "R", "R", "R", "L", "L", "L", "R", "N", "R", "L", "L",
"L", "L", "L", "L")), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -57L))
How do I create rows after every single row, leaving me with 57 * 2 (114 rows), but the values in my newly created rows depend on event_rinkside column.
If event_rinkside equals R, then, I want to insert 82 into coords_x and 0 into coords_y.
If event_rinkside equals L, then, I want to insert -82 into coords_x and 0 into coords_y.
I feel like the solution to this SO question is a good starting points, but I don't know how to incorporate my own conditions:
Here is the solution I'm talking about:
library(purrr)
df %>%
group_by(id) %>%
map_dfr(rbind, NA) %>%
mutate(id = rep(df$id, each = 2))
Here's a solution with dplyr:
library(dplyr)
df %>%
mutate(coords_x = 82 * ifelse(event_rinkside == "L", -1, 1),
coords_y = 0) %>%
rbind(df, .) %>%
arrange(event_index)
How it works:
In the first step, mutate is used to modify an unassigned copy of df. The column coords_x gets the value of 82; the value is multiplied with -1 if event_rinkside == "L" and 1 otherwise. The column coords_y gets the value of 0.
In the next step, the unchanged original data frame df and the current unassigned and modified copy of it are combined with rbind. Here, . represents the result of the mutate step above. The result of rbind has the rows of the original version above the rows of the modified version.
In the last step, arrange is used to sort the rows along the values of event_index. In this way, each original row is directly followed by the corresponding modified row.
The result:
# A tibble: 114 x 4
event_index coords_x coords_y event_rinkside
<int> <dbl> <dbl> <chr>
1 1 80 -1 R
2 1 82 0 R
3 2 53 14 R
4 2 82 0 R
5 3 31 -30 R
6 3 82 0 R
7 4 -56 17 L
8 4 -82 0 L
9 5 -34 26 L
10 5 -82 0 L
# … with 104 more rows
I'm not too familiar with r, the my algorithm should work regardless of that. You want to shift the row up to the 2n-1 row. I would create a second array and manually place them in at the specific indexes.
some pseudo code for you (i usually write in python so my pseudo shows it)
reinsert(list):
array_out = [len(list)*2,len(list[0]) // initialize to the desired dimensions
array_out[0] = list[0] /// manually insert first row cause math
for n in range(0,len(list)):
array_out[2n-1] = list[n]
array_out[2n] = event_rinkside // make a function call or make an ifthen clause to do you logic
return(array_out)
you can insert the newly created rows in the loop or add them after the fact knowing they will all be at even numbered indexes.
This is similar to Sven's answer, using case_when to distinguish between the possibilities within event_rinkside:
new_df <- df %>% bind_rows(
df %>% mutate(
coords_x = case_when(
event_rinkside == 'R' ~ 82,
event_rinkside == 'L' ~ -82,
TRUE ~ coords_x
),
coords_y = case_when(
event_rinkside == 'R' ~ 0,
event_rinkside == 'L' ~ 0,
TRUE ~ coords_y
)
)
) %>% arrange(
event_index
)
If you know the ranges of your variables, it could be simplified into if_elses.
My attempt, which is pretty similar to other answers already,
df <- df[rep(1:nrow(df), each = 2),] ## Create a duplicate row after each row
df[seq(2,nrow(df),2),] <- df[seq(2,nrow(df),2),] %>% mutate(coords_x = case_when(event_rinkside == "R" ~ 82,
event_rinkside == "L" ~ -82,
TRUE ~ coords_x),
coords_y = case_when(event_rinkside == "R" ~ 0,
event_rinkside == "L" ~ 0,
TRUE ~ coords_y)
)

Resources