Transferring name of column to a function in R - r

I'm trying to write a function which returns specific details about outliers (only sex, age, education, and the outlying value). I need to do it with many parameters, so I would like to transfer name of column to the function. Is there a way to do it?
For example, this code should return: f, 27, 12, 110.
my_data= data.frame( sex= c("f", "m", "f", "f", "m"),
age= c(22, 30, 24, 27, 30),
eduyears= c(12,16, 15, 12, 17),
weight= c(53, 70, 60, 110, 75),
height= c(160, 183, 157, 168, 180))
find_outliers= function (my_data, colname) {
out_values= boxplot.stats(my_data$colname)$out
out_ind= which(my_data$colname %in% out_values) #find outliers indices
outliers= my_data[out_ind ,c("sex","age","eduyears", colname)]
return (outliers)
}
find_outliers(weight)

If the function has two arguments you need to pass them both in its call, you are only passing one, weight. And passing as an unquoted variable means the function must get the column name as a character string in order to access it.
Finally, see the famous question on how to Dynamically select data frame columns using $ and a vector of column names.
my_data <- data.frame(sex = c("f", "m", "f", "f", "m"),
age = c(22, 30, 24, 27, 30),
eduyears = c(12,16, 15, 12, 17),
weight = c(53, 70, 60, 110, 75),
height = c(160, 183, 157, 168, 180))
find_outliers <- function (my_data, colname) {
# get the colname as a character string
colname <- as.character(substitute(colname))
out_values <- boxplot.stats(my_data[[colname]])$out
out_ind <- which(my_data[[colname]] %in% out_values) #find outliers indices
outliers <- my_data[out_ind, c("sex","age","eduyears", colname)]
outliers
}
find_outliers(my_data, weight)
#> sex age eduyears weight
#> 4 f 27 12 110
my_data |> find_outliers(weight)
#> sex age eduyears weight
#> 4 f 27 12 110
Created on 2022-11-05 with reprex v2.0.2

Related

Recoding several columns at once

I'm recoding values to letters with the following line of code (which worked) :
df_mean$COMMUNITY_mean <- cut(df_mean$COMMUNITY_mean, breaks=c(0, 10, 25, 50, 75, 90, Inf), labels=c("a", "b", "c", "d", "e", "f"))
In order to apply it to multiple columns :
names <- colnames(df_mean) #extract columns names to a list
names <- names[-c(1:10)]; #remove 10 first columns not interested in
for(i in 1:length(names)) {
df_mean <- cut(names[[i]], breaks=c(0, 10, 25, 50, 75, 90, Inf), labels=c("a", "b", "c", "d", "e", "f"))
}
But it fails to execute "Error in cut.default(names[[i]], breaks = c(0, 10, 25, 50, 75, 90, Inf), :
'x' must be numerical"
Any suggestions ?
Try this in the first line
names <- colnames(df_mean[as.logical(lapply(df_mean , is.numeric))])
# remove this line ===> names <- colnames(df_mean)
to extract the numerical columns from your data

mutate und case_when with multiple cases

I would like to basically write a Syntax to get general scales to T-Scores.
To norm these, there are two conditions, the gender and the age, which requires a separate T-Score.
So my data looks something like this:
w <- factor(c("m", "w", "w", "m", "m", "w", "w", "w", "m", "m"))
x <- c(28, 18, 25, 29, 21, 19, 27, 26, 31, 22)
y <- c(80, 55, 74, 101, 84, 74, 65, 56, 88, 78)
z <- c(170, 174, 183, 190, 185, 178, 169, 163, 189, 184)
bsp1 <- data.frame(w, x, y, z)
colnames(bsp1) <- c("Geschlecht", "Alter", "xx", "yy")
rm(w, x, y, z)
bsp1
So far, I've created something like this, even though in this example it's not complete.
bsp1 <- bsp1 %>%
mutate(xxx =
case_when(
Geschlecht = "m" & Alter > 18 & xx == 55 ~ "1",
Geschlecht = "m" & Alter > 18 & xx == 56 ~ "2",
Geschlecht = "m" & Alter > 18 & xx == TRUE ~ "3",
))
I can't seem to figure out, how to combine these multiple conditions into the case_when function. Also, if there needs to be a TRUE statement for it at the end, where does it go?
I hope it's kind of understandable, what I want to do here.
Thank you in advance.
You probably meant to write :
library(dplyr)
bsp1 <- bsp1 %>%
mutate(xxx =
case_when(
Geschlecht == "m" & Alter > 18 & xx == 55 ~ 1,
Geschlecht == "m" & Alter > 18 & xx == 56 ~ 2,
TRUE ~ 3
))

Assigning a range of values to a descriptive variable in R

Apologies in advance - I am relatively new to R/RStudio and am trying to figure out how to assign a value of ranges to a letter grade. In the project I am working on I am trying to predict a hidden value, and one portion of it derives from the three semi-revealed values represented by letter grades. For example, I may know that the three traits revealed are an A, B+, and B but not the exact numbers. However, from the previous data I have pulled, I know the following ranges are correct for each letter grade:
A+: 90 or greater
A: 86-89
A-: 82-85
B+: 82-77
B: 75-78
B-: 72-74
C+:69-71
C: 68-66
C-: 63-65
D: 60-62
F: 0-59
Is there a way for me to link these associated values to the letter grades to use later on in a multiple regression model?
Appreciate it.
I think you just want to organize all these values in a dataframe. Like this:
grades <- c("A+", "A", "A-", "B+", "B", "B-", "C+", "C", "C-", "D", "F")
min_value <- c(90, 86, 82, 79, 75, 72, 69, 66, 63, 60, 0)
max_value <- c(100, 89, 85, 81, 77, 74, 71, 68, 65, 62, 59)
mean_value <- (max_value+min_value)/2
df <- data.frame(grades, min_value, max_value, mean_value)
df
Edit: I'm no longer sure I understood your goal correctly. Here are two options to convert numeric grades to letter grades.
First, use the data.table package and perform a "rolling join". You can learn about rolling joins in this blog post:
https://r-norberg.blogspot.com/2016/06/understanding-datatable-rolling-joins.html
library(data.table)
grades =
"let, num
A+, 90
A, 86
A-, 82
B+, 80
B, 75
B-, 72
C+, 69
C, 68
C-, 63
D, 60
F, 0"
grades = read.csv(text=grades)
students =
"name, result
john, 60
mary, 86
anish, 79"
students = read.csv(text=students)
setDT(students)
setDT(grades)
grades[students, roll=TRUE, on=c("num"="result")]
#> let num name
#> 1: D 60 john
#> 2: A 86 mary
#> 3: B 79 anish
Alternatively, you could write a conversion function with a for loop. That would look something like this:
pct2let = function(grades, slack_fail = 2, slack_pass = 1){
bareme = structure(list(lb = c(90, 85, 80, 77, 73, 70, 65, 60, 57, 54,
50, 35, 0.01, 0), ub = c(100, 89.99, 84.99, 79.99, 76.99, 72.99,
69.99, 64.99, 59.99, 56.99, 53.99, 49.99, 34.99, 0.01), let = c("A+",
"A", "A-", "B+", "B", "B-", "C+", "C", "C-", "D+", "D", "E",
"F", "F*")), .Names = c("lb", "ub", "let"), class = "data.frame", row.names = c(NA,
-14L))
grades = ifelse(grades < 50, grades + slack_fail, grades + slack_pass)
out = grades
for(i in nrow(bareme):1){
out[grades >= bareme$lb[i]] = bareme$let[i]
}
return(out)
}
pct2let(c(45, 65, 78))
#> [1] "E" "C+" "B+"

Conditionally replace values across multiple columns based on string match in a separate column

I'm trying to conditionally replace values in multiple columns based on a string match in a different column but I'd like to be able to do so in a single line of code using the across() function but I keep getting errors that don't quite make sense to me. I feel like this is probably a simple solution so if anyone could point me in the right direction, that would be fantastic!
df <- data.frame("type" = c("Park", "Neighborhood", "Airport", "Park", "Neighborhood", "Neighborhood"),
"total" = c(34, 56, 75, 89, 21, 56),
"group_a" = c(30, 26, 45, 60, 3, 46),
"group_b" = c(4, 30, 30, 29, 18, 10))
# working but not concise
df %>%
mutate(total = ifelse(str_detect(type, "Park"), NA, total),
group_a = ifelse(str_detect(type, "Park"), NA, group_a),
group_b = ifelse(str_detect(type, "Park"), NA, group_b))
# concise but not working
df %>% mutate(across(total, group_a, group_b), ifelse(str_detect(type, "Park"), NA, .))
Update
We got a solution that works with my dummy dataset but is not working with my real data, so I am going to share a small snippet of my real data frame with the numbers changed and organization names hidden. When I run this line of code (df %>% mutate(across(c(Attempts, Canvasses, Completes)), ~ifelse(str_detect(long_name, "park-cemetery"), NA, .))) on these data, I get the following error message:
Error: Problem with mutate() input ..2. x Input ..2 must be a
vector, not a formula object. i Input ..2 is
~ifelse(str_detect(long_name, "park-cemetery"), NA, .).
This a small sample of the data that produces this error:
df <- structure(list(Org = c("OrgName", "OrgName", "OrgName", "OrgName",
"OrgName", "OrgName", "OrgName", "OrgName", "OrgName", "OrgName"
), nCode = c("M34", "R36", "R46", "X29", "M31", "K39", "Q12",
"Q39", "X41", "K27"), Attempts = c(100, 100, 100, 100, 100, 100,
100, 100, 100, 100), Canvasses = c(80, 80, 80, 80, 80, 80, 80,
80, 80, 80), Completes = c(50, 50, 50, 50, 50, 50, 50, 50, 50,
50), van_nocc_id = c(999, 999, 999, 999, 999, 999, 999, 999,
999, 999), van_name = c("M-Upper West Side", "SI-Rosebank", "SI-Tottenville",
"BX-park-cemetery-etc-Bronx", "M-Stuyvesant Town-Cooper Village",
"BK-Kensington", "Q-Broad Channel", "Q-Lindenwood", "BX-Wakefield",
"BK-East New York"), boro_short = c("M", "SI", "SI", "BX", "M",
"BK", "Q", "Q", "BX", "BK"), long_name = c("Upper West Side",
"Rosebank", "Tottenville", "park-cemetery-etc-Bronx", "Stuyvesant Town-Cooper Village",
"Kensington", "Broad Channel", "Lindenwood", "Wakefield", "East New York"
)), row.names = c(NA, -10L), class = "data.frame")
Final update
The curse of the misplaced closing bracket! Thanks to everyone for your help... the correct solution was df %>% mutate(across(c(Attempts, Canvasses, Completes), ~ifelse(str_detect(long_name, "park-cemetery"), NA, .)))
If you use the newly introduced function across (which is the correct way to approach this task), you have to specify inside across itself the function you want to apply. In this case the function ifelse(...) has to be a purrr-style lambda (so starting with ~). Check out across documentation and look for the arguments .cols and .fns.
df %>%
mutate(across(c(total, group_a, group_b), ~ifelse(str_detect(type, "Park"), NA, .)))
Output
# type total group_a group_b
# 1 Park NA NA NA
# 2 Neighborhood 56 26 30
# 3 Airport 75 45 30
# 4 Park NA NA NA
# 5 Neighborhood 21 3 18
# 6 Neighborhood 56 46 10
Here a data.table solution.
require(data.table)
df <- data.frame("type" = c("Park", "Neighborhood", "Airport", "Park", "Neighborhood", "Neighborhood"),
"total" = c(34, 56, 75, 89, 21, 56),
"group_a" = c(30, 26, 45, 60, 3, 46),
"group_b" = c(4, 30, 30, 29, 18, 10))
setDT(df)
df[type == "Park", c("total", "group_a", "group_b") := NA]
Update: that didn't take long to figure out! Just needed to place the columns in a vector:
# concise AND working!
df %>% mutate(across(c(total, group_a, group_b)), ifelse(str_detect(type, "Park"), NA, .))
I had tried this initially but placed the columns in quotes... don't do that :)

What are points to consider when choosing between seq_along() and 1:length()

Recently, I have come across a code snippet from another R user on Stackoverflow who printed a list of data.frames using kableExtra by the following initial for loop for (i in 1:length(listofdfs)) followed by the corresponding function. Until now, I always used for (i in seq_along(listofdfs)) to get the job done. A comparison of both methods on some random data.frames did not show any differences (see reprex below).
I was wondering whether someone could elaborate on the points to consider when applying either method in practice independent from printing dfs, etc. Unfortunately, the R documentation does not seem to give any information on this specific topic.
Here is a small reprex resulting in the same output:
library (dplyr)
library (kableExtra)
# Some random data
city <- rep(c(1:3), each = 4) %>% factor ()
gender <- rep(c("m", "f", "m", "f", "m", "f", "m", "f", "m", "f", "m", "f"))
age <- c(32, 54, 67, 35, 19, 84, 34, 46, 67, 41, 20, 75)
working_yrs <- c(16, 27, 39, 16, 2, 50, 16, 23, 48, 21, 0, 57)
# Generating list of dfs by group split
listcity <- data.frame(city, gender, age, working_yrs) %>% group_split (city, keep=TRUE)
# for loop using seq_along
for (i in seq_along(listcity)) {
print(
kable(listcity[[i]], caption = "Using seq_along") %>%
kable_styling("striped", bootstrap_options = "hover", full_width = FALSE)
)}
# for loop using 1:length
for (i in 1:length(listcity)) {
print(
kable(listcity[[i]], caption = "Using 1:length") %>%
kable_styling("striped", bootstrap_options = "hover", full_width = FALSE)
)}

Resources