Recoding several columns at once - r

I'm recoding values to letters with the following line of code (which worked) :
df_mean$COMMUNITY_mean <- cut(df_mean$COMMUNITY_mean, breaks=c(0, 10, 25, 50, 75, 90, Inf), labels=c("a", "b", "c", "d", "e", "f"))
In order to apply it to multiple columns :
names <- colnames(df_mean) #extract columns names to a list
names <- names[-c(1:10)]; #remove 10 first columns not interested in
for(i in 1:length(names)) {
df_mean <- cut(names[[i]], breaks=c(0, 10, 25, 50, 75, 90, Inf), labels=c("a", "b", "c", "d", "e", "f"))
}
But it fails to execute "Error in cut.default(names[[i]], breaks = c(0, 10, 25, 50, 75, 90, Inf), :
'x' must be numerical"
Any suggestions ?

Try this in the first line
names <- colnames(df_mean[as.logical(lapply(df_mean , is.numeric))])
# remove this line ===> names <- colnames(df_mean)
to extract the numerical columns from your data

Related

Transferring name of column to a function in R

I'm trying to write a function which returns specific details about outliers (only sex, age, education, and the outlying value). I need to do it with many parameters, so I would like to transfer name of column to the function. Is there a way to do it?
For example, this code should return: f, 27, 12, 110.
my_data= data.frame( sex= c("f", "m", "f", "f", "m"),
age= c(22, 30, 24, 27, 30),
eduyears= c(12,16, 15, 12, 17),
weight= c(53, 70, 60, 110, 75),
height= c(160, 183, 157, 168, 180))
find_outliers= function (my_data, colname) {
out_values= boxplot.stats(my_data$colname)$out
out_ind= which(my_data$colname %in% out_values) #find outliers indices
outliers= my_data[out_ind ,c("sex","age","eduyears", colname)]
return (outliers)
}
find_outliers(weight)
If the function has two arguments you need to pass them both in its call, you are only passing one, weight. And passing as an unquoted variable means the function must get the column name as a character string in order to access it.
Finally, see the famous question on how to Dynamically select data frame columns using $ and a vector of column names.
my_data <- data.frame(sex = c("f", "m", "f", "f", "m"),
age = c(22, 30, 24, 27, 30),
eduyears = c(12,16, 15, 12, 17),
weight = c(53, 70, 60, 110, 75),
height = c(160, 183, 157, 168, 180))
find_outliers <- function (my_data, colname) {
# get the colname as a character string
colname <- as.character(substitute(colname))
out_values <- boxplot.stats(my_data[[colname]])$out
out_ind <- which(my_data[[colname]] %in% out_values) #find outliers indices
outliers <- my_data[out_ind, c("sex","age","eduyears", colname)]
outliers
}
find_outliers(my_data, weight)
#> sex age eduyears weight
#> 4 f 27 12 110
my_data |> find_outliers(weight)
#> sex age eduyears weight
#> 4 f 27 12 110
Created on 2022-11-05 with reprex v2.0.2

proportion within each factor using dplyr [duplicate]

This question already has answers here:
Relative frequencies / proportions with dplyr
(10 answers)
Closed 1 year ago.
I want to get the prop inside each factor using dplyr. The desired result appears in desired$prop
Thanks in advance :))
data <- data.frame(
team = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
country = c("usa","uk",
"spain","usa","uk","spain","usa","uk","spain"),
value = c(40, 20, 10, 50, 30, 35, 50, 60, 25)
)
desired <- data.frame(
team = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
country = c("usa",
"uk","spain","usa","uk","spain","usa","uk",
"spain"),
value = c(40, 20, 10, 50, 30, 35, 50, 60, 25),
prop = c(0.285714286,0.181818182,0.142857143,0.357142857,
0.272727273,0.5,0.357142857,0.545454545,
0.357142857)
)
#MrFlick is right. And also faster than I am.
library(dplyr)
df <- data %>%
group_by(country) %>%
mutate(prop = value/sum(value))

Assigning a range of values to a descriptive variable in R

Apologies in advance - I am relatively new to R/RStudio and am trying to figure out how to assign a value of ranges to a letter grade. In the project I am working on I am trying to predict a hidden value, and one portion of it derives from the three semi-revealed values represented by letter grades. For example, I may know that the three traits revealed are an A, B+, and B but not the exact numbers. However, from the previous data I have pulled, I know the following ranges are correct for each letter grade:
A+: 90 or greater
A: 86-89
A-: 82-85
B+: 82-77
B: 75-78
B-: 72-74
C+:69-71
C: 68-66
C-: 63-65
D: 60-62
F: 0-59
Is there a way for me to link these associated values to the letter grades to use later on in a multiple regression model?
Appreciate it.
I think you just want to organize all these values in a dataframe. Like this:
grades <- c("A+", "A", "A-", "B+", "B", "B-", "C+", "C", "C-", "D", "F")
min_value <- c(90, 86, 82, 79, 75, 72, 69, 66, 63, 60, 0)
max_value <- c(100, 89, 85, 81, 77, 74, 71, 68, 65, 62, 59)
mean_value <- (max_value+min_value)/2
df <- data.frame(grades, min_value, max_value, mean_value)
df
Edit: I'm no longer sure I understood your goal correctly. Here are two options to convert numeric grades to letter grades.
First, use the data.table package and perform a "rolling join". You can learn about rolling joins in this blog post:
https://r-norberg.blogspot.com/2016/06/understanding-datatable-rolling-joins.html
library(data.table)
grades =
"let, num
A+, 90
A, 86
A-, 82
B+, 80
B, 75
B-, 72
C+, 69
C, 68
C-, 63
D, 60
F, 0"
grades = read.csv(text=grades)
students =
"name, result
john, 60
mary, 86
anish, 79"
students = read.csv(text=students)
setDT(students)
setDT(grades)
grades[students, roll=TRUE, on=c("num"="result")]
#> let num name
#> 1: D 60 john
#> 2: A 86 mary
#> 3: B 79 anish
Alternatively, you could write a conversion function with a for loop. That would look something like this:
pct2let = function(grades, slack_fail = 2, slack_pass = 1){
bareme = structure(list(lb = c(90, 85, 80, 77, 73, 70, 65, 60, 57, 54,
50, 35, 0.01, 0), ub = c(100, 89.99, 84.99, 79.99, 76.99, 72.99,
69.99, 64.99, 59.99, 56.99, 53.99, 49.99, 34.99, 0.01), let = c("A+",
"A", "A-", "B+", "B", "B-", "C+", "C", "C-", "D+", "D", "E",
"F", "F*")), .Names = c("lb", "ub", "let"), class = "data.frame", row.names = c(NA,
-14L))
grades = ifelse(grades < 50, grades + slack_fail, grades + slack_pass)
out = grades
for(i in nrow(bareme):1){
out[grades >= bareme$lb[i]] = bareme$let[i]
}
return(out)
}
pct2let(c(45, 65, 78))
#> [1] "E" "C+" "B+"

What are points to consider when choosing between seq_along() and 1:length()

Recently, I have come across a code snippet from another R user on Stackoverflow who printed a list of data.frames using kableExtra by the following initial for loop for (i in 1:length(listofdfs)) followed by the corresponding function. Until now, I always used for (i in seq_along(listofdfs)) to get the job done. A comparison of both methods on some random data.frames did not show any differences (see reprex below).
I was wondering whether someone could elaborate on the points to consider when applying either method in practice independent from printing dfs, etc. Unfortunately, the R documentation does not seem to give any information on this specific topic.
Here is a small reprex resulting in the same output:
library (dplyr)
library (kableExtra)
# Some random data
city <- rep(c(1:3), each = 4) %>% factor ()
gender <- rep(c("m", "f", "m", "f", "m", "f", "m", "f", "m", "f", "m", "f"))
age <- c(32, 54, 67, 35, 19, 84, 34, 46, 67, 41, 20, 75)
working_yrs <- c(16, 27, 39, 16, 2, 50, 16, 23, 48, 21, 0, 57)
# Generating list of dfs by group split
listcity <- data.frame(city, gender, age, working_yrs) %>% group_split (city, keep=TRUE)
# for loop using seq_along
for (i in seq_along(listcity)) {
print(
kable(listcity[[i]], caption = "Using seq_along") %>%
kable_styling("striped", bootstrap_options = "hover", full_width = FALSE)
)}
# for loop using 1:length
for (i in 1:length(listcity)) {
print(
kable(listcity[[i]], caption = "Using 1:length") %>%
kable_styling("striped", bootstrap_options = "hover", full_width = FALSE)
)}

R -ggplot two boxplots for separate columns

So I am trying to make two boxplots on one graph of two separate variables.
I have a dataset with multiple variables but I wanna compare only two: income_husband, and income_wife.
I have done it using boxplot() but how can i do it using ggplot ?
It would help if you had some data to work with but I have put some sample data together. Group C is filtered out. Is this what you are sort of after?
library(tidyverse)
group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b", 'c', "c")
income = c(100, 120, 110, 23, 34, 120, 45, 156, 65, 52, 65, 98)
data <- tibble(group, income)
data
data2 <- data %>%
filter(group == "a" | group == "b" )
b <- ggplot(data2, aes(x = group, y = income))
b + geom_boxplot()

Resources