Specify levels when `mutate`-ing dataframe variables to factors - r

Let's say I have the following tibble dataframe called data:
library(tibble)
data <- tribble(
~"ID", ~"some factor", ~"some other factor",
1L, "low", "high",
2L, "very high", "low",
3L, "very low", "low",
4L, "high", "very high",
5L, "very low", "very low"
)
I use the fct() function in forcats to convert my two factor variables accordingly:
library(dplyr)
library(forcats)
data <- data %>%
mutate(across(starts_with("some"), fct))
Which gives me:
# A tibble: 5 × 3
ID `some factor` `some other factor`
<int> <fct> <fct>
1 1 low high
2 2 very high low
3 3 very low low
4 4 high very high
5 5 very low very low
However, when I call fct this way it's unclear to me how to specify the levels of this ordinal variable. The order I would like is:
order <- c("very low", "low", "high", "very high")
How should I do this with dplyr's set of functions? The goal is to have ggplot2 visualizations that respect this ordering.

When you use across() you can pass extra arguments along to the called function through across's ....
data <- data %>%
mutate(across(starts_with("some"), fct, levels = order))
This is equivalent to
data <- data %>%
mutate(across(starts_with("some"), function(x) fct(x, levels = order)))
(This is a common paradigm in R, many functions where you are applying a function have a ... argument for arguments that will be passed along to the applied function, see also lapply, sapply, purrr::map, etc.)

order <- c("very low", "low", "high", "very high")
data <- data %>%
mutate(across(starts_with("some"), fct, order))
should do the trick

Related

Trying to turn continuous variable into categorical variable with unequal sizes [duplicate]

This question already has answers here:
Convert continuous numeric values to discrete categories defined by intervals
(2 answers)
Closed 1 year ago.
I am trying to turn Failures_Avg from continuous into a factor of 3 levels.
Use <- read_csv("FinalData.csv")%>% filter(Week %between% c("1", "7")) %>% dplyr::select(-c(LactationNumber,DaysPerWeek, TotalWeeks))%>%
mutate_at(vars(Farm,Parity,CowNr,Week,Failures_Avg), as.factor)%>%
mutate_at(vars(Milk_AvgDay,Milk_AvgVisit,IntervalCV), as.numeric)%>%
group_by(Farm,CowNr)
Here I read in failures_Avg as a categorical variable but since it is continuous there are about 65 levels. I am trying to change it from this into
Low - =0
Med - Between 0 & including 1
High - Greater than 1
Thank you in advance
We can use cut
library(dplyr)
Use %>%
mutate(failures_Avg = cut(failures_Avg, breaks = c(-Inf, 0, 1, Inf), labels = c("Low", "Med", "High")))
or case_when
Use <- Use %>%
mutate(failures_Avg = case_when(failures_Avg == 0 ~ "Low",
(failures_Avg > 0 & failures_Avg < 1) ~ "Med",
TRUE ~ "High"))

Changing value conditionally only for numerical variables in dataframe

Imagine I have a dataframe. This dataframe consists of numerical and non-numerical variables.
For all the numerical variables I would like to operate the following:
IF the value is bigger than the mean of the column it is in then change the value to "high".
ELSE change it to "low".
I have come very close to the solution with the following line of code:
df <- mutate_if(df, is.numeric, funs(ifelse(. > mean(.), "high", "low")))
However, I am aware that the mean(.) part is incorrect. So my question is:
How can I correct this part so I get the mean of the corresponding variable where . is in?
Also, I am assuming the rest is correct. If this is not the case I would appreciate someone telling me so I can try to correct it!
Here is an illustration of what I am trying to achieve:
duration amount sex
6 2 F
5 2 M
3 9 M
2 3 M
should become:
duration amount sex
high low F
high low M
low high M
low low M
EDIT:
The accepted answer made me realize my code was correct in the end!
In the newer version of dplyr (version >= 1.0), we use mutate with across as the suffix if, at all are getting deprecated
library(dplyr)
df %>%
mutate(across(where(is.numeric),
~ case_when(. > mean(., na.rm = TRUE) ~ "high", TRUE ~ "low")))
-output
# duration amount sex
#1 high low F
#2 high low M
#3 low high M
#4 low low M
Or with ifelse
df %>%
mutate(across(where(is.numeric),
~ ifelse(. > mean(., na.rm = TRUE), "high", "low")))
Or using the previous version
df %>%
mutate_if(is.numeric, ~ ifelse(. > mean(.), "high", "low"))
data
df <- structure(list(duration = c(6L, 5L, 3L, 2L), amount = c(2L, 2L,
9L, 3L), sex = c("F", "M", "M", "M")), class = "data.frame",
row.names = c(NA,
-4L))

importing csv into R with empty factors and missing factor levels

I have a data set with Scores and Categories in a csv file
VAR1_SCORE VAR1_CAT VAR2_SCORE VAR2_CAT VAR3_SCORE VAR3_CAT
80 MID 60 LOW
80 MID 100 HIGH
90 HIGH 90 HIGH
I am reading a csv file in above format.
Please note*: VAR1_CAT doesn't have LOW LEVEL DEFINED
While importing I want to achieve,
define same factor levels for all category contains('_cat')
there could be empty variables like VAR3_scores. This should be read-in as numerical and not logical
the empty variable (VAR3_CAT / VAR1_CAT) should have same factor levels (HIGH - MID -LOW)
read the data with read.csv for instance, then use some tidyverse afterwards
library(tidyverse)
df <- df %>%
mutate_at(vars(ends_with("CAT")), ~factor(., levels = c("LOW", "MID", "HIGH")))
Show the levels:
select(df, ends_with("CAT")) %>%
map(levels)
$VAR1_CAT
[1] "LOW" "MID" "HIGH"
$VAR2_CAT
[1] "LOW" "MID" "HIGH"
$VAR3_CAT
[1] "LOW" "MID" "HIGH"
We can also use mutate with across
library(dplyr)
df <- df %>%
mutate(across(ends_with('CAT'), factor, levels = c("LOW", "MID", "HIGH")))

Is there a way to add new columns to R based on conditions to

Currently using R in Azure. I'm trying to create a new column within my dataframe whose values are dependent on an exisiting column("Sum of Pillar".
->WithSumIDAPillars <- maml.mapInputPort(1)
->WithSumIDAPillars["newcolumn"] <- NA
->WithSumIDAPillars$newcolumn <- if (WithSumIDAPillars$Sum of Pillar <5 ="Low";WithSumIDAPillars$Sum of Pillar <=6<=10 ="Medium";WithSumIDAPillars$Sum of Pillar <=11<=16 ="High"
I need to create a new column that would set the following requirements:
If "Sum of PIllar" value is between 0-5=Low, 6-11=Medium and 11-16=High.
Have you used the dplyr package? Would something like this work?
library("dplyr")
WithSumIDAPillars$newcolumn <-
case_when(
WithSumIDAPillars$`Sum of Pillar` <= 6 ~ "Low",
WithSumIDAPillars$`Sum of Pillar` <= 11 ~ "Medium",
WithSumIDAPillars$`Sum of Pillar` <= 16 ~ "High",
TRUE ~ NA_character_
)
The case_when() function goes through each case sequentially for until one of the expressions on the left side of the ~ evaluates to TRUE, so the last statement is used as a default value.
Depending on your application, it may make things easier to name your column sum_of_pillar, using underscores. That would make it easier to use the pipe (%>%) and the mutate() function to write things a little more concisely:
WithSumIDAPillars <-
WithSumIDAPillars %>%
mutate(
newcolumn = case_when(
sum_of_pillar <= 5 ~ "Low",
sum_of_pillar <= 11 ~ "Medium",
sum_of_pillar <= 16 ~ "High",
TRUE ~ NA_character_
)
)
To learn more about dplyr, you can visit the website: https://dplyr.tidyverse.org/ or the (free) R for Data Science Book: https://r4ds.had.co.nz/
Hope this helps!
An alternative, perhaps less elegant method to case_when is using nested if_else statements. Maybe the one advantage is you don't have to may too much attention to the order or the statements as you do with case_when.
library(tidyverse)
WithSumIDAPillars %>%
mutate(new_col = if_else(`Sum of the Pillar` >= 0 & <= 5, "Low",
if_else(`Sum of the Pillar` >= 6 & <= 11, "Medium",
if_else(`Sum of the Pillar` >= 12 & <= 18, "High",
NA))))
NB - there's an overlap between your upper Medium and lower High thresholds so I upped the lower boundary for High to 12.

R - Filling missing values (blanks) based upon values on the same row but different column

I'm using R and have the following sample of data frame in which all variables are factors:
first second third
social birth control high
birth control high
medical Anorexia Nervosa low
medical Anorexia Nervosa low
Alcoholism high
family Alcoholism high
Basically, I need a function to help me fill the blanks in the first column based upon the values in the second and third columns.
For instance, if I have in the second column "birth control" and in the third column "high" I need to fill the blank in the first column with "social". If it is "Alcoholism" and "high" in the second and third column respectively, I need to fill the blanks in the first column with "family".
Based on the data showed, it is not very clear whether you have other values in 'first' for each combination of 'second' and 'third'. If there is only a single value and you need to replace the '' with that, then you could try
library(data.table)
setDT(df1)[, replace(first, first=='', first[first!='']),
list(second, third)]
Or a more efficient method would be
setDT(df1)[, first:= first[first!=''] , list(second, third)]
# first second third
#1: social birth control high
#2: social birth control high
#3: medical Anorexia Nervosa low
#4: medical Anorexia Nervosa low
#5: family Alcoholism high
#6: family Alcoholism high
data
df1 <- structure(list(first = c("social", "", "medical", "medical",
"", "family"), second = c("birth control", "birth control",
"Anorexia Nervosa",
"Anorexia Nervosa", "Alcoholism", "Alcoholism"), third = c("high",
"high", "low", "low", "high", "high")), .Names = c("first", "second",
"third"), class = "data.frame", row.names = c(NA, -6L))
One way would be to create a lookup list of some sort (for example, either using a named vector, factor or something similar) and then replacing any "" values with the values from the lookup list.
Here's an example (though I think that your problem is not fully defined and perhaps overly simplified).
library(dplyr)
library(tidyr)
mydf %>%
unite(condition, second, third, remove = FALSE) %>%
mutate(condition = factor(condition,
c("birth control_high", "Anorexia Nervosa_low", "Alcoholism_high"),
c("social", "medical", "family"))) %>%
mutate(condition = as.character(condition)) %>%
mutate(first = replace(first, first == "", condition[first == ""])) %>%
select(-condition)
# first second third
# 1 social birth control high
# 2 social birth control high
# 3 medical Anorexia Nervosa low
# 4 medical Anorexia Nervosa low
# 5 family Alcoholism high
# 6 family Alcoholism high
A "data.table" approach would follow the same steps, but would have the advantage of modifying by reference rather than copying.
library(data.table)
as.data.table(mydf)[
, condition := sprintf("%s_%s", second, third)][
, condition := as.character(
factor(condition,
c("birth control_high", "Anorexia Nervosa_low", "Alcoholism_high"),
c("social", "medical", "family")))][
first == "", first := condition][
, condition := NULL][]
Another approach with dplyr using #akrun very nice solution
library(dplyr)
df1 %>% group_by(second, third) %>%
mutate(first=replace(first, first=='', first[first!=''])) %>% ungroup
Data
df1 <- structure(list(first = c("social", "", "medical", "medical",
"", "family"), second = c("birth control", "birth control",
"Anorexia Nervosa",
"Anorexia Nervosa", "Alcoholism", "Alcoholism"), third = c("high",
"high", "low", "low", "high", "high")), .Names = c("first", "second",
"third"), class = "data.frame", row.names = c(NA, -6L))

Resources