Recoding turns everything into the same value in R - r

I'm practicing R and I created a new column that had continuous numbers in them called ROI, and wanted to recode the number values into string values in R like this:
df = mutate(diabetes_df, ROI = ifelse(ROI < 18.5, 'Under', ROI))
df = mutate(diabetes_df, ROI = ifelse(ROI >= 18.5 & ROI <= 25, 'average', ROI))
diabetes_df = mutate(diabetes_df, ROI = ifelse(ROI > 25 & BMI <= 30, 'above average', ROI))
This works normally and it displays these words wherever the condition is met, however when i put the last ifelse statement :
df = mutate(diabetes_df, ROI = ifelse(ROI > 30, 'OVER', ROI))
It turns every value in the new column I made into the OVER value. I was wondering if anyone knew how to make it so that it would only say OVER for where the condition is met?

If ROI is a numeric column, the issue is that you are overwriting a numeric column with text values.
If ROI is not a numeric column, then inequality comparison on text strings works different from how you have assumed.
Note that all you commands take the form: df = mutate(df, ROI = ifelse(ROI <condition>, 'label', ROI). This means you are overwriting the original ROI values, and the replaced values will we used for subsequent comparisons.
Suppose df had only row with ROI = 10 then:
# df:
# ROI = 10
df2 = mutate(df, ROI = ifelse(ROI < 18.5, 'Under', ROI))
# compares 10 < 18.5
# replaces 10 with 'Under'
# df2:
# ROI = 'Under'
df3 = mutate(df2, ROI = ifelse(ROI > 30, 'OVER', ROI))
# compares 'Under' > 30
# After standardizing formats, compares 'Under' > '30' (conversion to string)
# replaces 'Under' with 'OVER'
Two possible solutions:
write to a different column, this is good practice
df %>%
mutate(ROI_label = NA) %>%
mutate(ROI_label = ifelse(ROI < 18.5, 'Under', ROI_label)) %>%
mutate(ROI_label = ifelse(ROI >= 18.5 & ROI <= 25, 'average', ROI_label)) %>%
mutate(ROI_label = ifelse(ROI > 25 & BMI <= 30, 'above average', ROI_label)) %>%
mutate(ROI_label = ifelse(ROI > 30, 'OVER', ROI_label))
use case_when, this is also good practice
df %>%
mutate(ROI = case_when(ROI < 18.5 ~ 'Under',
ROI >= 18.5 & ROI <= 25 ~ 'average',
ROI > 25 & BMI <= 30 ~ 'above average',
ROI > 30 ~ 'OVER'))
Even better, write to a different column and use case_when.

We can replicate the problem with the mtcars data frame. The following code on the third mutate() statement results in all rows getting the wt value set to High because after the first mutate(), the wt column is a vector of character values.
library(dplyr)
data(mtcars)
mtcars <- mutate(mtcars,wt = ifelse(wt < 2.6,"Low", wt))
# at this point, wt is character
str(mtcars$wt)
> str(mtcars$wt)
chr [1:32] "2.62" "2.875" "Low" "3.215" "3.44" "3.46" "3.57" "3.19" "3.15" ...
By the third mutate() all rows meet the condition of TRUE for the if_else() based on a character string comparison where the string values of Low and Medium are greater than the number 3.61.
mtcars <- mutate(mtcars, wt = ifelse( 2.6 <= wt & wt <= 3.61,"Medium",wt))
mtcars <- mutate(mtcars, wt = ifelse( wt > 3.61,"High",wt))
...and the output:
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 High 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 High 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 High 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 High 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 High 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 High 20.22 1 0 3 1
We can prevent this behavior by using case_when(), which makes all of the comparisons to the numeric version of wt in a single pass of the data.
# use case_when()
data(mtcars)
mtcars %>% mutate(wt = case_when(
wt < 2.6 ~ "Low",
wt >= 2.6 & wt <= 3.61 ~ "Medium",
wt > 3.61 ~ "High"
)) %>% head(.)
...and the output:
head(.)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 Medium 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 Medium 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 Low 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 Medium 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 Medium 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 Medium 20.22 1 0 3 1
>
From the comments to this answer, it wasn't clear to the OP how to save the changed column to the existing data frame. The following code snippet addresses that question.
data(mtcars)
mtcars %>% mutate(wt = case_when(
wt < 2.6 ~ "Low",
wt >= 2.6 & wt <= 3.61 ~ "Medium",
wt > 3.61 ~ "High"
)) -> mtcars

Related

Prevent change to dataframe format in R

I have a dataframe that must have a specific layout. Is there a way for me to make R reject any command I attempt that would change the number or names of the columns?
It is easy to check the format of the data table manually, but I have found no way to make R do it for me automatically every time I execute a piece of code.
regards
This doesn’t offer the level of foolproof safety I think you’re looking for (hard to know without more details), but you could define a function operator that yields modified functions that error if changes to columns are detected:
same_cols <- function(fn) {
function(.data, ...) {
out <- fn(.data, ...)
stopifnot(identical(sort(names(.data)), sort(names(out))))
out
}
}
For example, you could create modified versions of dplyr functions:
library(dplyr)
my_mutate <- same_cols(mutate)
my_summarize <- same_cols(summarize)
which work as usual if columns are preserved:
mtcars %>%
my_mutate(mpg = mpg / 2) %>%
head()
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 10.50 6 160 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 10.50 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 11.40 4 108 93 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 10.70 6 258 110 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 9.35 8 360 175 3.15 3.440 17.02 0 0 3 2
# Valiant 9.05 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars %>%
my_summarize(across(everything(), mean))
# mpg cyl disp hp drat wt qsec vs am
# 1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625
# gear carb
# 1 3.6875 2.8125
But throw errors if changes to columns are made:
mtcars %>%
my_mutate(mpg2 = mpg / 2)
# Error in my_mutate(., mpg2 = mpg/2) :
# identical(sort(names(.data)), sort(names(out))) is not TRUE
mtcars %>%
my_summarize(mpg = mean(mpg))
# Error in my_summarize(., mpg = mean(mpg)) :
# identical(sort(names(.data)), sort(names(out))) is not TRUE
You mention the names and columns need to be the same, also realize that with data.table also names are updated by reference. See the example below.
foo <- data.table(
x = letters[1:5],
y = LETTERS[1:5]
)
colnames <- names(foo)
colnames
# [1] "x" "y"
setnames(foo, colnames, c("a", "b"))
foo[, z := "oops"]
colnames
# [1] "a" "b" "z"
identical(colnames, names(foo))
# [1] TRUE
To check that both the columns and names are unalterated (and in same order here) you can take right away a copy of the names. And after each code run, you can check the current names with the copied names.
foo <- data.table(
x = letters[1:5],
y = LETTERS[1:5]
)
colnames <- copy(names(foo))
setnames(foo, colnames, c("a", "b"))
foo[, z := "oops"]
identical(colnames, names(foo))
[1] FALSE
colnames
# [1] "x" "y"
names(foo)
# [1] "a" "b" "z"

Loop within Loop in R

I am trying to figure out how to run two different loops on the same code. I am trying to create a matrix where I am filling a column with the mean of a variable for each year.
Here's the code I am using to do it right now:
matplot2 = as.data.frame(matrix(NA, nrow=16, ncol=4))
matplot2[1,1] = mean(matplot[matplot$Year==2003, 'TotalTime'])
matplot2[2,1] = mean(matplot[matplot$Year==2004, 'TotalTime'])
matplot2[3,1] = mean(matplot[matplot$Year==2005, 'TotalTime'])
matplot2[4,1] = mean(matplot[matplot$Year==2006, 'TotalTime'])
matplot2[5,1] = mean(matplot[matplot$Year==2007, 'TotalTime'])
matplot2[6,1] = mean(matplot[matplot$Year==2008, 'TotalTime'])
matplot2[7,1] = mean(matplot[matplot$Year==2009, 'TotalTime'])
matplot2[8,1] = mean(matplot[matplot$Year==2010, 'TotalTime'])
matplot2[9,1] = mean(matplot[matplot$Year==2011, 'TotalTime'])
matplot2[10,1] = mean(matplot[matplot$Year==2012, 'TotalTime'])
matplot2[11,1] = mean(matplot[matplot$Year==2013, 'TotalTime'])
matplot2[12,1] = mean(matplot[matplot$Year==2014, 'TotalTime'])
matplot2[13,1] = mean(matplot[matplot$Year==2015, 'TotalTime'])
matplot2[14,1] = mean(matplot[matplot$Year==2016, 'TotalTime'])
matplot2[15,1] = mean(matplot[matplot$Year==2017, 'TotalTime'])
matplot2[16,1] = mean(matplot[matplot$Year==2018, 'TotalTime'])
If it were just the year changing, I would write the loop like this:
for(i in 2003:2018) {
matplot2[1,1] = mean(matplot[matplot$Year==i, 'TotalTime'])
}
But, I need the row number in the matrix I'm printing the results into to change as well. How can I write a loop where I am printing the results of all these means into one column of a matrix?
In other words, I need to be able to have it loop matplot2[j,1] in addition to the matplot$Year==i.
Any suggestions would be greatly appreciated!
Your literal calculations of the mean(TotalTime) can all be reduced to a single command (with no for loop required):
matplot2 <- aggregate(TotalTime ~ Year, data = matplot, FUN = mean)
That should return a two-column frame with the unique values of Year in the first column, and the respective means in the second column.
Demonstrated with data I have:
head(mtcars)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
res <- aggregate(disp ~ cyl, data = mtcars, FUN = mean)
res
# cyl disp
# 1 4 105.1364
# 2 6 183.3143
# 3 8 353.1000
This and more can be seen in summarize by group (of which this question is essentially a dupe, even if you didn't know to ask it that way).
R is a vectorized language so passing a vector of values for the index and year should work.
i<-1:16
matplot2[i,1] = mean(matplot[matplot$Year==(2002 + i), 'TotalTime'])

R - Unite without NA values [duplicate]

This question already has answers here:
Combine two or more columns in a dataframe into a new column with a new name
(9 answers)
Closed 3 years ago.
I have got multiple columns. All columns do have NA values in some rows. Is it possible to unite these columns without having the NA values in the new column?
Without NA values:
library(dplyr)
unite(mtcars, 'mpg_am', c('mpg','am'))
Creating fake data:
mtcars$NA_1 = ifelse(mtcars$mpg>20, NA, mtcars$mpg)
mtcars$NA_2 = ifelse(mtcars$cyl>6, NA, mtcars$mpg)
unite(mtcars, 'Var1', c('NA_1','NA_2'))
This will create values like
Var1
NA_21
15.5_NA
NA_NA
15.5_21
...
desired output:
Var1
21
15.5
NA
15.5_21
...
We can use unite with na.rm
library(tidyverse)
mtcars %>%
rownames_to_column('rn') %>%
mutate_at(vars(starts_with("NA")), as.character) %>%
unite(Var1, NA_1, NA_2, na.rm = TRUE) %>%
mutate(Var1 = na_if(Var1, "")) %>%
column_to_rownames('rn')
Or another option is coalesce instead of unite
mtcars %>%
mutate(Var1 = str_c(coalesce(NA_1, NA_2), coalesce(NA_2, NA_1), sep="_"))
Or another option is
mtcars %>%
mutate_at(vars(starts_with("NA")), list(~ replace_na(., ''))) %>%
mutate(Var1 = str_remove(na_if(str_c(NA_1, NA_2, sep="_"), '_'), '^_|_$') ) %>%
select(-NA_1, NA_2)
unite has got na.rm parameter which will remove NA values but for that column needs to be of character type.
library(dplyr)
library(tidyr)
mtcars %>%
mutate_at(vars(NA_1, NA_2), as.character) %>%
unite(Var1, NA_1, NA_2, na.rm = TRUE)
# mpg cyl disp hp drat wt qsec vs am gear carb Var1
#1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 21
#2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 21
#3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 22.8
#4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 21.4
#5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 18.1_18.1
#.....
However, if both the values are NA then this will return empty values instead of NA, if we need NA strictly we can check for empty values and replace
mtcars %>%
mutate_at(vars(NA_1, NA_2), as.character) %>%
unite(Var1, NA_1, NA_2, na.rm = TRUE)
mutate(Var1 = replace(Var1, Var1 == "", NA_character_))
Without any packages we can use paste0 in base R
cols <- c('NA_1','NA_2')
mtcars["V1"] <- apply(mtcars[cols],1,function(x) paste0(na.omit(x), collapse = "-"))

Naming a new variable based on a quosure

I'm trying to write a custom function that will compute a new variable based on values from a predefined vector of variables (e.g., vector_heavy) and then name the new variable based on an argument provided to the function (e.g., custom_name).
This variable naming is where my quosure skills are failing me. Any help is greatly appreciated.
library(tidyverse)
vector_heavy <- quos(disp, wt, cyl)
cv_compute <- function(data, cv_name, cv_vector){
cv_name <- enquo(cv_name)
data %>%
rowwise() %>%
mutate(!!cv_name = mean(c(!!!cv_vector), na.rm = TRUE)) %>%
ungroup()
}
d <- cv_compute(mtcars, cv_name = custom_name, cv_vector = vector_heavy)
My error message reads:
Error: unexpected '=' in:
" rowwise() %>%
mutate(!!cv_name ="
Removing the !! before cv_name within mutate() will result in a function that calculates a new variable literally named cv_name, and ignoring the custom_name I've included as an argument.
cv_compute <- function(data, cv_name, cv_vector){
cv_name <- enquo(cv_name)
data %>%
rowwise() %>%
mutate(cv_name = mean(c(!!!cv_vector), na.rm = TRUE)) %>%
ungroup()
}
How can I get this function to utilize the custom_name I supply as an argument for cv_name?
You need to use the := helper within mutate. You'll also need quo_name to convert the input to a string.
The mutate line of your function will then look like
mutate(!!quo_name(cv_name) := mean(c(!!!cv_vector), na.rm = TRUE))
In its entirety:
cv_compute <- function(data, cv_name, cv_vector){
cv_name <- enquo(cv_name)
data %>%
rowwise() %>%
mutate(!!quo_name(cv_name) := mean(c(!!!cv_vector), na.rm = TRUE)) %>%
ungroup()
}
cv_compute(mtcars, cv_name = custom_name, cv_vector = vector_heavy)
mpg cyl disp hp drat wt qsec vs am gear carb custom_name
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 56.20667
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 56.29167
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 38.10667
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 89.07167
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 123.81333
6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 78.15333

Re-assembling a dataframe after a split [duplicate]

This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Closed 6 years ago.
I have trouble applying a split to a data.frame and then assembling some aggregated results back into a different data.frame. I tried using the 'unsplit' function but I can't figure out how to use it properly to get the desired result. Let me demonstrate on the common 'mtcars' data: Let's say that my ultimate result is to get a data frame with two variables: cyl (cylinders) and mean_mpg (mean over mpg for group of cars sharing the same count of cylinders).
So the initial split goes like this:
spl <- split(mtcars, mtcars$cyl)
The result of which looks something like this:
$`4`
mpg cyl disp hp drat wt qsec vs am gear carb
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
...
$`6`
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
...
$`8`
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
...
Now I want to do something along the lines of:
df <- as.data.frame(lapply(spl, function(x) mean(x$mpg)), col.names=c("cyl", "mean_mpg"))
However, doing the above results in:
X4 X6 X8
1 26.66364 19.74286 15.1
While I'd want the df to be like this:
cyl mean_mpg
1 4 26.66364
2 6 19.74286
3 8 15.10000
Thanks, J.
If you are only interested in reassembling a split then look at (2), (4) and (4a) but if the actual underlying question is really about the way to perform aggregations over groups then they all may be of interest:
1) aggregate Normally one uses aggregate as already mentioned in the comments. Simplifying #alistaire's code slightly:
aggregate(mpg ~ cyl, mtcars, mean)
2) split/lapply/do.call Also #rawr has given a split/lapply/do.call solution in the comments which we can also simplify slightly:
spl <- split(mtcars, mtcars$cyl)
do.call("rbind", lapply(spl, with, data.frame(cyl = cyl[1], mpg = mean(mpg))))
3) do.call/by The last one could alternately be rewritten in terms of by:
do.call("rbind", by(mtcars, mtcars$cyl, with, data.frame(cyl = cyl[1], mpg = mean(mpg))))
4) split/lapply/unsplit Another possibility is to use split and unsplit:
spl <- split(mtcars, mtcars$cyl)
L <- lapply(spl, with, data.frame(cyl = cyl[1], mpg = mean(mpg), row.names = cyl[1]))
unsplit(L, sapply(L, "[[", "cyl"))
4a) or if row names are sufficient:
spl <- split(mtcars, mtcars$cyl)
L <- lapply(spl, with, data.frame(mpg = mean(mpg), row.names = cyl[1]))
unsplit(L, sapply(L, rownames))
The above do not use any packages but there are also many packages that can do aggregations including dplyr, data.table and sqldf:
5) dplyr
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarize(mpg = mean(mpg)) %>%
ungroup()
6) data.table
library(data.table)
as.data.table(mtcars)[, list(mpg = mean(mpg)), by = "cyl"]
7) sqldf
library(sqldf)
sqldf("select cyl, avg(mpg) mpg from mtcars group by cyl")

Resources