The labelled package provides this functionality to modify value labels for multiple variables in one go:
df <- data.frame(v1 = 1:3, v2 = c(2, 3, 1), v3 = 3:1)
val_labels(df[, c("v1", "v3")]) <- c(YES = 1, MAYBE = 2, NO = 3)
val_labels(df)
But I'm wondering if there's a way to do this in tidyverse syntax:
Something like this:
library(tidyverse)
df%>%
mutate(across(V1:V2), ~val_labels(.x)<-c(YES = 1, MAYBE = 2, NO = 3)
We need to assign and then return the column (.x). In addition, when there are more than one expression, wrap it inside the {}
library(dplyr)
library(labelled)
df <- df %>%
mutate(across(v1:v2, ~
{
val_labels(.x) <- c(YES = 1, MAYBE = 2, NO = 3)
.x
}))
-output
> val_labels(df)
$v1
YES MAYBE NO
1 2 3
$v2
YES MAYBE NO
1 2 3
$v3
NULL
I would suggest using haven's labelled class directly, alternatively check out the labelled package's functions made for the dplyr syntax, e.g. add_value_labels.
df <-
df |>
mutate(across(v1:v2,
~ haven::labelled(.,
labels = c(YES = 1,
MAYBE = 2,
NO = 3)
)
)
)
labelled::val_labels(df)
Output:
$v1
YES MAYBE NO
1 2 3
$v2
YES MAYBE NO
1 2 3
$v3
NULL
A side note: Unless you have a very specific reason for using the labelled-package I'd suggest that you keep the usage to a minimum and coerce into factors, especially in the case of value labels. I've learned the hard way that the labelled-package (and sjlabelled for that matter) will often let you do things that seems smart on the outset but isn't in the long run.
A labelled vector is a common data structure in other statistical environments, allowing you to assign text labels to specific values. (...) This class provides few methods, as I expect you'll coerce to a standard R class (e.g. a factor()) soon after importing.
https://haven.tidyverse.org/reference/labelled.html
(My emphasis)
Related
I'm looking for an easy way to change several values for the same person. Preferably with dplyr or any other package from tidyverse.
Here my example:
df <- data.frame(personid = 1:3, class = c("class1", "class3", "class3"), classlevel = c(1, 11, 3), education = c("BA", "Msc", "BA"))
df
My dataset contains an entry with several mistakes. Person #2 should be part of class 1, at classlevel 1 und his education is BA, not MSc. I use mutate with case_when a lot, but in my case I don't want to change one variable with multiple condition, I have one condition and want to change multiple values in other variables based on this condition.
Basically, I'm looking for an shorter code which replaces this:
df$class[df$personid == 2] <- "class1"
df$classlevel[df$personid == 2] <- 1
df$education[df$personid == 2] <- "BA"
df
or this:
library(tidyverse)
df <- df |>
mutate(class = case_when(personid == 2 ~ "class1", TRUE ~ class)) |>
mutate(classlevel = case_when(personid == 2 ~ 1, TRUE ~ as.numeric(classlevel))) |>
mutate(education = case_when(personid == 2 ~ "BA", TRUE ~ education))
df
In my original data, there are several dozend cases like this, and I find it a bit tedious to use three lines of code for each person. Is there a shorter way?
Thanks for your input!
One way would be to create a data frame of the values to be updated and use rows_update(). Note that this assumes the rows are uniquely identified.
library(dplyr)
df_update <- tribble(
~personid, ~class, ~classlevel, ~education,
1, "class1", 1, "BA"
)
df %>%
rows_update(df_update, by = "personid")
personid class classlevel education
1 1 class1 1 BA
2 2 class1 1 BA
3 3 class3 3 BA
I think I need a little bit more information to try to answer your question., but I'll try anyway.
If you want to change the value of some columns based on a unique condition across all the rows I recommend doing this (I created new columns col_name1 so you can see the original and ouput):
df <- df %>% mutate(class1 = case_when(class != "class1" ~ "class1", TRUE ~ class),
classlevel1 = case_when( classlevel != 1 ~ 1, TRUE ~ as.numeric(classlevel)),
education1 = case_when( education != "BA" ~ "BA", TRUE ~ education))
If that was your problem, then you are probably not familiar with the concept of vectorization. Briefly, a vectorized function runs for all the rows or elements in your vector, without you needing to specify that. There are a lof of examples and tutorial on the web if you search "vectorization in R" or something similar.
Otherwise, if your condition changes for each single id (or row) in your data, then the problem is more complicated.
Let me know if that helps and, if it doesn't, consider providing more information in your question.
First off - newbie with R so bear with me. I'm trying to recode string values as numeric. My problem is I have two different string patterns present in my values: "M" and "B" for 'million' and 'billion', respectively.
df <- (funds = c($1.76M, $2B, $57M, $9.87B)
I've successfully knocked off the dollar sign and now have:
df <- (funds = c($1.76M, $2B, $57M, $9.87B),
fundsR = c(1.76M, 2B, 57M, 9.87B)
)
How can I recode these as numeric while retaining their respective monetary values? I've tried using various if statements, for loops, with or without str_detect, pipe operators, case_when, mutate, etc. to isolate values with "M" and values with "B", convert to numeric and multiply to come up the complimentary numeric value--all in a new column. This seemingly simple task turned out not as simple as I imagined it would be and I'd attribute it to being a novice. At this point I'd like to start from scratch and see if anyone has any fresh ideas. My Rstudio is a MESS.
Something like this would be nice:
df <- (funds = c($1.76M, $2B, $57M, $9.87B),
fundsR = c(1.76M, 2B, 57M, 9.87B),
fundsFinal = c(1760000, 2000000000, 57000000, 9870000000)
)
I'd really appreciate your input.
You could create a helper function f, and then apply it to the funds column:
library(dplyr)
library(stringr)
f <- function(x) {
curr = c("M"=1e6, "B" = 1e9)
val = str_remove(x,"\\$")
as.numeric(str_remove_all(val,"B|M"))*curr[str_extract(val, "B|M")]
}
df %>% mutate(fundsFinal = f(funds))
Output:
funds fundsFinal
1 $1.76M 1.76e+06
2 $2B 2.00e+09
3 $57M 5.70e+07
4 $9.87B 9.87e+09
Input:
df = structure(list(funds = c("$1.76M", "$2B", "$57M", "$9.87B")), class = "data.frame", row.names = c(NA,
-4L))
This works but I'm sure better solutions exist. Assuming funds is a character vector:
library(tidyverse)
options(scipen = 999)
df <- data.frame(funds = c('$1.76M', '$2B', '$57M', '$9.87B'))
df = df %>%
mutate( fundsFinal = ifelse(str_sub(funds,nchar(funds),-1) =='M',
as.numeric(substr(funds, 2, nchar(funds) - 1))*10^6,
as.numeric(substr(funds, 2, nchar(funds) - 1))*10^9))
I am trying to perform a function over each row and create a new column that considers multiple columns using tidyverse , I was initially using rowwise() but that was very slow. I want the list of columns into my custom function be a variable, but I can't get it to work unless I explicitly list the variable names. For example, this works:
low_risk_codes <- c(0,1,10)
vars <- c("V1", "V2")
m <- matrix(1:9, ncol=3)
classify_low_risk_drug <- function(...){
t <- cbind(...)
return(apply(t, 1, function(x) ifelse(any(x %in% low_risk_codes), 1, 0)))
}
as.data.frame(m) %>%
mutate(val4 = classify_low_risk_drug(V1, V2))
But if I want it to evaluate using the column input as vars :
as.data.frame(m) %>%
mutate(val4 = classify_low_risk_drug(vars))
But I can't get it to work even if I include !!, what am I missing?!
Also any suggestions for how to do this with map instead are also appreciated!
This sounds like it will do what you want, but I need to qualify it (a lot). First, FYI, I am still wrapping my mind around NSE in R but I find this vignette very helpful.
Related to the solution, I tried to speed up the function by avoiding rowwise() or apply(). It should be quicker with rapply()/rowSums() but I did not benchmark it. It may run into issues with very large data because rowSums() will convert the dataframe into a matrix but that probably wont be a problem. In theory, you should also be able to use select helpers / unquoted variable names / columns positions (if you so dare).
Also, I find it a little quirky that you need to supply the dataframe as the first argument (i.e., as .), but there may be a way around that. I am certainly open to anyway that wants to edit this / use this as the base for their solution. Hope this helps / gets you going in the right direction!
classify_low_risk_drug <- function(.data, vars, codes, na.rm = FALSE){
df <- rapply(.data, function(x) x %in% codes, how = "replace")
as.integer(rowSums(select(df, !!enquo(vars)), na.rm = na.rm) > 0)
}
as.data.frame(m) %>%
mutate(val4 = classify_low_risk_drug(., vars = vars, codes = c(0, 1, 10)))
V1 V2 V3 val4
1 1 4 7 1
2 2 5 8 0
3 3 6 9 0
EDIT: you could improve the speed a little bit by avoiding the matrix conversion / using lapply() w/ pmax():
classify_low_risk_drug2 <- function(.data, vars, codes, na.rm = FALSE){
as.integer(do.call(pmax, lapply(select(.data, !!enquo(vars)), `%in%`, codes)))
}
Does anyone know how I can create a format of a variable in R and apply it to any other variable I want?
More specifically, I am trying to translate a SAS script to R script.
In SAS I can create a format of a variable like this:
PROC FORMAT
VALUE bool
1 = "Yes"
2 = "No"
3 = "NA"
;
(so the variable bool has the levels 1, 2, 3, where 1 will be replaced with "Yes", 2 with "No", etc)
Then I can indicate that for a specific variable of my data set (myVariable) - which also has the levels 1, 2, 3 - I want to have the same format:
FORMAT myVariable bool.;
so all the 1s will become "Yes", etc. Obviously, the order of the levels is not the same between the two variables, i just want to apply the same labels.
I cannot find how to do this with R, has anyone already done it?
You can also create a function if you want to reuse the format (and not deal with factors if that is a problem).
library(dplyr)
lvl <- function(y){ifelse(y == 1, "Yes",
ifelse(y == 2, "No","NA"))}
df <- data.frame(
answers = c(1,2,3)
)
df2 <- df %>% mutate(var2 = lvl(answers))
Try a look-up vector. For example.
v <-setNames(c("yes", "no","na"), 1:3))
v[c(1,2,2,3,1,1)]
In vanilla R, you can do this:
# create data
df <- data.frame(
'answers' = c('1','2','3')
)
# make 'answers' into a factor
df$answers <- as.factor(df$answers)
#rename factor levels
levels(df$answers)
[1] "1" "2" "3"
levels(df$answers) <- c('Yes','No','NA')
In Tidyverse, this is slightly less clunky.
# you can also do this within tidyverse
library(tidyverse)
# create data
df <- data.frame(
'answers' = c('1','2','3')
)
df %>% mutate(answers = as.factor(answers)) %>%
recode(answers, '1' = 'Yes', '2'='No', '3'='NA')) -> df
I am trying to replace NA values for a specific set of columns in my tibble. The columns all start with the same prefix so I am wanting to know if there is a concise way to make use of the starts_with() function from the dplyr package that would allow me to do this.
I have seen several other questions on SO, however they all require the use of specific column names or locations. I'm really trying to be lazy and not wanting to define ALL columns, just the prefix.
I've tried the replace_na() function from the tidyr package to no avail. I know the code I have is wrong for the assignment, but my vocabulary isn't large enough to know where to look.
Reprex:
library(tidyverse)
tbl1 <- tibble(
id = c(1, 2, 3),
num_a = c(1, NA, 4),
num_b = c(NA, 99, 100),
col_c = c("d", "e", NA)
)
replace_na(tbl1, list(starts_with("num_") = 0)))
How about using mutate_at with if_else (or case_when)? This works if you want to replace all NA in the columns of interest with 0.
mutate_at(tbl1, vars( starts_with("num_") ),
funs( if_else( is.na(.), 0, .) ) )
# A tibble: 3 x 4
id num_a num_b col_c
<dbl> <dbl> <dbl> <chr>
1 1 1 0 d
2 2 0 99 e
3 3 4 100 <NA>
Note that starts_with and other select helpers return An integer vector giving the position of the matched variables. I always have to keep this in mind when trying to use them in situations outside how I normally use them..
In newer versions of dplyr, use list() with a tilde instead of funs():
list( ~if_else( is.na(.), 0, .) )