I am trying to replace NA values for a specific set of columns in my tibble. The columns all start with the same prefix so I am wanting to know if there is a concise way to make use of the starts_with() function from the dplyr package that would allow me to do this.
I have seen several other questions on SO, however they all require the use of specific column names or locations. I'm really trying to be lazy and not wanting to define ALL columns, just the prefix.
I've tried the replace_na() function from the tidyr package to no avail. I know the code I have is wrong for the assignment, but my vocabulary isn't large enough to know where to look.
Reprex:
library(tidyverse)
tbl1 <- tibble(
id = c(1, 2, 3),
num_a = c(1, NA, 4),
num_b = c(NA, 99, 100),
col_c = c("d", "e", NA)
)
replace_na(tbl1, list(starts_with("num_") = 0)))
How about using mutate_at with if_else (or case_when)? This works if you want to replace all NA in the columns of interest with 0.
mutate_at(tbl1, vars( starts_with("num_") ),
funs( if_else( is.na(.), 0, .) ) )
# A tibble: 3 x 4
id num_a num_b col_c
<dbl> <dbl> <dbl> <chr>
1 1 1 0 d
2 2 0 99 e
3 3 4 100 <NA>
Note that starts_with and other select helpers return An integer vector giving the position of the matched variables. I always have to keep this in mind when trying to use them in situations outside how I normally use them..
In newer versions of dplyr, use list() with a tilde instead of funs():
list( ~if_else( is.na(.), 0, .) )
Related
The labelled package provides this functionality to modify value labels for multiple variables in one go:
df <- data.frame(v1 = 1:3, v2 = c(2, 3, 1), v3 = 3:1)
val_labels(df[, c("v1", "v3")]) <- c(YES = 1, MAYBE = 2, NO = 3)
val_labels(df)
But I'm wondering if there's a way to do this in tidyverse syntax:
Something like this:
library(tidyverse)
df%>%
mutate(across(V1:V2), ~val_labels(.x)<-c(YES = 1, MAYBE = 2, NO = 3)
We need to assign and then return the column (.x). In addition, when there are more than one expression, wrap it inside the {}
library(dplyr)
library(labelled)
df <- df %>%
mutate(across(v1:v2, ~
{
val_labels(.x) <- c(YES = 1, MAYBE = 2, NO = 3)
.x
}))
-output
> val_labels(df)
$v1
YES MAYBE NO
1 2 3
$v2
YES MAYBE NO
1 2 3
$v3
NULL
I would suggest using haven's labelled class directly, alternatively check out the labelled package's functions made for the dplyr syntax, e.g. add_value_labels.
df <-
df |>
mutate(across(v1:v2,
~ haven::labelled(.,
labels = c(YES = 1,
MAYBE = 2,
NO = 3)
)
)
)
labelled::val_labels(df)
Output:
$v1
YES MAYBE NO
1 2 3
$v2
YES MAYBE NO
1 2 3
$v3
NULL
A side note: Unless you have a very specific reason for using the labelled-package I'd suggest that you keep the usage to a minimum and coerce into factors, especially in the case of value labels. I've learned the hard way that the labelled-package (and sjlabelled for that matter) will often let you do things that seems smart on the outset but isn't in the long run.
A labelled vector is a common data structure in other statistical environments, allowing you to assign text labels to specific values. (...) This class provides few methods, as I expect you'll coerce to a standard R class (e.g. a factor()) soon after importing.
https://haven.tidyverse.org/reference/labelled.html
(My emphasis)
I have two dataframes df1 and df2 which I have merged together into another dataframe df3
df1 <- data.frame(
Name = c("A", "B", "C"),
Value = c(1, 2, 3),
Method = c("Indirect"))
df2 <- data.frame(
Name = c("A", "B"),
Value = c(4, 5),
Method = c("Direct"))
df3 <- rbind(df1, df2)
So df3 looks something like this
Now I need to identify all the unique entries in the Name column (which is C in this case) and for each of the unique entries, a row is to be added which would have the same "Name" but "Value" would be 0 and the "Method" would be the opposite one. The output should look like this.
Finally the rows with similar "Name" are to be arranged one below the other.
I have a huge dataframe and I need to achieve the above mentioned outcome in the most efficient way in R. How do I proceed?
One way
tmp=df3[!(df3$Name %in% df3$Name[duplicated(df3$Name)]),]
tmp$Value=0
tmp$Method=ifelse(tmp$Method=="Direct","Indirect","Direct")
Name Value Method
3 C 0 Direct
you can now rbind this to your original data (and sort it).
Please find another solution using data.table
Reprex
Code
library(data.table)
library(magrittr) # for the pipe!
setDT(df3)
df3 <- rbindlist(list(df3,
df3[!(df3$Name %in% df3[duplicated(Name)]$Name)
][, `:=` (Value = 0, Method = fifelse(Method == "Indirect", "Direct", "Indirect"))])) %>%
setorder(., Name)
Output
df3
#> Name Value Method
#> 1: A 1 Indirect
#> 2: A 4 Direct
#> 3: B 2 Indirect
#> 4: B 5 Direct
#> 5: C 3 Indirect
#> 6: C 0 Direct
Created on 2021-12-15 by the reprex package (v2.0.1)
I think that with 10,000 rows you will barely notice it:
library(dplyr)
df3 |>
add_count(Name) |>
filter(n == 1) |>
mutate(
Value = 0,
Method = c(Indirect = 'Direct', Direct = 'Indirect')[Method],
n = NULL
) |>
bind_rows(df3) |>
arrange(Name, Value, Method)
# Name Value Method
# 1 A 1 Indirect
# 2 A 4 Direct
# 3 B 2 Indirect
# 4 B 5 Direct
# 5 C 0 Direct
# 6 C 3 Indirect
I have to apologize in advance if the question is very basic as I am still new to R. I have tried to look on stackoverflow for similar questions, but I still can't resolve the problem that I am facing.
I am currently working on a large dataset X. What I am trying to do is pretty simple. I want to replace all NAs in selected columns (non consecutive columns) with "no".
I firstly have created a variable including all the columns that I want to modify. For instance, if I want to modify the NAs in columns named "m","l" and "h", I wrote the following:
modify <- c("m","l","h")
for (i in 1:length(modify))
column <- modify[i]
X$column <- as.character(X$column) #X is my dataframe
X$column %>% replace_na("no")
This loop returned the output only for the "m" column, which is the first variable in my modify variable. However, even after generating the output after the loop, when I tried to check X$m, nothing has changed in my original dataset.
I also tried to create a function, which is very similar to the loop. Even though no error message was generated, it didn't work as I do not know what the return value should be.
Why can't the loop being applied to my entire dataset while the individual steps in the loop work?
Thank you so so much for your help!
This might help, and was among one of the answers here (but slightly different here using all_of():
library(tidyverse)
df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b"))
df
#> # A tibble: 3 × 2
#> x y
#> <dbl> <chr>
#> 1 1 a
#> 2 2 <NA>
#> 3 NA b
modify <- c("x","y")
df %>%
mutate(
across(all_of(modify), ~replace_na(.x, 0))
)
#> # A tibble: 3 × 2
#> x y
#> <dbl> <chr>
#> 1 1 a
#> 2 2 0
#> 3 0 b
Created on 2021-09-22 by the reprex package (v2.0.1)
Here's a base R approach modifying data from #scrameri.
df <- data.frame(x = c(1, 2, NA), y = c("a", NA, "b"), c = c(1, NA, 5))
modify <- c('x', 'y')
df[modify][is.na(df[modify])] <- 'no'
df
# x y c
#1 1 a 1
#2 2 no NA
#3 no b 5
I'm going to fix your code with as few changes as possible, so you can learn.
There are two big problems. First, the for loop needs to have curly braces {} around the lines you want to loop over. Second, if you want to reference variables in a data frame dynamically, you can't use the $ operator. You have to use double brackets [[]].
library(tidyr)
X <- data.frame(m = c(1, 2, NA), l = c("a", NA, "b"), h = c(1, NA, 5))
modify <- c("m","l","h")
for (i in seq_along(modify)) {
column <- modify[i]
X[[column]] <- as.character(X[[column]]) #X is my dataframe
X[[column]] <- X[[column]] %>% replace_na("no")
}
X
# m l h
# 1 1 a 1
# 2 2 no no
# 3 no b 5
You can do what you were trying to do much more efficiently, as shown in the other answers. But I wanted to show you how to do it the way you were trying to correct your understanding of for loops and the subset operator. These are basic things that everyone should understand when you are first learning R.
You might want to go through a beginners tutorial to solidify your understanding. I used tutorialspoint when I was first learning and found it useful.
We could do this efficiently with set from data.table
library(data.table)
setDT(X)
for(nm in modify) {
set(X, i = NULL, j= nm, value = as.character(X[[nm]]))
set(X, i = which(is.na(X[[nm]])), j = nm, value = 'no')
}
-output
> X
m l h i
1: 1 a 1 NA
2: 2 no no 5
3: no b 5 6
data
X <- data.frame(m = c(1, 2, NA), l = c("a", NA, "b"),
h = c(1, NA, 5), i = c(NA, 5, 6))
modify <- c("m","l","h")
I am new to R and just learning the ropes so thanks in advance for any assistance you can provide.
I have a dataset that I am cleaning as a class project.
I have several sets of categorical data that I want to turn into specific numeric values.
I am repeating the same code format for different columns that I think would make a good function.
I would like to turn this:
# plyr using revalue
df$Area <- revalue(x = df$Area,
replace = c("rural" = 1,
"suburban" = 2,
"urban" = 3))
df$Area <- as.numeric(df$Area)
into this:
reval_3 <- function(data, columnX,
value1, num_val1,
value2, num_val2,
value3, num_val3) {
# plyr using revalue
data$columnX <- revalue(x = data$columnX,
replace = c(value1 = num_val1,
value2 = num_val2,
value3 = num_val3))
# set as numeric
data$columnX <- as.numeric(data$columnX)
# return dataset
return(data)
}
I get the following error:
The following `from` values were not present in `x`: value1, value2, value3
Error: Assigned data `as.numeric(data$columnX)` must be compatible with existing data.
x Existing data has 10000 rows.
x Assigned data has 0 rows.
ℹ Only vectors of size 1 are recycled.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning messages:
1: Unknown or uninitialised column: `columnX`.
I've tried it with a single value1 where value1 <- c("rural" = 1, "suburban" = 2, "urban" = 3)
I know I can just:
df$Area <- as.numeric(as.factor(df$Area))
the data but I want specific values for each choice rather than R choosing.
Any assistance appreciated.
As already mentioned by #MartinGal in his comment, plyr is retired and the package authors themselves recommend using dplyr instead. See https://github.com/hadley/plyr.
Hence, one option to achieve your desired result would be to make use of dplyr::recode. Additionally if you want to write your function I would suggest to pass the values to recode and the replacements as vectors instead of passing each value and replacement as separate arguments:
library(dplyr)
set.seed(42)
df <- data.frame(
Area = sample(c("rural", "suburban", "urban"), 10, replace = TRUE)
)
recode_table <- c("rural" = 1, "suburban" = 2, "urban" = 3)
recode(df$Area, !!!recode_table)
#> [1] 1 1 1 1 2 2 2 1 3 3
reval_3 <- function(data, x, values, replacements) {
recode_table <- setNames(replacements, values)
data[[x]] <- recode(data[[x]], !!!recode_table)
data
}
df <- reval_3(df, "Area", c("rural", "suburban", "urban"), 1:3)
df
#> Area
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 2
#> 6 2
#> 7 2
#> 8 1
#> 9 3
#> 10 3
You can use case_when with across.
If the columns that you want to change are called col1, col2 you can do -
library(dplyr)
df <- df %>%
mutate(across(c(col1, col2), ~case_when(. == 'rural' ~ 1,
. == 'suburban' ~ 2,
. == 'urban' ~ 3)))
Based on your actual column names you can also pass starts_with, ends_with, range of columns A:Z in across.
I have a dataframe where I want to replace the variables
age_1 with values of variable age1_corr_1 if age1_corr_1 is not NA
age_2 with values of variable age1_corr_2 if age1_corr_2 is not NA, ...,
age_n with values of variable age1_corr_n if age1_corr_n is not NA.
Then I'd like to delete the variables age1_corr_1, age1_corr_2, ..., age1_corr_n. I have figured out how to do the first part (change the values) in a loop but couldn't figure out how to delete the variables after. Any suggestion?
Sample data
y <- data.frame("age_1" = c(5,1,1,10), "age1_corr_1" = c(1,NA,NA,0), "age_2" = c(1,2,3,4), "age1_corr_2" = c(NA, NA, 10, 9),
"age_3" = c(4,3,2,5), "age1_corr_3" = c(NA,NA,NA,6), "age_4" = c(1,4,2,7), "age1_corr_4" = c(NA, NA, NA,NA))
The code that will change values of age_n based on age1_corr_n
for(i in 1:4){
cname1 <- paste0("age_",i)
cname2 <- paste0("age1_corr_",i)
y[,cname1] <- ifelse(!is.na(y[,cname2]), y[,cname2], y[,cname1])
}
The output I'd like to have is
age_1 age_2 age_3 age_4
1 1 1 4 1
2 1 2 3 4
3 1 10 2 2
4 0 9 6 7
You have several options if there is a pattern to the columns you want to remove (or conversely, the ones you want to keep).
Here's the data you provided:
y <- data.frame("age_1" = c(5,1,1,10), "age1_corr_1" = c(1,NA,NA,0), "age_2" = c(1,2,3,4), "age1_corr_2" = c(NA, NA, 10, 9),
"age_3" = c(4,3,2,5), "age1_corr_3" = c(NA,NA,NA,6), "age_4" = c(1,4,2,7), "age1_corr_4" = c(NA, NA, NA,NA))
Here's a dplyr example of how to get only those columns that follow the pattern age_N, where N is 1, 2, 3, or 4:
library(dplyr)
x <- select(y, paste("age", 1:4, sep = "_"))
Alternatively, you could choose the pattern for the columns you DON'T want:
x <- select(y, -grep("_corr_", current_vars()))
This uses the following strategy:
* you can select for everything BUT a column or set of columns by adding a minus sign first.
* current_vars() is a helper function in dplyr that evaluates to all the variable names for the data (here, y)
Do the real work with dplyr::coalesce() (description: "Given a set of vectors, coalesce() finds the first non-missing value at each position."). Then drop the columns with dplyr::select(), using a negative sign in front of the columns you don't need anymore.
library(magrittr)
y %>%
dplyr::mutate(
age1_corr_4 = as.numeric(age1_corr_4), # Delete this line if it's already a numeric/floating data type.
age_1 = dplyr::coalesce(age1_corr_1, age_1),
age_2 = dplyr::coalesce(age1_corr_2, age_2),
age_3 = dplyr::coalesce(age1_corr_3, age_3),
age_4 = dplyr::coalesce(age1_corr_4, age_4)
) %>%
dplyr::select(
-age1_corr_1, -age1_corr_2, -age1_corr_3, -age1_corr_4
)
Produces
age_1 age_2 age_3 age_4
1 1 1 4 1
2 1 2 3 4
3 1 10 2 2
4 0 9 6 7
Edit: I apologize, I focused on the coalesce part of the task and ignored the n part of the task.
Here are two other approaches that can handle an arbitrary number of columns. For this specific example dataset, make sure that the 4th column is correctly represented as a float with y$age1_corr_4 <- as.numeric(y$age1_corr_4)).
Like Dan Hall's response, one approach keeps the columns you want...
library(magrittr)
coalesce_corr1 <- function( index ) {
name_age <- paste0("age_" , index)
name_corr <- paste0("age1_corr_", index)
y %>%
dplyr::mutate(
!!name_age := dplyr::coalesce(.data[[name_corr]], .data[[name_age]])
) %>%
dplyr::select(!!name_age)
}
1:4 %>%
purrr::map(coalesce_corr) %>%
dplyr::bind_cols()
...and the other drops the columns you don't want.
z <- y
coalesce_corr2 <- function( index ) {
name_age <- paste0( "age_" , index)
name_corr <- paste0( "age1_corr_", index)
z <<- z %>%
dplyr::mutate(
!!name_age := dplyr::coalesce(.data[[!!name_corr]], .data[[!!name_age]])
)
z[[name_corr]] <<- NULL
}
1:4 %>%
purrr::walk(coalesce_corr2)
z
I wish this last one didn't require a global variable (that uses <<-), and for this reason, I actually recommend Dan's approaches, but I wanted to try out quosures for output variables.