I have a seemingly small problem. I want to use mutate_all() in conjunction with case_when(). A sample data frame:
tbl <- tibble(
x = c(0, 1, 2, 3, NA),
y = c(0, 1, NA, 2, 3),
z = c(0, NA, 1, 2, 3),
date = rep(today(), 5)
)
I first made another data frame replacing all the NA's with zero's and the values with a 1 with the following piece of code.
tbl %>%
mutate_all(
funs(
case_when(
. %>% is.na() ~ 0,
TRUE ~ 1
)))
Now I want to replace the NA values with blanks ("") and leave the other values as it is. However, I don't know how to set the TRUE value in a way that it keeps the value of the column.
Any suggestions would be much appreciated!
To leave the NA as "", we can use replace_na from tidyr
library(dplyr)
library(tidyr)
tbl %>%
mutate_all(replace_na, "")
# A tibble: 5 x 3
# x y z
# <chr> <chr> <chr>
#1 0 0 0
#2 1 1 ""
#3 2 "" 1
#4 3 2 2
#5 "" 3 3
With case_when or if_else, we have to make sure the type are the same across. Here, we are converting to character when we insert the "", so make sure the other values are also coerced to character class
tbl %>%
mutate_all(~ case_when(is.na(.) ~ "", TRUE ~ as.character(.)))
If we want to use only specific columns, then we can use mutate_at
tbl %>%
mutate_at(vars(x:y), ~ case_when(is.na(.) ~ "", TRUE ~ as.character(.)))
Also, to simplify the code in OP's post, it can be directly coerced to integer with as.integer or +
tbl %>%
mutate_all(~ as.integer(!is.na(.)))
Or if we are using case_when
tbl %>%
mutate_all(~ case_when(is.na(.)~ 0, TRUE ~ 1))
Related
I´ve got a dataframe with 15 columns (data are categorical).
I´d like to extract lines with contraditory categories (based in a set of rules). I tried
Df %>% filter_at(vars(col_1,col_2), any_vars(. %in% c(8, 1))) and it works fine for lines with category 8 or lines with category 1 ... problem is I´d like both 8 and 1 in the same line (that´s the way I figure it would catch the contraditions in the dataset).
Appreciate any ideas for this matter.
We may use & with ==
library(dplyr)
Df %>%
filter(if_any(c(col_1, col_2), ~ .x == 8) & if_any(c(col_1, col_2), ~ .x == 1))
-output
ID col_1 col_2
1 1 1 8
Or another option is to paste the columns and detect with a regex
library(stringr)
Df %>%
filter(str_detect(str_c(col_1, col_2), "18|81"))
-output
ID col_1 col_2
1 1 1 8
If there are more than 2 values, we may also use
library(purrr)
Df %>%
filter(map(c(1, 8), \(x) if_any(c(col_1, col_2), ~ .x == x)) %>%
reduce(`&`))
ID col_1 col_2
1 1 1 8
data
Df <- data.frame(ID = 1:5, col_1 = c(1, 2, 3, 4, 1), col_2 = c(8, 8, 3, 4, 1))
I have a data frame similar to this one.
df <- data.frame(id=c(1,2,3), tot_1=runif(3, 0, 100), tot_2=runif(3, 0, 100), tot_3=runif(3, 0, 100), tot_4=runif(3, 0, 100))
I want to select or make an operation only with those with suffixes lower than 3.
#select
df <- df %>% select(id, tot_1, tot_2)
#or sum
df <- df %>% mutate(sumVar = rowSums(across(c(tot_1, tot_2))))
However, in my real data, there are many more variables and not in order. So how could I select them without doing it manually?
We may use matches
df %>%
mutate(sumVar = rowSums(across(matches('tot_[1-2]$'))))
If we need to be more flexible, extract the digit part from the column names that starts with 'tot', subset based on the condition and use that new names
library(stringr)
nm1 <- str_subset(names(df), 'tot')
nm2 <- nm1[readr::parse_number(nm1) <3]
df %>%
mutate(sumVar = rowSums(across(all_of(nm2))))
Solution with num_range
This is the rare case for the often forgotten num_range selection helper from dplyr, which extracts the numbers from the names in a single step, then selects a range:
determine the threshold
suffix_threshold <- 3
Select( )
library(dplyr)
df %>% select(id, num_range(prefix='tot_',
range=seq_len(suffix_threshold-1)))
id tot_1 tot_2
1 1 26.75082 26.89506
2 2 21.86453 18.11683
3 3 51.67968 51.85761
mutate() with rowSums()
library(dplyr)
df %>% mutate(sumVar = across(num_range(prefix='tot_', range=seq_len(suffix_threshold-1)))%>%
rowSums)
id tot_1 tot_2 tot_3 tot_4 sumVar
1 1 26.75082 26.89506 56.27829 71.79353 53.64588
2 2 21.86453 18.11683 12.91569 96.14099 39.98136
3 3 51.67968 51.85761 25.63676 10.01408 103.53730
Here is a base R way -
cols <- grep('tot_', names(df), value = TRUE)
#Select
df[c('id', cols[as.numeric(sub('tot_', '',cols)) < 3])]
# id tot_1 tot_2
#1 1 75.409112 30.59338
#2 2 9.613496 44.96151
#3 3 58.589574 64.90672
#Rowsums
df$sumVar <- rowSums(df[cols[as.numeric(sub('tot_', '',cols)) < 3]])
df
# id tot_1 tot_2 tot_3 tot_4 sumVar
#1 1 75.409112 30.59338 59.82815 50.495758 106.00250
#2 2 9.613496 44.96151 84.19916 2.189482 54.57501
#3 3 58.589574 64.90672 18.17310 71.390459 123.49629
I want to change all NA values in a column to 0 and all other values to 1. However, I can't get the combination of case_when and is.na to work.
# Create dataframe
a <- c(rep(NA,9), 2, rep(NA, 10))
b <- c(rep(NA,9), "test", rep(NA, 10))
df <- data.frame(a,b, stringsAsFactors = F)
# Create new column (c), where all NA values in (a) are transformed to 0 and other values are transformed to 1
df <- df %>%
mutate(
c = case_when(
a == is.na(.$a) ~ 0,
FALSE ~ 1
)
)
I expect column (c) to indicate all 0 values and one 1 value, but its all 0's.
It does work when I use an if_else statement with is.na, like:
df <- df %>%
mutate(
c = if_else(is.na(a), 0, 1))
)
What is going on here?
You should be doing this instead:
df %>%
mutate(
c = case_when(
is.na(a) ~ 0,
TRUE ~ 1
)
)
suppose I have a tibble dat below, what I would like to do is to calculate maximum of (x 2, x 3) and then minus x 1, where x can be either a or b. In my real data I have more than 3 columns, so something like 2:n (e.g., 2:3) would be great. tried many things, seems not working as I wanted them to, still struggling with the string vs column name thing..
dat <- tibble(`a 1` = c(0, 0, 0), `a 2` = 1:3, `a 3` = 3:1,
`b 1` = rep(1, 3), `b 2` = 4:6, `b 3` = 6:4)
foo <- function(x = 'a')
{
???
}
end result:
if x == `a`
c(3, 2, 3)
if x == `b`
c(5, 4, 5)
Solution 1
This solution uses only base R. The idea is to define a function (max_minus_first) to calculate the answer. The max_minus_first function has two arguments. The first argument, dat, is a data frame for analysis with the same format as the OP provided. group is the name of the group for analysis. The end product is a vector with the answer.
max_minus_first <- function(dat, group){
# Get all column names with starting string "group"
col_names <- colnames(dat)
dat2 <- dat[, col_names[grepl(paste0("^", group), col_names)]]
# Get the maximum values from all columns except the first column
max_value <- apply(dat2[, -1], 1, max, na.rm = TRUE)
# Calculate max_value minus the values from the first column
final_value <- max_value - unlist(dat2[, 1], use.names = FALSE)
return(final_value)
}
max_minus_first(dat, "a")
# [1] 3 2 3
max_minus_first(dat, "b")
# [1] 5 4 5
Solution 2
A solution using the tidyverse. The end product (dat2) is a tibble with the output from each group (a, b, ...)
library(tidyverse)
dat2 <- dat %>%
rowid_to_column() %>%
gather(Column, Value, -rowid, -ends_with(" 1")) %>%
separate(Column, into = c("Group", "Column_Number")) %>%
gather(Column_1, Value_1, ends_with(" 1")) %>%
separate(Column_1, into = c("Group_1", "Column_Number_1")) %>%
filter(Group == Group_1) %>%
group_by(rowid, Group, Value_1) %>%
summarise(Value = max(Value, na.rm = TRUE)) %>%
mutate(Final = Value - Value_1) %>%
ungroup() %>%
select(-starts_with("Value")) %>%
spread(Group, Final)
dat2
# # A tibble: 3 x 3
# rowid a b
# * <int> <dbl> <dbl>
# 1 1 3 5
# 2 2 2 4
# 3 3 3 5
Explanation
rowid_to_column() is from the tibble package, a way to create a new column based on row ID.
gather is from the tidyr package to convert the data frame from the wide format to long format. I used gather twice because the first column of each group is different than other columns in the same group. ends_with(" 1") is a select helper function from the dplyr, which select the column with a name ending in " 1". Notice that the space in " 1" is important because "1" may select other columns like a 11 if such columns exist.
separate is from the tidyr package to separate a column into two columns. I used it to separate the Group name and column numbers in each Group.
filter(Group == Group_1) is to filter rows with Group == Group_1.
group_by(rowid, Group, Value_1) and then summarise(Value = max(Value, na.rm = TRUE)) make sure the maximum from each Group is calculated.
mutate(Final = Value - Value_1) is to calculate the difference between maximum from each Group and the value from the first column. The results are stored in the Final column.
select(-starts_with("Value")) removes any columns with a name beginning with "Value".
spread from the tidyr package converts the data frame from long format to wide format.
Solution 3
Another tidyverse solution, which similar to Solution 2. It uses do to conduct operation to each Group hence making the code more concise.
dat2 <- dat %>%
rowid_to_column() %>%
gather(Column, Value, -rowid) %>%
separate(Column, into = c("Group", "Column_Number")) %>%
group_by(rowid, Group) %>%
do(data_frame(Max = max(.$Value[.$Column_Number != 1]),
First = .$Value[.$Column_Number == 1])) %>%
mutate(Final = Max - First) %>%
select(-Max, -First) %>%
spread(Group, Final) %>%
ungroup()
dat2
# # A tibble: 3 x 3
# rowid a b
# * <int> <dbl> <dbl>
# 1 1 3 5
# 2 2 2 4
# 3 3 3 5
I want to pad strings with zeros (on the left) if the number of characters is 2.
Let the dataframe be as follows:
df<-data.frame(a=c("352","35","54","1"),stringsAsFactors=FALSE)
I would like to get
df
a
1 352
2 035
3 054
4 1
I tried using mutate_if as follows:
df %>% mutate_if(nchar(a)==2,str_pad(a,width=3,side="left",pad="0"))
df %>% mutate_if(nchar(vars(a))==2,str_pad(a,width=3,side="left",pad="0"))
But, both can't work.
I also tried using mutate with replace:
df %>% mutate(a=replace(a,which(nchar(a)==2),str_pad(a,width=3,side="left",pad="0")))
Again, I can't achieve what I want.
We can either use if_else
df %>%
mutate(a = if_else(nchar(a)==2, str_pad(a,width=3,side="left",pad="0"), a))
or case_when
df %>%
mutate(a = case_when(nchar(.$a)==2 ~ str_pad(.$a, width = 3, side = "left", pad = "0"),
TRUE ~ .$a) )
# a
#1 352
#2 035
#3 054
#4 1