rename multiple variables with pattern mutate_at [duplicate] - r

This question already has an answer here:
Create new variables with mutate_at while keeping the original ones
(1 answer)
Closed 3 years ago.
I have a dataset with many columns with similar names. Some of the columns have values in cents and others in dollars, e.g:
library (tidyverse)
data<- tribble(
~col1_cents, ~col1,~col2_cents, ~col2,
1000, NA, 3000, NA,
NA, 20, NA, 25.2,
2000, NA, 2030, NA,
)
For one variable, it's easy to divide the value by 100 and then assign it to the dollar variable, and delete the cent variable e.g.:
data %>% mutate( if_else(is.na(col1),
col1_cents/100,
col1) %>%
select(-col1_cents)
Is there a generalisable way to do this for all variables in the dataset that end in _cents? I tried this with mutate_at and ends_with but could not get it to rename to the original variable without _cents...
Thanks!

You can use mutate_at
library(dplyr)
data %>% mutate_at(vars(ends_with("cents")), ~./100)
# A tibble: 3 x 4
# col1_cents col1 col2_cents col2
# <dbl> <dbl> <dbl> <dbl>
#1 10 NA 30 NA
#2 NA 20 NA 25.2
#3 20 NA 20.3 NA
If you then want to combine the two columns, we can use split.default to split columns based on similarity of the names, use imap_dfc from purrr along with coalesce to combine them together.
df1 <- data %>% mutate_at(vars(ends_with("cents")), ~./100)
purrr::imap_dfc(split.default(df1, sub("_.*", "", names(df1))),
~.x %>% mutate(!!.y := coalesce(.x[[2]], .x[[1]])) %>% select(.y))
# col1 col2
# <dbl> <dbl>
#1 10 30
#2 20 25.2
#3 20 20.3

Related

sum across multiple columns of a data frame based on multiple patterns R

I have a data frame of multiple variables for for different years, that looks kind of like this:
df <- data.frame(name=c("name1", "name2", "name3", "name4"),
X1990=c(1,6,8,NA),
X1990.1=c(10,20,NA,2),
X1990.2=c(2,4,6,8),
X1990.3=c(1,NA,3,6),
X1990.4=c(8,7,5,4),
X1991=c(2,6,3,5),
X1991.1=c(NA,20,NA,2),
X1991.2=c(NA,NA,NA,NA),
X1991.3=c(1,NA,3,5),
X1991.4=c(8,9,6,3))
I made this example with only 5 variables per year and with only 2 year, but in reality is a much larger df, with tens of variables for the years 1990 to 2020.
I want to create a new dataframe with the sums all the columns for the same year, so that the new data frame looks like this:
df_sum <- data.frame(name=c("name1", "name2", "name3", "name4"),
X1990=c(22, 37, 22, 20),
X1991=c(11,35,12,15))
I was thinking some loop over rowSums(across(matches('pattern')), na.rm = TRUE) that I found on another questions, but so far have not been successful to implement.
Thanks!
We can reshape to 'long' format with pivot_longer, and get the sum while reshaping back to 'wide'
library(dplyr)
library(tidyr)
library(stringr)
df %>%
pivot_longer(cols = starts_with("X"), names_to = "name1") %>%
mutate(name1 = str_remove(name1, "\\.\\d+$")) %>%
pivot_wider(names_from = name1, values_from = value,
values_fn = ~ sum(.x, na.rm = TRUE))
-output
# A tibble: 4 × 3
name X1990 X1991
<chr> <dbl> <dbl>
1 name1 22 11
2 name2 37 35
3 name3 22 12
4 name4 20 15
Or in base R, use split.default to split the data into a list of datasets based on the column name pattern, get the rowSums and cbind with the first column
cbind(df[1], sapply(split.default(df[-1],
trimws(names(df)[-1], whitespace = "\\.\\d+")), rowSums, na.rm = TRUE))
name X1990 X1991
1 name1 22 11
2 name2 37 35
3 name3 22 12
4 name4 20 15

How to replace NA in a dataframe for a specific value using the results of another column and taking into account conditions of another column?

I have a dataframe composed of 9 columns with more than 4000 observations. For this question I will present a simpler dataframe (I use the tidyverse library)
Let's say I have the following dataframe:
library(tidyverse)
df <- tibble(Product = c("Bread","Oranges","Eggs","Bananas","Whole Bread" ),
Weight = c(NA, 1, NA, NA, NA),
Units = c(2,6,1,2,1),
Price = c(1,3.5,0.5,0.75,1.5))
df
I want to replace the NA values of the Weight column for a number multiplied by the results of Units depending on the word showed by the column Product. Basically, is a rule like:
Replace NA in Weight for 2.5*number of units if Product contains the word "Bread". Replace for 1 if Product contains the word "Eggs"
The thing is that I don't know how to code somehting like that in R. I tried the following code that a kind user gave me for a similar question:
df <- df %>%
mutate(Weight = case_when(Product == "bread" & is.na(Weight) ~ 0.25*Units))
But it doesn't work and it doesn't take into account the fact that if there is "Whole Bread" written in my dataframe it also has to apply the rule.
Does anyone have an idea?
Some of them are not exact matches, so use str_detect
library(dplyr)
library(stringr)
df %>%
mutate(Weight = case_when(is.na(Weight) &
str_detect(Product, regex("Bread", ignore_case = TRUE)) ~ 2.5 * Units,
is.na(Weight) & Product == "Eggs"~ Units, TRUE ~ Weight))
-output
# A tibble: 5 × 4
Product Weight Units Price
<chr> <dbl> <dbl> <dbl>
1 Bread 5 2 1
2 Oranges 1 6 3.5
3 Eggs 1 1 0.5
4 Bananas NA 2 0.75
5 Whole Bread 2.5 1 1.5

Merging rows in a dataframe R with duplicate id's

I have a question considering merging rows in a dataframe:
I have seen a couple of questions regarding merging rows, however I have a hard time understanding them and applying them to my situation:
I have a dataframe with a structure like this:
person_id test_date serial_number freezer_number test_1 test_2 test_3 test_4
x 01/01/2010 c d positive NA NA NA
x 05/01/2010 a b NA positive NA NA
y 02/02/2020 e f positive NA NA NA
......................................
I want to merge the rows so that the data of the other columns remain intact (mainly the test
date), however I want the rows of the test number and the person_id to merge so that the same individual is in 1 row with multiple tests.
This would be the ideal output:
person_id test_date serial_number freezer_number test_date2 test_1 test_2 test_3 test_4
x 01/01/2010 c d 05/01/2010 positive positive NA NA
y 02/02/2020 e f positive NA NA NA
......................................
How do I go about this? I have tried the "aggregate()" functions before, however this is very unclear to me.
Any help is appreciated, I can give more information to clarify my current code and data frame!
You could use summarize_all, grouped by person_id. This preserves the variables in each first row per person_id not being NA.
I added a pivot_wider to preserve the different test_dates (as pointed out by #Andrea M).
library(dplyr)
library(lubridate)
df1 <- df %>%
group_by(person_id) %>%
mutate(id = seq_along(person_id)) %>%
pivot_wider(names_from = id,
values_from = test_date,
names_prefix = "test_date") %>%
summarize_all(list(~ .[!is.na(.)][1]))
Output
> df1
# A tibble: 2 x 9
person_id serial_number freezer_number test_1 test_2 test_3 test_4 test_date1 test_date2
<chr> <chr> <chr> <chr> <chr> <lgl> <lgl> <chr> <chr>
1 x c d positive positive NA NA 01/01/2010 05/01/2010
2 y e f positive NA NA NA 02/02/2020 NA
What you're trying to do is reshaping the data from long format (one row per test) to wide format (one row per person, tests are in separate columns). This can be done in many ways, for example with tidyr::pivot_wider().
However there's a complicating factor - your dataset is not quite in long format because there are already multiple columns per test result. So you first need to fix that.
# Load libraries
library(tidyr)
library(dplyr)
library(stringr)
# Create dataset
df <- tribble(~person_id, ~test_date, ~serial_number, ~freezer_number, ~test_1, ~test_2, ~test_3, ~test_4,
"x", "01/01/2010", "c", "d", "positive", NA, NA, NA,
"x", "05/01/2010", "a", "b", NA, "positive", NA, NA,
"y", "02/02/2020", "e", "f", "positive", NA, NA, NA)
df2 <- df %>%
# Add a column indicating test number
group_by(person_id) %>%
mutate(test_number = row_number(),
# Gather the test results into a single column
test_result = paste0(test_1, test_2, test_3, test_4) %>%
str_remove_all("NA")) %>%
select(-(test_1:test_4)) %>%
# Reshape from long to wide
pivot_wider(names_from = test_number,
values_from = c(test_date, serial_number,
freezer_number, test_result)) %>%
# Reorder the columns
relocate(ends_with("1"), .before = ends_with("2"))
df2

R coalesce down columns by identifer [duplicate]

This question already has answers here:
combine rows in data frame containing NA to make complete row
(7 answers)
Closed 2 years ago.
I have a long dataset with student grades and courses going across many semesters. It has many NAs and many rows for each student. I want it to have one long row per student to fill in those NAs but keep the same column names.
Here's a sample:
library(tidyverse)
sample <- tibble(student = c("Corey", "Corey", "Sibley", "Sibley"),
fall_course_1 = c("Math", NA, "Science", NA),
fall_course_2 = c(NA, "English", NA, NA),
fall_grade_1 = c(90, NA, 98, NA),
fall_grade_2 = c(NA, 60, NA, NA))
And here's what I'd like it to look like:
library(tidyverse)
answer <- tibble(student = c("Corey", "Sibley"),
fall_course_1 = c("Math", "Science"),
fall_course_2 = c("English", NA),
fall_grade_1 = c(90, 98),
fall_grade_2 = c(60, NA))
Some semesters, some students take many classes and some just one. I've tried using coalesce(), but I can't figure it out. Any help would be appreciated!
This should do it, pivot the data long, remove the na's and then pivot it back to wide.
You need to convert the numeric values to character temporarily so they can go in the same column as the course labels, then type_convert() is a lazy way to put them back again.
library(dplyr)
library(tidyr)
library(readr)
reshaped <- sample %>%
mutate_if(is.numeric, as.character) %>%
pivot_longer(-student) %>%
drop_na() %>%
pivot_wider(student, names_from = name, values_from = value) %>%
type_convert()
You could get the first non-NA value in each column for each student.
library(dplyr)
sample %> group_by(student) %>% summarise_all(~na.omit(.)[1])
# A tibble: 2 x 5
# student fall_course_1 fall_course_2 fall_grade_1 fall_grade_2
# <chr> <chr> <chr> <dbl> <dbl>
#1 Corey Math English 90 60
#2 Sibley Science NA 98 NA
This approach returns NA if there are all NA values in a group.
Using a custom colaesce function and dplyr:
coalesce_all_columns <- function(df) {
return(coalesce(!!! as.list(df)))
}
library(dplyr)
sample %>%
group_by(student) %>%
summarise_all(coalesce_all_columns)
# A tibble: 2 x 5
student fall_course_1 fall_course_2 fall_grade_1 fall_grade_2
<chr> <chr> <chr> <dbl> <dbl>
1 Corey Math English 90 60
2 Sibley Science NA 98 NA
You could also use data.table package as follows:
library(data.table)
setDT(sample)[, lapply(.SD, na.omit), student]
sample
# 1: Corey Math English 90 60
# 2: Sibley Science <NA> 98 NA

Summarize data frame to return non-NA values along subsets

Hoping that someone can help me with a trick. I've found similar questions online, but none of the examples I've seen do exactly what I'm looking for or work on my data structure.
I need to remove NAs from a data frame along data subsets and compress the remaining NA values into rows for each data subset.
Example:
#create example data
a <- c(1, 1, 1, 2, 2, 2) #this is the subsetting variable in the example
b <- c(NA, NA, "B", NA, NA, "C") #max 1 non-NA value for each subset
c <- c("A", NA, NA, "A", NA, NA)
d <- c(NA, NA, 1, NA, NA, NA) #some subsets for some columns have all NA values
dat <- as.data.frame(cbind(a, b, c, d))
> desired output
a b c d
1 B A 1
2 C A <NA>
Rules of thumb:
1) Need to remove NA values from each column
2) Loop along data subsets (column "a" in example above)
3) All columns, for each subset, have a max of 1 non-NA value, but some columns may have all NA values
Ideas:
lapply or dplyr is probably helpful to loop along all columns
na.omit is likely helpful, if the subsetting column that has entries for all
rows can be ignored (something like as.data.frame(lapply(dat.admin, na.omit))). issue in returning lapply output to data frame if some subsets don't return any non-NA values
x[which.min(is.na(x))] effectively accomplishes this if laboriously applied to each individual column
Any help is appreciated to put the final pieces together! Thank you!
One solution could be achieved using dplyr::summarise_all. The data needs to be group_by on a.
library(dplyr)
dat %>%
group_by(a) %>%
summarise_all(funs(.[which.min(is.na(.))]))
# # A tibble: 2 x 4
# a b c d
# <fctr> <fctr> <fctr> <fctr>
# 1 1 B A 1
# 2 2 C A <NA>
Solution with data.table and na.omit
library(data.table)
merge(setDT(dat)[,a[1],keyby=a], setDT(dat)[,na.omit(.SD),keyby=a],all.x=TRUE)
I think the merge statement can be improved
Not really sure if this is what you're looking for, but this might work for you. It at least replicates the small sample output you're looking for:
library(dplyr)
library(tidyr)
dat %>%
filter_at(vars(b:c), any_vars(!is.na(.))) %>%
group_by(a) %>%
fill(b) %>%
fill(c) %>%
filter_at(vars(b:c), all_vars(!is.na(.)))
# A tibble: 2 x 4
# Groups: a [2]
a b c d
<fctr> <fctr> <fctr> <fctr>
1 1 B A 1
2 2 C A NA
You could also use just dplyr:
dat %>%
group_by(a) %>%
summarise_each(funs(first(.[!is.na(.)])))

Resources