Mutate based on conditions? - r

df <- data.frame(x1 = c("a","a","a","a","b","b","b","b"),ind = c("O","O","C","C","O","O","O","O"), num = c(6,12,18,24,6,12,18,24))
set.seed(1)
df <- df[sample(nrow(df)),]
df2 <- df %>% group_by(x1) %>%
arrange(x1,num)
> df2
# A tibble: 8 x 3
# Groups: x1 [2]
x1 ind num
<fct> <fct> <dbl>
1 a O 6
2 a O 12
3 a C 18
4 a C 24
5 b O 6
6 b O 12
7 b O 18
8 b O 24
I want to create some new columns to this data, the first one should check for each unique value of the column x1 it should take the minimum value of the column num where the column ind is equal to C. For the value a this should return 18. It then does this again but check when ind is equal to O instead. If it finds nothing then it should just return N/A. So the two columns should be result like this:
x1 ind num min_O min_C
<fct> <fct> <dbl> <dbl> <dbl>
1 a O 6 6 18
2 a O 12 6 18
3 a C 18 6 18
4 a C 24 6 18
5 b O 6 6 NA
6 b O 12 6 NA
7 b O 18 6 NA
8 b O 24 6 NA
I've tried a variation of grouping by the x1 and ind column but couldn't get it to work as I want to do a minimum if it equals a particular value. I am sure there is an easy way!

This looks a bit cumbersome but does the job
library(dplyr)
library(tidyr)
df2 %>%
group_by(x1, ind) %>%
pivot_wider(names_from = ind, values_from = num, values_fn = min, names_prefix = 'min_') %>%
left_join(df2, by = 'x1')
# A tibble: 8 x 5
# Groups: x1 [2]
x1 min_O min_C ind num
<chr> <dbl> <dbl> <chr> <dbl>
1 a 6 18 O 6
2 a 6 18 O 12
3 a 6 18 C 18
4 a 6 18 C 24
5 b 6 NA O 6
6 b 6 NA O 12
7 b 6 NA O 18
8 b 6 NA O 24

Another way could be
library(tidyr)
library(dplyr)
df %>%
arrange(x1,num) %>%
group_by(x1) %>%
mutate(min_C = min(num[ind == "C"]),
min_O = min(num[ind == "O"]),
across(starts_with("min"), ~ ifelse(.x == Inf, NA_real_, .x)))
which returns
# A tibble: 8 x 5
# Groups: x1 [2]
x1 ind num min_C min_O
<chr> <chr> <dbl> <dbl> <dbl>
1 a O 6 18 6
2 a O 12 18 6
3 a C 18 18 6
4 a C 24 18 6
5 b O 6 NA 6
6 b O 12 NA 6
7 b O 18 NA 6
8 b O 24 NA 6
but also returns a warning, since there are no C in group b.
If you don't use the across(...) part, NAs are replaced with Inf.

Related

How to fill a column by group with sampled row numbers according to n per group

I am working with a dataframe in R. I have groups stated by column Group1. I need to create a new column named sampled where I need to fill with a specific value after using sample per group from 1 to each number of rows per group. Here is the data I have:
library(tidyverse)
#Data
dat <- data.frame(Group1=sample(letters[1:3],15,replace = T))
Then dat looks like this:
dat
Group1
1 b
2 a
3 a
4 c
5 c
6 c
7 a
8 b
9 c
10 b
11 a
12 b
13 c
14 c
15 c
In order to get the N per group, we do this:
#Code
dat %>%
arrange(Group1) %>%
group_by(Group1) %>%
mutate(N=n())
Which produces:
# A tibble: 15 x 2
# Groups: Group1 [3]
Group1 N
<chr> <int>
1 a 4
2 a 4
3 a 4
4 a 4
5 b 4
6 b 4
7 b 4
8 b 4
9 c 7
10 c 7
11 c 7
12 c 7
13 c 7
14 c 7
15 c 7
What I need to do is next. I have the N per group, so I have to create a sample of 3 numbers from 1:N. In the case of group a having N=4 it would be sample(1:4,3) which produces [1] 2 4 3. With this in the group a I need that rows belonging to sampled values must be filled with 999. So for first group we would have:
Group1 N sampled
<chr> <int> <int>
1 a 4 NA
2 a 4 999
3 a 4 999
4 a 4 999
And then the same for the rest of groups. In this way using sample we will have random values per group. Is that possible to do using dplyr or tidyverse. Many thanks!
You could try:
set.seed(3242)
library(dplyr)
dat %>%
arrange(Group1) %>%
add_count(Group1, name = 'N') %>%
group_by(Group1) %>%
mutate(
sampled = case_when(
row_number() %in% sample(1:n(), 3L) ~ 999L,
TRUE ~ NA_integer_
)
)
Output:
# A tibble: 15 × 3
# Groups: Group1 [3]
Group1 N sampled
<chr> <int> <int>
1 a 4 999
2 a 4 999
3 a 4 NA
4 a 4 999
5 b 4 999
6 b 4 999
7 b 4 999
8 b 4 NA
9 c 7 NA
10 c 7 999
11 c 7 NA
12 c 7 999
13 c 7 NA
14 c 7 NA
15 c 7 999

Fill up missing values based on other entries on R

I have dataset input with a couple of missing values. and I have to create dataset output with the following logic:
If there is a missing in any of the columns b, c, or d, then
check the correspondent a column and fill up the missing with the
correspondent value from that row to the specific column.
I tried to do that with _join functions from dplyr but was unsuccessful.
I can do it manually, but this option is off the table because I have a big dataset with multiple instances like that.
Input
library(dplyr)
input <- tibble( a = rep(c("A", "B", "C", "D"),2 ),
b = c(1:3, NA, rep(NA,4)),
c = c(21:28),
d = c(rep(NA,4), 54, NA, 34,11)) %>%
arrange(a)
Input view
# A tibble: 8 × 4
# a b c d
# <chr> <int> <int> <dbl>
#1 A 1 21 NA
#2 A NA 25 54
#3 B 2 22 NA
#4 B NA 26 NA
#5 C 3 23 NA
#6 C NA 27 34
#7 D NA 24 NA
#8 D NA 28 11
Output - expected view
# A tibble: 8 × 4
# a b c d
# <chr> <int> <int> <dbl>
# 1 A 1 21 54
# 2 A 1 25 54
# 3 B 2 22 NA
# 4 B 2 26 NA
# 5 C 3 23 34
# 6 C 3 27 34
# 7 D NA 24 11
# 8 D NA 28 11
Use function na.locf from package zoo to carry the last observation forward or in the opposite direction.
suppressPackageStartupMessages(library(dplyr))
input <- tibble( a = rep(c("A", "B", "C", "D"),2 ),
b = c(1:3, NA, rep(NA,4)),
c = c(21:28),
d = c(rep(NA,4), 54, NA, 34,11)) %>%
arrange(a)
input %>%
group_by(a) %>%
mutate(across(b:d, zoo::na.locf, na.rm = FALSE)) %>%
mutate(across(b:d, zoo::na.locf, na.rm = FALSE, fromLast = TRUE))
#> # A tibble: 8 × 4
#> # Groups: a [4]
#> a b c d
#> <chr> <int> <int> <dbl>
#> 1 A 1 21 54
#> 2 A 1 25 54
#> 3 B 2 22 NA
#> 4 B 2 26 NA
#> 5 C 3 23 34
#> 6 C 3 27 34
#> 7 D NA 24 11
#> 8 D NA 28 11
Created on 2022-05-14 by the reprex package (v2.0.1)
This is hasty imputation:
library(dplyr)
input %>%
group_by(a) %>%
mutate(across(b:d, ~ if_else(is.na(.), na.omit(.)[1], .))) %>%
ungroup()
# # A tibble: 8 x 4
# a b c d
# <chr> <int> <int> <dbl>
# 1 A 1 21 54
# 2 A 1 25 54
# 3 B 2 22 NA
# 4 B 2 26 NA
# 5 C 3 23 34
# 6 C 3 27 34
# 7 D NA 24 11
# 8 D NA 28 11
I think the group_by(a) is fairly intuitive and makes sense. The "hasty" part of my first sentence is that we find the first non-NA value and use it. Other imputation techniques may use the average, median, previous valid data ("locf" as in Rui's answer), or random sampling.
The mice package specializes in imputation.

collapse a dataframe in R that contains both numeric and character variables

I have the following data.frame:
data <- data.frame("ag" = rep(LETTERS[1:4],6),
"date" = c(sapply(1:3, function(x) rep(x, 8))),
"num_var1"= 1:24,
"num_var2"= 24:1,
"alpha_var1" = LETTERS[1:24],
"alpha_var2" = LETTERS[25:2] )
and I would like to summarize (mean) its rows by ag and date using dplyr. The issue is that some rows include characters: in this case, I would like to get the first entry by group (the example dataset is already sorted).
Since my dataset has several entries, I would like the code to be able to recognize whether a variable is numeric (including integers) or a character. However, the best solution that I have so far is the following one:
data %>%
dplyr::group_by(ag, date) %>%
summarise(across(everything(), mean))
which creates NAs for non-numeric variables. Do you have a better solution?
Is this what you are looking for?
library(dplyr)
data %>%
dplyr::group_by(ag, date) %>%
summarise(across(everything(), ~
if(is.numeric(.x)) mean(.x) else first(.x)))
#> `summarise()` has grouped output by 'ag'. You can override using the `.groups` argument.
#> # A tibble: 12 x 6
#> # Groups: ag [4]
#> ag date num_var1 num_var2 alpha_var1 alpha_var2
#> <chr> <int> <dbl> <dbl> <chr> <chr>
#> 1 A 1 3 22 A Y
#> 2 A 2 11 14 I Q
#> 3 A 3 19 6 Q I
#> 4 B 1 4 21 B X
#> 5 B 2 12 13 J P
#> 6 B 3 20 5 R H
#> 7 C 1 5 20 C W
#> 8 C 2 13 12 K O
#> 9 C 3 21 4 S G
#> 10 D 1 6 19 D V
#> 11 D 2 14 11 L N
#> 12 D 3 22 3 T F
Created on 2022-03-03 by the reprex package (v2.0.1)
Another possible solution:
library(tidyverse)
data %>%
group_by(ag, date) %>%
summarise(across(where(is.numeric), mean),
across(where(is.character), first), .groups = "drop")
#> # A tibble: 12 × 6
#> ag date num_var1 num_var2 alpha_var1 alpha_var2
#> <chr> <int> <dbl> <dbl> <chr> <chr>
#> 1 A 1 3 22 A Y
#> 2 A 2 11 14 I Q
#> 3 A 3 19 6 Q I
#> 4 B 1 4 21 B X
#> 5 B 2 12 13 J P
#> 6 B 3 20 5 R H
#> 7 C 1 5 20 C W
#> 8 C 2 13 12 K O
#> 9 C 3 21 4 S G
#> 10 D 1 6 19 D V
#> 11 D 2 14 11 L N
#> 12 D 3 22 3 T F

Sum up tables results from multiple sheets into one table in R

I am reading through excel file that has multiple sheets.
file_to_read <- "./file_name.xlsx"
# Get all names of sheets in the file
sheet_names <- readxl::excel_sheets(file_to_read)
# Loop through sheets
L <- lapply(sheet_names, function(x) {
all_cells <-
tidyxl::xlsx_cells(file_to_read, sheets = x)
})
L here has all the sheets. Now, I need to get the data from each sheet to combine all the columns and rows into one file. To be exact, I want to sum the matching columns and rows in the data into one file.
I will put simple example to make it clear.
For example, this table in one sheet,
df1 <- data.frame(x = 1:5, y = 2:6, z = 3:7)
rownames(df1) <- LETTERS[1:5]
df1
M x y z
A 1 2 3
B 2 3 4
C 3 4 5
D 4 5 6
E 5 6 7
The second table in the next sheet,
df2 <- data.frame(x = 1:5, y = 2:6, z = 3:7, w = 8:12)
rownames(df2) <- LETTERS[3:7]
df2
M x y z w
C 1 2 3 8
D 2 3 4 9
E 3 4 5 10
F 4 5 6 11
G 5 6 7 12
My goal is to combine (sum) the matched records in all 100 tables from one excel file to get one big tables that has the total sum of each value.
The final table should be like this:
M x y z w
A 1 2 3 0
B 2 3 4 0
C 4 6 8 8
D 6 8 10 9
E 8 10 12 10
F 4 5 6 11
G 5 6 7 12
Is there a way to achieve this in R? I am not an expert in R, but I wish if I could know how to read all sheets and do the sum Then save the output to a file.
Thank you
As you have stated that you have hundreds of sheets it is suggested that you should import all of these in one single list say my.list in R (as per this link or this readxl documentation suggested) and follow this strategy instead of binding every two dfs one by one
df1 <- read.table(text = 'M x y z
A 1 2 3
B 2 3 4
C 3 4 5
D 4 5 6
E 5 6 7', header = T)
df2 <- read.table(text = 'M x y z w
C 1 2 3 8
D 2 3 4 9
E 3 4 5 10
F 4 5 6 11
G 5 6 7 12', header = T)
library(tibble)
library(tidyverse)
my.list <- list(df1, df2)
map_dfr(my.list, ~.x)
#> M x y z w
#> 1 A 1 2 3 NA
#> 2 B 2 3 4 NA
#> 3 C 3 4 5 NA
#> 4 D 4 5 6 NA
#> 5 E 5 6 7 NA
#> 6 C 1 2 3 8
#> 7 D 2 3 4 9
#> 8 E 3 4 5 10
#> 9 F 4 5 6 11
#> 10 G 5 6 7 12
map_dfr(my.list , ~ .x) %>%
group_by(M) %>%
summarise(across(everything(), sum, na.rm = T))
#> # A tibble: 7 x 5
#> M x y z w
#> <chr> <int> <int> <int> <int>
#> 1 A 1 2 3 0
#> 2 B 2 3 4 0
#> 3 C 4 6 8 8
#> 4 D 6 8 10 9
#> 5 E 8 10 12 10
#> 6 F 4 5 6 11
#> 7 G 5 6 7 12
Created on 2021-05-26 by the reprex package (v2.0.0)
One approach that will work is these steps:
read each sheet into a list
convert each sheet into a long format
bind into a single data frame
sum and group by over that long data frame
cast back to tabular format
That should work for N sheets with any combination of row and column headers in those sheets. E.g.
file <- "D:\\Book1.xlsx"
sheet_names <- readxl::excel_sheets(file)
sheet_data <- lapply(sheet_names, function(sheet_name) {
readxl::read_xlsx(path = file, sheet = sheet_name)
})
# use pivot_longer on each sheet to make long data
long_sheet_data <- lapply(sheet_data, function(data) {
long <- tidyr::pivot_longer(
data = data,
cols = !M,
names_to = "col",
values_to = "val"
)
})
# combine into a single tibble
long_data = dplyr::bind_rows(long_sheet_data)
# sum up matching pairs of `M` and `col`
summarised <- long_data %>%
group_by(M, col) %>%
dplyr::summarise(agg = sum(val))
# convert to a tabular format
tabular <- summarised %>%
tidyr::pivot_wider(
names_from = col,
values_from = agg,
values_fill = 0
)
tabular
I get this output with a spreadsheet using your initial inputs:
> tabular
# A tibble: 7 x 5
# Groups: M [7]
M x y z w
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 2 3 0
2 B 2 3 4 0
3 C 4 6 8 8
4 D 6 8 10 9
5 E 8 10 12 10
6 F 4 5 6 11
7 G 5 6 7 12
You could use dplyr and tidyr to get your desired result:
Let be
df <- data.frame(subject=c(rep("Mother", 2), rep("Child", 2)), modifier=c("chart2", "child", "tech", "unkn"), mother_chart2=1:4, mother_child=5:8, child_tech=9:12, child_unkn=13:16)
> df
subject modifier mother_chart2 mother_child child_tech child_unkn
1 Mother chart2 1 5 9 13
2 Mother child 2 6 10 14
3 Child tech 3 7 11 15
4 Child unkn 4 8 12 16
and
df2 <- data.frame(subject=c(rep("Mother", 2), rep("Child", 2)), modifier=c("chart", "child", "tech", "unkn"), mother_chart=101:104, mother_child=105:108, child_tech=109:112, child_unkn=113:116)
> df2
subject modifier mother_chart mother_child child_tech child_unkn
1 Mother chart 101 105 109 113
2 Mother child 102 106 110 114
3 Child tech 103 107 111 115
4 Child unkn 104 108 112 116
Then
library(dplyr)
library(tidyr)
df2_tmp <- df2 %>%
pivot_longer(col=-c("subject", "modifier"))
df %>%
pivot_longer(col=-c("subject", "modifier")) %>%
full_join(df2_tmp, by=c("subject", "modifier", "name")) %>%
mutate(across(starts_with("value"), ~ replace_na(., 0)),
sum = value.x + value.y) %>%
select(-value.x, -value.y) %>%
pivot_wider(names_from=name, values_from=sum, values_fill=0)
returns
# A tibble: 5 x 7
subject modifier mother_chart2 mother_child child_tech child_unkn mother_chart
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mother chart2 1 5 9 13 0
2 Mother child 2 112 120 128 102
3 Child tech 3 114 122 130 103
4 Child unkn 4 116 124 132 104
5 Mother chart 0 105 109 113 101

Using dplyr: Within groups, select the first value meeting a condition

I need assistance obtaining a solution that will scan backwards in time and obtain the first value meeting a condition. I have data similar to:
set.seed(42)
df <- data.frame(
id = sample(LETTERS[1:3], 20, replace = TRUE),
time.var = sample(1:20, 20, replace = TRUE),
x = sample(c(1:10), 20, replace = TRUE)
)
df <- df[order(df$id, df$time.var),]
id time.var x
A 5 2
A 14 8
A 19 7
A 20 1
B 1 1
B 2 5
B 9 10
B 11 10
B 13 6
B 15 4
B 19 3
C 1 7
C 3 5
C 8 9
C 8 4
C 17 7
C 17 4
C 17 8
C 19 4
C 19 10
For the last member of each group defined in time order by time.var, I'd like to obtain the first value from x less than 5 by scanning in descending time order.
I have tried:
test <- df %>%
group_by(id) %>%
arrange(id, time.var) %>%
mutate(less.5 = which.max(x[x < 5]) )
What strategy can I use to obtain this type of output:
id time.var x previous.less.5
A 5 2
A 14 8
A 19 7
A 20 1 2
B 1 1
B 2 5
B 9 10
B 11 10
B 13 6
B 15 4
B 19 3 4
C 1 7
C 3 5
C 8 9
C 8 4
C 17 7
C 17 4
C 17 8
C 19 4
C 19 10 4
Using library(dplyr):
df %>%
arrange(id, time.var) %>%
group_by(id) %>%
mutate(previous.less.5 = tail(c(x[c((x[-n()] < 5), FALSE)]),1)) %>%
group_by(id) %>%
mutate(previous.less.5 = if_else(row_number() == n(), previous.less.5, NULL))
or
df %>%
arrange(id, time.var) %>%
group_by(id) %>%
slice(1:(n()-1)) %>%
filter(x < 5) %>%
slice(n()) %>%
select(-time.var) %>%
right_join(df, ., by="id", suffix =c("",".y")) %>%
group_by(id) %>%
mutate(previous.less.5 = if_else(row_number() == n(), x.y, NULL)) %>%
select(-x.y)
giving:
#> # A tibble: 20 x 4
#> # Groups: id [3]
#> id time.var x previous.less.5
#> <fct> <int> <int> <int>
#> 1 A 3 10 NA
#> 2 A 4 8 NA
#> 3 A 4 6 NA
#> 4 A 5 2 NA
#> 5 A 5 8 NA
#> 6 A 5 7 NA
#> 7 A 11 6 NA
#> 8 A 13 3 NA
#> 9 A 15 2 3
#> 10 B 2 1 NA
#> 11 B 4 3 NA
#> 12 B 4 6 NA
#> 13 B 8 5 NA
#> 14 B 8 4 NA
#> 15 B 20 7 4
#> 16 C 1 2 NA
#> 17 C 2 10 NA
#> 18 C 10 6 NA
#> 19 C 13 2 NA
#> 20 C 18 5 2
Update:
If there's a group with no record less than 5 (or only last record less than 5) then following works:
df %>%
arrange(id, time.var) %>%
group_by(id) %>%
mutate(previous.less.5 = if_else(row_number() == n(),
max(tail(c( x[ c( x[-n()] < 5, FALSE) ] ), 1)),
NULL)) %>%
mutate(previous.less.5 = replace(previous.less.5, is.infinite(previous.less.5), NA))
Data:
set.seed(42) # I am getting different data than what you've shown with this seed
df <- data.frame(
id = sample(LETTERS[1:3], 20, replace = TRUE),
time.var = sample(1:20, 20, replace = TRUE),
x = sample(c(1:10), 20, replace = TRUE)
)
df <- df[order(df$id, df$time.var),]
We can reverse value of x by id get the first number which is less than 5 using which. The last replace is to assign NA to all the values in previous.less.5 except the last one.
library(dplyr)
df %>%
#Data is already sorted by `id` and `time.var` but if your still need use
#arrange(id, time.var) %>%
group_by(id) %>%
mutate(rev_x = c(NA, rev(x)[-1]), previous.less.5 = rev_x[which(rev_x < 5)[1]],
previous.less.5 = replace(previous.less.5, row_number() != n(), NA)) %>%
select(-rev_x)
# id time.var x previous.less.5
# <fct> <int> <int> <int>
# 1 A 5 2 NA
# 2 A 14 8 NA
# 3 A 19 7 NA
# 4 A 20 1 2
# 5 B 1 1 NA
# 6 B 2 5 NA
# 7 B 9 10 NA
# 8 B 11 10 NA
# 9 B 13 6 NA
#10 B 15 4 NA
#11 B 19 3 4
#12 C 1 7 NA
#13 C 3 5 NA
#14 C 8 9 NA
#15 C 8 4 NA
#16 C 17 7 NA
#17 C 17 4 NA
#18 C 17 8 NA
#19 C 19 4 NA
#20 C 19 10 4
This should also handle the case and return NA's if there is no value less than 5 in an id.

Resources