R dplyr full_join(x, y): fill NA with values from x - r

I would like to joint two data frames:
library(dplyr)
set.seed(666)
x <- data.frame(id = 1:10, c1 = rnorm(10), c2 = rnorm(10), c3 = rnorm(10))
y <- data.frame(id = 1:10, c1 = rnorm(10))
joined <- x |>
full_join(y) |>
arrange(id)
What is an elegant way to fill NAs of the new rows from y with the values from the columns of x?
target:
id c1 c2 c3
1 1 0.75331105 2.15004262 -0.69209929
2 1 0.75499616 2.15004262 -0.69209929
3 2 2.01435467 -1.77023084 -1.18304354
4 2 -0.64148890 -1.77023084 -1.18304354
...
EDIT: tidyr::fill() works fine but appears to be extremely slow on moderate-large data sets (e.g. >100k rows, >20cols). I would be happy to see a data.table alternative

Add a another more lines of code.
choose the columns you need to fill.
Edit: grouping isn't necessary in this situation. Since each new group starts with a number.
joined <- x |>
full_join(y) |>
arrange(id) |>
fill(c2:c3, .direction = "down")
id c1 c2 c3
<int> <dbl> <dbl> <dbl>
1 1 -0.0822 1.18 -0.889
2 1 1.58 1.18 -0.889
3 2 0.120 0.0288 0.278
4 2 1.64 0.0288 0.278
5 3 0.0213 -0.166 -1.20
6 3 -0.404 -0.166 -1.20
7 4 -0.274 -1.53 -0.660
8 4 -0.0456 -1.53 -0.660
9 5 -0.881 -0.335 -1.02
10 5 -2.47 -0.335 -1.02

Related

Using group_modify with selected columns (retaining whole data frame and order)

I have run out of R power on this one. I appreciate any help, it is probably quite simple for someone with more experience.
I have a data frame (tibble) with some numerical columns, a group column, and some other columns with other information. I want to do operations on the numerical columns, by group, but still retain all the columns.
I've put an example below: I am replacing the NAs with the group mean, for each column. The columns to replace the NAs are specified by the df_names variable.
It basically works, except it removes all columns except the numerical ones, AND reorders everything. Which makes it hard to reassemble. I could work around this, but I have a feeling there must be a simpler way to direct group_apply to specified columns, while retaining the other columns, and keeping the order.
Can anyone help? Thanks so much in advance!
Will
library("tidyverse")
# create tibble
df <- tibble(
name=letters[1:10],
csize=c("L","S","S","L","L","S","L","S","L","S"),
v1=rnorm(10),
v2=rnorm(10),
v3=rnorm(10)
)
# introduce some missing data
df$v1[3] <- NA
df$v1[6] <- NA
df$v1[7] <- NA
df$v3[2] <- NA
# these are the cols where I want to replace the NAs
df_names <- c("v1","v2","v3")
# this is the grouping variable (has to be stored as a string, since it is an input to the function)
groupvar <- "csize"
# now I want to replace the NAs with column means, restricted to their group
# the following line works, but the problem is that it removes the name column, and reorders the rows...
df_imp <- df %>% group_by(.dots=groupvar) %>% select(df_names) %>% group_modify( ~{replace_na(.x,as.list(colMeans(.x, na.rm=TRUE)))})
group_modify is overkill in this case; mutate(across()) is your friend here:
df %>% group_by(.dots = groupvar) %>%
mutate(across(all_of(df_names), ~if_else(is.na(.x), mean(.x, na.rm = TRUE), .x)))
Result:
> df
# A tibble: 10 x 5
# Groups: csize [2]
name csize v1 v2 v3
<chr> <chr> <dbl> <dbl> <dbl>
1 a L -1.22 1.48 -0.628
2 b S -1.17 0.0890 -0.130
3 c S -0.422 -0.0956 -0.0271
4 d L -0.265 0.180 -0.786
5 e L -0.491 0.509 -0.359
6 f S -0.422 -0.712 0.232
7 g L -0.400 -1.13 1.13
8 h S -0.538 -0.0785 0.690
9 i L 0.373 0.308 0.252
10 j S 0.445 0.743 -1.41
Does this work:
> library(dplyr)
> df %>% group_by(csize) %>% mutate(across(v1:v3, ~ replace_na(., mean(., na.rm = T))))
# A tibble: 10 x 5
# Groups: csize [2]
name csize v1 v2 v3
<chr> <chr> <dbl> <dbl> <dbl>
1 a L 1.57 0.310 -1.76
2 b S -0.705 0.0655 0.577
3 c S -1.05 1.28 1.82
4 d L 0.958 -2.09 -0.371
5 e L -0.712 0.247 -1.13
6 f S -1.05 -0.516 -0.107
7 g L 0.403 1.79 0.128
8 h S -0.793 1.52 1.07
9 i L -0.206 -0.369 -1.77
10 j S -1.65 -0.992 -0.476

Generating a column with the average value of rows before and after tiw row index

Given some data like the following:
set.seed(1234)
df <- tibble(class = rep(c("a","b"), each=6), value = c(rnorm(n=6, mean=0, sd=1), rnorm(n=6, mean=1, sd=0.1)))
# A tibble: 12 x 2
# class value
# <chr> <dbl>
# 1 a -1.21
# 2 a 0.277
# 3 a 1.08
# 4 a -2.35
# 5 a 0.429
# 6 a 0.506
# 7 b 0.943
# 8 b 0.945
# 9 b 0.944
#10 b 0.911
#11 b 0.952
#12 b 0.900
I'm trying to generate a new column (context) that contains the average of "value" of the X preceding and posterior rows, when possible. It would be desirable to have this by level of a factor in a different column. For example, for X=2, I would expect something like the following:
# A tibble: 12 x 2
# class value context
# <chr> <dbl> <dbl>
# 1 a -1.21 NA
# 2 a 0.277 NA
# 3 a 1.08 -0.7135
# 4 a -2.35 0.573
# 5 a 0.429 NA
# 6 a 0.506 NA
# 7 b 0.943 NA
# 8 b 0.945 NA
# 9 b 0.944 0.9377
#10 b 0.911 0.9278
#11 b 0.952 NA
#12 b 0.900 NA
Note that for the first two rows it is not possible to generate the context value in this case, because they do not have X=2 predecing rows. The value -0.7135 at row 3 is the average of rows 1, 2, 4 and 5.
Similarly, rows 5 and 6 do not have a value of context, because these do not have two values afterwards belonging to the same level of the factor "class" (because row 7 is class="b" while 5 and 6 are class="a").
I do not know if this is even possible in R, I haven't found any similar questions, and I can only reach to solutions like the following one, which I think is not representative of this language.
My solution:
X <- 2
df_list <- df %>% dplyr::group_split(class)
result <- tibble()
for (i in 1:length(df_list)) {
tmp <- df_list[[i]]
context <- vector()
for (j in 1:nrow(tmp)) {
if (j<=X | j>nrow(tmp)-X) context <- c(context, NA)
else {
values <- vector()
for (k in 1:X) {
values <- c(values, tmp$value[j-k], tmp$value[j+k])
}
context <- c(context, mean(values))
}
}
tmp <- tmp %>% dplyr::mutate(context=context)
result <- result %>% dplyr::bind_rows(tmp)
}
This will give and approximate solution to that above (differences due to rounding). But again, this approach lacks of flexibility, e.g. if we want to create various columns at once, for different values of X. Are there R functions developed to solved tasks like this one? (eg. vectorized functions?)
# this is your dataframe
set.seed(1234)
df <- tibble(class = rep(c("a","b"), each=6), value = c(rnorm(n=6, mean=0, sd=1), rnorm(n=6, mean=1, sd=0.1)))
# pipes ('%>%') and grouping from the dplyr package
library(tidyverse)
# rolling mean function from the zoo package
library(zoo)
df %>% # take df
group_by(class) %>% # group it by class
mutate(context = (rollsum(value, 5, fill = NA) - value) / 4) # and calculate the rolling mean
Basically you calculate a rolling mean with a window width of 5, that is center (it's the default) and you fill the remaining values with NAs. Since the value of the exact row is not to be included in the average, it needs to be excluded.
One way using dplyr :
n <- 2
library(dplyr)
df %>%
group_by(class) %>%
mutate(context = map_dbl(row_number(), ~ if(.x <= n | .x > (n() - n))
NA else mean(value[c((.x - n):(.x - 1), (.x + 1) : (.x + n))])))
# class value context
# <chr> <dbl> <dbl>
# 1 a -1.21 NA
# 2 a 0.277 NA
# 3 a 1.08 -0.712
# 4 a -2.35 0.574
# 5 a 0.429 NA
# 6 a 0.506 NA
# 7 b 0.943 NA
# 8 b 0.945 NA
# 9 b 0.944 0.938
#10 b 0.911 0.935
#11 b 0.952 NA
#12 b 0.900 NA
Here is a base R solution using ave(), i.e.,
df <- within(df,
contest <- ave(value,
class,
FUN = function(v,X=2) sapply(seq(v), function(k) ifelse(k-X < 1 | k+X >length(v),NA,mean(v[c(k-(X:1),k + (1:X))])))))
such that
> df
# A tibble: 12 x 3
class value contest
<chr> <dbl> <dbl>
1 a -1.21 NA
2 a 0.277 NA
3 a 1.08 -0.712
4 a -2.35 0.574
5 a 0.429 NA
6 a 0.506 NA
7 b 0.943 NA
8 b 0.945 NA
9 b 0.944 0.938
10 b 0.911 0.935
11 b 0.952 NA
12 b 0.900 NA

How to compute multiple new columns in a R dataframe with dynamic names

I'm trying to generate multiple new columns/variables in a R dataframe with dynamic new names taken from a vector. The new variables are computed from groups/levels of a single column.
The dataframe contains measurements (counts) of different chemical elements (element) along depth (z). The new variables are computed by dividing the counts of each element at a certain depth by the respective counts of proxy elements (proxies) at the same depth.
There is already a solution using mutate that works if I only want to create one new column/name the columns explicitly (see code below). I'm looking for a generalised solution to use in a shiny web app where proxies is not a string but a vector of strings and is dynamically changing according to user input.
# Working code for just one new column at a time (here Ti_ratio)
proxies <- "Ti"
df <- tibble(z = rep(1:10, 4), element = rep(c("Ag", "Fe", "Ca", "Ti"), each = 10), counts = rnorm(40))
df_Ti <- df %>%
group_by(z) %>%
mutate(Ti_ratio = counts/counts[element %in% proxies])
# Not working code for multiple columns at a time
proxies <- c("Ca", "Fe", "Ti")
varname <- paste(proxies, "ratio", sep = "_")
df_ratios <- df %>%
group_by(z) %>%
map(~ mutate(!!varname = .x$counts/.x$counts[element %in% proxies]))
Output of working code:
> head(df_Ti)
# A tibble: 6 x 4
# Groups: z [6]
z element counts Ti_ratio
<int> <chr> <dbl> <dbl>
1 1 Ag 2.41 4.10
2 2 Ag -1.06 -0.970
3 3 Ag -0.312 -0.458
4 4 Ag -0.186 0.570
5 5 Ag 1.12 -1.38
6 6 Ag -1.68 -2.84
Expected output of not working code:
> head(df_ratios)
# A tibble: 6 x 6
# Groups: z [6]
z element counts Ca_ratio Fe_ratio Ti_ratio
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 Ag 2.41 4.78 -10.1 4.10
2 2 Ag -1.06 3.19 0.506 -0.970
3 3 Ag -0.312 -0.479 -0.621 -0.458
4 4 Ag -0.186 -0.296 -0.145 0.570
5 5 Ag 1.12 0.353 3.19 -1.38
6 6 Ag -1.68 -2.81 -0.927 -2.84
Edit:
I found a general solution to my problem with base R using two nested for-loops, similar to the answer posted by #fra (the difference being that here I loop both over the depth and the proxies):
library(tidyverse)
df <- tibble(z = rep(1:3, 4), element = rep(c("Ag", "Ca", "Fe", "Ti"), each = 3), counts = runif(12)) %>% arrange(z, element)
proxies <- c("Ca", "Fe", "Ti")
for (f in seq_along(proxies)) {
proxy <- proxies[f]
tmp2 <- NULL
for (i in unique(df$z)) {
tmp <- df[df$z == i,]
tmp <- as.data.frame(tmp$counts/tmp$counts[tmp$element %in% proxy])
names(tmp) <- paste(proxy, "ratio", sep = "_")
tmp2 <- rbind(tmp2, tmp)
}
df[, 3 + f] <- tmp2
}
And the correct output:
> head(df)
# A tibble: 6 x 6
z element counts Ca_ratio Fe_ratio Ti_ratio
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 Ag 0.690 0.864 9.21 1.13
2 1 Ca 0.798 1 10.7 1.30
3 1 Fe 0.0749 0.0938 1 0.122
4 1 Ti 0.612 0.767 8.17 1
5 2 Ag 0.687 0.807 3.76 0.730
6 2 Ca 0.851 1 4.66 0.904
I made the dataframe contain less data so that it's clearly visible why this solution is correct (Ratios of elements with themselves = 1).
I'm still interested in a more elegant solution that I could use with pipes.
A tidyverse option could be to create a function, similar to your original code and then pass that through using map_dfc to create new columns.
library(tidyverse)
proxies <- c("Ca", "Fe", "Ti")
your_func <- function(x){
df %>%
group_by(z) %>%
mutate(!!paste(x, "ratio", sep = "_") := counts/counts[element %in% !!x]) %>%
ungroup() %>%
select(!!paste(x, "ratio", sep = "_") )
}
df %>%
group_modify(~map_dfc(proxies, your_func)) %>%
bind_cols(df, .) %>%
arrange(z, element)
# z element counts Ca_ratio Fe_ratio Ti_ratio
# <int> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 Ag -0.112 -0.733 -0.197 -1.51
# 2 1 Ca 0.153 1 0.269 2.06
# 3 1 Fe 0.570 3.72 1 7.66
# 4 1 Ti 0.0743 0.485 0.130 1
# 5 2 Ag 0.881 0.406 -6.52 -1.49
# 6 2 Ca 2.17 1 -16.1 -3.69
# 7 2 Fe -0.135 -0.0622 1 0.229
# 8 2 Ti -0.590 -0.271 4.37 1
# 9 3 Ag 0.398 0.837 0.166 -0.700
#10 3 Ca 0.476 1 0.198 -0.836
# ... with 30 more rows
Using base R
proxies <- c("Ca", "Fe", "Ti")
for(f in proxies){
newDF <- as.data.frame(df$counts/df$counts[df$element %in% f])
names(newDF) <- paste(f, "ratio", sep = "_")
df <- cbind(df,newDF)
}
> df
z element counts Ca_ratio Fe_ratio Ti_ratio
1 1 Ag -0.40163072 -0.35820754 1.7375395 0.45692965
2 2 Ag -1.00880171 1.27798430 22.8520332 -2.84599471
3 3 Ag 0.72230855 -1.19506223 6.3893485 -0.73558507
4 4 Ag -1.71524002 -1.38942436 1.7564861 -3.03313134
5 5 Ag -0.30813737 1.08127226 4.1985801 -0.33008370
6 6 Ag 0.20524663 0.08910397 -0.3132916 -0.23778331
...

summarise dplyr with dynamic columns? [duplicate]

This question already has answers here:
summarise_at using different functions for different variables
(2 answers)
Aggregate multiple variables with different functions [duplicate]
(2 answers)
Closed 3 years ago.
I've some R-code which does, what I want it to do. But now the question:
Is there any mechanism to avoid coding A1 A2 A3 and so on? I would like to code A* for all columns beginning with A. There can be any number of "A" columns in dependency to a list length which is definied in the code. The rest of the code is dynamic, but here I have a manual intervention (add some A columns or delete some A columns within the summerise statement).
I have found summarize_at, but I don't see how I can do the other things like last() and sum() at the same time for the other columns.
l_af <- l_cf %>%
group_by(PID, Server) %>%
summarise(Player=last(Player),
Guild=last(Guild),
Points=last(Points),
Battles=last(Battles),
A1=max(A1),
A2=max(A2),
A3=max(A3),
A4=max(A4),
A5=max(A5),
A6=max(A6),
RecCount=sum(RecCount))
Any help is appreciated.
The problem with using summarise it is removes all other columns if they are not used. You can consider to use mutate first perform all the operations and then use summarise.
library(dplyr)
l_cf %>%
group_by(PID, Server) %>%
mutate_at(vars(Player,Guild,Points,Battles), last) %>%
mutate_at(vars(starts_with("A")), max) %>%
mutate(RecCount = sum(RecCount)) %>%
summarise_all(max)
A reproducible example
set.seed(123)
df <- data.frame(group = rep(1:5, 2), x = runif(10), y = runif(10),
a1 = runif(10), a2 = runif(10), z = runif(10))
First applying functions individually for each column
df %>%
group_by(group) %>%
summarise(x=last(x),
y=last(y),
a1=max(a1),
a2=max(a2),
z=sum(z))
# A tibble: 5 x 6
# group x y a1 a2 z
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 0.0456 0.900 0.890 0.963 0.282
#2 2 0.528 0.246 0.693 0.902 0.648
#3 3 0.892 0.0421 0.641 0.691 0.880
#4 4 0.551 0.328 0.994 0.795 0.635
#5 5 0.457 0.955 0.656 0.232 1.01
Now apply the functions together for multiple columns
df %>%
group_by(group) %>%
mutate_at(vars(x, y), last) %>%
mutate_at(vars(starts_with("a")), max) %>%
mutate(z = sum(z)) %>%
summarise_all(max)
# group x y a1 a2 z
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 0.0456 0.900 0.890 0.963 0.282
#2 2 0.528 0.246 0.693 0.902 0.648
#3 3 0.892 0.0421 0.641 0.691 0.880
#4 4 0.551 0.328 0.994 0.795 0.635
#5 5 0.457 0.955 0.656 0.232 1.01
We can see that both the approaches gave the same output.

Order data frame by the last column with dplyr

library(dplyr)
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df %>%
arrange(colnames(df) %>% tail(1) %>% desc())
I am looping over a list of data frames. There are different columns in the data frames and the last column of each may have a different name.
I need to arrange every data frame by its last column. The simple case looks like the above code.
Using arrange_at and ncol:
df %>% arrange_at(ncol(.), desc)
As arrange_at will be depricated in the future, you could also use:
# option 1
df %>% arrange(desc(.[ncol(.)]))
# option 2
df %>% arrange(across(ncol(.), desc))
If we need to arrange by the last column name, either use the name string
df %>%
arrange_at(vars(last(names(.))), desc)
Or specify the index
df %>%
arrange_at(ncol(.), desc)
The new dplyr way (I guess from 1.0.0 on) would be using across(last_col()):
library(dplyr)
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df %>%
arrange(across(last_col(), desc))
#> # A tibble: 10 x 4
#> a b c d
#> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.283 0.443 1.30 0.910
#> 2 0.797 -0.0819 -0.936 0.828
#> 3 0.0717 -0.858 -0.355 0.671
#> 4 -1.38 -1.08 -0.472 0.426
#> 5 1.52 1.43 -0.0593 0.249
#> 6 0.827 -1.28 1.86 0.0824
#> 7 -0.448 0.0558 -1.48 -0.143
#> 8 0.377 -0.601 0.238 -0.918
#> 9 0.770 1.93 1.23 -1.43
#> 10 0.0532 -0.0934 -1.14 -2.08
> packageVersion("dplyr")
#> [1] ‘1.0.4’

Resources