summarise dplyr with dynamic columns? [duplicate] - r

This question already has answers here:
summarise_at using different functions for different variables
(2 answers)
Aggregate multiple variables with different functions [duplicate]
(2 answers)
Closed 3 years ago.
I've some R-code which does, what I want it to do. But now the question:
Is there any mechanism to avoid coding A1 A2 A3 and so on? I would like to code A* for all columns beginning with A. There can be any number of "A" columns in dependency to a list length which is definied in the code. The rest of the code is dynamic, but here I have a manual intervention (add some A columns or delete some A columns within the summerise statement).
I have found summarize_at, but I don't see how I can do the other things like last() and sum() at the same time for the other columns.
l_af <- l_cf %>%
group_by(PID, Server) %>%
summarise(Player=last(Player),
Guild=last(Guild),
Points=last(Points),
Battles=last(Battles),
A1=max(A1),
A2=max(A2),
A3=max(A3),
A4=max(A4),
A5=max(A5),
A6=max(A6),
RecCount=sum(RecCount))
Any help is appreciated.

The problem with using summarise it is removes all other columns if they are not used. You can consider to use mutate first perform all the operations and then use summarise.
library(dplyr)
l_cf %>%
group_by(PID, Server) %>%
mutate_at(vars(Player,Guild,Points,Battles), last) %>%
mutate_at(vars(starts_with("A")), max) %>%
mutate(RecCount = sum(RecCount)) %>%
summarise_all(max)
A reproducible example
set.seed(123)
df <- data.frame(group = rep(1:5, 2), x = runif(10), y = runif(10),
a1 = runif(10), a2 = runif(10), z = runif(10))
First applying functions individually for each column
df %>%
group_by(group) %>%
summarise(x=last(x),
y=last(y),
a1=max(a1),
a2=max(a2),
z=sum(z))
# A tibble: 5 x 6
# group x y a1 a2 z
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 0.0456 0.900 0.890 0.963 0.282
#2 2 0.528 0.246 0.693 0.902 0.648
#3 3 0.892 0.0421 0.641 0.691 0.880
#4 4 0.551 0.328 0.994 0.795 0.635
#5 5 0.457 0.955 0.656 0.232 1.01
Now apply the functions together for multiple columns
df %>%
group_by(group) %>%
mutate_at(vars(x, y), last) %>%
mutate_at(vars(starts_with("a")), max) %>%
mutate(z = sum(z)) %>%
summarise_all(max)
# group x y a1 a2 z
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 0.0456 0.900 0.890 0.963 0.282
#2 2 0.528 0.246 0.693 0.902 0.648
#3 3 0.892 0.0421 0.641 0.691 0.880
#4 4 0.551 0.328 0.994 0.795 0.635
#5 5 0.457 0.955 0.656 0.232 1.01
We can see that both the approaches gave the same output.

Related

Calculate variable from data.frame using dplyr

I'm sure this is obvious, but I can't figure it out.
I have a data.frame, and want to calculate a variable from several types.
df = data.frame(time = rep(seq(10),each=2),Type=rep(c("A","B"),times=10),value = runif(20))
I want a new data.frame, with A / B for each time point.
I've tried:
df2 <- df |> group_by(time) |> mutate(new_value= value[Type=="A"] / value[Type=="B"],.keep="none")
But I still have a new_value twice for each time.
A better option may be to reshape to 'wide' format with pivot_wider and then create the column
library(dplyr)
library(tidyr)
df %>%
pivot_wider(names_from = Type, values_from = value) %>%
mutate(new_value = A/B)
-output
# A tibble: 10 × 4
time A B new_value
<int> <dbl> <dbl> <dbl>
1 1 0.565 0.913 0.618
2 2 0.902 0.274 3.29
3 3 0.321 0.986 0.326
4 4 0.620 0.937 0.661
5 5 0.467 0.407 1.15
6 6 0.659 0.152 4.33
7 7 0.573 0.239 2.40
8 8 0.962 0.601 1.60
9 9 0.515 0.403 1.28
10 10 0.880 0.364 2.42
mutate creates or modifies a column in the original dataset, thus it returns the same number of rows. Instead, it may be better to use summarise if we want unique values (but here the 'Type' will be lost)
df |>
group_by(time) |>
summarise(new_value= value[Type=="A"] / value[Type=="B"])
In addition, this works only when the count of 'A', 'B' elements per 'time' is the same

problem with `replace_na()` from tidyr package

I wrote a function that has five arguments to calculate random numbers from a normal distribution. It has two steps:
replace NA with 0 in tibble column
replace 0 with a random number
My problems are:
line three doesn't replace NA value with 0
line five doesn't replace 0 with a random number
I have this error :
! Must subset columns with a valid subscript vector.
x Subscript `col` has the wrong type `function`.
It must be logical, numeric, or character.
here is my code :
whithout=function(col,min,max,mean,sd){
for(i in 1:4267){
continuous_dataset=continuous_dataset %>% replace_na(continuous_dataset[,col]=0)
if(is.na(continuous_dataset[,col])){
continuous_dataset[i,col]=round(rtruncnorm(1,min,max,mean,sd))
}
}
}
There's no need to write a function that loops across both columns and observations.
I assume you have no zeroes in your dataset to begin with. In which case, I can skip replacing NA with 0 and go straight to genereating the replacement value.
My solution is based on the tidyverse.
First, generate some test data.
library(tidyverse)
set.seed(123)
df <- tibble(x=runif(5), y=runif(5), z=runif(5))
df$x[3] <- NA
df$y[4] <- NA
df$z[5] <- NA
df
# A tibble: 5 × 3
x y z
<dbl> <dbl> <dbl>
1 0.288 0.0456 0.957
2 0.788 0.528 0.453
3 NA 0.892 0.678
4 0.883 NA 0.573
5 0.940 0.457 NA
Now solve the problem.
df %>%
mutate(
across(
everything(),
function(.x, mean, sd) .x <- ifelse(is.na(.x), rnorm(nrow(.), mean, sd), .x),
mean=500,
sd=100
)
)
# A tibble: 5 × 3
x y z
<dbl> <dbl> <dbl>
1 0.288 0.0456 0.957
2 0.788 0.528 0.453
3 669. 0.892 0.678
4 0.883 629. 0.573
5 0.940 0.457 467.
By avoiding looping through columns and rows, the code is more compact, more robust and (though I've not tested) faster.
If you don't want to process every column, simply replace everything() with a vector of columns that you do want to process. For example
df %>%
mutate(
across(
c(x, y),
function(.x, mean, sd) .x <- ifelse(is.na(.x), rnorm(nrow(.), mean, sd), .x),
mean=500,
sd=100
)
)
# A tibble: 5 × 3
x y z
<dbl> <dbl> <dbl>
1 0.288 0.0456 0.957
2 0.788 0.528 0.453
3 669. 0.892 0.678
4 0.883 629. 0.573
5 0.940 0.457 NA

Using group_modify with selected columns (retaining whole data frame and order)

I have run out of R power on this one. I appreciate any help, it is probably quite simple for someone with more experience.
I have a data frame (tibble) with some numerical columns, a group column, and some other columns with other information. I want to do operations on the numerical columns, by group, but still retain all the columns.
I've put an example below: I am replacing the NAs with the group mean, for each column. The columns to replace the NAs are specified by the df_names variable.
It basically works, except it removes all columns except the numerical ones, AND reorders everything. Which makes it hard to reassemble. I could work around this, but I have a feeling there must be a simpler way to direct group_apply to specified columns, while retaining the other columns, and keeping the order.
Can anyone help? Thanks so much in advance!
Will
library("tidyverse")
# create tibble
df <- tibble(
name=letters[1:10],
csize=c("L","S","S","L","L","S","L","S","L","S"),
v1=rnorm(10),
v2=rnorm(10),
v3=rnorm(10)
)
# introduce some missing data
df$v1[3] <- NA
df$v1[6] <- NA
df$v1[7] <- NA
df$v3[2] <- NA
# these are the cols where I want to replace the NAs
df_names <- c("v1","v2","v3")
# this is the grouping variable (has to be stored as a string, since it is an input to the function)
groupvar <- "csize"
# now I want to replace the NAs with column means, restricted to their group
# the following line works, but the problem is that it removes the name column, and reorders the rows...
df_imp <- df %>% group_by(.dots=groupvar) %>% select(df_names) %>% group_modify( ~{replace_na(.x,as.list(colMeans(.x, na.rm=TRUE)))})
group_modify is overkill in this case; mutate(across()) is your friend here:
df %>% group_by(.dots = groupvar) %>%
mutate(across(all_of(df_names), ~if_else(is.na(.x), mean(.x, na.rm = TRUE), .x)))
Result:
> df
# A tibble: 10 x 5
# Groups: csize [2]
name csize v1 v2 v3
<chr> <chr> <dbl> <dbl> <dbl>
1 a L -1.22 1.48 -0.628
2 b S -1.17 0.0890 -0.130
3 c S -0.422 -0.0956 -0.0271
4 d L -0.265 0.180 -0.786
5 e L -0.491 0.509 -0.359
6 f S -0.422 -0.712 0.232
7 g L -0.400 -1.13 1.13
8 h S -0.538 -0.0785 0.690
9 i L 0.373 0.308 0.252
10 j S 0.445 0.743 -1.41
Does this work:
> library(dplyr)
> df %>% group_by(csize) %>% mutate(across(v1:v3, ~ replace_na(., mean(., na.rm = T))))
# A tibble: 10 x 5
# Groups: csize [2]
name csize v1 v2 v3
<chr> <chr> <dbl> <dbl> <dbl>
1 a L 1.57 0.310 -1.76
2 b S -0.705 0.0655 0.577
3 c S -1.05 1.28 1.82
4 d L 0.958 -2.09 -0.371
5 e L -0.712 0.247 -1.13
6 f S -1.05 -0.516 -0.107
7 g L 0.403 1.79 0.128
8 h S -0.793 1.52 1.07
9 i L -0.206 -0.369 -1.77
10 j S -1.65 -0.992 -0.476

pivot_longer but for multiple sets of columns [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Reshaping wide to long with multiple values columns [duplicate]
(5 answers)
Closed 2 years ago.
I have a data frame as follows:
Z1 Z2 Y1 Y2 smpl
0.451 0.333 0.034 0.173 A
0.491 0.270 0.033 0.207 B
Columns Z1 and Z2 are multiple observations from the same sample. Same goes for Y1 and Y2.
I need to merge columns Z1 and Z2, and columns Y1 and Y2. yielding the following:
Z Y smpl
0.451 0.034 A
0.333 0.173 A
0.491 0.033 B
0.270 0.207 B
I am sure the solution is trivial, but... Well, I am a bit of a n00b, and hence I am also stuck. How can I do this?
You can use :
tidyr::pivot_longer(df, cols = -smpl,
names_to = '.value',
names_pattern = '([A-Z])')
# A tibble: 4 x 3
# smpl Z Y
# <chr> <dbl> <dbl>
#1 A 0.451 0.034
#2 A 0.333 0.173
#3 B 0.491 0.033
#4 B 0.27 0.207
you can try
library(tidyverse)
df %>%
pivot_longer(matches("Z|Y")) %>%
separate(name, letters[1:2], sep = 1) %>%
pivot_wider(names_from = a, values_from = value)
# A tibble: 4 x 4
smpl b Z Y
<fct> <chr> <dbl> <dbl>
1 A 1 0.451 0.034
2 A 2 0.333 0.173
3 B 1 0.491 0.033
4 B 2 0.27 0.207
Or easier
df %>%
pivot_longer(matches("Z|Y"),
names_to = c(".value", "b"),
names_sep = 1)
if you don't need the information about subgroups of Y and Z you can finish the lines of code using select(-b)

R summarise with multiple evalution metric functions that use actual and predicted from a data frame

I want to calculate multiple model evaluation metrics by groups for a data set. Each metric requires the input of actual (observed) and predicted values. These are columns in my data frame. My groups are represented by the variables iTime and an_id.
I can do the necessary calculations with summarise and much redundant typing, but there must be a purrr way to do this. I am trying to master purrr. I have tried invoke_map and pmap but could not figure out how to refer to the columns "actual" and "predicted" in my data frame.
A short example - there are more metrics needed:
library(Metrics)
df <- data.frame(an_id = c('G','J','J', 'J', 'G','G','J','G'),
iTime = c(1,1,2,2,1,2,1,2),
actual = c(1.28, 2.72,.664,.927,.711,1.16,.727,.834),
predicted = c(1.14,1.61,.475,.737,.715,1.15,.725,.90))
dataMetrics <- df %>%
group_by(an_id, iTime) %>%
summarise(vmae = mae(actual, predicted),
rae = rae(actual, predicted),
vrmse = rmse(actual, predicted))
> dataMetrics
A tibble: 4 x 5
an_id iTime vmae rae vrmse
<chr> <dbl> <dbl> <dbl> <dbl>
1 G 1 0.072 0.253 0.0990
2 G 2 0.038 0.233 0.0472
3 J 1 0.556 0.558 0.785
4 J 2 0.190 1.44 0.190
I don't know where mae, mase and rmse come from, which regrettably makes your example not reproducible. It's important to always explicitly state which packages you're using.
invoke_map is the way to map multiple functions to the same data. We can then combine that with nesting data and mapping invoke_map over the nested data.
I'll demonstrate with the sample data you give and by defining two functions f1 and f2:
f1 <- function(x, y) sum(abs(x - y))
f2 <- function(x, y) sum((x - y)^2)
library(tidyverse)
df %>%
group_by(an_id, iTime) %>%
nest() %>%
mutate(tmp = map(data, ~invoke_map_dfc(
list(f1 = f1, f2 = f2),
x = .x$actual, y = .x$predicted))) %>%
select(-data) %>%
unnest()
## A tibble: 4 x 4
# an_id iTime f1 f2
# <fct> <int> <dbl> <dbl>
#1 G 1 0.144 0.0196
#2 J 1 1.11 1.23
#3 J 2 0.381 0.0718
#4 G 2 0.01 0.0001
Explanation: We group observations by an_id and iTime, then nest the remaining data and use invoke_map_dfc inside map to apply f1 and f2 to data and store the result in columns of a nested tibble. The last step is removing the data column and un-nesting the summary stats.
Update
To reproduce your expected output
library(Metrics)
df %>%
group_by(an_id, iTime) %>%
nest() %>%
mutate(tmp = map(data, ~invoke_map_dfc(
list(vmae = mae, rae = rae, vrmse = rmse),
actual = .x$actual, predicted = .x$predicted))) %>%
select(-data) %>%
unnest()
## A tibble: 4 x 5
# an_id iTime vmae rae vrmse
# <fct> <dbl> <dbl> <dbl> <dbl>
#1 G 1 0.072 0.253 0.0990
#2 J 1 0.556 0.558 0.785
#3 J 2 0.190 1.44 0.190
#4 G 2 0.038 0.233 0.0472
Sample data
df <- read.table(text =
"an_id iTime actual predicted
G 1 1.28 1.14
J 1 2.72 1.61
J 2 0.664 0.475
J 2 0.927 0.737
G 1 0.711 0.715
G 2 1.16 1.15
J 2 0.727 0.725", header = T)

Resources