I'm sure this is obvious, but I can't figure it out.
I have a data.frame, and want to calculate a variable from several types.
df = data.frame(time = rep(seq(10),each=2),Type=rep(c("A","B"),times=10),value = runif(20))
I want a new data.frame, with A / B for each time point.
I've tried:
df2 <- df |> group_by(time) |> mutate(new_value= value[Type=="A"] / value[Type=="B"],.keep="none")
But I still have a new_value twice for each time.
A better option may be to reshape to 'wide' format with pivot_wider and then create the column
library(dplyr)
library(tidyr)
df %>%
pivot_wider(names_from = Type, values_from = value) %>%
mutate(new_value = A/B)
-output
# A tibble: 10 × 4
time A B new_value
<int> <dbl> <dbl> <dbl>
1 1 0.565 0.913 0.618
2 2 0.902 0.274 3.29
3 3 0.321 0.986 0.326
4 4 0.620 0.937 0.661
5 5 0.467 0.407 1.15
6 6 0.659 0.152 4.33
7 7 0.573 0.239 2.40
8 8 0.962 0.601 1.60
9 9 0.515 0.403 1.28
10 10 0.880 0.364 2.42
mutate creates or modifies a column in the original dataset, thus it returns the same number of rows. Instead, it may be better to use summarise if we want unique values (but here the 'Type' will be lost)
df |>
group_by(time) |>
summarise(new_value= value[Type=="A"] / value[Type=="B"])
In addition, this works only when the count of 'A', 'B' elements per 'time' is the same
Related
I'm trying to go from a tibble of variable names and functions like this:
N <- 100
dat <-
tibble(
variable_name = c("a", "b"),
variable_value = c("rnorm(N)", "rnorm(N)")
)
to a tibble with two variables a and b of length N
dat2 <-
tibble(
a = rnorm(N),
b = rnorm(N)
)
is there a !!! or rlang-y way to accomplish this?
We can evalutate the string
library(dplyr)
library(purrr)
library(tibble)
deframe(dat) %>%
map_dfc(~ eval(rlang::parse_expr(.x)))
-output
# A tibble: 100 x 2
a b
<dbl> <dbl>
1 0.0750 2.55
2 -1.65 -1.48
3 1.77 -0.627
4 0.766 -0.0411
5 0.832 0.200
6 -1.91 -0.533
7 -0.0208 -0.266
8 -0.409 1.08
9 -1.38 -0.181
10 0.727 0.252
# … with 90 more rows
Here is a base way with a pipe and a as_tibble call.
Map(function(x) eval(str2lang(x)), setNames(dat$variable_value, dat$variable_name)) %>%
as_tibble
I have run out of R power on this one. I appreciate any help, it is probably quite simple for someone with more experience.
I have a data frame (tibble) with some numerical columns, a group column, and some other columns with other information. I want to do operations on the numerical columns, by group, but still retain all the columns.
I've put an example below: I am replacing the NAs with the group mean, for each column. The columns to replace the NAs are specified by the df_names variable.
It basically works, except it removes all columns except the numerical ones, AND reorders everything. Which makes it hard to reassemble. I could work around this, but I have a feeling there must be a simpler way to direct group_apply to specified columns, while retaining the other columns, and keeping the order.
Can anyone help? Thanks so much in advance!
Will
library("tidyverse")
# create tibble
df <- tibble(
name=letters[1:10],
csize=c("L","S","S","L","L","S","L","S","L","S"),
v1=rnorm(10),
v2=rnorm(10),
v3=rnorm(10)
)
# introduce some missing data
df$v1[3] <- NA
df$v1[6] <- NA
df$v1[7] <- NA
df$v3[2] <- NA
# these are the cols where I want to replace the NAs
df_names <- c("v1","v2","v3")
# this is the grouping variable (has to be stored as a string, since it is an input to the function)
groupvar <- "csize"
# now I want to replace the NAs with column means, restricted to their group
# the following line works, but the problem is that it removes the name column, and reorders the rows...
df_imp <- df %>% group_by(.dots=groupvar) %>% select(df_names) %>% group_modify( ~{replace_na(.x,as.list(colMeans(.x, na.rm=TRUE)))})
group_modify is overkill in this case; mutate(across()) is your friend here:
df %>% group_by(.dots = groupvar) %>%
mutate(across(all_of(df_names), ~if_else(is.na(.x), mean(.x, na.rm = TRUE), .x)))
Result:
> df
# A tibble: 10 x 5
# Groups: csize [2]
name csize v1 v2 v3
<chr> <chr> <dbl> <dbl> <dbl>
1 a L -1.22 1.48 -0.628
2 b S -1.17 0.0890 -0.130
3 c S -0.422 -0.0956 -0.0271
4 d L -0.265 0.180 -0.786
5 e L -0.491 0.509 -0.359
6 f S -0.422 -0.712 0.232
7 g L -0.400 -1.13 1.13
8 h S -0.538 -0.0785 0.690
9 i L 0.373 0.308 0.252
10 j S 0.445 0.743 -1.41
Does this work:
> library(dplyr)
> df %>% group_by(csize) %>% mutate(across(v1:v3, ~ replace_na(., mean(., na.rm = T))))
# A tibble: 10 x 5
# Groups: csize [2]
name csize v1 v2 v3
<chr> <chr> <dbl> <dbl> <dbl>
1 a L 1.57 0.310 -1.76
2 b S -0.705 0.0655 0.577
3 c S -1.05 1.28 1.82
4 d L 0.958 -2.09 -0.371
5 e L -0.712 0.247 -1.13
6 f S -1.05 -0.516 -0.107
7 g L 0.403 1.79 0.128
8 h S -0.793 1.52 1.07
9 i L -0.206 -0.369 -1.77
10 j S -1.65 -0.992 -0.476
This question already has answers here:
summarise_at using different functions for different variables
(2 answers)
Aggregate multiple variables with different functions [duplicate]
(2 answers)
Closed 3 years ago.
I've some R-code which does, what I want it to do. But now the question:
Is there any mechanism to avoid coding A1 A2 A3 and so on? I would like to code A* for all columns beginning with A. There can be any number of "A" columns in dependency to a list length which is definied in the code. The rest of the code is dynamic, but here I have a manual intervention (add some A columns or delete some A columns within the summerise statement).
I have found summarize_at, but I don't see how I can do the other things like last() and sum() at the same time for the other columns.
l_af <- l_cf %>%
group_by(PID, Server) %>%
summarise(Player=last(Player),
Guild=last(Guild),
Points=last(Points),
Battles=last(Battles),
A1=max(A1),
A2=max(A2),
A3=max(A3),
A4=max(A4),
A5=max(A5),
A6=max(A6),
RecCount=sum(RecCount))
Any help is appreciated.
The problem with using summarise it is removes all other columns if they are not used. You can consider to use mutate first perform all the operations and then use summarise.
library(dplyr)
l_cf %>%
group_by(PID, Server) %>%
mutate_at(vars(Player,Guild,Points,Battles), last) %>%
mutate_at(vars(starts_with("A")), max) %>%
mutate(RecCount = sum(RecCount)) %>%
summarise_all(max)
A reproducible example
set.seed(123)
df <- data.frame(group = rep(1:5, 2), x = runif(10), y = runif(10),
a1 = runif(10), a2 = runif(10), z = runif(10))
First applying functions individually for each column
df %>%
group_by(group) %>%
summarise(x=last(x),
y=last(y),
a1=max(a1),
a2=max(a2),
z=sum(z))
# A tibble: 5 x 6
# group x y a1 a2 z
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 0.0456 0.900 0.890 0.963 0.282
#2 2 0.528 0.246 0.693 0.902 0.648
#3 3 0.892 0.0421 0.641 0.691 0.880
#4 4 0.551 0.328 0.994 0.795 0.635
#5 5 0.457 0.955 0.656 0.232 1.01
Now apply the functions together for multiple columns
df %>%
group_by(group) %>%
mutate_at(vars(x, y), last) %>%
mutate_at(vars(starts_with("a")), max) %>%
mutate(z = sum(z)) %>%
summarise_all(max)
# group x y a1 a2 z
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 0.0456 0.900 0.890 0.963 0.282
#2 2 0.528 0.246 0.693 0.902 0.648
#3 3 0.892 0.0421 0.641 0.691 0.880
#4 4 0.551 0.328 0.994 0.795 0.635
#5 5 0.457 0.955 0.656 0.232 1.01
We can see that both the approaches gave the same output.
I tried to calculate the cumsum with a depreciation rate.
I have a grouped dataframe with a column number.
I want to add the number one by one with depreciation.
If the rate is 1, then the cumsum function in base r is good enough.
But if not, let's say the rate of 0.5 (means each number will multiply by 0.5 to add the next number), cumsum is not enough.
I tried to write my own function to work with dplyr, but it fails.
library(tidyverse)
# dataframe
id=sample(1:5,25,replace=TRUE)
num=rnorm(25)
df=data.frame(id,num)
# my custom function
depre=function(data){
rate=0.5
r=nrow(data)
sl=data$num
nl=data$num
for (i in 2:r){
sl[i]=sl[i-1]*rate+nl[i]
}
return(sl)
}
# work with one group
df %>% filter(id==1) %>% depre(.)
# failed to work with dplyr
df %>% group_by(id) %>% mutate(sl=depre(.))
I expect the first element of column s, should be the same as in column num.
But the following ones, should be depreciate by times 0.5 and add next num.
It works in one group, but failed in multi-grouped dataframe.
The error message is: "Error: Column sl must be length 6 (the group size) or one, not 25".
I have no idea. Could anyone have a clue?
Thanks
Your function would work if you pass vector to your function instead of dataframe
depre <- function(num){
rate = 0.5
r= length(num)
sl = num
nl = num
for (i in 2:r){
sl[i]=sl[i-1]*rate+nl[i]
}
return(sl)
}
and then apply it by group.
library(dplyr)
df %>% group_by(id) %>% mutate(sl = depre(num))
We can split by 'id' and use the OP's function without any changes
library(dplyr)
library(purrr)
df %>%
group_split(id, keep = FALSE) %>%
map_df(~ tibble(id = .$id, sl = depre(.)))
# id sl
# <int> <dbl>
# 1 1 1.07
# 2 1 -0.776
# 3 1 -0.518
# 4 1 0.628
# 5 1 0.601
# 6 1 1.10
# 7 2 -0.734
# 8 2 -0.583
# 9 2 -0.437
#10 2 -3.45
# … with 15 more rows
or an option would be accumulate from purrr which would be more compact
out <- df %>%
group_by(id) %>%
mutate(sl = accumulate(num, ~ .y + .x * 0.5))
out
# A tibble: 25 x 3
# Groups: id [5]
# id num sl
# <int> <dbl> <dbl>
# 1 3 -0.784 -0.784
# 2 2 -0.734 -0.734
# 3 2 -0.216 -0.583
# 4 3 -0.335 -0.727
# 5 5 -1.09 -1.09
# 6 4 -0.0854 -0.0854
# 7 1 1.07 1.07
# 8 2 -0.145 -0.437
# 9 3 -1.17 -1.53
#10 5 -0.819 -1.36
# … with 15 more rows
out %>%
filter(id == 1)
# A tibble: 6 x 3
# Groups: id [1]
# id num sl
# <int> <dbl> <dbl>
#1 1 1.07 1.07
#2 1 -1.31 -0.776
#3 1 -0.129 -0.518
#4 1 0.887 0.628
#5 1 0.287 0.601
#6 1 0.800 1.10
Issue in the OP's function is that the input is the whole dataset and during the process of getting the number of rows, it uses nrow(data), which would be the total number of rows. With group_by, the dplyr convention is n() - giving the number of rows. By doing the group_split, the input data.frame is split into subset of data.frames and the nrow of those will work for the created function
library(dplyr)
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df %>%
arrange(colnames(df) %>% tail(1) %>% desc())
I am looping over a list of data frames. There are different columns in the data frames and the last column of each may have a different name.
I need to arrange every data frame by its last column. The simple case looks like the above code.
Using arrange_at and ncol:
df %>% arrange_at(ncol(.), desc)
As arrange_at will be depricated in the future, you could also use:
# option 1
df %>% arrange(desc(.[ncol(.)]))
# option 2
df %>% arrange(across(ncol(.), desc))
If we need to arrange by the last column name, either use the name string
df %>%
arrange_at(vars(last(names(.))), desc)
Or specify the index
df %>%
arrange_at(ncol(.), desc)
The new dplyr way (I guess from 1.0.0 on) would be using across(last_col()):
library(dplyr)
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df %>%
arrange(across(last_col(), desc))
#> # A tibble: 10 x 4
#> a b c d
#> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.283 0.443 1.30 0.910
#> 2 0.797 -0.0819 -0.936 0.828
#> 3 0.0717 -0.858 -0.355 0.671
#> 4 -1.38 -1.08 -0.472 0.426
#> 5 1.52 1.43 -0.0593 0.249
#> 6 0.827 -1.28 1.86 0.0824
#> 7 -0.448 0.0558 -1.48 -0.143
#> 8 0.377 -0.601 0.238 -0.918
#> 9 0.770 1.93 1.23 -1.43
#> 10 0.0532 -0.0934 -1.14 -2.08
> packageVersion("dplyr")
#> [1] ‘1.0.4’