row wise calculation and update entire row in dplyr - r

I want to do row wise calculation with dplyr package of R.The result of the calculation is a series. Then I want to replace the entire row with the calculated series. Here is the code:
df <- tibble(id = 1:6, w = 10:15, x = 20:25, y = 30:35, z = 40:45)
I want to run isoreg which the result is a series then replace it with what is under w:z columns:
df %>% rowwise %>%
mutate(across(c_across(w:z), ~ isoreg(as.numeric(c_across(w:z)))$yf))
It seems this method is just for replacing one element, not the entire row.
The isoreg is just a sample function, we could use other functions that return a series not a single value as the output.

You don't need to use across as well c_across. For rowwise operations use only c_across. Also c_across expects a single summary value as output so you can't replace all the rows in one go. A hack is to capture all the values in a list and use unnest_wider to get those values as separate columns.
library(dplyr)
df %>%
rowwise() %>%
mutate(output = list(isoreg(c_across(w:z))$yf)) %>%
tidyr::unnest_wider(output)
# id w x y z ...1 ...2 ...3 ...4
# <int> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
#1 1 10 20 30 40 10 20 30 40
#2 2 11 21 31 41 11 21 31 41
#3 3 12 22 32 42 12 22 32 42
#4 4 13 23 33 43 13 23 33 43
#5 5 14 24 34 44 14 24 34 44
#6 6 15 25 35 45 15 25 35 45
Since the output of isoreg is not named unnest_wider gives names as ..1, ..2 etc. You can rename them if needed and remove the columns which you don't need.
Base R option is to use apply :
df[-1] <- t(apply(df[-1], 1, function(x) isoreg(x)$yf))

We could use pmap with unnest_wider
library(dplyr)
library(tidyr)
library(purrr)
df %>%
mutate(new = pmap(select(., w:z), ~ isoreg(c(...))$yf)) %>%
unnest_wider(c(new))
# A tibble: 6 x 9
# id w x y z ...1 ...2 ...3 ...4
# <int> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
#1 1 10 20 30 40 10 20 30 40
#2 2 11 21 31 41 11 21 31 41
#3 3 12 22 32 42 12 22 32 42
#4 4 13 23 33 43 13 23 33 43
#5 5 14 24 34 44 14 24 34 44
#6 6 15 25 35 45 15 25 35 45

Related

R function which transforms data adds new column depending on counted grouped rows

I am new in R, so maybe anyone could help me.
I have dataset like this
ID
Date
Revenue
Sales
1
2022.01.01
10
20
1
2022.02.01
11
21
1
2022.03.01
12
22
2
2022.01.01
13
33
2
2022.02.01
14
41
2
2022.03.01
15
51
2
2022.04.01
16
61
I need to transform this dataset, with group_by(ID). Also is important how much rows there is by group.
My transformed data must look like that:
ID
Revenue4
Revenue3
Revenue2
Revenue1
Sales4
Sales3
Sales2
Sales1
1
-
12
11
10
-
22
21
20
2
16
15
14
13
61
51
41
33
I need to do this by some function, because I have a lot of rows with different ID's and about 40 columns.
Thank You!
One approach is to add a column containing the row number within each group, and then use pivot_wider using the row number in the new column names, combined with sales and revenue.
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(rn = row_number()) %>%
pivot_wider(id_cols = ID,
names_from = rn,
values_from = c("Revenue", "Sales"))
Output
ID Revenue_1 Revenue_2 Revenue_3 Revenue_4 Sales_1 Sales_2 Sales_3 Sales_4
<int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 10 11 12 NA 20 21 22 NA
2 2 13 14 15 16 33 41 51 61
Same idea as the other answer, but with data.table:
##
# In future provide this using dput(...)
#
df <- read.table(text="ID Date Revenue Sales
1 2022.01.01 10 20
1 2022.02.01 11 21
1 2022.03.01 12 22
2 2022.01.01 13 33
2 2022.02.01 14 41
2 2022.03.01 15 51
2 2022.04.01 16 61", header=TRUE)
##
# you start here
#
library(data.table)
setDT(df)[, N:=seq(.N), by=.(ID)] |> dcast(ID~N, value.var = c('Revenue', 'Sales'), sep="")
## ID Revenue1 Revenue2 Revenue3 Revenue4 Sales1 Sales2 Sales3 Sales4
## 1: 1 10 11 12 NA 20 21 22 NA
## 2: 2 13 14 15 16 33 41 51 61

Expand a data frame by group

I have a big data frame which consists of 1000 data frames (500x500), and I created by following code:
setwd("user/all_csv")
archivos <- list.files(full.names = F)
big.df <- lapply(archivos, read.csv, header = TRUE) %>%
set_names(archivos)%>%
bind_rows(.id = 'grp')
The big.df looks like below (a small example):
grp X X1 X2 X5
2020_01_19 1 23 47 3
2020_01_19 2 13 45 54
2020_01_19 5 23 41 21
2020_01_20 1 65 32 19
2020_01_20 2 39 52 12
2020_01_20 5 43 76 90
...
How can I generate the output below?:
1-X1 1-X2 1-X5 2-X1 2-X2 2-X5 5-X1 5-X2 5-X5
2020_01_19 23 47 3 13 45 54 23 41 21
2020_01_20 65 32 19 39 52 12 43 76 90
...
I don't really know how to proceed. Any help would be greatly appreciated.
use tidyr::pivot_wider with names_glue argument as follows.
Store name of all variables (even 500) to be pivoted into a vector say cols
Use values_from = all_of(cols) as argument in pivot_wider
cols <- c('X1', 'X2', 'X5')
df %>% pivot_wider(id_cols = grp, names_from = X, values_from = all_of(cols),
names_glue = '{X}-{.value}')
# A tibble: 2 x 10
grp `1-X1` `2-X1` `5-X1` `1-X2` `2-X2` `5-X2` `1-X5` `2-X5` `5-X5`
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 2020_01_19 23 13 23 47 45 41 3 54 21
2 2020_01_20 65 39 43 32 52 76 19 12 90
If you want to use all columns except first two, use this
df %>% pivot_wider(id_cols = grp, names_from = X, values_from = !c(grp, X),
names_glue = '{X}-{.value}')
# A tibble: 2 x 10
grp `1-X1` `2-X1` `5-X1` `1-X2` `2-X2` `5-X2` `1-X5` `2-X5` `5-X5`
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 2020_01_19 23 13 23 47 45 41 3 54 21
2 2020_01_20 65 39 43 32 52 76 19 12 90
However, if you want to rearrange columns as shown in expected outcome, you may use names_vary = 'slowest' in pivot_wider function of tidyr 1.2.0.

Create multiple new columns in tibble in R based on value of previous row giving prefix to all

I have a tibble as so:
df <- tibble(a = seq(1:10),
b = seq(21,30),
c = seq(31,40))
I want to create a new tibble, where I want to lag some. I want to create new columns called prev+lagged_col_name, eg prev_a.
In my actual data, there are a lot of cols so I don't want to manually write it out. Additonally I only want to do it for some cols. In this eg, I have done it manually but wanted to know if there is a way to use a function to do it.
df_new <- df %>%
mutate(prev_a = lag(a),
prev_b = lag(b),
prev_d = lag(d))
Thanks for your help!
With the current dplyr version you can create new variable names with mutate_at, using a named list will take the name of the list as suffix. If you want it as a prefix as in your example you can use rename_at to correct the variable naming. With your real data, you need to adjust the vars() selection. For your example data matches("[a-c]") did work.
library(dplyr)
df <- tibble(a = seq(1:10),
b = seq(21,30),
c = seq(31,40))
df %>%
mutate_at(vars(matches("[a-c]")), list(prev = ~ lag(.x)))
#> # A tibble: 10 x 6
#> a b c a_prev b_prev c_prev
#> <int> <int> <int> <int> <int> <int>
#> 1 1 21 31 NA NA NA
#> 2 2 22 32 1 21 31
#> 3 3 23 33 2 22 32
#> 4 4 24 34 3 23 33
#> 5 5 25 35 4 24 34
#> 6 6 26 36 5 25 35
#> 7 7 27 37 6 26 36
#> 8 8 28 38 7 27 37
#> 9 9 29 39 8 28 38
#> 10 10 30 40 9 29 39
df %>%
mutate_at(vars(matches("[a-c]")), list(prev = ~ lag(.x))) %>%
rename_at(vars(contains( "_prev") ), list( ~paste("prev", gsub("_prev", "", .), sep = "_")))
#> # A tibble: 10 x 6
#> a b c prev_a prev_b prev_c
#> <int> <int> <int> <int> <int> <int>
#> 1 1 21 31 NA NA NA
#> 2 2 22 32 1 21 31
#> 3 3 23 33 2 22 32
#> 4 4 24 34 3 23 33
#> 5 5 25 35 4 24 34
#> 6 6 26 36 5 25 35
#> 7 7 27 37 6 26 36
#> 8 8 28 38 7 27 37
#> 9 9 29 39 8 28 38
#> 10 10 30 40 9 29 39
Created on 2020-04-29 by the reprex package (v0.3.0)
You could do this this way
df_new <- bind_cols(
df,
df %>% mutate_at(.vars = vars("a","b","c"), function(x) lag(x))
)
Names are a bit nasty but you can rename them check here. Or see #Bas comment to get the names with a suffix.
# A tibble: 10 x 6
a b c a1 b1 c1
<int> <int> <int> <int> <int> <int>
1 1 21 31 NA NA NA
2 2 22 32 1 21 31
3 3 23 33 2 22 32
4 4 24 34 3 23 33
5 5 25 35 4 24 34
6 6 26 36 5 25 35
7 7 27 37 6 26 36
8 8 28 38 7 27 37
9 9 29 39 8 28 38
10 10 30 40 9 29 39
If you have dplyr 1.0 you can use the new accross() function.
See some expamples from the docs, instead of mean you want lag
df %>% mutate_if(is.numeric, mean, na.rm = TRUE)
# ->
df %>% mutate(across(is.numeric, mean, na.rm = TRUE))
df %>% mutate_at(vars(x, starts_with("y")), mean, na.rm = TRUE)
# ->
df %>% mutate(across(c(x, starts_with("y")), mean, na.rm = TRUE))
df %>% mutate_all(mean, na.rm = TRUE)
# ->
df %>% mutate(across(everything(), mean, na.rm = TRUE))

Group_by / summarize by two variables within a function

I would like to write a function that summarize the provided data by some specified criteria, in this case by age
The example data is a table of users' age and their stats.
df <- data.frame('Age'=rep(18:25,2), 'X1'=10:17, 'X2'=28:35,'X4'=22:29)
Next I define the output columns that are relevant for the analysis
output_columns <- c('Age', 'X1', 'X2', 'X3')
This function computes the basic the sum of X1. X2 and X3 grouped by age.
aggr <- function(data, criteria, output_columns){
k <- data %>% .[, colnames(.) %in% output_columns] %>%
group_by_(.dots = criteria) %>%
#summarise_each(funs(count), age) %>%
summarize_if(is.numeric, sum)
return (k)
}
When I call it like this
> e <- aggr(df, "Age", output_columns)
> e
# A tibble: 8 x 3
Age X1 X2
<int> <int> <int>
1 18 20 56
2 19 22 58
3 20 24 60
4 21 26 62
5 22 28 64
6 23 30 66
7 24 32 68
8 25 34 70
I want to have another column called count which shows the number of observations in each age group. Desired output is
> desired
Age X1 X2 count
1 18 20 56 2
2 19 22 58 2
3 20 24 60 2
4 21 26 62 2
5 22 28 64 2
6 23 30 66 2
7 24 32 68 2
8 25 34 70 2
I have tried different ways to do that, e.g. tally(), summarize_each
etc. They all deliver wrong results.
I believe their should be an easy and simple way to do that.
Any help is appreciated.
Since you're already summing all variables, you can just add a column of all 1s before the summary function
aggr <- function(data, criteria, output_columns){
data %>%
.[, colnames(.) %in% output_columns] %>%
group_by_(.dots = criteria) %>%
mutate(n = 1L) %>%
summarize_if(is.numeric, sum)
}
# A tibble: 8 x 4
Age X1 X2 n
<int> <int> <int> <int>
1 18 20 56 2
2 19 22 58 2
3 20 24 60 2
4 21 26 62 2
5 22 28 64 2
6 23 30 66 2
7 24 32 68 2
8 25 34 70 2
We could create the 'count' column before summarise_if
aggr<- function(data, criteria, output_columns){
data %>%
select(intersect(names(.), output_columns))%>%
group_by_at(criteria)%>%
group_by(count = n(), add= TRUE) %>%
summarize_if(is.numeric,sum) %>%
select(setdiff(names(.), 'count'), count)
}
aggr(df,"Age",output_columns)
# A tibble: 8 x 4
# Groups: Age [8]
# Age X1 X2 count
# <int> <int> <int> <int>
#1 18 20 56 2
#2 19 22 58 2
#3 20 24 60 2
#4 21 26 62 2
#5 22 28 64 2
#6 23 30 66 2
#7 24 32 68 2
#8 25 34 70 2
In base R you could do
aggr <- function(data, criteria, output_columns){
ds <- data[, colnames(data) %in% output_columns]
d <- aggregate(ds, by=list(criteria), function(x) c(sum(x), length(x)))
"names<-"(do.call(data.frame, d)[, -c(2:3, 5)], c(names(ds), "n"))
}
> with(df, aggr(df, Age, output_columns))
Age X1 X2 n
1 18 20 56 2
2 19 22 58 2
3 20 24 60 2
4 21 26 62 2
5 22 28 64 2
6 23 30 66 2
7 24 32 68 2
8 25 34 70 2

Appending many columns - functions of existing columns - to data frame

I have a data frame with 200 columns: A_1, ..., A_100, B_1, ..., B_100. The entries of A are integers from 1 to 5 or NA, while the entries of B are -1, 0, 1, NA.
I want to append 100 more columns: C_1, ..., C_100 where C_i = A_i + B_i, except when it would yield 0 or 6, in which case it should stay as is.
What would be the best way to do this in R, in terms of clarity and computational complexity? There has to be a better way than a for loop or something like that, perhaps there are functions for this in some library? I'm going to have to do similar operations a lot so I'd like a streamlined method.
You can try:
library(tidyverse)
# some data
d <- data.frame(A_1=1:10,
A_2=1:10,
A_3=1:10,
B_1=11:20,
B_2=21:30,
B_3=31:40)
d %>%
gather(key, value) %>%
separate(key, into = c("a","b")) %>%
group_by(b, a) %>%
mutate(n=row_number()) %>%
unite(a2,b, n) %>%
spread(a, value) %>%
mutate(Sum=A+B) %>%
separate(a2, into = c("a", "b"), remove = T) %>%
select(-A,-B) %>%
mutate(a=paste0("C_",a)) %>%
spread(a, Sum) %>%
arrange(as.numeric(b)) %>%
left_join(d %>% rownames_to_column(), by=c("b"="rowname"))
# A tibble: 10 x 10
b C_1 C_2 C_3 A_1 A_2 A_3 B_1 B_2 B_3
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 12 22 32 1 1 1 11 21 31
2 2 14 24 34 2 2 2 12 22 32
3 3 16 26 36 3 3 3 13 23 33
4 4 18 28 38 4 4 4 14 24 34
5 5 20 30 40 5 5 5 15 25 35
6 6 22 32 42 6 6 6 16 26 36
7 7 24 34 44 7 7 7 17 27 37
8 8 26 36 46 8 8 8 18 28 38
9 9 28 38 48 9 9 9 19 29 39
10 10 30 40 50 10 10 10 20 30 40
The idea is to use tidyr's gather and spread to get the columns A and B side by side. Then you can calculate the sum and transform it back to the expected data.frame. As long your data.frame has the same number of A and B columns, it is working.

Resources