Expand a data frame by group - r

I have a big data frame which consists of 1000 data frames (500x500), and I created by following code:
setwd("user/all_csv")
archivos <- list.files(full.names = F)
big.df <- lapply(archivos, read.csv, header = TRUE) %>%
set_names(archivos)%>%
bind_rows(.id = 'grp')
The big.df looks like below (a small example):
grp X X1 X2 X5
2020_01_19 1 23 47 3
2020_01_19 2 13 45 54
2020_01_19 5 23 41 21
2020_01_20 1 65 32 19
2020_01_20 2 39 52 12
2020_01_20 5 43 76 90
...
How can I generate the output below?:
1-X1 1-X2 1-X5 2-X1 2-X2 2-X5 5-X1 5-X2 5-X5
2020_01_19 23 47 3 13 45 54 23 41 21
2020_01_20 65 32 19 39 52 12 43 76 90
...
I don't really know how to proceed. Any help would be greatly appreciated.

use tidyr::pivot_wider with names_glue argument as follows.
Store name of all variables (even 500) to be pivoted into a vector say cols
Use values_from = all_of(cols) as argument in pivot_wider
cols <- c('X1', 'X2', 'X5')
df %>% pivot_wider(id_cols = grp, names_from = X, values_from = all_of(cols),
names_glue = '{X}-{.value}')
# A tibble: 2 x 10
grp `1-X1` `2-X1` `5-X1` `1-X2` `2-X2` `5-X2` `1-X5` `2-X5` `5-X5`
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 2020_01_19 23 13 23 47 45 41 3 54 21
2 2020_01_20 65 39 43 32 52 76 19 12 90
If you want to use all columns except first two, use this
df %>% pivot_wider(id_cols = grp, names_from = X, values_from = !c(grp, X),
names_glue = '{X}-{.value}')
# A tibble: 2 x 10
grp `1-X1` `2-X1` `5-X1` `1-X2` `2-X2` `5-X2` `1-X5` `2-X5` `5-X5`
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 2020_01_19 23 13 23 47 45 41 3 54 21
2 2020_01_20 65 39 43 32 52 76 19 12 90
However, if you want to rearrange columns as shown in expected outcome, you may use names_vary = 'slowest' in pivot_wider function of tidyr 1.2.0.

Related

How do I reshape my data so that rows are columns in R?

I have a dataset that contains the following values
Item Number Sales in Dollars
1 50 10
2 50 15
3 60 20
4 60 30
5 70 35
6 70 45
I would like to reshape the data such that the result would be
50 60 70
1 10 20 35
2 15 30 45
How could I go about achieving this?
in Base R:
unstack(df, Sales_in_Dollars~Item_Number)
X50 X60 X70
1 10 20 35
2 15 30 45
We could use pivot_wider:
The trick is to group_by and create an id in the group to get this output, otherwise you will get a list with NAs
library(dplyr)
library(tidyr)
df %>%
group_by(ItemNumber) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from=ItemNumber, values_from = SalesinDollars) %>%
select(-id)
`50` `60` `70`
<int> <int> <int>
1 10 20 35
2 15 30 45
With data.table:
data.table::dcast(as.data.table(df),
rowid(`Item.Number`) ~ `Item.Number`,
value.var = "Sales.in.Dollars")[, -1]
Output
50 60 70
<int> <int> <int>
1: 10 20 35
2: 15 30 45
Another possible solution, based on tidyverse:
library(tidyverse)
df %>%
pivot_wider(names_from = item, values_from = sales, values_fn = list) %>%
unnest(everything())
#> # A tibble: 2 x 3
#> `50` `60` `70`
#> <int> <int> <int>
#> 1 10 20 35
#> 2 15 30 45

R function which transforms data adds new column depending on counted grouped rows

I am new in R, so maybe anyone could help me.
I have dataset like this
ID
Date
Revenue
Sales
1
2022.01.01
10
20
1
2022.02.01
11
21
1
2022.03.01
12
22
2
2022.01.01
13
33
2
2022.02.01
14
41
2
2022.03.01
15
51
2
2022.04.01
16
61
I need to transform this dataset, with group_by(ID). Also is important how much rows there is by group.
My transformed data must look like that:
ID
Revenue4
Revenue3
Revenue2
Revenue1
Sales4
Sales3
Sales2
Sales1
1
-
12
11
10
-
22
21
20
2
16
15
14
13
61
51
41
33
I need to do this by some function, because I have a lot of rows with different ID's and about 40 columns.
Thank You!
One approach is to add a column containing the row number within each group, and then use pivot_wider using the row number in the new column names, combined with sales and revenue.
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(rn = row_number()) %>%
pivot_wider(id_cols = ID,
names_from = rn,
values_from = c("Revenue", "Sales"))
Output
ID Revenue_1 Revenue_2 Revenue_3 Revenue_4 Sales_1 Sales_2 Sales_3 Sales_4
<int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 10 11 12 NA 20 21 22 NA
2 2 13 14 15 16 33 41 51 61
Same idea as the other answer, but with data.table:
##
# In future provide this using dput(...)
#
df <- read.table(text="ID Date Revenue Sales
1 2022.01.01 10 20
1 2022.02.01 11 21
1 2022.03.01 12 22
2 2022.01.01 13 33
2 2022.02.01 14 41
2 2022.03.01 15 51
2 2022.04.01 16 61", header=TRUE)
##
# you start here
#
library(data.table)
setDT(df)[, N:=seq(.N), by=.(ID)] |> dcast(ID~N, value.var = c('Revenue', 'Sales'), sep="")
## ID Revenue1 Revenue2 Revenue3 Revenue4 Sales1 Sales2 Sales3 Sales4
## 1: 1 10 11 12 NA 20 21 22 NA
## 2: 2 13 14 15 16 33 41 51 61

row wise calculation and update entire row in dplyr

I want to do row wise calculation with dplyr package of R.The result of the calculation is a series. Then I want to replace the entire row with the calculated series. Here is the code:
df <- tibble(id = 1:6, w = 10:15, x = 20:25, y = 30:35, z = 40:45)
I want to run isoreg which the result is a series then replace it with what is under w:z columns:
df %>% rowwise %>%
mutate(across(c_across(w:z), ~ isoreg(as.numeric(c_across(w:z)))$yf))
It seems this method is just for replacing one element, not the entire row.
The isoreg is just a sample function, we could use other functions that return a series not a single value as the output.
You don't need to use across as well c_across. For rowwise operations use only c_across. Also c_across expects a single summary value as output so you can't replace all the rows in one go. A hack is to capture all the values in a list and use unnest_wider to get those values as separate columns.
library(dplyr)
df %>%
rowwise() %>%
mutate(output = list(isoreg(c_across(w:z))$yf)) %>%
tidyr::unnest_wider(output)
# id w x y z ...1 ...2 ...3 ...4
# <int> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
#1 1 10 20 30 40 10 20 30 40
#2 2 11 21 31 41 11 21 31 41
#3 3 12 22 32 42 12 22 32 42
#4 4 13 23 33 43 13 23 33 43
#5 5 14 24 34 44 14 24 34 44
#6 6 15 25 35 45 15 25 35 45
Since the output of isoreg is not named unnest_wider gives names as ..1, ..2 etc. You can rename them if needed and remove the columns which you don't need.
Base R option is to use apply :
df[-1] <- t(apply(df[-1], 1, function(x) isoreg(x)$yf))
We could use pmap with unnest_wider
library(dplyr)
library(tidyr)
library(purrr)
df %>%
mutate(new = pmap(select(., w:z), ~ isoreg(c(...))$yf)) %>%
unnest_wider(c(new))
# A tibble: 6 x 9
# id w x y z ...1 ...2 ...3 ...4
# <int> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
#1 1 10 20 30 40 10 20 30 40
#2 2 11 21 31 41 11 21 31 41
#3 3 12 22 32 42 12 22 32 42
#4 4 13 23 33 43 13 23 33 43
#5 5 14 24 34 44 14 24 34 44
#6 6 15 25 35 45 15 25 35 45

Create "metadata" field in R

I have a data frame set up similar to this:
id <- c(123,234,123,234)
task <- c(54,23,12,58)
a <- c(23,67,45,89)
b <- c(78,45,65,45)
df <- data.frame(id,task,a,b)
> df
id task a b
1 123 54 23 78
2 234 23 67 45
3 123 12 45 65
4 234 58 89 45
where I score a and b for each ID:
df$score <- rowMeans(subset(df, select = c(3:4)), na.rm = TRUE)
> df
id task a b score
1 123 54 23 78 50.5
2 234 23 67 45 56.0
3 123 12 45 65 55.0
4 234 58 89 45 67.0
for each id I got an aggregate score like such:
out <- ddply(df, 1, summarise,
overall = mean(score, na.rm = TRUE))
> out
id overall
1 123 52.75
2 234 61.50
but what I want my final output to have is a new column that has the scores that went into the overall and their task id like this:
id overall meta
1 123 52.75 "task_scores":[{"54":50.5,"12":55}]
2 234 61.50 "task_scores":[{"23":56,"58":67}]
how would I go about doing that using R?
We could make use of jsonlite to create the structure
library(jsonlite)
library(plyr)
ddply(df, "id", summarise, overall = mean(score, na.rm = TRUE),
meta = paste0('"task_scores":',
toJSON(setNames(as.data.frame.list(score), task))))
# id overall meta
#1 123 52.75 "task_scores":[{"54":50.5,"12":55}]
#2 234 61.50 "task_scores":[{"23":56,"58":67}]
I don't know how to make that metadata dictionary offhand, but you could do something like this:
library(dplyr)
library(magrittr)
out <- df %>% group_by(id) %>% mutate(overall = mean(score))
> out
# A tibble: 4 x 6
# Groups: id [2]
id task a b score overall
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 123 54 23 78 50.5 52.8
2 234 23 67 45 56 61.5
3 123 12 45 65 55 52.8
4 234 58 89 45 67 61.5
So the df would have both the aggregated scores and preserve the data in the original rows.
You can do it with a few mutates. Paste your tallies, get your row average, then your group average.
library(dplyr)
df %>%
mutate(score = rowMeans(subset(., select = c(3:4)), na.rm = TRUE)) %>%
group_by(id) %>%
mutate(overall = mean(score)) %>%
mutate(tally = paste(task, score, sep = ":", collapse = ","))
# A tibble: 4 x 7
# Groups: id [2]
id task a b score overall tally
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 123 54 23 78 50.5 52.8 54:50.5,12:55
2 234 23 67 45 56 61.5 23:56,58:67
3 123 12 45 65 55 52.8 54:50.5,12:55
4 234 58 89 45 67 61.5 23:56,58:67
And to get your desired final output, just select and slice.
df %>%
mutate(score = rowMeans(subset(., select = c(3:4)), na.rm = TRUE)) %>%
group_by(id) %>%
mutate(overall = mean(score)) %>%
mutate(tally = paste(task, score, sep = ":", collapse = ",")) %>%
select(id, overall, tally) %>%
slice(1)
# A tibble: 1 x 3
id overall tally
<dbl> <dbl> <chr>
1 123 52.8 54:50.5,12:55
2 234 61.5 23:56,58:67

Group_by / summarize by two variables within a function

I would like to write a function that summarize the provided data by some specified criteria, in this case by age
The example data is a table of users' age and their stats.
df <- data.frame('Age'=rep(18:25,2), 'X1'=10:17, 'X2'=28:35,'X4'=22:29)
Next I define the output columns that are relevant for the analysis
output_columns <- c('Age', 'X1', 'X2', 'X3')
This function computes the basic the sum of X1. X2 and X3 grouped by age.
aggr <- function(data, criteria, output_columns){
k <- data %>% .[, colnames(.) %in% output_columns] %>%
group_by_(.dots = criteria) %>%
#summarise_each(funs(count), age) %>%
summarize_if(is.numeric, sum)
return (k)
}
When I call it like this
> e <- aggr(df, "Age", output_columns)
> e
# A tibble: 8 x 3
Age X1 X2
<int> <int> <int>
1 18 20 56
2 19 22 58
3 20 24 60
4 21 26 62
5 22 28 64
6 23 30 66
7 24 32 68
8 25 34 70
I want to have another column called count which shows the number of observations in each age group. Desired output is
> desired
Age X1 X2 count
1 18 20 56 2
2 19 22 58 2
3 20 24 60 2
4 21 26 62 2
5 22 28 64 2
6 23 30 66 2
7 24 32 68 2
8 25 34 70 2
I have tried different ways to do that, e.g. tally(), summarize_each
etc. They all deliver wrong results.
I believe their should be an easy and simple way to do that.
Any help is appreciated.
Since you're already summing all variables, you can just add a column of all 1s before the summary function
aggr <- function(data, criteria, output_columns){
data %>%
.[, colnames(.) %in% output_columns] %>%
group_by_(.dots = criteria) %>%
mutate(n = 1L) %>%
summarize_if(is.numeric, sum)
}
# A tibble: 8 x 4
Age X1 X2 n
<int> <int> <int> <int>
1 18 20 56 2
2 19 22 58 2
3 20 24 60 2
4 21 26 62 2
5 22 28 64 2
6 23 30 66 2
7 24 32 68 2
8 25 34 70 2
We could create the 'count' column before summarise_if
aggr<- function(data, criteria, output_columns){
data %>%
select(intersect(names(.), output_columns))%>%
group_by_at(criteria)%>%
group_by(count = n(), add= TRUE) %>%
summarize_if(is.numeric,sum) %>%
select(setdiff(names(.), 'count'), count)
}
aggr(df,"Age",output_columns)
# A tibble: 8 x 4
# Groups: Age [8]
# Age X1 X2 count
# <int> <int> <int> <int>
#1 18 20 56 2
#2 19 22 58 2
#3 20 24 60 2
#4 21 26 62 2
#5 22 28 64 2
#6 23 30 66 2
#7 24 32 68 2
#8 25 34 70 2
In base R you could do
aggr <- function(data, criteria, output_columns){
ds <- data[, colnames(data) %in% output_columns]
d <- aggregate(ds, by=list(criteria), function(x) c(sum(x), length(x)))
"names<-"(do.call(data.frame, d)[, -c(2:3, 5)], c(names(ds), "n"))
}
> with(df, aggr(df, Age, output_columns))
Age X1 X2 n
1 18 20 56 2
2 19 22 58 2
3 20 24 60 2
4 21 26 62 2
5 22 28 64 2
6 23 30 66 2
7 24 32 68 2
8 25 34 70 2

Resources