Calculating % of total within groups across each column and transposing - r

Is there a way to create the following output (assuming a lot of IDs and a lot more attributes)?
I am stuck after calculating the % of total by ATT1 within ID and then ATT2, etc.. Not sure how to go about making the rows into column headers and aggregate.
Input File (df in R):
ID ATT1 ATT2 ATT3 ATT4 Value
1 a x d i 10
1 a y d j 10
1 a y d k 10
1 b y c k 10
1 b y c l 10
2 a x c k 20
…
And I want the output file to look like (ATT4_l is cut off):
ID ATT1_a ATT1_b ATT2_x ATT2_y ATT3_d ATT3_c ATT4_i ATT4_j ATT4_k
1 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.2 0.4
...
I tried using dplyr
df %>% group_by(ID, ATT1) %>% mutate(proc = (Value/sum(Value) * 100))
But I am not sure what to do once I have all the ATT calculated to get them into columns and aggregated so that each ID only has 1 row of data.

You can do this with the two main workhorses of the tidyverse: dplyr for calculations and tidyr for reshaping data. Some of the reshaping is convoluted so I'm breaking it into steps.
library(dplyr)
library(tidyr)
...
If you gather the data from its original wide format into a long format, you'll have a column of IDs, a column of ATTx values, a column of letters (don't know the context meaning of these, so I'm literally calling it letters), and a column of values. From this format, you can group observations by combinations of ID, ATT, and letter, and you can later stick ATTs and letters together in the way you've laid out.
df %>%
gather(key = att, value = letter, -ID, -Value) %>%
head()
#> # A tibble: 6 x 4
#> ID Value att letter
#> <int> <int> <chr> <chr>
#> 1 1 10 ATT1 a
#> 2 1 10 ATT1 a
#> 3 1 10 ATT1 a
#> 4 1 10 ATT1 b
#> 5 1 10 ATT1 b
#> 6 2 20 ATT1 a
After grouping, calculate total values for each ID/ATT/letter combo:
df %>%
gather(key = att, value = letter, -ID, -Value) %>%
group_by(ID, att, letter) %>%
summarise(group_val = sum(Value)) %>%
head()
#> # A tibble: 6 x 4
#> # Groups: ID, att [3]
#> ID att letter group_val
#> <int> <chr> <chr> <int>
#> 1 1 ATT1 a 30
#> 2 1 ATT1 b 20
#> 3 1 ATT2 x 10
#> 4 1 ATT2 y 40
#> 5 1 ATT3 c 20
#> 6 1 ATT3 d 30
Using mutate, you can calculate the share of each observation within its larger group. mutate drops one layer of the grouping hierarchy, so this is the share of values for each letter within a given ID and ATT. Since you no longer need the total values, just their shares, drop that column, and stick the ATTs and letters back together with unite.
df %>%
gather(key = att, value = letter, -ID, -Value) %>%
group_by(ID, att, letter) %>%
summarise(group_val = sum(Value)) %>%
mutate(share = group_val / sum(group_val)) %>%
select(-group_val) %>%
unite(group, att, letter, sep = "_") %>%
head()
#> # A tibble: 6 x 3
#> # Groups: ID [1]
#> ID group share
#> <int> <chr> <dbl>
#> 1 1 ATT1_a 0.6
#> 2 1 ATT1_b 0.4
#> 3 1 ATT2_x 0.2
#> 4 1 ATT2_y 0.8
#> 5 1 ATT3_c 0.4
#> 6 1 ATT3_d 0.6
Now you have all the information you're looking for, just need to get it into a wide format, turning the values in the group column into individual columns. You do this with spread:
df %>%
gather(key = att, value = letter, -ID, -Value) %>%
group_by(ID, att, letter) %>%
summarise(group_val = sum(Value)) %>%
mutate(share = group_val / sum(group_val)) %>%
select(-group_val) %>%
unite(group, att, letter, sep = "_") %>%
spread(key = group, value = share)
#> # A tibble: 2 x 11
#> # Groups: ID [2]
#> ID ATT1_a ATT1_b ATT2_x ATT2_y ATT3_c ATT3_d ATT4_i ATT4_j ATT4_k
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.6 0.4 0.2 0.8 0.4 0.6 0.2 0.2 0.4
#> 2 2 1 NA 1 NA 1 NA NA NA 1
#> # ... with 1 more variable: ATT4_l <dbl>
Note that there are NAs filled in here where there aren't observations for combinations of ID/ATT/letter. I'm assuming you'll have more complete data than in the sample you posted.
Created on 2018-10-03 by the reprex package (v0.2.1)

I believe you are looking for the reshape2 package
library(reshape2)
df.new <- dcast(df,
formula = ID~ATT1,
value.var = "proc",
fun.aggregate = mean)
This will not completely fix your problem though - I recommend doing this first to make your data tidy
df.tidy <- melt(df,
id.vars = c("ID","Value"),
variable.name = "ATT1_4",
value.name = "att.factor")
df.tidy <- df.tidy %>% group_by(ID, att.factor) %>% mutate(proc = (Value/sum(Value)*100))
df.new <- dcast(df.tidy,
formula = ID~att.factor,
value.var = "proc",
fun.aggregate = mean)
NaN will be returned for anything combination that isnt represented in df.tidy. you can use the fill argument to assign a value to those.

Related

How to filter out groups empty for 1 column in Tidyverse

tibble(
A = c("A","A","B","B"),
x = c(NA,NA,NA,1),
y = c(1,2,3,4),
) %>% group_by(A) -> df
desired output:
tibble(
A = c("B","B"),
x = c(NA,1)
y = c(3,4),
)
I want to find all groups for which all elements of x and x only are all NA, then remove those groups. "B" is filtered in because it has at least 1 non NA element.
I tried:
df %>%
filter(all(!is.na(x)))
but it seems that filters out if it finds at least 1 NA; I need the correct word, which is not all.
This will remove groups of column A if all elements of x are NA:
library(dplyr)
df %>%
group_by(A) %>%
filter(! all(is.na(x)))
# A tibble: 2 × 3
# Groups: A [1]
# A x y
# <chr> <dbl> <dbl>
#1 B NA 3
#2 B 1 4
Note that group "A" was removed because both cells in the column x are not defined.
We can use any with complete.cases
library(dplyr)
df %>%
group_by(A) %>%
filter(any(complete.cases(x))) %>%
ungroup
-output
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4
In the devel version of dplyr, we could use .by in filter thus we don't need to group_by/ungroup
df %>%
filter(any(complete.cases(x)), .by = 'A')
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4

Places after decimal points discarded when extracting numbers from strings

I'd like to extract weight values from strings with the unit and the time of measurement using tidyverse.
My dataset is like as below:
df <- tibble(ID = c("A","B","C"),
Weight = c("45kg^20221120", "51.5kg^20221015", "66.05kg^20221020"))
------
A tibble: 3 × 2
ID Weight
<chr> <chr>
1 A 45kg^20221120
2 B 11.5kg^20221015
3 C 66.05kg^20221020
I use stringr in the tidyverse package with regular expressions.
library(tidyverse)
df %>%
mutate(Weight = as.numeric(str_extract(Measurement, "(\\d+\\.\\d+)|(\\d+)(?=kg)")))
----------
A tibble: 3 × 3
ID Measurement Weight
<chr> <chr> <dbl>
1 A 45kg^20221120 45
2 B 11.5kg^20221015 11.5
3 C 66.05kg^20221020 66.0
The second decimal place of C (.05) doesn't extracted.
What's wrong with my code?
Any answers or comments are welcome.
Thanks.
Yes, it was extracted, however tibble is rounding it for 66.0 for easy display.
You can see it if you transform it in data.frame or if you View it
Solution
Check here
Check this
df %>%
mutate(Weight = as.numeric(str_extract(Measurement, "(\\d+\\.\\d+)|(\\d+)(?=kg)"))) %>%
as.data.frame()
Output
#> ID Measurement Weight
#> 1 A 45kg^20221120 45.00
#> 2 B 51.5kg^20221015 51.50
#> 3 C 66.05kg^20221020 66.05
Or check this
df %>%
mutate(Weight = as.numeric(str_extract(Measurement, "(\\d+\\.\\d+)|(\\d+)(?=kg)"))) %>%
View()
You could try to pull all the data out of the string at once with extract:
library(tidyverse)
df <- tibble(ID = c("A","B","C"),
Weight = c("45kg^20221120", "51.5kg^20221015", "66.05kg^20221020"))
df |>
extract(col = Weight,
into = c("weight", "unit", "date"),
regex = "(.*)(kg)\\^(.*$)",
remove = TRUE,
convert = TRUE) |>
mutate(date = lubridate::ymd(date))
#> # A tibble: 3 x 4
#> ID weight unit date
#> <chr> <dbl> <chr> <date>
#> 1 A 45 kg 2022-11-20
#> 2 B 51.5 kg 2022-10-15
#> 3 C 66.0 kg 2022-10-20
Note that, as stated in the comments, the .05 is just not printing, but is present in the data.

Spread data with non-unique keys with R

I have the following data frame:
ID
Group
1
A
1
B
2
C
2
D
And I want to reshape the data frame into a wider version in terms of ID. Thus, the new data frame looks like this:
ID
Group1
Group2
1
A
B
2
C
D
You can do this by adding a helper column and then using tidyr::pivot_wider():
library(dplyr)
library(tidyr)
data <- tibble(
id = c(1, 1, 2, 2),
group = letters[1:4]
)
# Add a helper column to use when pivoting. This uses the row number
# over each subgroup, i.e. over each value of `id`
transformed_data <- data %>%
group_by(id) %>%
mutate(helper = paste0("Group", row_number())) %>%
ungroup()
# Here's what the helper column looks like
transformed_data
#> # A tibble: 4 x 3
#> id group helper
#> <dbl> <chr> <chr>
#> 1 1 a Group1
#> 2 1 b Group2
#> 3 2 c Group1
#> 4 2 d Group2
# Pivot the data using the helper column
transformed_data %>%
pivot_wider(names_from = helper, values_from = group)
#> # A tibble: 2 x 3
#> id Group1 Group2
#> <dbl> <chr> <chr>
#> 1 1 a b
#> 2 2 c d

How do you use dplyr::pull to convert grouped a colum into vectors?

I have a tibble, df, I would like to take the tibble and group it and then use dplyr::pull to create vectors from the grouped dataframe. I have provided a reprex below.
df is the base tibble. My desired output is reflected by df2. I just don't know how to get there programmatically. I have tried to use pull to achieve this output but pull did not seem to recognize the group_by function and instead created a vector out of the whole column. Is what I'm trying to achieve possible with dplyr or base r. Note - new_col is supposed to be a vector created from the name column.
library(tidyverse)
library(reprex)
df <- tibble(group = c(1,1,1,1,2,2,2,3,3,3,3,3),
name = c('Jim','Deb','Bill','Ann','Joe','Jon','Jane','Jake','Sam','Gus','Trixy','Don'),
type = c(1,2,3,4,3,2,1,2,3,1,4,5))
df
#> # A tibble: 12 x 3
#> group name type
#> <dbl> <chr> <dbl>
#> 1 1 Jim 1
#> 2 1 Deb 2
#> 3 1 Bill 3
#> 4 1 Ann 4
#> 5 2 Joe 3
#> 6 2 Jon 2
#> 7 2 Jane 1
#> 8 3 Jake 2
#> 9 3 Sam 3
#> 10 3 Gus 1
#> 11 3 Trixy 4
#> 12 3 Don 5
# Desired Output - New Col is a column of vectors
df2 <- tibble(group=c(1,2,3),name=c("Jim","Jane","Gus"), type=c(1,1,1), new_col = c("'Jim','Deb','Bill','Ann'","'Joe','Jon','Jane'","'Jake','Sam','Gus','Trixy','Don'"))
df2
#> # A tibble: 3 x 4
#> group name type new_col
#> <dbl> <chr> <dbl> <chr>
#> 1 1 Jim 1 'Jim','Deb','Bill','Ann'
#> 2 2 Jane 1 'Joe','Jon','Jane'
#> 3 3 Gus 1 'Jake','Sam','Gus','Trixy','Don'
Created on 2020-11-14 by the reprex package (v0.3.0)
Maybe this is what you are looking for:
library(dplyr)
df <- tibble(group = c(1,1,1,1,2,2,2,3,3,3,3,3),
name = c('Jim','Deb','Bill','Ann','Joe','Jon','Jane','Jake','Sam','Gus','Trixy','Don'),
type = c(1,2,3,4,3,2,1,2,3,1,4,5))
df %>%
group_by(group) %>%
mutate(new_col = name, name = first(name, order_by = type), type = first(type, order_by = type)) %>%
group_by(name, type, .add = TRUE) %>%
summarise(new_col = paste(new_col, collapse = ","))
#> `summarise()` regrouping output by 'group', 'name' (override with `.groups` argument)
#> # A tibble: 3 x 4
#> # Groups: group, name [3]
#> group name type new_col
#> <dbl> <chr> <dbl> <chr>
#> 1 1 Jim 1 Jim,Deb,Bill,Ann
#> 2 2 Jane 1 Joe,Jon,Jane
#> 3 3 Gus 1 Jake,Sam,Gus,Trixy,Don
EDIT If new_col should be a list of vectors then you could do `summarise(new_col = list(c(new_col)))
df %>%
group_by(group) %>%
mutate(new_col = name, name = first(name, order_by = type), type = first(type, order_by = type)) %>%
group_by(name, type, .add = TRUE) %>%
summarise(new_col = list(c(new_col)))
Another option would be to use tidyr::nest:
df %>%
group_by(group) %>%
mutate(new_col = name, name = first(name, order_by = type), type = first(type, order_by = type)) %>%
nest(new_col = new_col)

R: dplyr and row_number() does not enumerate as expected

I want to enumerate each record of a dataframe/tibble resulted from a grouping. The index is according a defined order. If I use row_number() it does enumerate but within group. But I want that it enumerates without considering the former grouping.
Here is an example. To make it simple I used the most minimal dataframe:
library(dplyr)
df0 <- data.frame( x1 = rep(LETTERS[1:2],each=2)
, x2 = rep(letters[1:2], 2)
, y = floor(abs(rnorm(4)*10))
)
df0
# x1 x2 y
# 1 A a 12
# 2 A b 24
# 3 B a 0
# 4 B b 12
Now, I group this table:
df1 <- df0 %>% group_by(x1,x2) %>% summarize(y=sum(y))
This gives me a object of class tibble:
# A tibble: 4 x 3
# Groups: x1 [?]
# x1 x2 y
# <fct> <fct> <dbl>
# 1 A a 12
# 2 A b 24
# 3 B a 0
# 4 B b 12
I want to add a row number to this table using row_numer():
df2 <- df1 %>% arrange(desc(y)) %>% mutate(index = row_number())
df2
# A tibble: 4 x 4
# Groups: x1 [2]
# x1 x2 y index
# <fct> <fct> <dbl> <int>
# 1 A b 24 1
# 2 A a 12 2
# 3 B b 12 1
# 4 B a 0 2
row_number() does enumerate within the former grouping. This was not my intention. This can be avoid converting tibble to a dataframe first:
df2 <- df2 %>% as.data.frame() %>% arrange(desc(y)) %>% mutate(index = row_number())
df2
# x1 x2 y index
# 1 A b 24 1
# 2 A a 12 2
# 3 B b 12 3
# 4 B a 0 4
My question is: is this behaviour intended?
If yes: is it not very dangerous to incorporate former data processing into tibble? Which type of processing is incorporated?
At the moment I will convert tibble into dataframe to avoid this kind of unexpected results.
To elaborate on my comment: yes, retaining grouping is intended, and in many cases useful. It's only dangerous if you don't understand how group_by works—and that's true of any function. To undo group_by, you call ungroup.
Take a look at the group_by docs, as they're very thorough and explain how this function interacts with others, how grouping is layered, etc. The docs also explain how each call to summarise removes a layer of grouping—it might be there that you got confused about what's going on.
For example, you can group by x1 and x2, summarize y, and create a row number, which will give you the rows according to x1 (summarise removed a layer of grouping, i.e. drops the x2 grouping). Then ungrouping allows you to get row numbers based on the entire data frame.
library(dplyr)
df0 %>%
group_by(x1, x2) %>%
summarise(y = sum(y)) %>%
mutate(group_row = row_number()) %>%
ungroup() %>%
mutate(all_df_row = row_number())
#> # A tibble: 4 x 5
#> x1 x2 y group_row all_df_row
#> <fct> <fct> <dbl> <int> <int>
#> 1 A a 12 1 1
#> 2 A b 2 2 2
#> 3 B a 10 1 3
#> 4 B b 23 2 4
A use case—I do this for work probably every day—is to get sums within multiple groups (again, x1 and x2), then to find the shares of those values within their larger group (after peeling away a layer of grouping, this is x1) with mutate. Again, here I ungroup to show the shares instead of the entire data frame.
df0 %>%
group_by(x1, x2) %>%
summarise(y = sum(y)) %>%
mutate(share_in_group = y / sum(y)) %>%
ungroup() %>%
mutate(share_all_df = y / sum(y))
#> # A tibble: 4 x 5
#> x1 x2 y share_in_group share_all_df
#> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 A a 12 0.857 0.255
#> 2 A b 2 0.143 0.0426
#> 3 B a 10 0.303 0.213
#> 4 B b 23 0.697 0.489
Created on 2018-10-11 by the reprex package (v0.2.1)
As camille nicely showed, there are good reasons for wanting to have the result of summarize() retain additional layers of grouping and it's a documented behaviour so not really dangerous or unexpected per se.
However one additional tip is that if you are just going to call ungroup() after summarize() you might as well use summarize(.groups = "drop") which will return an ungrouped tibble and save you a line of code.
library(tidyverse)
df0 <- data.frame(
x1 = rep(LETTERS[1:2], each = 2),
x2 = rep(letters[1:2], 2),
y = floor(abs(rnorm(4) * 10))
)
df0 %>%
group_by(x1,x2) %>%
summarize(y=sum(y), .groups = "drop") %>%
arrange(desc(y)) %>%
mutate(index = row_number())
#> # A tibble: 4 x 4
#> x1 x2 y index
#> <chr> <chr> <dbl> <int>
#> 1 A b 8 1
#> 2 A a 2 2
#> 3 B a 2 3
#> 4 B b 1 4
Created on 2022-02-06 by the reprex package (v2.0.1)

Resources