I have a dataframe with 0-3 rows depending on the underlying data. Here is an example with 2 rows:
df <- tibble(ID = c(1, 1), v = c(1, 2))
ID v
<dbl> <dbl>
1 1 1
2 1 2
I now want to convert each row of v into a separate column. As I have 3 rows at maximum, the result should look like this:
ID v1 v2 v3
<dbl> <dbl> <dbl> <dbl>
1 1 NA 1 2
Whats the best way to achieve this? Thanks!
Perhaps this helps
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(nm = str_c("v", 2:3)) %>%
complete(ID, nm = str_c("v", 1:3)) %>%
pivot_wider(names_from = nm, values_from = v)
Update: Op request, see comments:
df %>%
group_by(ID) %>%
summarise(cur_data()[seq(max_n),]) %>%
arrange(!is.na(v), v) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = row,
values_from = v,
names_glue = "v_{.name}")
ID v_1 v_2 v_3
<dbl> <dbl> <dbl> <dbl>
1 1 NA 1 2
First answer:
Maybe something like this:
What we are doing here is:
define the max of your group (in this case it is 3)
then fill up each group to max of 3 with adding NA
For naming add a row_number() column and use pivot_wider with it'S arguments:
library(dplyr)
library(tidyr)
max_n <- 3
df %>%
group_by(ID) %>%
summarise(cur_data()[seq(max_n),]) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = row,
values_from = v,
names_glue = "v_{.name}")
ID v_1 v_2 v_3
<dbl> <dbl> <dbl> <dbl>
1 1 1 2 NA
Related
I got a data frame with a lot of columns and want to summarise them with multiple functions.
test_df <- data.frame(Group = sample(c("A", "B", "C"), 10, T), var1 = sample(1:5, 10, T), var2 = sample(3:7, 10, T))
test_df %>%
group_by(Group) %>%
summarise_all(c(Mean = mean, Sum = sum))
# A tibble: 3 x 5
Group var1_Mean var2_Mean var1_Sum var2_Sum
<chr> <dbl> <dbl> <int> <int>
1 A 3.14 5.14 22 36
2 B 4.5 4.5 9 9
3 C 4 6 4 6
This results in a tibble with the first row Group and column names with a combination of the previous column name and the function name.
The desired result is a table with the previous column names as first row and the groups and functions in the column names.
I can achive this with
test_longer <- test_df %>% pivot_longer(cols = starts_with("var"), names_to = "var", values_to = "val")
# Add row number because spread needs unique identifiers for rows
test_longer <- test_longer %>%
group_by(Group) %>%
mutate(grouped_id = row_number())
spread(test_longer, Group, val) %>%
select(-grouped_id) %>%
group_by(var) %>%
summarise_all(c(Mean = mean, Sum = sum), na.rm = T)
# A tibble: 2 x 7
var A_Mean B_Mean C_Mean A_Sum B_Sum C_Sum
<chr> <dbl> <dbl> <dbl> <int> <int> <int>
1 var1 3.14 4.5 4 22 9 4
2 var2 5.14 4.5 6 36 9 6
But this seems to be a rather long detour... There probably is a better way, but I could not find it. Any suggestions? Thank you
There's lots of ways to go about it, but I would simplify it by pivoting to a longer data frame initially, and then grouping by var and group. Then you can just pivot wider to get the final result you want. Note that I used summarize(across()) which replaces the deprecated summarize_all(), even though with a single column could've just manually specified Mean = ... and Sum = ....
set.seed(123)
test_df %>%
pivot_longer(
var1:var2,
names_to = "var"
) %>%
group_by(Group, var) %>%
summarize(
across(
everything(),
list(Mean = mean, Sum = sum),
.names = "{.fn}"
),
.groups = "drop"
) %>%
pivot_wider(
names_from = "Group",
values_from = c(Mean, Sum),
names_glue = "{Group}_{.value}"
)
#> # A tibble: 2 × 7
#> var A_Mean B_Mean C_Mean A_Sum B_Sum C_Sum
#> <chr> <dbl> <dbl> <dbl> <int> <int> <int>
#> 1 var1 1 2.5 3.2 1 10 16
#> 2 var2 5 4.5 4.4 5 18 22
My code is dirty.
if condition smaller than two, names = unpopular.
df <- data.frame(vote=c("A","A","A","B","B","B","B","B","B","C","D"),
val=c(rep(1,11))
)
df %>% group_by(vote) %>% summarise(val=sum(val))
out
vote val
<fct> <dbl>
1 A 3
2 B 6
3 C 1
4 D 1
but I need
vote val
<fct> <dbl>
1 A 3
2 B 6
3 unpopular 2
my idea is
df2 <- df %>% group_by(vote) %>% summarise(val=sum(val))
df2$vote[df2$val < 2] <- "unpop"
df2 %>% group_by....
it's not cool.
do you know any cool & helpful function ?
We can do a double grouping
library(dplyr)
df %>%
group_by(vote) %>%
summarise(val=sum(val)) %>%
group_by(vote = replace(vote, val <2, 'unpop')) %>%
summarise(val = sum(val))
-output
# A tibble: 3 x 2
# vote val
# <chr> <dbl>
#1 A 3
#2 B 6
#3 unpop 2
Or another option with rowsum
df %>%
group_by(vote = replace(vote, vote %in%
names(which((rowsum(val, vote) < 2)[,1])), 'unpopular')) %>%
summarise(val = sum(val))
Or using fct_lump_n from forcats
library(forcats)
df %>%
group_by(vote = fct_lump_n(vote, 2, other_level = "unpop")) %>%
summarise(val = sum(val))
# A tibble: 3 x 2
# vote val
# <fct> <dbl>
#1 A 3
#2 B 6
#3 unpop 2
Or using table
df %>%
group_by(vote = replace(vote,
vote %in% names(which(table(vote) < 2)), 'unpop')) %>%
summarise(val = sum(val))
If you want to vote based on sum of val in base R you can do this as :
aggregate(val~vote, transform(aggregate(val~vote, df, sum),
vote = replace(vote, val < 2, 'unpop')), sum)
# vote val
#1 A 3
#2 B 6
#3 unpop 2
I am looking to perform a group by on id, code1 and then summarise. I want the summarise to do several conditional sums i.e. sum of the count column when code2 == "B". I know how to do this by creating an intermediary binary column but I was wondering if there is quicker method where this can all be performed in the summarise statement.
Here is some test data:
id <- c(1,1,1)
code1 <- c("M", "M", "M")
code2 <- c("B", "B", "U")
code3 <- c("H", "N", "N")
count <- c(15, 2, 1)
x <- data.frame(id, code1, code2, code3, count)
Desired output:
id | code1 | Total | B_count | U_count | H_count | N_count
1 M 18 17 1 15 3
We can use the conditions inside the summarise call:
library(dplyr)
x %>%
group_by(id, code1) %>%
summarise(total = sum(count),
B_count = sum(count[code2 == "B"]),
U_count = sum(count[code2 == "U"]),
H_count = sum(count[code3 == "H"]),
N_count = sum(count[code3 == "N"]))
`summarise()` regrouping output by 'id' (override with `.groups` argument)
# A tibble: 1 x 7
# Groups: id [1]
id code1 total B_count U_count H_count N_count
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 M 18 17 1 15 3
This solution is very complicated but it gets the job done.
library(dplyr)
library(tidyr)
x %>%
pivot_longer(
cols = matches('code[2-9]'),
names_to = 'vars',
values_to = 'code'
) %>%
dplyr::select(-vars) %>%
group_by(id, code1, code) %>%
summarise(count = sum(count), .groups = "rowwise") %>%
pivot_wider(
id_cols = c(id, code1),
names_from = code,
values_from = count
) %>%
left_join(
x %>%
group_by(id, code1) %>%
summarise(Total = sum(count), .groups = "rowwise"),
by = c("id", "code1")
) %>%
select(id, code1, Total, everything())
## A tibble: 1 x 7
# id code1 Total B H N U
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 M 18 17 15 3 1
I want to group by column A and then sum values in column C for distinct values in columns B and C. Is it possible to do it inside summarise clause?
I know that's possible with distinct() function before aggregation. What about something like that:
Data:
df <- tibble(A = c(1,1,1,2,2), B = c('a','b','b','a','a'), C=c(5,10,10,15,15))
My try that doesn't work:
df %>%
group_by(A) %>%
summarise(sumC=sum(distinct(B,C) %>% select(C)))
Desired ouput:
A sumC
1 15
2 15
You could use duplicated
df %>%
group_by(A) %>%
summarise(sumC = sum(C[!duplicated(B)]))
## A tibble: 2 x 2
# A sumC
# <dbl> <dbl>
#1 1 15
#2 2 15
Or with distinct
df %>%
group_by(A) %>%
distinct(B, C) %>%
summarise(sumC = sum(C))
## A tibble: 2 x 2
# A sumC
# <dbl> <dbl>
#1 1 15
#2 2 15
A different possibility could be:
df %>%
group_by(A, B, C) %>%
slice(1) %>%
group_by(A) %>%
summarise(sumC = sum(C))
A sumC
<dbl> <dbl>
1 1 15
2 2 15
Or a twist on #Maurits Evers answer:
df %>%
distinct(A, B, C) %>%
group_by(A) %>%
summarise(sumC = sum(C))
Using tidyr/dplyr, I have some factor columns which I'd like to Z-score, and then mutate an average Z-score, whilst retaining the original data for reference.
I'd like to avoid using a for loop in tidyr/dplyr, thus I'm gathering my data and performing my calculation (Z-score) on a single column. However, I'm struggling with restoring the wide format.
Here is a MWE:
library(dplyr)
library(tidyr)
# Original Data
dfData <- data.frame(
Name = c("Steve","Jwan","Ashley"),
A = c(10,20,12),
B = c(0.2,0.3,0.5)
) %>% tbl_df()
# Gather to Z-score
dfLong <- dfData %>% gather("Factor","Value",A:B) %>%
mutate(FactorZ = paste0("Z_",Factor)) %>%
group_by(Factor) %>%
mutate(ValueZ = (Value - mean(Value,na.rm = TRUE))/sd(Value,na.rm = TRUE))
# Now go wide to do some mutations (eg Z)Avg = (Z_A + Z_B)/2)
# This does not work
dfWide <- dfLong %>%
spread(Factor,Value) %>%
spread(FactorZ,ValueZ)%>%
mutate(Z_Avg = (Z_A+Z_B)/2)
# This is the desired result
dfDesired <- dfData %>% mutate(Z_A = (A - mean(A,na.rm = TRUE))/sd(A,na.rm = TRUE)) %>% mutate(Z_B = (B - mean(B,na.rm = TRUE))/sd(B,na.rm = TRUE)) %>%
mutate(Z_Avg = (Z_A+Z_B)/2)
Thanks for any help/input!
Another approach using dplyr (version 0.5.0)
library(dplyr)
dfData %>%
mutate_each(funs(Z = scale(.)), -Name) %>%
mutate(Z_Avg = (A_Z+B_Z)/2)
means <-function(x)mean(x, na.rm=T)
dfWide %>% group_by(Name) %>% summarise_each(funs(means)) %>% mutate(Z_Avg = (Z_A + Z_B)/2)
# A tibble: 3 x 6
Name A B Z_A Z_B Z_Avg
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Ashley 12 0.5 -0.3779645 1.0910895 0.3565625
2 Jwan 20 0.3 1.1338934 -0.2182179 0.4578378
3 Steve 10 0.2 -0.7559289 -0.8728716 -0.8144003
Here is one approach with long and wide format. For z-transformation, you can use the base function scale. Furthermore, this approach includes a join to combine the original data frame and the one including the new values.
dfLong <- dfData %>%
gather(Factor, Value, A:B) %>%
group_by(Factor) %>%
mutate(ValueZ = scale(Value))
# Name Factor Value ValueZ
# <fctr> <chr> <dbl> <dbl>
# 1 Steve A 10.0 -0.7559289
# 2 Jwan A 20.0 1.1338934
# 3 Ashley A 12.0 -0.3779645
# 4 Steve B 0.2 -0.8728716
# 5 Jwan B 0.3 -0.2182179
# 6 Ashley B 0.5 1.0910895
dfWide <- dfData %>% inner_join(dfLong %>%
ungroup %>%
select(-Value) %>%
mutate(Factor = paste0("Z_", Factor)) %>%
spread(Factor, ValueZ) %>%
mutate(Z_Avg = (Z_A + Z_B) / 2))
# Name A B Z_A Z_B Z_Avg
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Steve 10 0.2 -0.7559289 -0.8728716 -0.8144003
# 2 Jwan 20 0.3 1.1338934 -0.2182179 0.4578378
# 3 Ashley 12 0.5 -0.3779645 1.0910895 0.3565625
I would just do it all in wide format. No need to keep switching between the long and wide formats.
dfData %>%
mutate(Z_A=(A-mean(unlist(dfData$A)))/sd(unlist(dfData$A)),
Z_B=(B-mean(unlist(dfData$B)))/sd(unlist(dfData$B))) %>%
mutate(Z_AVG=(Z_A+Z_B)/2)