Aggregate column - how to handle uneven dataframe

Aggregate column - how to handle uneven dataframe - r

I want to aggregate per Chr every third row (sum of three rows). However, since my df is not divisible by 3, I am not sure how to handle the last remaining rows, which could be just 1 or 2 rows. If two rows remain, I would like to just sum those 2 remaining rows.
Input
data.frame(Chr = c("chr1","chr1","chr1","chr1","chr1","chr2","chr2","chr2","chr2","chr2","chr3"),
value = c(1,3,1,3,5,6,3,1,3,5,0),
seq = c(1,2,3,4,5,1,2,3,4,5,6))
Output (using dplyr mutate, keeping all columns)
data.frame(Chr = c("chr1","chr1","chr1","chr1","chr1","chr2","chr2","chr2","chr2","chr2","chr3"),
value = c(1,3,1,3,5,6,3,1,3,5,0),
seq = c(1,2,3,4,5,1,2,3,4,5,6),
agg = c(5,5,5,8,8,10,10,10,8,8,8))

Here is a similar yet other approach: Trying out the new .by argument.
I tried to use it also for the last mutate but it was not possible to combine cumsum(.. with .by:
library(dplyr)
df %>%
mutate(group = as.integer(gl(n(),3,n())), .by=Chr) %>%
mutate(id = row_number(), .by = c(Chr, group)) %>%
group_by(Chr, sumgroup = cumsum(id == 1)) %>%
mutate(agg = sum(value)) %>%
ungroup() %>%
select(Chr, value, seq, agg)
Chr value seq agg
<chr> <dbl> <dbl> <dbl>
1 chr1 1 1 5
2 chr1 3 2 5
3 chr1 1 3 5
4 chr1 3 4 8
5 chr1 5 5 8
6 chr2 6 1 10
7 chr2 3 2 10
8 chr2 1 3 10
9 chr2 3 4 8
10 chr2 5 5 8
11 chr3 0 6 0

An option with dplyr & ave
library(dplyr) # >= 1.1.0
df1 %>%
mutate(agg = ave(value, as.integer(gl(n(), 3, n())), FUN = sum), .by = Chr)
-output
Chr value seq agg
1 chr1 1 1 5
2 chr1 3 2 5
3 chr1 1 3 5
4 chr1 3 4 8
5 chr1 5 5 8
6 chr2 6 1 10
7 chr2 3 2 10
8 chr2 1 3 10
9 chr2 3 4 8
10 chr2 5 5 8
11 chr3 0 6 0
Or with data.table
library(data.table)
setDT(df1)[, agg := .SD[, rep(sum(value), .N),
as.integer(gl(.N, 3, .N))]$V1, Chr]
-output
> df1
Chr value seq agg
1: chr1 1 1 5
2: chr1 3 2 5
3: chr1 1 3 5
4: chr1 3 4 8
5: chr1 5 5 8
6: chr2 6 1 10
7: chr2 3 2 10
8: chr2 1 3 10
9: chr2 3 4 8
10: chr2 5 5 8
11: chr3 0 6 0

You can use (row_number()-1) %/% 3 to group per 3 observations.
library(dplyr)
df %>%
mutate(Grp = (row_number()-1) %/% 3 + 1, .by = Chr) %>%
mutate(agg = sum(value), .by = c(Chr, Grp))
# Chr value seq Grp agg
# 1 chr1 1 1 1 5
# 2 chr1 3 2 1 5
# 3 chr1 1 3 1 5
# 4 chr1 3 4 2 8
# 5 chr1 5 5 2 8
# 6 chr2 6 1 1 10
# 7 chr2 3 2 1 10
# 8 chr2 1 3 1 10
# 9 chr2 3 4 2 8
# 10 chr2 5 5 2 8
# 11 chr3 0 6 1 0
If the Grp column is not needed, you can drop it with select(-Grp).

Diving the row number (within Chr groups) by 3 and rounding up this ratio gives 1 for the first 3 values, 2 for the next 3 values, etc. You can then group by Chr and this variable to compute your sums:
d %>%
group_by(Chr) %>%
mutate(Chr_group = ceiling(row_number()/3)) %>%
group_by(Chr, Chr_group) %>%
mutate(agg = sum(value)) %>%
ungroup()

Related

how to pipe all these commands dplyr aggregate groupby mixedorder in R

assume this is my dataset
library(gtools)
library(dplyr)
df <- data.frame(grp=c(0.5,0.6,1,2,2,2,4.5,10,22,"kids","Parents","Teachers"),
f1= c(1,0,3,2,4,0,3,0,1,6,8,4),
f2= c(1,0,3,1,4,0,1,0,1,5,8,4),
f3= c(1,0,3,2,4,6,1,2,1,6,8,4))
df
grp f1 f2 f3
1 0.5 1 1 1
2 0.6 0 0 0
3 1 3 3 3
4 2 2 1 2
5 2 4 4 4
6 2 0 0 6
7 4.5 3 1 1
8 10 0 0 2
9 22 1 1 1
10 kids 6 5 6
11 Parents 8 8 8
12 Teachers 4 4 4
and this is my desired output
df_final
grp f1 f2 f3
1 <=1 4 4 4
2 2-9 9 6 13
3 10-19 0 0 2
4 >20 1 1 1
5 kids 6 5 6
6 Parents 8 8 8
7 Teachers 4 4 4
This is what I did + commenting my questions:
############ how NOT to splot set into two subsets of data
df_1 <- df %>%
filter(grepl('kids|Parents|Teachers', grp))
df_1
grp f1 f2 f3
1 kids 6 5 6
2 Parents 8 8 8
3 Teachers 4 4 4
df_2 <- df %>%
filter(!grepl('kids|Parents|Teachers', grp)) %>%
mutate(across(.cols = grp, .fns = as.numeric)) %>%
mutate(grp= cut(grp, breaks=c(-999,2,10,21,999) , labels=c("<=1", "2-9","10-19",">20"), right=F))
df_2
grp f1 f2 f3
1 <=1 1 1 1
2 <=1 0 0 0
3 <=1 3 3 3
4 2-9 2 1 2
5 2-9 4 4 4
6 2-9 0 0 6
7 2-9 3 1 1
8 10-19 0 0 2
9 >20 1 1 1
### how to pipe both aggregate and mixedorder/sort instead of separate lined of codes
df_2 <- aggregate(.~grp, data = df_2, FUN=sum)
df2[mixedorder(df2$grp, decreasing = T),]
df_2
grp f1 f2 f3
1 <=1 4 4 4
2 2-9 9 6 13
3 10-19 0 0 2
4 >20 1 1 1
### how to make sure 10-19 does not come before 2-9 in case of actual dataset
grp a b d
1 <=1 53 48 53
2 10-15 65 63 65
3 2-9 30 40 30
df_final <- rbind(df_2, df_1)
df_final
grp f1 f2 f3
1 <=1 4 4 4
2 2-9 9 6 13
3 10-19 0 0 2
4 >20 1 1 1
5 kids 6 5 6
6 Parents 8 8 8
7 Teachers 4 4 4
Is there any neat way to get from original df to df_final all in dplyr by just piping commands?
how NOT to splot set into two subsets of data?
how to pipe both aggregate and mixedorder/sort instead of separate lined of codes?
how to make sure 10-19 does not come before 2-9 in case of actual dataset?

Here is one option - create a second column ('grp2') with the cut values on the numeric elements only, then coalesce with the original column, while appending the levels, and then do a group_by summarise with across. In this way, we don't have to use mixedsort, as the cut already had the grouping sorted
library(dplyr)
library(stringr)
df %>%
mutate(grp2 = case_when(str_detect(grp, '^[0-9.]+$')
~ cut(as.numeric(grp), breaks=c(-999,2,10,21,999) ,
labels=c("<=1", "2-9","10-19",">20"), right=FALSE))) %>%
mutate(grp =factor(coalesce(grp2, grp),
levels = c(levels(grp2), unique(grp[is.na(grp2)]))), .keep = "unused") %>%
group_by(grp) %>%
summarise(across(everything(), sum, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 7 × 4
grp f1 f2 f3
<fct> <dbl> <dbl> <dbl>
1 <=1 4 4 4
2 2-9 9 6 13
3 10-19 0 0 2
4 >20 1 1 1
5 kids 6 5 6
6 Parents 8 8 8
7 Teachers 4 4 4

Recode character IDs into numeric IDs

I need to modify an id variable values. Here is how a sample data looks like:
df <- data.frame(id = c(11,21,22,"33_AS_A","33_AS_B","33_AS_X", "35_Part1","35_Part2","35_Part4","35_Part7"),
Grade= c(3,3,3, 4,4,4,5,5,5,5))
> df
id Grade
1 11 3
2 21 3
3 22 3
4 33_AS_A 4
5 33_AS_B 4
6 33_AS_X 4
7 35_Part1 5
8 35_Part2 5
9 35_Part4 5
10 35_Part7 5
I need to recode the id as a numeric variable by giving ordered numeric values instead of the text values in order.
Here is my desired output looks like:
> df2
id Grade
1 11 3
2 21 3
3 22 3
4 331 4
5 332 4
6 333 4
7 351 5
8 352 5
9 353 5
10 354 5
Any ideas?

library(dplyr)
library(stringr)
df %>%
mutate(
group = str_extract(id, "[0-9]+")
) %>%
group_by(group) %>%
mutate(id = as.numeric(paste0(group, if(n() > 1) row_number() else ""))) %>%
ungroup() %>%
select(-group)
# # A tibble: 10 × 2
# id Grade
# <dbl> <dbl>
# 1 11 3
# 2 21 3
# 3 22 3
# 4 331 4
# 5 332 4
# 6 333 4
# 7 351 5
# 8 352 5
# 9 353 5
#10 354 5

Using base, split into groups based on numbers, if the group length is not 1, then add row number:
x <- sapply(strsplit(df$id, "_"), `[`, 1)
df$ID <- unlist(sapply(split(x, x), function(i)
if(length(i) == 1) i else paste0(i, seq(i))))
df
# id Grade ID
# 1 11 3 11
# 2 21 3 21
# 3 22 3 22
# 4 33_AS_A 4 331
# 5 33_AS_B 4 332
# 6 33_AS_X 4 333
# 7 35_Part1 5 351
# 8 35_Part2 5 352
# 9 35_Part4 5 353
# 10 35_Part7 5 354

Parse one column of json and bind with other column to make dataframe

I have data that takes the format:
have <- structure(list(V1 = c(4L, 28L, 2L),
V2 = c("[{\"group\":1,\"topic\":\"A\"},{\"group\":1,\"topic\":\"B\"},{\"group\":2,\"topic\":\"C\"},{\"group\":2,\"topic\":\"T\"},{\"group\":2,\"topic\":\"U\"},{\"group\":3,\"topic\":\"V\"},{\"group\":3,\"topic\":\"D\"},{\"group\":3,\"topic\":\"R\"},{\"group\":4,\"topic\":\"A\"},{\"group\":4,\"topic\":\"Q\"},{\"group\":4,\"topic\":\"S\"},{\"group\":4,\"topic\":\"W\"},{\"group\":6,\"topic\":\"O\"},{\"group\":6,\"topic\":\"P\"},{\"group\":6,\"topic\":\"E\"},{\"group\":6,\"topic\":\"F\"},{\"group\":6,\"topic\":\"G\"},{\"group\":6,\"topic\":\"H\"},{\"group\":6,\"topic\":\"I\"},{\"group\":6,\"topic\":\"J\"},{\"group\":6,\"topic\":\"K\"},{\"group\":6,\"topic\":\"L\"},{\"group\":6,\"topic\":\"M\"},{\"group\":6,\"topic\":\"N\"}]",
"[]",
"[{\"group\":2,\"topic\":\"C\"},{\"group\":3,\"topic\":\"D\"},{\"group\":6,\"topic\":\"O\"},{\"group\":6,\"topic\":\"P\"},{\"group\":6,\"topic\":\"E\"},{\"group\":6,\"topic\":\"G\"},{\"group\":6,\"topic\":\"M\"}]")
),
row.names = c(NA, 3L),
class = "data.frame")
The contents of V2 are nested groupings for each row like [{"group":1,"topic":"A"},{"group":1,"topic":"B"}...]
I want to get a wide dataframe that has an indicator (1/0) for each combination of group+topic (see also_have) for each row. Something like this:
# A tibble: 3 x 4
id topic_id_1 topic_id_2 topic_id_3 topic_id_4 ...
<dbl> <dbl> <dbl> <dbl>
1 4 1 1 0
2 28 0 0 0
3 2 0 0 0
The first step is to parse the json.
I can use purrr::map(have$V2, jsonlite::fromJSON) to unnest into a list, but I'm not sure how to bind the V1 column (that we might rename to id) to each element of the resulting list (note that list element two is empty because V1==28 is empty). Here's a snippet of what the first element might look like with the id (V1) added.
[[1]]
group topic id
1 1 A 4
2 1 B 4
3 2 C 4
4 2 T 4
...
Alternatively, I think purrr::map_df(have$V2, jsonlite::fromJSON) would get me closer to what I ultimately need, but here too I'm not sure how to add the row id (V1).
df <- purrr::map_df(have$V2, jsonlite::fromJSON)
df
What I get:
group topic
1 1 A
2 1 B
3 2 C
4 2 T
...
What I want (notice `V1==28` does not appear):
group topic id
1 1 A 4
2 1 B 4
3 2 C 4
4 2 T 4
5 2 U 4
6 3 V 4
7 3 D 4
8 3 R 4
9 4 A 4
10 4 Q 4
11 4 S 4
12 4 W 4
13 6 O 4
14 6 P 4
15 6 E 4
16 6 F 4
17 6 G 4
18 6 H 4
19 6 I 4
20 6 J 4
21 6 K 4
22 6 L 4
23 6 M 4
24 6 N 4
25 2 C 2
26 3 D 2
27 6 O 2
28 6 P 2
29 6 E 2
30 6 G 2
31 6 M 2
STOP.
I think if I can get the above dataframe with id I can get the rest of the way. The ultimate goal is to join this info with also_have and then pivot wide.
# join
also_have <- expand_grid(c(1:6), c(LETTERS)) %>%
mutate(topic_id = 1:n()) %>%
magrittr::set_colnames(c("group", "topic", "topic_id")) %>%
select(topic_id, group, topic)
# pivot wide
# A tibble: 3 x 4
id topic_id_1 topic_id_2 topic_id_3 topic_id_4 ...
<dbl> <dbl> <dbl> <dbl>
1 4 1 1 0
2 28 0 0 0
3 2 0 0 0
Update:
Applying #akrun's solution:
purrr::map_dfr(setNames(have$V2, have$V1),
jsonlite::fromJSON,
.id = 'V1') %>%
rename(id = V1) %>%
left_join(also_have, by=c("group", "topic")) %>%
select(-group, -topic) %>%
mutate(value = 1) %>%
pivot_wider(id_cols = id,
names_from = topic_id,
names_prefix = "topic_id",
values_from = value,
values_fill = 0
) %>%
full_join(tibble(id = as.character(have$V1))) %>%
replace(is.na(.), 0)
# A tibble: 3 x 25
id topic_id1 topic_id2 topic_id29 topic_id46 topic_id47 topic_id74 topic_id56
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 1 1 1 1 1 1 1
2 2 0 0 1 0 0 0 1
3 28 0 0 0 0 0 0 0
# … with 17 more variables: topic_id70 <dbl>, topic_id79 <dbl>, topic_id95 <dbl>,
# topic_id97 <dbl>, topic_id101 <dbl>, topic_id145 <dbl>, topic_id146 <dbl>,
# topic_id135 <dbl>, topic_id136 <dbl>, topic_id137 <dbl>, topic_id138 <dbl>,
# topic_id139 <dbl>, topic_id140 <dbl>, topic_id141 <dbl>, topic_id142 <dbl>,
# topic_id143 <dbl>, topic_id144 <dbl>

We could pass a named vector and then use .id in map_dfr
purrr::map_dfr(setNames(have$V2, have$V1), jsonlite::fromJSON, .id = 'id')
-output
id group topic
1 4 1 A
2 4 1 B
3 4 2 C
4 4 2 T
5 4 2 U
6 4 3 V
7 4 3 D
8 4 3 R
9 4 4 A
10 4 4 Q
11 4 4 S
12 4 4 W
...
Or this can be done within in dplyr framework itself after using rowwise
library(tidyr)
have %>%
rowwise %>%
transmute(ID = V1, V2 = list(fromJSON(V2))) %>%
ungroup %>%
unnest(c(V2), keep_empty = TRUE) %>%
select(-V2)
# A tibble: 32 x 3
ID group topic
<int> <int> <chr>
1 4 1 A
2 4 1 B
3 4 2 C
4 4 2 T
5 4 2 U
6 4 3 V
7 4 3 D
8 4 3 R
9 4 4 A
10 4 4 Q
# … with 22 more rows
For the second step do a join
out <- have %>%
rowwise %>%
transmute(ID = V1, V2 = list(fromJSON(V2))) %>%
ungroup %>%
unnest(c(V2), keep_empty = TRUE) %>%
select(-V2) %>% right_join(also_have)
out
Joining, by = c("group", "topic")
# A tibble: 163 x 4
ID group topic topic_id
<int> <int> <chr> <int>
1 4 1 A 1
2 4 1 B 2
3 4 2 C 29
4 4 2 T 46
5 4 2 U 47
6 4 3 V 74
7 4 3 D 56
8 4 3 R 70
9 4 4 A 79
10 4 4 Q 95
# … with 153 more rows

Compare and count the frequency of pairs of entries in two columns

I have two columns (v5 & v6) in a matrix where both columns have entries between 0 and 5 as
head(matrix)
v1 v2 ... v5 v6
[1,] 0 5
[2,] 1 3
[3,] 2 1
[4,] 4 1
[5,] 2 2
I want to construct a new (6*6)matrix contains the number of occurrences of each pair of values in both columns as
new_matrix
0 1 2 3 4 5
0 2326 2882 2587 734 341 0
1 50 17 103 14 0 6
2 ......
3 .......
4 ......
5 .......
I mean that I want to know how many pairs (0,0) , (0,1), ..., (0,5),... (5,5) are in both columns?
I used library(plyr) as
freq <- ddply(matrix, .(matrix$v5, matrix$v6), nrow)
names(freq) <- c("v5", "v6", "Freq")
But this will not give the needed result!

With tidyverse, you can arrive at this answers using usual group_by operations.
Sample data
I'm creating column names to make it easier to convert to tibble.
set.seed(123)
M <- matrix(sample(0:5, 100, TRUE),
sample(0:5, 100, TRUE),
ncol = 2,
nrow = 100,
dimnames = list(NULL, c("colA", "colB")))
Solution
library("tidyverse")
as_tibble(M) %>%
arrange(colA, colB) %>%
group_by(colA, colB) %>%
summarise(num_pairs = n(), .groups = "drop") %>%
pivot_wider(names_from = colB, values_from = num_pairs) %>%
remove_rownames()
Preview
# A tibble: 6 x 7
colA `0` `1` `2` `4` `5` `3`
<int> <int> <int> <int> <int> <int> <int>
1 0 4 4 4 2 4 NA
2 1 2 2 4 6 2 NA
3 2 6 4 NA 2 6 NA
4 3 2 NA NA 4 6 2
5 4 NA 2 6 NA 2 4
6 5 6 2 4 4 2 2
Comments
You have asked:
I mean that I want to know how many pairs (0,0) , (0,1), ...,
(0,5),... (5,5) are in both columns?
This answer gives you that, the question is how important is for you to have your results stored as a matrix? You can convert the results further into matrix by using as.matrix on what you get. Likely, I would stop after summarise(num_pairs = n(), .groups = "drop") as that gives very usable results, easy to subset join and so forth.

We can also use table
table(as.data.frame(M))
-output
# colB
#colA 0 1 2 3 4 5
# 0 4 4 4 0 2 4
# 1 2 2 4 0 6 2
# 2 6 4 0 0 2 6
# 3 2 0 0 2 4 6
# 4 0 2 6 4 0 2
# 5 6 2 4 2 4 2

Only rows where difference between them is less than 'n' in groups

Let's say we have the below dataset where values in V2 are ordered ascending in groups V1:
Input =(" V1 V2
1 A 3
2 A 4
3 A 5
4 A 6
5 A 12
6 A 13
7 B 4
8 B 5
9 B 6
10 B 12
11 C 13
12 C 14
13 C 18")
df = as.data.frame(read.table(textConnection(Input), header = T, row.names = 1))
Now I want to keep rows where the difference between consecutive ones is <= 1, so my desired output:
V1 V2
1 A 3
2 A 4
3 A 5
4 A 6
5 A 12
6 A 13
7 B 4
8 B 5
9 B 6
11 C 13
12 C 14
However when I use:
df %>%
group_by(V1) %>%
filter(c(0,diff(V2)) <= 1)
I have:
V1 V2
1 A 3
2 A 4
3 A 5
4 A 6
5 A 13
6 B 4
7 B 5
8 B 6
9 C 13
10 C 14
The row with V2 value 12 is missing and it should be in dataset. I tried also with lag() but result is same.
df %>%
group_by(V1) %>%
filter(V2 - lag(V2) <= 1 | is.na(V2 - lag(V2)))
Could you point my mistake?

You need to subtract the values from both the sides. Try lead and lag :
library(dplyr)
df %>%
group_by(V1) %>%
filter(V2 - lag(V2) <= 1 | V2 - lead(V2) <= 1)
# V1 V2
# <chr> <int>
# 1 A 3
# 2 A 4
# 3 A 5
# 4 A 6
# 5 A 12
# 6 A 13
# 7 B 4
# 8 B 5
# 9 B 6
#10 C 13
#11 C 14

Here is another idea where we create groups with a tolerance of 1, and filter out those groups with only one observation, i.e.
df %>%
group_by(V1, grp = cumsum(c(TRUE, diff(V2) != 1))) %>%
filter(n() > 1) %>%
ungroup() %>%
select(-grp)
# A tibble: 11 x 2
# V1 V2
# <fct> <int>
# 1 A 3
# 2 A 4
# 3 A 5
# 4 A 6
# 5 A 12
# 6 A 13
# 7 B 4
# 8 B 5
# 9 B 6
#10 C 13
#11 C 14

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Aggregate column - how to handle uneven dataframe - r

Related

how to pipe all these commands dplyr aggregate groupby mixedorder in R

Recode character IDs into numeric IDs

Parse one column of json and bind with other column to make dataframe

Compare and count the frequency of pairs of entries in two columns

Only rows where difference between them is less than 'n' in groups

Categories

Resources