Grouping and stacking data - r

A sample of my data is :
dat <- read.table(text = " ID BC1 DC1 DE1 MN2 DC2 PO2 SA3 BC3 KL3 AA4 AP4 BC4 PO4
1 2 1 2 3 1 3 1 1 3 2 2 2 2
2 3 1 1 2 3 1 1 2 3 1 1 3 2
3 2 3 2 3 2 3 2 1 1 3 1 1 1
4 3 3 1 1 1 1 1 2 2 1 2 1 2", header = TRUE)
I want to get the following table and missing data are blank
ID Group1 Group2 Group3 Group4
1 2 1 2
2 3 1 1
3 2 3 2
4 3 3 1
1 3 1 3
2 2 3 1
3 3 2 3
4 1 1 1
1 1 1 3
2 1 2 3
3 2 1 1
4 1 2 2
1 2 2 2 2
2 1 1 3 2
3 3 1 1 1
4 1 2 1 2
The number in front of each column is where the columns are separated from each other. For example BC1, DC1 and DE1. They form the first four rows with their Ids and MN2, DC2 and PO2 form the second rows with their IDs and so on.

What about using the row numbers with some pivoting?
library(dplyr)
library(tidyr)
dat |>
pivot_longer(-ID, names_sep = "(?=\\d)", names_to = c(NA, "id")) |>
group_by(ID, id) |>
mutate(name = row_number()) |>
pivot_wider(c(ID, id), names_prefix = "Group") |>
arrange(id) |>
ungroup() |>
select(-id)
Or using a map:
library(purrr)
library(dplyr)
paste(1:4) |> # unique(readr::parse_number(names(dat |> select(-ID))))
map(\(x) select(dat, ID, ends_with(x)) |> rename_with(\(x) names(x) <- paste0("Group", 1:length(x)), -ID)) |>
bind_rows()
Output:
# A tibble: 16 × 5
ID Group1 Group2 Group3 Group4
<int> <int> <int> <int> <int>
1 1 2 1 2 NA
2 2 3 1 1 NA
3 3 2 3 2 NA
4 4 3 3 1 NA
5 1 3 1 3 NA
6 2 2 3 1 NA
7 3 3 2 3 NA
8 4 1 1 1 NA
9 1 1 1 3 NA
10 2 1 2 3 NA
11 3 2 1 1 NA
12 4 1 2 2 NA
13 1 2 2 2 2
14 2 1 1 3 2
15 3 3 1 1 1
16 4 1 2 1 2
Update 13-01: Now the first solution returns the correct ID (not id) + another approach added.

Would be interesting to see if there is an easier approach:
library(tidyverse)
dat |>
pivot_longer(-ID) |>
mutate(id = str_extract(name, "\\d$")) |>
group_by(ID, id) |>
mutate(name = paste0("Group", row_number())) |>
ungroup() |>
pivot_wider(names_from = name, values_from = value) |>
arrange(id, ID) |>
select(-id)
#> # A tibble: 16 × 5
#> ID Group1 Group2 Group3 Group4
#> <int> <int> <int> <int> <int>
#> 1 1 2 1 2 NA
#> 2 2 3 1 1 NA
#> 3 3 2 3 2 NA
#> 4 4 3 3 1 NA
#> 5 1 3 1 3 NA
#> 6 2 2 3 1 NA
#> 7 3 3 2 3 NA
#> 8 4 1 1 1 NA
#> 9 1 1 1 3 NA
#> 10 2 1 2 3 NA
#> 11 3 2 1 1 NA
#> 12 4 1 2 2 NA
#> 13 1 2 2 2 2
#> 14 2 1 1 3 2
#> 15 3 3 1 1 1
#> 16 4 1 2 1 2

You can rename the data with a specified pattern ("index1_index2"), i.e.
# ID 1_1 1_2 1_3 2_1 2_2 2_3 3_1 3_2 3_3 4_1 4_2 4_3 4_4
# 1 1 2 1 2 3 1 3 1 1 3 2 2 2 2
# 2 2 3 1 1 2 3 1 1 2 3 1 1 3 2
# 3 3 2 3 2 3 2 3 2 1 1 3 1 1 1
# 4 4 3 3 1 1 1 1 1 2 2 1 2 1 2
so that you can add the special element ".value" to names_to when using pivot_longer() to stack multiple columns that are grouped by that pattern.
Code
library(dplyr)
library(tidyr)
dat %>%
rename_with(~ sub('\\D+', '', .x) %>%
paste(., ave(., ., FUN = seq), sep = '_'), -ID) %>%
pivot_longer(-ID, names_to = c("set", ".value"), names_sep = '_') %>%
arrange(set) %>%
select(-set)
Output
# A tibble: 16 × 5
ID `1` `2` `3` `4`
<int> <int> <int> <int> <int>
1 1 2 1 2 NA
2 2 3 1 1 NA
3 3 2 3 2 NA
4 4 3 3 1 NA
5 1 3 1 3 NA
6 2 2 3 1 NA
7 3 3 2 3 NA
8 4 1 1 1 NA
9 1 1 1 3 NA
10 2 1 2 3 NA
11 3 2 1 1 NA
12 4 1 2 2 NA
13 1 2 2 2 2
14 2 1 1 3 2
15 3 3 1 1 1
16 4 1 2 1 2

Related

how to move up the values within each group in R

I need to shift valid values to the top the of dataframe withing each id. Here is an example dataset:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,3,3),
itemid = c(1,2,3,1,2,3,1,2,3,4),
values = c(1,NA,0,NA,NA,0,1,NA,0,NA))
df
id itemid values
1 1 1 1
2 1 2 NA
3 1 3 0
4 2 1 NA
5 2 2 NA
6 2 3 0
7 3 1 1
8 3 2 NA
9 3 3 0
10 3 4 NA
excluding the id column, when there is a missing value in values column, I want to shift all values aligned to the top for each id.
How can I get this desired dataset below?
df1
id itemid values
1 1 1 1
2 1 2 0
3 1 3 NA
4 2 1 0
5 2 2 NA
6 2 3 NA
7 3 1 1
8 3 2 0
9 3 3 NA
10 3 4 NA
Using tidyverse you can arrange by whether values is missing or not (which will put those at the bottom).
library(tidyverse)
df %>%
arrange(id, is.na(values))
Output
id itemid values
<dbl> <dbl> <dbl>
1 1 1 1
2 1 3 0
3 1 2 NA
4 2 3 0
5 2 1 NA
6 2 2 NA
7 3 1 1
8 3 3 0
9 3 2 NA
10 3 4 NA
Or, if you wish to retain the same order for itemid and other columns, you can use mutate to specifically order columns of interest (like values). Other answers provide good solutions, such as #Santiago and #ThomasIsCoding. If you have multiple columns of interest to move NA to the bottom per group, you can also try:
df %>%
group_by(id) %>%
mutate(across(.cols = values, ~values[order(is.na(.))]))
where the .cols argument would contain the columns to transform and reorder independently.
Output
id itemid values
<dbl> <dbl> <dbl>
1 1 1 1
2 1 2 0
3 1 3 NA
4 2 1 0
5 2 2 NA
6 2 3 NA
7 3 1 1
8 3 2 0
9 3 3 NA
10 3 4 NA
We can try ave + order
> transform(df, values = ave(values, id, FUN = function(x) x[order(is.na(x))]))
id itemid values
1 1 1 1
2 1 2 0
3 1 3 NA
4 2 1 0
5 2 2 NA
6 2 3 NA
7 3 1 1
8 3 2 0
9 3 3 NA
10 3 4 NA
With data.table:
library(data.table)
setDT(df)[, values := values[order(is.na(values))], id][]
#> id itemid values
#> 1: 1 1 1
#> 2: 1 2 0
#> 3: 1 3 NA
#> 4: 2 1 0
#> 5: 2 2 NA
#> 6: 2 3 NA
#> 7: 3 1 1
#> 8: 3 2 0
#> 9: 3 3 NA
#> 10: 3 4 NA
I'd define a function that does what you want and then group by id:
completed_first <- function(x) {
completed <- x[!is.na(x)]
length(completed) <- length(x)
completed
}
library(dplyr)
df %>%
group_by(id) %>%
mutate(
values = completed_first(values)
) %>%
ungroup()
# # A tibble: 10 × 3
# id itemid values
# <dbl> <dbl> <dbl>
# 1 1 1 1
# 2 1 2 0
# 3 1 3 NA
# 4 2 1 0
# 5 2 2 NA
# 6 2 3 NA
# 7 3 1 1
# 8 3 2 0
# 9 3 3 NA
# 10 3 4 NA
(This method preserves the order of itemid.)
Or building upon ThomasIsCoding's answer:
library(dplyr)
df %>%
group_by(id) %>%
mutate(
values = values[order(is.na(values))]
) %>%
ungroup()
# # A tibble: 10 × 3
# id itemid values
# <dbl> <dbl> <dbl>
# 1 1 1 1
# 2 1 2 0
# 3 1 3 NA
# 4 2 1 0
# 5 2 2 NA
# 6 2 3 NA
# 7 3 1 1
# 8 3 2 0
# 9 3 3 NA
# 10 3 4 NA

count the different name considering as same

I want to count the number of fluctuation of responses under the column response per id. However, responses No~ no ~ DK. I need to consider as the same response just only for count to the number of fluctuate in response. I don't change responses permanently.
df <- data.frame(
id=c(1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4),
response=c("Yes","Yes","No","DK","no","No","No","no","No","Yes","Yes","DK","No","Yes","Yes","No","No","No","died","TO","Yes","No","Yes")
)
I am trying it using the following code:
library(tidyverse)
df <- df %>%
group_by(id) %>% fill(response) %>%
mutate(new = rleid(response), rn = row_number()) %>%
mutate(flactuation = case_when(rn >2 & duplicated(new) ~ 'No', rn > 2 ~ 'Yes')) %>%
mutate(numberofchange = sum(flactuation=="Yes", na.rm = T)) %>% select(-rn, -flactuation)
Expected
id response new numberofchange
<dbl> <chr> <int> <int>
1 1 Yes 1 1
2 1 Yes 1 1
3 1 No 2 1
4 1 DK 2 1
5 1 no 2 1
6 2 No 1 1
7 2 No 1 1
8 2 no 1 1
9 2 No 1 1
10 2 Yes 2 1
11 2 Yes 2 1
12 3 DK 1 2
13 3 No 1 2
14 3 Yes 2 2
15 3 Yes 2 2
16 3 No 3 2
17 3 No 3 2
18 4 No 1 5
19 4 died 2 5
20 4 TO 3 5
21 4 Yes 4 5
22 4 No 5 5
23 4 Yes 6 5
You could use data.table::rleid() to get the run-length indices.
library(dplyr)
df %>%
group_by(id) %>%
mutate(new = data.table::rleid(replace(response, response %in% c('no', 'DK'), "No")),
numberofchange = max(new) - 1) %>%
ungroup()
# A tibble: 23 × 4
id response new numberofchange
<dbl> <chr> <int> <dbl>
1 1 Yes 1 1
2 1 Yes 1 1
3 1 No 2 1
4 1 DK 2 1
5 1 no 2 1
6 2 No 1 1
7 2 No 1 1
8 2 no 1 1
9 2 No 1 1
10 2 Yes 2 1
11 2 Yes 2 1
12 3 DK 1 2
13 3 No 1 2
14 3 Yes 2 2
15 3 Yes 2 2
16 3 No 3 2
17 3 No 3 2
18 4 No 1 5
19 4 died 2 5
20 4 TO 3 5
21 4 Yes 4 5
22 4 No 5 5
23 4 Yes 6 5

is there a way in R to fill missing groups absent of observations?

Say I have something like:
df<-data.frame(group=c(1, 1,1, 2,2,2,3,3,3,4,4, 1, 1,1),
group2=c(1,2,3,1,2,3,1,2,3,1,3, 1,2,3))
group group2
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3
10 4 1
11 4 3
12 1 1
13 1 2
14 1 3
My goal is to count the number of unique instances for group= something and group2= something. Like so:
df1<-df%>%group_by(group, group2)%>% mutate(want=n())%>%distinct(group, group2, .keep_all=TRUE)
group group2 want
<dbl> <dbl> <int>
1 1 1 2
2 1 2 2
3 1 3 2
4 2 1 1
5 2 2 1
6 2 3 1
7 3 1 1
8 3 2 1
9 3 3 1
10 4 1 1
11 4 3 1
however, notice that group=4, group2=2 was not in my dataset to begin with. Is there some sort of autofill function where I can fill these non-observations with a zero to get below easily?:
group group2 want
<dbl> <dbl> <int>
1 1 1 2
2 1 2 2
3 1 3 2
4 2 1 1
5 2 2 1
6 2 3 1
7 3 1 1
8 3 2 1
9 3 3 1
10 4 1 1
11 4 2 0
12 4 3 1
After getting the count, we can expand with complete to fill the missing combinations with 0
library(dplyr)
library(tidyr)
df %>%
count(group, group2) %>%
complete(group, group2, fill = list(n = 0))
# A tibble: 12 x 3
# group group2 n
# <dbl> <dbl> <dbl>
# 1 1 1 2
# 2 1 2 2
# 3 1 3 2
# 4 2 1 1
# 5 2 2 1
# 6 2 3 1
# 7 3 1 1
# 8 3 2 1
# 9 3 3 1
#10 4 1 1
#11 4 2 0
#12 4 3 1
Or if we do the group_by, instead of mutate and then do the distinct, directly use the summarise
df %>%
group_by(group, group2) %>%
summarise(n = n()) %>%
ungroup %>%
complete(group, group2, fill = list(n = 0))
Here is a data.table approach solution to this problem:
library(data.table)
setDT(df)[CJ(group, group2, unique = TRUE),
c(.SD, .(want = .N)), .EACHI,
on = c("group", "group2")]
# group group2 want
# 1 1 2
# 1 2 2
# 1 3 2
# 2 1 1
# 2 2 1
# 2 3 1
# 3 1 1
# 3 2 1
# 3 3 1
# 4 1 1
# 4 2 0
# 4 3 1

How to arrange/sort by unique sequences?

A) Here is my data frame arranged by plate:
df <- read.table(header=TRUE, stringsAsFactors=FALSE, text="
plate phase score
A 1 1
A 1 1
A 1 1
A 2 1
A 2 1
A 2 1
A 3 2
A 3 2
A 3 2
B 1 1
B 1 1
B 1 2
B 2 1
B 2 1
B 2 3")
B) Goal: I want to order it by both plate first and then phase but sequentially (see below how the rows are ordered alphabetically by plate but sequentially by phase
plate phase score
<chr> <int> <int>
1 A 1 1
2 A 2 1
3 A 3 2
4 A 1 1
5 A 2 1
6 A 3 2
7 A 1 1
8 A 2 1
9 A 3 2
10 B 1 1
11 B 2 1
12 B 1 1
13 B 2 1
14 B 1 2
15 B 2 3
One option is to create a sequence variable grouped by 'plate', 'phase' and arrange on it along with 'plate' and 'score'
library(dplyr)
df %>%
group_by(plate, phase) %>%
mutate(rn = row_number()) %>%
ungroup %>%
arrange(plate, rn, score) %>%
select(-rn)
# A tibble: 15 x 3
# plate phase score
# <chr> <int> <int>
# 1 A 1 1
# 2 A 2 1
# 3 A 3 2
# 4 A 1 1
# 5 A 2 1
# 6 A 3 2
# 7 A 1 1
# 8 A 2 1
# 9 A 3 2
#10 B 1 1
#11 B 2 1
#12 B 1 1
#13 B 2 1
#14 B 1 2
#15 B 2 3
Or using data.table
library(data.table)
setDT(df)[order(plate, rowid(phase), score)]
df[with(df, order(plate, ave(phase, phase, FUN = seq_along), phase)),]
#> plate phase score
#> 1 A 1 1
#> 4 A 2 1
#> 7 A 3 2
#> 2 A 1 1
#> 5 A 2 1
#> 8 A 3 2
#> 3 A 1 1
#> 6 A 2 1
#> 9 A 3 2
#> 10 B 1 1
#> 13 B 2 1
#> 11 B 1 1
#> 14 B 2 1
#> 12 B 1 2
#> 15 B 2 3

R cummax function with NA

data
data=data.frame("person"=c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2),
"score"=c(1,2,1,2,3,1,3,NA,4,2,1,NA,2,NA,3,1,2,4),
"want"=c(1,2,1,2,3,3,3,3,4,2,1,1,2,2,3,3,3,4))
attempt
library(dplyr)
data = data %>%
group_by(person) %>%
mutate(wantTEST = ifelse(score >= 3 | (row_number() >= which.max(score == 3)),
cummax(score), score),
wantTEST = replace(wantTEST, duplicated(wantTEST == 4) & wantTEST == 4, NA))
i am basically working to use the cummax function but only under specific circumstances. i want to keep any values (1-2-1-1) except if there is a 3 or 4 (1-2-1-3-2-1-4) should be (1-2-1-3-3-4). if there is NA value i want to carry forward previous value. thank you.
Here's one way with tidyverse. You may want to use fill() after group_by() but that's somewhat unclear.
data %>%
fill(score) %>%
group_by(person) %>%
mutate(
w = ifelse(cummax(score) > 2, cummax(score), score)
) %>%
ungroup()
# A tibble: 18 x 4
person score want w
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 1 2 2 2
3 1 1 1 1
4 1 2 2 2
5 1 3 3 3
6 1 1 3 3
7 1 3 3 3
8 1 3 3 3
9 1 4 4 4
10 2 2 2 2
11 2 1 1 1
12 2 1 1 1
13 2 2 2 2
14 2 2 2 2
15 2 3 3 3
16 2 1 3 3
17 2 2 3 3
18 2 4 4 4
One way to do this is to first fill NA values and then for each row check if anytime the score of 3 or more is passed in the group. If the score of 3 is reached till that point we take the max score until that point or else return the same score.
library(tidyverse)
data %>%
fill(score) %>%
group_by(person) %>%
mutate(want1 = map_dbl(seq_len(n()), ~if(. >= which.max(score == 3))
max(score[seq_len(.)]) else score[.]))
# person score want want1
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 1
# 2 1 2 2 2
# 3 1 1 1 1
# 4 1 2 2 2
# 5 1 3 3 3
# 6 1 1 3 3
# 7 1 3 3 3
# 8 1 3 3 3
# 9 1 4 4 4
#10 2 2 2 2
#11 2 1 1 1
#12 2 1 1 1
#13 2 2 2 2
#14 2 2 2 2
#15 2 3 3 3
#16 2 1 3 3
#17 2 2 3 3
#18 2 4 4 4
Another way is to use accumulate from purrr. I use if_else_ from hablar for type stability:
library(tidyverse)
library(hablar)
data %>%
fill(score) %>%
group_by(person) %>%
mutate(wt = accumulate(score, ~if_else_(.x > 2, max(.x, .y), .y)))

Resources