How to arrange/sort by unique sequences? - r

A) Here is my data frame arranged by plate:
df <- read.table(header=TRUE, stringsAsFactors=FALSE, text="
plate phase score
A 1 1
A 1 1
A 1 1
A 2 1
A 2 1
A 2 1
A 3 2
A 3 2
A 3 2
B 1 1
B 1 1
B 1 2
B 2 1
B 2 1
B 2 3")
B) Goal: I want to order it by both plate first and then phase but sequentially (see below how the rows are ordered alphabetically by plate but sequentially by phase
plate phase score
<chr> <int> <int>
1 A 1 1
2 A 2 1
3 A 3 2
4 A 1 1
5 A 2 1
6 A 3 2
7 A 1 1
8 A 2 1
9 A 3 2
10 B 1 1
11 B 2 1
12 B 1 1
13 B 2 1
14 B 1 2
15 B 2 3

One option is to create a sequence variable grouped by 'plate', 'phase' and arrange on it along with 'plate' and 'score'
library(dplyr)
df %>%
group_by(plate, phase) %>%
mutate(rn = row_number()) %>%
ungroup %>%
arrange(plate, rn, score) %>%
select(-rn)
# A tibble: 15 x 3
# plate phase score
# <chr> <int> <int>
# 1 A 1 1
# 2 A 2 1
# 3 A 3 2
# 4 A 1 1
# 5 A 2 1
# 6 A 3 2
# 7 A 1 1
# 8 A 2 1
# 9 A 3 2
#10 B 1 1
#11 B 2 1
#12 B 1 1
#13 B 2 1
#14 B 1 2
#15 B 2 3
Or using data.table
library(data.table)
setDT(df)[order(plate, rowid(phase), score)]

df[with(df, order(plate, ave(phase, phase, FUN = seq_along), phase)),]
#> plate phase score
#> 1 A 1 1
#> 4 A 2 1
#> 7 A 3 2
#> 2 A 1 1
#> 5 A 2 1
#> 8 A 3 2
#> 3 A 1 1
#> 6 A 2 1
#> 9 A 3 2
#> 10 B 1 1
#> 13 B 2 1
#> 11 B 1 1
#> 14 B 2 1
#> 12 B 1 2
#> 15 B 2 3

Related

Grouping and stacking data

A sample of my data is :
dat <- read.table(text = " ID BC1 DC1 DE1 MN2 DC2 PO2 SA3 BC3 KL3 AA4 AP4 BC4 PO4
1 2 1 2 3 1 3 1 1 3 2 2 2 2
2 3 1 1 2 3 1 1 2 3 1 1 3 2
3 2 3 2 3 2 3 2 1 1 3 1 1 1
4 3 3 1 1 1 1 1 2 2 1 2 1 2", header = TRUE)
I want to get the following table and missing data are blank
ID Group1 Group2 Group3 Group4
1 2 1 2
2 3 1 1
3 2 3 2
4 3 3 1
1 3 1 3
2 2 3 1
3 3 2 3
4 1 1 1
1 1 1 3
2 1 2 3
3 2 1 1
4 1 2 2
1 2 2 2 2
2 1 1 3 2
3 3 1 1 1
4 1 2 1 2
The number in front of each column is where the columns are separated from each other. For example BC1, DC1 and DE1. They form the first four rows with their Ids and MN2, DC2 and PO2 form the second rows with their IDs and so on.
What about using the row numbers with some pivoting?
library(dplyr)
library(tidyr)
dat |>
pivot_longer(-ID, names_sep = "(?=\\d)", names_to = c(NA, "id")) |>
group_by(ID, id) |>
mutate(name = row_number()) |>
pivot_wider(c(ID, id), names_prefix = "Group") |>
arrange(id) |>
ungroup() |>
select(-id)
Or using a map:
library(purrr)
library(dplyr)
paste(1:4) |> # unique(readr::parse_number(names(dat |> select(-ID))))
map(\(x) select(dat, ID, ends_with(x)) |> rename_with(\(x) names(x) <- paste0("Group", 1:length(x)), -ID)) |>
bind_rows()
Output:
# A tibble: 16 × 5
ID Group1 Group2 Group3 Group4
<int> <int> <int> <int> <int>
1 1 2 1 2 NA
2 2 3 1 1 NA
3 3 2 3 2 NA
4 4 3 3 1 NA
5 1 3 1 3 NA
6 2 2 3 1 NA
7 3 3 2 3 NA
8 4 1 1 1 NA
9 1 1 1 3 NA
10 2 1 2 3 NA
11 3 2 1 1 NA
12 4 1 2 2 NA
13 1 2 2 2 2
14 2 1 1 3 2
15 3 3 1 1 1
16 4 1 2 1 2
Update 13-01: Now the first solution returns the correct ID (not id) + another approach added.
Would be interesting to see if there is an easier approach:
library(tidyverse)
dat |>
pivot_longer(-ID) |>
mutate(id = str_extract(name, "\\d$")) |>
group_by(ID, id) |>
mutate(name = paste0("Group", row_number())) |>
ungroup() |>
pivot_wider(names_from = name, values_from = value) |>
arrange(id, ID) |>
select(-id)
#> # A tibble: 16 × 5
#> ID Group1 Group2 Group3 Group4
#> <int> <int> <int> <int> <int>
#> 1 1 2 1 2 NA
#> 2 2 3 1 1 NA
#> 3 3 2 3 2 NA
#> 4 4 3 3 1 NA
#> 5 1 3 1 3 NA
#> 6 2 2 3 1 NA
#> 7 3 3 2 3 NA
#> 8 4 1 1 1 NA
#> 9 1 1 1 3 NA
#> 10 2 1 2 3 NA
#> 11 3 2 1 1 NA
#> 12 4 1 2 2 NA
#> 13 1 2 2 2 2
#> 14 2 1 1 3 2
#> 15 3 3 1 1 1
#> 16 4 1 2 1 2
You can rename the data with a specified pattern ("index1_index2"), i.e.
# ID 1_1 1_2 1_3 2_1 2_2 2_3 3_1 3_2 3_3 4_1 4_2 4_3 4_4
# 1 1 2 1 2 3 1 3 1 1 3 2 2 2 2
# 2 2 3 1 1 2 3 1 1 2 3 1 1 3 2
# 3 3 2 3 2 3 2 3 2 1 1 3 1 1 1
# 4 4 3 3 1 1 1 1 1 2 2 1 2 1 2
so that you can add the special element ".value" to names_to when using pivot_longer() to stack multiple columns that are grouped by that pattern.
Code
library(dplyr)
library(tidyr)
dat %>%
rename_with(~ sub('\\D+', '', .x) %>%
paste(., ave(., ., FUN = seq), sep = '_'), -ID) %>%
pivot_longer(-ID, names_to = c("set", ".value"), names_sep = '_') %>%
arrange(set) %>%
select(-set)
Output
# A tibble: 16 × 5
ID `1` `2` `3` `4`
<int> <int> <int> <int> <int>
1 1 2 1 2 NA
2 2 3 1 1 NA
3 3 2 3 2 NA
4 4 3 3 1 NA
5 1 3 1 3 NA
6 2 2 3 1 NA
7 3 3 2 3 NA
8 4 1 1 1 NA
9 1 1 1 3 NA
10 2 1 2 3 NA
11 3 2 1 1 NA
12 4 1 2 2 NA
13 1 2 2 2 2
14 2 1 1 3 2
15 3 3 1 1 1
16 4 1 2 1 2

Replacing values in a data.frame that have lost their order

In my toy data, for each unique study, the numeric variables (sample and group) must have an order starting from 1. But:
For example, in study 1, we see that there are two unique sample values (1 & 3), so 3 must be replaced with 2.
For example, in study 2, we see that there is one unique group value (2), so it must be replaced with 1.
In study 3, both sample and group seem ok meaning their unique values are 1 and 2 (no replacing needed).
For this toy data, my desired output is shown below. But I appreciate a functional solution that can automatically replace any number of numeric variables in a data.frame that have lost their order just like I showed in my toy data.
m="
study sample group outcome
1 1 1 A
1 1 1 B
1 1 2 A
1 1 2 B
1 3 1 A
1 3 1 B
1 3 2 A
1 3 2 B
2 1 2 A
2 1 2 B
2 2 2 A
2 2 2 B
2 3 2 A
2 3 2 B
3 1 1 A
3 1 1 B
3 1 2 A
3 1 2 B
3 2 1 A
3 2 1 B
3 2 2 A
3 2 2 B"
data <- read.table(text=m, h=T)
Desired_output="
study sample group outcome
1 1 1 A
1 1 1 B
1 1 2 A
1 1 2 B
1 2 1 A
1 2 1 B
1 2 2 A
1 2 2 B
2 1 1 A
2 1 1 B
2 2 1 A
2 2 1 B
2 3 1 A
2 3 1 B
3 1 1 A
3 1 1 B
3 1 2 A
3 1 2 B
3 2 1 A
3 2 1 B
3 2 2 A
3 2 2 B"
You can do:
library(dplyr)
data %>%
group_by(study) %>%
mutate(across(tidyselect::vars_select_helpers$where(is.numeric),
function(x) as.numeric(as.factor(x)))) %>%
as.data.frame()
The resultant data frame looks like this:
study sample group outcome
1 1 1 1 A
2 1 1 1 B
3 1 1 2 A
4 1 1 2 B
5 1 2 1 A
6 1 2 1 B
7 1 2 2 A
8 1 2 2 B
9 2 1 1 A
10 2 1 1 B
11 2 2 1 A
12 2 2 1 B
13 2 3 1 A
14 2 3 1 B
15 3 1 1 A
16 3 1 1 B
17 3 1 2 A
18 3 1 2 B
19 3 2 1 A
20 3 2 1 B
21 3 2 2 A
22 3 2 2 B
Here is an alternative (not as elegant as #Allan Cameron +1 ) dplyr solution:
library(dplyr)
df %>%
group_by(study) %>%
mutate(x = n()/length(unique(sample)),
sample = rep(row_number(), each=x, length.out = n()),
y = length(unique(group)),
group = ifelse(y==1, 1, group)) %>%
select(-x, -y)
study sample group outcome
<int> <int> <dbl> <chr>
1 1 1 1 A
2 1 1 1 B
3 1 1 2 A
4 1 1 2 B
5 1 2 1 A
6 1 2 1 B
7 1 2 2 A
8 1 2 2 B
9 2 1 1 A
10 2 1 1 B
11 2 2 1 A
12 2 2 1 B
13 2 3 1 A
14 2 3 1 B
15 3 1 1 A
16 3 1 1 B
17 3 1 2 A
18 3 1 2 B
19 3 2 1 A
20 3 2 1 B
21 3 2 2 A
22 3 2 2 B

Remove Redundant row with large number of variable

I have data with 33 attribute. 30 of them is variable. And other 3 column is cluster number ,degree and sum of degree. I want to remove duplicate row which have same value from variable 1 until 30. Within duplicate row I want to choose the row which have highest values of sum degree to remain in the data. This coding is run in R. My question is how do I simplify zz.
df_order=dfOrder(rule2,c(33),ascending=FALSE)
df_order2=as_tibble(df_order)
zz=df_order2 %>% distinct(X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,X24,X25,X26,X27,X28,X29,X30,.keep_all = TRUE)
Sample data:
set.seed(42)
dat <- tibble(a=rep(1:2, each=10), b=rep(1:4, each=5), x1=sample(3,size=20,replace=TRUE), x2=sample(3,size=20,replace=TRUE), x3=sample(3,size=20,replace=TRUE))
dat
# # A tibble: 20 x 5
# a b x1 x2 x3
# <int> <int> <int> <int> <int>
# 1 1 1 1 1 3
# 2 1 1 1 3 3
# 3 1 1 1 1 1
# 4 1 1 1 1 1
# 5 1 1 2 2 2
# 6 1 2 2 3 2
# ...truncated...
Brute-force to show what distinct gives you:
distinct(dat, x1, x2, x3, .keep_all = TRUE)
# # A tibble: 14 x 5
# a b x1 x2 x3
# <int> <int> <int> <int> <int>
# 1 1 1 1 1 3
# 2 1 1 1 3 3
# 3 1 1 1 1 1
# 4 1 1 2 2 2
# 5 1 2 2 3 2
# 6 1 2 1 1 2
# 7 1 2 3 2 2
# 8 1 2 3 2 3
# 9 2 3 1 3 2
# 10 2 3 1 3 1
# 11 2 3 2 2 3
# 12 2 4 3 1 2
# 13 2 4 1 2 1
# 14 2 4 3 2 1
Programmatic way, without specifying each of x1 through x3, both work (depending on your preference towards "just use these" or "don't use those"). The first two work in base R and tidyverse equally well, the third is using dplyr::select.
dat[!duplicated(subset(dat, select = -(a:b))),]
dat[!duplicated(subset(dat, select = x1:x3)),]
dat[!duplicated(select(dat, x1:x3)),] # or -(a:b), same
Or perhaps a pipe-looking method:
select(dat, x1:x3) %>%
Negate(duplicated)(.) %>%
which(.) %>%
slice(dat, .)
Using the data from #r2evans post an option is to use splice after converting the column names to symbols
library(dplyr)
dat %>%
distinct(!!! rlang::syms(names(select(., starts_with('x')))), .keep_all = TRUE)
# A tibble: 14 x 5
# a b x1 x2 x3
# <int> <int> <int> <int> <int>
# 1 1 1 1 1 3
# 2 1 1 1 3 3
# 3 1 1 1 1 1
# 4 1 1 2 2 2
# 5 1 2 2 3 2
# 6 1 2 1 1 2
# 7 1 2 3 2 2
# 8 1 2 3 2 3
# 9 2 3 1 3 2
#10 2 3 1 3 1
#11 2 3 2 2 3
#12 2 4 3 1 2
#13 2 4 1 2 1
#14 2 4 3 2 1
From dplyr version >= 1.0.0, we can also use distinct with across
dat %>%
distinct(across(starts_with('x')), .keep_all = TRUE)
# A tibble: 14 x 5
# a b x1 x2 x3
# <int> <int> <int> <int> <int>
# 1 1 1 1 1 3
# 2 1 1 1 3 3
# 3 1 1 1 1 1
# 4 1 1 2 2 2
# 5 1 2 2 3 2
# 6 1 2 1 1 2
# 7 1 2 3 2 2
# 8 1 2 3 2 3
# 9 2 3 1 3 2
#10 2 3 1 3 1
#11 2 3 2 2 3
#12 2 4 3 1 2
#13 2 4 1 2 1
#14 2 4 3 2 1

is there a way in R to fill missing groups absent of observations?

Say I have something like:
df<-data.frame(group=c(1, 1,1, 2,2,2,3,3,3,4,4, 1, 1,1),
group2=c(1,2,3,1,2,3,1,2,3,1,3, 1,2,3))
group group2
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3
10 4 1
11 4 3
12 1 1
13 1 2
14 1 3
My goal is to count the number of unique instances for group= something and group2= something. Like so:
df1<-df%>%group_by(group, group2)%>% mutate(want=n())%>%distinct(group, group2, .keep_all=TRUE)
group group2 want
<dbl> <dbl> <int>
1 1 1 2
2 1 2 2
3 1 3 2
4 2 1 1
5 2 2 1
6 2 3 1
7 3 1 1
8 3 2 1
9 3 3 1
10 4 1 1
11 4 3 1
however, notice that group=4, group2=2 was not in my dataset to begin with. Is there some sort of autofill function where I can fill these non-observations with a zero to get below easily?:
group group2 want
<dbl> <dbl> <int>
1 1 1 2
2 1 2 2
3 1 3 2
4 2 1 1
5 2 2 1
6 2 3 1
7 3 1 1
8 3 2 1
9 3 3 1
10 4 1 1
11 4 2 0
12 4 3 1
After getting the count, we can expand with complete to fill the missing combinations with 0
library(dplyr)
library(tidyr)
df %>%
count(group, group2) %>%
complete(group, group2, fill = list(n = 0))
# A tibble: 12 x 3
# group group2 n
# <dbl> <dbl> <dbl>
# 1 1 1 2
# 2 1 2 2
# 3 1 3 2
# 4 2 1 1
# 5 2 2 1
# 6 2 3 1
# 7 3 1 1
# 8 3 2 1
# 9 3 3 1
#10 4 1 1
#11 4 2 0
#12 4 3 1
Or if we do the group_by, instead of mutate and then do the distinct, directly use the summarise
df %>%
group_by(group, group2) %>%
summarise(n = n()) %>%
ungroup %>%
complete(group, group2, fill = list(n = 0))
Here is a data.table approach solution to this problem:
library(data.table)
setDT(df)[CJ(group, group2, unique = TRUE),
c(.SD, .(want = .N)), .EACHI,
on = c("group", "group2")]
# group group2 want
# 1 1 2
# 1 2 2
# 1 3 2
# 2 1 1
# 2 2 1
# 2 3 1
# 3 1 1
# 3 2 1
# 3 3 1
# 4 1 1
# 4 2 0
# 4 3 1

Adding sequence of numbers to the data

Hi I have a data frame like this
df <-data.frame(x=rep(rep(seq(0,3),each=2),2 ),gr=gl(2,8))
x gr
1 0 1
2 0 1
3 1 1
4 1 1
5 2 1
6 2 1
7 3 1
8 3 1
9 0 2
10 0 2
11 1 2
12 1 2
13 2 2
14 2 2
15 3 2
16 3 2
I want to add a new column numbering sequence of numbers when the x value ==0
I tried
library(dplyr)
df%>%
group_by(gr)%>%
mutate(numbering=seq(2,8,2))
Error in mutate_impl(.data, dots) :
Column `numbering` must be length 8 (the group size) or one, not 4
?
Just for side note mutate(numbering=rep(seq(2,8,2),each=2)) would work for this minimal example but for the general case its better to look x value change from 0!
the expected output
x gr numbering
1 0 1 2
2 0 1 2
3 1 1 4
4 1 1 4
5 2 1 6
6 2 1 6
7 3 1 8
8 3 1 8
9 0 2 2
10 0 2 2
11 1 2 4
12 1 2 4
13 2 2 6
14 2 2 6
15 3 2 8
16 3 2 8
Do you mean something like this?
library(tidyverse);
df %>%
group_by(gr) %>%
mutate(numbering = cumsum(c(1, diff(x) != 0)))
## A tibble: 16 x 3
## Groups: gr [2]
# x gr numbering
# <int> <fct> <dbl>
# 1 0 1 1.
# 2 0 1 1.
# 3 1 1 2.
# 4 1 1 2.
# 5 2 1 3.
# 6 2 1 3.
# 7 3 1 4.
# 8 3 1 4.
# 9 0 2 1.
#10 0 2 1.
#11 1 2 2.
#12 1 2 2.
#13 2 2 3.
#14 2 2 3.
#15 3 2 4.
#16 3 2 4.
Or if you must have a numbering sequence 2,4,6,... instead of 1,2,3,... you can do
df %>%
group_by(gr) %>%
mutate(numering = 2 * cumsum(c(1, diff(x) != 0)));
## A tibble: 16 x 3
## Groups: gr [2]
# x gr numering
# <int> <fct> <dbl>
# 1 0 1 2.
# 2 0 1 2.
# 3 1 1 4.
# 4 1 1 4.
# 5 2 1 6.
# 6 2 1 6.
# 7 3 1 8.
# 8 3 1 8.
# 9 0 2 2.
#10 0 2 2.
#11 1 2 4.
#12 1 2 4.
#13 2 2 6.
#14 2 2 6.
#15 3 2 8.
#16 3 2 8.
Here is an option using match to get the index and then pass on the seq values to fill
df %>%
group_by(gr) %>%
mutate(numbering = seq(2, length.out = n()/2, by = 2)[match(x, unique(x))])
# A tibble: 16 x 3
# Groups: gr [2]
# x gr numbering
# <int> <fct> <dbl>
# 1 0 1 2
# 2 0 1 2
# 3 1 1 4
# 4 1 1 4
# 5 2 1 6
# 6 2 1 6
# 7 3 1 8
# 8 3 1 8
# 9 0 2 2
#10 0 2 2
#11 1 2 4
#12 1 2 4
#13 2 2 6
#14 2 2 6
#15 3 2 8
#16 3 2 8

Resources