R unique ID renumbering for each group in a data.frame - r

I want to create a unique sequential numeric ID for each distinct group based on 3 columns, but for each group the IDs must start from 1 to n.
Using the solution at Creating a unique ID, I can create unique IDs, but they are sequential for the entire data frame.
k1 <- c(1,1,1,1,1,1,1,1,1,1)
k2 <- c(1,1,1,1,1,2,2,2,2,2)
k3 <- rep(letters[1:2],5)
df <- as.data.frame(cbind(k1,k2, k3))
d <- transform(df, id = as.numeric(interaction(k1,k2,k3, drop=TRUE)))
d <- d[with(d, order(k1,k2,k3)),]
the result is
> d
k1 k2 k3 id
1 1 1 a 1
3 1 1 a 1
5 1 1 a 1
2 1 1 b 3
4 1 1 b 3
7 1 2 a 2
9 1 2 a 2
6 1 2 b 4
8 1 2 b 4
10 1 2 b 4
and I'd like to have
> d
k1 k2 k3 id
1 1 1 a 1
3 1 1 a 1
5 1 1 a 1
2 1 1 b 2
4 1 1 b 2
7 1 2 a 1
9 1 2 a 1
6 1 2 b 2
8 1 2 b 2
10 1 2 b 2

Try using data.table as mentioned in the link:
library(data.table)
setDT(df)[,id:=.GRP,by=list(k1,k3)][]
# k1 k2 k3 id
# 1: 1 1 a 1
# 2: 1 1 b 2
# 3: 1 1 a 1
# 4: 1 1 b 2
# 5: 1 1 a 1
# 6: 1 2 b 2
# 7: 1 2 a 1
# 8: 1 2 b 2
# 9: 1 2 a 1
#10: 1 2 b 2

Try
d$id <- with(d, ave(id, k2, FUN=function(x) as.numeric(factor(x))))
d$id
#[1] 1 1 1 2 2 1 1 2 2 2

Related

Looping over various lists of data frames to create a variable in each data frame

I have four lists, each of which contains 12 data frames. Something like this: `
for (i in 1:10) {
assign(paste0("df", i), data.frame(c=c(1,2,3), d=c(1,2,3),
e=c(1,2,3), e=c(1,2,3),
g=c(1,2,3), h=c(1,2,3),
i=c(1,2,3), j=c(1,2,3),
k=c(1,2,3), l=c(1,2,3)))
}
for (i in 1:4) {
assign(paste0("list_", i), lapply(ls(pattern="df"), get))
}
rm(list=ls(pattern="df"))
`
Each list corresponds to a year and each of its elements (the data frames) correspond to a month. Conveniently, the position of each element of the list (of each data frame) is equivalent to its month. So, in the first list, the first data frame corresponds to January 2020, the second to February 2020, and so on. In the second list, the first data frame corresponds to January 2021, the second to February 2021, and so on.
What I need to do is to create a new variable that indicates the month of each data frame.
I have been trying different things, including this:
for(j in 20:21) {
for(i in 1:12) {
assign(get(paste0("df_20", j))[[i]],
get(paste0("df_20", j))[[i]] %>%
mutate(month=i)) ## se le suma 1 al mes porque comienza desde febrero
}
}
But nothing works. The problem seems to be the left hand of the assignment. When I use the get() function, the software returns an error ("Error in assign(get(paste0("mies_20", j))[[1]], get(paste0("mies_20", :
invalid first argument"). If I don't include this function, paste0("df_20", j))[[i]] does not recognize the "[[i]]".
Any ideas?
get() each list, iterate over its dataframes using lapply(), then assign back to the environment. It’s also best to use something other than i for your iteration variable, since there’s also a column i in your dataframes.
library(dplyr)
for (j in 1:4) {
list_j <- get(paste0("list_", j))
list_j <- lapply(
seq_along(list_j),
\(mnth) mutate(list_j[[mnth]], month = mnth)
)
assign(paste0("list_", j), list_j)
}
list_1
[[1]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 1
3 3 3 3 3 3 3 3 3 3 3 1
[[2]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 2
2 2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 2
[[3]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 3
2 2 2 2 2 2 2 2 2 2 2 3
3 3 3 3 3 3 3 3 3 3 3 3
[[4]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 4
2 2 2 2 2 2 2 2 2 2 2 4
3 3 3 3 3 3 3 3 3 3 3 4
[[5]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 5
2 2 2 2 2 2 2 2 2 2 2 5
3 3 3 3 3 3 3 3 3 3 3 5
[[6]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 6
2 2 2 2 2 2 2 2 2 2 2 6
3 3 3 3 3 3 3 3 3 3 3 6
[[7]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 7
2 2 2 2 2 2 2 2 2 2 2 7
3 3 3 3 3 3 3 3 3 3 3 7
[[8]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 8
2 2 2 2 2 2 2 2 2 2 2 8
3 3 3 3 3 3 3 3 3 3 3 8
[[9]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 9
2 2 2 2 2 2 2 2 2 2 2 9
3 3 3 3 3 3 3 3 3 3 3 9
[[10]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 10
2 2 2 2 2 2 2 2 2 2 2 10
3 3 3 3 3 3 3 3 3 3 3 10

How to create a group based on pattern from another column?

I have a data frame as below,
dt <- data.frame(id = c("a","b","c","d","e","f","g","h","i","j"),
value = c(1,2,1,2,1,1,1,2,1,2))
> dt
id value
1 a 1
2 b 2
3 c 1
4 d 2
5 e 1
6 f 1
7 g 1
8 h 2
9 i 1
10 j 2
I hope to create a column based on column value so that whenever it runs into a 2 in column value it will assign a new group number. The output will look like,
dtgroup <- data.frame(id = c("a","b","c","d","e","f","g","h","i","j"),
value = c(1,2,1,2,1,1,1,2,1,2),
group = c(1,1,2,2,3,3,3,3,4,4))
> dtgroup
id value group
1 a 1 1
2 b 2 1
3 c 1 2
4 d 2 2
5 e 1 3
6 f 1 3
7 g 1 3
8 h 2 3
9 i 1 4
10 j 2 4
Any ideas? Thanks!
We can use findInterval like below
> transform(dt, group = 1 + findInterval(seq_along(value), which(value == 2), left.open = TRUE))
id value group
1 a 1 1
2 b 2 1
3 c 1 2
4 d 2 2
5 e 1 3
6 f 1 3
7 g 1 3
8 h 2 3
9 i 1 4
10 j 2 4
or cut
> transform(dt, group = as.integer(cut(seq_along(value), c(-Inf, which(value == 2)))))
id value group
1 a 1 1
2 b 2 1
3 c 1 2
4 d 2 2
5 e 1 3
6 f 1 3
7 g 1 3
8 h 2 3
9 i 1 4
10 j 2 4
Another possibility. Increment by one when value is 1 and the previous value (dplyr::lag) is not 1.
dt$group <- with(dt, cumsum(value == 1 & dplyr::lag(value != 1, default = 1)))
id value group
1 a 1 1
2 b 2 1
3 c 1 2
4 d 2 2
5 e 1 3
6 f 1 3
7 g 1 3
8 h 2 3
9 i 1 4
10 j 2 4
With cumsum, if value doesn't have NAs:
dt$group <- head(c(0,cumsum(dt$value==2))+1,-1)
dt
id value group
1 a 1 1
2 b 2 1
3 c 1 2
4 d 2 2
5 e 1 3
6 f 1 3
7 g 1 3
8 h 2 3
9 i 1 4
10 j 2 4

Replacing values in a data.frame that have lost their order

In my toy data, for each unique study, the numeric variables (sample and group) must have an order starting from 1. But:
For example, in study 1, we see that there are two unique sample values (1 & 3), so 3 must be replaced with 2.
For example, in study 2, we see that there is one unique group value (2), so it must be replaced with 1.
In study 3, both sample and group seem ok meaning their unique values are 1 and 2 (no replacing needed).
For this toy data, my desired output is shown below. But I appreciate a functional solution that can automatically replace any number of numeric variables in a data.frame that have lost their order just like I showed in my toy data.
m="
study sample group outcome
1 1 1 A
1 1 1 B
1 1 2 A
1 1 2 B
1 3 1 A
1 3 1 B
1 3 2 A
1 3 2 B
2 1 2 A
2 1 2 B
2 2 2 A
2 2 2 B
2 3 2 A
2 3 2 B
3 1 1 A
3 1 1 B
3 1 2 A
3 1 2 B
3 2 1 A
3 2 1 B
3 2 2 A
3 2 2 B"
data <- read.table(text=m, h=T)
Desired_output="
study sample group outcome
1 1 1 A
1 1 1 B
1 1 2 A
1 1 2 B
1 2 1 A
1 2 1 B
1 2 2 A
1 2 2 B
2 1 1 A
2 1 1 B
2 2 1 A
2 2 1 B
2 3 1 A
2 3 1 B
3 1 1 A
3 1 1 B
3 1 2 A
3 1 2 B
3 2 1 A
3 2 1 B
3 2 2 A
3 2 2 B"
You can do:
library(dplyr)
data %>%
group_by(study) %>%
mutate(across(tidyselect::vars_select_helpers$where(is.numeric),
function(x) as.numeric(as.factor(x)))) %>%
as.data.frame()
The resultant data frame looks like this:
study sample group outcome
1 1 1 1 A
2 1 1 1 B
3 1 1 2 A
4 1 1 2 B
5 1 2 1 A
6 1 2 1 B
7 1 2 2 A
8 1 2 2 B
9 2 1 1 A
10 2 1 1 B
11 2 2 1 A
12 2 2 1 B
13 2 3 1 A
14 2 3 1 B
15 3 1 1 A
16 3 1 1 B
17 3 1 2 A
18 3 1 2 B
19 3 2 1 A
20 3 2 1 B
21 3 2 2 A
22 3 2 2 B
Here is an alternative (not as elegant as #Allan Cameron +1 ) dplyr solution:
library(dplyr)
df %>%
group_by(study) %>%
mutate(x = n()/length(unique(sample)),
sample = rep(row_number(), each=x, length.out = n()),
y = length(unique(group)),
group = ifelse(y==1, 1, group)) %>%
select(-x, -y)
study sample group outcome
<int> <int> <dbl> <chr>
1 1 1 1 A
2 1 1 1 B
3 1 1 2 A
4 1 1 2 B
5 1 2 1 A
6 1 2 1 B
7 1 2 2 A
8 1 2 2 B
9 2 1 1 A
10 2 1 1 B
11 2 2 1 A
12 2 2 1 B
13 2 3 1 A
14 2 3 1 B
15 3 1 1 A
16 3 1 1 B
17 3 1 2 A
18 3 1 2 B
19 3 2 1 A
20 3 2 1 B
21 3 2 2 A
22 3 2 2 B

R assign order references by other column value [duplicate]

This question already has answers here:
Create counter with multiple variables [duplicate]
(6 answers)
Closed 9 years ago.
I am trying to obtain a sequence within category.
My data are:
A B
1 1
1 2
1 2
1 3
1 3
1 3
1 4
1 4
and I want to get variable "c" such as my data look like:
A B C
1 1 1
1 2 1
1 2 2
1 3 1
1 3 2
1 3 3
1 4 1
1 4 2
Use ave with seq_along:
> mydf$C <- with(mydf, ave(A, A, B, FUN = seq_along))
> mydf
A B C
1 1 1 1
2 1 2 1
3 1 2 2
4 1 3 1
5 1 3 2
6 1 3 3
7 1 4 1
8 1 4 2
If your data are already ordered (as they are in this case), you can also use sequence with rle (mydf$C <- sequence(rle(do.call(paste, mydf))$lengths)), but you don't have that limitation with ave.
If you're a data.table fan, you can make use of .N as follows:
library(data.table)
DT <- data.table(mydf)
DT[, C := sequence(.N), by = c("A", "B")]
DT
# A B C
# 1: 1 1 1
# 2: 1 2 1
# 3: 1 2 2
# 4: 1 3 1
# 5: 1 3 2
# 6: 1 3 3
# 7: 1 4 1
# 8: 1 4 2

generate sequence within group in R [duplicate]

This question already has answers here:
Create counter with multiple variables [duplicate]
(6 answers)
Closed 9 years ago.
I am trying to obtain a sequence within category.
My data are:
A B
1 1
1 2
1 2
1 3
1 3
1 3
1 4
1 4
and I want to get variable "c" such as my data look like:
A B C
1 1 1
1 2 1
1 2 2
1 3 1
1 3 2
1 3 3
1 4 1
1 4 2
Use ave with seq_along:
> mydf$C <- with(mydf, ave(A, A, B, FUN = seq_along))
> mydf
A B C
1 1 1 1
2 1 2 1
3 1 2 2
4 1 3 1
5 1 3 2
6 1 3 3
7 1 4 1
8 1 4 2
If your data are already ordered (as they are in this case), you can also use sequence with rle (mydf$C <- sequence(rle(do.call(paste, mydf))$lengths)), but you don't have that limitation with ave.
If you're a data.table fan, you can make use of .N as follows:
library(data.table)
DT <- data.table(mydf)
DT[, C := sequence(.N), by = c("A", "B")]
DT
# A B C
# 1: 1 1 1
# 2: 1 2 1
# 3: 1 2 2
# 4: 1 3 1
# 5: 1 3 2
# 6: 1 3 3
# 7: 1 4 1
# 8: 1 4 2

Resources