Assign unique non-repeated ID to nested groups with the same values in R - r

I have run across similar questions, but have not been able to find an answer for my specific needs.
I have a data set with a nested group design and I need to include a unique non-repeating ID to nested groups that can have identical values. While I regularly conduct this type of data wrangling, both the structure of this data set as well as the required outcome are beyond my skillset at this time.
Below I have provided an example data set (df) and what the results should look like.
I used the below code in my actual data set, but realized that it fails under certain circumstances...which are exaggerated in the example data set provided here. I prefer the ID to be sequentially numbered.
df$ID = cumsum(c(TRUE, diff(df$LENGTH) != 0))
I am open to all options (e.g., library(data.table), library(boot), etc) as it would be great if others find this post useful. However, I prefer solutions that do not require the installation and loading of additional packages.
Thanks in advance for you help.
Take care.
df <- read.table(text = "GROUP REGION TIME LENGTH
a x 1 3
a x 2 3
a x 3 3
a y 4 3
a y 5 3
a y 6 3
a z 7 2
a z 8 2
b z 1 2
b z 2 2
b x 3 2
b x 4 2
c x 1 2
c x 2 2
c y 3 2
c y 4 2
c x 5 2
c x 6 2
c z 7 1", header = TRUE)
result <- read.table(text = "GROUP REGION TIME LENGTH ID
a x 1 3 1
a x 2 3 1
a x 3 3 1
a y 4 3 2
a y 5 3 2
a y 6 3 2
a z 7 2 3
a z 8 2 3
b z 1 2 4
b z 2 2 4
b x 3 2 5
b x 4 2 5
c x 1 2 6
c x 2 2 6
c y 3 2 7
c y 4 2 7
c x 5 2 8
c x 6 2 8
c z 7 1 9", header = TRUE)

Paste GROUP and REGION columns and use rle to create a sequential ID column.
transform(df,ID = with(rle(paste(GROUP, REGION)),rep(seq_along(values),lengths)))
In data.table we can use rleid.
library(data.table)
setDT(df)[, ID := rleid(GROUP, REGION)]
# GROUP REGION TIME LENGTH ID
# 1: a x 1 3 1
# 2: a x 2 3 1
# 3: a x 3 3 1
# 4: a y 4 3 2
# 5: a y 5 3 2
# 6: a y 6 3 2
# 7: a z 7 2 3
# 8: a z 8 2 3
# 9: b z 1 2 4
#10: b z 2 2 4
#11: b x 3 2 5
#12: b x 4 2 5
#13: c x 1 2 6
#14: c x 2 2 6
#15: c y 3 2 7
#16: c y 4 2 7
#17: c x 5 2 8
#18: c x 6 2 8
#19: c z 7 1 9

Another base R option, but without rle
transform(
df,
ID = cumsum(c(1, (s <- paste0(GROUP, REGION))[-1] != head(s, -1)))
)
gives
GROUP REGION TIME LENGTH ID
1 a x 1 3 1
2 a x 2 3 1
3 a x 3 3 1
4 a y 4 3 2
5 a y 5 3 2
6 a y 6 3 2
7 a z 7 2 3
8 a z 8 2 3
9 b z 1 2 4
10 b z 2 2 4
11 b x 3 2 5
12 b x 4 2 5
13 c x 1 2 6
14 c x 2 2 6
15 c y 3 2 7
16 c y 4 2 7
17 c x 5 2 8
18 c x 6 2 8
19 c z 7 1 9

With dplyr
library(dplyr)
library(data.table)
df %>%
mutate(ID = rleid(GROUP, REGION))

Related

How to add value into new column based on corresponding value in another column?

This is the sample data with 'y' being the new variable created.
x
A
B
C
y
A
1
4
7
B
5
6
7
C
3
5
3
If the value of column x ="A", I would like the value of col.A to be displayed in column y. And similarly for the "B" & "C" values in column x.
Final result should be something like this.
x
A
B
C
y
A
1
4
7
1
B
5
6
7
6
C
3
5
3
3
A proposition :
df <- read.table(header=TRUE, text="
x A B C
A 1 4 7
B 5 6 7
C 3 5 3
"
)
df$y <- paste0("df$",df$x,"[df$x=='",df$x,"']")
df
#> x A B C y
#> 1 A 1 4 7 df$A[df$x=='A']
#> 2 B 5 6 7 df$B[df$x=='B']
#> 3 C 3 5 3 df$C[df$x=='C']
df$y <- eval(ivmte:::unstring(df$y))
df
#> x A B C y
#> 1 A 1 4 7 1
#> 2 B 5 6 7 6
#> 3 C 3 5 3 3
# Created on 2021-01-30 by the reprex package (v0.3.0.9001)
Regards,
Try this:
create_column<-function(){
y<-numeric(nrow(your_dataframe))
for (i in 1:nrow(your_dataframe)){
y[i]<-your_dataframe[i, which(names(your_dataframe)==your_dataframe$x[i])]
}
cbind(your_dataframe, y)
}
create_column()
x A B C y
1 A 1 4 7 1
2 B 5 6 7 6
3 C 3 5 3 3
>
another option with apply:
cbind(your_dataframe, y=apply(your_dataframe, 1, function(x){
x[which(names(x)==x['x'])]
}))
> your_dataframe
x A B C y
1 A 1 4 7 1
2 B 5 6 7 6
3 C 3 5 3 3
Try this
df$y <- df[-1][cbind(seq(nrow(df)),match(df$x,names(df)[-1]))]

Selecting top N rows for each group based on value in column

I have dataframe like below :-
x<-c(3,2,1,8,7,11,10,9,7,5,4)
y<-c("a","a","a", "b","b","c","c","c","c","c","c")
z<-c(2,2,2,1,1,3,3,3,3,3,3)
df<-data.frame(x,y,z)
df
x y z
1 3 a 2
2 2 a 2
3 1 a 2
4 8 b 1
5 7 b 1
6 11 c 3
7 10 c 3
8 9 c 3
9 7 c 3
10 5 c 3
11 4 c 3
I want to select top n row for each group by column y where n is provided in column z.
So the output should be like :
output:
x y z
1 3 a 2
2 2 a 2
3 8 b 1
4 11 c 3
5 10 c 3
6 9 c 3
A solution with base R:
# df is split according to y, then we keep only the top "z" value (after ordering x)
# and rbind everything back together:
do.call(rbind,
lapply(split(df, df$y),
function(df1) df1[order(df1$x, decreasing=TRUE), ][1:unique(df1$z), ]))
# x y z
#a.1 3 a 2
#a.2 2 a 2
#b 8 b 1
#c.6 11 c 3
#c.7 10 c 3
#c.8 9 c 3
EDIT:
A much more direct way (still in base R) provided in comment by #mt1022:
df[ave(1:nrow(df), df$y, FUN = seq_along) <= df$z, ]
# x y z
#1 3 a 2
#2 2 a 2
#4 8 b 1
#6 11 c 3
#7 10 c 3
#8 9 c 3
One approach with data.table:
library(data.table)
setDT(df)
df[,.(inc=seq_len(.N)<=z,x,z),by=.(y)][inc==T ,-2]
# y x z
#1: a 3 2
#2: a 2 2
#3: b 8 1
#4: c 11 3
#5: c 10 3
#6: c 9 3
A solution with dplyr that uses do:
df %>%
group_by(y) %>%
do(head(.,as.numeric(unique(.$z))))
I'm posting the solution I was looking for using dplyr. It is based on #HNSKD:
library(dplyr)
x<-c(3,2,1,8,7,11,10,9,7,5,4)
y<-c("a","a","a", "b","b","c","c","c","c","c","c")
z<-c(2,2,2,1,1,3,3,3,3,3,3)
df<-data.frame(x,y,z)
df %>% group_by(y) %>% slice(1:2)
Which returns the first two elements for each y:
# A tibble: 6 x 3
# Groups: y [3]
x y z
<dbl> <fct> <dbl>
1 3 a 2
2 2 a 2
3 8 b 1
4 7 b 1
5 11 c 3
6 10 c 3

Label quantile by group with varying group sizes

Within my group (the "name" variable), I want cut the value into quartile. And create a quartile label column for variable "value". Since the group size varies, for the quartile range for different group changes as well. But below code, only cut the quartile by the overall value, resulting the same quartile range for all groups.
dt<-data.frame(name=c(rep('a',8),rep('b',4),rep('c',5)),value=c(1:8,1:4,1:5))
dt
dt.2<-dt%>% group_by(name)%>% mutate(newcol=
cut(value,breaks=quantile(value,probs=seq(0,1,0.25),na.rm=TRUE),include.lowest=TRUE))
dt.2
str(dt.2)
Data:
name value
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 a 6
7 a 7
8 a 8
9 b 1
10 b 2
11 b 3
12 b 4
13 c 1
14 c 2
15 c 3
16 c 4
17 c 5
output from above code.
Update: the problem is not that newcol is factor but the necol has the same quartile range across all the different group. For example name b, the value is 1-4 but the quartile range has 3-5, which is derived from min(value) to max(value) regardless of the group.
name value newcol
<fctr> <int> <fctr>
1 a 1 [1,2]
2 a 2 [1,2]
3 a 3 (2,3]
4 a 4 (3,5]
5 a 5 (3,5]
6 a 6 (5,8]
7 a 7 (5,8]
8 a 8 (5,8]
9 b 1 [1,2]
10 b 2 [1,2]
11 b 3 (2,3]
12 b 4 (3,5]
13 c 1 [1,2]
14 c 2 [1,2]
15 c 3 (2,3]
16 c 4 (3,5]
17 c 5 (3,5]
Desired output
name value newcol/quartile label
1 a 1 1
2 a 2 1
3 a 3 2
4 a 4 2
5 a 5 3
6 a 6 3
7 a 7 4
8 a 8 4
9 b 1 1
10 b 2 2
11 b 3 3
12 b 4 4
13 c 1 1
14 c 2 2
15 c 3 3
16 c 4 4
17 c 5 4
Here's a way you can do it, following the split-apply-combine framework.
dt<-data.frame(name=c(rep('a',8),rep('b',4),rep('c',5)),value=c(1:8,1:4,1:5))
split_dt <- lapply(split(dt, dt$name),
transform,
quantlabel = as.numeric(
cut(value, breaks = quantile(value, probs = seq(0,1,.25)), include.lowest = T)))
dt <- unsplit(split_dt, dt$name)
name value quantlabel
1 a 1 1
2 a 2 1
3 a 3 2
4 a 4 2
5 a 5 3
6 a 6 3
7 a 7 4
8 a 8 4
9 b 1 1
10 b 2 2
11 b 3 3
12 b 4 4
13 c 1 1
14 c 2 1
15 c 3 2
16 c 4 3
17 c 5 4
edit: there's a data.table way
following this post, we can use the data.table package, if performance is a concern:
library(data.table)
dt<-data.frame(name=c(rep('a',8),rep('b',4),rep('c',5)),value=c(1:8,1:4,1:5))
dt.t <- as.data.table(dt)
dt.t[,quantlabels := as.numeric(cut(value, breaks = quantile(value, probs = seq(0,1,.25)), include.lowest = T)), name]
name value quantlabels
1: a 1 1
2: a 2 1
3: a 3 2
4: a 4 2
5: a 5 3
6: a 6 3
7: a 7 4
8: a 8 4
9: b 1 1
10: b 2 2
11: b 3 3
12: b 4 4
13: c 1 1
14: c 2 1
15: c 3 2
16: c 4 3
17: c 5 4
edit: and there's a dplyr way
We can follow #akrun's advice and use as.numeric (which is what we've done for the other solutions):
dt %>%
group_by(name) %>%
mutate(quantlabel =
as.numeric(
cut(value,
breaks = quantile(value, probs = seq(0,1,.25)),
include.lowest = T)))
Note that if you instead wanted the labels themselves, use as.character:
dt %>%
group_by(name) %>%
mutate(quantlabel = as.character(cut(value, breaks = quantile(value, probs = seq(0,1,.25)), include.lowest = T)))
Source: local data frame [17 x 3]
Groups: name [3]
name value quantlabel
<fctr> <int> <chr>
1 a 1 [1,2.75]
2 a 2 [1,2.75]
3 a 3 (2.75,4.5]
4 a 4 (2.75,4.5]
5 a 5 (4.5,6.25]
6 a 6 (4.5,6.25]
7 a 7 (6.25,8]
8 a 8 (6.25,8]
9 b 1 [1,1.75]
10 b 2 (1.75,2.5]
11 b 3 (2.5,3.25]
12 b 4 (3.25,4]
13 c 1 [1,2]
14 c 2 [1,2]
15 c 3 (2,3]
16 c 4 (3,4]
17 c 5 (4,5]

Create a new variable which count length of duplicate in R

I have a data frame,I want to create a variable z,count duplicate of "y variable", if y have 1,1 set z = 2,2, if y have 3,3,3, set z = 3,3,3.
x = c("a","b","c","d","e","a","b","c","d","e","a","b","c")
y = c(1,1,2,2,2,3,3,4,4,4,5,5,5)
data <- data.frame(x,y)
data
x y z
1 a 1 2
2 b 1 2
3 c 2 3
4 d 2 3
5 e 2 3
6 a 3 2
7 b 3 2
8 c 4 3
9 d 4 3
10 e 4 3
11 a 5 3
12 b 5 3
13 c 5 3
Thanks for your help.
You can try the rle:
data$z <- with(data, unlist(mapply(rep, rle(y)$lengths, rle(y)$lengths)))
data
x y z
1 a 1 2
2 b 1 2
3 c 2 3
4 d 2 3
5 e 2 3
6 a 3 2
7 b 3 2
8 c 4 3
9 d 4 3
10 e 4 3
11 a 5 3
12 b 5 3
13 c 5 3
If your your variable y is sorted as an increasing sequence as you say, then the following solution will work:
# calculate counts of each level
counts <- table(data$y)
# fill in z
data$z <- counts[match(data$y, names(counts))]
Note, however, that this method will fail if y is not ordered and, since you want to restart the count when a different level occurs. For these purposes, #psidom's solution is more robust to mis-ordered data as rle will reset the count.
This method calculates the total occurrences of a level and then feeds these total counts to the proper location using match.
Here is a quick method using dplyr, and its rather intuitive syntax:
library(dplyr)
left_join(data, data %>%
group_by(y) %>%
summarize(z = n()),
by = "y")
x y z
1 a 1 2
2 b 1 2
3 c 2 3
4 d 2 3
5 e 2 3
6 a 3 2
7 b 3 2
8 c 4 3
9 d 4 3
10 e 4 3
11 a 5 3
12 b 5 3
13 c 5 3
We can do this easily with data.table
library(data.table)
setDT(data)[, z := .N , rleid(y)]
data
# x y z
# 1: a 1 2
# 2: b 1 2
# 3: c 2 3
# 4: d 2 3
# 5: e 2 3
# 6: a 3 2
# 7: b 3 2
# 8: c 4 3
# 9: d 4 3
#10: e 4 3
#11: a 5 3
#12: b 5 3
#13: c 5 3
Or using rle from base R without any loops
inverse.rle(within.list(rle(data$y), values <- lengths))
#[1] 2 2 3 3 3 2 2 3 3 3 3 3 3
Or another base R method with ave
with(data, ave(y, cumsum(c(TRUE, y[-1]!= y[-length(y)])), FUN=length))
#[1] 2 2 3 3 3 2 2 3 3 3 3 3 3

How to drop factors that have fewer than n members

Is there a way to drop factors that have fewer than N rows, like N = 5, from a data table?
Data:
DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,3,6), v=1:9,
id=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4))
Goal: remove rows when the number of id is less than 5. The variable "id" is the grouping variable, and the groups to delete when the number of rows in a group is less than 5. In DT, need to determine which groups have less than 5 members, (groups "1" and "4") and then remove those rows.
1: a 3 5 2
2: b 6 6 2
3: b 1 7 2
4: b 3 8 2
5: b 6 9 2
6: b 1 1 3
7: c 3 2 3
8: c 6 3 3
9: c 1 4 3
10: c 3 5 3
11: c 6 6 3
Here's an approach....
Get the length of the factors, and the factors to keep
nFactors<-tapply(DT$id,DT$id,length)
keepFactors <- nFactors >= 5
Then identify the ids to keep, and keep those rows. This generates the desired results, but is there a better way?
idsToKeep <- as.numeric(names(keepFactors[which(keepFactors)]))
DT[DT$id %in% idsToKeep,]
Since you begin with a data.table, this first part uses data.table syntax.
EDIT: Thanks to Arun (comment) for helping me improve this data table answer
DT[DT[, .(I=.I[.N>=5L]), by=id]$I]
# x y v id
# 1: a 3 5 2
# 2: a 6 6 2
# 3: b 1 7 2
# 4: b 3 8 2
# 5: b 6 9 2
# 6: b 1 1 3
# 7: b 3 2 3
# 8: b 6 3 3
# 9: c 1 4 3
# 10: c 3 5 3
# 11: c 6 6 3
In base R you could use
df <- data.frame(DT)
tab <- table(df$id)
df[df$id %in% names(tab[tab >= 5]), ]
# x y v id
# 5 a 3 5 2
# 6 a 6 6 2
# 7 b 1 7 2
# 8 b 3 8 2
# 9 b 6 9 2
# 10 b 1 1 3
# 11 b 3 2 3
# 12 b 6 3 3
# 13 c 1 4 3
# 14 c 3 5 3
# 15 c 6 6 3
If using a data.table is not necessary, you can use dplyr:
library(dplyr)
data.frame(DT) %>%
group_by(id) %>%
filter(n() >= 5)

Resources