Giving IDs to groups in R [duplicate]

Giving IDs to groups in R [duplicate] - r

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 1 year ago.
I have in R data frame that is divided to groups, like this:
Row
Group
1
A
2
B
3
A
4
D
5
C
6
B
7
C
8
C
9
A
10
B
I would like to add a uniaque numeric ID to each group, so finally I would have something like this:
Row
Group
ID
1
A
1
2
B
2
3
A
1
4
D
4
5
C
3
6
B
2
7
C
3
8
C
3
9
A
1
10
B
2
How could I achieve this?
Thank you very much.

Here is a simple way.
df1$ID <- as.integer(factor(df1$Group))
There are 3 solutions posted, mine, TarJae's and akrun's, they can be timed with increasing data sizes. akrun's is the fastest.
library(microbenchmark)
library(dplyr)
library(ggplot2)
funtest <- function(x, n){
out <- lapply(seq_len(n), function(i){
for(j in seq_len(i)) x <- rbind(x, x)
cat("nrow(x):", nrow(x), "\n")
mb <- microbenchmark(
match = with(x, match(Group, sort(unique(Group)))),
dplyr = x %>% group_by(Group) %>% mutate(ID = cur_group_id()),
intfac = as.integer(factor(x$Group))
)
mb$n <- i
mb
})
out <- do.call(rbind, out)
aggregate(time ~ ., out, median)
}
df1 %>%
funtest(10) %>%
ggplot(aes(n, time, colour = expr)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = 1:10, labels = 1:10) +
scale_y_continuous(trans = "log10") +
theme_bw()

Update
group_indices() was deprecated in dplyr 1.0.0.
Please use cur_group_id() instead.
df1 <- df %>%
group_by(Group) %>%
mutate(ID = cur_group_id())
First answer:
You can use group_indices
library(dplyr)
df1 <- df %>%
group_by(Group) %>%
mutate(ID = group_indices())
data
df <- tribble(
~Row, ~Group,
1, "A",
2, "B",
3, "A",
4, "D",
5, "C",
6, "B",
7, "C",
8, "C",
9, "A",
10,"B")
Row Group ID
<int> <chr> <int>
1 1 A 1
2 2 B 2
3 3 A 1
4 4 D 4
5 5 C 3
6 6 B 2
7 7 C 3
8 8 C 3
9 9 A 1
10 10 B 2

We can use match on the sorted unique values of 'Group' on the 'Group' to get the position index
df1$ID <- with(df1, match(Group, sort(unique(Group))))
data
df1 <- structure(list(Row = 1:10, Group = c("A", "B", "A", "D", "C",
"B", "C", "C", "A", "B")), class = "data.frame", row.names = c(NA,
-10L))

Related

R How to compute a unique index for two character variables with varing columns? [duplicate]

This question already has answers here:
Pasting elements of two vectors alphabetically
(5 answers)
How do you sort and paste two columns in a mutate statement?
(1 answer)
Row-wise sort then concatenate across specific columns of data frame
(2 answers)
Closed 1 year ago.
I'm not sure if I phrased my question properly, so let me give an simplified example:
Given a dataset as follows:
dat <- data_frame(X = c("A", "B", "B", "C", "A"),
Y = c("B", "A", "C", "A", "C"))
how can I compute a pair variable, so that it represents whatever was within X and Y at a given row BUT not generating duplicates, as here:
dat$pair <- c("A-B", "A-B", "B-C", "C-A", "C-A")
dat
# A tibble: 5 × 3
X Y pair
<chr> <chr> <chr>
1 A B A-B
2 B A A-B
3 B C B-C
4 C A C-A
5 A C C-A
I can compute a pairing with paste0 but it will indroduce duplicates (C-A is the same as A-C for me) that I want to avoid:
> dat <- mutate(dat, pair = paste0(X, "-", Y))
> dat
# A tibble: 5 × 3
X Y pair
<chr> <chr> <chr>
1 A B A-B
2 B A B-A
3 B C B-C
4 C A C-A
5 A C A-C

We can use pmin and pmax to sort the values parallely and paste them.
transform(dat, pair = paste(pmin(X, Y), pmax(X, Y), sep = '-'))
# X Y pair
#1 A B A-B
#2 B A A-B
#3 B C B-C
#4 C A A-C
#5 A C A-C
If you prefer dplyr this can be written as -
library(dplyr)
dat %>% mutate(pair = paste(pmin(X, Y), pmax(X, Y), sep = '-'))

I reordered each column once
dat <- data.frame(X = c("A", "B", "B", "C", "A"),
Y = c("B", "A", "C", "A", "C"))
library(dplyr)
dat %>%
rowwise %>%
mutate(pair = paste0(sort(c(as.character(X),as.character(Y)),decreasing = F),collapse = '-')) %>%
ungroup
output;
X Y pair
<fct> <fct> <chr>
1 A B A-B
2 B A A-B
3 B C B-C
4 C A A-C
5 A C A-C

With dplyr and tidyr you could try:
library(dplyr)
library(tidyr)
dat %>%
rowwise() %>%
mutate(pair = list(c(X, Y)),
pair = list(sort(pair)),
pair = list(paste(pair, collapse = "-"))) %>%
select(pair) %>%
distinct() %>%
unnest(pair)
#> # A tibble: 3 x 1
#> pair
#> <chr>
#> 1 A-B
#> 2 B-C
#> 3 A-C
Created on 2021-08-27 by the reprex package (v2.0.0)
data
dat <- data.frame(X = c("A", "B", "B", "C", "A"),
Y = c("B", "A", "C", "A", "C"))

Organize subgroup strings (text)

I am trying to convert something like this df format:
df <- data.frame(first = c("a", "a", "b", "b", "b", "c"),
words =c("about", "among", "blue", "but", "both", "cat"))
df
first words
1 a about
2 a among
3 b blue
4 b but
5 b both
6 c cat
into the following format:
df1
first words
1 a about, among
2 b blue, but, both
3 c cat
>
I have tried
aggregate(words ~ first, data = df, FUN = list)
first words
1 a 1, 2
2 b 3, 5, 4
3 c 6
and tidyverse:
df %>%
group_by(first) %>%
group_rows()
Any suggestions would be appreciated!

A data.table solution:
library(data.table)
df <- data.frame(first = c("a", "a", "b", "b", "b", "c"),
words =c("about", "among", "blue", "but", "both", "cat"))
df <- setDT(df)[, lapply(.SD, toString), by = first]
df
# first words
# 1: a about, among
# 2: b blue, but, both
# 3: c cat
# convert back to a data.frame if you want
setDF(df)

Using tidyverse, after the group_by use summarise to either paste
library(dplyr)
df %>%
group_by(first) %>%
summarise(words = toString(words))
# A tibble: 3 x 2
# first words
# <fct> <chr>
#1 a about, among
#2 b blue, but, both
#3 c cat
or keep it as a list column
df %>%
group_by(first) %>%
summarise(words = list(words))

Count number of duplicates in other dataframe

I have two data.frames dfA and dfB. Both of them have a column called key.
Now I'd like to know how many duplicates for A$key there are in B$key.
A <- data.frame(key=c("A", "B", "C", "D"))
B <- data.frame(key=c("A", "A", "B", "B", "B", "D"))
It should be A=2, B=3, C=0 and D=1. Whats the most easiest way to do this?

Use table
table(factor(B$key, levels = sort(unique(A$key))))
#A B C D
#2 3 0 1
factor is needed here such that we also 'count' entries that do not appear in B$key, that is C.

A <- data.frame(key=c("A", "B", "C", "D"))
B <- data.frame(key=c("A", "A", "B", "B", "B", "D"))
library(dplyr)
library(tidyr)
B %>%
filter(key %in% A$key) %>% # keep values that appear in A
count(key) %>% # count values
complete(key = A$key, fill = list(n = 0)) # add any values from A that don't appear
# # A tibble: 4 x 2
# key n
# <chr> <dbl>
# 1 A 2
# 2 B 3
# 3 C 0
# 4 D 1

Using tidyverse you can do:
A %>%
left_join(B %>% #Merging df A with df B for which the count in "key" was calculated
group_by(key) %>%
tally(), by = c("key" = "key")) %>%
mutate(n = ifelse(is.na(n), 0, n)) #Replacing NA with 0
key n
1 A 2
2 B 3
3 C 0
4 D 1

Actually you mean how many occurrences of each value of A$key you have in B$key?
You can obtain this by coding B$key as factor with the unique values of A$key as levels.
o <- table(factor(B$key, levels=unique(A$key)))
Yielding:
> o
A B C D
2 3 0 1
If you really want to count duplicates, do
dupes <- ifelse(o - 1 < 0, 0, o - 1)
Yielding:
> dupes
A B C D
1 2 0 0

Ordering a dataframe by its subsegments

My team and I are dealing with many thousands of URLs that have similar segments.
Some URLs have one segment ("seg", plural, "segs") in a position of interest to us. Other similar URLs have a different seg in the position of interest to us.
We need to sort a dataframe consisting of URLs and associated unique segs
in the position of interest, showing the frequency of those unique segs.
Here is a simplified example:
url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
df <- data.frame(url,seg)
We are looking for the following:
url freq seg
1 3 a in other words, url #1 appears three times each with a seg = "a",
2 2 b in other words: url #2 appears twice each with a seg = "b",
3 3 c in other words: url #3 appears three times with a seg = "c",
3 2 x two times with a seg = "x", and,
3 1 y once with a seg = "y"
4 1 d etc.
I can get there using a loop and several small steps, but am convinced there is a more elegant way of doing this. Here's my inelegant approach:
Create empty dataframe with num.unique rows and three columns (url, freq, seg)
result <- data.frame(url=0, Freq=0, seg=0)
Determine the unique URLs
unique.df.url <- unique(df$url)
Loop through the dataframe
for (xx in unique.df.url) {
url.seg <- df[which(df$url == unique.df.url[xx]), ] # create a dataframe for each of the unique urls and associated segs
freq.df.url <- data.frame(table(url.seg)) # summarize the frequency distribution of the segs by url
result <- rbind(result,freq.df.url) # append a new data.frame onto the last one
}
Eliminate rows in the dataframe where Frequency = 0
result.freq <- result[which(result$Freq |0), ]
Sort the dataframe by URL
result.order <- result.freq[order(result.freq$url), ]
This yields the desired results, but since it is so inelegant, I am concerned that once we move to scale, the time required will be prohibitive or at least a concern. Any suggestions?

In base R you can do this :
aggregate(freq~seg+url,`$<-`(df,freq,1),sum)
# or aggregate(freq~seg+url, data.frame(df,freq=1),sum)
# seg url freq
# 1 a 1 3
# 2 b 2 2
# 3 c 3 3
# 4 x 3 2
# 5 y 3 1
# 6 d 4 1
The trick with $<- is just to add a column freq of value 1 everywhere, without changing your source table.
Another possibility:
subset(as.data.frame(table(df[2:1])),Freq!=0)
# seg url Freq
# 1 a 1 3
# 8 b 2 2
# 15 c 3 3
# 17 x 3 2
# 18 y 3 1
# 22 d 4 1
Here I use [2:1] to switch the order of columns so table orders the results in the required way.

url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
df <- data.frame(url,seg)
library(dplyr)
df %>% count(url, seg) %>% arrange(url, desc(n))
# # A tibble: 6 x 3
# url seg n
# <dbl> <fct> <int>
# 1 1 a 3
# 2 2 b 2
# 3 3 c 3
# 4 3 x 2
# 5 3 y 1
# 6 4 d 1

Would the following code be better for you?
library(dplyr)
df %>% group_by(url, seg) %>% summarise(n())

Or paste & tapply:
url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
df <- data.frame(url,seg)
want <- tapply(url, INDEX = paste(url, seg, sep = "_"), length)
want <- data.frame(do.call(rbind, strsplit(names(want), "_")), want)
colnames(want) <- c("url", "seg", "freq")
want <- want[order(want$url, -want$freq), ]
rownames(want) <- NULL # needed?
want <- want[ , c("url", "freq", "seg")] # needed?
want

An option can be to use table and tidyr::gather to get data in format needed by OP:
library(tidyverse)
table(df) %>% as.data.frame() %>%
filter(Freq > 0 ) %>%
arrange(url, desc(Freq))
# url seg Freq
# 1 1 a 3
# 2 2 b 2
# 3 3 c 3
# 4 3 x 2
# 5 3 y 1
# 6 4 d 1
OR
df %>% group_by(url, seg) %>%
summarise(freq = n()) %>%
arrange(url, desc(freq))
# # A tibble: 6 x 3
# # Groups: url [4]
# url seg freq
# <dbl> <fctr> <int>
# 1 1.00 a 3
# 2 2.00 b 2
# 3 3.00 c 3
# 4 3.00 x 2
# 5 3.00 y 1
# 6 4.00 d 1

Find out the row with different value with in same name [duplicate]

This question already has answers here:
How to remove rows that have only 1 combination for a given ID
(4 answers)
Selecting & grouping dual-category data from a data frame
(4 answers)
Closed 5 years ago.
I have a df looks like
df <- data.frame(Name = c("A", "A","A","B", "B", "C", "D", "E", "E"),
Value = c(1, 1, 1, 2, 15, 3, 4, 5, 5))
Basically, A is 1, B is 2, C is 3 and so on.
However, as you can see, B has "2" and "15"."15" is the wrong value and it should not be here.
I would like to find out the row which Value won't matches within the same Name.
Ideal output will looks like
B 2
B 15

you can use tidyverse functions like:
df %>%
group_by(Name, Value) %>%
unique()
giving:
Name Value
1 A 1
2 B 2
3 B 15
4 C 3
5 D 4
6 E 5
then, to keep only the Name with multiple Value, append above with:
df %>%
group_by(Name) %>%
filter( n() > 1)

Something like this? This searches for Names that are associated to more than 1 value and outputs one copy of each pair {Name - Value}.
df <- data.frame(Name = c("A", "A","A","B", "B", "C", "D", "E", "E"),
Value = c(1, 1, 1, 2, 15, 3, 4, 5, 5))
res <- do.call(rbind, lapply(unique(df$Name), (function(i){
if (length(unique(df[df$Name == i,]$Value)) > 1 ) {
out <- df[df$Name == i,]
out[!duplicated(out$Value), ]
}
})))
res
Result as expected
Name Value
4 B 2
5 B 15

Filter(function(x)nrow(unique(x))!=1,split(df,df$Name))
$B
Name Value
4 B 2
5 B 15
Or:
Reduce(rbind,by(df,df$Name,function(x) if(nrow(unique(x))>1) x))
Name Value
4 B 2
5 B 15

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Giving IDs to groups in R [duplicate] - r

We can use match on the sorted unique values of 'Group' on the 'Group' to get the position index df1$ID <- with(df1, match(Group, sort(unique(Group)))) data df1 <- structure(list(Row = 1:10, Group = c("A", "B", "A", "D", "C", "B", "C", "C", "A", "B")), class = "data.frame", row.names = c(NA, -10L))

Related

R How to compute a unique index for two character variables with varing columns? [duplicate]

Organize subgroup strings (text)

Count number of duplicates in other dataframe

Ordering a dataframe by its subsegments

Find out the row with different value with in same name [duplicate]

Categories

Resources