Creating a new data frame using existing data - r

I would like to create a new data from my existing data frame "ab". The new data frame should look like "Newdf".
a<- c(1:5)
b<-c(11:15)
ab<-data.frame(C1=a,c2=b)
ab
df<-c(1,11,2,12,3,13,4,14,5,15)
CMT<-c(1:2)
CMT1<-rep.int(CMT,times=5)
Newdf<-data.frame(DV=df,Comp=CMT1)
Newdf
Can we use dplyr package? If yes, how?

More importantly than dplyr, you'd need tidyr:
library(tidyr)
library(dplyr)
ab %>%
gather(Comp, DV) %>%
mutate(Comp = recode(Comp, "C1" = 1, "c2" = 2))
# Comp DV
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 1 5
# 6 2 11
# 7 2 12
# 8 2 13
# 9 2 14
# 10 2 15

Using dplyr and tidyr gives you something close...
library(tidyr)
library(dplyr)
df2 <- ab %>%
mutate(Order=1:n()) %>%
gather(key=Comp,value=DV,C1,c2) %>%
arrange(Order) %>%
mutate(Comp=recode(Comp,"C1"=1,"c2"=2)) %>%
select(DV,Comp)
df2
DV Comp
1 1 1
2 11 2
3 2 1
4 12 2
5 3 1
6 13 2
7 4 1
8 14 2
9 5 1
10 15 2

Although the OP has asked for a dpylr solution, I felt challenged to look for a data.table solution. So, FWIW, here is an alternative approach using melt().
Note that this solution does not depend on specific column names in ab as the two other dplyr solutions do. In addition, it should be working for more than two columns in ab as well (untested).
library(data.table)
melt(setDT(ab, keep.rownames = TRUE), id.vars = "rn", value.name = "DV"
)[, Comp := rleid(variable)
][order(rn)][, c("rn", "variable") := NULL][]
# DV Comp
# 1: 1 1
# 2: 11 2
# 3: 2 1
# 4: 12 2
# 5: 3 1
# 6: 13 2
# 7: 4 1
# 8: 14 2
# 9: 5 1
#10: 15 2
Data
ab <- structure(list(C1 = 1:5, c2 = 11:15), .Names = c("C1", "c2"),
row.names = c(NA, -5L), class = "data.frame")
ab
# C1 c2
#1 1 11
#2 2 12
#3 3 13
#4 4 14
#5 5 15

Related

How to create index of unique elements of a column in R dataframe [duplicate]

I have a data frame (all_data) in which I have a list of sites (1... to n) and their scores e.g.
site score
1 10
1 11
1 12
4 10
4 11
4 11
8 9
8 8
8 7
I want create a column that numbers each level of site in numerical order, like a counter. In the example, the sites (1, 4, and 8) would be have a corresponding counter from 1 to 3 in the 'number' column:
site score number
1 10 1
1 11 1
1 12 1
4 10 2
4 11 2
4 11 2
8 9 3
8 8 3
8 7 3
I am sure this must be easily solved, but I have not found a way yet.
Try Data$number <- as.numeric(as.factor(Data$site))
On a sidenote : the difference between the solution of me and #Chase on one hand, and the one of #DWin on the other, is the ordering of the numbers. Both as.factor and factor will automatically sort the levels, whereas that doesn't happen in the solution of #DWin :
Dat <- data.frame(site = rep(c(1,8,4), each = 3), score = runif(9))
Dat$number <- as.numeric(factor(Dat$site))
Dat$sitenum <- match(Dat$site, unique(Dat$site) )
Gives
> Dat
site score number sitenum
1 1 0.7377561 1 1
2 1 0.3131139 1 1
3 1 0.7862290 1 1
4 8 0.4480387 3 2
5 8 0.3873210 3 2
6 8 0.8778102 3 2
7 4 0.6916340 2 3
8 4 0.3033787 2 3
9 4 0.6552808 2 3
In the new dplyr 1.0.0 we can use cur_group_id() which gives a unique numeric identifier to a group.
library(dplyr)
df %>% group_by(site) %>% mutate(number = cur_group_id())
# site score number
# <int> <int> <int>
#1 1 10 1
#2 1 11 1
#3 1 12 1
#4 4 10 2
#5 4 11 2
#6 4 11 2
#7 8 9 3
#8 8 8 3
#9 8 7 3
data
df <- structure(list(site = c(1L, 1L, 1L, 4L, 4L, 4L, 8L, 8L, 8L),
score = c(10L, 11L, 12L, 10L, 11L, 11L, 9L, 8L, 7L)),
class = "data.frame", row.names = c(NA, -9L))
Two other options:
1) Using the .GRP function from the data.table package:
library(data.table)
setDT(dat)[, num := .GRP, by = site]
with the example dataset from below this results in:
> dat
site score num
1: 1 0.14945795 1
2: 1 0.60035697 1
3: 1 0.94643075 1
4: 8 0.68835336 2
5: 8 0.50553372 2
6: 8 0.37293624 2
7: 4 0.33580504 3
8: 4 0.04825135 3
9: 4 0.61894754 3
10: 8 0.96144729 2
11: 8 0.65496051 2
12: 8 0.51029199 2
2) Using the group_indices function from dplyr:
dat$num <- group_indices(dat, site)
or when you want to work around non-standard evaluation:
library(dplyr)
dat %>%
mutate(num = group_indices_(dat, .dots = c('site')))
which results in:
site score num
1 1 0.42480366 1
2 1 0.98736177 1
3 1 0.35766187 1
4 8 0.06243182 3
5 8 0.55617002 3
6 8 0.20304632 3
7 4 0.90855921 2
8 4 0.25215078 2
9 4 0.44981251 2
10 8 0.60288270 3
11 8 0.46946587 3
12 8 0.44941782 3
As can be seen, dplyr gives a different order of the group numbers.
If you want another number every time the group changes, there are several other options:
1) with base R:
# option 1:
dat$num <- cumsum(c(TRUE, head(dat$site, -1) != tail(dat$site, -1)))
# option 2:
x <- rle(dat$site)$lengths
dat$num <- rep(seq_along(x), times=x)
2) with the data.table package:
library(data.table)
setDT(dat)[, num := rleid(site)]
which all result in:
> dat
site score num
1 1 0.80817855 1
2 1 0.07881334 1
3 1 0.60092828 1
4 8 0.71477988 2
5 8 0.51384565 2
6 8 0.72011650 2
7 4 0.74994627 3
8 4 0.09564052 3
9 4 0.39782587 3
10 8 0.29446540 4
11 8 0.61725367 4
12 8 0.97427413 4
Used data:
dat <- data.frame(site = rep(c(1,8,4,8), each = 3), score = runif(12))
This should be fairly efficient and understandable:
Dat$sitenum <- match(Dat$site, unique(Dat$site))
Using the data from #Jaap, a different dplyr possibility using dense_rank() could be:
dat %>%
mutate(ID = dense_rank(site))
site score ID
1 1 0.1884490 1
2 1 0.1087422 1
3 1 0.7438149 1
4 8 0.1150771 3
5 8 0.9978203 3
6 8 0.7781222 3
7 4 0.4081830 2
8 4 0.2782333 2
9 4 0.9566959 2
10 8 0.2545320 3
11 8 0.1201062 3
12 8 0.5449901 3
Or a rleid()-like dplyr approach, with the data arranged first:
dat %>%
arrange(site) %>%
mutate(ID = with(rle(site), rep(seq_along(lengths), lengths)))
site score ID
1 1 0.1884490 1
2 1 0.1087422 1
3 1 0.7438149 1
4 4 0.4081830 2
5 4 0.2782333 2
6 4 0.9566959 2
7 8 0.1150771 3
8 8 0.9978203 3
9 8 0.7781222 3
10 8 0.2545320 3
11 8 0.1201062 3
12 8 0.5449901 3
Or using duplicated() and cumsum():
df %>%
mutate(ID = cumsum(!duplicated(site)))
The same with base R:
df$ID <- with(rle(df$site), rep(seq_along(lengths), lengths))
Or:
df$ID <- cumsum(!duplicated(df$site))
You can turn site into a factor and then return the numeric or integer values of that factor:
dat <- data.frame(site = rep(c(1,4,8), each = 3), score = runif(9))
dat$number <- as.integer(factor(dat$site))
dat
site score number
1 1 0.5305773 1
2 1 0.9367732 1
3 1 0.1831554 1
4 4 0.4068128 2
5 4 0.3438962 2
6 4 0.8123883 2
7 8 0.9122846 3
8 8 0.2949260 3
9 8 0.6771526 3
Another solution using the data.table package.
Example with the more complete datset provided by Jaap:
setDT(dat)[, number := frank(site, ties.method = "dense")]
dat
site score number
1: 1 0.3107920 1
2: 1 0.3640102 1
3: 1 0.1715318 1
4: 8 0.7247535 3
5: 8 0.1263025 3
6: 8 0.4657868 3
7: 4 0.6915818 2
8: 4 0.3558270 2
9: 4 0.3376173 2
10: 8 0.7934963 3
11: 8 0.9641918 3
12: 8 0.9832120 3
Another way to do it. That I think is easy to get even when you know little about R:
library(dplyr)
df <- data.frame('site' = c(1, 1, 1, 4, 4, 4, 8, 8, 8))
df <- mutate(df, 'number' = cumsum(site != lag(site, default=-1)))
I too recently needed a solution to this. Didn't find this thread, started mine and was re-directed here (thank you). Good to see many solutions but to me (and I feel is good practice), a scalable solution is important. Hence, benchmarked several solutions below.
df <- data.table(country = rep(c('a', 'b', 'b', 'c', 'c', 'c'), 1e7)
)
a <-
microbenchmark(factor = {df[, group_id := as.integer(factor(country))]}
, unique_match = df[, group_id := match(country, unique(country))]
, rle = df[ , group_id := with(rle(country), rep(seq_along(lengths), lengths))]
, dup_cumsum = df[, group_id := cumsum(!duplicated(country))]
, frank = df[, group_id := frank(country, ties.method = "dense")]
, GRP = df[, group_id := .GRP, country]
, rleid = df[, group_id := rleid(country)]
, cumsum_head_tail = df[, group_id := cumsum(c(TRUE, head(country, -1) != tail(country, -1)))]
, times = 50
)
autoplot(a)
It would appear the podium is held by data.table.
Still, was great to learn of alternatives e.g. cumsum(!duplicated(country)). What a brainteaser!
Using collapse::group, Fast Hash-Based Grouping:
library(collapse)
d = data.frame(site = rep(c(1,8,4), each = 3))
settransform(d, number = group(site)) # settransform updates data by reference. See also ftransform.
d
# site number
# 1 1 1
# 2 1 1
# 3 1 1
# 4 8 2
# 5 8 2
# 6 8 2
# 7 4 3
# 8 4 3
# 9 4 3
The collapse functions are considerably faster on larger data. Here I compare with two common base idioms (factor / as.integer; match / unique), and two data.table methods (.GRP; frank), using 1e5 groups with 1e3 rows each.
library(data.table)
library(microbenchmark)
nr = 1e3
ng = 1e5
set.seed(1)
d1 = data.table(g = sample(1:ng, nr*ng, replace = TRUE))
d2 = copy(d1)
d3 = copy(d1)
d4 = copy(d1)
d5 = copy(d1)
microbenchmark(
factor = {d1[ , gi := as.integer(factor(g))]},
unique_match = {d2[, gi := match(g, unique(g))]},
frank = {d3[, gi := frank(g, ties.method = "dense")]},
GRP = {d4[, gi := .GRP, by = g]},
collap = {settransform(d5, gi = group(g))},
times = 20L)
# Unit: milliseconds
# expr min lq mean median uq max
# factor 46648.6099 48493.0146 49918.956 49336.208 51547.0789 53585.472
# unique_match 12662.0978 13057.3210 13534.391 13530.457 13998.0141 14407.036
# frank 2628.4923 2695.7064 3240.522 2833.950 3797.5579 5547.227
# GRP 2754.2153 3283.2444 3796.109 3717.239 4184.5174 5117.918
# collap 640.1993 668.2301 729.351 698.307 753.2932 1086.592
# Check equality with data.table .GRP. Use as.vector to remove attributes
all.equal(d4$gi, as.vector(d5$gi))
# [1] TRUE
Note: group, .GRP and match / unique all create the group number according to order of appearance of the different values in the original data (also discussed in previous posts).
If you want to keep your existing columns and assign back to the same data frame...
my_df <- my_df %>%
select(everything()) %>%
group_by(geo) %>%
mutate(geo_id = cur_group_id())
And you can do multiple columns like so...
my_df <- my_df %>%
select(everything()) %>%
group_by(geo) %>%
mutate(geo_id = cur_group_id()) %>%
group_by(state) %>%
mutate(state_id = cur_group_id()) %>%
group_by(name) %>%
mutate(name_id = cur_group_id())
If the numbers of the site column were unordered, we could use as_factor() in combination with fct_inorder() from the forcats package:
library(tibble)
library(dplyr)
library(forcats)
all_data_unordered <- tibble(site = c(1,1,1,8,8,8,4,4,4),
score = c(10,11,12,10,11,11,9,8,7))
all_data_unordered |>
mutate(number = as_factor(site) |> fct_inorder() |> as.integer())
#> # A tibble: 9 × 3
#> site score number
#> <dbl> <dbl> <int>
#> 1 1 10 1
#> 2 1 11 1
#> 3 1 12 1
#> 4 8 10 2
#> 5 8 11 2
#> 6 8 11 2
#> 7 4 9 3
#> 8 4 8 3
#> 9 4 7 3
Created on 2021-11-05 by the reprex package (v2.0.1)
Since dplyr 1.1.0, another option is consecutive_id:
library(dplyr)
df %>%
mutate(id = consecutive_id(site))
# site score id
# 1 1 10 1
# 2 1 11 1
# 3 1 12 1
# 4 4 10 2
# 5 4 11 2
# 6 4 11 2
# 7 8 9 3
# 8 8 8 3
# 9 8 7 3
Note that consecutive_id, like data.table::rleid but unlike cur_group_id or as.numeric(factor(.)) will return an ID for consecutive values, meaning that if the same value appears not consecutively, it'll create a new id.
df <- structure(list(site = c(1L, 1L, 1L, 4L, 4L, 4L, 1L, 1L, 1L)),
class = "data.frame", row.names = c(NA, -9L))
df %>%
mutate(cons_id = consecutive_id(site)) %>%
group_by(site) %>%
mutate(cur_group_id = cur_group_id())
# site cons_id cur_group_id
# 1 1 1 1
# 2 1 1 1
# 3 1 1 1
# 4 4 2 2
# 5 4 2 2
# 6 4 2 2
# 7 1 3 1
# 8 1 3 1
# 9 1 3 1

Trying to make a new column using R [duplicate]

I have a data frame (all_data) in which I have a list of sites (1... to n) and their scores e.g.
site score
1 10
1 11
1 12
4 10
4 11
4 11
8 9
8 8
8 7
I want create a column that numbers each level of site in numerical order, like a counter. In the example, the sites (1, 4, and 8) would be have a corresponding counter from 1 to 3 in the 'number' column:
site score number
1 10 1
1 11 1
1 12 1
4 10 2
4 11 2
4 11 2
8 9 3
8 8 3
8 7 3
I am sure this must be easily solved, but I have not found a way yet.
Try Data$number <- as.numeric(as.factor(Data$site))
On a sidenote : the difference between the solution of me and #Chase on one hand, and the one of #DWin on the other, is the ordering of the numbers. Both as.factor and factor will automatically sort the levels, whereas that doesn't happen in the solution of #DWin :
Dat <- data.frame(site = rep(c(1,8,4), each = 3), score = runif(9))
Dat$number <- as.numeric(factor(Dat$site))
Dat$sitenum <- match(Dat$site, unique(Dat$site) )
Gives
> Dat
site score number sitenum
1 1 0.7377561 1 1
2 1 0.3131139 1 1
3 1 0.7862290 1 1
4 8 0.4480387 3 2
5 8 0.3873210 3 2
6 8 0.8778102 3 2
7 4 0.6916340 2 3
8 4 0.3033787 2 3
9 4 0.6552808 2 3
In the new dplyr 1.0.0 we can use cur_group_id() which gives a unique numeric identifier to a group.
library(dplyr)
df %>% group_by(site) %>% mutate(number = cur_group_id())
# site score number
# <int> <int> <int>
#1 1 10 1
#2 1 11 1
#3 1 12 1
#4 4 10 2
#5 4 11 2
#6 4 11 2
#7 8 9 3
#8 8 8 3
#9 8 7 3
data
df <- structure(list(site = c(1L, 1L, 1L, 4L, 4L, 4L, 8L, 8L, 8L),
score = c(10L, 11L, 12L, 10L, 11L, 11L, 9L, 8L, 7L)),
class = "data.frame", row.names = c(NA, -9L))
Two other options:
1) Using the .GRP function from the data.table package:
library(data.table)
setDT(dat)[, num := .GRP, by = site]
with the example dataset from below this results in:
> dat
site score num
1: 1 0.14945795 1
2: 1 0.60035697 1
3: 1 0.94643075 1
4: 8 0.68835336 2
5: 8 0.50553372 2
6: 8 0.37293624 2
7: 4 0.33580504 3
8: 4 0.04825135 3
9: 4 0.61894754 3
10: 8 0.96144729 2
11: 8 0.65496051 2
12: 8 0.51029199 2
2) Using the group_indices function from dplyr:
dat$num <- group_indices(dat, site)
or when you want to work around non-standard evaluation:
library(dplyr)
dat %>%
mutate(num = group_indices_(dat, .dots = c('site')))
which results in:
site score num
1 1 0.42480366 1
2 1 0.98736177 1
3 1 0.35766187 1
4 8 0.06243182 3
5 8 0.55617002 3
6 8 0.20304632 3
7 4 0.90855921 2
8 4 0.25215078 2
9 4 0.44981251 2
10 8 0.60288270 3
11 8 0.46946587 3
12 8 0.44941782 3
As can be seen, dplyr gives a different order of the group numbers.
If you want another number every time the group changes, there are several other options:
1) with base R:
# option 1:
dat$num <- cumsum(c(TRUE, head(dat$site, -1) != tail(dat$site, -1)))
# option 2:
x <- rle(dat$site)$lengths
dat$num <- rep(seq_along(x), times=x)
2) with the data.table package:
library(data.table)
setDT(dat)[, num := rleid(site)]
which all result in:
> dat
site score num
1 1 0.80817855 1
2 1 0.07881334 1
3 1 0.60092828 1
4 8 0.71477988 2
5 8 0.51384565 2
6 8 0.72011650 2
7 4 0.74994627 3
8 4 0.09564052 3
9 4 0.39782587 3
10 8 0.29446540 4
11 8 0.61725367 4
12 8 0.97427413 4
Used data:
dat <- data.frame(site = rep(c(1,8,4,8), each = 3), score = runif(12))
This should be fairly efficient and understandable:
Dat$sitenum <- match(Dat$site, unique(Dat$site))
Using the data from #Jaap, a different dplyr possibility using dense_rank() could be:
dat %>%
mutate(ID = dense_rank(site))
site score ID
1 1 0.1884490 1
2 1 0.1087422 1
3 1 0.7438149 1
4 8 0.1150771 3
5 8 0.9978203 3
6 8 0.7781222 3
7 4 0.4081830 2
8 4 0.2782333 2
9 4 0.9566959 2
10 8 0.2545320 3
11 8 0.1201062 3
12 8 0.5449901 3
Or a rleid()-like dplyr approach, with the data arranged first:
dat %>%
arrange(site) %>%
mutate(ID = with(rle(site), rep(seq_along(lengths), lengths)))
site score ID
1 1 0.1884490 1
2 1 0.1087422 1
3 1 0.7438149 1
4 4 0.4081830 2
5 4 0.2782333 2
6 4 0.9566959 2
7 8 0.1150771 3
8 8 0.9978203 3
9 8 0.7781222 3
10 8 0.2545320 3
11 8 0.1201062 3
12 8 0.5449901 3
Or using duplicated() and cumsum():
df %>%
mutate(ID = cumsum(!duplicated(site)))
The same with base R:
df$ID <- with(rle(df$site), rep(seq_along(lengths), lengths))
Or:
df$ID <- cumsum(!duplicated(df$site))
You can turn site into a factor and then return the numeric or integer values of that factor:
dat <- data.frame(site = rep(c(1,4,8), each = 3), score = runif(9))
dat$number <- as.integer(factor(dat$site))
dat
site score number
1 1 0.5305773 1
2 1 0.9367732 1
3 1 0.1831554 1
4 4 0.4068128 2
5 4 0.3438962 2
6 4 0.8123883 2
7 8 0.9122846 3
8 8 0.2949260 3
9 8 0.6771526 3
Another solution using the data.table package.
Example with the more complete datset provided by Jaap:
setDT(dat)[, number := frank(site, ties.method = "dense")]
dat
site score number
1: 1 0.3107920 1
2: 1 0.3640102 1
3: 1 0.1715318 1
4: 8 0.7247535 3
5: 8 0.1263025 3
6: 8 0.4657868 3
7: 4 0.6915818 2
8: 4 0.3558270 2
9: 4 0.3376173 2
10: 8 0.7934963 3
11: 8 0.9641918 3
12: 8 0.9832120 3
Another way to do it. That I think is easy to get even when you know little about R:
library(dplyr)
df <- data.frame('site' = c(1, 1, 1, 4, 4, 4, 8, 8, 8))
df <- mutate(df, 'number' = cumsum(site != lag(site, default=-1)))
I too recently needed a solution to this. Didn't find this thread, started mine and was re-directed here (thank you). Good to see many solutions but to me (and I feel is good practice), a scalable solution is important. Hence, benchmarked several solutions below.
df <- data.table(country = rep(c('a', 'b', 'b', 'c', 'c', 'c'), 1e7)
)
a <-
microbenchmark(factor = {df[, group_id := as.integer(factor(country))]}
, unique_match = df[, group_id := match(country, unique(country))]
, rle = df[ , group_id := with(rle(country), rep(seq_along(lengths), lengths))]
, dup_cumsum = df[, group_id := cumsum(!duplicated(country))]
, frank = df[, group_id := frank(country, ties.method = "dense")]
, GRP = df[, group_id := .GRP, country]
, rleid = df[, group_id := rleid(country)]
, cumsum_head_tail = df[, group_id := cumsum(c(TRUE, head(country, -1) != tail(country, -1)))]
, times = 50
)
autoplot(a)
It would appear the podium is held by data.table.
Still, was great to learn of alternatives e.g. cumsum(!duplicated(country)). What a brainteaser!
Using collapse::group, Fast Hash-Based Grouping:
library(collapse)
d = data.frame(site = rep(c(1,8,4), each = 3))
settransform(d, number = group(site)) # settransform updates data by reference. See also ftransform.
d
# site number
# 1 1 1
# 2 1 1
# 3 1 1
# 4 8 2
# 5 8 2
# 6 8 2
# 7 4 3
# 8 4 3
# 9 4 3
The collapse functions are considerably faster on larger data. Here I compare with two common base idioms (factor / as.integer; match / unique), and two data.table methods (.GRP; frank), using 1e5 groups with 1e3 rows each.
library(data.table)
library(microbenchmark)
nr = 1e3
ng = 1e5
set.seed(1)
d1 = data.table(g = sample(1:ng, nr*ng, replace = TRUE))
d2 = copy(d1)
d3 = copy(d1)
d4 = copy(d1)
d5 = copy(d1)
microbenchmark(
factor = {d1[ , gi := as.integer(factor(g))]},
unique_match = {d2[, gi := match(g, unique(g))]},
frank = {d3[, gi := frank(g, ties.method = "dense")]},
GRP = {d4[, gi := .GRP, by = g]},
collap = {settransform(d5, gi = group(g))},
times = 20L)
# Unit: milliseconds
# expr min lq mean median uq max
# factor 46648.6099 48493.0146 49918.956 49336.208 51547.0789 53585.472
# unique_match 12662.0978 13057.3210 13534.391 13530.457 13998.0141 14407.036
# frank 2628.4923 2695.7064 3240.522 2833.950 3797.5579 5547.227
# GRP 2754.2153 3283.2444 3796.109 3717.239 4184.5174 5117.918
# collap 640.1993 668.2301 729.351 698.307 753.2932 1086.592
# Check equality with data.table .GRP. Use as.vector to remove attributes
all.equal(d4$gi, as.vector(d5$gi))
# [1] TRUE
Note: group, .GRP and match / unique all create the group number according to order of appearance of the different values in the original data (also discussed in previous posts).
If you want to keep your existing columns and assign back to the same data frame...
my_df <- my_df %>%
select(everything()) %>%
group_by(geo) %>%
mutate(geo_id = cur_group_id())
And you can do multiple columns like so...
my_df <- my_df %>%
select(everything()) %>%
group_by(geo) %>%
mutate(geo_id = cur_group_id()) %>%
group_by(state) %>%
mutate(state_id = cur_group_id()) %>%
group_by(name) %>%
mutate(name_id = cur_group_id())
If the numbers of the site column were unordered, we could use as_factor() in combination with fct_inorder() from the forcats package:
library(tibble)
library(dplyr)
library(forcats)
all_data_unordered <- tibble(site = c(1,1,1,8,8,8,4,4,4),
score = c(10,11,12,10,11,11,9,8,7))
all_data_unordered |>
mutate(number = as_factor(site) |> fct_inorder() |> as.integer())
#> # A tibble: 9 × 3
#> site score number
#> <dbl> <dbl> <int>
#> 1 1 10 1
#> 2 1 11 1
#> 3 1 12 1
#> 4 8 10 2
#> 5 8 11 2
#> 6 8 11 2
#> 7 4 9 3
#> 8 4 8 3
#> 9 4 7 3
Created on 2021-11-05 by the reprex package (v2.0.1)
Since dplyr 1.1.0, another option is consecutive_id:
library(dplyr)
df %>%
mutate(id = consecutive_id(site))
# site score id
# 1 1 10 1
# 2 1 11 1
# 3 1 12 1
# 4 4 10 2
# 5 4 11 2
# 6 4 11 2
# 7 8 9 3
# 8 8 8 3
# 9 8 7 3
Note that consecutive_id, like data.table::rleid but unlike cur_group_id or as.numeric(factor(.)) will return an ID for consecutive values, meaning that if the same value appears not consecutively, it'll create a new id.
df <- structure(list(site = c(1L, 1L, 1L, 4L, 4L, 4L, 1L, 1L, 1L)),
class = "data.frame", row.names = c(NA, -9L))
df %>%
mutate(cons_id = consecutive_id(site)) %>%
group_by(site) %>%
mutate(cur_group_id = cur_group_id())
# site cons_id cur_group_id
# 1 1 1 1
# 2 1 1 1
# 3 1 1 1
# 4 4 2 2
# 5 4 2 2
# 6 4 2 2
# 7 1 3 1
# 8 1 3 1
# 9 1 3 1

Merging columns while ignoring NAs

I would like to merge multiple columns. Here is what my sample dataset looks like.
df <- data.frame(
id = c(1,2,3,4,5),
cat.1 = c(3,4,NA,4,2),
cat.2 = c(3,NA,1,4,NA),
cat.3 = c(3,4,1,4,2))
> df
id cat.1 cat.2 cat.3
1 1 3 3 3
2 2 4 NA 4
3 3 NA 1 1
4 4 4 4 4
5 5 2 NA 2
I am trying to merge columns cat.1 cat.2 and cat.3. It is a little complicated for me since there are NAs.
I need to have only one cat variable and even some columns have NA, I need to ignore them. The desired output is below:
> df
id cat
1 1 3
2 2 4
3 3 1
4 4 4
5 5 2
Any thoughts?
Another variation of Gregor's answer using dplyr::transmute:
library(dplyr)
df %>%
transmute(id = id, cat = coalesce(cat.1, cat.2, cat.3))
#> id cat
#> 1 1 3
#> 2 2 4
#> 3 3 1
#> 4 4 4
#> 5 5 2
With dplyr:
library(dplyr)
df %>%
mutate(cat = coalesce(cat.1, cat.2, cat.3)) %>%
select(-cat.1, -cat.2, -cat.3)
An option with fcoalesce from data.table
library(data.table)
setDT(df)[, .(id, cat = do.call(fcoalesce, .SD)), .SDcols = patterns('^cat')]
-output
# id cat
#1: 1 3
#2: 2 4
#3: 3 1
#4: 4 4
#5: 5 2
Does this work:
> library(dplyr)
> df %>% rowwise() %>% mutate(cat = mean(c(cat.1, cat.2, cat.3), na.rm = T)) %>% select(-(2:4))
# A tibble: 5 x 2
# Rowwise:
id cat
<dbl> <dbl>
1 1 3
2 2 4
3 3 1
4 4 4
5 5 2
Since values across rows are unique, mean of the rows will return the same unique value, can also go with max or min.
Here is a base R solution which uses apply:
df$cat <- apply(df, 1, function(x) unique(x[!is.na(x)][-1]))

Grouped pivot_longer dplyr

This is an example dataframe. My real dataframe is larger. I highly prefer a tidyverse solution.
#my data
age <- c(18,18,19)
A1 <- c(3,5,3)
A2 <- c(4,4,3)
B1 <- c(1,5,2)
B2 <- c(2,2,5)
df <- data.frame(age, A1, A2, B1, B2)
I want my data to look like this:
#what i want
new_age <- c(18,18,18,18,19,19)
A <- c(3,5,4,4,3,3)
B <- c(1,5,2,2,2,5)
new_df <- data.frame(new_age, A, B)
I want to pivot longer and stack columns A1:A2 into column A, and B1:B2 into B. I also want to have the responses to match the correct age. For example, the 19 year old person in this example has only responded with 3's in columns A1:A2.
tidyr::pivot_longer(df, cols = -age, names_to = c(".value",'groupid'),
#1+ non digits followed by 1+ digits
names_pattern = "(\\D+)(\\d+)")
# A tibble: 6 x 4
age groupid A B
<dbl> <chr> <dbl> <dbl>
1 18 1 3 1
2 18 2 4 2
3 18 1 5 5
4 18 2 4 2
5 19 1 3 2
6 19 2 3 5
in Base R you will use reshape then select the columns you want. You can change the row names also
reshape(df,2:ncol(df),dir = "long",sep="")[,-c(2,5)] #
age A B
1.1 18 3 1
2.1 18 5 5
3.1 19 3 2
1.2 18 4 2
2.2 18 4 2
3.2 19 3 5
As you have a larger dataframe, maybe a solution with data.table will be faster. Here, you can use melt function from data.table package as follow:
library(data.table)
colA = grep("A",colnames(df),value = TRUE)
colB = grep("B",colnames(df),value = TRUE)
setDT(df)
df <- melt(df, measure = list(colA,colB), value.name = c("A","B"))
df[,variable := NULL]
dt <- dt[order(age)]
age A B
1: 18 3 1
2: 18 5 5
3: 18 4 2
4: 18 4 2
5: 19 3 2
6: 19 3 5
Does it answer your question ?
EDIT: Using patterns - suggestion from #Wimpel
As #Wimpel suggested it in comments, you can get the same result using patterns:
melt( setDT(df), measure.vars = patterns( A="^A[0-9]", B="^B[0-9]") )[, variable:=NULL][]
age A B
1: 18 3 1
2: 18 5 5
3: 19 3 2
4: 18 4 2
5: 18 4 2
6: 19 3 5

Creating index variable in R data frames [duplicate]

I have a data frame (all_data) in which I have a list of sites (1... to n) and their scores e.g.
site score
1 10
1 11
1 12
4 10
4 11
4 11
8 9
8 8
8 7
I want create a column that numbers each level of site in numerical order, like a counter. In the example, the sites (1, 4, and 8) would be have a corresponding counter from 1 to 3 in the 'number' column:
site score number
1 10 1
1 11 1
1 12 1
4 10 2
4 11 2
4 11 2
8 9 3
8 8 3
8 7 3
I am sure this must be easily solved, but I have not found a way yet.
Try Data$number <- as.numeric(as.factor(Data$site))
On a sidenote : the difference between the solution of me and #Chase on one hand, and the one of #DWin on the other, is the ordering of the numbers. Both as.factor and factor will automatically sort the levels, whereas that doesn't happen in the solution of #DWin :
Dat <- data.frame(site = rep(c(1,8,4), each = 3), score = runif(9))
Dat$number <- as.numeric(factor(Dat$site))
Dat$sitenum <- match(Dat$site, unique(Dat$site) )
Gives
> Dat
site score number sitenum
1 1 0.7377561 1 1
2 1 0.3131139 1 1
3 1 0.7862290 1 1
4 8 0.4480387 3 2
5 8 0.3873210 3 2
6 8 0.8778102 3 2
7 4 0.6916340 2 3
8 4 0.3033787 2 3
9 4 0.6552808 2 3
In the new dplyr 1.0.0 we can use cur_group_id() which gives a unique numeric identifier to a group.
library(dplyr)
df %>% group_by(site) %>% mutate(number = cur_group_id())
# site score number
# <int> <int> <int>
#1 1 10 1
#2 1 11 1
#3 1 12 1
#4 4 10 2
#5 4 11 2
#6 4 11 2
#7 8 9 3
#8 8 8 3
#9 8 7 3
data
df <- structure(list(site = c(1L, 1L, 1L, 4L, 4L, 4L, 8L, 8L, 8L),
score = c(10L, 11L, 12L, 10L, 11L, 11L, 9L, 8L, 7L)),
class = "data.frame", row.names = c(NA, -9L))
Two other options:
1) Using the .GRP function from the data.table package:
library(data.table)
setDT(dat)[, num := .GRP, by = site]
with the example dataset from below this results in:
> dat
site score num
1: 1 0.14945795 1
2: 1 0.60035697 1
3: 1 0.94643075 1
4: 8 0.68835336 2
5: 8 0.50553372 2
6: 8 0.37293624 2
7: 4 0.33580504 3
8: 4 0.04825135 3
9: 4 0.61894754 3
10: 8 0.96144729 2
11: 8 0.65496051 2
12: 8 0.51029199 2
2) Using the group_indices function from dplyr:
dat$num <- group_indices(dat, site)
or when you want to work around non-standard evaluation:
library(dplyr)
dat %>%
mutate(num = group_indices_(dat, .dots = c('site')))
which results in:
site score num
1 1 0.42480366 1
2 1 0.98736177 1
3 1 0.35766187 1
4 8 0.06243182 3
5 8 0.55617002 3
6 8 0.20304632 3
7 4 0.90855921 2
8 4 0.25215078 2
9 4 0.44981251 2
10 8 0.60288270 3
11 8 0.46946587 3
12 8 0.44941782 3
As can be seen, dplyr gives a different order of the group numbers.
If you want another number every time the group changes, there are several other options:
1) with base R:
# option 1:
dat$num <- cumsum(c(TRUE, head(dat$site, -1) != tail(dat$site, -1)))
# option 2:
x <- rle(dat$site)$lengths
dat$num <- rep(seq_along(x), times=x)
2) with the data.table package:
library(data.table)
setDT(dat)[, num := rleid(site)]
which all result in:
> dat
site score num
1 1 0.80817855 1
2 1 0.07881334 1
3 1 0.60092828 1
4 8 0.71477988 2
5 8 0.51384565 2
6 8 0.72011650 2
7 4 0.74994627 3
8 4 0.09564052 3
9 4 0.39782587 3
10 8 0.29446540 4
11 8 0.61725367 4
12 8 0.97427413 4
Used data:
dat <- data.frame(site = rep(c(1,8,4,8), each = 3), score = runif(12))
This should be fairly efficient and understandable:
Dat$sitenum <- match(Dat$site, unique(Dat$site))
Using the data from #Jaap, a different dplyr possibility using dense_rank() could be:
dat %>%
mutate(ID = dense_rank(site))
site score ID
1 1 0.1884490 1
2 1 0.1087422 1
3 1 0.7438149 1
4 8 0.1150771 3
5 8 0.9978203 3
6 8 0.7781222 3
7 4 0.4081830 2
8 4 0.2782333 2
9 4 0.9566959 2
10 8 0.2545320 3
11 8 0.1201062 3
12 8 0.5449901 3
Or a rleid()-like dplyr approach, with the data arranged first:
dat %>%
arrange(site) %>%
mutate(ID = with(rle(site), rep(seq_along(lengths), lengths)))
site score ID
1 1 0.1884490 1
2 1 0.1087422 1
3 1 0.7438149 1
4 4 0.4081830 2
5 4 0.2782333 2
6 4 0.9566959 2
7 8 0.1150771 3
8 8 0.9978203 3
9 8 0.7781222 3
10 8 0.2545320 3
11 8 0.1201062 3
12 8 0.5449901 3
Or using duplicated() and cumsum():
df %>%
mutate(ID = cumsum(!duplicated(site)))
The same with base R:
df$ID <- with(rle(df$site), rep(seq_along(lengths), lengths))
Or:
df$ID <- cumsum(!duplicated(df$site))
You can turn site into a factor and then return the numeric or integer values of that factor:
dat <- data.frame(site = rep(c(1,4,8), each = 3), score = runif(9))
dat$number <- as.integer(factor(dat$site))
dat
site score number
1 1 0.5305773 1
2 1 0.9367732 1
3 1 0.1831554 1
4 4 0.4068128 2
5 4 0.3438962 2
6 4 0.8123883 2
7 8 0.9122846 3
8 8 0.2949260 3
9 8 0.6771526 3
Another solution using the data.table package.
Example with the more complete datset provided by Jaap:
setDT(dat)[, number := frank(site, ties.method = "dense")]
dat
site score number
1: 1 0.3107920 1
2: 1 0.3640102 1
3: 1 0.1715318 1
4: 8 0.7247535 3
5: 8 0.1263025 3
6: 8 0.4657868 3
7: 4 0.6915818 2
8: 4 0.3558270 2
9: 4 0.3376173 2
10: 8 0.7934963 3
11: 8 0.9641918 3
12: 8 0.9832120 3
Another way to do it. That I think is easy to get even when you know little about R:
library(dplyr)
df <- data.frame('site' = c(1, 1, 1, 4, 4, 4, 8, 8, 8))
df <- mutate(df, 'number' = cumsum(site != lag(site, default=-1)))
I too recently needed a solution to this. Didn't find this thread, started mine and was re-directed here (thank you). Good to see many solutions but to me (and I feel is good practice), a scalable solution is important. Hence, benchmarked several solutions below.
df <- data.table(country = rep(c('a', 'b', 'b', 'c', 'c', 'c'), 1e7)
)
a <-
microbenchmark(factor = {df[, group_id := as.integer(factor(country))]}
, unique_match = df[, group_id := match(country, unique(country))]
, rle = df[ , group_id := with(rle(country), rep(seq_along(lengths), lengths))]
, dup_cumsum = df[, group_id := cumsum(!duplicated(country))]
, frank = df[, group_id := frank(country, ties.method = "dense")]
, GRP = df[, group_id := .GRP, country]
, rleid = df[, group_id := rleid(country)]
, cumsum_head_tail = df[, group_id := cumsum(c(TRUE, head(country, -1) != tail(country, -1)))]
, times = 50
)
autoplot(a)
It would appear the podium is held by data.table.
Still, was great to learn of alternatives e.g. cumsum(!duplicated(country)). What a brainteaser!
Using collapse::group, Fast Hash-Based Grouping:
library(collapse)
d = data.frame(site = rep(c(1,8,4), each = 3))
settransform(d, number = group(site)) # settransform updates data by reference. See also ftransform.
d
# site number
# 1 1 1
# 2 1 1
# 3 1 1
# 4 8 2
# 5 8 2
# 6 8 2
# 7 4 3
# 8 4 3
# 9 4 3
The collapse functions are considerably faster on larger data. Here I compare with two common base idioms (factor / as.integer; match / unique), and two data.table methods (.GRP; frank), using 1e5 groups with 1e3 rows each.
library(data.table)
library(microbenchmark)
nr = 1e3
ng = 1e5
set.seed(1)
d1 = data.table(g = sample(1:ng, nr*ng, replace = TRUE))
d2 = copy(d1)
d3 = copy(d1)
d4 = copy(d1)
d5 = copy(d1)
microbenchmark(
factor = {d1[ , gi := as.integer(factor(g))]},
unique_match = {d2[, gi := match(g, unique(g))]},
frank = {d3[, gi := frank(g, ties.method = "dense")]},
GRP = {d4[, gi := .GRP, by = g]},
collap = {settransform(d5, gi = group(g))},
times = 20L)
# Unit: milliseconds
# expr min lq mean median uq max
# factor 46648.6099 48493.0146 49918.956 49336.208 51547.0789 53585.472
# unique_match 12662.0978 13057.3210 13534.391 13530.457 13998.0141 14407.036
# frank 2628.4923 2695.7064 3240.522 2833.950 3797.5579 5547.227
# GRP 2754.2153 3283.2444 3796.109 3717.239 4184.5174 5117.918
# collap 640.1993 668.2301 729.351 698.307 753.2932 1086.592
# Check equality with data.table .GRP. Use as.vector to remove attributes
all.equal(d4$gi, as.vector(d5$gi))
# [1] TRUE
Note: group, .GRP and match / unique all create the group number according to order of appearance of the different values in the original data (also discussed in previous posts).
If you want to keep your existing columns and assign back to the same data frame...
my_df <- my_df %>%
select(everything()) %>%
group_by(geo) %>%
mutate(geo_id = cur_group_id())
And you can do multiple columns like so...
my_df <- my_df %>%
select(everything()) %>%
group_by(geo) %>%
mutate(geo_id = cur_group_id()) %>%
group_by(state) %>%
mutate(state_id = cur_group_id()) %>%
group_by(name) %>%
mutate(name_id = cur_group_id())
If the numbers of the site column were unordered, we could use as_factor() in combination with fct_inorder() from the forcats package:
library(tibble)
library(dplyr)
library(forcats)
all_data_unordered <- tibble(site = c(1,1,1,8,8,8,4,4,4),
score = c(10,11,12,10,11,11,9,8,7))
all_data_unordered |>
mutate(number = as_factor(site) |> fct_inorder() |> as.integer())
#> # A tibble: 9 × 3
#> site score number
#> <dbl> <dbl> <int>
#> 1 1 10 1
#> 2 1 11 1
#> 3 1 12 1
#> 4 8 10 2
#> 5 8 11 2
#> 6 8 11 2
#> 7 4 9 3
#> 8 4 8 3
#> 9 4 7 3
Created on 2021-11-05 by the reprex package (v2.0.1)
Since dplyr 1.1.0, another option is consecutive_id:
library(dplyr)
df %>%
mutate(id = consecutive_id(site))
# site score id
# 1 1 10 1
# 2 1 11 1
# 3 1 12 1
# 4 4 10 2
# 5 4 11 2
# 6 4 11 2
# 7 8 9 3
# 8 8 8 3
# 9 8 7 3
Note that consecutive_id, like data.table::rleid but unlike cur_group_id or as.numeric(factor(.)) will return an ID for consecutive values, meaning that if the same value appears not consecutively, it'll create a new id.
df <- structure(list(site = c(1L, 1L, 1L, 4L, 4L, 4L, 1L, 1L, 1L)),
class = "data.frame", row.names = c(NA, -9L))
df %>%
mutate(cons_id = consecutive_id(site)) %>%
group_by(site) %>%
mutate(cur_group_id = cur_group_id())
# site cons_id cur_group_id
# 1 1 1 1
# 2 1 1 1
# 3 1 1 1
# 4 4 2 2
# 5 4 2 2
# 6 4 2 2
# 7 1 3 1
# 8 1 3 1
# 9 1 3 1

Resources