In the following dataset:
Day Place Name
22 X A
22 X A
22 X B
22 X A
22 Y C
22 Y C
22 Y D
23 X B
23 X A
How can I assign numbering to the variable Name in following order using R:
Day Place Name Number
22 X A 1
22 X A 1
22 X B 2
22 X A 1
22 Y C 1
22 Y C 1
22 Y D 2
23 X B 1
23 X A 2
In a nutshell, I need to number the names according to their order to occurrence on a certain day and at a certain place.
In base R using tapply:
dat$Number <-
unlist(tapply(dat$Name,paste(dat$Day,dat$Place),
FUN=function(x){
y <- as.character(x)
as.integer(factor(y,levels=unique(y)))
}))
# Day Place Name Number
# 1 22 X A 1
# 2 22 X A 1
# 3 22 X B 2
# 4 22 Y C 1
# 5 22 Y C 1
# 6 22 Y D 2
# 7 23 X B 1
# 8 23 X A 2
idea
Group by Day and Place using tapply
For each group, create a coerce the Name to the factor conserving the same order of levels.
Coerce the created factor to integer to get the final result.
using data.table(sugar syntax) :
library(data.table)
setDT(dat)[,Number := {
y <- as.character(Name)
as.integer(factor(y,levels=unique(y)))
},"Day,Place"]
Day Place Name Number
1: 22 X A 1
2: 22 X A 1
3: 22 X B 2
4: 22 Y C 1
5: 22 Y C 1
6: 22 Y D 2
7: 23 X B 1
8: 23 X A 2
idx <- function(x) cumsum(c(TRUE, tail(x, -1) != head(x, -1)))
transform(dat, Number = ave(idx(Name), Day, Place, FUN = idx))
# Day Place Name Number
# 1 22 X A 1
# 2 22 X A 1
# 3 22 X B 2
# 4 22 Y C 1
# 5 22 Y C 1
# 6 22 Y D 2
# 7 23 X B 1
# 8 23 X A 2
Use ddply from plyr.
dfr <- read.table(header = TRUE, text = "Day Place Name
22 X A
22 X A
22 X B
22 X A
22 Y C
22 Y C
22 Y D
23 X B
23 X A")
library(plyr)
ddply(
dfr,
.(Day, Place),
mutate,
Number = as.integer(factor(Name, levels = unique(Name)))
)
Or use dplyr, in a variant of beginneR's deleted answer.
library(dplyr)
dfr %>%
group_by(Day, Place) %>%
mutate(Number = as.integer(factor(Name, levels = unique(Name))))
Related
I am trying to conditionally subset data.frames in a list of data.frames based on value in a vector. Basically, whenever a > 0 I would like to subset the corresponding list element to have that many randomly-sampled rows.
# a list
l <- list( data.frame(x=1:5, y = 1:5),
data.frame(x= 11:15, y = 11:15),
data.frame(x=21:25, y = 21:25) )
# a vector
a <- c(3, 1,-2)
# one possible permutation of the desired output
[[1]]
x y
1 1 1
2 3 3
3 5 5
[[2]]
x y
1 13 13
[[3]]
x y
1 21 21
2 22 22
3 23 23
4 24 24
5 25 25
I have been trying to do this with purrr::map_if() as follows, but
my function only uses the first value of a as the number of rows for all of the data.frames. That is, the first and second elements of the list are subset to 3 rows, but I'd like the second element to have just 1 row.
f <- function(x, count) {x[sample(nrow(x), count),]}
purrr::map_if(l, a > 0, f, count = a)
Is there a way to pass the value in 'a' for each iteration of map_if()?
Or some other solution?
A base R one with Map + ifelse
> Map(function(x, k) x[sample(nrow(x), ifelse(k > 0, k, nrow(x))), ], l, a)
[[1]]
x y
3 3 3
4 4 4
5 5 5
[[2]]
x y
2 12 12
[[3]]
x y
2 22 22
1 21 21
5 25 25
3 23 23
4 24 24
You could use the following solution. Here you actually need to use purrr::map2 or base::mapply or base::Map since you are should iterate over 2 vectors or lists in parallel.
library(dplyr)
library(purrr)
map2(a, l, ~ if(.x > 0) {
.y %>%
slice_sample(n = .x)
} else {
.y
})
[[1]]
x y
1 2 2
2 4 4
3 3 3
[[2]]
x y
1 11 11
[[3]]
x y
1 21 21
2 22 22
3 23 23
4 24 24
5 25 25
library(tidyverse)
# a list
l <- list( data.frame(x=1:5, y = 1:5),
data.frame(x= 11:15, y = 11:15),
data.frame(x=21:25, y = 21:25) )
# a vector
a <- c(3, 1, -2)
map2(
.x = l,
.y = a,
.f = ~sample_n(tbl = .x, size = ifelse(.y > nrow(.x) | .y < 0, nrow(.x), .y))
)
#> [[1]]
#> x y
#> 1 4 4
#> 2 2 2
#> 3 1 1
#>
#> [[2]]
#> x y
#> 1 13 13
#>
#> [[3]]
#> x y
#> 1 24 24
#> 2 21 21
#> 3 23 23
#> 4 22 22
#> 5 25 25
Created on 2021-09-10 by the reprex package (v2.0.1)
Here is my data.
dat<-read.table(text=" MP1 MP2 MP3 N1 N2 N3 WP1 WP2 WP3
A A A Y Y Y 10 11 11
A B A Y Y Y 10 11 11
B B A Y Y Y 10 10 11
A B A Y Y Y 11 11 10
B B A Y Y Y 10 10 11
B B A N Y Y 11 10 10
B C A Y Y Y 11 11 11
C C B Y Y N 10 11 10
B C B Y Y Y 11 11 11
B C B Y N Y 10 11 11
",header=TRUE)
I want to get this table. Indeed I want to get three columns instead of nine columns. These columns are named as follows:
MP N WP
A Y 10
A Y 10
B Y 10
A Y 11
B Y 10
B N 11
B Y 11
C Y 10
B Y 11
B Y 10
A Y 11
B Y 11
B Y 10
B Y 11
B Y 10
B Y 10
C Y 11
C Y 11
C Y 11
C N 11
A Y 11
A Y 11
A Y 11
A Y 10
A Y 11
A Y 10
A Y 11
B N 10
B Y 11
B Y 11
I have tried this:
dat1 <- data.frame(MP=unlist(dat, use.names = FALSE))
But, not sure why it does not work. I also used
dat2 <- data.frame(MP = c(dat[,"MP"], dat[,"N"],dat[,WP])))
Here's another base R approach that preserves the factors:
names(dat) <- c(rep("MP", 3), rep("N", 3), rep("WP", 3))
rdat2 <- rbind(dat[, c(1, 4, 7)], dat[, c(2, 5, 8)], dat[, c(3, 6, 9)])
str(rdat2)
# 'data.frame': 30 obs. of 3 variables:
# $ MP: Factor w/ 3 levels "A","B","C": 1 1 2 1 2 2 2 3 2 2 ...
# $ N : Factor w/ 2 levels "N","Y": 2 2 2 2 2 1 2 2 2 2 ...
# $ WP: int 10 10 10 11 10 11 11 10 11 10 ...
An option is pivot_longer, specify the cols argument as everything() (as we are using all the columns), also the separation in column names is between the numbers and the uppercase letters, so we can use a regex lookaround to do the split at that junction
library(dplyr)
library(tidyr)
dat %>%
pivot_longer(cols = everything(), names_to = c( ".value", "grp"),
names_sep="(?<=[A-Z])(?=[0-9])") %>%
select(-grp)
# A tibble: 30 x 3
# MP N WP
# <fct> <fct> <int>
# 1 A Y 10
# 2 A Y 11
# 3 A Y 11
# 4 A Y 10
# 5 B Y 11
# 6 A Y 11
# 7 B Y 10
# 8 B Y 10
# 9 A Y 11
#10 A Y 11
# … with 20 more rows
Or with melt from data.table
library(data.table)
melt(setDT(dat), measure = patterns("^MP", "^N", "^WP"),
value.name = c("MP", "N","WP"))[, variable := NULL][]
A quick solution using base R is:
as.data.frame(sapply(c("MP", "N", "WP"), function(x) unlist(dat[grep(x, names(dat))]), simplify = FALSE))
This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
How can we generate unique id numbers within each group of a dataframe? Here's some data grouped by "personid":
personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23
I wish to add an id column with a unique value for each row within each subset defined by "personid", always starting with 1. This is my desired output:
personid date measurement id
1 x 23 1
1 x 32 2
2 y 21 1
3 x 23 1
3 z 23 2
3 y 23 3
I appreciate any help.
Some dplyr alternatives, using convenience functions row_number and n.
library(dplyr)
df %>% group_by(personid) %>% mutate(id = row_number())
df %>% group_by(personid) %>% mutate(id = 1:n())
df %>% group_by(personid) %>% mutate(id = seq_len(n()))
df %>% group_by(personid) %>% mutate(id = seq_along(personid))
You may also use getanID from package splitstackshape. Note that the input dataset is returned as a data.table.
getanID(data = df, id.vars = "personid")
# personid date measurement .id
# 1: 1 x 23 1
# 2: 1 x 32 2
# 3: 2 y 21 1
# 4: 3 x 23 1
# 5: 3 z 23 2
# 6: 3 y 23 3
The misleadingly named ave() function, with argument FUN=seq_along, will accomplish this nicely -- even if your personid column is not strictly ordered.
df <- read.table(text = "personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23", header=TRUE)
## First with your data.frame
ave(df$personid, df$personid, FUN=seq_along)
# [1] 1 2 1 1 2 3
## Then with another, in which personid is *not* in order
df2 <- df[c(2:6, 1),]
ave(df2$personid, df2$personid, FUN=seq_along)
# [1] 1 1 1 2 3 2
Using data.table, and assuming you wish to order by date within the personid subset
library(data.table)
DT <- data.table(Data)
DT[,id := order(date), by = personid]
## personid date measurement id
## 1: 1 x 23 1
## 2: 1 x 32 2
## 3: 2 y 21 1
## 4: 3 x 23 1
## 5: 3 z 23 3
## 6: 3 y 23 2
If you wish do not wish to order by date
DT[, id := 1:.N, by = personid]
## personid date measurement id
## 1: 1 x 23 1
## 2: 1 x 32 2
## 3: 2 y 21 1
## 4: 3 x 23 1
## 5: 3 z 23 2
## 6: 3 y 23 3
Any of the following would also work
DT[, id := seq_along(measurement), by = personid]
DT[, id := seq_along(date), by = personid]
The equivalent commands using plyr
library(plyr)
# ordering by date
ddply(Data, .(personid), mutate, id = order(date))
# in original order
ddply(Data, .(personid), mutate, id = seq_along(date))
ddply(Data, .(personid), mutate, id = seq_along(measurement))
I think there's a canned command for this, but I can't remember it. So here's one way:
> test <- sample(letters[1:3],10,replace=TRUE)
> cumsum(duplicated(test))
[1] 0 0 1 1 2 3 4 5 6 7
> cumsum(duplicated(test))+1
[1] 1 1 2 2 3 4 5 6 7 8
This works because duplicated returns a logical vector. cumsum evalues numeric vectors, so the logical gets coerced to numeric.
You can store the result to your data.frame as a new column if you want:
dat$id <- cumsum(duplicated(test))+1
Assuming your data are in a data.frame named Data, this will do the trick:
# ensure Data is in the correct order
Data <- Data[order(Data$personid),]
# tabulate() calculates the number of each personid
# sequence() creates a n-length vector for each element in the input,
# and concatenates the result
Data$id <- sequence(tabulate(Data$personid))
You can use sqldf
df<-read.table(header=T,text="personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23")
library(sqldf)
sqldf("SELECT a.*, COUNT(*) count
FROM df a, df b
WHERE a.personid = b.personid AND b.ROWID <= a.ROWID
GROUP BY a.ROWID"
)
# personid date measurement count
#1 1 x 23 1
#2 1 x 32 2
#3 2 y 21 1
#4 3 x 23 1
#5 3 z 23 2
#6 3 y 23 3
This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
How can we generate unique id numbers within each group of a dataframe? Here's some data grouped by "personid":
personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23
I wish to add an id column with a unique value for each row within each subset defined by "personid", always starting with 1. This is my desired output:
personid date measurement id
1 x 23 1
1 x 32 2
2 y 21 1
3 x 23 1
3 z 23 2
3 y 23 3
I appreciate any help.
Some dplyr alternatives, using convenience functions row_number and n.
library(dplyr)
df %>% group_by(personid) %>% mutate(id = row_number())
df %>% group_by(personid) %>% mutate(id = 1:n())
df %>% group_by(personid) %>% mutate(id = seq_len(n()))
df %>% group_by(personid) %>% mutate(id = seq_along(personid))
You may also use getanID from package splitstackshape. Note that the input dataset is returned as a data.table.
getanID(data = df, id.vars = "personid")
# personid date measurement .id
# 1: 1 x 23 1
# 2: 1 x 32 2
# 3: 2 y 21 1
# 4: 3 x 23 1
# 5: 3 z 23 2
# 6: 3 y 23 3
The misleadingly named ave() function, with argument FUN=seq_along, will accomplish this nicely -- even if your personid column is not strictly ordered.
df <- read.table(text = "personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23", header=TRUE)
## First with your data.frame
ave(df$personid, df$personid, FUN=seq_along)
# [1] 1 2 1 1 2 3
## Then with another, in which personid is *not* in order
df2 <- df[c(2:6, 1),]
ave(df2$personid, df2$personid, FUN=seq_along)
# [1] 1 1 1 2 3 2
Using data.table, and assuming you wish to order by date within the personid subset
library(data.table)
DT <- data.table(Data)
DT[,id := order(date), by = personid]
## personid date measurement id
## 1: 1 x 23 1
## 2: 1 x 32 2
## 3: 2 y 21 1
## 4: 3 x 23 1
## 5: 3 z 23 3
## 6: 3 y 23 2
If you wish do not wish to order by date
DT[, id := 1:.N, by = personid]
## personid date measurement id
## 1: 1 x 23 1
## 2: 1 x 32 2
## 3: 2 y 21 1
## 4: 3 x 23 1
## 5: 3 z 23 2
## 6: 3 y 23 3
Any of the following would also work
DT[, id := seq_along(measurement), by = personid]
DT[, id := seq_along(date), by = personid]
The equivalent commands using plyr
library(plyr)
# ordering by date
ddply(Data, .(personid), mutate, id = order(date))
# in original order
ddply(Data, .(personid), mutate, id = seq_along(date))
ddply(Data, .(personid), mutate, id = seq_along(measurement))
I think there's a canned command for this, but I can't remember it. So here's one way:
> test <- sample(letters[1:3],10,replace=TRUE)
> cumsum(duplicated(test))
[1] 0 0 1 1 2 3 4 5 6 7
> cumsum(duplicated(test))+1
[1] 1 1 2 2 3 4 5 6 7 8
This works because duplicated returns a logical vector. cumsum evalues numeric vectors, so the logical gets coerced to numeric.
You can store the result to your data.frame as a new column if you want:
dat$id <- cumsum(duplicated(test))+1
Assuming your data are in a data.frame named Data, this will do the trick:
# ensure Data is in the correct order
Data <- Data[order(Data$personid),]
# tabulate() calculates the number of each personid
# sequence() creates a n-length vector for each element in the input,
# and concatenates the result
Data$id <- sequence(tabulate(Data$personid))
You can use sqldf
df<-read.table(header=T,text="personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23")
library(sqldf)
sqldf("SELECT a.*, COUNT(*) count
FROM df a, df b
WHERE a.personid = b.personid AND b.ROWID <= a.ROWID
GROUP BY a.ROWID"
)
# personid date measurement count
#1 1 x 23 1
#2 1 x 32 2
#3 2 y 21 1
#4 3 x 23 1
#5 3 z 23 2
#6 3 y 23 3
This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
How can we generate unique id numbers within each group of a dataframe? Here's some data grouped by "personid":
personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23
I wish to add an id column with a unique value for each row within each subset defined by "personid", always starting with 1. This is my desired output:
personid date measurement id
1 x 23 1
1 x 32 2
2 y 21 1
3 x 23 1
3 z 23 2
3 y 23 3
I appreciate any help.
Some dplyr alternatives, using convenience functions row_number and n.
library(dplyr)
df %>% group_by(personid) %>% mutate(id = row_number())
df %>% group_by(personid) %>% mutate(id = 1:n())
df %>% group_by(personid) %>% mutate(id = seq_len(n()))
df %>% group_by(personid) %>% mutate(id = seq_along(personid))
You may also use getanID from package splitstackshape. Note that the input dataset is returned as a data.table.
getanID(data = df, id.vars = "personid")
# personid date measurement .id
# 1: 1 x 23 1
# 2: 1 x 32 2
# 3: 2 y 21 1
# 4: 3 x 23 1
# 5: 3 z 23 2
# 6: 3 y 23 3
The misleadingly named ave() function, with argument FUN=seq_along, will accomplish this nicely -- even if your personid column is not strictly ordered.
df <- read.table(text = "personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23", header=TRUE)
## First with your data.frame
ave(df$personid, df$personid, FUN=seq_along)
# [1] 1 2 1 1 2 3
## Then with another, in which personid is *not* in order
df2 <- df[c(2:6, 1),]
ave(df2$personid, df2$personid, FUN=seq_along)
# [1] 1 1 1 2 3 2
Using data.table, and assuming you wish to order by date within the personid subset
library(data.table)
DT <- data.table(Data)
DT[,id := order(date), by = personid]
## personid date measurement id
## 1: 1 x 23 1
## 2: 1 x 32 2
## 3: 2 y 21 1
## 4: 3 x 23 1
## 5: 3 z 23 3
## 6: 3 y 23 2
If you wish do not wish to order by date
DT[, id := 1:.N, by = personid]
## personid date measurement id
## 1: 1 x 23 1
## 2: 1 x 32 2
## 3: 2 y 21 1
## 4: 3 x 23 1
## 5: 3 z 23 2
## 6: 3 y 23 3
Any of the following would also work
DT[, id := seq_along(measurement), by = personid]
DT[, id := seq_along(date), by = personid]
The equivalent commands using plyr
library(plyr)
# ordering by date
ddply(Data, .(personid), mutate, id = order(date))
# in original order
ddply(Data, .(personid), mutate, id = seq_along(date))
ddply(Data, .(personid), mutate, id = seq_along(measurement))
I think there's a canned command for this, but I can't remember it. So here's one way:
> test <- sample(letters[1:3],10,replace=TRUE)
> cumsum(duplicated(test))
[1] 0 0 1 1 2 3 4 5 6 7
> cumsum(duplicated(test))+1
[1] 1 1 2 2 3 4 5 6 7 8
This works because duplicated returns a logical vector. cumsum evalues numeric vectors, so the logical gets coerced to numeric.
You can store the result to your data.frame as a new column if you want:
dat$id <- cumsum(duplicated(test))+1
Assuming your data are in a data.frame named Data, this will do the trick:
# ensure Data is in the correct order
Data <- Data[order(Data$personid),]
# tabulate() calculates the number of each personid
# sequence() creates a n-length vector for each element in the input,
# and concatenates the result
Data$id <- sequence(tabulate(Data$personid))
You can use sqldf
df<-read.table(header=T,text="personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23")
library(sqldf)
sqldf("SELECT a.*, COUNT(*) count
FROM df a, df b
WHERE a.personid = b.personid AND b.ROWID <= a.ROWID
GROUP BY a.ROWID"
)
# personid date measurement count
#1 1 x 23 1
#2 1 x 32 2
#3 2 y 21 1
#4 3 x 23 1
#5 3 z 23 2
#6 3 y 23 3