Divide one column of data frame by condition from another column - r

I have a data frame with 2 columns like this:
cond val
1 5
2 18
2 18
2 18
3 30
3 30
I want to change values in val in this way:
cond val
1 5 # 5 = 5/1 (only "1" in cond column)
2 6 # 6 = 18/3 (there are three "2" in cond column)
2 6
2 6
3 15 # 15 = 30/2
3 15
How to achieve this?

A base R solution:
# method 1:
mydf$val <- ave(mydf$val, mydf$cond, FUN = function(x) x = x/length(x))
# method 2:
mydf <- transform(mydf, val = ave(val, cond, FUN = function(x) x = x/length(x)))
which gives:
cond val
1 1 5
2 2 6
3 2 6
4 2 6
5 3 15
6 3 15

Here's the dplyr way:
library(dplyr)
df %>%
group_by(cond) %>%
mutate(val = val / n())
Which gives:
#Source: local data frame [6 x 2]
#Groups: cond [3]
#
# cond val
# (int) (dbl)
#1 1 5
#2 2 6
#3 2 6
#4 2 6
#5 3 15
#6 3 15
The idea is to divide val by the number of observations in the current group (cond) using n()

This seems like an appropriate situation for data.table:
library(data.table)
(dt <- data.table(df)[,val := val / .N, by = cond][])
# cond val
# 1: 1 5
# 2: 2 6
# 3: 2 6
# 4: 2 6
# 5: 3 15
# 6: 3 15
df <- read.table(
text = "cond val
1 5
2 18
2 18
2 18
3 30
3 30",
header = TRUE,
colClasses = "numeric"
)

In base R
df$result = df$val / ave(df$cond, df$cond, FUN = length)
The ave() divides up the cond column by its unique values and takes the length of each subvector, i.e., the denominator you ask for.

Here is a base R answer that will work if cond is an ID variable:
# get length of repeats
temp <- rle(df$cond)
temp <- data.frame(cond=temp$values, lengths=temp$lengths)
# merge onto data.frame
df <- merge(df, temp, by="cond")
df$valNew <- df$val / df$lengths

Related

Creating new vector that represents the count

I want to create a vector of counts in the following way:
say my vector is
x <- c(1,1,1,1,2)
which represents a categorical variable.
I want a second vector of the form
x1 <- c(4,4,4,4,1)
which represents the count at each level. e.g. 4 occurrences of level 1, and 1 occurrence of level 2.
I have tried
r <- range(x) ; table(factor(x, levels = r[1]:r[2]))
tabulate(factor(x, levels = min(x):max(x)))
table(x)
This uses ave to group by each value. This would likely be better if your vector is definitely an integer type.
x <- c(1,1,1,1,2)
ave(x, x, FUN = length)
[1] 4 4 4 4 1
Equivalents in data.table and dplyr:
library(data.table)
data.table(x)[, n:= .N, by = 'x'][]
x n
1: 1 4
2: 1 4
3: 1 4
4: 1 4
5: 2 1
library(dplyr)
library(tibble)
tibble::enframe(x, name = NULL)%>%
add_count(value)
##or
x%>%
tibble::enframe(name = NULL)%>%
group_by(value)%>%
mutate(n = n())%>%
ungroup()
# A tibble: 5 x 2
value n
<dbl> <int>
1 1 4
2 1 4
3 1 4
4 1 4
5 2 1
If you do it like this:
x = c(1,1,1,1,2)
x1 = as.vector(table(x)[x])
You obtain the vector you wanted:
[1] 4 4 4 4 1
We can use fct_count from forcats which has a sort argument too:
x <- as.factor(x)
forcats::fct_count(x)
# A tibble: 2 x 2
f n
<fct> <int>
1 1 4
2 2 1
We can use tabulate or table along with rep
x1 <- tabulate(x)
rep(x1,x1)
#[1] 4 4 4 4 1
x1 <- table(x)
as.integer(rep(x1, x1))
#[1] 4 4 4 4 1
An option with tapply from base R.
v1 <- tapply(x, x, FUN = length)
rep(as.integer(v1), v1)
#[1] 4 4 4 4 1
NB: It is a dupe

Adding sequence numbers based on time R [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
How can we generate unique id numbers within each group of a dataframe? Here's some data grouped by "personid":
personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23
I wish to add an id column with a unique value for each row within each subset defined by "personid", always starting with 1. This is my desired output:
personid date measurement id
1 x 23 1
1 x 32 2
2 y 21 1
3 x 23 1
3 z 23 2
3 y 23 3
I appreciate any help.
Some dplyr alternatives, using convenience functions row_number and n.
library(dplyr)
df %>% group_by(personid) %>% mutate(id = row_number())
df %>% group_by(personid) %>% mutate(id = 1:n())
df %>% group_by(personid) %>% mutate(id = seq_len(n()))
df %>% group_by(personid) %>% mutate(id = seq_along(personid))
You may also use getanID from package splitstackshape. Note that the input dataset is returned as a data.table.
getanID(data = df, id.vars = "personid")
# personid date measurement .id
# 1: 1 x 23 1
# 2: 1 x 32 2
# 3: 2 y 21 1
# 4: 3 x 23 1
# 5: 3 z 23 2
# 6: 3 y 23 3
The misleadingly named ave() function, with argument FUN=seq_along, will accomplish this nicely -- even if your personid column is not strictly ordered.
df <- read.table(text = "personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23", header=TRUE)
## First with your data.frame
ave(df$personid, df$personid, FUN=seq_along)
# [1] 1 2 1 1 2 3
## Then with another, in which personid is *not* in order
df2 <- df[c(2:6, 1),]
ave(df2$personid, df2$personid, FUN=seq_along)
# [1] 1 1 1 2 3 2
Using data.table, and assuming you wish to order by date within the personid subset
library(data.table)
DT <- data.table(Data)
DT[,id := order(date), by = personid]
## personid date measurement id
## 1: 1 x 23 1
## 2: 1 x 32 2
## 3: 2 y 21 1
## 4: 3 x 23 1
## 5: 3 z 23 3
## 6: 3 y 23 2
If you wish do not wish to order by date
DT[, id := 1:.N, by = personid]
## personid date measurement id
## 1: 1 x 23 1
## 2: 1 x 32 2
## 3: 2 y 21 1
## 4: 3 x 23 1
## 5: 3 z 23 2
## 6: 3 y 23 3
Any of the following would also work
DT[, id := seq_along(measurement), by = personid]
DT[, id := seq_along(date), by = personid]
The equivalent commands using plyr
library(plyr)
# ordering by date
ddply(Data, .(personid), mutate, id = order(date))
# in original order
ddply(Data, .(personid), mutate, id = seq_along(date))
ddply(Data, .(personid), mutate, id = seq_along(measurement))
I think there's a canned command for this, but I can't remember it. So here's one way:
> test <- sample(letters[1:3],10,replace=TRUE)
> cumsum(duplicated(test))
[1] 0 0 1 1 2 3 4 5 6 7
> cumsum(duplicated(test))+1
[1] 1 1 2 2 3 4 5 6 7 8
This works because duplicated returns a logical vector. cumsum evalues numeric vectors, so the logical gets coerced to numeric.
You can store the result to your data.frame as a new column if you want:
dat$id <- cumsum(duplicated(test))+1
Assuming your data are in a data.frame named Data, this will do the trick:
# ensure Data is in the correct order
Data <- Data[order(Data$personid),]
# tabulate() calculates the number of each personid
# sequence() creates a n-length vector for each element in the input,
# and concatenates the result
Data$id <- sequence(tabulate(Data$personid))
You can use sqldf
df<-read.table(header=T,text="personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23")
library(sqldf)
sqldf("SELECT a.*, COUNT(*) count
FROM df a, df b
WHERE a.personid = b.personid AND b.ROWID <= a.ROWID
GROUP BY a.ROWID"
)
# personid date measurement count
#1 1 x 23 1
#2 1 x 32 2
#3 2 y 21 1
#4 3 x 23 1
#5 3 z 23 2
#6 3 y 23 3

Creating an ordered id by group in R [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
How can we generate unique id numbers within each group of a dataframe? Here's some data grouped by "personid":
personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23
I wish to add an id column with a unique value for each row within each subset defined by "personid", always starting with 1. This is my desired output:
personid date measurement id
1 x 23 1
1 x 32 2
2 y 21 1
3 x 23 1
3 z 23 2
3 y 23 3
I appreciate any help.
Some dplyr alternatives, using convenience functions row_number and n.
library(dplyr)
df %>% group_by(personid) %>% mutate(id = row_number())
df %>% group_by(personid) %>% mutate(id = 1:n())
df %>% group_by(personid) %>% mutate(id = seq_len(n()))
df %>% group_by(personid) %>% mutate(id = seq_along(personid))
You may also use getanID from package splitstackshape. Note that the input dataset is returned as a data.table.
getanID(data = df, id.vars = "personid")
# personid date measurement .id
# 1: 1 x 23 1
# 2: 1 x 32 2
# 3: 2 y 21 1
# 4: 3 x 23 1
# 5: 3 z 23 2
# 6: 3 y 23 3
The misleadingly named ave() function, with argument FUN=seq_along, will accomplish this nicely -- even if your personid column is not strictly ordered.
df <- read.table(text = "personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23", header=TRUE)
## First with your data.frame
ave(df$personid, df$personid, FUN=seq_along)
# [1] 1 2 1 1 2 3
## Then with another, in which personid is *not* in order
df2 <- df[c(2:6, 1),]
ave(df2$personid, df2$personid, FUN=seq_along)
# [1] 1 1 1 2 3 2
Using data.table, and assuming you wish to order by date within the personid subset
library(data.table)
DT <- data.table(Data)
DT[,id := order(date), by = personid]
## personid date measurement id
## 1: 1 x 23 1
## 2: 1 x 32 2
## 3: 2 y 21 1
## 4: 3 x 23 1
## 5: 3 z 23 3
## 6: 3 y 23 2
If you wish do not wish to order by date
DT[, id := 1:.N, by = personid]
## personid date measurement id
## 1: 1 x 23 1
## 2: 1 x 32 2
## 3: 2 y 21 1
## 4: 3 x 23 1
## 5: 3 z 23 2
## 6: 3 y 23 3
Any of the following would also work
DT[, id := seq_along(measurement), by = personid]
DT[, id := seq_along(date), by = personid]
The equivalent commands using plyr
library(plyr)
# ordering by date
ddply(Data, .(personid), mutate, id = order(date))
# in original order
ddply(Data, .(personid), mutate, id = seq_along(date))
ddply(Data, .(personid), mutate, id = seq_along(measurement))
I think there's a canned command for this, but I can't remember it. So here's one way:
> test <- sample(letters[1:3],10,replace=TRUE)
> cumsum(duplicated(test))
[1] 0 0 1 1 2 3 4 5 6 7
> cumsum(duplicated(test))+1
[1] 1 1 2 2 3 4 5 6 7 8
This works because duplicated returns a logical vector. cumsum evalues numeric vectors, so the logical gets coerced to numeric.
You can store the result to your data.frame as a new column if you want:
dat$id <- cumsum(duplicated(test))+1
Assuming your data are in a data.frame named Data, this will do the trick:
# ensure Data is in the correct order
Data <- Data[order(Data$personid),]
# tabulate() calculates the number of each personid
# sequence() creates a n-length vector for each element in the input,
# and concatenates the result
Data$id <- sequence(tabulate(Data$personid))
You can use sqldf
df<-read.table(header=T,text="personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23")
library(sqldf)
sqldf("SELECT a.*, COUNT(*) count
FROM df a, df b
WHERE a.personid = b.personid AND b.ROWID <= a.ROWID
GROUP BY a.ROWID"
)
# personid date measurement count
#1 1 x 23 1
#2 1 x 32 2
#3 2 y 21 1
#4 3 x 23 1
#5 3 z 23 2
#6 3 y 23 3

create sequence vector for each level of other column [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
How can we generate unique id numbers within each group of a dataframe? Here's some data grouped by "personid":
personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23
I wish to add an id column with a unique value for each row within each subset defined by "personid", always starting with 1. This is my desired output:
personid date measurement id
1 x 23 1
1 x 32 2
2 y 21 1
3 x 23 1
3 z 23 2
3 y 23 3
I appreciate any help.
Some dplyr alternatives, using convenience functions row_number and n.
library(dplyr)
df %>% group_by(personid) %>% mutate(id = row_number())
df %>% group_by(personid) %>% mutate(id = 1:n())
df %>% group_by(personid) %>% mutate(id = seq_len(n()))
df %>% group_by(personid) %>% mutate(id = seq_along(personid))
You may also use getanID from package splitstackshape. Note that the input dataset is returned as a data.table.
getanID(data = df, id.vars = "personid")
# personid date measurement .id
# 1: 1 x 23 1
# 2: 1 x 32 2
# 3: 2 y 21 1
# 4: 3 x 23 1
# 5: 3 z 23 2
# 6: 3 y 23 3
The misleadingly named ave() function, with argument FUN=seq_along, will accomplish this nicely -- even if your personid column is not strictly ordered.
df <- read.table(text = "personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23", header=TRUE)
## First with your data.frame
ave(df$personid, df$personid, FUN=seq_along)
# [1] 1 2 1 1 2 3
## Then with another, in which personid is *not* in order
df2 <- df[c(2:6, 1),]
ave(df2$personid, df2$personid, FUN=seq_along)
# [1] 1 1 1 2 3 2
Using data.table, and assuming you wish to order by date within the personid subset
library(data.table)
DT <- data.table(Data)
DT[,id := order(date), by = personid]
## personid date measurement id
## 1: 1 x 23 1
## 2: 1 x 32 2
## 3: 2 y 21 1
## 4: 3 x 23 1
## 5: 3 z 23 3
## 6: 3 y 23 2
If you wish do not wish to order by date
DT[, id := 1:.N, by = personid]
## personid date measurement id
## 1: 1 x 23 1
## 2: 1 x 32 2
## 3: 2 y 21 1
## 4: 3 x 23 1
## 5: 3 z 23 2
## 6: 3 y 23 3
Any of the following would also work
DT[, id := seq_along(measurement), by = personid]
DT[, id := seq_along(date), by = personid]
The equivalent commands using plyr
library(plyr)
# ordering by date
ddply(Data, .(personid), mutate, id = order(date))
# in original order
ddply(Data, .(personid), mutate, id = seq_along(date))
ddply(Data, .(personid), mutate, id = seq_along(measurement))
I think there's a canned command for this, but I can't remember it. So here's one way:
> test <- sample(letters[1:3],10,replace=TRUE)
> cumsum(duplicated(test))
[1] 0 0 1 1 2 3 4 5 6 7
> cumsum(duplicated(test))+1
[1] 1 1 2 2 3 4 5 6 7 8
This works because duplicated returns a logical vector. cumsum evalues numeric vectors, so the logical gets coerced to numeric.
You can store the result to your data.frame as a new column if you want:
dat$id <- cumsum(duplicated(test))+1
Assuming your data are in a data.frame named Data, this will do the trick:
# ensure Data is in the correct order
Data <- Data[order(Data$personid),]
# tabulate() calculates the number of each personid
# sequence() creates a n-length vector for each element in the input,
# and concatenates the result
Data$id <- sequence(tabulate(Data$personid))
You can use sqldf
df<-read.table(header=T,text="personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23")
library(sqldf)
sqldf("SELECT a.*, COUNT(*) count
FROM df a, df b
WHERE a.personid = b.personid AND b.ROWID <= a.ROWID
GROUP BY a.ROWID"
)
# personid date measurement count
#1 1 x 23 1
#2 1 x 32 2
#3 2 y 21 1
#4 3 x 23 1
#5 3 z 23 2
#6 3 y 23 3

R, dplyr: cumulative version of n_distinct

I have a dataframe as follows. It is ordered by column time.
Input -
df = data.frame(time = 1:20,
grp = sort(rep(1:5,4)),
var1 = rep(c('A','B'),10)
)
head(df,10)
time grp var1
1 1 1 A
2 2 1 B
3 3 1 A
4 4 1 B
5 5 2 A
6 6 2 B
7 7 2 A
8 8 2 B
9 9 3 A
10 10 3 B
I want to create another variable var2 which computes no of distinct var1 values so far i.e. until that point in time for each group grp . This is a little different from what I'd get if I were to use n_distinct.
Expected output -
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2
I want to create a function say cum_n_distinct for this and use it as -
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))
A dplyr solution inspired from #akrun's answer -
Ths logic is basically to set 1st occurrence of each unique values of var1 to 1 and rest to 0 for each group grp and then apply cumsum on it -
df = df %>%
arrange(time) %>%
group_by(grp,var1) %>%
mutate(var_temp = ifelse(row_number()==1,1,0)) %>%
group_by(grp) %>%
mutate(var2 = cumsum(var_temp)) %>%
select(-var_temp)
head(df,10)
Source: local data frame [10 x 4]
Groups: grp
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2
Assuming stuff is ordered by time already, first define a cumulative distinct function:
dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
Then a base solution that uses ave to create groups (note, assumes var1 is factor), and then applies our function to each group:
transform(df, var2=ave(as.integer(var1), grp, FUN=dist_cum))
A data.table solution, basically doing the same thing:
library(data.table)
(data.table(df)[, var2:=dist_cum(var1), by=grp])
And dplyr, again, same thing:
library(dplyr)
df %>% group_by(grp) %>% mutate(var2=dist_cum(var1))
Try:
Update
With your new dataset, an approach in base R
df$var2 <- unlist(lapply(split(df, df$grp),
function(x) {x$var2 <-0
indx <- match(unique(x$var1), x$var1)
x$var2[indx] <- 1
cumsum(x$var2) }))
head(df,7)
# time grp var1 var2
# 1 1 1 A 1
# 2 2 1 B 2
# 3 3 1 A 2
# 4 4 1 B 2
# 5 5 2 A 1
# 6 6 2 B 2
# 7 7 2 A 2
Here's another solution using data.table that's pretty quick.
Generic Function
cum_n_distinct <- function(x, na.include = TRUE){
# Given a vector x, returns a corresponding vector y
# where the ith element of y gives the number of unique
# elements observed up to and including index i
# if na.include = TRUE (default) NA is counted as an
# additional unique element, otherwise it's essentially ignored
temp <- data.table(x, idx = seq_along(x))
firsts <- temp[temp[, .I[1L], by = x]$V1]
if(na.include == FALSE) firsts <- firsts[!is.na(x)]
y <- rep(0, times = length(x))
y[firsts$idx] <- 1
y <- cumsum(y)
return(y)
}
Example Use
cum_n_distinct(c(5,10,10,15,5)) # 1 2 2 3 3
cum_n_distinct(c(5,NA,10,15,5)) # 1 2 3 4 4
cum_n_distinct(c(5,NA,10,15,5), na.include = FALSE) # 1 1 2 3 3
Solution To Your Question
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))

Resources