Adding new column with the number of duplicates in R - r

I have a column of ID's which have duplicates. What I want to do is create a new column which in each row divides 1 by the number of duplicates that exists of the ID in that column.
So for example in row 2 the ID is 101, in the same column the 101 is repeated three times. Therefore in a new column I want in row 2 to be 1 divided by the number of duplicates (1/3) of the 101 ID which gives a value of 0.33'. How would I go about this? Sorry if this doesn't make sense, happy to clarify.
Thanks so much in advance!

We can use ave:
# create data
set.seed(123)
ids <- sample(1:3, 10, replace = TRUE)
dat <- data.frame(id = ids)
# id
# 1 3
# 2 3
# 3 3
# 4 2
# 5 3
# 6 2
# 7 2
# 8 2
# 9 3
# 10 1
# use ave to count instances of id
1/ave(dat$id, dat$id, FUN = length)
# 0.20 0.20 0.20 0.25 0.20 0.25 0.25 0.25 0.20 1.00
If your id variable is a character, we can use the seq_along function within our ave call:
dat$id <- as.character(dat$id)
1/ave(seq_along(dat$id), dat$id, FUN = length)
# [1] 0.20 0.20 0.20 0.25 0.20 0.25 0.25 0.25 0.20 1.00

In the following code, I assume the data frame to be named df.
require("tidyverse")
df <- table(df$ID) %>%
enframe() %>%
mutate(value = 1 / value) %>%
left_join(df, ., by = c("ID" = "name"))

Using data.table
library(data.table)
setDT(dat)[, value := 1/.N, id]
data
set.seed(123)
ids <- sample(1:3, 10, replace = TRUE)
dat <- data.frame(id = ids)

Related

Aggregate with output length same as Data Frame length

I need to aggregate a number between two factors, but I need the output of the aggregation to be a a vector the same length as the original data frame, rather than a summary table, so I can attach it and eventually output it as an .xlsx report.
data <- data.frame(A = c("A","A","A","A","A","A","B","B","B","B","B","B","B","B","C","C","C","C","C","C"),
B = c(1,1,2,2,2,3,1,1,1,1,2,2,2,3,3,1,1,1,1,2),
X=c(0.17,0.15,0.30,0.36,0.47,0.43,0.50,0.38,0.38,0.47,0.40,0.29,0.46,0.14,0.03,0.34,0.42,0.35,0.19,0.27))
I need to sum X grouped by both A, and unique combination of A and B, and append it to the data frame, so that it looks like this
I'm aware of the aggregate function, which calculates the quantities I need but outputs them in a summary table format which I can't then append to the data frame.
So far this is the only method I've come up with - it takes 10 minutes to run on my actual, 13000 row data frame, it seems very hacky and it also seems to be causing some other bugs that I'm hoping redoing this bit will solve.
TBL <- as.data.frame(table(data$A, data$B))
colnames(TBL) <- c("A", "B", "Freq")
#contains every unique combination of A and B
for (i in 1:NROW(TBL)){
INDEX <- which(data$A == TBL$A[i] & data$B == TBL$B[i])
Data$`X by AB`[INDEX] <- sum(data$X[INDEX])
}
Seems like you need to group_by A AND A and B and get sum of X. With dplyr, we can use two group_by statements with mutate
library(dplyr)
data %>%
group_by(A, B) %>%
mutate(XbyAB = sum(X)) %>%
group_by(A) %>%
mutate(XbyA = sum(X))
# A B X XbyAB XbyA
# <fct> <dbl> <dbl> <dbl> <dbl>
# 1 A 1 0.12 0.12 0.46
# 2 A 2 0.34 0.34 0.46
# 3 B 1 0.5 0.9 1.59
# 4 B 1 0.4 0.9 1.59
# 5 B 3 0.69 0.69 1.59
# 6 C 1 0.42 0.42 0.5
# 7 C 2 0.08 0.08 0.5
# 8 D 2 0.9 0.9 0.9
# 9 E 3 0.74 0.74 0.94
#10 E 4 0.2 0.2 0.94
Or in base R two ave with transform
transform(data, XbyAB = ave(X, A, B, FUN = sum), XbyA = ave(X, A, FUN = sum))
data.table solution.
library("data.table")
data <- as.data.table(data)
First, let's sum X by A:
data[, .( `X by A`=sum(X) ), by=A]
# A X by A
# 1: A 1.88
# 2: B 3.02
# 3: C 1.60
We merge this summary data.table with the original one on column A:
data[data[, .( `X by A`=sum(X) ), by=A], on=.(A)]
We can also summarize and then merge on two columns:
data[data[, .( `X by AB`=sum(X) ), by=.(A, B)], on=.(A, B)]
The problem is, to the uninitiated, the data.table syntax isn't very readable, but I swear by its speed (as compared to dplyr and especially data.frame). Although the difference shouldn't be very noticeable with 13K rows.

Split colums into rows , and change an especific column value for new rows in R

I have a data frame with multiple columns as follows:
Frequency Alels
0.5 C
0.6 C,G
0.02 A,T,TTT
And I want to split the value of second column and the new rows have frequency = 0.
I'm trying with separate() from tidyr package but I can't change the frequency column in new rows and I get the above results:
Frequency Alels
0.5 C
0.6 C
0.6 G
0.02 A
0.02 T
0.02 TTT
But I want the output as follows:
Frequency Alels
0.5 C
0.6 C
0 G
0.02 A
0 T
0 TTT
I'm trying with separate() from tidyr package but I can't change the frequency column in new rows.
This should work:
d <- read.table(text = "Frecuency Alels
0.5 C
0.6 C,G",
header = T, stringsAsFactors = F)
counts <- sapply(strsplit(d$Alels, split = ","), length)
data.frame("Frecuency" = unlist(lapply(seq_along(d$Frecuency),
function(x) c(d$Frecuency[x],
rep(0, counts[x] -1)))),
"Alels" = unlist(strsplit(d$Alels, split = ",")))
Not pretty, but I think it works.
# Create data frame
df <- data.frame(frequency = c(0.5, 0.6),
alels = c("C", "C, G, T"),
stringsAsFactors = FALSE)
# Duplicate the alels column, separate rows
# Requires magrittr, dplyr, tidyr
df %<>%
mutate(alels_check = alels) %>%
separate_rows(alels, sep = ",", convert = TRUE)
# Check for dupes and set them to zero
df[duplicated(df$frequency, df$alels_check),]$frequency <- 0
# Remove the duplicated alels column
df %<>% select(-alels_check)
Original:
# frequency alels
# 1 0.5 C
# 2 0.6 C, G, T
Result:
# frequency alels
# 1 0.5 C
# 2 0.6 C
# 3 0.0 G
# 4 0.0 T
Using your data:
# frequency alels
# 1 0.50 C
# 2 0.60 C, G
# 3 0.02 A, T, TTT
# frequency alels
# 1 0.50 C
# 2 0.60 C
# 3 0.00 G
# 4 0.02 A
# 5 0.00 T
# 6 0.00 TTT
the data from your example:
df <- read.table(text = " Frequency Alels
0.5 C
0.6 C,G
0.02 A,T,TTT",
header = T, stringsAsFactors = F)
and another solution for you to consider:
library(dplyr)
lapply(1:nrow(df),
function(row_num){
s <- strsplit(df$Alels[row_num], ",") %>% unlist
data.frame(Frequency = c(df$Frequency[row_num], rep(0,length(s)-1)),
Alels = s)
}) %>% do.call(rbind, .)
df
instead of do.call(rbind, .) you can also choose to use rbindlist()from the package data.table

Efficient sampling of factor variable from dataframe subsets

I have a dataframe df1 which contains 6 columns, two of which (var1 & var3) I am using to split df1 by, resulting in a list of dataframes ls1.
For each sub dataframe in ls1 I want to sample() x$var2, x$num times with x$probs probabilities as follows:
Create data:
var1 <- rep(LETTERS[seq( from = 1, to = 3 )], each = 6)
var2 <- rep(LETTERS[seq( from = 1, to = 3 )], 6)
var3 <- rep(1:2,3, each = 3)
num <- rep(c(10, 11, 13, 8, 20, 5), each = 3)
probs <- round(runif(18), 2)
df1 <- as.data.frame(cbind(var1, var2, var3, num, probs))
ls1 <- split(df1, list(df1$var1, df1$var3))
have a look at the first couple list elements:
$A.1
var1 var2 var3 num probs
1 A A 1 10 0.06
2 A B 1 10 0.27
3 A C 1 10 0.23
$B.1
var1 var2 var3 num probs
7 B A 1 13 0.93
8 B B 1 13 0.36
9 B C 1 13 0.04
lapply over ls1:
ls1 <- lapply(ls1, function(x) {
res <- table(sample(x$var2, size = as.numeric(as.character(x$num)),
replace = TRUE, prob = as.numeric(as.character(x$probs))))
res <- as.data.frame(res)
cbind(x, res = res$Freq)
})
df2 <- do.call("rbind", ls1)
df2
Have a look at the first couple list elements of the result:
$A.1
var1 var2 var3 num probs res
1 A A 1 10 0.06 2
2 A B 1 10 0.27 4
3 A C 1 10 0.23 4
$B.1
var1 var2 var3 num probs res
7 B A 1 13 0.93 10
8 B B 1 13 0.36 3
9 B C 1 13 0.04 0
So for each dataframe a new variable res is created, the sum of res equals num and the elements of var2 are represented in res in proportions relating to probs. This does what I want but it becomes very slow when there is a lot of data.
My Question: is there a way to replace the lapply piece of code with something more efficient/faster?
I am just beginning to learn about vectorization and am guessing this could be vectorized? but I am unsure of how to achieve it.
ls1 is eventually returned to a dataframe structure so if it doesn't need to become a list to begin with all the better (although it doesn't really matter how the data is structured for this step).
Any help would be much appreciated.
First, you should create df1 using data.frame() rather than converting from a matrix, because the matrix forces all data types to the be the same even though you have both numeric and character variables.
df1 <- data.frame(var1, var2, var3, num, probs)
Next, instead of using the sample function, the rmultinom function is much more efficient because it directly outputs the number of draws for each value in x$var2:
ls1 <- lapply(ls1, function(x) {
x$res <- rmultinom(1, x$num[1], x$probs)
x
})
This should be noticeably faster than using the sample approach.
Rather than splitting your data frame in groups, I would use package {dplyr} with a group_by+mutate:
library(dplyr)
df1 %>%
mutate_at(vars(num, probs), as.numeric) %>%
group_by(var1, var3) %>%
mutate(res = c(rmultinom(1, num[1], probs)))
This should be fast and you can keep the original data structure.
Learn more there.

Returning observations that only occur once in a group [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 5 years ago.
I am trying to group a data.frame by a factor variable, and then return rows of the data.frame that correspond to observations that occur exactly once in each group. For example, consider the following data:
x = matrix(c(1,1,2,2,2,3,4,4,5,4), nrow = 5, ncol = 2, byrow = F)
x = data.frame(x)
x
# X1 X2
# 1 1 3
# 2 1 4
# 3 2 4
# 4 2 5
# 5 2 4
I would like to group the data by the values in column 1, then return the rows for which the value in column 2 occurs only once within a group. Here, the function would return the first, second, and fourth rows.
Desired output
# X1 X2
# 1 1 3
# 2 1 4
# 4 2 5
I am looking to apply this to a dataset with >1mm rows.
In base R, you can try ave:
x[with(x, ave(X2, X1, X2, FUN = length)) == 1, ]
# X1 X2
# 1 1 3
# 2 1 4
# 4 2 5
Because ave scales very poorly when there are multiple groups and multiple grouping variables, you may want to create a new group first:
x[with(x, ave(X2, sprintf("%s__%s", X1, X2), FUN = length)) == 1, ]
The speeds will vary widely according to the nature of your data.
You can also try:
library(dplyr)
x %>%
group_by(X1, X2) %>%
filter(n() == 1)
# Source: local data frame [3 x 2]
# Groups: X1, X2 [3]
#
# X1 X2
# (dbl) (dbl)
# 1 1 3
# 2 1 4
# 3 2 5
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(x), grouped by the first column i.e. "X1", if, there is only one observation, return the row else remove all the duplicates and return only the unique row.
library(data.table)
setDT(x)[, if(.N==1) .SD else
.SD[!(duplicated(X2)|duplicated(X2, fromLast=TRUE))], X1]
# X1 X2
#1: 1 3
#2: 1 4
#3: 2 5
If we are using both "X1" and "X2" as grouping variable
setDT(x)[x[, .I[.N==1], .(X1, X2)]$V1]
# X1 X2
#1: 1 3
#2: 1 4
#3: 2 5
NOTE: Data.table is very fast and is compact.
Or without using any group by option, with base R we can do
x[!(duplicated(x)|duplicated(x, fromLast=TRUE)),]
# X1 X2
#1 1 3
#2 1 4
#4 2 5
Or with tally from dplyr
library(dplyr)
x %>%
group_by_(.dots= names(x)) %>%
tally() %>%
filter(n==1) %>%
select(-n)
Note that this should be faster than the other dplyr solution.
Benchmarks
library(data.table)
library(dplyr)
Sample data
set.seed(24)
x1 <- data.frame(X1 = sample(1:5000, 1e6, replace=TRUE),
X2 = sample(1:10000, 1e6, replace=TRUE))
x2 <- copy(as.data.table(x1))
Base R approaches
system.time(x1[with(x1, ave(X2, sprintf("%s__%s", X1, X2), FUN = length)) == 1, ])
# user system elapsed
# 20.245 0.002 20.280
system.time(x1[!(duplicated(x1)|duplicated(x1, fromLast=TRUE)), ])
# user system elapsed
# 1.994 0.000 1.998
dplyr approaches
system.time(x1 %>% group_by(X1, X2) %>% filter(n() == 1))
# user system elapsed
# 33.400 0.006 33.467
system.time(x1 %>% group_by_(.dots= names(x2)) %>% tally() %>% filter(n==1) %>% select(-n))
# user system elapsed
# 2.331 0.000 2.333
data.table approaches
system.time(x2[x2[, .I[.N==1], list(X1, X2)]$V1])
# user system elapsed
# 1.128 0.001 1.131
system.time(x2[, .N, by = list(X1, X2)][N == 1][, N := NULL][])
# user system elapsed
# 0.320 0.000 0.323
Summary: The "data.table" approaches win hands down, but if you're unable to use the package for some reason, using duplicated from base R also performs quite well.
With base, something like
do.call(rbind, lapply(split(x, x$X1),
function(y){y[table(y$X2) == 1,]}))
# X1 X2
# 1.1 1 3
# 1.2 1 4
# 2 2 5
where split splits x into a list of data.frames split by the levels of X1, and then lapply subsets to rows where there is only one occurrence of the value of X2, tabulated by table. do.call(rbind then reassembles the resulting data.frames back into a single one.

to calculate summary of multipl. two column in dataset in R, loops

I have a large data table with over 300 columns. I would like to get by each letter column
-- summary of (each observation in column * weight of observation).
-- summary of weight if obs. in a letter column is more than 0.
Here I provided a example for a column.
id <- c("0001", "0002", "0003", "0004")
a <- c(0, 9, 8, 5)
b <- c(0,5,5,0)
c <- c(1.5, 0.55, 0, 0.06)
weight <- c(102.354, 34.998, 84.664, .657)
data <- data.frame(id, a, b, c, weight)
data
id a b c weight
1 0001 0 0 1.50 102.354
2 0002 9 5 0.55 34.998
3 0003 8 5 0.00 84.664
4 0004 5 0 0.06 0.657
sum(data$a * data$weight)
[1] 995.579
sum(data$weight[data$a >0])
[1] 120.319​
Any idea?
A possible data.table solution
You could define an helper function
tempfunc <- function(x) c(sum(x * data$weight), sum(data$weight[x > 0]))
Then do either
library(data.table)
setDT(data)[, lapply(.SD, tempfunc), .SDcols = -c("id", "weight")]
# a b c
# 1: 995.579 598.310 172.8193
# 2: 120.319 119.662 138.0090
Or
library(dplyr)
setDT(data) %>% summarise_each(funs(tempfunc), -c(id, weight))
## a b c
## 1: 995.579 598.310 172.8193
## 2: 120.319 119.662 138.0090
The following code should solve your question:
my.names <- names(data)[names(data) %in% letters]
res <- lapply(my.names, function(x){
c(sum(data[[x]]*data[["weight"]]), sum(data[["weight"]][data[[x]]>0]))
})
names(res) <- my.names
or directly to data.frame:
do.call("rbind", lapply(my.names, function(letter){
data.frame(letter, "sum1_name" = sum(data[[letter]]*data[["weight"]]),
"sum2_name" = sum(data[["weight"]][data[[letter]]>0]))
}))
# letter sum1_name sum2_name
# 1 a 995.5790 120.319
# 2 b 598.3100 119.662
# 3 c 172.8193 138.009

Resources