Efficient sampling of factor variable from dataframe subsets - r

I have a dataframe df1 which contains 6 columns, two of which (var1 & var3) I am using to split df1 by, resulting in a list of dataframes ls1.
For each sub dataframe in ls1 I want to sample() x$var2, x$num times with x$probs probabilities as follows:
Create data:
var1 <- rep(LETTERS[seq( from = 1, to = 3 )], each = 6)
var2 <- rep(LETTERS[seq( from = 1, to = 3 )], 6)
var3 <- rep(1:2,3, each = 3)
num <- rep(c(10, 11, 13, 8, 20, 5), each = 3)
probs <- round(runif(18), 2)
df1 <- as.data.frame(cbind(var1, var2, var3, num, probs))
ls1 <- split(df1, list(df1$var1, df1$var3))
have a look at the first couple list elements:
$A.1
var1 var2 var3 num probs
1 A A 1 10 0.06
2 A B 1 10 0.27
3 A C 1 10 0.23
$B.1
var1 var2 var3 num probs
7 B A 1 13 0.93
8 B B 1 13 0.36
9 B C 1 13 0.04
lapply over ls1:
ls1 <- lapply(ls1, function(x) {
res <- table(sample(x$var2, size = as.numeric(as.character(x$num)),
replace = TRUE, prob = as.numeric(as.character(x$probs))))
res <- as.data.frame(res)
cbind(x, res = res$Freq)
})
df2 <- do.call("rbind", ls1)
df2
Have a look at the first couple list elements of the result:
$A.1
var1 var2 var3 num probs res
1 A A 1 10 0.06 2
2 A B 1 10 0.27 4
3 A C 1 10 0.23 4
$B.1
var1 var2 var3 num probs res
7 B A 1 13 0.93 10
8 B B 1 13 0.36 3
9 B C 1 13 0.04 0
So for each dataframe a new variable res is created, the sum of res equals num and the elements of var2 are represented in res in proportions relating to probs. This does what I want but it becomes very slow when there is a lot of data.
My Question: is there a way to replace the lapply piece of code with something more efficient/faster?
I am just beginning to learn about vectorization and am guessing this could be vectorized? but I am unsure of how to achieve it.
ls1 is eventually returned to a dataframe structure so if it doesn't need to become a list to begin with all the better (although it doesn't really matter how the data is structured for this step).
Any help would be much appreciated.

First, you should create df1 using data.frame() rather than converting from a matrix, because the matrix forces all data types to the be the same even though you have both numeric and character variables.
df1 <- data.frame(var1, var2, var3, num, probs)
Next, instead of using the sample function, the rmultinom function is much more efficient because it directly outputs the number of draws for each value in x$var2:
ls1 <- lapply(ls1, function(x) {
x$res <- rmultinom(1, x$num[1], x$probs)
x
})
This should be noticeably faster than using the sample approach.

Rather than splitting your data frame in groups, I would use package {dplyr} with a group_by+mutate:
library(dplyr)
df1 %>%
mutate_at(vars(num, probs), as.numeric) %>%
group_by(var1, var3) %>%
mutate(res = c(rmultinom(1, num[1], probs)))
This should be fast and you can keep the original data structure.
Learn more there.

Related

Adding new column with the number of duplicates in R

I have a column of ID's which have duplicates. What I want to do is create a new column which in each row divides 1 by the number of duplicates that exists of the ID in that column.
So for example in row 2 the ID is 101, in the same column the 101 is repeated three times. Therefore in a new column I want in row 2 to be 1 divided by the number of duplicates (1/3) of the 101 ID which gives a value of 0.33'. How would I go about this? Sorry if this doesn't make sense, happy to clarify.
Thanks so much in advance!
We can use ave:
# create data
set.seed(123)
ids <- sample(1:3, 10, replace = TRUE)
dat <- data.frame(id = ids)
# id
# 1 3
# 2 3
# 3 3
# 4 2
# 5 3
# 6 2
# 7 2
# 8 2
# 9 3
# 10 1
# use ave to count instances of id
1/ave(dat$id, dat$id, FUN = length)
# 0.20 0.20 0.20 0.25 0.20 0.25 0.25 0.25 0.20 1.00
If your id variable is a character, we can use the seq_along function within our ave call:
dat$id <- as.character(dat$id)
1/ave(seq_along(dat$id), dat$id, FUN = length)
# [1] 0.20 0.20 0.20 0.25 0.20 0.25 0.25 0.25 0.20 1.00
In the following code, I assume the data frame to be named df.
require("tidyverse")
df <- table(df$ID) %>%
enframe() %>%
mutate(value = 1 / value) %>%
left_join(df, ., by = c("ID" = "name"))
Using data.table
library(data.table)
setDT(dat)[, value := 1/.N, id]
data
set.seed(123)
ids <- sample(1:3, 10, replace = TRUE)
dat <- data.frame(id = ids)

Identify which row of data.frame exactly matches a vector

Given this data.frame:
var1 <- c(1, 2)
var2 <- c(3, 4)
var3 <- c(5, 6)
df <- expand.grid(var1 = var1, var2 = var2, var3 = var3)
var1 var2 var3
1 1 3 5
2 2 3 5
3 1 4 5
4 2 4 5
5 1 3 6
6 2 3 6
7 1 4 6
8 2 4 6
I would like to identify the data.frame row number matching this vector (4 is the answer in this case):
vec <- c(var1 = 2, var2 = 4, var3 = 5)
var1 var2 var3
2 4 5
I can't seem to sort out a simple subsetting method. The best I have been able to come up with is the following:
working <- apply(df, 2, match, vec)
which(apply(working, 1, anyNA) == FALSE)
This seems less straightforward than expected; I was wondering if there was a more straightforward solution?
We can transpose the dataframe, compare it with vec and select the row where all of the value matches.
which(colSums(t(df) == vec) == ncol(df))
#[1] 4
For the sake of completeness, subsetting can be implemented using data.table's join:
library(data.table)
setDT(df)[as.list(vec), on = names(vec), which = TRUE]
[1] 4
This can be solved using the prodlim library:
> library(prodlim)
> row.match(vec, df)
[1] 4
Here is a dplyr option:
library(dplyr)
library(magrittr)
df %>% mutate(new=paste0(var1,var2,var3), num=row_number()) %>%
filter(new=="245") %>% select(num) %>% as.integer()
[1] 4

Is there a way to find the indices of common (exactly the same) elements in a dataframe?

Given a dataframe such as,
num <- c(5,10,15,20,25)
letter <- c("A", "B", "A", "C", "B")
thelist <- data.frame(num, letter)
I need to find the indices where the letters are the same.
Output:
A 1 3
B 2 5
C 4
Then, take these indices and find the mean of those indices in num.
Output:
A 10
B 17.5
C 20
I cannot use loops or if statements, I am looking at using a sort of apply, which, etc.
As the objective is to find the mean for each similar 'letter', it is better to group by 'letter' and get the mean of 'num'
library(dplyr)
thelist %>%
group_by(letter) %>%
summarise(num = mean(num))
# A tibble: 3 x 2
# letter num
# <fct> <dbl>
#1 A 10
#2 B 17.5
#3 C 20
or in base R
aggregate(num ~ letter, thelist, mean)
To find the index of the same 'letter', we can split the sequence of rows by 'letter
split(seq_len(nrow(thelist)), thelist$letter)
#$A
#[1] 1 3
#$B
#[1] 2 5
#$C
#[1] 4
Another option using data.table:
library(data.table)
setDT(thelist)[, .(ind = paste(.I, collapse = " "),
mean_num = mean(num)
),
by = letter]
Output:
letter ind mean_num
1: A 1 3 10.0
2: B 2 5 17.5
3: C 4 20.0
I'd use dplyr/tidyverse for this:
# setup
library(tidyverse)
# group by letters then get mean of num
thelist %>%
group_by(letter) %>%
summarise(mean_num = mean(num))
You could also use base R with a for loop:
lets <- unique(thelist$letter)
x <- rep(NA, length(lets))
for(i in 1:3){
x[i] <- mean(thelist$num[thelist$letter %in% lets[i]])
}
x

How to remove rows with inf from a dataframe in R

I have a very large dataframe(df) with approximately 35-45 columns(variables) and rows greater than 300. Some of the rows contains NA,NaN,Inf,-Inf values in single or multiple variables and I have used
na.omit(df) to remove rows with NA and NaN but I cant remove rows with Inf and -Inf values using na.omit function.
While searching I came across this thread Remove rows with Inf and NaN in R and used the modified code df[is.finite(df)] but its not removing the rows with Inf and -Inf and also gives this error
Error in is.finite(df) : default method not implemented for type
'list'
EDITED
Remove the entire row even the corresponding one or multiple columns have inf and -inf
To remove the rows with +/-Inf I'd suggest the following:
df <- df[!is.infinite(rowSums(df)),]
or, equivalently,
df <- df[is.finite(rowSums(df)),]
The second option (the one with is.finite() and without the negation) removes also rows containing NA values in case that this has not already been done.
Depending on the data, there are a couple options using scoped variants of dplyr::filter() and is.finite() or is.infinite() that might be useful:
library(dplyr)
# sample data
df <- data_frame(a = c(1, 2, 3, NA), b = c(5, Inf, 8, 8), c = c(9, 10, Inf, 11), d = c('a', 'b', 'c', 'd'))
# across all columns:
df %>%
filter_all(all_vars(!is.infinite(.)))
# note that is.finite() does not work with NA or strings:
df %>%
filter_all(all_vars(is.finite(.)))
# checking only numeric columns:
df %>%
filter_if(~is.numeric(.), all_vars(!is.infinite(.)))
# checking only select columns, in this case a through c:
df %>%
filter_at(vars(a:c), all_vars(!is.infinite(.)))
The is.finite works on vector and not on data.frame object. So, we can loop through the data.frame using lapply and get only the 'finite' values.
lapply(df, function(x) x[is.finite(x)])
If the number of Inf, -Inf values are different for each column, the above code will have a list with elements having unequal length. So, it may be better to leave it as a list. If we want a data.frame, it should have equal lengths.
If we want to remove rows contain any NA or Inf/-Inf values
df[Reduce(`&`, lapply(df, function(x) !is.na(x) & is.finite(x))),]
Or a compact option by #nicola
df[Reduce(`&`, lapply(df, is.finite)),]
If we are ready to use a package, a compact option would be NaRV.omit
library(IDPmisc)
NaRV.omit(df)
data
set.seed(24)
df <- as.data.frame(matrix(sample(c(1:5, NA, -Inf, Inf),
20*5, replace=TRUE), ncol=5))
To keep the rows without Inf we can do:
df[apply(df, 1, function(x) all(is.finite(x))), ]
Also NAs are handled by this because of:
a rowindex with value NA will remove this row in the result.
Also rows with NaN are not in the result.
set.seed(24)
df <- as.data.frame(matrix(sample(c(0:9, NA, -Inf, Inf, NaN), 20*5, replace=TRUE), ncol=5))
df2 <- df[apply(df, 1, function(x) all(is.finite(x))), ]
Here are the results of the different is.~-functions:
x <- c(42, NA, NaN, Inf)
is.finite(x)
# [1] TRUE FALSE FALSE FALSE
is.na(x)
# [1] FALSE TRUE TRUE FALSE
is.nan(x)
# [1] FALSE FALSE TRUE FALSE
df[!is.infinite(df$x),]
wherein x is the column of df that contains the infinite values. The first answer posted was contingent on rowsums but for my own problem, the df had columns which could not be added.
It took me awhile to work this out for dplyr 1.0.0 so i thought i would put up the new version of #sbha solutions using c_across since filter_all, filter_if are getting deprecated.
library(dplyr)
df <- tibble(a = c(1, 2, 3, NA), b = c(5, Inf, 8, 8), c = c(9, 10, Inf, 11), d = c('a', 'b', 'c', 'd'))
# a b c d
# <dbl> <dbl> <dbl> <chr>
# 1 1 5 9 a
# 2 2 Inf 10 b
# 3 3 8 Inf c
# 4 NA 8 11 d
df %>%
rowwise %>%
filter(!all(is.infinite(c_across(where(is.numeric)))))
# # A tibble: 4 x 4
# # Rowwise:
# a b c d
# <dbl> <dbl> <dbl> <chr>
# 1 1 5 9 a
# 2 2 Inf 10 b
# 3 3 8 Inf c
# 4 NA 8 11 d
df %>%
rowwise %>%
filter(!any(is.infinite(c_across(where(is.numeric)))))
# # A tibble: 2 x 4
# # Rowwise:
# a b c d
# <dbl> <dbl> <dbl> <chr>
# 1 1 5 9 a
# 2 NA 8 11 d
df %>%
rowwise %>%
filter(!any(is.infinite(c_across(a:c))))
# # A tibble: 2 x 4
# # Rowwise:
# a b c d
# <dbl> <dbl> <dbl> <chr>
# 1 1 5 9 a
# 2 NA 8 11 d
To be honest I think #sbha answer is simpler!
I had this problem and none of the above solutions worked for me. I used the following to remove rows with +/-Inf in columns 15 and 16 of my dataframe.
d<-subset(c, c[,15:16]!="-Inf")
e<-subset(d, d[,15:16]!="Inf")
I consider myself new to coding and I couldn't get the recommendations above to work with my code.
I found a less complicated way to reduce a dataframe with 2 lines, first by replacing Inf with Na, then by selecting rows with complete data:
Df[sapply(Df, is.infinite)] <- NA
Df<-Df[complete.cases(Df), ]

to calculate summary of multipl. two column in dataset in R, loops

I have a large data table with over 300 columns. I would like to get by each letter column
-- summary of (each observation in column * weight of observation).
-- summary of weight if obs. in a letter column is more than 0.
Here I provided a example for a column.
id <- c("0001", "0002", "0003", "0004")
a <- c(0, 9, 8, 5)
b <- c(0,5,5,0)
c <- c(1.5, 0.55, 0, 0.06)
weight <- c(102.354, 34.998, 84.664, .657)
data <- data.frame(id, a, b, c, weight)
data
id a b c weight
1 0001 0 0 1.50 102.354
2 0002 9 5 0.55 34.998
3 0003 8 5 0.00 84.664
4 0004 5 0 0.06 0.657
sum(data$a * data$weight)
[1] 995.579
sum(data$weight[data$a >0])
[1] 120.319​
Any idea?
A possible data.table solution
You could define an helper function
tempfunc <- function(x) c(sum(x * data$weight), sum(data$weight[x > 0]))
Then do either
library(data.table)
setDT(data)[, lapply(.SD, tempfunc), .SDcols = -c("id", "weight")]
# a b c
# 1: 995.579 598.310 172.8193
# 2: 120.319 119.662 138.0090
Or
library(dplyr)
setDT(data) %>% summarise_each(funs(tempfunc), -c(id, weight))
## a b c
## 1: 995.579 598.310 172.8193
## 2: 120.319 119.662 138.0090
The following code should solve your question:
my.names <- names(data)[names(data) %in% letters]
res <- lapply(my.names, function(x){
c(sum(data[[x]]*data[["weight"]]), sum(data[["weight"]][data[[x]]>0]))
})
names(res) <- my.names
or directly to data.frame:
do.call("rbind", lapply(my.names, function(letter){
data.frame(letter, "sum1_name" = sum(data[[letter]]*data[["weight"]]),
"sum2_name" = sum(data[["weight"]][data[[letter]]>0]))
}))
# letter sum1_name sum2_name
# 1 a 995.5790 120.319
# 2 b 598.3100 119.662
# 3 c 172.8193 138.009

Resources