It seems that R might be missing an obvious simple function: psum. Does it exist as a different name, or is it in a package somewhere?
x = c(1,3,NA,5)
y = c(2,NA,4,1)
min(x,y,na.rm=TRUE) # ok
[1] 1
max(x,y,na.rm=TRUE) # ok
[1] 5
sum(x,y,na.rm=TRUE) # ok
[1] 16
pmin(x,y,na.rm=TRUE) # ok
[1] 1 3 4 1
pmax(x,y,na.rm=TRUE) # ok
[1] 2 3 4 5
psum(x,y,na.rm=TRUE)
[1] 3 3 4 6 # expected result
Error: could not find function "psum" # actual result
I realise that + is already like psum, but what about NA?
x+y
[1] 3 NA NA 6 # can't supply `na.rm=TRUE` to `+`
Is there a case to add psum? Or have I missed something.
This question is a follow up from this question :
Using := in data.table to sum the values of two columns in R, ignoring NAs
Following #JoshUlrich's comment on the previous question,
psum <- function(...,na.rm=FALSE) {
rowSums(do.call(cbind,list(...)),na.rm=na.rm) }
edit: from Sven Hohenstein:
psum2 <- function(...,na.rm=FALSE) {
dat <- do.call(cbind,list(...))
res <- rowSums(dat, na.rm=na.rm)
idx_na <- !rowSums(!is.na(dat))
res[idx_na] <- NA
res
}
x = c(1,3,NA,5,NA)
y = c(2,NA,4,1,NA)
z = c(1,2,3,4,NA)
psum(x,y,na.rm=TRUE)
## [1] 3 3 4 6 0
psum2(x,y,na.rm=TRUE)
## [1] 3 3 4 6 NA
n = 1e7
x = sample(c(1:10,NA),n,replace=TRUE)
y = sample(c(1:10,NA),n,replace=TRUE)
z = sample(c(1:10,NA),n,replace=TRUE)
library(rbenchmark)
benchmark(psum(x,y,z,na.rm=TRUE),
psum2(x,y,z,na.rm=TRUE),
pmin(x,y,z,na.rm=TRUE),
pmax(x,y,z,na.rm=TRUE), replications=20)
## test replications elapsed relative
## 4 pmax(x, y, z, na.rm = TRUE) 20 26.114 1.019
## 3 pmin(x, y, z, na.rm = TRUE) 20 25.632 1.000
## 2 psum2(x, y, z, na.rm = TRUE) 20 164.476 6.417
## 1 psum(x, y, z, na.rm = TRUE) 20 63.719 2.486
Sven's version (which arguably is the correct one) is quite a bit slower,
although whether it matters obviously depends on the application.
Anyone want to hack up an inline/Rcpp version?
As for why this doesn't exist: don't know, but good luck getting R-core to make additions like this ... I can't offhand think of a sufficiently widespread *misc package into which this could go ...
Follow up thread by Matthew on r-devel is here (which seems to confirm) :
r-devel: There is pmin and pmax each taking na.rm, how about psum?
After a quick search on CRAN, there are at least 3 packages that have a psum function. rccmisc, incadata and kit. kit seems to be the fastest. Below reproducing the example of Ben Bolker.
benchmark(
rccmisc::psum(x,y,z,na.rm=TRUE),
incadata::psum(x,y,z,na.rm=TRUE),
kit::psum(x,y,z,na.rm=TRUE),
psum(x,y,z,na.rm=TRUE),
psum2(x,y,z,na.rm=TRUE),
replications=20
)
# test replications elapsed relative
# 2 incadata::psum(x, y, z, na.rm = TRUE) 20 20.05 14.220
# 3 kit::psum(x, y, z, na.rm = TRUE) 20 1.41 1.000
# 4 psum(x, y, z, na.rm = TRUE) 20 8.04 5.702
# 5 psum2(x, y, z, na.rm = TRUE) 20 20.44 14.496
# 1 rccmisc::psum(x, y, z, na.rm = TRUE) 20 23.24 16.482
Another approach whose advantage is to also work with matrices, just like pmin and pmax.
psum <- function(..., na.rm = FALSE) {
plus_na_rm <- function(x, y) ifelse(is.na(x), 0, x) + ifelse(is.na(y), 0, y)
Reduce(if(na.rm) plus_na_rm else `+`, list(...))
}
x = c(1,3,NA,5)
y = c(2,NA,4,1)
psum(x, y)
#> [1] 3 NA NA 6
psum(x, y, na.rm = TRUE)
#> [1] 3 3 4 6
# With matrices
A <- matrix(1:9, nrow = 3)
B <- matrix(c(NA, 2:8, NA), nrow = 3)
psum(A, B)
#> [,1] [,2] [,3]
#> [1,] NA 8 14
#> [2,] 4 10 16
#> [3,] 6 12 NA
psum(A, B, na.rm = TRUE)
#> [,1] [,2] [,3]
#> [1,] 1 8 14
#> [2,] 4 10 16
#> [3,] 6 12 9
Created on 2020-03-09 by the reprex package (v0.3.0)
One caveat: if an element is NA across all the summed objects and na.rm = TRUE, the result will be 0 (and not NA).
For example:
psum(NA, NA, na.rm = TRUE)
#> [1] 0
Related
I have a list as:
s <- c('peel', 'peer', 'pear', 'tggc', 'gcgt')
I would like to compare each string with every other string in the list and I use the following command:
z <- Map(utf8ToInt, s)
dmat <- outer(z, z, FUN=Vectorize(function(x, y) sum(bitwXor(x, y) > 0)))
However, I would like to output the number of character differences (instead of characters matching) based on position:
For example "tggc" when compared with the string "gcgt" should be output as 3.
Just use a simple negation ! as per the following:
s <- c('peel', 'peer', 'pear', 'tggc', 'gcgt')
z <- Map(utf8ToInt, s)
dmat <- outer(z, z, FUN = Vectorize(function(x, y) sum(!bitwXor(x, y))))
dmat
Or use a straightforward equality comparison given that you've mapped the characters to integers.
dmat <- outer(z, z, FUN = Vectorize(function(x, y) sum(x == y)))
Both give output:
peel peer pear tggc gcgt
peel 4 3 2 0 0
peer 3 4 3 0 0
pear 2 3 4 0 0
tggc 0 0 0 4 1
gcgt 0 0 0 1 4
Note: If you have fixed string length, you can also use subtraction, but the above saves you from passing this explicitly, which adds a little generality.
If performance is a concern:
s <- c('peel', 'peer', 'pear', 'tggc', 'gcgt')
z <- mapply(utf8ToInt, s)
n <- length(s)
n1 <- 1:(n - 1L)
replace(matrix(nrow = n, ncol = n),
sequence(n1, seq(n + 1L, by = n, length.out = n - 1L)),
colSums(z[, sequence(n1)] == z[, rep.int(2:n, n1)]))
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] NA 3 2 0 0
#> [2,] NA NA 3 0 0
#> [3,] NA NA NA 0 0
#> [4,] NA NA NA NA 1
#> [5,] NA NA NA NA NA
# benchmarking with a larger character vector
s <- mapply(FUN = function(x) paste0(sample(letters[1:4]), collapse = ""), 1:100)
microbenchmark::microbenchmark(bitwXor = {z <- Map(utf8ToInt, s)
outer(z, z, FUN = Vectorize(function(x, y) sum(!bitwXor(x, y))))},
logical = {z <- Map(utf8ToInt, s)
outer(z, z, FUN = Vectorize(function(x, y) sum(x == y)))},
mat = {z <- mapply(utf8ToInt, s)
n <- length(s)
n1 <- 1:(n - 1L)
replace(matrix(nrow = n, ncol = n),
sequence(n1, seq(n + 1L, by = n, length.out = n - 1L)),
colSums(z[, sequence(n1)] == z[, rep.int(2:n, n1)]))})
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> bitwXor 23846.1 24875.6 26207.230 26120.95 27134.35 33842.8 100
#> logical 16645.5 17514.8 19020.051 18383.35 19875.15 32716.8 100
#> mat 387.4 455.0 511.322 482.70 544.05 1224.4 100
# confirm that the results are the same
z <- Map(utf8ToInt, s)
mat1 <- outer(z, z, FUN = Vectorize(function(x, y) sum(!bitwXor(x, y))))
mat2 <- outer(z, z, FUN = Vectorize(function(x, y) sum(x == y)))
z <- mapply(utf8ToInt, s)
n <- length(s)
n1 <- 1:(n - 1L)
mat3 <- replace(matrix(nrow = n, ncol = n), sequence(n1, seq(n + 1L, by = n, length.out = n - 1L)), colSums(z[, sequence(n1)] == z[, rep.int(2:n, n1)]))
all.equal(mat1[upper.tri(mat1)], mat2[upper.tri(mat2)])
#> [1] TRUE
all.equal(mat1[upper.tri(mat1)], mat3[upper.tri(mat3)])
#> [1] TRUE
A possible solution:
library(tidyverse)
sample <- c('peel','peer','pear','tggc','gcgt')
sample %>%
expand.grid(sample) %>%
rowwise %>%
mutate(cmp = mapply(function(x,y)
{ x != y}, x=str_split(Var1, ""), y=str_split(Var2, "")) %>% sum)
#> # A tibble: 25 × 3
#> # Rowwise:
#> Var1 Var2 cmp
#> <fct> <fct> <int>
#> 1 peel peel 0
#> 2 peer peel 1
#> 3 pear peel 2
#> 4 tggc peel 4
#> 5 gcgt peel 4
#> 6 peel peer 1
#> 7 peer peer 0
#> 8 pear peer 1
#> 9 tggc peer 4
#> 10 gcgt peer 4
#> # … with 15 more rows
I built this custom "winsorize" function that does what it should, unless there are NA's in the data.
How it works:
winsor1 <- function(x, probability){
numWin <- ceiling(length(x)*probability)
# Replace first lower, then upper
x <- pmax(x, sort(x)[numWin+1])
x <- pmin(x, sort(x)[length(x)-numWin])
return(x)
}
x <- 0:10
winsor1(x, probability=0.01)
[1] 1 1 2 3 4 5 6 7 8 9 9
So it replaces the top (and bottom) 1% of the data (rounded up to the next value, since there are only 11 values in the example). If there are, e.g., 250 values then the bottom 3 and top 3 values would be replaced by the bottom 4th and top 4th respectively.
The whole thing breaks down when there are NA's in the data, causing an error. However, if I set na.rm = TRUE in the pmax() and pmin() then the NA's themselves are replaced by the bottom value.
x[5] <- NA
winsor1(x, probability=0.01)
[1] 1 1 2 3 1 5 6 7 8 9 9
What can I do so that the NA's are preserved but do not cause an error? This is the output I want for the last line:
winsor1(x, probability=0.01)
[1] 1 1 2 3 NA 5 6 7 8 9 9
The issue is with sort as it removes the NA by default or else we have to specify na.last = TRUE which may also not be the case we need. One option is order
winsor1 <- function(x, probability){
numWin <- ceiling(length(x)*probability)
# Replace first lower, then upper
x1 <- x[order(x)]
x <- pmax(x, x1[numWin+1])
x1 <- x1[order(x1)]
x <- pmin(x, x1[length(x)-numWin], na.rm = TRUE)
return(x)
}
-testing
x <- 0:10
winsor1(x, probability=0.01)
#[1] 1 1 2 3 4 5 6 7 8 9 9
x[5] <- NA
winsor1(x, probability=0.01)
#[1] 1 1 2 3 NA 5 6 7 8 9 10
or with na.last in sort
winsor1 <- function(x, probability){
numWin <- ceiling(length(x)*probability)
# Replace first lower, then upper
x <- pmax(x, sort(x, na.last = TRUE)[numWin+1])
x <- pmin(x, sort(x, na.last = TRUE)[length(x)-numWin], na.rm = TRUE)
return(x)
}
Adding two vectors is easy:
> c(1:5) + c(6:10)
[1] 7 9 11 13 15
But since adding any number to NA gives NA, this happens:
> c(1,NA,3:5)+c(6:10)
[1] 7 NA 11 13 15
How can I add two vectors where there may be some NAs, treating them as zeros? I need to get this result:
> c(1,NA,3:5)+c(6:10)
[1] 7 7 11 13 15
Any ideas on how to do this using {base} and not changing the NAs to zeros on the original vectors?
You can also use colSums or rowSums, e.g.:
rowSums(cbind(x, y), na.rm = T)
# [1] 7 7 11 13 15
colSums(rbind(x, y), na.rm = T)
# [1] 7 7 11 13 15
Benchmarks; surprisingly colSums works the fastest:
microbenchmark::microbenchmark(fn_replace(x, y),
fn_rowSums(x, y),
fn_colSums(x, y),
fn_coalesce(x, y))
# Unit: milliseconds
# expr min lq mean median uq max neval
# fn_replace(x, y) 121.4322 130.99067 174.1531 162.2454 183.1781 385.7348 100
# fn_rowSums(x, y) 143.0654 146.20815 172.5396 149.3953 179.0337 370.1625 100
# fn_colSums(x, y) 96.8848 99.46521 121.5916 106.8800 140.9279 298.1607 100
# fn_coalesce(x, y) 259.2923 310.16915 357.0241 326.1245 360.9110 595.9711 100
## Code to generate x, y and functions for benchmark:
fn_replace <- function(x, y) {
replace(x, is.na(x), 0) + replace(y, is.na(y), 0)
}
fn_rowSums <- function(x, y) {
rowSums(cbind(x, y), na.rm = T)
}
fn_colSums <- function(x, y) {
colSums(rbind(x, y), na.rm = T)
}
fn_coalesce <- function(x, y) {
dplyr::coalesce(x, rep(0, length(x))) +
dplyr::coalesce(y, rep(0, length(y)))
}
n_rep <- 1e6
x <- as.numeric(rep(c(1, NA, 3:5, NA, NA, 5), n_rep))
y <- as.numeric(rep(c(NA, 6:9, NA, 3, 4), n_rep))
Maybe replace NA's with 0 and then add the vectors
x <- c(1,NA,3:5)
y <- c(6:10)
replace(x, is.na(x), 0) + replace(y, is.na(y), 0)
#[1] 7 7 11 13 15
We could try using coalesce() from the dplyr package:
require(dplyr)
x <- c(1,NA,3:5)
y <- c(6:10)
coalesce(x, rep(0, 5)) + coalesce(y, rep(0, 5))
coalesce(x, y) works by taking the first non NA value from x, should that position have a non NA value, or from y, e.g.
x rep(0, 5) => result
1 0 1
NA 0 0
3 0 3
4 0 4
5 0 5
Instead of base::replace() and dplyr::coalesce() as above, we can also use tidyr::replace_na():
library(tidyr)
replace_na(x, 0) + replace_na(y, 0)
#[1] 7 7 11 13 15
For a given dataframe, I'd like to split it based on some boolean value, and then apply a label to that row and the previous rows up until that point.
Assuming the following dataframe:
test <- data.frame(x = 1:10, y = c(F, F, F, T, F, F, T, F, F, F))
I'd ultimately like to create a new column that would contain a label for that specific portion of the dataframe. Ideally, something like the following:
x y z
1 F 1
2 F 1
3 F 1
4 T 1
5 F 2
6 F 2
7 T 2
8 F 3
9 F 3
10 F 3
My current thought is that I need to loop through the dataframe with a function similar to the following (but not exactly):
label.portion <- function(test) {
for (i in 1:nrow(test)) {
z <- 1
if(test$y[i]) { z <- z + 1 }
return(z)
}
}
What is the best/easiest way of doing this? Any help is much appreciated.
Your z column can be built as
z <- with(test, sum(y)-rev(cumsum(rev(y)))+1)
in order to make every new z value start at a FALSE y after a TRUE y, as per your example.
Then you can do cbind(test, z) to get what you want.
One liner solution using transform
transform(test,z= cumsum(c(0,diff(y)) == -1)+1)
x y z
1 1 FALSE 1
2 2 FALSE 1
3 3 FALSE 1
4 4 TRUE 1
5 5 FALSE 2
6 6 FALSE 2
7 7 TRUE 2
8 8 FALSE 3
9 9 FALSE 3
10 10 FALSE 3
Another one liner solution which will be slightly faster than other solutions (except data.table)
test <- data.frame(x = 1:10, y = c(F, F, F, T, F, F, T, F, F, F))
test$z <- c(1, head(cumsum(test$y), -1) + 1)
test
## x y z
## 1 1 FALSE 1
## 2 2 FALSE 1
## 3 3 FALSE 1
## 4 4 TRUE 1
## 5 5 FALSE 2
## 6 6 FALSE 2
## 7 7 TRUE 2
## 8 8 FALSE 3
## 9 9 FALSE 3
## 10 10 FALSE 3
Benchmarks with other solutions provided (excluding data.table)
test <- data.frame(x = 1:1e+05, y = sample(c(T, F), size = 1e+05, replace = TRUE))
microbenchmark(c(1, head(cumsum(test$y), -1) + 1), cumsum(c(0, diff(test$y)) == -1) + 1, with(test, sum(y) - rev(cumsum(rev(y))) +
1), times = 100)
## Unit: milliseconds
## expr min lq median uq max neval
## c(1, head(cumsum(test$y), -1) + 1) 1.685473 1.758474 1.865409 4.647218 5.091512 100
## cumsum(c(0, diff(test$y)) == -1) + 1 4.064867 4.379714 6.936561 7.338810 7.657961 100
## with(test, sum(y) - rev(cumsum(rev(y))) + 1) 2.568766 2.720395 5.396096 5.701176 30.642436 100
Here is an approach using na.locf from xts and data.table for coding elegance (and efficiency)
library(data.table)
library(xts) # for na.locf
test <- data.table(test)
test[(y), grp := seq_along(y)][, grp := na.locf(grp, fromLast = TRUE)]
test[is.na(grp), grp := max(test[, grp], na.rm =TRUE) + 1L]
And a far clearer and faster approach
test[, grp := {xx <- diff(c(0,.I[y], length(.I))); rep.int(seq_along(xx),xx)}]
Note that diff uses a for loop implemented in R, so an Rcpp sugar implementation) would be faster (I'm sure that a cpp function would blow most of these out of the water)
I was using the prcomp function when I received this error
Error in prcomp.default(x, ...) :
cannot rescale a constant/zero column to unit variance
I know I can scan my data manually but is there any function or command in R that can help me remove these constant variables?
I know this is a very simple task, but I have never been across any function that does this.
Thanks,
The problem here is that your column variance is equal to zero. You can check which column of a data frame is constant this way, for example :
df <- data.frame(x=1:5, y=rep(1,5))
df
# x y
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1
# Supply names of columns that have 0 variance
names(df[, sapply(df, function(v) var(v, na.rm=TRUE)==0)])
# [1] "y"
So if you want to exclude these columns, you can use :
df[,sapply(df, function(v) var(v, na.rm=TRUE)!=0)]
EDIT : In fact it is simpler to use apply instead. Something like this :
df[,apply(df, 2, var, na.rm=TRUE) != 0]
I guess this Q&A is a popular Google search result but the answer is a bit slow for a large matrix, plus I do not have enough reputation to comment on the first answer. Therefore I post a new answer to the question.
For each column of a large matrix, checking whether the maximum is equal to the minimum is sufficient.
df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]
This is the test. More than 90% of the time is reduced compared to the first answer. It is also faster than the answer from the second comment on the question.
ncol = 1000000
nrow = 10
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix
time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]) # my method
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 })]) # Keith's method
time1
# user system elapsed
# 22.267 0.194 22.626
time2
# user system elapsed
# 2.073 0.077 2.155
time3
# user system elapsed
# 6.702 0.060 6.790
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE
Since this Q&A is a popular Google search result but the answer is a bit slow for a large matrix and #raymkchow version is slow with NAs i propose a new version using exponential search and data.table power.
This a function I implemented in dataPreparation package.
First build an example data.table, with more lines than columns (which is usually the case) and 10% of NAs
ncol = 1000
nrow = 100000
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df <- apply (df, 2, function(x) {x[sample( c(1:nrow), floor(nrow/10))] <- NA; x} ) # Add 10% of NAs
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix
df <- as.data.table(df)
Then benchmark all approaches:
time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0, with = F]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE)), with = F]) # raymkchow
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 }), with = F]) # Keith's method
time4 <- system.time(df4 <- df[,-which_are_constant(df, verbose=FALSE)]) # My method
The results are the following:
time1 # Variance approch
# user system elapsed
# 2.55 1.45 4.07
time2 # Min = max approach
# user system elapsed
# 2.72 1.5 4.22
time3 # length(unique()) approach
# user system elapsed
# 6.7 2.75 9.53
time4 # Exponential search approach
# user system elapsed
# 0.39 0.07 0.45
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE
all.equal(df4, df2)
# [1] TRUE
dataPreparation:which_are_constant is 10 times faster than the other approaches.
Plus the more rows you have the more interesting it is to use.
The janitor library has the comment remove_constant that can help delete constant columns.
Let's create a synthesis data for illustration:
library(janitor)
test_dat <- data.frame(A=1, B=1:10, C= LETTERS[1:10])
test_dat
This is the test_dat
> test_dat
A B C
1 1 1 A
2 1 2 B
3 1 3 C
4 1 4 D
5 1 5 E
6 1 6 F
7 1 7 G
8 1 8 H
9 1 9 I
10 1 10 J
then the comment remove_constant can help delete the constant column
remove_constant(test_dat)
remove_constant(test_dat, na.rm= TRUE)
Using the above two comments, we will get:
B C
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
6 6 F
7 7 G
8 8 H
9 9 I
10 10 J
NOTE: use the argument na.rm = TRUE to make sure that any column having one value and NA will also be deleted. For example,
test_dat_with_NA <- data.frame(A=c(1, NA), B=1:10, C= LETTERS[1:10])
test_dat_with_NA
the test_dat_with_NA we get:
A B C
1 1 1 A
2 NA 2 B
3 1 3 C
4 NA 4 D
5 1 5 E
6 NA 6 F
7 1 7 G
8 NA 8 H
9 1 9 I
10 NA 10 J
then the comment
remove_constant(test_dat_with_NA)
could not delete the column A
A B C
1 1 1 A
2 NA 2 B
3 1 3 C
4 NA 4 D
5 1 5 E
6 NA 6 F
7 1 7 G
8 NA 8 H
9 1 9 I
10 NA 10 J
while the comment
remove_constant(test_dat_with_NA, na.rm= TRUE)
could delete the column A with only value 1 and NA:
B C
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
6 6 F
7 7 G
8 8 H
9 9 I
10 10 J
If you are after a dplyr solution that returns the non-constant variables in a df, I'd recommend the following. Optionally, you can add %>% colnames() if the column names are desired:
library(dplyr)
df <- data.frame(x = 1:5, y = rep(1,5))
# returns dataframe
var_df <- df %>%
select_if(function(v) var(v, na.rm=TRUE) != 0)
var_df %>% colnames() # returns column names
tidyverse version of Keith's comment:
df %>% purrr::keep(~length(unique(.x)) != 1)