Replicate certain values in vector determined by other vector - r

I have a vector of values (say 1:10), and want to repeat certain values in it 2 or more times, determined by another vector (say c(3,4,6,8)). In this example, the result would be c(1,2,3,3,4,4,5,6,6,7,8,8,9,10) when repeating 2 times.
This should work for an arbitrary length range vector (like 200:600), with a second vector which is contained by the first. Is there a handy way to achieve this?

Akrun's is a more compact method, but this also will work
# get rep vector
reps <- rep(1L, 10L)
reps[c(3,4,6,8)] <- 2L
rep(1:10, reps)
[1] 1 2 3 3 4 4 5 6 6 7 8 8 9 10
The insight here is that rep will take an integer vector in the second argument the same length as the first argument that indicates the number of repetitions for each element of the first argument.
Note that this solution relies on the assumption that c(3,4,6,8) is the index or position of the elements that are to be repeated. Under this scenario, then d-b's comment has a one-liner
rep(x, (seq_along(x) %in% c(3,4,6,8)) + 1)
If instead, c(3,4,6,8) indicates the values that are to be repeated, then docendo-discimus's super-compact code,
rep(x, (x %in% c(3,4,6,8)) * (n-1) +1)
where n may be adjusted to change the number of repetitions. If you need to call this a couple times, this could be rolled up into a function like
myReps <- function(x, y, n) rep(x, (x %in% y) * (n-1) +1)
and called as
myReps(1:10, c(3,4,6,8), 2)
in the current scenario.

We can try
i1 <- v1 %in% v2
sort(c(v1[!i1], rep(v1[i1], each = 2)))
#[1] 1 2 3 3 4 4 5 6 6 7 8 8 9 10
Update
For the arbitrary vector,
f1 <- function(vec1, vec2, n){
i1 <- vec1 %in% vec2
vec3 <- seq_along(vec1)
c(vec1[!i1], rep(vec1[i1], each = n))[order(c(vec3[!i1],
rep(vec3[i1], each=n)))]
}
set.seed(24)
v1N <- sample(10)
v2 <- c(3,4,6,8)
v1N
#[1] 3 10 6 4 7 5 2 9 8 1
f1(v1N, v2, 2)
#[1] 3 3 10 6 6 4 4 7 5 2 9 8 8 1
f1(v1N, v2, 3)
#[1] 3 3 3 10 6 6 6 4 4 4 7 5 2 9 8 8 8 1

Here's another approach using sapply
#DATA
x = 1:10
r = c(3,4,6,8)
n = 2 #Two repetitions of selected values
#Assuming 'r' is the index of values in x to be repeated
unlist(sapply(seq_along(x), function(i) if(i %in% r){rep(x[i], n)}else{rep(x[i],1)}))
#[1] 1 2 3 3 4 4 5 6 6 7 8 8 9 10
#Assuming 'r' is the values in 'x' to be repeated
unlist(sapply(x, function(i) if(i %in% r){rep(i, n)}else{rep(i, 1)}))
#[1] 1 2 3 3 4 4 5 6 6 7 8 8 9 10
Haven't tested these thoroughly but could be possible alternatives. Note that the order of the output will be considerably different with this approach.
sort(c(x, rep(x[x %in% r], n-1))) #assuming 'r' is values
#[1] 1 2 3 3 4 4 5 6 6 7 8 8 9 10
sort(c(x, rep(x[r], n-1))) #assuming 'r' is index
#[1] 1 2 3 3 4 4 5 6 6 7 8 8 9 10

I suggest this solution just to emphasize the cool usage of append function in base R:
ff <- function(vec, v, n) {
for(i in seq_along(v)) vec <- append(vec, rep(v[i], n-1), after = which(vec==v[i]))
vec
}
Examples:
set.seed(1)
ff(vec = sample(10), v = c(3,4,6,8), n = 2)
#[1] 3 3 4 4 5 7 2 8 8 9 6 6 10 1
ff(vec = sample(10), v = c(2,5,9), n = 4)
#[1] 3 2 2 2 2 6 10 5 5 5 5 7 8 4 1 9 9 9 9

Related

How to vectorize the RHS of dplyr::case_when?

Suppose I have a dataframe that looks like this:
> data <- data.frame(x = c(1,1,2,2,3,4,5,6), y = c(1,2,3,4,5,6,7,8))
> data
x y
1 1 1
2 1 2
3 2 3
4 2 4
5 3 5
6 4 6
7 5 7
8 6 8
I want to use mutate and case_when to create a new id variable that will identify rows using the variable x, and give rows missing x a unique id. In other words, I should have the same id for rows one and two, rows three and four, while rows 5-8 should have their own unique ids. Suppose I want to generate these id values with a function:
id_function <- function(x, n){
set.seed(x)
res <- character(n)
for(i in seq(n)){
res[i] <- paste0(sample(c(letters, LETTERS, 0:9), 32), collapse="")
}
res
}
id_function(1, 1)
[1] "4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf"
I am trying to use this function on the RHS of a case_when expression like this:
data %>%
mutate(my_id = id_function(1234, nrow(.)),
my_id = dplyr::case_when(!is.na(x) ~ id_function(x, 1),
TRUE ~ my_id))
But the RHS does not seem to be vectorized and I get the same value for all non-missing values of x:
x y my_id
1 1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
3 2 3 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
4 2 4 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
5 NA 5 0vnws5giVNIzp86BHKuOZ9ch4dtL3Fqy
6 NA 6 IbKU6DjvW9ypitl7qc25Lr4sOwEfghdk
7 NA 7 8oqQMPx6IrkGhXv4KlUtYfcJ5Z1RCaDy
8 NA 8 BRsjumlCEGS6v4ANrw1bxLynOKkF90ao
I'm sure there's a way to vectorize the RHS, what am I doing wrong? Is there an easier approach to solving this problem?
I guess rowwise() would do the trick:
data %>%
rowwise() %>%
mutate(my_id = id_function(x, 1))
x y my_id
1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 3 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
2 4 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
3 5 e5lMJNQEhtj4VY1KbCR9WUiPrpy7vfXo
4 6 3kYcgR7109DLbxatQIAKXFeovN8pnuUV
5 7 bQ4ok7OuDgscLUlpzKAivBj2T3m6wrWy
6 8 0jSn3Jcb2HDA5uhvG8g1ytsmRpl6CQWN
purrr map functions can be used for non-vectorized functions. The following will give you a similar result. map2 will take the two arguments expected by your id_function.
library(tidyverse)
data %>%
mutate(my_id = map2(x, 1, id_function))
Output
x y my_id
1 1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
3 2 3 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
4 2 4 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
5 3 5 e5lMJNQEhtj4VY1KbCR9WUiPrpy7vfXo
6 4 6 3kYcgR7109DLbxatQIAKXFeovN8pnuUV
7 5 7 bQ4ok7OuDgscLUlpzKAivBj2T3m6wrWy
8 6 8 0jSn3Jcb2HDA5uhvG8g1ytsmRpl6CQWN

Vector recycling concept in R

I am trying to understand the working of vector recycling in R. I have 2 vectors
c(2,4,6)
and
c(1,2)
And I want to use the rep() to produce an output as follows:
[1] 2 4 6 4 8 12
based on what I understand from ?rep() is that there are times and each parameters which do the operations which I tried.
> rep(c(2,4,6), times=2)
[1] 2 4 6 2 4 6
But I also see the first vector is multiplied by the first element of the second vector and then to the second element of the second vector. Not sure how to proceed with it.
You can use:
rep(c(2,4,6), 2) * rep(c(1,2), each=3)
#[1] 2 4 6 4 8 12
or with auto recycling:
c(2,4,6) * rep(c(1,2), each=3)
#[1] 2 4 6 4 8 12
Alternative outer could be used:
c(outer(c(2,4,6), c(1,2)))
#[1] 2 4 6 4 8 12
Also crossprod could be used:
c(crossprod(t(c(2,4,6)), c(1,2)))
#[1] 2 4 6 4 8 12
Or %*%:
c(c(2,4,6) %*% t(c(1,2)))
#[1] 2 4 6 4 8 12

Subset columns using logical vector

I have a dataframe that I want to drop those columns with NA's rate > 70% or there is dominant value taking over 99% of rows. How can I do that in R?
I find it easier to select rows with logic vector in subset function, but how can I do the similar for columns? For example, if I write:
isNARateLt70 <- function(column) {//some code}
apply(dataframe, 2, isNARateLt70)
Then how can I continue to use this vector to subset dataframe?
If you have a data.frame like
dd <- data.frame(matrix(rpois(7*4,10),ncol=7, dimnames=list(NULL,letters[1:7])))
# a b c d e f g
# 1 11 2 5 9 7 6 10
# 2 10 5 11 13 11 11 8
# 3 14 8 6 16 9 11 9
# 4 11 8 12 8 11 6 10
You can subset with a logical vector using one of
mycols<-c(T,F,F,T,F,F,T)
dd[mycols]
dd[, mycols]
There's really no need to write a function when we have colMeans (thanks #MrFlick for the advice to change from colSums()/nrow(), and shown at the bottom of this answer).
Here's how I would approach your function if you want to use sapply on it later.
> d <- data.frame(x = rep(NA, 5), y = c(1, NA, NA, 1, 1),
z = c(rep(NA, 3), 1, 2))
> isNARateLt70 <- function(x) mean(is.na(x)) <= 0.7
> sapply(d, isNARateLt70)
# x y z
# FALSE TRUE TRUE
Then, to subset with the above line your data using the above line of code, it's
> d[sapply(d, isNARateLt70)]
But as mentioned, colMeans works just the same,
> d[colMeans(is.na(d)) <= 0.7]
# y z
# 1 1 NA
# 2 NA NA
# 3 NA NA
# 4 1 1
# 5 1 2
Maybe this will help too. The 2 parameter in apply() means apply this function column wise on the data.frame cars.
> columns <- apply(cars, 2, function(x) {mean(x) > 10})
> columns
speed dist
TRUE TRUE
> cars[1:10, columns]
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17

Excel OFFSET function in r

I am trying to simulate the OFFSET function from Excel. I understand that this can be done for a single value but I would like to return a range. I'd like to return a group of values with an offset of 1 and a group size of 2. For example, on row 4, I would like to have a group with values of column a, rows 3 & 2. Sorry but I am stumped.
Is it possible to add this result to the data frame as another column using cbind or similar? Alternatively, could I use this in a vectorized function so I could sum or mean the result?
Mockup Example:
> df <- data.frame(a=1:10)
> df
a
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
> #PROCESS
> df
a b
1 1 NA
2 2 (1)
3 3 (1,2)
4 4 (2,3)
5 5 (3,4)
6 6 (4,5)
7 7 (5,6)
8 8 (6,7)
9 9 (7,8)
10 10 (8,9)
This should do the trick:
df$b1 <- c(rep(NA, 1), head(df$a, -1))
df$b2 <- c(rep(NA, 2), head(df$a, -2))
Note that the result will have to live in two columns, as columns in data frames only support simple data types. (Unless you want to resort to complex numbers.) head with a negative argument cuts the negated value of the argument from the tail, try head(1:10, -2). rep is repetition, c is concatenation. The <- assignment adds a new column if it's not there yet.
What Excel calls OFFSET is sometimes also referred to as lag.
EDIT: Following Greg Snow's comment, here's a version that's more elegant, but also more difficult to understand:
df <- cbind(df, as.data.frame((embed(c(NA, NA, df$a), 3))[,c(3,2)]))
Try it component by component to see how it works.
Do you want something like this?
> df <- data.frame(a=1:10)
> b=t(sapply(1:10, function(i) c(df$a[(i+2)%%10+1], df$a[(i+4)%%10+1])))
> s = sapply(1:10, function(i) sum(b[i,]))
> df = data.frame(df, b, s)
> df
a X1 X2 s
1 1 4 6 10
2 2 5 7 12
3 3 6 8 14
4 4 7 9 16
5 5 8 10 18
6 6 9 1 10
7 7 10 2 12
8 8 1 3 4
9 9 2 4 6
10 10 3 5 8

Quickly remove zero variance variables from a data.frame

I have a large data.frame that was generated by a process outside my control, which may or may not contain variables with zero variance (i.e. all the observations are the same). I would like to build a predictive model based on this data, and obviously these variables are of no use.
Here's the function I'm currently using to remove such variables from the data.frame. It's currently based on apply, and I was wondering if there are any obvious ways to speed this function up, so that it works quickly on very large datasets, with a large number (400 or 500) of variables?
set.seed(1)
dat <- data.frame(
A=factor(rep("X",10),levels=c('X','Y')),
B=round(runif(10)*10),
C=rep(10,10),
D=c(rep(10,9),1),
E=factor(rep("A",10)),
F=factor(rep(c("I","J"),5)),
G=c(rep(10,9),NA)
)
zeroVar <- function(data, useNA = 'ifany') {
out <- apply(data, 2, function(x) {length(table(x, useNA = useNA))})
which(out==1)
}
And here's the result of the process:
> dat
A B C D E F G
1 X 3 10 10 A I 10
2 X 4 10 10 A J 10
3 X 6 10 10 A I 10
4 X 9 10 10 A J 10
5 X 2 10 10 A I 10
6 X 9 10 10 A J 10
7 X 9 10 10 A I 10
8 X 7 10 10 A J 10
9 X 6 10 10 A I 10
10 X 1 10 1 A J NA
> dat[,-zeroVar(dat)]
B D F G
1 3 10 I 10
2 4 10 J 10
3 6 10 I 10
4 9 10 J 10
5 2 10 I 10
6 9 10 J 10
7 9 10 I 10
8 7 10 J 10
9 6 10 I 10
10 1 1 J NA
> dat[,-zeroVar(dat, useNA = 'no')]
B D F
1 3 10 I
2 4 10 J
3 6 10 I
4 9 10 J
5 2 10 I
6 9 10 J
7 9 10 I
8 7 10 J
9 6 10 I
10 1 1 J
You may also want to look into the nearZeroVar() function in the caret package.
If you have one event out of 1000, it might be a good idea to discard these data (but this depends on the model). nearZeroVar() can do that.
Don't use table() - very slow for such things. One option is length(unique(x)):
foo <- function(dat) {
out <- lapply(dat, function(x) length(unique(x)))
want <- which(!out > 1)
unlist(want)
}
system.time(replicate(1000, zeroVar(dat)))
system.time(replicate(1000, foo(dat)))
Which is an order magnitude faster than yours on the example data set whilst giving similar output:
> system.time(replicate(1000, zeroVar(dat)))
user system elapsed
3.334 0.000 3.335
> system.time(replicate(1000, foo(dat)))
user system elapsed
0.324 0.000 0.324
Simon's solution here is similarly quick on this example:
> system.time(replicate(1000, which(!unlist(lapply(dat,
+ function(x) 0 == var(if (is.factor(x)) as.integer(x) else x))))))
user system elapsed
0.392 0.000 0.395
but you'll have to see if they scale similarly to real problem sizes.
Simply don't use table - it's extremely slow on numeric vectors since it converts them to strings. I would probably use something like
var0 <- unlist(lapply(df, function(x) 0 == var(if (is.factor(x)) as.integer(x) else x)))
It will be TRUE for 0-variance, NA for columns with NAs and FALSE for non-zero variance
Use the Caret Package and the function nearZeroVar
require(caret)
NZV<- nearZeroVar(dataset, saveMetrics = TRUE)
NZV[NZV[,"zeroVar"] > 0, ]
NZV[NZV[,"zeroVar"] + NZV[,"nzv"] > 0, ]
Well, save yourself some coding time:
Rgames: foo
[,1] [,2] [,3]
[1,] 1 1e+00 1
[2,] 1 2e+00 1
[3,] 1 3e+00 1
[4,] 1 4e+00 1
[5,] 1 5e+00 1
[6,] 1 6e+00 2
[7,] 1 7e+00 3
[8,] 1 8e+00 1
[9,] 1 9e+00 1
[10,] 1 1e+01 1
Rgames: sd(foo)
[1] 0.000000e+00 3.027650e+00 6.749486e-01
Warning message:
sd(<matrix>) is deprecated.
Use apply(*, 2, sd) instead.
To avoid nasty floating-point roundoffs, take that output vector, which I'll call "bar," and do something like bar[bar< 2*.Machine$double.eps] <- 0 and then finally your data frame dat[,as.logical(bar)] should do the trick.
How about using factor to count the number of unique elements and looping with sapply:
dat[sapply(dat, function(x) length(levels(factor(x)))>1)]
B D F
1 3 10 I
2 4 10 J
3 6 10 I
4 9 10 J
5 2 10 I
6 9 10 J
7 9 10 I
8 7 10 J
9 6 10 I
10 1 1 J
NAs are excluded by default, but this can be changed with the exclude parameter of factor:
dat[sapply(dat, function(x) length(levels(factor(x,exclude=NULL)))>1)]
B D F G
1 3 10 I 10
2 4 10 J 10
3 6 10 I 10
4 9 10 J 10
5 2 10 I 10
6 9 10 J 10
7 9 10 I 10
8 7 10 J 10
9 6 10 I 10
10 1 1 J NA
Because I'm an idiot who keeps googling the same question, let me leave a tidyverse approach that I've settled on:
library(tidyverse)
df <- df %>%
select(
- {
df %>%
map_dbl(~ length(table(.x, useNA = "ifany"))) %>%
{which(. == 1)} %>%
names()
}
)
I think this could be made shorter but I'm too tired!
I think having zero variance is equivalent to being constant and one can get around without doing any arithmetic operations at all. I would expect that range() outperforms var(), but I have not verified this:
removeConstantColumns <- function(a_dataframe, verbose=FALSE) {
notConstant <- function(x) {
if (is.factor(x)) x <- as.integer(x)
return (0 != diff(range(x, na.rm=TRUE)))
}
bkeep <- sapply(a_dataframe, notConstant)
if (verbose) {
cat('removeConstantColumns: '
, ifelse(all(bkeep)
, 'nothing'
, paste(names(a_dataframe)[!bkeep], collapse=',')
, ' removed', '\n')
}
return (a_dataframe[, bkeep])
}
Check this custom function. I did not try it on data frames with 100+ variables.
remove_low_variance_cols <- function(df, threshold = 0) {
n <- Sys.time() #See how long this takes to run
remove_cols <- df %>%
select_if(is.numeric) %>%
map_dfr(var) %>%
gather() %>%
filter(value <= threshold) %>%
spread(key, value) %>%
names()
if(length(remove_cols)) {
print("Removing the following columns: ")
print(remove_cols)
}else {
print("There are no low variance columns with this threshold")
}
#How long did this script take?
print(paste("Time Consumed: ", Sys.time() - n, "Secs."))
return(df[, setdiff(names(df), remove_cols)])
}

Resources