How to remove near zero variance without employing the caret package? - r

I am new to programming. But here is the piece of code that I have tried to remove the nearZeroVar function of the caret package from:
N <- 200 # number of points per class
D <- 2 # dimensionality
K <- 4 # number of classes
X <- data.frame() # data matrix (each row = single example)
y <- data.frame() # class labels
...(some lines are omitted)...
X <- as.matrix(X)
Y <- matrix(0, N * K, K)
for (i in 1:(N * K)) { Y[i, y[i,]] <- 1}
...(some lines are omitted)...
nzv <- nearZeroVar(train)
nzv.nolabel <- nzv-1
inTrain <- createDataPartition(y=train$label, p=0.7, list=F)
training <- train[inTrain, ]
CV <- train[-inTrain, ]
X <- as.matrix(training[, -1])
N <- nrow(X)
y <- training[, 1]
K <- length(unique(y))
X.proc <- X[, -nzv.nolabel]/max(X)
D <- ncol(X.proc)
Xcv <- as.matrix(CV[, -1])
ycv <- CV[, 1]
Xcv.proc <- Xcv[, -nzv.nolabel]/max(X)
Y <- matrix(0, N, K)
So, to get rid of the nearZeroVar function, I have tried to use the Filter function and the following foo function:
foo <- function(data) {
out <- lapply(data, function(x) length(unique(x)))
want <- which(!out > 1)
unlist(want)
}
nzv <- foo(trainingSet)
nzv.nolabel <- nzv - 1
But I get error messages: "Error in X[, training.nolabel]: incorrect number of dimensions. Execution halted" or something like "Non-conformable arrays". Any ideas on how to work around the `nearZeroVar" are strongly appreciated. Please, let me know if I should share some more details.

It's not evident from the code posted that how Filter() was being used.
Try the following;
# create sample data
R> df <- data.frame(a=1:10, b=sample(10:19), c=rep(5,10))
R> df
a b c
1 1 16 5
2 2 17 5
3 3 18 5
4 4 13 5
5 5 15 5
6 6 14 5
7 7 11 5
8 8 12 5
9 9 19 5
10 10 10 5
creating a custom function like;
R> zeroVarianceCol<- function(df){
new_df<-Filter(var,df)
}
passing the dataframe to this custom function like, x<- zeroVarianceCol(df) will remove the near zero variance column, in this case column c.
R> x
a b
1 1 16
2 2 17
3 3 18
4 4 13
5 5 15
6 6 14
7 7 11
8 8 12
9 9 19
10 10 10

Related

How to increase performance when randomly selecting clusters and adding observations?

In a clustered dataset, I want to randomly pick some clusters and then add some simulated observations to the selected clusters. Then I want to create a dataset that combines the simulated and original observations from the selected clusters with all the original observations from the unselected clusters. I would also like to repeat this process many times and thus create many (maybe 1000) new datasets. I managed to do this using for loop but would like to know if there is any more efficient and concise way to accomplish this. Here is an example dataset:
## simulate some data
y <- rnorm(20)
x <- rnorm(20)
z <- rep(1:5, 4)
w <- rep(1:4, each=5)
dd <- data.frame(id=z, cluster=w, x=x, y=y)
# id cluster x y
# 1 1 1 0.30003855 0.65325768
# 2 2 1 -1.00563626 -0.12270866
# 3 3 1 0.01925927 -0.41367651
# 4 4 1 -1.07742065 -2.64314895
# 5 5 1 0.71270333 -0.09294102
# 6 1 2 1.08477509 0.43028470
# 7 2 2 -2.22498770 0.53539884
# 8 3 2 1.23569346 -0.55527835
# 9 4 2 -1.24104450 1.77950291
# 10 5 2 0.45476927 0.28642442
# 11 1 3 0.65990264 0.12631586
# 12 2 3 -0.19988983 1.27226678
# 13 3 3 -0.64511396 -0.71846622
# 14 4 3 0.16532102 -0.45033862
# 15 5 3 0.43881870 2.39745248
# 16 1 4 0.88330282 0.01112919
# 17 2 4 -2.05233698 1.63356842
# 18 3 4 -1.63637927 -1.43850664
# 19 4 4 1.43040234 -0.19051680
# 20 5 4 1.04662885 0.37842390
cl <- split(dd, dd$cluster) ## split the data based on clusters
k <- length(dd$id)
l <- length(cl)
`%notin%` <- Negate(`%in%`) ## define "not in" to exclude unselected clusters so
## as to retain their original observations
A clsamp function in the following code is then created which includes two for loops. The first for loop is to exclude the unselected clusters and the second for loop is to simulate new observations and append them to the selected clusters. Note that I randomly sample 2 clusters (10% of the total number of observations), without replacement
clsamp <- function(cl, k) {
a <- sample(cl, size=0.1*k, replace=FALSE)
jud <- (names(cl) %notin% names(a))
need <- names(cl)[jud]
T3 <- NULL
for (k in need) {
T3 <- rbind(T3, cl[[k]])
}
subt <- NULL
s <- a
for (j in 1:2) {
y <- rnorm(2)
x <- rnorm(2)
d <- cbind(id=nrow(a[[j]]) + c(1:length(x)),
cluster=unique(a[[j]]$cluster), x, y)
s[[j]] <- rbind(a[[j]], d)
subt <- rbind(subt, s[[j]])
}
T <- rbind(T3, subt)
return(T)
}
Finally, this creates a list of 5 datasets each of which combines the simulated and original observations from the selected clusters with all the original observations from the unselected clusters
Q <- vector(mode="list", length=5)
for (i in 1:length(Q)) {
Q[[i]] <- clsamp(cl, 20)
}
Anyone knows a shorter way to do this? Maybe use the replicate function? Thanks.
This generates a sizeX2 matrix of random values and cbinds sampled cluster names and consecutive ids to it. It directly starts with dd and also works when you convert dd to a matrix mm, which might be slightly faster. Output is a data frame, though. Instead of your k I use f to directly calculate the number of rows that should be added to the two selected clusters. In case the size gets zero, the original data frame is returned.
clsamp2 <- function(m, f=.1) {
size <- round(nrow(m)*f)
if (size == 0) as.data.frame(m)
else {
ids <- unique(m[,1])
cls <- unique(m[,2])
rd <- matrix(rnorm(size * 4), ncol=2, dimnames=list(NULL, c("x", "y")))
out <- rbind.data.frame(m, cbind(id=rep(max(ids) + 1:size, each=2),
cluster=sample(cls, 2), rd))
`rownames<-`(out[order(out$cluster, out$id), ], NULL)
}
}
Result
set.seed(42) ## same seed also used for creating `dd`
clsamp2(dd, .1)
## or
mm <- as.matrix(dd)
clsamp2(mm, .1)
# id cluster x y
# 1 1 1 -0.30663859 1.37095845
# 2 2 1 -1.78130843 -0.56469817
# 3 3 1 -0.17191736 0.36312841
# 4 4 1 1.21467470 0.63286260
# 5 5 1 1.89519346 0.40426832
# 6 1 2 -0.43046913 -0.10612452
# 7 2 2 -0.25726938 1.51152200
# 8 3 2 -1.76316309 -0.09465904
# 9 4 2 0.46009735 2.01842371
# 10 5 2 -0.63999488 -0.06271410
# 11 6 2 1.37095845 0.40426832
# 12 7 2 0.36312841 1.51152200
# 13 1 3 0.45545012 1.30486965
# 14 2 3 0.70483734 2.28664539
# 15 3 3 1.03510352 -1.38886070
# 16 4 3 -0.60892638 -0.27878877
# 17 5 3 0.50495512 -0.13332134
# 18 1 4 -1.71700868 0.63595040
# 19 2 4 -0.78445901 -0.28425292
# 20 3 4 -0.85090759 -2.65645542
# 21 4 4 -2.41420765 -2.44046693
# 22 5 4 0.03612261 1.32011335
# 23 6 4 -0.56469817 -0.10612452
# 24 7 4 0.63286260 -0.09465904
To create the list of five samples, you may use replicate.
replicate(5, clsamp2(dd, .1), simplify=FALSE)
Running time is negligible.
system.time(replicate(1000, clsamp2(dd, .1), simplify=FALSE))
# user system elapsed
# 0.44 0.03 0.44

Use apply on two data.frame's

If I had a data.frame X and wanted to apply a function foo to each of its rows, I would just run apply(X, 1, foo). This is all well-known and simple.
Now imagine I have another data.frame Y and the following function:
mean_of_sum <- function(x,y) {
return(mean(x+y))
}
Is there a way to write an "apply equivalent" to the following loop:
my_loop_fun <- function(X, Y)
results <- numeric(nrow(X))
for(i in 1: length(results)) {
results[i] <- mean_of_sum(X[i,], Y[i,])
}
return(results)
If such an "apply syntax" exists, would it be more efficient than my "good" old loop?
this should work:
sapply(seq_len(nrow(X)), function(i) mean_of_sum(X[i,], Y[i,]))
You apply the function on the sequence 1, 2, ..., n (where n is the number of rows ) and in each "iteration" you evaluate mean_of_sum for the i-th row.
We can split every row of X and Y in list and use mapply to apply the function. Changing the function mean_of_sum a bit to convert one-row dataframe to numeric
mean_of_sum <- function(x,y) {
return(mean(as.numeric(x) + as.numeric(y)))
}
Consider an example,
X <- data.frame(a = 1:5, b = 6:10)
Y <- data.frame(c = 11:15, d = 16:20)
mapply(mean_of_sum, split(X, seq_len(nrow(X))), split(Y, seq_len(nrow(Y))))
# 1 2 3 4 5
#17 19 21 23 25
where X and Y are
X
# a b
#1 1 6
#2 2 7
#3 3 8
#4 4 9
#5 5 10
Y
# c d
#1 11 16
#2 12 17
#3 13 18
#4 14 19
#5 15 20
So the first value 17 is counted as
mean(c(1 + 11, 6 + 16))
#[1] 17
and so on for next values.

R: loop exercise without NAs on output

I am practicing a simple R loop. From a vector "m" with values 1 to 20, i want to create a loop that save a selected value in a object"a" and the remaining values in object "b".
This is what i did:
a=NULL
b=NULL
m <- c(1:20)
for (i in m)
if (i == 4){
a[[i]] <- i
} else {
b[[i]] <- i
}
This is the output:
> a
[1] NA NA NA 4
> b
[1] 1 2 3 NA 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
My question is: how can i improve my loop code so the output does not show NAs, and without using function "na.omit"?
Thanks
a=NULL
b=NULL
m <- c(1:20)
for (i in m){
if (i == 4){
a <- i
} else {
b <- append(b, i)
}
}
This will put a single value (in this case 4) in object a, and will consecutively add the other values to b.
Result:
> a
[1] 4
> b
[1] 1 2 3 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Another way to solve it is with vector operations. We doesn't need to do a loop to solve some problems about classification.
In your case, we can use:
m = c(1:20)
subset_with = m[m == 4] # It returns the values with the maching (m == 4)
subset_without = m[m != 4] # It returns the values with the maching (m != 4)
I hope this helps you.

storing results of a for function in list or

add <- c( 2,3,4)
for (i in add){
a <- i +3
b <- a + 3
z <- a + b
print(z)
}
# Result
[1] 13
[1] 15
[1] 17
In R, it can print the result, but I want to save the results for further computation in a vector, data frame or list
Thanks in advance
Try something like:
add <- c(2, 3, 4)
z <- rep(0, length(add))
idx = 1
for(i in add) {
a <- i + 3
b <- a + 3
z[idx] <- a + b
idx <- idx + 1
}
print(z)
This is simple algebra, no need in a for loop at all
res <- (add + 3)*2 + 3
res
## [1] 13 15 17
Or if you want a data.frame
data.frame(a = add + 3, b = add + 6, c = (add + 3)*2 + 3)
# a b c
# 1 5 8 13
# 2 6 9 15
# 3 7 10 17
Though in general, when you are trying to something like that, it is better to create a function, for example
myfunc <- function(x) {
a <- x + 3
b <- a + 3
z <- a + b
z
}
myfunc(add)
## [1] 13 15 17
In cases when a loop is actually needed (unlike in your example) and you want to store its results, it is better to use *apply family for such tasks. For example, use lapply if you want a list back
res <- lapply(add, myfunc)
res
# [[1]]
# [1] 13
#
# [[2]]
# [1] 15
#
# [[3]]
# [1] 17
Or use sapply if you want a vector back
res <- sapply(add, myfunc)
res
## [1] 13 15 17
For a data.frame to keep all the info
add <- c( 2,3,4)
results <- data.frame()
for (i in add){
a <- i +3
b <- a + 3
z <- a + b
#print(z)
results <- rbind(results, cbind(a,b,z))
}
results
a b z
1 5 8 13
2 6 9 15
3 7 10 17
If you just want z then use a vector, no need for lists
add <- c( 2,3,4)
results <- vector()
for (i in add){
a <- i +3
b <- a + 3
z <- a + b
#print(z)
results <- c(results, z)
}
results
[1] 13 15 17
It might be instructive to compare these two results with those of #dugar:
> sapply(add, function(x) c(a=x+3, b=a+3, z=a+b) )
[,1] [,2] [,3]
a 5 6 7
b 10 10 10
z 17 17 17
That is the result of lazy evaluation and sometimes trips us up when computing with intermediate values. This next one should give a slightly more expected result:
> sapply(add, function(x) c(a=x+3, b=(x+3)+3, z=(x+3)+((x+3)+3)) )
[,1] [,2] [,3]
a 5 6 7
b 8 9 10
z 13 15 17
Those results are the transpose of #dugar. Using sapply or lapply often saves you the effort off setting up a zeroth case object and then incrementing counters.
> lapply(add, function(x) c(a=x+3, b=(x+3)+3, z=(x+3)+((x+3)+3)) )
[[1]]
a b z
5 8 13
[[2]]
a b z
6 9 15
[[3]]
a b z
7 10 17

Flatten matrix in R to four columns (indexes and upper/lower triangles)

I'm using the cor.prob() function that's been posted several times around the mailing list to get a matrix of correlations (lower diagonal) and p-values (upper diagonals):
cor.prob <- function (X, dfr = nrow(X) - 2) {
R <- cor(X)
above <- row(R) < col(R)
r2 <- R[above]^2
Fstat <- r2 * dfr/(1 - r2)
R[above] <- 1 - pf(Fstat, 1, dfr)
R[row(R) == col(R)] <- NA
R
}
d <- data.frame(x=1:5, y=c(10,16,8,60,80), z=c(10,9,12,2,1))
cor.prob(d)
> cor.prob(d)
x y z
x NA 0.04856042 0.107654038
y 0.8807155 NA 0.003523594
z -0.7953560 -0.97945703 NA
How would I collapse the above correlation matrix (with the correlations in the lower half, p-values in the upper half) into a four-column matrix: two indexes, the correlation, and the p-value? E.g.:
i j cor pval
x y .88 .048
x z -.79 .107
y z -.97 0.0035
I've seen the answer to the previous question like this, but will only give me a 3-column matrix, not a four column matrix with separate columns for the p-value and correlation.
Any help is appreciated!
well it's not a matrix, because you can't mix characters and numerics. But:
this is my first attempt (before your label swap):
m <- cor.prob(d)
ut <- upper.tri(m)
lt <- lower.tri(m)
d <- data.frame(i=rep(row.names(m),ncol(m))[as.vector(ut)],
j=rep(colnames(m),each=nrow(m))[as.vector(ut)],
cor=m[ut],
p=m[lt])
now apply the correction I suggested below and you get
d <- data.frame(i=rep(row.names(m),ncol(m))[as.vector(ut)],
j=rep(colnames(m),each=nrow(m))[as.vector(ut)],
cor=m[ut],
p=t(m)[ut])
finally your label swap, use row()/col(), and write it as a function:
f1 <- function(m) {
ut <- upper.tri(m)
data.frame(i = rownames(m)[row(m)[ut]],
j = rownames(m)[col(m)[ut]],
cor=t(m)[ut],
p=tm[ut])
}
then
m<-matrix(1:25,5,dimnames=list(letters[1:5],letters[1:5])
> m
a b c d e
a 1 6 11 16 21
b 2 7 12 17 22
c 3 8 13 18 23
d 4 9 14 19 24
e 5 10 15 20 25
> f1(m)
i j cor p
1 a b 6 2
2 a c 11 3
3 b c 12 8
4 a d 16 4
5 b d 17 9
6 c d 18 14
7 a e 21 5
8 b e 22 10
9 c e 23 15
10 d e 24 20
Can you explain what you expected if it wasn't this?
cd <- cor.prob(d)
dcd <- as.data.frame( which( row(cd) < col(cd), arr.ind=TRUE) )
dcd$pval <- cd[row(cd) < col(cd)]
dcd$cor <- cd[row(cd) > col(cd)]
dcd[[2]] <-dimnames(cd)[[2]][dcd$col]
dcd[[1]] <-dimnames(cd)[[2]][dcd$row]
dcd
#--------------------
row col pval cor
1 x y 0.048560420 0.8807155
2 x z 0.107654038 -0.7953560
3 y z 0.003523594 -0.9794570

Resources