Simple way of creating dummy variable in R - r

I want to know how simply a dummy variables can be created. I found many similar questions on the dummy but either they are based on some external packages or technical.
I have data like this :
df <- data.frame(X=rnorm(10,0,1), Y=rnorm(10,0,1))
df$Z <- c(NA, diff(df$X)*diff(df$Y))
Z create a new variable within df ie product of change in X and change in Y.
Now I want to create a dummy variable D in df such that if : Z < 0 then D==1, if Z >0 then D==0.
I tried in this way :
df$D <- NA
for(i in 2:10) {
if(df$Z[i] <0 ) {
D[i] ==1
}
if(df$Z[i] >0 ) {
D[i] ==0
}}
This is not working.
I want to know why above code is not working (with easy way of doing this) and how dummy variables can be creating in R without using any external packages with little bit of explanation.

Try :
df$D<-ifelse(df$Z<0,1,0)
df
X Y Z D
1 -0.1041896 -1.11731404 NA NA
2 -1.4286604 1.42523717 -3.36753491 1
3 0.3931643 -0.05525477 -2.69719691 1
4 -0.2236541 1.64531526 -1.04894297 1
5 1.1725167 0.80063291 -1.17932089 1
6 0.7571427 0.64072381 0.06642209 0
7 0.4929186 1.25125268 -0.16131645 1
8 0.9715885 -0.54755653 -0.86103574 1
9 -0.2962052 -1.37459521 1.04851438 0
10 -1.4838675 -0.85788632 -0.61367565 1
The ifelsefunction takes 3 arguments : the condition to evaluate df$Z<0, the value if the condition is TRUE : 1 and the value if the condition is FALSE : 0. The function is vectorized so it works well in this case.

We can create a logical vector by df$Z < 0 and then coerce it to binary by wrapping with +.
df$D <- +(df$Z <0)
Or as #BenBolker mentioned, the canonical options would be
as.numeric(df$Z < 0)
or
as.integer(df$Z < 0)
Benchmarks
set.seed(42)
Z <- rnorm(1e7)
library(microbenchmark)
microbenchmark(akrun= +(Z < 0), etienne = ifelse(Z < 0, 1, 0),
times= 20L, unit='relative')
# Unit: relative
# expr min lq mean median uq max neval
# akrun 1.00000 1.00000 1.000000 1.00000 1.00000 1.000000 20
# etienne 12.20975 10.36044 9.926074 10.66976 9.32328 7.830117 20

You can try
df$D[df$Z<0]<-1
df$D[df$Z>0]<-0
But you should consider the possibility that Z can be 0.

Related

If a value is X, how can I check if X is less than 0.05 and write "YES" or "NO" in another cell in a data.frame in R?

I have the following kind of file (variable "a"):
P OK
0.009109607206037 NA
0.296054274328919 NA
0.359366011629242 NA
4.77143881428015E-05 NA
0.002556197639041 NA
1.68489333654225E-05 NA
0.413536654401798 NA
7.8906355718309E-06 NA
0.183951454595559 NA
0.018652061230313 NA
9.62042790189634E-15 NA
0.151533362472736 NA
0.037140932397797 NA
0.350401082523352 NA
0.673474391454102 NA
0.000329419618776 NA
These are data generated in R in a data.frame, what I did was calculate the P-value, but, in the final file, I have more than 5000 lines, so, to make my life easier, I put a marker as the last column, but I can't figure out how to make an if/else condition here.
So, I've tried:
If a$P<0.05
a$OK <- "Significant"
Else
a$OK <- "Not-Significant
But this didn't work... Can someone help me to fix this in R?
Use ifelse:
a$OK <- ifelse(a$P < 0.05, "Significant", "Non-Signficant")
The ifelse function is vectorized, meaning that the above will populate the entire OK column in your data frame.
Tim Biegeleisen's answer is the canonical way of solving the problem but ifelse is known to be slow.
Here are two alternatives. They create an index and use it to get the values from a vector of strings.
The first uses a logical result and then adds 1 because R is one-based.
The second uses findInterval.
OK1 <- c("Significant", "Non-Signficant")[(a$P >= 0.05) + 1]
OK2 <- c("Significant", "Non-Signficant")[findInterval(a$P, c(0, 0.05, 1))]
OK3 <- ifelse(a$P < 0.05, "Significant", "Non-Signficant")
identical(OK1, OK2) # TRUE
identical(OK1, OK3) # TRUE
Now some speed comparisons.
library(ggplot2)
library(microbenchmark)
mb <- microbenchmark(
loginx = c("Significant", "Non-Signficant")[(a$P >= 0.05) + 1],
findInt = c("Significant", "Non-Signficant")[findInterval(a$P, c(0, 0.05, 1))],
ifelse = ifelse(a$P < 0.05, "Significant", "Non-Signficant")
)
mb
#Unit: microseconds
# expr min lq mean median uq max neval
# loginx 14.450 15.8580 17.52272 16.7705 18.6525 63.106 100
# findInt 18.726 21.0170 23.00090 23.2135 24.3680 46.071 100
# ifelse 31.940 33.0065 33.70410 33.4330 33.9235 48.500 100
autoplot(mb)

How to define values of a data.frame that are outside a range as NA?

I am currently trying to create a routine that helps me to clean my datasets. For some numeric / integer variables there is a range (min & max) where values are allowed. Values that are not contained within that ranged should be declared as NA.
My current code:
df$variable[df$variable < min.range && df$variable > max.range] <- NA
Or as an alternative:
df$variable[!df$variable %in% c(min.range:max.range)] <- NA
I am wondering which one would be more efficient since the datasets can be quite big and I want to keep the processing time as short as possible. Maybe there is even a better way to solve the problem.
Thank you in advance!
Your first way of doing it is wrong for 2 reasons:
Firstly, a value cannot be < min.range and > max.range, you need an or there.
Secondly, you don't need a double & or | that will only check the first value.
You thus need to replace your first line of code by:
df$variable[df$variable < min.range | df$variable > max.range] <- NA
For the second way, it can only work with integers.
Regarding the efficiency, you can test both your ways with a relatively large data.frame:
set.seed(123)
df <- data.frame(matrix(floor(rnorm(50000*1000, 100, 10)), nrow=50000))
colnames(df)[1] <- "variable"
min.range <- 85
max.range <- 115
meth1 <- function(){df$variable[df$variable < min.range | df$variable > max.range] <- NA; df}
meth2 <- function(){df$variable[!df$variable %in% c(min.range:max.range)] <- NA; df}
library(microbenchmark)
microbenchmark(meth1(), meth2(), unit="relative")
# expr min lq mean median uq max neval cld
# meth1() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a
# meth2() 1.588484 1.603514 1.581301 1.597115 1.564948 1.481916 100 b
To sum up:
- modify your first method if you want to make it work
- don't use the second one if you are not working with integers
- even if you're working with integers, the first way is more efficient
You can get the execution time of your alternatives like this:
#processing time of option 1
system.time({
df$variable[df$variable < min.range && df$variable > max.range] <- NA
})
#processing time of option 2
system.time({
df$variable[!df$variable %in% c(min.range:max.range)] <- NA
})
(don't forget to reinitialize your dfbetween the 2 tests)

Efficiently save indices of nonzero matrix elements to a file

I need to save the indices of a matrix's nonzero elements to a file. This works very well for small-sized matrices, storing the row numbers of the non-zero indices in a and the column numbers of the non-zero indices in b:
X <- matrix(c(1,0,3,4,0,5), byrow=TRUE, nrow=2);
a <- row(X)[which(!X == 0)]
b <- col(X)[which(!X == 0)]
But size of the matrix is huge, and I need to find an efficient way to save the indices to a txt file, so that I have a[1] b[1] (new line) a[2] b[2] and so on. Any suggestions?
The package Matrix has a great solution for extremely large matrices. The sparseMatrix object can be summarized into a data.frame where i and j are your indices and x is the value:
X <- matrix(c(1,0,3,4,0,5), byrow=TRUE, nrow=2);
a <- row(X)[which(!X == 0)]
b <- col(X)[which(!X == 0)]
library(Matrix)
Y <- Matrix(X, sparse = TRUE)
(res <- summary(Y))
2 x 3 sparse Matrix of class "dgCMatrix", with 4 entries
i j x
1 1 1 1
2 2 1 4
3 1 3 3
4 2 3 5
class(res)
[1] "sparseSummary" "data.frame"
You can then subset to get just i and j:
res[, c("i", "j")]
i j
1 1 1
2 2 1
3 1 3
4 2 3
You can grab the rows and columns of all non-zero locations using which with parameter arr.ind=TRUE, writing the result to a file with write.table:
write.table(which(X != 0, arr.ind=TRUE), "file.txt", row.names=F, col.names=F)
This yields space-separated output of the pairs of elements in the specified file:
1 1
2 1
1 3
2 3
Using which with arr.ind=TRUE saves a few scans through your input matrix compared to the code posted in your question, so it should be a bit quicker at calculating the data to output. You can see this with a benchmark for a larger matrix (1000 x 1000, with 1% density):
set.seed(144)
bigX <- matrix(sample(c(rep(0, 99), 1), 1000000, replace=T), nrow=1000)
OP <- function(X) cbind(row(X)[which(!X == 0)], col(X)[which(!X == 0)])
josilber <- function(X) which(X != 0, arr.ind=TRUE)
library(microbenchmark)
microbenchmark(OP(bigX), josilber(bigX))
# Unit: milliseconds
# expr min lq mean median uq max neval
# OP(bigX) 20.513535 23.014517 36.463423 25.354250 59.130520 65.50304 100
# josilber(bigX) 3.873165 4.281624 6.741824 5.250777 6.998415 45.02542 100
In this case we see about a 5x speedup in computing the non-zero rows and columns. Depending on the density and size of your matrix the output operation (write.table) might instead be the bottleneck, in which case there may not be too much benefit to this approach.

Memorize the last "correct" value of a sequence (for removing outliers)

I have a little problem in a function.
The aim of it is to remove outliers I've detected in my data.frame. They are detected when there's a too big difference with the previous correct value (e.g c(1,2,3,20,30,4,5,6): "20" and "30" are the outliers). But my data is much more complex than this.
My idea is to consider the first two numeric values of my column as "correct". Then, I want to test each next value:
if the difference between the tested value and the previous one is <20, then it's a new correct one, and the test must start again from this new correct value (and not from the previous correct one)
if the same difference is >20, then it's a wrong one. An index must be put next to the wrong value, and the test must still continue from this same correct value, until a new correct value is detected
Here's an example with my function and a fake DF:
myts <- data.frame(x=c(12,12,35,39,46,45,33,5,26,28,29,34,15,15),z=NA)
test <- function(x){
st1 = NULL
temp <- st1[1] <- x[1]
st1 <- numeric(length(x))
for (i in 2:(length(x))){
if((!is.na(x[i])) & (!is.na(x[i-1]))& (abs((x[i])-(temp)) > 20)){
st1[i] <- 1
} }
return(st1)
}
myts[,2] <- apply(as.data.frame(myts[,1]),2,test)
myts[,2] <- as.numeric(myts[,2])
It does nearly the job, but the problem is that the last correct value is not memorized. It still does the test from the first correct value.
Due to this, rows 9 to 11 in my example are not detected. I let you imagine the problem on a 500,000 rows data.frame.
How can I solve this little problem? The rest of the function may be OK.
You just need to update temp for any indices that aren't outliers:
test <- function(x) {
temp <- x[1]
st1 <- numeric(length(x))
for (i in 2:(length(x))){
if(!is.na(x[i]) & !is.na(x[i-1]) & abs(x[i]-temp) > 20) {
st1[i] <- 1
} else {
temp <- x[i]
}
}
return(st1)
}
myts[,2] <- apply(as.data.frame(myts[,1]),2,test)
myts[,2] <- as.numeric(myts[,2])
myts
# x z
# 1 12 0
# 2 12 0
# 3 35 1
# 4 39 1
# 5 46 1
# 6 45 1
# 7 33 1
# 8 5 0
# 9 26 1
# 10 28 1
# 11 29 1
# 12 34 1
# 13 15 0
# 14 15 0
One thing to note is that for loops in R will be quite slow compared to vectorized functions. However, because each element in your vector depends on a complicated way on the previous ones, it's tough to use R's built-in vectorized functions to efficiently compute your vector. You can convert this code nearly verbatim to C++ and use the Rcpp package to regain the efficiency:
library(Rcpp)
test2 <- cppFunction(
"IntegerVector test2(NumericVector x) {
const int n = x.length();
IntegerVector st1(n, 0);
double temp = x[0];
for (int i=1; i < n; ++i) {
if (!R_IsNA(x[i]) && !R_IsNA(x[i]) && fabs(x[i] - temp) > 20.0) {
st1[i] = 1;
} else {
temp = x[i];
}
}
return st1;
}")
all.equal(test(myts[,1]), test2(myts[,1]))
# [1] TRUE
# Benchmark on large vector with some NA values:
set.seed(144)
large.vec <- c(0, sample(c(1:50, NA), 1000000, replace=T))
all.equal(test(large.vec), test2(large.vec))
# [1] TRUE
library(microbenchmark)
microbenchmark(test(large.vec), test2(large.vec))
# Unit: milliseconds
# expr min lq mean median uq max neval
# test(large.vec) 2343.684164 2468.873079 2667.67970 2604.22954 2747.23919 3753.54901 100
# test2(large.vec) 9.596752 9.864069 10.97127 10.23011 11.68708 16.67855 100
The Rcpp code is about 250x faster on a vector of length 1 million. Depending on your use case this speedup may or may not be important.

Combining two vectors element-by-element

I have 2 vectors, such as these:
A <- c(1,2,NA,NA,NA,NA,7)
B <- c(NA,NA,3,4,NA,NA,7)
I would like to combine them so that the resulting vector is
1,2,3,4,NA,NA,-1
That is
when only 1 value (say X) in either vector at position i exists (the other being NA) the new vector should take the value X at position i.
when both values are NA at position i, the new vector should take the value NA at position i
when both vectors have a value at position i, the new vector should take the value -1 at position i.
I can easily do this with a loop, but it is very slow on a large dataset so can anyone provide a fast way to do this ?
These commands create the vector:
X <- A
X[is.na(A)] <- B[is.na(A)]
X[is.na(B)] <- A[is.na(B)]
X[!is.na(A & B)] <- -1
#[1] 1 2 3 4 NA NA -1
A <- c(1,2,NA,NA,NA,NA,7)
B <- c(NA,NA,3,4,NA,NA,7)
C <- rowMeans(cbind(A,B),na.rm=TRUE)
C[which(!is.na(A*B))]<- -1
#[1] 1 2 3 4 NaN NaN -1
Benchmarks:
Unit: microseconds
expr min lq median uq max
1 Roland(A, B) 17.863 19.095 19.710 20.019 68.985
2 Sven(A, B) 11.703 13.243 14.167 14.783 100.398
A bit late to the party, but here is another option defining a function that works by applying rules to the two vectors cbind-ed together.
# get the data
A <- c(1,2,NA,NA,NA,NA,7)
B <- c(NA,NA,3,4,NA,NA,7)
# define the function
process <- function(A,B) {
x <- cbind(A,B)
apply(x,1,function(x) {
if(sum(is.na(x))==1) {na.omit(x)} else
if(all(is.na(x))) {NA} else
if(!any(is.na(x))) {-1}
})
}
# call the function
process(A,B)
#[1] 1 2 3 4 NA NA -1
The main benefit of using a function is that it is easier to update the rules or the inputs to apply the code to new data.

Resources