I'm looking for an indicator function in R, i.e. a function that returns a 1, if the value of an element in a vector is greater than 0 and returns zero, if the value of an element in a vector is less than 0.
I need to use this function on all elements in a vector returning a new vector with only zeros and ones.
Thanks.
There are a variety of ways, the minimal keystroke one:
Ivec <- 0+(vec>0)
Saves a couple of keystrokes over: as.numeric(vec>0). I would guess the ifelse(x>0,1,0)-approach would be somewhat slower if applied to a large vector or if used in simulations. Could also use:
Ivec <- 1*(vec>0)
If i am able to understand you correctly then you want to make changes into entire data frame,assuming of which i can suggest you to use apply like below, where df is your data frame.
apply(df,2,function(x)ifelse((x>0),1,0))
You can also use if its for only one vector something like below:
x <- c(-2,3,1,0)
y <- ifelse(x>0,1,0)
print(y)
[1] 0 1 1 0 #Output
Hope this helps
The I function in R, called the Inhibit Interpretation/Conversion of Objects function, can be used for this purpose. For instance, the line below returns the values for the function I(x < 4) where X = {0, 1, 2, 3, 4, 5}:
> I(0:5 < 4)
[1] TRUE TRUE TRUE TRUE FALSE FALSE
In R TRUE and FALSE can be treated as 1 and 0s, but if you insist on your output being precisely those numbers, just wrap your I function into as.numeric.
There is also an built-in indicator function in R
Indicator(x,min,max)
-Inf and Inf are still the valid values.
Related
I have a vector of factors given by a sequence of numbers. These factors are also found in separate data seta, called test_set and train_set. What the following code does is find where the factor in the data sets matches in the vector of factors and puts a 1 in the place of the matrix. Multiplying this matrix compound_test by test_set$Compound should give you compare_comp.
compare_comp <- rbind(dcm,cmp1)[,1]
compound_test <- matrix(0,nrow(test_set),length(compare_comp)) # test indicator matrix
compound_train <-matrix(0,nrow(train_set),length(compare_comp))
for (i in 1:length(compare_comp)){
compound_test[which(compare_comp[i]==test_set$Compound),i]=1
compound_train[which(compare_comp[i]==train_set$Compound),i]=1}
It does this for a train and test set, and compare_comp is the vector of factors.
Is there a function in R that lets me create the same thing without the need for a for loop? I have tried model.matrix(~Compound,data=test_set) without much luck.
While you may not be able to completely avoid iteration since you are comparing each element of compare_comp vector to the full vector of Compound in each test_set and train_set, you can however use more compact assignment with apply family functions.
Specifically, sapply returns a logical matrix of booleans (TRUE, FALSE) that we assign in corresponding position to initialized matrices where TRUE converts to 1 and FALSE to 0.
# SAPPLY AFTER MATRIX INITIALIZATION
compound_test2 <- matrix(0, nrow(test_set), length(compare_comp))
compound_train2 <- matrix(0, nrow(train_set), length(compare_comp))
compound_test2[] <- sapply(compare_comp, function(x) x == test_set$Compound)
compound_train2[] <- sapply(compare_comp, function(x) x == train_set$Compound)
Alternatively, the rarely used and known vapply (similar to sapply but must define the output type), returns an equivalent matrix but as numeric type.
# VAPPLY WITHOUT MATRIX INITIALIZATION
compound_test3 <- vapply(compare_comp, function(x) x == test_set$Compound,
numeric(length(compare_comp)))
compound_train3 <- vapply(compare_comp, function(x) x == train_set$Compound,
numeric(length(compare_comp)))
Testing confirms with random data (see demo below), both versions are identical to your looped version
identical(compound_test1, compound_test2)
identical(compound_train1, compound_train2)
# [1] TRUE
# [1] TRUE
identical(compound_test1, compound_test3)
identical(compound_train1, compound_train3)
# [1] TRUE
# [1] TRUE
Online Demo
I am normally a maple user currently working with R, and I have a problem with correctly indexing variables.
Say I want to define 2 vectors, v1 and v2, and I want to call the nth element in v1. In maple this is easily done:
v[1]:=some vector,
and the nth element is then called by the command
v[1][n].
How can this be done in R? The actual problem is as follows:
I have a sequence M (say of length 10, indexed by k) of simulated negbin variables. For each of these simulated variables I want to construct a vector X of length M[k] with entries given by some formula. So I should end up with 10 different vectors, each of different length. My incorrect code looks like this
sims<-10
M<-rnegbin(sims, eks_2016_kasko*exp(-2.17173), 840.1746)
for(k in 1:sims){
x[k]<-rep(NA,M[k])
X[k]<-rep(NA,M[k])
for(i in 1:M[k]){x[k][i]<-runif(1,min=0,max=1)
if(x[k][i]>=0 & x[i]<=0.1056379){
X[k][i]<-rlnorm(1, 6.228244, 0.3565041)}
else{
X[k][i]<-rlnorm(1, 8.910837, 1.1890874)
}
}
}
The error appears to be that x[k] is not a valid name for a variable. Any way to make this work?
Thanks a lot :)
I've edited your R script slightly to get it working and make it reproducible. To do this I had to assume that eks_2016_kasko was an integer value of 10.
require(MASS)
sims<-10
# Because you R is not zero indexed add one
M<-rnegbin(sims, 10*exp(-2.17173), 840.1746) + 1
# Create a list
x <- list()
X <- list()
for(k in 1:sims){
x[[k]]<-rep(NA,M[k])
X[[k]]<-rep(NA,M[k])
for(i in 1:M[k]){
x[[k]][i]<-runif(1,min=0,max=1)
if(x[[k]][i]>=0 & x[[k]][i]<=0.1056379){
X[[k]][i]<-rlnorm(1, 6.228244, 0.3565041)}
else{
X[[k]][i]<-rlnorm(1, 8.910837, 1.1890874)
}
}
This will work and I think is what you were trying to do, BUT is not great R code. I strongly recommend using the lapply family instead of for loops, learning to use data.table and parallelisation if you need to get things to scale. Additionally if you want to read more about indexing in R and subsetting Hadley Wickham has a comprehensive break down here.
Hope this helps!
Let me start with a few remarks and then show you, how your problem can be solved using R.
In R, there is most of the time no need to use a for loop in order to assign several values to a vector. So, for example, to fill a vector of length 100 with uniformly distributed random variables, you do something like:
set.seed(1234)
x1 <- rep(NA, 100)
for (i in 1:100) {
x1[i] <- runif(1, 0, 1)
}
(set.seed() is used to set the random seed, such that you get the same result each time.) It is much simpler (and also much faster) to do this instead:
x2 <- runif(100, 0, 1)
identical(x1, x2)
## [1] TRUE
As you see, results are identical.
The reason that x[k]<-rep(NA,M[k]) does not work is that indeed x[k] is not a valid variable name in R. [ is used for indexing, so x[k] extracts the element k from a vector x. Since you try to assign a vector of length larger than 1 to a single element, you get an error. What you probably want to use is a list, as you will see in the example below.
So here comes the code that I would use instead of what you proposed in your post. Note that I am not sure that I correctly understood what you intend to do, so I will also describe below what the code does. Let me know if this fits your intentions.
# define M
library(MASS)
eks_2016_kasko <- 486689.1
sims<-10
M<-rnegbin(sims, eks_2016_kasko*exp(-2.17173), 840.1746)
# define the function that calculates X for a single value from M
calculate_X <- function(m) {
x <- runif(m, min=0,max=1)
X <- ifelse(x > 0.1056379, rlnorm(m, 6.228244, 0.3565041),
rlnorm(m, 8.910837, 1.1890874))
}
# apply that function to each element of M
X <- lapply(M, calculate_X)
As you can see, there are no loops in that solution. I'll start to explain at the end:
lapply is used to apply a function (calculate_X) to each element of a list or vector (here it is the vector M). It returns a list. So, you can get, e.g. the third of the vectors with X[[3]] (note that [[ is used to extract elements from a list). And the contents of X[[3]] will be the result of calculate_X(M[3]).
The function calculate_X() does the following: It creates a vector of m uniformly distributed random values (remember that m runs over the elements of M) and stores that in x. Then it creates a vector X that contains log normally distributed random variables. The parameters of the distribution depend on the value x.
I want to find the index of the outlier spotted by the grubbs.test function of the outliers package (I adapted it from another SO answer here)
where = function(x) which(x==as.numeric(strsplit(grubbs.test(x)$alternative," ")[[1]][3]))
It works by retrieving the number in the text displayed by the grubbs result. It's kind of a hack but it works well, let's say, for round numbers:
df=c(0, 3, rnorm(10))
where(df) #[1] 2
When it gets to decimal numbers, the text doesn't match all the times with the digits of the actual number:
df=c(0, sqrt(10), rnorm(10))
where(df) # integer(0)
Someone has an idea to fix that problem? Or another way to find the index of the grubbs test biggest outlier? I'm trying to use this in a loop.
The problem is because strsplit returns stings instead of numbers. In your second example I get:
[1] "highest" "value" "3.16227766016838" "is" "an" "outlier"
but the third element is not really the character version of the number 3.16227766016838. In fact the real number returned from grubbs.test might have a lot more decimal places and this is why the == operator does not 'catch' it as an equality. This can be seen clearly here:
a<-sqrt(10)
> a == as.numeric(as.character(a))
[1] FALSE
Is there a solution to this?
YES there is.
In order to tackle this problem just use the almost.equal function that I took the liberty to copy from this R-help post:
almost.equal <- function (x, y, tolerance=.Machine$double.eps^0.5,
na.value=TRUE)
{
answer <- rep(na.value, length(x))
test <- !is.na(x)
answer[test] <- abs(x[test] - y) < tolerance
answer
}
The above function is a vectorized form of the all.equal function which checks for an 'approximate' equality so that it captures cases like yours.
Let's convert your function to:
where = function(x) {
which(almost.equal(x, as.numeric(strsplit(grubbs.test(x)$alternative," ")[[1]][3])))
}
And let's check it now:
> df=c(0, 3, rnorm(10))
> where(df)
[1] 2
And:
> df=c(0, sqrt(10), rnorm(10))
> where(df)
[1] 2
And you have a solution that works well with decimal numbers too!!
I am trying to use apply() to fill in an additional column in a dataframe and by calling a function I created with each row of the data frame.
The dataframe is called Hit.Data has 2 columns Zip.Code and Hits. Here are a few rows
Zip.Code , Hits
97222 , 20
10100 , 35
87700 , 23
The apply code is the following:
Hit.Data$Zone = apply(Hit.Data, 1, function(x) lookupZone("89000", x["Zip.Code"]))
The lookupZone() function is the following:
lookupZone <- function(sourceZip, destZip){
sourceKey = substr(sourceZip, 1, 3)
destKey = substr(destZips, 1, 3)
return(zipToZipZoneMap[[sourceKey]][[destKey]])
}
All the lookupZone() function does is take the 2 strings, truncates to the required characters and looks up the values. What happens when I run this code though is that R assigns a list to Hit.Data$Zone instead of filling in data row by row.
> typeof(Hit.Data$Zone)
[1] "list
What baffles me is that when I use apply and just tell it to put a number in it works correctly:
> Hit.Data$Zone = apply(Hit.Data, 1, function(x) 2)
> typeof(Hit.Data$Zone)
[1] "double"
I know R has a lot of strange behavior around dropping dimensions of matrices and doing odd things with lists but this looks like it should be pretty straightforward. What am I missing? I feel like there is something fundamental about R I am fighting, and so far it is winning.
Your problem is that you are occasionally looking up non-existing entries in your hashmap, which causes hash to silently return NULL. Consider:
> hash("890", hash("972"=3, "101"=3, "877"=3))[["890"]][["101"]]
[1] 3
> hash("890", hash("972"=3, "101"=3, "877"=3))[["890"]][["100"]]
NULL
If apply encounters any NULL values, then it can't coerce the result to a vector, so it will return a list. Same will happen with sapply.
You have to ensure that all possible combinations of the first three zip code digits in your data are present in your hash, or you need logic in your code to return NA instead of NULL for missing entries.
As others have said, it's hard to diagnose without knowing what ZiptoZipZoneMap(...) is doing, but you could try this:
Hit.Data$Zone <- sapply(Hit.Data$Zip.Code, function(x) lookupZone("89000", x))
I am not sure what I am doing wrong here.
ee <- eigen(crossprod(X))$values
for(i in 1:length(ee)){
if(ee[i]==0:1e^-9) stop("singular Matrix")}
Using the eigen value approach, I am trying to determine if the matrix is singular or not. I am attempting to find out if one of the eigen values of the matrix is between 0 and 10^-9. How can I use the if statement (as above) correctly to achieve my goal? Is there any other way to approach this?
what if I want to concatenate the zero eigen value in vector
zer <-NULL
ee <- eigen(crossprod(X))$values
for(i in 1:length(ee)){
if(abs(ee[i])<=1e-9)zer <- c(zer,ee[i])}
Can I do that?
#AriBFriedman is quite correct. I can, however see a couple of other issues
1e^-9 should be 1e-9.
0:1e-9 returns 0, (: creates a sequence by one between 0 and 1e-9, therefore returns just 0. See ?`:` for more details
Using == with decimals will cause problems due to floating point arithmetic
In the form written, your code checks (individually) whether the elements ee[i] == 0, which is not what you want (nor does it make sense in terms floating point arithmetic)
You are looking for cases where the eigen value is less than this small number, so use less than (<).
What you are looking for is something like
if(any(abs(ee) < 1e-9)) stop('singular matrix')
If you want to get the 0 (or small) eigen vectors, then use which
# this will give the indexs (which elements are small)
small_values <- which(abs(ee) < 1e-9))
# and those small values
ee[small_values]
There is no need for the for loop as everything being done is vectorized.
if takes a single argument of length 1.
Try either ifelse or using any() or all() to turn your vector of logicals into a logical vector of length 1.
Here's an example reproducing your data:
X <- matrix(1:10,1:10)
ee <- eigen(crossprod(X))$values
This will test if any of the values of ee are > 0 AND< 1e-9
if (any((ee > 0) & (ee < 1e-9))) {stop("singular matrix")}