Finding a value in an interval - r

Sorry if this is a basic question. Have been trying to figure this out but not being able to.
I have a vector of values called sym.
> head(sym)
[,1]
val 3.652166e-05
val -2.094026e-05
val 4.583950e-05
val 6.570184e-06
val -1.431486e-05
val -5.339604e-06
These I put in intervals by using factor on cut function on sym.
factorx<-factor(cut(sym,breaks=nclass.Sturges(sym)))
[1] (2.82e-05,5.28e-05] (-2.11e-05,3.55e-06] (2.82e-05,5.28e-05] (3.55e-06,2.82e-05] (-2.11e-05,3.55e-06] (-2.11e-05,3.55e-06]
[7] (-2.11e-05,3.55e-06] (2.82e-05,5.28e-05] (3.55e-06,2.82e-05] (7.74e-05,0.000102]
Levels: (-2.11e-05,3.55e-06] (3.55e-06,2.82e-05] (2.82e-05,5.28e-05] (7.74e-05,0.000102]
So clearly, four intervals were created in factorx. Now I have a new value tmp=3.7e-0.6.
My question is how can I find which interval in the above does it belongs to? I tried to use findInterval() but seems it does not work on factors like factorx.
Thanks

If you plan to re-classify new values, it's best to explicitly set the breaks= parameter with a vector rather than a size. Not that had those values been in the set originally, you may have had different breaks, and it is possible that your new values may be outside all the levels of your existing data which can be troublesome.
So first, I will generate some sample data.
set.seed(18)
x <- runif(50)
Now I will show two different way to calculate breaks. Here are b1() and b2()
b1<-function(x, n=nclass.Sturges(x)) {
#like default cut()
nb <- as.integer(n + 1)
dx <- diff(rx <- range(x, na.rm = TRUE))
if (dx == 0)
dx <- abs(rx[1L])
seq.int(rx[1L] - dx/1000, rx[2L] + dx/1000,
length.out = nb)
}
b2<-function(x, n=nclass.Sturges(x)) {
#like default hist()
pretty(range(x), n=n)
}
So each of these functions will give break points similar to either the default behaviors of cut() or hist(). Rather than just a single number of breaks, they each return a vector with all the break points explicitly stated. This allows you to use cut() to create your factor
mybreaks <- b1(x)
factorx <- cut(x,breaks=mybreaks))
(Note that's you don't have to wrap cut() in factor() as cut() already returns a factor. Now, if you get new values, you can classify them using findInterval() and the special breaks vector you've already prepared
nv <- runif(5)
grp <- findInterval(nv,mybreaks)
And we can check the results with
data.frame(grp=levels(factorx)[grp], x=nv)
# grp x
# 1 (0.831,0.969] 0.8769438
# 2 (0.00131,0.14] 0.1188054
# 3 (0.416,0.554] 0.5467373
# 4 (0.14,0.278] 0.2327532
# 5 (0.554,0.693] 0.6022678
and everything looks pretty good. In this case, findInterval() will tell you which level of the previous factor you created that each item belongs to. It will return 0 if the number is smaller than your previous observations, but it will return the largest category for anything greater than the largest level of mybreaks. This behavior is somewhat different that cut() which return NA. The last group in cut() is right-closed where findInterval leaves the right-end open.

Related

How to concatenate NOT as character in R?

I want to concatenate iris$SepalLength, so I can use that in a function to get the Sepal Length column from iris data frame. But when I use paste function paste("iris$", colnames(iris[3])), the result is as characters (with quotes), as "iris$SepalLength". I need the result not as a character. I have tried noquotes(), as.datafram() etc but it doesn't work.
freq <- function(y) {
for (i in iris) {
count <-1
y <- paste0("iris$",colnames(iris[count]))
data.frame(as.list(y))
print(y)
span = seq(min(y),max(y), by = 1)
freq = cut(y, breaks = span, right = FALSE)
table(freq)
count = count +1
}
}
freq(1)
The crux of your problem isn't making that object not be a string, it's convincing R to do what you want with the string. You can do this with, e.g., eval(parse(text = foo)). Isolating out a small working example:
y <- "iris$Sepal.Length"
data.frame(as.list(y)) # does not display iris$Sepal.Length
data.frame(as.list(eval(parse(text = y)))) # DOES display iris.$Sepal.Length
That said, I wanted to point out some issues with your function:
The input variable appears to not do anything (because it is immediately overwritten), which may not have been intended.
The for loop seems broken, since it resets count to 1 on each pass, which I think you didn't mean. Relatedly, it iterates over all i in iris, but then it doesn't use i in any meaningful way other than to keep a count. Instead, you could do something like for(count in 1 : length(iris) which would establish the count variable and iterate it for you as well.
It's generally better to avoid for loops in R entirely; there's a host of families available for doing functions to (e.g.) every column of a data frame. As a very simple version of this, something like apply(iris, 2, table) will apply the table function along margin 2 (the columns) of iris and, in this case, place the results in a list. The idea would be to build your function to do what you want to a single vector, then pass each vector through the function with something from the apply() family. For instance:
cleantable <- function(x) {
myspan = seq(min(x), max(x)) # if unspecified, by = 1
myfreq = cut(x, breaks = myspan, right = FALSE)
table(myfreq)
}
apply(iris[1:4], 2, cleantable) # can only use first 4 columns since 5th isn't numeric
would do what I think you were trying to do on the first 4 columns of iris. This way of programming will be generally more readable and less prone to mistakes.

initialise multiple variables at once in R [duplicate]

I am using the example of calculating the length of the arc around a circle and the area under the arc around a circle based on the radius of the circle (r) and the angle of the the arc(theta). The area and the length are both based on r and theta, and you can calculate them simultaneously in python.
In python, I can assign two values at the same time by doing this.
from math import pi
def circle_set(r, theta):
return theta * r, .5*theta*r*r
arc_len, arc_area = circle_set(1, .5*pi)
Implementing the same structure in R gives me this.
circle_set <- function(r, theta){
return(theta * r, .5 * theta * r *r)
}
arc_len, arc_area <- circle_set(1, .5*3.14)
But returns this error.
arc_len, arc_area <- circle_set(1, .5*3.14)
Error: unexpected ',' in "arc_len,"
Is there a way to use the same structure in R?
No, you can't do that in R (at least, not in base or any packages I'm aware of).
The closest you could come would be to assign objects to different elements of a list. If you really wanted, you could then use list2env to put the list elements in an environment (e.g., the global environment), or use attach to make the list elements accessible, but I don't think you gain much from these approaches.
If you want a function to return more than one value, just put them in a list. See also r - Function returning more than one value.
You can assign multiple variables the same value as below. Even here, I think the code is unusual and less clear, I think this outweighs any benefits of brevity. (Though I suppose it makes it crystal clear that all of the variables are the same value... perhaps in the right context it makes sense.)
x <- y <- z <- 1
# the above is equivalent to
x <- 1
y <- 1
z <- 1
As Gregor said, there's no way to do it exactly as you said and his method is a good one, but you could also have a vector represent your two values like so:
# Function that adds one value and returns a vector of all the arguments.
plusOne <- function(vec) {
vec <- vec + 1
return(vec)
}
# Creating variables and applying the function.
x <- 1
y <- 2
z <- 3
vec <- c(x, y, z)
vec <- plusOne(vec)
So essentially you could make a vector and have your function return vectors, which is essentially filling 3 values at once. Again, not what you want exactly, just a suggestion.

Conversion: Can we allow user to provide input for a function in percentages in R?

Question:
I'm wondering how I can define a way such that if a user of a function provided an input value in "percentages" then before using these input values, the function converts these percentages to a corresponding ordinary numeric values to be used in the function?
Note: By percentage as input value, I literally mean a user could put 12.5% or 1.5% etc. and so these values be converted to .125 and .015.
A simple annotated R example is below:
Accept.percent = function(x) { # "x" can be as small as ".013" & as large as ".85"
# but user can provide "1.3%" to "85%"
cdf <- pbeta(x, 4, 2) # and we can convert the input values
# provided as percentages to a corresponding
return(cdf) # ordinary numeric values before use.
}
## Example of use:
Accept.percent(.013)
Sure, you can cut off the % sign, convert to a numeric variable and divide by 100
x2 <- as.numeric(substr(x, 1, nchar(x)-1)) / 100
Then use x2 in your call to pbeta()
Do you need to allow for the percentage sign too? If not, you can do it by adding a condition and a coorespondong argument:
my_fun <- function(x, percentage = FALSE){
if (percentage) x <- x/100
(rest of function goes here)
}

Allocate people to teams based on 33/67% percentile of variable

I have a dataset where I would like to allocate people to different groups based on criteria, however, I would like R to do this automatically. I have separated my variables in <=.33 percentile and >=67 percetile and else.
dfOCEAN <-df[1:60,1:7]
print(colnames(dfOCEAN))
dfOCEAN <- dfOCEAN[complete.cases(dfOCEAN),]
i = 0
for(i in 1:length(dfOCEAN$factor_e)){
if(dfOCEAN$factor_e[i] <= quantile(dfOCEAN$factor_e, c(.33))){
dfOCEAN$Introversion[i] <- 1
}
else if(dfOCEAN$factor_e[i] >= quantile(dfOCEAN$factor_e, c(.67))){
dfOCEAN$Introversion[i] <- 2
}
else
dfOCEAN$Introversion[i] <- 3
}
i = 0
for(i in 1:length(dfOCEAN$factor_c)){
if(dfOCEAN$factor_c[i] <=quantile(dfOCEAN$factor_c, c(.33))){
dfOCEAN$Conscientious[i] <- 1
}
else if(dfOCEAN$factor_c[i] >= quantile(dfOCEAN$factor_c, c(.67))){
dfOCEAN$Conscientious[i] <- 2
}
else
dfOCEAN$Conscientious[i] <- 3
}
Then I am trying to create random samples with Dplyr's slice function.
dfOCEANset <- dfOCEAN %>% group_by(c(Introversion, Conscientious)) %>% slice(sample(c(1,2),1))
However, I am unable to get the desired results. Ideally, I would retrieve a dataframe whereby the data would be clustered with a combination of the different categories and the names would remain
Try this loop-less (but untested in the absence of a reproducible example) method:
dfOCEAN$fac_grp <- c(1,3,2)[ findInterval( dfOCEAN$factor_e,
quantile( dfOCEAN$factor_e, c(0, .33, .67)),
)}
R is intended to be used as a "vectorized" language and both the findInterval and quantile functions will return vectors, with findInterval giving a vector the same length as its first argument. You added a little wrinkle in asking us to arrange in a rather unnatural manner, which I handled by using the result from findInterval as an index into a three-item vector. The other function that does something similar (but returns a factor) is the cut function.

add exact proportion of random missing values to data.frame

I would like to add random NA to a data.frame in R. So far I've looked into these questions:
R: Randomly insert NAs into dataframe proportionaly
How do I add random NAs into a data frame
add random missing values to a complete data frame (in R)
Many solutions were provided here, but I couldn't find one that comply with these 5 conditions:
Add really random NA, and not the same amount by row or by column
Work with every class of variable that one can encounter in a data.frame (numeric, character, factor, logical, ts..), so the output must have the same format as the input data.frame or matrix.
Guarantee an exact number or proportion [note] of NA in the output (many solutions result in a smaller number of NA since several are generated at the same place)
Is computationnaly efficient for big datasets.
Add the proportion/number of NA independently of already present NA in the input.
Anyone has an idea?
I have already tried to write a function to do this (in an answer of the first link) but it doesn't comply with points N°3&4.
Thanks.
[note] the exact proportion, rounded at +/- 1NA of course.
This is the way that I do it for my paper on library(imputeMulti) which is currently in review at JSS. This inserts NA's into a random percentage of the whole dataset and scales well, It doesn't guarantee an exact number because of the case of n * p * pctNA %% 1 != 0.
createNAs <- function (x, pctNA = 0.1) {
n <- nrow(x)
p <- ncol(x)
NAloc <- rep(FALSE, n * p)
NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
x[matrix(NAloc, nrow = n, ncol = p)] <- NA
return(x)
}
Obviously you should use a random seed for reproducibility, which can be specified before the function call.
This works as a general strategy for creating baseline datasets for comparison across imputation methods. I believe this is what you want, although your question (as noted in the comments) isn't clearly stated.
Edit: I do assume that x is complete. So, I'm not sure how it would handle existing missing data. You could certainly modify the code if you want, though that would probably increase the runtime by at least O(n*p)
Some users reported that Alex's answer did not address condition N°5 of my question. Indeed, when adding random NA on a dataframe that already contains missing values, the new ones will sometimes fall on the initial ones, and the final proportion will be somewhere between initial proportion and desired proportion... So I expand on Alex's function to comply with all 5 conditions:
I modify his createNAs function so that it enables one of 3 options:
option complement: complement with NA up to the desired %
option add : add % of NA in addition to those already present
option none : add a % of NA regardless of those already present
For option 1 and 2, the function will work recursively until reached the desired proportion of NA:
createNAs <- function (x, pctNA = 0.0, option = "add"){
prop.NA = function(x) sum(is.na(x))/prod(dim(x))
initial.pctNA = prop.NA(x)
if ( (option =="complement") & (initial.pctNA > pctNA) ){
message("The data already had more NA than the target percentage. Returning original data")
return(x)
}
if ( (option == "none") || (initial.pctNA == 0) ){
n <- nrow(x)
p <- ncol(x)
NAloc <- rep(FALSE, n * p)
NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
x[matrix(NAloc, nrow = n, ncol = p)] <- NA
return(x)
} else { # if another option than none:
target = ifelse(option=="complement", pctNA, pctNA + initial.pctNA)
while (prop.NA(x) < target) {
prop.remaining.to.add = target - prop.NA(x)
x = createNAs(x, prop.remaining.to.add, option = "none")
}
return(x)
}
}

Resources