R - classification the number - assign labels - r

How to convert the numeric data to string, not the datatype change, but the classification in R? Say, I got 100 numbers 0:1, and if it's > 0.5, then I need to assign a name of "Good", otherwise it's "Bad".

You could try
nums <- seq(0,1, by = .01)
res <- c('Bad', 'Good')[(nums > 0.5)+1]

Do you wish to do it using factors?
a=runif(100, 0, 1) > 0.5
b=factor(a, c(FALSE,TRUE), labels=c("Bad","Good"))
c=as.character(b)
Alternatively, if you just want to change the names in the vector, a, then:
a=runif(100, 0, 1) > 0.5
c=ifelse(a,"Good","Bad")
names(a)=c

Related

Conditionally assign a value to a random subset of a vector

I want to assign a defined value (let's say 1) to a random sample of a subset of a vector that meets certain conditions. I can't seem to make it work.
I have tried this code:
a <- c(1:50)
df <- as.data.frame(a)
df$c <- 0
df$c[sample(x=(df$c[df$a>25]), size = round(NROW(df$c[df$a>25])/5), replace = F)] <- 1
I would like just to randomly make some of the df$c vector values to be equal to 1, exactly a random sample of one fifth of the values in df$c in which value of df$a is a is greater than 25 (that would be 5 observations switched to 1).
But so far all of them remain 0 :/
Thanks!
Here's a way with base R -
df$c[sample(which(df$a > 25), sum(df$a > 25)/5)] <- 1
Be aware that this will fail if there is only 1 value in df$a > 25.
Below approach will not fail for any case but is bit verbose. Feel free to use whatever suits your need the best depending on expected values in df$a -
df$c[which(df$a > 25)[sample(length(which(df$a > 25)), sum(df$a > 25)/5)]] <- 1
Also, note that since, relace = F, sample size = sum(df$a > 25)/5 must be <= length(which(df$a > 25)). You can include this condition in your code if you want to make it even more safer.
Also, there will be no change if sum(df$a > 25)/5 < 1 so you may want to use size = max(sum(df$a > 25)/5, 1) if you want at least 1 change.
Here's a nicer version of my first version, thanks to #Frank -
df$c <- replace(df$c, sample(w <- which(df$a > 25), length(w)*.2), 1)
Not as elegant as the other solution you have but here's another way:
df <- data.frame('a' = c(1:50), 'c' = rep(0,50))
df$c[sample(
# subset to sample
df$a[df$a > 25],
# sample size
size = round(length(df$a[df$a > 25])/5, 0),
# no replacement
replace = F)] <- 1
Yours didn't work because you sample where df$c > 25 rather than df$a
df$c[sample(x=( df$c [df$a>25]), size = round(NROW(df$c[df$a>25])/5), replace = F)] <- 1

how to correlate 2 variables when X > 1

I have a data set and want to run a correlation between X and Y. However, I only want to look at X values that are greater than 1.
cor(Data$X, Data$Y, use = "complete.obs")
What argument do I add to run a correlation between X and Y only for the X values that are greater than 1?
You can subset using the [ operator.
Try this:
# Generate Example Data
Data <- data.frame(X = seq(-5, 10, 1),
Y = sample(1:100, 16))
with(data = Data[Data$X > 1, ], cor(X, Y, use = "complete.obs"))
[ lets us specify rows and columns in the style my.data.frame[rows, columns]. Here we are specifying that we want only rows where X > 1, but all columns. We could also do the following to ask for each column individually by name:
cor(Data[Data$X > 1, "X"], Data[Data$X > 1, "Y"], use = "complete.obs"))
Or even the following to subset the column vectors:
cor(Data$X[Data$X > 1], Data$Y[Data$X > 1], use = "complete.obs"))
Of course, these are only to illustrate the flexibility. It's best to subset the whole data set once to avoid discrepancies.

R: Change Vector Output to Several Ranges

I am using Jenks Natural Breaks via the BAMMtools package to segment my data in RStudio Version 1.0.153. The output is a vector that shows where the natural breaks occur in my data set, as such:
[1] 14999 41689 58415 79454 110184 200746
I would like to take the output above and create the ranges inferred by the breaks. Ex: 14999-41689, 41690-58415, 58416-79454, 79455-110184, 110185-200746
Are there any functions that I can use in R Studio to accomplish this? Thank you in advance!
Input data
x <- c(14999, 41689, 58415, 79454, 110184, 200746)
If you want the ranges as characters you can do
y <- x; y[1] <- y[1] - 1 # First range given in question doesn't follow the pattern. Adjusting for that
paste(head(y, -1) + 1, tail(y, -1), sep = '-')
#[1] "14999-41689" "41690-58415" "58416-79454" "79455-110184" "110185-200746"
If you want a list of the actual sets of numbers in each range you can do
seqs <- Map(seq, head(y, -1) + 1, tail(y, -1))
You can definitely create your own function that produces the exact output you're looking for, but you can use the cut function that will give you something like this:
# example vector
x = c(14999, 41689, 58415, 79454, 110184, 200746)
# use the vector and its values as breaks
ranges = cut(x, x, dig.lab = 6)
# see the levels
levels(ranges)
#[1] "(14999,41689]" "(41689,58415]" "(58415,79454]" "(79454,110184]" "(110184,200746]"

Numeric Matching / Extracting with Hard Coded Values in R

Having trouble understanding numeric matching / indexing in R.
If I have a situation where I create a dataframe such as:
options(digits = 3)
x <- seq(from = 0, to = 5, by = 0.10)
TestDF <- data.frame(x = x, y = dlnorm(x))
and I wanted to compare a hardcoded value to my y column -
> TestDF[TestDF$y == 0.0230,]$x
numeric(0)
That being said, if I compare to the value that's straight out of the dataframe (which for an x value of 4.9, should be a y value of 0.0230).
> TestDF[TestDF$y == TestDF[50,]$y,]$x
[1] 4.9
Does this have to do with exact matching? If I limit the digits to 3 decimal point, then 0.0230000 won't be the same as the original value in y I'm comparing to? If this is the case, is there a way around it if I do need to extract values based on rounded, hard-coded values?
You can use round() function to reduce the number of decimal digits to the preferred scale of the floating point number. See below.
set.seed(1L)
x <- seq(from = 0, to = 5, by = 0.10)
TestDF <- data.frame(x = x, y = dlnorm(x))
constant <- 0.023
TestDF[ with(TestDF, round(y, 3) == constant), ]
# x y
# 50 4.9 0.02302884
You can compare the rounded y with the stated value:
> any(TestDF$y == 0.0230)
[1] FALSE
> any(round(TestDF$y, 3) == 0.0230)
[1] TRUE
I'm not certain you grok the meaning of the digits option. From ?options it says about digits
digits: controls the number of significant digits to print when printing numeric values.
(emphasis mine.) So this only affects how the values are printed, not how they are stored.
You generated a set of reals, none of which are exactly 0.0230. This has nothing to do with exact matching. The value you indicated should be 0.0230 is actually stored as
> with(TestDF, print(y[50], digits = 22))
[1] 0.02302883835550340041465
regardless of the digits setting in options because that setting only affects the printed value. And the issue is not exact matching because even with the small fudge allowed by the recommended way to do comparisons, all.equal(), y[50] and 0.0230 are still not equal
> with(TestDF, all.equal(0.0230, y[50]))
[1] "Mean relative difference: 0.001253842"

R: populate a vector incrementally

I'm trying to write a short function that assigns values and populates incrementally a vector, based on values in another vector.
For instance, if I have a vector of binaries a = [0,1,1,0,1], I want to create a vector b of the same length as a, that assigns a value x if a[1]=0, or a value y if a[1]=1. So b = [0.4,0.6,0.6,0.4,0.6]
I have done this:
a<-sample(0:1,20,replace=T)
assign<-function(x){
c<-vector()
for (i in 1:length(x)){
ifelse (x[[i]]>0,b<-0.6,b<-0.4)
c[[length(c)+1]]=b}
return (b)
}
but then
assign(a)
only returns the first assignment. I assume I didn't nest the loop correctly?
As you state that your vector a is binary, you can turn it into a vector of indices and use that "property":
bfroma <- function(x) c(0.4, 0.6)[x+1]
a <- c(0, 1, 1, 0, 1)
bfroma(a)
#[1] 0.4 0.6 0.6 0.4 0.6
Some comments on your code:
it is not advised to do ifelse (x[[i]] > 0, b <- 0.6, b <- 0.4); ifelse is not used as this (you'd better check ?ifelse again). Use b <- ifelse (x[[i]] > 0, 0.6, 0.4).
I think you want return(c) rather than return(b);
use a different function name, assign will mask R's built-in one.
Anyway, I figured that the whole function can be replaced by
function (x) ifelse(x > 0, 0.6, 0.4)
or
function (x) {x <- 0.4; x[x > 0] <- 0.6; x}
For your particular case where input vector is strictly 0-1 binary, we can do better. Cath has pointed out already, by indexing only:
function (x) c(0.4, 0.6)[x + 1L]
More generally, as long as x is discrete, we can use match to get position index and use fast replacement, too, but I will not elaborate on that here.

Resources