Allocate people to teams based on 33/67% percentile of variable - r

I have a dataset where I would like to allocate people to different groups based on criteria, however, I would like R to do this automatically. I have separated my variables in <=.33 percentile and >=67 percetile and else.
dfOCEAN <-df[1:60,1:7]
print(colnames(dfOCEAN))
dfOCEAN <- dfOCEAN[complete.cases(dfOCEAN),]
i = 0
for(i in 1:length(dfOCEAN$factor_e)){
if(dfOCEAN$factor_e[i] <= quantile(dfOCEAN$factor_e, c(.33))){
dfOCEAN$Introversion[i] <- 1
}
else if(dfOCEAN$factor_e[i] >= quantile(dfOCEAN$factor_e, c(.67))){
dfOCEAN$Introversion[i] <- 2
}
else
dfOCEAN$Introversion[i] <- 3
}
i = 0
for(i in 1:length(dfOCEAN$factor_c)){
if(dfOCEAN$factor_c[i] <=quantile(dfOCEAN$factor_c, c(.33))){
dfOCEAN$Conscientious[i] <- 1
}
else if(dfOCEAN$factor_c[i] >= quantile(dfOCEAN$factor_c, c(.67))){
dfOCEAN$Conscientious[i] <- 2
}
else
dfOCEAN$Conscientious[i] <- 3
}
Then I am trying to create random samples with Dplyr's slice function.
dfOCEANset <- dfOCEAN %>% group_by(c(Introversion, Conscientious)) %>% slice(sample(c(1,2),1))
However, I am unable to get the desired results. Ideally, I would retrieve a dataframe whereby the data would be clustered with a combination of the different categories and the names would remain

Try this loop-less (but untested in the absence of a reproducible example) method:
dfOCEAN$fac_grp <- c(1,3,2)[ findInterval( dfOCEAN$factor_e,
quantile( dfOCEAN$factor_e, c(0, .33, .67)),
)}
R is intended to be used as a "vectorized" language and both the findInterval and quantile functions will return vectors, with findInterval giving a vector the same length as its first argument. You added a little wrinkle in asking us to arrange in a rather unnatural manner, which I handled by using the result from findInterval as an index into a three-item vector. The other function that does something similar (but returns a factor) is the cut function.

Related

How to concatenate NOT as character in R?

I want to concatenate iris$SepalLength, so I can use that in a function to get the Sepal Length column from iris data frame. But when I use paste function paste("iris$", colnames(iris[3])), the result is as characters (with quotes), as "iris$SepalLength". I need the result not as a character. I have tried noquotes(), as.datafram() etc but it doesn't work.
freq <- function(y) {
for (i in iris) {
count <-1
y <- paste0("iris$",colnames(iris[count]))
data.frame(as.list(y))
print(y)
span = seq(min(y),max(y), by = 1)
freq = cut(y, breaks = span, right = FALSE)
table(freq)
count = count +1
}
}
freq(1)
The crux of your problem isn't making that object not be a string, it's convincing R to do what you want with the string. You can do this with, e.g., eval(parse(text = foo)). Isolating out a small working example:
y <- "iris$Sepal.Length"
data.frame(as.list(y)) # does not display iris$Sepal.Length
data.frame(as.list(eval(parse(text = y)))) # DOES display iris.$Sepal.Length
That said, I wanted to point out some issues with your function:
The input variable appears to not do anything (because it is immediately overwritten), which may not have been intended.
The for loop seems broken, since it resets count to 1 on each pass, which I think you didn't mean. Relatedly, it iterates over all i in iris, but then it doesn't use i in any meaningful way other than to keep a count. Instead, you could do something like for(count in 1 : length(iris) which would establish the count variable and iterate it for you as well.
It's generally better to avoid for loops in R entirely; there's a host of families available for doing functions to (e.g.) every column of a data frame. As a very simple version of this, something like apply(iris, 2, table) will apply the table function along margin 2 (the columns) of iris and, in this case, place the results in a list. The idea would be to build your function to do what you want to a single vector, then pass each vector through the function with something from the apply() family. For instance:
cleantable <- function(x) {
myspan = seq(min(x), max(x)) # if unspecified, by = 1
myfreq = cut(x, breaks = myspan, right = FALSE)
table(myfreq)
}
apply(iris[1:4], 2, cleantable) # can only use first 4 columns since 5th isn't numeric
would do what I think you were trying to do on the first 4 columns of iris. This way of programming will be generally more readable and less prone to mistakes.

R programming Function (Returning a subset of Real Mean Squared)

I am new to R and am working on writing some cool functions while I learn statistics in parallel. I'm trying to make a function that will take a numeric vector, perform the "root mean squared" operations and then have the output return essentially same vector with the possible outliers removed.
For example, if the vector is c(2,4,9,10,100) the resulting RMS would be about 37.
Therefore, I want the output to return the same vector with the possible outlier (in this case, 100) removed from the dataset. So the result would be 2, 4, 9, 10
I put my code below but the output isn't working. I tried it 2 different ways. Everything up to the line that says RMS final works. But below that it does not.
How can I modify this function so that it does what I want? Also, as a bonus, and this might be asking a lot but based on my coding below, any tips for a newbie on making functions would be something I'd be grateful for as well. Thanks so much!
RMS_x <- c(2,4,9,10,100)
#Root Mean Squared Function - Takes a numeric vector
RMS <- function(RMS_x){
RMS_MEAN <- mean(RMS_x)
RMS_DIFF <- (RMS_x-RMS_MEAN)
RMS_DIFF_SQ <- RMS_DIFF^2
RMS_FINAL <- sqrt(sum(RMS_DIFF_SQ)/length(RMS_x))
for(i in length(RMS_x)){
if(abs(RMS_x[i]) > RMS_FINAL){
output <- RMS_x[i]}
else {NULL} }
return(output)
}
#Root Mean Squared Function - Takes a numeric vector
RMS <- function(RMS_x){
RMS_MEAN <- mean(RMS_x)
RMS_DIFF <- (RMS_x-RMS_MEAN)
RMS_DIFF_SQ <- RMS_DIFF^2
RMS_FINAL <- sqrt(sum(RMS_DIFF_SQ)/length(RMS_x))
#output <- ifelse(abs(RMS_x) > RMS_FINAL,RMS_x, NULL)
return(RMS_FINAL)
}
Try following in the first lines of the RMS function.
RMS <- function(RMS_x) {
bp <- boxplot(RMS, plot = FALSE)
RMS_x <- RMS_x[!(RMS_x %in% bp$out)]
...
Now, you have RMS_x sans the outliers.
The boxplot function has a way of determining the outliers. Here, I am using that to remove them.
Since you are asking more specifically about R and R functions I’ll focus my response on that. There are a couple errors I'll point out then provide a few alternative solutions.
Your first function isn’t producing the output you want for two reasons:
The logic instructs the function to return a single value rather than a vector. If you’re trying to load a vector within your for loop (one without the outlier) make sure to initialize the vector outside of the function : output <- vector() (note that in my solution below however this is not required). Also the value it is returning is just a value in your vector RMS_x that is greater than the RMS rather that finding an outlier, just fyi if that's what you wanted.
There’s an error and/or typo in your for loop argument, it’s minor but it turns your for loop into not-a-loop whatsoever – which is obviously the total opposite of what you intended. The for loop needs a vector to loop through, the argument should be: for(i in 1:length(RMS_x))
In your code the loop is jumping straight to i = 5 because that is the length of your vector (length(RMS_x) = 5). Given that the values in the RMS_x vector were already in ascending order your code happens to give the "right" answer but that's just because of how you initially loaded the vector. This may have been a typo in your question, and it's a difference of only 2 code characters, but it totally changes what the function looks for.
Solution:
To get what you are trying to accomplish, you need to write two functions: 1.) that defines what's considered an outlier in your data set and 2.) a second function that strips out the outliers and calculates RMS. Then from there either make the functions independent or nest them to pass variables (this kind of goes with your bonus request as well since it's multiple ways of writing functions).
Function to identify outliers:
outlrs <- function(vec){
Q1 <- summary(vec)["1st Qu."]
Q3 <- summary(vec)["3rd Qu."]
# defining outliers can get complicated depending on your sample data but
# your data set is super simple so we'll keep it that way
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5*(IQR)
upper_bound <- Q3 + 1.5*(IQR)
bounds <- c(lower_bound, upper_bound)
return(bounds)
assign("non_outlier_range", bounds, envir = globalEnv())
# the assign() function will create an actual object in your environment
# called non_outlier_range that you can access directly - return()
# just mean the result will be spit out into the console or into a variable
# you load it into
}
Now moving on to the second function, a few options here:
First Way: Input bounds argument into RMS_func()
RMS_func <- function(dat, bounds){
dat <- dat[!(dat < min(bounds)) & !(dat > max(bounds))]
dat_MEAN <- mean(dat)
dat_DIFF <- (dat-dat_MEAN)
dat_DIFF_SQ <- dat_DIFF^2
dat_FINAL <- sqrt(sum(dat_DIFF_SQ)/length(dat))
return(dat_FINAL)
}
# Call function from approach 1 - note that here the assign() in the
# definition of outlrs() would be required to refer to non_outlier_range:
RMS_func(dat = RMS_x, bounds = non_outlier_range)
Second Way: Call outlrs() inside the second function
RMS_func <- function(dat){
bounds <- outlrs(vec = dat)
dat <- dat[!(dat < min(bounds)) & !(dat > max(bounds))]
dat_MEAN <- mean(dat)
dat_DIFF <- (dat-dat_MEAN)
dat_DIFF_SQ <- dat_DIFF^2
dat_FINAL <- sqrt(sum(dat_DIFF_SQ)/length(dat))
return(dat_FINAL)
}
# Call RMS_func - here the assign() in outlrs() would not be needed is not
# needed because the output will exist within the functions temp environment
# and be passed to RMS_func
RMS_func(dat = RMS_x)
Third Way: Nest outlrs() definition within the RMS_Func - in this case you only need one nested function to accomplish your task
RMS_Func <- function(dat){
outlrs <- function(vec){
Q1 <- summary(dat)["1st Qu."]
Q3 <- summary(dat)["3rd Qu."]
#Q1 <- quantile(vec)["25%"]
#Q3 <- summary(vec)["75%"]
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5*(IQR)
upper_bound <- Q3 + 1.5*(IQR)
bounds <- c(lower_bound, upper_bound)
return(bounds)
}
bounds <- outlrs(vec = dat)
dat <- dat[!(dat < min(bounds)) & !(dat > max(bounds))]
dat_MEAN <- mean(dat)
dat_DIFF <- (dat-dat_MEAN)
dat_DIFF_SQ <- dat_DIFF^2
dat_FINAL <- sqrt(sum(dat_DIFF_SQ)/length(dat))
return(dat_FINAL)
}
P.S. Wrote this pretty quickly - will likely re-test and edit later. Hopefully for now this helps.

How to remove the for loop and perform vectorization for data frame variables?

I have a data frame (V6Stationary42Obs1D.df) with 6 variables. For the 1st variable of my data frame I obtain the value as follows ("1" in effrectpl[i,1] indicates I obtained the value for the 1st variable):
sum <- 0
for (i in as.integer(1:5)) { # 5= no. of variables - 1 = 6-1=5
sum <- sum + conditionalGb(as.matrix(V6Stationary42Obs1D.df[gctemplate(6,1,1)[effrectpl[i,1],]][(1+0):42,]), nx = 1, ny = 1, order = 5)[[2]]
}
sum
For the 2nd variable of my data frame I obtain the value as follows ("2" in effrectpl[i,2] indicates I obtained the value for the 2nd variable):
sum <- 0
for (i in as.integer(1:5)) {
sum <- sum + conditionalGb(as.matrix(V6Stationary42Obs1D.df[gctemplate(6,1,1)[effrectpl[i,2],]][(1+0):42,]), nx = 1, ny = 1, order = 5)[[2]]
} # "6" in gctemplate(6,1,1) is the no. of variables in the data frame
# there is no change other than the one in effrectpl[i,2] for the 2nd variable
sum
There are 6 variables in my data variable, and I have to do the same for each variable (The number of variables will change when I convert this mass to a function; notably, for neuroscience sometimes there may be about 300 variables! and guess the calculation load). I need a vectorized solution that will overcome the above trouble.
What I did (thought):
sum <- c(0,0,0,0,0,0)
for (i in as.integer(1:5)) {
sum??? <- sum + ????
}
sum
Though I know s/t/...apply family, in this particular problem I could not figure out how to handle them as well.
Any help will be greatly appreciated. Thx in advance.
Note: I found the following for-inside-for solution, and now, think that maybe a vectorized solution either difficult or unnecessary in the above case. Anyway, if I see some sort of non-for solution, I will be glad.
for (j in as.integer(1:6)) {
sum[j] <- 0
for (i in as.integer(1:5)) {
sum[j] <- sum[j] + conditionalGb(as.matrix(V6Stationary42Obs1D.df[gctemplate(6,1,1)[effrectpl[i,j],]][(1+0):42,]), nx = 1, ny = 1, order = 5)[[2]]
}
print(sum[j])
}
If computation speed is not your concern and you understand what you are doing then for loop is fine. Its not wrong. It can be made more efficient by vectorization but thats not a necessity.
It is very difficult to provide you with a solution as the example is very hard to follow and calls functions which I have no idea whats its doing, but in general if you have a function f(i) that depends on i you can turn
sum = 0
for( i in 1:n) sum = sum + f(i)
into
sum(sapply(1:n,function(i) f(i)))
BTW its a bad idea to call your variable sum since that is also the name of a common function in R.

Finding a value in an interval

Sorry if this is a basic question. Have been trying to figure this out but not being able to.
I have a vector of values called sym.
> head(sym)
[,1]
val 3.652166e-05
val -2.094026e-05
val 4.583950e-05
val 6.570184e-06
val -1.431486e-05
val -5.339604e-06
These I put in intervals by using factor on cut function on sym.
factorx<-factor(cut(sym,breaks=nclass.Sturges(sym)))
[1] (2.82e-05,5.28e-05] (-2.11e-05,3.55e-06] (2.82e-05,5.28e-05] (3.55e-06,2.82e-05] (-2.11e-05,3.55e-06] (-2.11e-05,3.55e-06]
[7] (-2.11e-05,3.55e-06] (2.82e-05,5.28e-05] (3.55e-06,2.82e-05] (7.74e-05,0.000102]
Levels: (-2.11e-05,3.55e-06] (3.55e-06,2.82e-05] (2.82e-05,5.28e-05] (7.74e-05,0.000102]
So clearly, four intervals were created in factorx. Now I have a new value tmp=3.7e-0.6.
My question is how can I find which interval in the above does it belongs to? I tried to use findInterval() but seems it does not work on factors like factorx.
Thanks
If you plan to re-classify new values, it's best to explicitly set the breaks= parameter with a vector rather than a size. Not that had those values been in the set originally, you may have had different breaks, and it is possible that your new values may be outside all the levels of your existing data which can be troublesome.
So first, I will generate some sample data.
set.seed(18)
x <- runif(50)
Now I will show two different way to calculate breaks. Here are b1() and b2()
b1<-function(x, n=nclass.Sturges(x)) {
#like default cut()
nb <- as.integer(n + 1)
dx <- diff(rx <- range(x, na.rm = TRUE))
if (dx == 0)
dx <- abs(rx[1L])
seq.int(rx[1L] - dx/1000, rx[2L] + dx/1000,
length.out = nb)
}
b2<-function(x, n=nclass.Sturges(x)) {
#like default hist()
pretty(range(x), n=n)
}
So each of these functions will give break points similar to either the default behaviors of cut() or hist(). Rather than just a single number of breaks, they each return a vector with all the break points explicitly stated. This allows you to use cut() to create your factor
mybreaks <- b1(x)
factorx <- cut(x,breaks=mybreaks))
(Note that's you don't have to wrap cut() in factor() as cut() already returns a factor. Now, if you get new values, you can classify them using findInterval() and the special breaks vector you've already prepared
nv <- runif(5)
grp <- findInterval(nv,mybreaks)
And we can check the results with
data.frame(grp=levels(factorx)[grp], x=nv)
# grp x
# 1 (0.831,0.969] 0.8769438
# 2 (0.00131,0.14] 0.1188054
# 3 (0.416,0.554] 0.5467373
# 4 (0.14,0.278] 0.2327532
# 5 (0.554,0.693] 0.6022678
and everything looks pretty good. In this case, findInterval() will tell you which level of the previous factor you created that each item belongs to. It will return 0 if the number is smaller than your previous observations, but it will return the largest category for anything greater than the largest level of mybreaks. This behavior is somewhat different that cut() which return NA. The last group in cut() is right-closed where findInterval leaves the right-end open.

Using mapply() in R over rows, vs. columns

I deal with a great deal of survey data and the like in my work, and I often have to make various scoring programs that process data on a row-by-row level. For instance, I am dealing with a table right now that contains 12 columns with subscale scores from a psychometric instrument. These will be converted to normalized scores using tables provided by the instrument's creator. Seems straightforward so far.
However, there are four tables - the instrument is scored differently depending on gender and age range. So, for instance, a 14-year old female and an 10 year-old male get different normalization tables. All of the normalization data is stored in a R data frame.
What I would like to do is write a function which can be applied over rows, which returns a vector looked up from the normalization data. So, something vaguely like this:
converter <- function(rawscores,gender,age) {
if(gender=="Male") {
if(8 <= age & age <= 11) {convertvec <- c(1:12)}
if(12 <= age & age <= 14) {convertvec <- c(13:24)}
}
else if(gender=="Female") {
if(8 <= age & age <= 11) {convertvec <- c(25:36)}
if(12 <= age & age <= 14) {convertvec <- c(37:48)}
}
converted_scores <- rep(0,12)
for(z in 1:12) {
converted_scores[z] <- conversion_table[(unlist(rawscores)+1)[z],
convertvec[z]]
}
rm(z)
return(converted_scores)
}
EDITED: I updated this with the code I actually got to work yesterday. This version returns a simple vector with the scores. Here's how I then implemented it.
mydata[,21:32] <- 0
for(x in 1:dim(mydata)[1]) {
tscc_scores[x,21:32] <- converter(mydata[x,7:18],
mydata[x,"gender"],
mydata[x,"age"])
}
This works, but like I said, I'm given to understand that it is bad practice?
Side note: the reason rawscores+1 is there is that the data frame has a score of zero in the first index.
Fundamentally, the function doesn't seem very complicated, and I know I could just implement it using a loop where I would do for(x in 1:number_of_records), but my understanding is that doing so is poor practice. I had hoped to simply use apply() to do this, like as follows:
apply(X=mydata[,1:12],MARGIN=1,
FUN=converter,gender=mydata[,"gender"],age=mydata[,"age"])
Unfortunately, R doesn't seem to approve of this approach, as it does not iterate through the vectors passed to subsequent arguments, but rather tries to take them as the argument as a whole. The solution would appear to be mapply(), but I can't figure out if there's a way to use mapply() over rows, instead of columns.
So, I guess my questions are threefold. One, is there a way to use mapply() over rows? Two, is there a way to make apply() iterate over arguments? And three, is there a better option out there? I've seen and heard a lot about the plyr package, but I didn't want to jump to that before I fully investigated the options present in Base R.
You could rewrite 'converter' so that it takes vectors of gender, age, and a row index which you then use to do lookups and assignments to converted_scores using a conversion array and a data array that is jsut the numeric score columns. There is an additional problem with using apply since it will convert all its x arguments to "character" class because of the gender class being "character". It wasn't clear whether your code normdf[ rawscores+1, convertvec] was supposed to be an array extraction or a function call.
Untested in absence of working example (with normdf, mydata):
converted_scores <- matrix(NA, nrow=NROW(rawscores), ncol=12)
converter <- function(idx,gender,age) {
gidx <- match(gender, c("Male", "Female") )
aidx <- findInterval(age, c(8,12,15) )
ag.idx <- gidx + 2*aidx -1
# the aidx factor needs to be the same number of valid age categories
cvt <- cvt.arr[ ag.idx, ]
converted_scores[idx] <- normdf[rawscores+1,convertvec]
return(converted_scores)
}
cvt.arr <- matrix(1:48, nrow=4, byrow=TRUE)[1,3,2,4] # the genders alternate
cvt.scores <- mapply(converter, 1:NROW(mydata), mydata$gender, mydata$age)
I'd advise against applying this stuff by row, but would rather apply this by column. The reason is that there are only 12 columns, but there might be many rows.
The following piece of code works for me. There might be better ways, but it might be interesting for you nevertheless.
offset <- with(mydata, 24*(gender == "Female") + 12*(age >= 12))
idxs <- expand.grid(row = 1:nrow(mydata), col = 1:12)
idxs$off <- idxs$col + offset
idxs$val <- as.numeric(mydata[as.matrix(idxs[c("row", "col")])]) + 1
idxs$norm <- normdf[as.matrix(idxs[c("val", "off")])]
converted <- mydata
converted[,1:12] <- as.matrix(idxs$norm, ncol=12)
The tricky part here is this idxs data frame which combines all the rest. It has the folowing columns:
row and column: Position in the original data
off: column in normdf, based on gender and age
val: row in normdf, based on original value + 1
norm: corresponding normalized value
I'll post this here with this first thought, and see whether I can come up with a better answer, either based on jorans comment, or using a three- or four-dimensional array for normdf. Not sure yet.

Resources