Removing dataframe outliers in R with `boxplot.stats` - r

I'm relatively new at R, so please bear with me.
I'm using the Ames dataset (full description of dataset here; link to dataset download here).
I'm trying to create a subset data frame that will allow me to run a linear regression analysis, and I'm trying to remove the outliers using the boxplot.stats function. I created a frame that will include my samples using the following code:
regressionFrame <- data.frame(subset(ames_housing_data[,c('SalePrice','GrLivArea','LotArea')] , BldgType == '1Fam'))
My next objective was to remove the outliers, so I tried to subset using a which() function:
regressionFrame <- regressionFrame[which(regressionFrame$GrLivArea != boxplot.stats(regressionFrame$GrLivArea)$out),]
Unfortunately, that produced the
longer object length is not a multiple of shorter object length
error. Does anyone know a better way to approach this, ideally using the which() subsetting function? I'm assuming it would include some form of lapply(), but for the life of me I can't figure out how. (I figure I can always learn fancier methods later, but this is the one I'm going for right now since I already understand it.)

Nice use with boxplot.stats.
You can not test SAFELY using != if boxplot.stats returns you more than one outliers in $out. An analogy here is 1:5 != 1:3. You probably want to try !(1:5 %in% 1:3).
regressionFrame <- subset(regressionFrame,
subset = !(GrLivArea %in% boxplot.stats(GrLivArea)$out))
What I mean by SAFELY, is that 1:5 != 1:3 gives a wrong result with a warning, but 1:6 != 1:3 gives a wrong result without warning. The warning is related to the recycling rule. In the latter case, 1:3 can be recycled to have the same length of 1:6 (that is, the length of 1:6 is a multiple of the length of 1:3), so you will be testing with 1:6 != c(1:3, 1:3).
A simple example.
x <- c(1:10/10, 101, 102, 103) ## has three outliers: 101, 102 and 103
out <- boxplot.stats(x)$out ## `boxplot.stats` has picked them out
x[x != out] ## this gives a warning and wrong result
x[!(x %in% out)] ## this removes them from x

Related

How to transfer multiple columns into numeric & find correlation coefficients

I have a dataset "res.sav" that I read in via haven. It contains 20 columns, called "Genes1_Acc4", "Genes2_Acc4" etc. I am trying to find a correlation coefficient between those and another column called "Condition". I want to separately list all coefficients.
I created two functions, cor.condition.cols and cor.func to do that. The first iterates through the filenames and works just fine. The second was supposed to give me my correlations which didn't work at all. I also created a new "cor.condition.Genes" which I would like to fill with the correlations, ideally as a matrix or dataframe.
I have tried to iterate through the columns with two functions. However, when I try to pass it, I get the error: "NAs introduced by conversion". This wouldn't be the end of the world (I tried also suppressWarning()). But the bigger problem I have that it seems like my function does not convert said columns into the numeric type I need for my cor() function. I receive the "y must be numeric" error when trying to run the cor() function. I tried to put several arguments within and without '' or "" without success.
When I ran str(cor.condition.cols) I only receive character strings, which makes me think that my function somehow messes up with the as.numeric function. Any suggestions of how else I could iter through these columns and transfer them?
Thanks guys :)
cor.condition.cols <- lapply(1:20, function(x){paste0("res$Genes", x, "_Acc4")})
#save acc_4 columns as numeric columns and calculate correlations
res <- (as.numeric("cor.condition.cols"))
cor.func <- function(x){
cor(res$Condition, x, use="complete.obs", method="pearson")
}
cor.condition.Genes <- cor.func(cor.condition.cols)
You can do:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
res2 <- as.numeric(as.matrix(res[cor.condition.cols]))
cor.condition.Genes <- cor(res2, res$Condition, use="complete.obs", method="pearson")
eventually the short variant:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
cor.condition.Genes <- cor(res[cor.condition.cols], res$Condition, use="complete.obs")
Here is an example with other data:
cor(iris[-(4:5)], iris[[4]])

Understanding Vectorized Code In R

I'm trying to understand the answer to this question using R and I'm struggling a lot.
The dataset for the R code can be found with this code
library(devtools)
install_github("genomicsclass/GSE5859Subset")
library(GSE5859Subset)
data(GSE5859Subset) ##this loads the three tables you need
Here is the question
Write a function that takes a vector of values e and a binary vector group coding two groups, and returns the p-value from a t-test: t.test( e[group==1], e[group==0])$p.value.
Now define g to code cases (1) and controls (0) like this g <- factor(sampleInfo$group)
Next use the function apply to run a t-test for each row of geneExpression and obtain the p-value. What is smallest p-value among all these t-tests?
The answer provided is
myttest <- function(e,group){
x <- e[group==1]
y <- e[group==0]
return( t.test(x,y)$p.value )
}
g <- factor(sampleInfo$group)
pvals <- apply(geneExpression,1,myttest, group=g)
min( pvals )
Which gives you the answer of 1.406803e-21.
What exactly is the input of the "e" argument of the myttest function when you run this? Is it possible to write this function as a formula like
t.test(DV ~ sampleInfo$group)
The t test is comparing the gene expression values of the 24 people (the values of which I believe are in the "geneExpression" matrix) by what group they were
in which you can find in sampleInfo's "group" column. I've run t tests so many times in R, but for some reason I can't wrap my mind around what's going on in this code.
You question seems to be about understanding the function apply().
For the technical description, see ?apply.
My quick explanation: the apply() line of code in your question applies the following function to each of the rows of geneExpression
myttest(e=x, group=g)
where x is a placeholder for each row.
To help make sense of it, a for loop version of that apply() line would look something like:
N <- nrows(geneExpression) #so we don't have to type this twice
pvals <- numeric(N) #empty vector to store results
# what 'apply' does (but it does it very quickly and with less typing from us)
for(i in 1:N) {
pvals[i] <- myttest(geneExpression[i,], group=g[i])
}

R programming Function (Returning a subset of Real Mean Squared)

I am new to R and am working on writing some cool functions while I learn statistics in parallel. I'm trying to make a function that will take a numeric vector, perform the "root mean squared" operations and then have the output return essentially same vector with the possible outliers removed.
For example, if the vector is c(2,4,9,10,100) the resulting RMS would be about 37.
Therefore, I want the output to return the same vector with the possible outlier (in this case, 100) removed from the dataset. So the result would be 2, 4, 9, 10
I put my code below but the output isn't working. I tried it 2 different ways. Everything up to the line that says RMS final works. But below that it does not.
How can I modify this function so that it does what I want? Also, as a bonus, and this might be asking a lot but based on my coding below, any tips for a newbie on making functions would be something I'd be grateful for as well. Thanks so much!
RMS_x <- c(2,4,9,10,100)
#Root Mean Squared Function - Takes a numeric vector
RMS <- function(RMS_x){
RMS_MEAN <- mean(RMS_x)
RMS_DIFF <- (RMS_x-RMS_MEAN)
RMS_DIFF_SQ <- RMS_DIFF^2
RMS_FINAL <- sqrt(sum(RMS_DIFF_SQ)/length(RMS_x))
for(i in length(RMS_x)){
if(abs(RMS_x[i]) > RMS_FINAL){
output <- RMS_x[i]}
else {NULL} }
return(output)
}
#Root Mean Squared Function - Takes a numeric vector
RMS <- function(RMS_x){
RMS_MEAN <- mean(RMS_x)
RMS_DIFF <- (RMS_x-RMS_MEAN)
RMS_DIFF_SQ <- RMS_DIFF^2
RMS_FINAL <- sqrt(sum(RMS_DIFF_SQ)/length(RMS_x))
#output <- ifelse(abs(RMS_x) > RMS_FINAL,RMS_x, NULL)
return(RMS_FINAL)
}
Try following in the first lines of the RMS function.
RMS <- function(RMS_x) {
bp <- boxplot(RMS, plot = FALSE)
RMS_x <- RMS_x[!(RMS_x %in% bp$out)]
...
Now, you have RMS_x sans the outliers.
The boxplot function has a way of determining the outliers. Here, I am using that to remove them.
Since you are asking more specifically about R and R functions I’ll focus my response on that. There are a couple errors I'll point out then provide a few alternative solutions.
Your first function isn’t producing the output you want for two reasons:
The logic instructs the function to return a single value rather than a vector. If you’re trying to load a vector within your for loop (one without the outlier) make sure to initialize the vector outside of the function : output <- vector() (note that in my solution below however this is not required). Also the value it is returning is just a value in your vector RMS_x that is greater than the RMS rather that finding an outlier, just fyi if that's what you wanted.
There’s an error and/or typo in your for loop argument, it’s minor but it turns your for loop into not-a-loop whatsoever – which is obviously the total opposite of what you intended. The for loop needs a vector to loop through, the argument should be: for(i in 1:length(RMS_x))
In your code the loop is jumping straight to i = 5 because that is the length of your vector (length(RMS_x) = 5). Given that the values in the RMS_x vector were already in ascending order your code happens to give the "right" answer but that's just because of how you initially loaded the vector. This may have been a typo in your question, and it's a difference of only 2 code characters, but it totally changes what the function looks for.
Solution:
To get what you are trying to accomplish, you need to write two functions: 1.) that defines what's considered an outlier in your data set and 2.) a second function that strips out the outliers and calculates RMS. Then from there either make the functions independent or nest them to pass variables (this kind of goes with your bonus request as well since it's multiple ways of writing functions).
Function to identify outliers:
outlrs <- function(vec){
Q1 <- summary(vec)["1st Qu."]
Q3 <- summary(vec)["3rd Qu."]
# defining outliers can get complicated depending on your sample data but
# your data set is super simple so we'll keep it that way
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5*(IQR)
upper_bound <- Q3 + 1.5*(IQR)
bounds <- c(lower_bound, upper_bound)
return(bounds)
assign("non_outlier_range", bounds, envir = globalEnv())
# the assign() function will create an actual object in your environment
# called non_outlier_range that you can access directly - return()
# just mean the result will be spit out into the console or into a variable
# you load it into
}
Now moving on to the second function, a few options here:
First Way: Input bounds argument into RMS_func()
RMS_func <- function(dat, bounds){
dat <- dat[!(dat < min(bounds)) & !(dat > max(bounds))]
dat_MEAN <- mean(dat)
dat_DIFF <- (dat-dat_MEAN)
dat_DIFF_SQ <- dat_DIFF^2
dat_FINAL <- sqrt(sum(dat_DIFF_SQ)/length(dat))
return(dat_FINAL)
}
# Call function from approach 1 - note that here the assign() in the
# definition of outlrs() would be required to refer to non_outlier_range:
RMS_func(dat = RMS_x, bounds = non_outlier_range)
Second Way: Call outlrs() inside the second function
RMS_func <- function(dat){
bounds <- outlrs(vec = dat)
dat <- dat[!(dat < min(bounds)) & !(dat > max(bounds))]
dat_MEAN <- mean(dat)
dat_DIFF <- (dat-dat_MEAN)
dat_DIFF_SQ <- dat_DIFF^2
dat_FINAL <- sqrt(sum(dat_DIFF_SQ)/length(dat))
return(dat_FINAL)
}
# Call RMS_func - here the assign() in outlrs() would not be needed is not
# needed because the output will exist within the functions temp environment
# and be passed to RMS_func
RMS_func(dat = RMS_x)
Third Way: Nest outlrs() definition within the RMS_Func - in this case you only need one nested function to accomplish your task
RMS_Func <- function(dat){
outlrs <- function(vec){
Q1 <- summary(dat)["1st Qu."]
Q3 <- summary(dat)["3rd Qu."]
#Q1 <- quantile(vec)["25%"]
#Q3 <- summary(vec)["75%"]
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5*(IQR)
upper_bound <- Q3 + 1.5*(IQR)
bounds <- c(lower_bound, upper_bound)
return(bounds)
}
bounds <- outlrs(vec = dat)
dat <- dat[!(dat < min(bounds)) & !(dat > max(bounds))]
dat_MEAN <- mean(dat)
dat_DIFF <- (dat-dat_MEAN)
dat_DIFF_SQ <- dat_DIFF^2
dat_FINAL <- sqrt(sum(dat_DIFF_SQ)/length(dat))
return(dat_FINAL)
}
P.S. Wrote this pretty quickly - will likely re-test and edit later. Hopefully for now this helps.

R warning message - invalid factor level, NA generated

I have the following block of code. I am a complete beginner in R (a few days old) so I am not sure how much of the code will I need to share to counter my problem. So here is all of it I have written.
mdata <- read.csv("outcome-of-care-measures.csv",colClasses = "character")
allstate <- unique(mdata$State)
allstate <- allstate[order(allstate)]
spldata <- split(mdata,mdata$State)
if (num=="best") num <- 1
ranklist <- data.frame("hospital" = character(),"state" = character())
for (i in seq_len(length(allstate))) {
if (outcome=="heart attack"){
pdata <- spldata[[i]]
pdata[,11] <- as.numeric(pdata[,11])
bestof <- pdata[!is.na(as.numeric(pdata[,11])),][]
inorder <- order(bestof[,11],bestof[,2])
if (num=="worst") num <- nrow(bestof)
hospital <- bestof[inorder[num],2]
state <- allstate[i]
ranklist <- rbind(ranklist,c(hospital,state))
}
}
allstate is a character vector of states.
outcome can have values similar to "heart attack"
num will be numeric or "best" or "worst"
I want to create a data frame ranklist which will have hospital names and the state names which follow a certain criterion.
However I keep getting the error
invalid factor level, NA generated
I know it has something to do with rbind but I cannot figure out what is it. I have tried googling about this, and also tried troubleshooting using other similar queries on this site too. I have checked any of my vectors I am trying to bind are not factors. I also tried forcing the coercion by setting the hospital and state as.character() during assignment, but didn't work.
I would be grateful for any help.
Thanks in advance!
Since this is apparently from a Coursera assignment I am not going to give you a solution but I am going to hint at it: Have a look at the help pages for read.csv and data.frame. Both have the argument stringsAsFactors. What is the default, true or false? Do you want to keep the default setting? Is colClasses = "character" in line 1 necessary? Use the str function to check what the classes of the columns in mdata and ranklist are. read.csv additionally has an na.strings argument. If you use it correctly, also the NAs introduced by coercion warning will disappear and line 16 won't be necessary.
Finally, don't grow a matrix or data frame inside a loop if you know the final size beforehand. Initialize it with the correct dimensions (here 52 x 2) and assign e.g. the i-th hospital to the i-th row and first column of the data frame. That way rbind is not necessary.
By the way you did not get an error but a warning. R didn't interrupt the loop it just let you know that some values have been coerced to NA. You can also simplify the seq_len statement by using seq_along instead.

Is it possible to use a function in replace() in R?

I am new to R and would like to see if there is a more elegant and quicker way to replace some data in a vector, when the replacement is conditional and the new value is based on a formula. Just putting in the formula in replace() gives an error, as the size of the replacement vector (i.e. size of the whole vector) is larger than the number of replacement (i.e. a subset of the vector)
This for loop works, but is pretty slow:
A <- df$v2/df$v1
for(i in 1:length(A)) {
if (is.na(A[i]) & !is.na(df$v3[i])) {
A[i] <- df$v3[i]/df$v1[i]
}
}
The following doesn't work and I understand why (the replacement needs to be of the same size as the object being replaced), but haven't found a nicer solution than the for loop above:
A <- replace(A,is.na(A) & !is.na(df$v3),df$v3/df$v1)
It gives the following error:
Warning message:
In replace(....
number of items to replace is not a multiple of replacement length
Don't see why replace is necessary:
A <- with(df, v2/v1)
repl <- is.na(A) & !is.na(df$v3)
A[repl] <- with(df[repl, ], v3/v1)
Alternatively, since what you're doing is calculating v2/v1 unless v2 is missing and v3 is not, in which case use v3/v1:
A <- with(df, ifelse(!is.na(v2), v2/v1, v3/v1))

Resources