Testing for ouliers in a dataframe - r

I am new to R and I tried to use a function that tests for outliers in a large dataframe with over 600 variables all numeric except for the last 2 columns. I tried the outlier function in the outliers package to test one column at a time, I ended with a numeric vector which I could not use. Is there a better way to identify all outliers in a dataframe.
myout <- c()
for (i in 1:dim(training)[2]){
if (is.numeric(training[,i])) {
myout <- c(myout,outlier(training[,i])) }
}

As you can read in the helpfile of outlier it finds one value for each variable, the one that differs the most from the mean. I think what you want is finding for each variable the index of all data points that are outliers. This can be done in the following way (of course you need to remove your non-numeric variables first):
# first write a custom function that returns the index of all outliers
# I define an outlier as 3 sd's away from the mean, you can adjust that
is.outlier <- function(x) which(abs(x - mean(x)) > 3*sd(x))
# turn the df into a list, and apply the function to each variable with lapply
df.as.list <- as.list(df) # enter the name of your data frame instead of df
lapply(df.as.list, is.outlier)
It will return a list with at element i the indices of the outliers of the variable in
column i.

You may not actually want to remove outliers, but per this 2 years ago:
x[!x %in% boxplot.stats(x)$out]

Related

How to get standard deviation of multiple columns in R?

I want to get the standard deviation of specific columns in a dataframe and store those means in a list in R.
The specific variable names of the columns are stored in a vector. For those specific variables (depends on user input) I want to calculate the standard deviation and store those in a list, over which I can loop then to use it in another part of my code.
I tried as follows, e.g.:
specific_variables <- c("variable1", "variable2") # can be of a different length depending on user input
data <- data.frame(...) # this is a dataframe with multiple columns, of which "variable1" and "variable2" are both columns from
sd_list <- 0 # empty variable for storage purposes
# for loop over the variables
for (i in length(specific_variables)) {
sd_list[i] <- sd(data$specific_variables[i], na.rm = TRUE)
}
print(sd_list)
I get an error.
Second attempt using colSds and sapply:
colSds(data[sapply(specific_variables, na.rm = TRUE)])
But the colSds function doesn't work (anymore?).
Ideally, I'd like to store those the standard deviations from certain column names into a list.
Lets assume you have a dataframe with two columns. The easiest way is to use apply:
frame<-data.frame(X=1:6,Y=rnorm(6))
sd_list<-apply(frame,2,sd)
the "2" in apply means: calculate sds for each column. A "1" would mean: calculate for each row.
There is no colSds() function, but colMeans() and colSums() do exist ...
With help of #shghm I found a way:
sd_list <- as.list(unname(apply(data[specific_variables], 2, sd, na.rm = TRUE)))

Finding Mean of a column in an R Data Set, by using FOR Loops to remove Missing Values

I have a data set with Air Quality Data. The Data Frame is a matrix of 153 rows and 5 columns.
I want to find the mean of the first column in this Data Frame.
There are missing values in the column, so I want to exclude those while finding the mean.
And finally I want to do that using Control Structures (for loops and if-else loops)
I have tried writing code as seen below. I have created 'y' instead of the actual Air Quality data set to have a reproducible example.
y <- c(1,2,3,NA,5,6,NA,NA,9,10,11,NA,13,NA,15)
x <- matrix(y,nrow=15)
for(i in 1:15){
if(is.na(data.frame[i,1]) == FALSE){
New.Vec <- c(x[i,1])
}
}
print(mean(New.Vec))
I expected the output to be the mean. Though the error I received is this:
Error: object 'New.Vec' not found
One line of code, no need for for loop.
mean(data.frame$name_of_the_first_column, na.rm = TRUE)
Setting na.rm = TRUE makes the mean function ignore NAs.
Here, we can make use of na.aggregate from zoo
library(zoo)
df1[] <- na.aggregate(df1)
Assuming that 'df1' is a data.frame with all numeric columns and wanted to fill the NA elements with the corresponding mean of that column. na.aggregate, by default have the fun.aggregate as mean
can't see your data, but probably like this? the vector needed to be initialized. better to avoid loops in R when you can...
myDataFrame <- read.csv("hw1_data.csv")
New.Vec <- c()
for(i in 1:153){
if(!is.na(myDataFrame[i,1])){
New.Vec <- c(New.Vec, myDataFrame[i,1])
}
}
print(mean(New.Vec))

Write script to ignore objects which can’t be found in r

I am trying to construct a script in r to force it to ignore objects it can’t find.
A simplified version of my script is as follows
Trial<-sum(a,b,c,d,e)
A-e are numeric vectors generates by calculating the sum of a column in a data frame.
My problem is I want to use the same script over multiple different conditions (and have far more objects than just a-e). For some of these conditions some of the objects a-e may not exist. Therefore r returns error object d not found.
To avoid having to generate a unique script for each condition I would like to force to ignore any missing objects.
I would be grateful for any help!
Welcome to SO! As mentioned in the comments, in the future try to include a working example in your question. The preferred solution to your problem would be to avoid assigning values to individual variables in the first place. Try to restructure your code so that your column sums get assign to, for example, a list. In the example below, I create some sample data, assign column sum values to a vector, and compute the sum of the vector, without creating a new variable for each column.
# Create sample data
rData <- as.data.frame(matrix(c(1:6), nrow=6, ncol=5, byrow = TRUE))
print(rData)
# Compute column sum
sumVec <- apply(rData, 2, sum)
print(sumVec)
# Compute sum of column sums
total <- sum(sumVec)
print(total)
If you have to use individual variables, before adding them up, you could check if the variable exists, and if not, create it and assign NA. You can then compute the sum of your variables after excluding NA.
# Sample variables
a <- 15
b <- 20
c <- 50
# Assign NA if it doesn't exist (one variable at a time)
if(!exists("d")) { d <- NA }
# Assign NA using sapply (preferred)
sapply(c("a","b","c","d","e"), function(x)
if(!exists(x)) { assign(x, NA, envir=.GlobalEnv) }
)
# Compute sum after excluding NA
altTotal <- sum(na.omit(c(a,b,c,d,e)))
print(altTotal)
Hopefully this will get you closer to the solution!

Looping a rep() function in r

df is a frequency table, where the values in a were reported as many times as recorded in column x,y,z. I'm trying to convert the frequency table to the original data, so I use the rep() function.
How do I loop the rep() function to give me the original data for x, y, z without having to repeat the function several times like I did below?
Also, can I input the result into a data frame, bearing in mind that the output will have different column lengths:
a <- (1:10)
x <- (6:15)
y <- (11:20)
z <- (16:25)
df <- data.frame(a,x,y,z)
df
rep(df[,1], df[,2])
rep(df[,1], df[,3])
rep(df[,1], df[,4])
If you don't want to repeat the for loop, you can always try using an apply function. Note that you cannot store it in a data.frame because the objects are of different lengths, but you could store it in a list and access the elements in a similar way to a data.frame. Something like this works:
df2<-sapply(df[,2:4],function(x) rep(df[,1],x))
What this sapply function is saying is for each column in df[,2:4], apply the rep(df[,1],x) function to it where x is one of your columns ( df[,2], df[,3], or df[,4]).
The below code just makes sure the apply function is giving the same result as your original way.
identical(df2$x,rep(df[,1], df[,2]))
[1] TRUE
identical(df2$y,rep(df[,1], df[,3]))
[1] TRUE
identical(df2$z,rep(df[,1], df[,4]))
[1] TRUE
EDIT:
If you want it as a data.frame object you can do this:
res<-as.data.frame(sapply(df2, '[', seq(max(sapply(df2, length)))))
Note this introduces NAs into your data.frame so be careful!

How to apply operation and sum over columns in R?

I want to apply some operations to the values in a number of columns, and then sum the results of each row across columns. I can do this using:
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$a2 <- x$a^2
x$b2 <- x$b^2
x$result <- x$a2 + x$b2
but this will become arduous with many columns, and I'm wondering if anyone can suggest a simpler way. Note that the dataframe contains other columns that I do not want to include in the calculation (in this example, column sample is not to be included).
Many thanks!
I would simply subset the columns of interest and apply everything directly on the matrix using the rowSums function.
x <- data.frame(sample=1:3, a=4:6, b=7:9)
# put column indices and apply your function
x$result <- rowSums(x[,c(2,3)]^2)
This of course assumes your function is vectorized. If not, you would need to use some apply variation (which you are seeing many of). That said, you can still use rowSums if you find it useful like so. Note, I use sapply which also returns a matrix.
# random custom function
myfun <- function(x){
return(x^2 + 3)
}
rowSums(sapply(x[,c(2,3)], myfun))
I would suggest to convert the data set into the 'long' format, group it by sample, and then calculate the result. Here is the solution using data.table:
library(data.table)
melt(setDT(x),id.vars = 'sample')[,sum(value^2),by=sample]
# sample V1
#1: 1 65
#2: 2 89
#3: 3 117
You can easily replace value^2 by any function you want.
You can use apply function. And get those columns that you need with c(i1,i2,..,etc).
apply(( x[ , c(2, 3) ])^2, 1 ,sum )
If you want to apply a function named somefunction to some of the columns, whose indices or colnames are in the vector col_indices, and then sum the results, you can do :
# if somefunction can be vectorized :
x$results<-apply(x[,col_indices],1,function(x) sum(somefunction(x)))
# if not :
x$results<-apply(x[,col_indices],1,function(x) sum(sapply(x,somefunction)))
I want to come at this one from a "no extensions" R POV.
It's important to remember what kind of data structure you are working with. Data frames are actually lists of vectors--each column is itself a vector. So you can you the handy-dandy lapply function to apply a function to the desired column in the list/data frame.
I'm going to define a function as the square as you have above, but of course this can be any function of any complexity (so long as it takes a vector as an input and returns a vector of the same length. If it doesn't, it won't fit into the original data.frame!
The steps below are extra pedantic to show each little bit, but obviously it can be compressed into one or two steps. Note that I only retain the sum of the squares of each column, given that you might want to save space in memory if you are working with lots and lots of data.
create data; define the function
grab the columns you want as a separate (temporary) data.frame
apply the function to the data.frame/list you just created.
lapply returns a list, so if you intend to retain it seperately make it a temporary data.frame. This is not necessary.
calculate the sums of the rows of the temporary data.frame and append it as a new column in x.
remove the temp data.table.
Code:
x <- data.frame(sample=1:3, a=4:6, b=7:9); square <- function(x) x^2 #step 1
x[2:3] #Step 2
temp <- data.frame(lapply(x[2:3], square)) #step 3 and step 4
x$squareRowSums <- rowSums(temp) #step 5
rm(temp) #step 6
Here is an other apply solution
cols <- c("a", "b")
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$result <- apply(x[, cols], 1, function(x) sum(x^2))

Resources