passing multiple column to function for fixing outlier - r

summary(standard_airline)
#outlier treatment
outFix <- function(x){
quant <- quantile(x,probs = c(.25,.75))
h <- 1.5*IQR(x,na.rm = T)
x[x<(quant[1]-h)] <- quant[1]
x[x>(quant[2]+h)] <- quant[2]
}
v <- colnames(airline[,-1])
data2 <- lapply(v,outFix)
Error - Error in (1 - h) * qs[i] : non-numeric argument to binary operator
I couldn't find out what is the error coming here although logically seems right, Is there any way in R to pass multiple column of a dataset to a particular function. Here I want to pass every column except ID to fix the outliers.

Problem
The issue you are encountering is that v is a character vector of column names. Your function outFix expects a numeric vector. So what your lapply code is actually doing is something like this: outFix("Balance"). So it's trying to compute quantiles and IQRs on a string, which is why you're having your error.
quantile("Balance")
Error in (1 - h) * qs[i] : non-numeric argument to binary operator
Solutions
In the following code replace df with airline for your specific data.
In base R:
df[,-1] <- lapply(df[, -1], function(x) outFix(as.numeric(x))) # exclude first column
Or using your code:
df[, v] <- lapply(df[, v], function(x) outFix(as.numeric(x)))
Using dplyr you can apply your function to every column and except ID with:
library(dplyr)
df %>%
dplyr::mutate_at(dplyr::vars(-ID), ~ outFix(as.numeric(.))) # remove ID by name
df %>%
dplyr::mutate_at(-1, ~ outFix(as.numeric(.))) # remove ID by column position
This makes sure that all your columns are numeric before being passed to your function outFix.
If you're certain that all of your columns are numeric ahead of time then you don't need to use the as.numeric function, but could be good to have in case.

Related

R- identify columns with integers [duplicate]

I would like to know if there is a simpler way of subsetting a data.frame`s integer columns.
My goal is to modify the numerical columns in my data.frame without touching the purely integer columns (in my case containing 0 or 1). The integer columns were originally factor levels turned into dummy variables and should stay as they are. So I want to temporarily remove them.
To distinguish numerical from integer columns I used the OP's version from here (Check if the number is integer).
But is.wholenumber returns a matrix of TRUE/FALSE instead of one value per column like is.numeric, therefore sapply(mtcars, is.wholenumber) does not help me. I came up with the following solution, but I thought there must be an easier way?
data(mtcars)
is.wholenumber <- function(x, tol = .Machine$double.eps^0.5) abs(x - round(x)) < tol
integer_column_names <- apply(is.wholenumber(mtcars), 2, mean) == 1
numeric_df <- mtcars[, !integer_column_names]
You can use dplyr to achieve that as shown here
library(dplyr)
is_whole <- function(x) all(floor(x) == x)
df = select_if(mtcars, is_whole)
or in base R
df = mtcars[ ,sapply(mtcars, is_whole)]

dataframe is collapsed to a vector when given to function

I am trying to make use of the content of a dataframe in a function, here is a simplified example of my problem.
df <- data.frame(v1=1:10,v2=23:32)
df2 <- data.frame(v1=1:3,v2=3:5)
fxm <- function(x,y,q)
{
return(cbind(q[q[,2]==x,],y))
}
mapply(fxm,df[,1],df[,2],q=df2)
Error in q[, 2] : incorrect number of dimensions
if I add a print statement:
df <- data.frame(v1=1:10,v2=23:32)
df2 <- data.frame(v1=1:3,v2=3:5)
fxm <- function(x,y,q)
{
print(q)
return(cbind(q[q[,2]==x,],y))
}
mapply(fxm,df[,1],df[,2],q=df2)
I get:
[1] 1 2 3
Error in q[, 2] : incorrect number of dimensions
The data frame is converted to a vector of its first column for some reason. How can I stop this from happening, and have the whole dataframe accessible to my function?
I am trying to select a subset of the dataframe and returning it based on the other two parameters of the function, which is why I need the whole dataframe to be passed to the function.
If I understand you correctly, you want the whole thing q = df2 passed to the fxm function you define, am I right?
The problem is that in your code mapply will extract elements from q = df2 as some additional parameters just same as extracting elements from df[,1] and df[,2]. You need to set MoreArgs parameter for mapply to pass the whole thing to the function like this:
df <- data.frame(v1=1:10,v2=23:32)
df2 <- data.frame(v1=1:3,v2=3:5)
fxm <- function(x,y,q)
{
print(q)
return(cbind(q[q[,2]==x,],y))
}
mapply(fxm,df[,1],df[,2], MoreArgs = list(q=df2))
This still doesn't work for me and there is some error elsewhere. From the printing result you can see the whole data.frame prints out, which solves your original problem.

Mean imputation issue with data.table

Trying to impute missing values in all numeric rows using this loop:
for(i in 1:ncol(df)){
if (is.numeric(df[,i])){
df[is.na(df[,i]), i] <- mean(df[,i], na.rm = TRUE)
}
}
When data.table package is not attached then code above is working as it should. Once I attach data.table package, then the behaviour changes and it shows me the error:
Error in `[.data.table`(df, , i) :
j (the 2nd argument inside [...]) is a single symbol but column name 'i'
is not found. Perhaps you intended DT[,..i] or DT[,i,with=FALSE]. This
difference to data.frame is deliberate and explained in FAQ 1.1.
I tried '..i' and 'with=FALSE' everywhere but with no success. Actually it has not passed even first is.numeric condition.
The data.table syntax is a little different in such a case. You can do it as follows:
num_cols <- names(df)[sapply(df, is.numeric)]
for(col in num_cols) {
set(df, i = which(is.na(df[[col]])), j = col, value = mean(df[[col]], na.rm=TRUE))
}
Or, if you want to keep using your existing loop, you can just turn the data back to data.frame using
setDF(df)
An alternative answer to this question, i came up with while sitting with a similar problem on a large scale. One might be interested in avoiding for loops by using the [.data.table method.
DF[i, j, by, on, ...]
First we'll create a function that can perform the imputation
impute_na <- function(x, val = mean, ...){
if(!is.numeric(x))return(x)
na <- is.na(x)
if(is.function(val))
val <- val(x[!na])
if(!is.numeric(val)||length(val)>1)
stop("'val' needs to be either a function or a single numeric value!")
x[na] <- val
x
}
To perform the imputation on the data frame, one could create and evaluate an expression in the data.table environment, but for simplicity of example here we'll overwrite using <-
DF <- DF[, lapply(.SD, impute_na)]
This will impute the mean across all numeric columns, and keep any non-numeric columns as is. If we wished to impute another value (like... 42 or whatever), and maybe we have some grouping variable, for which we only want the mean to computed over this can be included as well by
DF <- DF[, lapply(.SD, impute_na, val = 42)]
DF <- DF[, lapply(.SD, impute_NA), by = group]
Which would impute 42, and the within-group mean respectively.

Apply function not working in R

I created a data frame :
fy <- c(2010,2011,2012,2010,2011,2012,2010,2011,2012)
company <-c("Apple","Apple","Apple","Google","Google","Google","Microsoft","Microsoft","Microsoft")
revenue <- c(65225,108249,156508,29321,37905,50175,62484,69943,73723)
profit <- c(14013,25922,41733,8505,9737,10737,18760,23150,16978)
companiesData <- data.frame(fy, company, revenue, profit)
I am trying to create new column using apply command but it is given error:
companiesData$Margin<-apply(companiesData,1,function(x){(x[4]/x[3])*100})
Error in x[4]/x[3] : non-numeric argument to binary operator
Can someone please tell me what is the mistake here?
The mistake is that apply coerces its first argument to a matrix and since companiesData has numeric and non-numeric variables, all variables are converted to non-numeric resulting in the operation x/y being invalid, because division is not defined for non-numeric data.
Solution: you don't need apply in this case.
companiesData$Margin <- 100 * companiesData$profit / companiesData$revenue
or equivalently
companiesData <- within(companiesData, Margin <- 100 * profit / revenue)
do what you want.

Calculate Mean of a column in R having non numeric values

I have a column which contain numeric as well as non-numeric values. I want to find the mean of the numeric values which i can use it to replace the non-numeric values. How can this be done in R?
Say your data frame is named df and the column you want to "fix" is called df$x. You could do the following.
You have to unfactor and then convert to numeric. This will give you NAs for all the character strings that cannot be coalesced to numbers.
nums <- as.numeric(as.character(df$x))
As Richie Cotton pointed out, there is a "more efficient, but harder to remember" way to convert factors to numeric
nums <- as.numeric(levels(df$x))[as.integer(df$x)]
To get the mean, you use mean() but pass na.rm = T
m <- mean(nums, na.rm = T)
Assign the mean to all the NA values.
nums[is.na(nums)] <- m
You could then replace the old data, but I don't recommend it. Instead just add a new column
df$new.x <- nums
This is a function I wrote yesterday to combat the non-numeric types. I have a data.frame with unpredictable type for each column. I want to calculate the means for numeric, and leave everything else untouched.
colMeans2 <- function(x) {
# This function tries to guess column type. Since all columns come as
# characters, it first tries to see if x == "TRUE" or "FALSE". If
# not so, it tries to coerce vector into integer. If that doesn't
# work it tries to see if there's a ' \" ' in the vector (meaning a
# column with character), it uses that as a result. Finally if nothing
# else passes, it means the column type is numeric, and it calculates
# the mean of that. The end.
# browser()
# try if logical
if (any(levels(x) == "TRUE" | levels(x) == "FALSE")) return(NA)
# try if integer
try.int <- strtoi(x)
if (all(!is.na(try.int))) return(try.int[1])
# try if character
if (any(grepl("\\\"", x))) return(x[1])
# what's left is numeric
mean(as.numeric(as.character(x)), na.rm = TRUE)
# a possible warning about coerced NAs probably originates in the above line
}
You would use it like so:
apply(X = your.dataframe, MARGIN = 2, FUN = colMeans2)
It sort of depends on what your data looks like.
Does it look like this?
data = list(1, 2, 'new jersey')
Then you could
data.numbers = sapply(data, as.numeric)
and get
c(1, 2, NA)
And you can find the mean with
mean(data.numbers, na.rm=T)
A compact conversion:
vec <- c(0:10,"a","z")
vec2 <- (as.numeric(vec))
vec2[is.na(vec2)] <- mean(vec2[!is.na(vec2)])
as.numeric will print the warning message listed below and convert the non-numeric to NA.
Warning message:
In mean(as.numeric(vec)) : NAs introduced by coercion

Resources