I used the following code to try to replace variables's value that are below the bottom 2.5% and above the top 97.5% with specific values.You can perform that code. It provides open data file.
credit<-read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE)
fun <- function(x){
quantiles <- quantile( x, c(.025, .975 ) )
x[ x < quantiles[1] ] <- quantiles[1]
x[ x > quantiles[2] ] <- quantiles[2]
x
}
fun(credit)
But the error message is appeared.
Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)) :
undefined columns selected
What's the problem? I happy to any help!
+Addition comment
I found that the above function does not work in the data frame but works only in the vector.
I can change the outlier of each variable in the data file with the following code:
credit$Duration.of.Credit..month. <- pmax(quantile(credit$Duration.of.Credit..month.,.025),
pmin(credit$Duration.of.Credit..month., quantile(credit$Duration.of.Credit..month.,.975)))
However, my data file has so many variables that it is inconvenient to enter code one by one.
So how can I change the outliers of the variables that a specific value not pmax&pmin?
There's actually nothing wrong with your function as long as you apply it to a column. I'd use mutate_at or mutate_all (if you really want to apply it to all columns) of the dplyr package. Something like this:
library(dplyr)
credit_trunc <- credit %>%
mutate_at(vars(Credit.Amount, Creditability), funs(fun))
or
credit_trunc <- credit %>%
mutate_all(funs(fun))
or if you also have columns of another type (e.g. factor, character) in your data frame, you can use:
credit_trunc <- credit %>%
mutate_if(is.numeric, funs(fun))
This will give you back the data frame with the chosen / all columns / all numeric columns modified as you wanted it.
Related
I'm currently working with a dataframe "dat." I'm trying to calculate a score using columns 69-88 (if there are values in any of those columns, then add them together and put the result in a new column called "score").
This is the code I have now:
dat$score <- 0
for (num in 69:88){
dat$score[!is.na(dat[,num])] <- dat$score+dat[,num]
}
This gives me a column where some rows show the correct score, but other rows are returning "NA". I also have 20 warnings messages that look like so:
1: In dat$score[!is.na(dat[, num])] <- dat$score + ... :
number of items to replace is not a multiple of replacement length
Why is my code working for some rows and not for others, and why am I getting this error?
Are you looking for the rowSums()function? You just have to add the argument na.rm=TRUE.
A solution with dplyr:
library(dplyr)
dat %>% mutate(score=rowSums(across(69:88), na.rm=TRUE))
Or with base R
dat$score<-rowSums(dat[, 69:88], na.rm=TRUE)
use apply. its usually quicker than for
dat$score <- apply(dat[,69:88], 1, sum, na.rm = T)
summary(standard_airline)
#outlier treatment
outFix <- function(x){
quant <- quantile(x,probs = c(.25,.75))
h <- 1.5*IQR(x,na.rm = T)
x[x<(quant[1]-h)] <- quant[1]
x[x>(quant[2]+h)] <- quant[2]
}
v <- colnames(airline[,-1])
data2 <- lapply(v,outFix)
Error - Error in (1 - h) * qs[i] : non-numeric argument to binary operator
I couldn't find out what is the error coming here although logically seems right, Is there any way in R to pass multiple column of a dataset to a particular function. Here I want to pass every column except ID to fix the outliers.
Problem
The issue you are encountering is that v is a character vector of column names. Your function outFix expects a numeric vector. So what your lapply code is actually doing is something like this: outFix("Balance"). So it's trying to compute quantiles and IQRs on a string, which is why you're having your error.
quantile("Balance")
Error in (1 - h) * qs[i] : non-numeric argument to binary operator
Solutions
In the following code replace df with airline for your specific data.
In base R:
df[,-1] <- lapply(df[, -1], function(x) outFix(as.numeric(x))) # exclude first column
Or using your code:
df[, v] <- lapply(df[, v], function(x) outFix(as.numeric(x)))
Using dplyr you can apply your function to every column and except ID with:
library(dplyr)
df %>%
dplyr::mutate_at(dplyr::vars(-ID), ~ outFix(as.numeric(.))) # remove ID by name
df %>%
dplyr::mutate_at(-1, ~ outFix(as.numeric(.))) # remove ID by column position
This makes sure that all your columns are numeric before being passed to your function outFix.
If you're certain that all of your columns are numeric ahead of time then you don't need to use the as.numeric function, but could be good to have in case.
I have a data set with Air Quality Data. The Data Frame is a matrix of 153 rows and 5 columns.
I want to find the mean of the first column in this Data Frame.
There are missing values in the column, so I want to exclude those while finding the mean.
And finally I want to do that using Control Structures (for loops and if-else loops)
I have tried writing code as seen below. I have created 'y' instead of the actual Air Quality data set to have a reproducible example.
y <- c(1,2,3,NA,5,6,NA,NA,9,10,11,NA,13,NA,15)
x <- matrix(y,nrow=15)
for(i in 1:15){
if(is.na(data.frame[i,1]) == FALSE){
New.Vec <- c(x[i,1])
}
}
print(mean(New.Vec))
I expected the output to be the mean. Though the error I received is this:
Error: object 'New.Vec' not found
One line of code, no need for for loop.
mean(data.frame$name_of_the_first_column, na.rm = TRUE)
Setting na.rm = TRUE makes the mean function ignore NAs.
Here, we can make use of na.aggregate from zoo
library(zoo)
df1[] <- na.aggregate(df1)
Assuming that 'df1' is a data.frame with all numeric columns and wanted to fill the NA elements with the corresponding mean of that column. na.aggregate, by default have the fun.aggregate as mean
can't see your data, but probably like this? the vector needed to be initialized. better to avoid loops in R when you can...
myDataFrame <- read.csv("hw1_data.csv")
New.Vec <- c()
for(i in 1:153){
if(!is.na(myDataFrame[i,1])){
New.Vec <- c(New.Vec, myDataFrame[i,1])
}
}
print(mean(New.Vec))
Trying to impute missing values in all numeric rows using this loop:
for(i in 1:ncol(df)){
if (is.numeric(df[,i])){
df[is.na(df[,i]), i] <- mean(df[,i], na.rm = TRUE)
}
}
When data.table package is not attached then code above is working as it should. Once I attach data.table package, then the behaviour changes and it shows me the error:
Error in `[.data.table`(df, , i) :
j (the 2nd argument inside [...]) is a single symbol but column name 'i'
is not found. Perhaps you intended DT[,..i] or DT[,i,with=FALSE]. This
difference to data.frame is deliberate and explained in FAQ 1.1.
I tried '..i' and 'with=FALSE' everywhere but with no success. Actually it has not passed even first is.numeric condition.
The data.table syntax is a little different in such a case. You can do it as follows:
num_cols <- names(df)[sapply(df, is.numeric)]
for(col in num_cols) {
set(df, i = which(is.na(df[[col]])), j = col, value = mean(df[[col]], na.rm=TRUE))
}
Or, if you want to keep using your existing loop, you can just turn the data back to data.frame using
setDF(df)
An alternative answer to this question, i came up with while sitting with a similar problem on a large scale. One might be interested in avoiding for loops by using the [.data.table method.
DF[i, j, by, on, ...]
First we'll create a function that can perform the imputation
impute_na <- function(x, val = mean, ...){
if(!is.numeric(x))return(x)
na <- is.na(x)
if(is.function(val))
val <- val(x[!na])
if(!is.numeric(val)||length(val)>1)
stop("'val' needs to be either a function or a single numeric value!")
x[na] <- val
x
}
To perform the imputation on the data frame, one could create and evaluate an expression in the data.table environment, but for simplicity of example here we'll overwrite using <-
DF <- DF[, lapply(.SD, impute_na)]
This will impute the mean across all numeric columns, and keep any non-numeric columns as is. If we wished to impute another value (like... 42 or whatever), and maybe we have some grouping variable, for which we only want the mean to computed over this can be included as well by
DF <- DF[, lapply(.SD, impute_na, val = 42)]
DF <- DF[, lapply(.SD, impute_NA), by = group]
Which would impute 42, and the within-group mean respectively.
I tried searching the error which I am getting while using "mean" function in R 3.1.2.'
Purpose: Calculate Mean of datasets
Used Functions: sapply, summary to calculate mean as shown below:
sapply(data,mean,na.rm=TRUE)
summary(data)
Problem Faced: Now, I am trying to use "mean" function to calculate mean from complete dataset. I used the function like this:
> testingnew <-data[complete.cases(data),]
> mean(testingnew)
Popped Warning :
[1] NA
Warning message:
In mean.default(testingnew) :
argument is not numeric or logical: returning NA
Que: Can someone please tell me why this warning comes, I tried to remove NA(missing values) using complete.cases.
#To Eliminate missing values: # ! = is not
testingnew <- subset(data, !(is.na(data)))
#Choose a column to calculate the mean:
#Make sure it is numeric or integer
class(testingnew$Col1)
mean(testingnew$Col1, na.rm=TRUE)
Maybe you can try to reproduce this workflow with your own dataset... It seems the only thing missing is referring to individual columns with the mean function, or using sapply as you did before.
Create a dataframe using random values
my.df <- data.frame(x1 = rnorm(n = 200), x2 = rnorm(n=200))
Spread NA's randomly into the df
is.na(my.df) <- matrix(sample(c(TRUE,FALSE), replace= TRUE, size = 400,
prob=c(0.10, 0.90)),
ncol = 2)
For getting means without using complete cases:
mean(my.df$x1, na.rm=TRUE) # mean(my.df[,1], na.rm=TRUE) is equivalent
mean(my.df$x2, na.rm=TRUE) # mean(my.df[,2], na.rm=TRUE) is equivalent
Complete-case approach (if this is what you really need):
my.df.complete <- my.df[complete.cases(my.df),]
Get means for both columns
sapply(X = my.df.complete, FUN = mean)
Get mean from individual columns
mean(my.df.complete$x1)
mean(my.df.complete$x2)
Creating a subset helped:
data3 <-subset(data, !is.na(Ozone))
mean(data3$Ozone)