How can I create the following function in R? - r

I need to remove some outliers from two variables in my dataset. What I've been thinking of is to replace those outliers with the value of it's Q3+-1.5IQR. Is there a fuction aviable to do this or how can I create a fuction that replaces the values of those observation that exceed Q3+1.5IQR for the value of the Q3+1.5IQR itself. Thank you in advance

Related

How can I see multiple variable's outlier in one boxplot using R?

I am a newbie to R. I have a question. For checking the outlier of a variable we generally use:
boxplot(train$rate)
Suppose, the rate is the variable of my datasets and train is my data sets name. But when I have multiple variables like 100 or 150 variables, then it will be very time consuming to check one by one variable's outlier. Is there any function to bring the 100 variables' outlier in one boxplot?
If yes, then which function is used to remove those variable's outlier at one time instead of one by one? Please help to solve this problem.
Thanks in advance
I agree with Rui Barradas that it is bad practice to remove outliers without further thought. As long as the value is valid you should keep it in your data or at least run two separate analyses with and without the influential value. You could use a for loop to apply a function to every variable in your dataset.
train2<-train # Copy old dataset
outvalue<-list() # Create two empty lists
outindex<-list()
for(i in 1:ncol(train2){ # For every column in your dataset
outvalue[[i]]<-boxplot(train2[,i])$out # Plot and get the outlier value
outindex[[i]]<-which(train2[,i] == outvalue[[i]]) # Get the outlier index
train2[outindex[[i]],i] <- NA # Remove the outliers
}
This works and plots the data, but it is quite slow. If you don't want to plot the data but just want the outliers you could look into other outlier functions, the extremevalues package has a function that takes a different approach to identifying outliers and doesn't require a plot.
This uses the getOutliers function from the extremevalues package
outRight<-list()
outLeft<-outRight
for(i in 1:ncol(train2){
outRight[[i]]<-getOutliers(train2[,i])$iRight
outLeft[[i]]<-getOutliers(train2[,i])$iLeft
train2[outRight[[i]],i] <- NA
train2[outLeft[[i]],i] <- NA
}
The function boxplot returns a value. If you see the Value section of its help page you'll see that it's a list with named components, one of which is out. That's the one you seem to be looking for.
bp <- boxplot(train$rate)
bp$out
clean <- train$rate[-which(train$rate %in% bp$out)] # to remove the outliers
I also would not do that. Outliers are data, and normal/likely to occur. By eliminating them you are not taking into account the entirety of your data, a bad practice.

correlation of several columns need to be calculated

I'm trying to get the correlation coefficient for corresponding columns of two csv files. I simply use the followings but get errors. consider each csv file has 50 columns
first values <- read.csv("")
second values <- read.csv("")
correlation.csv <- cor(x= first values , y=second values, method="spearman)
But i get x' must be numeric error!
subset of one csv file
Thanks for your help
The read.table function and all of it's derivatives return a data.frame which is an R list object. The mapply function processes lists in "parallel". If the matching columns are in the same order in the two datasets and have the same number of rows and do not have spaces in their names, it would be as simple as:
mapply(cor, first_values , second_values)
If it's more complicated tahn that, then you need to fill in the missing details with example data by editing the question (not by responding in comments.)
There must be some categorical variable in X.So you can first separate that categorical variable from X and then use X in cor() function.

using a function with lapply to create a column and match values

I have two datasets H and G. They have a column named 'diff' that as the name suggests, holds difference between two columns within each dataset. I used lapply to calculate the percentage for each dataset (I have more datasets than H and G, so would like to calculate the percentage of the two columns in each dataset), but for some reason lapply gives me the output however doesn't create "perc" column in the datasets that pass through it. What am I doing wrong here?
H<-data.frame(replicate(10,sample(0:20,10,rep=TRUE)))
G<-data.frame(replicate(10,sample(0:20,10,rep=TRUE)))
H[c(2,3,7,9),9]<-NA
G[c(1,5,7,8),9]<-NA
H$diff<-H$X10-H$X9
G$diff<-G$X10-G$X9
dsay<-list(H,G)
lapply(dsay,function(x)x$perc<-round((x$diff/x$X10)*100,1))
Extension of this question:
once I have the percent differences as columns using:
H<-data.frame(replicate(10,sample(0:20,10,rep=TRUE)))
G<-data.frame(replicate(10,sample(0:20,10,rep=TRUE)))
H[c(2,3,7,9),9]<-NA
G[c(1,5,7,8),9]<-NA
H$diff<-H$X10-H$X9
G$diff<-G$X10-G$X9
H$perc<-round((H$diff/H$X10)*100,1)
G$perc<-round((G$diff/G$X10)*100,1)
I generated a plot using:
xyplot(X8+X9+X10~X1,H,type=c('p','l','g'),
col = c('yellow', 'green', 'blue','red'),
ylab='Count',layout=c(3, 1),
xlab=paste("H",'difference',min(pmin(H$perc, na.rm = TRUE),na.rm=TRUE),
'% change count'))
Never mind the plot it will generate, but what I'm trying to get to is that I also display the value of corresponding difference from the "diff" column alongwith the lowest difference (which is what the min function is doing). I've tried using "match" in vain. Could someone help please?
If we need the changes to reflect in the dataframe objects as well, list2env or assign can be used. But, I would do all the computations within the list itself.
list2env(lapply(mget(c('H','G')), function(x)
{x$perc<-round((x$diff/x$X10)*100,1);x}), envir=.GlobalEnv)

Remove a variable value from the list of possible values

I am using the dataset that can be accessed with the following command - load(url("http://bit.ly/dasi_gss_data"))
When I run the query table(gss$premarsx), it returns a column called Other with count 0. When I plot a graph of the same variable (premarsx), there is a column Other with zero height. Is there a way to remove the variable value Other from the variable definition so that it does not appear in the results of any queries/plots?
You can pass it through the factor() function to have it pick up the present levels:
gss$premarsx <- factor(gss$premarsx)

Calculating column totals and then sorting on the results in R

Please can anyone offer guidance on how to
calculate column totals and then
sort on the resultant totals in R?
Everything I've tried so far to total the columns, eg, (colsums() and sapply() returns the resultant totals as a vector (eg ABOUT 4022) and I cannot find any information on how I can split this into the Column Header ABOUT and Column Value 4022 and then sort both the Header and Value on the Column Value.
Note that the function is called colSums (not colsums).
Use this with sort. Or if you want to order the columns use order:
colSums(mtcars)
sort(colSums(mtcars))
mtcars[ ,order(colSums(mtcars))]

Resources