remove outliers from multiple columns in R - r

I used below codes to identify outliers on different columns:
outliers_x1 <- boxplot(mydata$x1, plot=FALSE)$out
outliers_x4 <- boxplot(mydata$x4, plot=FALSE)$out
outliers_x6 <- boxplot(mydata$x6, plot=FALSE)$out
Now, how can I remove those outliers from the dataset by one code?

This will set any outlier values to NA, and then optionally remove all rows where any column contains an outlier. Works with arbitrary number of columns.
Uses data.table for convenience.
library(data.table)
library(matrixStats)
##
# create sample data
#
set.seed(1)
dt <- data.table(x1=rnorm(100), x2=rnorm(100), x3=rnorm(100))
##
# incorporate possible outliers
#
dt[sample(100, 5), x1:=10*x1]
dt[sample(100, 5), x2:=10*x2]
dt[sample(100, 5), x3:=10*x3]
##
# you start here...
# remove all rows where any column contains an outlier
#
indx <- sapply(dt, \(x) !(x %in% boxplot(x, plot=FALSE)$out))
dt[as.logical(rowProds(indx))]
In the above, indx is a matrix with three logical columns. Each element is TRUE unless the corresponding column contained an outlier in that row. We use rowProds(...) from the matrixStats package to multiply ( & ) the 3 rows together. Unfortunately this converts everything numeric (1, 0), so we have to convert back to logical to use as an index into dt.
##
# replaces outliers with NA in each column
#
dt.melt <- melt(dt[, id:=seq(.N)], id='id')
dt.melt[, ol:=(value %in% boxplot(value, plot=FALSE)$out), by=.(variable)]
dt.melt[(ol), value:=NA]
result <- dcast(dt.melt, id~variable)[, id:=NULL]
##
# remove all rows where any column contains an outlier
#
na.omit(result)
In the code above we add an id column, then melt(...) so all other columns are in one column (value) with a second column (variable) indicating the original source column. Then we apply the boxplot(...) algorithm group-wise (by variable) to produce an ol column indicating an outlier. Then we set any value corresponding to ol == TRUE to NA. Then we re-convert to your original wide format with dcast(...) and remove the id.
It's a bit roundabout but this melt - process - dcast pattern is common when processing multiple columns like this.
Finally, na.omit(result) will remove any rows which have NA in any of the columns. If that's what you want it's simpler to use the first approach.

Related

How do I use one column to gate select rows of another column in r?

I have a dataset with a column of 1's and 0's and another column with double values. I want to make a third column that contains the data in each of the rows in the second column that corresponds to a 1 in the first column. I have no idea how to do this and googling for this has been a nightmare. How do I do this?
You can do this in a one-liner with ifelse. Assuming your data frame is called df, 1 and 0 values in col1, doubles in col2, values corresponding to the zeros are NA:
df$col3 <- ifelse(df$col1, df$col2, NA)
We can do this in base R in a single-line using indexing. We create the logical vector with first 'col1' and use that as index to create the new column. By default, the values that are FALSE from 'i1' will be NA
i1 <- as.logical(df1$col1)
# // or
# i1 <- df1$col1 == 1
df1$col3[i1] <- df1$col2[i1]
Or as a single line
df1$col3[as.logical(df1$col1)] <- df1$col2[as.logical(df1$col1)]

Vector gets stored as a dataframe instead of being a vector

I am new to r and rstudio and I need to create a vector that stores the first 100 rows of the csv file the programme reads . However , despite all my attempts my variable v1 ends up becoming a dataframe instead of an int vector . May I know what I can do to solve this? Here's my code:
library(readr)
library(readr)
cup_data <- read_csv("C:/Users/Asus.DESKTOP-BTB81TA/Desktop/STUDY/YEAR 2/
YEAR 2 SEM 2/PREDICTIVE ANALYTICS(1_PA_011763)/Week 1 (Intro to PA)/
Practical/cup98lrn variable subset small.csv")
# Retrieve only the selected columns
cup_data_small <- cup_data[c("AGE", "RAMNTALL", "NGIFTALL", "LASTGIFT",
"GENDER", "TIMELAG", "AVGGIFT", "TARGET_B", "TARGET_D")]
str(cup_data_small)
cup_data_small
#get the number of columns and rows
ncol(cup_data_small)
nrow(cup_data_small)
cat("No of column",ncol(cup_data_small),"\nNo of Row :",nrow(cup_data_small))
#cat
#Concatenate and print
#Outputs the objects, concatenating the representations.
#cat performs much less conversion than print.
#Print the first 10 rows of cup_data_small
head(cup_data_small, n=10)
#Create a vector V1 by selecting first 100 rows of AGE
v1 <- cup_data_small[1:100,"AGE",]
Here's what my environment says:
cup_data_small is a tibble, a slightly modified version of a dataframe that has slightly different rules to try to avoid some common quirks/inconsistencies in standard dataframes. E.g. in a standard dataframe, df[, c("a")] gives you a vector, and df[, c("a", "b")] gives you a dataframe - you're using the same syntax so arguably they should give the same type of result.
To get just a vector from a tibble, you have to explicitly pass drop = TRUE, e.g.:
library(dplyr)
# Standard dataframe
iris[, "Species"]
iris_tibble = iris %>%
as_tibble()
# Remains a tibble/dataframe
iris_tibble[, "Species"]
# This gives you just the vector
iris_tibble[, "Species", drop = TRUE]

How to apply this code to all cells of dataframe in R

This line of code applies to a f_name column on my data frame and removes all the cells of f_name column but I want to apply it to all columns.
How do I do it?
subset(m, nchar(as.character(f_name)) <= 100
If your data.frame is named dat, try the following.
It first creates a logical index inx with values TRUE if all elements of a column have less than 100 characters. It then subsets the original data.frame keeping only those columns.
inx <- sapply(dat, function(x) all(nchar(x) < 100))
new_dat <- dat[which(inx)]

R select subset of data

I have a dataset with three columns.
## generate sample data
set.seed(1)
x<-sample(1:3,50,replace = T )
y<-sample(1:3,50,replace = T )
z<-sample(1:3,50,replace = T )
data<-as.data.frame(cbind(x,y,z))
What I am trying to do is:
Select those rows where all the three columns have 1
Select those rows where only two columns have 1 (could be any column)
Select only those rows where only column has 1 (could be any column)
Basically I want any two columns (for 2nd case) to fulfill the conditions and not any specific column.
I am aware of rows selection using
subset<-data[c(data$x==1,data$y==1,data$z==1),]
But this only selects those rows based on conditions for specific columns whereas I want any of the three/two columns to fullfill me criteria
Thanks
n = 1 # or 2 or 3
data[rowSums(data == 1) == n,]
Here is another method:
rowCounts <- table(c(which(data$x==1), which(data$y==1), which(data$z==1)))
# this is the long way
df.oneOne <- data[as.integer(names(rowCounts)[rowCounts == 1]),]
df.oneTwo <- data[as.integer(names(rowCounts)[rowCounts == 2]),]
df.oneThree <- data[as.integer(names(rowCounts)[rowCounts == 3]),]
It is better to save multiple data.frames in a list especially when there is some structure that guides this storage as is the case here. Following #richard-scriven 's suggestion, you can do this easily with lapply:
df.oneCountList <- lapply(1:3, function(i)
data[as.integer(names(rowCounts)[rowCounts == i]),]
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)
You can then pull out the data.frames using either their index, df.oneCountList[[1]] or their name df.oneCountList[["df.oneOne"]].
#eddi below suggests a nice shortcut to my method of pulling out the table names using tabulate and the arr.ind argument of which. When which is applied on a multipdimensional object such as an array or a data.frame, setting arr.ind==TRUE produces indices of the rows and the columns where the logical expression evaluates to TRUE. His suggestion exploits this to pull out the row vector where a 1 is found across all variables. The tabulate function is then applied to these row values and tabulate returns a sorted vector that where each element represents a row and rows without a 1 are filled in with a 0.
Under this method,
rowCounts <- tabulate(which(data == 1, arr.ind = TRUE)[,1])
returns a vector from which you might immediately pull the values. You can include the above lapply to get a list of data.frames:
df.oneCountList <- lapply(1:3, function(i) data[rowCounts == i,])
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)

sum different columns in a data.frame

I have a very big data.frame and want to sum the values in every column.
So I used the following code:
sum(production[,4],na.rm=TRUE)
or
sum(production$X1961,na.rm=TRUE)
The problem is that the data.frame is very big. And I only want to sum 40 certain columns with different names of my data.frame. And I don't want to list every single column. Is there a smarter solution?
At the end I also want to store the sum of every column in a new data.frame.
Thanks in advance!
Try this:
colSums(df[sapply(df, is.numeric)], na.rm = TRUE)
where sapply(df, is.numeric) is used to detect all the columns that are numeric.
If you just want to sum a few columns, then do:
colSums(df[c("X1961", "X1962", "X1999")], na.rm = TRUE)
res <- unlist(lapply(production, function(x) if(is.numeric(x)) sum(x, na.rm=T)))
will return the sum of each numeric column.
You could create a new data frame based on the result with
data.frame(t(res))
If you dont want to include every single column, you somehow have to indicate which ones to include (or alternatively, which to exclude)
colsInclude <- c("X1961", "X1962", "X1963") # by name
# or #
colsInclude <- paste0("X", 1961:2003) # by name
# or #
colsInclude <- c(10:19, 23, 55, 147) # by column number
To put those columns in a new data frame simply use [ ] as you've done: '
newDF <- oldDF[, colsInclude]
To sum up each column, simply use colSums
sums <- colSums(newDF, na.rm=T)
# or #
sums <- colSums(oldDF[, colsInclude], na.rm=T)
Note that sums will be a vector, not necessarilly a data frame.
You can make it into a data frame using as.data.frame
sums <- as.data.frame(sums)
# or, to include the data frame from which it came #
sums <- rbind(newDF, "totals"=sums)

Resources