Replacing outliers from multiple columns in a dataframe containing NAs using R - r

I am trying to replace outliers from a big dataset (more than 3000 columns and 250000 rows) by NA. I want to replace the observations that are greater or smaller than 3 standard deviations from the mean by NA. I got it, doing column by column:
height = ifelse(abs(height-mean(height,na.rm=TRUE)) < 3*sd(height,na.rm=TRUE),height,NA)
However, I would like to create a function to do that in a subset of columns. To do that, I created a list with the column names that I want to replace the outliers. But it is not working.
Anyone could help me, please?
An example of my dataset would be:
name = factor(c("A","B","C","D","E","F","G","H","H"))
height = c(120,NA,150,170,NA,146,132,210,NA)
age = c(10,20,0,30,40,50,60,NA,130)
mark = c(100,0.5,100,50,90,100,NA,50,210)
data = data.frame(name=name,mark=mark,age=age,height=height)
data
This was my last try:
d1=names(data)
list = c("age","height","mark")
ntraits=length(list)
nrows=dim(data)[1]
for(i in 1:ntraits){
a=list[i]
b=which(d1==a)
d2=data[,b]
for (j in 1:nrows){
d2[j] = ifelse(abs(d2[j]-mean(d2,na.rm=TRUE)) < 3*sd(d2,na.rm=TRUE),d2[j],NA)
}
}
Sorry, I am still learning how to program in R. Thank you very much.
Cheers.

I would look into using apply and scale, scale will omit NAs. The following code should work:
# get sd for a subset of the columns
data.scale <- scale(data[ ,c("age","height","mark") ])
# set outliers to NA
data.scale[ abs(data.scale) > 3 ] <- NA
# write back to the data set
data[ ,c("age","height","mark") ] <- data.scale

Related

subset dataframe based on certain threshold in r

I have a correlation dataframe with 381717 rows and 450 columns and no NA values, and I want to subset this dataframe for all correlations with abs value > 0.6. I have tried multiple things to use lapply and sapply on all rows and columns to subset my dataframe but I end up getting NAs, but I do see that there are a few values which should satisfy this condition.If I could get any leads on how to do this, I would be really grateful.
I know this seems like an easy issue but I am somehow unable to get the right subsetting done and would like your help!
Thanks in advance!
Best regards
Expected output :
x1 = seq(1:7)
x2 = c(2,4,8,5,1,2,3)
y1 = c(9,6,5,4,8,6,4)
y2 = c(1,7,4,5,1,2,2)
df = data.frame(x1,x2,y1,y2)
corr_df = data.frame(cor(df))
corr_df$var = row.names(corr_df)
corr_df1 = reshape2::melt(corr_df, value_name = "Corr")
corr_df1[corr_df1$value > 0.6,]
I have created a dummy dataset and done the subset of correlation dataframe. It might work for you.
Considering a dataframe of correlation values:
corr.vals<-data.frame(x1=runif(5,0,1),
x2=runif(5,0,1),
x3=runif(5,0,1),
x4=runif(5,0,1),
x5=runif(5,0,1))
row.names(corr.vals)<-c("y1","y2","y3","y4","y5")
You should be able to select the values > 0.6, while keeping row and column names, using complete.cases() in a subsetting:
values_06<-corr.vals[complete.cases(corr.vals)>0.6]

Divide specific values in a column by 1000

I need to divide certain values in a column by 1000 but do not know how to go about it
I attempted to use this function initially:
test <- Updins(weight,)
test$weight <- as.numeric(test$weight) / 1000
head(test)
with Updins being the dataframe and weight the column just to see if it would at least divide the entire column by 1000 but no such luck. It did not recognise 'test' as a variable.
Can anyone provide any guidance? I'm very new to R :)
If 'Updins is the dataset object name, we can select the columns with [ and not with ( as ( is used for function invoke
test <- Updins['weight']
test$weight <- as.numeric(test$weight) / 1000
Here is a fake data set to divide all rows by 1000. I also included a for-loop as one potential way to only do this for certain rows. Since you didn't specify how you were doing that, I just did it for any rows that had a value greater than 1,005, and I did a second version for only dividing by 1,000 if the ID was an odd number. If you have NAs this you may need an addition if statement to deal with them. I will provide an example for that in the third/last for-loop example.
ID<-1:10
grams<-1000:1009
df<-data.frame(ID,grams)
df$kg<-as.numeric(df$grams)/1000
df[,"kg"]<-as.numeric(df[,"grams"])/1000 #will do the same thing as the line above
for(i in 1:nrow(df)){
if(df[i,"grams"]>1005){df[i,"kg3"]<-as.numeric(df[i,"grams"])/1000}
}#if the weight is greater than 1,005 grams.
for(i in 1:nrow(df)){
if(df[i,"ID"] %in% seq(1,101, by = 2)){df[i,"kg4"]<-as.numeric(df[i,"grams"])/1000}
}#if the id is an odd number
df[3,"grams"]<-NA#add an NA to the weight data to test the next loop
for(i in 1:nrow(df)){
if(is.na(df[i,"grams"]) & (df[i,"ID"] %in% seq(1,101, by = 2))){df[i,"kg4"]<-NA}
else if(df[i,"ID"] %in% seq(1,101, by = 2)){df[i,"kg4"]<-as.numeric(df[i,"grams"])/1000}
}#Same as above, but works with NAs
Hard without data to work with or expected output, but here's a skeleton that you could probably use:
library(dplyr) #The package you'll need, for the pipes (%>% -- passes objects from one line to the next)
test <- Updins %>% #Using the dataset Updins
mutate(weight = ifelse(as.numeric(weight) > 199, #CHANGING weight variable. #Where weight > 50...
as.character(as.numeric(weight)/1000), #... divide a numeric version of the weight variable by 1000, but keep as a character...
weight) #OTHERWISE, keep the weight variable as is
head(test)
I kept the new value as a character, because it seems that your weight variable is a character variable based on some of the warnings ('NAs introduced by coercion') that you're getting.

Finding Mean of a column in an R Data Set, by using FOR Loops to remove Missing Values

I have a data set with Air Quality Data. The Data Frame is a matrix of 153 rows and 5 columns.
I want to find the mean of the first column in this Data Frame.
There are missing values in the column, so I want to exclude those while finding the mean.
And finally I want to do that using Control Structures (for loops and if-else loops)
I have tried writing code as seen below. I have created 'y' instead of the actual Air Quality data set to have a reproducible example.
y <- c(1,2,3,NA,5,6,NA,NA,9,10,11,NA,13,NA,15)
x <- matrix(y,nrow=15)
for(i in 1:15){
if(is.na(data.frame[i,1]) == FALSE){
New.Vec <- c(x[i,1])
}
}
print(mean(New.Vec))
I expected the output to be the mean. Though the error I received is this:
Error: object 'New.Vec' not found
One line of code, no need for for loop.
mean(data.frame$name_of_the_first_column, na.rm = TRUE)
Setting na.rm = TRUE makes the mean function ignore NAs.
Here, we can make use of na.aggregate from zoo
library(zoo)
df1[] <- na.aggregate(df1)
Assuming that 'df1' is a data.frame with all numeric columns and wanted to fill the NA elements with the corresponding mean of that column. na.aggregate, by default have the fun.aggregate as mean
can't see your data, but probably like this? the vector needed to be initialized. better to avoid loops in R when you can...
myDataFrame <- read.csv("hw1_data.csv")
New.Vec <- c()
for(i in 1:153){
if(!is.na(myDataFrame[i,1])){
New.Vec <- c(New.Vec, myDataFrame[i,1])
}
}
print(mean(New.Vec))

How do you drop all rows from a dataframe where the sum of a range of columns is 0?

I have a dataframe with the columns
experimentResultDataColumns - faceGenderClk - 35 more columns ending with Clk - rougeClk - someMoreExperimentDataColumns
I am trying to drop all rows from the dataframe, where the sum of the 50 colums from faceGenderClk to (including) rougeClk is 0
There is data of an online study in the dataframe and the "Clk" columns count how many times the participant clicked a specific slider. If no sliders were clicked, the data is invalid. (It's basically like someone handing you your survey without setting their pen on the paper)
I was able to perform similar logic with a statement like this:
df<-df[!(df$screenWidth < 1280),]
to cut out all insufficiently sized screens, but I am unsure of how to perform this sum operation within that statement. I tried
df <- df[!(sum(df$faceGenderClk:df$rougeClk) > 0)]
but that doesn't work. (I'm not very good at R, I assume it definitely shouldn't work with that syntax)
The expected result is a dataframe which has all rows stripped from it, where the sum of all 50 values in that row from faceGenderClk to rougeClk is 0
EDIT:
data: https://pastebin.com/SLAmkHk5
the expected result of the code would drop the second row of data
code so far:
df <- read.csv("./trials.csv")
SECONDS_IN_AN_HOUR <- 60*60
MILLISECONDS_IN_AN_HOUR <- SECONDS_IN_AN_HOUR * 1000
library(dplyr)
#levels(df$latinSquare) <- c("AlexaF", "SiriF", "CortanaF", "SiriM", "GoogleF", "RobotM") ignore this since I faked the dataset to protect participants' personal data
df<-df[!(df$timeMainSessionTime > 6 * MILLISECONDS_IN_AN_HOUR),]
df<-df[!(df$screenWidth < 1280),]
the as of this edit accepted answer solves the problem with:
cols = grep(pattern = "Clk$", names(df), value=TRUE)
sums = rowSums(df[cols])
df <- df[sums != 0, ]
First, get the names of the column you want to check. Then add up the columns and do your subset.
# columns that end in Clk
cols = grep(pattern = "Clk$", names(df), value = TRUE)
# add them up
sums = rowSums(df[cols])
# susbet
df[sums != 0, ]

Repeat a function on a data frame and store the output

I simulated a data matrix containing 200 rows x 1000 columns. It contains 0's and 1's in a binomial distribution. The probability of a 1 occurring depends on a probability matrix that I've created.
I then transpose this data matrix and convert it to a data frame. I created a function that will introduce missing data to each row of the data frame. The function will also add three columns to the data frame after the missing data is introduced. One column is the computed frequency of 1's across each of the 1000 rows. The 2nd column is the computed frequency of 0's across each row. The 3rd column is the frequency of missing values across each row.
I would like to repeat this function 500 times with the same input data frame (the one with no missing values) and output three data frames: one with 500 columns containing all of the computed frequencies of 0's (one column per simulation), one with 500 columns containing all of the computed frequencies of 1's, and one with 500 columns of the missing data frequencies.
I have seen mapply() used for something similar, but was not sure if it would work in my case. How can I repeatedly apply a function to a data frame and store the output of each computation performed within that function every time that function is repeated?
Thank you!
####Load Functions####
###Compute freq of 0's
compute.al0 = function(GEcols){
(sum(GEcols==0, na.rm=TRUE)/sum(!is.na(GEcols)))
}
###Compute freq of 1's
compute.al1 = function(GEcols){
(sum(GEcols==1, na.rm=TRUE)/sum(!is.na(GEcols)))
}
#Introduce missing data
addmissing = function(GEcols){
newdata = GEcols
num.cols = 200
num.miss = 10
set.to.missing = sample(num.cols, num.miss, replace=FALSE) #select num.miss to be set to missing
newdata[set.to.missing] = NA
return(newdata) #why is the matrix getting transposed during this??
}
#Introduce missing data and re-compute freq of 0's and 1's, and missing data freq
rep.missing = function(GEcols){
indata = GEcols
missdata = apply(indata,1,addmissing)
missdata.out = as.data.frame(missdata) #have to get the df back in the right format
missdata.out.t = t(missdata.out)
missdata.new = as.data.frame(missdata.out.t)
missdata.new$allele.0 = apply(missdata.new[,1:200], 1, compute.al0) #compute freq of 0's
missdata.new$allele.1 = apply(missdata.new[,1:200], 1, compute.al1) #compute freq of 1's
missdata.new$miss = apply(missdata.new[,1:200], 1, function(x) {(sum(is.na(x)))/200}) #compute missing
return(missdata.new)
}
#Generate a data matrix with no missing values
datasim = matrix(0, nrow=200, ncol=1000) #pre-allocated matrix of 0's of desired size
probmatrix = col(datasim)/1000 #probability matrix, each of the 1000 columns will have a different prob
datasim2 = matrix(rbinom(200 * 1000,1,probmatrix),
nrow=200, ncol=1000, byrow=FALSE) #new matrix of 0's and 1's based on probabilities
#Assign column names
cnum = 1:1000
cnum = paste("M",cnum,sep='')
colnames(datasim2) = cnum
#Assign row names
rnum = 1:200
rnum = paste("L",rnum,sep='')
rownames(datasim2) = rnum
datasim2 = t(datasim2) #data will be used in the transposed form
datasim2 = as.data.frame(datasim2)
#add 10 missing values per row and compute new frequencies
datasim.miss = rep.missing(datasim2)
#Now, how can I repeat the rep.missing function
#500 times and store the output of the new frequencies
#generated from each repetition?
Update:
Frank, thank you for the replicate() suggestion. I am able to return the repetitions by changing return(missdata.new) to return(list(missdata.new)) in the rep.missing() function. I then call the function with replicate(500,rep.missing(datasim2), simplify="matrix").
This is almost exactly what I want. I would like to do
return(list(missdata.new$allele.0, missdata.new$allele.1, missdata.new$miss))
in rep.missing() and return each of these 3 vectors as 3 column bound data frames within a list. One data frame holds the 500 repetitions of missdata.new$allele.0, one holds the 500 repetitions of missdata.new$allele.1, etc.
replicate(500, rep.missing(datasim2), simplify="matrix")
I am not sure to understand which part is where you don't know how to do.
If you don't know how repeatedly store your results. one way would be to have a global variable , and inside your function you do <<- assignments instead of <- or =.
x=c()
func = function(i){x <<- c(x,i) }
sapply(1:5,func)
mapply is tfor repeating a function over multiple inputs list or vectors.
you want to repeat your function 500 times. so you can always do
sapply(1:500,fund)

Resources