subset dataframe based on certain threshold in r - r

I have a correlation dataframe with 381717 rows and 450 columns and no NA values, and I want to subset this dataframe for all correlations with abs value > 0.6. I have tried multiple things to use lapply and sapply on all rows and columns to subset my dataframe but I end up getting NAs, but I do see that there are a few values which should satisfy this condition.If I could get any leads on how to do this, I would be really grateful.
I know this seems like an easy issue but I am somehow unable to get the right subsetting done and would like your help!
Thanks in advance!
Best regards
Expected output :

x1 = seq(1:7)
x2 = c(2,4,8,5,1,2,3)
y1 = c(9,6,5,4,8,6,4)
y2 = c(1,7,4,5,1,2,2)
df = data.frame(x1,x2,y1,y2)
corr_df = data.frame(cor(df))
corr_df$var = row.names(corr_df)
corr_df1 = reshape2::melt(corr_df, value_name = "Corr")
corr_df1[corr_df1$value > 0.6,]
I have created a dummy dataset and done the subset of correlation dataframe. It might work for you.

Considering a dataframe of correlation values:
corr.vals<-data.frame(x1=runif(5,0,1),
x2=runif(5,0,1),
x3=runif(5,0,1),
x4=runif(5,0,1),
x5=runif(5,0,1))
row.names(corr.vals)<-c("y1","y2","y3","y4","y5")
You should be able to select the values > 0.6, while keeping row and column names, using complete.cases() in a subsetting:
values_06<-corr.vals[complete.cases(corr.vals)>0.6]

Related

Finding Mean of a column in an R Data Set, by using FOR Loops to remove Missing Values

I have a data set with Air Quality Data. The Data Frame is a matrix of 153 rows and 5 columns.
I want to find the mean of the first column in this Data Frame.
There are missing values in the column, so I want to exclude those while finding the mean.
And finally I want to do that using Control Structures (for loops and if-else loops)
I have tried writing code as seen below. I have created 'y' instead of the actual Air Quality data set to have a reproducible example.
y <- c(1,2,3,NA,5,6,NA,NA,9,10,11,NA,13,NA,15)
x <- matrix(y,nrow=15)
for(i in 1:15){
if(is.na(data.frame[i,1]) == FALSE){
New.Vec <- c(x[i,1])
}
}
print(mean(New.Vec))
I expected the output to be the mean. Though the error I received is this:
Error: object 'New.Vec' not found
One line of code, no need for for loop.
mean(data.frame$name_of_the_first_column, na.rm = TRUE)
Setting na.rm = TRUE makes the mean function ignore NAs.
Here, we can make use of na.aggregate from zoo
library(zoo)
df1[] <- na.aggregate(df1)
Assuming that 'df1' is a data.frame with all numeric columns and wanted to fill the NA elements with the corresponding mean of that column. na.aggregate, by default have the fun.aggregate as mean
can't see your data, but probably like this? the vector needed to be initialized. better to avoid loops in R when you can...
myDataFrame <- read.csv("hw1_data.csv")
New.Vec <- c()
for(i in 1:153){
if(!is.na(myDataFrame[i,1])){
New.Vec <- c(New.Vec, myDataFrame[i,1])
}
}
print(mean(New.Vec))

Subtract a row value from the mean of the corresponding column and then sum the differences together

Here is a reproducible problem set-
c = c(1,2,3,4)
d = c(4,1,2,4)
e = c(2,1,5,4)
f = c(2,3,3,4)
tdf <- data.frame(c,d,e,f)
I can't figure out how I can subtract each row value from the mean of the corresponding column and then sum all these differences together for each column and save them.
basically I want to compute summation(xi-xavg) for each column. I would really appreciate any help. Thank you.
The apply() family of functions will solve this issue. sapply will apply a function to each column of a data.frame and return the results of the function. So simply pass it a data frame and define a function you want performed
sapply(tdf, function(x) sum(x-mean(x)))
An option would be to replicate the colMeans to get the dimensions same as that of the original data, get the difference and find the sum of each column with colSums
colSums(tdf - colMeans(tdf)[col(tdf)])
Or another option is to take the transpose of 'tdf', subtract from colMeans and then do the rowSums
rowSums(t(tdf) - colMeans(tdf))

Subsetting dataframe rows based on decimals in R?

I am quite new to R and have quite a challenging Question. I have a large dataframe consisting of 110,000 rows representing high-Resolution data from a Sediment core. I would like to select multiple rows based on Depth (which is recorded in mm to 3 decimal points). Of Course, I have not the time to go through the entire dataframe and pick the rows that I Need. I would like to be able to select the rows I would like based on the decimal Point part of the number and not the first Digit. I.e. I would like to be able to subset to a dataframe where all the .035 values would be returned. I have so far tried using the which() function but had no luck
newdata <- Linescan_EN18218[which(Linescan_EN18218$Position.mm.== .035),]
Can anyone offer any hints/suggestions how I can solve this Problem. Link to the first part of the dataframe csv
Welcome to stack overflow
Can you please further describe what you mean with had no luck. Did you get an error message or an empty data.frame?
In principle, your method should work. I have replicated it with simulated data.
n = 100
test <- data.frame(
a = 1:n,
b = rnorm(n = n),
c = sample(c(0.1,0.035, 0.0001), size = n, replace =T)
)
newdata <- test[which(test$c == 0.035),]

Replacing outliers from multiple columns in a dataframe containing NAs using R

I am trying to replace outliers from a big dataset (more than 3000 columns and 250000 rows) by NA. I want to replace the observations that are greater or smaller than 3 standard deviations from the mean by NA. I got it, doing column by column:
height = ifelse(abs(height-mean(height,na.rm=TRUE)) < 3*sd(height,na.rm=TRUE),height,NA)
However, I would like to create a function to do that in a subset of columns. To do that, I created a list with the column names that I want to replace the outliers. But it is not working.
Anyone could help me, please?
An example of my dataset would be:
name = factor(c("A","B","C","D","E","F","G","H","H"))
height = c(120,NA,150,170,NA,146,132,210,NA)
age = c(10,20,0,30,40,50,60,NA,130)
mark = c(100,0.5,100,50,90,100,NA,50,210)
data = data.frame(name=name,mark=mark,age=age,height=height)
data
This was my last try:
d1=names(data)
list = c("age","height","mark")
ntraits=length(list)
nrows=dim(data)[1]
for(i in 1:ntraits){
a=list[i]
b=which(d1==a)
d2=data[,b]
for (j in 1:nrows){
d2[j] = ifelse(abs(d2[j]-mean(d2,na.rm=TRUE)) < 3*sd(d2,na.rm=TRUE),d2[j],NA)
}
}
Sorry, I am still learning how to program in R. Thank you very much.
Cheers.
I would look into using apply and scale, scale will omit NAs. The following code should work:
# get sd for a subset of the columns
data.scale <- scale(data[ ,c("age","height","mark") ])
# set outliers to NA
data.scale[ abs(data.scale) > 3 ] <- NA
# write back to the data set
data[ ,c("age","height","mark") ] <- data.scale

How can I get each numeric column's mean in one data?

I have data named cluster_1. It has nominal variable from first column to the third.
# select the columns based on the clustering results
cluster_1 <- mat[which(groups==1),]
m_cluster_1 <- mean(cluster_1[c(-(1:3))])
By the last statement, I can get the mean of all columns'. However, what I want is to attach the mean of each variable(column) to the bottom of the column.
How can I make it? Please let me know.
colMeans() will give you the mean of each column in a data frame or matrix. And rbind() can be used to append the result.
rbind(cluster_1[, -(1:3)], colMeans(cluster_1[, -(1:3)]))
A generalization of what you are doing can be found with the function addmargins. Try, for example:
cluster_1Means <- addmargins(cluster_1[, -(1:3)], margin = 1, FUN = mean)
cluster_1Means

Resources