I'm trying to get the row sums for each row of my dataframe but with a condition. I'd like to exclude all the values that are between -1 and 1 after applying log2. I know how to exclude NAs but I'm confused with excluding actual numbers. My dataframe is just numbers, except for the row and column names.
Try
dat1 <- dat
dat1[dat1 >-1 & dat1<1] <- NA
rowSums(dat1, na.rm=TRUE)
If there are no NAs in the dataset, you could assign the values to 0 and just use rowSums
dat1[dat1 >-1 & dat1<1] <- 0
rowSums(dat1)
data
set.seed(42)
dat <- as.data.frame(matrix(sample(seq(-5,5,by=0.25), 20*5,
replace=TRUE), ncol=5))
Related
I used below codes to identify outliers on different columns:
outliers_x1 <- boxplot(mydata$x1, plot=FALSE)$out
outliers_x4 <- boxplot(mydata$x4, plot=FALSE)$out
outliers_x6 <- boxplot(mydata$x6, plot=FALSE)$out
Now, how can I remove those outliers from the dataset by one code?
This will set any outlier values to NA, and then optionally remove all rows where any column contains an outlier. Works with arbitrary number of columns.
Uses data.table for convenience.
library(data.table)
library(matrixStats)
##
# create sample data
#
set.seed(1)
dt <- data.table(x1=rnorm(100), x2=rnorm(100), x3=rnorm(100))
##
# incorporate possible outliers
#
dt[sample(100, 5), x1:=10*x1]
dt[sample(100, 5), x2:=10*x2]
dt[sample(100, 5), x3:=10*x3]
##
# you start here...
# remove all rows where any column contains an outlier
#
indx <- sapply(dt, \(x) !(x %in% boxplot(x, plot=FALSE)$out))
dt[as.logical(rowProds(indx))]
In the above, indx is a matrix with three logical columns. Each element is TRUE unless the corresponding column contained an outlier in that row. We use rowProds(...) from the matrixStats package to multiply ( & ) the 3 rows together. Unfortunately this converts everything numeric (1, 0), so we have to convert back to logical to use as an index into dt.
##
# replaces outliers with NA in each column
#
dt.melt <- melt(dt[, id:=seq(.N)], id='id')
dt.melt[, ol:=(value %in% boxplot(value, plot=FALSE)$out), by=.(variable)]
dt.melt[(ol), value:=NA]
result <- dcast(dt.melt, id~variable)[, id:=NULL]
##
# remove all rows where any column contains an outlier
#
na.omit(result)
In the code above we add an id column, then melt(...) so all other columns are in one column (value) with a second column (variable) indicating the original source column. Then we apply the boxplot(...) algorithm group-wise (by variable) to produce an ol column indicating an outlier. Then we set any value corresponding to ol == TRUE to NA. Then we re-convert to your original wide format with dcast(...) and remove the id.
It's a bit roundabout but this melt - process - dcast pattern is common when processing multiple columns like this.
Finally, na.omit(result) will remove any rows which have NA in any of the columns. If that's what you want it's simpler to use the first approach.
I have a DF with multiple columns of data. I want to keep all the data (rows) that have a value in column "X". Simply put i want to remove all the rows from a DF that has "NA" as a value in column "X". Right now I change the "NA" in column "X" to "0" and then remove all rows with "0" in column "X" from the DF. This is two steps. Can I do it with just one line/step?
DF <- DF[["X"]][is.na(DF[["X"]])] <- 0
DF <- DF[DF$X != 0,]
Thank you,
The result of is.na could directly be used to subset DF. To keep the rows without NA in column X negate the result of is.na using !.
DF <- DF[!is.na(DF$X),]
I have a dataset with a column of 1's and 0's and another column with double values. I want to make a third column that contains the data in each of the rows in the second column that corresponds to a 1 in the first column. I have no idea how to do this and googling for this has been a nightmare. How do I do this?
You can do this in a one-liner with ifelse. Assuming your data frame is called df, 1 and 0 values in col1, doubles in col2, values corresponding to the zeros are NA:
df$col3 <- ifelse(df$col1, df$col2, NA)
We can do this in base R in a single-line using indexing. We create the logical vector with first 'col1' and use that as index to create the new column. By default, the values that are FALSE from 'i1' will be NA
i1 <- as.logical(df1$col1)
# // or
# i1 <- df1$col1 == 1
df1$col3[i1] <- df1$col2[i1]
Or as a single line
df1$col3[as.logical(df1$col1)] <- df1$col2[as.logical(df1$col1)]
As far as I know, missing data (NA's) in a data frame can be substituted by either row- or column-based averages. But what I'm trying to do in R (but not sure if it's possible) is calculating averages for missing cells that is based on both rows and columns where the cell with missing value is located. I was wondering if you had any suggestions.
Here is the sample data with NA's:
nr <- 50
mm <- t(matrix(sample(0:4, nr * 15, replace = TRUE), nr))
mm[,c(4,7,12,13)]<-NA
mm[c(3,5,8,9,10,13),]<-NA
Assuming that the OP wanted to replace the NA element based on the row/column averages of that index, we get the row/column index using which with arr.ind=TRUE ('ind'). Get the colMeans and rowMeans of the dataset ('df') subsetted by the columns of 'ind', and replace the NA elements by the average of the corresponding elements of 'c1' and 'r1'.
ind <- which(is.na(df), arr.ind=TRUE)
c1 <- colMeans(df[,ind[,2]], na.rm=TRUE)
r1 <- rowMeans(df[ind[,1],], na.rm=TRUE)
df[ind] <- colMeans(rbind(c1, r1))
Or as #thelatemail suggested we can use outer to get the combinations of colMeans and rowMeans and then replace the NA values based on that.
ind <- is.na(df)
df[ind] <- (outer(rowMeans(df,na.rm=TRUE), colMeans(df,na.rm=TRUE), `+`)/2)[ind]
data
set.seed(24)
df <- as.data.frame(matrix( sample(c(NA, 0:5), 10*10, replace=TRUE), ncol=10))
Is there a better way to go through observations in a data frame and impute NA values? I've put together a 'for loop' that seems to do the job, swapping NAs with the row's mean value, but I'm wondering if there is a better approach that does not use a for loop to solve this problem -- perhaps a built in R function?
# 1. Create data frame with some NA values.
rdata <- rbinom(30,5,prob=0.5)
rdata[rdata == 0] <- NA
mtx <- matrix(rdata, 3, 10)
df <- as.data.frame(mtx)
df2 <- df
# 2. Run for loop to replace NAs with that row's mean.
for(i in 1:3){ # for every row
x <- as.numeric(df[i,]) # subset/extract that row into a numeric vector
y <- is.na(x) # create logical vector of NAs
z <- !is.na(x) # create logical vector of non-NAs
result <- mean(x[z]) # get the mean value of the row
df2[i,y] <- result # replace NAs in that row
}
# 3. Show output with imputed row mean values.
print(df) # before
print(df2) # after
Here's a possible vectorized approach (without any loop)
indx <- which(is.na(df), arr.ind = TRUE)
df[indx] <- rowMeans(df, na.rm = TRUE)[indx[,"row"]]
Some explanation
We can identify the locations of the NAs using the arr.ind parameter in which. Then we can simply index df (by the row and column indexes) and the row means (only by the row indexes) and replace values accordingly
Data:
set.seed(102)
rdata <- matrix(rbinom(30,5,prob=0.5),nrow=3)
rdata[cbind(1:3,2:4)] <- NA
df <- as.data.frame(rdata)
This is a little trickier than I'd like -- it relies on the column-major ordering of matrices in R as well as the recycling of the row-means vector to the full length of the matrix. I tried to come up with a sweep() solution but didn't manage so far.
rmeans <- rowMeans(df,na.rm=TRUE)
df[] <- ifelse(is.na(df),rmeans,as.matrix(df))
One possibility, using impute from Hmisc, which allows for choosing any function to do imputation,
library(Hmisc)
t(sapply(split(df2, row(df2)), impute, fun=mean))
Also, you can hide the loop in an apply
t(apply(df2, 1, function(x) {
mu <- mean(x, na.rm=T)
x[is.na(x)] <- mu
x
}))