How to delete rows from a dataframe that contain n*NA - r

I have a number of large datasets with ~10 columns, and ~200000 rows. Not all columns contain values for each row, although at least one column must contain a value for the row to be present, I would like to set a threshold for how many NAs are allowed in a row.
My Dataframe looks something like this:
ID q r s t u v w x y z
A 1 5 NA 3 8 9 NA 8 6 4
B 5 NA 4 6 1 9 7 4 9 3
C NA 9 4 NA 4 8 4 NA 5 NA
D 2 2 6 8 4 NA 3 7 1 32
And I would like to be able to delete the rows that contain more than 2 cells containing NA to get
ID q r s t u v w x y z
A 1 5 NA 3 8 9 NA 8 6 4
B 5 NA 4 6 1 9 7 4 9 3
D 2 2 6 8 4 NA 3 7 1 32
complete.cases removes all rows containing any NA, and I know one can delete rows that contain NA in certain columns but is there a way to modify it so that it is non-specific about which columns contain NA, but how many of the total do?
Alternatively, this dataframe is generated by merging several dataframes using
file1<-read.delim("~/file1.txt")
file2<-read.delim(file=args[1])
file1<-merge(file1,file2,by="chr.pos",all=TRUE)
Perhaps the merge function could be altered?
Thanks

Use rowSums. To remove rows from a data frame (df) that contain precisely n NA values:
df <- df[rowSums(is.na(df)) != n, ]
or to remove rows that contain n or more NA values:
df <- df[rowSums(is.na(df)) < n, ]
in both cases of course replacing n with the number that's required

If dat is the name of your data.frame the following will return what you're looking for:
keep <- rowSums(is.na(dat)) < 2
dat <- dat[keep, ]
What this is doing:
is.na(dat)
# returns a matrix of T/F
# note that when adding logicals
# T == 1, and F == 0
rowSums(.)
# quickly computes the total per row
# since your task is to identify the
# rows with a certain number of NA's
rowSums(.) < 2
# for each row, determine if the sum
# (which is the number of NAs) is less
# than 2 or not. Returns T/F accordingly
We use the output of this last statement to
identify which rows to keep. Note that it is not necessary to actually store this last logical.

If d is your data frame, try this:
d <- d[rowSums(is.na(d)) < 2,]

This will return a dataset where at most two values per row are missing:
dfrm[ apply(dfrm, 1, function(r) sum(is.na(x)) <= 2 ) , ]

Related

first row where value in one column higher than in another column R

I have a data table:
df <- data.table(Points = 1:5, A = c(2,4,6,8,10), B = c(1,3,4,5,9))
df <-
Points A B
1 2 1
2 4 3
3 6 4
4 8 5
5 10 9
I want the value of column Points corresponding to the first value in column B that is higher than the current value of column A.
Expected output if A == 4:
4 (the first value in B bigger than 4 has a corresponding value of Points equal to 4)
How about this:
df[, Points[apply(outer(A, B, `<`), 1, function(z) which(z)[1])]]
# [1] 2 4 5 5 NA
The trick is that outer is producing a 5x5 matrix, where the column of the first TRUE on a row is the index. I tried which.max, but in the 5th row here, nothing is found, so which.max(rep(F,5)) returns 1, which is obviously not right ... as opposed to which(rep(F,5))[1] which returns NA (more meaningful here).

Substitute Average of Previous and Next Available Values of Field for NA Values in Dataframe

The sample data set of available much bigger data set is in following format:
Station <-c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A")
Parameter <-c(2,3,NA,4,4,9,NA,NA,10,15,NA,NA,NA,18,20)
Par_Count <-c(1,1,1,2,2,1,2,2,1,1,3,3,3,1,1)
df<-data.frame(Station, Parameter, Par_Count)
df
Station Parameter Par_Count
A 2 1
A 3 1
A NA 1
A 4 2
A 4 2
A 9 1
A NA 2
A NA 2
A 10 1
A 15 1
A NA 3
A NA 3
A NA 3
A 18 1
A 20 1
I want to approximate NA's which are less than 2 in number with average of next and previous available values for NA in that column. In original data set somewhere NA's are 100's in number, so I want to ignore consecutive NA's greater than 3 in number. Par_Count represent number of consecutive occurrences of that particular value in parameter.
I tried with:
library(zoo)
df1 <- within(df, na.approx(df$Parameter, maxgap = 2))
and even for for single occurence with:
df1 <- within(df, Parameter[Parameter == is.na(df$Parameter) & Par_Count == 1] <-
lead(Parameter) - lag(Parameter))
but nothing worked. It didn't change any occurrence of NA value.
The desired output is like:
Station Parameter Par_Count
A 2 1
A 3 1
A 3.5 1
A 4 2
A 4 2
A 9 1
A 9.5 2
A 9.75 2 <--here 9.5 will also work
A 10 1
A 15 1
A NA 3
A NA 3
A NA 3
A 18 1
A 20 1
You are nearly there. I think you have misinterpreted the use of within. If you would like to use within, You need to assign the output of na.approx to a column of the data frame. The following will work:
library(zoo)
df1 <- within(df, Parameter <- na.approx(Parameter, maxgap = 2, na.rm = FALSE))
Note it is advisable to use na.rm = FALSE, otherwise leading or trailing NAs will be removed, leading to an error.
Personally, I think the following is more readable, though it is a matter of style.
library(zoo)
df1 <- df
df1$Parameter <- na.approx(df$Parameter, maxgap = 2, na.rm = FALSE))

How to remove outiers from multi columns of a data frame

I would like to get a data frame that contains only data that is within 2 SD per each numeric column.
I know how to do it for a single column but how can I do it for a bunch of columns at once?
Here is the toy data frame:
df <- read.table(text = "target birds wolfs Country
3 21 7 a
3 8 4 b
1 2 8 c
1 2 3 a
1 8 3 a
6 1 2 a
6 7 1 b
6 1 5 c",header = TRUE)
Here is the code line for getting only the data that is under 2 SD for a single column(birds).How can I do it for all numeric columns at once?
df[!(abs(df$birds - mean(df$birds))/sd(df$birds)) > 2,]
target birds wolfs Country
2 3 8 4 b
3 1 2 8 c
4 1 2 3 a
5 1 8 3 a
6 6 1 2 a
7 6 7 1 b
8 6 1 5 c
We can use lapply to loop over the dataset columns and subset the numeric vectors (by using a if/else condition) based on the mean and sd.
lapply(df, function(x) if(is.numeric(x)) x[!(abs((x-mean(x))/sd(x))>2)] else x)
EDIT:
I was under the impression that we need to remove the outliers for each column separately. But, if we need to keep only the rows that have no outliers for the numeric columns, we can loop through the columns with lapply as before, instead of returning 'x', we return the sequence of 'x' and then get the intersect of the list element with Reduce. The numeric index can be used for subsetting the rows.
lst <- lapply(df, function(x) if(is.numeric(x))
seq_along(x)[!(abs((x-mean(x))/sd(x))>2)] else seq_along(x))
df[Reduce(intersect,lst),]
I'm guessing that you are trying to filter your data set by checking that all of the numeric columns are within 2 SD (?)
In that case I would suggest to create two filters. 1 one that will indicate numeric columns, the second one that will check that all of them within 2 SD. For the second condition, we can use the built in scale function
indx <- sapply(df, is.numeric)
indx2 <- rowSums(abs(scale(df[indx])) <= 2) == sum(indx)
df[indx2,]
# target birds wolfs Country
# 2 3 8 4 b
# 3 1 2 8 c
# 4 1 2 3 a
# 5 1 8 3 a
# 6 6 1 2 a
# 7 6 7 1 b
# 8 6 1 5 c

Removing same values in columns in some rows of a file in R

I have a file like this.
1 3
1 2
1 10
1 5
**5 5**
6 7
8 9
4 6
1 2
**10 10**
......
The file contains thousands of rows. I wanted to know, how can I remove the rows which contains the same values in columns in R ( The row containing 5 5 and row containing 10 10 )? I know how to remove duplicate columns or duplicate rows, but how do I go about selectively removing them? Thanks. :)
I would do this with indexing, example with small data frame:
myDf <- data.frame(a=c(3,5,8,6,9,4,3), b=c(3,3,5,8,9,6,4))
myDf <- myDf[myDf$a != myDf$b,]
I would consider writing a helper function like this:
indicator <- function(indf) {
rowSums(vapply(indf, function(x) x == indf[, 1],
logical(nrow(indf)))) == ncol(indf)
}
Basically, the function compares each column in the data.frame with the first column of the data.frame, then, checks to see which rowSums are the same as the number of columns in the data.frame.
This basically creates a logical vector that can be used to subset your data.frame.
Example:
mydf <- data.frame(a=c(3,5,8,6,9,4,3),
b=c(3,3,5,8,9,6,4),
c=c(3,4,5,6,9,7,2))
indicator(mydf)
# [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE
mydf[!indicator(mydf), ]
# a b c
# 2 5 3 4
# 3 8 5 5
# 4 6 8 6
# 6 4 6 7
# 7 3 4 2

Choose one cell per row in data frame

I have a vector that tells me, for each row in a date frame, the column index for which the value in this row should be updated.
> set.seed(12008); n <- 10000; d <- data.frame(c1=1:n, c2=2*(1:n), c3=3*(1:n))
> i <- sample.int(3, n, replace=TRUE)
> head(d); head(i)
c1 c2 c3
1 1 2 3
2 2 4 6
3 3 6 9
4 4 8 12
5 5 10 15
6 6 12 18
[1] 3 2 2 3 2 1
This means that for rows 1 and 4, c3 should be updated; for rows 2, 3 and 5, c2 should be updated (among others). What is the cleanest way to achieve this in R using vectorized operations, i.e, without apply and friends? EDIT: And, if at all possible, without R loops?
I have thought about transforming d into a matrix and then address the matrix elements using an one-dimensional vector. But then I haven't found a clean way to compute the one-dimensional address from the row and column indexes.
With your example data, and using only the first few rows (D and I below) you can easily do what you want via a matrix as you surmise.
set.seed(12008)
n <- 10000
d <- data.frame(c1=1:n, c2=2*(1:n), c3=3*(1:n))
i <- sample.int(3, n, replace=TRUE)
## just work with small subset
D <- head(d)
I <- head(i)
First, convert D into a matrix:
dmat <- data.matrix(D)
Next compute the indices of the vector representation of the matrix corresponding to rows and columns indicated by I. For this, it is easy to generate the row indices as well as the column index (given by I) using seq_along(I) which in this simple example is the vector 1:6. To compute the vector indices we can use:
(I - 1) * nrow(D) + seq_along(I)
where the first part ( (I - 1) * nrow(D) ) gives us the correct multiple of the number of rows (6 here) to index the start of the Ith column. We then add on the row index to get the index for the n-th element in the Ith column.
Using this we just index into dmat using "[", treating it like a vector. The replacement version of "[" ("[<-") allows us to do the replacement in a single line. Here I replace the indicated elements with NA to make it easier to see that the correct elements were identified:
> dmat
c1 c2 c3
1 1 2 3
2 2 4 6
3 3 6 9
4 4 8 12
5 5 10 15
6 6 12 18
> dmat[(I - 1) * nrow(D) + seq_along(I)] <- NA
> dmat
c1 c2 c3
1 1 2 NA
2 2 NA 6
3 3 NA 9
4 4 8 NA
5 5 NA 15
6 NA 12 18
If you are willing to first convert your data.frame to a matrix, you can index elements-to-be-replaced using a two-column matrix. (Beginning with R-2.16.0, this will be possible with data.frames directly.) The indexing matrix should have row indices in its first column and column indices in its second column.
Here's an example:
## Create a subset of the your data
set.seed(12008); n <- 6
D <- data.frame(c1=1:n, c2=2*(1:n), c3=3*(1:n))
i <- seq_len(nrow(D)) # vector of row indices
j <- sample(3, n, replace=TRUE) # vector of column indices
ij <- cbind(i, j) # a 2-column matrix to index a 2-D array
# (This extends smoothly to higher-D arrays.)
## Convert it to a matrix
Dmat <- as.matrix(D)
## Replace the elements indexed by 'ij'
Dmat[ij] <- NA
Dmat
# c1 c2 c3
# [1,] 1 2 NA
# [2,] 2 NA 6
# [3,] 3 NA 9
# [4,] 4 8 NA
# [5,] 5 NA 15
# [6,] NA 12 18
Beginning with R-2.16.0, you will be able to use the same syntax for dataframes (i.e. without having to first convert dataframes to matrices).
From the R-devel NEWS file:
Matrix indexing of dataframes by two column numeric indices is now supported for replacement as well as extraction.
Using the current R-devel snapshot, here's what that looks like:
D[ij] <- NA
D
# c1 c2 c3
# 1 1 2 NA
# 2 2 NA 6
# 3 3 NA 9
# 4 4 8 NA
# 5 5 NA 15
# 6 NA 12 18
Here's one way:
d[which(i == 1), "c1"] <- "one"
d[which(i == 2), "c2"] <- "two"
d[which(i == 3), "c3"] <- "three"
c1 c2 c3
1 1 2 three
2 2 two 6
3 3 two 9
4 4 8 three
5 5 two 15
6 one 12 18

Resources