Data frame with NA in R

Data frame with NA in R - r

I have a data frame which has some rows with NA entries, I want to find the index of the row and the column at which the entry is NA. I am looping in a nested fashion to do that, and that is taking too long. Is there a quicker way to do it? Thanks.

set.seed(123)
dfrm <- data.frame(a=sample(c(1:5, NA), 25,T), b=sample(c(letters,NA), 25,rep=T)
which(is.na(dfrm), arr.ind=TRUE)
row col
[1,] 4 1
[2,] 5 1
[3,] 8 1
[4,] 11 1
[5,] 16 1
[6,] 20 1
[7,] 21 1
[8,] 24 1
[9,] 6 2

Related

Remove duplicate rows based on a column values by storing the row whose entry in another column is maximum

I have the following matrix
> mat<-rbind(c(9,6),c(10,6),c(11,7),c(12,7),c(12,8),c(12,9),c(12,10),c(12,11),c(12,12),c(13,12))
> mat
[,1] [,2]
[1,] 9 6
[2,] 10 6
[3,] 11 7
[4,] 12 7
[5,] 12 8
[6,] 12 9
[7,] 12 10
[8,] 12 11
[9,] 12 12
[10,] 13 12
I would like to remove duplicate rows based on first column values and store the row whose entry in the second column is maximum. E.g. for the example above, the desidered outcome is
[,1] [,2]
[1,] 9 6
[2,] 10 6
[3,] 11 7
[4,] 12 12
[5,] 13 12
I tried with
> mat[!duplicated(mat[,1]),]
but I obtained
[,1] [,2]
[1,] 9 6
[2,] 10 6
[3,] 11 7
[4,] 12 7
[5,] 13 12
which is different from the desidered outcome for the entry [4,2]. Suggestions?

You can sort the matrix first, using ascending order for column 1 and descending order for column 2. Then the duplicated function will remove all but the maximum column 2 value for each column 1 value.
mat <- mat[order(mat[,1],-mat[,2]),]
mat[!duplicated(mat[,1]),]
[,1] [,2]
[1,] 9 6
[2,] 10 6
[3,] 11 7
[4,] 12 12
[5,] 13 12

Like Josephs solution, but if you add row names first you can keep the original order (which will be the same in this case).
rownames(mat) <- 1:nrow(mat)
mat <- mat[order(mat[,2], -mat[,2]),]
mat <- mat[!duplicated(mat[,1]),]
mat[order(as.numeric(rownames(mat))),]
# [,1] [,2]
# 1 9 6
# 2 10 6
# 3 11 7
# 4 12 12
# 5 13 12

First Sort then keep only the first row for each duplicate
mat <- mat[order(mat[,1], mat[,2]),]
mat[!duplicated(mat[,1]),]
EDIT: Sorry I thought your desired result is last df,Ok so you want max value
mat<-rbind(c(9,6),c(10,6),c(11,7),c(12,7),c(12,8),c(12,9),c(12,10),c(12,11),c(12,12),c(13,12))
#Reverse sort
mat <- mat[order(mat[,1], mat[,2], decreasing=TRUE),]
#Keep only the first row for each duplicate, this will give the largest values
mat <- mat[!duplicated(mat[,1]),]
#finally sort it
mat <- mat[order(mat[,1], mat[,2]),]

Is there a way to order a matrix with R and then break ties later? [duplicate]

This question already has answers here:
Sort (order) data frame rows by multiple columns
(19 answers)
Closed 3 years ago.
I have a matrix. For example, like this:
temp <- cbind(rep(1:4, 3), c(rep(1, 4), rep(3,4), rep(2, 4)))
temp
# output:
[,1] [,2]
[1,] 1 1
[2,] 2 1
[3,] 3 1
[4,] 4 1
[5,] 1 3
[6,] 2 3
[7,] 3 3
[8,] 4 3
[9,] 1 2
[10,] 2 2
[11,] 3 2
[12,] 4 2
I need to order the the first column of the matrix, and then later break ties using the second column.
i.e., first go to this:
stack_temp[order(stack_temp[,1]),]
# output:
[,1] [,2]
[1,] 1 1
[2,] 1 3
[3,] 1 2
[4,] 2 1
[5,] 2 3
[6,] 2 2
[7,] 3 1
[8,] 3 3
[9,] 3 2
[10,] 4 1
[11,] 4 3
[12,] 4 2
and then this:
stack_temp[order(stack_temp[,1], stack_temp[,2]),]
# output:
[,1] [,2]
[1,] 1 1
[2,] 1 2
[3,] 1 3
[4,] 2 1
[5,] 2 2
[6,] 2 3
[7,] 3 1
[8,] 3 2
[9,] 3 3
[10,] 4 1
[11,] 4 2
[12,] 4 3
but I cannot chain inputs as order wants me to. That is, I cannot write stack_temp[,1], stack_temp[,2] within the order function call.
This is because for the matrix I am using, I have a vector of column indices (i.e. c(1, 2)), so I cannot directly write the inputs above.
How do I achieve the same effect as the single order call when my input is a vector of column indices?
Note: in my actual problem, I have a vector of column names, not indices, and it is of variable length (usually longer than 2).

I can just use do.call with order and my matrix and column names to get the desired result. Thanks to #akrun for the help.

How to reconstruct diff data with leading NAs using diffinv in R?

I have a large data matrix (“trial.matrix”) similar to the one below.
[,1] [,2]
[1,] 3 NA
[2,] 5 NA
[3,] 7 NA
[4,] 9 10
[5,] 11 12
[6,] 13 14
My problem requires that I shuffle some rows of the difference version of this matrix and then reconstruct a matrix from the shuffled difference matrix. When I apply diff(trial.matrix) I get:
[,1] [,2]
[1,] 2 NA
[2,] 2 NA
[3,] 2 NA
[4,] 2 2
[5,] 2 2
To reconstruct the original data frame, I need to use cumsum() or diffinv(), e.g.:
new.df <- diffinv(diff(trial.matrix), xi = t(c(3, 10)))
but this gives:
[,1] [,2]
[1,] 3 10
[2,] 5 NA
[3,] 7 NA
[4,] 9 NA
[5,] 11 NA
[6,] 13 NA
Obviously, the beginning value (“xi”) for column 2 has to be applied starting in row 3 (or 4?). I have a number of columns in the real matrix, some with leading NAs and some without. I need to preserve the leading NAs in the reconstruction. I cannot figure out an easy way to reconstruct the columns with NAs in the difference matrix in a straightforward way.
(For each column I am able to construct two vectors, one containing the first non-NA row, and the other containing the first NA value, but cannot figure out a straightforward way to use these.)
Suggestions appreciated.

You can temporarily replace NAs by zeroes:
trial.matrix <- matrix(c(seq(3,13,by=2),rep(NA,3),10,12,14),ncol=2)
xi <- apply(trial.matrix,2,function(cl) cl[which(!is.na(cl))[1]])
z2 <- diff(trial.matrix)
# temporarily replace NAs in the second column by zeroes:
nas <- which(is.na(z2[,2]))
z2[nas,2] <- 0
new.df <- diffinv(z2,xi = t(xi))
# return NAs
new.df[nas,2] <- NA
# [,1] [,2]
# [1,] 3 NA
# [2,] 5 NA
# [3,] 7 NA
# [4,] 9 10
# [5,] 11 12
# [6,] 13 14

Delete row based on the value of the rows above

I have a the following data set:
data <- cbind(c(1,2,3,4,5,6,7,8,9,10,11),c(1,11,21,60,30,2,61,12,3,35,63))
I would like to select the rows for which the number in the second column is greater than the highest number reached up to that point. The result should look like this.
[,1] [,2]
[1,] 1 1
[2,] 2 11
[3,] 3 21
[4,] 4 60
[5,] 7 61
[6,] 11 63

You want to try cummax:
> d[ d[,2] == cummax(d[,2]) ,]
[,1] [,2]
[1,] 1 1
[2,] 2 11
[3,] 3 21
[4,] 4 60
[5,] 7 61
[6,] 11 63
PS. data is an internal R function, so, since R variables and functions share the namespace (R design was influenced by Scheme, which is a "Lisp-1"), your variable shadows the system function.

The cummax function should work well
data[ data[,2]==cummax(data[,2]),]
returns
[,1] [,2]
[1,] 1 1
[2,] 2 11
[3,] 3 21
[4,] 4 60
[5,] 7 61
[6,] 11 63
as desired.

R fill interpolated matrix with NA's

I took a stratified random sample out of a raster layer using R's raster package and the sampleStratified function:
library(raster)
r<-raster(nrows=5, ncols=5)
r[]<-c(1,0,0,0,1,1,0,1,0,0,1,0,1,0,1,1,1,0,1,0,0,0,1,1,1)
#Stratified random sample size
sampleStratified(r, size=5)
cell layer
[1,] 3 0
[2,] 22 0
[3,] 7 0
[4,] 21 0
[5,] 12 0
[6,] 13 1
[7,] 17 1
[8,] 11 1
[9,] 8 1
[10,] 23 1
What I would like to do now is to order the sample by the first column, interpolate the first column to get the original length of the raster and fill the missing values of the second column with NA to look like this:
[,1] [,2]
[1,] 1 NA
[2,] 2 NA
[3,] 3 0
[4,] 4 NA
[5,] 5 NA
[6,] 6 NA
[7,] 7 0
[8,] 8 1
[9,] 9 NA
[10,] 10 NA
[11,] 11 1
[12,] 12 0
[13,] 13 1
[14,] 14 NA
[15,] 15 NA
[16,] 16 NA
[17,] 17 1
[18,] 18 NA
[19,] 19 NA
[20,] 20 NA
[21,] 21 0
[22,] 22 0
[23,] 23 1
[24,] 24 NA
[25,] 25 NA
I tried something with the approxTime function from the simecol package but failed with the NA filling. I have 10 raster layers with around 500,000 values each so a fast approach would really appreciated.

I'd think about it the opposite way. Instead of interpolation which could be expensive, you already know the cells you want to change are those that are not in the random sample. so use your random sample as an index vector for the cell numbers you don't want to change and just use the [<- replacement method on those cell indices that do not appear in your stratified sample. We use raster methods for the base functions [<- and %in% and also seq_len. Forgive the slightly long-winded example, better to show the steps. Should be quite fast and I don't envisage any problems with rasters of 500,000 cells...
# For reproducible example
set.seed(1)
# Get stratified random sample
ids <- sampleStratified(r, size=5)
# Copy of original raster (to visualise difference)
r2 <- r
# Get set of cell indices
cell_no <- seq_len(ncell(r2))
# Those indices to replace are those not in the random sample
repl <- cell_no[ ! cell_no %in% ids[,1] ]
# Replace cells not in sample with NA
r2[ repl ] <- NA
# Plot to show what is going on
par( mfrow = c(1,2))
plot(r)
plot(r2)

I would use merge as #Roland suggested.
mm <- data.frame(col1 = sample(1:100, 50), col2 = sample(0:1, 50, replace = TRUE))
mm <- as.matrix(mm[order(mm[, 1]), ])
mdl <- as.matrix(data.frame(col1 = 1:100, col2 = NA))
merge(mdl, mm, by = "col1", all.x = TRUE)
col1 col2.x col2.y
1 1 NA NA
2 2 NA 0
3 3 NA 0
4 4 NA NA
5 5 NA NA
6 6 NA NA
7 7 NA 0
8 8 NA 1
9 9 NA NA
10 10 NA 0
11 11 NA NA
12 12 NA 0
13 13 NA 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Data frame with NA in R - r

I have a data frame which has some rows with NA entries, I want to find the index of the row and the column at which the entry is NA. I am looping in a nested fashion to do that, and that is taking too long. Is there a quicker way to do it? Thanks.

set.seed(123) dfrm <- data.frame(a=sample(c(1:5, NA), 25,T), b=sample(c(letters,NA), 25,rep=T) which(is.na(dfrm), arr.ind=TRUE) row col [1,] 4 1 [2,] 5 1 [3,] 8 1 [4,] 11 1 [5,] 16 1 [6,] 20 1 [7,] 21 1 [8,] 24 1 [9,] 6 2

Related

Remove duplicate rows based on a column values by storing the row whose entry in another column is maximum

Is there a way to order a matrix with R and then break ties later? [duplicate]

How to reconstruct diff data with leading NAs using diffinv in R?

Delete row based on the value of the rows above

R fill interpolated matrix with NA's

Categories

Resources