Imputation with column medians in R - r

If I have a vector, for example
vec <- c(3,4,5,NA)
I can replace the NA with the median value of the other values in the vector with the following code:
vec[which(is.na(vec))] <- median(vec, na.rm = T)
However, if I have a matrix containing NAs, applying this same code across all columns of the matrix doesn't give me back a matrix, just returning the medians of each matrix column.
mat <- matrix(c(1,NA,3,5,6,7,NA,3,4,NA,2,8), ncol = 3)
apply(mat, 2, function(x) x[which(is.na(x))] <- median(x, na.rm=T) )
#[1] 3 6 4
How can I get the matrix back with NAs replaced by column medians? This question is similar: Replace NA values by row means but I can't adapt any of the solutions to my case.

There is a convenient function (na.aggregate) in zoo to replace the NA elements with the specified FUN.
library(zoo)
apply(mat, 2, FUN = function(x) na.aggregate(x, FUN = median))
# [,1] [,2] [,3]
#[1,] 1 6 4
#[2,] 3 7 4
#[3,] 3 6 2
#[4,] 5 3 8
Or as #G.Grothendieck commented, na.aggregate can be directly applied on the matrix
na.aggregate(mat, FUN = median)

Adding return(x) as last line of the function within apply will solve it.
> apply(mat, 2, function(x){
x[which(is.na(x))] <- median(x, na.rm=T)
return(x)
})
[,1] [,2] [,3]
[1,] 1 6 4
[2,] 3 7 4
[3,] 3 6 2
[4,] 5 3 8

Related

How to remove parts of a column from matrix in R

Let's say I have a matrix
[,1] [,2] [,3] [,4]
[1,] 10 11 12 13
[2,] 9 10 15 4
[3,] 5 7 4 10
[4,] 1 2 6 2
I want to remove parts of a column where the values are <=5. Even if there is a higher value in the next row of the column (ie. [3,4] after [2,4] is <5), those will become 0, so I should be left with:
[,1] [,2] [,3] [,4]
[1,] 10 11 12 13
[2,] 9 10 15 NA
[3,] NA 7 NA NA
[4,] NA NA NA NA
The matrix was created by using a for-loop to iterate a population 100 times so my matrix is 100x100.
I tried to use an if function in the for-loop to remove parts of the column but instead it just removed all columns after the first one.
if(matrix[,col]<=5) break
Here's a way to replace the required values in a matrix with NA:
# Create a random matrix with 20 rows and 20 columns
m <- matrix(floor(runif(400, min = 0, max = 101)), nrow = 20)
# Function that iterates through a vector and replaces values <= 5
# and the following values with NA
f <- function(x) {
fillNA <- FALSE
for (i in 1:length(x)) {
if (fillNA || x[i] <= 5) {
x[i] <- NA
fillNA <- TRUE
}
}
x
}
# Apply the function column-wise
apply(m, 2, f)
We can do this in base R. Let's assume that your matrix is called m. The function below does the following:
Check each element to see if it is <= 5, producing TRUE/FALSE values.
Cumulatively sum the TRUE/FALSE values.
Replace any non-zero cumulative values with NA.
Use apply to perform this operation per column of the matrix.
This can be fit on one line:
m2 <- apply(m, 2, \(x) ifelse(cumsum(x <= 5), NA, x))
[,1] [,2] [,3] [,4]
[1,] 10 11 12 13
[2,] 9 10 15 NA
[3,] NA 7 NA NA
[4,] NA NA NA NA
# Load the necessary packages
library(dplyr)
# Set the seed for reproducibility
set.seed(123)
# Create a random matrix with 100 rows and 100 columns
matrix <- matrix(runif(10000), nrow = 100)
# Replace values in each row of the matrix that are <= 5 with NA
matrix[apply(matrix, 1, function(x) any(x <= 5)), ] <- NA
# View the modified matrix
matrix
This code first loads the dplyr package, which is not necessary for this task but is used here to create a random matrix. It then sets the seed for reproducibility, so that the same random matrix is generated every time the code is run. Next, it creates a random matrix with 100 rows and 100 columns using the runif function, which generates random uniform numbers between 0 and 1. Finally, it uses the apply function to apply the logic to each row of the matrix and replace any values that are <= 5 with NA.

Reshape each row of a data.frame to be a matrix in R

I am working with the hand-written zip codes dataset. I have loaded the dataset like this:
digits <- read.table("./zip.train",
quote = "",
comment.char = "",
stringsAsFactors = F)
Then I get only the ones:
ones <- digits[digits$V1 == 1, -1]
Right now, in ones I have 442 rows, with 256 column. I need to transform each row in ones to a 16x16 matrix. I think what I am looking for is a list of 16x16 matrix like the ones in this question:
How to create a list of matrix in R
But I tried with my data and did not work.
At first I tried ones <- apply(ones, 1, matrix, nrow = 16, ncol = 16) but is not working as I thought it was. I also tried lapply with no luck.
An alternative is to just change the dims of your matrix.
Consider the following matrix "M":
M <- matrix(1:12, ncol = 4)
M
# [,1] [,2] [,3] [,4]
# [1,] 1 4 7 10
# [2,] 2 5 8 11
# [3,] 3 6 9 12
We are looking to create a three dimensional array from this, so you can specify the dimensions as "row", "column", "third-dimension". However, since the matrix is constructed by column, you first need to transpose it before changing the dimensions.
`dim<-`(t(M), c(2, 2, nrow(M)))
# , , 1
#
# [,1] [,2]
# [1,] 1 7
# [2,] 4 10
#
# , , 2
#
# [,1] [,2]
# [1,] 2 8
# [2,] 5 11
#
# , , 3
#
# [,1] [,2]
# [1,] 3 9
# [2,] 6 12
though there are probably simple ways, you can try with lapply:
ones_matrix <- lapply(1:nrow(ones), function(i){matrix(ones[i, ], nrow=16)})

How to apply median function to multiple columns or vectors in R [duplicate]

Suppose I have a n by 2 matrix and a function that takes a 2-vector as one of its arguments. I would like to apply the function to each row of the matrix and get a n-vector. How to do this in R?
For example, I would like to compute the density of a 2D standard Normal distribution on three points:
bivariate.density(x = c(0, 0), mu = c(0, 0), sigma = c(1, 1), rho = 0){
exp(-1/(2*(1-rho^2))*(x[1]^2/sigma[1]^2+x[2]^2/sigma[2]^2-2*rho*x[1]*x[2]/(sigma[1]*sigma[2]))) * 1/(2*pi*sigma[1]*sigma[2]*sqrt(1-rho^2))
}
out <- rbind(c(1, 2), c(3, 4), c(5, 6))
How to apply the function to each row of out?
How to pass values for the other arguments besides the points to the function in the way you specify?
You simply use the apply() function:
R> M <- matrix(1:6, nrow=3, byrow=TRUE)
R> M
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
R> apply(M, 1, function(x) 2*x[1]+x[2])
[1] 4 10 16
R>
This takes a matrix and applies a (silly) function to each row. You pass extra arguments to the function as fourth, fifth, ... arguments to apply().
In case you want to apply common functions such as sum or mean, you should use rowSums or rowMeans since they're faster than apply(data, 1, sum) approach. Otherwise, stick with apply(data, 1, fun). You can pass additional arguments after FUN argument (as Dirk already suggested):
set.seed(1)
m <- matrix(round(runif(20, 1, 5)), ncol=4)
diag(m) <- NA
m
[,1] [,2] [,3] [,4]
[1,] NA 5 2 3
[2,] 2 NA 2 4
[3,] 3 4 NA 5
[4,] 5 4 3 NA
[5,] 2 1 4 4
Then you can do something like this:
apply(m, 1, quantile, probs=c(.25,.5, .75), na.rm=TRUE)
[,1] [,2] [,3] [,4] [,5]
25% 2.5 2 3.5 3.5 1.75
50% 3.0 2 4.0 4.0 3.00
75% 4.0 3 4.5 4.5 4.00
Here is a short example of applying a function to each row of a matrix.
(Here, the function applied normalizes every row to 1.)
Note: The result from the apply() had to be transposed using t() to get the same layout as the input matrix A.
A <- matrix(c(
0, 1, 1, 2,
0, 0, 1, 3,
0, 0, 1, 3
), nrow = 3, byrow = TRUE)
t(apply(A, 1, function(x) x / sum(x) ))
Result:
[,1] [,2] [,3] [,4]
[1,] 0 0.25 0.25 0.50
[2,] 0 0.00 0.25 0.75
[3,] 0 0.00 0.25 0.75
Apply does the job well, but is quite slow.
Using sapply and vapply could be useful. dplyr's rowwise could also be useful
Let's see an example of how to do row wise product of any data frame.
a = data.frame(t(iris[1:10,1:3]))
vapply(a, prod, 0)
sapply(a, prod)
Note that assigning to variable before using vapply/sapply/ apply is good practice as it reduces time a lot. Let's see microbenchmark results
a = data.frame(t(iris[1:10,1:3]))
b = iris[1:10,1:3]
microbenchmark::microbenchmark(
apply(b, 1 , prod),
vapply(a, prod, 0),
sapply(a, prod) ,
apply(iris[1:10,1:3], 1 , prod),
vapply(data.frame(t(iris[1:10,1:3])), prod, 0),
sapply(data.frame(t(iris[1:10,1:3])), prod) ,
b %>% rowwise() %>%
summarise(p = prod(Sepal.Length,Sepal.Width,Petal.Length))
)
Have a careful look at how t() is being used
First step would be making the function object, then applying it. If you want a matrix object that has the same number of rows, you can predefine it and use the object[] form as illustrated (otherwise the returned value will be simplified to a vector):
bvnormdens <- function(x=c(0,0),mu=c(0,0), sigma=c(1,1), rho=0){
exp(-1/(2*(1-rho^2))*(x[1]^2/sigma[1]^2+
x[2]^2/sigma[2]^2-
2*rho*x[1]*x[2]/(sigma[1]*sigma[2]))) *
1/(2*pi*sigma[1]*sigma[2]*sqrt(1-rho^2))
}
out=rbind(c(1,2),c(3,4),c(5,6));
bvout<-matrix(NA, ncol=1, nrow=3)
bvout[] <-apply(out, 1, bvnormdens)
bvout
[,1]
[1,] 1.306423e-02
[2,] 5.931153e-07
[3,] 9.033134e-15
If you wanted to use other than your default parameters then the call should include named arguments after the function:
bvout[] <-apply(out, 1, FUN=bvnormdens, mu=c(-1,1), rho=0.6)
apply() can also be used on higher dimensional arrays and the MARGIN argument can be a vector as well as a single integer.
Another approach if you want to use a varying portion of the dataset instead of a single value is to use rollapply(data, width, FUN, ...). Using a vector of widths allows you to apply a function on a varying window of the dataset. I've used this to build an adaptive filtering routine, though it isn't very efficient.
A dplyr Approach using across, rowSums and rowMeans.
M <- matrix(1:9, nrow=3, byrow=TRUE)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
M %>% as_tibble() %>%
rowwise() %>%
mutate(sum = rowSums(across(where(is.numeric)))) %>%
mutate(mean = rowMeans(across(V1:V3))) %>%
mutate(Max = max(V1:V3)) %>%
mutate(Min = min(V1:V3)) %>%
as.matrix()
V1 V2 V3 sum mean Max Min
[1,] 1 2 3 6 2 3 1
[2,] 4 5 6 15 5 6 4
[3,] 7 8 9 24 8 9 7

Apply a function to every row of a matrix or a data frame

Suppose I have a n by 2 matrix and a function that takes a 2-vector as one of its arguments. I would like to apply the function to each row of the matrix and get a n-vector. How to do this in R?
For example, I would like to compute the density of a 2D standard Normal distribution on three points:
bivariate.density(x = c(0, 0), mu = c(0, 0), sigma = c(1, 1), rho = 0){
exp(-1/(2*(1-rho^2))*(x[1]^2/sigma[1]^2+x[2]^2/sigma[2]^2-2*rho*x[1]*x[2]/(sigma[1]*sigma[2]))) * 1/(2*pi*sigma[1]*sigma[2]*sqrt(1-rho^2))
}
out <- rbind(c(1, 2), c(3, 4), c(5, 6))
How to apply the function to each row of out?
How to pass values for the other arguments besides the points to the function in the way you specify?
You simply use the apply() function:
R> M <- matrix(1:6, nrow=3, byrow=TRUE)
R> M
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
R> apply(M, 1, function(x) 2*x[1]+x[2])
[1] 4 10 16
R>
This takes a matrix and applies a (silly) function to each row. You pass extra arguments to the function as fourth, fifth, ... arguments to apply().
In case you want to apply common functions such as sum or mean, you should use rowSums or rowMeans since they're faster than apply(data, 1, sum) approach. Otherwise, stick with apply(data, 1, fun). You can pass additional arguments after FUN argument (as Dirk already suggested):
set.seed(1)
m <- matrix(round(runif(20, 1, 5)), ncol=4)
diag(m) <- NA
m
[,1] [,2] [,3] [,4]
[1,] NA 5 2 3
[2,] 2 NA 2 4
[3,] 3 4 NA 5
[4,] 5 4 3 NA
[5,] 2 1 4 4
Then you can do something like this:
apply(m, 1, quantile, probs=c(.25,.5, .75), na.rm=TRUE)
[,1] [,2] [,3] [,4] [,5]
25% 2.5 2 3.5 3.5 1.75
50% 3.0 2 4.0 4.0 3.00
75% 4.0 3 4.5 4.5 4.00
Here is a short example of applying a function to each row of a matrix.
(Here, the function applied normalizes every row to 1.)
Note: The result from the apply() had to be transposed using t() to get the same layout as the input matrix A.
A <- matrix(c(
0, 1, 1, 2,
0, 0, 1, 3,
0, 0, 1, 3
), nrow = 3, byrow = TRUE)
t(apply(A, 1, function(x) x / sum(x) ))
Result:
[,1] [,2] [,3] [,4]
[1,] 0 0.25 0.25 0.50
[2,] 0 0.00 0.25 0.75
[3,] 0 0.00 0.25 0.75
Apply does the job well, but is quite slow.
Using sapply and vapply could be useful. dplyr's rowwise could also be useful
Let's see an example of how to do row wise product of any data frame.
a = data.frame(t(iris[1:10,1:3]))
vapply(a, prod, 0)
sapply(a, prod)
Note that assigning to variable before using vapply/sapply/ apply is good practice as it reduces time a lot. Let's see microbenchmark results
a = data.frame(t(iris[1:10,1:3]))
b = iris[1:10,1:3]
microbenchmark::microbenchmark(
apply(b, 1 , prod),
vapply(a, prod, 0),
sapply(a, prod) ,
apply(iris[1:10,1:3], 1 , prod),
vapply(data.frame(t(iris[1:10,1:3])), prod, 0),
sapply(data.frame(t(iris[1:10,1:3])), prod) ,
b %>% rowwise() %>%
summarise(p = prod(Sepal.Length,Sepal.Width,Petal.Length))
)
Have a careful look at how t() is being used
First step would be making the function object, then applying it. If you want a matrix object that has the same number of rows, you can predefine it and use the object[] form as illustrated (otherwise the returned value will be simplified to a vector):
bvnormdens <- function(x=c(0,0),mu=c(0,0), sigma=c(1,1), rho=0){
exp(-1/(2*(1-rho^2))*(x[1]^2/sigma[1]^2+
x[2]^2/sigma[2]^2-
2*rho*x[1]*x[2]/(sigma[1]*sigma[2]))) *
1/(2*pi*sigma[1]*sigma[2]*sqrt(1-rho^2))
}
out=rbind(c(1,2),c(3,4),c(5,6));
bvout<-matrix(NA, ncol=1, nrow=3)
bvout[] <-apply(out, 1, bvnormdens)
bvout
[,1]
[1,] 1.306423e-02
[2,] 5.931153e-07
[3,] 9.033134e-15
If you wanted to use other than your default parameters then the call should include named arguments after the function:
bvout[] <-apply(out, 1, FUN=bvnormdens, mu=c(-1,1), rho=0.6)
apply() can also be used on higher dimensional arrays and the MARGIN argument can be a vector as well as a single integer.
Another approach if you want to use a varying portion of the dataset instead of a single value is to use rollapply(data, width, FUN, ...). Using a vector of widths allows you to apply a function on a varying window of the dataset. I've used this to build an adaptive filtering routine, though it isn't very efficient.
A dplyr Approach using across, rowSums and rowMeans.
M <- matrix(1:9, nrow=3, byrow=TRUE)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
M %>% as_tibble() %>%
rowwise() %>%
mutate(sum = rowSums(across(where(is.numeric)))) %>%
mutate(mean = rowMeans(across(V1:V3))) %>%
mutate(Max = max(V1:V3)) %>%
mutate(Min = min(V1:V3)) %>%
as.matrix()
V1 V2 V3 sum mean Max Min
[1,] 1 2 3 6 2 3 1
[2,] 4 5 6 15 5 6 4
[3,] 7 8 9 24 8 9 7

Questions about missing data

In a matrix, if there is some missing data recorded as NA.
how could I delete rows with NA in the matrix?
can I use na.rm?
na.omit() will take matrices (and data frames) and return only those rows with no NA values whatsoever - it takes complete.cases() one step further by deleting the FALSE rows for you.
> x <- data.frame(c(1,2,3), c(4, NA, 6))
> x
c.1..2..3. c.4..NA..6.
1 1 4
2 2 NA
3 3 6
> na.omit(x)
c.1..2..3. c.4..NA..6.
1 1 4
3 3 6
I think na.rm usually only works within functions, say for the mean function. I would go with complete.cases: http://stat.ethz.ch/R-manual/R-patched/library/stats/html/complete.cases.htm
let's say you have the following 3x3-matrix:
x <- matrix(c(1:8, NA), 3, 3)
> x
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 NA
then you can get the complete cases of this matrix with
y <- x[complete.cases(x),]
> y
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
The complete.cases-function returns a vector of truth values that says whether or not a case is complete:
> complete.cases(x)
[1] TRUE TRUE FALSE
and then you index the rows of matrix x and add the "," to say that you want all columns.
If you want to remove rows that contain NA's you can use apply() to apply a quick function to check each row. E.g., if your matrix is x,
goodIdx <- apply(x, 1, function(r) !any(is.na(r)))
newX <- x[goodIdx,]

Resources