How do I change NA into Column median [duplicate]

How do I change NA into Column median [duplicate] - r

This question already has answers here:
Replacing NA's in each column of matrix with the median of that column
(4 answers)
Closed 5 years ago.
A data frame contains 123 columns,
and each columns have at least 1 NA value.
I want these NA values to be raplaced into column median.
because there are so many columns, i cannot write a code using each column name.
so i tried to use 'apply' to solve this but it didn't work.
data2[-1]<-lapply(data2[-1],function(x)x - median(x,na.rm=TRUE))
it says it doesn't work since it is data frame, not numeric.

We can use na.aggregate
library(zoo)
j1 <- sapply(df1, is.numeric)
df1[j1] <- na.aggregate(df1[j1], FUN = median)

We can use map2_df
library(purrr)
df <- data.frame(a = c(1, 2, 3), b = c(2, NA, 9), c = c(NA, 3, 5), d = c(0, 4, NA))
purrr::map2_df(df, purrr::dmap(df, median, na.rm = TRUE), function(x, y) ifelse(is.na(x), y, x))

for(i in 1:ncol(df)){
df[is.na(df[,i]), i] <- median(df[,i], na.rm = TRUE)
}

Related

Find max and min value on a dataframe, ignoring NAs

I need to find the max and min value in a dataframe "df" like this:
col1 col2 col3
7 4 5
2 NA 6
3 2 4
NA NA 1
The result should be: min = 1 and max = 7.
I have used this function:
min <- min(df, na.rm=TRUE)
max <- max(df, na.rm=TRUE)
but it gives me the following error:
Error in FUN(X[[i]], ...) :
only defined on a data frame with all numeric variables
So I have converted all the values as.numeric in this way:
df <- as.numeric(as.character(df))
but it introduces NAs by coercion and now the results are:
min = -Inf and max=Inf
How can I operate on the df ignoring NAs?

If the columns are not numeric, convert it with type.convert
df <- type.convert(df, as.is = TRUE)
Or use a force conversion with matrix route
df[] <- as.numeric(as.matrix(df))
Or with lapply
df[] <- lapply(df, function(x) as.numeric(as.character(x)))
With R 4.1.0 we can also do
sapply(df, \(x) as.numeric(as.character(x))) |>
range(na.rm = TRUE)
#[1] 1 7
Once the columns are numeric, the functions work as expected
min(df, na.rm = TRUE)
#[1] 1
max(df, na.rm = TRUE)
#[1] 7
Note that as.character/as.numeric requires a vector input and not a data.frame

We could use minMax function from dataMaid package (handles NA's)
library(dataMaid)
minMax(df, maxDecimals = 2)
Output:
Min. and max.: 2; 7
data:
df <- tribble(
~col1, ~col2, ~col3,
7, 4, 5,
2, NA, 6,
3, 2, 4,
NA, NA, 1)

Another base R option
> range(na.omit(as.numeric(unlist(df))))
[1] 1 7
If it is factor class, you should use (thank #akrun's comment)
as.numeric(as.character(unlist(df)))

How to replace NA values with different values based on column in R dataframe?

I am trying to replace NA values by column with values predetermined from a vector. For example, I have vector containing the values (1,5,3) and a dataframe df, and want to replace all NA values from column one of df with 1, column two NA's with 5, and column three NA's with 3.
I tried a formula I saw that took
df[is.na(df)] = vector
but didn't seem to work due to "wrong length". Both the vector and #columns in df are also the same length.

You can use which to get row/column index of NA values and replace it directly.
mat <- which(is.na(df), arr.ind = TRUE)
df[mat] <- vector[mat[, 2]]

We can use Map to replace the corresponding columns in the dataset with the value in the vector and replace it directly and this would almost all the time and it is a single step replacement and is concise
df[] <- Map(function(x, y) replace(x, is.na(x), y), df, vec)
df
# col1 col2 col3
#1 1 5 2
#2 3 2 3
#3 1 5 3
Or another option is to make the lengths same, and then use pmax
df[] <- pmax(as.matrix(df), is.na(df) * vec[col(df)], na.rm = TRUE)
or another option with replace
df <- replace(df, is.na(df), rep(vec, colSums(is.na(df))))
NOTE: All the solutions above are one-liner
Or using data.table with set
library(data.table)
setDT(df)
for(j in seq_along(df)) set(df, i = which(is.na(df[[j]])), j = j, value = vec[j])
data
df <- data.frame(col1 = c(1, 3, NA), col2 = c(NA, 2, NA), col3 = c(2, NA, NA))
vec <- c(1, 5, 3)

Find index of all columns which contain the max value of the row in R?

In order to find the max value of each row I used:
col_max <- apply(dat, max, na.rm=TRUE)
so I have a list of the max value for each row, but now I want to find the indices of each column where that max value appears by row (i.e., each row has a different max which may appear more than once).
How can I do this in R? Thanks in advance!

Isn't it possible to simply use which.max() and then specify the column or row?
This is not the way you intended it, but it might be simpler?

Try this:
# data frame with 5 columns and 10 rows
dat <- as.data.frame(matrix(sample(1:5, 50, replace = T), nrow = 10))
# name for rows
rownames(dat) <- letters[1:10]
# find the max positons by row
apply(dat, 1, function(x) which(x %in% max(x)))
# find the max positons by col
apply(dat, 2, function(x) which(x %in% max(x)))

Rearrange dataframe by subsetting and column bind [duplicate]

This question already has an answer here:
Merging rows with the same ID variable [duplicate]
(1 answer)
Closed 7 years ago.
I have the following dataframe:
st <- data.frame(
se = rep(1:2, 5),
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2))
st$xy <- paste(st$X,",",st$Y)
st <- st[c("se","xy")]
but I want it to be the following:
1 2 3 4 5
-1.53697673029089 , 2.10652020463275 -1.02183940974772 , 0.623009466458354 1.33614674072657 , 1.5694345481646 0.270466789820086 , -0.75670874554064 -0.280167896821629 , -1.33313822867893
0.26012874418111 , 2.87972571647846 -1.32317949800031 , -2.92675188421021 0.584199000313255 , 0.565499464846637 -0.555881716346136 , -1.14460518414649 -1.0871665543915 , -3.18687136890236
I mean when the value of se is the same, make a column bind.
Do you have any ideas how to accomplish this?
I had no luck with spread(tidyr), and I guess it's something which involves sapply, cbind and a if statement. Because the real data involves more than 35.000 rows.

It seems as though your eventual goal is to have a data file which has roughly 35000 columns. Are you sure about that? That doesn't sound very tidy.
To do what you want, you are going to need to have a row identifier. In the below, I've called it caseid, and then removed it once it was no longer required. I then transpose the result to get what you asked for.
library(tidyr)
library(dplyr)
st <- data.frame(
se = rep(1:2, 5),
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2))
st$xy <- paste(st$X,",",st$Y)
st <- st[c("se","xy")]
st$caseid = rep(1:(nrow(st)/2), each = 2) # temporary
df = spread(st, se, xy) %>%select(-caseid) %>%t()
print(df)

If we need to split the 'xy' column elements into individual units, cSplit from splitstackshape can be used. Then rbind the alternating rows of 'st1' after unlisting`.
library(splitstackshape)
st1 <- cSplit(st, 'xy', ', ', 'wide')
rbind(unlist(st1[c(TRUE,FALSE)][,-1, with=FALSE]),
unlist(st1[c(FALSE, TRUE)][,-1, with=FALSE]))
If we don't need to split the 'xy' column into individual elements, we can use dcast from data.table. It should be fast enough. Convert the 'data.frame' to 'data.table' (setDT(st), create a sequence column ('N') by 'se', and then dcast from 'long' to 'wide'.
library(data.table)
dcast(setDT(st)[, N:= 1:.N, se], se~N, value.var= 'xy')

Mean row by imbricated levels of factors

I have the following dataframe:
df = data.frame(id=c("A","A","A","A","B","B","B","B","C","C","C","C","D","D","D","D"),
sub=rep(c(1:4),4),
acc1=runif(16,0,3),
acc2=runif(16,0,3),
acc3=runif(16,0,3),
acc4=runif(16,0,3))
What I want is to obtain the mean rows for each ID, which is to say I want to obtain the mean acc1, acc2, acc3 and acc4 for each level A, B, C and D by averaging the values for each sub (4 levels for each id), which would give something like this in the end (with the NAs replaced by the means I want of course):
dfavg = data.frame(id=c("A","B","C","D"),meanacc1=NA,meanacc2=NA,meanacc3=NA,meanacc4=NA)
Thanks in advance!

Try:
You can use any of the specialized packages dplyr or data.table or using base R. Because you have a lot of columns that starts with acc to get the mean of, I choose dplyr. Here, the idea is to first group the variable by id and then use summarise_each to get the mean of each column by id that starts_with acc
library(dplyr)
df1 <- df %>%
group_by(id) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE)), starts_with("acc")) %>%
rename(meanacc1=acc1, meanacc2=acc2, meanacc3=acc3, meanacc4=acc4) #this works but it requires more typing.
I would rename using paste
# colnames(df1)[-1] <- paste0("mean", colnames(df1)[-1])
gives the result
# id meanacc1 meanacc2 meanacc3 meanacc4
#1 A 1.7061929 2.401601 2.057538 1.643627
#2 B 1.7172095 1.405389 2.132378 1.769410
#3 C 1.4424233 1.737187 1.998414 1.137112
#4 D 0.5468509 1.281781 1.790294 1.429353
Or using data.table
library(data.table)
nm1 <- paste0("acc", 1:4) #names of columns to do the `means`
dt1 <- setDT(df)[, lapply(.SD, mean, na.rm=TRUE), by=id, .SDcols=nm1]
Here.SD implies Subset of Data.table, .SDcols are the columns to which we apply the mean operation.
setnames(dt1, 2:5, paste0("mean", nm1)) #change the names of the concerned columns in the result
dt1

(This must have been asked at least 20 times.) The `aggregate function applies the same function (given as the third argument) to all the columns of its first argument within groups defined by its second argument:
aggregate(df[-(1:2)], df[1],mean)
If you want to append the letters "mean" to the column names:
names(df2) <- paste0("mean", names(df2)
If you had wanted to do the column selection automatically then grep or grepl would work:
aggregate(df[ grepl("acc", names(df) )], df[1], mean)

Here are a couple of other base R options:
split + vapply (since we know vapply would simplify to a matrix whenever possible)
t(vapply(split(df[-c(1, 2)], df[, 1]), colMeans, numeric(4L)))
by (with a do.call(rbind, ...) to get the final structure)
do.call(rbind, by(data = df[-c(1, 2)], INDICES = df[[1]], FUN = colMeans))
Both will give you something like this as your result:
# acc1 acc2 acc3 acc4
# A 1.337496 2.091926 1.978835 1.799669
# B 1.287303 1.447884 1.297933 1.312325
# C 1.870008 1.145385 1.768011 1.252027
# D 1.682446 1.413716 1.582506 1.274925
The sample data used here was (with set.seed, for reproducibility):
set.seed(1)
df = data.frame(id = rep(LETTERS[1:4], 4),
sub = rep(c(1:4), 4),
acc1 = runif(16, 0, 3),
acc2 = runif(16, 0, 3),
acc3 = runif(16, 0, 3),
acc4 = runif(16, 0, 3))
Scaling up to 1M rows, these both perform quite well (though obviously not as fast as "dplyr" or "data.table").

You can do this in base package itself using this:
a <- list();
for (i in 1:nlevels(df$id))
{
a[[i]] = colMeans(subset(df, id==levels(df$id)[i])[,c(3,4,5,6)]) ##select columns of df of which you want to compute the means. In your example, 3, 4, 5 and 6 are the columns
}
meanDF <- cbind(data.frame(levels(df$id)), data.frame(matrix(unlist(a), nrow=4, ncol=4, byrow=T)))
colnames(meanDF) = c("id", "meanacc1", "meanacc2", "meanacc3", "meanacc4")
meanDF
id meanacc1 meanacc2 meanacc3 meanacc4
A 1.464635 1.645898 1.7461862 1.026917
B 1.807555 1.097313 1.7135346 1.517892
C 1.350708 1.922609 0.8068907 1.607274
D 1.458911 0.726527 2.4643733 2.141865