A package ('related') requires me to change some values withing variables in a largeish SNP dataframe (385x12300). This is no doubt simple but I can't find this particular question anywhere. Sample data:
binfrom<-c(1,1,1,1,0,NA)
x <- sample(binfrom, 100, replace = TRUE)
x<-data.frame(matrix(x,10,10))
I need the variable names X1,X2 etc to replace each "1" in that variable column. The values "0" and "NA" remain unchanged.
Another way is to use which (I'm assuming you have real NAs there- see #akruns comment)
indx <- which(x == 1, arr.ind = TRUE)
x[indx] <- names(x)[indx[, 2]]
This is basically identifies the locations of ones and replacing with the corresponding column names while using the columns location of the generated index.
We convert the columns of 'x' to character class from factor and use Map to replace 1 in each column with the corresponding column name.
x[] <- lapply(x, as.character)
x[] <- Map(function(y,z) replace(y, y==1, z), x, colnames(x))
In the OP's post, NA was created as character "NA". Because of that, the columns were factor while creating data.frame (with stringsAsFactors=TRUE - default option). If we used real NA, then the first step i.e. converting to character is not needed.
In case, we work with data.table, another option is set which should be fast when working with large datasets.
library(data.table)
setDT(x)
for(j in seq_along(x)){
set(x, i=NULL, j= j, value= as.character(x[[j]]))
set(x, i= which(x[[j]]==1 & !is.na(x[[j]])),
j=j, value= names(x)[j])
}
NOTE: Assumption is that we are working with real NA values.
Related
This should be easy but google and me are failing. Say I have this data:
library(data.table)
mydata <- data.table(a = c(1, NA),
b = c(NA, NA),
pointer = c(1,2))
and I want to get the rows where both a and b are NA. Of course i can do this manually like:
mydata[is.na(a) & is.na(b)]
but the issue arises deep in other code and I want to do this based on a character vector (or list, or whatever, this is flexible) of the column names such as:
myvector <- c("a","b")
again I can do this manually if I know how many elements the vector has:
mydata[is.na(get(myvector[1])) & is.na(get(myvector[2]))]
But I don't know how many elements myvector has in my application. How can I do this without specifying the number of entries in myvector? Essentially, I'm looking for something like with = F but for i in data.table. So I want to use myvector like this:
mydata[is.na(somefunction(myvector))]
I tried all kind of paste0(myvector, collapse = " & ") combinations with get() or as.formula() but it is getting me nowhere.
We can specify the .SDcols with the vector of column names, loop over the .SD (Subset of Data.table), create a list of logical vectors with is.na and Reduce the list to a single logical vector with & (which checks the corresponding elements of the list or column with & condition), use that to subset the rows of data
library(data.table)
mydata[mydata[, Reduce(`&`, lapply(.SD, is.na)), .SDcols = myvector]]
-output
# a b pointer
#1: NA NA 2
Or use mget
mydata[mydata[, Reduce(`&`, lapply(mget(myvector), is.na))]]
Here is another solution assuming that myvector is a character vector:
library(data.table)
mydata[rowSums(!is.na(mydata[, ..myvector])) == 0]
Trying to impute missing values in all numeric rows using this loop:
for(i in 1:ncol(df)){
if (is.numeric(df[,i])){
df[is.na(df[,i]), i] <- mean(df[,i], na.rm = TRUE)
}
}
When data.table package is not attached then code above is working as it should. Once I attach data.table package, then the behaviour changes and it shows me the error:
Error in `[.data.table`(df, , i) :
j (the 2nd argument inside [...]) is a single symbol but column name 'i'
is not found. Perhaps you intended DT[,..i] or DT[,i,with=FALSE]. This
difference to data.frame is deliberate and explained in FAQ 1.1.
I tried '..i' and 'with=FALSE' everywhere but with no success. Actually it has not passed even first is.numeric condition.
The data.table syntax is a little different in such a case. You can do it as follows:
num_cols <- names(df)[sapply(df, is.numeric)]
for(col in num_cols) {
set(df, i = which(is.na(df[[col]])), j = col, value = mean(df[[col]], na.rm=TRUE))
}
Or, if you want to keep using your existing loop, you can just turn the data back to data.frame using
setDF(df)
An alternative answer to this question, i came up with while sitting with a similar problem on a large scale. One might be interested in avoiding for loops by using the [.data.table method.
DF[i, j, by, on, ...]
First we'll create a function that can perform the imputation
impute_na <- function(x, val = mean, ...){
if(!is.numeric(x))return(x)
na <- is.na(x)
if(is.function(val))
val <- val(x[!na])
if(!is.numeric(val)||length(val)>1)
stop("'val' needs to be either a function or a single numeric value!")
x[na] <- val
x
}
To perform the imputation on the data frame, one could create and evaluate an expression in the data.table environment, but for simplicity of example here we'll overwrite using <-
DF <- DF[, lapply(.SD, impute_na)]
This will impute the mean across all numeric columns, and keep any non-numeric columns as is. If we wished to impute another value (like... 42 or whatever), and maybe we have some grouping variable, for which we only want the mean to computed over this can be included as well by
DF <- DF[, lapply(.SD, impute_na, val = 42)]
DF <- DF[, lapply(.SD, impute_NA), by = group]
Which would impute 42, and the within-group mean respectively.
Is there a better way to go through observations in a data frame and impute NA values? I've put together a 'for loop' that seems to do the job, swapping NAs with the row's mean value, but I'm wondering if there is a better approach that does not use a for loop to solve this problem -- perhaps a built in R function?
# 1. Create data frame with some NA values.
rdata <- rbinom(30,5,prob=0.5)
rdata[rdata == 0] <- NA
mtx <- matrix(rdata, 3, 10)
df <- as.data.frame(mtx)
df2 <- df
# 2. Run for loop to replace NAs with that row's mean.
for(i in 1:3){ # for every row
x <- as.numeric(df[i,]) # subset/extract that row into a numeric vector
y <- is.na(x) # create logical vector of NAs
z <- !is.na(x) # create logical vector of non-NAs
result <- mean(x[z]) # get the mean value of the row
df2[i,y] <- result # replace NAs in that row
}
# 3. Show output with imputed row mean values.
print(df) # before
print(df2) # after
Here's a possible vectorized approach (without any loop)
indx <- which(is.na(df), arr.ind = TRUE)
df[indx] <- rowMeans(df, na.rm = TRUE)[indx[,"row"]]
Some explanation
We can identify the locations of the NAs using the arr.ind parameter in which. Then we can simply index df (by the row and column indexes) and the row means (only by the row indexes) and replace values accordingly
Data:
set.seed(102)
rdata <- matrix(rbinom(30,5,prob=0.5),nrow=3)
rdata[cbind(1:3,2:4)] <- NA
df <- as.data.frame(rdata)
This is a little trickier than I'd like -- it relies on the column-major ordering of matrices in R as well as the recycling of the row-means vector to the full length of the matrix. I tried to come up with a sweep() solution but didn't manage so far.
rmeans <- rowMeans(df,na.rm=TRUE)
df[] <- ifelse(is.na(df),rmeans,as.matrix(df))
One possibility, using impute from Hmisc, which allows for choosing any function to do imputation,
library(Hmisc)
t(sapply(split(df2, row(df2)), impute, fun=mean))
Also, you can hide the loop in an apply
t(apply(df2, 1, function(x) {
mu <- mean(x, na.rm=T)
x[is.na(x)] <- mu
x
}))
I have a dataframe in which I occasionally have -1s. I want to replace them with NA. I tried the apply function, but it returns a matrix of characters to me, which is no good:
apply(d,c(1,2), function(x){
if (x == -1){
return (NA)
}else{
return (x)
}
})
I am wrestling with by but I cannot seem to handle it properly. I have got this so far:
s <-by(d,d[,'Q1_I1'], function(x){
for(i in x)
print(i)
})
which if I understood correctly by() serves into x my dataframe row by row. And I can iterate through the each element of the row by the for function. I just don't know how to replace the value.
The reason that apply does not work is that it converts a data frame to a matrix and if your data frame has any factors then this will be a character matrix.
You can use lapply instead which will process the data frame one column at a time. This code works:
mydf <- data.frame( x=c(1:10, -1), y=c(-1, 10:1), g=sample(letters,11) )
mydf
mydf[] <- lapply(mydf, function(x) { x[x==-1] <- NA; x})
mydf
As #rawr mentions in the comments it does work to do:
mydf[ mydf== -1 ] <- NA
but the documentation (?'[.data.frame') say that that is not recommended due to the conversions.
One big question is how the data frame is being created. If you are reading the data using read.table or related functions then you can just specify the na.strings argument and have the conversion done for you as the data is read in.
You can do this fast and transparently with the data.table library.
# take standard dataset and transform to data.table
mtcars = data.table(mtcars,keep.rownames = TRUE)
# select rows with 5 gear and set to NA
mtcars[gear==5,gear:= NA]
mtcars
I have a column which contain numeric as well as non-numeric values. I want to find the mean of the numeric values which i can use it to replace the non-numeric values. How can this be done in R?
Say your data frame is named df and the column you want to "fix" is called df$x. You could do the following.
You have to unfactor and then convert to numeric. This will give you NAs for all the character strings that cannot be coalesced to numbers.
nums <- as.numeric(as.character(df$x))
As Richie Cotton pointed out, there is a "more efficient, but harder to remember" way to convert factors to numeric
nums <- as.numeric(levels(df$x))[as.integer(df$x)]
To get the mean, you use mean() but pass na.rm = T
m <- mean(nums, na.rm = T)
Assign the mean to all the NA values.
nums[is.na(nums)] <- m
You could then replace the old data, but I don't recommend it. Instead just add a new column
df$new.x <- nums
This is a function I wrote yesterday to combat the non-numeric types. I have a data.frame with unpredictable type for each column. I want to calculate the means for numeric, and leave everything else untouched.
colMeans2 <- function(x) {
# This function tries to guess column type. Since all columns come as
# characters, it first tries to see if x == "TRUE" or "FALSE". If
# not so, it tries to coerce vector into integer. If that doesn't
# work it tries to see if there's a ' \" ' in the vector (meaning a
# column with character), it uses that as a result. Finally if nothing
# else passes, it means the column type is numeric, and it calculates
# the mean of that. The end.
# browser()
# try if logical
if (any(levels(x) == "TRUE" | levels(x) == "FALSE")) return(NA)
# try if integer
try.int <- strtoi(x)
if (all(!is.na(try.int))) return(try.int[1])
# try if character
if (any(grepl("\\\"", x))) return(x[1])
# what's left is numeric
mean(as.numeric(as.character(x)), na.rm = TRUE)
# a possible warning about coerced NAs probably originates in the above line
}
You would use it like so:
apply(X = your.dataframe, MARGIN = 2, FUN = colMeans2)
It sort of depends on what your data looks like.
Does it look like this?
data = list(1, 2, 'new jersey')
Then you could
data.numbers = sapply(data, as.numeric)
and get
c(1, 2, NA)
And you can find the mean with
mean(data.numbers, na.rm=T)
A compact conversion:
vec <- c(0:10,"a","z")
vec2 <- (as.numeric(vec))
vec2[is.na(vec2)] <- mean(vec2[!is.na(vec2)])
as.numeric will print the warning message listed below and convert the non-numeric to NA.
Warning message:
In mean(as.numeric(vec)) : NAs introduced by coercion