too many NA values in dataset for na.omit to handle - r

I have a text file dataset that I read as follows:
cancer1 <- read.table("cancer.txt", stringsAsFactors = FALSE, quote='', header=TRUE,sep='\t')
I then have to convert the class of the constituent values so that I can perform mathematical analyses on the df.
cancer<-apply(cancer1,2, as.numeric)
This introduces >9000 NA values into a "17980 X 598" df. Hence there are too many NA values to just simply use "na.omit" as that just removes all of the rows....
Hence my plan is to replace each NA in each row with the mean value of that row, my attempt is as follows:
for(i in rownames(cancer)){
cancer2<-replace(cancer, is.na(cancer), mean(cancer[i,]))
}
However this removes every row just like na.omit:
dim(cancer2)
[1] 0 598
Can someone tell me how to replace each of the NA values with the mean of that row?

You can use rowMeans with indexing.
k <- which(is.na(cancer1), arr.ind=TRUE)
cancer1[k] <- rowMeans(cancer1, na.rm=TRUE)[k[,1]]
Where k is an indices of the rows with NA values.
This works better than my original answer, which was:
for(i in 1:nrow(cancer1)){
for(n in 1:ncol(cancer1)){
if(is.na(cancer1[i,n])){
cancer1[i,n] <- mean(t(cancer1[i,]), na.rm = T)# or rowMeans(cancer1[i,], na.rm=T)
}
}
}

sorted it out with code adapted from related post:
cancer1 <- read.table("TCGA_BRCA_Agilent_244K_microarray_genomicMatrix.txt", stringsAsFactors = FALSE, quote='' ,header=TRUE,sep='\t')
t<-cancer1[1:800, 1:400]
t<-t(t)
t<-apply(t,2, as.numeric) #constituents read as character strings need to be converted
#to numerics
cM <- rowMeans(t, na.rm=TRUE) #necessary subsequent data cleaning due to the
#introduction of >1000 NA values- converted to the mean value of that row
indx <- which(is.na(t), arr.ind=TRUE)
t[indx] <- cM[indx[,2]]

Related

How to count missing values from two columns in R

I have a data frame which looks like this
**Contig_A** **Contig_B**
Contig_0 Contig_1
Contig_3 Contig_5
Contig_4 Contig_1
Contig_9 Contig_0
I want to count how many contig ids (from Contig_0 to Contig_1193) are not present in either Contig_A column of Contig_B.
For example: if we consider there are total 10 contigs here for this data frame (Contig_0 to Contig_9), then the answer would be 4 (Contig_2, Contig_6, Contig_7, Contig_8)
Create a vector of all the values that you want to check (all_contig) which is Contig_0 to Contig_10 here. Use setdiff to find the absent values and length to get the count of missing values.
cols <- c('Contig_A', 'Contig_B')
#If there are lot of 'Contig' columns that you want to consider
#cols <- grep('Contig', names(df), value = TRUE)
all_contig <- paste0('Contig_', 0:10)
missing_contig <- setdiff(all_contig, unlist(df[cols]))
#[1] "Contig_2" "Contig_6" "Contig_7" "Contig_8" "Contig_10"
count_missing <- length(missing_contig)
#[1] 5
by match,
x <- c(0:9)
contigs <- sapply(x, function(t) paste0("Contig_",t))
df1 <- data.frame(
Contig_A = c("Contig_0", "Contig_3", "Contig_4", "Contig_9"),
Contig_B = c("Contig_1", "Contig_5", "Contig_1", "Contig_0")
)
xx <- c(df1$Contig_A,df1$Contig_B)
contigs[is.na(match(contigs, xx))]
[1] "Contig_2" "Contig_6" "Contig_7" "Contig_8"
In your case, just change x as x <- c(0,1193)

Vlookup/ Match function in R for continuous column in R

I have a 2 dataframe.
df1:
Dis1_SubDIs1_Village1 Dis2_SubDIs1_Village1 Dis1_SubDIs2_Village1
JODHPUR|JODHPUR|JODHPUR |JODHPUR|JODHPUR JODHPUR||JODHPUR
JHUNJHUNUN|JHUNJHUNUN|BARI |JHUNJHUNUN|BARI JHUNJHUNUN|BARI|BARI
BUNDI|HINDOLI|BUNDI |HINDOLI|BUNDI BUNDI|BUNDI|BUNDI
SIROHI|SIROHI|SIROHI |SIROHI|SIROHI SIROHI||SIROHI
ALWAR|ALWAR|BASAI |ALWAR|BASAI ALWAR||BASAI
BHARATPUR|BHARATPUR|SEEKRI |BHARATPUR|SEEKRI BHARATPUR||SEEKRI
and second data,
df2 :
High
|BHARATPUR|SEEKRI
BUNDI|HINDOLI|BUNDI
SIROHI||SIROHI
CHURU|TARANAGAR|DABRI CHHOTI
Now, I want to apply vloook/match in df1 with respect to df2 column. The same we do in excel.
If exact matches are there, give me the match, else 0.
I tried making the function in R
For match
for(i in names(df1)){
match_vector = match(df_final[,i], df$High, incomparables = NA)
df1$High = df2$High[match_vector]
}
but getting an error. It's showing only for the last column and replacing the value of other column.
For vlookup:
func_vlook = function(a){
for(i in 1:ncol(a)) {
lookup_df = vlookup_df(lookup_value = i,
dict = df2,
lookup_column = 1)
}
return(lookup_df)
}
lookup_df <- func_vlook(a = df1)
Still getting an error.
My final Output should be like the below attached with df file:
Dis1_SubDIs1_Village1_M1 Dis2_SubDIs1_Village1_M2 Dis1_SubDIs2_Village1_M3
NA NA NA
NA NA NA
BUNDI|HINDOLI|BUNDI NA NA
NA SIROHI||SIROHI SIROHI||SIROHI
NA NA NA
NA NA |BHARATPUR|SEEKRI
for the N no. of columns, there should be N no. of columns with match
Please help.
No need for any loops with this one - apply and match should work fine. apply will iterate over as many columns you have, so the output will have the same number of columns as the input. In your example, apply will simplify to produce a matrix.
apply(X = df1,
MARGIN = 2,
FUN = function(x) df2$High[match(x, df2$High)])
If you need a dataframe as the output, then wrap the code below in as.data.frame()
as.data.frame(apply(X = df1,
MARGIN = 2,
FUN = function(x) df2$High[match(x, df2$High)]))

Why does code throw error when looped? My code works when I incremement index "by hand" but when I put in loop it fails

I want to append values from one dataframe as column names to an another data frame.
I've written code that will produce one column at a time if I "manually" assigne index values:
df_searchtable <- data.frame(category = c("air", "ground", "ground", "air"), wiggy = c("soar", "trot", "dive", "gallop"))
df_host <- data.frame(textcolum = c("run on the ground", "fly through the air"))
#create vector of categories
categroups <- as.character(unique(df_searchtable$category))
##### if I assign colum names one at a time using index numbers no prob:
group = categroups[1]
df_host[, group] <- NA
##### if I use a loop to assign the column names:
for (i in categroups) {
group = categroups[i]
df_host[, group] <- NA
}
the code fails, giving:
Error in [<-.data.frame(`*tmp*`, , group, value = NA) :
missing values are not allowed in subscripted assignments of data frames
How can I get around this problem?
Here's a simple base R solution:
df_host[categroups] <- NA
df_host
textcolum air ground
1 run on the ground NA NA
2 fly through the air NA NA
The problem with your loop is that you are looping through each element whereas your code assumes you are looping through 1, 2, ..., n.
For instance:
for (i in categroups) {
print(i)
print(categroups[i])
}
[1] "air"
[1] NA
[1] "ground"
[1] NA
To fix your loop, you could do one of two things:
for (group in categroups) {
df_host[, group] <- NA
}
# or
for (i in seq_along(categroups)) {
group <- categroups[i]
df_host[, group] <- NA
}
Here's a solution using purrr's map.
bind_cols(df_host,
map_dfc(categroups,
function(group) tibble(!!group := rep(NA_real_, nrow(df_host)))))
Gives:
textcolum air ground
1 run on the ground NA NA
2 fly through the air NA NA
map_dfc maps over the input categroups, creates a single-column tibble for each one, and joins the newly created tibbles into a dataframe
bind_cols joins the original dataframe to your new tibble
Alternatively you could use walk:
walk(categroups, function(group){df_host <<- mutate(df_host, !!group := rep(NA_real_, nrow(df_host)))})
Here's an ugly base R solution: create an empty matrix with the column names and cbind it to the second dataframe.
df_searchtable <- data.frame(category = c("air", "ground", "ground", "air"),
wiggy = c("soar", "trot", "dive", "gallop"),
stringsAsFactors = FALSE)
df_host <- data.frame(textcolum = c("run on the ground", "fly through the air"),
stringsAsFactors = FALSE)
cbind(df_host,
matrix(nrow = nrow(df_host),
ncol = length(unique(df_searchtable$category)),
dimnames = list(NULL, unique(df_searchtable$category))))
Result:
textcolum air ground
1 run on the ground NA NA
2 fly through the air NA NA

Assigning NA to new column before a loop in R

What do you think of assigning NA to new column before a loop? Is it consider a best practice ? Is there a more elegant way to do this?
I discovered don't assign NA value to columns before fill them with a loop can cause some troubles, specially on rows where API can't bring an answer: it fill the row with the data from the previous line...
Can you please help?
Url <- c("https://www.r-project.org/","https://cran.r-project.org/")
df <- data.frame(Url)
URL_row <- nrow(df)
df$PageSpeed_Score <- NA
df$PageSpeed_NumberResources <- NA
df$PageSpeed_NumberHosts <- NA
for (i in 1:URL_row) {
url_to_check <- as.character(df[i, "Url"])
print(url_to_check)
PageSpeed_APIrequest <- paste("https://www.googleapis.com/pagespeedonline/v2/runPagespeed?url=", url_to_check,"&strategy=desktop", sep = "")
PageSpeed_APIrequest <- fromJSON(PageSpeed_APIrequest)
df$PageSpeed_Score[i] <- PageSpeed_APIrequest$rule$SPEED
df$PageSpeed_NumberResources[i] <- PageSpeed_APIrequest$pageStats$numberResources
df$PageSpeed_NumberHosts[i] <- PageSpeed_APIrequest$pageStats$numberHosts
}
The only (small) issue with NA is that it is of type logical.
typeof(NA)
It might be better to use NA_character_ or NA_real_ (etc.) depending on the expected type of the column. Not a big deal because mutability in R, though.

Function to change blanks to NA

I'm trying to write a function that turns empty strings into NA. A summary of one of my column looks like this:
a b
12 210 468
I'd like to change the 12 empty values to NA. I also have a few other factor columns for which I'd like to change empty values to NA, so I borrowed some stuff from here and there to come up with this:
# change nulls to NAs
nullToNA <- function(df){
# split df into numeric & non-numeric functions
a<-df[,sapply(df, is.numeric), drop = FALSE]
b<-df[,sapply(df, Negate(is.numeric)), drop = FALSE]
# Change empty strings to NA
b<-b[lapply(b,function(x) levels(x) <- c(levels(x), NA) ),] # add NA level
b<-b[lapply(b,function(x) x[x=="",]<- NA),] # change Null to NA
# Put the columns back together
d<-cbind(a,b)
d[, names(df)]
}
However, I'm getting this error:
> foo<-nullToNA(bar)
Error in x[x == "", ] <- NA : incorrect number of subscripts on matrix
Called from: FUN(X[[i]], ...)
I have tried the answer found here: Replace all 0 values to NA but it changes all my columns to numeric values.
You can directly index fields that match a logical criterion. So you can just write:
df[is_empty(df)] = NA
Where is_empty is your comparison, e.g. df == "":
df[df == ""] = NA
But note that is.null(df) won’t work, and would be weird anyway1. I would advise against merging the logic for columns of different types, though! Instead, handle them separately.
1 You’ll almost never encounter NULL inside a table since that only works if the underlying vector is a list. You can create matrices and data.frames with this constraint, but then is.null(df) will never be TRUE because the NULL values are wrapped inside the list).
This worked for me
df[df == 'NULL'] <- NA
How about just:
df[apply(df, 2, function(x) x=="")] = NA
Works fine for me, at least on simple examples.
This is the function I used to solve this issue.
null_na=function(vector){
new_vector=rep(NA,length(vector))
for(i in 1:length(vector))
if(vector[i]== ""){new_vector[i]=NA}else if(is.na(vector[i]))
{new_vector[i]=NA}else{new_vector[i]=vector[i]}
return(new_vector)
}
Just plug in the column or vector you are having an issue with.

Resources