Replacing NA cells with string in R dataframe - r

I have written a function that "cleans up" taxonomic data from NGS taxonomic files. The problem is that I am unable to replace NA cells with a string like "undefined". I know that it has something to do with variables being made into factors and not characters (Warning message: In `...` : invalid factor level, NA generated), however even when importing data with stringsAsFactors = FALSE I still get this error in some cells.
Here is how I import the data:
raw_data_1 <- taxon_import(read.delim("taxonomy_site_1/*/*/*/taxonomy.tsv", stringsAsFactors = FALSE))
The taxon_import function is used to split the taxa and assign variable names:
taxon_import <- function(data) {
data <- as.data.frame(str_split_fixed(data$Taxon, ";", 7))
colnames(data) <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species")
return(data)
}
Now the following function is used to "clean" the data and this is where I would like to replace certain strings with "Undefined", however I keep getting the error: In[<-.factor(tmp, thisvar, value = "Undefined") : invalid factor level, NA generated
Here follows the data_cleanup function:
data_cleanup <- function(data) {
strip_1 = list("D_0__", "D_1__", "D_2__", "D_3__", "D_4__", "D_5__", "D_6__")
for (i in strip_1) {
data <- as.data.frame(sapply(data, gsub, pattern = i, replacement = ""))
}
data[data==""] <- "Undefined"
strip_2 = list("__", "unidentified", "Ambiguous_taxa", "uncultured", "Unknown", "uncultured .*", "Unassigned .*", "wastewater Unassigned", "metagenome")
for (j in strip_2) {
data <- as.data.frame(sapply(data, gsub, pattern = j, replacement = "Undefined"))
}
return(data)
}
The function is simply applied like: test <- data_cleanup(raw_data_1)
I am appending the data from a cloud, since it is very lengthy data. Here is the link to a data file https://drive.google.com/open?id=1GBkV_sp3A0M6uvrx4gm9Woaan7QinNCn
I hope you will forgive my ignorance, however I tried many solutions before posting here.

We start by using the tidyverse library. Let me give a twist to your question, as it's about replacing NAs, but I think with this code you should avoid that problem.
As I read your code, you erase the strings "D_0__", "D_1__", ... from the observation strings. Then you replace the strings "Ambiguous_taxa", "unidentified", ... with the string "Undefined".
According to your data, I replaced the functions with regex, which makes a little easy to clean your data:
library(tidyverse)
taxon_import <- function(data) {
data <- as.data.frame(str_split_fixed(data$Taxon, ";", 7))
colnames(data) <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species")
return(data)
}
raw_data_1 <- taxon_import(read.delim("taxonomy.tsv", stringsAsFactors = FALSE))
raw_data_1 <- data.frame(lapply(raw_data_1,as.character),stringsAsFactors = FALSE)
depured <- as.data.frame(sapply(raw_data_1,function(x) sub("^D_[0-6]__","",x)), stringAsFactors = FALSE)
depured <- as.data.frame(sapply(depured,function(x) sub("__|unidentified|Ambiguous_taxa|uncultured","Undefined",x)), stringsAsFactors = FALSE)
depured <- as.data.frame(sapply(depured,function(x) sub("Unknown|uncultured\\s.\\*|Unassigned\\s.\\*","Undefined",x)), stringsAsFactors = FALSE)
depured <- as.data.frame(sapply(depured,function(x) sub("wastewater\\sUnassigned|metagenome","Undefined",x)), stringsAsFactors = FALSE)
depured[depured ==""] <- "Undefined"
Let me explain my code. First, I read in many websites that it's better to avoid loops, as "for". So how you replace text that starts with "D_0__"?
The answer is regex (regular expression). It seems complicated at first but with practice it'll be helpful. See this expression:
"^D_[0-6]__"
It means: "Take the start of the string which begins with "D_" and follows a number between 0 and 6 and follows "__"
Aha. So you can use the function sub
sub("^D_[0-6]__","",string)
which reads: replace the regular expression with a blank space "" in the string.
Now you see another regex:
"__|unidentified|Ambiguous_taxa|uncultured"
It means: select the string "__" or "unidentified" or "Ambiguous_taxa" ...
Be careful with this regex
"Unknown|uncultured\\s.\\*|Unassigned\\s.\\*"
it means: select the string "Unknown" or "uncultured .*" or...
the blank space it's represented by \s and the asterisk is \*
Now what about the as.data.frame function? Every time I use it I have to make it "stringsAsFactors = FALSE" because the function tries to use the characters, as factors.
With this code no NA are created.
Hope it helps, please don't hesitate to ask if needed.
Regards,
Alexis

Related

Using lapply with gsub to replace word in dataframe using another dataframe as 'dictionnary'

I have a dataframe called data where I want to replace some word in specific columns A & B.
I have a second dataframe called dict that is playing the role of dictionnary/hash containing the words and the values to use for replacement.
I think it could be done with purrr’s map() but I want to use apply. It's for a package and I don't want to have to load another package.
The following code is not working but it's give you the idea. I'm stuck.
columns <- c("A", "B" )
data[columns] <- lapply(data[columns], function(x){x}) %>% lapply(dict, function(y){
gsub(pattern = y[,2], replacement = y[,1], x)})
This is working for one word to change...but I'm not able to pass the list of changes conainted in the dictionnary.
data[columns] <- lapply(data[columns], gsub, pattern = "FLT1", replacement = "flt1")
#Gregor_Thomas is right, you need a for loop to have a recursive effect, otherwise you just replace one value at the time.
df <- data.frame("A"=c("PB1","PB2","OK0","OK0"),"B"=c("OK3","OK4","PB1","PB2"))
dict <- data.frame("pattern"=c("PB1","PB2"), "replacement"=c("OK1","OK2"))
apply(df[,c("A","B")],2, FUN=function(x) {
for (i in 1:nrow(dict)) {
x <- gsub(pattern = dict$pattern[i], replacement = dict$replacement[i],x)
}
return(x)
})
Or, if your dict data is too long you can generate a succession of all the gsub you need using a paste as a code generator :
paste0("df[,'A'] <- gsub(pattern = '", dict$pattern,"', replacement = '", dict$replacement,"',df[,'A'])")
It generates all the gsub lines for the "A" column :
"df[,'A'] <- gsub(pattern = 'PB1', replacement = 'OK1',df[,'A'])"
"df[,'A'] <- gsub(pattern = 'PB2', replacement = 'OK2',df[,'A'])"
Then you evaluate the code and wrap it in a lapply for the various columns :
lapply(c("A","B"), FUN = function(v) { eval(parse(text=paste0("df[,'", v,"'] <- gsub(pattern = '", dict$pattern,"', replacement = '", dict$replacement,"',df[,'",v,"'])"))) })
It's ugly but it works fine to avoid long loops.
Edit : for a exact matching between df and dict maybe you should use a boolean selection with == instead of gsub().
(I don't use match() here because it selects only the first matching
df <- data.frame("A"=c("PB1","PB2","OK0","OK0","OK"),"B"=c("OK3","OK4","PB1","PB2","AB"))
dict <- data.frame("pattern"=c("PB1","PB2","OK"), "replacement"=c("OK1","OK2","ZE"))
apply(df[,c("A","B")],2, FUN=function(x) {
for (i in 1:nrow(dict)) {
x[x==dict$pattern[i]] <- dict$replacement[i]
}
return(x)
})

convert character column and then split it into multiple new boolean columns using r mutate

I am attempting to split out a flags column into multiple new columns in r using mutate_at and then separate functions. I have simplified and cleaned my solution as seen below, however I am getting an error that indicates that the entire column of data is being passed into my function rather than each row individually. Is this normal behaviour which just requires me to loop over each element of x inside my function? or am I calling the mutate_at function incorrectly?
example data:
dataVariable <- data.frame(c_flags = c(".q.q.q","y..i.o","0x5a",".lll.."))
functions:
dataVariable <- read_csv("...",
col_types = cols(
c_date = col_datetime(format = ""),
c_dbl = col_double(),
c_flags = col_character(),
c_class = col_factor(c("a", "b", "c")),
c_skip = col_skip()
))
funTranslateXForNewColumn <- function(x){
binary = ""
if(startsWith(x, "0x")){
binary=hex2bin(x)
} else {
binary = c(0,0,0,0,0,0)
splitFlag = strsplit(x, "")[[1]]
for(i in splitFlag){
flagVal = 1
if(i=="."){
flagVal = 0
}
binary=append(binary, flagVal)
}
}
return(paste(binary[4:12], collapse='' ))
}
mutate_at(dataVariable, vars(c_flags), funs(funTranslateXForNewColumn(.)))
separate(dataVariable, c_flags, c(NA, "flag_1","flag_2","flag_3","flag_4","flag_5","flag_6","flag_7","flag_8","flag_9"), sep="")
The error I am receiving is:
Warning messages:
1: Problem with `mutate()` input `c_flags`.
i the condition has length > 1 and only the first element will be used
After translating the string into an appropriate binary representation of the flags, I will then use the seperate function to split it into new columns.
Similar to OP's logic but maybe shorter :
dataVariable$binFlags <- sapply(strsplit(dataVariable$c_flags, ''), function(x)
paste(as.integer(x != '.'), collapse = ''))
If you want to do this using dplyr we can implement the same logic as :
library(dplyr)
dataVariable %>%
mutate(binFlags = purrr::map_chr(strsplit(c_flags, ''),
~paste(as.integer(. != '.'), collapse = '')))
# c_flags binFlags
#1 .q.q.q 010101
#2 y..i.o 100101
#3 .lll.. 011100
mutate_at/across is used when you want to apply a function to multiple columns. Moreover, I don't see here that you are creating only one new binary column and not multiple new columns as mentioned in your post.
I was able to get the outcome I desired by replacing the mutate_at function with:
dataVariable$binFlags <- mapply(funTranslateXForNewColumn, dataVariable$c_flags)
However I want to know how to use the mutate_at function correctly.
credit to: https://datascience.stackexchange.com/questions/41964/mutate-with-custom-function-in-r-does-not-work
The above link also includes the solution to get this function to work which is to vectorize the function:
v_funTranslateXForNewColumn <- Vectorize(funTranslateXForNewColumn)
mutate_at(dataVariable, vars(c_flags), funs(v_funTranslateXForNewColumn(.)))

separate distinct strings with common characters using stringr::str_detect() in R

i have the following sample character vector:
sample_dat <- c("Q2", "Q20", "Q21", "Q23_8_T", "Q21_fct", "Q2_fct7", "Q20_fct7_4", "Q2_fct7_4")
From this vector of strings, I want to isolate those that share in common the initial prefix using a regular expression so that I might be able to use it again in a function, such that the desired subset of strings for prefix = "Q2" would be the result of the following code snippet:
(desired_subset <- sample_dat[c(1, 6, 8)])
That is, the desired output should be c("Q2", "Q2_fct7", "Q2_fct7_4")
I tried using stringr::str_detect() to reproduce the desired_subset using a regular expression, but i am unable to have desired_subset[1] enter the result:
library(stringr)
sample_dat[str_detect(string = sample_dat, pattern = "Q2_")]
in the case above, too few results are returned., I am missing
"Q2" itself.
Whereas in the code below, too many results are returned. For example "Q20" and "Q21" are returned which is not what I want.
sample_dat[str_detect(string = sample_dat, pattern = "Q2")]
eventually, I'd like to use it in a function like so:
subset_str <- function(str, prefix){
substitute(prefix)
str_set <- str_detect(string = str, pattern = paste0(eval(prefix),'_'))
return(str[str_set])
}
such that
subset_str(sample_dat, "Q2") would return ONLY
c("Q2", "Q2_fct7", "Q2_fct7_4") and
subset_str(sample_dat, "Q20") would return ONLY
c("Q20", "Q20_fct7")
Perhaps there is someone who might be able to help me.
Thanks.
We can specify the pattern as the intended substring to match ("Q20") that is the start of the string (^) followed by either a _ or (|) it is the end ($) of the string
grep("^Q20(_|$)", sample_dat, value = TRUE)
#[1] "Q20" "Q20_fct7_4"
grep("^Q2(_|$)", sample_dat, value = TRUE)
#[1] "Q2" "Q2_fct7" "Q2_fct7_4"
which can be wrapped into a function
subset_str <- function(string, pattern){
grep(pattern, string, value = TRUE)
}
Or the same pattern in str_detect
library(stringr)
sample_dat[str_detect(string = sample_dat, pattern = "Q2(_|$)")]
#[1] "Q2" "Q2_fct7" "Q2_fct7_4"

Extract and paste together multiple columns of a data frame like object using a vector of column names

I have an object (variable rld) which looks a bit like a "data.frame" (see further down the post for details) in that it has columns that can be accessed using $ or [[]].
I have a vector groups containing names of some of its columns (3 in example below).
I generate strings based on combinations of elements in the columns as follows:
paste(rld[[groups[1]]], rld[[groups[2]]], rld[[groups[3]]], sep="-")
I would like to generalize this so that I don't need to know how many elements are in groups.
The following attempt fails:
> paste(rld[[groups]], collapse="-")
Error in normalizeDoubleBracketSubscript(i, x, exact = exact, error.if.nomatch = FALSE) :
attempt to extract more than one element
Here is how I would do in functional-style with a python dictionary:
map("-".join, zip(*map(rld.get, groups)))
Is there a similar column-getter operator in R ?
As suggested in the comments, here is the output of dput(rld): http://paste.ubuntu.com/23528168/ (I could not paste it directly, since it is huge.)
This was generated using the DESeq2 bioinformatics package, and more precisely, doing something similar to what is described page 28 of this document: https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf.
DESeq2 can be installed from bioconductor as follows:
source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
Reproducible example
One of the solutions worked when running in interactive mode, but failed when the code was put in a library function, with the following error:
Error in do.call(function(...) paste(..., sep = "-"), colData(rld)[groups]) :
second argument must be a list
After some tests, it appears that the problem doesn't occur if the function is in the main calling script, as follows:
library(DESeq2)
library(test.package)
lib_names <- c(
"WT_1",
"mut_1",
"WT_2",
"mut_2",
"WT_3",
"mut_3"
)
file_names <- paste(
lib_names,
"txt",
sep="."
)
wt <- "WT"
mut <- "mut"
genotypes <- rep(c(wt, mut), times=3)
replicates <- c(rep("1", times=2), rep("2", times=2), rep("3", times=2))
sample_table = data.frame(
lib = lib_names,
file_name = file_names,
genotype = genotypes,
replicate = replicates
)
dds_raw <- DESeqDataSetFromHTSeqCount(
sampleTable = sample_table,
directory = ".",
design = ~ genotype
)
# Remove genes with too few read counts
dds <- dds_raw[ rowSums(counts(dds_raw)) > 1, ]
dds$group <- factor(dds$genotype)
design(dds) <- ~ replicate + group
dds <- DESeq(dds)
test_do_paste <- function(dds) {
require(DESeq2)
groups <- head(colnames(colData(dds)), -2)
rld <- rlog(dds, blind=F)
stopifnot(all(groups %in% names(colData(rld))))
combined_names <- do.call(
function (...) paste(..., sep = "-"),
colData(rld)[groups]
)
print(combined_names)
}
test_do_paste(dds)
# This fails (with the same function put in a package)
#test.package::test_do_paste(dds)
The error occurs when the function is packaged as in https://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/
Data used in the example:
WT_1.txt
WT_2.txt
WT_3.txt
mut_1.txt
mut_2.txt
mut_3.txt
I posted this issue as a separate question: do.call error "second argument must be a list" with S4Vectors when the code is in a library
Although I have an answer to my initial question, I'm still interested in alternative solutions for the "column extraction using a vector of column names" issue.
We may use either of the following:
do.call(function (...) paste(..., sep = "-"), rld[groups])
do.call(paste, c(rld[groups], sep = "-"))
We can consider a small, reproducible example:
rld <- mtcars[1:5, ]
groups <- names(mtcars)[c(1,3,5,6,8)]
do.call(paste, c(rld[groups], sep = "-"))
#[1] "21-160-3.9-2.62-0" "21-160-3.9-2.875-0" "22.8-108-3.85-2.32-1"
#[4] "21.4-258-3.08-3.215-1" "18.7-360-3.15-3.44-0"
Note, it is your responsibility to ensure all(groups %in% names(rld)) is TRUE, otherwise you get "subscript out of bound" or "undefined column selected" error.
(I am copying your comment as a follow-up)
It seems the methods you propose don't work directly on my object. However, the package I'm using provides a colData function that makes something more similar to a data.frame:
> class(colData(rld))
[1] "DataFrame"
attr(,"package")
[1] "S4Vectors"
do.call(function (...) paste(..., sep = "-"), colData(rld)[groups]) works, but do.call(paste, c(colData(rld)[groups], sep = "-")) fails with an error message I fail to understand (as too often with R...):
> do.call(paste, c(colData(rld)[groups], sep = "-"))
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘mcols’ for signature ‘"character"’

Avoid that space in column name is replaced with period (".") when using read.csv()

I am using R to do some data pre-processing, and here is the problem that I am faced with: I input the data using read.csv(filename,header=TRUE), and then the space in variable names became ".", for example, a variable named Full Code became Full.Code in the generated dataframe. After the processing, I use write.xlsx(filename) to export the results, while the variable names are changed. How to address this problem?
Besides, in the output .xlsx file, the first column become indices(i.e., 1 to N), which is not what I am expecting.
If your set check.names=FALSE in read.csv when you read the data in then the names will not be changed and you will not need to edit them before writing the data back out. This of course means that you would need quote the column names (back quotes in some cases) or refer to the columns by location rather than name while editing.
To get spaces back in the names, do this (right before you export - R does let you have spaces in variable names, but it's a pain):
# A simple regular expression to replace dots with spaces
# This might have unintended consequences, so be sure to check the results
names(yourdata) <- gsub(x = names(yourdata),
pattern = "\\.",
replacement = " ")
To drop the first-column index, just add row.names = FALSE to your write.xlsx(). That's a common argument for functions that write out data in tabular format (write.csv() has it, too).
Here's a function (sorry, I know it could be refactored) that makes nice column names even if there are multiple consecutive dots and trailing dots:
makeColNamesUserFriendly <- function(ds) {
# FIXME: Repetitive.
# Convert any number of consecutive dots to a single space.
names(ds) <- gsub(x = names(ds),
pattern = "(\\.)+",
replacement = " ")
# Drop the trailing spaces.
names(ds) <- gsub(x = names(ds),
pattern = "( )+$",
replacement = "")
ds
}
Example usage:
ds <- makeColNamesUserFriendly(ds)
Just to add to the answers already provided, here is another way of replacing the “.” or any other kind of punctation in column names by using a regex with the stringr package in the way like:
require(“stringr”)
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
For example try:
data <- data.frame(variable.x = 1:10, variable.y = 21:30, variable.z = "const")
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
and
colnames(data)
will give you
[1] "variable x" "variable y" "variable z"

Resources