Multiple gsub() expressions in R - r

I'm trying to clean a column of data from a data frame with many gsub commands.
Some examples would be:
df$col1<-gsub("-00070", "-0070", df$col1)
df$col1<-gsub("-00063", "-0063",df$col1)
df$col1<-gsub("F4", "FA", df$col1)
...
Looking at the column after running these lines of code, it looks like some of the changes have taken, but some have not. Moreover, if I run the block of code with the gsub() commands more changes start taking effect the more I run the block.
I'm very confused by this behavior, any information is appreciated.

There's probably a better way, but you could always use Map
new <- 1:3
old <- letters[1:3]
to.change <- letters[1:10]
Map(function(x, y) to.change <<- gsub(x, y, to.change), old, new)
to.change
# [1] "1" "2" "3" "d" "e" "f" "g" "h" "i" "j"

Related

Why subtracting an empty vector in R deletes everything?

Could someone please enlighten me why subtracting an empty vector in R results in the whole content of a data frame being deleted? Just to give an example
WhichInstances2 <- which(JointProcedures3$JointID %in% KneeIDcount$Var1[KneeIDcount$Freq >1])
JointProcedures3 <-JointProcedures3[-WhichInstances2,]
Will give me all blanks in JointProcedures3 if WhichInstances2 has all its value as FALSE, but it should simply give me what JointProcedures3 was before those lines of code.
This is not the first time it has happened to me and I have asked my supervisor and it has happened to him as well and he just thinks t is a quirk of R.
Rewriting the code as
WhichInstances2 <- which(JointProcedures3$JointID %in% KneeIDcount$Var1[KneeIDcount$Freq >1])
if(length(WhichInstances2)>0)
{
JointProcedures3 <-JointProcedures3[-WhichInstances2,]
}
fixes the issue. But it should not have in principle made a scooby of a difference if that conditional was there or not, since if length(WhichInstances2) was equal to 0, I would simply be subtract nothing from the original JointProcedures3...
Thanks all for your input.
Let's try a simpler example to see what's happening.
x <- 1:5
y <- LETTERS[1:5]
which(x>4)
## [1] 5
y[which(x>4)]
## [1] "E"
So far so good ...
which(x>5)
## integer(0)
> y[which(x>5)]
## character(0)
This is also fine. Now what if we negate? The problem is that integer(0) is a zero-length vector, so -integer(0) is also a zero-length vector, so y[-which(x>5] is also a zero-length vector ..
What can you do about it? Don't use which(); instead use logical indexing directly, and use ! to negate the condition:
y[!(x>5)]
## [1] "A" "B" "C" "D" "E"
In your case:
JointID_OK <- (JointProcedures3$JointID %in% KneeIDcount$Var1[KneeIDcount$Freq >1])
JointProcedures3 <-JointProcedures3[!JointID_OK,]
For what it's worth, this is section 8.1.13 in the R Inferno, "negative nothing is something"
It seems you are checking for ids in a vector and you intend to remove them from another; probably setdiff is what you are looking for.
Consider if we have a vector of the lowercase letters of the alphabet (its an r builtin) and we want to remove any entry that matches something that is not in there ("ab") , as programmers we would wish for nothing to be removed and keep our 26 letters
# wont work
letters[ - which(letters=="ab")]
#works
setdiff(letters , which(letters=="ab"))
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u"
[22] "v" "w" "x" "y" "z"

R: Check if strings in a vector are present in other vectors, and return name of the match

I need a tool more selective than %in% or match(). I need a code that matches a vector of string with another vector, and that returns the names of the matches.
Currently I have the following,
test <- c("country_A", "country_B", "country_C", "country_D", "country_E", "country_F") rating_3 <- c("country_B", "country_D", "country_G", "country_K")
rating_3 <- c("country_B", "country_D", "country_G", "country_K")
rating_4 <- c("country_C", "country_E", "country_M", "country_F)
i <- 1
while (i <= 33) {
print(i)
print(test[[i]])
if (grepl(test[[i]], rating_3) == TRUE) {
print(grepl(test[[i]], rating_3)) }
i <- i+1
},
This should check each element of test present in rating_3, but for some reason, it returns only the position, the name of the string, and a warning;
[1]
[country_A]
There were 6 warnings (use warnings() to see them)
I need to know what this piece of code fails, but I'd like to eventually have it return the name only when it's inside another vector, and if possible, testing it against several vectors at once, having it print the name of the vector in which it fits, something like
[1]
[String]
[rating_3]
How could I get something like that?
Without a reproducible example, it is hard to determine what exactly you need, but I think this could be done using %in%:
# create reprex
test <- sample(letters,10)
rating_3 <- sample(letters, 20)
print(rating_3[rating_3 %in% test])
[1] "r" "z" "l" "e" "m" "c" "p" "t" "f" "x" "n" "h" "b" "o" "s" "v" "k" "w" "a"
[20] "i"

r match filenames in two folders and perform code

I have 2 folders with text files: Aba with 90 files and Baa with 50 files. I have a piece of code where I open files with same names from two folders and perform operation.
dna_no= read.table("/home/Documents/Baa/112.txt",skip=1, header=TRUE, sep="\t", fill=FALSE)
sim = read.table("/home/Documents/Data/Aba/112.txt",header=FALSE, sep="\t")
then I want to perform code on contents of files:
Select rows from sim where 1st column of dna_no matches 1st column sim:
sm_dna= sim[which(dna_no[,1]%in%sim[,1]),]
sim_nn17 = cbind(sm_dna[,1],sm_dna[,4:6]
etc.
Now I want to do this in one go for all files in Baa find file with the same name from Aba and do the above operation.
I'm using list.files:
filenames= list.files("/home/Documents/Baa/", full.names=TRUE)
file_sim= list.files("/home/Documents/Data/Aba/",full.names=TRUE)
ldf <- lapply(filenames, function(x) read.table(x,skip=1))
tcf <- lapply(file_sim, function(z) read.table(z,colClasses = c(rep("numeric", 6), rep("NULL", 1)),header=FALSE, sep="\t"))
so now I need to find ldf[i] that is same in tcf [i] , i.e. files with the same names (e.g. 112 file names are all numeric) and I cannot figure out how to do it as list.files seems not to safe files names.
and then perform code for each of the files.
myFun <- function(filenames){
same operation described above for each file:
sm_dna= ..
sim_nn17 =..
...}
I'm not sure how does the code change here as well?
would it be possible do this without loop?
the code works fine for separate files but not for a batch of files in folder.
many thanks for help!
I think you have you two distinct questions here, really.
Find matching list items
Execute some operations on a series of files without a for-loop
The first thing is pretty simple. Here's a reproducible example, but you could use two lists of filenames from calls to list.files or anything here
# here are two random vectors of letters
set.seed(1)
vec1 <- letters[sample(1:26, 5)]
vec2 <- letters[sample(1:26, 15)]
# > vec1
# [1] "g" "j" "n" "u" "e"
# > vec2
# [1] "x" "z" "p" "o" "b" "e" "d" "n" "g" "s" "h" "k" "q" "u" "j"
# here are the matching ones
intersect(vec1, vec2)
# [1] "g" "j" "n" "u" "e"
The second thing is simple too: read in two files of the same name from different locations, perform some operations:
my_func <- function(filename) {
# get files with same name from two dirs
dna_no <- read.table(paste0('/home/Documents/Baa/', filename))
sim <- read.table(paste0('/home/Documents/Data/Aba/', filename))
# do other stuff...
}
Putting these together you can do something like:
filenames <- list.files("/home/Documents/Baa/")
file_sim <- list.files("/home/Documents/Data/Aba/")
lapply(intersect(filenames, file_sim), my_func)

Generating Multiple Subsets in R

I have a large sequence of bytes, and I would like to generate a list containing an arbitrary number of subsets of that sequence. I suspect I need to use one of the apply functions, but the trick is that I need to iterate over the vector of starting positions, not the sequence itself.
Here's an example of how I want it to work --
extrct_by_mod <- function(x, startpos, endpos, lrecl)
{
x[1:length(x) %% lrecl %in% startpos:endpos]
}
tmp_seq <- letters[1:25]
startpos <- c(0, 2)
endpos <- c(1, 5)
lrecl <- 5
list_one <- extrct_by_mod(x=tmp_seq, startpos=startpos[1], endpos=endpos[1], lrecl=lrecl)
list_two <- extrct_by_mod(x=tmp_seq, startpos=startpos[2], endpos=endpos[2], lrecl=lrecl)
what_i_want <- list(list_one, list_two)
Ideally, I'd like to be able to just add more values to startpos and endpos, thus automatically generate more subsets to add to my list. Note that the subsets will not be the same length, and in some cases, not even the same type.
My datasets are fairly large, so something that scales well would be ideal. I realize that this could be done with a loop, but I'm understanding that you generally want to avoid looping in R.
Thank you!
Saving some time by pre-calculating the modulo-selection index:
> cats <- 1:length(tmp_seq) %% lrecl
> mapply(function(start,end) { tmp_seq[cats %in% start:end]} , startpos, endpos)
[[1]]
[1] "a" "e" "f" "j" "k" "o" "p" "t" "u" "y"
[[2]]
[1] "b" "c" "d" "g" "h" "i" "l" "m" "n" "q" "r" "s" "v" "w" "x"
(It is not correct that R apply functions are any faster than equivalent loops.)

Read text file in R and convert it to a character object

I'm reading a text file like this in R 2.10.0.
248585_at 250887_at 245638_s_at AFFX-BioC-5_at
248585_at 250887_at 264488_s_at 245638_s_at AFFX-BioC-5_at AFFX-BioC-3_at AFFX-BioDn-5_at
248585_at 250887_at
Using the command
clusters<-read.delim("test",sep="\t",fill=TRUE,header=FALSE)
Now, I must pass every row in this file to a BioConductor function that takes only character vectors as input.
My problem is that using as.character on this "clusters" object turns everything into numeric strings.
> clusters[1,]
V1 V2 V3 V4 V5 V6 V7
1 248585_at 250887_at 245638_s_at AFFX-BioC-5_at
But
> as.character(clusters[1,])
[1] "1" "1" "2" "3" "1" "1" "1"
Is there any way to keep the original names and put them into a character vector?
Maybe it helps: my "clusters" object given by the "read.delim" file belongs to the "list" type.
Thanks a lot :-)
Federico
By default character columns are converted to factors. You can avoid this by setting as.is=TRUE argument:
clusters <- read.delim("test", sep="\t", fill=TRUE, header=FALSE, as.is=TRUE)
If you only pass arguments from text file to character vector you could do something like:
x <- readLines("test")
xx <- strsplit(x,split="\t")
xx[[1]] # xx is a list
# [1] "248585_at" "250887_at" "245638_s_at" "AFFX-BioC-5_at"
I never would have expected that to happen, but trying a small test case produces the same results you're giving.
Since the result of df[1,] is itself a data.frame, one fix I thought to try was to use unlist -- seems to work:
> df <- data.frame(a=LETTERS[1:10], b=LETTERS[11:20], c=LETTERS[5:14])
> df[1,]
a b c
1 A K E
> as.character(df[1,])
[1] "1" "1" "1"
> as.character(unlist(df[2,]))
[1] "B" "L" "F"
I think turning the data.frame into a matrix first would also get around this:
m <- as.matrix(df)
> as.character(m[2,])
[1] "B" "L" "F"
To avoid issues with factors in your data.frame you might want to set stringsAsFactors=TRUE when reading in your data from the text file, eg:
clusters <- read.delim("test", sep="\t", fill=TRUE, header=FALSE,
stringsAsFactors=FALSE)
And, after all that, the unexpected behavior seems to come from the fact that the original affy probes in your data.frame are treated as factors. So, doing the stringsAsFactors=FALSE thing will side-step the fanfare:
df <- data.frame(a=LETTERS[1:10], b=LETTERS[11:20],
c=LETTERS[5:14], stringsAsFactors=FALSE)
> as.character(df[1,])
[1] "A" "K" "E"

Resources