r match filenames in two folders and perform code

r match filenames in two folders and perform code - r

I have 2 folders with text files: Aba with 90 files and Baa with 50 files. I have a piece of code where I open files with same names from two folders and perform operation.
dna_no= read.table("/home/Documents/Baa/112.txt",skip=1, header=TRUE, sep="\t", fill=FALSE)
sim = read.table("/home/Documents/Data/Aba/112.txt",header=FALSE, sep="\t")
then I want to perform code on contents of files:
Select rows from sim where 1st column of dna_no matches 1st column sim:
sm_dna= sim[which(dna_no[,1]%in%sim[,1]),]
sim_nn17 = cbind(sm_dna[,1],sm_dna[,4:6]
etc.
Now I want to do this in one go for all files in Baa find file with the same name from Aba and do the above operation.
I'm using list.files:
filenames= list.files("/home/Documents/Baa/", full.names=TRUE)
file_sim= list.files("/home/Documents/Data/Aba/",full.names=TRUE)
ldf <- lapply(filenames, function(x) read.table(x,skip=1))
tcf <- lapply(file_sim, function(z) read.table(z,colClasses = c(rep("numeric", 6), rep("NULL", 1)),header=FALSE, sep="\t"))
so now I need to find ldf[i] that is same in tcf [i] , i.e. files with the same names (e.g. 112 file names are all numeric) and I cannot figure out how to do it as list.files seems not to safe files names.
and then perform code for each of the files.
myFun <- function(filenames){
same operation described above for each file:
sm_dna= ..
sim_nn17 =..
...}
I'm not sure how does the code change here as well?
would it be possible do this without loop?
the code works fine for separate files but not for a batch of files in folder.
many thanks for help!

I think you have you two distinct questions here, really.
Find matching list items
Execute some operations on a series of files without a for-loop
The first thing is pretty simple. Here's a reproducible example, but you could use two lists of filenames from calls to list.files or anything here
# here are two random vectors of letters
set.seed(1)
vec1 <- letters[sample(1:26, 5)]
vec2 <- letters[sample(1:26, 15)]
# > vec1
# [1] "g" "j" "n" "u" "e"
# > vec2
# [1] "x" "z" "p" "o" "b" "e" "d" "n" "g" "s" "h" "k" "q" "u" "j"
# here are the matching ones
intersect(vec1, vec2)
# [1] "g" "j" "n" "u" "e"
The second thing is simple too: read in two files of the same name from different locations, perform some operations:
my_func <- function(filename) {
# get files with same name from two dirs
dna_no <- read.table(paste0('/home/Documents/Baa/', filename))
sim <- read.table(paste0('/home/Documents/Data/Aba/', filename))
# do other stuff...
}
Putting these together you can do something like:
filenames <- list.files("/home/Documents/Baa/")
file_sim <- list.files("/home/Documents/Data/Aba/")
lapply(intersect(filenames, file_sim), my_func)

Related

R: Check if strings in a vector are present in other vectors, and return name of the match

I need a tool more selective than %in% or match(). I need a code that matches a vector of string with another vector, and that returns the names of the matches.
Currently I have the following,
test <- c("country_A", "country_B", "country_C", "country_D", "country_E", "country_F") rating_3 <- c("country_B", "country_D", "country_G", "country_K")
rating_3 <- c("country_B", "country_D", "country_G", "country_K")
rating_4 <- c("country_C", "country_E", "country_M", "country_F)
i <- 1
while (i <= 33) {
print(i)
print(test[[i]])
if (grepl(test[[i]], rating_3) == TRUE) {
print(grepl(test[[i]], rating_3)) }
i <- i+1
},
This should check each element of test present in rating_3, but for some reason, it returns only the position, the name of the string, and a warning;
[1]
[country_A]
There were 6 warnings (use warnings() to see them)
I need to know what this piece of code fails, but I'd like to eventually have it return the name only when it's inside another vector, and if possible, testing it against several vectors at once, having it print the name of the vector in which it fits, something like
[1]
[String]
[rating_3]
How could I get something like that?

Without a reproducible example, it is hard to determine what exactly you need, but I think this could be done using %in%:
# create reprex
test <- sample(letters,10)
rating_3 <- sample(letters, 20)
print(rating_3[rating_3 %in% test])
[1] "r" "z" "l" "e" "m" "c" "p" "t" "f" "x" "n" "h" "b" "o" "s" "v" "k" "w" "a"
[20] "i"

Multiple gsub() expressions in R

I'm trying to clean a column of data from a data frame with many gsub commands.
Some examples would be:
df$col1<-gsub("-00070", "-0070", df$col1)
df$col1<-gsub("-00063", "-0063",df$col1)
df$col1<-gsub("F4", "FA", df$col1)
...
Looking at the column after running these lines of code, it looks like some of the changes have taken, but some have not. Moreover, if I run the block of code with the gsub() commands more changes start taking effect the more I run the block.
I'm very confused by this behavior, any information is appreciated.

There's probably a better way, but you could always use Map
new <- 1:3
old <- letters[1:3]
to.change <- letters[1:10]
Map(function(x, y) to.change <<- gsub(x, y, to.change), old, new)
to.change
# [1] "1" "2" "3" "d" "e" "f" "g" "h" "i" "j"

Generating Multiple Subsets in R

I have a large sequence of bytes, and I would like to generate a list containing an arbitrary number of subsets of that sequence. I suspect I need to use one of the apply functions, but the trick is that I need to iterate over the vector of starting positions, not the sequence itself.
Here's an example of how I want it to work --
extrct_by_mod <- function(x, startpos, endpos, lrecl)
{
x[1:length(x) %% lrecl %in% startpos:endpos]
}
tmp_seq <- letters[1:25]
startpos <- c(0, 2)
endpos <- c(1, 5)
lrecl <- 5
list_one <- extrct_by_mod(x=tmp_seq, startpos=startpos[1], endpos=endpos[1], lrecl=lrecl)
list_two <- extrct_by_mod(x=tmp_seq, startpos=startpos[2], endpos=endpos[2], lrecl=lrecl)
what_i_want <- list(list_one, list_two)
Ideally, I'd like to be able to just add more values to startpos and endpos, thus automatically generate more subsets to add to my list. Note that the subsets will not be the same length, and in some cases, not even the same type.
My datasets are fairly large, so something that scales well would be ideal. I realize that this could be done with a loop, but I'm understanding that you generally want to avoid looping in R.
Thank you!

Saving some time by pre-calculating the modulo-selection index:
> cats <- 1:length(tmp_seq) %% lrecl
> mapply(function(start,end) { tmp_seq[cats %in% start:end]} , startpos, endpos)
[[1]]
[1] "a" "e" "f" "j" "k" "o" "p" "t" "u" "y"
[[2]]
[1] "b" "c" "d" "g" "h" "i" "l" "m" "n" "q" "r" "s" "v" "w" "x"
(It is not correct that R apply functions are any faster than equivalent loops.)

R: RScript Vector as argument

i am using the preinstalled package RScript in R.
I want to call the following R-Script with the name 'test.R' from the command prompt:
a <- c("a", "b", "c")
a
args <- commandArgs(TRUE)
b <- as.vector(args[1])
b
I use the following command:
RScript test.R c("d","e","f")
This creates following output:
[1] "a" "b" "c"
[1] "c(d,e,f)"
As you see, the first (and only) argument is interpreted as a String, then converted to a 1-dimensional vector. How can the argument be interpreted as a vector?
Sidenote: Of course the items of the vector could be separated into several arguments, but in my final project, there will be more than one vector-argument. And to implement something like this is my last resort:
RScript test.R "d" "e" "f" END_OF_VECTOR_1 "g" "h" "i" END_OF_VECTOR_2 "j" "k" "l"

You could use comma-separated lists.
On the command line:
RScript test.R a,b,c d,e,f g,h,i j
In your code:
vargs <- strsplit(args, ",")
vargs
# [[1]]
# [1] "a" "b" "c"
#
# [[2]]
# [1] "d" "e" "f"
#
# [[3]]
# [1] "g" "h" "i"
#
# [[4]]
# [1] "j"

You can clean up something like a = "c(1,2,3)" with:
as.numeric(unlist(strsplit(substr(a,3,nchar(a)-1), ",")))
which works fine as long as script will always be passed a properly formatted string, and you are aware of the expected format. In the RScript arguments, the c() part should not be listed, simply give it a quoted comma-separated list.

#flodel thanks for your answer, thats one way to do it. Beside that, I have found a workaround for my problem.
The following code in the file 'test.R' stores the arguments in 'args'. The text in 'args' is then evaluated to normal R expressions and as an output, a and b are given.
args <- commandArgs(TRUE)
eval(parse(text=args))
a
b
The code can be called in the command prompt as follows:
RScript test.R a=5 b=as.vector(c('foo', 'bar'))

Read text file in R and convert it to a character object

I'm reading a text file like this in R 2.10.0.
248585_at 250887_at 245638_s_at AFFX-BioC-5_at
248585_at 250887_at 264488_s_at 245638_s_at AFFX-BioC-5_at AFFX-BioC-3_at AFFX-BioDn-5_at
248585_at 250887_at
Using the command
clusters<-read.delim("test",sep="\t",fill=TRUE,header=FALSE)
Now, I must pass every row in this file to a BioConductor function that takes only character vectors as input.
My problem is that using as.character on this "clusters" object turns everything into numeric strings.
> clusters[1,]
V1 V2 V3 V4 V5 V6 V7
1 248585_at 250887_at 245638_s_at AFFX-BioC-5_at
But
> as.character(clusters[1,])
[1] "1" "1" "2" "3" "1" "1" "1"
Is there any way to keep the original names and put them into a character vector?
Maybe it helps: my "clusters" object given by the "read.delim" file belongs to the "list" type.
Thanks a lot :-)
Federico

By default character columns are converted to factors. You can avoid this by setting as.is=TRUE argument:
clusters <- read.delim("test", sep="\t", fill=TRUE, header=FALSE, as.is=TRUE)
If you only pass arguments from text file to character vector you could do something like:
x <- readLines("test")
xx <- strsplit(x,split="\t")
xx[[1]] # xx is a list
# [1] "248585_at" "250887_at" "245638_s_at" "AFFX-BioC-5_at"

I never would have expected that to happen, but trying a small test case produces the same results you're giving.
Since the result of df[1,] is itself a data.frame, one fix I thought to try was to use unlist -- seems to work:
> df <- data.frame(a=LETTERS[1:10], b=LETTERS[11:20], c=LETTERS[5:14])
> df[1,]
a b c
1 A K E
> as.character(df[1,])
[1] "1" "1" "1"
> as.character(unlist(df[2,]))
[1] "B" "L" "F"
I think turning the data.frame into a matrix first would also get around this:
m <- as.matrix(df)
> as.character(m[2,])
[1] "B" "L" "F"
To avoid issues with factors in your data.frame you might want to set stringsAsFactors=TRUE when reading in your data from the text file, eg:
clusters <- read.delim("test", sep="\t", fill=TRUE, header=FALSE,
stringsAsFactors=FALSE)
And, after all that, the unexpected behavior seems to come from the fact that the original affy probes in your data.frame are treated as factors. So, doing the stringsAsFactors=FALSE thing will side-step the fanfare:
df <- data.frame(a=LETTERS[1:10], b=LETTERS[11:20],
c=LETTERS[5:14], stringsAsFactors=FALSE)
> as.character(df[1,])
[1] "A" "K" "E"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

r match filenames in two folders and perform code - r

Related

R: Check if strings in a vector are present in other vectors, and return name of the match

Multiple gsub() expressions in R

Generating Multiple Subsets in R

R: RScript Vector as argument

Read text file in R and convert it to a character object

Categories

Resources