Apply a particular function in all files of a folder using R - r

I have developed a particular R function named DNAdupstability for some Biological analysis which requires input using as fasta file (.fasta/.txt) which returns a dataframe in this format:
Sequence Position8 Position9 Position10 Position11 Position12 Position13
1 1 -1.473571 -1.473571 -1.462143 -1.412143 -1.412143 -1.371429
Position14 Position15 Position16 Position17 Position18 Position19 Position20
1 -1.372143 -1.4 -1.428571 -1.439286 -1.430714 -1.420714 -1.397143
This is a random dataframe and it continues to n positions on the basis of the input sequence. I have a folder named Random_fasta which has 1333 equal length but different fasta sequences. The developed function DNAdupstability gives the desired outcome for a single fasta sequence (the above mentioned dataframe) from the folder Random_fasta, but now I want to carry out analysis of all the other 1332 sequences using the same DNAdupstability function and a form a combined dataframe similar to this format for all the sequences
Sequence Position8 Position9 Position10 Position11 Position12 Position13
1 1 -1.434286 -1.434286 -1.446429 -1.435714 -1.445714 -1.509286
2 2 -1.522143 -1.492143 -1.463571 -1.435714 -1.492857 -1.544286
3 3 -1.232857 -1.265000 -1.333571 -1.328571 -1.330000 -1.329286
4 4 -1.799286 -1.799286 -1.799286 -1.799286 -1.730714 -1.735714
5 5 -1.547143 -1.507143 -1.535714 -1.530714 -1.478571 -1.450714
Position14 Position15 Position16 Position17 Position18 Position19 Position20
1 -1.452143 -1.402143 -1.390000 -1.457143 -1.509286 -1.498571 -1.458571
2 -1.544286 -1.544286 -1.544286 -1.544286 -1.601429 -1.715000 -1.755000
3 -1.340000 -1.328571 -1.333571 -1.344286 -1.384286 -1.446429 -1.486429
4 -1.667143 -1.605000 -1.536429 -1.486429 -1.536429 -1.605000 -1.600000
5 -1.450714 -1.450714 -1.412143 -1.372143 -1.434286 -1.531429 -1.615000
So that I could calculate the position-wise mean which will then be further used for some visualization using ggplot2. Is there any way that I could apply the same functions in all the files of the folder particularly using R and get the desired combined dataframe? Any help will be greatly appreciated!

One option is to recursively return all the files from the main folder with list.files, then apply the custom fuction by looping over the files, and convert to a single data.frame with do.call(rbind
files <- list.files('path/to/your/folder', recursive = TRUE,
pattern = "\\.txt$", full.names = TRUE)
lst1 <- lapply(files, DNAdupstability)
out <- do.call(rbind, lst1)
Or we can use map from purrr with _dfr to combine all the output from the list to a single data.frame
library(purrr)
out <- map_dfr(files, DNAdupstability)

Related

Extract and match sets from list of filenames

I have a dataset of 4000+ images. For the purpose of figuring out the code, I moved a small subset of them to another folder.
The files look like this:
folder
[1] "r01c01f01p01-ch3.tiff" "r01c01f01p01-ch4.tiff" "r01c01f02p01-ch1.tiff"
[4] "r01c01f03p01-ch2.tiff" "r01c01f03p01-ch3.tiff" "r01c01f04p01-ch2.tiff"
[7] "r01c01f04p01-ch4.tiff" "r01c01f05p01-ch1.tiff" "r01c01f05p01-ch2.tiff"
[10] "r01c01f06p01-ch2.tiff" "r01c01f06p01-ch4.tiff" "r01c01f09p01-ch3.tiff"
[13] "r01c01f09p01-ch4.tiff" "r01c01f10p01-ch1.tiff" "r01c01f10p01-ch4.tiff"
[16] "r01c01f11p01-ch1.tiff" "r01c01f11p01-ch2.tiff" "r01c01f11p01-ch3.tiff"
[19] "r01c01f11p01-ch4.tiff" "r01c02f10p01-ch1.tiff" "r01c02f10p01-ch2.tiff"
[22] "r01c02f10p01-ch3.tiff" "r01c02f10p01-ch4.tiff"
I cannot remove the name prior to the -ch# as that information is important. What I want to do, however, is to filter this list of images, and return only sets (ie: r01c02f10p01) which have all four ch values (ch1-4).
I was originally thinking that we could approach the issue along the lines of this:
ch1 <- dir(path="/Desktop/cp/complete//", pattern="ch1")
ch2 <- dir(path="/Desktop/cp/complete//", pattern="ch2")
ch3 <- dir(path="/Desktop/cp/complete//", pattern="ch3")
ch4 <- dir(path="/Desktop/cp/complete//", pattern="ch4")
Applying this list with the file.remove function, similar to this:
final2 <- dir(path="/Desktop/cp1/Images//", pattern="ch5")
file.remove(folder,final2)
However, creating new variables for each ch value fragments out each file. I am unsure how to use these to actually distinguish whether an individual image has all four ch values to meaningfully filter my images. I'm kind of at a loss, as the other sources I've seen have issues that don't quite match this problem.
Earlier, I was able to remove the all images with ch5 from my image set like this. I was thinking this may be helpful in trying to filter only images which have ch1-ch4, but I'm not sure how to proceed.
##Create folder variable which has all image files
folder <- list.files(getwd())
##Create final2 variable which has all image files ending in ch5
final2 <- dir(path="/Desktop/cp1/Images//", pattern="ch5")
##Remove final2 from folder
file.remove(folder,final2)
To summarize: I expect to filter files from a random assortment without complete ch values (ie: maybe only ch1 and ch2, or ch3 and ch4, or ch1, ch2, ch3, and ch4), to an assortment which only contains files which have a complete set (four files with ch1, ch2, ch3, and ch4).
Starting with a vector of filenames like you would get from list.files or something similar, you can create a data frame of filenames, use regex to extract the alphanumeric part at the beginning and the number that follows "-ch". Then check that all elements of an expected set (I put this in ch_set, but there might be another way you need to do this) occur in each group's set of CH values.
# assume this is the vector of file names that comes from list.files
# or something comparable
files <- c("r01c01f01p01-ch3.tiff", "r01c01f01p01-ch4.tiff", "r01c01f02p01-ch1.tiff", "r01c01f03p01-ch2.tiff", "r01c01f03p01-ch3.tiff", "r01c01f04p01-ch2.tiff", "r01c01f04p01-ch4.tiff", "r01c01f05p01-ch1.tiff", "r01c01f05p01-ch2.tiff", "r01c01f06p01-ch2.tiff", "r01c01f06p01-ch4.tiff", "r01c01f09p01-ch3.tiff", "r01c01f09p01-ch4.tiff", "r01c01f10p01-ch1.tiff", "r01c01f10p01-ch4.tiff", "r01c01f11p01-ch1.tiff", "r01c01f11p01-ch2.tiff", "r01c01f11p01-ch3.tiff", "r01c01f11p01-ch4.tiff", "r01c02f10p01-ch1.tiff", "r01c02f10p01-ch2.tiff", "r01c02f10p01-ch3.tiff", "r01c02f10p01-ch4.tiff")
library(dplyr)
ch_set <- 1:4
files_to_keep <- data.frame(filename = files, stringsAsFactors = FALSE) %>%
tidyr::extract(filename, into = c("group", "ch"), regex = "(^[\\w\\d]+)\\-ch(\\d)", remove = FALSE) %>%
mutate(ch = as.numeric(ch)) %>%
group_by(group) %>%
filter(all(ch_set %in% ch))
files_to_keep
#> # A tibble: 8 x 3
#> # Groups: group [2]
#> filename group ch
#> <chr> <chr> <dbl>
#> 1 r01c01f11p01-ch1.tiff r01c01f11p01 1
#> 2 r01c01f11p01-ch2.tiff r01c01f11p01 2
#> 3 r01c01f11p01-ch3.tiff r01c01f11p01 3
#> 4 r01c01f11p01-ch4.tiff r01c01f11p01 4
#> 5 r01c02f10p01-ch1.tiff r01c02f10p01 1
#> 6 r01c02f10p01-ch2.tiff r01c02f10p01 2
#> 7 r01c02f10p01-ch3.tiff r01c02f10p01 3
#> 8 r01c02f10p01-ch4.tiff r01c02f10p01 4
Now that you have a dataframe of the complete groups, just pull the matching filenames back out:
files_to_keep$filename
#> [1] "r01c01f11p01-ch1.tiff" "r01c01f11p01-ch2.tiff" "r01c01f11p01-ch3.tiff"
#> [4] "r01c01f11p01-ch4.tiff" "r01c02f10p01-ch1.tiff" "r01c02f10p01-ch2.tiff"
#> [7] "r01c02f10p01-ch3.tiff" "r01c02f10p01-ch4.tiff"
One thing to note is that this worked without the mutate line where I converted ch to numeric—i.e. comparing character versions of those numbers to regular numeric version of them—because under the hood, %in% converts to matching types. That didn't seem totally safe if you needed to scale this, so I converted to have them in matching types.

Need to use jsonlite to handle ndjson message list using stream_in() and stream_out()

I have an ndjson data source. For a simple example, consider a text file with three lines, each containing a valid json message. I want to extract 7 variables from the messages and put them in a dataframe.
Please use the following sample data in a text file. You can paste this data into a text editor and save it as "ndjson_sample.txt"
{"ts":"1","ct":"{\"Var1\":6,\"Var2\":6,\"Var3\":-70,\"Var4\":12353,\"Var5\":1,\"Var6\":\"abc\",\"Var7\":\"x\"}"}
{"ts":"2","ct":"{\"Var1\":6,\"Var2\":6,\"Var3\":-68,\"Var4\":4528,\"Var5\":1,\"Var6\":\"def\",\"Var7\":\"y\"}"}
{"ts":"3","ct":"{\"Var1\":6,\"Var2\":6,\"Var3\":-70,\"Var4\":-5409,\"Var5\":1,\"Var6\":\"ghi\",\"Var7\":\"z\"}"}
The following three lines of code accomplish what I want to do:
file1 <- "ndjson_sample.txt"
json_data1 <- ndjson::stream_in(file1)
raw_df_temp1 <- as.data.frame(ndjson::flatten(json_data1$ct))
For reasons I won't get into, I cannot use the ndjson package. I must find a way to use the jsonlite package to do the same thing using the stream_in() and stream_out() functions. Here's what I tried:
con_in1 <- file(file1, open = "rt")
con_out1 <- file(tmp <- tempfile(), open = "wt")
callback_func <- function(df){
jsonlite::stream_out(df, con_out1, pagesize = 1)
}
jsonlite::stream_in(con_in1, handler = callback_func, pagesize = 1)
close(con_out1)
con_in2 <- file(tmp, open = "rt")
raw_df_temp2 <- jsonlite::stream_in(con_in2)
This is not giving me the same data frame as a final output. Can you tell me what I'm doing wrong and what I have to change to make raw_df_temp1 equal raw_df_temp2?
I could potentially solve this with a the fromJSON() functions operating on each line of the file, but I'd like to find a way to do it with the stream functions. The files I will be dealing with a are quite large and so efficiency will be key. I need this to be as fast as possible.
Thank you in advance.
Currently under ct you'll find a string that can (subsequently) be fed to fromJSON independently, but it will not be parsed as such. Ignoring your stream_out(stream_in(...),...) test, here are a couple of ways to read it in:
library(jsonlite)
json <- stream_in(file('ds_guy.ndjson'), simplifyDataFrame=FALSE)
# opening file input connection.
# Imported 3 records. Simplifying...
# closing file input connection.
cbind(
ts = sapply(json, `[[`, "ts"),
do.call(rbind.data.frame, lapply(json, function(a) fromJSON(a$ct)))
)
# ts Var1 Var2 Var3 Var4 Var5 Var6 Var7
# 1 1 6 6 -70 12353 1 abc x
# 2 2 6 6 -68 4528 1 def y
# 3 3 6 6 -70 -5409 1 ghi z
Calling fromJSON on each string might be cumbersome, and with larger data this slow-down is why there is stream_in, so if we can capture the "ct" component into a stream of its own, then ...
writeLines(sapply(json, `[[`, "ct"), 'ds_guy2.ndjson')
(There are far-more-efficient ways to do this with non-R tools, including perhaps a simple
sed -e 's/.*"ct":"\({.*\}\)"}$/\1/g' -e 's/\\"/"/g' ds_guy.ndjson > ds_guy.ndjson2
though this makes a few assumptions about the data that may not be perfectly safe. A better solution would be to use jq, which should "always" correctly-parse proper json, then a quick sed to replace escaped quotes:
jq '.ct' ds_guy.ndjson | sed -e 's/\\"/"/g' > ds_guy2.ndjson
and you can do that with system(...) in R if needed.)
From there, under the assumption that each line will contain exactly one row of data.frame data:
json2 <- stream_in(file('ds_guy2.ndjson'), simplifyDataFrame=TRUE)
# opening file input connection.
# Imported 3 records. Simplifying...
# closing file input connection.
cbind(ts=sapply(json, `[[`, "ts"), json2)
# ts Var1 Var2 Var3 Var4 Var5 Var6 Var7
# 1 1 6 6 -70 12353 1 abc x
# 2 2 6 6 -68 4528 1 def y
# 3 3 6 6 -70 -5409 1 ghi z
NB: in the first example, "ts" is a factor, all others are character because that's what fromJSON gives. In the second example, all strings are factor. This can easily be addressed through judicious use of stringsAsFactors=FALSE, depending on your needs.

Change column in dataframe based on regex in R

I have a large dataframe with a column displaying different profiles:
PROFILE NTHREADS TIME
profAsuffix 1 3.12
profAanother 2 1.9
profAyetanother 3
...
profBsuffix 1 4.1
profBanother 1 3.9
...
I want to rename all profA* pattern combining them in one name (profA) and do the same with profB*. Until now, I do it as:
data$PROFILE <- as.factor(data$PROFILE)
levels(data$PROFILE)[levels(data$PROFILE)=="profAsuffix"] <- "profA"
levels(data$PROFILE)[levels(data$PROFILE)=="profAanother"] <- "profA"
levels(data$PROFILE)[levels(data$PROFILE)=="profAyetanother"] <- "profA"
And so on. But this time I have too many differents suffixes, so I wonder if I can use grepl or a similar approach to do the same thing.
We can use sub
data$PROFILE <- sub("^([a-z]+[A-B]).*", "\\1", data$PROFILE)

Replacing for loop with apply function for lm with a fixed reference colum [duplicate]

This question already has an answer here:
Fitting a linear model with multiple LHS
(1 answer)
Closed 6 years ago.
I'm under the pump by coworkers to stop using for loops so much, but I'm not great with apply functions either.
What I need to do is to regress multiple companies against a fixed reference value, which I can achieve easily with a for loop, but not so much using the apply family.
My data and for loop look like:
Date AANRI AGLRI APARI ASTRI ASXRI DUERI ENVRI GASRI HDFRI SKIRI
1: 2006-01-06 504.86 26443.30 255.75 101.15 28050.84 108.77 247.71 169.61 99.03 100.00
2: 2006-01-13 498.86 26618.78 252.21 100.00 28324.59 110.70 251.43 171.67 99.18 103.36
3: 2006-01-20 492.41 27734.33 255.67 100.38 28436.87 110.41 247.41 169.61 98.92 101.68
4: 2006-01-27 498.86 28850.82 264.88 99.23 28815.26 111.90 246.70 173.74 98.26 99.16
5: 2006-02-03 497.48 28164.16 265.79 100.38 28614.28 111.16 244.88 170.98 99.64 97.48
6: 2006-02-10 500.71 28104.86 262.23 101.54 28567.93 112.21 248.63 173.05 99.38 98.32
And my for loop:
reg1_store <- list()
for(i in names(RI_c)[!grepl("ASX|Date", names(RI_c))]){
reg1_store[[i]] <- lm(get(i) ~ ASXRI, data = RI_c)
}
This works fine, I am able to regress the separate companies on the ASX and store them accordingly.
I am wondering how I can replicate this with an apply function?
#zhequan-li offers a very efficient solution. If efficiency is not a consideration and you want the results in a list, you should use lapply. The main idea is to give lapply a vector of tickers (companies for the left-hand-side), paste each ticker into a character string of the form "X ~ ASXRI", then call lm on that formula.
tickers <- names(RI_c)[!grepl("ASX|Date", names(RI_c))]
reg1_store <- lapply(tickers, function(x) {
lm(paste(x, "~ ASXRI"), RI_c)
})
# To name the elements of your list
names(reg1_store) <- tickers

Extracting outputs from lapply to a dataframe

I have some R code which performs some data extraction operation on all files in the current directory, using the following code:
files <- list.files(".", pattern="*.tts")
results <- lapply(files, data_for_time, "17/06/2006 12:00:00")
The output from lapply is the following (extracted using dput()) - basically a list full of vectors:
list(c("amer", "14.5"), c("appl", "14.2"), c("brec", "13.1"),
c("camb", "13.5"), c("camo", "30.1"), c("cari", "13.8"),
c("chio", "21.1"), c("dung", "9.4"), c("east", "11.8"), c("exmo",
"12.1"), c("farb", "14.7"), c("hard", "15.6"), c("herm",
"24.3"), c("hero", "13.3"), c("hert", "11.8"), c("hung",
"26"), c("lizr", "14"), c("maid", "30.4"), c("mart", "8.8"
), c("newb", "14.7"), c("newl", "14.3"), c("oxfr", "13.9"
), c("padt", "10.3"), c("pbil", "13.6"), c("pmtg", "11.1"
), c("pmth", "11.7"), c("pool", "14.6"), c("prae", "11.9"
), c("ral2", "12.2"), c("sano", "15.3"), c("scil", "36.2"
), c("sham", "12.9"), c("stra", "30.9"), c("stro", "14.7"
), c("taut", "13.7"), c("tedd", "22.3"), c("wari", "12.7"
), c("weiw", "13.6"), c("weyb", "8.4"))
However, I would like to then deal with this output as a dataframe with two columns: one for the alphabetic code ("amer", "appl" etc) and one for the number (14.5, 14.2 etc).
Unfortunately, as.data.frame doesn't seem to work with this input of nested vectors inside a list. How should I go about converting this? Do I need to change the way that my function data_for_time returns its values? At the moment it just returns c(name, value). Or is there a nice way to convert from this sort of output to a dataframe?
Try this if results were your list:
> as.data.frame(do.call(rbind, results))
V1 V2
1 amer 14.5
2 appl 14.2
3 brec 13.1
4 camb 13.5
...
One option might be to use the ldply function from the plyr package, which will stitch things back into a data frame for you.
A trivial example of it's use:
ldply(1:10,.fun = function(x){c(runif(1),"a")})
V1 V2
1 0.406373084755614 a
2 0.456838687881827 a
3 0.681300171650946 a
4 0.294320539338514 a
5 0.811559669673443 a
6 0.340881009353325 a
7 0.134072444401681 a
8 0.00850683846510947 a
9 0.326008745934814 a
10 0.90791508089751 a
But note that if you're mixing variable types with c(), you probably will want to alter your function to return simply data.frame(name= name,value = value) instead of c(name,value). Otherwise everything will be coerced to character (as it is in my example above).
inp <- list(c("amer", "14.5"), c("appl", "14.2"), .... # did not see need to copy all
data.frame( first= sapply( inp, "[", 1),
second =as.numeric( sapply( inp, "[", 2) ) )
first second
1 amer 14.5
2 appl 14.2
3 brec 13.1
4 camb 13.5
5 camo 30.1
6 cari 13.8
snipped output
Because and forNelton took the response I was in the process of giving and Joran took the only other reasonable response I could think of and since I'm supposed to be writing a paper here's a ridiculous answer:
#I named your list LIST
LIST2 <- LIST[[1]]
lapply(2:length(LIST), function(i) {LIST2 <<- rbind(LIST2, LIST[[i]])})
data.frame(LIST2)

Resources