Automate script for many genes with different values - r

I am interested in making my R script to work automatically for another set of parameters. For example:
gene_name start_x end_y
file1 -> gene1 100 200
file2-> gene2 150 270
and my script does trivial job, just for learning purposes. It should take the information about gene1 and find a sum, write into a file; then it should take information of the next gene2, find sum and write this into a new file and etc, and lets say I would like to keep files name according to the genes name:
file_gene1.txt # this file holds sum of start_x +end_y for gene1
file_gene2.txt # this file holds sum of start_x +end_y for gene2
etc for the rest of 700 genes (obviously manually its to much work to take file1, and write file name and plug in start and end values into already existing script )
I guess the idea is clear, I have never been doing this type of things, and I guess its very trivial, but i would appreciate if anyone can tell me the proper definition of this process so I can search and learn online how to do it.
P.S: I think in Python I would just make a list of genes and related x/y values, loop and select required info, but I still don't know how I would keep gene names as a file name automatically.
EDIT:
I have to supply the info about a gene location, therefore start and end, which is X and Y respectively.
x=100 # assign x to a value of a related gene
y=150 # assign y to a value of a related gene
a=tbl[which(tbl[,'middle']>=x & tbl[,'middle']<y),] # for each new gene this info is changing accoringly
write.table( a, file= ' gene1.txt' ) # here I would need changing file name
my thoughts:
may be I need to generate a file, which contains all 700 gene names and related X and Y values.
then I read line one of this file and supply it into my script (in case of variable a, x and y)
then my computation is over I write results into a file and keep a gene name, that was used to generate this results.
Is it more clear?
P.S.: I Google it by probably because I don't know the topic I cant find anything relevant, just give me the idea where I can search, I would like to learn this programming step anyway.

I guess so you are looking for reading all the files present in a folder (Assuming all your gene files written in a single folder using your older script). In that case you can use something like:
directory <- "C://User//Downloads//R//data"
file <- list.files(directory, full.names = TRUE)
Then access filename using file[i] and do whatever needed (naming the file paste("gene", file[i], sep = "_") or reading it read.csv(file[i])).

I would divide your problem in two parts. (Sample data for reproducible example provided below)
library(data.table) # v1.9.7 (devel version)
# go here for install instructions
# https://github.com/Rdatatable/data.table/wiki/Installation
1st: Apply your functions to your data by gene
output <- dt[ , .( f1 = sum(start_x, end_y),
f2 = start_x - end_y ,
f3 = start_x * end_y ,
f7 = start_x / end_y),
by=.(gene)]
2nd: Split your data frame by gene and save it in separate files
output[,fwrite(.SD,file=sprintf("%s.csv", unique(gene))),
by=.(gene)]
Latter on, you can do bind the multiple files into one single data frame if you like:
# Get a List of all `.csv` files in your folder
filenames <- list.files("C:/your/folder", pattern="*.csv", full.names=TRUE)
# Load and bind all data sets
data <- rbindlist(lapply(filenames,fread))
ps. note that fwrite is still in development version of data.table as of today (12 May 2016)
data for reproducible example:
dt <- data.table( gene = c('id1','id2','id3','id4','id5','id6','id7','id8','id9','id10'),
start_x = c(1:10),
end_y = c(20:29) )

Related

How to merge many databases in R?

I have this huge database from a telescope at the institute where I currently am working, this telescope saves every single day in a file, it takes values for each of the 8 channels it measures every 10 seconds, and every day starts at 00:00 and finishes at 23:59, unless there was a connection error, in which case there are 2 or more files for one single day.
Also, the database has measurement mistakes, missing data, repeated values, etc.
File extensions are .sn1 for days saved in one single file and, .sn1, .sn2, .sn3...... for days saved in multiple files, all the files have the same number of rows and variables, besides that there are 2 formats of databases, one has a sort of a header and it uses the first 5 lines of the file, the other one doesn't have it.
Every month has it's own folder including the days it has, and then this folders are saved in the year they belong to, so for 10 years I'm talking about more than 3000 files, and to be honest I had never worked with .sn1 files before
I have code to merge 2 or a handful of files into 1, but this time I have thousands of files (which is way more then what I've used before and also the reason of why I can't provide a simple example) and I would like to generate a program that would merge all of the files to 1 huge database, so I can get a better sample from it.
I have an Excel extension that would list all the file locations in a specific folder, can I use a list like this to put all the files together?
Suggestions were too long for a comment, so I'm posting them as an aswer here.
It appears that you are able to read the files into R (at least one at a time) so I'm not getting into that.
Multiple Locations: If you have a list of all the locations, you can search in those locations to give you just the files you need. You mentioned an excel file (let's call it paths.csv - has only one column with the directory locations):
library(data.table)
all_directories <- fread(paths.csv, col.names = "paths")
# Focussing on only .sn1 files to begin with
files_names <- dir(path = all_directories$paths[1], pattern = ".sn1")
# Getting the full path for each file
file_names <- paste(all_directories$path[1], file_names, sep = "/")
Reading all the files: I created a space-delimited dummy file and gave it the extension ".sn1" - I was able to read it properly with data.table::fread(). If you're able to open the files using notepad or something similar, it should work for you too. Need more information on how the files with different headers can be distinguished from one another - do they follow a naming convention, or have different extensions (appears to be the case). Focusing on the files with 5 rows of headers/other info for now.
read_func <- function(fname){
dat <- fread(fname, sep = " ", skip = 5)
dat$file_name <- fname # Add file name as a variable - to use for sorting the big dataset
}
# Get all files into a list
data_list <- lapply(file_names, read_func)
# Merge list to get one big dataset
dat <- rdbindlist(data_list, use.names = T, fill = T)
Doing all of the above will give you a dataset for all the files that have the extension ".sn1" in the first directory from your list of directories (paths.csv). You can enclose all of this in a function and use lapply over all the different directories to get a list wherein each element is a dataset of all such files.
To include files with ".sn2", ".sn3" ... extensions you can modify the call as below:
ptrns <- paste(sapply(1:5, function(z) paste(".sn",z,sep = "")), collapse = "|")
# ".sn1|.sn2|.sn3|.sn4|.sn5"
dir(paths[1], pattern = ptrns)
Here's the simplified version that should work for all file extensions in all directories right away - might take some time if the files are too large etc. You may want to consider doing this in chunks instead.
# Assuming only one column with no header. sep is set to ";" since by default fread may treate spaces
# as separators. You can use any other symbol that is unlikely to be present in the location names
# We need the output to be a vector so we can use `lapply` without any unwanted behaviour
paths_vec <- as.character(fread("paths.csv", sep = ";", select = 1, header = F)$V1)
# Get all file names incl. location)
file_names <- unlist(lapply(paths_vec, function(z){
ptrns <- paste(sapply(1:5, function(q) paste(".sn",q,sep = "")), collapse = "|")
inter <- dir(z, pattern = ptrns)
return(paste(z,inter, sep = "/"))
}))
# Get all data in a single data.table using read_func previously defined
dat <- rbindlist(lapply(file_names, read_func), use.names = T, fill = T)

Populating a matrix (or a DF or a DT) with a loop from a folder containing txt files

I wrote my first code in R for treating some spectra [basically .txt files with a Xcol (wavelength) and Ycol (intensity)].
The code works for single files, provided I write the file name in the code. Here the code working for the first file HKU47_PSG_1_LW_0.txt.
setwd("C:/Users/dd16722/R/Raman/Data")
# import Spectra
PSG1_LW<-read.table("HKU47_PSG_1_LW_0.txt")
colnames(PSG1_LW)[colnames(PSG1_LW)=="V2"] <- "PSG1_LW"
PSG2_LW<-read.table("HKU47_PSG_2_LW_all_0.txt")
colnames(PSG2_LW)[colnames(PSG2_LW)=="V2"] <- "PSG2_LW"
#Plot 2 spectra and define the Y range
plot(PSG1_LW$V1, PSG1_LW$PSG1_LW, type="l",xaxs="i", yaxs="i", main="Raman spectra", xlab="Raman shift (cm-1)", ylab="Intensity", ylim=range(PSG1_LW,PSG2_LW))
lines(PSG2_LW$V1, PSG2_LW$PSG2_LW, col=("red"), yaxs="i")
# Temperature-excitation line correction
laser = 532
PSG1_LW_corr <- PSG1_LW$PSG1_LW*((10^7/laser)^3*(1-exp(-6.62607*10^(-34)*29979245800*PSG1_LW$V1/(1.3806488*10^(-23)*293.15)))*PSG1_LW$V1/((10^7/laser)-PSG1_LW$V1)^4)
PSG1_Raw_Corr <-cbind (PSG1_LW,PSG1_LW_corr)
lines(PSG1_LW$V1, PSG1_LW_corr, col="red")
plot(PSG1_LW$V1, PSG1_Raw_Corr$PSG1_LW_corr, type="l",xaxs="i", yaxs="i", xlab="Raman shift (cm-1)", ylab="Intensity")
Now, it's time for another little step forward. In the folder, there are many spectra (in the code above I reported the second one: HKU47_PSG_2_LW_all_0.txt) having again 2 columns, same length of the first file. I suppose I should merge all the files in a matrix (or DF or DT).
Probably I need a loop as I need a code able to check automatically the number of files contained in the folder and ultimately to create an object with several columns (i.e. the double of the number of the files).
So I started like this:
listLW <- list.files(path = ".", pattern = "LW")
numLW <- as.integer(length(listLW))
numLW represents the number of iterations I need to set. The question is: how can I populate a matrix (or DF or DT) in order to have in the first 2 columns the first txt file in my folder, then the second file in the 3rd and 4th columns etc? Considering that I need to perform some other operations as I showed above in the code.
I have been reading about loop in R since yestarday but actually could not find the best and easy solution.
Thanks!
You could do something like
# Load data.table library
require(data.table)
# Import the first file
DT_final <- fread(file = listLW[1])
# Loop over the rest of the files and use cbind to merge them into 1 DT
for(file in setdiff(listLW, listLW[1])) {
DT_temp <- fread(file)
DT_final <- cbind(DT_final, DT_temp)
}

R - combining lines from multiple CSV into a data frame

I have a folder with hundreds of CSV files each containing data for a particular postal code.
Each CSV files contains two columns and thousands of rows. Descriptors are in Column A, values are in Column B.
I need to extract two pieces of information from each file and create a new table or dataframe using the values in [Column A, Row 2] (which is the postal code) and [Column B, Row 1585] (which is the median income).
The end result should be a table/dataframe with two columns: one for postal code, the other for median income.
Any help or advice would be appreciated.
Disclaimer: this question is pretty vague. Next time, be sure to add a reproducible example that we can run on our machines. It will help you, the people answering your questions, and future users.
You might try something like:
files = list.files("~/Directory")
my_df = data.frame(matrix(ncol = 2, nrow = length(files)
for(i in 1:length(files)){
row1 = read.csv("~/Directory/files[i]",nrows = 1)
row2 = read.csv("~/Directory/files[i]", skip = 1585, nrows = 1)
my_df = rbind(my_df, rbind(row1, row2))
}
my_df = my_df[,c("A","B")]
# Note on interpreting indexing syntax:
Read this as "my_df is now (=) my_df such that ([) the columns (,)
are only A and B (c("A", "B")) "
You can use list.files function to get directories for all your files and then use read.csv and rbind in for loop to create one data.frame.
Something like this:
direct<-list.files("directory_to_your_files")
df<-NULL
for(i in length(direct)){
df<-rbind(df,read.csv(direct[i]))
}
So here is the code which does what I want it to do. If there are more elegant solutions, please feel free to point them out.
# set the working directory to where the data files are stored
setwd("/foo")
# count the files
files = list.files("/foo")
#create an empty dataframe and name the columns
dataMatrix=data.frame(matrix(c(rep(NA,times=2*length(files))),nrow=length(files)))
colnames(dataMatrix)=c("Postal Code", "Median Income")
# create a for loop to get the information in R2/C1 and R1585/C2 of each data file
# Data is R2/C1 is a string, but is interpreted as a number unless specifically declared a string
for(i in 1:length(files)) {
getData = read.csv(files[i],header=F)
dataMatrix[i,1]=toString(getData[2,1])
dataMatrix[i,2]=(getData[1585,2])
}
Thank you to all those who helped me figure this out, especially Nancy.

Replace multiple strings in multiple files with R

I have something like 700,000 files in a folder where I need to find and replace multiple strings with different other strings (all 4 caracters codes). It is unsure if a string is present or not in a file. I'm trying to use gsub but I can't find how to do it with regular expressions. Can someone tell me a good and efficient way to handle this task?
This is the code I've used so far. It worked well with only one y <- gsub(...) instruction but doesn't work for my purpose, obviously because only the last gsub instruction is taken into account for defining the y variable...
chm_files <- list.files(getwd(), pattern=("^[[:digit:]]*.chm$"), full.names=F)
for(chm_file in chm_files) {
x <- readLines(chm_file)
y <- gsub("AG02|AG07|AG05|AG18|AG19|AG08|AG09|AG17", "AGRL", x)
y <- gsub("SB28|SB42|SB43|SB33|SB41|SB34|SB39|SB35", "SWHT", x)
y <- gsub("WB28|WB42|WB43|WB32|WB09|WB33|WB41|WB26", "BARL", x)
y <- gsub("WW02|WW25|WW08|WW31|WW05|WW28|WW19|WW42", "WWHT", x)
cat(y, file=chm_file, sep="\n")
}
I am sure there are already numerous pre-built functions for this task in various R-packages, but anyhow I just cooked this one up for myself and others to use/modify. Apart from the tasks request above it also prints out a tracking log of the count of all changes made across files function: multi_replace.
Here is some example code of how it should be run
# local directory with files you want to work with
setwd("C:/Users/DW/Desktop/New folder")
# get a list of files based on a pattern of interest e.g. .html, .txt, .php
filer = list.files(pattern=".php")
# f - list of original string values you want to change
f <- c("localhost","dbtest","root","oldpassword")
# r - list of values to replace the above values with
# make sure the indexing of f & r
r <- c("newhost", "newdb", "newroot", "newpassword")
# Run the function and watch all your changes take place ;)
tracking_sheet <- multi_replace(filer, f, r)
tracking_sheet
setwd("D:/R Training Material Kathmandu/File renaming procedures")
filer = list.files(pattern="2016")
f <- c("DATA,","$")
r <- c("","")
tracking_sheet <- multi_replace(filer, f, r)
tracking_sheet
I used the above script but the code failed to replace the $ sign among all files

Opening and reading multiple netcdf files with RnetCDF

Using R, I am trying to open all the netcdf files I have in a single folder (e.g 20 files) read a single variable, and create a single data.frame combining the values from all files. I have been using RnetCDF to read netcdf files. For a single file, I read the variable with the following commands:
library('RNetCDF')
nc = open.nc('file.nc')
lw = var.get.nc(nc,'LWdown',start=c(414,315,1),count=c(1,1,240))
where 414 & 315 are the longitude and latitude of the value I would like to extract and 240 is the number of timesteps.
I have found this thread which explains how to open multiple files. Following it, I have managed to open the files using:
filenames= list.files('/MY_FOLDER/',pattern='*.nc',full.names=TRUE)
ldf = lapply(filenames,open.nc)
but now I'm stuck. I tried
var1= lapply(ldf, var.get.nc(ldf,'LWdown',start=c(414,315,1),count=c(1,1,240)))
but it doesn't work.
The added complication is that every nc file has a different number of timestep. So I have 2 questions:
1: How can I open all files, read the variable in each file and combine all values in a single data frame?
2: How can I set the last dimension in count to vary for all files?
Following #mdsummer's comment, I have tried a do loop instead and have managed to do everything I needed:
# Declare data frame
df=NULL
#Open all files
files= list.files('MY_FOLDER/',pattern='*.nc',full.names=TRUE)
# Loop over files
for(i in seq_along(files)) {
nc = open.nc(files[i])
# Read the whole nc file and read the length of the varying dimension (here, the 3rd dimension, specifically time)
lw = var.get.nc(nc,'LWdown')
x=dim(lw)
# Vary the time dimension for each file as required
lw = var.get.nc(nc,'LWdown',start=c(414,315,1),count=c(1,1,x[3]))
# Add the values from each file to a single data.frame
rbind(df,data.frame(lw))->df
}
There may be a more elegant way but it works.
You're passing the additional function parameters wrong. You should use ... for that. Here's a simple example of how to pass na.rm to mean.
x.var <- 1:10
x.var[5] <- NA
x.var <- list(x.var)
x.var[[2]] <- 1:10
lapply(x.var, FUN = mean)
lapply(x.var, FUN = mean, na.rm = TRUE)
edit
For your specific example, this would be something along the lines of
var1 <- lapply(ldf, FUN = var.get.nc, variable = 'LWdown', start = c(414, 315, 1), count = c(1, 1, 240))
though this is untested.
I think this is much easier to do with CDO as you can select the varying timestep easily using the date or time stamp, and pick out the desired nearest grid point. This would be an example bash script:
# I don't know how your time axis is
# you may need to use a date with a time stamp too if your data is not e.g. daily
# see the CDO manual for how to define dates.
date=20090101
lat=10
lon=50
files=`ls MY_FOLDER/*.nc`
for file in $files ; do
# select the nearest grid point and the date slice desired:
# %??? strips the .nc from the file name
cdo seldate,$date -remapnn,lon=$lon/lat=$lat $file ${file%???}_${lat}_${lon}_${date}.nc
done
Rscript here to read in the files
It is possible to merge all the new files with cdo, but you would need to be careful if the time stamp is the same. You could try cdo merge or cdo cat - that way you can read in a single file to R, rather than having to loop and open each file separately.

Resources