I have something like 700,000 files in a folder where I need to find and replace multiple strings with different other strings (all 4 caracters codes). It is unsure if a string is present or not in a file. I'm trying to use gsub but I can't find how to do it with regular expressions. Can someone tell me a good and efficient way to handle this task?
This is the code I've used so far. It worked well with only one y <- gsub(...) instruction but doesn't work for my purpose, obviously because only the last gsub instruction is taken into account for defining the y variable...
chm_files <- list.files(getwd(), pattern=("^[[:digit:]]*.chm$"), full.names=F)
for(chm_file in chm_files) {
x <- readLines(chm_file)
y <- gsub("AG02|AG07|AG05|AG18|AG19|AG08|AG09|AG17", "AGRL", x)
y <- gsub("SB28|SB42|SB43|SB33|SB41|SB34|SB39|SB35", "SWHT", x)
y <- gsub("WB28|WB42|WB43|WB32|WB09|WB33|WB41|WB26", "BARL", x)
y <- gsub("WW02|WW25|WW08|WW31|WW05|WW28|WW19|WW42", "WWHT", x)
cat(y, file=chm_file, sep="\n")
}
I am sure there are already numerous pre-built functions for this task in various R-packages, but anyhow I just cooked this one up for myself and others to use/modify. Apart from the tasks request above it also prints out a tracking log of the count of all changes made across files function: multi_replace.
Here is some example code of how it should be run
# local directory with files you want to work with
setwd("C:/Users/DW/Desktop/New folder")
# get a list of files based on a pattern of interest e.g. .html, .txt, .php
filer = list.files(pattern=".php")
# f - list of original string values you want to change
f <- c("localhost","dbtest","root","oldpassword")
# r - list of values to replace the above values with
# make sure the indexing of f & r
r <- c("newhost", "newdb", "newroot", "newpassword")
# Run the function and watch all your changes take place ;)
tracking_sheet <- multi_replace(filer, f, r)
tracking_sheet
setwd("D:/R Training Material Kathmandu/File renaming procedures")
filer = list.files(pattern="2016")
f <- c("DATA,","$")
r <- c("","")
tracking_sheet <- multi_replace(filer, f, r)
tracking_sheet
I used the above script but the code failed to replace the $ sign among all files
Related
My goal is to read many files into R, and ultimately, run a Root Mean Square Error (rmse) function on each pair of columns within each file.
I have this code:
#This calls all the files into a dataframe
filnames <- dir("~/Desktop/LGsampleHUCsWgraphs/testRSMEs", pattern = "*_45Fall_*")
#This reads each file
read_data <- function(z){
dat <- read_excel(z, skip = 0, )
return(dat)
}
#This combines them into one list and splits them by the names in the first column
datalist <- lapply(filnames, read_data)
bigdata <- rbindlist(datalist, use.names = T)
splitByHUCs <- split(bigdata, f = bigdata$HUC...1 , sep = "\n", lex.order = TRUE)
So far, all is working well. Now I want to apply an rmse [library(Metrics)] analysis on each of the "splits" created above. I don't know what to call the "splits". Here I have used names but that is an R reserved word and won't work. I tried the bigdata object but that didn't work either. I also tried to use splitByHUCs, and rMSEs.
rMSEs <- sapply(splitByHUCs, function(x) rmse(names$Predicted, names$Actual))
write.csv(rMSEs, file = "~/Desktop/testRMSEs.csv")
The rmse code works fine when I run it on a single file and create a name for the dataframe:
read_excel("bcc1_45Fall_1010002.xlsm")
bcc1F1010002 <- read_excel("bcc1_45Fall_1010002.xlsm")
rmse(bcc1F1010002$Predicted, bcc1F1010002$Actual)
The "splits" are named by the "splitByHUCs" script, like this:
They are named for the file they came from, appropriately. I need some kind of reference name for the rmse formula and I don't know what it would be. Any ideas? Thanks. I made some small versions of the files, but I don't know how to add them here.
As it is a list, we can loop over the list with sapply/lapply as in the OP's code, but the names$ is incorrect as the lambda function object is x which signifies each of the elements of the list (i.e. a data.frame). Therefore, instead of names$, use x$
sapply(splitByHUCs, function(x) rmse(x$Predicted, x$Actual))
I wrote my first code in R for treating some spectra [basically .txt files with a Xcol (wavelength) and Ycol (intensity)].
The code works for single files, provided I write the file name in the code. Here the code working for the first file HKU47_PSG_1_LW_0.txt.
setwd("C:/Users/dd16722/R/Raman/Data")
# import Spectra
PSG1_LW<-read.table("HKU47_PSG_1_LW_0.txt")
colnames(PSG1_LW)[colnames(PSG1_LW)=="V2"] <- "PSG1_LW"
PSG2_LW<-read.table("HKU47_PSG_2_LW_all_0.txt")
colnames(PSG2_LW)[colnames(PSG2_LW)=="V2"] <- "PSG2_LW"
#Plot 2 spectra and define the Y range
plot(PSG1_LW$V1, PSG1_LW$PSG1_LW, type="l",xaxs="i", yaxs="i", main="Raman spectra", xlab="Raman shift (cm-1)", ylab="Intensity", ylim=range(PSG1_LW,PSG2_LW))
lines(PSG2_LW$V1, PSG2_LW$PSG2_LW, col=("red"), yaxs="i")
# Temperature-excitation line correction
laser = 532
PSG1_LW_corr <- PSG1_LW$PSG1_LW*((10^7/laser)^3*(1-exp(-6.62607*10^(-34)*29979245800*PSG1_LW$V1/(1.3806488*10^(-23)*293.15)))*PSG1_LW$V1/((10^7/laser)-PSG1_LW$V1)^4)
PSG1_Raw_Corr <-cbind (PSG1_LW,PSG1_LW_corr)
lines(PSG1_LW$V1, PSG1_LW_corr, col="red")
plot(PSG1_LW$V1, PSG1_Raw_Corr$PSG1_LW_corr, type="l",xaxs="i", yaxs="i", xlab="Raman shift (cm-1)", ylab="Intensity")
Now, it's time for another little step forward. In the folder, there are many spectra (in the code above I reported the second one: HKU47_PSG_2_LW_all_0.txt) having again 2 columns, same length of the first file. I suppose I should merge all the files in a matrix (or DF or DT).
Probably I need a loop as I need a code able to check automatically the number of files contained in the folder and ultimately to create an object with several columns (i.e. the double of the number of the files).
So I started like this:
listLW <- list.files(path = ".", pattern = "LW")
numLW <- as.integer(length(listLW))
numLW represents the number of iterations I need to set. The question is: how can I populate a matrix (or DF or DT) in order to have in the first 2 columns the first txt file in my folder, then the second file in the 3rd and 4th columns etc? Considering that I need to perform some other operations as I showed above in the code.
I have been reading about loop in R since yestarday but actually could not find the best and easy solution.
Thanks!
You could do something like
# Load data.table library
require(data.table)
# Import the first file
DT_final <- fread(file = listLW[1])
# Loop over the rest of the files and use cbind to merge them into 1 DT
for(file in setdiff(listLW, listLW[1])) {
DT_temp <- fread(file)
DT_final <- cbind(DT_final, DT_temp)
}
I am interested in making my R script to work automatically for another set of parameters. For example:
gene_name start_x end_y
file1 -> gene1 100 200
file2-> gene2 150 270
and my script does trivial job, just for learning purposes. It should take the information about gene1 and find a sum, write into a file; then it should take information of the next gene2, find sum and write this into a new file and etc, and lets say I would like to keep files name according to the genes name:
file_gene1.txt # this file holds sum of start_x +end_y for gene1
file_gene2.txt # this file holds sum of start_x +end_y for gene2
etc for the rest of 700 genes (obviously manually its to much work to take file1, and write file name and plug in start and end values into already existing script )
I guess the idea is clear, I have never been doing this type of things, and I guess its very trivial, but i would appreciate if anyone can tell me the proper definition of this process so I can search and learn online how to do it.
P.S: I think in Python I would just make a list of genes and related x/y values, loop and select required info, but I still don't know how I would keep gene names as a file name automatically.
EDIT:
I have to supply the info about a gene location, therefore start and end, which is X and Y respectively.
x=100 # assign x to a value of a related gene
y=150 # assign y to a value of a related gene
a=tbl[which(tbl[,'middle']>=x & tbl[,'middle']<y),] # for each new gene this info is changing accoringly
write.table( a, file= ' gene1.txt' ) # here I would need changing file name
my thoughts:
may be I need to generate a file, which contains all 700 gene names and related X and Y values.
then I read line one of this file and supply it into my script (in case of variable a, x and y)
then my computation is over I write results into a file and keep a gene name, that was used to generate this results.
Is it more clear?
P.S.: I Google it by probably because I don't know the topic I cant find anything relevant, just give me the idea where I can search, I would like to learn this programming step anyway.
I guess so you are looking for reading all the files present in a folder (Assuming all your gene files written in a single folder using your older script). In that case you can use something like:
directory <- "C://User//Downloads//R//data"
file <- list.files(directory, full.names = TRUE)
Then access filename using file[i] and do whatever needed (naming the file paste("gene", file[i], sep = "_") or reading it read.csv(file[i])).
I would divide your problem in two parts. (Sample data for reproducible example provided below)
library(data.table) # v1.9.7 (devel version)
# go here for install instructions
# https://github.com/Rdatatable/data.table/wiki/Installation
1st: Apply your functions to your data by gene
output <- dt[ , .( f1 = sum(start_x, end_y),
f2 = start_x - end_y ,
f3 = start_x * end_y ,
f7 = start_x / end_y),
by=.(gene)]
2nd: Split your data frame by gene and save it in separate files
output[,fwrite(.SD,file=sprintf("%s.csv", unique(gene))),
by=.(gene)]
Latter on, you can do bind the multiple files into one single data frame if you like:
# Get a List of all `.csv` files in your folder
filenames <- list.files("C:/your/folder", pattern="*.csv", full.names=TRUE)
# Load and bind all data sets
data <- rbindlist(lapply(filenames,fread))
ps. note that fwrite is still in development version of data.table as of today (12 May 2016)
data for reproducible example:
dt <- data.table( gene = c('id1','id2','id3','id4','id5','id6','id7','id8','id9','id10'),
start_x = c(1:10),
end_y = c(20:29) )
I have multiple time series (each in a seperate file), which I need to adjust seasonally using the season package in R and store the adjusted series each in a seperate file again in a different directory.
The Code works for a single county.
So I tried to use a for Loop but R is unable to use the read.dta with a wildcard.
I'm new to R and using usually Stata so the question is maybe quite stupid and my code quite messy.
Sorry and Thanks in advance
Nathan
for(i in 1:402)
{
alo[i] <- read.dta("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/SINGLE_SERIES/County[i]")
alo_ts[i] <-ts(alo[i], freq = 12, start = 2007)
m[i] <- seas(alo_ts[i])
original[i]<-as.data.frame(original(m[i]))
adjusted[i]<-as.data.frame(final(m[i]))
trend[i]<-as.data.frame(trend(m[i]))
irregular[i]<-as.data.frame(irregular(m[i]))
County[i] <- data.frame(cbind(adjusted[i],original[i],trend[i],irregular[i], deparse.level =1))
write.dta(County[i], "/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/ADJUSTED_SERIES/County[i].dta")
}
This is a good place to use a function and the *apply family. As noted in a comment, your main problem is likely to be that you're using Stata-like character string construction that will not work in R. You need to use paste (or paste0, as here) rather than just passing the indexing variable directly in the string like in Stata. Here's some code:
f <- function(i) {
d <- read.dta(paste0("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/SINGLE_SERIES/County",i,".dta"))
alo_ts <- ts(d, freq = 12, start = 2007)
m <- seas(alo_ts)
original <- as.data.frame(original(m))
adjusted <- as.data.frame(final(m))
trend <- as.data.frame(trend(m))
irregular <- as.data.frame(irregular(m))
County <- cbind(adjusted,original,trend,irregular, deparse.level = 1)
write.dta(County, paste0("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/ADJUSTED_SERIES/County",i,".dta"))
invisible(County)
}
# return a list of all of the resulting datasets
lapply(1:402, f)
It would probably also be a good idea to take advantage of relative directories by first setting your working directory:
setwd("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/")
Then you can simply the above paths to:
d <- read.dta(paste0("./SINGLE_SERIES/County",i,".dta"))
and
write.dta(County, paste0("./ADJUSTED_SERIES/County",i,".dta"))
which will make your code more readable and reproducible should, for example, someone ever run it on another computer.
I have several files in one folder and I want to rename them,I noticed that R reads in alphbatical order, so I used the command mixedsort and it worked but when I checked the results I found that the files were read in a different order not numerically. The name of the first file is Daily_NPP1.bin up to Daily_NPP365.bin
a<- list.files("C:\\New folder (6)", "*.bin", full.names = TRUE)
k<- mixedsort(a)#### load package feild
b <- sprintf("C:carbonflux\\Daily_Rh%d.bin", seq(k))
file.rename(a, b)
How do I force R to read in numerical order?
If renaming is all you want to do you could just do the following, regardless of sorting.
b <- sub("^.*?([0-9]+)\\.bin$", "C:\\\\carbonflux\\\\Daily_Rh\\1.bin", a)
file.rename(a, b)
The first argument to sub extracts the numbers at the end of the file names, and the second pastes it into the new file name template (at the position of \\1). All the \\\\ are needed to escape the backslashes properly.
Here is a way to order the vector without renaming the files:
# Replication of data:
a <- sort(paste0("Daily_NPP",1:365,".bin"))
# Extract numbers and order:
a <- a[order(as.numeric(gsub("[^0-9]","",a)))]