I am trying to let user define how many drugs' data user want to upload for specific therapy. Based on that number my function want to let user select data for that many drugs and store them using variables e.g. drug_1_data, drug_2_data, etc.
I have wrote a code but it doesn't work
Could someone please help
no_drugs <- readline("how many drugs for this therapy? Ans:")
i=0
while(i < no_drugs) {
i <- i+1
caption_to_add <- paste("drug",i, sep = "_")
mydata <- choose.files( caption = caption_to_add) # caption describes data for which drug
file_name <- noquote(paste("drug", i, "data", sep = "_")) # to create variable that will save uploaded .csv file
file_name <- read.csv(mydata[i],header=TRUE, sep = "\t")
}
In your example, mydata is a one element string, so subsets with i bigger than 1 will return NA. Furthermore, in your first assignment of file_name you set it to a non-quoted character vector but then overwrite it with data (and in every iteration of the loop you lose the data you created in the previous step). I think what you wanted was something more in the line of:
file_name <- paste("drug", i, "data", sep = "_")
assign(file_name, read.delim(mydata, header=TRUE)
# I changed the function to read.delim since the separator is a tab
However, I would also recommend to think about putting all the data in a list (it might be easier to apply operations to multiple drug dataframes like that), using something like this:
n_drugs <- as.numeric(readline("how many drugs for this therapy? Ans:"))
drugs <- vector("list", n_drugs)
for(i in 1:n_drugs) {
caption_to_add <- paste("drug",i, sep = "_")
mydata <- choose.files( caption = caption_to_add)
drugs[i] <- read.delim(mydata,header=TRUE)
}
Related
I have several files with the names RTDFE, TRYFG, FTYGS, WERTS...like 100 files in txt format. For each file, I'm using the following code and writing the output in a file.
name = c("RTDFE")
file1 <- paste0(name, "_filter",".txt")
file2 <- paste0(name, "_data",".txt")
### One
A <- read.delim(file1, sep = "\t", header = FALSE)
#### two
B <- read.delim(file2, sep = "\t", header = FALSE)
C <- merge(A, B, by="XYZ")
nrow(C)
145
Output:
Samples Common
RTDFE 145
Every time I'm assigning the file to variable name running my code and writing the output in the file. Instead, I want the code to be run on all the files in one go and want the following output. Common is the row of merged data frame C
The output I need:
Samples Common
RTDFE 145
TRYFG ...
FTYGS ...
WERTS ...
How to do this? Any help.
How about putting all your names in a single vector, called names, like this:
names<-c("TRYFG","RTDFE",...)
and then feeding each one to a function that reads the files, merges them, and returns the rows
f<-function(n) {
fs = paste0(n,c("_filter", "_data"),".txt")
C = merge(
read.delim(fs[1],sep="\t", header=F),
read.delim(fs[2],sep="\t", header=F), by="XYZ")
data.frame(Samples=n,Common=nrow(C))
}
Then just call call this function f on each of the values in names, row binding the result together
do.call(rbind, lapply(names, f))
An easy way to create the vector names is like this:
p = "_(filter|data).txt"
names = unique(gsub(p,"",list.files(pattern = p)))
I am making some assumptions here.
The first assumption is that you have all these files in a folder with no other text files (.txt) in this folder.
If so you can get the list of files with the command list.files.
But when doing so you will get the "_data.txt" and the "filter.txt".
We need a way to extract the basic part of the name.
I use "str_replace" to remove the "_data.txt" and the "_filter.txt" from the list.
But when doing so you will get a list with two entries. Therefore I use the "unique" command.
I store this in "lfiles" that will now contain "RTDFE, TRYFG, FTYGS, WERTS..." and any other file that satisfy the conditions.
After this I run a for loop on this list.
I reopen the files similarly as you do.
I merge by XYZ and I immediately put the results in a data frame.
By using rbind I keep adding results to the data frame "res".
library(stringr)
lfiles=list.files(path = ".", pattern = ".txt")
## we strip, from the files, the "_filter and the data
lfiles=unique( sapply(lfiles, function(x){
x=str_replace(x, "_data.txt", "")
x=str_replace(x, "_filter.txt", "")
return(x)
} ))
res=NULL
for(i in lfiles){
file1 <- paste0(i, "_filter.txt")
file2 <- paste0(i, "_data.txt")
### One
A <- read.delim(file1, sep = "\t", header = FALSE)
#### two
B <- read.delim(file2, sep = "\t", header = FALSE)
res=rbind(data.frame(Samples=i, Common=nrow(merge(A, B, by="XYZ"))))
}
Ok, I will assume you have a folder called "data" with files named "RTDFE_filter.txt, RTDFE_data, TRYFG_filter.txt, TRYFG_data.txt, etc. (only and exacly this files).
This code should give a possible way
# save the file names
files = list.files("data")
# get indexes for "data" (for "filter" indexes, add 1)
files_data_index = seq(1, length(f), 2) # 1, 3, 5, ...
# loop on indexes
results = lapply(files_data_index, function(i) {
A <- read.delim(files[i+1], sep = "\t", header = FALSE)
B <- read.delim(files[i], sep = "\t", header = FALSE)
C <- merge(A, B, by="XYZ")
samp = strsplit(files[i], "_")[[1]][1]
com = nrow(C)
return(c(Samples = samp, Comon = com))
})
# combine results
do.call(rbind, results)
in R I have a list of input files, which are data frames.
Now I want to subset them based on the gene given in one of the columns.
I am used to do everything repetitively on every sample I have but I want to be able to make the code smoother and shorter, which is giving me some problems.
How I have done it before:
GM04284 <- read.table("GM04284_methylation_results_hg37.txt", header = TRUE)
GM04284_HTT <- subset(GM04284[GM04284$target == "HTT",])
GM04284_FMR1 <- subset(GM04284[GM04284$target == "fmr1",])
How I want to do it now:
input_files = list.files(pattern = "_methylation_results_hg37.txt")
for (file in input_files){
# Define sample and gene from input file
sample = strsplit(file, split = "_")[[1]][1]
# read input
data = read.table(file, header = T, na.strings = "NA")
# subset input into gene specific tables
paste(sample,"_HTT", sep = "") <- subset(data[data$target == "HTT",])
paste(sample,"_FMR1", sep = "") <- subset(data[data$target == "fmr1",])
}
The subset part is what is causing me problems.
How can I make a new variable name that looks like the output of paste(sample,"_HTT", sep = "") and which can be taken as the name for the new subset table?
Thanks in advance, your help is very appreciated.
Are you sure you need to create new variable for each dataframe? If you're going to treat them all in the same way later, it might be better to use something more uniform and better organized.
One alternative is to keep them all in the list:
input_files = list.files(pattern = "_methylation_results_hg37.txt")
res_list <- list()
for (file in input_files){
# Define sample and gene from input file
sample = strsplit(file, split = "_")[[1]][1]
# read input
data = read.table(file, header = T, na.strings = "NA")
# subset input into gene specific tables
res_list[[paste0(sample,"_HTT")]] <- data[data$target == "HTT", ]
res_list[[paste0(sample,"_FMR1")]] <- data[data$target == "fmr1",]
}
Then you can address them as members of this list, like res_list$GM04284 (or, equivalent, res_list[['GM04284']])
Vasily makes a good point in the answer above. It would indeed be tidier to have each dataframe contained within a list.
Nonetheless, you could use assign() if you really wanted to create a new dynamic variable:
assign(paste0(sample,"_HTT"), subset(data[data$target == "HTT",]), envir = .GlobalEnv)
I have written a code to filter, group, and sort my large data files. I have multiple text files I have to analyze. I know I can copy the code and run it with new data but I was wondering if there was a way to put this in a for loop that would open the text files one by one and run and store the results. I use the following to load all my text files. In the next steps, I select columns and filter them to find the desired values. But at the moment it only reads one file. I want to obtain results from all data files.
Samples <- Sys.glob("*.csv")
for (filename in Samples) {
try <- read.csv(filename, sep = ",", header = FALSE)
shear <- data.frame(try[,5],try[,8],try[,12])
lane <- shear[which(shear$Load == "LL-1"),]
Ext <- subset(lane, Girder %in% c("Left Ext","Right Ext"))
Max.Ext <- max(Ext$Shear)
}
You can put everything that you want to apply to each file in a function :
apply_fun <- function(filename) {
try <- read.csv(filename, sep = ",", header = FALSE)
shear <- data.frame(try[,5],try[,8],try[,12])
lane <- shear[which(shear$Load == "LL-1"),]
Ext <- subset(lane, Girder %in% c("Left Ext","Right Ext"))
return(max(Ext$Shear, na.rm = TRUE))
}
and here it seems we want only one number (max) from each file, we can use sapply to apply the function to each file.
Samples <- Sys.glob("*.csv")
sapply(Samples, apply_fun)
So let's say I've defined the below function to read in a set of files:
read_File <- function(file){
# read Excel file
df1 <- read.xls(file,sheet=1, pattern="Name", header=T, na.strings=("##"), stringsAsFactors=F)
# remove rows where Name is empty
df2 <- df1[-which(df1$Name==""),]
# remove rows where "Name" is repeated
df3 <- df2[-which(df2$Name=="Name"),]
# remove all empty columns (anything with an auto-generated colname)
df4 <- df3[, -grep("X\\.{0,1}\\d*$", colnames(df3))]
row.names(df4) <- NULL
df4$FileName <- file
return(df4)
}
It works fine like this, but it feels like bad form to define df1...df4 to represent the intermediate steps. Is there a better way to do this without compromising readability?
I see no reason to save intermediate objects separately unless they need to be used multiple times. This is not the case in your code, so I would replace all your df[0-9] with df:
read_File <- function(file){
# read Excel file
df <- read.xls(file,sheet = 1, pattern = "Name", header = T,
na.strings = ("##"), stringsAsFactors = F)
# remove rows where Name is empty
df <- df[-which(df$Name == ""), ]
# remove rows where "Name" is repeated
df <- df[-which(df$Name == "Name"), ]
# remove all empty columns (anything with an auto-generated colname)
df <- df[, -grep("X\\.{0,1}\\d*$", colnames(df))]
row.names(df) <- NULL
df$FileName <- file
return(df)
}
df3 is not a nice descriptive variable name - it doesn't tell you anything more about the variable then df. Sequentially naming variables steps like that also creates a maintenance burden: if you need to add a new step in the middle, you will need to rename all subsequent objects to maintain consistency - which sounds both annoying and potentially risky for bugs.
(Or have something hacky like df2.5, which is ugly and doesn't generalize well.) Generally, I think sequentially named variables are almost always bad practice, even when they are separate objects that you need saved.
Furthermore, keeping the intermediate objects around is not good use of memory. In most cases it won't matter, but if your data is large than saving all the intermediate steps separately will greatly increase the amount of memory used during processing.
The comments are excellent, lots of detail - they tell you all you need to know about what's going on in the code.
If it were me, I would probably combine some steps, something like this:
read_File <- function(file){
# read Excel file
df <- read.xls(file,sheet = 1, pattern = "Name", header = T,
na.strings = ("##"), stringsAsFactors = F)
# remove rows where Name is bad:
bad_names <- c("", "Name")
df <- df[-which(df$Name %in% bad_names), ]
# remove all empty columns (anything with an auto-generated colname)
df <- df[, -grep("X\\.{0,1}\\d*$", colnames(df))]
row.names(df) <- NULL
df$FileName <- file
return(df)
}
Having a bad_names vector to omit saves a line and is more parametric - it would be trivial to promote bad_names to a function argument (perhaps with the default value c("", "Name")) so that the user could customize it.
After having searched for help in different threads on this topic, I still have not become wiser. Therefore: Here comes another question on looping through multiple data files...
OK. I have multiple CSV files in one folder containing 5 columns of data. The filenames are as follows:
Moist yyyymmdd hh_mm_ss.csv
I would like to create a script that reads processes the CSV-files one by one doing the following steps:
1) load file
2) check number of rows and exclude file if less than 3 registrations
3) calculate mean value of all measurements (=rows) for column 2
4) calculate mean value of all measurements (=rows) for column 4
5) output the filename timestamp, mean column 2 and mean column 4 to a data frame,
I have written the following function
moist.each.mean <- function() {
library("tcltk")
directory <- tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <- regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame(timestamp=character(), humidity=numeric(), temp=numeric())
for(i in 1:length(filelist)){
file.in[[i]] <- read.csv(filelist[i], header=F)
if (nrow(file.in[[i]]<3)){
print("discard")
} else {
newrow <- c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1))
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
}
but i keep getting an error:
Error in `[[<-.data.frame`(`*tmp*`, i, value = list(V1 = c(10519949L, :
replacement has 18 rows, data has 17
Any ideas?
Thx, kruemelprinz
I'd also suggest to use (l)apply... Here's my take:
getMeans <- function(fpath,runfct,
target_cols = c(2),
sep=",",
dec=".",
header = T,
min_obs_threshold = 3){
f <- list.files(fpath)
fcsv <- f[grepl("\.csv",f)]
fcsv <- paste0(fpath,fcsv)
csv_list <- lapply(fcsv,read.table,sep = sep,
dec = dec, header = header)
csv_rows <- sapply(csv_list,nrow)
rel_csv_list <- csv_list[!(csv_rows < min_obs_threshold)]
lapply(rel_csv_list,function(x) colMeans(x[,target_cols]))
}
Also with that kind of error message, the debugger might be very helpful.
Just run debug(moist.each.mean) and execute the function stepwise.
Here's a slightly different approach. Use lapply to read each csv file, exclude it if necessary, otherwise create a summary. This gives you a list where each element is a data frame summary. Then use rbind to create the final summary data frame.
Without a sample of your data, I can't be sure the code below exactly matches your problem, but hopefully it will be enough to get you where you want to go.
# Get vector of filenames to read
filelist=list.files(path=directory, pattern="csv")
# Read all the csv files into a list and create summaries
df.list = lapply(filelist, function(f) {
file.in = read.csv(f, header=TRUE, stringsAsFactors=FALSE)
# Set to empty data frame if file has less than 3 rows of data
if (nrow(file.in) < 3) {
print(paste("Discard", f))
# Otherwise, capture file timestamp and summarise data frame
} else {
data.frame(timestamp=substr(f, 7, 22),
humidity=round(mean(file.in$V2),1),
temp=round(mean(file.in$V4),1))
}
})
# Bind list into final summary data frame (excluding the list elements
# that don't contain a data frame because they didn't have enough rows
# to be included in the summary)
result = do.call(rbind, df.list[sapply(df.list, is.data.frame)])
One issue with your original code is that you create a vector of summary results rather than a data frame of results:
c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1)) is a vector with three elements. What you actually want is a data frame with three columns:
data.frame(timestamp=filetitles[[i]],
humidity=round(mean(file.in[[i]]$V2),1),
temp=round(mean(file.in[[i]]$V4),1))
Thanks for the suggestions using lapply. This is definitely of value as it saves a whole lot of code as well! Meanwhile, I managed to fix my original code as well:
library("tcltk")
# directory: path to csv files
directory <-
tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <-
regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame()
for (i in 1:length(filelist)) {
file.in <- read.csv(filelist[i], header = F, skipNul = T)
if (nrow(file.in) < 3) {
print("discard")
} else {
newrow <-
matrix(
c(filetitles[[i]], round(mean(file.in$V2, na.rm=T),1), round(mean(file.in$V4, na.rm=T),1)), nrow = 1, ncol =
3, byrow = T
)
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
Only I did not get it to work as a function because then I would only have one row in mdf containing the last file data. Somehow it did not add rows but overwrite row 1 with each iteration. But using it without a function wrapper worked fine...