in R I have a list of input files, which are data frames.
Now I want to subset them based on the gene given in one of the columns.
I am used to do everything repetitively on every sample I have but I want to be able to make the code smoother and shorter, which is giving me some problems.
How I have done it before:
GM04284 <- read.table("GM04284_methylation_results_hg37.txt", header = TRUE)
GM04284_HTT <- subset(GM04284[GM04284$target == "HTT",])
GM04284_FMR1 <- subset(GM04284[GM04284$target == "fmr1",])
How I want to do it now:
input_files = list.files(pattern = "_methylation_results_hg37.txt")
for (file in input_files){
# Define sample and gene from input file
sample = strsplit(file, split = "_")[[1]][1]
# read input
data = read.table(file, header = T, na.strings = "NA")
# subset input into gene specific tables
paste(sample,"_HTT", sep = "") <- subset(data[data$target == "HTT",])
paste(sample,"_FMR1", sep = "") <- subset(data[data$target == "fmr1",])
}
The subset part is what is causing me problems.
How can I make a new variable name that looks like the output of paste(sample,"_HTT", sep = "") and which can be taken as the name for the new subset table?
Thanks in advance, your help is very appreciated.
Are you sure you need to create new variable for each dataframe? If you're going to treat them all in the same way later, it might be better to use something more uniform and better organized.
One alternative is to keep them all in the list:
input_files = list.files(pattern = "_methylation_results_hg37.txt")
res_list <- list()
for (file in input_files){
# Define sample and gene from input file
sample = strsplit(file, split = "_")[[1]][1]
# read input
data = read.table(file, header = T, na.strings = "NA")
# subset input into gene specific tables
res_list[[paste0(sample,"_HTT")]] <- data[data$target == "HTT", ]
res_list[[paste0(sample,"_FMR1")]] <- data[data$target == "fmr1",]
}
Then you can address them as members of this list, like res_list$GM04284 (or, equivalent, res_list[['GM04284']])
Vasily makes a good point in the answer above. It would indeed be tidier to have each dataframe contained within a list.
Nonetheless, you could use assign() if you really wanted to create a new dynamic variable:
assign(paste0(sample,"_HTT"), subset(data[data$target == "HTT",]), envir = .GlobalEnv)
Related
R Beginners here
I have a folder contains 150 csv files, each file name is "student1" "student2"....
Each files has 2 columns with Courses and Score
I want to run a for loop for this and store all of the data into a new dataframe.
so far I have :
data_1 = dir(path_cwd.full.names = TRUE, pattern = "csv$")
for(i in data_1)
{
b = read.csv(i,sep = ", " header = TRUE)
}
Please help me and explain it to me!
Much thanks
You can use lapply here. It is basically the same as for loop but you will more control over the operation. Here we will use lapply to get each file and then using do.call we will bind all data frames into one dataframe. The point is that you should make sure all csv files have the same number of columns and their names and order of columns are matched. Else, you may need to do more steps in between.
data_1 = dir(path_cwd.full.names = TRUE, pattern = "csv$"
final_df <- lapply(data_1, function(i){
b = read.csv(i,sep = ", " header = TRUE)
}) %>% do.call(what = rbind)
I'm trying to call a dataframe but it's named with a number because it was originally multiple. I want to either rename the dataframes in my loop or find a way to call my dataframe even though it is titled with a number. Right now, after I run this code:
filenames <- list.files(path = "filepath",pattern = ".*txt")
head(filenames)
names <- substr(filenames,1,22)
for(i in names){
filepath <-file.path("filepath",paste(i,".txt",sep = ""))
assign(i,read.delim(filepath,colClasses = c('character','character','factor','factor'),sep = "\t"))
}
I get a lot of separate dataframes with names like '101_1b1_Al_sc_Meditron.txt'. When I try to even view the dataframe, R is confused because the name begins with a number.
Is there a good solution here?
The simplest solution is to reference the original names using backticks.
example:
`123_mtcars` <- mtcars
View(`123_mtcars`)
If you would prefer to create a naming convention or just to remove numbers from each dataframe name you could do that in your loop and use the new variable in your assign statement.
example:
filenames <- list.files(path = "filepath",pattern = ".*txt")
head(filenames)
names <- substr(filenames,1,22)
for(i in names){
filepath <-file.path("filepath",paste(i,".txt",sep = ""))
# gsub to replace all numbers with "" for the name i
dfName <- gsub("[0-9]", "", i)
assign(dfName,read.delim(filepath,colClasses = c('character','character','factor','factor'),sep = "\t"))
}
The are 3 solutions I can think of :
1. Keeping your code in current state.
If we don't change anything about your code and your dataframes are named as '101_1b1_Al_sc_Meditron' to view the contents of the dataframe you can use backticks. Try using it like this :
`101_1b1_Al_sc_Meditron`
2. Change the name of dataframes.
In your loop change the assign line to
assign(paste0('df_', i), read.delim(filepath,
colClasses = c('character','character','factor','factor'),sep = "\t"))
So after running for loop you'll have filenames as df_101_1b1_Al_sc_Meditron which is a standard name and you can access them without any problem.
3. Store data in a list.
Instead of having so many dataframes in the global environment why not store them in a list. Lists are easier to manage.
list_of_files <-lapply(filepath, function(x) read.delim(x,
colClasses = c('character','character','factor','factor'),sep = "\t"))
So, I have a .tsv file of human variants.
I need to store in a data.frame all the rows of this file with a precise name and save them in another file. I'm trying with this script:
data = read.table(file.choose(), sep = '\t', header = TRUE)
variant = readline("Insert variant:")
store <- data.frame(matrix(NA, ncol = ncol(data)))
colnames(store) = colnames(data)
for (i in 1:nrow(data))
{
if (data[i,3] == variant)
{
store[i,] = as.data.frame(data[i,], stringsAsFactors = FALSE)
}
}
But because I used a matrix in the data.frame, it stores only numeric data, of course. Any ideas of how can I solve this and how can I write the output of the loop directly in a .tsv file?
If discarding the rows would work, what you need is a subset, something like:
store <- data[ data[[3]] == variant, ]
data[[3]] here looks at the third column, which we compare to variant. So we subset data by taking only those rows where that third column matches variant.
So let's say I've defined the below function to read in a set of files:
read_File <- function(file){
# read Excel file
df1 <- read.xls(file,sheet=1, pattern="Name", header=T, na.strings=("##"), stringsAsFactors=F)
# remove rows where Name is empty
df2 <- df1[-which(df1$Name==""),]
# remove rows where "Name" is repeated
df3 <- df2[-which(df2$Name=="Name"),]
# remove all empty columns (anything with an auto-generated colname)
df4 <- df3[, -grep("X\\.{0,1}\\d*$", colnames(df3))]
row.names(df4) <- NULL
df4$FileName <- file
return(df4)
}
It works fine like this, but it feels like bad form to define df1...df4 to represent the intermediate steps. Is there a better way to do this without compromising readability?
I see no reason to save intermediate objects separately unless they need to be used multiple times. This is not the case in your code, so I would replace all your df[0-9] with df:
read_File <- function(file){
# read Excel file
df <- read.xls(file,sheet = 1, pattern = "Name", header = T,
na.strings = ("##"), stringsAsFactors = F)
# remove rows where Name is empty
df <- df[-which(df$Name == ""), ]
# remove rows where "Name" is repeated
df <- df[-which(df$Name == "Name"), ]
# remove all empty columns (anything with an auto-generated colname)
df <- df[, -grep("X\\.{0,1}\\d*$", colnames(df))]
row.names(df) <- NULL
df$FileName <- file
return(df)
}
df3 is not a nice descriptive variable name - it doesn't tell you anything more about the variable then df. Sequentially naming variables steps like that also creates a maintenance burden: if you need to add a new step in the middle, you will need to rename all subsequent objects to maintain consistency - which sounds both annoying and potentially risky for bugs.
(Or have something hacky like df2.5, which is ugly and doesn't generalize well.) Generally, I think sequentially named variables are almost always bad practice, even when they are separate objects that you need saved.
Furthermore, keeping the intermediate objects around is not good use of memory. In most cases it won't matter, but if your data is large than saving all the intermediate steps separately will greatly increase the amount of memory used during processing.
The comments are excellent, lots of detail - they tell you all you need to know about what's going on in the code.
If it were me, I would probably combine some steps, something like this:
read_File <- function(file){
# read Excel file
df <- read.xls(file,sheet = 1, pattern = "Name", header = T,
na.strings = ("##"), stringsAsFactors = F)
# remove rows where Name is bad:
bad_names <- c("", "Name")
df <- df[-which(df$Name %in% bad_names), ]
# remove all empty columns (anything with an auto-generated colname)
df <- df[, -grep("X\\.{0,1}\\d*$", colnames(df))]
row.names(df) <- NULL
df$FileName <- file
return(df)
}
Having a bad_names vector to omit saves a line and is more parametric - it would be trivial to promote bad_names to a function argument (perhaps with the default value c("", "Name")) so that the user could customize it.
I am trying to let user define how many drugs' data user want to upload for specific therapy. Based on that number my function want to let user select data for that many drugs and store them using variables e.g. drug_1_data, drug_2_data, etc.
I have wrote a code but it doesn't work
Could someone please help
no_drugs <- readline("how many drugs for this therapy? Ans:")
i=0
while(i < no_drugs) {
i <- i+1
caption_to_add <- paste("drug",i, sep = "_")
mydata <- choose.files( caption = caption_to_add) # caption describes data for which drug
file_name <- noquote(paste("drug", i, "data", sep = "_")) # to create variable that will save uploaded .csv file
file_name <- read.csv(mydata[i],header=TRUE, sep = "\t")
}
In your example, mydata is a one element string, so subsets with i bigger than 1 will return NA. Furthermore, in your first assignment of file_name you set it to a non-quoted character vector but then overwrite it with data (and in every iteration of the loop you lose the data you created in the previous step). I think what you wanted was something more in the line of:
file_name <- paste("drug", i, "data", sep = "_")
assign(file_name, read.delim(mydata, header=TRUE)
# I changed the function to read.delim since the separator is a tab
However, I would also recommend to think about putting all the data in a list (it might be easier to apply operations to multiple drug dataframes like that), using something like this:
n_drugs <- as.numeric(readline("how many drugs for this therapy? Ans:"))
drugs <- vector("list", n_drugs)
for(i in 1:n_drugs) {
caption_to_add <- paste("drug",i, sep = "_")
mydata <- choose.files( caption = caption_to_add)
drugs[i] <- read.delim(mydata,header=TRUE)
}