I'm trying to call a dataframe but it's named with a number because it was originally multiple. I want to either rename the dataframes in my loop or find a way to call my dataframe even though it is titled with a number. Right now, after I run this code:
filenames <- list.files(path = "filepath",pattern = ".*txt")
head(filenames)
names <- substr(filenames,1,22)
for(i in names){
filepath <-file.path("filepath",paste(i,".txt",sep = ""))
assign(i,read.delim(filepath,colClasses = c('character','character','factor','factor'),sep = "\t"))
}
I get a lot of separate dataframes with names like '101_1b1_Al_sc_Meditron.txt'. When I try to even view the dataframe, R is confused because the name begins with a number.
Is there a good solution here?
The simplest solution is to reference the original names using backticks.
example:
`123_mtcars` <- mtcars
View(`123_mtcars`)
If you would prefer to create a naming convention or just to remove numbers from each dataframe name you could do that in your loop and use the new variable in your assign statement.
example:
filenames <- list.files(path = "filepath",pattern = ".*txt")
head(filenames)
names <- substr(filenames,1,22)
for(i in names){
filepath <-file.path("filepath",paste(i,".txt",sep = ""))
# gsub to replace all numbers with "" for the name i
dfName <- gsub("[0-9]", "", i)
assign(dfName,read.delim(filepath,colClasses = c('character','character','factor','factor'),sep = "\t"))
}
The are 3 solutions I can think of :
1. Keeping your code in current state.
If we don't change anything about your code and your dataframes are named as '101_1b1_Al_sc_Meditron' to view the contents of the dataframe you can use backticks. Try using it like this :
`101_1b1_Al_sc_Meditron`
2. Change the name of dataframes.
In your loop change the assign line to
assign(paste0('df_', i), read.delim(filepath,
colClasses = c('character','character','factor','factor'),sep = "\t"))
So after running for loop you'll have filenames as df_101_1b1_Al_sc_Meditron which is a standard name and you can access them without any problem.
3. Store data in a list.
Instead of having so many dataframes in the global environment why not store them in a list. Lists are easier to manage.
list_of_files <-lapply(filepath, function(x) read.delim(x,
colClasses = c('character','character','factor','factor'),sep = "\t"))
Related
in R I have a list of input files, which are data frames.
Now I want to subset them based on the gene given in one of the columns.
I am used to do everything repetitively on every sample I have but I want to be able to make the code smoother and shorter, which is giving me some problems.
How I have done it before:
GM04284 <- read.table("GM04284_methylation_results_hg37.txt", header = TRUE)
GM04284_HTT <- subset(GM04284[GM04284$target == "HTT",])
GM04284_FMR1 <- subset(GM04284[GM04284$target == "fmr1",])
How I want to do it now:
input_files = list.files(pattern = "_methylation_results_hg37.txt")
for (file in input_files){
# Define sample and gene from input file
sample = strsplit(file, split = "_")[[1]][1]
# read input
data = read.table(file, header = T, na.strings = "NA")
# subset input into gene specific tables
paste(sample,"_HTT", sep = "") <- subset(data[data$target == "HTT",])
paste(sample,"_FMR1", sep = "") <- subset(data[data$target == "fmr1",])
}
The subset part is what is causing me problems.
How can I make a new variable name that looks like the output of paste(sample,"_HTT", sep = "") and which can be taken as the name for the new subset table?
Thanks in advance, your help is very appreciated.
Are you sure you need to create new variable for each dataframe? If you're going to treat them all in the same way later, it might be better to use something more uniform and better organized.
One alternative is to keep them all in the list:
input_files = list.files(pattern = "_methylation_results_hg37.txt")
res_list <- list()
for (file in input_files){
# Define sample and gene from input file
sample = strsplit(file, split = "_")[[1]][1]
# read input
data = read.table(file, header = T, na.strings = "NA")
# subset input into gene specific tables
res_list[[paste0(sample,"_HTT")]] <- data[data$target == "HTT", ]
res_list[[paste0(sample,"_FMR1")]] <- data[data$target == "fmr1",]
}
Then you can address them as members of this list, like res_list$GM04284 (or, equivalent, res_list[['GM04284']])
Vasily makes a good point in the answer above. It would indeed be tidier to have each dataframe contained within a list.
Nonetheless, you could use assign() if you really wanted to create a new dynamic variable:
assign(paste0(sample,"_HTT"), subset(data[data$target == "HTT",]), envir = .GlobalEnv)
I am trying to alter the contense of a specific coulmn in a list of dataframes in R that has been constructed as such:
# Generating a filelist for all summary.txt files that are 3 subdirectories deep from the pwd
filelist = grep(Sys.glob(paste(getwd(), "/*/*/*/*.txt", sep = "")),pattern = "summary.txt", invert = TRUE, value = TRUE )
# Reading in all data files
cazys = lapply(filelist, read.csv, header = TRUE, sep = "\t")
a typical on of the dataframes will look like this:
fam group Functions
AA2 3 1.11.1.13:70,
I want to split the Functions column by ":" to remove ":70," and similar for every dataframe in the list. I have tried the following:
# Correcting the EC number column
fixed_EC = lapply(cazys, function(x){
x$Functions = strsplit(as.character(x$Functions), ":", fixed = TRUE)[[1]][1]
} )
But this only returns the result of strplit and not the alterd dataframe. however when I use this command outside of apply it produced the desierd results. How can I get this to work inside an apply function?
Adding return(x) in your current approach should solve it. However, here is a different approach using regex which removes everything after ":" in Functions column.
fixed_EC <- lapply(cazys, function(x)
transform(x, Functions = sub(':.*', '', Functions)))
I have my file names as all.files in working directory. I want to read these files in loop and assign the names as gsub(".csv","", all.files) for each file.
all.files <- c("harvestA.csv", "harvestB.csv", "harvestC.csv", "harvestD.csv",
"seedA.csv", "seedB.csv", "seedC.csv", "seedD.csv")
I tried something like this below but it won't work. What do I need here?
for(i in 1:length(all.files)){
assign(gsub(".csv","", all.files)[i]) <- read.table(
all.files[i],
header = TRUE,
sep = ","
)
}
You could keep them in a named list as it is not a good practice to clutter the environment with lot of global variables
list_df <- lapply(all.files, read.csv)
names(list_df) <- sub("\\.csv", "", all.files)
You can always extract individual dataframes as list_df[["harvestA"]], list_df[["harvestB"]] etc.
If you still need them as separate dataframes
list2env(list_df, .GlobalEnv)
The . is a metacharacter in regex matching any character. So, we can use fixed = TRUE to match the literal dot. Also, in the OP's code, with the assign, there is no need for another assignment operator (<-), the second argument in assign is value and here it is the dataset read with the read.table
for(i in 1:length(all.files)){
assign(sub(".csv","", all.files, fixed = TRUE)[i], read.table(
all.files[i],
header = TRUE,
sep = ","
))
}
An option using strsplit
for (i in seq_along(all.files)) {
assign(x = strsplit(allfiles[i],"\\.")[[1]][1],
value = read.csv(all.files[i]),
pos = .GlobalEnv)
}
I have a list of files like:
nE_pT_sbj01_e2_2.csv,
nE_pT_sbj02_e2_2.csv,
nE_pT_sbj04_e2_2.csv,
nE_pT_sbj05_e2_2.csv,
nE_pT_sbj09_e2_2.csv,
nE_pT_sbj10_e2_2.csv
As you can see, the name of the files is the same with the exception of 'sbj' (the number of the subject) which is not consecutive.
I need to run a for loop, but I would like to retain the original number of the subject. How to do this?
I assume I need to replace length(file) with something that keeps the original number of the subject, but not sure how to do it.
setwd("/path")
file = list.files(pattern="\\.csv$")
for(i in 1:length(file)){
data=read.table(file[i],header=TRUE,sep=",",row.names=NULL)
source("functionE.R")
Output = paste("e_sbj", i, "_e2.Rdata")
save.image(Output)
}
The code above gives me as output:
e_sbj1_e2.Rdata,e_sbj2_e2.Rdata,e_sbj3_e2.Rdata,
e_sbj4_e2.Rdata,e_sbj5_e2.Rdata,e_sbj6_e2.Rdata.
Instead, I would like to obtain:
e_sbj01_e2.Rdata,e_sbj02_e2.Rdata,e_sbj04_e2.Rdata,
e_sbj05_e2.Rdata,e_sbj09_e2.Rdata,e_sbj10_e2.Rdata.
Drop the extension "csv", then add "Rdata", and use filenames in the loop, for example:
myFiles <- list.files(pattern = "\\.csv$")
for(i in myFiles){
myDf <- read.csv(i)
outputFile <- paste0(tools::file_path_sans_ext(i), ".Rdata")
outputFile <- gsub("nE_pT_", "e_", outputFile, fixed = TRUE)
save(myDf, file = outputFile)
}
Note: I changed your variable names, try to avoid using function names as a variable name.
If you use regular expressions and sprintf (or paste0), you can do it easily without a loop:
fls <- c('nE_pT_sbj01_e2_2.csv', 'nE_pT_sbj02_e2_2.csv', 'nE_pT_sbj04_e2_2.csv', 'nE_pT_sbj05_e2_2.csv', 'nE_pT_sbj09_e2_2.csv', 'nE_pT_sbj10_e2_2.csv')
sprintf('e_%s_e2.Rdata',regmatches(fls,regexpr('sbj\\d{2}',fls)))
[1] "e_sbj01_e2.Rdata" "e_sbj02_e2.Rdata" "e_sbj04_e2.Rdata" "e_sbj05_e2.Rdata" "e_sbj09_e2.Rdata" "e_sbj10_e2.Rdata"
You can easily feed the vector to a function (if possible) or feed the function to the vector with sapply or lapply
fls_new <- sprintf('e_%s_e2.Rdata',regmatches(fls,regexpr('sbj\\d{2}',fls)))
res <- lapply(fls_new,function(x) yourfunction(x))
If I understood correctly, you only change extension from .csv to .Rdata, remove last "_2" and change prefix from "nE_pT" to "e". If yes, this should work:
Output = sub("_2.csv", ".Rdata", sub("nE_pT, "e", file[i]))
I am still pretty new to R and very new to for-loops and functions, but I searched quite a bit on stackoverflow and couldn't find an answer to this question. So here we go.
I'm trying to create a script that will (1) read in multiple .csv files and (2) apply a function to strip twitter handles from urls in and do some other things to these files. I have developed script for these two tasks separately, so I know that most of my code works, but something goes wrong when I try to combine them. I prepare for doing so using the following code:
# specify directory for your files and replace 'file' with the first, unique part of the
# files you would like to import
mypath <- "~/Users/you/data/"
mypattern <- "file+.*csv"
# Get a list of the files
file_list <- list.files(path = mypath,
pattern = mypattern)
# List of names to be given to data frames
data_names <- str_match(file_list, "(.*?)\\.")[,2]
# Define function for preparing datasets
handlestripper <- function(data){
data$handle <- str_match(data$URL, "com/(.*?)/status")[,2]
data$rank <- c(1:500)
names(data) <- c("dateGMT", "url", "tweet", "twitterid", "rank")
data <- data[,c(4, 1:3, 5)]
}
That all works fine. The problem comes when I try to execute the function handlestripper() within the for-loop.
# Read in data
for(i in data_names){
filepath <- file.path(mypath, paste(i, ".csv", sep = ""))
assign(i, read.delim(filepath, colClasses = "character", sep = ","))
i <- handlestripper(i)
}
When I execute this code, I get the following error: Error in data$URL : $ operator is invalid for atomic vectors. I know that this means that my function is being applied to the string I called from within the vector data_names, but I don't know how to tell R that, in this last line of my for-loop, I want the function applied to the objects of name i that I just created using the assign command, rather than to i itself.
Inside your loop, you can change this:
assign(i, read.delim(filepath, colClasses = "character", sep = ","))
i <- handlestripper(i)
to
tmp <- read.delim(filepath, colClasses = "character", sep = ",")
assign(i, handlestripper(tmp))
I think you should make as few get and assign calls as you can, but there's nothing wrong with indexing your loop with names as you are doing. I do it all the time, anyway.