So, I have a .tsv file of human variants.
I need to store in a data.frame all the rows of this file with a precise name and save them in another file. I'm trying with this script:
data = read.table(file.choose(), sep = '\t', header = TRUE)
variant = readline("Insert variant:")
store <- data.frame(matrix(NA, ncol = ncol(data)))
colnames(store) = colnames(data)
for (i in 1:nrow(data))
{
if (data[i,3] == variant)
{
store[i,] = as.data.frame(data[i,], stringsAsFactors = FALSE)
}
}
But because I used a matrix in the data.frame, it stores only numeric data, of course. Any ideas of how can I solve this and how can I write the output of the loop directly in a .tsv file?
If discarding the rows would work, what you need is a subset, something like:
store <- data[ data[[3]] == variant, ]
data[[3]] here looks at the third column, which we compare to variant. So we subset data by taking only those rows where that third column matches variant.
Related
Beginner here, I have 31 excel files that I want to extract the data frame. I want a R loop to read all the files, then only take 2 columns and change the column names. Then I want to combine the files based on the same row.
This is my attempt:
files = list.files(path=".", pattern="xls")
for (i in 1:length(files)){
table = data.frame(readxl::read_xls(files[i]), stringsAsFactors=FALSE)
table = table[,c(1,3)]
colnames(table) = c("UR",paste0("Zscore",i))
}
}
}
The problem is that I don't know to code it to save the individual file. This code only saves the last file. I tried googling all night and couldn't figure it out.
I also tried assign() but I don't know how to modify the tables within assign as part of the loop.
files = list.files(pattern="*.xls")
for (i in 1:length(files))assign(files[i], data.frame(readxl::read_xls(files[i])))
I want to files to end up like UR, Zscore1, Zscore2, Zscore3...
So instead I did it manually like this:
table1 = data.frame(readxl::read_xls(files[1]), stringsAsFactors=FALSE)
table1 = table1[,c(1,3)]
colnames(table1) = c("UR",paste0("Zscore",1))
table2 = data.frame(readxl::read_xls(files[2]), stringsAsFactors=FALSE)
table2 = table2[,c(1,3)]
colnames(table2) = c("UR",paste0("Zscore",2))
tableA = merge(table1,table2, all.x = T)
table3 = data.frame(readxl::read_xls(files[3]), stringsAsFactors=FALSE)
table3 = table3[,c(1,3)]
colnames(table3) = c("UR",paste0("Zscore",3))
tableA = merge(tableA,table3, all.x = T)
Try this approach with lapply and Reduce :
files = list.files(path=".", pattern="xls")
Reduce(function(x, y) merge(x, y, all.x = T, by = 'UR'),
lapply(seq_along(files), function(i) {
data <- readxl::read_xls(files[i])
data <- data[c(1, 3)]
names(data) <- c('UR', paste0('Zscore', i))
data
})) -> result
result
The main problem is that you are not assigning the table to anything, so you are rebuilding the table on every iteration.
For every iteration, you should assign the created table as a corresponding element [[i]] of either a data frame or a list, using the assign operator <-
Maybe something like this will work:
files <- list.files(path=".", pattern="xls")
list_of_tables<-vector(mode = "list", length = (length(files))
for (i in seq_along(files)){
list_of_tables[i] <- data.frame(readxl::read_xls(files[i]), stringsAsFactors=FALSE)[,c(1,3)]
names(list_of_tables[i]) <- c("UR",paste0("Zscore",i))
}
Then, if you want to stack the whole list in a single data frame, you can use cbind, as in:
my_data_frame<-do.call(cbind, list_of_tables)
Otherwise just keep it as a list
in R I have a list of input files, which are data frames.
Now I want to subset them based on the gene given in one of the columns.
I am used to do everything repetitively on every sample I have but I want to be able to make the code smoother and shorter, which is giving me some problems.
How I have done it before:
GM04284 <- read.table("GM04284_methylation_results_hg37.txt", header = TRUE)
GM04284_HTT <- subset(GM04284[GM04284$target == "HTT",])
GM04284_FMR1 <- subset(GM04284[GM04284$target == "fmr1",])
How I want to do it now:
input_files = list.files(pattern = "_methylation_results_hg37.txt")
for (file in input_files){
# Define sample and gene from input file
sample = strsplit(file, split = "_")[[1]][1]
# read input
data = read.table(file, header = T, na.strings = "NA")
# subset input into gene specific tables
paste(sample,"_HTT", sep = "") <- subset(data[data$target == "HTT",])
paste(sample,"_FMR1", sep = "") <- subset(data[data$target == "fmr1",])
}
The subset part is what is causing me problems.
How can I make a new variable name that looks like the output of paste(sample,"_HTT", sep = "") and which can be taken as the name for the new subset table?
Thanks in advance, your help is very appreciated.
Are you sure you need to create new variable for each dataframe? If you're going to treat them all in the same way later, it might be better to use something more uniform and better organized.
One alternative is to keep them all in the list:
input_files = list.files(pattern = "_methylation_results_hg37.txt")
res_list <- list()
for (file in input_files){
# Define sample and gene from input file
sample = strsplit(file, split = "_")[[1]][1]
# read input
data = read.table(file, header = T, na.strings = "NA")
# subset input into gene specific tables
res_list[[paste0(sample,"_HTT")]] <- data[data$target == "HTT", ]
res_list[[paste0(sample,"_FMR1")]] <- data[data$target == "fmr1",]
}
Then you can address them as members of this list, like res_list$GM04284 (or, equivalent, res_list[['GM04284']])
Vasily makes a good point in the answer above. It would indeed be tidier to have each dataframe contained within a list.
Nonetheless, you could use assign() if you really wanted to create a new dynamic variable:
assign(paste0(sample,"_HTT"), subset(data[data$target == "HTT",]), envir = .GlobalEnv)
I have a large number of CSV files. I need to extract relevant data from each file, and compile all of the relevant data into a new file.
I have been copying/pasting the code below and changing relevant details (e.g., file name) to repeat the same process for many CSV files. After that, I use cbind()/write.xlsx() to combine all of the relevant data and write it to an excel file. I need a more efficient method to accomplish this task.
How can I:
incorporate a loop that imports a large number of CSV files (to replace #1 below)
select relevant rows based on a string instead of entering specific row numbers
(to replace # 2 below)
combine all of the relevant data into a single data frame with each file's data in one column
library(tidyr)
# 1 - import raw data files
file1 <- read.csv ("1.csv", header = FALSE, sep = "\n")
# 2 - select relevant rows
file1 <- as.data.frame(file1[c(41:155),])
colnames(file1) <- c("file1")
#separate components of each line from raw csv file / isolate data
temp1 <- separate(file1, file1, into = c("Text", "IntNum", "Data", sep = "\\s"))
temp1 <- temp1$Data
temp1 <- as.data.frame(temp1)
If the number of relevant rows in each file is the same, you could do it like this. Option 1 shows a solution using a loop, option 2 shows a solution using sapply.
In a first step I generate three csv-files to make the code reproducible. The start row in each file is defined by "start", the end row by "end". I then get a list with the names of these files with dir().
#make csv-files, target vector always same length (3)
set.seed(1)
for (i in 1:3) {
df <- data.frame(x = c(rep(0, sample(1:10,1)), "begin",
paste0("dat", i),
"end",rep(0, sample(1:10, 1))))
write.csv(df, file = paste0("file", i, ".csv"), quote = FALSE, row.names = FALSE)
}
#get list of file names
allFiles <- dir(pattern = glob2rx("*.csv"))
Option 1 - loop
For the loop you could first initialize a result data frame ("outDF") with the number of columns set to the number of csv-files and the number of rows set to the length of the target vector in each file ("start" to "end"). You can then loop over the files and fill the data frame. The start and end rows can be indexed using which().
#initialise result data frame
outDF <- data.frame(matrix(nrow = 3, ncol = length(allFiles),
dimnames = list(NULL, allFiles)))
#loop over csv files
for (iFile in allFiles) {
idat <- read.csv(iFile, stringsAsFactors = FALSE) #read csv
outDF[, iFile] <- idat[which(idat$x == "start"):which(idat$x == "end"),]
}
Option 2 - sapply
Instead of a loop you could use sapply with a custom function to extract the relevant rows in each file. This returns a matrix which you could then transform into a dataframe.
out <- sapply(allFiles, FUN = function(x) {
idat <- read.csv(x, stringsAsFactors = FALSE)
return(idat[which(idat$x == "start"):which(idat$x == "end"),])
})
outDF <- as.data.frame(out)
If the number of rows between "start" and "end" differs between files, the above options won´t work. In this case you could generate a data frame by first using lapply() (similar to option 2) to generate a result list (with different lengths of the list elements) and then padding shorter lists with NAs before transforming the result into a dataframe again.
#make csv-files with with target vector of different lengths (3:12)
set.seed(1)
for (i in 1:3) {
df <- data.frame(x = c(rep(0, sample(1:10,1)), "start",
rep(paste0("dat", i), sample(1:10,1)),
"end",rep(0, sample(1:10, 1))))
write.csv(df, file = paste0("file", i, ".csv"), quote = FALSE, row.names = FALSE)
}
#lapply
out <- lapply(allFiles, FUN = function(x) {
idat = read.csv(x, stringsAsFactors = FALSE)
return(idat[which(idat$x == "start"):which(idat$x == "end"),])
})
out <- lapply(out, `length<-`, max(lengths(out)))
outDF <- do.call(cbind, out)
I am trying to let user define how many drugs' data user want to upload for specific therapy. Based on that number my function want to let user select data for that many drugs and store them using variables e.g. drug_1_data, drug_2_data, etc.
I have wrote a code but it doesn't work
Could someone please help
no_drugs <- readline("how many drugs for this therapy? Ans:")
i=0
while(i < no_drugs) {
i <- i+1
caption_to_add <- paste("drug",i, sep = "_")
mydata <- choose.files( caption = caption_to_add) # caption describes data for which drug
file_name <- noquote(paste("drug", i, "data", sep = "_")) # to create variable that will save uploaded .csv file
file_name <- read.csv(mydata[i],header=TRUE, sep = "\t")
}
In your example, mydata is a one element string, so subsets with i bigger than 1 will return NA. Furthermore, in your first assignment of file_name you set it to a non-quoted character vector but then overwrite it with data (and in every iteration of the loop you lose the data you created in the previous step). I think what you wanted was something more in the line of:
file_name <- paste("drug", i, "data", sep = "_")
assign(file_name, read.delim(mydata, header=TRUE)
# I changed the function to read.delim since the separator is a tab
However, I would also recommend to think about putting all the data in a list (it might be easier to apply operations to multiple drug dataframes like that), using something like this:
n_drugs <- as.numeric(readline("how many drugs for this therapy? Ans:"))
drugs <- vector("list", n_drugs)
for(i in 1:n_drugs) {
caption_to_add <- paste("drug",i, sep = "_")
mydata <- choose.files( caption = caption_to_add)
drugs[i] <- read.delim(mydata,header=TRUE)
}
After having searched for help in different threads on this topic, I still have not become wiser. Therefore: Here comes another question on looping through multiple data files...
OK. I have multiple CSV files in one folder containing 5 columns of data. The filenames are as follows:
Moist yyyymmdd hh_mm_ss.csv
I would like to create a script that reads processes the CSV-files one by one doing the following steps:
1) load file
2) check number of rows and exclude file if less than 3 registrations
3) calculate mean value of all measurements (=rows) for column 2
4) calculate mean value of all measurements (=rows) for column 4
5) output the filename timestamp, mean column 2 and mean column 4 to a data frame,
I have written the following function
moist.each.mean <- function() {
library("tcltk")
directory <- tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <- regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame(timestamp=character(), humidity=numeric(), temp=numeric())
for(i in 1:length(filelist)){
file.in[[i]] <- read.csv(filelist[i], header=F)
if (nrow(file.in[[i]]<3)){
print("discard")
} else {
newrow <- c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1))
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
}
but i keep getting an error:
Error in `[[<-.data.frame`(`*tmp*`, i, value = list(V1 = c(10519949L, :
replacement has 18 rows, data has 17
Any ideas?
Thx, kruemelprinz
I'd also suggest to use (l)apply... Here's my take:
getMeans <- function(fpath,runfct,
target_cols = c(2),
sep=",",
dec=".",
header = T,
min_obs_threshold = 3){
f <- list.files(fpath)
fcsv <- f[grepl("\.csv",f)]
fcsv <- paste0(fpath,fcsv)
csv_list <- lapply(fcsv,read.table,sep = sep,
dec = dec, header = header)
csv_rows <- sapply(csv_list,nrow)
rel_csv_list <- csv_list[!(csv_rows < min_obs_threshold)]
lapply(rel_csv_list,function(x) colMeans(x[,target_cols]))
}
Also with that kind of error message, the debugger might be very helpful.
Just run debug(moist.each.mean) and execute the function stepwise.
Here's a slightly different approach. Use lapply to read each csv file, exclude it if necessary, otherwise create a summary. This gives you a list where each element is a data frame summary. Then use rbind to create the final summary data frame.
Without a sample of your data, I can't be sure the code below exactly matches your problem, but hopefully it will be enough to get you where you want to go.
# Get vector of filenames to read
filelist=list.files(path=directory, pattern="csv")
# Read all the csv files into a list and create summaries
df.list = lapply(filelist, function(f) {
file.in = read.csv(f, header=TRUE, stringsAsFactors=FALSE)
# Set to empty data frame if file has less than 3 rows of data
if (nrow(file.in) < 3) {
print(paste("Discard", f))
# Otherwise, capture file timestamp and summarise data frame
} else {
data.frame(timestamp=substr(f, 7, 22),
humidity=round(mean(file.in$V2),1),
temp=round(mean(file.in$V4),1))
}
})
# Bind list into final summary data frame (excluding the list elements
# that don't contain a data frame because they didn't have enough rows
# to be included in the summary)
result = do.call(rbind, df.list[sapply(df.list, is.data.frame)])
One issue with your original code is that you create a vector of summary results rather than a data frame of results:
c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1)) is a vector with three elements. What you actually want is a data frame with three columns:
data.frame(timestamp=filetitles[[i]],
humidity=round(mean(file.in[[i]]$V2),1),
temp=round(mean(file.in[[i]]$V4),1))
Thanks for the suggestions using lapply. This is definitely of value as it saves a whole lot of code as well! Meanwhile, I managed to fix my original code as well:
library("tcltk")
# directory: path to csv files
directory <-
tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <-
regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame()
for (i in 1:length(filelist)) {
file.in <- read.csv(filelist[i], header = F, skipNul = T)
if (nrow(file.in) < 3) {
print("discard")
} else {
newrow <-
matrix(
c(filetitles[[i]], round(mean(file.in$V2, na.rm=T),1), round(mean(file.in$V4, na.rm=T),1)), nrow = 1, ncol =
3, byrow = T
)
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
Only I did not get it to work as a function because then I would only have one row in mdf containing the last file data. Somehow it did not add rows but overwrite row 1 with each iteration. But using it without a function wrapper worked fine...