Repeating tk_choose.files to import multiple .csv files multiple times - r

I am using sapply(tk_choose.files) to produce an interactive window where I can choose which .csv files (multiple) to import. I then do some basic data manipulation so that the mean of one particular column can be plotted using ggplot.
So far my code looks something like this:
>tfiles <- data.frame(sapply(sapply(tk_choose.files(caption="Choose T files
(hold CTRL to select multiple files)"), read.table, header=TRUE, sep=","), c))
>rfiles <- data.frame(sapply(sapply(tk_choose.files(caption="Choose R files
(hold CTRL to select multiple files)"), read.table, header=TRUE, sep=","), c))
I have then calculated the mean of a particular column for both tfiles and rfiles so that I could plot 100-tfiles-rfiles.
While this is working fine for one set of data, I would like to now import more sets of data, preferably also using sapply(tk_choose.files). Essentially I need to get t/rfiles1, t/rfiles2...and repeat the data manipulation process after that, so that I could get a plot of multiple sets of data. I have no idea how to do this without having to copy and paste my code!
Sorry if this is a stupid question, I am very new to R so I am really stuck, your help is greatly appreciated!

Assuming that the files in the working directory are as follow:
all.files<-list.files(pattern="\\.csv")
all.files
[1] "R01.csv" "R02.csv" "R03.csv" "R04.csv" "T01.csv" "T02.csv" "T03.csv" "T04.csv"
And you wish to call tfiles1 as merged data of T01 and T02; tfiles2 as merged data of T03 and T04
T <- grep("T", all.files, value=T)
T
[1] "T01.csv" "T02.csv" "T03.csv" "T04.csv"
t.list <- list(T[1:2], T[3:4])
all.T <- lapply(t.list, function(x)ldply(x, read.csv))
for (i in 1:length(all.T)) assign(paste0("tfiles", i), all.T[[i]]) #this will produce tfiles1 and tfiles2 in your R environment.

Related

Extracting a single cell value from multiple csv files in R and

I have 500 csv. files with data that looks like:
sample data
I want to extract one cell (e.g. B4 or 0.477) per a csv file and combine those values into a single csv. What are some recommendations on how to do this easily?
You can try something like this
all.fi <- list.files("/path/to/csvfiles", pattern=".csv", full.names=TRUE) # store names of csv files in path as a string vector
library(readr) # package for read_lines and write_lines
ans <- sapply(all.fi, function(i) { eachline <- read_lines(i, n=4) # read only the 4th line of the file
ans <- unlist(strsplit(eachline, ","))[2] # split the string on commas, then extract the 2nd element of the resulting vector
return(ans) })
write_lines(ans, "/path/to/output.csv")
I can not add a comment. So, I will write my comment here.
Since your data is very large and it is very difficult to load it individually, then try this: Importing multiple .csv files into R. It is similar to the first part of your problem. For second part, try this:
You can save your data as a data.frame (as with the comment of #Bruno Zamengo) and then you can use select and merge functions in R. Then, you can easily combine them in single csv file. With select and merge functions you can select all the values you need and them combine them. I used this idea in my project. Do not forget to use lapply.

Populating a matrix (or a DF or a DT) with a loop from a folder containing txt files

I wrote my first code in R for treating some spectra [basically .txt files with a Xcol (wavelength) and Ycol (intensity)].
The code works for single files, provided I write the file name in the code. Here the code working for the first file HKU47_PSG_1_LW_0.txt.
setwd("C:/Users/dd16722/R/Raman/Data")
# import Spectra
PSG1_LW<-read.table("HKU47_PSG_1_LW_0.txt")
colnames(PSG1_LW)[colnames(PSG1_LW)=="V2"] <- "PSG1_LW"
PSG2_LW<-read.table("HKU47_PSG_2_LW_all_0.txt")
colnames(PSG2_LW)[colnames(PSG2_LW)=="V2"] <- "PSG2_LW"
#Plot 2 spectra and define the Y range
plot(PSG1_LW$V1, PSG1_LW$PSG1_LW, type="l",xaxs="i", yaxs="i", main="Raman spectra", xlab="Raman shift (cm-1)", ylab="Intensity", ylim=range(PSG1_LW,PSG2_LW))
lines(PSG2_LW$V1, PSG2_LW$PSG2_LW, col=("red"), yaxs="i")
# Temperature-excitation line correction
laser = 532
PSG1_LW_corr <- PSG1_LW$PSG1_LW*((10^7/laser)^3*(1-exp(-6.62607*10^(-34)*29979245800*PSG1_LW$V1/(1.3806488*10^(-23)*293.15)))*PSG1_LW$V1/((10^7/laser)-PSG1_LW$V1)^4)
PSG1_Raw_Corr <-cbind (PSG1_LW,PSG1_LW_corr)
lines(PSG1_LW$V1, PSG1_LW_corr, col="red")
plot(PSG1_LW$V1, PSG1_Raw_Corr$PSG1_LW_corr, type="l",xaxs="i", yaxs="i", xlab="Raman shift (cm-1)", ylab="Intensity")
Now, it's time for another little step forward. In the folder, there are many spectra (in the code above I reported the second one: HKU47_PSG_2_LW_all_0.txt) having again 2 columns, same length of the first file. I suppose I should merge all the files in a matrix (or DF or DT).
Probably I need a loop as I need a code able to check automatically the number of files contained in the folder and ultimately to create an object with several columns (i.e. the double of the number of the files).
So I started like this:
listLW <- list.files(path = ".", pattern = "LW")
numLW <- as.integer(length(listLW))
numLW represents the number of iterations I need to set. The question is: how can I populate a matrix (or DF or DT) in order to have in the first 2 columns the first txt file in my folder, then the second file in the 3rd and 4th columns etc? Considering that I need to perform some other operations as I showed above in the code.
I have been reading about loop in R since yestarday but actually could not find the best and easy solution.
Thanks!
You could do something like
# Load data.table library
require(data.table)
# Import the first file
DT_final <- fread(file = listLW[1])
# Loop over the rest of the files and use cbind to merge them into 1 DT
for(file in setdiff(listLW, listLW[1])) {
DT_temp <- fread(file)
DT_final <- cbind(DT_final, DT_temp)
}

Replace values within dataframe with filename while importing using read.table (R)

I am trying to clean up some data in R. I have a bunch of .txt files: each .txt file is named with an ID (e.g. ABC001), and there is a column (let's call this ID_Column) in the .txt file that contains the same ID. Each column has 5 rows (or less - some files have missing data). However, some of the files have incorrect/missing IDs (e.g. ABC01). Here's an image of what each file looks like:
https://i.stack.imgur.com/lyXfV.png
What I am trying to do here is to import everything AND replace the ID_Column with the filename (which I know to all be correct).
Is there any way to do this easily? I think this can probably be done with a for loop but I would like to know if there is any other way. Right now I have this:
all_files <- list.files(pattern=".txt")
data <- do.call(rbind, lapply(all_files, read.table, header=TRUE))
So, basically, I want to know if it is possible to use lapply (or any other function) to replace data$ID_Column with the filenames in all_files. I am having trouble as each filename is only represented once in all_files, while each ID_Column in data is represented 5 times (but not always, due to missing data). I think the solution is to create a function and call it within lapply, but I am having trouble with that.
Thanks in advance!
I would just make a function that uses read.table and adds the file's name as a column.
all_files <- list.files(pattern=".txt")
data <- do.call(rbind, lapply(all_files, function(x){
a = read.table(x, header=TRUE);
a$ID_Column=x
return(a)
}
)

R - Creating subsets of several datasets in a loop

I have a quite big number of quite heavy datasets. I would like to extract a subset out of each of them and save it into different csv files (one for each dataset). These are the commands I would like to loop for all the files I have in the folder:
df <-read.csv("1985.csv",header=FALSE,stringsAsFactors=TRUE,sep="\t")
df_short <- df[df$V6=="OPP", ]
write.csv(df_short, file = "OPP_1985.csv",row.names=FALSE)
rm(df)
rm(df_short)
This is probably a very noob question, but I am struggling to understand how to do it, so I would appreciate a lot help with this!
EDIT:
Following #SimonShine's suggestion, I have run this code and it works!
You don't specify if you are trying to collect the subsets into one dataset, or if you are trying to make one file per subset. You refer to OPP_1985 that appears out of scope for the code you wrote. Did you mean to refer to df_short?
You could start by abstracting what you want to do with one datafile into a function, e.g.:
extract_and_save_from_dataset <- function(csvfile) {
df <- read.csv(csvfile, header=F, stringsAsFactors=T, sep="\t")
df_short <- df[df$V6 == "OPP",]
csvfile_short <- gsub(".csv", "_short.csv", csvfile)
write.csv(df_short, file=csvfile_short, row_names=F)
}
Assuming you have a collection of dataset filenames, you could apply this function multiple times:
# csvfiles <- c("OPP_1985.csv", "OPP_1986.csv", ...)
csvfiles <- list.files("/path/to/my/csvfiles")
for (csvfile in csvfiles) {
extract_and_save_from_dataset(csvfile)
}
The data.table approach is probably the fastest option, specially if you have a large dataset. The function fwrite{data.table} works in parallel using many CPUS, making it extremely fast.
Here is how you can divide your original data according to subgroups defined based on the values of df$V6 and save each subset into a separate .csv file.
library (data.table)
set(df)[, fwrite(.SD, paste0("output_", V6,".csv")), by = V6, .SDcols=names(df) ]
ps. The name of the files will be output_*.csv where * is the correspondent V6 value.

How to not overwrite file in R

I am trying to copy and paste tables from R into Excel. Consider the following code from a previous question:
data <- list.files(path=getwd())
n <- length(list)
for (i in 1:n)
{
data1 <- read.csv(data[i])
outline <- data1[,2]
outline <- as.data.frame(table(outline))
print(outline) # this prints all n tables
name <- paste0(i,"X.csv")
write.csv(outline, name)
}
This code writes each table into separate Excel files (i.e. "1X.csv", "2X.csv", etc..). Is there any way of "shifting" each table down some rows instead of rewriting the previous table each time? I have also tried this code:
output <- as.data.frame(output)
wb = loadWorkbook("X.xlsx", create=TRUE)
createSheet(wb, name = "output")
writeWorksheet(wb,output,sheet="output",startRow=1,startCol=1)
writeNamedRegion(wb,output,name="output")
saveWorkbook(wb)
But this does not copy the dataframes exactly into Excel.
I think, as mentioned in the comments, the way to go is to first merge the data frames in R and then writing them into (one) output file:
# get vector of filenames
filenames <- list.files(path=getwd())
# for each filename: load file and create outline
outlines <- lapply(filenames, function(filename) {
data <- read.csv(filename)
outline <- data[,2]
outline <- as.data.frame(table(outline))
outline
})
# merge all outlines into one data frame (by appending them row-wise)
outlines.merged <- do.call(rbind, outlines)
# save merged data frame
write.csv(outlines.merged, "all.csv")
Despite what microsoft would like you to believe, .csv files are not excel files, they are a common file type that can be read by excel and many other programs.
The best approach depends on what you really want to do. Do you want all the tables to read into a single worksheet in excel? If so you could just write to a single file using the append argument to the write.csv or other functions. Or use a connection that you keep open so each new one is appended. You may want to use cat to put a couple of newlines before each new table.
Your second attempt looks like it uses the XLConnect package (but you don't say, so it could be something else). I would think this the best approach, how is the result different from what you are expecting?

Resources