R Save Files Into Array - r

I have multiple files that I would like to essentially merge (.txt and .csv). They are all very different tables so I would essentially like to have about 30 sheets of different tables, and then be able to save that one file and index it later.
I've had trouble trying to find the most efficient way to do this, as most of my searches have ended up trying to merge() files together, which isn't possible since this collection of data files are unique.
The biggest issue is that each data frame is different, varying in names of columns and number of rows, unlike similar questions that have been asked.
What's the best way to combine the tables I have into one array, and save it?
EDIT:
To add some more detail, I have essentially three different kinds of data frames from multiple different files:
.csv files with table headers "X" "gene" "baseMean" "log2FoldChange" "lfcSE" "stat"
"pvalue" "padj" "TuLob" "TuDu"
one kind of .txt files with headers "hgnc_symbol" "ensembl_gene_id" "ensembl_transcript_id" "ensembl_peptide_id"
"band" "chromosome_name" "start_position" "end_position"
"transcript_start" "transcript_end" "description" "go_id"
"name_1006" "transcript_source" "status"
and a second kind of .txt files with headers "hgnc_symbol" "ensembl_gene_id" "ensembl_transcript_id" "ensembl_peptide_id"
"band" "chromosome_name" "start_position" "end_position"
"transcript_start" "transcript_end" "description" "name_1006"
"transcript_source" "status"
Again, I'm not trying to merge these tables, just save them in a stack or three dimension array as one file, to be opened and indexed later

I think what you want to do is use the save function to save the data in R's internal format.
df1 <- data.frame(x=rnorm(100))
df2 <- data.frame(y=rnorm(10), z=rnorm(10))
Gives us two data frames with different columns, rows, etc.
save(df1, df2, file="agg.RData")
Saves it to agg.RData
You can later do a
load("agg.RData")
head(df1)
...
See also attach, which does what load does, only lazily, so it will only load the objects once you try to access them.
Finally, you can get some measure of isolation by specifying and environment for load:
agg <- new.env()
load("agg.RData", agg)
head(agg$df1)
...

Related

Export .RData into a CSV in R

I want to export fake.bc.Rdata in package "qtl" into a CSV, and when running "summary" it shows this is an object of class "cross", which makes me fail to convert it. And I tried to use resave, but there is warning :cannot coerce class ‘c("bc", "cross")’ to a data.frame.
Thank you all for your help in advance!
CSV stands for comma-separated values, and is not suitable for all kinds of data.
It requires, like indicated in the comments, clear columns and rows.
Take this JSON as an example:
{
"name":"John",
"age":30,
"likes":"Walking","Running"
}
If you were to represent this in CSV-format, how would you deal with the difference in length? One way would be to have repeated data
name,age,likes
John,30,Walking
John,30,Running
But that doesn't really look right. Even if you merge the two into one you would still have trouble reading the data back, e.g.
name,age,likes
John,30,Walking/Running
Thus, CSV is best suited for tidy data.
TL;DR
Can your data be represented tidily as comma-separated values, or should you be looking at alternative forms of exporting your data?
EDIT:
It appears you do have some options:
If you look at the reference, you have the option to export your data using write.cross().
For your data, you could use write.cross(fake.bc, "csv", "myCrossData", c(1,5,13)). It then does the following:
Comma-delimited formats: a single csv file is created in the formats
"csv" or "csvr". Two files are created (one for the genotype data and
one for the phenotype data) for the formats "csvs" and "csvsr"; if
filestem="file", the two files will be names "file_gen.csv" and
"file_phe.csv".

Creating temporary data frames in R

I am importing multiple excel workbooks, processing them, and appending them subsequently. I want to create a temporary dataframe (tempfile?) that holds nothing in the beginning, and after each successive workbook processing, append it. How do I create such temporary dataframe in the beginning?
I am coming from Stata and I use tempfile a lot. Is there a counterpart to tempfile from Stata to R?
As #James said you do not need an empty data frame or tempfile, simply add newly processed data frames to the first data frame. Here is an example (based on csv but the logic is the same):
list_of_files <- c('1.csv','2.csv',...)
pre_processor <- function(dataframe){
# do stuff
}
library(dplyr)
dataframe <- pre_processor(read.csv('1.csv')) %>%
rbind(pre_processor(read.csv('2.csv'))) %>%>
...
Now if you have a lot of files or a very complicated pre_processsing then you might have other questions (e.g. how to loop over the list of files or to write the right pre_processing function) but these should be separate and we really need more specifics (example data, code so far, etc.).

How to convert list of strings to list of objects or list of dataframes in R

I have written a program in R that takes all of the .csv files in a folder and imports them as data frames with the naming convention "main1," "main2," "main3" and so on for each data frame. The number of files in the folder may vary, so I was hoping the convention would make it easier to join the files later by being able to paste together the number of records. I successfully coded a way to find the folder and identify all of the files, as well as the total number of files.
agencyloc <- dirname(file.choose())
setwd(agencyloc)
listagencyfiles <- list.files(pattern = "*.csv")
numagencies <- 1:length(listagencyfiles)
I then created the individual dataframes without issue. I am not including this because it is long and does not relate to my problem. The problem is when I try to rbind these dataframes into one large dataframe, it says "Input to rbindlist must be a list of data.tables." Since there will be varying numbers of files, I can't just hard code this in, it has to be something similar to this. I tried the following, but it creates a list of strings and not a list of objects:
allfiles <- paste0("main", 1:length(numagencies))
However, this outputs a list of strings that can't be used to bind the fiels. Is there a way to change the data type from character strings to objects so that this will work when executed:
finaltable <- rbindlist(allfiles)
What I am looking for would almost be the opposite of as.character(objectname) if that makes any sense. I need to go from character to object instead of object to character.

R: give data frames new names based on contents of their current name

I'm writing a script to plot data from multiple files. Each file is named using the same format, where strings between “.” give some info on what is in the file. For example, SITE.TT.AF.000.52.000.001.002.003.WDSD_30.csv.
These data will be from multiple sites, so SITE, or WDSD_30, or any other string, may be different depending on where the data is from, though it's position in the file name will always indicate a specific feature such as location or measurement.
So far I have each file read into R and saved as a data frame named the same as the file. I'd like to get something like the following to work: if there is a data frame in the global environment that contains WDSD_30, then plot a specific column from that data frame. The column will always have the same name, so I could write plot(WDSD_30$meas), and no matter what site's files were loaded in the global environment, the script would find the WDSD_30 file and plot the meas variable. My goal for this script is to be able to point it to any folder containing files from a particular site, and no matter what the site, the script will be able to read in the data and find files containing the variables I'm interested in plotting.
A colleague suggested I try using strsplit() to break up the file name and extract the element I want to use, then use that to rename the data frame containing that element. I'm stuck on how exactly to do this or whether this is the best approach.
Here's what I have so far:
site.files<- basename(list.files( pattern = ".csv",recursive = TRUE,full.names= FALSE))
sfsplit<- lapply(site.files, function(x) strsplit(x, ".", fixed =T)[[1]])
for (i in 1:length(site.files)) assign(site.files[i],read.csv(site.files[i]))
for (i in 1:length(site.files))
if (sfsplit[[i]][10]==grep("PARQL", sfsplit[[i]][10]))
{assign(data.frame.getting.named.PARQL, sfsplit[[i]][10])}
else if (sfsplit[i][10]==grep("IRBT", sfsplit[[i]][10]))
{assign(data.frame.getting.named.IRBT, sfsplit[[i]][10])
...and so on for each data frame I'd like to eventually plot from.Is this a good approach, or is there some better way? I'm also unclear on how to refer to the objects I made up for this example, data.frame.getting.named.xxxx, without using the entire filename as it was read into R. Is there something like data.frame[1] to generically refer to the 1st data frame in the global environment.

Join two dataframes before exporting as .csv files

I am working on a large questionnaire - and I produce summary frequency tables for different questions (e.g. df1 and df2).
a<-c(1:5)
b<-c(4,3,2,1,1)
Percent<-c(40,30,20,10,10)
df1<-data.frame(a,b,Percent)
c<-c(1,1,5,2,1)
Percent<-c(10,10,50,20,10)
df2<-data.frame(a,c,Percent)
rm(a,b,c,Percent)
I normally export the dataframes as csv files using the following command:
write.csv(df1 ,file="df2.csv")
However, as my questionnaire has many questions and therefore dataframes, I was wondering if there is a way in R to combine different dataframes (say with a line separating them), and export these to a csv (and then ultimately open them in Excel)? When I open Excel, I therefore will have just one file with all my question dataframes in, one below the other. This one csv file would be so much easier than having individual files which I have to open in turn to view the results.
Many thanks in advance.
If your end goal is an Excel spreadsheet, I'd look into some of the tools available in R for directly writing an xls file. Personally, I use the XLConnect package, but there is also xlsx and also several write.xls functions floating around in various packages.
I happen to like XLConnect because it allows for some handy vectorization in situations just like this:
require(XLConnect)
#Put your data frames in a single list
# I added two more copies for illustration
dfs <- list(df1,df2,df1,df2)
#Create the xls file and a sheet
# Note that XLConnect doesn't seem to do tilde expansion!
wb <- loadWorkbook("/Users/jorane/Desktop/so.xls",create = TRUE)
createSheet(wb,"Survey")
#Starting row for each data frame
# Note the +1 to get a gap between each
n <- length(dfs)
rows <- cumsum(c(1,sapply(dfs[1:(n-1)],nrow) + 1))
#Write the file
writeWorksheet(wb,dfs,"Survey",startRow = rows,startCol = 1,header = FALSE)
#If you don't call saveWorkbook, nothing will happen
saveWorkbook(wb)
I specified header = FALSE since otherwise it will write the column header for each data frame. But adding a single row at the top in the xls file at the end isn't much additional work.
As James commented, you could use
merge(df1, df2, by="a")
but that would combine the data horizontally. If you want to combine them vertically you could use rbind:
rbind(df1, df2, df3,...)
(Note: the column names need to match for rbind to work).

Resources