Join two dataframes before exporting as .csv files - r

I am working on a large questionnaire - and I produce summary frequency tables for different questions (e.g. df1 and df2).
a<-c(1:5)
b<-c(4,3,2,1,1)
Percent<-c(40,30,20,10,10)
df1<-data.frame(a,b,Percent)
c<-c(1,1,5,2,1)
Percent<-c(10,10,50,20,10)
df2<-data.frame(a,c,Percent)
rm(a,b,c,Percent)
I normally export the dataframes as csv files using the following command:
write.csv(df1 ,file="df2.csv")
However, as my questionnaire has many questions and therefore dataframes, I was wondering if there is a way in R to combine different dataframes (say with a line separating them), and export these to a csv (and then ultimately open them in Excel)? When I open Excel, I therefore will have just one file with all my question dataframes in, one below the other. This one csv file would be so much easier than having individual files which I have to open in turn to view the results.
Many thanks in advance.

If your end goal is an Excel spreadsheet, I'd look into some of the tools available in R for directly writing an xls file. Personally, I use the XLConnect package, but there is also xlsx and also several write.xls functions floating around in various packages.
I happen to like XLConnect because it allows for some handy vectorization in situations just like this:
require(XLConnect)
#Put your data frames in a single list
# I added two more copies for illustration
dfs <- list(df1,df2,df1,df2)
#Create the xls file and a sheet
# Note that XLConnect doesn't seem to do tilde expansion!
wb <- loadWorkbook("/Users/jorane/Desktop/so.xls",create = TRUE)
createSheet(wb,"Survey")
#Starting row for each data frame
# Note the +1 to get a gap between each
n <- length(dfs)
rows <- cumsum(c(1,sapply(dfs[1:(n-1)],nrow) + 1))
#Write the file
writeWorksheet(wb,dfs,"Survey",startRow = rows,startCol = 1,header = FALSE)
#If you don't call saveWorkbook, nothing will happen
saveWorkbook(wb)
I specified header = FALSE since otherwise it will write the column header for each data frame. But adding a single row at the top in the xls file at the end isn't much additional work.

As James commented, you could use
merge(df1, df2, by="a")
but that would combine the data horizontally. If you want to combine them vertically you could use rbind:
rbind(df1, df2, df3,...)
(Note: the column names need to match for rbind to work).

Related

Picking last row data only from 2000 csv in the same directory and make single dataframe through R

Using R, I want to pick last row data only from over 2000 csv in the same directory
and make single dataframe.
Directory = "C:\data
File name, for example '123456_p' (6 number digit)
Each csv has different number of rows, but has the same number of columns (10 columns)
I know the tail and list function, but over 2000 dataframes, inputting manually is time wasting.
Is there any way to do this with loop through R?
As always, I really appreciate your help and support
There are four things you need to do here:
Get all the filenames we want to read in
Read each in and get the last row
Loop through them
Bind them all together
There are many options for each of these steps, but let's use purrr for the looping and binding, and base-R for the rest.
Get all the filenames we want to read in
You can do this with the list.files() function.
filelist = list.files(pattern = '.csv')
will generate a vector of filenames for all CSV files in the working directory. Edit as appropriate to specify the pattern further or target a different directory.
Read each in and get the last row
The read.csv() function can read in each file (if you want it to go faster, use data.table::fread() instead), and as you mentioned tail() can get the last row. If you build a function out of this it will be easier to loop over, or change the process if it turns out you need another step of cleaning.
read_one_file = function(x) {
tail(read.csv(x), 1)
}
Loop through them
Bind them all together
You can do both of these steps at once with map_df() in the purrr package.
library(purrr)
final_data = map_df(filelist, read_one_file)

Creating temporary data frames in R

I am importing multiple excel workbooks, processing them, and appending them subsequently. I want to create a temporary dataframe (tempfile?) that holds nothing in the beginning, and after each successive workbook processing, append it. How do I create such temporary dataframe in the beginning?
I am coming from Stata and I use tempfile a lot. Is there a counterpart to tempfile from Stata to R?
As #James said you do not need an empty data frame or tempfile, simply add newly processed data frames to the first data frame. Here is an example (based on csv but the logic is the same):
list_of_files <- c('1.csv','2.csv',...)
pre_processor <- function(dataframe){
# do stuff
}
library(dplyr)
dataframe <- pre_processor(read.csv('1.csv')) %>%
rbind(pre_processor(read.csv('2.csv'))) %>%>
...
Now if you have a lot of files or a very complicated pre_processsing then you might have other questions (e.g. how to loop over the list of files or to write the right pre_processing function) but these should be separate and we really need more specifics (example data, code so far, etc.).

R Save Files Into Array

I have multiple files that I would like to essentially merge (.txt and .csv). They are all very different tables so I would essentially like to have about 30 sheets of different tables, and then be able to save that one file and index it later.
I've had trouble trying to find the most efficient way to do this, as most of my searches have ended up trying to merge() files together, which isn't possible since this collection of data files are unique.
The biggest issue is that each data frame is different, varying in names of columns and number of rows, unlike similar questions that have been asked.
What's the best way to combine the tables I have into one array, and save it?
EDIT:
To add some more detail, I have essentially three different kinds of data frames from multiple different files:
.csv files with table headers "X" "gene" "baseMean" "log2FoldChange" "lfcSE" "stat"
"pvalue" "padj" "TuLob" "TuDu"
one kind of .txt files with headers "hgnc_symbol" "ensembl_gene_id" "ensembl_transcript_id" "ensembl_peptide_id"
"band" "chromosome_name" "start_position" "end_position"
"transcript_start" "transcript_end" "description" "go_id"
"name_1006" "transcript_source" "status"
and a second kind of .txt files with headers "hgnc_symbol" "ensembl_gene_id" "ensembl_transcript_id" "ensembl_peptide_id"
"band" "chromosome_name" "start_position" "end_position"
"transcript_start" "transcript_end" "description" "name_1006"
"transcript_source" "status"
Again, I'm not trying to merge these tables, just save them in a stack or three dimension array as one file, to be opened and indexed later
I think what you want to do is use the save function to save the data in R's internal format.
df1 <- data.frame(x=rnorm(100))
df2 <- data.frame(y=rnorm(10), z=rnorm(10))
Gives us two data frames with different columns, rows, etc.
save(df1, df2, file="agg.RData")
Saves it to agg.RData
You can later do a
load("agg.RData")
head(df1)
...
See also attach, which does what load does, only lazily, so it will only load the objects once you try to access them.
Finally, you can get some measure of isolation by specifying and environment for load:
agg <- new.env()
load("agg.RData", agg)
head(agg$df1)
...

merge multiple files with different rows in R

I know that this question has been asked previously, but answers to the previous posts cannot seem to solve my problem.
I have dozens of tab-delimited .txt files. Each file has two columns ("pos", "score"). I would like to compile all of the "score" columns into one file with multiple columns. The number of rows in each file varies and they are irrelevant for the compilation.
If someone could direct me on how to accomplish this, preferably in R, it would be a lot of helpful.
Alternatively, my ultimate goal is to read the median and mean of the "score" column from each file. So if this could be accomplished, with or without compiling the files, it would be even more helpful.
Thanks.
UPDATE:
As appealing as the idea of personal code ninjas is, I understand this will have to remain a fantasy. Sorry for not being explicit.
I have tried lapply and Reduce, e.g.,
> files <- dir(pattern="X.*\\.txt$")
> File_list <- lapply(filesToProcess,function(score)
+ read.table(score,header=TRUE,row.names=1))
> File_list <- lapply(files,function(z) z[c("pos","score")])
> out_file <- Reduce(function(x,y) {merge(x,y,by=c("pos"))},File_list)
which I know doesn't really make sense, considering I have variable row numbers. I have also tried plyr
> files <- list.files()
> out_list <- llply(files,read.table)
As well as cbind and rbind. Usually I get an error message, because the row numbers don't match up or I just get all the "score" data compiled into one column.
The advice on similar posts (e.g., Merging multiple csv files in R, Simultaneously merge multiple data.frames in a list, and Merge multiple files in a list with different number of rows) has not been helpful.
I hope this clears things up.
This problem could be solved in two steps:
Step 1. Read the data from your csv files into a list of data frames, where files is a vector of file names. If you need to add extra arguments to read.csv, add them like shown below. See ?lapply for details.
list_of_dataframes <- lapply(files, read.csv, stringsAsFactors = FALSE)
Step 2. Calculate means for each data frame:
means <- sapply(list_of_dataframes, function(df) mean(df$score))
Of course, you can always do it in one step like this:
means <- sapply(files, function(filename) mean(read.csv(filename)$score))
I think you want smth like this:
all_data = do.call(rbind, lapply(files,
function(f) {
cbind(read.csv(f), file_name=f)
}))
You can then do whatever "by" type of action you like. Also, don't forget to adjust the various read.csv options to suit your needs.
E.g. once you have the above, you can do the following (and much more):
library(data.table)
dt = data.table(all_data)
dt[, list(mean(score), median(score)), by = file_name]
A small note: you could also use data.table's fread, to read the files in instead of the read.table and its derivatives, and that would be much faster, and while we're at it, use rbindlist instead of do.call(rbind,.

Select columns for heatmap in R

I need your help again :)
I wrote an R script, that generates a heatmap out of a given tab-seperated txt or xls file. At the moment, I delete all columns I don't want to have in the heatmap by hand in the xls file.
Now I want to automatize it, but I don't know how :(
The interesting columns all start the same in all xls files, followed by an individual name:
xls-file 1: L1_tpm_xxxx L2_tpm_xxxx L3_tpm_xxxx
xls-file 2: L1_tpm_xxxx L2_tpm_xxxx L3_tpm_xxxx L4_tpm_xxxx L5_tpm_xxxx
Any ideas how to select those columns?
Thanking you in anticipation, Philipp
You could use (if you have read your data in a data.frame df):
df <- df[,grep("^L[[:digit:]]+_tpm.*",colnames(df))]
or you can explicitly write the columns that you want:
df <- df[,c("L1_tpm_xxxx","L2_tpm_xxxx","L3_tpm_xxxx")]
etc...
The following link is quite useful;-)
If you think the column positions are going to be fixed across excel sheets, the simplest solution here is to just use column indices. For example, if you use read.table to import a tab-delimited text file as a data.frame, and then decide you'd prefer to only keep the first two columns, you might do something like this:
data <- read.table("path_to_file.txt", header=T, sep="\t")
data <- data[,1:2]

Resources