merge multiple files with different rows in R - r

I know that this question has been asked previously, but answers to the previous posts cannot seem to solve my problem.
I have dozens of tab-delimited .txt files. Each file has two columns ("pos", "score"). I would like to compile all of the "score" columns into one file with multiple columns. The number of rows in each file varies and they are irrelevant for the compilation.
If someone could direct me on how to accomplish this, preferably in R, it would be a lot of helpful.
Alternatively, my ultimate goal is to read the median and mean of the "score" column from each file. So if this could be accomplished, with or without compiling the files, it would be even more helpful.
Thanks.
UPDATE:
As appealing as the idea of personal code ninjas is, I understand this will have to remain a fantasy. Sorry for not being explicit.
I have tried lapply and Reduce, e.g.,
> files <- dir(pattern="X.*\\.txt$")
> File_list <- lapply(filesToProcess,function(score)
+ read.table(score,header=TRUE,row.names=1))
> File_list <- lapply(files,function(z) z[c("pos","score")])
> out_file <- Reduce(function(x,y) {merge(x,y,by=c("pos"))},File_list)
which I know doesn't really make sense, considering I have variable row numbers. I have also tried plyr
> files <- list.files()
> out_list <- llply(files,read.table)
As well as cbind and rbind. Usually I get an error message, because the row numbers don't match up or I just get all the "score" data compiled into one column.
The advice on similar posts (e.g., Merging multiple csv files in R, Simultaneously merge multiple data.frames in a list, and Merge multiple files in a list with different number of rows) has not been helpful.
I hope this clears things up.

This problem could be solved in two steps:
Step 1. Read the data from your csv files into a list of data frames, where files is a vector of file names. If you need to add extra arguments to read.csv, add them like shown below. See ?lapply for details.
list_of_dataframes <- lapply(files, read.csv, stringsAsFactors = FALSE)
Step 2. Calculate means for each data frame:
means <- sapply(list_of_dataframes, function(df) mean(df$score))
Of course, you can always do it in one step like this:
means <- sapply(files, function(filename) mean(read.csv(filename)$score))

I think you want smth like this:
all_data = do.call(rbind, lapply(files,
function(f) {
cbind(read.csv(f), file_name=f)
}))
You can then do whatever "by" type of action you like. Also, don't forget to adjust the various read.csv options to suit your needs.
E.g. once you have the above, you can do the following (and much more):
library(data.table)
dt = data.table(all_data)
dt[, list(mean(score), median(score)), by = file_name]
A small note: you could also use data.table's fread, to read the files in instead of the read.table and its derivatives, and that would be much faster, and while we're at it, use rbindlist instead of do.call(rbind,.

Related

R: Generate dynamic names for dataframes

I need to read several csv files from a directory and save each data in separate dataframe.
The filenames are in a character vector:
lcl_forecast_data_files <- dir(lcl_forecast_data_path, pattern=glob2rx("*.csv"), full.names=TRUE)
For example: "fruc2021.csv", "gem2020.csv", "strb2021.csv".
So far I am reading the files step by step:
fruc2021 <- read_csv2("fruc2021.csv")
gem2020 <- read_csv2("gem2020.csv")
strb2010 <- read_csv2("strb2021.csv")
But there are many more files in the directory and subdirectories. To read them all one by one is very tedious.
Now I have already experimented a little with the map function, but I have not yet figured out how to automatically generate the names of the dataframes from the file names.
A first simple try was:
lcl_forecast_data <- lcl_forecast_data_files %>%
map(
function(x) {
str_replace(basename(x), ".csv","") <- read_csv2(x)
}
)
But this did not work :-(
Is it even possible to generate names for dataframes like this?
Or are there other, simpler possibilities?
Greetings
Benne
Translated with www.DeepL.com/Translator (free version)
If you do not want to use a list and lapply as #Onyambu suggested you can use assign() to generate the dataframes.
filenames <- c("fruc2021.csv", "gem2020.csv", "strb2021.csv")
for (i in filenames) {
assign(paste('',gsub(".csv","",i),sep=''),read.csv(i))
}

R - Read specific columns from XLSX

This seems like a silly question, but I really could not find a solution! I need to read only specific columns from an Excel file. The file have multiple sheets with different number of columns, but the ones I need to read will be there. I can do this for csv files, but not for excel! This is my present code, which reads the first 14 columns (but the columns I need might not always be in the first 14). I can't just read them all as rbind will throw an error citing row mismatch (different number of rows in the sheets).
EDIT: I solved this by omitting the col_types parameter, it worked as sheets with different column numbers only had column headers. Still, this is no way a robust solution, so I hope someone can do a better job than me.
INV <- lapply(sheets, function(X) read_excel("./Inventory.xlsx", sheet = X, col_types = c(rep("text", 14))))
names(INV) <- sheets
INV <- do.call("rbind", INV)
I am trying to do something like this:
INV <- lapply(FILES[grepl("Inventory", FILES)],
function(n) read_csv(file=paste0(n), col_types=cols_only(DIVISION="c",
DEPARTMENT="i",
ITEM_ID="c",
DESCRIPTION="c",
UNIT_QTY="i",
COMP_UNIT_QTY="i",
REGION="c",
LOCATION_TYPE="c",
ZONE="c",
LOCATION_ID="c",
ATS_IND="c",
CONTAINER_ID="c",
STATUS="c",
TROUBLE_CODES="c")))
But, for an Excel file. I tried using read.xlsx from openxlsx and read_excel from readxl, but nneither supported doing this. There must be some other way. Don't worry about column types, I am fine with all as characters.
I would very much appreciate if this can be done using readxl or openxlsx.

in R, Can I get only the names columns of a csv(txt) file?

I have many big files. But I would like get only the names of the columns without load them.
Using the data.table packages, I can do
df1 <-fread("file.txt")
names1<- names(df)
But, for get all names of the all files, is ver expensive. There is some other option?
Many functions to read in data have optional arguments that allow you to specify how many lines you'd like to read in. For example, the read.table function would allow you to do:
df1 <- read.table("file.txt", nrows=1, header=TRUE)
colnames(df1)
I'd bet that fread() has this option too.
(Note that you may even be able to get away with nrows=0, but I haven't checked to see if that works)
EDIT
As commenter kindly points out, fread() and read.table() work a little differently.
For fread(), you'll want to supply argument nrows=0 :
df1 <- fread("file.txt", nrows=0) ##works
As per the documentation,
nrows=0 is a special case that just returns the column names and types; e.g., a dry run for a large file or to quickly check format consistency of a set of files before starting to read any.
But nrows=0 is one of the ignored cases when supplied in read.table()
df1 <- fread("file.txt") ##reads entire file
df1 <- fread("file.txt", nrows=-1) ##reads entire file
df1 <- fread("file.txt", nrows=0) ##reads entire file

R: Using the "names" function on a dataset created within a loop

I am using a for loop to read in multiple csv files and naming the datasets import1, import2, etc. For example:
assign(paste("import",i,sep=""), read.csv(files[i], header=FALSE))
However, I now want to rename the variables in each dataset. I have tried the following:
names(as.name(paste("import",i,sep=""))) <- c("xxxx", "yyyy")
But get the error "target of assignment expands to non-language object". (I need to change the name of variables in each dataset within the loop as the variable names need to be different in each dataset).
Any suggestions on how to do this would be much appreciated.
Thanks.
While I do agree it would be much better to keep your data.frames in a list rather than creating a bunch of variables in your global environment, you can also set names when you read the files in
assign(paste("import",i,sep=""),
read.csv(files[i], header=FALSE, col.names=c("xxxx", "yyyy")))
Using assign() isn't very "R-like".
A better approach would be to read the files into a list of data.frames, instead of one data.frame object per file. Assuming files is the vector of file names (as you imply above):
import <- lapply(files, read.csv, header=FALSE)
Then if you want to operate on each data.frame in the list using a loop, you easily can:
for (i in seq_along(import)) names(import[[i]]) <- c('xxx', 'yyy')

Join two dataframes before exporting as .csv files

I am working on a large questionnaire - and I produce summary frequency tables for different questions (e.g. df1 and df2).
a<-c(1:5)
b<-c(4,3,2,1,1)
Percent<-c(40,30,20,10,10)
df1<-data.frame(a,b,Percent)
c<-c(1,1,5,2,1)
Percent<-c(10,10,50,20,10)
df2<-data.frame(a,c,Percent)
rm(a,b,c,Percent)
I normally export the dataframes as csv files using the following command:
write.csv(df1 ,file="df2.csv")
However, as my questionnaire has many questions and therefore dataframes, I was wondering if there is a way in R to combine different dataframes (say with a line separating them), and export these to a csv (and then ultimately open them in Excel)? When I open Excel, I therefore will have just one file with all my question dataframes in, one below the other. This one csv file would be so much easier than having individual files which I have to open in turn to view the results.
Many thanks in advance.
If your end goal is an Excel spreadsheet, I'd look into some of the tools available in R for directly writing an xls file. Personally, I use the XLConnect package, but there is also xlsx and also several write.xls functions floating around in various packages.
I happen to like XLConnect because it allows for some handy vectorization in situations just like this:
require(XLConnect)
#Put your data frames in a single list
# I added two more copies for illustration
dfs <- list(df1,df2,df1,df2)
#Create the xls file and a sheet
# Note that XLConnect doesn't seem to do tilde expansion!
wb <- loadWorkbook("/Users/jorane/Desktop/so.xls",create = TRUE)
createSheet(wb,"Survey")
#Starting row for each data frame
# Note the +1 to get a gap between each
n <- length(dfs)
rows <- cumsum(c(1,sapply(dfs[1:(n-1)],nrow) + 1))
#Write the file
writeWorksheet(wb,dfs,"Survey",startRow = rows,startCol = 1,header = FALSE)
#If you don't call saveWorkbook, nothing will happen
saveWorkbook(wb)
I specified header = FALSE since otherwise it will write the column header for each data frame. But adding a single row at the top in the xls file at the end isn't much additional work.
As James commented, you could use
merge(df1, df2, by="a")
but that would combine the data horizontally. If you want to combine them vertically you could use rbind:
rbind(df1, df2, df3,...)
(Note: the column names need to match for rbind to work).

Resources