Combining multiple data.frames using R - r

I have several txt files in which each txt file contains 3 columns(A,B,C).
Column A will be common to all txt files. Now I want to combine txt files with coulmn A appearing only once while the other columns (B and C) of respective files. I used cbind but it creates a data frame with repeats of column A, which I dont want. The column A must be repeated only once. Here is the R code I tried:
data <- read.delim(file.choose(),header=T)
data2 <- read.delim(file.choose(),header=T)
data3 <- cbind(data1,data2)
write.table(data3,file="sample.txt",sep="\t",col.names=NA)

Unless your files are all sorted precisely the same, you'll need to use merge:
dat <- merge(data,data2,by="A")
dat <- merge(dat,data3,by="A")
This should automatically prevent you from having multiple A's, since merge knows they're all a key/index column. You'll likely want to rename the duplicate B's and C's before merging.

Related

R; Rbind Excel files from a List of vectors of files in R

I have web-scraped ~1000 Excel Files into a specific folder on my computer
I then read these files in which returned a value of chr [1:1049]
I then grouped these files by similar names which was every 6 belonged in one group
This returned a List of 175, with values of the group of 6 file names.
I am confused on how I would run a loop that would merge/rbind the 6 file names for each group from that list. I would also need to remove the first row but I know how to do that part with read.xlsx
My code so far is
setwd("C:\\Users\\ewarren\\OneDrive\\Documents\\Reservoir Storage")
files <- list.files()
file_groups <- split(files, ceiling(seq_along(files)/6))
with
for (i in file_groups) {
print(i)
}
returning each group of file names
The files for example are:
files
They are each compromised of two columns, date and amount
I need to add a third to each that is the reservoir name
That way when all the rows from all the files are combined theres a date, an amount, and a reservoir. If I do them all at once w/o the reservoir, I wouldnt know which rows belong to which.
You can use startRow = 2 to not get the first row in read.xlsx
for merging the groups of file. If you have an identifier e.g. x in each file that matches with their others in the group, but not with the ones which are in other groups.
you have make a list group1 <- list.files(pattern = "x)
then use do.call(cbind, group1)

writing single column to .csv in R

HI folks: I'm trying to write a vector of length = 100 to a single-column .csv in R. Each time I try, I get two columns in the csv file: first with index numbers from the vector, second with the contents of my vector. For example:
MyPath<-("~/rstudioshared/Data/HW3")
Files<-dir(MyPath)
write.csv(Files,"Names.csv",row.names = FALSE)
If I convert the vector to a data frame and then check its dimensions,
Files<-data.frame(Files)
dim(Files)
I get 100 rows by 1 column, and the column contains the names of the files in my directory folder. This is what I want.
Then I write the csv. When I open it outside of R or read it back in and look at it, I get a 100 X 2 DF where the first column contains the index numbers and the second column has the names of my files.
Why does this happen?
How do I write just the single column of data to the .csv?
Thanks!
Row names are written by write.csv() by default (and by default, a data frame with n rows will have row names 1,...,n). You can see this by looking at e.g.:
dat <- data.frame(mevar=rnorm(10))
# then compare what gets written by:
write.csv(dat, "outname1.csv")
# versus:
rownames(dat) <- letters[1:10]
write.csv(dat, "outname2.csv")
Just use write.csv(dat, "outname.csv", row.names=FALSE) and the row names won't show up.
And a suggestion: might be easier/cleaner to just just write the vector directly to a text file with writeLines(your_vector, "your_outfile.txt") (you can still use read.csv() to read it back in if you prefer using that :p).

Merge by first column multiple files in the same folder

I have probably a difficult question since I would like to know if there is an elegant way to solve it in R.
Essentially I have a folder full of different tab separated .txt files.
each file has "names" in the first column and the important numerical value in the third column. every file contains the same names, they are just in different rows.
So I was wondering if, with a nice function, I can simplify the task and let R generating a data frame with, in the first column the names (does not matter the order) and in the other columns all the 3rd columns of each single file saved in the same folder (with the name of the files as name of the column)
I am not able to write something decent and I only have a function for merging, because I am not able to make a cycle that whatever files are in the folder, they are all process together.
So you just want the name column and the 3rd column?
Using data.table:
library(data.table)
dt1 <- fread("text1.txt")[, c(1, 3)]
dt2 <- fread("text2.txt")[, c(1, 3)]
...
Repeat for all your txt files, then:
dt <- dt1[dt2, on = "name"]
dt <- dt[dt3, on = "name"]
...
Repeat for all the files.
That should be sufficient, assuming all third columns are unique data and I'm correct in my assumptions about your data.

Change column names after merging multiple data frames into one in R

After merging multiple data frames into one, I would like to know how to change the column headers in the master data frame to represent the original files that they came from. I merged a large number of data frames into one using the code below:
library(plyr)
dflist = list.files(path=dir, pattern="csv$", full.names=TRUE, recursive=FALSE)
import.list = llply(dflist, read.csv)
Master = Reduce(function(x, y) merge(x, y, by="Hours"), import.list)
I would like the columns that belonged to each original data frame to be named by the unique ID that the original data frame/ csv file is named by (i.e. aa, ab, ac). The unique IDs in the filenames comes immediately before a low line ("_") so I can isolate them using the code below. However, I am having trouble now applying this to column headers. Any help would be much appreciated.
filename = dflist[1]
unqID = strsplit(filename,"_")[[1]][1]
You could define a function in your llply call to and have read.csv assign names.
or just rename them after reading them in and before merging #joran suggested
#First get the names
filenames = dflist
#I am unsure about the line below, as I
unqID = lapply(filenames,function(x) strplit(x,"_")[1])
names(import.list) <- paste("unqID", names(import.list),sep=".") #renaming the list items
And then merge using your code

Appending a row to a dataframe while reading from multiple csv files in R

I'm reading from multiple csv files in a loop, and performing some calculations on each file's data, and then I wish to add that new row to a data frame:
for (i in csvFiles) {
fileToBeRead<-paste(directory, i, sep="/")
dataframe<-read.csv(paste(fileToBeRead, "csv", sep="."))
file <- i
recordsOK <- sum(complete.cases(dataframe))
record.data <- data.frame(monitorID, recordsOK)
}
So, I want to add file and recordsOK as a new row to the data frame. This just overwrites data frame every time, so I'd end up with the data from the latest csv file. How can I do this while preserving the data from the last iteration?
Building a data.frame one row at a time is almost always the wrong way to do it. Here'a more R-like solution
OKcount<-sapply(csvFiles, function(i) {
fileToBeRead<-paste(directory, i, sep="/")
dataframe<-read.csv(paste(fileToBeRead, "csv", sep="."))
sum(complete.cases(dataframe))
})
record.data <- data.frame(monitorID=seq_along(csvFiles), recordsOK=OKcount)
The main idea is that you generally build your data column-wise, not row-wise, and then bundle it together in a data.frame when you're all done. Because R has so many vectorized operations, this is usually pretty easy.
But if you really want to add rows to a data.frame, you can rbind (row bind) additional rows in. So instead of overwriting record.data each time, you would do
record.data <- rbind(record.data, data.frame(monitorID, recordsOK)
But that means you will need to define record.data outside of your loop and initialize it with the correct column names and data types since only matching data.frames can be combined. You can initialize it with
record.data <- data.frame(monitorID=numeric(), recordsOK=numeric())

Resources