Merge by first column multiple files in the same folder - r

I have probably a difficult question since I would like to know if there is an elegant way to solve it in R.
Essentially I have a folder full of different tab separated .txt files.
each file has "names" in the first column and the important numerical value in the third column. every file contains the same names, they are just in different rows.
So I was wondering if, with a nice function, I can simplify the task and let R generating a data frame with, in the first column the names (does not matter the order) and in the other columns all the 3rd columns of each single file saved in the same folder (with the name of the files as name of the column)
I am not able to write something decent and I only have a function for merging, because I am not able to make a cycle that whatever files are in the folder, they are all process together.

So you just want the name column and the 3rd column?
Using data.table:
library(data.table)
dt1 <- fread("text1.txt")[, c(1, 3)]
dt2 <- fread("text2.txt")[, c(1, 3)]
...
Repeat for all your txt files, then:
dt <- dt1[dt2, on = "name"]
dt <- dt[dt3, on = "name"]
...
Repeat for all the files.
That should be sufficient, assuming all third columns are unique data and I'm correct in my assumptions about your data.

Related

R, Dataset without column names

Complete noob here, specially with R.
For a school project I have to work with a specific dataset which doesn't come with column names in the dataset it self but there is a .txt that has extra information regarding the dataset, including the column names. The problem I'm having is that when I load the dataset rstudio assumes that the first line of data is actually the column names. Initially I just substituted the name with colnames() but by doing so I ended up ignoring/deleting the first line of data, and I'm sure that's not the right away of dealing with it.
How can I go about adding the correct column names without deleting the first line of data? (Preferably inside R due to school work requirements)
Thanks in advance!
When we read the data with read.table, use header = FALSE so that it automatically assigns a column name
df1 <- read.table('file.txt', header = FALSE)
Then, we can assign the preferred column names from the other .txt column
colnames(df1) <- scan('names.txt', what = '', quiet = TRUE)

R; Rbind Excel files from a List of vectors of files in R

I have web-scraped ~1000 Excel Files into a specific folder on my computer
I then read these files in which returned a value of chr [1:1049]
I then grouped these files by similar names which was every 6 belonged in one group
This returned a List of 175, with values of the group of 6 file names.
I am confused on how I would run a loop that would merge/rbind the 6 file names for each group from that list. I would also need to remove the first row but I know how to do that part with read.xlsx
My code so far is
setwd("C:\\Users\\ewarren\\OneDrive\\Documents\\Reservoir Storage")
files <- list.files()
file_groups <- split(files, ceiling(seq_along(files)/6))
with
for (i in file_groups) {
print(i)
}
returning each group of file names
The files for example are:
files
They are each compromised of two columns, date and amount
I need to add a third to each that is the reservoir name
That way when all the rows from all the files are combined theres a date, an amount, and a reservoir. If I do them all at once w/o the reservoir, I wouldnt know which rows belong to which.
You can use startRow = 2 to not get the first row in read.xlsx
for merging the groups of file. If you have an identifier e.g. x in each file that matches with their others in the group, but not with the ones which are in other groups.
you have make a list group1 <- list.files(pattern = "x)
then use do.call(cbind, group1)

How do I select columns by name while ignoring certain characters?

I'm trying to pull data from a file, but only pull certain columns based on the column name.
I have this bit of code:
filepath <- ([my filepath])
files <- list.files(filepath, full.names=T)
newData <- fread(file,select=c(selectCols))
selectCols contains a list of column names (as strings). But in the data I'm pulling, there may be underscores placed differently in each file for the same data.
Here's an example:
PERIOD_ID
PERIOD_ID_
_PERIOD_ID_
And so on. I know I can use gsub to change the column names once the data is already pulled:
colnames(newData) <- gsub("_","",newData)
Then I can select by column name, but given that it's a lot of data I'm not sure this is the most efficient idea.
Is there a way to do ignore underscores or other characters within the fread function?

Differences in imported data from one file vs. lots of files

I have built a function which allows me to process .csv files one by one. This involves importing data using the read.csv function, assigning one of the columns a name, and making a series of calculations based on that one column. However, I'm having problems with how to apply this function to a whole folder of files. Once a list of files is generated, do I need to read the data from each file from within my function, or prior to the application of it? This is what I had previously to import the data:
AllData <- read.csv("filename.csv", header=TRUE, skip=7)
DataForCalcs <- Data[5]
My code resulted in the calculation of a number of variables, which I put into a matrix at the end of the code, and used the apply function to calculate the max of each of those variables.
NewVariables <- matrix(c(Variable1, Variable2, Variable3, Variable4, Variable5)
colnames(NewVariables <- c("Variable1", "Variable2", "Variable3", Variable4", "Variable5")
apply(NewVariables, 2, max, na.rm=TRUE)
This worked great, but I then need to write this table to a new .csv file, which contains these results for each of the ~300 files I want to process, preceded by the name of each file. I'm new to this, so I would really appreciate your time helping me out!
Have you thought about reading in all your .csv files in a loop that combines them into one dataframe? I do this all the time like this:
df <- c()
for (x in list.files(pattern="*.csv")) {
u<-read.csv(x, skip=6)
u$Label = factor(x) #A column that is the filename
df <- rbind(df,u)
}
This of course assumes that every .csv file has an equal number of columns that are named the same thing. But if that assumption is true then you can simply treat the resulting dataframe like one dataframe.
One you have you dataframe entered you can use the Label column as your group by variable. Also you'll need to select only the 5th and 13th variables as well as the label variable. Then if your goal is to take say the max and max values for each .csv file and produce another dataframe of those max values you'd go about it like this.
library(dplyr)
df.summary <- df %>%
group_by(Label) %>%
summarise_each(funs(max)) ##Take the max value of each column except Label
There are better ways to do this using gather() but I don't want to overwhelm you.

Combining multiple data.frames using R

I have several txt files in which each txt file contains 3 columns(A,B,C).
Column A will be common to all txt files. Now I want to combine txt files with coulmn A appearing only once while the other columns (B and C) of respective files. I used cbind but it creates a data frame with repeats of column A, which I dont want. The column A must be repeated only once. Here is the R code I tried:
data <- read.delim(file.choose(),header=T)
data2 <- read.delim(file.choose(),header=T)
data3 <- cbind(data1,data2)
write.table(data3,file="sample.txt",sep="\t",col.names=NA)
Unless your files are all sorted precisely the same, you'll need to use merge:
dat <- merge(data,data2,by="A")
dat <- merge(dat,data3,by="A")
This should automatically prevent you from having multiple A's, since merge knows they're all a key/index column. You'll likely want to rename the duplicate B's and C's before merging.

Resources