Appending two excel files into one dataframe - r

I am trying to append two excel files from Excel into R.
I am using the following the code to do so:
rm(list = ls(all.names = TRUE))
library(rio) #this is for the excel appending
library("dplyr") #this is required for filtering and selecting
library(tidyverse)
library(openxlsx)
path1 <- "A:/Users/Desktop/Test1.xlsx"
path2 <- "A:/Users/Desktop/Test2.xlsx"
dat = bind_rows(path1,path2)
Output
> dat = bind_rows(path1,path2)
Error: Argument 1 must have names.
Run `rlang::last_error()` to see where the error occurred
I appreciate that this is more for combining rows together, but can someone help me with combining difference workbooks into one dataframe in R Studio?

bind_rows() works with data frames AFTER they have been loaded into the R environment. Here are you merely trying to "bind" 2 strings of characters together, hence the error. First you need to import the data from Excel, which you could do with something like:
test_df1 <- readxl::read_xlsx(path1)
test_df2 <- readxl::read_xlsx(path2)
and then you should be able to run:
test_df <- bind_rows(test_df1, test_df2)
A quicker way would be to iterate the process using the map function from purrr:
test_df <- map_df(c(path1, path2), readxl::read_xlsx)

If you want to append one under the other, meaning that both excels have the same columns, I would
select the rows I wanted from the first excel and create a dataframe
select the rows from the second excel and create a second dataframe
append them with rbind().
On the other hand if you want to append the one next to another, I would choose the columns needed from first and second excel into two dataframes respectively and then I would go with cbind()

Related

Combining a list of data frames into a new data frame in R

This is a 3rd edit to the question (leaving below thread just in case):
The following code makes some sample data frames, selects those with "_areaX" in the title and makes a list of them. The goal is then to combine the data frames in the list into 1 data frame. It almost works...
Area1 <- 100
Area2 <- 200
Area3 <- 300
Zone <- 3
a1_areaX <- data.frame(Area1)
a2_areaX <- data.frame(Area2)
a3_areaX <- data.frame(Area3)
a_zoneX <- data.frame(Zone)
library(dplyr)
pattern = "_areaX"
df_list <- mget(ls(envir = globalenv(), pattern = pattern))
big_data = bind_rows(df_list, .id = "FileName")
The problem is the newly created data frame looks like this:
And I need it to look like this:
File Name
Area measurement
a1_areaX
100
a2_areaX
200
a3_areaX
300
Below are my earlier attempts at asking this question. Edited from first version:
I have csv files imported into R Global Env that look like this (I'd share the actual file(s) but there doesn't seem to be a way to do this here):
They all have a name, the above one is called "s6_section_area". There are many of them (with different names) and I've put them all together into a list using this code:
pattern = "section_area"
section_area_list <- list(mget(grep(pattern,ls(globalenv()), value = TRUE), globalenv()))
Now I want a new data frame that looks like this, put together from the data frames in the above made list.
File Name
Area measurement
a1_section_area
a number
a2_section_area
another number
many more
more numbers
So, the first column should list the name of the original file and the second column the measurement provided in that file.
Hope this is clearer - Not sure how else to provide reproducible example without sharing the actual files (which doesn't seem to be an option).
addition to edit: Using this code
section_area_data <- bind_rows(section_area_list, .id = "FileName")
I get (it goes on and on to the right)
I'm after a table that looks like the sample above, left column is file name with a list of file names going down. Right column is the measurement for that file name (taken from original file).
Note that in your list of dataframes (df_list) all the columns have different names (Area1, Area2, Area3) whereas in your output dataframe they all have been combined into one single column. So for that you need to change the different column names to the same one and bind the dataframes together.
library(dplyr)
library(purrr)
result <- map_df(df_list, ~.x %>%
rename_with(~"Area", contains('Area')), .id = 'FileName')
result
# FileName Area
#1 a1_areaX 100
#2 a2_areaX 200
#3 a3_areaX 300
Thanks everyone for your suggestions. In the end, I was able to combine the suggestions and some more thinking and came up with this, which works perfectly.
library("dplyr")
pattern = "section_area"
section_area_list <- mget(ls(envir = globalenv(), pattern = pattern))
section_area_data <- bind_rows(section_area_list, .id = "FileName") %>%
select(-V1)
So, a bunch of csv files were imported into R Global Env. A list of all files with a name ending in "section_area" was made. Those files were than bound into one big data frame, with the file names as one column and the value (area measurement in this case) in the other column (there was a pointless column in the original csv files called "V1" which I deleted).
This is what one of the many csv files looks like
sample csv file
And this is the layout of the final data frame (it goes on for about 150 rows)
final data frame

rbind three data bases using Rbind function

I know it's a newbie question, I have these 3 xlsx files with 3 three data bases of the same 14 variables,its a cross section data panel ,
All I want is to concatenate them in one single data base called eplt,
First, I import them
library(dplyr)
library(ggplot2)
library(xlsx)
##Import the three data bases
epl_data<-read.xlsx("Notes_ETAB2016-2017.xlsx",sheetIndex = 1,header = TRUE)
epl_data2<-read.xlsx("Notes_ETAB2017-2018.xlsx",sheetIndex = 1,header = TRUE)
epl_data3<-read.xlsx("Notes_ETAB2018-2019.xlsx",sheetIndex = 1,header = TRUE)
## to render the number of rows in each of them
nrow(epl_data)
nrow(epl_data2)
nrow(epl_data3)
# I want to rbind the three sets together
eplt<-rbind(epl_data,epl_data2,epl_data3)
the total number of rows is 29441, but when applying Rbind to bind them all together I get the error
> eplt<-rbind(epl_data,epl_data2,epl_data3)
Error in match.names(clabs, names(xi)) :
names do not match previous names
but the names of the variables in the 3 sets are the same
could someone please help, I only want to rebind 25000 observations, and leave the rest 4441 to compare it with the predictable obs of a multiple regression model,
thanks in advance
The third dataframes doesn't have the same names as the first two: Svt isn't to upper cases.
One way is to apply the names of one dataframe to the others:
colnames(epl_data2) <- colnames(epl_data)
colnames(epl_data3) <- colnames(epl_data)
But i recommand the package janitor whenever your data comes from Excel files. Indeed, it is common to have variable names issues. This package ensure a good formatting of your data column names:
epl_data <- janitor::clean_names(epl_data)
epl_data2 <- janitor::clean_names(epl_data2)
epl_data3 <- janitor::clean_names(epl_data3)
Therefore, the rbind should work
As already mentioned you have a mismatch in the variable name 'SVT'. Here is an alternative that would make the column names lower case and bind them together in one dataframe.
library(dplyr)
library(purrr)
eplt <- list.files(pattern = 'Notes_ETAB2016-\\d+\\.xlsx') %>%
map_df(~readxl::read_excel(.x) %>% rename_with(~tolower(.)))

writing single column to .csv in R

HI folks: I'm trying to write a vector of length = 100 to a single-column .csv in R. Each time I try, I get two columns in the csv file: first with index numbers from the vector, second with the contents of my vector. For example:
MyPath<-("~/rstudioshared/Data/HW3")
Files<-dir(MyPath)
write.csv(Files,"Names.csv",row.names = FALSE)
If I convert the vector to a data frame and then check its dimensions,
Files<-data.frame(Files)
dim(Files)
I get 100 rows by 1 column, and the column contains the names of the files in my directory folder. This is what I want.
Then I write the csv. When I open it outside of R or read it back in and look at it, I get a 100 X 2 DF where the first column contains the index numbers and the second column has the names of my files.
Why does this happen?
How do I write just the single column of data to the .csv?
Thanks!
Row names are written by write.csv() by default (and by default, a data frame with n rows will have row names 1,...,n). You can see this by looking at e.g.:
dat <- data.frame(mevar=rnorm(10))
# then compare what gets written by:
write.csv(dat, "outname1.csv")
# versus:
rownames(dat) <- letters[1:10]
write.csv(dat, "outname2.csv")
Just use write.csv(dat, "outname.csv", row.names=FALSE) and the row names won't show up.
And a suggestion: might be easier/cleaner to just just write the vector directly to a text file with writeLines(your_vector, "your_outfile.txt") (you can still use read.csv() to read it back in if you prefer using that :p).

Comparing column headers of two files to fetch data in R

I have a large CSV file, say INPUT, with about 500+ columns. I also have a dataframe DF that contains a subset of the column headers of INPUT which changes at every iteration.
I have to fetch the data from only those columns of INPUT that is present in the dataframe DF and write it into another CSV file, say OUTPUT.
In short,
INPUT.csv:
ID,Col_A,Col_B,Col_C,Col_D,Col_E,Col_F,,,,,,,,,,,,,Col_S,,,,,,,,,,,,,,,,Col_Z
1,009,abcd,67,xvz,33,50,,,,,,,,,,,,,,,,,,,,,,,,,,,,oup,,,,,,,,,,,,,,,,,,90
2,007,efgh,87,wuy,56,67,,,,,,,,,,,,,,,,,,,,,,,,,,,,ghj,,,,,,,,,,,,,,,,,,,,888
print(DF):
[1] "Col_D" "Col_Z"
[3] "Col_F" "Col_S"
OUTPUT.csv
ID,Col_D,Col_Z,Col_F,Col_S
1,xvz,90,50,oup
2,wuy,888,67,ghj
I'm a beginner when it comes to R. I would prefer for the matching of dataframe with the INPUT file to be automated, because i don't want to do this everyday when the dataframe gets updated.
I'm not sure whether this is the answer :
input <- read.table(...)
input[colnames(input) %in% colnames(DF)]
if I understand it correctly, you need to import the INPUT.csv file inside R and then match the columns of your DF with those columns of your INPUT, is that correct?
you can either use the match function or just import the INPUT.csv file inside RStudio via "Import Dataset" button and subset it. Subsetting of imported dataframes is fairly easy.
If you will import your dataset as INPUT, then you can make the subset of these columns in following way: INPUT[,c(1,2,4)]
and that will get you first, second and fourth column of the INPUT dataset.
First, to upload the csv is simple:
dataframe_read <- read.csv('/Path/to/csv/')
If I understand correctly that one dataframes columns is always a subset, the code is as follows:
### Example Dataframes
df1 <- data_frame(one = c(1,3,4), two= c(1,2,3), three = c(1,2,3))
df2 <- data_frame(one = c(1,3,4), three= c(1,2,3))
### Make new data frame
df3 <- df1[,colnames(df2)]
### Write new dataframe
write.csv(df3, 'hello.csv')

read multiple csv files with the same column headings and find the mean

Is it possible to read multiple csv excell files into R. All of the csv files have the same 4 columns. the first is a character, the second and third are numeric and the fourth is integer. I want to combine the data in each numeric column and find the mean.
I can get the csv files into R with
data <- list.files(directory)
myFiles <- paste(directory,data[id],sep="/")
I am unable to get the numbers from the individual columns add them and find the mean.
I am completely new to R and any advice is appreciated.
Here is a simple method:
Prep: Generate dummy data: (You already have this)
dummy <- data.frame(names=rep("a",4), a=1:4,b=5:8)
write.csv(dummy,file="data01.csv",row.names=F)
write.csv(dummy,file="data02.csv",row.names=F)
write.csv(dummy,file="data03.csv",row.names=F)
Step0: Load the file names: (just like you are doing)
data <- dir(getwd(),".csv")
Step1: Read and combine:
DF <- do.call(rbind,lapply(data,function(fn) read.csv(file=fn,header=T)))
DF
Step2: Find mean of appropriate columns:
apply(DF[,2:3],2,mean)
Hope that helps!!
EDIT: If you are having trouble with file path, try ?file.path.

Resources