rbind three data bases using Rbind function - r

I know it's a newbie question, I have these 3 xlsx files with 3 three data bases of the same 14 variables,its a cross section data panel ,
All I want is to concatenate them in one single data base called eplt,
First, I import them
library(dplyr)
library(ggplot2)
library(xlsx)
##Import the three data bases
epl_data<-read.xlsx("Notes_ETAB2016-2017.xlsx",sheetIndex = 1,header = TRUE)
epl_data2<-read.xlsx("Notes_ETAB2017-2018.xlsx",sheetIndex = 1,header = TRUE)
epl_data3<-read.xlsx("Notes_ETAB2018-2019.xlsx",sheetIndex = 1,header = TRUE)
## to render the number of rows in each of them
nrow(epl_data)
nrow(epl_data2)
nrow(epl_data3)
# I want to rbind the three sets together
eplt<-rbind(epl_data,epl_data2,epl_data3)
the total number of rows is 29441, but when applying Rbind to bind them all together I get the error
> eplt<-rbind(epl_data,epl_data2,epl_data3)
Error in match.names(clabs, names(xi)) :
names do not match previous names
but the names of the variables in the 3 sets are the same
could someone please help, I only want to rebind 25000 observations, and leave the rest 4441 to compare it with the predictable obs of a multiple regression model,
thanks in advance

The third dataframes doesn't have the same names as the first two: Svt isn't to upper cases.
One way is to apply the names of one dataframe to the others:
colnames(epl_data2) <- colnames(epl_data)
colnames(epl_data3) <- colnames(epl_data)
But i recommand the package janitor whenever your data comes from Excel files. Indeed, it is common to have variable names issues. This package ensure a good formatting of your data column names:
epl_data <- janitor::clean_names(epl_data)
epl_data2 <- janitor::clean_names(epl_data2)
epl_data3 <- janitor::clean_names(epl_data3)
Therefore, the rbind should work

As already mentioned you have a mismatch in the variable name 'SVT'. Here is an alternative that would make the column names lower case and bind them together in one dataframe.
library(dplyr)
library(purrr)
eplt <- list.files(pattern = 'Notes_ETAB2016-\\d+\\.xlsx') %>%
map_df(~readxl::read_excel(.x) %>% rename_with(~tolower(.)))

Related

Appending two excel files into one dataframe

I am trying to append two excel files from Excel into R.
I am using the following the code to do so:
rm(list = ls(all.names = TRUE))
library(rio) #this is for the excel appending
library("dplyr") #this is required for filtering and selecting
library(tidyverse)
library(openxlsx)
path1 <- "A:/Users/Desktop/Test1.xlsx"
path2 <- "A:/Users/Desktop/Test2.xlsx"
dat = bind_rows(path1,path2)
Output
> dat = bind_rows(path1,path2)
Error: Argument 1 must have names.
Run `rlang::last_error()` to see where the error occurred
I appreciate that this is more for combining rows together, but can someone help me with combining difference workbooks into one dataframe in R Studio?
bind_rows() works with data frames AFTER they have been loaded into the R environment. Here are you merely trying to "bind" 2 strings of characters together, hence the error. First you need to import the data from Excel, which you could do with something like:
test_df1 <- readxl::read_xlsx(path1)
test_df2 <- readxl::read_xlsx(path2)
and then you should be able to run:
test_df <- bind_rows(test_df1, test_df2)
A quicker way would be to iterate the process using the map function from purrr:
test_df <- map_df(c(path1, path2), readxl::read_xlsx)
If you want to append one under the other, meaning that both excels have the same columns, I would
select the rows I wanted from the first excel and create a dataframe
select the rows from the second excel and create a second dataframe
append them with rbind().
On the other hand if you want to append the one next to another, I would choose the columns needed from first and second excel into two dataframes respectively and then I would go with cbind()

How to conditionally change colnames from certain rows?

I have a problem that arises often working with excel survey data: the first 10 or so column names are appropriate in a data set, the remaining x:ncol need to be renamed to the values of first row of the data set, starting at x+1. (The colnames are correct until x, after which point the colnames become empty, with the values that I would like to have as the colname being in the first row).
I have been doing this manually, writing them out one by one using dplyr::select(). How can I automate this in a tidy workflow? I imagine using set_names() or rename_at() but can't get the syntax. Thank you in advance
mtcars %>%
select(miles_per_gallon = "mpg", everything()) %>% #etc. keep some names
rename_at(vars(3:ncol(.)), funs(mtcars[,1]))
Error: `nm` must be `NULL` or a character vector the same length as `x`
The error isn't surprising, but to illustrate the point - how to have the names from x:ncol() replaced by the first row's values starting from x+1?
I think this should do it for you -
x <- 10 # means 1:10 column names are appropriate
names(df)[(x+1):ncol(df)] <- df[1, (x+1):ncol(df)]
df <- df[-1, ] # removing 1st row assuming it's bad data

How to create a "top ten" vector that keeps labels?

I have a data set that has 655 Rows, and 21 Columns. I'm currently looping through each column and need to find the top ten of each, but when I use the head() function, it doesn't keep the labels (they are names of bacteria, each column is a sample). Is there a way to create sorted subset of data that sorts the row name along with it?
right now I am doing
topten <- head(sort(genuscounts[,c(1,i)], decreasing = TRUE) n = 10)
but I am getting an error message since column 1 is the list of names.
Thanks!
Because sort() applies to vectors, it's not going to work with your subset genuscounts[,c(1,i)], because the subset has multiple columns. In base R, you'll want to use order():
thisColumn <- genuscounts[,c(1,i)]
topten <- head(thisColumn[order(thisColumn[,2],decreasing=T),],10)
You could also use arrange_() from the dplyr package, which provides a more user-friendly interface:
library(dplyr)
head(arrange_(genuscounts[,c(1,i)],desc(names(genuscounts)[i])),10)
You'd need to use arrange_() instead of arrange() because your column name will be a string and not an object.
Hope this helps!!

Keeping only a list of certain columns in a very robust dataset?

I am trying to do a few things with some longitudinal data:
1) Combine several years worth of data into one table
e.g.
data1996.csv,
data1997.csv,
...,
data2013.csv
2) Define a list of variables to keep
3) Drop all columns that do not match the list of keepers
4) Write the result dataset into a CSV file
require(data.table)
setwd("~/my/directory")
Define names of filepaths
paths <- list()
Make a list of the files I want to aggregate
for(i in 0:17)
{
paths[i]<- paste("MERGED",1996+i,"_PP.csv",sep="")
}
Define list of variables to keep
keeps <- list(
"CITY",
"ZIP",
"LONGITUDE",
"LATITUDE",
...
)
Run fread on all files in the list of paths
out <- rbindlist(lapply(paths, fread), use.names=TRUE)
For some reason typeof(out) returns list
This is where I attempt to drop all columns except those in "keeps"
filteredOut <- out[,keeps,drop=FALSE]
But it just gives me a list of the 28 variables I want to keep
I tried this too:
filteredOut <- out[keeps]
but I get this error:
Error in `[.data.table`(out, keeps) :
When i is a data.table (or character vector), x must be keyed (i.e. sorted, and, marked as sorted) so data.table knows which columns to join to and take advantage of x being sorted. Call setkey(x,...) first, see ?setkey.
write.table(filteredOut, "testing.csv", sep=",")
My script appears to successfully combine the 17 years of data (I end up with 'out' which has 117905 obs. in 1729 variables)
Afterwards, I want to save to a csv:
write.table(filteredOut, "myfile.csv", sep=",")
I do get warnings as well, over 50 of them but they appear to regard NULL values. The issues I am having are 1) understanding data types (list, data.frame, data.table) and 2) the proper way to implement the drop command
Any and all help is greatly appreciated!
We can unlist the 'keeps', and use with=FALSE to subset the columns.
out[, unlist(keeps), with=FALSE]

R: row.names and data manipulation / export

I am having some issues understanding what row.names is and how it works. And, how I can get my data to do stuff the row.names allows one to do.
For example, I am creating some clusters with the code below (my data). I want to export the results which is what the sapply line does, but only to the screen for now. The first column (path_country) of my data frame are country names and the other columns are other variables (integers). I don't see an easy way to export these clusters to a table or list of countries and their group membership.
I tried to make a dummy example using example data sets in R. For example, mtcars, it was then that I noticed the first column was denoted as row.names. With mtcars I can create clusters, cutree to the specified number of groups and then save as a data frame. With this approach I have the 'car names' in the first column and the group number in the second column (more or less, could be cleaned up to look nicer, but is essentially what I am after), which is what I would like to happen with my data.
Any thoughts on this would be appreciated.
# my data
path_country <- read.csv("C:/path_country.csv")
patho <- subset(path_country, select=c(2:188))
patho.d <- dist(patho)
patho.hclust <- hclust(patho.d)
patho.hclust.groups11 = cutree(patho.hclust,11)
sapply(unique(patho.hclust.groups11),function(g)path_country$Country[patho.hclust.groups11 == g])
# mtcars data
car.d <- dist(mtcars)
car.h <- hclust(car.d)
car.h.11 <- cutree(car.h, 11)
nice_result <- as.data.frame(car.h.11)
write.table(nice_result, "test.txt", sep="\t")
1) You can create data.frame with row.names from CSV file:
# Names in the first column
path_country <- read.table("C:/path_country.csv", row.names=1)
# Names in column "Country"
path_country <- read.table("C:/path_country.csv", row.names="Country", head=TRUE)
Note, that in second case you should specify head=TRUE in order to use columns' names.
Now rownames(path_country) should give you vector with rows' names, and as.data.frame(patho.hclust.groups11) nice result for export.
2) At any time you can specify rows' names for your data.frame with command:
rownames(path_country) <- names.vector
where names.vector is a vector with unique names of length equal to number of rows in data.frame. In your example:
rownames(patho.hclust.groups11) <- path_country$Country
Note, that if you are using first approach you don't need this command.

Resources