Importing Excel-tables in R - r

Is there a way to import a named Excel-table into R as a data.frame?
I typically have several named Excel-tables on a single worksheet, that I want to import as data.frames, without relying on static row - and column references for the location of the Excel-tables.
I have tried to set namedRegion which is an available argument for several Excel-import functions, but that does not seem to work for named Excel-tables. I am currently using the openxlxs package, which has a function getTables() that creates a variable with Excel-table names from a single worksheet, but not the data in the tables.

To get your named table is a little bit of work.
First you need to load the workbook.
library(openxlsx)
wb <- loadWorkbook("name_excel_file.xlsx")
Next you need to extract the name of your named table.
# get the name and the range
tables <- getTables(wb = wb,
sheet = 1)
If you have multiple named tables they are all in tables. My named table is called Table1.
Next you to extract the column numbers and row numbers, which you will later use to extract the named table from the Excel file.
# get the range
table_range <- names(tables[tables == "Table1"])
table_range_refs <- strsplit(table_range, ":")[[1]]
# use a regex to extract out the row numbers
table_range_row_num <- gsub("[^0-9.]", "", table_range_refs)
# extract out the column numbers
table_range_col_num <- convertFromExcelRef(table_range_refs)
Now you re-read the Excel file with the cols and rows parameter.
# finally read it
my_df <- read.xlsx(xlsxFile = "name_excel_file.xlsx",
sheet = 1,
cols = table_range_col_num[1]:table_range_col_num[2],
rows = table_range_row_num[1]:table_range_row_num[2])
You end up with a data frame with only the content of your named table.
I used this a while ago. I found this code somewhere, but I don't know anymore from where.

This link is might be useful for you
https://stackoverflow.com/a/17709204/10235327
1. Install XLConnect package
2. Save a path to your file in a variable
3. Load workbook
4. Save your data to df
To get table names you can use function
getTables(wb,sheet=1,simplify=T)
Where:
wb - your workbook
sheet - sheet name or might be the number as well
simplify = TRUE (default) the result is simplified to a vector
https://rdrr.io/cran/XLConnect/man/getTables-methods.html
Here's the code (not mine, copied from the topic above, just a bit modified)
require(XLConnect)
sampleFile = "C:/Users/your.name/Documents/test.xlsx"
wb = loadWorkbook(sampleFile)
myTable <- getTables(wb,sheet=1)
df<-readTable(wb, sheet = 1, table = myTable)

You can check next packages:
library(xlsx)
Data <- read.xlsx('YourFile.xlsx',sheet=1)
library(readxl)
Data <- read_excel('YourFile.xlsx',sheet=1)
Both options allow you to define specific regions to load the data into R.

I use read.xlsx from package openxlsx. For example:
library(openxlsx)
fileA <- paste0(some.directory,'excel.file.xlsx')
A <- read.xlsx(fileA, startRow = 3)
hope it helps

Related

How can I parse a json string on a txt file?

I need to parse two columns from a txt file in which I have:
first column: id (with only one value)
second column: parameters (which has more json fields inside, SEE BELOW).
Here an example:
ID;PARAMETERS
Y;"""Pr""=>""22"", ""Err""=>""0"", ""DryT""=>""0"", ""Lang""=>""99"", ""Opt1""=>""67"", ""Opt2""=>""0"",
S;"""Pr""=>""5"", ""Err""=>""255"", ""Opt1""=>""0"", ""Opt2""=>""0"", ""Opt3""=>""55"", ""Opt4""=>""0"",
K;"""Pr""=>""1"", ""Err""=>""0"", ""DryT""=>""0"", ""Lang""=>""21"", ""Opt1""=>""0"", ""Opt2""=>""0"",
P;"""Pr""=>""90"", ""Err""=>""0"", ""DryT""=>""0"", ""Lang""=>""20"", ""Opt1""=>""0"", ""Opt2""=>""0"",
My dataset is in csv format, but I have tried also to transform it in a txt file or in json file and I tried to import it in R but I cannot parse the two columns.
I would like to obtain each parameter in one column, and If an ID does not have a parameter I need NA
Can you help me please?
I have tried this R code but it does not work:
setwd("D:/")
df <- read.delim("file name", header = TRUE, na.strings = -1)
install.packages("jsonlite")
library(jsonlite)
filepath<-"file name"
prova <- fromJSON(filepath)
prova <- fromJSON(filepath)
Can you help me please?
Thanks
You can import the csv directly in Rstudio with multiples functions :
data <- read.csv("your_data.csv")
Otherwise you can load it by looking in 'Import dataset' in the Environment tab. It will open a new window in which you can browse to your file without setting working directory.
To set NA for 0 values, here is an example
# I create a new data for example,
# you will have another one if you succed in importing your data.
data <- data.frame(ID = c("H1","H2","H3"), PARAMETERS_Y = c(42,0,3))
data[which(data$PARAMETERS_Y == 0),2] = NA
It look for rows in the Param column which are equal to 0 and replace them with NAs. Don't forget to change the column name to match your data loaded previously

When writing to an excel sheet, how can I add a new row each time I run the script?

I've written a script that pulls and summarizes data from 2 databases regularly. I'm using write.xlsx to write 3 data frames to 3 different tabs:
Tab1 is data from one database,
Tab2 is data from another, and
Tab3 is data summarized from both.
It is fine if Tabs 1&2 are overwritten with each run, but I'd like Tab3 to add a new row (with sys.date in col1) each run, thus giving me a history of the data summaries. Is this possible and how?
append=TRUE is not getting me what I'm looking for. My next option (less desirable) is to just write the summarized data to a separate file with sys.date in the filename.
# Write to spreadsheet w/ 3 tabs
write.xlsx(df1, file = "file_location\\filename.xlsx", sheetName = "dataset1", row.names = FALSE, append = FALSE) #write to tab1
write.xlsx(df2, file = "file_location\\filename.xlsx", sheetName = "dataset2", row.names = FALSE, append = TRUE) #write to tab2
write.xlsx(summary_df, file = "file_location\\filename.xlsx", sheetName = "summarized_data", row.names = FALSE, append = TRUE) #write to next blank row in tab3
The script is writing over the summary data with each run. I would like it to append it to the next blank row.
So the openxlsx package allows you to specify not only the row to which you wish to append the new data but also to extract the last row of the current workbook sheet. The trick is to write the date and the new summary data as DataTable objects which are unique to the openxlsx package. The function writeDataTable will allow you to do this and the only required arguments are the name of workbook object, the name of the sheet in the workbook and any R data.frame object. The workbook object can be obtained using the loadWorkbook function which simply takes the path to the Excel filename. Also you can read in an entire Excel workbook with multiple sheets and then extract and modify the individual worksheets before saving them back to the original Excel workbook.
To create the initial workbook so that the updates are compatible to the original workbook simply replace the write.xlsx statements with writeDataTable statements as follows:
library(openxlsx)
wb <- createWorkbook()
writeDataTable(wb,sheet="dataset1",x=dataset1)
writeDataTable(wb,sheet="dataset2",x=dataset2)
writeDataTable(wb,sheet="summarized_dat",x=summary_df)
saveWorkbook(wb,"file_location\\filename.xlsx")
So after creating the initial workbook you can now simply load that and modify the sheets individually. Since you mentioned you can overwrite dataset1 and dataset2 I will focus on summarized_dat because you can just use the code above to overwrite those previous datasets:
library(openxlsx)
wb <- loadWorkbook("file_location\\filename.xlsx")
Now you can use the function below to append the new summarized data with the date. For this I converted the date to a data.frame object but you could also use the writeData function. I like the writeDataTable function because you can use getTables to extract the last row easier. The output from calling getTables looks like this using one of my workbooks as an example with three tables stacked vertically:
[1] "A1:P61" "A62:A63" "A64:T124"
The trick is then to extract the last row number which in this case is 124. So here is a function I just wrote quickly that will do all of this for you automatically and takes a workbook object, name of the sheet you wish to modify, and the updated summary table saved as a data.frame object:
Update_wb_fun<-function(wb, sheetname, newdata){
tmp_wb <- wb #creates copy if you wish to keep the original before modifying as a check
table_cells <- names(getTables(tmp_wb,sheetname)) #Extracts the Excel cells for all DataTables in the worksheet. First run there will only be one but in subsequent runs there will be more and you will want the last one.
lastrow <- as.numeric(gsub("[A-z]","",unlist(strsplit(table_cells[length(table_cells)],":"))[2])) #Extracts the last row
start_time_row <- lastrow+1
writeDataTable(tmp_wb,sheet=sheetname,x=data.frame(Sys.Date()),startRow = start_time_row) #Appending the time stamp as DatTable.
writeDataTable(tmp_wb,sheet=sheetname,x=newdata,startRow = start_time_row+2) #Appending the new data under the time stamp
return(tmp_wb)
}
The code to extract the "lastrow" looks a little complicated but uses base R functions to extract that row number. The length function extracts the last element from example output above (e.g., "A64:T124"). The strsplit separates the string by colon and we have to unlist this to create a vector and obtain the second element (e.g., "T124"). Finally the gsub removes the letters and keeps only the row number (e.g., "124"). The as.numeric converts from character object to numeric object.
So to call this function and update the "summarized_dat" worksheet do the following:
test_wb <- Update_wb_fun(wb, "summarized_dat", summary_df) #Here I am preserving the original workbook object before modification but you could save to wb directly.
saveWorkbook(test_wb, "file_location\\filename.xlsx", overWrite=T) #Need to use the overWrite option if filename already exists.
So that is how you could append to the "summarized_dat" sheet. Before the final save you could also update the first two sheets by re-running the writeDataTable statements earlier before we wrote the function. Let me know if this is at all what you are looking for and I could modify this code easily. Good luck!
I would check out the openxlsx package - it allows you to write data starting at a given row in an Excel file. The only catch is that you will need to read the current Excel file in order to determine which line is the last one containing data. You can do that with the readxl package and using nrow() of the dataframe you create from the Excel file to determine the number of rows.

Read specific non-adjacent columns from Excel file [duplicate]

So I have a bunch of excel files I want to loop through and read specific, discontinuous columns into a data frame. Using the readxl works for the basic stuff like this:
library(readxl)
library(plyr)
wb <- list.files(pattern = "*.xls")
dflist <- list()
for (i in wb){
dflist[[i]] <- data.frame(read_excel(i, sheet = "SheetName", skip=3, col_names = TRUE))
}
# now put them into a data frame
data <- ldply(dflist, data.frame, .id = NULL)
This works (barely) but the problem is my excel files have about 114 columns and I only want specific ones. Also I do not want to allow R to guess the col_types because it messes some of them up (eg for a string column, if the first value starts with a number, it tries to interpret the whole column as numeric, and crashes). So my question is: How do I specify specific, discontinuous columns to read? The range argument uses the cell_ranger package which does not allow for reading discontinuous columns. So any alternative?
.xlsx >>> you can use library openxlsx
The read.xlsx function from library openxlsx has an optional parameter cols that takes a numeric index, specifying which columns to read.
It seems it reads all columns as characters if at least one column contains characters.
openxlsx::read.xlsx("test.xlsx", cols = c(2,3,6))
.xls >>> you can use library XLConnect
The potential problem is that library XLConnect requires library rJava, which might be tricky to install on some systems. If you can get it running, the keep and drop parameters of readWorksheet() accept both column names and indices. Parameter colTypes deals with column types. This way it works for me:
options(java.home = "C:\\Program Files\\Java\\jdk1.8.0_74\\") #path to jdk
library(rJava)
library(XLConnect)
workbook <- loadWorkbook("test.xls")
readWorksheet(workbook, sheet = "Sheet0", keep = c(1,2,5))
Edit:
Library readxl works well for both .xls and .xlsx if you want to read a range (rectangle) from your excel file. E.g.
readxl::read_xls("test.xls", range = "B3:D8")
readxl::read_xls("test.xls", sheet = "Sheet1", range = cell_cols("B:E"))
readxl::read_xlsx("test.xlsx", sheet = 2, range = cell_cols(2:5))

R read excel by column names

So I have a bunch of excel files I want to loop through and read specific, discontinuous columns into a data frame. Using the readxl works for the basic stuff like this:
library(readxl)
library(plyr)
wb <- list.files(pattern = "*.xls")
dflist <- list()
for (i in wb){
dflist[[i]] <- data.frame(read_excel(i, sheet = "SheetName", skip=3, col_names = TRUE))
}
# now put them into a data frame
data <- ldply(dflist, data.frame, .id = NULL)
This works (barely) but the problem is my excel files have about 114 columns and I only want specific ones. Also I do not want to allow R to guess the col_types because it messes some of them up (eg for a string column, if the first value starts with a number, it tries to interpret the whole column as numeric, and crashes). So my question is: How do I specify specific, discontinuous columns to read? The range argument uses the cell_ranger package which does not allow for reading discontinuous columns. So any alternative?
.xlsx >>> you can use library openxlsx
The read.xlsx function from library openxlsx has an optional parameter cols that takes a numeric index, specifying which columns to read.
It seems it reads all columns as characters if at least one column contains characters.
openxlsx::read.xlsx("test.xlsx", cols = c(2,3,6))
.xls >>> you can use library XLConnect
The potential problem is that library XLConnect requires library rJava, which might be tricky to install on some systems. If you can get it running, the keep and drop parameters of readWorksheet() accept both column names and indices. Parameter colTypes deals with column types. This way it works for me:
options(java.home = "C:\\Program Files\\Java\\jdk1.8.0_74\\") #path to jdk
library(rJava)
library(XLConnect)
workbook <- loadWorkbook("test.xls")
readWorksheet(workbook, sheet = "Sheet0", keep = c(1,2,5))
Edit:
Library readxl works well for both .xls and .xlsx if you want to read a range (rectangle) from your excel file. E.g.
readxl::read_xls("test.xls", range = "B3:D8")
readxl::read_xls("test.xls", sheet = "Sheet1", range = cell_cols("B:E"))
readxl::read_xlsx("test.xlsx", sheet = 2, range = cell_cols(2:5))

How to not overwrite file in R

I am trying to copy and paste tables from R into Excel. Consider the following code from a previous question:
data <- list.files(path=getwd())
n <- length(list)
for (i in 1:n)
{
data1 <- read.csv(data[i])
outline <- data1[,2]
outline <- as.data.frame(table(outline))
print(outline) # this prints all n tables
name <- paste0(i,"X.csv")
write.csv(outline, name)
}
This code writes each table into separate Excel files (i.e. "1X.csv", "2X.csv", etc..). Is there any way of "shifting" each table down some rows instead of rewriting the previous table each time? I have also tried this code:
output <- as.data.frame(output)
wb = loadWorkbook("X.xlsx", create=TRUE)
createSheet(wb, name = "output")
writeWorksheet(wb,output,sheet="output",startRow=1,startCol=1)
writeNamedRegion(wb,output,name="output")
saveWorkbook(wb)
But this does not copy the dataframes exactly into Excel.
I think, as mentioned in the comments, the way to go is to first merge the data frames in R and then writing them into (one) output file:
# get vector of filenames
filenames <- list.files(path=getwd())
# for each filename: load file and create outline
outlines <- lapply(filenames, function(filename) {
data <- read.csv(filename)
outline <- data[,2]
outline <- as.data.frame(table(outline))
outline
})
# merge all outlines into one data frame (by appending them row-wise)
outlines.merged <- do.call(rbind, outlines)
# save merged data frame
write.csv(outlines.merged, "all.csv")
Despite what microsoft would like you to believe, .csv files are not excel files, they are a common file type that can be read by excel and many other programs.
The best approach depends on what you really want to do. Do you want all the tables to read into a single worksheet in excel? If so you could just write to a single file using the append argument to the write.csv or other functions. Or use a connection that you keep open so each new one is appended. You may want to use cat to put a couple of newlines before each new table.
Your second attempt looks like it uses the XLConnect package (but you don't say, so it could be something else). I would think this the best approach, how is the result different from what you are expecting?

Resources