Manipulating Excel files using R - r

I want to edit an existing excel file using R. For example, ExcelFile_1 has the data, and I need to place the data from ExcelFile_1 into another file called ExcelFile_2. This is based on the column and row names.
ExcelFile_1:
Store Shipped Qty
1111 100
2222 200
ExcelFile_2:
Store Shipped Qty
1111
2222
If I m working with a data frame, I generally do
ExcelFile_2$Shipped Qty <-
ExcelFile_1$Shipped Qty[match(ExcelFile_1$Store #, ExcelFile_2$Store #)
The above line works for my data frame, but I donot know how to place this formula while writing into a worksheet using XLConnect package. All I see is the below mentioned options.
writeWorksheet(object,data,sheet,startRow,startCol,header,rownames)
I do not want to edit as a data frame and save the data frame as another "worksheet" in an existing/new Excel File, as I want to preserve the ExcelFile_2 formats.
For example: I want to change the value of ExcelFile_2 cell "B2" using the values from another sheet.
Could anyone please help me with the above problem?

Assuming your files are stored in your home directory and named one.xlsx and two.xlsx, you can do the following:
library(XLConnect)
# Load content of the first sheet of one.xlsx
df1 <- readWorksheetFromFile("~/one.xlsx", 1)
# Do what you like to df1 ...
# Write df1 to the first sheet of two.xlsx
wb2 <- loadWorkbook("~/two.xlsx")
writeWorksheet(wb2, df1, sheet = 1)
saveWorkbook(wb2)
If needed, you can also use startRow and startCol in both readWorksheetFromFile() and writeWorksheet() to specify exact rows and columns and header to specify if you want to read/write the headers.

Related

Importing Excel-tables in R

Is there a way to import a named Excel-table into R as a data.frame?
I typically have several named Excel-tables on a single worksheet, that I want to import as data.frames, without relying on static row - and column references for the location of the Excel-tables.
I have tried to set namedRegion which is an available argument for several Excel-import functions, but that does not seem to work for named Excel-tables. I am currently using the openxlxs package, which has a function getTables() that creates a variable with Excel-table names from a single worksheet, but not the data in the tables.
To get your named table is a little bit of work.
First you need to load the workbook.
library(openxlsx)
wb <- loadWorkbook("name_excel_file.xlsx")
Next you need to extract the name of your named table.
# get the name and the range
tables <- getTables(wb = wb,
sheet = 1)
If you have multiple named tables they are all in tables. My named table is called Table1.
Next you to extract the column numbers and row numbers, which you will later use to extract the named table from the Excel file.
# get the range
table_range <- names(tables[tables == "Table1"])
table_range_refs <- strsplit(table_range, ":")[[1]]
# use a regex to extract out the row numbers
table_range_row_num <- gsub("[^0-9.]", "", table_range_refs)
# extract out the column numbers
table_range_col_num <- convertFromExcelRef(table_range_refs)
Now you re-read the Excel file with the cols and rows parameter.
# finally read it
my_df <- read.xlsx(xlsxFile = "name_excel_file.xlsx",
sheet = 1,
cols = table_range_col_num[1]:table_range_col_num[2],
rows = table_range_row_num[1]:table_range_row_num[2])
You end up with a data frame with only the content of your named table.
I used this a while ago. I found this code somewhere, but I don't know anymore from where.
This link is might be useful for you
https://stackoverflow.com/a/17709204/10235327
1. Install XLConnect package
2. Save a path to your file in a variable
3. Load workbook
4. Save your data to df
To get table names you can use function
getTables(wb,sheet=1,simplify=T)
Where:
wb - your workbook
sheet - sheet name or might be the number as well
simplify = TRUE (default) the result is simplified to a vector
https://rdrr.io/cran/XLConnect/man/getTables-methods.html
Here's the code (not mine, copied from the topic above, just a bit modified)
require(XLConnect)
sampleFile = "C:/Users/your.name/Documents/test.xlsx"
wb = loadWorkbook(sampleFile)
myTable <- getTables(wb,sheet=1)
df<-readTable(wb, sheet = 1, table = myTable)
You can check next packages:
library(xlsx)
Data <- read.xlsx('YourFile.xlsx',sheet=1)
library(readxl)
Data <- read_excel('YourFile.xlsx',sheet=1)
Both options allow you to define specific regions to load the data into R.
I use read.xlsx from package openxlsx. For example:
library(openxlsx)
fileA <- paste0(some.directory,'excel.file.xlsx')
A <- read.xlsx(fileA, startRow = 3)
hope it helps

How can I parse a json string on a txt file?

I need to parse two columns from a txt file in which I have:
first column: id (with only one value)
second column: parameters (which has more json fields inside, SEE BELOW).
Here an example:
ID;PARAMETERS
Y;"""Pr""=>""22"", ""Err""=>""0"", ""DryT""=>""0"", ""Lang""=>""99"", ""Opt1""=>""67"", ""Opt2""=>""0"",
S;"""Pr""=>""5"", ""Err""=>""255"", ""Opt1""=>""0"", ""Opt2""=>""0"", ""Opt3""=>""55"", ""Opt4""=>""0"",
K;"""Pr""=>""1"", ""Err""=>""0"", ""DryT""=>""0"", ""Lang""=>""21"", ""Opt1""=>""0"", ""Opt2""=>""0"",
P;"""Pr""=>""90"", ""Err""=>""0"", ""DryT""=>""0"", ""Lang""=>""20"", ""Opt1""=>""0"", ""Opt2""=>""0"",
My dataset is in csv format, but I have tried also to transform it in a txt file or in json file and I tried to import it in R but I cannot parse the two columns.
I would like to obtain each parameter in one column, and If an ID does not have a parameter I need NA
Can you help me please?
I have tried this R code but it does not work:
setwd("D:/")
df <- read.delim("file name", header = TRUE, na.strings = -1)
install.packages("jsonlite")
library(jsonlite)
filepath<-"file name"
prova <- fromJSON(filepath)
prova <- fromJSON(filepath)
Can you help me please?
Thanks
You can import the csv directly in Rstudio with multiples functions :
data <- read.csv("your_data.csv")
Otherwise you can load it by looking in 'Import dataset' in the Environment tab. It will open a new window in which you can browse to your file without setting working directory.
To set NA for 0 values, here is an example
# I create a new data for example,
# you will have another one if you succed in importing your data.
data <- data.frame(ID = c("H1","H2","H3"), PARAMETERS_Y = c(42,0,3))
data[which(data$PARAMETERS_Y == 0),2] = NA
It look for rows in the Param column which are equal to 0 and replace them with NAs. Don't forget to change the column name to match your data loaded previously

When writing to an excel sheet, how can I add a new row each time I run the script?

I've written a script that pulls and summarizes data from 2 databases regularly. I'm using write.xlsx to write 3 data frames to 3 different tabs:
Tab1 is data from one database,
Tab2 is data from another, and
Tab3 is data summarized from both.
It is fine if Tabs 1&2 are overwritten with each run, but I'd like Tab3 to add a new row (with sys.date in col1) each run, thus giving me a history of the data summaries. Is this possible and how?
append=TRUE is not getting me what I'm looking for. My next option (less desirable) is to just write the summarized data to a separate file with sys.date in the filename.
# Write to spreadsheet w/ 3 tabs
write.xlsx(df1, file = "file_location\\filename.xlsx", sheetName = "dataset1", row.names = FALSE, append = FALSE) #write to tab1
write.xlsx(df2, file = "file_location\\filename.xlsx", sheetName = "dataset2", row.names = FALSE, append = TRUE) #write to tab2
write.xlsx(summary_df, file = "file_location\\filename.xlsx", sheetName = "summarized_data", row.names = FALSE, append = TRUE) #write to next blank row in tab3
The script is writing over the summary data with each run. I would like it to append it to the next blank row.
So the openxlsx package allows you to specify not only the row to which you wish to append the new data but also to extract the last row of the current workbook sheet. The trick is to write the date and the new summary data as DataTable objects which are unique to the openxlsx package. The function writeDataTable will allow you to do this and the only required arguments are the name of workbook object, the name of the sheet in the workbook and any R data.frame object. The workbook object can be obtained using the loadWorkbook function which simply takes the path to the Excel filename. Also you can read in an entire Excel workbook with multiple sheets and then extract and modify the individual worksheets before saving them back to the original Excel workbook.
To create the initial workbook so that the updates are compatible to the original workbook simply replace the write.xlsx statements with writeDataTable statements as follows:
library(openxlsx)
wb <- createWorkbook()
writeDataTable(wb,sheet="dataset1",x=dataset1)
writeDataTable(wb,sheet="dataset2",x=dataset2)
writeDataTable(wb,sheet="summarized_dat",x=summary_df)
saveWorkbook(wb,"file_location\\filename.xlsx")
So after creating the initial workbook you can now simply load that and modify the sheets individually. Since you mentioned you can overwrite dataset1 and dataset2 I will focus on summarized_dat because you can just use the code above to overwrite those previous datasets:
library(openxlsx)
wb <- loadWorkbook("file_location\\filename.xlsx")
Now you can use the function below to append the new summarized data with the date. For this I converted the date to a data.frame object but you could also use the writeData function. I like the writeDataTable function because you can use getTables to extract the last row easier. The output from calling getTables looks like this using one of my workbooks as an example with three tables stacked vertically:
[1] "A1:P61" "A62:A63" "A64:T124"
The trick is then to extract the last row number which in this case is 124. So here is a function I just wrote quickly that will do all of this for you automatically and takes a workbook object, name of the sheet you wish to modify, and the updated summary table saved as a data.frame object:
Update_wb_fun<-function(wb, sheetname, newdata){
tmp_wb <- wb #creates copy if you wish to keep the original before modifying as a check
table_cells <- names(getTables(tmp_wb,sheetname)) #Extracts the Excel cells for all DataTables in the worksheet. First run there will only be one but in subsequent runs there will be more and you will want the last one.
lastrow <- as.numeric(gsub("[A-z]","",unlist(strsplit(table_cells[length(table_cells)],":"))[2])) #Extracts the last row
start_time_row <- lastrow+1
writeDataTable(tmp_wb,sheet=sheetname,x=data.frame(Sys.Date()),startRow = start_time_row) #Appending the time stamp as DatTable.
writeDataTable(tmp_wb,sheet=sheetname,x=newdata,startRow = start_time_row+2) #Appending the new data under the time stamp
return(tmp_wb)
}
The code to extract the "lastrow" looks a little complicated but uses base R functions to extract that row number. The length function extracts the last element from example output above (e.g., "A64:T124"). The strsplit separates the string by colon and we have to unlist this to create a vector and obtain the second element (e.g., "T124"). Finally the gsub removes the letters and keeps only the row number (e.g., "124"). The as.numeric converts from character object to numeric object.
So to call this function and update the "summarized_dat" worksheet do the following:
test_wb <- Update_wb_fun(wb, "summarized_dat", summary_df) #Here I am preserving the original workbook object before modification but you could save to wb directly.
saveWorkbook(test_wb, "file_location\\filename.xlsx", overWrite=T) #Need to use the overWrite option if filename already exists.
So that is how you could append to the "summarized_dat" sheet. Before the final save you could also update the first two sheets by re-running the writeDataTable statements earlier before we wrote the function. Let me know if this is at all what you are looking for and I could modify this code easily. Good luck!
I would check out the openxlsx package - it allows you to write data starting at a given row in an Excel file. The only catch is that you will need to read the current Excel file in order to determine which line is the last one containing data. You can do that with the readxl package and using nrow() of the dataframe you create from the Excel file to determine the number of rows.

Crosschecking Two Text Files with R

I need help in initiating/writing an R script that can cross check one .txt file with another to find where there are multiple instances. I have an excel file with a list of miRNAs that I want to cross reference with another excel file that has a column that contains the same miRNA name.
So is it a text file or excel file you are importing? Without more details, file structure, or a reproducible example, it will be hard to help.
You can try:
# Get the correct package
install.packages('readxl')
df1 <- readxl::read_excel('[directory name][file name].xlsx')
df2 <- readxl::read_excel('[directory name][file name].xlsx')
# Creates a new variable flag if miRNA from your first dataset is in the second dataset
df1$is_in_df2 <- ifelse(df1$miRNA %in% df2$miRNA, 'Yes', 'No')

R - Import & Merge Multiple Excel Files And Add Filesource Variable

I have used R for various things over the past year but due to the number of packages and functions available, I am still sadly a beginner. I believe R would allow me to do what I want to do with minimal code, but I am struggling.
What I want to do:
I have roughly a hundred different excel files containing data on students. Each excel file represents a different school but contains the same variables. I need to:
Import the data into R from Excel
Add a variable to each file containing the filename
Merge all of the data (add observations/rows - do not need to match on variables)
I will need to do this for multiple sets of data, so I am trying to make this as simple and easy to replicate as possible.
What the Data Look Like:
Row 1 Title
Row 2 StudentID Var1 Var2 Var3 Var4 Var5
Row 3 11234 1 9/8/2011 343 159-167 32
Row 4 11235 2 9/16/2011 112 152-160 12
Row 5 11236 1 9/8/2011 325 164-171 44
Row 1 is meaningless and Row 2 contains the variable names. The files have different numbers of rows.
What I have so far:
At first I simply tried to import data from excel. Using the XLSX package, this works nicely:
dat <- read.xlsx2("FILENAME.xlsx", sheetIndex=1,
sheetName=NULL, startRow=2,
endRow=NULL, as.data.frame=TRUE,
header=TRUE)
Next, I focused on figuring out how to merge the files (also thought this is where I should add the filename variable to the datafiles). This is where I got stuck.
setwd("FILE_PATH_TO_EXCEL_DIRECTORY")
filenames <- list.files(pattern=".xls")
do.call("rbind", lapply(filenames, read.xlsx2, sheetIndex=1, colIndex=6, header=TRUE, startrow=2, FILENAMEVAR=filenames));
I set my directory, make a list of all the excel file names in the folder, and then try to merge them in one statement using the a variable for the filenames.
When I do this I get the following error:
Error in data.frame(res, ...) :
arguments imply differing number of rows: 616, 1, 5
I know there is a problem with my application of lapply - the startrow is not being recognized as an option and the FILENAMEVAR is trying to merge the list of 5 sample filenames as opposed to adding a column containing the filename.
What next?
If anyone can refer me to a useful resource or function, critique what I have so far, or point me in a new direction, it would be GREATLY appreciated!
I'll post my comment (with bdemerast picking up on the typo). The solution was untested as xlsx will not run happily on my machine
You need to pass a single FILENAMEVAR to read.xlsx2.
lapply(filenames, function(x) read.xlsx2(file=x, sheetIndex=1, colIndex=6, header=TRUE, startRow=2, FILENAMEVAR=x))

Resources