Trying to convert XML into a dataframe - r

I am downloading the zip files from the location
http://nemweb.com.au/Data_Archive/Wholesale_Electricity/NEMDE/2019/NEMDE_2019_03/NEMDE_Market_Data/NEMDE_Files/NemPriceSetter_20190301_xml.zip
The zip file has multiple xml files inside which iam trying to read but based on the style of the XML file I cannot parse it properly and cannot convert that into a data frame
I have tried to download the zip file into a temporary directory and then tried parsing one file at a time
library(xml2)
library(tidyverse)
tf <- tempfile(tmpdir = tdir <- tempdir())
download.file("http://nemweb.com.au/Data_Archive/Wholesale_Electricity/NEMDE/2019/NEMDE_2019_03/NEMDE_Market_Data/NEMDE_Files/NemPriceSetter_20190301_xml.zip", tf)
xml_files <- unzip(tf, exdir = tdir)
library(XML)
doc<-xmlParse(xml_files[1])
a<-xmlToDataFrame(nodes=getNodeSet(doc,"//SolutionAnalysis/PriceSetting"))
unlink(tdir, T, T)
This is how the XML file looks
and I am trying to put the information in a specific column using a data frame

Related

Read in feather file directly from GitHub in R

How can I read in a .feather file from the web (e.g. GitHub) in R? I can read formats as .csv or .dta from GitHub directly as raw
# CSV
coursedata <- read.csv(file = 'https://raw.githubusercontent.com/MarcoKuehne/seminars_in_applied_economics/main/Data/GF_2020.csv')
# DTA
library(haven)
soep <- read_dta("https://github.com/MarcoKuehne/seminars_in_applied_economics/blob/main/Data/soep_lebensz_en.dta?raw=true")
But the same approach fails for arrow and read_feather.
library(arrow)
digital <- read_feather("https://github.com/MarcoKuehne/seminars_in_applied_economics/blob/main/Data/Digital_Literacy_EN.feather?raw=true")
Is there a direct way or a nested command? Or am I required to download the file manually or programmatically as a temporary file?

Read .tar.gz file in R

I have compressed file like cat.txt.tar.gz, I just need to load into R and process as follows
zip <-("cat.txt.tar.gz")
data <- read.delim(file=(untar(zip,"cat.txt")),sep="\t")
but "data" is empty while running the code.Is there any way to read a file from .tar.gz
Are you sure your file is named correctly?
Usually compressed files are named cat.tar.gz, excluding the .txt.
Second, try the following code:
tarfile <- "cat.txt.tar.gz" # Or "cat.tar.gz" if that is right
data <- read.delim(file = untar(tarfile,compressed="gzip"),sep="\t")
If this doesn't work, you might need to extract the file first, and then read the extracted file.
To read in a particular csv or txt within a gz archive without having to UNZIP it first one can use library(archive) :
library(archive)
library(readr)
read_csv(archive_read("cat.txt.tar.gz", file = 1), col_types = cols(), sep="\t")
should work.

R issues vs excel

I am working on a project where I have multiple excel files and each file is having multiple workbook, I have to get the data from one of the workbook let say sheet=6, and after that all theses data to store in a new .xls or .csv file.
I am facing issue while trying to read the data from file and string it in to list. getting the following error:
Error: `path` does not exist: ‘BillingReport___Gurgaon-Apr-2019.xlsx’
I am trying mapdfr funtion ot get the data.
library(purrr)
library(readxl)
library(dplyr)
library(rio)
library(XLConnect)
library(tidyverse)
setwd ="F:/Capstone/Billing Reports final/"
#Set path of Billing source folder
billingptah <- "F:/Capstone/Billing Reports final/"
#Set path of destination folder
csvexportpath <- "F:/Capstone/Billing_data/billing_data.csv"
#get the names of the files to be loaded
files_to_load <- list.files(path = billingptah)
files_to_load
#Load all the data into one file
billing_data <- map_dfr(files_to_load, function(x) map_dfr( excel_sheets(x) , function(y) read_excel(path=x, sheet = 6,col_types = "text" ) %>% mutate(sheet=6) ) %>% mutate(filename=x) )
following is the error message:
Error: `path` does not exist:
‘BillingReport___Gurgaon-Apr-2019.xlsx’
It is all about the difference about relative and absolute path. You're telling R to load a file located in your current working directory named ‘BillingReport___Gurgaon-Apr-2019.xlsx’. You need to add the path to access this file name as a suffix. Try this after building files_to_load:
files_to_load <- paste0(billingptah, files_to_load)
It will tell R to access the files named after file_to_load located in billingptah directory.
Edit
Let me just point you out some useful links:
https://www.reed.edu/data-at-reed/resources/R/reading_and_writing.html
And for best practices: https://stat.ethz.ch/R-manual/R-devel/library/base/html/file.path.html

Dynamically converting a list of Excel files to csv files in R

I currently have a folder containing all Excel (.xlsx) files, and using R I would like to automatically convert all of these files to CSV files using the "openxlsx" package (or some variation). I currently have the following code to convert one of the files and place it in the same folder:convert("team_order\\team_1.xlsx", "team_order\\team_1.csv")
I would like to automate the process so it does it to all the files in the folder, and also removes the current xlsx files, so only the csv files remain. Thanks!
You can try this using rio, since it seems like that's what you're already using:
library("rio")
xls <- dir(pattern = "xlsx")
created <- mapply(convert, xls, gsub("xlsx", "csv", xls))
unlink(xls) # delete xlsx files
library(readxl)
# Create a vector of Excel files to read
files.to.read = list.files(pattern="xlsx")
# Read each file and write it to csv
lapply(files.to.read, function(f) {
df = read_excel(f, sheet=1)
write.csv(df, gsub("xlsx", "csv", f), row.names=FALSE)
})
You can remove the files with the command below. However, this is dangerous to run automatically right after the previous code. If the previous code fails for some reason, the code below will still delete your Excel files.
lapply(files.to.read, file.remove)
You could wrap it in a try/catch block to be safe.

Downloading and opening zipfile

I would like to download the following zip folder which contains 2 files and open it to manipulate the data. Presently I only know how to download it manually.
https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2FNEI_data.zip
I would also like to download the following zip folder which contains a series of nested folders containing data that I am interested in downloading to manipulate. Presently I am only downloading them manually:
https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip
Can anyone advise me how to do it through coding in R?
Here's a way to do it:
fn <- "https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2FNEI_data.zip"
download.file(fn, tf <- tempfile(fileext = ".zip"))
unzip(tf, exdir = td <- file.path(tempdir(), "myzip"))
(list.files(td, full.names = TRUE, recursive = TRUE))

Resources