I'm fairly new to working with XML files within the R environment, but I have at least come further in making it work normally than I have with the specific file.
Quick background: I receive data in the attached format, but I cannot convert the data into a data frame (which I have succeeded in with other files.) Somehow my normal procedure doesn't work with this. My goal is to make the data into a data frame. Normally I would just use xmlToDataFrame(), but that provides me with the following error:
unable to find an inherited method for function ‘xmlToDataFrame’ for
signature ‘"xml_document", "missing", "missing", "missing", "missing"’
Then I tried the below sequence
data = read_xml("file.xml")
xmlimport = xmlTreeParse("file.xml")
topxml = xmlRoot(xmlimport)
topxml = xmlSApply(topxml,function(x) xmlSApply(x,xmlValue))
That provided me with the attached picture as output. All the data is contained within the cells, and I cannot seem to access the data. I feel like there is a really simple solution, but after working with the file for longer than I like to admit, I hope you can point something (hopefully) obvious out to me.
If you have the time to assist me in it, I've uploaded the file here
Hope that will do.
Thanks for taking the time to assist me.
Note: The data is a bank fee statement, and the data is completely fictional
Output result
Related
I am trying to figure out how to 'download' data into a nice CSV file to be able to analyse.
I am currently looking at WHO data here:
I am doing so through following documentation and getting output like so:
test_data <- jsonlite::parse_json(url("http://apps.who.int/gho/athena/api/GHO/WHS6_102.json?profile=simple"))
head(test_data)
This gives me a rather messy list of list of lists.
For example:
I get this
It is not very easy to analyse and rather messy. How could I clean this up by using say two columns that is returned from this json_parse, information only from say dim like REGION, YEAR, COUNTRY and then the values from the column Value. I would like to make this into a nice dataframe/CSV file so I can then more easily understand what is happening.
Can anyone give any advice?
jsonlite::fromJSON gives you data in a better format and the 3rd element in the list is where the main data is.
url <- 'https://apps.who.int/gho/athena/api/GHO/WHS6_102.json?profile=simple'
tmp <- jsonlite::fromJSON(url)
data <- tmp[[3]]
I have dug into rlist and purrr and found them to be quite helpful in working with lists of pre-structured data. I tried to solve the problems arising on my one to improve my coding skills - so thanks to the community of helping out! However, I reached a dead-end now:
I want to write a code which is needed to be written in a way, that we throw our excel files in xlsm format to the folder an r does the rest.
I Import my data using:
vec.files<-list.files(pattern=".xlsm")
vec.numbers<- gsub("\\.xlsm","",vec.files)
list.alldata <- lapply(vec.files, read_excel, sheet="XYZ")
names(list.alldata) <- vec.numbers
The data we call is a combination of charaters, Dates (...).
When I try to use the rlist-package everything works fine until I try to use to filter on names, which were in the excel file not a fixed entry (e.g. Measurable 1), but a reference to another field (e.g. =Table1!A1, or an Reference).
If I try to call a false element I get this failure:
list.map(list.alldata, NameWhichWasAReferenceToAnotherFieldBefore)
Fehler in eval(.expr, .evalwith(.data), environment()) :
Objekt 'Namewhichwasareferencetoanotherfieldbefore' nicht gefunden
I am quite surprised, as if I call
names(list.alldata[[1]])
I get a vector with the correct entries / names.
As I identified the read_excel() as the problem causing reason I tried to add col_names=True, but did not help. Also col_names=False calls the correct arguments into the dataset.
I assume, that exporting the data as a .csv would help, but this is not an option. Can this be easily done by r in a pree-loop?
In my concept of working assessing the data by the names is essential and there is no work around so I really appreciate your help!
I'm relatively new to R but experienced in traditional programming languages (e.g., C, Java). I've recently run into the situation where I had so many data files to load that I was spending almost as much time on that one task as I was on the actual analysis. I spent a little time googling this but didn't run across any solutions that I found directly relevant (I might have missed something, I'm impatient that way). Despite that I came up with a simple solution to my problem that I wanted to share with the community in case anyone else found themselves in similar circumstances.
A bit of background info: The data I'm analyzing is real-time performance and diagnostic metrics for an experimental system that is driven by real-time data feeds (i.e., complicated). The upshot is that between trials filenames don't change and the data is written out directly to csv files (I wrote the logging code so I get to be my own best friend like that ;). There are dozens of files generated during a single trial and we have potentially hundreds of trials to look forward to.
I had a few ideas and after playing around with the code a bit I came up with the following solution:
# Create mapping that associates files with a handle that the loader will use to
# generate a named list of data frames (don't even try this on the cmdline)
createDataFileMapping <- function() {
list(
c(file = "file1.csv", descr = "descriptor1"),
c(file = "file2.csv", descr = "descriptor2"),
...
)
}
# Batch load csv files and return as list of data frames
loadTrialData <- function(load.dir, mapping) {
dfList <- list()
for (item in mapping) {
file <- paste(load.dir, item[["file"]], sep = "/")
df <- read.csv(file)
dfList[[ item[["descr"]] ]] <- df
}
return(dfList)
}
Invoking is as simple as loadTrialData("~/data/directory", createDataFileMapping()).
I'm sure there are other ways to solve this problem but the above gets the job done in my case. I'm sure this is slightly less memory-efficient than loading the files directly into data frames in the global environment, and the syntax for passing individual data frames to analysis/plotting functions isn't as elegant as it could be, but I'm not choosy. If you have a more flexible/generalizable solution then please don't hesitate to post!
What you have is sound, I would add only two comments:
Don't worry about extra memory usage, assuming the data frames are of nontrivial size you won't lose much putting them in a big list.
You might add ... as an argument to your function and pass it through to read.csv, so that if another user needs to specify extra arguments because their file wasn't in quite the same format (or wants stringsAsFactors=FALSE or something) then they have the flexibility to do that.
I'm writing a script to plot data from multiple files. Each file is named using the same format, where strings between “.” give some info on what is in the file. For example, SITE.TT.AF.000.52.000.001.002.003.WDSD_30.csv.
These data will be from multiple sites, so SITE, or WDSD_30, or any other string, may be different depending on where the data is from, though it's position in the file name will always indicate a specific feature such as location or measurement.
So far I have each file read into R and saved as a data frame named the same as the file. I'd like to get something like the following to work: if there is a data frame in the global environment that contains WDSD_30, then plot a specific column from that data frame. The column will always have the same name, so I could write plot(WDSD_30$meas), and no matter what site's files were loaded in the global environment, the script would find the WDSD_30 file and plot the meas variable. My goal for this script is to be able to point it to any folder containing files from a particular site, and no matter what the site, the script will be able to read in the data and find files containing the variables I'm interested in plotting.
A colleague suggested I try using strsplit() to break up the file name and extract the element I want to use, then use that to rename the data frame containing that element. I'm stuck on how exactly to do this or whether this is the best approach.
Here's what I have so far:
site.files<- basename(list.files( pattern = ".csv",recursive = TRUE,full.names= FALSE))
sfsplit<- lapply(site.files, function(x) strsplit(x, ".", fixed =T)[[1]])
for (i in 1:length(site.files)) assign(site.files[i],read.csv(site.files[i]))
for (i in 1:length(site.files))
if (sfsplit[[i]][10]==grep("PARQL", sfsplit[[i]][10]))
{assign(data.frame.getting.named.PARQL, sfsplit[[i]][10])}
else if (sfsplit[i][10]==grep("IRBT", sfsplit[[i]][10]))
{assign(data.frame.getting.named.IRBT, sfsplit[[i]][10])
...and so on for each data frame I'd like to eventually plot from.Is this a good approach, or is there some better way? I'm also unclear on how to refer to the objects I made up for this example, data.frame.getting.named.xxxx, without using the entire filename as it was read into R. Is there something like data.frame[1] to generically refer to the 1st data frame in the global environment.
I wish to read data into R from SAS data sets in Windows. The read.ssd function allows me to do so, however, it seems to have an issue when I try to import a SAS data set that has any non-alphabetic symbols in its name. For example, I can import table.sas7bdat using the following:
directory <- "C:/sas data sets"
sashome <- "/Program Files/SAS/SAS 9.1"
table.df <- read.ssd(directory, "table", sascmd = file.path(sashome, "sas.exe"))
but I can't do the same for a table SAS data set named table1.sas7bdat. It returns an error:
Error in file.symlink(oldPath, linkPath) :
symbolic links are not supported on this version of Windows
Given that I do not have the option to rename these data sets, is there a way to read a SAS data set that has non-alphabetic symbols in its name in to R?
Looking about, it looks like others have your problem as well. Perhaps it's just a bug.
Anyway, try the suggestion from this (old) R help post, posted by the venerable Dan Nordlund who's pretty good at this stuff - and also active on SASL (sasl#listserv.uga.edu) if you want to try cross-posting your question there.
https://stat.ethz.ch/pipermail/r-help/2008-December/181616.html
Also, you might consider the transport method if you don't mind 8 character long variable names.
Use:
directory <- "C:/sas data sets"
sashome <- "/Program Files/SAS/SAS 9.1"
table.df <- read.ssd(library=directory, mem="table1", formats=F,
sasprog=file.path(sashome, "sas.exe"))