Convert Dat Data To Data Frame - r

I'm reading data data and trying to convert it to data frame to save it into readable format. However no clue about converting the dat data. A bit beginner to R. Any help will be highly appreciated.
Code so Far:
data <- readLines("Day8.dat")
print(data)
Output So Far:
[1] "<d2lm:d2LogicalModel extensionVersion=\"2.0\" extensionName=\"NTIS Published Services\"
modelBaseVersion=\"2\" xmlns:ns4=\"http://www.thalesgroup.com/NTIS/Datex2Extensions/1.0Beta1\"
xmlns:ns3=\"http://datex2.eu/schema/2/2_0/inrix\" xmlns:d2lm=\"http://datex2.eu/schema/2/2_0\">
<d2lm:exchange><d2lm:supplierIdentification><d2lm:country>gb</d2lm:country>
<d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier></d2lm:supplierIdentification></d2lm:exchange>
<d2lm:payloadPublication xsi:type=\"d2lm:SituationPublication\" lang=\"en\"
xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"><d2lm:feedType>Event Data</d2lm:feedType>
<d2lm:publicationTime>2020-05-10T00:00:44.778+01:00</d2lm:publicationTime><d2lm:publicationCreator>
<d2lm:country>gb</d2lm:country><d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier>
</d2lm:publicationCreator><d2lm:situation version=\"\" id=\"2922904\"><d2lm:headerInformation>
<d2lm:areaOfInterest>national</d2lm:areaOfInterest>
....
Thanks

It all depends on what you want to do with the data, i.e., how you want to process it.
For example, let's assume your interest is in parsing all XML tags as separate strings, then you can extract the tags using regular expression and the function str_extract:
library(stringr)
str_extract_all(dat, "<(d2lm:[^>]*)>.*</\\1>|<d2lm:[^>]*>")
This regex works even if the XML element names are variable:
str_extract_all(dat, "<([^>]*)>.*</\\1>|<[^>]*>")
The result is a list:
[[1]]
[1] "<d2lm:d2LogicalModel extensionVersion=\"2.0\" extensionName=\"NTIS Published Services\" \nmodelBaseVersion=\"2\" xmlns:ns4=\"http://www.thalesgroup.com/NTIS/Datex2Extensions/1.0Beta1\" \nxmlns:ns3=\"http://datex2.eu/schema/2/2_0/inrix\" xmlns:d2lm=\"http://datex2.eu/schema/2/2_0\">"
[2] "<d2lm:exchange>"
[3] "<d2lm:supplierIdentification>"
[4] "<d2lm:country>gb</d2lm:country>"
[5] "<d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier>"
[6] "<d2lm:payloadPublication xsi:type=\"d2lm:SituationPublication\" lang=\"en\" \nxmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\">"
[7] "<d2lm:feedType>Event Data</d2lm:feedType>"
[8] "<d2lm:publicationTime>2020-05-10T00:00:44.778+01:00</d2lm:publicationTime>"
[9] "<d2lm:publicationCreator>"
[10] "<d2lm:country>gb</d2lm:country>"
[11] "<d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier>"
[12] "<d2lm:situation version=\"\" id=\"2922904\">"
[13] "<d2lm:headerInformation>"
[14] "<d2lm:areaOfInterest>national</d2lm:areaOfInterest>"
To turn the list into a dataframe:
datDF <- data.frame(tags = unlist(str_extract_all(dat, "<(d2lm:[^>]*)>.*</\\1>|<d2lm:[^>]*>")))
EDIT:
If you want to have a dataframe with the text values between XML start tag and XML end tag, you can extract these tags and values along these lines:
datDF <- data.frame(
tags = unlist(str_extract_all(dat, "<([^>]*)>(?=[^>]*</\\1>)")),
values = unlist(str_extract_all(dat, "(?<=<([^>]{1,100})>).*(?=</\\1>)"))
)
datDF
tags values
1 <d2lm:country> gb
2 <d2lm:nationalIdentifier> NTIS
3 <d2lm:feedType> Event Data
4 <d2lm:publicationTime> 2020-05-10T00:00:44.778+01:00
5 <d2lm:country> gb
6 <d2lm:nationalIdentifier> NTIS
7 <d2lm:areaOfInterest> national
Is this--roughly--what you had in mind?
DATA:
dat <- '<d2lm:d2LogicalModel extensionVersion=\"2.0\" extensionName=\"NTIS Published Services\"
modelBaseVersion=\"2\" xmlns:ns4=\"http://www.thalesgroup.com/NTIS/Datex2Extensions/1.0Beta1\"
xmlns:ns3=\"http://datex2.eu/schema/2/2_0/inrix\" xmlns:d2lm=\"http://datex2.eu/schema/2/2_0\">
<d2lm:exchange><d2lm:supplierIdentification><d2lm:country>gb</d2lm:country>
<d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier></d2lm:supplierIdentification></d2lm:exchange>
<d2lm:payloadPublication xsi:type=\"d2lm:SituationPublication\" lang=\"en\"
xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"><d2lm:feedType>Event Data</d2lm:feedType>
<d2lm:publicationTime>2020-05-10T00:00:44.778+01:00</d2lm:publicationTime><d2lm:publicationCreator>
<d2lm:country>gb</d2lm:country><d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier>
</d2lm:publicationCreator><d2lm:situation version=\"\" id=\"2922904\"><d2lm:headerInformation>
<d2lm:areaOfInterest>national</d2lm:areaOfInterest>'

Related

Rename row.name in data frame using matches or partial matches from a list

I have a data frame in R with 341 rows. I want to rename the row names using a list with 349 names. All 341 names will be in this list for sure. But not all of them will be perfect hits.
The data looks like this
rownames(df_RPM1)
[1] "LQNS02059392.1_11686_5p"
[2] "LQNS02277998.1_30984_3p"
[3] "LQNS02277998.1_30984_5p"
[4] "LQNS02277998.1_30988_3p"
[5] "LQNS02277998.1_30988_5p"
[6] "LQNS02277997.1_30943_3p"
[7] "miR-9|LQNS02278070.1_31740_3p"
[8] "miR-9|LQNS02278094.1_36129_3p"
head(inlist)
[1] "dpu-miR-2-03_LQNS02059392.1_11686_5p" "dpu-miR-10-P2_LQNS02277998.1_30984_3p"
[3] "dpu-miR-10-P2_LQNS02277998.1_30984_5p" "dpu-miR-10-P3_LQNS02277998.1_30988_3p"
[5] "dpu-miR-10-P3_LQNS02277998.1_30988_5p" "miR-9|LQNS02278070.1_31740_3p"
[6] "miR-9|LQNS02278094.1_36129_3p"
The order won't necessarily be the same in the two.
Can anyone suggest me how to do this in R?
Thanks a lot
Depends a lot what a "non-perfect hit" looks like. Assuming the row name is a substring of the real name, str_detect() does the job quite well:
library(tidyverse)
real_names <- c("dpu-miR-2-03_LQNS02059392.1_11686_5p",
"dpu-miR-10-P2_LQNS02277998.1_30984_3p",
"dpu-miR-10-P2_LQNS02277998.1_30984_5p",
"dpu-miR-10-P3_LQNS02277998.1_30988_3p",
"dpu-miR-10-P3_LQNS02277998.1_30988_5p",
"miR-9|LQNS02278070.1_31740_3p",
"miR-9|LQNS02278094.1_36129_3p")
str_which(real_names, "LQNS02059392.1_11686_5p")
#> [1] 1
So we can vectorize (I removed the element 6 which is not found in the example list):
pos <- map_int(rownames(df_RPM1), ~ str_which(real_names, fixed(.)))
pos
#> [1] 1 2 3 4 5 6 7
And all that's left is to change the row names:
rownames(df_RPM1) <- real_names[pos]
Of course, if a non-perfect hit means something more complicated, you may need to create a regex from the row names or something like that.

Nested List Parsing with jsonlite

This is the second time that I have faced this recently, so I wanted to reach out to see if there is a better way to parse dataframes returned from jsonlite when one of elements is an array stored as a column in the dataframe as a list.
I know that this part of the power with jsonlite, but I am not sure how to work with this nested structure. In the end, I suppose that I can write my own custom parsing, but given that I am almost there, I wanted to see how to work with this data.
For example:
## options
options(stringsAsFactors=F)
## packages
library(httr)
library(jsonlite)
## setup
gameid="2015020759"
SEASON = '20152016'
BASE = "http://live.nhl.com/GameData/"
URL = paste0(BASE, SEASON, "/", gameid, "/PlayByPlay.json")
## get the data
x <- GET(URL)
## parse
api_response <- content(x, as="text")
api_response <- jsonlite::fromJSON(api_response, flatten=TRUE)
## get the data of interest
pbp <- api_response$data$game$plays$play
colnames(pbp)
And exploring what comes back:
> class(pbp$aoi)
[1] "list"
> class(pbp$desc)
[1] "character"
> class(pbp$xcoord)
[1] "integer"
From above, the column pbp$aoi is a list. Here are a few entries:
> head(pbp$aoi)
[[1]]
[1] 8465009 8470638 8471695 8473419 8475792 8475902
[[2]]
[1] 8470626 8471276 8471695 8476525 8476792 8477956
[[3]]
[1] 8469619 8471695 8473492 8474625 8475727 8476525
[[4]]
[1] 8469619 8471695 8473492 8474625 8475727 8476525
[[5]]
[1] 8469619 8471695 8473492 8474625 8475727 8476525
[[6]]
[1] 8469619 8471695 8473492 8474625 8475727 8475902
I don't really care if I parse these lists in the same dataframe, but what do I have for options to parse out the data?
I would prefer to take the data out of out lists and parse them into a dataframe that can be "related" to the original record it came from.
Thanks in advance for your help.
From #hrbmstr above, I was able to get what I wanted using unnest.
select(pbp, eventid, aoi) %>% unnest() %>% head

R: Extract words from a website

I am attempting to extract all words that start with a particular phrase from a website. The website I am using is:
http://docs.ggplot2.org/current/
I want to extract all the words that start with "stat_". I should get 21 names like "stat_identity" in return. I have the following code:
stats <- readLines("http://docs.ggplot2.org/current/")
head(stats)
grep("stat_{1[a-z]", stats, value=TRUE)
I am returned every line containing the phrase "stat_". I just want to extract the "stat_" words. So I tried something else:
gsub("\b^stat_[a-z]+ ", "", stats)
I think the output I got was an empty string, " ", where a "stat_" phrase would be? So now I'm trying to think of ways to extract all the text and set everything that is not a "stat_" phrase to empty strings. Does anyone have any ideas on how to get my desired output?
rvest & stringr to the rescue:
library(xml2)
library(rvest)
library(stringr)
pg <- read_html("http://docs.ggplot2.org/current/")
unique(str_match_all(html_text(html_nodes(pg, "body")),
"(stat_[[:alnum:]_]+)")[[1]][,2])
## [1] "stat_bin" "stat_bin2dCount"
## [3] "stat_bindot" "stat_binhexBin"
## [5] "stat_boxplot" "stat_contour"
## [7] "stat_density" "stat_density2d"
## [9] "stat_ecdf" "stat_functionSuperimpose"
## [11] "stat_identity" "stat_qqCalculation"
## [13] "stat_quantile" "stat_smooth"
## [15] "stat_spokeConvert" "stat_sum"
## [17] "stat_summarySummarise" "stat_summary_hexApply"
## [19] "stat_summary2dApply" "stat_uniqueRemove"
## [21] "stat_ydensity" "stat_defaults"
Unless you need the links (then you can use other rvest functions), this removes all the markup for you and just gives you the text of the website.

R - get values from multiple variables in the environment

I have some variables in my current R environment:
ls()
[1] "clt.list" "commands.list" "dirs.list" "eq" "hurs.list" "mlist" "prec.list" "temp.list" "vars"
[10] "vars.list" "wind.list"
where each one of the variables "clt.list", "hurs.list", "prec.list", "temp.list" and "wind.list" is a (huge) list of strings.
For example:
clt.list[1:20]
[1] "clt_Amon_ACCESS1-0_historical_r1i1p1_185001-200512.nc" "clt_Amon_ACCESS1-3_historical_r1i1p1_185001-200512.nc"
[3] "clt_Amon_bcc-csm1-1_historical_r1i1p1_185001-201212.nc" "clt_Amon_bcc-csm1-1-m_historical_r1i1p1_185001-201212.nc"
[5] "clt_Amon_BNU-ESM_historical_r1i1p1_185001-200512.nc" "clt_Amon_CanESM2_historical_r1i1p1_185001-200512.nc"
[7] "clt_Amon_CCSM4_historical_r1i1p1_185001-200512.nc" "clt_Amon_CESM1-BGC_historical_r1i1p1_185001-200512.nc"
[9] "clt_Amon_CESM1-CAM5_historical_r1i1p1_185001-200512.nc" "clt_Amon_CESM1-CAM5-1-FV2_historical_r1i1p1_185001-200512.nc"
[11] "clt_Amon_CESM1-FASTCHEM_historical_r1i1p1_185001-200512.nc" "clt_Amon_CESM1-WACCM_historical_r1i1p1_185001-200512.nc"
[13] "clt_Amon_CMCC-CESM_historical_r1i1p1_190001-190412.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_190001-200512.nc"
[15] "clt_Amon_CMCC-CESM_historical_r1i1p1_190501-190912.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_191001-191412.nc"
[17] "clt_Amon_CMCC-CESM_historical_r1i1p1_191501-191912.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_192001-192412.nc"
[19] "clt_Amon_CMCC-CESM_historical_r1i1p1_192501-192912.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_193001-193412.nc"
What I need to do is extract the subset of the string that is between "Amon_" and "_historical".
I can do this for a single variable, as shown here:
levels(as.factor(sub(".*?Amon_(.*?)_historical.*", "\\1", clt.list[1:20])))
[1] "ACCESS1-0" "ACCESS1-3" "bcc-csm1-1" "bcc-csm1-1-m" "BNU-ESM" "CanESM2" "CCSM4"
[8] "CESM1-BGC" "CESM1-CAM5" "CESM1-CAM5-1-FV2" "CESM1-FASTCHEM" "CESM1-WACCM" "CMCC-CESM"
However, what I'd like to do is to run the command above for all the five variables at once. Instead of using just "ctl.list" as argument in the command above, I'd like to use all variables "clt.list", "hurs.list", "prec.list", "temp.list" and "wind.list" at once.
How can I do that?
Many thanks in advance!
You can put your operation into a function and then iterate over it:
get_my_substr <- function(vecname)
levels(as.factor(sub(".*?Amon_(.*?)_historical.*", "\\1", get(vecname))))
lapply(my_vecnames,get_my_substr)
lapply acts like a loop. You can create your list of vector names with
my_vecnames <- ls(pattern=".list$")
It is generally good practice to post a reproducible example in your question. Since none was provided here, I tested this approach with...
# example-maker
prestr <- "grr_Amon_"
posstr <- "_historical_zzz"
make_ex <- function()
replicate(
sample(10,1),
paste0(prestr,paste0(sample(LETTERS,sample(5,1)),collapse=""),posstr)
)
# make a couple examples
set.seed(1)
m01 <- make_ex()
m02 <- make_ex()
# test result
lapply(ls(pattern="^m[0-9][0-9]$"),get_my_substr)
One solution would be to create a vector containing the variable names that you want extract the data from, for example:
var.names <- c("clt.list", "commands.list", "dirs.list")
Then to access the value of each variable from the name:
for (var.name in var.names) {
var.value <- as.list(environment())[[var.name]]
# Do something with var.value
}

list files in R by dates

I have a set of netcdf file that is organised by dates in my directory ( each file is one day of data). I read all the files in R using
require(RNetCDF)
files= list.files( ,pattern='*.nc',full.names=TRUE)
When I run the codes R reads 2014 and 2013, then parts of 2010 is at the end .. ( see below sample output in R)
"./MERRA100.prod.assim.tavg1_2d_lnd_Nx.19820223.SUB.nc"
"./MERRA100.prod.assim.tavg1_2d_lnd_Nx.19820224.SUB.nc"
"./MERRA100.prod.assim.tavg1_2d_lnd_Nx.19820225.SUB.nc"
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20130829.SUB.nc"
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20130830.SUB.nc"
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20130831.SUB.nc"
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20100626.SUB.nc"
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20100827.SUB.nc"
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20100828.SUB.nc"
I am trying to generate daily times series for these files using a loop..so when i apply the rest of my codes.. data for from June to Aug 2010 comes to end of daily time series. I rather suspect that this has to do how the files are listed R
Is there any way to list files in R and ensure it is organized dates?
Here are your files unsorted
paths <- c("./MERRA100.prod.assim.tavg1_2d_lnd_Nx.19820223.SUB.nc",
"./MERRA100.prod.assim.tavg1_2d_lnd_Nx.19820224.SUB.nc",
"./MERRA100.prod.assim.tavg1_2d_lnd_Nx.19820225.SUB.nc",
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20130829.SUB.nc",
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20130830.SUB.nc",
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20130831.SUB.nc",
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20100626.SUB.nc",
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20100827.SUB.nc",
"./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20100828.SUB.nc")
I'm using a regular expression to extract the 8 digits in the date, YYYYMMDD, and you should be able to sort by the string of digits, but you can also just convert them into dates
## matches ...Nx.<number of digits = 8>... and captures the stuff in <>
## and saves this match to the first capture group, \\1
pattern <- '.*Nx\\.(\\d{8}).*'
gsub(pattern, '\\1', paths)
# [1] "19820223" "19820224" "19820225" "20130829" "20130830" "20130831"
# [7] "20100626" "20100827" "20100828"
sort(gsub(pattern, '\\1', paths))
# [1] "19820223" "19820224" "19820225" "20100626" "20100827" "20100828"
# [7] "20130829" "20130830" "20130831"
## not necessary to convert that into dates but you can
as.Date(sort(gsub(pattern, '\\1', paths)), '%Y%m%d')
# [1] "1982-02-23" "1982-02-24" "1982-02-25" "2010-06-26" "2010-08-27"
# [6] "2010-08-28" "2013-08-29" "2013-08-30" "2013-08-31"
And order the original paths
## so you can use the above to order the paths
paths[order(gsub(pattern, '\\1', paths))]
# [1] "./MERRA100.prod.assim.tavg1_2d_lnd_Nx.19820223.SUB.nc"
# [2] "./MERRA100.prod.assim.tavg1_2d_lnd_Nx.19820224.SUB.nc"
# [3] "./MERRA100.prod.assim.tavg1_2d_lnd_Nx.19820225.SUB.nc"
# [4] "./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20100626.SUB.nc"
# [5] "./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20100827.SUB.nc"
# [6] "./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20100828.SUB.nc"
# [7] "./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20130829.SUB.nc"
# [8] "./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20130830.SUB.nc"
# [9] "./MERRA301.prod.assim.tavg1_2d_lnd_Nx.20130831.SUB.nc"

Resources