First of all, I must say I'm completely noob in R. So I apologize in advance for asking for help with such a simple task. My task is to form a graph of COVID-19 cases for a certain period using data from the CSV file. Unfortunately, at the moment I cannot contact the person from the World Health Organization who provided the data and the script for launching. But I was left with an error that I cannot fix either myself, not with the help of Google.
script.R
library(EpiEstim)
library(ggplot2)
COVID<-read.csv("dataset.csv")
res_parametric_si<-estimate_R(COVID$I,method="parametric_si",config=make_config(list(mean_si=4,std_si=3)))
plot(res_parametric_si)
dataset.csv
Date,Suspected per day,Total suspected,Discarded/pending,Confirmed per day,Total confirmed,Deaths per day,Deaths Total,Case fatality rate,Daily confirmed,Recovered per day,Recovered total,Active cases,Tested with PCR,# of PCR tests total,average tests/ 7 days,Inf HCW,Inf HCW/d,Vent HCW,Susp per day
01-Jul-20,1239,91172,45285,889,45887,12,1185,2.58%,889,505,20053,24649,11109,676684,10073,6828,63,,1239
02-Jul-20,1249,92421,45658,876,46763,27,1212,2.59%,876,505,20558,24993,13167,689851,9966,6874,46,,1249
03-Jul-20,1288,93709,46032,914,47677,15,1227,2.57%,914,597,21155,25295,11825,701676,9915.7,6937,63,,1288
04-Jul-20,926,94635,46135,823,48500,22,1249,2.58%,823,221,21376,25875,9934,711610,9957,6990,53,,926
05-Jul-20,680,95315,46272,543,49043,13,1262,2.57%,543,327,21703,26078,6696,718306,9963.7,7030,40,,680
06-Jul-20,871,96186,46579,564,49607,21,1283,2.59%,564,490,22193,26131,9343,727649,10303.9,7046,16,,871
07-Jul-20,1170,97356,46942,807,50414,23,1306,2.59%,807,926,23119,25989,13568,741217,10806,7092,46,,1170
Error
Error in process_I(incid) (script.R#4): incid must be a vector or a dataframe with either i) a column called 'I', or ii) 2 columns called 'local' and 'imported'.
For the example data the issue seems to be that it does only cover 7 data points, and the configurator assumes that there it can window over more than 7 days. What worked for me was the following code (working in the sense that it does not throw an error).
config <- make_config(incid = COVID$Daily.confirmed,
method="parametric_si",
list(mean_si=4,std_si=3, t_start = c(2,3),t_end = c(6,7)))
res_parametric_si<-estimate_R(COVID$Daily.confirmed,method="parametric_si",config=config)
plot(res_parametric_si)
In my angular application I am tracking filters that users utilize on one of the pages. What I can later see in Logs, is the following (query for last 24 hours)
What I am interested in, is the count of filters groupped by its name. So I created the following query:
However the problem as you can see, is that my y-axis starts from 1 instead of 0. To users this looks like the last two filters don't have any values, where in reality they both have count of 1.
I have tried to use ymin=0 together with render function, however it did not work (chart still starts from 1). Then I have read I need to use make-series() function and so I tried:
customEvents
| where timestamp >= ago(24h)
| where customDimensions.pageName == 'product'
| make-series Count=count(name) default=0 on timestamp from datetime(2019-10-10) to datetime(2019-10-11) step 1d by name
| project name, Count
However the result is some weird matrix instead of a regular table:
I have just started with application insights thus any help in respect to this matter would be more than appreciated. Thank you
in Workbooks in application insights you could do almost this query (see below for a simplification?), then use the chart settings and set the axis min/max explicitly:
but why are you using make-series but then summarizing to just one series?
in this specific case is summarize simpler:
customEvents
| where timestamp between(datetime(2019-10-10) .. datetime(2019-10-11))
| where customDimensions.pageName == 'product'
| summarize Count=count(name) by name
| render barchart
in the logs blade (where you are), you could do this query, and I believe you can use
render barchart title="blah" ymin=0
(at some point workbooks will be able to "see" all the rendeer options like ymin/ymax/xmin/xmax/title/etc, but right now they're all stripped out at service layer)
A bit late to the party, but the correct syntax to pass in ymin and ymax when using a query is this:
| ...
| render barchart with (ymin=0, ymax=100)
See https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/renderoperator?pivots=azuremonitor
I have a complex JSON file (~8GB) containing publically available data for businesses. We have decided to split the files up into multiple CSV files (or tabs in a .xlsx), so clients can easily consume the data. These files will be linked by the NZBN column/key.
I'm using R and jsonlite to read a small sample in (before scaling up to the full file). I'm guessing I need some way to specify what key/columns go in each file (i.e, the first file will have headers: australianBusinessNumber, australianCompanyNumber, australianServiceAddress, the second file will have headers: annualReturnFilingMonth, annualReturnLastFiled, countryOfOrigin...)
Here's a sample of two businesses/entities (I've bunged some of the data as well so ignore the actual values): test file
I've read almost every post on s/o of similar questions and none seem to be giving me any luck. I've tried variations of purrr, *apply commands, custom flattening functions and jqr (an r version of 'jq' - looks promising but I can't seem to run it).
Here's an attempt at creating my separate files, but I'm unsure how to include the linking identifier (NZBN) + I keep running into further nested lists (i'm unsure how many levels of nesting there are)
bulk <- jsonlite::fromJSON("bd_test.json")
coreEntity <- data.frame(bulk$companies)
coreEntity <- coreEntity[,sapply(coreEntity, is.list)==FALSE]
company <- bulk$companies$entity$company
company <- purrr::reduce(company, dplyr::bind_rows)
shareholding <- company$shareholding
shareholding <- purrr::reduce(shareholding, dplyr::bind_rows)
shareAllocation <- shareholding$shareAllocation
shareAllocation <- purrr::reduce(shareAllocation, dplyr::bind_rows)
I'm not sure if it's easier to split the files up during the flattening/wrangling process, or just completely flatten the whole file so I just have one line per business/entity (and then gather columns as needed) - my only concern is that I need to scale this up to ~1.3million nodes (8GB JSON file).
Ideally I would want the csv files split every time there is a new collection, and the values in the collection would become the columns for the new csv/tab.
Any help or tips would be much appreciated.
------- UPDATE ------
Updated as my question was a little vague I think all I need is some code to produce one of the csv's/tabs and I replicate for the other collections.
Say for example, I wanted to create a csv of the following elements:
entityName (unique linking identifier)
nzbn (unique linking
identifier)
emailAddress__uniqueIdentifier
emailAddress__emailAddress
emailAddress__emailPurpose
emailAddress__emailPurposeDescription
emailAddress__startDate
How would I go about that?
i'm unsure how many levels of nesting there are
This will provide an answer to that quite efficiently:
jq '
def max(s): reduce s as $s (null;
if . == null then $s elif $s > . then $s else . end);
max(paths|length)' input.json
(With the test file, the answer is 14.)
To get an overall view (schema) of the data, you could
run:
jq 'include "schema"; schema' input.json
where schema.jq is available at this gist. This will produce a structural schema.
"Say for example, I wanted to create a csv of the following elements:"
Here's a jq solution, apart from the headers:
.companies.entity[]
| [.entityName, .nzbn]
+ (.emailAddress[] | [.uniqueIdentifier, .emailAddress, .emailPurpose, .emailPurposeDescription, .startDate])
| #csv
shareholding
The shareholding data is complex, so in the following I've used the to_table function defined elsewhere on this page.
The sample data does not include a "company name" field so in the following, I've added a 0-based "company index" field:
.companies.entity[]
| [.entityName, .nzbn] as $ix
| .company
| range(0;length) as $cix
| .[$cix]
| $ix + [$cix] + (.shareholding[] | to_table(false))
jqr
The above solutions use the standalone jq executable, but all going well, it should be trivial to use the same filters with jqr, though to use jq's include, it might be simplest to specify the path explicitly, as for example:
include "schema" {search: "~/.jq"};
If the input JSON is sufficiently regular, you
might find the following flattening function helpful, especially as it can emit a header in the form of an array of strings based on the "paths" to the leaf elements of the input, which can be arbitrarily nested:
# to_table produces a flat array.
# If hdr == true, then ONLY emit a header line (in prettified form, i.e. as an array of strings);
# if hdr is an array, it should be the prettified form and is used to check consistency.
def to_table(hdr):
def prettify: map( (map(tostring)|join(":") ));
def composite: type == "object" or type == "array";
def check:
select(hdr|type == "array")
| if prettify == hdr then empty
else error("expected head is \(hdr) but imputed header is \(.)")
end ;
. as $in
| [paths(composite|not)] # the paths in array-of-array form
| if hdr==true then prettify
else check, map(. as $p | $in | getpath($p))
end;
For example, to produce the desired table (without headers) for .emailAddress, one could write:
.companies.entity[]
| [.entityName, .nzbn] as $ix
| $ix + (.emailAddress[] | to_table(false))
| #tsv
(Adding the headers and checking for consistency,
are left as an exercise for now, but are dealt with below.)
Generating multiple files
More interestingly, you could select the level you want, and produce multiple tables automagically. One way to partition the output into separate files efficiently would be to use awk. For example, you could pipe the output obtained using this jq filter:
["entityName", "nzbn"] as $common
| .companies.entity[]
| [.entityName, .nzbn] as $ix
| (to_entries[] | select(.value | type == "array") | .key) as $key
| ($ix + [$key] | join("-")) as $filename
| (.[$key][0]|to_table(true)) as $header
# First emit the line giving all the headers:
| $filename, ($common + $header | #tsv),
# Then emit the rows of the table:
(.[$key][]
| ($filename, ($ix + to_table(false) | #tsv)))
to
awk -F\\t 'fn {print >> fn; fn=0;next} {fn=$1".tsv"}'
This will produce headers in each file; if you want consistency checking, change to_table(false) to to_table($header).
I have a data file, each row may have different format, but the certain pattern "\\- .*\\| PR", the data set is kind as following:
|- 7 | PR - Easy to post position and search resumes | Improvement - searching of resumes
[ 1387028] | Recommend - 9 | PR - As a recruiter I find a lot qualified resumes for jobs that I am working on.
|- 10 | PR - its easy to use and good candidiates
I want to have a record of the number in this pattern, or the data I offered, I need a record of 7, 9,10. I have no idea about how to do it, is there someone can help?
as.integer(sub('.*- ([0-9]+) \\| PR.*', '\\1', yourvector))
I'm performing several tests using different approaches for cleaning a big csv file and then importing it into R.
This time I'm playing with Powershell in Windows.
While things work well and most accurate than when using cut() with pipe(), the process is horribly slow.
This is my command:
shell(shell = "powershell",
"Import-Csv In.csv |
select-object col1, col2, etc |
Export-csv new.csv")
And these are the system.time() results:
user system elapsed
0.61 0.42 1568.51
I've seen some other posts that use C# via streaming taking couple of dozens of seconds, but I don't know C#.
My question is, how can improve the PowerShell command in order to make it faster?
Thanks,
Diego
There's a fair amout of overhead in reading in the csv, converting the rows to powershell objects, and the converting back to csv. Doing it through the pipeline that way also causes it to do this one record at a time. You should be able to speed that up considerably if you switch to using Get-Content with a -ReadCount parameter, and extracting your data using a regular expression in a -replace operator, e.g.:
shell(shell = "powershell",
"Get-Content In.csv -ReadCount 1000 |
foreach { $_ -replace '^(.+?,.+?),','$1' |
Add-Content new.csv")
This will reduce the number if disk reads, and the -replace will be functioning as an array operator, doing 1000 records at a time.
First and foremost, my first test was wrong in the sense that due some errors I had before, several other sessions of powershell remained open and delayed the whole process.
These are the real numbers:
> system.time(shell(shell = "powershell", psh.comm))
user system elapsed
0.09 0.05 824.53
Now, as I said I couldn't find a good pattern for splitting the columns of my csv file.
I maybe need to add that it is a messy file, with fields containing commas, multiline fields, summary lines, and so on.
I tried other approaches, like one very famous in stack overflow that uses embedded C# code in PowerShell for splitting csv files.
While it works faster than the more common approach I showed previously, results are not accurate for these types of messy files.
> system.time(shell(shell = "powershell", psh.comm))
user system elapsed
0.01 0.00 212.96
Both approaches showed similar RAM consumption (~40Mb), and CPU usage (~50%) most of the time.
So while the former approach took 4 times the amount of the later, the accuracy of the results, the low cost in terms of resources, and the lesser developing time make me consider it the most efficient for big and messy csv files.