I have a data file, each row may have different format, but the certain pattern "\\- .*\\| PR", the data set is kind as following:
|- 7 | PR - Easy to post position and search resumes | Improvement - searching of resumes
[ 1387028] | Recommend - 9 | PR - As a recruiter I find a lot qualified resumes for jobs that I am working on.
|- 10 | PR - its easy to use and good candidiates
I want to have a record of the number in this pattern, or the data I offered, I need a record of 7, 9,10. I have no idea about how to do it, is there someone can help?
as.integer(sub('.*- ([0-9]+) \\| PR.*', '\\1', yourvector))
Related
I have a complex JSON file (~8GB) containing publically available data for businesses. We have decided to split the files up into multiple CSV files (or tabs in a .xlsx), so clients can easily consume the data. These files will be linked by the NZBN column/key.
I'm using R and jsonlite to read a small sample in (before scaling up to the full file). I'm guessing I need some way to specify what key/columns go in each file (i.e, the first file will have headers: australianBusinessNumber, australianCompanyNumber, australianServiceAddress, the second file will have headers: annualReturnFilingMonth, annualReturnLastFiled, countryOfOrigin...)
Here's a sample of two businesses/entities (I've bunged some of the data as well so ignore the actual values): test file
I've read almost every post on s/o of similar questions and none seem to be giving me any luck. I've tried variations of purrr, *apply commands, custom flattening functions and jqr (an r version of 'jq' - looks promising but I can't seem to run it).
Here's an attempt at creating my separate files, but I'm unsure how to include the linking identifier (NZBN) + I keep running into further nested lists (i'm unsure how many levels of nesting there are)
bulk <- jsonlite::fromJSON("bd_test.json")
coreEntity <- data.frame(bulk$companies)
coreEntity <- coreEntity[,sapply(coreEntity, is.list)==FALSE]
company <- bulk$companies$entity$company
company <- purrr::reduce(company, dplyr::bind_rows)
shareholding <- company$shareholding
shareholding <- purrr::reduce(shareholding, dplyr::bind_rows)
shareAllocation <- shareholding$shareAllocation
shareAllocation <- purrr::reduce(shareAllocation, dplyr::bind_rows)
I'm not sure if it's easier to split the files up during the flattening/wrangling process, or just completely flatten the whole file so I just have one line per business/entity (and then gather columns as needed) - my only concern is that I need to scale this up to ~1.3million nodes (8GB JSON file).
Ideally I would want the csv files split every time there is a new collection, and the values in the collection would become the columns for the new csv/tab.
Any help or tips would be much appreciated.
------- UPDATE ------
Updated as my question was a little vague I think all I need is some code to produce one of the csv's/tabs and I replicate for the other collections.
Say for example, I wanted to create a csv of the following elements:
entityName (unique linking identifier)
nzbn (unique linking
identifier)
emailAddress__uniqueIdentifier
emailAddress__emailAddress
emailAddress__emailPurpose
emailAddress__emailPurposeDescription
emailAddress__startDate
How would I go about that?
i'm unsure how many levels of nesting there are
This will provide an answer to that quite efficiently:
jq '
def max(s): reduce s as $s (null;
if . == null then $s elif $s > . then $s else . end);
max(paths|length)' input.json
(With the test file, the answer is 14.)
To get an overall view (schema) of the data, you could
run:
jq 'include "schema"; schema' input.json
where schema.jq is available at this gist. This will produce a structural schema.
"Say for example, I wanted to create a csv of the following elements:"
Here's a jq solution, apart from the headers:
.companies.entity[]
| [.entityName, .nzbn]
+ (.emailAddress[] | [.uniqueIdentifier, .emailAddress, .emailPurpose, .emailPurposeDescription, .startDate])
| #csv
shareholding
The shareholding data is complex, so in the following I've used the to_table function defined elsewhere on this page.
The sample data does not include a "company name" field so in the following, I've added a 0-based "company index" field:
.companies.entity[]
| [.entityName, .nzbn] as $ix
| .company
| range(0;length) as $cix
| .[$cix]
| $ix + [$cix] + (.shareholding[] | to_table(false))
jqr
The above solutions use the standalone jq executable, but all going well, it should be trivial to use the same filters with jqr, though to use jq's include, it might be simplest to specify the path explicitly, as for example:
include "schema" {search: "~/.jq"};
If the input JSON is sufficiently regular, you
might find the following flattening function helpful, especially as it can emit a header in the form of an array of strings based on the "paths" to the leaf elements of the input, which can be arbitrarily nested:
# to_table produces a flat array.
# If hdr == true, then ONLY emit a header line (in prettified form, i.e. as an array of strings);
# if hdr is an array, it should be the prettified form and is used to check consistency.
def to_table(hdr):
def prettify: map( (map(tostring)|join(":") ));
def composite: type == "object" or type == "array";
def check:
select(hdr|type == "array")
| if prettify == hdr then empty
else error("expected head is \(hdr) but imputed header is \(.)")
end ;
. as $in
| [paths(composite|not)] # the paths in array-of-array form
| if hdr==true then prettify
else check, map(. as $p | $in | getpath($p))
end;
For example, to produce the desired table (without headers) for .emailAddress, one could write:
.companies.entity[]
| [.entityName, .nzbn] as $ix
| $ix + (.emailAddress[] | to_table(false))
| #tsv
(Adding the headers and checking for consistency,
are left as an exercise for now, but are dealt with below.)
Generating multiple files
More interestingly, you could select the level you want, and produce multiple tables automagically. One way to partition the output into separate files efficiently would be to use awk. For example, you could pipe the output obtained using this jq filter:
["entityName", "nzbn"] as $common
| .companies.entity[]
| [.entityName, .nzbn] as $ix
| (to_entries[] | select(.value | type == "array") | .key) as $key
| ($ix + [$key] | join("-")) as $filename
| (.[$key][0]|to_table(true)) as $header
# First emit the line giving all the headers:
| $filename, ($common + $header | #tsv),
# Then emit the rows of the table:
(.[$key][]
| ($filename, ($ix + to_table(false) | #tsv)))
to
awk -F\\t 'fn {print >> fn; fn=0;next} {fn=$1".tsv"}'
This will produce headers in each file; if you want consistency checking, change to_table(false) to to_table($header).
We have a WordPress website using MariaDB where a wp_options table keeps growing due to a rogue plugin writing thousands of records to the table. The issue has not been resolved by the plugin maintainer yet and I keep having to remove these 'transient' (temp) records manually via DELETE statement. The problem is the ibd file keeps growing and now 35GB in size. Once this is resolved, I plan to do an OPTIMIZE TABLE on the table to cleanup. Is that the best approach to reclaim all that space? I assume I'll need as much as 40GB free space to do this and how long should the OPTIMIZE TABLE take? Since this table is used quite a bit by WordPress, it seems it will be best to take the website offline while optimizing to avoid locks. I'll looking for the quickest way to resolve.
At least I think these rogue records are the cause of the table growing. Below is a list of the top 10 type of entries in the table:
MariaDB [wmnf_www]> SELECT substr(`wp_options`.`option_name`, 1, 18) AS `option_name`, count(`wp_options`.`option_value`) AS `cnt` FROM `wp_options` GROUP BY substr(`wp_options`.`option_name`, 1, 18) ORDER BY `cnt` DESC LIMIT 10;
+--------------------+-------+
| option_name | cnt |
+--------------------+-------+
| _transient_timeout | 21186 |
| _transient_ee_ssn_ | 12628 |
| _transient_jpp_li_ | 222 |
| _transient_externa | 125 |
| _transient_wc_rela | 63 |
| jpsq_sync-14716436 | 50 |
| wpmf_current_folde | 35 |
| _wc_session_expire | 34 |
| jpsq_sync-14716465 | 29 |
| jpsq_sync-14716417 | 25 |
+--------------------+-------+
10 rows in set (0.17 sec)
The _transient_ee_ssn_ and _transient_timeout_ee_ are the issue and keep growing, the only ones in the set above that has grown since last night and was initially found with 800K records. I keep removing the records as the plugin maintainer said was safe. But is this the cause of the ibd file growing?
---UPDATE---
Oddly enough, the issue is not resolved and transient records keep getting generated by the thousands, but this ibd index file has stopped growing for the moment. After steadily growing over the weekend from 20GB to now 39GB, it has not grown in a couple of hours. Perhaps there's a limit or this file was growing for other reasons?
I think it would be a better solution to recreate the table using Percona pt-online-schema-change tool. This will recreate the table and move all the data to the new table then drop the old table. This will avoid locking the database for a long time.
I'm working on a solution for Cassandra that's proving impossible.
We have a table that will return a set of candidates given some search criteria. The row with the highest score is returned back to the user. We can do this quite easily with SQL, but there's a need to migrate to Cassandra. Here are the tables involved:
Value
ID | VALUE | COUNTRY | STATE | CITY | COUNTY
--------+---------+----------+----------+-----------+-----------
1 | 50 | US | | |
--------+---------+----------+----------+-----------+-----------
2 | 25 | | TX | |
--------+---------+----------+----------+-----------+-----------
3 | 15 | | | MEMPHIS |
--------+---------+----------+----------+-----------+-----------
4 | 5 | | | | BROWARD
--------+---------+----------+----------+-----------+-----------
5 | 30 | | NY | NYC |
--------+---------+----------+----------+-----------+-----------
6 | 20 | US | | NASHVILLE |
--------+---------+----------+----------+-----------+-----------
Scoring
ATTRIBUTE | SCORE
-------------+-------------
COUNTRY | 1
STATE | 2
CITY | 4
COUNTY | 8
A query is sent that can have any of those four attributes populated or not. We search through our values table, calculate the scores, and return the highest one. If a column in the values table is null, it means it's applicable for all.
ID 1 is applicable for all states, cities, and counties within the US.
ID 2 is applicable for all countries, cities, and counties where the state is TX.
Example:
Query: {Country: US, State: TX}
Matches Value IDs: [1, 2, 3, 4, 6]
Scores: [1, 2, 4, 8, 5(1+4)]
Result: {id: 4} (8 was the highest score so Broward returns)
How would you model something like this in Cassandra 2.1?
Found out the best way to achieve this was using Solr with Cassandra.
Somethings to note though about using Solr, since all the resources I needed were scattered amongst the internet.
You must first start Cassandra with Solr. There's a command with the dse tool for starting cassandra with Solr enabled.
$CASSANDRA_HOME/bin/dse cassandra -s
You must create your keyspace with network topology stategy and solr enabled.
CREATE KEYSPACE ... WITH REPLICATION = {'class': 'NetworkTopologyStrategy', 'Solr': 1}
Once you create your table within your solr enabled keyspace, create a core using the dsetool.
$CASSANDRA_HOME/bin/dsetool create_core keyspace.table_name generateResources=true reindex=true
This will allow solr to index your data and generate a number of secondary indexes against your cassandra table.
To perform the queries needed for columns where values may or may not exist requires a somewhat complex query.
SELECT * FROM keyspace.table_name WHERE solr_query = '{"q": "{(-column:[* TO *] AND *:*) OR column:value}"';
Finally, you may notice when searching for text, your solr query column:"Hello" may pick up other unwanted values like HelloWorld or HelloThere. This is due to the datatype used in your schema.xml for Solr. Here's how to modify this behavior:
Head to your Solr Admin UI. (Normally http://hostname:8983/solr/)
Choose your core in the drop down list in the left pane, should be named keyspace.table_name.
Look for Config or Schema, both should take you to the schema.xml.
Copy and paste that file to some text editor. Optionally, you could try using wget or curl to download the file, but you need the real link which is provided in the text field box to the top right.
There's a tag <fieldtype>, with the name TextField. Replace org.apache.solr.schema.TextField with org.apache.solr.schema.StrField. You must also remove the analyzers, StrField does not support those.
That's it, hopefully I've saved people from all the headaches I encountered.
I'm pretty new here so thank you in advance for the help. I'm trying to do some analysis of the entire Bitcoin transaction chain. In order to do that, I'm trying to create 2 tables
1) A full list of all Bitcoin addresses and their balance, i.e.,:
| ID | Address | Balance |
-------------------------------
| 1 | 7d4kExk... | 32 |
| 2 | 9Eckjes... | 0 |
| . | ... | ... |
2) A record of the number of transactions that have ever occurred between any two addresses in the Bitcoin network
| ID | Sender | Receiver | Transactions |
--------------------------------------------------
| 1 | 7d4kExk... | klDk39D... | 2 |
| 2 | 9Eckjes... | 7d4kExk... | 3 |
| . | ... | ... | .. |
To do this I've written a (probably very inefficient) script in R that loops through every block and scrapes blockexplorer.com to compile the tables. I've tried running it a couple of times so far but I'm running into two main issues
1 - It's very slow... I can imagine it's going to take at least a week at the rate that it's going
2 - I haven't been able to run it for more than a day or two without it hanging. It seems to just freeze RStudio.
I'd really appreaciate your help in two areas:
1 - Is there a better way to do this in R to make the code run significantly faster?
2 - Should I stop using R altogether for this and try a different approach?
Thanks in advance for the help! Please see below for the relevant chunks of code I'm using
url_start <- "http://blockexplorer.com/b/"
url_end <- ""
readUrl <- function(url) {
table <- try(readHTMLTable(url)[[1]])
if(inherits(table,"try-error")){
message(paste("URL does not seem to exist:", url))
errors <- errors + 1
return(NA)
} else {
processed <- processed + 1
return(table)
}
}
block_loop <- function (end, start = 0) {
...
addr_row <- 1 #starting row to fill out table
links_row <- 1 #starting row to fill out table
for (i in start:end) {
print(paste0("Reading block: ",i))
url <- paste(url_start,i,url_end, sep = "")
table <- readUrl(url)
if(is.na(table)){ next }
....
There are very close to 250,000 blocks on the site you mentioned (at least, 260,000 gives a 404). Curling from my connection (1 MB/s down) gives an average speed of about half a second. Try it yourself from the command line (just copy and paste) to see what you get:
curl -s -w "%{time_total}\n" -o /dev/null http://blockexplorer.com/b/220000
I'll assume your requests are about as fast as mine. Half a second times 250,000 is 125,000 seconds, or a day and a half. This is the absolute best you can get using any methods because you have to request the page.
Now, after doing an install.packages("XML"), I saw that running readHTMLTable(http://blockexplorer.com/b/220000) takes about five seconds on average. Five seconds times 250,000 is 1.25 million seconds which is about two weeks. So your estimates were correct; this is really, really slow. For reference, I'm running a 2011 MacBook Pro with a 2.2 GHz Intel Core i7 and 8GB of memory (1333 MHz).
Next, table merges in R are quite slow. Assuming 100 records per table row (seems about average) you'll have 25 million rows, and some of these rows have a kilobyte of data in them. Assuming you can fit this table in memory, concatenating tables will be a problem.
The solution to these problems that I'm most familiar with is to use Python instead of R, BeautifulSoup4 instead of readHTMLTable, and Pandas to replace R's dataframe. BeautifulSoup is fast (install lxml, a parser written in C) and easy to use, and Pandas is very quick too. Its dataframe class is modeled after R's, so you probably can work with it just fine. If you need something to request URLs and return the HTML for BeautifulSoup to parse, I'd suggest Requests. It's lean and simple, and the documentation is good. All of these are pip installable.
If you still run into problems the only thing I can think of is to get maybe 1% of the data in memory at a time, statistically reduce it, and move on to the next 1%. If you're on a machine similar to mine, you might not have another option.
Background
I am trying to setup my trade analysis environment. I am running some rule based strategies on futures on different brokers and trying to aggregate trades from different brokers in one place. I am using blotter package as my main tool for analysis.
Idea is to use blotter and PerformanceAnalytics for analysis of live performance of various strategies I am running.
Problem at hand
My source of future EOD data is CSIData. All the EOD OHLC prices for these futures are stored in CSV format in following directory structure. For each future there is seperate directory and each contract of the future has one csv file with OHLC price series.
|
+---AD
| AD_201203.TXT
| AD_201206.TXT
| AD_201209.TXT
| AD_201212.TXT
| AD_201303.TXT
| AD_201306.TXT
| AD_201309.TXT
| AD_201312.TXT
| AD_201403.TXT
| AD_201406.TXT
| AD_54.TXT
...
+---BO2
| BO2195012.TXT
| BO2201201.TXT
| BO2201203.TXT
| BO2201205.TXT
| BO2201207.TXT
| BO2201208.TXT
| BO2201209.TXT
| BO2201210.TXT
| BO2201212.TXT
| BO2201301.TXT
...
I have managed to define root contracts for all the futures (e.g. in above case AD, BO2 etc) I will be using in FinancialInstrument with CSIData symbols as primary identifiers.
I am now struggling on how to define all the actual individual future contracts (e.g. AD_201203, AD_201206 etc) and setup their lookup using setSymbolLookup.FI.
Any pointers on how to do that?
To setup individual future contracts, I looked into ?future_series and ?build_series_symbols, however, the suffixes they support seem to be only of Future Month code format. So I have a feeling I am left with setting up each individual future contract manually. e.g.
build_series_symbols(data.frame(primary_id=c('ES','NQ'), month_cycle=c('H,M,U,Z'), yearlist = c(10,11)))
[1] "ESH0" "ESM0" "ESU0" "ESZ0" "NQH0" "NQM0" "NQU0" "NQZ0" "ESH1" "ESM1" "ESU1" "ESZ1" "NQH1" "NQM1" "NQU1" "NQZ1"
I have no clue where to start digging for my second part of my question i.e. setting price lookup for these futures from CSI.
PS: If this is not right forum for this kind of question, I am happy to get it moved to right section or even ask on totally different forum altogether.
PPS: Can someone with higher reputation tag this question with FinancialInstrument and CSIdata? Thanks!
The first part just works.
R> currency("USD")
[1] "USD"
R> future("AD", "USD", 100000)
[1] "AD"
Warning message:
In future("AD", "USD", 1e+05) :
underlying_id should only be NULL for cash-settled futures
R> future_series("AD_201206", expires="2012-06-18")
[1] "AD_201206"
R> getInstrument("AD_201206")
primary_id :"AD_201206"
currency :"USD"
multiplier :1e+05
tick_size : NULL
identifiers: list()
type :"future_series" "future"
root_id :"AD"
suffix_id :"201206"
expires :"2012-06-18"
Regarding the second part, I've never used setSymbolLookup.FI. I'd either use setSymbolLookup directly, or set a src instrument attribute if I were going to go that route.
However, I'd probably make a getSymbols method, maybe getSymbols.mycsv, that knows how to find your data if you give it a dir argument. Then, I'd just setDefaults on your getSymbols method (assuming that's how most of your data are stored).
I save data with saveSymbols.days(), and use getSymbols.FI daily. I think it wouldn't be much effort to tweak getSymbols.FI to read csv files instead of RData files. So, I suggest looking at that code.
Then, you can just
setDefaults("getSymbols", src="mycsv")
setDefaults("getSymbols.mycsv", dir="path/to/dir")
Or, if you prefer
setSymbolLookup(AD_201206=list(src="mycsv", dir="/path/to/dir"))
or (essentially the same thing)
instrument_attr("AD_201206", "src", list(src="mycsv", dir="/path/to/dir")