Downloading the entire Bitcoin transaction chain with R - r

I'm pretty new here so thank you in advance for the help. I'm trying to do some analysis of the entire Bitcoin transaction chain. In order to do that, I'm trying to create 2 tables
1) A full list of all Bitcoin addresses and their balance, i.e.,:
| ID | Address | Balance |
-------------------------------
| 1 | 7d4kExk... | 32 |
| 2 | 9Eckjes... | 0 |
| . | ... | ... |
2) A record of the number of transactions that have ever occurred between any two addresses in the Bitcoin network
| ID | Sender | Receiver | Transactions |
--------------------------------------------------
| 1 | 7d4kExk... | klDk39D... | 2 |
| 2 | 9Eckjes... | 7d4kExk... | 3 |
| . | ... | ... | .. |
To do this I've written a (probably very inefficient) script in R that loops through every block and scrapes blockexplorer.com to compile the tables. I've tried running it a couple of times so far but I'm running into two main issues
1 - It's very slow... I can imagine it's going to take at least a week at the rate that it's going
2 - I haven't been able to run it for more than a day or two without it hanging. It seems to just freeze RStudio.
I'd really appreaciate your help in two areas:
1 - Is there a better way to do this in R to make the code run significantly faster?
2 - Should I stop using R altogether for this and try a different approach?
Thanks in advance for the help! Please see below for the relevant chunks of code I'm using
url_start <- "http://blockexplorer.com/b/"
url_end <- ""
readUrl <- function(url) {
table <- try(readHTMLTable(url)[[1]])
if(inherits(table,"try-error")){
message(paste("URL does not seem to exist:", url))
errors <- errors + 1
return(NA)
} else {
processed <- processed + 1
return(table)
}
}
block_loop <- function (end, start = 0) {
...
addr_row <- 1 #starting row to fill out table
links_row <- 1 #starting row to fill out table
for (i in start:end) {
print(paste0("Reading block: ",i))
url <- paste(url_start,i,url_end, sep = "")
table <- readUrl(url)
if(is.na(table)){ next }
....

There are very close to 250,000 blocks on the site you mentioned (at least, 260,000 gives a 404). Curling from my connection (1 MB/s down) gives an average speed of about half a second. Try it yourself from the command line (just copy and paste) to see what you get:
curl -s -w "%{time_total}\n" -o /dev/null http://blockexplorer.com/b/220000
I'll assume your requests are about as fast as mine. Half a second times 250,000 is 125,000 seconds, or a day and a half. This is the absolute best you can get using any methods because you have to request the page.
Now, after doing an install.packages("XML"), I saw that running readHTMLTable(http://blockexplorer.com/b/220000) takes about five seconds on average. Five seconds times 250,000 is 1.25 million seconds which is about two weeks. So your estimates were correct; this is really, really slow. For reference, I'm running a 2011 MacBook Pro with a 2.2 GHz Intel Core i7 and 8GB of memory (1333 MHz).
Next, table merges in R are quite slow. Assuming 100 records per table row (seems about average) you'll have 25 million rows, and some of these rows have a kilobyte of data in them. Assuming you can fit this table in memory, concatenating tables will be a problem.
The solution to these problems that I'm most familiar with is to use Python instead of R, BeautifulSoup4 instead of readHTMLTable, and Pandas to replace R's dataframe. BeautifulSoup is fast (install lxml, a parser written in C) and easy to use, and Pandas is very quick too. Its dataframe class is modeled after R's, so you probably can work with it just fine. If you need something to request URLs and return the HTML for BeautifulSoup to parse, I'd suggest Requests. It's lean and simple, and the documentation is good. All of these are pip installable.
If you still run into problems the only thing I can think of is to get maybe 1% of the data in memory at a time, statistically reduce it, and move on to the next 1%. If you're on a machine similar to mine, you might not have another option.

Related

Optimize table in MariaDB for large ibd file

We have a WordPress website using MariaDB where a wp_options table keeps growing due to a rogue plugin writing thousands of records to the table. The issue has not been resolved by the plugin maintainer yet and I keep having to remove these 'transient' (temp) records manually via DELETE statement. The problem is the ibd file keeps growing and now 35GB in size. Once this is resolved, I plan to do an OPTIMIZE TABLE on the table to cleanup. Is that the best approach to reclaim all that space? I assume I'll need as much as 40GB free space to do this and how long should the OPTIMIZE TABLE take? Since this table is used quite a bit by WordPress, it seems it will be best to take the website offline while optimizing to avoid locks. I'll looking for the quickest way to resolve.
At least I think these rogue records are the cause of the table growing. Below is a list of the top 10 type of entries in the table:
MariaDB [wmnf_www]> SELECT substr(`wp_options`.`option_name`, 1, 18) AS `option_name`, count(`wp_options`.`option_value`) AS `cnt` FROM `wp_options` GROUP BY substr(`wp_options`.`option_name`, 1, 18) ORDER BY `cnt` DESC LIMIT 10;
+--------------------+-------+
| option_name | cnt |
+--------------------+-------+
| _transient_timeout | 21186 |
| _transient_ee_ssn_ | 12628 |
| _transient_jpp_li_ | 222 |
| _transient_externa | 125 |
| _transient_wc_rela | 63 |
| jpsq_sync-14716436 | 50 |
| wpmf_current_folde | 35 |
| _wc_session_expire | 34 |
| jpsq_sync-14716465 | 29 |
| jpsq_sync-14716417 | 25 |
+--------------------+-------+
10 rows in set (0.17 sec)
The _transient_ee_ssn_ and _transient_timeout_ee_ are the issue and keep growing, the only ones in the set above that has grown since last night and was initially found with 800K records. I keep removing the records as the plugin maintainer said was safe. But is this the cause of the ibd file growing?
---UPDATE---
Oddly enough, the issue is not resolved and transient records keep getting generated by the thousands, but this ibd index file has stopped growing for the moment. After steadily growing over the weekend from 20GB to now 39GB, it has not grown in a couple of hours. Perhaps there's a limit or this file was growing for other reasons?
I think it would be a better solution to recreate the table using Percona pt-online-schema-change tool. This will recreate the table and move all the data to the new table then drop the old table. This will avoid locking the database for a long time.

Google Sheets FILTER() and QUERY() not working with SUM()

I'm trying to pull and sum data from one sheet on another. This is GA data being built into a report, so I have sessions split up by landing page and device type, and would like to group them in different ways.
I usually use FILTER() for this sort of thing, but it keeps returning a 0 sum. Thinking this may be an odd edge case with FILTER(), I switched to using QUERY() instead. That gave me an error, but a Google search doesn't offer much documentation about what the error actually means. Taking a guess that it could be indicating an issue with the data type (i.e. not numeric), I changed the format of the source from "Automatic" to "Number", but to no avail.
Maybe it's a lack of coffee, I'm at a loss as to why neither function is working to do a simple lookup and sum by criteria.
FILTER() function
SUM(FILTER(AllData!C:C,AllData!A:A="/chestnut/",AllData!B:B="desktop"))
No error, but returns 0 regardless of filter parameters.
QUERY() function
QUERY(AllData!A:G, "SELECT SUM(C) WHERE A='/chestnut/' AND B='desktop'",1)
Error returned:
Unable to parse query string for Function QUERY parameter 2: AVG_SUM_ONLY_NUMERIC
Sample data:
landingPage | deviceCategory | sessions
-------------|----------------|----------
/chestnut/ | desktop | 4
/chestnut/ | desktop | 2
/chestnut/ | tablet | 5
/chestnut/ | tablet | 1
/maple/ | desktop | 1
/maple/ | desktop | 2
/maple/ | mobile | 3
/maple/ | mobile | 1
I think the summing doesn't work because your numbers are text formatted.
See if any of these work? (change ranges to suit)
using FILTER()
=SUM(FILTER(VALUE(AllData!C:C),AllData!A:A="/chestnut/",AllData!B:B="desktop"))
using QUERY()
=ArrayFormula(QUERY({AllData!A:B, VALUE(AllData!C:C)}, "SELECT SUM(Col3) WHERE Col1='/chestnut/' AND Col2='desktop' label SUM(Col3)''",1))
using SUMPRODUCT()
=SUMPRODUCT(VALUE(AllData!C2:C),AllData!A2:A="/chestnut/",AllData!B2:B="desktop")

React native firebase 'average' query?

I come from a php/mysql background, and json data and firebase queries are still pretty new to me. Now I am working in React Native and I have a collection of data, and one of the keys stores a integer. I want to get the average of all the integers. Where do I even start?
I have used Firebases snapshot.numChildren function before so I am getting a little more familiar in this json world, but any sort of help would be appreciated. Thanks!
So, you know that you can return all of the data and determine the average. I'm guessing this is for a large set of data where it would be ideal not to return the entire node every time you would like to retrieve and average.
It depends on what this data is and how it's being updated, but I think one option is to simply have a separate node that is updated every time the collection is added to or is changed.
Here is some really rough pseudo code. For example, if your database looks like this:
database
|
+--collection
| |
| +--item_one (probably a uid like -k2jduwi5j5j5)
| | |
| | +--number: 90
| |
| +--item_two
| | |
| | +--number: 70
|
+--collection_metadata
| |
| +--average: 80
| |
| +--number_of_items: 2
Then when a new item is added, you run a metadata calculation:
var numerator = average * number_of_items + newItem.number;
number_of_items++; <-- this is your new number of items
numerator / number_of_items; <-- this is your new average
Then when an item is updated, you run a metadata calculation:
var numerator = average * number_of_items - changedItem.oldNumber + changedItem.newNumber;
numerator / number_of_items; <-- this is your new average
Now when you want this data, you always have this data on hand.

How to copy a value from a certain pattern in R?

I have a data file, each row may have different format, but the certain pattern "\\- .*\\| PR", the data set is kind as following:
|- 7 | PR - Easy to post position and search resumes | Improvement - searching of resumes
[ 1387028] | Recommend - 9 | PR - As a recruiter I find a lot qualified resumes for jobs that I am working on.
|- 10 | PR - its easy to use and good candidiates
I want to have a record of the number in this pattern, or the data I offered, I need a record of 7, 9,10. I have no idea about how to do it, is there someone can help?
as.integer(sub('.*- ([0-9]+) \\| PR.*', '\\1', yourvector))

TCP/IP communication from the unix server to the Pure Data

I am interested in TCP/IP communication from the Unix server to the Pure Data. I have it realized using sockets on the Unix server side, and netclient on the Pure Data side. I exploited the chat-server tutorial for this (3.Networking > 10.chat_client.pd).
Now the problem lies that the server is streaming the data out as a "string" message delimited with ";"
My question is, is there a way to send something other than string message to Pure Data, like byte-stream or serialized number stream? Can Pure Data receive such messages?
Since string takes too many bytes to transfer, for example number "1024;" is already 5 bytes, while such an integer number is just 4 bytes.
UPDATE: For everyone that stumbles upon this post in search for the answer.
Apparently [netclient] on the Pure Data side cannot receive nothing else than ; delimited messages.
So the solution for the problem posed above:
My question is, is there a way to send something other than string message to Pure Data, like byte-stream or serialized number stream? Can Pure Data receive such messages?
The solution is to use [tcpclient], it can receive byte-stream data.
Now my question is, how do I get four compact numbers to work with?
Now I have a series of bytes, at least in the correct order.
From my UNIX server I am sending a structure
typedef struct {
int var_code;
int sample_time;
int hr;
float hs;
} phy_data;
Sample data might be 2 1000000 51 2000.56
When received and printed in Pure Data I get output like this:
: 0 0 0 2 0 10 114 26 0 0 0 51 0 16 242 78
You can notice number 2 and number 51 clearly, I guess the others are correct as well.
How can I get these numbers back to a usable format?
Maybe some manipulation with [bytes2any] and [route], but I haven't been able to extract the data with it?
here's an outline of what you have to do:
repackage the bytelist to small messages of the correct size for the various types.
since all your elements are 4 byte long, you simply repackage your list (or bytestream, as TCP/IP doesn't guarantee to deliver your 16 bytes as a single list, but could also decide to break it into a list of arbitrary length) to a number of 4 atom lists.
the most stable way, would probably be to 1st serialize the list (check the "serializer" example in the [list] help) and than reassamble that list to 4 elements.
if you can use externals like zexy you could use [repack 4] for that.
if you trust [netclient] to output your messages as complete lists, you could simply use a large [unpack ....] and 4 [pack]s
interpret the raw data for each sublist
integers is rather simple, floats are way more complicated
integers:
|
[unpack 0 0 0 0]
| | | |
[<< 8] | | |
| | | |
[+ ] | |
| | |
[<< 8] | |
| | |
[+ ] |
| |
[<< 8] |
| |
[+ ]
|
floats are left as an exercise to the user :-)
the real solution to your problem would be to use a well-defined application-layer protocol, rather than brew your own.
the most widespread protocol in use for applications like Pd, is certainly OSC.
in order to decode the raw OSC-bytes into Pd-messages, use [unpackOSC] (part of the "mrpeach" library; on Debian, you install it via the pd-osc package)
on the "server" side, you can use liblo for encoding data and sending it.
note
be aware that since OSC is packet-based, you will need a packetizing mechanism for stream-based protocols like TCP/IP. as with OSC-1.2, this should be SLIP. liblo should already take care of this. check the patches accompanying [unpackOSC] for how to do this within Pd.
all this is not needed if you are using a UDP as a transport.

Resources