How to export the constructed dictionary from Newsmap (Quanteda) - r

I have trained a newsmap model in the Newsmap package for quanteda in R and am trying to export the large dictionary it constructed based on my corpus (not the seed dictionary).
I have tried this code, but it only gives me the 10 most associated terms per country in a list format, which I also fail to extract in order to form a dictionary object I can use in R.
Dict <-coef(model)
I would really appreciate any and all help!

You only need to extract the names of the vectors with desired number of words passed to n.
> quanteda::dictionary(lapply(coef(model, n = 1000), FUN = names))
Dictionary object with 226 key entries.
- [bi]:
- burundi, burundi's, bujumbura, burundian, nkurunziza, uprona, msd, nduwimana, hutus, tutsi, radebe, drcongo, rapporteur, elderly, mushikiwabo, generation, kayumba, faustin, hutu, olga [ ... and 980 more ]
- [dj]:
- djibouti, djibouti's, djiboutian, western-led, pretty, photo, watkins, ask, entebbe, westerners, mujahideen, salvation, osprey, persistent, horn, afdb, donors, ismael, nevis, grenade [ ... and 980 more ]
- [er]:
- eritrea, eritreans, eritrean, keetharuth, issaias, eritrea's, binnie, sheila, somaliland, catania, mandeb, brutal, sicily's, lana, horn, lampedusa, aman, afdb, donors, monitoring [ ... and 980 more ]
- [et]:
- ethiopia, ethiopian, addis, ababa, addis, ababa, hailemariam, desalegn, ethiopians, maasho, ethiopia's, mandeb, igad, dibaba, genzebe, mesfin, bekele, spla, shrikesh, laxmidas [ ... and 980 more ]
- [ke]:
- kenya, kenyan, nairobi, nairobi, uhuru, lamu, mombasa, mpeketoni, kenyans, kws, nairobi's, akwiri, ruto, westgate, kenyatta's, mombasa, makaburi, kenyatta, kenya's, ol [ ... and 980 more ]
- [km]:
- comoros, mazen, emiratis, oil-rich, canterbury, lahiya, shoukri, gender, wadia, lombok, brisbane's, entire, christiana, blahodatne, everest's, culiacan, kamensk-shakhtinsky, protestants, pk-5, parwan [ ... and 980 more ]
[ reached max_nkey ... 220 more keys ]

Related

I'm having trouble with parsing a JSON file

I am attempting to use a .json file I found online, but I'm starting to think that there is an underlying issue with the file. I am not very knowledgeable in .json files, so I am trying to convert it into a CSV file. I have yet to find a website that can do that for me.
I've tried using R to convert the file since the file is also quite large and I can only assume that most websites have a size limit. I have tried flattening it in r with this code:
library(jsonlite)
library(tidyr)
library(tidyverse)
json_string <- readLines("data.json")
json_data <- fromJSON(json_string)
json_data <- flatten(json_data)
df <- as_data_frame(json_data)
write_csv(df, "output.csv")
but it returns this error:
! Tibble columns must have compatible sizes.
* Size 2: Columns `A-Alrund, God of the Cosmos // A-Hakka, Whispering Raven`, `A-Blessed Hippogriff // A-Tyr's Blessing`, `A-Emerald Dragon // A-Dissonant Wave`, `A-Monster Manual // A-Zoological Study`, `A-Rowan, Scholar of Sparks // A-Will, Scholar of Frost`, and 484 more.
* Size 3: Column `Smelt // Herd // Saw`.
* Size 5: Column `Who // What // When // Where // Why`.
* Size 6: Columns `Everythingamajig`, `Garbage Elemental`, `Ineffable Blessing`, `Knight of the Kitchen Sink`, `Scavenger Hunt`, and 4 more.
i Only values of size one are recycled.
Backtrace:
1. tibble::as_data_frame(json_data)
3. tibble:::as_tibble.list(x, ...)
4. tibble:::lst_to_tibble(x, .rows, .name_repair, col_lengths(x))
5. tibble:::recycle_columns(x, .rows, lengths)
Here is what the first 2 items of the .json file look like
{"data": {"\"Ach! Hans, Run!\"": [{"colorIdentity": ["G", "R"], "colors": ["G", "R"], "convertedManaCost": 6.0, "foreignData": [], "identifiers": {"scryfallOracleId": "a2c5ee76-6084-413c-bb70-45490d818374"}, "isFunny": true, "layout": "normal", "legalities": {}, "manaCost": "{2}{R}{R}{G}{G}", "manaValue": 6.0, "name": "\"Ach! Hans, Run!\"", "printings": ["UNH"], "purchaseUrls": {"cardKingdom": "https://mtgjson.com/links/84dfefe718a51cf8", "cardKingdomFoil": "https://mtgjson.com/links/d8c9f3fc1e93c89c", "cardmarket": "https://mtgjson.com/links/b9d69f0d1a9fb80c", "tcgplayer": "https://mtgjson.com/links/c51d2b13ff76f1f0"}, "rulings": [], "subtypes": [], "supertypes": [], "text": "At the beginning of your upkeep, you may say \"Ach! Hans, run! It's the . . .\" and the name of a creature card. If you do, search your library for a card with that name, put it onto the battlefield, then shuffle. That creature gains haste. Exile it at the beginning of the next end step.", "type": "Enchantment", "types": ["Enchantment"]}], "\"Brims\" Barone, Midway Mobster": [{"colorIdentity": ["B", "W"], "colors": ["B", "W"], "convertedManaCost": 5.0, "foreignData": [], "identifiers": {"scryfallOracleId": "c64c31f2-c1be-414e-9dff-c3b77ba97545"}, "isFunny": true, "layout": "normal", "leadershipSkills": {"brawl": false, "commander": true, "oathbreaker": false}, "legalities": {}, "manaCost": "{3}{W}{B}", "manaValue": 5.0, "name": "\"Brims\" Barone, Midway Mobster", "power": "5", "printings": ["UNF"], "purchaseUrls": {"cardKingdom": "https://mtgjson.com/links/d1e320bd9d6813c0", "cardKingdomFoil": "https://mtgjson.com/links/18f86e8a04682c34", "cardmarket": "https://mtgjson.com/links/d5a3d8cfb60767d4", "tcgplayer": "https://mtgjson.com/links/980f45f2bc8c3733"}, "rulings": [], "subtypes": ["Human", "Rogue"], "supertypes": ["Legendary"], "text": "When \"Brims\" Barone, Midway Mobster enters the battlefield, put a +1/+1 counter on each other creature you control that has a hat.\n\"Brims\" Barone, Midway Mobster has menace as long as you're wearing a hat.", "toughness": "4", "type": "Legendary Creature — Human Rogue", "types": ["Creature"]}]}
I am hoping that the resulting csv file has the keys as the column names, and the values to be assigned to the columns based on their keys.
EDIT:
I have now attached a screenshot of what the json_data structure looks like.Structure of json_data
Assuming it's one of the JSON dumps from scryfall, try this:
library(jsonlite)
library(tidyr)
library(tidyverse)
todo <- list.files(pattern = ".json")
json_data <- fromJSON(todo)
json_data_flat_jsl <- jsonlite::flatten(json_data)
df <- as_tibble(json_data_flat_jsl)
write_csv(df, "output.csv")

Cassandra collection tombstones

I have created a table with a collection. Inserted a record and took sstabledump of it and seeing there is range tombstone for it in the sstable. Does this tombstone ever get removed? Also when I run sstablemetadata on the only sstable, it shows "Estimated droppable tombstones" as 0.5", Similarly it shows one record with epoch time as insert time for - "Estimated tombstone drop times: 1548384720: 1". Does it mean that when I do sstablemetadata on a table having collections, the estimated droppable tombstone ratio and drop times values are not true and dependable values due to collection/list range tombstones?
CREATE TABLE ks.nmtest (
reservation_id text,
order_id text,
c1 int,
order_details map<text, text>,
PRIMARY KEY (reservation_id, order_id)
) WITH CLUSTERING ORDER BY (order_id ASC)
user#cqlsh:ks> insert into nmtest (reservation_id , order_id , c1, order_details ) values('3','3',3,{'key':'value'});
user#cqlsh:ks> select * from nmtest ;
reservation_id | order_id | c1 | order_details
----------------+----------+----+------------------
3 | 3 | 3 | {'key': 'value'}
(1 rows)
[root#localhost nmtest-e1302500201d11e983bb693c02c04c62]# sstabledump mc-5-big-Data.db
WARN 02:52:19,596 memtable_cleanup_threshold has been deprecated and should be removed from cassandra.yaml
[
{
"partition" : {
"key" : [ "3" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 41,
"clustering" : [ "3" ],
"liveness_info" : { "tstamp" : "2019-01-25T02:51:13.574409Z" },
"cells" : [
{ "name" : "c1", "value" : 3 },
{ "name" : "order_details", "deletion_info" : { "marked_deleted" : "2019-01-25T02:51:13.574408Z", "local_delete_time" : "2019-01-25T02:51:13Z" } },
{ "name" : "order_details", "path" : [ "key" ], "value" : "value" }
]
}
]
}
SSTable: /data/data/ks/nmtest-e1302500201d11e983bb693c02c04c62/mc-5-big
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Bloom Filter FP chance: 0.010000
Minimum timestamp: 1548384673574408
Maximum timestamp: 1548384673574409
SSTable min local deletion time: 1548384673
SSTable max local deletion time: 2147483647
Compressor: org.apache.cassandra.io.compress.LZ4Compressor
Compression ratio: 1.0714285714285714
TTL min: 0
TTL max: 0
First token: -155496620801056360 (key=3)
Last token: -155496620801056360 (key=3)
minClustringValues: [3]
maxClustringValues: [3]
Estimated droppable tombstones: 0.5
SSTable Level: 0
Repaired at: 0
Replay positions covered: {CommitLogPosition(segmentId=1548382769966, position=6243201)=CommitLogPosition(segmentId=1548382769966, position=6433666)}
totalColumnsSet: 2
totalRows: 1
Estimated tombstone drop times:
1548384720: 1
Another quuestion was on the nodetool tablestats output - what does slice refer to in cassandra?
Average live cells per slice (last five minutes): 1.0
Maximum live cells per slice (last five minutes): 1
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Dropped Mutations: 0
sstablemetadata does not have the information about your table that is not held within the sstable as it is not guaranteed to be run on system that has Cassandra running, and even if it was its very complex to be able to know how to pull the schema information from it.
Since the gc_grace_seconds is a table parameter and not in the metadata it defaults to assuming a 0 gc grace so the droppable times listed in that histogram will be more a histogram of the tombstone creation times by default. If you know your gc grace you can add it as a -g parameter to your sstablemetadata call. like:
sstablemetadata -g 864000 mc-5-big-Data.db
see http://cassandra.apache.org/doc/latest/tools/sstable/sstablemetadata.html for information on the tools output.
With collections it's just normal range tombstone with all that it entails. They are used to prevent the requirement of a read-before-write when overwriting the value of a multicell collection.

how to print recursively a Python dictionary and its subdictionaries with whitespace alignment into columns

I want to create a function that can take a dictionary of dictionaries such as the following
information = {
"sample information": {
"ID": 169888,
"name": "ttH",
"number of events": 124883,
"cross section": 0.055519,
"k factor": 1.0201,
"generator": "pythia8",
"variables": {
"trk_n": 147,
"zappo_n": 9001
}
}
}
and then print it in a neat way such as the following, with alignment of keys and values using whitespace:
sample information:
ID: 169888
name: ttH
number of events: 124883
cross section: 0.055519
k factor: 1.0201
generator: pythia8
variables:
trk_n: 147
zappo_n: 9001
My attempt at the function is the following:
def printDictionary(
dictionary = None,
indentation = ''
):
for key, value in dictionary.iteritems():
if isinstance(value, dict):
print("{indentation}{key}:".format(
indentation = indentation,
key = key
))
printDictionary(
dictionary = value,
indentation = indentation + ' '
)
else:
print(indentation + "{key}: {value}".format(
key = key,
value = value
))
It produces the output like the following:
sample information:
name: ttH
generator: pythia8
cross section: 0.055519
variables:
zappo_n: 9001
trk_n: 147
number of events: 124883
k factor: 1.0201
ID: 169888
As is shown, it successfully prints the dictionary of dictionaries recursively, however is does not align the values into a neat column. What would be some reasonable way of doing this for dictionaries of arbitrary depth?
Try using the pprint module. Instead of writing your own function, you can do this:
import pprint
pprint.pprint(my_dict)
Be aware that this will print characters such as { and } around your dictionary and [] around your lists, but if you can ignore them, pprint() will take care of all the nesting and indentation for you.

How to use the function "table:get" (table extension) when 2 keys are required?

I have a file .txt with 3 columns: ID-polygon-1, ID-polygon-2 and distance.
When I import my file into Netlogo, I obtain 3 lists [[list1][list2][list3]] which corresponds with the 3 columns.
I used table:from-list list to create a table with the content of 3 lists.
I obtain {{table: [[1 1] [67 518] [815 127]]}} (The table displays the first two lines of my dataset).
For example, I would like to get the value of distance (list3) between ID-polygon-1 = 1 (list1) and ID-polygon-2 = 67 (list1), that is, 815.
How can I use table:get table key when I have need of 2 keys (ID-polygon-1 and ID-polygon-2) ?
Thanks very much your help.
Using table:from-list will not help you there: it expects "a list of two element lists, or pairs" where the "the first element in the pair is the key and the second element is the value." That's not what you have in your original list.
Furthermore, NetLogo tables (and associative arrays in general) cannot have two keys. They are always just key-value pairs. Nothing prevents the value from being another table, however, and in your case, that is what you need: a table of tables!
There is no primitive to build that directly, however. You will need to build it yourself:
extensions [ table ]
globals [ t ]
to setup
let lists [
[ 1 1 ] ; ID-polygon-1 column
[ 67 518 ] ; ID-polygon-2 column
[ 815 127 ] ; distance column
]
set t table:make
foreach n-values length first lists [ ? ] [
let id1 item ? (item 0 lists)
let id2 item ? (item 1 lists)
let dist item ? (item 2 lists)
if not table:has-key? t id1 [
table:put t id1 table:make
]
table:put (table:get t id1) id2 dist
]
end
Here is what you get when you print the resulting table:
{{table: [[1 {{table: [[67 815] [518 127]]}}]]}}
And here is a small reporter to make it convenient to get a distance from the table:
to-report get-dist [ id1 id2 ]
report table:get (table:get t id1) id2
end
Using get-dist 1 67 will give the 815 result you were looking for.

Two index with one value in a lua table

I am very new to lua and my plan is to create a table. This table (I call it test) has 200 entries - each entry has the same subentries (In this example the subentries money and age):
This is a sort of pseudocode:
table test = {
Entry 1: money=5 age=32
Entry 2: money=-5 age=14
...
Entry 200: money=999 age=72
}
How can I write this in lua ? Is there a possibility ? The other way would be, that I write each subentry as a single table:
table money = { }
table age = { }
But for me, this isn't a nice way, so maybe you can help me.
Edit:
This question Table inside a table is related, but I cannot write this 200x.
Try this syntax:
test = {
{ money = 5, age = 32 },
{ money = -5, age = 14 },
...
{ money = 999, age = 72 }
}
Examples of use:
-- money of the second entry:
print(test[2].money) -- prints "-5"
-- age of the last entry:
print(test[200].age) -- prints "72"
You can also turn the problem on it's side, and have 2 sequences in test: money and age where each entry has the same index in both arrays.
test = {
money ={1000,100,0,50},
age={40,30,20,25}
}
This will have better performance since you only have the overhead of 3 tables instead of n+1 tables, where n is the number of entries.
Anyway you have to enter your data one way or another. What you'd typically do is make use some easily parsed format like CSV, XML, ... and convert that to a table. Like this:
s=[[
1000 40
100 30
0 20
50 25]]
test ={ money={},age={}}
n=1
for balance,age in s:gmatch('([%d.]+)%s+([%d.]+)') do
test.money[n],test.age[n]=balance,age
n=n+1
end
You mean you do not want to write "money" and "age" 200x?
There are several solutions but you could write something like:
local test0 = {
5, 32,
-5, 14,
...
}
local test = {}
for i=1,#test0/2 do
test[i] = {money = test0[2*i-1], age = test0[2*i]}
end
Otherwise you could always use metatables and create a class that behaves exactly like you want.

Resources