I cannot find how to insert a line break in long titles inside nodes. For example:
library(DiagrammeR)
mermaid("
graph TB
A[GE Solution]-->C{ }
B[GA Solution]-->C{ }
C{ }-->D[Stir 10 mins at 500 r/min]
D[Stir 10 mins at 500 r/min]-->E[Homogenisation at 10000 r/min]
E[Homogenisation at 10000 r/min]-->F(Stir 10 min 450 r/min Complex coacervation)
")
Note node F is too long. How to I make it into sth like..?
|Stir 10 min 450 r/min|
|Complex coacervation |
Note \n doesn't work.
It appears you can use <br> instead:
mermaid("
graph TB
A[GE Solution]-->C{ }
B[GA Solution]-->C{ }
C{ }-->D[Stir 10 mins at 500 r/min]
D[Stir 10 mins at 500 r/min]-->E[Homogenisation at 10000 r/min]
E[Homogenisation at 10000 r/min]-->F(Stir 10 min 450 r/min <br> Complex coacervation)
")
Another answer that appears to work just fine is to use \n rather than <br>.
Related
I have 2 different data frame, one is of 5.5 MB and the other is 25 GB. I want to check if these two data frame have the same value in 2 different columns for each row.
For e.g.
x 0 0 a
x 1 2 b
y 1 2 c
z 3 4 d
and
x 0 0 w
x 1 2 m
y 5 6 p
z 8 9 q
I want to check if the 2° and 3° column are equal for each row, if yes I return the 4° columns for the both data frame.Then I should have:
a w
b m
c m
the 2 data frame are sorted respect the 2° and 3° column value. I try in R but the 2° file (25 GB) is too big. How can I obtain this new file in a "faster" (even some hours) way ???
With GNU awk for arrays of arrays:
$ cat tst.awk
NR==FNR { a[$2,$3][$4]; next }
($2,$3) in a {
for (val in a[$2,$3]) {
print val, $4
}
}
$ awk -f tst.awk small_file large_file
a w
b m
c m
and with any awk (a bit less efficiently):
$ cat tst.awk
NR==FNR { a[$2,$3] = a[$2,$3] FS $4; next }
($2,$3) in a {
split(a[$2,$3],vals)
for (i in vals) {
print vals[i], $4
}
}
$ awk -f tst.awk small_file large_file
a w
b m
c m
The above when reading small_file (NR==FNR is only true for the first file read - look up those variables in the awk man page or google) creates an associative array a[] that maps an index created from the concatenation of the 2nd+3rd fields to the list of value of the 4th field for those 2nd/3rd field combinations. Then when reading large_file it looks up that array for the current 2nd/3rd field combination and loops through all of the values stored for that combination in the previous phase printing that value (the $4 from small_file) plus the current $4.
You said your small file is 5.5 MB and the large file is 25 GB. Since 1 MB is about 1,047,600 characters (see https://www.computerhope.com/issues/chspace.htm) and each of your lines is about 8 characters long that means your small file is about 130 thousand lines long and your large one about 134 million lines long so I expect on an average powered computer the above should take no more than a minute or 2 to run, it certainly won't take anything like an hour!
An alternative to the solution of Ed Morton, but with an identical idea:
$ cat tst.awk
NR==FNR { a[$2,$3] = a[$2,$3] $4 ORS; next }
($2,$3) in a {
s=a[$2,$3]; gsub(ORS,OFS $4 ORS,s)
printf "%s",s;
}
$ awk -f tst.awk small_file large_file
a w
b m
c m
I am looking to explore the GameTheory package from CRAN, but I would appreciate help in converting my data (in the form of a data frame of unique combinations and results) in to the required coalition object. The precursor to this I believe to be an ordered list of all coalition values (https://cran.r-project.org/web/packages/GameTheory/vignettes/GameTheory.pdf).
My real data has n ~ 30 'players', and unique combinations = large (say 1000 unique combinations), for which I have 1 and 0 identifiers to describe the combinations. This data is sparsely populated in that I do not have data for all combinations, but will assume combinations not described have zero value. I plan to have one specific 'player' who will appear in all combinations, and act as a baseline.
By way of example this is the data frame I am starting with:
require(GameTheory)
games <- read.csv('C:\\Users\\me\\Desktop\\SampleGames.csv', header = TRUE, row.names = 1)
games
n1 n2 n3 n4 Stakes Wins Success_Rate
1 1 1 0 0 800 60 7.50%
2 1 0 1 0 850 45 5.29%
3 1 0 0 1 150000 10 0.01%
4 1 1 1 0 300 25 8.33%
5 1 1 0 1 1800 65 3.61%
6 1 0 1 1 1900 55 2.89%
7 1 1 1 1 700 40 5.71%
8 1 0 0 0 3000000 10 0.00333%
where n1 is my universal player, and in this instance, I have described all combinations.
To calculate my 'base' coalition value from player {1} alone, I am looking to perform the calculation: 0.00333% (success rate) * all stakes, i.e.
0.00333% * (800 + 850 + 150000 + 300 + 1800 + 1900 + 700 + 3000000) = 105
I'll then have zero values for {2}, {3} and {4} as they never "play" alone in this example.
To calculate my first pair coalition value, I am looking to perform the calculation:
7.5%(800 + 300 + 1800 + 700) + 0.00333%(850 + 150000 + 1900 + 3000000) = 375
This is calculated as players {1,2} base win rate (7.5%) by the stakes they feature in, plus player {1} base win rate (0.00333%) by the combinations he features in that player {2} does not - i.e. exclusive sets.
This logic is repeated for the other unique combinations. For example row 4 would be the combination of {1,2,3} so the calculation is:
7.5%(800+1800) + 5.29%(850+1900) + 8.33%(300+700) + 0.00333%(3000000+150000) = 529 which descriptively is set {1,2} success rate% by Stakes for the combinations it appears in that {3} does not, {1,3} by where {2} does not feature, {1,2,3} by their occurrences, and the base player {1} by examples where neither {2} nor {3} occur.
My expected outcome therefore should look like this I believe:
c(105,0,0,0, 375,304,110,0,0,0, 529,283,246,0, 400)
where the first four numbers are the single player combinations {1} {2} {3} and {4}, the next six numbers are two player combinations {1,2} {1,3} {1,4} (and the null cases {2,3} {2,4} {3,4} which don't exist), then the next four are the three player combinations {1,2,3} {1,2,4} {1,3,4} and the null case {2,3,4}, and lastly the full combination set {1,2,3,4}.
I'd then feed this in to the DefineGame function of the package to create my coalitions object.
Appreciate any help: I have tried to be as descriptive as possible. I really don't know where to start on generating the necessary sets and set exclusions.
I'm using an Hash Table to store some values. Here are the details:
There will be roughly 1M items to store (not known before, so no perfect-hash possible).
Table is 10M large.
Hash function is MurMurHash3.
I did some tests and storing 1M values I get 350,000 collisions and 30 elements at the most-colliding hash table's slot.
Are these result good?
Would it make sense to implement Binary Search for lists that get created at colliding hash-table's slots?
What' your advice to improve performances?
EDIT: Here is my code
var
HashList: array [0..10000000 - 1] of Integer;
for I := 0 to High(HashList) do
HashList[I] := 0;
for I := 1 to 1000000 do
begin
Y := MurmurHash3(UIntToStr(I));
Y := Y mod Length(HashList);
Inc(HashList[Y]);
if HashList[Y] > 1 then
Inc(TotalCollisionsCount);
if HashList[Y] > MostCollidingSlotItemCount then
MostCollidingSlotItemCount := HashList[Y];
end;
Writeln('Total: ' + IntToStr(TotalCollisionsCount) + ' Max: ' + IntToStr(MostCollidingSlotItemCount));
Here is the result I get:
Total: 48169 Max: 5
Am I missing something?
This is what you get when you put 1M items randomly into 10M cells
calendar_size=10000000 nperson = 1000000
E/cell| Ncell | frac | Nelem | frac |h/cell| hops | Cumhops
----+---------+--------+----------+--------+------+--------+--------
0: 9048262 (0.904826) 0 (0.000000) 0 0 0
1: 905064 (0.090506) 905064 (0.905064) 1 905064 905064
2: 45136 (0.004514) 90272 (0.090272) 3 135408 1040472
3: 1488 (0.000149) 4464 (0.004464) 6 8928 1049400
4: 50 (0.000005) 200 (0.000200) 10 500 1049900
----+---------+--------+----------+--------+------+--------+--------
5: 10000000 1000000 1.049900 1049900
The left column is the number of items in a cell. The second: the number of cells having this itemcount.
WRT the binary search: it is obvious that for small tables like this (maximum chain length=4, but most chains are of length=1), linear search outperforms binary search. The takeover-point is probably somewhere between 10 and 100.
What Unix command returns row number for all records in a file. Below is the requirement.
id name salary
10 a 1000
20 b 2000
30 c 3000
But I want output like this.
Row_id id name salary
1 10 a 1000
2 20 b 2000
3 30 c 3000
Thanks for your effort in advance.
Try:
nl script ##nl number the line
or
cat -n file ##This will number empty line as well
awk '{ print FNR " " $0 }' file
It will also print the line number.
I'm pretty new in statistics:
fisher = function(idxToTest, idxATI){
idxDependent=c()
dependent=c()
p = c()
for(i in c(1:length(idxToTest)))
{
tbl = table(data[[idxToTest[i]]], data[[idxATI]])
rez = fisher.test(tbl, workspace = 20000000000)
if(rez$p.value<0.1){
dependent=c(dependent, TRUE)
if(rez$p.value<0.1){
idxDependent = c(idxDependent, idxToTest[i])
}
}
else{
dependent = c(dependent, FALSE)
}
p = c(p, rez$p.value)
}
}
This is the function I use. It seems to work.
What I understood until now is that I have to pass as first parameter data like:
Men Women
Dieting 10 30
Non-dieting 5 60
My data comes from a CSV:
data = read.csv('***.csv', header = TRUE, sep=',');
My first problem is that I don't know how to converse from:
Loan.Purpose Home.Ownership
lp_value_1 ho_value_2
lp_value_1 ho_value_2
lp_value_2 ho_value_1
lp_value_3 ho_value_2
lp_value_2 ho_value_3
lp_value_4 ho_value_2
lp_value_3 ho_value_3
to:
ho_value_1 ho_value_2 ho_value_3
lp_value1 0 2 0
lp_value2 1 0 1
lp_value3 0 1 1
lp_value4 0 1 0
The second issue is that I don't know what the second parameter should be
POST UPDATE: This is what I get using fisher.test(myTable):
Error in fisher.test(test) : FEXACT error 501.
The hash table key cannot be computed because the largest key
is larger than the largest representable int.
The algorithm cannot proceed.
Reduce the workspace size or use another algorithm.
where myTable is:
MORTGAGE NONE OTHER OWN RENT
car 18 0 0 5 27
credit_card 190 0 2 38 214
debt_consolidation 620 0 2 87 598
educational 5 0 0 3 7
...
Basically, fisher tests only work on smallish data sets because they require alot of memory. But all is good because chi-square tests make minimal additional assumptions and are easier on the computer. Just do:
chisq.test(Loan.Purpose,Home.Ownership)
to get your p-values.
Make sure you read through and understand the help page for chisq.test, especially the examples at the bottom.
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/chisq.test.html
Then look at a mosaicplot to see the quantities like:
mosaicplot(Loan.Purpose,Home.Ownership)
this reference explains how mosaicplots work.
http://alumni.media.mit.edu/~tpminka/courses/36-350.2001/lectures/day12/