Hash Table + Binary Search - hashtable

I'm using an Hash Table to store some values. Here are the details:
There will be roughly 1M items to store (not known before, so no perfect-hash possible).
Table is 10M large.
Hash function is MurMurHash3.
I did some tests and storing 1M values I get 350,000 collisions and 30 elements at the most-colliding hash table's slot.
Are these result good?
Would it make sense to implement Binary Search for lists that get created at colliding hash-table's slots?
What' your advice to improve performances?
EDIT: Here is my code
var
HashList: array [0..10000000 - 1] of Integer;
for I := 0 to High(HashList) do
HashList[I] := 0;
for I := 1 to 1000000 do
begin
Y := MurmurHash3(UIntToStr(I));
Y := Y mod Length(HashList);
Inc(HashList[Y]);
if HashList[Y] > 1 then
Inc(TotalCollisionsCount);
if HashList[Y] > MostCollidingSlotItemCount then
MostCollidingSlotItemCount := HashList[Y];
end;
Writeln('Total: ' + IntToStr(TotalCollisionsCount) + ' Max: ' + IntToStr(MostCollidingSlotItemCount));
Here is the result I get:
Total: 48169 Max: 5
Am I missing something?

This is what you get when you put 1M items randomly into 10M cells
calendar_size=10000000 nperson = 1000000
E/cell| Ncell | frac | Nelem | frac |h/cell| hops | Cumhops
----+---------+--------+----------+--------+------+--------+--------
0: 9048262 (0.904826) 0 (0.000000) 0 0 0
1: 905064 (0.090506) 905064 (0.905064) 1 905064 905064
2: 45136 (0.004514) 90272 (0.090272) 3 135408 1040472
3: 1488 (0.000149) 4464 (0.004464) 6 8928 1049400
4: 50 (0.000005) 200 (0.000200) 10 500 1049900
----+---------+--------+----------+--------+------+--------+--------
5: 10000000 1000000 1.049900 1049900
The left column is the number of items in a cell. The second: the number of cells having this itemcount.
WRT the binary search: it is obvious that for small tables like this (maximum chain length=4, but most chains are of length=1), linear search outperforms binary search. The takeover-point is probably somewhere between 10 and 100.

Related

Extracting record from big endian data

I have the following code for network protocol implementation. As the protocol is big endian, I wanted to use the Bit_Order attribute and High_Order_First value but it seems I made a mistake.
With Ada.Unchecked_Conversion;
with Ada.Text_IO; use Ada.Text_IO;
with System; use System;
procedure Bit_Extraction is
type Byte is range 0 .. (2**8)-1 with Size => 8;
type Command is (Read_Coils,
Read_Discrete_Inputs
) with Size => 7;
for Command use (Read_Coils => 1,
Read_Discrete_Inputs => 4);
type has_exception is new Boolean with Size => 1;
type Frame is record
Function_Code : Command;
Is_Exception : has_exception := False;
end record
with Pack => True,
Size => 8;
for Frame use
record
Function_Code at 0 range 0 .. 6;
Is_Exception at 0 range 7 .. 7;
end record;
for Frame'Bit_Order use High_Order_First;
for Frame'Scalar_Storage_Order use High_Order_First;
function To_Frame is new Ada.Unchecked_Conversion (Byte, Frame);
my_frame : Frame;
begin
my_frame := To_Frame (Byte'(16#32#)); -- Big endian version of 16#4#
Put_Line (Command'Image (my_frame.Function_Code)
& " "
& has_exception'Image (my_frame.Is_Exception));
end Bit_Extraction;
Compilation is ok but the result is
raised CONSTRAINT_ERROR : bit_extraction.adb:39 invalid data
What did I forget or misunderstand ?
UPDATE
The real record in fact is
type Frame is record
Transaction_Id : Transaction_Identifier;
Protocol_Id : Word := 0;
Frame_Length : Length;
Unit_Id : Unit_Identifier;
Function_Code : Command;
Is_Exception : Boolean := False;
end record with Size => 8 * 8, Pack => True;
for Frame use
record
Transaction_Id at 0 range 0 .. 15;
Protocol_Id at 2 range 0 .. 15;
Frame_Length at 4 range 0 .. 15;
Unit_id at 6 range 0 .. 7;
Function_Code at 7 range 0 .. 6;
Is_Exception at 7 range 7 .. 7;
end record;
Where Transaction_Identifier, Word and Length are 16-bit wide.
These ones are displayed correctly if I remove the Is_Exception field and extend Function_Code to 8 bits.
The dump of the frame to decode is as following:
00000000 00 01 00 00 00 09 11 03 06 02 2b 00 64 00 7f
So my only problem is really to extract the 8th bit of the last byte.
So,
for Frame use
record
Transaction_Id at 0 range 0 .. 15;
Protocol_Id at 2 range 0 .. 15;
Frame_Length at 4 range 0 .. 15;
Unit_id at 6 range 0 .. 7;
Function_Code at 7 range 0 .. 6;
Is_Exception at 7 range 7 .. 7;
end record;
It seems you want Is_Exception to be the the LSB of the last byte?
With for Frame'Bit_Order use System.High_Order_First; the LSB will be bit 7,
(also, 16#32# will never be -- Big endian version of 16#4#, the bit pattern just doesn't match)
It may be more intuitive and clear to specify all of your fields relative to the word they're in, rather than the byte:
Unit_ID at 6 range 0..7;
Function_Code at 6 range 8 .. 14;
Is_Exception at 6 range 15 .. 15;
Given the definition of Command above, the legal values for the last byte will then be:
2 -> READ_COILS FALSE
3 -> READ_COILS TRUE
8 -> READ_DISCRETE_INPUTS FALSE
9 -> READ_DISCRETE_INPUTS TRUE
BTW,
by applying your update to your original program, and adding/changing the following, you program works for me
add
with Interfaces;
add
type Byte_Array is array(1..8) of Byte with Pack;
change, since we don't know the definition
Transaction_ID : Interfaces.Unsigned_16;
Protocol_ID : Interfaces.Unsigned_16;
Frame_Length : Interfaces.Unsigned_16;
Unit_ID : Interfaces.Unsigned_8;
change
function To_Frame is new Ada.Unchecked_Conversion (Byte_Array, Frame);
change
my_frame := To_Frame (Byte_Array'(00, 01, 00, 00, 00, 09, 16#11#, 16#9#));
I finally found what was wrong.
In fact, the Modbus Ethernet Frame definition mentioned that, in case of exception, the returned code should be the function code plus 128 (0x80) (see explanation on Wikipedia). That's the reason why I wanted to represent it through a Boolean value but my representation clauses were wrong.
The correct clauses are these ones :
for Frame use
record
Transaction_Id at 0 range 0 .. 15;
Protocol_Id at 2 range 0 .. 15;
Frame_Length at 4 range 0 .. 15;
Unit_id at 6 range 0 .. 7;
Is_Exception at 6 range 8 .. 8;
Function_Code at 6 range 9 .. 15;
end record;
This way, the Modbus network protocol is correctly modelled (or not but at least, my code is working).
I really thank egilhh and simonwright for making me find what was wrong and explain the semantics behind the aspects.
Obviously, I don't know who reward :)
Your original record declaration works fine (GNAT complains about the Pack, warning: pragma Pack has no effect, no unplaced components). The problem is with working out the little-endian Byte.
---------------------------------
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | BE bit numbers
---------------------------------
| c c c c c c c | e |
---------------------------------
| 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | LE bit numbers
---------------------------------
so if you want the Command to be Read_Discrete_Inputs, the Byte needs to have BE bit 4 (LE bit 3) set i.e. LE 16#8#.
Take a look at this AdaCore post on bit order and byte order to see how they handle it. After reading that, you will probably find that the bit order of your frame value is really 16#08#, which probably is not what you are expecting.
Big Endian / Little Endian typically refers to Byte order rather than bit order, so when you see that Network protocols are Big Endian, they mean Byte order. Avoid setting Bit_Order for your records. In modern systems, you will almost never need that.
Your record is only one byte in size, so byte order won't matter for it by itself. Byte order comes into play when you have larger field values (>8 bits long).
The bit_order pragma doesn't reverse the order that the bits appear in memory. It simply defines whether the most significant bit (left most) will be logically referred to as zero (High_Order_First) or the least significant bit will be referred to as zero (Low_Order_First) when interpreting the First_Bit and Last_Bit offsets from the byte position in the representation clause. Keep in mind that these offsets are taken from the MSB or LSB of the scalar the record component belongs to AS A VALUE. So in order for the byte positions to carry the same meaning on a little endian CPU as they do on a big endian CPU (as well as the in memory representation of multibyte machine scalars, which exist when one or more record components with the same byte position have a last_bit value which exceeds the capacity of a single byte) then 'Scalar_Storage_Order must also be specified.

R data.table fread fails on special character

I can only give you picture of data I'm working with or the character that creates my problems in .csv file. I don't know how to get that character.
This pillar character is stopping fread working. Is there away to escape it? readr read_csv works through them with no problem. I have tried to drop, make it character column, use comment.char = "", but nothing seems to work.
Here what I'm hoping to get out (what I get out with read_csv)
# A tibble: 5 x 4
X1 trade date trade_condition
<dbl> <dbl> <date> <chr>
1 2902 28.3 2019-01-14 -12------P----
2 2903 28.0 2019-01-14 P
3 2904 28.0 2019-01-14 P
4 2905 28.0 2019-01-14 P
5 2906 28.1 2019-01-14 P
I'm using data.table_1.12.0
Here is Verbose = T
omp_get_max_threads() = 8
omp_get_thread_limit() = 2147483647
DTthreads = 0
RestoreAfterFork = true
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 8 threads (omp_get_max_threads()=8, nth=8)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as integer
[02] Opening the file
Opening file C:/Users/Markku/Desktop/KONECRANES_2019.01.14/trades.csv
File opened, size = 592KB (606768 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<,trade,date,trade_condition,sy>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep automatically ...
sep=',' with 100 lines of 9 fields using quote rule 0
Detected 9 columns on line 1. This line is either column names or first data row. Line starts as: <<,trade,date,trade_condition,sy>>
Quote rule picked = 0
fill=false and the most number of columns found is 9
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 10 because (606767 bytes from row 1 to eof) / (2 * 27623 jump0size) == 10
Type codes (jump 000) : 57AAAA5AA Quote rule 0
A line with too-few fields (4/9) was found on line 4 of sample jump 7. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-few fields (4/9) was found on line 13 of sample jump 9. Most likely this jump landed awkwardly so type bumps here will be skipped.
Type codes (jump 010) : 57AAAA5AA Quote rule 0
'header' determined to be true due to column 2 containing a string on row 1 and a lower type (float64) in the rest of the 858 sample rows
=====
Sampled 858 rows (handled \n inside quoted fields) at 11 jump points
Bytes from first data row on line 2 to the end of last row: 606683
Line length: mean=213.01 sd=86.78 min=59 max=372
Estimated number of rows: 606683 / 213.01 = 2849
Initial alloc = 5698 rows (2849 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 57AAAA5AA
[10] Allocate memory for the datatable
Allocating 9 column slots (9 - 0 dropped) with 5698 rows
[11] Read the data
jumps=[0..1), chunk_size=606683, total_size=606683
Restarting team from jump 0. nSwept==0 quoteRule==1
jumps=[0..1), chunk_size=606683, total_size=606683
Restarting team from jump 0. nSwept==0 quoteRule==2
jumps=[0..1), chunk_size=606683, total_size=606683
Restarting team from jump 0. nSwept==0 quoteRule==3
jumps=[0..1), chunk_size=606683, total_size=606683
Read 2903 rows x 9 columns from 592KB (606768 bytes) file in 00:00.014 wall clock time
[12] Finalizing the datatable
Type counts:
2 : int32 '5'
1 : float64 '7'
6 : string 'A'
=============================
0.003s ( 21%) Memory map 0.001GB file
0.007s ( 50%) sep=',' ncol=9 and header detection
0.000s ( 0%) Column type detection using 858 sample rows
0.000s ( 0%) Allocation of 5698 rows x 9 cols (0.000GB) of which 2903 ( 51%) rows used
0.004s ( 29%) Reading 1 chunks (0 swept) of 0.579MB (each chunk 2903 rows) using 1 threads
+ 0.000s ( 0%) Parse to row-major thread buffers (grown 0 times)
+ 0.002s ( 14%) Transpose
+ 0.002s ( 14%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
0.014s Total
Warning message:
In fread(trades_file, verbose = T) :
Stopped early on line 2905. Expected 9 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<2903,28.04,2019-01-14,"P>>

GameTheory package: Convert data frame of games to Coalition Set

I am looking to explore the GameTheory package from CRAN, but I would appreciate help in converting my data (in the form of a data frame of unique combinations and results) in to the required coalition object. The precursor to this I believe to be an ordered list of all coalition values (https://cran.r-project.org/web/packages/GameTheory/vignettes/GameTheory.pdf).
My real data has n ~ 30 'players', and unique combinations = large (say 1000 unique combinations), for which I have 1 and 0 identifiers to describe the combinations. This data is sparsely populated in that I do not have data for all combinations, but will assume combinations not described have zero value. I plan to have one specific 'player' who will appear in all combinations, and act as a baseline.
By way of example this is the data frame I am starting with:
require(GameTheory)
games <- read.csv('C:\\Users\\me\\Desktop\\SampleGames.csv', header = TRUE, row.names = 1)
games
n1 n2 n3 n4 Stakes Wins Success_Rate
1 1 1 0 0 800 60 7.50%
2 1 0 1 0 850 45 5.29%
3 1 0 0 1 150000 10 0.01%
4 1 1 1 0 300 25 8.33%
5 1 1 0 1 1800 65 3.61%
6 1 0 1 1 1900 55 2.89%
7 1 1 1 1 700 40 5.71%
8 1 0 0 0 3000000 10 0.00333%
where n1 is my universal player, and in this instance, I have described all combinations.
To calculate my 'base' coalition value from player {1} alone, I am looking to perform the calculation: 0.00333% (success rate) * all stakes, i.e.
0.00333% * (800 + 850 + 150000 + 300 + 1800 + 1900 + 700 + 3000000) = 105
I'll then have zero values for {2}, {3} and {4} as they never "play" alone in this example.
To calculate my first pair coalition value, I am looking to perform the calculation:
7.5%(800 + 300 + 1800 + 700) + 0.00333%(850 + 150000 + 1900 + 3000000) = 375
This is calculated as players {1,2} base win rate (7.5%) by the stakes they feature in, plus player {1} base win rate (0.00333%) by the combinations he features in that player {2} does not - i.e. exclusive sets.
This logic is repeated for the other unique combinations. For example row 4 would be the combination of {1,2,3} so the calculation is:
7.5%(800+1800) + 5.29%(850+1900) + 8.33%(300+700) + 0.00333%(3000000+150000) = 529 which descriptively is set {1,2} success rate% by Stakes for the combinations it appears in that {3} does not, {1,3} by where {2} does not feature, {1,2,3} by their occurrences, and the base player {1} by examples where neither {2} nor {3} occur.
My expected outcome therefore should look like this I believe:
c(105,0,0,0, 375,304,110,0,0,0, 529,283,246,0, 400)
where the first four numbers are the single player combinations {1} {2} {3} and {4}, the next six numbers are two player combinations {1,2} {1,3} {1,4} (and the null cases {2,3} {2,4} {3,4} which don't exist), then the next four are the three player combinations {1,2,3} {1,2,4} {1,3,4} and the null case {2,3,4}, and lastly the full combination set {1,2,3,4}.
I'd then feed this in to the DefineGame function of the package to create my coalitions object.
Appreciate any help: I have tried to be as descriptive as possible. I really don't know where to start on generating the necessary sets and set exclusions.

Dealing with conditionals in a better manner than deeply nested ifelse blocks

I'm trying to write some code to analyze my company's insurance plan offerings... but they're complicated! The PPO plan is straightforward, but the high deductible health plans are complicated, as they introduced a "split" deductible and out of pocket maximum (individual and total) for the family plans. It works like this:
Once the individual meets the individual deductible, he/she is covered at 90%
Once the remaining 1+ individuals on the plan meet the total deductible, the entire family is covered at 90%
The individual cannot satisfy the family deductible with only their medical expenses
I want to feed in a vector of expenses for my family members (there are four of them) and output the total cost for each plan. Below is a table of possible scenarios, with the following column codes:
ded_ind: did one individual meet the individual deductible?
ded_tot: was the total deductible reached?
oop_ind: was the individual out of pocket max reached
oop_tot: was the total out of pocket max reached?
exp_ind = the expenses of the highest spender
exp_rem = the expenses of the remaining /other/ family members (not the highest spender)
oop_max_ind = the level of expenses at which the individual has paid their out of pocket maximum (when ded_ind + 0.1 * exp_ind = out of pocket max for the individual
oop_max_fam = same as for individual, but for remaining family members
The table:
| ded_ind | oop_ind | ded_rem | oop_rem | formula
|---------+---------+---------+---------+---------------------------------------------------------------------------|
| 0 | 0 | 0 | 0 | exp_ind + exp_rem |
| 1 | 0 | 0 | 0 | ded_ind + 0.1 * (exp_ind - ded_ind) + exp_rem |
| 0 | 0 | 1 | 0 | exp_ind + ded_rem + 0.1 * (exp_rem - ded_rem) |
| 1 | 1 | 0 | 0 | oop_max_ind + exp_fam |
| 1 | 0 | 1 | 0 | ded_ind + 0.1 * (exp_ind - ded_ind) + ded_rem + 0.1 * (exp_rem - ded_rem) |
| 0 | 0 | 1 | 1 | oop_max_rem + exp_ind |
| 1 | 0 | 1 | 1 | ded_ind + 0.1 * (exp_ind - ded_ind) + oop_max_rem |
| 1 | 1 | 1 | 0 | oop_ind_max + ded_rem + 0.1 * (exp_rem - ded_rem) |
| 1 | 1 | 1 | 1 | oop_ind_max + oop_rem_max |
Omitted: 0 1 0 0, 0 0 0 1, 0 1 1 0, and 0 1 0 1 are not present, as oop_ind and oop_rem could not have been met if ded_ind and ded_rem, respectively, have not been met.
My current code is a somewhat massive ifelse loop like so (not the code, but what it does):
check if plan is ppo or hsa
if hsa plan
if exp_ind + exp_rem < ded_rem # didn't meet family deductible
if exp_ind < ded_ind # individual deductible also not met
cost = exp_ind + exp_rem
else is exp_ind > oop_ind_max # ded_ind met, is oop_ind?
ded_ind + 0.1 * (exp_ind - ded_ind) + exp_fam # didn't reach oop_max_ind
else oop_max_ind + exp_fam # reached oop_max_ind
else ...
After the else, the total is greater than the family deductible. I check to see if it was contributed by more than two people and just continue on like that.
My question, now that I've given some background to the problem: Is there a better way to manage conditional situations like this than ifelse loops to filter them down a bit at a time?
The code ends up seeming redundant, as one checks for some higher level conditions (consider the table where ded_rem is met or not met... one still has to check for ded_ind and oop_max_ind in both cases, and the code is the same... just positioned at two different places in the ifelse structure).
Could this be done with some sort of matrix operation? Are there other examples online of more clever ways to deal with filtering of conditions?
Many thanks for any suggestions.
P.S. I'm using R and will be creating an interactive with shiny so that other employees can input best and worst case scenarios for each of their family members and see which plan comes out ahead via a dot or bar chart.
The suggestion to convert to a binary value based on the result gave me an idea, which also helped me learn that one can do vectorized TRUE / FALSE checks (I guess that was probably obvious to many).
Here's my current idea:
expenses will be a vector of individual forecast medical expenses for the year (example of three people):
expenses <- c(1500, 100, 400)
We set exp_ind to the max value, and sum the rest for exp_rem
exp_ind <- max(expenses)
# [1] index of which() for cases with multiple max values
exp_rem <- sum(expenses[-which(expenses == exp_ind)[1]])
For any given plan, I can set up a vector with the cutoffs, for example:
individual deductible = 1000
individual out of pocket max = 2000 (need to incur 11k of expenses to get there)
family deductible = 2000
family out of pocket max = 4000 (need to incur 22k of expenses to get there)
Set those values:
ded_ind <- 1000
oop_max_ind <- 11000
ded_tot <- 2000
oop_max_tot <- 22000
cutoffs <- c(ded_ind, oop_max_ind, ded_tot, oop_max_tot)
Now we can check the input expense against the cutoffs:
result <- as.numeric(rep(c(exp_ind, exp_rem), each = 2) > cutoffs)
Last, convert to binary:
result_bin <- sum(2^(seq_along(result) - 1) * result)
Now I can set up functions for the possible outcomes based on the value in result_bin:
if(result_bin == 1) {cost <- ded_ind + 0.1 * (exp_ind - ded_ind) + exp_rem }
cost
[1] 1550
We can check this...
High spender would have paid his 1000 and then 10% of remaining 500 = 1050
Other members did not reach the family deductible and paid the full 400 + 100 = 500
Total: 1550
I still need to create a mapping of results_bin values to corresponding functions, but doing a vectorized check and converting a unique binary value is much, much better, in my opinion, than my ifelse nested mess.
I look at it like this: I'd have had to set the variables and write the functions anyway; this saves me 1) explicitly writing all the conditions, 2) the redundancy issue I was talking about in that one ends up writing identical "sibling" branches of parent splits in the ifelse structure, and lastly, 3) the code is far, far, far more easily followed.
Since this question is not very specific, here is a simpler example/answer:
# example data
test <- expand.grid(opt1=0:1,opt2=0:1)
# create a unique identifier to represent the binary variables
test$code <- with(allopts,paste(opt1,opt2,sep=""))
# create an input variable to be used in functions
test$var1 <- 1:4
# opt1 opt2 code var1
#1 0 0 00 1
#2 1 0 10 2
#3 0 1 01 3
#4 1 1 11 4
Respective functions to apply depending on binary conditions, along with intended results for each combo:
var1 + 10 #code 00 - intended result = 11
var1 + 100 #code 10 - intended result = 102
var1 + 1000 #code 01 - intended result = 1003
var1 + var1 #code 11 - intended result = 8
Use ifelse combinations to do the calculations:
test$result <- with(test,
ifelse(code == "00", var1 + 10,
ifelse(code == "10", var1 + 100,
ifelse(code == "01", var1 + 1000,
ifelse(code == "11", var1 + var1,
NA
)))))
Result:
opt1 opt2 code var1 result
1 0 0 00 1 11
2 1 0 10 2 102
3 0 1 01 3 1003
4 1 1 11 4 8

How can I apply fisher test on this set of data (nominal variables)

I'm pretty new in statistics:
fisher = function(idxToTest, idxATI){
idxDependent=c()
dependent=c()
p = c()
for(i in c(1:length(idxToTest)))
{
tbl = table(data[[idxToTest[i]]], data[[idxATI]])
rez = fisher.test(tbl, workspace = 20000000000)
if(rez$p.value<0.1){
dependent=c(dependent, TRUE)
if(rez$p.value<0.1){
idxDependent = c(idxDependent, idxToTest[i])
}
}
else{
dependent = c(dependent, FALSE)
}
p = c(p, rez$p.value)
}
}
This is the function I use. It seems to work.
What I understood until now is that I have to pass as first parameter data like:
Men Women
Dieting 10 30
Non-dieting 5 60
My data comes from a CSV:
data = read.csv('***.csv', header = TRUE, sep=',');
My first problem is that I don't know how to converse from:
Loan.Purpose Home.Ownership
lp_value_1 ho_value_2
lp_value_1 ho_value_2
lp_value_2 ho_value_1
lp_value_3 ho_value_2
lp_value_2 ho_value_3
lp_value_4 ho_value_2
lp_value_3 ho_value_3
to:
ho_value_1 ho_value_2 ho_value_3
lp_value1 0 2 0
lp_value2 1 0 1
lp_value3 0 1 1
lp_value4 0 1 0
The second issue is that I don't know what the second parameter should be
POST UPDATE: This is what I get using fisher.test(myTable):
Error in fisher.test(test) : FEXACT error 501.
The hash table key cannot be computed because the largest key
is larger than the largest representable int.
The algorithm cannot proceed.
Reduce the workspace size or use another algorithm.
where myTable is:
MORTGAGE NONE OTHER OWN RENT
car 18 0 0 5 27
credit_card 190 0 2 38 214
debt_consolidation 620 0 2 87 598
educational 5 0 0 3 7
...
Basically, fisher tests only work on smallish data sets because they require alot of memory. But all is good because chi-square tests make minimal additional assumptions and are easier on the computer. Just do:
chisq.test(Loan.Purpose,Home.Ownership)
to get your p-values.
Make sure you read through and understand the help page for chisq.test, especially the examples at the bottom.
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/chisq.test.html
Then look at a mosaicplot to see the quantities like:
mosaicplot(Loan.Purpose,Home.Ownership)
this reference explains how mosaicplots work.
http://alumni.media.mit.edu/~tpminka/courses/36-350.2001/lectures/day12/

Resources