Identify common rows between two datasets without leaking other rows - encryption

I have a file with encrypted rows containing data. So does my colleague.
We need to see what rows we have in common but can't see what rows we don't have in common.
Any idea of how we can encrypt our data so we know what we have in common and what we don't ? I feel like it's impossible since we both know the encryption key but I'm sure you have ideas.
Thank you :).

Hash (SHA-2) each row, and share the hashes. This will tell you if the row is the same, but won't leak any information if it isn't.

Related

(CHORD) Peer-2-Peer How does it work/What does it do?

https://en.wikipedia.org/wiki/Chord_(peer-to-peer)
I've looked into Chord and i'm having trouble understanding exactly what it does.
It's a protocol for a distributed hash table which stores various keys/values for later usage? Is it just an efficient way to look up in the hash table what value for a given key?
Any help such as a basic example would be much appreciated
An example question is say if I hashed inserting string "Hi" to 3 and there were no peers at 3 it would go to the next available peer and store it there right? Or where does it store it's values to?
I already answered a similar question for bittorrent/kademlia, so just to summarize in a more general sense:
DHTs store the values with some redundancy on N nodes whose ID is closest to the target hash.
Considering the vastness of >= 128bit keyspaces it would be extremely unlikely for nodes to exactly match the key. At least in routing schemes where nodes don't adjust their IDs based on content, and chord is one of those.
It's pretty much the same as regular hash tables, hence distributed hash table. You have a limited set of buckets into which the entries are hashed, where the bucket space is much smaller than the potential input keyspace and thus does not precisely match the keys either.

Split large sqlite table by sessionid field

I am relatively new to sql(ite), and I'm learning as I go while working on a new project.
We have got millions of transaction rows in one "data" table, one field being a "sessionid" field.
Since I want to concentrate on in-session activity for now, I primarily need to look only at transactions from the same sessions.
My intuition now is, that it would be a lot faster if I separate the database by sessions into many single session tables, than always querying for a single sessionid, and then proceeding. My question: is that correct? will that make a difference?
Even if not: Could you help me out and tell me, how I could split the one "data" table rows into many session-specific tables, the rows staying the same? Plus one table which relates sessionIds to their tables?
Thanks!
A friend just told me, the splitting-into-tables thing would be extremely unflexible, and I should try adding a distinct index instead for the different sessionId rows to access single sessions faster. Any thoughts on that and how to do it best?
First of all, are you having any specific performance bottleneck with it till now? If yes, please describe it.
Having one table per session will probably speed lookups/indexes (for INSERTs) things up.
SQLite doesn't impose a limit on the number of tables, so you should be okay.
One other solution that provides easier maintenance, is if you create one table per day/week.
Depending on how long your sessions last, this could be feasible or not.
Related: https://stackoverflow.com/a/811862/89771

Hashing - Separate Chaining - Efficiently dealing with repetitive numbers

I am constructing a hash table mod 17 for example and I am trying to figure out an efficient way to deal with a repeating key value. Suppose I have like a random number generator and I make a 1000 random generated numbers, there is a chance that some of those numbers might occur multiple times. My implementation would have a linked list to an array for each of the slots i.e. 17 slots and keys would be stored in their respective position.
I want to kind of implement a failsafe sort of checker function that insures that there are no repeating keys in the hash table. I have been looking this up on the internet and have not found a most definite answer. MY idea was to keep each linked list sorted and have a lookahead to check if the number is there already. Does anyone know of a better idea?
Any thoughts and comments greatly appreciated.
If I understand, you want multiple values for the same key? I think it is not possible. When you go to retrieve the value, which value would you choose.

How to determine position of specific character/string in SQLite string column value?

I have values in a SQLite table* that contain a number of strings, of different lengths, joined by periods, something like this:
SomeApp.SomeNameSpace.InterestingString.NotInteresting
SomeApp.OtherNameSpace.WantThisOne.ReallyQuiteDull
SomeApp.OtherNameSpace.WantThisOne.AlsoDull
SomeApp.DifferentNameSpace.AlwaysWorthALook.LittleValue
I'd like to extract (in this case) the third period-delimited substring so I could write something like
SELECT interesting_string, COUNT(*)
FROM ( SELECT third_part_of_period_delimited_string(name) interesting_string )
GROUP BY interesting_string;
Obviously I can do this any number of ways programmatically; I'm wondering if there's any way to achieve this in a SQLite SELECT query?
* It's a SharpDevelop Profiler database, if anyone's curious
No.
You can, as you mention, work with the strings after you have selected them from the database. Or you can split them up into separate columns when they are stored.
If you do not have access to the code that is storing the data, you might want to consider reading the data in its entirety, splitting the strings and storing the split out tokens in separate columns in a new table. If the data is not too large, you might look at storing this table in a new memory database to give excellent performance.
Whether this is worthwhile depends on whether one pass to split the data strings can be made use of many times. If the data is constantly changing, then this scheme would probably not work well.

Database design question: How to handle a huge amount of data in Oracle?

I have over 1.500.000 data entries and it's going to increase gradually over time. This huge amount of data would come from 150 regions.
Now should I create 150 tables to manage this increasing huge data? Will this be efficient? I need fast operation. ASP.NET and Oracle will be used.
If all the data is the same, don't split it in to different tables. Take a look at Oracle's table partitions. One-hundred fifty partitions (or more) split out by region (or more) is probably more in line with what you're going to be looking for.
I would also recommend you look at the Oracle Database Performance Tuning Tips & Techniques book and browse Ask Tom on Oracle's website.
Only 1.5 M rows? Not a lot really...
Use one table; working out how to write a 150-way union across 150 tables will be murder.
1.5 million rows doesn't really seem like that much. How many people are accessing the table(s) at any given point? Do you have any indexes setup? If you expect it to grow much larger, you may want to look into partitioning in databases.
FWIW, I work with databases on a regular basis with 100M+ rows. It shouldn't be this bad unless you have thousands of people using it at a time.
1 table per region is way not normalized; you're probably going to lose a bunch of efficiency there. 1 table per data entry site is pretty unusual too. Normalization is huge, it will save you a ton of time down the road, so I'd make sure you're not storing any duplicate data.
If you're using oracle, you shouldn't need to have multiple tables. It'll support a lot more than 1.5 million rows. If you need to speed up data access, you can try a snowflake schema to pull in commonly accessed data.
If you mean 1,500,000 rows in a table then you do not have much to worry about. Oracle can handle much larger loads than that with ease.
If you need to identify the regions that the data came in, you can create a Region table and tie the ID from that to the big data table.
IMHO, you should post more details and we can help you better.
A database with 2,000 rows can be slow. It all depends on your database design, index, keys and most important is the hardware configuration your database server is running on. The way your application uses this data is also important. Is a read intensive database or transaction intensive? There is no right answer to what you are asking right now.
You first need to consider what operations are going to access the table. How will inserts be performed? Will the existing rows be updated, and if so how? By how much will the rows grow, and what percentage of them will grow? Will rows get deleted? By what criteria? How will you be selecting data? By what criteria and how many per query?
Data partition can be used for volume of data much larger than 1.5m rows. Look into optimizing
the SQL query ,batch processing and storage of data.

Resources