Protecting Java Source Code From Being Accessed [closed] - encryption

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
Last week, I had to create a little GUI for homework.
None of my school mates did it. They have stolen my one from where we had to upload it and then they uploaded it again as theirs. When I told my teacher it was all my work he did not believe me.
So I thought of putting a useless method or something inside with a proof that I coded it. I thought of encryption. My best idea up till now:
String key = ("ZGV2ZWxvcGVkIGJ5IFdhckdvZE5U"); //My proof in base64
Can you think of some other better ways?

I had the same problem as you a long time ago. We had Windows 2000 machines and uploaded files to a Novel network folder that everyone could see. I used several tricks to beat even the best thieves: whitespace watermarking; metadata watermarking; unusual characters; trusted timestamping; modus operandi. Here's them in order.
Whitespace watermarking:
This is my original contribution to watermarking. I needed an invisible watermark that worked in text files. The trick I came up with was to put in a specific pattern of whitespace between programming statements (or paragraphs). The file looked the same to them: some programming statements and line breaks. Selecting the text carefully would show the whitespace. Each empty line would contain a certain number of spaces that's obviously not random or accidental. (eg 17) In practice, this method did the work for me because they couldn't figure out what I was embedding in the documents.
Metadata watermarking
This is where you change the file's metadata to contain information. You can embed your name, a hash, etc. in unseen parts of a file, especially EXE's. In NT days, Alternate Data Streams were popular.
Unusual characters
I'll throw this one in just for kicks. An old IRC impersonation trick was to make a name with letters that look similar to another person's name. You can use this in watermarking. The Character Map in Windows will give you many unusual characters that look similar to, but aren't, a letter or number you might use in your source code. These showing up in a specific spot in someone else's work can't be accidental.
Trusted Timestamping
In a nutshell, you send a file (or its hash) to a third party who then appends a timestamp to it and signs it with a private key. Anyone wanting proof of when you created a document can go to the trusted third party, often a website, to verify your proof of creation time. These have been used in court cases for intellectual property disputes so they are a very strong form of evidence. They're the standard way to accomplish the proof you're seeking. (I included the others first b/c they're easy, they're more fun and will probably work.)
This Wikipedia article might help your instructor understand your evidence and the external links section has many providers, including free ones. I'd run test files through free ones for a few days before using them for something important.
Modus operandi
So, you did something and you now have proof right? No, the students can still say you stole the idea from them or some other nonsense. My fix for this was to, in private, establish one or more of my methods with my instructor. I tell the instructor to look for the whitespace, look for certain symbols, etc. but to never tell the others what the watermark was. If the instructor will agree to keep your simple techniques secret, they will probably continue to work fine. If not, there's always trusted timestamping. ;)

If your classmates stole your code from the upload site, I would encrypt your homework and email the key to the teacher. You can do this with PGP if you want to be complicated, or something as simple as a Zip file with a password.
EDIT:
PGP would allow you to encrypt/sign without revealing your key, but you can't beat the shear simplicity of a Zip file with a password, so just pick a new key every homework assignment. Beauty in simplicity :)

If you are giving source code to the teacher, then simply add a serialVersionUID to one of your class files that is an encrypted version of your name. You can decrypt it to the teacher yourself.
That does not mean anything to the others, just for you. You can say it's a generated code, if they're stealing it, probably won't bother to modify it at all.
If you want to do it in a stylish way, you could use this trick, if you find the random seed that produces your name. :) That would be your number then, and wherever it appears that would prove that it was you who made that code.

This happened with a pair of my students who lived in the same apartment. One stole the source code from a disk left in a desk drawer.
The thief slightly modified the stolen source, so that it wouldn't be obvious. I noticed the similarity of the code anyway, and examined the source in an editor. Some of the lines had extra spaces at the ends. Each student's source had the same number of extra spaces.
You could exploit this to encode information without making it visible. You could encode your initials or your student ID at the ends of some lines, with spaces.
A thief will likely make cosmetic changes to the visible code, but may miss the non-visible characters.
EDIT:
Thinking about this a little more, you could use spaces and tabs as Morse-code dits and dahs, and put your name at the end of multiple lines. A thief could remove, reorder or retype some lines without destroying your identification.
EDIT 2:
"Whitespace steganography" is the term for concealing messages in whitespace. Googling it reveals this open-source implementation dating back to the '90s, using Huffman encoding instead of Morse code.

It seems like an IT administration problem to me. Each student should have there own upload area which cannot accessed by other students.
The teacher would be a higher level up, being able to access each student upload folder. If this is not possible go with #exabrial answer as that is the simpliest solution.

The best thing you can do is to just zip the source code with a password and e-mail the password to the teacher.
Problem solved.

Use a distributed (=standalone) version control system, like git. Might be useful too.
A version history with your name, and dates might be sufficiently convincing.

What was stolen ?
The source ? You can put random Strings in it (but it can be changed). You can also try to add a special behavior know only from you (a special keypress will change a color row), you can then ask to the teacher "the others know this special combo ?". Best way will be to crash the program if a empty useless file is not present in the archive after 5 minutes of activity, your school mates will be too lazy to wait this ammount of time.
The binary ? Just comparing the checksum of each .class will be enough (your school mates are too lazy to rewrite the class files)

Just post your solution at the last minute. This won't give time to anyone to copy it.
And send a feedback to the administrator to disallow students to see other students assignments.

If you upload the file in a .zip with password encryption, anyone can just crack the password by downloading the .zip file and have their cpu run a million queries at it if they are that big of a cheat thief. Unfortunately, some are and it's easy to do.
Your source can be viewed on the shared server by the other students. The teacher should really be giving you your own password encrypted directory to upload to. This could be done easily just by adding subdomains. But perhaps the teacher might allow you to upload the files to your own server for him to access them there.
It's also possible to obfuscate the script so that it has a document.write('This page was written by xxxxx'), forcing anyone who copies your work to not be able to remove the credit unless they first decrypt it. But the real answer is that your school needs to give each of its students their own password protected directories.

In my case, my teachers came with a better approach. The questions they provided has something to do with our registration number.
Ex:
Input to a function/theory is our Registration number, which is
different for each student
So, answers or the approach to the solution are relatively different from each student.This make the necessarily of all students has to do their homework on their own, or at-least get to know how to hack the approach with their own registration[it may be harder than learning the lession ;)].
Hope your lecturer will read this thread before his next tutorial :D

Related

Static data storage on server-side

Why some data on server-side are still stored in DBC files, not in SQL-DB? In particular - spells (spells.dbc). What for?
We have a lot of bugs in spells and it's very hard to understand what's wrong with spell, but it's harder to find it spell...
Spells, Talents, achievements, etc... Are mostly found in DBC files because that is the way Blizzard did it back in the day. It's true that in 2019 this is a pretty outdated way to work indeed. Databases are getting stronger and more versatile and having hard-coded data is proving to be hard to work with. Hell, DBCs aren't really that heavy anyways and the reason why we haven't made this change yet is that... We have no other reason other than it being a task that takes a bit of time and It is monotonous to do.
We are aware that Trinity core has already made this change but they have far more contributors than we do if that serves as an excuse!
Nonetheless, this is already in our to-do list if you check the issue tracker at the main repository.
While It's true that we can't really edit DBC files because we would lose all the progress when re-extracted or lost the files, however, we can modify spells in a C++ file called SpellMgr.
There we have a function called SpellMgr::LoadDbcDataCorrections().
The main problem while doing this change is that we have to modify the core to support this change, and the function above contains a lot of corrections. Would need intense testing to make sure nothing is screwed up in the process.
In here by altering bits you can remove or add certain properties to the desired spells instead of touching the hard coded dbc files.
If you want an example, in this link, I have changed an Archimonde spell to have no cast time.
NOTE:
In this line, the commentary about damage can be miss leading but that's because I made a mistake and I haven't finished this pull request yet as of 18/04/2019.
The work has been started, notably by Kaev. I think at least 3 DBCs are now useless server side (but probably still needed client side, they are called DataBaseClient for a reason) like item.dbc.
Also, the original philosophy (for ALL cores, not just AC) was that we would not touch DBC because we don't do custom modifications, so there was no interest in having them server side.
But we wanted to change this and started to make them available directly in the DB, if you wish to help with that, it would be nice!
Why?
Because when emulation started, dbc fields were 90% unknown. So, developers created a parser for them that just required few code changes to support new fields as soon as their functionality was discovered.
Now that we've discovered 90% of required dbc fields and we've also created some great conversion tools for DBC<->SQL, it's just a matter of "effort".
SQL conversion is useful to avoid using of client data on server (you can totally overwrite them if you don't want to go against EULA) or just extends/customize them.
Here you are the issue about DBC->SQL conversion: https://github.com/azerothcore/azerothcore-wotlk/issues/584

ACORD AL3 - What's the deal with "?"s

We're writing a parser for ACORD AL3. Read AL3 coming in, write AL3 going out. Nice and simple.
As of right now, it is 99% solid. The only thing that's driving me nuts is the use of "?"s in the ACORD AL3 standard. It appears that they are used as placeholders for fields that do not have values in the message. HOWEVER, that's not the only rule for it, because if it was, the AL3 I'm currently generating would look that the sample files I'm trying to have it match.
So if anyone here knows anything about the rules around AL3 "?"s, that would be great. I've been pouring over the Data Dictionary and the other documentation from ACORD, and I'm seeing nothing to indicate which fields get it, and which ones don't.
Also, if the "?"s are not required for AL3 processing to begin with, that would also be great to know, because then I could just stop worrying about the whole thing.
In the ACORD AL3 standard, from what I recall, if you use a "?" in one of the fields, this tells the receiving system to not overwrite (with blanks) the target field in the user's management system.
There may be individual elements in a group that are valid, but the sender
cannot send them for some reason. The solution is to fill that data element with questions marks (?????) The receiving system will recognize this and not update that field on their system.
In ACORD Al3 "?" means there is not any data in that specific element but that element is much important to maintain the hierarchy. But there is one thing, Coverage groups and Transaction groups will not contains any question mark. that does not means these are not important even these are very much important in Al3 files. But that above mentioned description applies for data group.
Secondly number of question marks in element describe its length.
If anyone need more details related to al3 data, don't hesitate to ask.

Converting words to hashtags when posting to Twitter

I currently use LinqToTwitter to send posts to Twitter. I'd like to convert words in the title of the post to hashtags when it gets fired off as tweet so something like - "Firefox is cool" is the blog post and becomes #Firefox is cool http://myshortu.rl/dhsgeh on Twitter.
So far the way i see it is i need a database table with the words i want to convert to hashtags. I'd have to parse out the title and compare the words to those in the db and add on the pound sign. Is the best way to use a db table? Or can I do it with an in memory collection or keep the words in web.config? Thanks....
The decision on whether to use a database or file (such as web.config) might depend on whether you want to write code that allows you to maintain the list. e.g. Add, Modify, Remove. If so, then a DB sounds like the easiest option. If the list is small and doesn't change, then adding a delimited list to web.config would work fine.
Since you're using ASP.NET you can't hold it in a memory variable, but you can hold the list in Cache. This can make for some very fast lookups, rather than multiple file or DB queries.
Just to put this into perspective though, it's tough to recommend a proper design in a forum because there might be details that aren't known. So, it's best to take my answer as something that helps think about what the tradeoffs are, rather than a definitive recommendation on what you should do.

Handling asynchronous edits of documents on the web

I am writing a Web application that has a user interface for editing data. The idea is something similar to a wiki where there are edits to chunks of text. What is the best way to handle asynchronous edits from multiple users? The situation I am considering is this:
There is a document that is version 0. User A is editing it when it is version 0. A few minutes later but before user A saves his changes, user B opens up the same document and starts editing. How should the server treat the two different edits to version 0 of the document? Also what is this problem called and where can I get more information about similar problems?
Wikipedia addresses this problem in the following way:
Assume that person A and person B are both editing the same document. Also assume person A submits their edits slightly before person B.
First the media wiki software runs a traditional diffing algorithm over both edits.
Next, the results of the diffing algorithm are used to merge the text.
If the diffing algorithm finds that there are merge conflicts (i.e. person A and B edited the same piece of text), then person B is asked to resolve the conflicts since they submitted their edits last.
Wikipedia handles merge conflicts much like conflicts in a code repository.
If you want to allow multiple people to edit a document simultaneously and in real-time (like with google wave or etherpad), then I'd recommend looking into operational transforms (aka OT). Though the OT algorithm is no harder or simpler than a traditional diffing algorithm, there is less information on it and fewer ready-made implementations.
One typical pattern is to send each user a chunk of text and also a version number indicating which version they received. The rule is that the host will only accept the first revision to the currently activer version.
That way only one person can revise each version; everyone else would be told that their version was obsolete and you can do what you want for them at that point - usually send them the current version to retry.
This only works if it's unlikely to have multiple people working on the same version. If that's likely, then you probably need to research how subversion for instance handles multiple revisions to source code.
There are also schemes for multiple people working simultaneously on the same text and being feed each others' updates - see Google Wave for one example.

query strings - use row ids or human readable values?

I'm building a form and I wonder if there is a significant advantage in showing values in a more human readable format; e.g:
index.php?user=ted&location=newyork
Rather than:
index.php?user=23423&location=34645
On the one hand, having the query string a little more readable allows the user and search engines to better understand where they are, but this also creates a little more work on the server side, as I'll have to track down the associated rows through something other than their unique id.
For example, first find what the id of 'newyork' is before being able to work on other rows that require the location_id. I always prefer to give the db as little work as possible.
Edit: decided to go with readability. I figure I can always use the mysql query cache to speed things up if necessary.
Use human readable values when you can. Just be sure to sanitise the input.
Edit: Yes, this can and should still be done for SEO purposes (if its worth it to you) if you have lots of choices. Even if the user has lots of choices, you should know what they are (or what the limits are) so that you can properly sanitise the input. For instance, if they are choosing states, you can know all 50. If they are just making up their own text, make sure on your end that its only text.
A good rule of thumb is to store data as id's and display it as human readable text etc.
Depends on your goals.
If you are talking about something like a blog where you want everyone to see everthing (and find it easily), then the human/search engine readable format is a no brainer.
If these pages are locked behind a login then it doesn't much matter. You can do what is easier on the database.
For most internet apps, I'd err to the side of readability since that will help with search engines as well.
You shouldn't worry about the efficiency for any typically sized application on any reasonable database engine. Write your app for users, not for query optimizers. QO's can more easily take care of themselves. Deal with optimization in the unlikely event you start seeing a problem.
I am going to come from a different direction. In my opinion a URL should be readable if want the user to be able to use the URL to change their parameters by editing the URL instead of using the UI. An example would be https://www.coolreportingapp.com/accountReport.jsp?account=ABC&month=200911 . In this example, the user can "easily" change the account or month they looking at without messing with the UI. This of course means you need to validate the URL params each and every time, which you should do anyway. If you don't want the user to alter the URL params, you need to obfuscate and hash values and use the hash to verify they haven't.
Seriously, IMHO, in your example, none is more readable than the other. Do normal users know that "&" is a separator from "variables" in "user=ted&location=newyork&"? Do they need to know that exists something like a variable? Having this in mind, what's the difference in showing numbers or words?
If you really want readable urls you should build SEO Friendly urls (human readable). Remember that even a "dashes vs underscores" simple question matters in the end.

Resources