How do web scraping companies get their proxies pool? - web-scraping

I wanted to ask this question not in the business way but tech wise.
How do they get their apparently massive proxies pool and have enough so that they can delete or repurpose those who get blacklisted from Google etc. ?
I wanted to build my own system of rotating 'premium' proxies ( those that can bypass rate limiting website, that haven't been blacklisted anywhere and other tricky stuff ) if premium proxies are something that exist and not a marketing term to raise the price for some features.
I just wanted to know if there is something that they know and that we don't or if it's just a matter of buying a ton of proxies found online or something.
Most of these people creating these companies seem to be coming from a data scientist background which I don't have as a software developer.
Thank you.

Related

What do I need to know before I develop a health related software in EU?

I have a client who needs to collect some health-related data, store it and analyse it. Everything will be done through their website. That is based on Wordpress
.
I learned there is HIPPA standard on how to process and work with such data. All data journey from the uploading person, processing software, cloud and the person analysing it hast to be HIPPA compliant, right?
However, I see no references for it in GDPR or any other EU regulations on data protection.
No idea what to do about it now. Maybe someone could share experience on how to develop a health-related app according to Europe regulations?
It is important that my client's customers feel safe sharing their sensitive data.
Thank you

When a G-Suite form is embedded on external website, does any form data get stored on the host site?

This question comes up because of very specific HIPAA requirements. A Covered Entity(CE) eg, doctor can't use a cloud storage provider (CSP) unless they have a Business Associate Agreement (BAA) with the CSP, even if the data are encrypted and the CSP has no access. I'm not a security expert, but most web hosts' security would IMO satisfy HIPAA, IF there were a BAA.
There's a conduit exception for video, ISPs, and other electronic equivalents of USPS that do not store electronic Protected Health Information (e-PHI.)
I don't know why, but the web hosts who will sign a BAA charge $100-300/ mo for very basic hosting other sites charge $5-15/mo for. I think they're preying on CE ignorance with the perception there's lots of money sloshing around, true for radiology, but not for primary care.
G-Suite will execute a BAA, which makes G-Suite a reasonably-priced solution for gathering Protected Health Information (PHI) patient input, while keeping the CE compliant with HIPAA.
It's worth noting that "HIPAA compliance" is ONLY a property of CEs and Electronic Medical Records, not other software or sites. Any other product or service claiming "HIPAA compliance" is misrepresenting itself.
I find Google Sites not as user-friendly as most web hosts. There's less hand-holding for doing things like installing WP add-ins, or adding SSL certificates. Or maybe Google just does a terrible job of explaining how to actually DO something with a site hosted there. In any case, it seems easier to run a website on a web host that's set up to manage software and WP plug-ins for amateurs.
I'm willing to be educated on this. (24 hours later--I did a lot of self-education-see answer below.)
The basic HIPAA privacy requirements are rather simple:
CEs can use PHI to treat and carry out essential functions, but must
not share it with anyone not entitled to it.
The basic HIPAA security requirements are also simple:
Make a security risk analysis.
Implement reasonable security measures and
Document why various measures were taken or not.
Some elements are required, others must simply be addressed, evaluated and documented.
For example, 2FA is "addressable" as is data encryption, but making an analysis, having physical security and employee training are required.
So my question is whether a G-Suite form embedded in a website on another web host stores any data on that web host, or does it all go back to G-Suite, eg G-Drive, where it's secure and covered by a BAA?
The problem when you know very little about a topic is, you don't know what to ask. I know a bunch about HIPAA, not much about HTML. I did a lot more research, and there's at least two answers.
The short answer is, NO, the embedded frame is an iframe HTTPS linked to G-Suite.
The form in the iframe is a window into docs.google.com, so data never gets off docs.google.com, where it's covered by G-Suite's BAA. The host site is in effect a conduit.
<iframe src="https://docs.google.com/forms......…</iframe>
Note https
Embedding the form does not create a HIPAA violation.
The second answer is, G-Suite has its own content management system and website builder, which requires very little technical skill. Thus there's no need to install Wordpress or anything else, you just drag-and-drop to create a site. All the back end stuff is done for you. Duh. And they execute a BAA, all for $6 a month. So G-Suite is much simpler, in fact so simple that only a child can do it. Their help pages leave much to be desired.
Bottom line--for small covered entities, G Suite is a very economical website solution that doesn't create a HIPAA violation. Wish I knew this yesterday!
FYI: HIPAA compliant Cloud Services

How to upgrade my asp.net app to support more users?

When an asp.net website has about 1,000 active users, it works good.
How should I do if the website has about 100,000 active users?
How to upgrade my asp.net app to support a larger number of users?
Changing the webApp's architecture?
Or buying more web servers?
I just wonder in the real-world, how do other people build an asp.net website supporting millions of users? What's the app architecture of a website to support that?
Any suggestion will be welcome.
First, make sure you're with a first rate hosting provider.
Second, download a performance profiler (I always suggest Red Gate Performance Profiler) and profile your app. Find the bottlenecks and eliminate them. Repeat until you get your desired performance metric.
If your application is querying a database or other web services, try to use asynchronous methods. Using asynch methods will free up the web server to handle a lot more client requests while it is waiting for a response from the database server or web service.
You say it "works good" at the moment. It's impossible to know what the point at which this may change will be wihtout knowing a whole lot more about the nature of your traffic, current set up, what else runs on the server, etc ,etc. It could be that it continues to "work good" with a million users as it is.
When you need to make changes (and slowly reducing performance will alert you), that's whne you need to worry. And then, as Justin says, knowing the potential bottelnecks will give you pointers as to what solution you need.
Buying more servers is one strategy. So is changing the architecture. The easiest and cost effective is throwing more servers at it. It does depend a little bit on the current application architecture, but nothing that can't be easily overcome.
What I suggest, is to load test your application. See what happens as you increase the active users. Who knows it might handle 100k active users, maybe it won't but at least you will know the tipping point.
In regards to what you should do, that really depends on your business needs. If your company has the $$ and this is a core product, then it makes sense to architect a robust application. If it's not, maybe throwing hardware at the problem is good enough.
It would also help if you could define an active user. Is it someone who is visiting your site and has a session? Is it 100k concurrent requests to the server...?
In terms of hardware scaling: Scaling Up or Scaling out
Software scaling - Profile your app

User ownership of personal information [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
At the moment it seems that most webapps store their user-data centrally.
I would like to see a movement towards giving the user total access and ownership of their own personal information and data; ultimately allowing the user to choose where their data is stored.
As an example - with an application like facebook, the user's profile data could exist on any device that they own (e.g. their mobile phone) ... facebook would then request the data from the user, and make use of it.
Does anyone see this idea becoming a reality? Is it a ridiculous idea?
CLARIFICATION:
The information would at least need to be cache-able. The motivation behind the idea was to give the user more control over their own data - the user is self-publishing an
authoritative version of what they are happy for the world to see.
I'm imagining a future which is largely dictated by choices which are made now. Perhaps physical location of the data isn't actually important - and is more a symbolic gesture... but I think that decoupling the relationship between our information and the companies that make use of it could be a positive thing.
But perhaps, the details do need a bit more work ;)
What's with performance? Imagine you want to search for data that is located on hundreds of mobile phones or private distributed systems.
what your describing is simulator to a combination of OpenID Attribute Exchange, Portable Contacts and OpenSocial. Having one repository of user data that every other provider would feed off. Its nice for a user but I would not go so far as to tie it to a specific device. Rather a federated identity that you control from one vendor's website/application.
I am with you on this one.
And I think the key technology might be RDF. Since protocols such as F.O.A.F. are already used in these social applications, it is a small step from $Facebook storing your RDF Graph, to you storing it yourself, and saying: This is me, these are my friends, or anything else you might want someone to know.
This approach might be globalised to other personal information you might ened an authorised party to know, like Health Records.
There are quite a few conceptual problem with what you are suggesting.
Firstly, everytime you reconnected to the system, you would need to upload your personal information back into the system so that it could interact with you. This adds quite an overhead to the signin/handshake/auth with the remote system.
Secondly, alot of online systems (particularly online communities) rely on you leaving an online profile of yourself so that other users can interact with you (via your profile) when you yourself are offline. This data would have to be kept somewhere central.
At the very least, the online system would need a very basic profile to represent you, so that you could login & authenticate against... which sounds like a contradiction to what you are suggesting.
Performance would suffer should the user have physical possession of the data; e.g., thumb drive, local drive. However, if a "padded cell" solution were possible where the user has complete rights to a vault that the application could reach quickly, then there might be a possibility.
This really isn't a technology solution, rather one of corporate policy. Facebook could easily craft a policy that states that your records are yours, just like a bank should. They just don't. For that matter, many other institutions who are supposed to guard our personal information - our property if I can evoke John Locke - but fail miserably. If they reviewed their practices for violation of policy and were honest, you could trust. Unfortunately this just doesn't happen.
The IRS, Homeland Security and other agencies will always require that an institution yield access to assets. In the current climate I can't see how it would be allowed for individuals to remain in physical possession of electronic records that a bank or institution would use online.
Don't misinterrpret me - I think your idea is a good one to pursue, but it's more of a corporate policy issue than a technical solution.
You need to clarify what you mean by ownership. Are you trying to ensure that the data is only stored on your own devices? As others have pointed out, this will make building social networks impossible. You would disappear from Facebook when you weren't connected to it, for example.
Or are you trying to ensure that a single authoritative copy exists and that services defer to it? This might be more possible, and would require essentially synching the master copy on your cell phone with the server when possible.
Or are you trying to ensure that you can edit/delete your account at any time? Most sites already work like this.
The user still wouldn't be sure they 'own' their data, simply because they'd have to upload it every time they connect, and the company it's being sent to could still do whatever it wants with it. It could just not display your profile when you're not online, but still keep a copy of it somewhere.
Total access, ownership and location choices of personal information and data is an interesting goal but your example illustrates some fundamental architecture issues.
For example, Facebook is effectively a publishing mechanism. Anything you put on a public profile has essentially left the realm of information that you can reasonably expect to keep private. As a result, let's assume that public forums are outside the scope of your idea.
Within the realm of things that you can expect to keep private, I'm a big fan of encryption combined with physical and network security balanced against the need for performance. You use the mobile phone as an example. In that case, you almost certainly have at least three problems:
What encryption is used on the phone? Any?
Physical security risk is quite high - have you ever had an expensive portable electronic device stolen? There seems to be quite the stolen phone market out there....
The phone becomes a network hotspot - every service that needs your information would need to make an individual connection to your phone before it could satisfy a request. Your phone needs to be on, you need to have a sufficiently fat data pipeline, etc.
If you flip your idea around, however, it becomes clear that any organization that does require persistent storage of your sensitive private information (aka SPI) should meet some fundamental (and audit-able) requirements:
Demonstrated need to persist the information: many web services already ask "should I remember you?" or "do you want to create an account?" I think the default answer should always be "NO" unless I say otherwise explicitly.
No resale or sharing of SPI. If I didn't tell my bank or my bookstore that they can share my demographic information, they shouldn't be able to. Admittedly, my phone number and address are in the book, so I can't expect that I'll stay off of every mailing list but this would at least make things less convenient for the telemarketers.
Encryption all the time. My SPI should never be stored in the clear.
Physical security all the time. My SPI should never be on a laptop drive.
Given all of the above, it would be possible for you to partially achieve the goal of controlling the dissemination of your SPI. It wouldn't be perfect. The moment you type anything in, there is immediately a non-zero risk that someone somewhere has somehow figured out to monitor or capture it. Even so, you would have some control of where your information goes, some belief that it would only go where you tell it to go and that the probability of it being stolen is somewhat reduced.
Admittedly, that's a lot of weasel words in a row....
We are currently developing a platform to allow people exercise the right to access their personal data (habeas data) against any holder of such data.
Rather than following the approach you suggest, we actually pursue a different strategy: we take snapshots of the personal data as it is in the ddbb of the "data holder" whenever the individual wants to access her data.
Our objective is to give people freedom in the management of their own personal data, allowing them to share it with others based on their previous consent.
I would like to further discuss with you should you be interested.
Please read Architecture Astronauts.

Membership bulk email software

We have a Microsoft web stack web site, we have a members database.
We want to start doing mass-emails to our (opted in) membership.
I don't particularly want to re-invent the wheel with a system for having a web form submit and then mass send emails looping through thousands of records without timing out and overloading the server...
I guess I'm looking for something like Mailman, that can run on a windows (ASP/ASP.NET + SQL Server) backend and do the work for me, with suitable APIs to manage the subscriber list and send emails to that list.
Suggestions please?
I agree with acrosman, third parties that host email lists are a good way to go. A very reliable site I've found for mass emailing is http://mailing-list-services.com/. They do a good job to make sure their servers are never black listed or marked as spam. I've used them a few times, their website design blows, but their service is awsome. The Lyris Listmanager software they use has a pretty extensive API.
Advanced Intellect has some great tools, like aspNetEmail and ListNanny.
MaxBulkMailer might be a solution for you? The organisation I work for uses it to connect to www.authsmtp.com which gives us credits for a certain number of e-mails that we can send per month. You can import a spreadsheet of your mailing list or tap straight into a SQL server and pull the names and addresses. Available for Mac and Windows.
(not a sales pitch)
my company offers mail manager, but it's a hosted service. It has a full API though.
You can also check out how DotNetNuke does this
Unless your running a business that specializes in email, I'd suggest you find a hosted solution. There are 100's of little issues that come up when you run your own service over time. A hosted solution can save you lots of time and effort (and therefore money).

Resources