Modifying HTML content on the fly => SLOW

Modifying HTML content on the fly => SLOW - http

We are working on a PROXY based protection software. It catches the user http request, do the proxy stuffs, and catch the http response, modify its content and send it back to the original user.
We had 2 tries:
SQUID proxy and a PHP rind out of SQUID.
It was promising, but at PHP stream we did not know about the length of response data we were expected, so it was timeouting every time => SLOW
Now, we wrote a .net application. It does everything we need, and its pretty fast even does not modify the content. If we need to GZIP/GUNZIP, or just modify the content, it becomes very slow.
Could you help us?
We are working on this project for almost a year in our University in Hungary. We wrote a automatic, self learning full semantical analizer engine, which can analyze and interpret in all language, and can detect and screen the target content. We also built an image recognition software, which can detect the target object in 90% confidence in all image.
So everything is ready, but our proxy application is stucked.
We also could pay for this job, if anybody would write it.

I spend a lot of time programming in PHP - yes, as an interpreted language it can be slow - and there is a huge amount of badly written code available - but even before you start to touch the code, tuning the environment can reduce execution time by a factor of 5-10. Then changing the code can make it go faster still; the biggest wins come from good choices for architecture and data structures (which is true of any language - not just PHP).
I don't know where you're starting from but find it surprising that you are not able to process the stream relative to the amount of time taken to generate the content and send it across the network. For it to be timing out something is very wrong. (you're not trying to parse the HTML using the one of the XML parsers are you?). The length of the content shoul have little impact on the performance of the script unless you are trying to map it all into PHP's address space at the same time.
However AFAIK, it's not possible to implement a content filter directly into Squid using PHP (if you did, I'd love to know how you did it, also if you've implemented ICAP, that's very interesting). I'm guessing that you are using a URL redirector to route the requests via a proxy script written in PHP.
It is possible to write an ECAP module in C/C++.
Image recognition and natural language processing are not trivial exercises in programming - so you must have some good programmers working on your team. Really addressing your problem goes rather beyond the scope of a stack overflow answer, and touting for contractors is definitely off topic.

Thanks for your reply!
First of all: our PHP is pretty fast, the fsockopen is slow, because it cannot know when to close the response connection from SQUID.
Here is our code:
$buffer = socket_read($client, 4096);
if ( !($handle = fsockopen(HOST, SQUIDPROXYPORT, $errno, $error, 1)) ) {
Log::write($this->log, 'Errno: ' . $errno . ' Error: ' . $error . "\n" . $buffer);
exit('Nem sikerült csatlakozni! ' . $errno . ':' . $error);
}
stream_set_timeout($handle, 0, 100000);
fwrite($handle, $buffer);
$result = '';
do {
$tmp = fgets($handle, 1024);
if ( $tmp ) {
$result .= $tmp;
}
} while ( !feof($handle) && $tmp != false );
socket_write($client, $result, strlen($result));
fclose($handle);
socket_close($client);
Again, how it works:
Client send HTTP request to us
Our PHP get the request, and send its header to SQUID proxy
Squid does its stuff, and send the response data back to our PHP
Our PHP gets by fsockopen the response data from squid
We analyze the response data, or modify it
We send it back to client
BUT:
While we are waiting for the response data, we receive it, but we cannot know, at what time to close the connection between our PHP and SQUID. This results a slow work, and timeout at almost every time.
If you have any idea, plesa share with us!

Related

Extending twilio plugin to work with WordPress REST API

I've worked through the twilio tutorials regarding sending and receiving SMS with WordPress. I integrated them into a test install I have and then merged them into one. (The receive one is pretty short, although it's not a full "receive" more than a blind response).
Then I came across themebound's twilio-core and so I had a look at that and quickly I got a fatal error because they both use the twilio helper library. For testing, I just deactivated the first one and activated the second, which leads me into my first question:
Both of these use the same library, and have both used require_once. Each loaded it into their own plugin folder. The original name of the library is twilio-php-master, one renames it twilio the other twilio-php. Not that it matters at all, since they're in separate locations. The fatal error is as a result of the inner workings having the same function names.
How can I test for the existence of the other plugins helper library and use that in place of the one that I have installed?
What about the order of how WordPress loads the plugins? It's likely if mine loads first... Well, I can't even get that far because it crashes just trying to have both activated.
--- we're now leading into the next question ---
As a result, I'm probably going to go with the twilio-core version because it seems more featured and is available on github, even if the author isn't overly active (there's a pull request that's months old and no discussion about it).
I want to extend the functionality of one of the sending plugins (either one) with respect to the receipt of the message from twilio. At the moment, both examples use classes for the sending and the receive that I'm using is not. As such I have just added it to the end of the main plugin file (the one with the plugin metadata). (a quick note, this is not the cause of the above fatal error, this happened before I started merging the send and receive).
Because the receive is involved with the REST API and not initiated by a user action on the system (ie someone in the admin area accessing the class through the admin panel), I'm not sure if it's appropriate that a) I put it there, and b) use the send function inside the class when further processing the receipt. I have an end goal of analysing the incoming message and forwarding it back out through twilio or other application or even just recording it in wordpress itself.
Should I keep the receive functionality/plugin separate from the sending one?
And this leads on to the hardest question for me:
How would I extend either plugin to make the send function available to my receive plugin? (this is where part of my confusion comes from) -->> Because both plugins only operate in the admin area, and the REST API isn't an actual user operating in the front-end, how can I call those functions in the admin area? Will is "just be available"? Do I have to replicate them on the public side? and then if so, is it necessary to have it in the admin area as well?
edit: With respect to one of the comments below, I have tested and twl_send_sms is available once the helper is loaded. What I will do is determine a way to see if the helper is loaded (a function exists test will probably suffice) and if so, require or not require my version as appropriate.
From the receive message I am now able to craft a separate forward of a new message. But how can I pass the callback function the parameters of the initial inbound message? eg, how do I populate $sms_in with the POST data?
function register_receive_message_route() {
register_rest_route( 'sms/v1', '/receiver_sms', array(
'methods' => 'POST',
'callback' => 'trigger_receive_sms',
) );
}
function trigger_receive_sms($sms_in = '') {
/* we have three things to do:
* 1: send the reply to twilio,
* 2: craft a reply,
* 3: save the data message to the database
*/
echo header('content-type: text/xml');
echo ('<?xml version="1.0" encoding="UTF-8"?>');
echo ('<Response>');
echo (' <Message>Thanks, someone will be in contact shortly.</Message>');
echo ('</Response>');
$args = array(
'number_to' => '+xxxxxxxxxxx',
'message' => "Test Worked!\n$sms_in",
);
twl_send_sms( $args );

Twilio developer evangelist here.
It sounds to me as though your send functionality (whatever you can get out of twilio-core) is separate to your receive functionality. So I would likely split those up as two plugins and follow the instructions from the post you mentioned on how to write something that receives and responds to SMS messages. That way, twilio-core can use the Twilio library it bundles and your receive plugin won't need to use a library (it just needs to respond with TwiML, which is just XML).
I'm not sure I understand the last part of question 4 though, you can't really interact yourself with the receive endpoint of the plugin because all it will do is return XML to you. That's for Twilio to interact with.
Edit
In answer to your updated question, when your callback is triggered by a POST request it is passed a WP_REST_Request object. This object contains the POST body and the parameters can be accessed by array access.
function trigger_receive_sms($request) {
echo $request['Body'];
}
In good news, if you just plan to send two messages then you can do so entirely with TwiML. You just need to use two <Message>s:
echo ('<?xml version="1.0" encoding="UTF-8"?>');
echo ('<Response>');
echo (' <Message>Thanks, someone will be in contact shortly.</Message>');
echo (' <Message to="YOUR_OTHER_NUMBER">It worked!</Message>');
echo ('</Response>');
This way you don't need to worry about that other library.

How can subscribers be informed that an already published item has changed?

I want to implement a very basic RSS feed for a website that has an FAQ.
Subscribers will be informed about new questions/answers. That's great.
But questions will have essential changes in content from time to time (e.g. when a better answer for a question is found). Is RSS able to inform subscribers that such an answer has changed?
If not, what could be a good workaround? I'm thinking of offering another RSS feed which only announces changes to existing questions. Is this "the right way to go"?

The answer depends on what kind of client you want to implement your feed for.
If you are talking about simple RSS readers, there is no real standard to update only one entry that have changed, but most of them are doing long polling, effectively getting all the feed again with the update.
But there are protocols called light and fat pinging that are specifically designed to handle what you want to do, the difference between the two are:
Light pinging means that the url of the feed that changed will be sent to the subscriber
Fat pinging means that the updated content of the feed that changed will be sent to the subscriber
One pretty popular protocol (fat-ping) and backed by Google is called PubSubHubbub. There is a few services using it already, like Blogger, Youtube, MySpace, Tumblr or Wordpress. They have open-source clients for a lot of languages available on their Github, I recommend taking a look at their wiki if you are interested, which is pretty complete and informative.
Using a second feed for updates do not seems like a good idea, since that's not what it's designed for and would mean that you have to implement a client for the people to install to effectively get the updates.

I would do something with edit or update dates, titles and other XML node values compair old versions with the new versions if a node has been changed to another value, for example:
If you have a variable for your sml nodes that can be changed to an other value you'll be able to check for a change with if statements, for this example I'll check the value of $title..
<?php
$title = 'My awesome title';
/** Here your php code to get the content and assign variable to all nodes **/
$title_old = $title; /** Get the info your xml script or cached XML data **/
$title_new = 'My awesome title'; /** Use YOUR new value! - requestt the content again after someone clicked the save button or use a crontab to get info on changes every x minutes, hours or days **/
if($title_new == $title_old){
$title_changed = 'No changes detected to the title node';
echo $title_changed . ': The title is "' . $title . '"';
}else{
$title_changed = 'The title node has been changed';
echo $title_changed . ': The new title is "' . $title . '"';
}
?>
I hope this will bring you into the right direction..

You could format the links in such a way that they include a hash or and ID of the last answer. This way the links will get picked up as fresh, and if you have a custom feed reader it can trace if the link is new or was read by the user

How to get all client info from website visitors?

I want to collect all the information that we could when someone is visiting a webpage: e.g.:
clients screen resolution: <script type='text/javascript'>document.write(screen.width+'x'+screen.height); </script>
referer: <?php print ($_SERVER['HTTP_REFERER']); ?>
client ip: <?php print ($_SERVER['REMOTE_ADDR']); ?>
user agent: <?php print ($_SERVER['HTTP_USER_AGENT']); ?>
what else is there?

Those are the basic pieces of information. Anything beyond that could be viewed as SpyWare-like and privacy advocates will [justifiably] frown upon it.
The best way to obtain more information from your users is to ask them, make the fields optional, and inform your user of exactly what you will be using the information for. Will you be mailing them a newsletter?
If you plan to eMail them, then you MUST use the "confirmed opt-in" approach -- get their consent (by having them respond to an eMail, keyed with a special-secret-unique number, confirming that they are granting permission for you to send them that newsletter or whatever notifications you plan to send to them) first.
As long as you're up-front about how you plan to use the information, and give the users options to decide how you can use it (these options should all be "you do NOT have permission" by default), you're likely to get more users who are willing to trust you and provide you with better quality information. For those who don't wish to reveal any personal information about themselves, don't waste your time trying to get it because many of them take steps to prevent that and hide anyway (and that is their right).

Get all the information of client's machine with this small PHP:
<?php
foreach($_SERVER as $key => $value){
echo '$_SERVER["'.$key.'"] = '.$value."<br />";
}
?>

The list that is available to PHP is found here.
If you need more details than that, you might want to consider using Browserhawk.

For what end?
Remember that client IP is close to meaningless now. All users coming from the same proxy or same NAT point would have the same client IP. Years go, all of AOL traffic came from just a few proxies, though now actual AOL users may be outnumbered by the proxies :).
If you want to uniquely identify a user, its easy to create a cookie in apache (mod_usertrack) or whatever framework you use. If the person blocks cookies, please respect that and don't try tricks to track them anyway. Or take the lesson of Google, make it so useful, people will choose the utility over cookie worries.
Remember that Javascript runs on the client. Your document.write() will show the info on their webpage, not do anything for your server. You'd want to use Javascript to put this info in a cookie, or store with a form submission if you have any forms.

I like to use something like this:
$log = array(
'ip' => $_SERVER['REMOTE_ADDR'],
're' => $_SERVER['HTTP_REFERER'],
'ag' => $_SERVER['HTTP_USER_AGENT'],
'ts' => date("Y-m-d h:i:s",time())
);
echo json_encode($log);
You can save that string in a file, the JSON is pretty small and is just one line.

phpinfo(32);
Prints a table with the whole extractable information. You can simply copy and paste the variables directly into your php code.
e.g:
_SERVER["GEOIP_COUNTRY_CODE"] AT
would be in php code:
echo $_SERVER["GEOIP_COUNTRY_CODE"];

get all the outputs of $_SERVER variables:
<?php
$test_HTTP_proxy_headers = array('GATEWAY_INTERFACE','SERVER_ADDR','SERVER_NAME','SERVER_SOFTWARE','SERVER_PROTOCOL','REQUEST_METHOD','REQUEST_TIME','REQUEST_TIME_FLOAT','QUERY_STRING','DOCUMENT_ROOT','HTTP_ACCEPT','HTTP_ACCEPT_CHARSET','HTTP_ACCEPT_ENCODING','HTTP_ACCEPT_LANGUAGE','HTTP_CONNECTION','HTTP_HOST','HTTP_REFERER','HTTP_USER_AGENT','HTTPS','REMOTE_ADDR','REMOTE_HOST','REMOTE_PORT','REMOTE_USER','REDIRECT_REMOTE_USER','SCRIPT_FILENAME','SERVER_ADMIN','SERVER_PORT','SERVER_SIGNATURE','PATH_TRANSLATED','SCRIPT_NAME','REQYEST_URI','PHP_AUTH_DIGEST','PHP_AUTH_USER','PHP_AUTH_PW','AUTH_TYPE','PATH_INFO','ORIG_PATH_INFO','GEOIP_COUNTRY_CODE');
foreach($test_HTTP_proxy_headers as $header){
echo $header . ": " . $_SERVER[$header] . "<br/>";
}
?>

Create a timed cache in Drupal

I am looking for more detailed information on how I can get the following caching behavior in Drupal 7.
I want a block that renders information I'm retrieving from an external service. As the block is rendered for many users I do not want to continually request data from that service, but instead cache the result. However, this data is relatively frequent to change, so I'd like to retrieve the latest data every 5 or 10 minutes, then cache it again.
Does anyone know how to achieve such caching behavior without writing too much of the code oneself? I also haven't found much in terms of good documentation on how to use caching in Drupal (7), so any pointers on that are appreciated as well.

Keep in mind that cache_get() does not actually check if an item is expired or not. So you need to use:
if (($cache = cache_get('your_cache_key')) && $cache->expire >= REQUEST_TIME) {
return $cache->data;
}
Also make sure to use the REQUEST_TIME constant rather than time() in D7.

The functions cache_set() and cache_get() are what you are looking for. cache_set() has an expire argument.
You can use them basically like this:
<?php
if ($cached_data = cache_get('your_cache_key')) {
// Return from cache.
return $cached_data->data;
}
// No or outdated cache entry, refresh data.
$data = _your_module_get_data_from_external_service();
// Save data in cache with 5min expiration time.
cache_set('your_cache_key', $data, 'cache', time() + 60 * 5);
return $data;
?>
Note: You can also use a different cache bin (see documentation links) but you need to create a corresponding cache table yourself as part of your schema.

I think this should be $cache->expire, not expires. I didn't have luck with this example if I'm setting REQUEST_TIME + 300 in cache_set() since $cache->expires will always be less than REQUEST_TIME. This works for me:
if (($cache = cache_get('your_cache_key', 'cache')) && (REQUEST_TIME < $cache->expire)) {
return $cache->data;
}

Are Sessions faster than running a processor intensive function?

I am using the native Wordpress function wp_nav_menu() to create my site's navigation menus. This function really takes a long time to work, especially if the navigational menus is large like mine is. So my thought to get around this is as follows:
session_start();
if(isset($_SESSION['topTranslucent']))
echo $_SESSION['topTranslucent'];
else {
// ob necessary because wp_nav_menu() echos it's results
ob_start();
wp_nav_menu(array('menu'=>'Top Translucent','container'=>'','menu_id'=>'topMenu'));
$_SESSION['topTranslucent'] = ob_get_contents();
ob_end_flush();
}
My thinking here is that it will be much faster to print out the html stored in the session variable than to rerun the function on every page load. But not being too experienced with php sessions, I wanted to get some expert opions from you lovely wunderkinds at StackOverflow. Question is: Are sessions actually just doing what they seem to be doing? (i.e. storing text data in a cookie to be used across pages), or is there more than meets the eye?

Sessions are storing the serialized data on the server; they use cookies to for identification only. Example:
Client:
cookie { PHPSESSID => '1234567890a' }
Server:
cookie { PHPSESSID => '1234567890a' }
=> session 1234567890a {
topTranslucent => '<yourcode>whatever</yourcode>'
}
Your approach could work; note that the whole session will be unserialized on load (so overusing this will slow down the system, as it will load a lot of data. Using this for a few small snippets should be OK).
Possibly a better approach would be using a different mechanism as a cache, but sessions-as-a-cache are somewhat usable.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex