I have a problem with a plone-instance. On startup i get this CRITICAL Message
2011-03-25 10:23:06 CRITICAL ZODB.FileStorage /srv/plone/var/filestorage/Data.fs Database records 1258954454 seconds in the future
In the ZMI i can see that the plone - instance folder and everything in it has the date "2051-02-14 15:57" (it is the value of bobobase_modification_time). Every new object has the very same timestamp.
Due to that the packing of the ZODB doesn't make the Data.fs smaller and starting of the instance takes a long time right before the CRITICAL message above appears. Other than that, the site seems to work okay. Especially within Plone the time values seem to be correct.
I checked to following (syntax-higlighted version of debug-session here: http://pastie.org/1709881):
>>> plone = app.plonesite
>>> plone.created()
DateTime('2010/11/15 13:39:42.694 GMT+1')
>>> plone.modified()
DateTime('2010/11/15 13:39:42.694 GMT+1')
>>> plone.bobobase_modification_time()
DateTime('2051/02/14 15:57:21.077 GMT+1')
# Try to set creation date according to
# http: //plone.org/documentation/kb/set-creation-date
# setCreationDate doesn't work anymore
>>> from DateTime import DateTime
>>> d = DateTime('2010/11/16')
>>> plone.setModificationDate(d)
>>> plone.setCreationDate(d)
Traceback (most recent call last):
File "", line 1, in ?
AttributeError: setCreationDate
>>> plone.setEffectiveDate(d)
>>> plone.reindexObject()
>>> plone.created()
DateTime('2010/11/15 13:39:42.694 GMT+1')
>>> plone.modified()
DateTime('2010/11/16')
>>> plone.bobobase_modification_time()
DateTime('2051/02/14 15:57:21.077 GMT+1')
What can I do about the wrong time of the bobobase_modification_time()? Can I set this value somhow to a reasonable time?
Update: Writing this post brought to me some new ideas what to search for. I think the thread http://thread.gmane.org/gmane.comp.web.zope.general/12994/focus=12999 describes what happend with my site. I will now try to fix it with exporting an then importing the object.
You ran Zope on a server with the clock way off into the future, and the ZODB really doesn't like that.
Someone once wrote a patch to auto-correct for this situation, see:
http://www.mail-archive.com/zodb-dev#zope.org/msg03916.html
YMMV applying that one though.
Related
I have an airflow DAG that works perfectly when files are present, but error->fails when the source files are not there.
Randomly, I recieve files from a given source, that my DAG picks up and processes. While I need to run the DAG daily, files are not necessarily there daily. Could be monday, wednesday, or even sunday evening.
I'm not worried about days with no new files. i worry about days when new files come and it breaks.
How do I tell the DAG that when no file exist then gracefully exit with success?
My DAG below (please ignore schedule setting. I'm still in development mode):
import airflow
from airflow import models
from airflow.operators.gcs_to_bq import GoogleCloudStorageToBigQueryOperator
from airflow.operators.gcs_to_gcs import GoogleCloudStorageToGoogleCloudStorageOperator
args = {
'owner': 'Airflow',
'start_date': airflow.utils.dates.days_ago(2),
'email': ['email#gmail.com'],
'email_on_failure': True,
'schedule_interval': 'None',
}
dag = models.DAG(
dag_id='Source1_Ingestion',
default_args=args
)
# [START load ATTOM File to STAGING]
load_File_to_Source1_RAW = GoogleCloudStorageToBigQueryOperator(
task_id='Source1_GCS_to_GBQ_Raw',
bucket='Source1_files',
source_objects=['To_Process/*.txt'],
destination_project_dataset_table='Source1.Source1_RAW',
schema_fields=[
{'name': 'datarow', 'type': 'STRING', 'mode': 'NULLABLE'},
],
field_delimiter='§',
write_disposition='WRITE_TRUNCATE',
google_cloud_storage_conn_id='GCP_EDW_Staging',
bigquery_conn_id='GCP_EDW_Staging',
dag=dag)
# [END howto_operator_gcs_to_bq]
# [START move files to Archive]
archive_attom_files = GoogleCloudStorageToGoogleCloudStorageOperator(
task_id='Archive_Source1_Files',
source_bucket='Source1_files',
source_object='To_Process/*.txt',
destination_bucket='Source1_files',
destination_object='Archive/',
move_object=True,
google_cloud_storage_conn_id='GCP_EDW_Staging',
dag=dag
)
# [END move files to archive]
load_File_to_Source1_RAW.set_downstream(archive_Source1_files)
One way to approach this would be to add a Sensor Operator to the workflow.
Nehil Jain describes sensors nicely:
Sensors are a special kind of airflow operator that will keep running until a certain criterion is met. For example, you know a file will arrive at your S3 bucket during certain time period, but the exact time when the file arrives is inconsistent.
For your use case, it looks like there's a Google Cloud Sensor, which "checks for the existence of a file in Google Cloud Storage." The reason you'd incorporate a sensor is that you're decoupling the operation "determine if a file exists" from the operation "get the file (and do something with it)".
By default, sensors have two methods (source):
poke: the code to run at poke_interval times, which tests to see if the condition is true
execute: use the poke method to test for a condition on a schedule defined by the poke_interval; fails out when the timeout argument is reached
In a common file-detection sensor, the operator receives instructions to check a source for a file on a schedule (e.g. check every 5 minutes for up to 3 hours to see if the file exists). If the sensor succeeds in meeting its test condition, it succeeds and allows the DAG to continue downstream to the next operator(s). If it fails to find the file, it times out and the sensor operator is marked failed.
With just a sensor operator, you've already succeeded in separating the error cases - the DAG fails at the GoogleCloudStorageObjectSensor instead of the GoogleCloudStorageToBigQueryOperator when the file doesn't exist, and fails at the GoogleCloudStorageToBigQueryOperator when something is wrong with the transfer logic. Importantly for your use case, Airflow supports a soft_fail argument, which "mark[s] the task as SKIPPED on failure"
For this next part, I'll caveat this next part by explicitly stating that I'm not intimately familiar with the GoogleCloudStorage operators. If the operator doesn't allow wildcarding in the sensor, you may need to rewire the sensor's poke method to allow for more complex, pattern based file detection. This is where Airflow's plug-in architecture can really shine, allowing you to modify and extend existing operators to meet your exact needs.
The example I'll give you here is that the SFTPSensor only supports poking for a specific file out of the box. I needed wildcard based poking, so I wrote a plugin that modifies the SFTPSensor to support regular expressions in file identification. In my case, it was just modifying the poke to switch from polling for the existence of a single file to polling a list of files and then passing it through a regular expression to filter the list.
At a cursory glance, it looks like the way that the GoogleCloudStorageSensor pokes for an object is with the hook.exists method. I can't speak to whether a wildcard would work there, but if it doesn't, it looks like there's a hook.list method which would allow you to implement a similar workflow to what I did for the SFTPRegexSensor.
I've included some of the source code for the SFTPRegexSensor Plugin's poke method, modified for how I think it'd work with GCS in case it's helpful:
def poke(self, context):
# create a hook (removed some of the SSH/SFTP intricacies for simplicity)
# get list of file(s) matching regex
files = hook.list(self.bucket, self.prefix) # you need to define operator paramters for the choices that are dynamic in the operator's poke (e.g. which bucket, what the file prefix is); swapped in the GCS args
regex = re.compile(self.remote_filename)
files = list(filter(regex.search, files))
if not files:
return False
return True
While running pySpark SQL pipelines via Airflow I am interested in getting out some business stats like:
source read count
target write count
sizes of DFs during processing
error records count
One idea is to push it directly to the metrics, so it will gets automatically consumed by monitoring tools like Prometheus. Another idea is to obtain these values via some DAG result object, but I wasn't able to find anything about it in docs.
Please post some at least pseudo code if you have solution.
I would look to reuse Airflow's statistics and monitoring support in the airflow.stats.Stats class. Maybe something like this:
import logging
from airflow.stats import Stats
PYSPARK_LOG_PREFIX = "airflow_pyspark"
def your_python_operator(**context):
[...]
try:
Stats.incr(f"{PYSPARK_LOG_PREFIX}_read_count", src_read_count)
Stats.incr(f"{PYSPARK_LOG_PREFIX}_write_count", tgt_write_count)
# So on and so forth
except:
logging.exception("Caught exception during statistics logging")
[...]
I'm trying to test a bit of debouncing logic - these are local unittests I run for a Google App Engine webapp, using the 2.7 runtime environment. All my other tests are going well but this one has me stumped!
def testThat_emailDebouncingWorks(self):
# Do something, it triggers an email.
doSomething()
self.assertEqual(emails_sent, 1)
# Do something again, the new email is debounced.
doSomething()
self.assertEqual(emails_sent, 1)
# After an hour, the emails should start working again...
mockWaitingAnHour()
doSomething()
self.assertEqual(emails_sent, 2)
# ... and so should the debouncing.
doSomething()
self.assertEqual(emails_sent, 2)
The file under test logs the time an email was sent using datetime.now(), then reruns datetime.now() on all future attempts and returns early if under an hour has elapsed.
There are two things going wrong:
I think the unittest library only added mock support in 3.X, and I'm not keen on updating my whole app.
Even if I was using 3.X, all the examples I see are about faking a datetime response for your entire test case (using a mock decorator above the test def). Whereas I want to change that behaviour midway through my test, not for the entire case.
Any tips? Thanks in advance!
Okay, I got to the bottom of it and wanted to document the answers for anyone who finds this on Google ;)
1. Enable mocking on AppEngine for Python 2.7
You need to follow the instructions for copying a third party library (in our case, "mock") from the official docs. It's worth noting that on Ubuntu, the suggested command:
pip install -t lib/ mock
Will fail. You'll get an error like this:
DistutilsOptionError: can't combine user with prefix, exec_prefix/home, or install_(plat)base
This is to do with a weird conflict with Ubuntu which seems to have gone unfixed for years, and you'll see a lot of people suggesting a virtualenv workaround. I added the --system flag instead:
pip install --system -t lib/ mock
and it worked fine. Remember to follow the rest of the instructions with appengine_config, and you should be set. "import mock" is a good way to check.
2. Mocking the datetime.now() call
My module under test uses:
from datetime import datetime
In my test module, import some stuff:
from mock import patch, Mock
import my_module #Also known as my_module.py
import datetime
Then the actual test case:
#patch.object(my_module, 'datetime', Mock(wraps=datetime.datetime))
def testThat_myModule_debouncesEmails(self):
fake_time = datetime.datetime.now()
# This is the first time the thing happened. It should send an email.
doSomething()
self.assertEqual(1, emails_sent)
# Five minutes later, the thing happens again. Should be debounced.
fake_time += datetime.timedelta(minutes=5)
my_module.datetime.now.return_value = fake_time
doSomething()
self.assertEqual(1, emails_sent)
# Another 56 minutes pass, the thing happens again. An hour has elapsed, so don't debounce.
fake_time += datetime.timedelta(minutes=56)
my_module.datetime.now.return_value = fake_time
doSomething()
self.assertEqual(2, emails_sent)
# Give it another 15 minutes to check the debouncing kicks back in.
fake_time += datetime.timedelta(minutes=15)
my_module.datetime.now.return_value = fake_time
doSomething()
self.assertEqual(2, emails_sent)
Hope this helps someone!
I have a DAG without a schedule (it is run manually as needed). It has many tasks. Sometimes I want to 'skip' some initial tasks by changing the task state to SUCCESS manually. Changing task state of a manually executed DAG fails, seemingly because of a bug in parsing the execution_date.
Is there another way to individually setting task states for a manually executed DAG?
Example run below. The execution date of the Task is 01-13T17:27:13.130427, and I believe the milliseconds are not being parsed correctly.
Traceback
Traceback (most recent call last):
File "/opt/conda/envs/jumpman_prod/lib/python3.6/site-packages/airflow/www/views.py", line 2372, in set_task_instance_state
execution_date = datetime.strptime(execution_date, '%Y-%m-%d %H:%M:%S')
File "/opt/conda/envs/jumpman_prod/lib/python3.6/_strptime.py", line 565, in _strptime_datetime
tt, fraction = _strptime(data_string, format)
File "/opt/conda/envs/jumpman_prod/lib/python3.6/_strptime.py", line 365, in _strptime
data_string[found.end():])
ValueError: unconverted data remains: ..130427
It's not working from Task Instances page, but you can do it in another page:
- open DAG graph view
- select needed Run (screen 1) and click go
- select needed task
- in a popup window click Mark success (screen 2)
- then confirm.
PS it relates to airflow 1.9 version
Screen 1
Screen 2
What you may want to do to accomplish this is using branching, which, as the name suggests, allows you to follow different execution paths according to some conditions, just like an if in any programming language.
You can use the BranchPythonOperator (documented here) to attain this goal: the idea is that this operator is configured by a python_callable, a function that outputs the task_id to execute next (which should, of course, be a task which is directly downstream from the BranchPythonOperator itself).
Using branching will set the skipped tasks to the proper state automatically, as mentioned in the documentation:
All other “branches” or directly downstream tasks are marked with a state of skipped so that these paths can’t move forward. The skipped states are propagated downstream to allow for the DAG state to fill up and the DAG run’s state to be inferred.
The resulting DAG would look something like the following:
(source: apache.org)
Branching is documented here, on the official Apache Airflow documentation.
I just found out something very odd with Firebase and I would like to know if it is me that's doing something wrong or if there is a solution to this problem.
Basically, this is what it has always written when I was developing the app (and it's precisely what I was expecting):
nscoachtools#gmail¸com
maxMatches: 60
maxPlayers: 500
maxTeams: 30
userId: "SnMuRZEVqyN***...***hv2"
userMail: "nscoachtools#gmail.com"
userName: "Nicola Salvaro"
userPicture: "https://lh4.googleusercontent.com/-L7lSPz0VJ9A/..."
userToken: -1
and this is what it writes after I built the app in release mode:
nsalvaro77#gmail¸com
a: "Nicola Salvaro"
b: "ESjqwuh***...***wg1"
c: "nsalvaro77#gmail.com"
d: "https://lh4.googleusercontent.com/-2kwSEmLEN1c/..."
e: -2
f: 30
g: 500
h: 60
userToken: 1499775285255
Every "title" has been replaced with a letter. And "e: " was supposed to be "userToken: " then, when I tried to update it, it wrote it with the proper string but not on top of the original value... just wrote a new one. Then, when I try to read the full user, it gets the value of "e: ", not the "userToken: " one.
Did I do something wrong?
In release mode your Android app is being minified by Proguard. This process strips unused methods and makes other method names shorter.
As a consequence, your POJO classes (the classes your read from/write to Firebase) are getting new method names and Firebase reflectively uses those method names to determine the properties in the JSON.
The solution is to tell Proguard to not modify the method names of your POJOs.
More on that:
The oldest Q&A on how to do this is: What ProGuard configuration do I need for Firebase on Android?. But that one is from Firebase 2.x, while many of these are auto-included in Firebase 9 and up.
You can also potentially mark the classes with #Keep, see Firebase No properties to serialize found on class.
More interesting Q&A on this topic: https://stackoverflow.com/search?q=%5Bfirebase-database%5D+proguard+release