App Store Review Scraping HTTPSConnection Error - web-scraping

When i run this script, i get the error. I try to solve using sleep time between calls because i want to scrape a lot of reviews. However, it doesn't solve my problem.
Here is what I have tried and it didn't work:
import pandas as pd
from app_store_scraper import AppStore
import random
app_ = AppStore(country="us",app_name="Peet's Coffee: Earn Rewards", app_id = 1059865964)
app_.review(how_many=10000,
sleep=random.randint(25,30)
)
pets = pd.DataFrame(app_.reviews)
pets.to_excel("Peet's Coffee.xlsx")
Please advise how to deal with this error, or maybe there is another way to scrape App Store reviews.
Error:
2022-03-11 11:33:44,617 [INFO] Base - Initialised: AppStore('us', 'peet-s-coffee-earn-rewards', 1059865964)
2022-03-11 11:33:44,618 [INFO] Base - Ready to fetch reviews from: https://apps.apple.com/us/app/peet-s-coffee-earn-rewards/id1059865964
2022-03-11 11:34:12,743 [INFO] Base - [id:1059865964] Fetched 20 reviews (20 fetched in total)
2022-03-11 11:35:09,137 [INFO] Base - [id:1059865964] Fetched 60 reviews (60 fetched in total)
2022-03-11 11:36:12,063 [INFO] Base - [id:1059865964] Fetched 100 reviews (100 fetched in total)
2022-03-11 11:37:08,309 [INFO] Base - [id:1059865964] Fetched 140 reviews (140 fetched in total)
2022-03-11 11:38:04,731 [INFO] Base - [id:1059865964] Fetched 180 reviews (180 fetched in total)
2022-03-11 11:39:00,995 [INFO] Base - [id:1059865964] Fetched 220 reviews (220 fetched in total)
2022-03-11 11:39:57,254 [INFO] Base - [id:1059865964] Fetched 260 reviews (260 fetched in total)
2022-03-11 11:40:53,864 [INFO] Base - [id:1059865964] Fetched 300 reviews (300 fetched in total)
2022-03-11 11:41:13,868 [ERROR] Base - Something went wrong: HTTPSConnectionPool(host='amp-api.apps.apple.com', port=443): Max retries exceeded with url: /v1/catalog/us/apps/1059865964/reviews?l=en-GB&offset=300&limit=20&platform=web&additionalPlatforms=appletv%2Cipad%2Ciphone%2Cmac (Caused by ResponseError('too many 429 error responses'))
2022-03-11 11:41:13,872 [INFO] Base - [id:1059865964] Fetched 300 reviews (300 fetched in total)

Related

Google Cloud Vision API Oct 2021 Upgrade Error - Access Previous Version

The October2021/January2022 Vision API produces significant changes from the previous version running up to September2021.
The October2021/January 2022 version's return JSON has less information compared to the previous September2021 version. These JSON file statistic (file size KB, # JSON lines) show the differences. The same image PNG file was run:
September2021 (more data)
File 1 1,666 KB / 114,856 lines
File 2 675 KB / 45,687
Using Features parameter with and without: "model": "builtin/legacy" produces the same result.
October2021/January2022 (less data)
File 1 1.435 KB / 102.275 lines
File 2 584 KB / 41,778
Release notes link: https://cloud.google.com/vision/docs/release-notes
Google Vision is a core function of my application. The upgrade has "broken" my application.
# Python application calls Vision API
Python version: 3.8.5
google-api-core==1.22.4
google-api-python-client==1.12.3
google-auth==1.28.1
google-auth-httplib2==0.0.4
google-cloud-core==1.6.0
google-cloud-documentai==0.4.0
google-cloud-storage==1.37.1
google-cloud-vision==2.0.0
google-crc32c==1.1.2
google-resumable-media==1.2.0
googleapis-common-protos==1.52.0
How can I access the previous version?

"unknown blob" errors while pulling new images from JFrog

Since 2 days ago, we have been getting "unknown blob" errors when pulling from jfrog. I am attaching a sample log:
Command ['ssh', '-o', 'StrictHostKeyChecking=no', '-o', 'LogLevel=ERROR', 'localhost', 'docker', 'pull', '<redacted>.jfrog.io/<redacted>:latest'] failed with exit code 1 and output 'latest: Pulling from <redacted>
f5d23c7fed46: Pulling fs layer
3f4aa1d1dde5: Pulling fs layer
52c4bf0b6229: Pulling fs layer
fe61f8f5a308: Pulling fs layer
ebeed9e8b27e: Pulling fs layer
89831686aa31: Pulling fs layer
2e2c5baec652: Pulling fs layer
b6fa760c79e4: Pulling fs layer
2e2c5baec652: Waiting
ebeed9e8b27e: Waiting
b6fa760c79e4: Waiting
fe61f8f5a308: Waiting
3f4aa1d1dde5: Verifying Checksum
3f4aa1d1dde5: Download complete
f5d23c7fed46: Verifying Checksum
f5d23c7fed46: Download complete
fe61f8f5a308: Download complete
ebeed9e8b27e: Download complete
89831686aa31: Download complete
f5d23c7fed46: Pull complete
3f4aa1d1dde5: Pull complete
2e2c5baec652: Verifying Checksum
2e2c5baec652: Download complete
b6fa760c79e4: Downloading
unknown blob
This seems to have started during the kinesis outage. We first noticed it while we were trying to deploy a workaround during the outage. However the problem still persists.
The image pulls fine from docker hub, so it's not corrupted. This is currently breaking out automated deploy/provisioning process, as we have manually pull failed imaged from dockerhub.
Thanks,
-Caius
With #John's suggestion, I zapped the cache on JFrog side, and that removed the issue.
It seems that it's stale/invalid cache issue.
Also, while looking at the JFrog logs, I did find this, which might be relavant:
2020-11-28T18:55:24.493Z [jfrt ] [ERROR] [b66d3ae308977fb1] [o.a.r.RemoteRepoBase:858 ] [ttp-nio-8081-exec-17] - IO error while trying to download resource '<redacted>: org.artifactory.request.RemoteRequestException: Error fetching <redacted>/blobs/sha256:9c11dabbdc3a450cd1d9e15b016d455250606d78eecb33c92eebfa657549787f (remote response: 429: Too Many Requests)
TL;DR: zapping the cache fixed the problem.

"GC overhead limit exceeded" on cache of large dataset into spark memory (via sparklyr & RStudio)

I am very new to the Big Data technologies I am attempting to work with, but have so far managed to set up sparklyr in RStudio to connect to a standalone Spark cluster. Data is stored in Cassandra, and I can successfully bring large datsets into Spark memory (cache) to run further analysis on it.
However, recently I have been having a lot of trouble bringing in one particularly large dataset into Spark memory, even though the cluster should have more than enough resources (60 cores, 200GB RAM) to handle a dataset of its size.
I thought that by limiting the data being cached to just a few select columns of interest I could overcome the issue (using the answer code from my previous query here), but it does not. What happens is the jar process on my local machine ramps up to take over up all the local RAM and CPU resources and the whole process freezes, and on the cluster executers keep getting dropped and re-added. Weirdly, this happens even when I select only 1 row for cacheing (which should make this dataset much smaller than other datasets which I have had no problem cacheing into Spark memory).
I've had a look through the logs, and these seem to be the only informative errors/warnings early on in the process:
17/03/06 11:40:27 ERROR TaskSchedulerImpl: Ignoring update with state FINISHED for TID 33813 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
17/03/06 11:40:27 INFO DAGScheduler: Resubmitted ShuffleMapTask(0, 8167), so marking it as still running
...
17/03/06 11:46:59 WARN TaskSetManager: Lost task 3927.3 in stage 0.0 (TID 54882, 213.248.241.186, executor 100): ExecutorLostFailure (executor 100 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 167626 ms
17/03/06 11:46:59 INFO DAGScheduler: Resubmitted ShuffleMapTask(0, 3863), so marking it as still running
17/03/06 11:46:59 WARN TaskSetManager: Lost task 4300.3 in stage 0.0 (TID 54667, 213.248.241.186, executor 100): ExecutorLostFailure (executor 100 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 167626 ms
17/03/06 11:46:59 INFO DAGScheduler: Resubmitted ShuffleMapTask(0, 14069), so marking it as still running
And then after 20min or so the whole job crashes with:
java.lang.OutOfMemoryError: GC overhead limit exceeded
I've changed my connect config to increase the heartbeat interval ( spark.executor.heartbeatInterval: '180s' ), and have seen how to increase memoryOverhead by changing settings on a yarn cluster ( using spark.yarn.executor.memoryOverhead ), but not on a standalone cluster.
In my config file, I have experimented by adding each of the following settings one at a time (none of which have worked):
spark.memory.fraction: 0.3
spark.executor.extraJavaOptions: '-Xmx24g'
spark.driver.memory: "64G"
spark.driver.extraJavaOptions: '-XX:MaxHeapSize=1024m'
spark.driver.extraJavaOptions: '-XX:+UseG1GC'
UPDATE: and my full current yml config file is as follows:
default:
# local settings
sparklyr.sanitize.column.names: TRUE
sparklyr.cores.local: 3
sparklyr.shell.driver-memory: "8G"
# remote core/memory settings
spark.executor.memory: "32G"
spark.executor.cores: 5
spark.executor.heartbeatInterval: '180s'
spark.ext.h2o.nthreads: 10
spark.cores.max: 30
spark.memory.storageFraction: 0.6
spark.memory.fraction: 0.3
spark.network.timeout: 300
spark.driver.extraJavaOptions: '-XX:+UseG1GC'
# other configs for spark
spark.serializer: org.apache.spark.serializer.KryoSerializer
spark.executor.extraClassPath: /var/lib/cassandra/jar/guava-18.0.jar
# cassandra settings
spark.cassandra.connection.host: <cassandra_ip>
spark.cassandra.auth.username: <cassandra_login>
spark.cassandra.auth.password: <cassandra_pass>
spark.cassandra.connection.keep_alive_ms: 60000
# spark packages to load
sparklyr.defaultPackages:
- "com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M1"
- "com.databricks:spark-csv_2.11:1.3.0"
- "com.datastax.cassandra:cassandra-driver-core:3.0.2"
- "com.amazonaws:aws-java-sdk-pom:1.10.34"
So my question are:
Does anyone have any ideas about what to do in this instance?
Are
Are there config settings I can change to help with this issue?
Alternatively, is there a way to import the cassandra data in
batches with RStudio/sparklyr as the driver?
Or alternatively again, is there a way to munge/filter/edit data as it is brought into cache so that the resulting table is smaller (similar to using SQL querying, but with more complex dplyr syntax)?
OK, I've finally managed to make this work!
I'd initially tried the suggestion of #user6910411 to decrease the cassandra input split size, but this failed in the same way. After playing around with LOTS of other things, today I tried changing that setting in the opposite direction:
spark.cassandra.input.split.size_in_mb: 254
By INCREASING the split size, there were fewer spark tasks, and thus less overhead and fewer calls to the GC. It worked!

Alfresco: index creation stuck when enabling "Content indexing"

I am trying to index a Alfresco 4.0.d 5.0.d Community Repository (Alfresco Solr):
About 500.000 Documents
Repo-Size about 80GB
Metadata Indexing only: no problems: index is ready in about an hour.
Enabling Content Indexing too: The Solr Index seems to get stuck. After about 4 hours the Solr Webinterface is showing that no more transactions are left, but still the Index isn't marked as ready, and Solr keeps trying to create/update the index, when letting the indexer run. Stopped Indexing after about 12 hours, no progress shown in Solr Webinterface. Index Size kept growing all the time.
The "Troubleshooting Solr Index" tips from Alfresco Docs didn't make any difference.
I have enabled Debugging in Solr, and i am getting no obvious errors in there (no memory errors, no obvious errors at all). Only thing i see in the log files: Solr seems to try to Index the same Alfresco Transaction IDs over and over (see log excerpt, these lines are popping up over and over).
Any Idea how i can track down the cause of this?
Is it possible to find the Documents in the Repository belonging to the Transaction IDs?
Can some specific Transactions be excluded from indexing at all?
Thanks, Max
Log excerpt
2016-03-10 00:52:15,145 INFO [org.alfresco.solr.tracker.AclTracker] Scanning Acl change sets ...
2016-03-10 00:52:15,145 INFO [org.alfresco.solr.tracker.AclTracker] .... none found after lastTxCommitTime 1457481600850
2016-03-10 00:52:15,145 INFO [org.alfresco.solr.tracker.AclTracker] total number of acls updated: 0
2016-03-10 00:52:15,145 INFO [org.alfresco.solr.tracker.AbstractTracker] ... Running ContentTracker for core [archive].
2016-03-10 00:52:15,146 INFO [org.alfresco.solr.SolrInformationServer] .... registered Searchers for archive = 1
2016-03-10 00:52:15,146 INFO [org.alfresco.solr.Cloud] Running query FTSSTATUS:Dirty OR FTSSTATUS:New
2016-03-10 00:52:15,146 INFO [org.alfresco.solr.tracker.ContentTracker] total number of docs with content updated: 0
2016-03-10 00:52:15,146 INFO [org.alfresco.solr.tracker.AbstractTracker] ... Running MetadataTracker for core [archive].
2016-03-10 00:52:15,147 INFO [org.alfresco.solr.SolrInformationServer] .... registered Searchers for archive = 1
2016-03-10 00:52:15,155 INFO [org.alfresco.solr.Cloud] Running query TXID:1 AND TXCOMMITTIME:1399544992347
2016-03-10 00:52:15,155 INFO [org.alfresco.solr.tracker.MetadataTracker] Verified first transaction and timestamp in index
2016-03-10 00:52:15,156 INFO [org.alfresco.solr.tracker.MetadataTracker] Verified last transaction timestamp in index less than or equal to that of repository.
2016-03-10 00:52:15,161 INFO [org.alfresco.solr.tracker.MetadataTracker] Scanning transactions ...
2016-03-10 00:52:15,161 INFO [org.alfresco.solr.tracker.MetadataTracker] .... from Transaction [id=947618, commitTimeMs=1457521663509, updates=2, deletes=2]
2016-03-10 00:52:15,161 INFO [org.alfresco.solr.tracker.MetadataTracker] .... to Transaction [id=947654, commitTimeMs=1457524857746, updates=1, deletes=0]
2016-03-10 00:52:15,164 INFO [org.alfresco.solr.tracker.MetadataTracker] Scanning transactions ...
2016-03-10 00:52:15,164 INFO [org.alfresco.solr.tracker.MetadataTracker] .... from Transaction [id=947654, commitTimeMs=1457524857746, updates=1, deletes=0]
2016-03-10 00:52:15,165 INFO [org.alfresco.solr.tracker.MetadataTracker] .... to Transaction [id=947655, commitTimeMs=1457524858267, updates=2, deletes=1]
2016-03-10 00:52:15,180 INFO [org.alfresco.solr.tracker.MetadataTracker] Scanning transactions ...
2016-03-10 00:52:15,180 INFO [org.alfresco.solr.tracker.MetadataTracker] .... none found after lastTxCommitTime 1457524858267
2016-03-10 00:52:15,180 INFO [org.alfresco.solr.tracker.MetadataTracker] total number of docs with metadata updated: 0
2016-03-10 00:52:17,513 DEBUG [org.alfresco.solr.content.SolrContentUrlBuilder] Appending SOLR metadata: tenant - _DEFAULT_
2016-03-10 00:52:17,513 DEBUG [org.alfresco.solr.content.SolrContentUrlBuilder] Appending SOLR metadata: tenant - _DEFAULT_
2016-03-10 00:52:17,513 DEBUG [org.alfresco.solr.content.SolrContentUrlBuilder] Appending SOLR metadata: tenant - _DEFAULT_
2016-03-10 00:52:17,513 DEBUG [org.alfresco.solr.content.SolrContentUrlBuilder] Appending SOLR metadata: dbId - 124123
2016-03-10 00:52:17,513 DEBUG [org.alfresco.solr.content.SolrContentUrlBuilder] Converted SOLR metadata to URL: solr://
Edit: Adding Screenshots:
Solr Webadmin
Solr Health Report for Workspace Spaces Store
How did you check if solr is marked as ready?
Are you aware that there is a separate index for the trash (archive) and the "real" repository (workspace)? The log is showing output for the archive tracker.
Additionally it may help to downsize the tracker config and only allow one thread per tracker and or to disable the trash indexing.
Index Reports
Have you checked Index reports? s. https://wiki.alfresco.com/wiki/Alfresco_And_SOLR#Direct_URLs. You may need to import the repository certificates in your browser to be able to access the solr user interface and the alfresco solr reports
Could you please create and attach a alfresco-solr general report
http://<alfrescoserver>/solr/admin/cores?action=REPORT&wt=xml
and a a summary report
http://<alfrescoserver>/solr/admin/cores?action=SUMMARY&wt=xml
?
Transactions and nodes
You can check the transactions in the database. The log is telling you all the requird infos. In your snippet I can't find log entries reindexing the same node as you told but e.g. "Transaction id=947655" means the row in alf_transaction with id=947655. To find all nodes from a distinct transaction_id you can just
select * from alf_node where transaction_id=947655
It is not possible to skip distinct transactions but you can attach the cm:indexControl to nodes you don't want to index. Please check http://docs.alfresco.com/4.0/concepts/admin-indexes.html

filterstream of StreamR package is not resulting in data from Twitter though it saves some file

This is my query of streamR package, below query is resulting a message as shown -
"Capturing tweets...
Connection to Twitter stream was closed after 600 seconds with up to 74 tweets downloaded."
This is saving a file to my local with name - tweets_keyword.json, but there is no data in it.
library(streamR)
library(ROAuth)
filterStream(file.name="tweets_keyword.json", track=c("#liveittobelieveit"), timeout=600, tweets=1000,oauth=my_oauth)
Sorry if I have not followed any rules of this site, I'm new here.

Resources