RuntimeError: Split pattern data/* does not match any files - airflow

I'm currently trying to convert the TFX-TFRS tutorial into an Airflow pipeline.
When I run this Pipeline with LocalDagRunner() it executes without error. When I use the AirflowDagRunner() it throws the following error (Airflow pipeline is triggered through the Airflow Web-UI):
RuntimeError: Split pattern data/* does not match any
files.
Inside the data folder resides a CSV dataset.
I use the standard CsvExampleGen component in both cases.
This is the path to the dataset, which gets ingested by the pipeline:
PIPELINE_NAME = 'TFRS-ranking'
# Directory where MovieLens 100K rating data lives
DATA_ROOT = os.path.join('data', PIPELINE_NAME)

I fixed the issue by providing the Airflow pipeline with the absolute path to the data, instead of the relative path.

Related

FileNotFoundError: and driver_scaling_report.html

I am having trouble debugging my openmdao model with the ScipyOptimizeDriver.
The model inputs are a vector of design variables and the scalar output is calculated in a separate flow solver. This does involve a setting up directories to save the results from the flow solver but this is separate from where the openmdao python file is located, which I think could be part of the problem. The error is shown below:
FileNotFoundError: [Errno 2] No such file or directory: 'reports/problem1/driver_scaling_report.html'
The file is evidently missing from the current directory but I am not sure why, when I run the actuator disk example problem this file is automatically generated in the correct place.

Error reading data into Spark using spraklyr::spark_read_csv

I'm running Spark in 'standalone' mode on a local machine in Docker containers. I have a master and two workers, each is running in its own Docker container. In each of the containers the path /opt/spark-data is mapped to the same local directory on the host.
I'm connecting to the Spark master from R using sparklyr, and I can do a few things, for example, loading data into Spark using sparklyr::copy_to.
However, I cannot get sparklyr::spark_read_csv to work. The data I'm trying to load is in the local directory that is mapped in the containers. When attaching to the running containers I can see that the file I'm trying to load does exist in each of the 3 containers, in the local (to the container) path /opt/spark-data.
This is an example for the code I'm using:
xx_csv <- spark_read_csv(
sc,
name = "xx1_csv",
path = "file:///opt/spark-data/data-csv"
)
data-csv is a directory containing a single CSV file. I've also tried specifying the full path, including the file name.
When I'm calling the above code, I'm getting an exception:
Error: org.apache.spark.sql.AnalysisException: Path does not exist: file:/opt/spark-data/data-csv;
I've also tried with different numbers of / in the path argument, but to no avail.
The documentation for spark_read_csv says that
path: The path to the file. Needs to be accessible from the
cluster. Supports the ‘"hdfs://"’, ‘"s3a://"’ and ‘"file://"’
protocols.
My naive expectation is that if, when attaching to the container, I can see the file in the container file system, it means that it is "accessible from the cluster", so I don't understand why I'm getting the error. All the directories and files in the path are owned by rood and have read permissions by all.
What am I missing?
try without "file://" and with \\ if your are Windows user.

Default Authorization Required response (401) - taskscheduleR

I'm trying to run a daily taskscheduleR script that pulls data into R from an API. It works when I run it as a one time task but for some reason it won't work as a daily task. I keep getting the following error in the log file:
<HEAD><TITLE>Authorization Required</TITLE></HEAD>
<BODY BGCOLOR=white FGCOLOR=black>
<H1>Authorization Required</H1><HR>
<FONT FACE=Helvetica,Arial>
<B>Description: Authorization is required for access to this proxy</B>
</FONT>
<HR>
<!-- default Authorization Required response (401) -->
Here's the code:
library(httr)
library(jsonlite)
library(tidyverse)
library(taskscheduleR)
# Url to feed into GET function
url<-"https://urldefense.com/v3/__http://files.airnowtech.org/airnow/yesterday/daily_data_v2.dat__;!!J30X0ZrnC1oQtbA!Yh5wIss-mzbpMRXugALJoWEKLKcg1-7VmERQwcx2ESK0PZpM5NWNml5s9MVgwHr5LD1i5w$ "
# Sends request to AirNow API to get access to data
my_raw_result<-httr::GET(url)
# Retrieve contents of a request
my_content<-httr::content(my_raw_result,as="text")
# Parse content into a dataframe
my_content_from_delim <- my_content %>% textConnection %>% readLines %>% read.delim(text = ., sep = "|",header = FALSE)
head(my_content_from_delim)
I have been using the Rstudio add-in to create the task.
If you are trying to access this on a work computer, you may need to allow downloads from the url link. Open a browser, paste that url, click 'allow downloads', run the script.
I am not sure whether the solution I will offer will work for you, but it won't harm to try. If the problem related to the task scheduler, the following solution might work. However, if the problem of authorization issues, you may need to get some IT help from your workplace.
For the task scheduler issue, you can directly send your script to the windows task scheduler with a batch file and create a schedule for it.
To make it easy, you can use the following code. First, open a new folder and copy-paste your R script there. To run the following code, you should call you R script as My Script.r.
Then, in the same folder, create a batch file with the following codes. To create a batch file, you should copy the following code into a Notepad and save it as Run R Script.bat in the same folder.
cd %~dp0
"C:\PROGRA~1\R\R-40~1.0\bin\R.exe" -e "setwd(%~dp0)" CMD BATCH --vanilla --slave "%~dp0My Script.r" Log.txt
Here, cd %~dp0 will set the directory for the windows batch to the folder you run this batch. "C:\PROGRA~1\R\R-40~1.0\bin\R.exe" will specify your R.exe. You may need to change the path based on your system files.
-e "setwd(%~dp0)" will set the directory of R to the same folder in which the batch and script will be run.
"%~dp0My Script.r" Log.txt will define R script pathname and the log file for the batch.
Second, to create a daily schedule, we are going to create another batch file. To do so, copy and paste the following codes into a notepad and save as Daily Schedule.bat.
When you click the Daily Schedule.bat, it will create a daily task and run for the first time in one minute, and every day it will repeat itself at the same time when you first run this batch.
#echo off
for /F "tokens=1*" %%A in ('
powershell -NoP -C "(Get-Date).AddMinutes(1).ToString('MM/dd/yyyy HH:mm:ss')"
') do (
Set "MyDate=%%A"
set "MyTime=%%B"
)
::Execute path to bat path
cd %~dp0
::Create Task
SchTasks /Create /SC DAILY /TN "MY R TASK" /TR "%~dp0Run R Script.bat" /sd %MyDate% /st %MyTime%
This code will create a task called as "MY R TASK". To see whether it is scheduled, you can run the following codes on the windows prompt: taskschd.msc. This will open your task scheduler, and you can find your task there. If you want to modify or delete, you can use this task scheduler program; it has a nice GUI and easy to navigate.
For more details about the Task scheduler syntax, see the following link
If you have any questions, let me know.

What is the filepath that a "Read CSV" operator needs to read a file from RapidMiner Server?

I have a RM Server running on a VM (Ubuntu) on top of my Win10 machine.
I have a process to read a .csv file and write its contents on a MySQL database on a MySQL Server which also runs on the same VM.
The problem is that the read file operator does not seem to be able to find the file.
Scenario1.
When I try as location-name in the read csv operator ../data/myFile.csv
and run the process on Server I am getting Failed to execute initialization process: Error executing process /apps/myApp/process/task_read_csv_to_db: The file 'java.io.FileNotFoundException: /root/../data/myFile.csv (No such file or directory)' does not exist.
Scenario2.
When I try as location-name in the read csv operator /apps/myApp/data/myFile.csv
and run the process on Server I am getting Failed to execute initialization process: Error executing process /apps/myApp/process/task_read_csv_to_db: The file 'java.io.FileNotFoundException: /apps/myApp/data/myFile.csv (No such file or directory)' does not exist.
What is the right filepath that I should give to the Read CSV operator?
Just to update with the answer. After David's suggestion, I resulted in storing the .csv file outside of the /rapidminer-server-home/data/repository since every remote repository seems to be depicted with an integer instead of its original name, making the use of the actual full path of the file not usable.
I would say, the issue is that depending on the location of the JobAgent that is executing your process, the relative path might be varying.
Is /apps/myApp/data/myFile.csv the correct path to the file? If not, I would suggest to use the absolute path to the file. Hope this helps.
Best,
David

Publishing test results through command line test runner in VSTS

I'm trying to use vstest.console.exe with the TfsPublisher logger in VSTS (cloud).
There's a URL example shown in the article for TFS onsite, but I'm trying to work out what parameters to use for my VSTS build. The example is:
/logger:TfsPublisher;Collection=http://localhost:8080/tfs/DefaultCollection;TeamProject=MyProject;BuildName=DailyBuild_20121130.1
But I just get an error saying the build cannot be found in the project, e.g.
Error: Build "1234" cannot be found under team project "MyProject".
I believe the problem is the BuildName parameter. My project and build definition have no spaces in the names. I have tried various values, e.g.:
BuildName=%BUILD_BUILDID% (resolves to number, e.g. 1234)
BuildName=%BUILD_DEFINITIONNAME% (resolves to build definition name OK)
BuildName=%BUILD_BUILDURI% (resolves to url, e.g. vstfs:///Build/Build/1234)
The error message confirms that the environment variables seem to be resolving OK, but I can't determine what I should substitute for "DailyBuild_20121130.1" in my case.
Updated: My vstest.console.exe logger parameter currently looks like
/logger:TfsPublisher;Collection=%SYSTEM_TEAMFOUNDATIONCOLLECTIONURI%;TeamProject=%SYSTEM_TEAMPROJECT%;BuildName=%BUILD_BUILDNUMBER%
I effectively got the result I wanted using the Trx logger and one of the "Publish Test Results" build steps:
vstest.console.exe ... /logger:Trx
The build name is generated by "Build number format" under build definition "General" tab. You can get it from "BUILD_BUILDNUMBER" variable.

Resources