Why am I not getting the NULL values when using FILTER to remove CSV Headers in PIG? - bigdata

I have this data below in a .csv file:
Needed_values,TEMP,Desc
,022.3,NewYork
3,022.30,India
,027.0,Australia
,027.00,Russia
1,027.1,Austria
,027.10,Norway
,036.2,Hungary
,036.20,Lithunia
2,785.52,Nigeria
I saw in one of the StackOverflow question the way to remove header using FILTER. So,
When I load this file in my pig script and use Filter to remove the header of my csv then all the NULL values under Needed_values also got removed!
LOAD_DATA = LOAD 'DATA.csv' Using PigStorage(',') as
(
NEEDED_VALUES:chararray,
TEMP:chararray,
DESC:chararray
);
FILTER_HEADER = FILTER LOAD_DATA BY NEEDED_VALUES != 'Needed_values';
ACTUAL OUTPUT:
(3,022.30,India)
(1,027.1,Austria)
(1,027.1,Austria)
I'm expecting the output to include everything except the headers- Needed_values,TEMP,Desc:
,022.3,NewYork
3,022.30,India
,027.0,Australia
,027.00,Russia
1,027.1,Austria
,027.10,Norway
,036.2,Hungary
,036.20,Lithunia
2,785.52,Nigeria

The null values will not pass the filter condition. Change the filter to:
FILTER_HEADER = FILTER LOAD_DATA BY NEEDED_VALUES != 'Needed_values' OR NEEDED_VALUES IS NULL;

Related

r json mongodb query $in operator syntax error due to double quotes?

I'm building a json query to pass to a mongodb database in R.
In one scenario, I have a vector of dates and I want to query the database to return all records which have a date in the relevant field that matches a date in my vector of dates.
The second scenario is the same as the first, but this time I have a vector of character strings (IDs) and need to return all the records with matching IDs.
I understood the correct way to do this in a json query is to use the $in operator, and then put my vector in an array.
However, when I pass the query to my mongodb database, the exportLogId returns NULL. I'm quite sure that the problem is something to do with how I am representing the $in operator in the final query, since I have very similarly structured queries without the $in operator and they are all working. If I look for just one of my target dates or character strings, I get the desired result.
I followed the mongodb manual here to construct my query, and the only issue I can see is that the $in operator in the output of jsonlite::toJSON() is enclosed in double quotes; whereas I think it might need to be in single quotes (or no quotes at all, but I don't know how to write the syntax for that).
I'm creating my query in two steps:
Create the query as a series of nested lists
Convert the list object to json with jsonlite::toJSON()
Here is my code:
# Load libraries:
library(jsonlite)
# Create list of example dates to query in mongodb format:
sampledates <- c("2022-08-11T00:00:00.000Z",
"2022-08-15T00:00:00.000Z",
"2022-08-16T00:00:00.000Z",
"2022-08-17T00:00:00.000Z",
"2022-08-19T00:00:00.000Z")
# Create query as a list object:
query_list_l <- list(filter =
# Add where clause:
list(where =
# Filter results by list of sample dates:
list(dateSampleTaken = list('$in' = sampledates),
# Define format of column names and values:
useDbColumns = "true",
dontTranslateValues = "true",
jsonReplaceUndefinedWithNull = "true"),
# Define columns to return:
fields = c("id",
"updatedAt",
"person.visualId",
"labName",
"sampleIdentifier",
"dateSampleTaken",
"sequence.hasSequence")))
# Convert list object to JSON:
query_json = jsonlite::toJSON(x = query_list_l,
pretty = TRUE,
auto_unbox = TRUE)
The JSON query now looks like this:
> query_json
{
"filter": {
"where": {
"dateSampleTaken": {
"$in": ["2022-08-11T00:00:00.000Z", "2022-08-15T00:00:00.000Z", "2022-08-16T00:00:00.000Z", "2022-08-17T00:00:00.000Z", "2022-08-19T00:00:00.000Z"]
},
"useDbColumns": "true",
"dontTranslateValues": "true",
"jsonReplaceUndefinedWithNull": "true"
},
"fields": ["id", "updatedAt", "person.visualId", "labName", "sampleIdentifier", "dateSampleTaken", "sequence.hasSequence"]
}
}
As you can see, $in is now enclosed in double quotes, even though I put it in single quotes when I created the query as a list object. I have tried replacing with sprintf() but that just adds a lot of backslashes to my query. I also tried:
query_fixed <- gsub(pattern = "\\"\\$\\in\\"",
replacement = "\\'$in\\'",
x = query_json)
... but this fails with an error.
I would be very grateful to know if:
The syntax problem that is preventing $in from working is actually the double quotes?
If double quotes is the problem, how do I replace them with single quotes without messing up the JSON format?
UPDATE:
The issue seems to occur when R is passing the query to the database, but I still can't work out exactly why.
If I try the query out in loopback explorer in the database, it works and using the export log ID produced, I can then fetch the results with httr::GET() in R. Example query results are shown below (sorry for the hashes - the main point is you can see the format of the returned values):
[1] "[{\"_id\":\"e59953b6-a106-4b69-9e25-1c54eef5264a\",\"updatedAt\":\"2022-09-12T20:08:39.554Z\",\"dateSampleTaken\":\"2022-08-16T00:00:00.000Z\",\"labName\":\"LNG_REFERENCE_DATA_CATEGORY_LAB_NAME_LAB_A\",\"sampleIdentifier\":\"LS0044-SCV2-PCR\",\"sequence\":{\"hasSequence\":false},\"person\":{\"visualId\":\"C-2022-0002\"}},{\"_id\":\"af5cd9cc-4813-4194-b60b-7d130bae47bc\",\"updatedAt\":\"2022-09-12T20:11:07.467Z\",\"dateSampleTaken\":\"2022-08-17T00:00:00.000Z\",\"labName\":\"LNG_REFERENCE_DATA_CATEGORY_LAB_NAME_LAB_A\",\"sampleIdentifier\":\"LS0061-SCV2-PCR\",\"sequence\":{\"hasSequence\":false},\"person\":{\"visualId\":\"C-2022-0003\"}},{\"_id\":\"b5930079-8d57-43a8-85c0-c95f7e0338d9\",\"updatedAt\":\"2022-09-12T20:13:54.378Z\",\"dateSampleTaken\":\"2022-08-16T00:00:00.000Z\",\"labName\":\"LNG_REFERENCE_DATA_CATEGORY_LAB_NAME_LAB_A\",\"sampleIdentifier\":\"LS0043-SCV2-PCR\",\"sequence\":{\"hasSequence\":false},\"person\":{\"visualId\":\"C-2022-0004\"}}]"

Is there a way to input multiple environment variables into a url for testing on Postman?

I am testing an endpoint in Postman using a url like this, {{api_url}}/stackoverflow/help/{{customer_id}}/{{client_id}}.
I have the api_url, customer_id, and client_id stored in my environment variables. I would like to test multiple customer_id and client_id without having to change the environment variables manually each time. I created a csv to store a list of customer_id and one to store client_id. When I go to run collection, it will only allow me to add one file. Is there another way to do this if I want to iterate through my tests to automate them?
You can add both customer_id & client_id in one csv file. Postman will iterate n times (n = number of csv lines, except header)
you can use postman.setNextRequest to control the flow. The below code runs the request with different values in the arr variable
url:
{{api_url}}/stackoverflow/help/{{customer_id}}/{{client_id}}
now add pre-request:
// add values for the variable in an array
const tempArraycustomer_id = pm.variables.get("tempArraycustomer_id")
const tempArrayclient_id = pm.variables.get("tempArrayclient_id")
//modify the array to the values you want
const arrcustomer_id = tempArraycustomer_id ? tempArraycustomer_id : ["value1", "value2", "value3"]
const arrclient_id = tempArrayclient_id ? tempArrayclient_id : ["value1", "value2", "value3"]
// testing variable to each value of the array and sending the request until all values are used
pm.variables.set("customer_id", arrcustomer_id.pop())
pm.variables.set("client_id", arrclient_id.pop())
pm.variables.set("tempArraycustomer_id", arrcustomer_id)
pm.variables.set("tempArrayclient_id", arrclient_id)
//end iteration when no more elements are there
if (arrcustomer_id.length !== 0) {
postman.setNextRequest(pm.info.requestName)
}

ConvertFrom-StringData values stored in variable

I'm new to Powershell and I'm putting together a script that will populate all variables from data stored in a Excel file. Basically to create numerous VMs.
This works fine apart from where i have a variable with multiple name/value pairs which powershell needs to be a hashtable.
As each VM will need multiple tags applying, i have a column in excel called Tags.
The data in the field would look something like: "Top = Red `n Bottom = Blue".
I'm struggling to use ConvertFrom-StringData to create the hashtable however for these tags.
If i run:
ConvertFrom-StringData $exceldata.Tags
I end up with something like:
Name Value
---- -----
Top Red `n bottom = blue
I need help please with formatting the excel field correctly so ConvertFrom-StringData properly creates the hashtable. Or a better way of achieving this.
Thanks.
Sorted it, formatted the excel field as: Dept=it;env=prod;owner=Me
Then ran the following commands. No ConvertFrom-StringData required.
$artifacts = Import-Excel -Path "C:\temp\Artifacts.xlsx" -WorkSheetname "VM"
foreach ($artifact in $artifacts) {
$inputkeyvalues = $artifact.Tags
# Create hashtable
$tags = #{}
# Split input string into pairs
$inputkeyvalues.Split(';') |ForEach-Object {
# Split each pair into key and value
$key,$value = $_.Split('=')
# Populate $tags
$tags[$key] = $value
}
}

Comparing files in R using for loop when there is missing files in a series

I've used below code which successfully compares two text files and logs the difference in a log file using for loop. The file names are in series, for example, File_1, File_2 etc.. but when there is a missing file in the series, the code stops the execution with error - No such file or directory.
Then I've used if condition to check the file existence but I am getting a below-mentioned error.
Please help me to skip comparison for a nonexisting file.
Code:
for(i in 1:length){
prod_file_res_name <- sprintf("path/Query_Prod_%s.txt", i)
beta_file_res_name <- sprintf("path/Query_Beta_%s.txt", i)
if (exists('prod_file_res_name' && 'beta_file_res_name')){
res <- tools::Rdiff(prod_file_res_name, beta_file_res_name, Log = TRUE)
if(res[2] != "character(0)"){
write(toString(res[2]), file = "LogFile.txt",append=TRUE)
}
else{
elsevar <- sprintf("No difference found between prod and beta responses for query %s", i)
print(elsevar)
}
}
}
Error:
Error in "prod_file_res_name" && "beta_file_res_name" :
invalid 'x' type in 'x && y'
exists checks if the given R objects (only takes R objects as input --> this caused your first error) exist in your environment. You initiate the R objects prod_file_res_name and beta_file_res_name before you check if they exist, so the exists call will always return TRUE. What you are looking for is the file.exists function which checks if the file does exist in your working directory:
file.exists(prod_file_res_name) && file.exists(beta_file_res_name)
The second error was caused by the R objects existing but not the files you want to check.
Since exists() looks up single objects (from Documentation):
Is an Object Defined?
Description
Look for an R object of the given name and possibly return it
'prod_file_res_name' && 'beta_file_res_name' doesn't work.
Rewrite
exists("prod_file_res_name" && "beta_file_res_name")
to:
exists("prod_file_res_name") && exists("beta_file_res_name")

SQLite: isolating the file extension from a path

I need to isolate the file extension from a path in SQLite. I've read the post here (SQLite: How to select part of string?), which gets 99% there.
However, the solution:
select distinct replace(column_name, rtrim(column_name, replace(column_name, '.', '' ) ), '') from table_name;
fails if a file has no extension (i.e. no '.' in the filename), for which it should return an empty string. Is there any way to trap this please?
Note the filename in this context is the bit after the final '\'- it shouldn't be searching for'.'s in the full path, as it does at moment too.
I think it should be possible to do it using further nested rtrims and replaces.
Thanks. Yes, you can do it like this:
1) create a scalar function called "extension" in QtScript in SQLiteStudio
2) The code is as follows:
if ( arguments[0].substring(arguments[0].lastIndexOf('\u005C')).lastIndexOf('.') == -1 )
{
return ("");
}
else
{
return arguments[0].substring(arguments[0].lastIndexOf('.'));
}
3) Then, in the SQL query editor you can use
select distinct extension(PATH) from DATA
... to itemise the distinct file extensions from the column called PATH in the table called DATA.
Note that the PATH field must contain a backslash ('\') in this implementation - i.e. it must be a full path.

Resources