MariaDB JSON remove key and its values - mariadb

I have a TABLBE like
CREATE TABLE `saved_links` (
`link_entry_id` bigint(20) NOT NULL AUTO_INCREMENT,
`link_id` varchar(30) COLLATE utf8mb4_unicode_ci NOT NULL,
`user_data_json` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
PRIMARY KEY (`link_entry_id`),
UNIQUE KEY `link_id` (`link_id`)
) ENGINE=InnoDB AUTO_INCREMENT=19 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci COMMENT='saved Links'
AND INSERT
INSERT INTO `saved_links`(`link_id`, `user_data_json` )
VALUES (
'AABBCC',
'[{
"mama#gmail_DOT_com": {"u_email": "mama#gmail_DOT_com", "private": "no"}},
{
"papa#gmail_DOT_com": {"u_email": "papa#gmail_DOT_com", "private": "no"}},
{
"daughter#gmail_DOT_com": {"u_email": "daughter#gmail_DOT_com", "private": "no"}},
{
"son#gmail_DOT_com": {"u_email": "son#gmail_DOT_com", "private": "no"}
}]'
), (
'DDEEFF',
'[{
"mama#gmail_DOT_com": {"u_email": "mama#gmail_DOT_com", "private": "no"}},
{
"papa#gmail_DOT_com": {"u_email": "papa#gmail_DOT_com", "private": "no"}}
]'
) ;
SELECT*
---------------------------------------------------
`link_id` | `user_data_json`
----------------------------------------------------
`AABBCC` | [{
| "mama#gmail_DOT_com": {"u_email": "mama#gmail_DOT_com", "private": "no"}},
| {
| "papa#gmail_DOT_com": {"u_email": "papa#gmail_DOT_com", "private": "no"}},
| {
| "daughter#gmail_DOT_com": {"u_email": "daughter#gmail_DOT_com", "private": "no"}},
| {
| "son#gmail_DOT_com": {"u_email": "son#gmail_DOT_com", "private": "no"}}]
---------------------------------------------------------------------------------------------
`DDEEFF` | [{
| "mama#gmail_DOT_com": {"u_email": "mama#gmail_DOT_com", "private": "no"}},
| {
| "papa#gmail_DOT_com": {"u_email": "papa#gmail_DOT_com", "private": "no"}}
| ]
---------------------------------------------------------------------------------------------
I would like to REMOVE "papa#gmail_DOT_com" and all his values from AABBCC
I have tried (Am using 10.4.15-MariaDB)
UPDATE `saved_links`
SET `user_data_json` = IFNULL(
JSON_REMOVE( `user_data_json`, JSON_UNQUOTE(
REPLACE( JSON_SEARCH(
`user_data_json`, 'all', 'papa#gmail_DOT_com', NULL, '$**.papa#gmail_DOT_com'), '.u_email', '' ) ) ), `user_data_json` )
where `link_id` = 'AABBCC'
This returns
---------------------------------------------------
`link_id` | `user_data_json`
----------------------------------------------------
`AABBCC` | [{
| "mama#gmail_DOT_com": {"u_email": "mama#gmail_DOT_com", "private": "no"}},
| {}, //-> Notice these empty braces that are left behind.
| {
| "daughter#gmail_DOT_com": {"u_email": "daughter#gmail_DOT_com", "private": "no"}},
| {
| "son#gmail_DOT_com": {"u_email": "son#gmail_DOT_com", "private": "no"}}]
Is there a way to avoid having the empty {} after removal?
UPDATE01- If you try:
UPDATE `saved_links` SET
`user_data_json` =
JSON_REMOVE(`user_data_json`, '$.papa#gmail_DOT_com')
WHERE `link_id`= 'AABBCC'
This deletes all data in the column user_data_json WHERE link_id= 'AABBCC'`
Thank you

select json_remove(user_data_json,'$[1]') from saved_links where link_entry_id=19;
will return:
[{"mama#gmail_DOT_com": {"private": "no", "u_email": "mama#gmail_DOT_com"}},
{"daughter#gmail_DOT_com": {"private": "no", "u_email": "daughter#gmail_DOT_com"}},
{"son#gmail_DOT_com": {"private": "no", "u_email": "son#gmail_DOT_com"}}]
I am not really using JSON, but got my inspiration from the second example here: https://mariadb.com/kb/en/json_remove/
EDIT:
You could optimize this:
with recursive abc as (
Select 0 as i
union all
select i+1 from abc where i<2)
select link_entry_id, link_id,i, json_keys(user_data_json,concat('$[',i,']'))
from saved_links,abc;
output:
+---------------+---------+------+----------------------------------------------+
| link_entry_id | link_id | i | json_keys(user_data_json,concat('$[',i,']')) |
+---------------+---------+------+----------------------------------------------+
| 19 | AABBCC | 0 | ["mama#gmail_DOT_com"] |
| 20 | DDEEFF | 0 | ["mama#gmail_DOT_com"] |
| 19 | AABBCC | 1 | ["papa#gmail_DOT_com"] |
| 20 | DDEEFF | 1 | ["papa#gmail_DOT_com"] |
| 19 | AABBCC | 2 | ["daughter#gmail_DOT_com"] |
| 20 | DDEEFF | 2 | NULL |
+---------------+---------+------+----------------------------------------------+
With this you could 'convert' "papa#gm...." to 1.
EDIT2:
Combining different JSON functions from Mariadb or from MySQL can do a lot:
SELECT
j.person,
JSON_KEYS(j.person),
JSON_EXTRACT(JSON_KEYS(j.person),'$[0]'),
JSON_UNQUOTE(JSON_EXTRACT(JSON_KEYS(j.person),'$[0]')),
JSON_VALUE(JSON_KEYS(j.person),'$[0]')
FROM
JSON_TABLE('[{
"mama#gmail_DOT_com": {"u_email": "mama#gmail_DOT_com", "private": "no"}},
{
"papa#gmail_DOT_com": {"u_email": "papa#gmail_DOT_com", "private": "no"}},
{
"daughter#gmail_DOT_com": {"u_email": "daughter#gmail_DOT_com", "private": "no"}},
{
"son#gmail_DOT_com": {"u_email": "son#gmail_DOT_com", "private": "no"}
}]',
'$[*]' COLUMNS(person JSON PATH '$[0]')) j
output (please scroll right, the last column is more interesting than the first column 😉):
+ ----------- + ------------------------ + --------------------------------------------- + ----------------------------------------------------------- + ------------------------------------------- +
| person | JSON_KEYS(j.person) | JSON_EXTRACT(JSON_KEYS(j.person),'$[0]') | JSON_UNQUOTE(JSON_EXTRACT(JSON_KEYS(j.person),'$[0]')) | JSON_VALUE(JSON_KEYS(j.person),'$[0]') |
+ ----------- + ------------------------ + --------------------------------------------- + ----------------------------------------------------------- + ------------------------------------------- +
| {"mama#gmail_DOT_com": {"private": "no", "u_email": "mama#gmail_DOT_com"}} | ["mama#gmail_DOT_com"] | "mama#gmail_DOT_com" | mama#gmail_DOT_com | mama#gmail_DOT_com |
| {"papa#gmail_DOT_com": {"private": "no", "u_email": "papa#gmail_DOT_com"}} | ["papa#gmail_DOT_com"] | "papa#gmail_DOT_com" | papa#gmail_DOT_com | papa#gmail_DOT_com |
| {"daughter#gmail_DOT_com": {"private": "no", "u_email": "daughter#gmail_DOT_com"}} | ["daughter#gmail_DOT_com"] | "daughter#gmail_DOT_com" | daughter#gmail_DOT_com | daughter#gmail_DOT_com |
| {"son#gmail_DOT_com": {"private": "no", "u_email": "son#gmail_DOT_com"}} | ["son#gmail_DOT_com"] | "son#gmail_DOT_com" | son#gmail_DOT_com | son#gmail_DOT_com |
+ ----------- + ------------------------ + --------------------------------------------- + ----------------------------------------------------------- + ------------------------------------------- +
EDIT (2020-12-26):
I did have a look at mariadb, and below is tested on version 10.5.8.
select json_extract(json_array(user_data_json,"papa#gmail_DOT_com"), '$[1]') from saved_links;
+-----------------------------------------------------------------------+
| json_extract(json_array(user_data_json,"papa#gmail_DOT_com"), '$[1]') |
+-----------------------------------------------------------------------+
| "papa#gmail_DOT_com" |
| "papa#gmail_DOT_com" |
+-----------------------------------------------------------------------+
But use of $[1] is not desired, soe whe have to determine the correct value for 1:
WITH RECURSIVE data AS (
SELECT
link_entry_id,
link_id,
0 as I,
JSON_KEYS(user_data_json, '$[0]') jk
FROM saved_links
UNION ALL
SELECT
sl.link_entry_id,
sl.link_id,
I+1,
JSON_KEYS(user_data_json, CONCAT('$[',i+1,']'))
FROM saved_links sl, (select max(i) as I from data) x
WHERE JSON_KEYS(user_data_json, CONCAT('$[',i+1,']'))<>'')
SELECT * FROM data
;
.
+---------------+---------+------+----------------------------+
| link_entry_id | link_id | I | jk |
+---------------+---------+------+----------------------------+
| 19 | AABBCC | 0 | ["mama#gmail_DOT_com"] |
| 20 | DDEEFF | 0 | ["mama#gmail_DOT_com"] |
| 19 | AABBCC | 1 | ["papa#gmail_DOT_com"] |
| 20 | DDEEFF | 1 | ["papa#gmail_DOT_com"] |
| 19 | AABBCC | 2 | ["daughter#gmail_DOT_com"] |
| 19 | AABBCC | 3 | ["son#gmail_DOT_com"] |
+---------------+---------+------+----------------------------+
I is the correct value for finding papa#gmail_DOT_com
WITH RECURSIVE data AS (
SELECT
link_entry_id,
link_id,
0 as I,
JSON_KEYS(user_data_json, '$[0]') jk
FROM saved_links
UNION ALL
SELECT
sl.link_entry_id,
sl.link_id,
I+1,
JSON_KEYS(user_data_json, CONCAT('$[',i+1,']'))
FROM saved_links sl, (select max(i) as I from data) x
WHERE JSON_KEYS(user_data_json, CONCAT('$[',i+1,']'))<>'')
SELECT
json_remove(user_data_json, concat('$[',I,']'))
FROM saved_links sl
INNER JOIN data d ON d.link_entry_id= sl.link_entry_id AND d.link_id=sl.link_id and d.I=1
;
.
[{"mama#gmail_DOT_com": {"u_email": "mama#gmail_DOT_com", "private": "no"}},
{"daughter#gmail_DOT_com": {"u_email": "daughter#gmail_DOT_com", "private": "no"}},
{"son#gmail_DOT_com": {"u_email": "son#gmail_DOT_com", "private": "no"}}]
[{"mama#gmail_DOT_com": {"u_email": "mama#gmail_DOT_com", "private": "no"}}]

I've played some time with this puzzle and I figured in another way to do it.
You can use json_search (plus to other functions) to finally use json_remove.
Once you a creating an array of jsons, we must consider it are your designer decision to upload data as is.
So, this is my code:
UPDATE saved_links sl
SET user_data_json =
JSON_REMOVE(user_data_json,
SUBSTRING_INDEX(
JSON_UNQUOTE(
JSON_SEARCH(sl.user_data_json,'one','papa#gmail_DOT_com')
)
,'.', 1)
)
WHERE link_id='AABBCC'
json_search(sl.user_data_json,'one','papa#gmail_DOT_com')
Returns "$[1].papa#gmail_DOT_com.u_email"
JSON_UNQUOTE
Returns $[1].papa#gmail_DOT_com.u_email
SUBSTRING_INDEX(#JSON,'.',1)
Returns $[1]
And finally you will use this last return as JSON_REMOVE path.
I don't know if your JSON key will be always the same of u_email but if it's true, then you can use it.

Related

How to create a variable based on character and number iteration in R?

I'm trying to create a dummy variable based on the character type variable.
For example, I need to create "newcat" variable ranging from "I00" to "I99".
In the code I wrote, I place all the characters from I00-I99.
But is there any way to make this code efficient with the loop to iterate number after the string?
Thank you in advance!!
mort <- mort %>%
mutate(newcat = ifelse(ucod=="I00" |
ucod=="I01" | ucod=="I02" | ucod=="I03" | ucod=="I04" | ucod=="I05" |
ucod=="I06" | ucod=="I07" | ucod=="I08" | ucod=="I09" | ucod=="I10" |
ucod=="I11" | ucod=="I12" | ucod=="I13" | ucod=="I14" | ucod=="I15" |
ucod=="I16" | ucod=="I17" | ucod=="I18" | ucod=="I19" | ucod=="I20" |
ucod=="I21" | ucod=="I22" | ucod=="I23" | ucod=="I24" | ucod=="I25" |
ucod=="I26" | ucod=="I27" | ucod=="I28" | ucod=="I29" | ucod=="I30" |
ucod=="I31" | ucod=="I32" | ucod=="I33" | ucod=="I34" | ucod=="I35" |
ucod=="I36" | ucod=="I37" | ucod=="I38" | ucod=="I39" | ucod=="I40" |
ucod=="I41" | ucod=="I42" | ucod=="I43" | ucod=="I44" | ucod=="I45" |
ucod=="I46" | ucod=="I47" | ucod=="I48" | ucod=="I49" | ucod=="I50" |
ucod=="I51" | ucod=="I52" | ucod=="I53" | ucod=="I54" | ucod=="I55" |
ucod=="I56" | ucod=="I57" | ucod=="I58" | ucod=="I59" | ucod=="I60" |
ucod=="I61" | ucod=="I62" | ucod=="I63" | ucod=="I64" | ucod=="I65" |
ucod=="I66" | ucod=="I67" | ucod=="I68" | ucod=="I69" | ucod=="I70" |
ucod=="I71" | ucod=="I72" | ucod=="I73" | ucod=="I74" | ucod=="I75" |
ucod=="I76" | ucod=="I77" | ucod=="I78" | ucod=="I79" | ucod=="I80" |
ucod=="I81" | ucod=="I82" | ucod=="I83" | ucod=="I84" | ucod=="I85" |
ucod=="I86" | ucod=="I87" | ucod=="I88" | ucod=="I89" | ucod=="I90" |
ucod=="I91" | ucod=="I92" | ucod=="I93" | ucod=="I94" | ucod=="I95" |
ucod=="I96" | ucod=="I97" | ucod=="I98" | ucod=="I99", 1, 0))
Try %in% instead of == with |
x <- c(paste0("I0", 0:9),paste0("I", c(10:99)))
mort %>%
mutate(newcat = ifelse(ucod %in% x, 1, 0))
Another option is to use regex:
mort <- mort %>%
mutate(newcat = +str_detect(ucod, '^I[0-9]{2}$'))
where ^ is a metacharacter which indicates the beginning of the string. Then we have I[0-9]{2} which matches the letter I and any 2 combinations of the numbers 0-9. Then $ is another metacharacter that indicates the end of the string. So the string matched must start with I followed by 2 numbers and that should be the end of the string. Any string that does not match the pattern will be flaged as FALSE

DynamoDB and GSIs when using access control with IAM policies

As a lockdown project I'm introducing myself to the concept of multi-tenancy applications. My simple application has a tenant who has a an online shop front. The shop has product categories each containing many products. My initial thought on database schema is as follows:
+====================================================================================+
| Primary Key | Sort Key (GSI PK) | Attribute 1 (GSI SK) | Attribute 2 | Attribute 3 |
|-------------|-------------------|----------------------|-------------|-------------|
| TENANT-uuid | CATEGORY-uuid | categoryName | ... | ... |
| TENANT-uuid | PRODUCT-uuid | productName | ... | ... |
| TENANT-uuid | PRODUCT-uuid | productName | ... | ... |
+====================================================================================+
So our GSI looks like so:
+=======================================================================================+
| Primary Key | Sort Key | Attribute 1 (PK) | Attribute 2 (SK) | Attribute 3 |
|---------------|-------------------|------------------|------------------|-------------|
| CATEGORY-uuid | categoryName | TENANT-uuid | CATEGORY-uuid | ... |
| PRODUCT-uuid | productName | TENANT-uuid | PRODUCT-uuid | ... |
| PRODUCT-uuid | productName | TENANT-uuid | PRODUCT-uuid | ... |
+=======================================================================================+
If I were to implement the following role policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"dynamodb:GetItem",
"dynamodb:PutItem"
],
"Resource": [
"arn:aws:dynamodb:XXX:XXX:table/XXX"
],
"Condition": {
"ForAllValues:StringEquals": {
"dynamodb:LeadingKeys": [
"TENANT-uuid"
]
}
}
}
]
}
How does the LeadingKeys condition work if we're running a query on an index?
Update 1
So upon further inspection it seems one way to do this (for this situation) is to have a GSI with the partition key as the TENANT-uuid and the sort key as the item's parent. I've realised I should probably add slightly more information as follows.
Our desired outcomes are:
Get list of tenant's categories -> Query with PK = TENANT-uuid and SK BeginsWith "CATEGORY"
Get list of tenant's products -> Query with PK = TENANT-uuid and SK BegingsWith "PRODUCT"
Get list of products in a specific tenant's category -> ???
Get single tenant's category -> Query with PK = TENANT-uuid and SK = CATEGORY-uuid
Get single tenant's product -> Query with PK = TENANT-uuid and SK = PRODUCT-uuid
As it stands the only one that was an issue was number 3. A little reorganisation of the schema as follows seems to work. However it does limit our ability to sort our data slightly.
Table
+----------------------+---------------+-----------------+-------------+
| TenantID (PK/GSI PK) | ItemType (SK) | Data - (GSI SK) | Attribute 2 |
+----------------------+---------------+-----------------+-------------+
| TENANT-uuid | CATEGORY-1 | Category Name | ... |
+----------------------+---------------+-----------------+-------------+
| TENANT-uuid | PRODUCT-1 | CATEGORY-1 | ... |
+----------------------+---------------+-----------------+-------------+
| TENANT-uuid | PRODUCT-2 | CATEGORY-1 | ... |
+----------------------+---------------+-----------------+-------------+
Index
+---------------+---------------+------------+-------------+
| TenantID (PK) | Data (SK) | ItemType | Attribute 2 |
+---------------+---------------+------------+-------------+
| TENANT-uuid | Category Name | CATEGORY-1 | ... |
+---------------+---------------+------------+-------------+
| TENANT-uuid | CATEGORY-1 | PRODUCT-1 | ... |
+---------------+---------------+------------+-------------+
| TENANT-uuid | CATEGORY-1 | PRODUCT-2 | ... |
+---------------+---------------+------------+-------------+
So now, for number 3, to get a list of products in a specific tenant's category we query the index with PK = TENANT-uuid and SK=CATEGORY-uuid
This allows us to meet the leadingKeys condition.
However, I'm not sure if this it the best solution. For the time being, in my little project, it works.
After almost giving up, I have found a solution. See this SO post describing how you can use wildcards in the IAM policy. Then in your GSI's, you could prefix each of your Id's with a tenant ID. Using your second table as an example, replace CATEGORY-uuid with TENANT-uuid-CATEGORY-uuid
And then your policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"dynamodb:GetItem",
"dynamodb:PutItem"
],
"Resource": [
"arn:aws:dynamodb:XXX:XXX:table/XXX"
],
"Condition": {
"ForAllValues:StringLike": {
"dynamodb:LeadingKeys": [
"TENANT-uuid*"
]
}
}
}
]
}
I tested this quick, it works just fine, and this is the approach I plan to use in my multi-tenant app.

Split data in SQLite column

I have a SQLite database that looks similar to this:
---------- ------------ ------------
| Car | | Computer | | Category |
---------- ------------ ------------
| id | | id | | id |
| make | | make | | record |
| model | | price | ------------
| year | | cpu |
---------- | weight |
------------
The record column in my Category table contains a comma separated list of the table name and id of the items that belong to that Category, so an entry would look like this:
Car_1,Car_2.
I am trying to split the items in the record on the comma to get each value:
Car_1
Car_2
Then I need to take it one step further and split on the _ and return the Car records.
So if I know the Category id, I'm trying to wind up with this in the end:
---------------- ------------------
| Car | | Car |
---------------| -----------------|
| id: 1 | | id: 2 |
| make: Honda | | make: Toyota |
| model: Civic | | model: Corolla |
| year: 2016 | | year: 2013 |
---------------- ------------------
I have had some success on splitting on the comma and getting 2 records back, but I'm stuck on splitting on the _ and making the join to the table in the record.
This is my query so far:
WITH RECURSIVE record(recordhash, data) AS (
SELECT '', record || ',' FROM Category WHERE id = 1
UNION ALL
SELECT
substr(data, 0, instr(data, ',')),
substr(data, instr(data, ',') + 1)
FROM record
WHERE data != '')
SELECT recordhash
FROM record
WHERE recordhash != ''
This is returning
--------------
| recordhash |
--------------
| Car_1 |
| Car_2 |
--------------
Any help would be greatly appreciated!
If your recursive CTE works as expected then you can split each of the values of recordhash with _ as a delimiter and use the part after _ as the id of the rows from Car to return:
select * from Car
where id in (
select substr(recordhash, 5)
from record
where recordhash like 'Car%'
)

manipulate multiple variables in a data frame

How to shorten the following codes? I feel it's so repetitive and lengthy and perhaps can be shortened. Not sure how to select those variables and do the recoding like this in a succinct way. Any help is welcome!
data_France$X.1CTP2[data_France$X.1CTP2>7.01 | data_France$X.1CTP2<0.99]<-NA
data_France$X.1CTP3[data_France$X.1CTP3>7.01 | data_France$X.1CTP3<0.99]<-NA
data_France$X.1CTP4[data_France$X.1CTP4>7.01 | data_France$X.1CTP4<0.99]<-NA
data_France$X.1CTP5[data_France$X.1CTP5>7.01 | data_France$X.1CTP5<0.99]<-NA
data_France$X.1CTP6[data_France$X.1CTP6>7.01 | data_France$X.1CTP6<0.99]<-NA
data_France$X.1CTP7[data_France$X.1CTP7>7.01 | data_France$X.1CTP7<0.99]<-NA
data_France$X.1CTP8[data_France$X.1CTP8>7.01 | data_France$X.1CTP8<0.99]<-NA
data_France$X.1CTP9[data_France$X.1CTP9>7.01 | data_France$X.1CTP9<0.99]<-NA
data_France$X.1CTP10[data_France$X.1CTP10>7.01 | data_France$X.1CTP10<0.99]<-NA
data_France$X.1CTP11[data_France$X.1CTP11>7.01 | data_France$X.1CTP11<0.99]<-NA
data_France$X.1CTP12[data_France$X.1CTP12>7.01 | data_France$X.1CTP12<0.99]<-NA
data_France$X.1CTP13[data_France$X.1CTP13>7.01 | data_France$X.1CTP13<0.99]<-NA
data_France$X.1CTP14[data_France$X.1CTP14>7.01 | data_France$X.1CTP14<0.99]<-NA
data_France$X.1CTP15[data_France$X.1CTP15>7.01 | data_France$X.1CTP15<0.99]<-NA
data_France$X.2CTP1[data_France$X.2CTP1>7.01 | data_France$X.2CTP1<0.99]<-NA
data_France$X.2CTP3[data_France$X.2CTP3>7.01 | data_France$X.2CTP3<0.99]<-NA
data_France$X.2CTP4[data_France$X.2CTP4>7.01 | data_France$X.2CTP4<0.99]<-NA
data_France$X.2CTP5[data_France$X.2CTP5>7.01 | data_France$X.2CTP5<0.99]<-NA
data_France$X.2CTP6[data_France$X.2CTP6>7.01 | data_France$X.2CTP6<0.99]<-NA
data_France$X.2CTP7[data_France$X.2CTP7>7.01 | data_France$X.2CTP7<0.99]<-NA
data_France$X.2CTP8[data_France$X.2CTP8>7.01 | data_France$X.2CTP8<0.99]<-NA
data_France$X.2CTP9[data_France$X.2CTP9>7.01 | data_France$X.2CTP9<0.99]<-NA
data_France$X.2CTP10[data_France$X.2CTP10>7.01 | data_France$X.2CTP10<0.99]<-NA
data_France$X.2CTP11[data_France$X.2CTP11>7.01 | data_France$X.2CTP11<0.99]<-NA
data_France$X.2CTP12[data_France$X.2CTP12>7.01 | data_France$X.2CTP12<0.99]<-NA
data_France$X.2CTP13[data_France$X.2CTP13>7.01 | data_France$X.2CTP13<0.99]<-NA
data_France$X.2CTP14[data_France$X.2CTP14>7.01 | data_France$X.2CTP14<0.99]<-NA
data_France$X.2CTP15[data_France$X.2CTP15>7.01 | data_France$X.2CTP15<0.99]<-NA
data_France$X.3CTP1[data_France$X.3CTP1>7.01 | data_France$X.3CTP1<0.99]<-NA
data_France$X.3CTP2[data_France$X.3CTP2>7.01 | data_France$X.3CTP2<0.99]<-NA
data_France$X.3CTP4[data_France$X.3CTP4>7.01 | data_France$X.3CTP4<0.99]<-NA
data_France$X.3CTP5[data_France$X.3CTP5>7.01 | data_France$X.3CTP5<0.99]<-NA
data_France$X.3CTP6[data_France$X.3CTP6>7.01 | data_France$X.3CTP6<0.99]<-NA
data_France$X.3CTP7[data_France$X.3CTP7>7.01 | data_France$X.3CTP7<0.99]<-NA
data_France$X.3CTP8[data_France$X.3CTP8>7.01 | data_France$X.3CTP8<0.99]<-NA
data_France$X.3CTP9[data_France$X.3CTP9>7.01 | data_France$X.3CTP9<0.99]<-NA
data_France$X.3CTP10[data_France$X.3CTP10>7.01 | data_France$X.3CTP10<0.99]<-NA
data_France$X.3CTP11[data_France$X.3CTP11>7.01 | data_France$X.3CTP11<0.99]<-NA
data_France$X.3CTP12[data_France$X.3CTP12>7.01 | data_France$X.3CTP12<0.99]<-NA
data_France$X.3CTP13[data_France$X.3CTP13>7.01 | data_France$X.3CTP13<0.99]<-NA
data_France$X.3CTP14[data_France$X.3CTP14>7.01 | data_France$X.3CTP14<0.99]<-NA
data_France$X.3CTP15[data_France$X.3CTP15>7.01 | data_France$X.3CTP15<0.99]<-NA
data_France$X.4CTP1[data_France$X.4CTP1>7.01 | data_France$X.4CTP1<0.99]<-NA
data_France$X.4CTP2[data_France$X.4CTP2>7.01 | data_France$X.4CTP2<0.99]<-NA
data_France$X.4CTP3[data_France$X.4CTP3>7.01 | data_France$X.4CTP3<0.99]<-NA
data_France$X.4CTP5[data_France$X.4CTP5>7.01 | data_France$X.4CTP5<0.99]<-NA
data_France$X.4CTP6[data_France$X.4CTP6>7.01 | data_France$X.4CTP6<0.99]<-NA
data_France$X.4CTP7[data_France$X.4CTP7>7.01 | data_France$X.4CTP7<0.99]<-NA
data_France$X.4CTP8[data_France$X.4CTP8>7.01 | data_France$X.4CTP8<0.99]<-NA
data_France$X.4CTP9[data_France$X.4CTP9>7.01 | data_France$X.4CTP9<0.99]<-NA
data_France$X.4CTP10[data_France$X.4CTP10>7.01 | data_France$X.4CTP10<0.99]<-NA
data_France$X.4CTP11[data_France$X.4CTP11>7.01 | data_France$X.4CTP11<0.99]<-NA
data_France$X.4CTP12[data_France$X.4CTP12>7.01 | data_France$X.4CTP12<0.99]<-NA
data_France$X.4CTP13[data_France$X.4CTP13>7.01 | data_France$X.4CTP13<0.99]<-NA
data_France$X.4CTP14[data_France$X.4CTP14>7.01 | data_France$X.4CTP14<0.99]<-NA
data_France$X.4CTP15[data_France$X.4CTP15>7.01 | data_France$X.4CTP15<0.99]<-NA
data_France$X.5CTP1[data_France$X.5CTP1>7.01 | data_France$X.5CTP1<0.99]<-NA
data_France$X.5CTP2[data_France$X.5CTP2>7.01 | data_France$X.5CTP2<0.99]<-NA
data_France$X.5CTP3[data_France$X.5CTP3>7.01 | data_France$X.5CTP3<0.99]<-NA
data_France$X.5CTP4[data_France$X.5CTP4>7.01 | data_France$X.5CTP4<0.99]<-NA
data_France$X.5CTP6[data_France$X.5CTP6>7.01 | data_France$X.5CTP6<0.99]<-NA
data_France$X.5CTP7[data_France$X.5CTP7>7.01 | data_France$X.5CTP7<0.99]<-NA
data_France$X.5CTP8[data_France$X.5CTP8>7.01 | data_France$X.5CTP8<0.99]<-NA
data_France$X.5CTP9[data_France$X.5CTP9>7.01 | data_France$X.5CTP9<0.99]<-NA
data_France$X.5CTP10[data_France$X.5CTP10>7.01 | data_France$X.5CTP10<0.99]<-NA
data_France$X.5CTP11[data_France$X.5CTP11>7.01 | data_France$X.5CTP11<0.99]<-NA
data_France$X.5CTP12[data_France$X.5CTP12>7.01 | data_France$X.5CTP12<0.99]<-NA
data_France$X.5CTP13[data_France$X.5CTP13>7.01 | data_France$X.5CTP13<0.99]<-NA
data_France$X.5CTP14[data_France$X.5CTP14>7.01 | data_France$X.5CTP14<0.99]<-NA
data_France$X.5CTP15[data_France$X.5CTP15>7.01 | data_France$X.5CTP15<0.99]<-NA
data_France$X.6CTP1[data_France$X.6CTP1>7.01 | data_France$X.6CTP1<0.99]<-NA
data_France$X.6CTP2[data_France$X.6CTP2>7.01 | data_France$X.6CTP2<0.99]<-NA
data_France$X.6CTP3[data_France$X.6CTP3>7.01 | data_France$X.6CTP3<0.99]<-NA
data_France$X.6CTP4[data_France$X.6CTP4>7.01 | data_France$X.6CTP4<0.99]<-NA
data_France$X.6CTP5[data_France$X.6CTP5>7.01 | data_France$X.6CTP5<0.99]<-NA
data_France$X.6CTP7[data_France$X.6CTP7>7.01 | data_France$X.6CTP7<0.99]<-NA
data_France$X.6CTP8[data_France$X.6CTP8>7.01 | data_France$X.6CTP8<0.99]<-NA
data_France$X.6CTP9[data_France$X.6CTP9>7.01 | data_France$X.6CTP9<0.99]<-NA
data_France$X.6CTP10[data_France$X.6CTP10>7.01 | data_France$X.6CTP10<0.99]<-NA
data_France$X.6CTP11[data_France$X.6CTP11>7.01 | data_France$X.6CTP11<0.99]<-NA
data_France$X.6CTP12[data_France$X.6CTP12>7.01 | data_France$X.6CTP12<0.99]<-NA
data_France$X.6CTP13[data_France$X.6CTP13>7.01 | data_France$X.6CTP13<0.99]<-NA
data_France$X.6CTP14[data_France$X.6CTP14>7.01 | data_France$X.6CTP14<0.99]<-NA
data_France$X.6CTP15[data_France$X.6CTP15>7.01 | data_France$X.6CTP15<0.99]<-NA
data_France$X.7CTP1[data_France$X.7CTP1>7.01 | data_France$X.7CTP1<0.99]<-NA
data_France$X.7CTP2[data_France$X.7CTP2>7.01 | data_France$X.7CTP2<0.99]<-NA
data_France$X.7CTP3[data_France$X.7CTP3>7.01 | data_France$X.7CTP3<0.99]<-NA
data_France$X.7CTP4[data_France$X.7CTP4>7.01 | data_France$X.7CTP4<0.99]<-NA
data_France$X.7CTP5[data_France$X.7CTP5>7.01 | data_France$X.7CTP5<0.99]<-NA
data_France$X.7CTP6[data_France$X.7CTP6>7.01 | data_France$X.7CTP6<0.99]<-NA
data_France$X.7CTP8[data_France$X.7CTP8>7.01 | data_France$X.7CTP8<0.99]<-NA
data_France$X.7CTP9[data_France$X.7CTP9>7.01 | data_France$X.7CTP9<0.99]<-NA
data_France$X.7CTP10[data_France$X.7CTP10>7.01 | data_France$X.7CTP10<0.99]<-NA
data_France$X.7CTP11[data_France$X.7CTP11>7.01 | data_France$X.7CTP11<0.99]<-NA
data_France$X.7CTP12[data_France$X.7CTP12>7.01 | data_France$X.7CTP12<0.99]<-NA
data_France$X.7CTP13[data_France$X.7CTP13>7.01 | data_France$X.7CTP13<0.99]<-NA
data_France$X.7CTP14[data_France$X.7CTP14>7.01 | data_France$X.7CTP14<0.99]<-NA
data_France$X.7CTP15[data_France$X.7CTP15>7.01 | data_France$X.7CTP15<0.99]<-NA
data_France$X.8CTP1[data_France$X.8CTP1>7.01 | data_France$X.8CTP1<0.99]<-NA
data_France$X.8CTP2[data_France$X.8CTP2>7.01 | data_France$X.8CTP2<0.99]<-NA
data_France$X.8CTP3[data_France$X.8CTP3>7.01 | data_France$X.8CTP3<0.99]<-NA
data_France$X.8CTP4[data_France$X.8CTP4>7.01 | data_France$X.8CTP4<0.99]<-NA
data_France$X.8CTP5[data_France$X.8CTP5>7.01 | data_France$X.8CTP5<0.99]<-NA
data_France$X.8CTP6[data_France$X.8CTP6>7.01 | data_France$X.8CTP6<0.99]<-NA
data_France$X.8CTP7[data_France$X.8CTP7>7.01 | data_France$X.8CTP7<0.99]<-NA
data_France$X.8CTP9[data_France$X.8CTP9>7.01 | data_France$X.8CTP9<0.99]<-NA
data_France$X.8CTP10[data_France$X.8CTP10>7.01 | data_France$X.8CTP10<0.99]<-NA
data_France$X.8CTP11[data_France$X.8CTP11>7.01 | data_France$X.8CTP11<0.99]<-NA
data_France$X.8CTP12[data_France$X.8CTP12>7.01 | data_France$X.8CTP12<0.99]<-NA
data_France$X.8CTP13[data_France$X.8CTP13>7.01 | data_France$X.8CTP13<0.99]<-NA
data_France$X.8CTP14[data_France$X.8CTP14>7.01 | data_France$X.8CTP14<0.99]<-NA
data_France$X.8CTP15[data_France$X.8CTP15>7.01 | data_France$X.8CTP15<0.99]<-NA
data_France$X.9CTP1[data_France$X.9CTP1>7.01 | data_France$X.9CTP1<0.99]<-NA
data_France$X.9CTP2[data_France$X.9CTP2>7.01 | data_France$X.9CTP2<0.99]<-NA
data_France$X.9CTP3[data_France$X.9CTP3>7.01 | data_France$X.9CTP3<0.99]<-NA
data_France$X.9CTP4[data_France$X.9CTP4>7.01 | data_France$X.9CTP4<0.99]<-NA
data_France$X.9CTP5[data_France$X.9CTP5>7.01 | data_France$X.9CTP5<0.99]<-NA
data_France$X.9CTP6[data_France$X.9CTP6>7.01 | data_France$X.9CTP6<0.99]<-NA
data_France$X.9CTP7[data_France$X.9CTP7>7.01 | data_France$X.9CTP7<0.99]<-NA
data_France$X.9CTP8[data_France$X.9CTP8>7.01 | data_France$X.9CTP8<0.99]<-NA
data_France$X.9CTP10[data_France$X.9CTP10>7.01 | data_France$X.9CTP10<0.99]<-NA
data_France$X.9CTP11[data_France$X.9CTP11>7.01 | data_France$X.9CTP11<0.99]<-NA
data_France$X.9CTP12[data_France$X.9CTP12>7.01 | data_France$X.9CTP12<0.99]<-NA
data_France$X.9CTP13[data_France$X.9CTP13>7.01 | data_France$X.9CTP13<0.99]<-NA
data_France$X.9CTP14[data_France$X.9CTP14>7.01 | data_France$X.9CTP14<0.99]<-NA
data_France$X.9CTP15[data_France$X.9CTP15>7.01 | data_France$X.9CTP15<0.99]<-NA
data_France$X.10CTP1[data_France$X.10CTP1>7.01 | data_France$X.10CTP1<0.99]<-NA
data_France$X.10CTP2[data_France$X.10CTP2>7.01 | data_France$X.10CTP2<0.99]<-NA
data_France$X.10CTP3[data_France$X.10CTP3>7.01 | data_France$X.10CTP3<0.99]<-NA
data_France$X.10CTP4[data_France$X.10CTP4>7.01 | data_France$X.10CTP4<0.99]<-NA
data_France$X.10CTP5[data_France$X.10CTP5>7.01 | data_France$X.10CTP5<0.99]<-NA
data_France$X.10CTP6[data_France$X.10CTP6>7.01 | data_France$X.10CTP6<0.99]<-NA
data_France$X.10CTP7[data_France$X.10CTP7>7.01 | data_France$X.10CTP7<0.99]<-NA
data_France$X.10CTP8[data_France$X.10CTP8>7.01 | data_France$X.10CTP8<0.99]<-NA
data_France$X.10CTP9[data_France$X.10CTP9>7.01 | data_France$X.10CTP9<0.99]<-NA
data_France$X.10CTP11[data_France$X.10CTP11>7.01 | data_France$X.10CTP11<0.99]<-NA
data_France$X.10CTP12[data_France$X.10CTP12>7.01 | data_France$X.10CTP12<0.99]<-NA
data_France$X.10CTP13[data_France$X.10CTP13>7.01 | data_France$X.10CTP13<0.99]<-NA
data_France$X.10CTP14[data_France$X.10CTP14>7.01 | data_France$X.10CTP14<0.99]<-NA
data_France$X.10CTP15[data_France$X.10CTP15>7.01 | data_France$X.10CTP15<0.99]<-NA
data_France$X.11CTP1[data_France$X.11CTP1>7.01 | data_France$X.11CTP1<0.99]<-NA
data_France$X.11CTP2[data_France$X.11CTP2>7.01 | data_France$X.11CTP2<0.99]<-NA
data_France$X.11CTP3[data_France$X.11CTP3>7.01 | data_France$X.11CTP3<0.99]<-NA
data_France$X.11CTP4[data_France$X.11CTP4>7.01 | data_France$X.11CTP4<0.99]<-NA
data_France$X.11CTP5[data_France$X.11CTP5>7.01 | data_France$X.11CTP5<0.99]<-NA
data_France$X.11CTP6[data_France$X.11CTP6>7.01 | data_France$X.11CTP6<0.99]<-NA
data_France$X.11CTP7[data_France$X.11CTP7>7.01 | data_France$X.11CTP7<0.99]<-NA
data_France$X.11CTP8[data_France$X.11CTP8>7.01 | data_France$X.11CTP8<0.99]<-NA
data_France$X.11CTP9[data_France$X.11CTP9>7.01 | data_France$X.11CTP9<0.99]<-NA
data_France$X.11CTP10[data_France$X.11CTP10>7.01 | data_France$X.11CTP10<0.99]<-NA
data_France$X.11CTP12[data_France$X.11CTP12>7.01 | data_France$X.11CTP12<0.99]<-NA
data_France$X.11CTP13[data_France$X.11CTP13>7.01 | data_France$X.11CTP13<0.99]<-NA
data_France$X.11CTP14[data_France$X.11CTP14>7.01 | data_France$X.11CTP14<0.99]<-NA
data_France$X.11CTP15[data_France$X.11CTP15>7.01 | data_France$X.11CTP15<0.99]<-NA
data_France$X.12CTP1[data_France$X.12CTP1>7.01 | data_France$X.12CTP1<0.99]<-NA
data_France$X.12CTP2[data_France$X.12CTP2>7.01 | data_France$X.12CTP2<0.99]<-NA
data_France$X.12CTP3[data_France$X.12CTP3>7.01 | data_France$X.12CTP3<0.99]<-NA
data_France$X.12CTP4[data_France$X.12CTP4>7.01 | data_France$X.12CTP4<0.99]<-NA
data_France$X.12CTP5[data_France$X.12CTP5>7.01 | data_France$X.12CTP5<0.99]<-NA
data_France$X.12CTP6[data_France$X.12CTP6>7.01 | data_France$X.12CTP6<0.99]<-NA
data_France$X.12CTP7[data_France$X.12CTP7>7.01 | data_France$X.12CTP7<0.99]<-NA
data_France$X.12CTP8[data_France$X.12CTP8>7.01 | data_France$X.12CTP8<0.99]<-NA
data_France$X.12CTP9[data_France$X.12CTP9>7.01 | data_France$X.12CTP9<0.99]<-NA
data_France$X.12CTP10[data_France$X.12CTP10>7.01 | data_France$X.12CTP10<0.99]<-NA
data_France$X.12CTP11[data_France$X.12CTP11>7.01 | data_France$X.12CTP11<0.99]<-NA
data_France$X.12CTP13[data_France$X.12CTP13>7.01 | data_France$X.12CTP13<0.99]<-NA
data_France$X.12CTP14[data_France$X.12CTP14>7.01 | data_France$X.12CTP14<0.99]<-NA
data_France$X.12CTP15[data_France$X.12CTP15>7.01 | data_France$X.12CTP15<0.99]<-NA
data_France$X.13CTP1[data_France$X.13CTP1>7.01 | data_France$X.13CTP1<0.99]<-NA
data_France$X.13CTP2[data_France$X.13CTP2>7.01 | data_France$X.13CTP2<0.99]<-NA
data_France$X.13CTP3[data_France$X.13CTP3>7.01 | data_France$X.13CTP3<0.99]<-NA
data_France$X.13CTP4[data_France$X.13CTP4>7.01 | data_France$X.13CTP4<0.99]<-NA
data_France$X.13CTP5[data_France$X.13CTP5>7.01 | data_France$X.13CTP5<0.99]<-NA
data_France$X.13CTP6[data_France$X.13CTP6>7.01 | data_France$X.13CTP6<0.99]<-NA
data_France$X.13CTP7[data_France$X.13CTP7>7.01 | data_France$X.13CTP7<0.99]<-NA
data_France$X.13CTP8[data_France$X.13CTP8>7.01 | data_France$X.13CTP8<0.99]<-NA
data_France$X.13CTP9[data_France$X.13CTP9>7.01 | data_France$X.13CTP9<0.99]<-NA
data_France$X.13CTP10[data_France$X.13CTP10>7.01 | data_France$X.13CTP10<0.99]<-NA
data_France$X.13CTP11[data_France$X.13CTP11>7.01 | data_France$X.13CTP11<0.99]<-NA
data_France$X.13CTP12[data_France$X.13CTP12>7.01 | data_France$X.13CTP12<0.99]<-NA
data_France$X.13CTP14[data_France$X.13CTP14>7.01 | data_France$X.13CTP14<0.99]<-NA
data_France$X.13CTP15[data_France$X.13CTP15>7.01 | data_France$X.13CTP15<0.99]<-NA
data_France$X.14CTP1[data_France$X.14CTP1>7.01 | data_France$X.14CTP1<0.99]<-NA
data_France$X.14CTP2[data_France$X.14CTP2>7.01 | data_France$X.14CTP2<0.99]<-NA
data_France$X.14CTP3[data_France$X.14CTP3>7.01 | data_France$X.14CTP3<0.99]<-NA
data_France$X.14CTP4[data_France$X.14CTP4>7.01 | data_France$X.14CTP4<0.99]<-NA
data_France$X.14CTP5[data_France$X.14CTP5>7.01 | data_France$X.14CTP5<0.99]<-NA
data_France$X.14CTP6[data_France$X.14CTP6>7.01 | data_France$X.14CTP6<0.99]<-NA
data_France$X.14CTP7[data_France$X.14CTP7>7.01 | data_France$X.14CTP7<0.99]<-NA
data_France$X.14CTP8[data_France$X.14CTP8>7.01 | data_France$X.14CTP8<0.99]<-NA
data_France$X.14CTP9[data_France$X.14CTP9>7.01 | data_France$X.14CTP9<0.99]<-NA
data_France$X.14CTP10[data_France$X.14CTP10>7.01 | data_France$X.14CTP10<0.99]<-NA
data_France$X.14CTP11[data_France$X.14CTP11>7.01 | data_France$X.14CTP11<0.99]<-NA
data_France$X.14CTP12[data_France$X.14CTP12>7.01 | data_France$X.14CTP12<0.99]<-NA
data_France$X.14CTP13[data_France$X.14CTP13>7.01 | data_France$X.14CTP13<0.99]<-NA
data_France$X.14CTP15[data_France$X.14CTP15>7.01 | data_France$X.14CTP15<0.99]<-NA
data_France$X.15CTP1[data_France$X.15CTP1>7.01 | data_France$X.15CTP1<0.99]<-NA
data_France$X.15CTP2[data_France$X.15CTP2>7.01 | data_France$X.15CTP2<0.99]<-NA
data_France$X.15CTP3[data_France$X.15CTP3>7.01 | data_France$X.15CTP3<0.99]<-NA
data_France$X.15CTP4[data_France$X.15CTP4>7.01 | data_France$X.15CTP4<0.99]<-NA
data_France$X.15CTP5[data_France$X.15CTP5>7.01 | data_France$X.15CTP5<0.99]<-NA
data_France$X.15CTP6[data_France$X.15CTP6>7.01 | data_France$X.15CTP6<0.99]<-NA
data_France$X.15CTP7[data_France$X.15CTP7>7.01 | data_France$X.15CTP7<0.99]<-NA
data_France$X.15CTP8[data_France$X.15CTP8>7.01 | data_France$X.15CTP8<0.99]<-NA
data_France$X.15CTP9[data_France$X.15CTP9>7.01 | data_France$X.15CTP9<0.99]<-NA
data_France$X.15CTP10[data_France$X.15CTP10>7.01 | data_France$X.15CTP10<0.99]<-NA
data_France$X.15CTP11[data_France$X.15CTP11>7.01 | data_France$X.15CTP11<0.99]<-NA
data_France$X.15CTP12[data_France$X.15CTP12>7.01 | data_France$X.15CTP12<0.99]<-NA
data_France$X.15CTP13[data_France$X.15CTP13>7.01 | data_France$X.15CTP13<0.99]<-NA
data_France$X.15CTP14[data_France$X.15CTP14>7.01 | data_France$X.15CTP14<0.99]<-NA
Base R equivalent of #Cettt's answer:
## helper function to replace elements with NA
rfun <- function(x) replace(x, which(x<0.99 | x>7.01), NA)
## identify which columns need to be changed
cnm <- grep("^X.[0-9]+CTP[0-9]+", names(data_France))
for (i in cnm) {
data_France[cnm] <- rfun(data_France[cnm])
}
You could also use lapply(), but sometimes the for loop is easier to understand and debug.
I would recommend the dplyr package which has the mutate_at function.
In your case you could use it like this:
library(dplyr)
data_France %>%
as_tibble %>%
mutate_at(vars(matches("^X.[0-9]+CTP[0-9]+")), ~ifelse(.x < 0.99 | .x > 7.01, NA_real_, .x))
#Create a vector of variable names. There may be other ways to do this, like using
#regex or just taking the indices of the variables names (e.g., 1:225)
vars <- apply(expand.grid("X.", as.character(1:15), "CTP", as.character(1:15)),
1, paste0, collapse = "")
for (i in vars) {
data_France[[i]][data_France[[i]] > 7.01 | data_France[[i]] < 0.99] <- NA
}
If this is your entire data set (i.e., there are no other variables in the data), you can simply do
data_France[data_France > 7.01 | data_France < 0.99] <- NA

R: Regex to match more than one pipe occurrence

I have a dataset in which I paste values in a dplyr chain and collapse with the pipe character (e.g. " | "). If any of the values in the dataset are blank, I just get recurring pipe characters in the pasted list.
Some of the values look like this, for example:
badstring = "| | | | | | GHOULSBY,SCROGGINS | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CAT,JOHNSON | | | | | | | | | | | | BURGLAR,PALA | | | | | | | | |"
I want to match all the pipes that occur more than once and delete them, so that just the names appear like so:
correctstring = "| GHOULSBY,SCROGGINS | CAT,JOHNSON | |BURGLAR,PALA |"
I tried the following, but to no avail:
mutate(names = gsub('[\\|]{2,}', '', name_list))
The difficulty in this question is in formulating a regex which can selectively remove every pipe, except the ones we want to remain as actual separators between terms. We can match on the following pattern:
\|\s+(?=\|)
and then replace just empty string. This pattern will remove any pipe (and any following whitespace) so long as what follows is another pipe. A removal would not occur when a pipe is followed by an actual term, or when it is followed by the end of the string.
badstring = "| | | | | | GHOULSBY,SCROGGINS | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CAT,JOHNSON | | | | | | | | | | | | BURGLAR,PALA | | | | | | | | |"
result <- gsub("\\|\\s+(?=\\|)", "", badstring, perl=TRUE)
result
[1] "| GHOULSBY,SCROGGINS | CAT,JOHNSON | BURGLAR,PALA |"
Demo
Edit:
If you expect to have inputs like | | | which are devoid of any terms, and you would expect empty string as the output, then my solution would fail. I don't see an obvious way to modify the above regex, but you can handle this case with one more call to sub:
result <- sub("^\\|$", "", result)
We also might be able to modify the original pattern to use an alternation covering all cases:
result <- gsub("\\|\\s+(?=\\|)|(?:^\\|$)", "", badstring, perl=TRUE)

Resources