substring operation or regex in teradata - teradata

I have data as below
col1
abc1234
abc 1234
12345
abc 1234 123456789
xyz1234567890a
I want output having the string which is numeric with length >=5 characters, rest all records filtered.
I have tried function REGEXP_SUBSTR(col1, '[0-9]+'), but it is not giving desired result
SELECT col1
,REGEXP_SUBSTR(col1, '[0-9]+') as num
FROM table1
WHERE col1 IS NOT NULL
AND LENGTH(num) >5
expected output is as below
num
12345
123456789
1234567890

You need tell the RegEx to return at least five consecutive digits, currently it's at least one digit. And of course, if you want >= 5 you shouldn't write > 5 :-)
RegExp_Substr(col1, '[0-9]{5,}')

Related

Finding a duplicate value in a column and with other columns while inputs have different formats

I hope everyone is alright.
I have a data set that contains three columns:
column 1 includes all the IDs, in this column there might me duplicated IDs
column 2 includes all the phone_numbers related to the IDs. phone_numbers are captured in different formats and need to cleaned and written in one single format. Each ID might have more than one unique phone_number
column 3 includes all the random phone numbers related to the IDs but form a different source.
I need to have three output columns:
output_1 indicates if an ID has more than one phone_number
output_2 indicates if an ID has the same phone_number with the phone_number of any other IDs
output_3 indicates if an ID has the same phone_number with any random phone number even if it is from same ID
My data looks like
id <- c(1, 2, 3, 1, 4, 5,2,3,7)
phone_number <- c("4121234567","3137894561",
"1234567788","(412)123-45%67",
"919-789-1$122","(123)1112233",
"(412)1234567","1234567788",
"123-11%12233")
phone_random<- c("na","4121234567",
"","na",
"123-1112233",
"na","","919-789-1$122","")
df <- data.frame(id, phone_number,phone_random)
df %>% head()
id phone_id phone_random
1 1 4121234567 na
2 2 3137894561 4121234567
3 3 1234567788
4 1 (412)123-45%67 na
5 4 919-789-1$122 123-1112233
6 5 (123)1112233 na
Please let me know if further information is needed in this case.
Thank you so much for your help in advance!
As a beginning you can clean your phone_number and phone_random as
df$phone_number <- gsub('\\D', '', df$phone_number)
df$phone_random <- gsub('\\D', '', df$phone_random)

Get frequency of occurrence of a character value in R dataframe

I have a dataframe like this:
col1 col2 cole3
abc def xzy
lmk qwe abc
def lmk xzy
xzy abc qwe
The three columns hold character datatype values.
Across the 3 columns I have 5 unique values: abc, def, xzy, lmk and qwe.
What I need is a count of number of times each of these values appears in the whole dataframe.
abc 3
qwe 2
def 2
xzy 3
lmk 2
All the count() and aggregate functions only work column-wise and when I unlist, it doesnt seem to work either.
Any suggestions for functions that I can use?
Many thanks in advance.
You should do something like that (assuming that your previous data frame is called df1):
data <- c(df1$col1, df1$col2, df1$col3)
table(data)
And it gives you desired values. Make sure your data in df1 are characters not factors.
You can use an aggregate function with 'union all' - make sure you use 'union all', not just 'union'.
Let us assume that your table is called tmpcol with columns col1, col2, and col3...
select value, sum(cnt)
from (
select col1 as value, count(*) as cnt
from tmpcol
group by col1
union all
select col2 as value, count(*) as cnt
from tmpcol
group by col2
union all
select col3 as value, count(*) as cnt
from tmpcol
group by col3
) dataset
group by value;

Counting number of words between a predefined delimiter

What's the best way to count number of words between a predefined delimiter (in my case '/')?
Dataset:
df <- data.frame(v1 = c('A DOG//1//',
'CAT/WHITE///',
'A HORSE/BROWN & BLACK/2//',
'DOG////'))
Expected results are the following numbers..
2 (which are A DOG and 1)
2 (which are CAT and WHITE)
3 (A HORSE, BROWN & BLACK, 2)
1 (DOG)
Thank you!
strsplit at one or more slash ("/+") and count strings
lengths(strsplit(as.character(df$v1), "/+"))
#[1] 2 2 3 1
Assuming your data doesn't have cases where a string (a) begins with "/" or (b) doesn't end with "/," then you can just count the number of times there's a chunk of slashes in order to get the number of chunks between slashes. So the following works for the data you've provided.
stringr::str_count(df$v1, "/+")
Using stringr::str_split() and counting the number of nonblank strings...
df <- data.frame(v1 = c('A DOG//1//',
'CAT/WHITE///',
'A HORSE/BROWN & BLACK/2//',
'DOG////'))
sapply(stringr::str_split(df$v1, '/'), function(x) sum(x != ''))
[1] 2 2 3 1

Using if/else statement to insert a decimal for a column based on starting letter and string length of the row using R

I have a data frame "df" and want to apply if/else conditions to insert a decimal for the entire column "A"
A B
E0505 123
890 43
4505 56
Rules to apply:
If the code starts with "E" and length of the code is > 4: between character 4 and 5.
If length of the code is > 3 and the code doesn't start with "E": between character 3 and 4.
If length of the code is <= 3: return the code as such.
Final output:
A B
E050.5 123
890 43
450.5 56
I have tried this, but I am not sure how to include the condition where row starts with E or not.
ifelse(str_length(df$A)>3, as.character(paste0(substring(df$A, 1, 3),".", substring(df$A, 4))), as.character(df$A))
Use sub with regular expression, you can do this:
df$A <- sub("((?:^E.|^[^E]).{2})(.+)", "\\1.\\2", df$A)
df
# A B
#1 E050.5 123
#2 890 43
#3 450.5 56
((?:^E.|^[^E]).{2})(.+) matches strings:
case 1: starts with E followed by 4 or more characters, in which case capture the first 4 characters and the rest as two separate groups and insert . between;
case 2: not starts with E but have 4 or more characters, in which case capture the first 3 characters and the rest as two separate groups and insert . between;
Strings starting with E and has less than 5 characters in total or not starting with E and has less than 4 characters in total are not matched, and will not be modified.
If ignoring case: df$A <- sub("((?:^[Ee].|^[^Ee]).{2})(.+)", "\\1.\\2", df$A).

Remove any row in a dataframe where the length of a zipcode is not equal to 5 digits

I have a dataframe with zipcodes:
address <- as.data.frame(matrix(c('1111 Spam Street', '12 Foo Bar', '666 Dead End', 95524, 94118, 9021), ncol=2))
address$V2 <- as.numeric(as.character(address$V2))
Which looks like this:
V1 V2
1 1111 Spam Street 95524
2 12 Foo Bar 94118
3 666 Dead End 9021
Unfortunately, the last zipcode is incorrect and I would like to remove that row and end up with just this:
V1 V2
1 1111 Spam Street 95524
2 12 Foo Bar 94118
My attempt newaddress <- address[length(address$V2) != 5, ] is obviously wrong, because it is looking at the length of the column, not the values inside the column.
How can I remove any row in a dataframe where there is a numeric value in a column which is not 5 digits in length?
Any advice is appreciated, and I apologize in advance for such a simple question.
This should do it
newaddress <- address[nchar(address$V2) ==5 , ] #would also remove rows with more than 5 digits
EDIT after comment by #Matt:
Assuming the values in address$V2 are integer, you can also do the following:
address[address$V2 >= 10000 & address$V2 <100000, ]
Using dplyr:
library(dplyr)
address %.% filter(nchar(V2) == 5)

Resources