Trimming a string in Teradata - teradata

I have a table in Teradata with a particular column "location" like
location
Rockville County, TX
Green River County, IL
Joliet County, CA
Jones County, FL
.
.
.
What I need to do is strip off everything after the county's name and turn the column into something like
location
Rockville
Green River
Joliet
Jones
I've been trying to use the function trim like
trim(trailing ' County' from location)
but it's not working. Any ideas?

The trim function is used for removing whitespace.
You can use a combination of index and substring, e.g.:
select
'Green River County, IL' your_string
, substring(your_string, 0, index(your_string, 'County')) your_desired_result
index(target_string, string_to_find) gives you the position of a string within another string
substring(target_string, start_index, end_index) allows you to pull out a specific part of a string

This is the Standard SQL way:
substring(location from 0 for position(' County' in location))
In TD14 you might also use a regular expression:
regexp_substr(location, '.*(?= County)')

Related

Pattern matching character vectors in R

I am trying to match the characters between two vectors in two separate dataframes, lets call the dataframes "rentals" and "parcels", which both contain the vector "address" which is a character of the addresses of all rental parcels in a county and the addresses of all parcels in a city. We would like to figure out which addresses in the "parcels" dataframe match an address in the "rentals" dataframe by searching through the vector of addresses in "parcels" for matches with an address in "rentals."
The values in rentals$address look like this:
rentals$address <- c("110 SW ARTHUR ST", "1610 NE 66TH AVE", "1420 SE 16TH AVE",...)
And the values in parcels$address look like this:
parcels$address <- c("635 N MARINE DR, PORTLAND, OR, 97217", "7023 N BANK ST, PORTLAND, OR, 97203", "5410 N CECELIA ST, PORTLAND, OR, 97203",...)
There are about 172,000 entries in the "parcels" dataframe and 285 in the "rentals" dataframe. My first solution was to match character values using grepl, which I don't think worked:
matches = grepl(rentals$address, parcels$address, fixed = TRUE)
This returns FALSE for each entry in parcels$address, but copying and pasting some values of "address" from "rentals" into Excel's CNTRL+F window viewing the "parcels" dataframe, I see a few addresses. So some appear to match.
How would I best be able to find which observation's values in the "address" column of the "rentals" dataframe is a matching character sequence in the "parcels" dataframe?
Are the addresses all exact matches? That is, no variations in spacing, capitalization, apartment number? If so, you might be able to use the dplyr function left_join to create a new df, using the address as the key, like so
library(dplyr)
df_compare <- df_rentals %>%
left_join(df_parcels, by = "address")
additionally, if you have columns along the lines of df_rentals$rentals = yes and df_parcels$parcels = yes, you can filter the resulting new dataframe
df_both <- filter(df_compare, rentals == "yes", parcels == "yes")

filter different values in the same column

I have a 2018 Premier League dataset and Im trying to work to practise. I want to take Arsenal and Chelsea data but I cant. It works with the age, but it doesnt with "Chelsea" and "Arsenal"
CheArs <- premier %>% filter(Current Club == "Chelsea", Current Club == "Arsenal"))
You are looking for filter with or ( the OR operator is denoted with |), so a good idea is to use dplyr and do:
library(dplyr)
CheArs <- premier %>%
filter(Current Club == "Chelsea"| Current Club == "Arsenal")
Which will give you a new dataframe with only information about Chelsea or Arsenal
I think the problem is space bar in the column name. You need to use backticks or regular double quotes like this:
premier=tibble(`Current Club`=c("Arsenal","Chelsea"),Matches=c(5,10))
premier %>% filter(`Current Club`=="Arsenal"|`Current Club` == "Chelsea")
And next problem is the comma in filter if you want logical OR you need to use | operator instead of the comma

simple Join error where some rows join and some don't

I have two dataframe which I'm trying to join which should be straight forward but I see some anomalous behavior.
Dataframe A
Name Sample Country Path
John S18902 UK /Home/drive/John
BOB 135671 USA /Home/drive/BOB
Tim GB12345_serum_63 UK /Home/drive/Tim
Wayne 12345_6789 UK /Home/drive/Wayne
Dataframe B
Surname Sample State FILE
Paul S18902 NJ John.csv
Gem 135671 PP BOB.csv
Love GB12345_serum_63 AP Tim.csv
Dave 12345_6789 MK Wayne.csv
I am using R markdown to do a simple join using the following command
Dataframec <- DataframeA %>%
left_join(DataframeB ,by = "Sample",all.x=T )
All rows join apart from the row where sample== GB12345_serum_63
There should be a simple fix to this but I'm out of ideas.
Thank you
If you are cutting-and-pasting your data directly into your question then the reason for this is because your key values are technically different due to having different numbers of spaces.
I cut and paste from your question from the beginning of the value to the start of the adjacent column name. So to 'country' in the first case and to 'state' in the second case
DataframeA: "GB12345_serum_63"
DataframeB: "GB12345_serum_63 "
You can see for DataframeB there are 3 space characters after the value. This can be resolved by removing extra whitespace from your key values as follows using regular expressions: gsub("^\\s+|\\s+$", "", x)
DataframeA$Sample <- gsub("^\\s+|\\s+$", "", DataframeA$Sample)
DataframeB$Sample <- gsub("^\\s+|\\s+$", "", DataframeB$Sample)
Now your join should work

Make only numeric entries blank

I have a dataframe with UK postcodes in it. Unfortunately some of the postcode data is incorrect - ie, they are only numeric (all UK postcodes should start with a alphabet character)
I have done some research and found the grepl command that I've used to generate a TRUE/FALSE vector if the entry is only numeric,
Data$NewPostCode <- grepl("^.*[0-9]+[A-Za-z]+.*$|.*[A-Za-z]+[0-9]+.*$",Data$PostCode)
however, what I really want to do is where the instance starts with a number to make the postcode blank.
Note, I don't want remove the rows with an incorrect postcode as I will lose information from the other variables. I simply want to remove that postcode
Example data
Area Postcode
Birmingham B1 1AA
Manchester M1 2BB
Bristol BS1 1LM
Southampton 1254
London 1290C
Newcastle N1 3DC
Desired output
Area Postcode
Birmingham B1 1AA
Manchester M1 2BB
Bristol BS1 1LM
Southampton
London
Newcastle N1 3DC
There are a few ways to go between TRUE/FALSE vectors and the kind of task you want, but I prefer ifelse. A simpler way to generate the type of logical vector you're looking for is
grepl("^[0-9]", Data$PostCode)
which will be TRUE whenever PostCode starts with a number, and FALSE otherwise. You may need to adjust the regex if your needs are more complex.
You can then define a new column which is blank whenever the vector is TRUE and the old value whenever the vector is FALSE, as follows:
Data$NewPostCode <- ifelse(grepl("^[0-9]", Data$PostCode), "", Data$PostCode)
(May I suggest using NA instead of blank?)

Name matching with different length data frames in R

I have two dataframes with numerous variables. Of primary concern are the following variables, df1.organization_name and df2.legal.name. I'm just using fully qualified SQL-esque names here.
df1 has dimensions of 15 x 2700 whereas df2 has dimensions of 10x40,000. And essentially, the 'common' or 'matching' columns are name fields.
I reviewed this post Merging through fuzzy matching of variables in R and it was very helpful but I can't really figure out how to wrangle the script to get it to work with my dfs.
I keep getting an error - Error in which(organization_name[i] == LEGAL.NAME) :
object 'LEGAL.NAME' not found.
Desired Matching and Outcome
What I am trying to do is compare each and every one of my df1.organization_name to every one of the df2.legal_name and make a comparison if they are a very close match (like >=85%). And then like in the script above, take matched customer name and the matched comparison name and put them into a data.frame for later analysis.
So, if one of my customer names is 'Johns Hopkins Auto Repair' and one of my public list names is, 'John Hopkins Microphone Repair', I would call that a good match and I want some sort of indicator appended to my customer list (in another column) that says, 'Partial Match' and the name from the public list.
Example(s) of the dfs for text wrangling:
df1.organization_name (these are fake names b/c I can't post customer names)
- My Company LLC
- John Johns DBA John's Repair
- Some Company Inc
- Ninja Turtles LLP
- Shredder Partners
df2.LEGAL.NAME (these are real names from the open source file)
- $1 & UP STORE CORP.
- $1 store 0713
- LLC 0baid/munir/gazem
- 1 2 3 MONEY EXCHANGE LLC
- 1 BOY & 3 GIRLS, LLC
- 1 STAR BEVERAGE INC
- 1 STOP LLC
- 1 STOP LLC
- 1 STOP LLC DBA TIENDA MEXICANA LA SAN JOSE
- 1 Stop Money Centers, LLC/Richard

Resources