Insert space in specific place from right with classic asp - asp-classic

I'm working with some UK postcode data around 60,000 entries in a SQL 2008 database and need to manipulate a string containing a postcode.
The original form data was collected with no validation so the postcodes are held in different formats, so CA12HW can also be CA1 2HW (correctly formatted).
UK postcodes vary in length and letter / number mix with the only exception being all codes finish space number letter letter.
I am only interested in looking at the first part of the code ie before the space. Therefore I am looking at writing a piece of code that does the following:
1.Check for a space 4th from the right.
2.If there is no space insert one 4th right.
3.Split the string at the space.
So far I have:
PostCode = "CA30GX"
SpaceLocate = InStr(PostCode, " ")
If SpaceLocate = 0 Then 'Postcode needs a space
If the only constant is that there should be a space 4th right how do I insert one?
Once the space is inserted I can split the code to use as I need.
PostcodeArray = Split(Postcode, " ")
Now PostcodeArray(0) is equal to "CA3", PostcodeArray(1) is equal to "0GX"
Any help would be appreciated.

You can just recreate the string:
PostCode = Left(PostCode, 3) & " " & Right(PostCode, 3)
PostcodeArray = Split(PostCode, " ")
Edit:
PostCode = Left(PostCode, Len(PostCode) - 3) & " " & Right(PostCode, 3)

You can use left and right string functions to do this:
newCode = left(Postcode,3) & " " & right(Postcode,len(Postcode)-3)

Related

Splitting a column in a dataframe in R into two based on content

I have a column in a R dataframe that holds a product weight i.e. 20 kg but it has mixed measuring systems i.e. 1 lbs & 2 kg etc. I want to separate the value from the measurement and put them in separate columns then convert them in a new column to a standard weight. Any thoughts on how I might achieve that? Thanks in advance.
Assume you have the column given as
x <- c("20 kg","50 lbs","1.5 kg","0.02 lbs")
and you know that there is always a space between the number and the measurement. Then you can split this up at the space-character, e.g. via
splitted <- strsplit(x," ")
This results in a list of vectors of length two, where the first is the number and the second is the measurement.
Now grab the numbers and convert them via
numbers <- as.numeric(sapply(splitted,"[[",1))
and grab the units via
units <- sapply(splitted,"[[",2)
Now you can put everything together in a `data.frame.
Note: When using as.numeric, the decimal point has to be a dot. If you have commas instead, you need to replace them by a dot, for example via gsub(",","\\.",...).
separate(DataFrame, VariableName, into = c("Value", "Metric"), sep = " ")
My case was simple enough that I could get away with just one space separator but I learned you can also use a regular expression here for more complex separator considerations.

Extract larger body of character data with stringr?

I am working to scrape text data from around 1000 pdf files. I have managed to import them all into R-studio, used str_subset and str_extract_all to acquire the smaller attributes I need. The main goal of this project is to scrape case history narrative data. These are paragraphs of natural language, bounded by unique words that are standardized throughout all the individual documents. See below for a reproduced example.
Is there a way I can use those two unique words, ("CASE HISTORY & INVESTIGATOR:"), to bound the text I would like to extract? If not, what sort of approach can I take to extracting the narrative data I need from each report?
text_data <- list("ES SPRINGFEILD POLICE DE FARRELL #789\n NOTIFIED DATE TIME OFFICER\nMARITAL STATUS: UNKNOWN\nIDENTIFIED BY: H. POIROT AT: SCENE DATE: 01/02/1895\nFINGERPRINTS TAKEN BY DATE\n YES NO OBIWAN KENOBI 01/02/1895\n
SPRINGFEILD\n CASE#: 012-345-678\n ABC NOTIFIED: ABC DATE:\n ABC OFFICER: NATURE:\nCASE HISTORY\n This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n EXAMINER / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n the next capitalized word, investigator with a colon, is a unique word where the string stops.\nINVESTIGATOR: HERCULE POIROT \n")
Here is what the expected output would be.
output <- list("This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n EXAMINER / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n the next capitalized word, investigator with a colon, is a unique word where the string stops.")
Thanks so much for helping!
One quick approach would be to use gsub and regexes to replace everything up to and including CASE HISTORY ('^.*CASE HISTORY') and everything after INVESTIGATOR: ('INVESTIGATOR:.*') with nothing. What remains will be the text between those two matches.
gsub('INVESTIGATOR:.*', '', gsub('^.*CASE HISTORY', '', text_data))
[1] "\n This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n EXAMINER / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n the next capitalized word, investigator with a colon, is a unique word where the string stops.\n"
After much deliberation I came to a solution I feel is worth sharing, so here we go:
# unlist text_data
file_contents_unlist <-
paste(unlist(text_data), collapse = " ")
# read lines, squish for good measure.
file_contents_lines <-
file_contents_unlist%>%
readr::read_lines() %>%
str_squish()
# Create indicies in the lines of our text data based upon regex grepl
# functions, be sure they match if scraping multiple chunks of data..
index_case_num_1 <- which(grepl("(Case#: \\d+[-]\\d+)",
file_contents_lines))
index_case_num_2 <- which(grepl("(Case#: \\d+[-]\\d+)",
file_contents_lines))
# function basically states, "give me back whatever's in those indices".
pull_case_num <-
function(index_case_num_1, index_case_num_2){
(file_contents_lines[index_case_num_1:index_case_num_2]
)
}
# map2() to iterate.
case_nums <- map2(index_case_num_1,
index_case_num_2,
pull_case_num)
# transform to dataframe
case_nums_df <- as.data.frame.character(case_nums)
# Repeat pattern for other vectors as needed.
index_case_hist_1 <-
which(grepl("CASE HISTORY", file_contents_lines))
index_case_hist_2 <-
which(grepl("Case#: ", file_contents_lines))
pull_case_hist <- function(index_case_hist_1,
index_case_hist_2 )
{(file_contents_lines[index_case_hist_1:index_case_hist_2]
)
}
case_hist <- map2(index_case_hist_1,
index_case_hist_2,
pull_case_hist)
case_hist_df <- as.data.frame.character(case_hist)
# cbind() the vectors, also a good call place to debug from.
cases_comp <- cbind(case_nums_df, case_hist_df)
Thanks all for responding. I hope this solution helps someone out there in the future. :)

How to remove dash (-) and \n at the same time or some solution to this text cleaning

I have been trying to remove dash(-) and \n for quite sometimes and it is not working.
I have tried using this code to remove -
gsub(" - ", " ", df1$text)
I have also tried using this code to remove \n
gsub("[\n]", " ", df1$text)
However, when I remove \n it becomes "abc-" when I remove dash(-), it becomes "abc\n". Is just a loop. All this result is in the console
When I try using \n to remove. In the console result
Df1
Id text
1 I have learnt abc-d in school.
2 I want app-le.
3 Going to sc-hool is fun.
When I try using dash(-) to remove. In console result
Df1
Id text
1 I have learnt abc\nd in school.
2 I want app\nle.
3 Going to sc\nhool is fun.
This is just loop and loop. I tried \n remove then dash(-) remove and all over again.
This is the data in dataframe. (It always stays the same)
Id text
1 I have learnt abc- d in school.
2 I want app- le.
3 Going to sc- hool is fun
For the dataframe right after the dash(-) there is a space.
The data I am using is news article, I have copyed and pasted it in a excel file. But I try using r to clean it.
Could someone help me out with this. Thanks!
I don't mind sharing the data with you privately, but just do not disclose it. Because it is my school project.
gsub(" - ", " ", df2$text) is looking for a space, then a dash, then a space. The examples you give like app- le don't have a space before the dash, so they won't match. If you want to match a space next to a dash only if it's there, us ? to quantify the space:
df2 = read.table(text = 'Id|text
1|I have learnt abc- d in school.
2|I want app- le.
3|Going to sc- hool is fun', sep = "|", header = TRUE)
gsub(" ?- ?", " ", df2$text)
# [1] "I have learnt abc d in school." "I want app le." "Going to sc hool is fun"
## Maybe you want to replace it with nothing, not with a space?
gsub(" ?- ?", "", df2$text)
# [1] "I have learnt abcd in school." "I want apple." "Going to school is fun"
Since your example doesn't include any line breaks, I can't really tell what the issue is.

How can count words of a line and keep the id of the line with apache pig?

I have a file whith 2 columns, the first one with the id and the second on with a long text and I need to know how to count words for each id.
For example if I have these two rows:
id | line
(1, This country is beautiful)
(2, I would love to have a cup of tea)
The answer I need is:
(1, 4)
(2, 9)
I have read a lot of comments about this but everyone keeps the total numbers of each word or the total number of words without keeping the id of the line.
I would appreciate if someone could help me.
Something like:
FOREACH row GENERATE
id,
COUNT(STRSPLITTOBAG(line, " "));
This should take each row, produce the needed ID field, and then split the text based on a delimeter (here a " " value) to a bag type, where the COUNT function counts the number of items in the bag.

Split character string into unequal segments R

I have some data that is I need to split into multiple elements, but there is not a specific identifier within the row to split on. I know the positions of different variables; is there a way I can split the string into multiple uneven parts based on my prior information. Example:
String: " 00008 L 1957110642706 194711071019561030R 1/812.5000000"
Desired result:
" 00008 "," ","L"," "," ","19571106","42706"," ","19471107","10","19561030","R 1/8","12.5000000"
So, my prior information is that the first element begins on the first position and is seven spaces long; the second begins at the 8th position in the string and is 8 spaces long; the 3rd element starts at the 16th position and is 1 space long, etc, etc.
xstr <- " 00008 L 1957110642706 194711071019561030R 1/812.5000000"
Rather than use this description:
first element begins on the first position and is seven spaces long; the second begins at the 8th position in the string and is 8 spaces long; the 3rd element starts at the 16th position and is 1 space long, etc, etc. ...
I'm just going to take the desired widths from your specified answer (nchar(res)):
res <- c(" 00008 "," ","L"," "," ","19571106","42706"," ","19471107","10","19561030","R 1/8","12.5000000")
Make sure that all variables are read as character strings:
res2 <- read.fwf(textConnection(xstr),widths=nchar(res),
colClasses=rep("character",length(res)))
Test:
all.equal(unname(unlist(res2)),res) ## TRUE
You can also use a simple substr function over your array of read lines:
my_lines <- read.table("your_file") #Or whatever way you read the lines
firstColumn <- substr(my_lines,1,7) #you can also use as.numeric and others if needed
secondColumn <- substr(my_lines,8,11)
# ..etc
rm(my_lines) #to save memory
Sometimes this is actually faster than other read.something packages specially if you dont use them correctly.

Resources