i need to create a syabase report where i need to consider for each field the digits and the space between each sigle fields. Example considering TABA:
OUTPUT
NAME SURNAME TELEPHONE
ROBERT ROSSI 0123456789
So between the columns NAME and SURNAME we have 20 digit of spaces. And between SURNAME and TELEPHONE we have 35 digits of spaces.
There is a way to do this on sybase?
Thank you
Related
I am working to scrape text data from around 1000 pdf files. I have managed to import them all into R-studio, used str_subset and str_extract_all to acquire the smaller attributes I need. The main goal of this project is to scrape case history narrative data. These are paragraphs of natural language, bounded by unique words that are standardized throughout all the individual documents. See below for a reproduced example.
Is there a way I can use those two unique words, ("CASE HISTORY & INVESTIGATOR:"), to bound the text I would like to extract? If not, what sort of approach can I take to extracting the narrative data I need from each report?
text_data <- list("ES SPRINGFEILD POLICE DE FARRELL #789\n NOTIFIED DATE TIME OFFICER\nMARITAL STATUS: UNKNOWN\nIDENTIFIED BY: H. POIROT AT: SCENE DATE: 01/02/1895\nFINGERPRINTS TAKEN BY DATE\n YES NO OBIWAN KENOBI 01/02/1895\n
SPRINGFEILD\n CASE#: 012-345-678\n ABC NOTIFIED: ABC DATE:\n ABC OFFICER: NATURE:\nCASE HISTORY\n This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n EXAMINER / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n the next capitalized word, investigator with a colon, is a unique word where the string stops.\nINVESTIGATOR: HERCULE POIROT \n")
Here is what the expected output would be.
output <- list("This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n EXAMINER / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n the next capitalized word, investigator with a colon, is a unique word where the string stops.")
Thanks so much for helping!
One quick approach would be to use gsub and regexes to replace everything up to and including CASE HISTORY ('^.*CASE HISTORY') and everything after INVESTIGATOR: ('INVESTIGATOR:.*') with nothing. What remains will be the text between those two matches.
gsub('INVESTIGATOR:.*', '', gsub('^.*CASE HISTORY', '', text_data))
[1] "\n This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n EXAMINER / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n the next capitalized word, investigator with a colon, is a unique word where the string stops.\n"
After much deliberation I came to a solution I feel is worth sharing, so here we go:
# unlist text_data
file_contents_unlist <-
paste(unlist(text_data), collapse = " ")
# read lines, squish for good measure.
file_contents_lines <-
file_contents_unlist%>%
readr::read_lines() %>%
str_squish()
# Create indicies in the lines of our text data based upon regex grepl
# functions, be sure they match if scraping multiple chunks of data..
index_case_num_1 <- which(grepl("(Case#: \\d+[-]\\d+)",
file_contents_lines))
index_case_num_2 <- which(grepl("(Case#: \\d+[-]\\d+)",
file_contents_lines))
# function basically states, "give me back whatever's in those indices".
pull_case_num <-
function(index_case_num_1, index_case_num_2){
(file_contents_lines[index_case_num_1:index_case_num_2]
)
}
# map2() to iterate.
case_nums <- map2(index_case_num_1,
index_case_num_2,
pull_case_num)
# transform to dataframe
case_nums_df <- as.data.frame.character(case_nums)
# Repeat pattern for other vectors as needed.
index_case_hist_1 <-
which(grepl("CASE HISTORY", file_contents_lines))
index_case_hist_2 <-
which(grepl("Case#: ", file_contents_lines))
pull_case_hist <- function(index_case_hist_1,
index_case_hist_2 )
{(file_contents_lines[index_case_hist_1:index_case_hist_2]
)
}
case_hist <- map2(index_case_hist_1,
index_case_hist_2,
pull_case_hist)
case_hist_df <- as.data.frame.character(case_hist)
# cbind() the vectors, also a good call place to debug from.
cases_comp <- cbind(case_nums_df, case_hist_df)
Thanks all for responding. I hope this solution helps someone out there in the future. :)
I'm cleaning up a data set. One of the difficulties is that some of the rows have wrestler names merged with wrestling company names without spaces.
Date Match
2001-06-16 American Dragon Defeats Jerry LynnMCW
1943-10-07 Lou Thesz Defeats Jack McDonaldGAC
1955-03-25 Buddy Rogers Defeats Danny McShain
To fix this, I use the following line to remove the company name by getting rid of a capitalized letter and everything that comes after if that capital follows a lower case letter:
data_set_2 <- data_set %>%
mutate(match = str_remove(match, "(?<=[:lower:])[:upper:].*"))
However, in the case of names with multiple capitalizations, like McDonald, the result looks like this:
date match
2001-06-16 American Dragon Defeats Jerry Lynn
1943-10-07 Lou Thesz Defeats Jack Mc
1955-03-25 Buddy Rogers Defeats Danny Mc
To fix this, I've tried to make it so that names only have one capitalization, by trying to lower a capital that comes after Mc:
data_set_2 <- data_set %>%
mutate(match = str_to_title(match, "(?<=Mc)[:upper:]"))
However, the below is the result:
Date Match
2001-06-16 American Dragon Defeats Jerry Lynnmcw
1943-10-07 Lou Thesz Defeats Jack Mcdonaldgac
1955-03-25 Buddy Rogers Defeats Danny Mcshain
As you can see, it is lowering everything, and not isolating the lower to just the one letter. I'm trying to think of a way to isolate the one character, but nothing I've tried has worked. Any ideas are appreciated. Thanks!
If the company names are all-caps, as in your first two examples, you could replace the empty string matched by the following regular expression with a space: (?<=[a-z])(?=[A-Z]{2}). (?<=[a-z]) is a positive lookbehind, requiring the match to be preceded by a lowercase letter. (?=[A-Z]{2}) is a positive lookahead, requiring the match to be followed by two uppercase letters. If the second character of the company name can be lowercase there is no logical way to determine if the substring is just a name or a name followed by a company name (e.g. MacDonald).
I am scraping https://www.transparency.org/news/pressreleases/year/2010 to retrieve header and details from each page. But along with header and details a telephone number and a blank string is coming in the retrieved list for every page.
[1] "See our simple, animated definitions of types of corruption and the ways to challenge it."
[2] "Judiciary - Commenting on Justice Bean’s sentencing in the BAE Systems’ Tanzania case, Transparency International UK welcomed the Judge’s stringent remarks concerning BAE Systems’ past conduct."
[3] " "
[4] "+49 30 3438 20 666"
I have tried with following codes but they didn't worked.
html %>% str_remove('+49 30 3438 20 666') %>% str_remove(' ').
How these elements can be removed?
Is it because you failed to escape the + sign?
From this cheatsheet,
Metacharacters (. * + etc.) can be used as
literal characters by escaping them. Characters
can be escaped using \ or by enclosing them
in \Q...\E.
s = "+49 30 3438 20 666"
str_remove(s, "\\+49 30 3438 20 666")
# ""
In case you want to drop all lines that start with a + and end with a number:
dd <- c(
"See our simple, animated definitions of types of corruption and the ways to challenge it."
, "Judiciary - Commenting on Justice Bean’s sentencing in the BAE Systems’ Tanzania case, Transparency International UK welcomed the Judge’s stringent remarks concerning BAE Systems’ past conduct."
," "
, "+49 30 3438 20 666")
c <- dd[!grepl("^\\+.*\\d*$",dd)]
You can also use \\s (one empty space) and \\d{2} (2 numbers) to have an exact match, to be on the safe side, if all numbers have the same format. Note that you can also use it in str_remove, with the end result beig an empty string. grep instead returns as logical vector that subsets your string.
If you want to delete also all empty lines
dd[!grepl("^\\s*$",dd)]
Note that you can do both at the same time by using "|":
dd[!grepl("^\\+.*\\d*$|^\\s*$",dd)]
You can get familiar with regex here: https://regex101.com/
I have a character vector that I need to clean. Specifically, I want to remove the number that comes before the word "Votes." Note that the number has a comma to separate thousands, so it's easier to treat it as a string.
I know that gsub("*. Votes","", text) will remove everything, but how do I just remove the number? Also, how do I collapse the repeated spaces into just one space?
Thanks for any help you might have!
Example data:
text <- "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? 558,586 Votes"
You may use
text <- "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? 558,586 Votes"
trimws(gsub("(\\s){2,}|\\d[0-9,]*\\s*(Votes)", "\\1\\2", text))
# => [1] "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? Votes"
See the online R demo and the online regex demo.
Details
(\\s){2,} - matches 2 or more whitespace chars while capturing the last occurrence that will be reinserted using the \1 placeholder in the replacement pattern
| - or
\\d - a digit
[0-9,]* - 0 or more digits or commas
\\s* - 0+ whitespace chars
(Votes) - Group 2 (will be restored in the output using the \2 placeholder): a Votes substring.
Note that trimws will remove any leading/trailing whitespace.
Easiest way is with stringr:
> library(stringr)
> regexp <- "-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+"
> str_extract(text,regexp)
[1] "558,586 Votes"
To do the same thing but extract only the number, wrap it in gsub:
> gsub('\\s+[[:alpha:]]+', '', str_extract(text,regexp))
[1] "558,586"
Here's a version that will strip out all numbers before the word "Votes" even if they have commas or periods in it:
> gsub('\\s+[[:alpha:]]+', '', unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+",text) )) )
[1] "558,586"
If you want the label too, then just throw out the gsub part:
> unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+",text) ))
[1] "558,586 Votes"
And if you want to pull out all the numbers:
> unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]*",text) ))
[1] "1" "15" "202" "558,586"
In sqlite3 table we have names like as below.
Monty Python
Luther Blissett
Rey maR EsteBaN
Monty Cantsin
Geoffrey Cohen
Karen Eliot
Poor Konrad
I need a query to fetch the character from the database.
Initially I need a query that should fetch first letter of every word in a string. In above example it should display M, P, L, B, R, E, C, G, K
Suppose if a user selects C after the above query, then it should fetch next possible characters of Cantsin and Cohen i.e A and O.
Please provide any inputs on how the solution can be arrived?
Solution below would work if only first name is taken to count.
Consider data:
sqlite> SELECT name FROM COMPANY;
Paul Allen
Mark
Monty Python
Luther Blissett
Monty Cantsin
For 1st query:
sqlite> SELECT distinct substr(name,1,1) FROM COMPANY;
P
M
L
And later for 2nd output try below algorithm:
if you have 1 character input then use substr (name, 2,1)
if you have 2 characters input then use substr (name, 3,1)
.. generally : if x is number of characters input and y is number of character output expected, try substr (name, x+1, y)
sqlite> SELECT distinct substr(name,2,1) FROM COMPANY where name like 'M%';
a
o
sqlite> SELECT distinct substr(name,3,1) FROM COMPANY where name like 'Ma%';
r
If others words also need to be considered for search, replace table name with sub queries as below.
SELECT distinct substr(name,2,1) FROM
(
select name from company where name like 'A%' -- counts for first word
UNION
select name from company where name like '% A%' -- counts for words that start with space
);
Please note that above queries will only return the characters of first word, though the match is for all words it would consider first name as source to fetch characters.