Split character string into unequal segments R - r

I have some data that is I need to split into multiple elements, but there is not a specific identifier within the row to split on. I know the positions of different variables; is there a way I can split the string into multiple uneven parts based on my prior information. Example:
String: " 00008 L 1957110642706 194711071019561030R 1/812.5000000"
Desired result:
" 00008 "," ","L"," "," ","19571106","42706"," ","19471107","10","19561030","R 1/8","12.5000000"
So, my prior information is that the first element begins on the first position and is seven spaces long; the second begins at the 8th position in the string and is 8 spaces long; the 3rd element starts at the 16th position and is 1 space long, etc, etc.

xstr <- " 00008 L 1957110642706 194711071019561030R 1/812.5000000"
Rather than use this description:
first element begins on the first position and is seven spaces long; the second begins at the 8th position in the string and is 8 spaces long; the 3rd element starts at the 16th position and is 1 space long, etc, etc. ...
I'm just going to take the desired widths from your specified answer (nchar(res)):
res <- c(" 00008 "," ","L"," "," ","19571106","42706"," ","19471107","10","19561030","R 1/8","12.5000000")
Make sure that all variables are read as character strings:
res2 <- read.fwf(textConnection(xstr),widths=nchar(res),
colClasses=rep("character",length(res)))
Test:
all.equal(unname(unlist(res2)),res) ## TRUE

You can also use a simple substr function over your array of read lines:
my_lines <- read.table("your_file") #Or whatever way you read the lines
firstColumn <- substr(my_lines,1,7) #you can also use as.numeric and others if needed
secondColumn <- substr(my_lines,8,11)
# ..etc
rm(my_lines) #to save memory
Sometimes this is actually faster than other read.something packages specially if you dont use them correctly.

Related

regex to find the position of the first four concurrent unique values

I've solved 2022 advent of code day 6, but was wondering if there was a regex way to find the first occurance of 4 non-repeating characters:
From the question:
bvwbjplbgvbhsrlpgdmjqwftvncz
bvwbjplbgvbhsrlpgdmjqwftvncz
# discard as repeating letter b
bvwbjplbgvbhsrlpgdmjqwftvncz
# match the 5th character, which signifies the end of the first four character block with no repeating characters
in R I've tried:
txt <- "bvwbjplbgvbhsrlpgdmjqwftvncz"
str_match("(.*)\1", txt)
But I'm having no luck
You can use
stringr::str_extract(txt, "(.)(?!\\1)(.)(?!\\1|\\2)(.)(?!\\1|\\2|\\3)(.)")
See the regex demo. Here, (.) captures any char into consequently numbered groups and the (?!...) negative lookaheads make sure each subsequent . does not match the already captured char(s).
See the R demo:
library(stringr)
txt <- "bvwbjplbgvbhsrlpgdmjqwftvncz"
str_extract(txt, "(.)(?!\\1)(.)(?!\\1|\\2)(.)(?!\\1|\\2|\\3)(.)")
## => [1] "vwbj"
Note that the stringr::str_match (as stringr::str_extract) takes the input as the first argument and the regex as the second argument.

Splitting a column in a dataframe in R into two based on content

I have a column in a R dataframe that holds a product weight i.e. 20 kg but it has mixed measuring systems i.e. 1 lbs & 2 kg etc. I want to separate the value from the measurement and put them in separate columns then convert them in a new column to a standard weight. Any thoughts on how I might achieve that? Thanks in advance.
Assume you have the column given as
x <- c("20 kg","50 lbs","1.5 kg","0.02 lbs")
and you know that there is always a space between the number and the measurement. Then you can split this up at the space-character, e.g. via
splitted <- strsplit(x," ")
This results in a list of vectors of length two, where the first is the number and the second is the measurement.
Now grab the numbers and convert them via
numbers <- as.numeric(sapply(splitted,"[[",1))
and grab the units via
units <- sapply(splitted,"[[",2)
Now you can put everything together in a `data.frame.
Note: When using as.numeric, the decimal point has to be a dot. If you have commas instead, you need to replace them by a dot, for example via gsub(",","\\.",...).
separate(DataFrame, VariableName, into = c("Value", "Metric"), sep = " ")
My case was simple enough that I could get away with just one space separator but I learned you can also use a regular expression here for more complex separator considerations.

Retain string till character limit with last complete word, and store remaining words in 2nd variable

Take these example strings, I want to split them such that the length is limited to X or less characters, a complete word is at the end of each string, and the remaining part is stored in another column. The words are always separated by space. I came across this partial solution in TSQL (doesn't create variable for extra words). However I need to do it in R. I was provided the first half solution in a previous question, this doesn't have the remaining words in new variables. I need help to create the new variable
{gsub(patt="(^.{2,100})([ ].+)", repl="\\1",y)}
For example:
XOVEW VJIEW NI **stays** XOVEW VJIEW NI (assuming X is 14)
XOVEW VJIEW NIGOI **becomes** XOVEW VJIEW (NIGOI goes to a new vector)
XOVEW VJIEWNIGOI **becomes** XOVEW (assuming X is 14)
So new variable will contain c("NIGOI","VJIEWNIGOI") coming from 2nd and 3rd row above.
v1 <- ifelse( nchar(vect) > 14, gsub( "(.*)\\s+(\\w+)", "\\1 - \\2", vect),vect);
values <- data.frame(do.call('rbind', lapply(strsplit(v1,split="-"), `length<-`,2)));
Output:
[,1] [,2]
[1,] "XOVEW VJIEW NI" NA
[2,] "XOVEW VJIEW " " NIGOI"
[3,] "XOVEW " " VJIEWNIGOI"
I have created a small vector which will check if your string length is greater or smaller than 14 (?nchar in case you want to understand it).
Then wherever, it is longer than 14 I have created a string seperated by a dash, This is just to segregate the two strings, where the first strings deptics any collection of word which is not the last one, the second string matches the last word of the statement.
To match these I used regex, a dot represents any character, a star zero or more matches(together it means any character with zero or more matches) , a \\s+ matches 1 or more spaces and \\w+ matches one or more words. Collectively the match is such that it should have last word segregated with rest of the string in cases where string length is more than 14 within ifelse. Also these characters are further captured into \\1 and \\2 with a dash separation. where \\1 matches the first non last word match and \\2 match the last word of the string.
At last do.call is used with with rbind(bind all the rows) and lapply(to get even number of columns across all the elements)
I hope this explains your query.

Gathering the correct amount of digits for numbers when text mining

I need to search for specific information within a set of documents that follows the same standard layout.
After I used grep to find the keywords in every document, I went on collecting the numbers or characters of interest.
One piece of data I have to collect is the Total Power that appears as following:
TotalPower: 986559. (UoPow)
Since I had already correctly selected this excerpt, I created the following function that takes the characters between positions n and m, where n and m start counting up from right to left.
substrRight <- function(x, n,m){
substr(x, nchar(x)-n+1, nchar(x)-m)
}
It's important to say that from the ":" to the number 986559, there are 2 spaces; and from the "." to the "(", there's one space.
So I wrote:
TotalP = substrRight(myDf[i],17,9) [1]
where myDf is a character vector with all the relevant observations.
Line [1], after I loop over all my observations, gives me the numbers I want, but I noticed that when the number was 986559, the result was 98655. It simply doesn't "see" 9 as the last number.
The code seems to work fine for the rest of the data. This number (986559) is indeed the highest number in the data and is the only one with order 10^5 of magnitude.
How can I make sure that I will gather all digits in every number?
Thank you for the help.
We can extract the digits before a . by using regex lookaround
library(stringr)
str_extract(str1, "\\d+(?=\\.)")
#[1] "986559"
The \\d+ indicates one or more digist followed by the regex lookaound .

R: Count number of rows in data frame, with matching character in specified position of string

I have a data frame with a column with characters:
strings
1 a;b;c;d
2 g;h;i;j
3 k;m
4 o
I would like to get a count of the number of strings(rows) with a certain specified characters at a certain position within the string.
Eg.
Get count of number of strings with 3rd character as one of the
characters in this set: {a,b,m}.
The output should be 2 in this case, since only the 1st and 3rd row
have any characters in {a,b,m} as their 3rd character within the
string.
I could only use this code to find any strings that contains 'b':
sum(grepl("b",df))
However, this is not good enough for the above task.
Please advice.
You can try grepl:
x = c('a;b;c;d','g;h;i;j','k;m','o')
sum(grepl('^.{2}[abm]', x))
#[1] 2
Try this:
sum(substr(df$strings,3,3) %in% c("a","b","m"))
Alternatively, if you want to use a ; as the delimeter you can do:
sum(sapply(strsplit(df$strings,";"),function(x) x[2] %in% c("a","b","m")))

Resources