I'm trying to extract values from a vector of strings. Each string in the vector, (there are about 2300 in the vector), follows the pattern of the example below:
"733|Overall (-2 to 2): _________2________________|How controversial is each sentence (1-5)?|Sent. 1 (ANALYSIS BY...): ________1__________|Sent. 2 (Bail is...): ____3______________|Sent. 3 (2) A...): _______1___________|Sent. 4 (3) A...): _______1___________|Sent. 5 (Proposition 100...): _______5___________|Sent. 6 (In 2006,...): _______3___________|Sent. 7 (That legislation...): ________1__________|Types of bias (check all that apply):|Pro Anti|X O Word use (bold, add alternate)|X O Examples (italicize)|O O Extra information (underline)|X O Any other bias (explain below)|Last sentence makes it sound like an urgent matter.|____________________________________________|NA|undocumented, without a visa|NA|NA|NA|NA|NA|NA|NA|NA|"
What I'd like is to extract the numbers following the pattern "Sent. " and place them into a separate vector. For the example, I'd like to extract "1311531".
I'm having trouble using gsub to accomplish this.
library(tidyverse)
Data <- c("PASTE YOUR WHOLE STRING")
str_locate(Data, "Sent. ")
Reference <- str_locate_all(Data, "Sent. ") %>% as.data.frame()
Reference %>% names() #Returns [1] "start" "end"
Reference <- Reference %>% mutate(end = end +1)
YourNumbers <- substr(Data,start = Reference$end[1], stop = Reference$end[1])
for (i in 2:dim(Reference)[1]){
Temp <- substr(Data,start = Reference$end[i], stop = Reference$end[i])
YourNumbers <- paste(YourNumbers, Temp, sep = "")
}
YourNumbers #Returns "1234567"
We can use str_match_all from stringr to get all the numbers followed by "Sent".
str_match_all(text, "Sent.*?_+(\\d+)")[[1]][, 2]
#[1] "1" "3" "1" "1" "5" "3" "1"
A base R option using strsplit and sub
lapply(strsplit(ss, "\\|"), function(x)
sub("Sent.+: _+(\\d+)_+", "\\1", x[grepl("^Sent", x)]))
#[[1]]
#[1] "1" "3" "1" "1" "5" "3" "1"
Sample data
ss <- "733|Overall (-2 to 2): _________2________________|How controversial is each sentence (1-5)?|Sent. 1 (ANALYSIS BY...): ________1__________|Sent. 2 (Bail is...): ____3______________|Sent. 3 (2) A...): _______1___________|Sent. 4 (3) A...): _______1___________|Sent. 5 (Proposition 100...): _______5___________|Sent. 6 (In 2006,...): _______3___________|Sent. 7 (That legislation...): ________1__________|Types of bias (check all that apply):|Pro Anti|X O Word use (bold, add alternate)|X O Examples (italicize)|O O Extra information (underline)|X O Any other bias (explain below)|Last sentence makes it sound like an urgent matter.|____________________________________________|NA|undocumented, without a visa|NA|NA|NA|NA|NA|NA|NA|NA|"
Related
This is a complete re-write of my original question in an attempt to clarify it and make it as answerable as possible. My objective is to write a function which takes a string as input and returns the information contained therein in tabular format. Two examples of the kind of character strings the function will face are the following
s1 <- " 9 9875 Γεωργίου Άγγελος Δημήτρης ΑΒ/Γ Π/Π Β 00:54:05 167***\r"
s2 <- " 10 8954F Smith John ΔΕΖ N ΔΕΝ ΕΚΚΙΝΗΣΕ 0\r"
(For those who had read my original question, these are smaller strings for simplicity.)
The required output would be:
Rank Code Name Club Class Time Points
9 9875 Γεωργίου Άγγελος Δημήτρης ΑΒ/Γ Π/Π Β 00:54:05 167
10 8954F Smith John ΔΕΖ N ΔΕΝ ΕΚΚΙΝΗΣΕ 0
I have managed to split the string based on where there's a blank space using:
strsplit(s1, " ")[[1]][strsplit(s1, " ")[[1]] != ""]
although a more elegant solution was given by G. Grothendieck in the comments below using:
unlist(strsplit(trimws(s1), " +"))
This results in
"9" "9875" "Γεωργίου" "Άγγελος" "Δημήτρης" "ΑΒ/Γ" "Π/Π" "Β" "00:54:05" "167***\r"
However, this is still problematic as "Γεωργίου" "Άγγελος" and "Δημήτρης" should be combined into "Γεωργίου Άγγελος Δημήτρης" (note that the number of elements could be two OR three) and the same applies to "Π/Π" "Β" which should be combined into "Π/Π Β".
The question
How can I use the additional information that I have, namely:
The order of the elements will always be the same
The Name data will consist of two or three words
The Club data (i.e. ΑΒ/Γ in s1 and ΔΕΖ in s2) will come from a pre-defined list of clubs (e.g. stored in a character vector named sClub)
The Class data (i.e. Π/Π Β in s1 and N in s2) will come from a pre-defined list of classes (e.g. stored in a character vector named sClass)
The Points data will always contain "\r" and won't contain any spaces.
to produce the required output above?
Defining
sClub <- c("ΑΒ/Γ", "ΔΕΖ")
sClass <- c("Π/Π Β", "N")
we may do
library(stringr)
myfun <- function(s)
gsub("\\*", "", trimws(str_match(s, paste0("^\\s*(\\d+)\\s*?(\\w+)\\s*?([\\w ]+)\\s*(", paste(sClub, collapse = "|"),")\\s*(", paste(sClass, collapse = "|"), ")(.*?)\\s*([^ ]*\r)"))[, -1]))
sapply(list(s1, s2), myfun)
# [,1] [,2]
# [1,] "9" "10"
# [2,] "9875" "8954F"
# [3,] "Γεωργίου Άγγελος Δημήτρης" "Smith John"
# [4,] "ΑΒ/Γ" "ΔΕΖ"
# [5,] "Π/Π Β" "N"
# [6,] "00:54:05" "ΔΕΝ ΕΚΚΙΝΗΣΕ"
# [7,] "167" "0"
The way it works is just taking into account all your additional information and constructing a long regex. It finishes with erasing * and removing leading/trailing whitespace.
I would like to remove constant (shared) parts of a string automatically and retain the variable parts.
e.g. i have a column with the following:
D20181116_Basel-Take1_digital
D20181116_Basel-Take2_digital
D20181116_Basel-Take3_digital
D20181116_Basel-Take4_digital
D20181116_Basel-Take5_digital
D20181116_Basel-Take5a_digital
how can i get automatically to for any similar column (here removing: "D20181116_Basel-Take" and "_digital"). But the code should be find the constant part itself and remove them.
1
2
3
4
5
5a
I hope this is clear. Thank you very much.
You can do it with a regex: it will remove everything before 'Take' and after the underscore character:
vec<- c("D20181116_Basel-Take1_digital",
"D20181116_Basel-Take2_digital",
"D20181116_Basel-Take3_digital",
"D20181116_Basel-Take4_digital",
"D20181116_Basel-Take5_digital",
"D20181116_Basel-Take5a_digital")
sub(".*?Take(.*?)_.*", "\\1", vec)
[1] "1" "2" "3" "4" "5" "5a"
with gsub():
assuming you have a dataframe df and want to change column
df$column <- gsub("^D20181116_Basel-Take","",df$column)
df$column <- gsub("_digital$","",df$column)
I am cleaning up data in R and would like to maintain numeric formatting when switching my column from numeric to character, specifically the significant zeros in the hundredths place (in example below). My input columns mostly begin as Factor data and the below is an example of what I am trying to do.
I'm sure there is a better way, just hoping for some folks with more knowledge than I to shed some light. Most questions online deal with leading zeros or formatting purely numeric columns, but the aspect of the "<" symbol in my data throws me for a loop as to the proper way of doing this.
df = as.factor(c("0.01","5.231","<0.02","0.30","0.801","2.302"))
ind = which(df %in% "<0.02") # Locate the below detection value.
df[ind] <- NA # Substitute NA temporarily
df = as.numeric(as.character(df)) # Changes to numeric column
df = round(df, digits = 2) # Rounds to hundredths place
ind1 = which(df < 0.02) # Check for below reporting limit values
df = as.character(df) # Change back to character column...
df[c(ind,ind1)] = "<0.02" # so I can place the reporting limit back
> # RESULTS::
> df
[1] "<0.02" "5.23" "<0.02" "0.3" "0.8" "2.3"
However, the 4th, 5th, and 6th values in the data are no longer reporting the zero in the hundredths place. What would be the proper order of operations for this? Perhaps changing the column back to character is incorrect? Any advice would be appreciated.
Thank you.
EDIT: ---- Upon recommendations from hrbrmstr and Mike:
Thanks for the advice. I tried the following and they both result in the same problem. Perhaps there is another way I could be indexing/replacing values?
format, same problem:
#... code from above...
ind1 = which(df < 0.02)
df = as.character(df)
df[!c(ind,ind1)] = format(df[!c(ind,ind1)],digits=2,nsmall=2)
> df
[1] "<0.02" "5.23" "<0.02" "0.3 " "0.8 " "2.3 "
sprintf, same problem:
# ... above code from example ...
ind1 = which(df < 0.02) # Check for below reporting limit values.
sprintf("%.2f",df) # sprintf attempt.
[1] "0.01" "5.23" "NA" "0.30" "0.80" "2.30"
df[c(ind,ind1)] = "<0.02" # Feed the symbols back into the column.
> df
[1] "<0.02" "5.23" "<0.02" "0.3" "0.8" "2.3" #Same Problem.
Tried a different way of replacing the values, and same problem.
# ... above code from example ...
> ind1 = which(df < 0.02)
> df[c(ind,ind1)] = 9999999
> sprintf("%.2f",df)
[1] "9999999.00" "5.23" "9999999.00" "0.30" "0.80" "2.30"
> gsub("9999999.00","<0.02",df)
[1] "<0.02" "5.23" "<0.02" "0.3" "0.8" "2.3" #Same Problem.
You could just pad it out with a gsub and a bit of regex...
df <- c("<0.02", "5.23", "<0.02", "0.3", "4", "0.8", "2.3")
gsub("^([^\\.]+)$", "\\1\\.00", gsub("\\.(\\d)$", "\\.\\10", df))
[1] "<0.02" "5.23" "<0.02" "0.30" "4.00" "0.80" "2.30"
The first gsub looks for a dot followed by a single digit and an end-of-string, and replaces the digit (the capture group \\1) with itself followed by a zero. The second checks for numbers with no dots, and adds .00 to the end.
xcv(123)
wert(232)
t(145)
tyui ier(133)
ytie(435)
...
The length of the string is dynamic meaning it is random. The number between the brackets are the target letters that are required to be taken out & stored in a new column in the same data set.
The following key words might help:
substr() strsplit()
I'm actively looking for an answer. Your help would be deeply appreciated.
Do you mean you want to extract the, for example, 123rd letter from the string called xcv?
set.seed(123)
xcv <- paste( sample( letters, 200, replace = TRUE ), collapse = "" )
n <- 123
You can extract the nth letter like so:
substr( xcv, n, n )
# [1] "i"
dat = c('xcv(123)' ,'wert(232)', 't(145)', 'tyui ier(133)', 'ytie(435)')
target = gsub(".*\\(|\\).*", "", dat) #captures anything in between '(' and ')'. We use \\( and \\) to denote the brackets since they are special characters.
cbind(dat, target)
dat target
[1,] "xcv(123)" "123"
[2,] "wert(232)" "232"
[3,] "t(145)" "145"
[4,] "tyui ier(133)" "133"
[5,] "ytie(435)" "435"
Is there a function which converts a hex string to text in R?
For example:
I've the hex string 1271763355662E324375203137 which should be converted to qv3Uf.2Cu 17.
Does someone know a good solution in R?
Here's one way:
s <- '1271763355662E324375203137'
h <- sapply(seq(1, nchar(s), by=2), function(x) substr(s, x, x+1))
rawToChar(as.raw(strtoi(h, 16L)))
## [1] "\022qv3Uf.2Cu 17"
And if you want, you can sub out non-printable characters as follows:
gsub('[^[:print:]]+', '', rawToChar(as.raw(strtoi(h, 16L))))
## [1] "qv3Uf.2Cu 17"
Just to add to #jbaums answer or to simplify it
library(wkb)
hex_string <- '231458716E234987'
hex_raw <- wkb::hex2raw(hex_string)
text <- rawToChar(as.raw(strtoi(hex_raw, 16L)))
An alternative way that separates the two parts involved:
Turn the initial string into a vector of bytes (with values as hexadecimals)
Convert those raw bytes into characters (excluding any not printable)
Part 1:
s <- '1271763355662E324375203137'
sc <- unlist(strsplit(s, ""))
i1 <- (1:nchar(s)) %% 2 == 1
# vector of bytes (as character)
s_pairs1 <- paste0(sc[i1], sc[!i1])
# make explicit it is a series of hexadecimals bytes
s_pairs2 <- paste0("0x", s_pairs1)
head(s_pairs2)
#> [1] "0x12" "0x71" "0x76" "0x33" "0x55" "0x66"
Part 2:
s_raw1 <- as.raw(s_pairs2)
# filter non printable values (ascii < 32 = 0x20)
s_raw2 <- s_raw1[s_raw1 >= as.raw("0x20")]
rawToChar(s_raw2)
#> [1] "qv3Uf.2Cu 17"
We could also use as.hexmode() function to turn s_pairs1 into a vector of hexadecimals
s_pairs2 <- as.hexmode(s_pairs1)
head(s_pairs2)
#> [1] "12" "71" "76" "33" "55" "66"
Created on 2023-01-03 by the reprex package (v2.0.1)