Normalize ASCII to UTF-8 in R - r

I have a dataframe I am trying to convert to rdf to edit in
Protege. The dataframe unfortunately has ASCII codes that are not visible when the strings are printed, most notoriously \u0020, which its he code for a space.
x <- "\u0020".
x
> " "
grepl() works fine when searching for the pattern,
but does not return the original string when the
result is printed.
match <-
grep(pattern = "\u0020", x = x, value = TRUE)
match
> " "
The problem is that these codes are throwing Protege off and I'm trying to normalize them to basic characters such as \u0020 to " ", but I cannot find any regex that will catch these and replace them with the single non-code character. The regex pattern [^ -~] does not catch these values and I'm completely blind to these strings otherwise. How can I normalize any of these codes in R?

Personally, I would just replace all unicode in the file using the stringi library.
Given a CSV file, test.csv that looks like
col1,col2,col3
\u0020, moretext, evenmoretext
First load it as a data.frame
> frame <- read.csv("test.txt", encoding="UTF-8")
> frame
col1 col2 col3
1 \\u0020 moretext evenmoretext
Next, find all of the occurrences that you want to replace and use stri_unescape_unicode to turn it into something that Protege likes.
> frame$col1
[1] "\\u0020"
frame$col1 <- stri_unescape_unicode(frame$col1)
> frame$col1
[1] " "
Once replaced, you should be able to write your csv back to disk without the unicode entries.

Related

String extraction with regular expression in R

I am trying to extract bunch of information from filenames using regular expressions in R. As I am matching the pattern, str_view() is showing me the correct set of strings. Yet, when I am trying to sub those and extract the remaining portion, it doesn't work. I also tried str_extract() but it isn't working. What am I doing wrong?
fname <- "TC2L6C_2020-08-14_1516_6C-ASG_29_00020.tab"
fext <- tools::file_path_sans_ext(fname)
stringr::str_view(fext, ".*-ASG_\\d+_", match = TRUE)
P_num <- gsub(".*-ASG_\\d{2}_", "", fext)
P_num <- stringr::str_extract(fname, "(?<=-ASG_\\d+)([^_])*(?=\\.tab)")
Using trimws from base R
trimws(fname, whitespace = ".*_|\\..*")
[1] "00020"
data
fname <- "TC2L6C_2020-08-14_1516_6C-ASG_29_00020.tab"
Here is a simple approach using sub:
fname <- "TC2L6C_2020-08-14_1516_6C-ASG_29_00020.tab"
output <- sub("^.*-ASG_\\d+_(.*)\\.tab$", "\\1", fname)
output
[1] "00020"
Above we use a capture group to isolate the portion of the filename, sans extension, which you want to match.

R RegEx gsub() Equivalent of "Line Operations>Remove Empty Lines (Containing Blank Characters)" in CSV file

I have a CSV fwith several columns: Tweet, date, etc. The spaces in some Tweets is causing blank lines and undesired truncated lines.
What works:
1. Using Notepad++'s function "Line Operations>Remove Empty Lines (Containing Blank Characters)"
2. Search and replace: \r with nothing.
However, I need to do this for a large number of files, and I can't manage to find a Regular Expression with gsub() in R that will do what the Notepadd++ function does.
Note that replacing ^[ \t]*$\r?\n with nothing and then \r with nothing does work in Notepad++, but not in R, as suggested here, but it does not work with g(sub) in R.
I have tried the following code:
tx <- readLines("tweets.csv")
subbed <-gsub(pattern = "^[ \\t]*$\\r?\\n", replace = "", x = tx)
subbed <-gsub(pattern = "\r", replace = "", x = subbed)
writeLines(subbed, "output.csv")
This is the input:
This is the desired output:
You may use
library(readtext)
tx <- readtext("weets.csv")
subbed <- gsub("(?m)^\\h*\\R?", "", tx$text, perl=TRUE)
subbed <- gsub("\r", "", subbed, fixed=TRUE)
writeLines(trimws(subbed), "output.csv")
The readtext llibrary reads the file into a single variable and thus all line break chars are kept.

Remove comma which is a thousands separator in R

I need to import a bunch of .csv files into R. I do this using the following code:
Dataset <- read.csv(paste0("./CSV/State_level/",file,".csv"),header = F,sep = ";",dec = "," , stringsAsFactors = FALSE)
The input is an .csv file with "," as separator for decimal places. Unfortunately there are quite a few entries as follows: 20,012,054.
This should really be: 20012,054 and leads to either NAs but usually the whole df being imported as character and not numeric which I'd like to have.
How do I get rid of the first "," when looking from left to right and only if the number has more than 3 figuers infront of the decimal-comma?
Here is a sample of how the data looks in the .csv-file:
A data.frame might look like this:
df<-data.frame(a=c(0.5,0.84,12.25,"20,125,25"), b=c("1,111,054",0.57,105.25,0.15))
I used "." as decimal separator in this case to make it a number, which in the .csv is a ",", but this is not the issue for numbers in the format: 123,45.
Thank you for your ideas & help!
We can use sub to get rid of the first ,
df[] <- lapply(df, function(x) sub(",(?=.*,)", "", x, perl = TRUE))
Just to show it would leave the , if there is only a single , in the code
sub(",(?=.*,)", "", c("0,5", "20,125,25"), perl = TRUE)
#[1] "0,5" "20125,25"

extract portion of file name using gsub()

I'm reading multiple .txt files using list.file() and file.path(). Just wanted to parse the full path names and extract the portion after the last “/” and before the “.”
Here is the structure of file path names :
"C:/Users/Alexandre/Desktop/COURS/FORMATIONS/THESE/PROJET/RESULTATS/Vessel features/Fusion/OK/SAT-DPL192C.txt"
The code I've tried
# l <- list.files(pattern = "SAT(.+)*.txt")
# f <- file.path(getwd(), c=(l))
f <- c("C:/Users/Alexandre/Desktop/COURS/FORMATIONS/THESE/PROJET/RESULTATS/Vessel features/Fusion/OK/SAT-DPL192C.txt", "C:/Users/Alexandre/Desktop/COURS/FORMATIONS/THESE/PROJET/RESULTATS/Vessel features/Fusion/OK/SAT-DPL193D.txt")
d <- lapply(f, read.delim)
names(d) <- gsub(".*/(.*)..*", "1", f)
Last string give [1] "1" "1" instead of [1] "DPL192C" "DPL193D" etc...
I've also tried the syntax like ".*/(.+)*..* for the portion to conserv with same result.
A . is a special character, so you need to escape it. Please, when you want to grab the captured expression, you need to use \\1, not just 1. Try this:
gsub(".*/(.*)\\..*", "\\1", f)
# [1] "SAT-DPL192C" "SAT-DPL193D"

How to read \" double-quote escaped values with read.table in R

I am having trouble to read a file containing lines like the one below in R.
"_:b5507F4C7x59005","Fabiana D\"atri"
Any idea? How can I make read.table understand that \" is the escape of quote?
Cheers,
Alexandre
It seems to me that read.table/read.csv cannot handle escaped quotes.
...But I think I have an (ugly) work-around inspired by #nullglob;
First read the file WITHOUT a quote character.
(This won't handle embedded , as #Ben Bolker noted)
Then go though the string columns and remove the quotes:
The test file looks like this (I added a non-string column for good measure):
13,"foo","Fab D\"atri","bar"
21,"foo2","Fab D\"atri2","bar2"
And here is the code:
# Generate test file
writeLines(c("13,\"foo\",\"Fab D\\\"atri\",\"bar\"",
"21,\"foo2\",\"Fab D\\\"atri2\",\"bar2\"" ), "foo.txt")
# Read ignoring quotes
tbl <- read.table("foo.txt", as.is=TRUE, quote='', sep=',', header=FALSE, row.names=NULL)
# Go through and cleanup
for (i in seq_len(NCOL(tbl))) {
if (is.character(tbl[[i]])) {
x <- tbl[[i]]
x <- substr(x, 2, nchar(x)-1) # Remove surrounding quotes
tbl[[i]] <- gsub('\\\\"', '"', x) # Unescape quotes
}
}
The output is then correct:
> tbl
V1 V2 V3 V4
1 13 foo Fab D"atri bar
2 21 foo2 Fab D"atri2 bar2
On Linux/Unix (or on Windows with cygwin or GnuWin32), you can use sed to convert the escaped double quotes \" to doubled double quotes "" which can be handled well by read.csv:
p <- pipe(paste0('sed \'s/\\\\"/""/g\' "', FILENAME, '"'))
d <- read.csv(p, ...)
rm(p)
Effectively, the following sed command is used to preprocess the CSV input:
sed 's/\\"/""/g' file.csv
I don't call this beautiful, but at least you don't have to leave the R environment...
My apologies ahead of time that this isn't more detailed -- I'm right in the middle of a code crunch.
You might consider using the scan() function. I created a simple sample file "sample.csv," which consists of:
V1,V2
"_:b5507F4C7x59005","Fabiana D\"atri"
Two quick possibilities are (with output commented so you can copy-paste to the command line):
test <- scan("sample.csv", sep=",", what='character',allowEscapes=TRUE)
## Read 4 items
test
##[1] "V1" "V2" "_:b5507F4C7x59005"
##[4] "Fabiana D\\atri\n"
or
test <- scan("sample.csv", sep=",", what='character',comment.char="\\")
## Read 4 items
test
## [1] "V1" "V2" "_:b5507F4C7x59005"
## [4] "Fabiana D\\atri\n"
You'll probably need to play around with it a little more to get what you want. And I see that you've already mentioned writeLines, so you may have already tried this. Either way, good luck!
I was able to get your eample to work by setting the quote argument:
> read.csv('test.csv',quote="'",head=FALSE)
V1 V2
1 "_:b5507F4C7x59005" "Fabiana D\\"atri"
2 "_:b5507F4C7x59005" "Fabiana D\\"atri"
read_delim from package readr can handle escaped and doubled double quotes, using the arguments escape_double and escape_backslash.
For example, if our file escapes quotes by doubling them:
"quote""","hello"
1,2
then we use
read_delim(file, delim=',') # default escape_backslash=FALSE, escape_double=TRUE
If our file escapes quotes with a backslash:
"quote\"","hello"
1,2
we use
read_delim(file, delim=',', escape_double=FALSE, escape_backslash=TRUE)
As of newer R versions, readr::read_delim() is the correct answer.
data = read_delim(filename, delim = "\t", quote = "\"",
escape_backslash=T, escape_double=F,
# The columns depend on your data
col_names = c("timeStart", "posEnd", "added", "removed"),
col_types = "nncc"
)
This should be fine with read.csv(). Take a look at the help for ?read.csv - the option for specifying the quote is quote = "....". In this case, though, there may be a problem: it seems that read.csv() prefers to see matching quotes.
I tried the same with read.table("sample.txt", header = FALSE, as.is = TRUE), with your text in sample.txt, and it seems to work. When all else fails with read.csv(), I tend to back up to read.table() and specify the parameters carefully.

Resources