I'm new to R and am having trouble (1) generalizing previous stack overflow answers to my situation, and (2) understanding R documentation. So I turn to this community and hope someone will walk me through.
I have this code where data1 is a text file:
data1 <- read.delim(file.choose())
pattern <- c("An Error Has Occurred!")
str_detect(data1, regex(pattern, ignore_case = FALSE))
The error message I see is:
argument is not an atomic vector; coercing[1] FALSE
When I use is.vector() to confirm the data type, it looks like it should be fine:
is.vector(pattern)
#this returns [1] TRUE as the output
Reference I used for str_detect function is https://www.rdocumentation.org/packages/stringr/versions/1.4.0/topics/str_detect.
Edit 1: Here is the output of data1 - I'm trying to match the 4th to last line "An Error Has Occurred!":
Silk.Road.Forums
<fctr>
*
Welcome, Guest. Please login or register.
[ ] [ ] [Forever] [Login]
Login with username, password and session length
[ ] [Search]
â\200¢ Home
â\200¢ Search
â\200¢ Login
â\200¢ Register
â\200¢ Silk Road Forums
An Error Has Occurred!
The user whose profile you are trying to view does not exist.
Back
â\200¢ SMF | SMF © 2013, Simple Machines
Edit 2: After a bit of rudimentary testing, it looks like the issue is with how I opened up data1, not necessarily str_detect().
When I just create a vector, it works:
dataVector <- c("An Error Has Occurred!", "another one")
pattern <- c("An Error Has Occurred!")
str_detect(dataVector, pattern) # returns [1] TRUE FALSE
But when I try to use the function on the file, it doesn't
data1 <- read.delim(file.choose())
pattern <- c("An Error Has Occurred!")
str_detect(data1, pattern) # returns the atomic vector error message`
Problem: So I'm convinced that the problem is that (1) I'm using the wrong function or (2) I'm loading up the file wrong for this file type. I've never used text files in R before so I'm a bit lost.
That's all I have and thank you in advance for anyone willing to take a stab at helping!
I think what is going on here is that read.delim is reading in your text file as a data frame and not a vector which is what str_detect requires.
For a quick work around you can try.
str_detect(data1[,1], "An Error Has Occurred!")
This works because right now data1 is a 1 column data frame. data2[,1] returns all rows for the first (and only) column of that data frame and returns it as a vector.
However! The problem here is you are using read.delim which is for delimited text files (i.e. like a csv file that has a separator such ',') which your data is not. Much better would be to use the function readlines which will return you a character vector.
# open a connection to your file
con <- file('path/to/file.txt',open="r")
# read file contents
data1 <- readLines(con)
# close the connection
close(con)
Then str_detect should work.
str_detect(data1, "An Error Has Occurred!")
Just as.data.frame() your data, the the str_replace() works fine!
Related
I would like to give a more informative error message when users of my R functions supply a string with an unrecognized escape
my_string <- "sql\sql"
# Error: '\s' is an unrecognized escape in character string starting ""sql\s"
Something like this would be ideal.
my_string <- "sql\sql"
# Error: my_string contains an unrecognized escape. Try sql\\sql with double backslashes instead.
I have tried an if statement that looks for single backslashes
if (stringr::str_detect("sql\sql", "\")) stop("my error message")
but I get the same error.
Almost all of my users are Windows users running R 3.3 and up.
Code execution in R happens in two phases. First, R takes the raw string you enter and parses that into commands that can be run; then, R actually runs those commands. The parsing step makes sure what you've written actually makes sense as code. If it doesn't make any sense, then R can't even turn it into anything it can attempt to run.
The error message you are getting about the unrecognized escape sequence is happening at the parsing stage. That means R isn't really even attempting to execute the command, it just straight up can't understand what you are saying. There is no way to catch in error like this in code because there's no user code that's running at that point.
So if you are counting on your users writing code like my_string <- "something", then they need to write valid code. They can't change how strings are encoded or what the assignment operator looks like or how variables can be named. They also can't type !my_string! <=== %something% because R can't parse that either. R can't parse my_string <- "sql\sql" but it can parse my_string <- "sql\\sql" (slashes much be escaped in string literals). If they are not savy users, you might want to consider providing an alternative interface that can sanitize user input before trying to run it as code. Maybe make a shiny front end or have users pass arguments to your scripts via command line parameters.
If you're capturing your user input correctly, for a string input of\, R will store that in my_string as \\.
readline()
\
[1] "\\"
readline()
sql\sql
[1] "sql\\sql"
That means internally in R:
my_string <- "sql\\sql"
However
cat(my_string)
sql\sql
To check the input, you need to escape each escape, because you're looking for \\
stringr::str_detect(my_string, "\\\\")
Which returns TRUE if the input string is sql\sql. So the full line is:
if (stringr::str_detect("sql\\sql", "\\\\")) stop("my error message")
I have to look for some registers with a regular expression in a massive log file with R.
All the data is about 1 Gb and i have to look for some registries and add the matches to a dataframe.
What I am doing is the following:
grep(paste("(Sending+.+message+.+1234567890+.+5000)", sep=""), dfLogs)
This goes correct when I do it for only one register.
When I try to do the grep for all the searches:
dfTrx$RcvMessage <- paste("(Sending+.+message+.+", dfTrx$NUMBER, "+.+", dfTrx$AMOUNT,")", sep="")
dfReceived <- unique(grep(paste(dfTrx$RcvMessage, collapse="|"), dfLogs), value=TRUE)
And I get the following error:
Error in grep(paste(dfTrx$RcvMessage, collapse = "|"), dfLogs) :
invalid regular expression '(Sending+.+message+.+1234567890+.+20)|(Sending+.+message+.+9876543210+.+20)|...
How can I do this regular expression for all the values? What am I doing wrong?
An example for data:
2015-12-09 19:01:44,717 - [DEBUG] [pool-1-thread-4450 ] [OutputHandler ] Sending 8 message(s) : 01XX765903091220151901440XXXX0000129D3A00003996101901442015120903857655184776733438000000200001XX765904091220151901440XXXX0000118BC100001839671901442015120903857655194251212137000000300001XX765905091220151901440XXXX000010E52A00003311451901442015120903857655203331836622000000200001XX765906091220151901440XXXX000011DCD300001972561901442015120903857655215522476419000000300001XX765907091220151901440XXXX000012980900003923951901442015120903857655225531194531000000500001XX765908091220151901440XXXX000010ED2200003882461901442015120903857655237351043626000000200001XX765909091220151901440XXXX000011BDBE00001656451901442015120903857655243312669477000000200001XX765910091220151901440XXXX00001211F3000024385819014420151209038576552598211945310000002000
I need to find the sending of the message, and the number and amount in the content of the message.
Thanks at #Marek.
I found that in the df there were some bad values that were giving me NA and some spaces that I hadn't seen before. Seems I didn't clean all the data properly. Thanks and sorry for a silly mistake from me.
I have created a data frame in R with the following name:
table_file1_C.txt|file2_C.txt
This name is was generated by the assign() function, in reference to a single .txt file that was generated by a program run on command line. Here is a sample from the loop that created this object:
assign(x=paste("table_",
dir(file.dir, pattern="\\.txt$")[i],
sep=''),
value=tmpTables[[i]])#tmpTables holds the data I'm manipulating, as read in from readHTMLtable
The issue is that I an unable to reference this object after its creation;
>table_file1_C.txt|file2_C.txt
Error: object 'file2_C.txt' not found
I believe that R is seeing the '|' character, and reading it as an instruction, not a part of the object's name, even though it already accepted it as part of the object's name.
So, I need to strip the | from the object's name. I planned to accomplish this with gsub() embedded within the assign() function, using something like this:
assign(x=paste("table_",#creating the name of the object
gsub(x=dir(file.dir, pattern="\\.txt$")[i],
pattern="|",
replacement="."),#need to remove the | characters!!
sep=''),
value=tmpTables[[i]])
However, this output gives something like this:
[1] ".t.a.b.l.e._.f.i.l.e.1...t.x.t.|.f.i.l.e.2...t.x.t."
As you can see, the name has been mangled, and the | has not actually been removed.
I need to find a way to remove the | from the name, so I can process the object that I have created. Or, prevent it from being included in the name in the first place. I can only do this within R, as I cannot modify the output of the program that I used to generate the data.
Does this make sense? Let me know if more information is needed. Thank you for taking the time to read this.
You need to escape the | character in the regular expression. Otherwise it is an empty pattern, which matches everything.
Escaping the character with brackets (character class):
x <- 'a|b'
gsub('[|]', '.', x)
## [1] "a.b"
Escaping with a backslash:
gsub('\\|', '.', x)
## [1] "a.b"
If you don't escape the | character, it is an "or" operation in the regular expression. Nothing or nothing, same as matching nothing. Thus it inserts the . between each character:
gsub('', '.', x)
## [1] ".a.|.b."
gsub('|', '.', x) # Same as above
## [1] ".a.|.b."
For some reason, escaping with ' ' as per Matthew Lundberg did not work correctly for me, but escaping with ` did.
> 'file1.txt|file2.txt'
[1] "file1.txt|file2.txt"
>`denovo_AR_C.txt|FOXA1_C.txt`
*data*
Thanks go to Matthew
I am trying to append metadata to columns in R. I have a large data set, and would like to pull up metadata info in R through comments (or is there somewhere more appropriate?). I have the metadata typed into single cells in a separate file, and am trying to refer to them when defining comments.
I have tried the following:
comment(data$cat)=as.quoted(metadata$catdot)
comment(data$cat)
# NULL
this next one I tried to put quotation marks around the text I wanted to pull in
comment(data$cat)=as.quoted(metadata$cat)
# Error in parse(text = x) : <text>:1:5: unexpected symbol
# 1: how many
^
another attempt using quotation marks. Please ignore the silly text, I'm just testing code with a data set and meta data I made up
comment(data$place)=metadata$placequote
# Error in `comment<-`(`*tmp*`, value = list("whether the location is a shelter or a foster home because that is really important the poor little dogs and cats just don't have anywhere else to go maybe they are blind or starving and it's really just such a shame")) :
# attempt to set invalid 'comment' attribute
comment(data$place)=readChar(metadata$place,nchars=81)
# Error in readChar(metadata$place, nchars = 81) :
# cannot read from this connection
comment(data$place)=paste(readLines(metadata$place), collapse=" ")
# Error in readLines(metadata$place) : 'con' is not a connection
This is the error that I receive when I try to run tolower() on a character vector from a file that cannot be changed (at least, not manually - too large).
Error in tolower(m) : invalid multibyte string X
It seems to be French company names that are the problem with the É character. Although I have not investigated all of them (also not possible to do so manually).
It's strange, because my thought was that encoding issues would have been identified during read.csv(), rather than during operations after the fact.
Is there a quick way to remove these multibyte strings? Or, perhaps a way to identify and convert? Or even just ignore them entirely?
Here's how I solved my problem:
First, I opened the raw data in a texteditor (Geany, in this case), clicked properties and identified the Encoding type.
After which I used the iconv() function.
x <- iconv(x,"WINDOWS-1252","UTF-8")
To be more specific, I did this for every column of the data.frame from the imported CSV. Important to note that I set stringsAsFactors=FALSE in my read.csv() call.
dat[,sapply(dat,is.character)] <- sapply(
dat[,sapply(dat,is.character)],
iconv,"WINDOWS-1252","UTF-8")
I was getting the same error. However, in my case it wasn't when I was reading the file, but a bit later when processing it. I realised that I was getting the error, because the file wasn't read with the correct encoding in the first place.
I found a much simpler solution (at least for my case) and wanted to share. I simply added encoding as below and it worked.
read.csv(<path>, encoding = "UTF-8")
library(tidyverse)
data_clean = data %>%
mutate(new_lowercase_col = tolower(enc2utf8(as.character(my_old_column))))
Where new_lowercase_col is the name of the new column I'm making out of the old uppercase one, which was called my_old_column.
I know this has been answered already but thought I'd share my solution to this as I experienced the same thing.
In my case, I used the function str_trim() from package stringr to trim whitespace from start and end of string.
com$uppervar<-toupper(str_trim(com$var))
# to avoid datatables warning: error in tolower(x) invalid multibyte string
# assuming all columns are char
new_data <- as.data.frame(
lapply(old_data, enc2utf8),
stringsAsFactors = FALSE
)
My solution to this issue
library(dplyr) # pipes
library(stringi) # for stri_enc_isutf8
#Read in csv data
old_data<- read.csv("non_utf_data.csv", encoding = "UTF-8")
#despite specifying utf -8, the below columns are not utf8:
all(stri_enc_isutf8(old_data$problem_column))
#The below code uses regular expressions to cleanse. May need to tinker with the last
#portion that selects the grammar to retain
utf_eight_data<- old_data %>%
mutate(problem_column = gsub("[^[:alnum:][:blank:]?&/\\-]", "", old_data$problem_column)) %>%
rename(solved_problem = problem_column)
#this column is now utf 8.
all(stri_enc_isutf8(utf_eight_data$solved_problem))