Trying to do a multiple grep in R with regular expressions - r

I have to look for some registers with a regular expression in a massive log file with R.
All the data is about 1 Gb and i have to look for some registries and add the matches to a dataframe.
What I am doing is the following:
grep(paste("(Sending+.+message+.+1234567890+.+5000)", sep=""), dfLogs)
This goes correct when I do it for only one register.
When I try to do the grep for all the searches:
dfTrx$RcvMessage <- paste("(Sending+.+message+.+", dfTrx$NUMBER, "+.+", dfTrx$AMOUNT,")", sep="")
dfReceived <- unique(grep(paste(dfTrx$RcvMessage, collapse="|"), dfLogs), value=TRUE)
And I get the following error:
Error in grep(paste(dfTrx$RcvMessage, collapse = "|"), dfLogs) :
invalid regular expression '(Sending+.+message+.+1234567890+.+20)|(Sending+.+message+.+9876543210+.+20)|...
How can I do this regular expression for all the values? What am I doing wrong?
An example for data:
2015-12-09 19:01:44,717 - [DEBUG] [pool-1-thread-4450 ] [OutputHandler ] Sending 8 message(s) : 01XX765903091220151901440XXXX0000129D3A00003996101901442015120903857655184776733438000000200001XX765904091220151901440XXXX0000118BC100001839671901442015120903857655194251212137000000300001XX765905091220151901440XXXX000010E52A00003311451901442015120903857655203331836622000000200001XX765906091220151901440XXXX000011DCD300001972561901442015120903857655215522476419000000300001XX765907091220151901440XXXX000012980900003923951901442015120903857655225531194531000000500001XX765908091220151901440XXXX000010ED2200003882461901442015120903857655237351043626000000200001XX765909091220151901440XXXX000011BDBE00001656451901442015120903857655243312669477000000200001XX765910091220151901440XXXX00001211F3000024385819014420151209038576552598211945310000002000
I need to find the sending of the message, and the number and amount in the content of the message.

Thanks at #Marek.
I found that in the df there were some bad values that were giving me NA and some spaces that I hadn't seen before. Seems I didn't clean all the data properly. Thanks and sorry for a silly mistake from me.

Related

Error: "argument is not an atomic vector; coercing[1] FALSE"

I'm new to R and am having trouble (1) generalizing previous stack overflow answers to my situation, and (2) understanding R documentation. So I turn to this community and hope someone will walk me through.
I have this code where data1 is a text file:
data1 <- read.delim(file.choose())
pattern <- c("An Error Has Occurred!")
str_detect(data1, regex(pattern, ignore_case = FALSE))
The error message I see is:
argument is not an atomic vector; coercing[1] FALSE
When I use is.vector() to confirm the data type, it looks like it should be fine:
is.vector(pattern)
#this returns [1] TRUE as the output
Reference I used for str_detect function is https://www.rdocumentation.org/packages/stringr/versions/1.4.0/topics/str_detect.
Edit 1: Here is the output of data1 - I'm trying to match the 4th to last line "An Error Has Occurred!":
Silk.Road.Forums
<fctr>
*
Welcome, Guest. Please login or register.
[ ] [ ] [Forever] [Login]
Login with username, password and session length
[ ] [Search]
â\200¢ Home
â\200¢ Search
â\200¢ Login
â\200¢ Register
â\200¢ Silk Road Forums
An Error Has Occurred!
The user whose profile you are trying to view does not exist.
Back
â\200¢ SMF | SMF © 2013, Simple Machines
Edit 2: After a bit of rudimentary testing, it looks like the issue is with how I opened up data1, not necessarily str_detect().
When I just create a vector, it works:
dataVector <- c("An Error Has Occurred!", "another one")
pattern <- c("An Error Has Occurred!")
str_detect(dataVector, pattern) # returns [1] TRUE FALSE
But when I try to use the function on the file, it doesn't
data1 <- read.delim(file.choose())
pattern <- c("An Error Has Occurred!")
str_detect(data1, pattern) # returns the atomic vector error message`
Problem: So I'm convinced that the problem is that (1) I'm using the wrong function or (2) I'm loading up the file wrong for this file type. I've never used text files in R before so I'm a bit lost.
That's all I have and thank you in advance for anyone willing to take a stab at helping!
I think what is going on here is that read.delim is reading in your text file as a data frame and not a vector which is what str_detect requires.
For a quick work around you can try.
str_detect(data1[,1], "An Error Has Occurred!")
This works because right now data1 is a 1 column data frame. data2[,1] returns all rows for the first (and only) column of that data frame and returns it as a vector.
However! The problem here is you are using read.delim which is for delimited text files (i.e. like a csv file that has a separator such ',') which your data is not. Much better would be to use the function readlines which will return you a character vector.
# open a connection to your file
con <- file('path/to/file.txt',open="r")
# read file contents
data1 <- readLines(con)
# close the connection
close(con)
Then str_detect should work.
str_detect(data1, "An Error Has Occurred!")
Just as.data.frame() your data, the the str_replace() works fine!

Read Table in R with "=" starting cell

I'm trying to read a table in R. I allways used the command read.table(...), but today I found an error when trying to read a table with a cell starting with the "=" symbol.
It is something like:
chr7 79435791 a 7 ,,...... \ak.f9ka
chr7 79435792 a 7 C$...... =.;a#Kk
chr7 79435793 T 7 ........ GF-FGGB
I use:
read.table(files, sep="\t", comment.char="", header=FALSE, colClasses=c("character"), stringsAsFactors=FALSE)
but I get the error:
The line 2 does not have 6 elements, and I think it is because of the = symbol.
How can I read this?
Thank you
I am not able to reproduce your error. I copied exactly your example in a .txt file, and I do not get the error you get.
By the way, with a simple read.table()statement it does not read correctly the last value in row 2 because of the comment character, which you can disable (see ?read.table).
I was able to somewhat reproduce the error which gave me some ideas - it could of course go horribly wrong as well :)
This should hopefully do the trick:
files.data.frame <- read.table(textConnection(readLines(files)), sep="\t")
And of course, add the extra options to read.table as deemed necessary.

Error: C stack usage too close to the limit when viewing data frame tail

I have a data frame appt which is 91.2MB in size, comprising 29255 observation of 51 variables.
When I tried to check its end with tail(appt), I get the error
Error: C stack usage 20212630 is too close to the limit
I have no clue what to do to trouble shoot this. Any suggestions on what I can do?
As additional info, I have in memory at the same time a few other variables of almost comparable size, including a 90.2MB character vector and a 42.3MB data frame of 77405 obs. x 60 variables. Calling tail on these two other variables did not trigger any error.
Edit:
I have narrowed down that the error occurs only when the last row is accessed. i.e. appt[29254, ] is fine, appt[29255, ] throws the error.
I had exactly the same error but managed to solve it by disabling quoting when reading in the dataframe. For me the error showed up when trying tail(df) as well.
Turns out one of the lines in my text file had the \ character. You could run into the same problem if your file has a " or ' somewhere, as read.table considers this a quote character in the default case.
Add the option quote="" to disable quoting alltogether. Example: read.table(filepath, quote="")
If this doesn't solve your issue, check out some of the other options in ?read.table (such as allowEscapes, sep, ...) as they might be causing your error.

Error in tolower() invalid multibyte string

This is the error that I receive when I try to run tolower() on a character vector from a file that cannot be changed (at least, not manually - too large).
Error in tolower(m) : invalid multibyte string X
It seems to be French company names that are the problem with the É character. Although I have not investigated all of them (also not possible to do so manually).
It's strange, because my thought was that encoding issues would have been identified during read.csv(), rather than during operations after the fact.
Is there a quick way to remove these multibyte strings? Or, perhaps a way to identify and convert? Or even just ignore them entirely?
Here's how I solved my problem:
First, I opened the raw data in a texteditor (Geany, in this case), clicked properties and identified the Encoding type.
After which I used the iconv() function.
x <- iconv(x,"WINDOWS-1252","UTF-8")
To be more specific, I did this for every column of the data.frame from the imported CSV. Important to note that I set stringsAsFactors=FALSE in my read.csv() call.
dat[,sapply(dat,is.character)] <- sapply(
dat[,sapply(dat,is.character)],
iconv,"WINDOWS-1252","UTF-8")
I was getting the same error. However, in my case it wasn't when I was reading the file, but a bit later when processing it. I realised that I was getting the error, because the file wasn't read with the correct encoding in the first place.
I found a much simpler solution (at least for my case) and wanted to share. I simply added encoding as below and it worked.
read.csv(<path>, encoding = "UTF-8")
library(tidyverse)
data_clean = data %>%
mutate(new_lowercase_col = tolower(enc2utf8(as.character(my_old_column))))
Where new_lowercase_col is the name of the new column I'm making out of the old uppercase one, which was called my_old_column.
I know this has been answered already but thought I'd share my solution to this as I experienced the same thing.
In my case, I used the function str_trim() from package stringr to trim whitespace from start and end of string.
com$uppervar<-toupper(str_trim(com$var))
# to avoid datatables warning: error in tolower(x) invalid multibyte string
# assuming all columns are char
new_data <- as.data.frame(
lapply(old_data, enc2utf8),
stringsAsFactors = FALSE
)
My solution to this issue
library(dplyr) # pipes
library(stringi) # for stri_enc_isutf8
#Read in csv data
old_data<- read.csv("non_utf_data.csv", encoding = "UTF-8")
#despite specifying utf -8, the below columns are not utf8:
all(stri_enc_isutf8(old_data$problem_column))
#The below code uses regular expressions to cleanse. May need to tinker with the last
#portion that selects the grammar to retain
utf_eight_data<- old_data %>%
mutate(problem_column = gsub("[^[:alnum:][:blank:]?&/\\-]", "", old_data$problem_column)) %>%
rename(solved_problem = problem_column)
#this column is now utf 8.
all(stri_enc_isutf8(utf_eight_data$solved_problem))

R: invalid multibyte string [duplicate]

This question already has answers here:
Invalid multibyte string in read.csv
(9 answers)
Closed 8 years ago.
I use read.delim(filename) without any parameters to read a tab delimited text file in R.
df = read.delim(file)
This worked as intended. Now I have a weird error message and I can't make any sense of it:
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :
invalid multibyte string at '<fd>'
Calls: read.delim -> read.table -> type.convert
Execution halted
Can anybody explain what a multibyte string is? What does fd mean? Are there other ways to read a tab file in R? I have column headers and lines which do not have data for all columns.
I realize this is pretty late, but I had a similar problem and I figured I'd post what worked for me. I used the iconv utility (e.g., "iconv file.pcl -f UTF-8 -t ISO-8859-1 -c"). The "-c" option skips characters that can't be translated.
If you want an R solution, here's a small convenience function I sometimes use to find where the offending (multiByte) character is lurking. Note that it is the next character to what gets printed. This works because print will work fine, but substr throws an error when multibyte characters are present.
find_offending_character <- function(x, maxStringLength=256){
print(x)
for (c in 1:maxStringLength){
offendingChar <- substr(x,c,c)
#print(offendingChar) #uncomment if you want the indiv characters printed
#the next character is the offending multibyte Character
}
}
string_vector <- c("test", "Se\x96ora", "works fine")
lapply(string_vector, find_offending_character)
I fix that character and run this again. Hope that helps someone who encounters the invalid multibyte string error.
I had a similarly strange problem with a file from the program e-prime (edat -> SPSS conversion), but then I discovered that there are many additional encodings you can use. this did the trick for me:
tbl <- read.delim("dir/file.txt", fileEncoding="UCS-2LE")
This happened to me because I had the 'copyright' symbol in one of my strings! Once it was removed, problem solved.
A good rule of thumb, make sure that characters not appearing on your keyboard are removed if you are seeing this error.
I figured out Leafpad to be an adequate and simple text-editor to view and save/convert in certain character sets - at least in the linux-world.
I used this to save the Latin-15 to UTF-8 and it worked.

Resources