Detect weird characters in all character fields in a data.frame - r

I have a large data.frame d that was read from a .csv file using read (it is actually a data.table resulting from fread a .csv file). I want to check in every column of type character for weird/corrupted characters. Meaning the weird sequences of characters that result from other corrupted parts of a text file or from using the wrong encoding.
A data.table solution, or some other fast solution, would be best.
This is a pseudo-code for a possible solution
create a vector str_cols with the names of all the character columns of d
for each column j in str_cols compute a frequency table of the values: tab <-d[,.N,j]. (this step is probably not necessary, just used to reduce the dimensions of the object that will be checked in columns with repetitions)
Check the values of j in the summary table tab
The crucial step is 3. Is there a function that does that?
Edit1: Perhaps some smart regular expression? This is a related non R question, trying to explicitly list all weird characters. Another solution perhaps is to find any character outside of the accepted list of characters [a-z 0-9 + punctuation].

If you post some example data it would be easier to give a more definitive answer. You could likely try something like this though.
DT[, lapply(.SD, stringr::str_detect, "^[^[[:print:]]]+$")]
It will return a data.table of the same size, but any string that has characters that aren't alphanumeric, punctuation, and/or space will be replaced with TRUE, and everything else replaced with FALSE. This would be my interpretation of your question about wanting to detect values that contain these characters.
You can change the behavior by replacing str_detect with whatever base R or stringr function you want, and slightly modifying the regex as needed. For example, you can remove the offending characters with the following code.
DT[, lapply(.SD, stringr::str_replace_all, "[^[[:print:]]]", "")]

Related

Splitting strings into elements from a list

A function in a package gives me a character, where the original strings are merged together. I need to separate them, in other words I have to find the original elements. Here is an example and what I have tried:
orig<-c("answer1","answer2","answer3")
result<-"answer3answer2"
What I need as an outcome is:
c("answer2","answer3")
I have tried to split() result, but there is no string to base it on, especially that I have no former knowledge of what the answers will be.
I have tried to match() the result to the orig, but I would need to do that with all substrings.
There has to be an easy solution, but I haven't found it.
index <- gregexpr(paste(orig,collapse='|'),result)[[1]]
starts <- as.numeric(index)
stops <- starts + attributes(index)$match.length - 1 )
substring(result, starts, stops)
This should work for well defined and reversible inputs. Alternatively, is it possible to append some strings to the input of the function, such that it can be easily separated afterwards?
What you're describing seems to be exactly string matching, and for your strings, grepl seems to be just the thing, in particular:
FindSubstrings <- function(orig, result){
orig[sapply(orig, grepl, result)]
}
In more detail: grepl takes a pattern argument and looks whether it occurs in your string (result in our case), and returns a TRUE/FALSE value. We subset the original values by the logical vector - does the value occur in the string?
Possible improvements:
fixed=TRUE may be a bright idea, because you don't need the full regex power for simple strings matching
some match patterns may contain others, for example, "answer10" contains "answer1"
stringi may be faster for such tasks (just rumors floating around, haven't rigorously tested), so you may want to look into it if you do this a lot.

How to prevent R from dropping leading zeros in an integer vector

Is there any way of stopping R from dropping leading zeros in an integer? e.g.,
a<-c(00217,00007,00017)
I understand this is not the correct way of writing integers. Sadly I've been given a text file (person and non-R code are not around anymore) containing thousands of vectors in a single list:
list(drugA=c(...), drugB=c(....),........)
I need to keep the leading zeros as 00002 becomes 2. I could load these thousands of values in and then write a function to parse the list and convert into characters whilst correcting for any number that isn't five characters long but I was hoping for a speedy alternative.
UPDATE1
An example of the text file I've been provided:
list(CETUXIMAB=c(05142,05316),
DORNASEALFA=c(94074),
ETANERCEPT=c(05342,99075),
BIVALIRUDIN=c(04400,09177),
LEUPROLIDE=c(02074,03219,91035,91086),
PEGINTERFERONALFA2A=c(03162),
ALTEPLASE=c(00486,01032,03371,05314),
DARBEPOETINALFA=c(02217,03421),
GOSERELIN=c(99221),
RETEPLASE=c(00157),
ERYTHROPOIETIN=c(92078,92122))
I have truncated the list as there are thousands of vectors. This was a text file generated using a program written in C++ (code not available). Some of the values e.g., RETEPLASE=c(00157) becomes truncated to 157.
library(stringr)
str_pad(a, 5, pad = "0")

transform character to a number in R

I loaded a data big set with read_delim(), since there I have the possibility to skip the first 4 rows of the data set which is not important for me. The data set is separated by ";". My Problem is the following:
I have some numbers like
-0,000364929204806685
0,00367021351121366
-0,0184237491339445
as you can see this numbers are seperated by commas. Therefore if i change the type of it to "numeric", during the loading phase, i get a formatting error like -3.649292e+14 for the first number. Thus i have to load the data as characters.
But now I am not able to do numeric calculations. as.numeric() doesen't work.
Is there any possibility to change this characters to numeric?
Thanks
Matthias
Thanks everybody for help, it can be solved by using gsub(). In the upper example:
as.numeric(gsub(",", ".", Dat[1,12]))
provides:
-0.0003649292

R: If cells of a variable contain a specific text

I am trying to find out how many cells contain a specific text for a variable (in this case the "fruits" variable) in R. I tried to use the match () function but could not get the desired result. I tried to use %in% as well but to no avail.
The command which i used is match("apple", lifestyle$fruits) and it returns a value which is much more than the correct answer :X
I think this will give you what you want:
sum(grepl("apple", lifestyle$fruits))
grepl returns a logical TRUE/FALSE vector with TRUE if it is found. sum sums these together. You can make this a little faster using the fixed=TRUE argument:
sum(grepl("apple", lifestyle$fruits, fixed=TRUE))
This tells grepl that it doesn't have to spend time making a regular expression and to just match literally.

How do I export a df as.character in R?

How do I export a data frame completely as.character in r? I have digits that need to be treated as text in large dataframes, and I'm using write.csv, but even though I imported digits into r as characters, they are exporting as numbers (not surrounded by"" when viewed in notepad) and are occasionally rewritten as, e.g., 1e-04 (for a small decimal value). This is for data munging, and I need stuff to stay as formatted (once formatted). Shouldn't that be possible with some form of "as.character" or similar?
Make it into a matrix. If there is at least one character column in your data frame, it'll coerce the rest to character to match, since you can only have on type of data in a matrix.
new <- as.matrix(old_data_frame)
If there are no character columns in your old data frame, do:
new <- matrix(as.character(as.numeric(as.matrix(old_data_frame))),
ncol=ncol(old_data_frame))
If you user the function
write.table(x, file= ,quote=TRUE ...)
anything that is a string will be quoted on output

Resources