R: If cells of a variable contain a specific text - r

I am trying to find out how many cells contain a specific text for a variable (in this case the "fruits" variable) in R. I tried to use the match () function but could not get the desired result. I tried to use %in% as well but to no avail.
The command which i used is match("apple", lifestyle$fruits) and it returns a value which is much more than the correct answer :X

I think this will give you what you want:
sum(grepl("apple", lifestyle$fruits))
grepl returns a logical TRUE/FALSE vector with TRUE if it is found. sum sums these together. You can make this a little faster using the fixed=TRUE argument:
sum(grepl("apple", lifestyle$fruits, fixed=TRUE))
This tells grepl that it doesn't have to spend time making a regular expression and to just match literally.

Related

Role of square brackets

I got this code from elsewhere and I wondering if someone can explain what the square brackets are doing.
matrix1[i,] <- df[[1]][]
I am using this to assign values to a matrix and it works but I am not sure what exactly it's doing. What does the initial set of [[]] mean followed by another []?
This might help you understand a bit. You can copy and paste this code and see the differences between different ways of indexing using [] and $. The only thing I can't answer for you is the second empty set of square brackets, from my understanding that does nothing, unless a value is within those brackets.
#Retreives the first column as a data frame
mtcars[1]
#Retrieves the first column values only (three different methods of doing the same thing)
mtcars[,1]
mtcars[[1]]
mtcars$mpg
#Retrieves the first row as a data frame
mtcars[1,]
#I can use a second set of brackets to get the 4th value within the first column
mtcars[[1]][4]
mtcars$mpg[4]
The general function of [ is that of subsetting, which is well documented both in help (as suggested in comments), and in this piece. The rest of of my answer is heavily based on that source.
In fact, there are operators for subsetting in R; [[,[, and $.
The [ and $ are useful for returning the index and named position, respectfully, for example the first three elements of vector a = 1:10 may be subsetted with a[c(1,2,3)]. You can also negatively subset to remove elements, as a[-1] will remove the first index.
The $ operator is different in that it only takes element names as input, e.g. if your df was a dataframe with a column values, df$values would subset that column. You can achieve the same [, but only with a quoted name such as df["values"].
To answer more specifically, what does df[[1]][] do?
First, the [[-operator will return the 1st element from df, and the following empty [-operator will pull everything from that output.

Detect weird characters in all character fields in a data.frame

I have a large data.frame d that was read from a .csv file using read (it is actually a data.table resulting from fread a .csv file). I want to check in every column of type character for weird/corrupted characters. Meaning the weird sequences of characters that result from other corrupted parts of a text file or from using the wrong encoding.
A data.table solution, or some other fast solution, would be best.
This is a pseudo-code for a possible solution
create a vector str_cols with the names of all the character columns of d
for each column j in str_cols compute a frequency table of the values: tab <-d[,.N,j]. (this step is probably not necessary, just used to reduce the dimensions of the object that will be checked in columns with repetitions)
Check the values of j in the summary table tab
The crucial step is 3. Is there a function that does that?
Edit1: Perhaps some smart regular expression? This is a related non R question, trying to explicitly list all weird characters. Another solution perhaps is to find any character outside of the accepted list of characters [a-z 0-9 + punctuation].
If you post some example data it would be easier to give a more definitive answer. You could likely try something like this though.
DT[, lapply(.SD, stringr::str_detect, "^[^[[:print:]]]+$")]
It will return a data.table of the same size, but any string that has characters that aren't alphanumeric, punctuation, and/or space will be replaced with TRUE, and everything else replaced with FALSE. This would be my interpretation of your question about wanting to detect values that contain these characters.
You can change the behavior by replacing str_detect with whatever base R or stringr function you want, and slightly modifying the regex as needed. For example, you can remove the offending characters with the following code.
DT[, lapply(.SD, stringr::str_replace_all, "[^[[:print:]]]", "")]

In R i have a column with text. How can i write a script in R that counts the frequency of the specific words?

The text column can hold up to 100 letters for each entry. How can i write a script that recognizes the word "Approved" or "Rejected". Sometimes the word will be "-Approved", "Approved","Approved" or "Approve". I want it to account for each scenario with a "LIKE" type of function.
There are two words i am looking for so "OR" may be applicable to this as opposed to a range.
R has a pair of text-similarity functions, agrep and agrepl, which are like grep and grepl in returning a vector when given a vector. The agrepl function is logical and of the same length as the input so works better in cases like this:
agrepl("Approved", df$text_col) | agrepl("Rejected", df$text_col)
That could be used to logically index matching rows of a dataframe. Or you could sum the logical vector to get a count. Suggestion: Edit your question with an example to use for demonstration.
There are additional parameters that can be used to adjust the tightness of the approximate matching.

Splitting strings into elements from a list

A function in a package gives me a character, where the original strings are merged together. I need to separate them, in other words I have to find the original elements. Here is an example and what I have tried:
orig<-c("answer1","answer2","answer3")
result<-"answer3answer2"
What I need as an outcome is:
c("answer2","answer3")
I have tried to split() result, but there is no string to base it on, especially that I have no former knowledge of what the answers will be.
I have tried to match() the result to the orig, but I would need to do that with all substrings.
There has to be an easy solution, but I haven't found it.
index <- gregexpr(paste(orig,collapse='|'),result)[[1]]
starts <- as.numeric(index)
stops <- starts + attributes(index)$match.length - 1 )
substring(result, starts, stops)
This should work for well defined and reversible inputs. Alternatively, is it possible to append some strings to the input of the function, such that it can be easily separated afterwards?
What you're describing seems to be exactly string matching, and for your strings, grepl seems to be just the thing, in particular:
FindSubstrings <- function(orig, result){
orig[sapply(orig, grepl, result)]
}
In more detail: grepl takes a pattern argument and looks whether it occurs in your string (result in our case), and returns a TRUE/FALSE value. We subset the original values by the logical vector - does the value occur in the string?
Possible improvements:
fixed=TRUE may be a bright idea, because you don't need the full regex power for simple strings matching
some match patterns may contain others, for example, "answer10" contains "answer1"
stringi may be faster for such tasks (just rumors floating around, haven't rigorously tested), so you may want to look into it if you do this a lot.

How to grep two words in string data?

So I have a data frame where the one of the columns is of type character, consisting of strings. I want to find those rows where "foo" and "bar" both occur but bar can also occur before foo. Basically like an AND operator for regular expressions. How shall I do that?
You may try
rowIndx <- grepl('foo', df$yourcol) & grepl('bar', df$yourcol)
rowIndx returns a logical TRUE/FALSE which can be used for subsetting the col. (comments from #Konrad Rudolph). If you need the numeric index, just wrap it with which i.e. which(rowIndx)
Regular expressions are bad at logical operations. Your particular case, however, can be trivially implemented by the following expression:
(foo.*bar)|(bar.*foo)
However, this is a very inefficient regex and I strongly advise against using it. In practice, you’d use akrun’s solution from the comment: grep for them individually and intersect the result (or do a logical grepl and & the results, which is semantically exchangeable).

Resources