This question already has answers here:
Matching multiple patterns
(6 answers)
Closed 7 years ago.
I am trying to understand how R deals with string manipulation and comparisons.
To this end I have set up two data frames, one which is my raw data and the other which is my reference data to which I would like to compare. I'm trying to understand the different ways of comparing strings and how to compare data frames in general (it seems far easier in SQL where you can just use the key word contains).
For the example below, the first item is the reference data and the second is the raw data.
grepl ("1845","UN1845")
Will return TRUE
any ("1845"=="UN1845")
Will return FALSE (I assume here because the word has to match fully)
is.element ("1845","UN1845")
Will return FALSE (same reason as the the any)
If I wanted to check the entire data reference table against each and every item in the raw table, how would I go about this?
From playing around I could do something like
grepl(Raw$Contents, Ref$desc)
Where the Raw data is basically strings and the ref data is strings. However when I run something like this, I get the message:
In grepl(Raw$Contents, MyCode$desc)
argument 'pattern' has length > 1 and only the first element will be used
I assume this is related to the fact that the table size for the reference table is different to the table I'm running comparisons against.
Sample data:
rawdata = data.frame(A=c("UN1845","FROZEN FOOD DRY ICE","LTD QTY8000"))
refdata = data.frame(A=c("1845","8000"))
The errror message means: your pattern argument has more than one element, but grepl and its family only accept one pattern at a time. You will have to loop (or *apply) over each pattern in your refdata collection.
EDIT: to clarify: grepl only accepts one pattern, but if that pattern contains the complete search set, e.g. via the OR operator, grepl will function as desired. thanks to David Arenburg for his comments.
Related
I want to learn how to access data from a nested list in R. I am relatively new to the R programming language, so I am unsure how to proceed.
The data is a 'large list(947 elements, 654.9mb) and takes the form:
The numbers within the datalist refer to station numbers and when I click on one (in Rstudio) it looks like this:
I want to kow how I can access the data within 'doy' for example. I have tried:
data[[1]]
which returns all the data for the first element of the list (site, location, doy,ltm etc). So clearly the number used within the square brackets is interpreted as an index for the list, as opposed to an identifier for the elements/station in the list.
Then I tried:
data$1
but it returned the error:
Error: unexpected numeric constant in "data$1"
Then I tried:
data[data$1==doy]
But was returned this:
Error: unexpected numeric constant in "data[data$1"
So at this point, I realise that it is not construing the number of the station as a category/factor within the list. It's just reading it as a number. So I thought I'd put some quotes around it to see if that changed what happened:
data[data$"1"=="doy"]
This returned
named list()
But when I looked at it in the environment, it was a list of 0.
I looked at some of the similar question here on Stack (like: accessing nested lists in R) and tried:
data[data$"1"=="doy",][[1]]
But just got:
Error in data[data$"1" == "doy", ] : incorrect number of dimensions
How can I access this data? It reminds me of a structure in Matlab, but it doesn't seem to be indexed in a similar fashion in R.
Let's look at some ways to do what you want:
data[[1]]
This returns the first element of the list, which is itself a list. You can use the $ subsetting shorthand, but the name of the first element is nonstandard. R prefers names that start with letters and include only alphanumeric characters, periods and underscores. You can escape this behavior with backticks:
data$`1`
If you want to access one of the elements of list 1 in your list of lists, you need to further subset. To get to doy, which is the third element of 1. You can do that four ways.
data[[1]][[3]]
data$`1`[[3]]
data[[1]]$doy
data$`1`$doy
One way (in addition to what Ben Norris has shown):
our_list[[c("1", "doy")]]
Reproducible example data (please provide next time)
our_list <- list(`1` = list(site = "x", doy = 3))
Here is a challenge for you: I was trying to make a tic tac toe based on R. First, the players have to configure putting in the name of the players, and the game should check if the name exists in a file called "Players.txt" (if not, the game will create one), if the name exists, the game will ask for a new one. The last part of the game is that the game should record all the punctuation of the players (each gambling chip used will subtract 5 points of 100 that the player has at the beginning of the game). The problem is when a player wins, the game shows the following error: "Error in table[location_name1, 3]: Incorrect number of dimension in R".
A vector can either be atomic or a list. Atomic vectors can only contain elements of one and the same data type. That means, you are "accidentally" creating a list with
vector=c(win,name1,name2,table)
with the result that each column of the data frame should become an entry.
You can solve it with
vector <- list(win, name1, name2, table)
vector is still a list but now it has the format I believe you want.
Having done that you still get errors. The reason is that these assignments fail.
location_name1=which(grepl(name1,table$gamers))
location_name2=which(grepl(name2,table$gamers))
They return an empty vector because earlier in the code you set win=vector[1]... table=vector[4]. Since vector is now a list, you have to subset it accordingly. That means you have to chance the statements to table=vector[[4]].
Now you are going to get another problem. The reason is that you treat the columns table$scores as text. When you read the data you need to make sure that this columns is not interpreted as text. You also have to eliminate all statements that coerce the column into text. Otherwise table[location_name1,3]=table[location_name1,3]+pointsx will obviously fail because you cannot add a number to a string.
For example, you coerce the column into a character column with this statement:
name1 <- data.frame(gamers=name1,games="1",scores="100")
games and scores are strings not numbers. Another example is the assigment after reading the table from the file. You can make sure that scoresare numeric by doing this.
scores <- as.numeric(table[,3])
Please get familiar with Rstudio debugging capabilities (https://support.rstudio.com/hc/en-us/articles/205612627-Debugging-with-RStudio). This way you can go through your code line by line and check consequences of each assignment to the data frame.
The text column can hold up to 100 letters for each entry. How can i write a script that recognizes the word "Approved" or "Rejected". Sometimes the word will be "-Approved", "Approved","Approved" or "Approve". I want it to account for each scenario with a "LIKE" type of function.
There are two words i am looking for so "OR" may be applicable to this as opposed to a range.
R has a pair of text-similarity functions, agrep and agrepl, which are like grep and grepl in returning a vector when given a vector. The agrepl function is logical and of the same length as the input so works better in cases like this:
agrepl("Approved", df$text_col) | agrepl("Rejected", df$text_col)
That could be used to logically index matching rows of a dataframe. Or you could sum the logical vector to get a count. Suggestion: Edit your question with an example to use for demonstration.
There are additional parameters that can be used to adjust the tightness of the approximate matching.
I'm a bit of an R novice and have been trying to experiment a bit using the agrep function in R. I have a large data base of customers (1.5 million rows) of which I'm sure there are many duplicates. Many of the duplicates though are not revealed using the table() to get the frequency of repeated exact names. Just eyeballing some of the rows, I have noticed many duplicates that are "unique" because there was a minor miss-key in the spelling of the name.
So far, to find all of the duplicates in my data set, I have been using agrep() to accomplish the fuzzy name matching. I have been playing around with the max.distance argument in agrep() to return different approximate matches. I think I have found a happy medium between returning false positives and missing out on true matches. As the agrep() is limited to matching a single pattern at a time, I was able to find an entry on stack overflow to help me write a sapply code that would allow me to match the data set against numerous patterns. Here is the code I am using to loop over numerous patterns as it combs through my data sets for "duplicates".
dups4<-data.frame(unlist(sapply(unique$name,agrep,value=T,max.distance=.154,vf$name)))
unique$name= this is the unique index I developed that has all of the "patterns" I wish to hunt for in my data set.
vf$name= is the column in my data frame that contains all of my customer names.
This coding works well on a small scale of a sample of 600 or so customers and the agrep works fine. My problem is when I attempt to use a unique index of 250K+ names and agrep it against my 1.5 million customers. As I type out this question, the code is still running in R and has not yet stopped (we are going on 20 minutes at this point).
Does anyone have any suggestions to speed this up or improve the code that I have used? I have not yet tried anything out of the plyr package. Perhaps this might be faster... I am a little unfamiliar though with using the ddply or llply functions.
Any suggestions would be greatly appreciated.
I'm so sorry, I missed this last request to post a solution. Here is how I solved my agrep, multiple pattern problem, and then sped things up using parallel processing.
What I am essentially doing is taking a a whole vector of character strings and then fuzzy matching them against themselves to find out if there are any fuzzy matched duplicate records in the vector.
Here I create clusters (twenty of them) that I wish to use in a parallel process created by parSapply
cl<-makeCluster(20)
So let's start with the innermost nesting of the code parSapply. This is what allows me to run the agrep() in a paralleled process. The first argument is "cl", which is the number of clusters I have specified to parallel process across ,as specified above.
The 2nd argument is the specific vector of patterns I wish to match against. The third argument is the actual function I wish to use to do the matching (in this case agrep). The next subsequent arguments are all arguments related to the agrep() that I am using. I have specified that I want the actual character strings returned (not the position of the strings) using value=T. I have also specified my max.distance I am willing to accept in a fuzzy match... in this case a cost of 2. The last argument is the full list of patterns I wish to be matched against the first list of patterns (argument 2). As it so happens, I am looking to identify duplicates, hence I match the vector against itself. The final output is a list, so I use unlist() and then data frame it to basically get a table of matches. From there, I can easily run a frequency table of the table I just created to find out, what fuzzy matched character strings have a frequency greater than 1, ultimately telling me that such a pattern match against itself and one other pattern in the vector.
truedupevf<-data.frame(unlist(parSapply(cl,
s4dupe$fuzzydob,agrep,value=T,
max.distance=2,s4dupe$fuzzydob)))
I hope this helps.
How can I select the second column of a dynamically named variable?
I create variables of the form "population.USA", "population.Mexico", "population.Canada". Each variable has a column for the year, and another column for the population value. I would like to select the second column from each of these variables during a loop.
I use this syntax:
sprintf("population.%s", country)[, 2]
R returns the error: Error in sprintf("population.%s", country)[, 2] : incorrect number of dimensions
Based on your sequence of questions over the last few minutes, I have two general recommendations for you as you get familiar with R:
Don't use sprintf.
Don't use assign.
Now, obviously, those functions are both useful at times. But you've learned about them too early, before you've mastered some basic stuff about R's data structures. Try to write code without those crutches (for the time being!), as they're just causing you problems.
Rather than creating separate individual variables for each nation's population, place them in a list.
population <- vector("list",3)
names(population) <- c('USA','Mexico','Russia')
Then you can access each using the string representation of the name of each country:
population[['USA']] <- 10000
Or,
region <- 'USA'
population[[region]]
In this example, I've assigned a single value to a list element, lists will hold any other data type, including matrices or data frames. It will be a lot less typing than using sprintf and assign, and a lot safer and more efficient as well.
See ?get. Here is an example:
> country <- "FOO"
> assign(sprintf("population.%s", country), data.frame(runif(5), runif(5)))
>
> get(sprintf("population.%s", country))[,2]
[1] 0.2241105 0.5640709 0.5945869 0.1830719 0.1895938
It is critically important to look at the object returned by a function if you get an error. It is immediately clear why your example fails if you just look at what it returns:
> sprintf("population.%s", country)
[1] "population.FOO"
At that point it would be immediately clear, if you didn't already know or have thought to read ?sprintf, that sprintf() returns a string not the object of that name. Armed with that knowledge you would have narrowed down the problem to how to recall an object from the computed name?