Creating abstract patterns from string data in R - r

I am trying to find patterns in strings in R by assigning tokens to the locations. I have a vector of the following form. It is basically change of location by a person in one year. For example in first case One person is moving from London to New york to London to Beijing and to cleveland.
path <- c('Lon-NYC-Lon-Bei-Cle', 'Mos-NYC-Bei-Cle-San', 'Bei-Cle-Bei-NYC-San')
I am trying to look for generic abstract patterns. I want to create a variable called: 'pattern' which gives me A-B-A-C-D for Lon-NYC-Lon-Bei-Cle string, A-B-C-D-E for Mos-NYC-Bei-Cle-San, A-B-A-C-D for Bei-Cle-Bei-NYC-San.
pattern <- c('A-B-A-C-D', 'A-B-C-D-E', 'A-B-A-C-D)
Is there a way I can create this variable?

If you always have fewer than 26 unique values, you can use something like this
sapply(strsplit(path,"-"), function(x)
paste(LETTERS[factor(x, levels=unique(x))], collapse="-")
)
# [1] "A-B-A-C-D" "A-B-C-D-E" "A-B-A-C-D"
here we use strsplit() to find the different pieces and factor() to take care of identifying duplicate values. Then we use the numeric values underlying the factor to index into the set of upper-case letters

Related

Search large string for multiple instances if smaller string in r

In R, I have taken a JSON format of test results and converted them to a data frame of 14 variables and 1101 entries. In this test, the user must select squares in a particular order for a correct score. Under one variable, "input," the values are long strings with info on which square was selected and the time it took to select the square.
Ex:
"[{\"selectedSquare\":\"1\",\"tapTime\":\"00:00:00:06\"},
{\"selectedSquare\":\"0\",\"tapTime\":\"00:00:01:02\"},
{\"selectedSquare\":\"3\",\"tapTime\":\"00:00:02:00\"},
{\"selectedSquare\":\"2\",\"tapTime\":\"00:00:02:07\"}]"
Some entries have more than others, some have none.
I need to search each entry for the square a student selected, and output the order into a new column. Using the example above:
1,0,3,2
I have tried to access each entry individually to test functions on using df$input[1], but it returns a factor with 219 levels. I cannot find a way to only access the relevant piece of the input entry.
You can do this by using an appropriate regular expression. Try:
library(dplyr)
library(stringr)
pattern <- "(?<=\")\\d(?=\")" ## regular expression with look arounds
df$new.col <- sapply(df$input, function(x) {str_extract_all(x, pattern)[[1]] %>% paste(collapse = ",")})

Creating Sub Lists from A to Z from a Master List

Task
I am attempting to use better functionality (loop or vector) to parse down a larger list into 26(maybe 27) smaller lists based on each letter of the alphabet (i.e. the first list contains all entries of the larger list that start with the letter A, the second list with the letter B ... the possible 27th list contains all remaining entries that use either numbers of other characters).
I am then attempting to ID which names on the list are similar by using the adist function (for instance, I need to correct company names that are misspelled. e.g. Companyy A needs to be corrected to Company A).
Code thus far
#creates a vector for all uniqueID/stakeholders whose name starts with "a" or "A"
stakeA <- grep("^[aA].*", uniqueID, value=TRUE)
#creates a distance matrix for all stakeholders whose name starts with "a" or "A"
stakeAdist <- (adist(stakeA), ignore.case=TRUE)
write.table(stakeAdist, "test.csv", quote=TRUE, sep = ",", row.names=stakeA, col.names=stakeA)
Explanation
I was able to complete the first step of my task using the above code; I have created a list of all the entries that begin with the letter A and then calculated the "distance" between each entry (appears in a matrix).
Ask One
I can copy and paste this code 26 times and move my way through the alphabet, but I figure that is likely a more elegant way to do this, and I would like to learn it!
Ask Two
To "correct" the entries, thus far I have resorted to writing a table and moving to Excel. In Excel I have to insert a row entry to have the matrix properly align (I suppose this is a small flaw in my code). To correct the entries, I use conditional formatting to highlight all instances where adist is between say 1 and 10 and then have to manually go through the highlights and correct the lists.
Any help on functions / methods to further automate this / better strategies using R would be great.
It would help to have an example of your data, but this might work.
EDIT: I am assuming your data is in a data.frame named df
for(i in 1:26) {
stake <- subset(df, uniqueID==grep(paste0('^[',letters[i],LETTERS[i],'].*'), df$uniqueID, value=T))
stakeDist <- adist(stakeA,ignore.case=T)
write.table(stakeDist, paste0("stake_",LETTERS[i],".csv"), quote=T, sep=',')
}
Using a combination of paste0, and the builtin letters and LETTERS this creates your grep expression.
Using subset, the correct IDs are extracted
paste0 will also create a unique filename for write.table().
And it is all tied together using a for()-loop

Fuzzy String Matching in R on numbers separated by hyphens

I am trying to match Cell Phone Tower IDs contained in one table with a master table of locations(in lat long) of Cell Phone Tower IDs. The format of IDs in the locations table are different from the ones in the first table and I am trying to use agrep() to do a fuzzy match. To give you an example, let's say the ID I am trying to match is:
x <- c("405-800-125-39883")
A sample of IDs located in the locations table:
y <- c("405-810-1802-19883", "405-810-2101-29883", "405-810-1401-31883",
"405-810-5005-49883","125-39883","405-810-660-39883")
I am then using agrep() with different combinations of max.distance:
agrep(x,y,max.distance=0.3,value=TRUE)
This returns:
[1] "405-810-1802-19883" "405-810-2101-29883" "405-810-1401-31883" "405-810-5005-49883"
[5] "405-810-660-39883"
Whereas the value that I am really after is "125-39883"
I have also tried the stringdist_join() function from the stringdist package and applied to the two data frames bby varying max_dist but with no success. Basically what I am looking for is a perfect match after the last hyphen and then macth on the number on the second last hyphen and so on. Is there any way of doing that?
You can vectorized agrep to be able to use all the values of y as the pattern.
Your aim is to look for the whole of y as a part of x. Thus your pattern should be y and not x
names(unlist(Vectorize(agrep)(y,x)))
[1] "125-39883"
Although we can use adist with the argument partial=TRUE so that it may do exactly what agrep does:
y[which.min(c(adist(y,x,partial = T)))]
[1] "125-39883"
If x is a vector and y is also a vector, you would rather use adist instead of agrep. All the arguments of agrep are contained in adist. Check ?adist for further details.
with your new question in the comments, you can do something like this:
w=adist(y,x,partial=T)
z=setNames(nchar(sub(".*?(M*)$","\\1",c(attr(adist(y,x,counts=T),"trafos")))),y)
names(which.max(z[which(min(w)==w)]))
[1] "126-39883"

Data Frame containing hyphens using R

I have created a list (Based on items in a column) in order to subset my dataset into smaller datasets relating to a particular variable. This list contains strings with hyphens in them -.
dim.list <- c('Age_CareContactDate-Gender', 'Age_CareContactDate-Group',
'Age_ServiceReferralReceivedDate-Gender',
'Age_ServiceReferralReceivedDate-Gender-0-18',
'Age_ServiceReferralReceivedDate-Group',
'Age_ServiceReferralReceivedDate-Group-ReferralReason')
I have then written some code to loop through each item in this list subsetting my main data.
for (i in dim.list) {assign(paste("df1.",i,sep=""),df[df$Dimension==i,])}
This works fine, however when I come to aggregate this in order to get some summary statistics I can't reference the dataset as R stops reading after the hyphen (I assume that the hyphen is some special character)
If I use a different list without hyphens e.g.
dim.list.abr <- c('ACCD_Gen','ACCD_Grp',
'ASRRD_Gen',
'ASRRD_Gen_0_18',
'ASRRD_Grp',
'ASRRD_Grp_RefRsn')
When my for loop above executes I get 6 data.frames with no observations.
Why is this happening?
Comment to answer:
Hyphens aren't allowed in standard variable names. Think of a simple example: a-b. Is it a variable name with a hyphen or is it a minus b? The R interpreter assumes a minus b, because it doesn't require spaces for binary operations. You can force non-standard names to work using backticks, e.g.,
# terribly confusing names:
`a-b` <- 5
`x+y` <- 10
`mean(x^2)` <- "this is awful"
but you're better off following the rules and using standard names without special characters like + - * / % $ # # ! & | ^ ( [ ' " in them. At ?quotes there is a section on Names and Identifiers:
Identifiers consist of a sequence of letters, digits, the period (.) and the underscore. They must not start with a digit nor underscore, nor with a period followed by a digit. Reserved words are not valid identifiers.
So that's why you're getting an error, but what you're doing isn't good practice. I completely agree with Axeman's comments. Use split to divide up your data frame into a list. And keep it in a list rather than use assign, it will be much easier to loop over or use lapply with that way. You might want to read my answer at How to make a list of data frames for a lot of discussion and examples.
Regarding your comment "dim.list is not the complete set of unique entries in the Dimensions column", that just means you need to subset before you split:
nice_list = df[df$Dimension %in% dim.list, ]
nice_list = split(nice_list, nice_list$Dimension)

Applying a function over a column in an r data frame where each item is a character string

I have a dataframe, wineSA, with two columns. One of these columns is populated with a character string, like so:
> summary(wineSA$description)
Length Class Mode
129971 character character
An example of a typical entry would be:
review <- "Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity."
I also have a function, that when applied to a string returns a sentiment score, like so:
> Getting_Sentimental(review)
[1] 0.4317412
I want to apply this function to every element in the wineSA$description column and add the sentiment score, as a separate column, to the data frame wineSA.
I have tried the following method, which uses apply(), but I get this message:
> wineSA$reviewSentiment <- apply(wineSA$description, FUN = Getting_Sentimental)
Error in apply(wineSA$description, FUN = Getting_Sentimental) :
dim(X) must have a positive length
I'm not sure apply() is appropriate here, but when I use either sapply() or lappy() it populates the new column with the same value for the sentiment.
Is there a special way of handling functions on string characters? Is there anything I'm missing?
Thanks

Resources