What is the efficient and simple way in R to do the following:
read in two-column data from a file
use this information to build some kind of translation dictionary, like a python dict
apply the translation to the content of a vector in order to obtain the translated vector, possibly for several vectors but using the same correspondence information
?
I thought that the hash package would help me to do that, but I'm unsure I perform step 3 correctly.
Say my initial vector is my_vect and my hash is my_dict
I tried the following:
values(my_dict, keys=my_vect)
The following observation make me doubt that I'm doing it in the proper way:
The operation seems slow (more than one second on a powerful desktop computer with a vector of 582 entries and a hash of 46665 entries)
It results in something that doesn't look homogeneous with my_vec: while my_vec appeared as "indexed by numbers" (I mean that integer numbers between square brackets appear on the side of the values when displaying the data in the interactive console), the result of calling values as above appears to still somehow looks like a dictionary: each displayed translated value has the original value (i.e. hash key) displayed above it. I just want the values.
Edit:
If I understand correctly, R has some way of using "names" instead of numerical indices for vectors, and what I obtain using the values function is such a vector with names. It seems to work for what I wanted to do, although I imagine it takes more memory than necessary.
I tried libraries hash and hashmap, and the second seemed more efficient.
A small usage example:
> library(hashmap)
> keys = c("a", "b", "c", "d")
> values = c("A", "B", "C", "D")
> my_dict <- hashmap(keys, values)
> my_vect <- c("b", "c", "c")
> translated <- my_dict$find(my_vect)
> translated
[1] "B" "C" "C"
To build the dictionary from a table obtained using read.table, the option stringsAsFactors = FALSE of read.table has to be used, otherwise weird things happen (see discussion in the comments of https://stackoverflow.com/a/38838271/1878788).
Did you try the str_replace_all function from the stringr package?
Let's say you have a dictionary data frame dict with columns original and replacement. The following code replaces all instances of original with replacement in the vector.
library(stringr)
translations <- setNames(dict$replacement, dict$original)
new_vect <- str_replace_all(vect, fixed(translations))
I'm not sure if it implements hashing, but the underlying expression is in C code from the stringi package, so it should be fast.
The only case where that won't work as is, is if some of the words in original contain other words in original. In this case you'll need to add regular expression start-string (^) or end-string ($) markers to the original strings you want to replace.
translations <- setNames(dict$replacement, paste0("^", dict$original, "$"))
Related
Task
I am attempting to use better functionality (loop or vector) to parse down a larger list into 26(maybe 27) smaller lists based on each letter of the alphabet (i.e. the first list contains all entries of the larger list that start with the letter A, the second list with the letter B ... the possible 27th list contains all remaining entries that use either numbers of other characters).
I am then attempting to ID which names on the list are similar by using the adist function (for instance, I need to correct company names that are misspelled. e.g. Companyy A needs to be corrected to Company A).
Code thus far
#creates a vector for all uniqueID/stakeholders whose name starts with "a" or "A"
stakeA <- grep("^[aA].*", uniqueID, value=TRUE)
#creates a distance matrix for all stakeholders whose name starts with "a" or "A"
stakeAdist <- (adist(stakeA), ignore.case=TRUE)
write.table(stakeAdist, "test.csv", quote=TRUE, sep = ",", row.names=stakeA, col.names=stakeA)
Explanation
I was able to complete the first step of my task using the above code; I have created a list of all the entries that begin with the letter A and then calculated the "distance" between each entry (appears in a matrix).
Ask One
I can copy and paste this code 26 times and move my way through the alphabet, but I figure that is likely a more elegant way to do this, and I would like to learn it!
Ask Two
To "correct" the entries, thus far I have resorted to writing a table and moving to Excel. In Excel I have to insert a row entry to have the matrix properly align (I suppose this is a small flaw in my code). To correct the entries, I use conditional formatting to highlight all instances where adist is between say 1 and 10 and then have to manually go through the highlights and correct the lists.
Any help on functions / methods to further automate this / better strategies using R would be great.
It would help to have an example of your data, but this might work.
EDIT: I am assuming your data is in a data.frame named df
for(i in 1:26) {
stake <- subset(df, uniqueID==grep(paste0('^[',letters[i],LETTERS[i],'].*'), df$uniqueID, value=T))
stakeDist <- adist(stakeA,ignore.case=T)
write.table(stakeDist, paste0("stake_",LETTERS[i],".csv"), quote=T, sep=',')
}
Using a combination of paste0, and the builtin letters and LETTERS this creates your grep expression.
Using subset, the correct IDs are extracted
paste0 will also create a unique filename for write.table().
And it is all tied together using a for()-loop
If you load the pracma package into the r console and type
gammainc(2,2)
you get
lowinc uppinc reginc
0.5939942 0.4060058 0.5939942
This looks like some kind of a named tuple or something.
But, I can't work out how to extract the number below the lowinc, namely 0.5939942. The code (gammainc(2,2))[1] doesn't work, we just get
lowinc
0.5939942
which isn't a number.
How is this done?
As can be checked with str(gammainc(2,2)[1]) and class(gammainc(2,2)[1]), the output mentioned in the OP is in fact a number. It is just a named number. The names used as attributes of the vector are supposed to make the output easier to understand.
The function unname() can be used to obtain the numerical vector without names:
unname(gammainc(2,2))
#[1] 0.5939942 0.4060058 0.5939942
To select the first entry, one can use:
unname(gammainc(2,2))[1]
#[1] 0.5939942
In this specific case, a clearer version of the same might be:
unname(gammainc(2,2)["lowinc"])
Double brackets will strip the dimension names
gammainc(2,2)[[1]]
gammainc(2,2)[["lowinc"]]
I don't claim it to be intuitive, or obvious, but it is mentioned in the manual:
For vectors and matrices the [[ forms are rarely used, although they
have some slight semantic differences from the [ form (e.g. it drops
any names or dimnames attribute, and that partial matching is used for
character indices).
The partial matching can be employed like this
gammainc(2, 2)[["low", exact=FALSE]]
In R vectors may have names() attribute. This is an example:
vector <- c(1, 2, 3)
names(vector) <- c("first", "second", "third")
If you display vector, you should probably get desired output:
vector
> vector
first second third
1 2 3
To ensure what type of output you get after the function you can use:
class(your_function())
I hope this helps.
Consider:
chars <- c("A", "B", "C")
string <- c("B", "C")
chars[!(chars %in% string)]
So, I want to get the char(s) which is(are) not in string.
The code works, but I feel like it's kind of inconvenient.
Is there a function in R which returns the actual value directly, instead of evaluating TRUE/FALSE and then indexing?
As akrun mentioned you want to use
setdiff(chars, string)
more generally, the help page for setdiff is the same than union and other useful and I feel underused functions to perform operations on sets such as
intersect()
Which more directly answers the phrasing of your initial question.
I am trying to subset a large data frame with my columns of interest. I do so using the grep function, this selects one column too many ("has_socio"), which I would like to remove.
The following code does exactly what I want, but I find it unpleasant to look at. I want to do it in one line. Aside from just calling the first subset inside the second subset, can it be optimized?
DF <- read.dta("./big.dta")
DF0 <- na.omit(subset(DF, select=c(other_named_vars, grep("has_",names(DF)))))
DF0 <- na.omit(subset(DF0, select=-c(has_socio)))
I know similar questions have been asked (e.g. Subsetting a dataframe in R by multiple conditions) but I do not find one that addresses this issue specifically. I recognize I could just write the grep RE more carefully, but I feel the above code more clearly expresses my intent.
Thanks.
Replace your grep with:
vec <- c("blah", "has_bacon", "has_ham", "has_socio")
grep("^has_(?!socio$)", vec, value=T, perl=T)
# [1] "has_bacon" "has_ham"
(?!...) is a negative lookahead operator, which looks ahead and makes sure that its contents do not follow the actual matching piece behind of it (has_ being the matching piece).
setdiff(grep("has_", vec, value = TRUE), "has_socio")
## [1] "has_bacon" "has_ham"
Can I search a character list for a string where I don't know how the string is cased? Or more generally, I'm trying to reference a column in a dataframe, but I don't know exactly how the columns are cased. My thought was to search names(myDataFrame) in a case-insensitive manner to return the proper casing of the column.
I would suggest the grep() function and some of its additional arguments that make it a pleasure to use.
grep("stringofinterest",names(dataframeofinterest),ignore.case=TRUE,value=TRUE)
without the argument value=TRUE you will only get a vector of index positions where the match occurred.
Assuming that there are no variable names which differ only in case, you can search your all-lowercase variable name in tolower(names(myDataFrame)):
match("b", tolower(c("A","B","C")))
[1] 2
This will produce only exact matches, but that is probably desirable in this case.
With the stringr package, you can modify the pattern with one of the built in modifier functions (see `?modifiers). For example since we are matching a fixed string (no special regular expression characters) but want to ignore case, we can do
str_detect(colnames(iris), fixed("species", ignore_case=TRUE))
Or you can use the (?i) case insensitive modifier
str_detect(colnames(iris), "(?i)species")
For anyone using this with %in%, simply use tolower on the right (or both) sides, like so:
"b" %in% c("a", "B", "c")
# [1] FALSE
tolower("b") %in% tolower(c("a", "B", "c"))
# [1] TRUE
The searchable package was created for allowing for various types of searching within objects:
l <- list( a=1, b=2, c=3 )
sl <- searchable(l) # make the list "searchable"
sl <- ignore.case(sl) # turn on case insensitivity
> sl['B']
$b
[1] 2
It works with lists and vectors and does a lot more than simple case-insensitive matching.
If you want to search for one set of strings in another set of strings, case insensitively, you could try:
s1 = c("a", "b")
s2 = c("B", "C")
matches = s1[ toupper(s1) %in% toupper(s2) ]
Another way of achieving this is to use str_which(string, pattern) from the stringr package:
library("stringr")
str_which(string = tolower(colnames(iris)), pattern = "species")