Display unique characters in dataframe column - r

I have a data frame of unique object names. I need to create a list of all unique characters used in the column. Is there a way to display a list of all unique characters used in a column?
Example data:
Performance PM
Truck Tips - B - 2001
ASE Parts Specialist (P1)
Output:
a
b
c
d
-
(
1
2

data <- "Example data:
Performance PM
Truck Tips - B - 2001
ASE Parts Specialist (P1)"
sort(unique(strsplit(data,"")[[1]]))
[1] "-" " " "\n" "(" ")" ":" "0" "1" "2" "a" "A" "B" "c" "d" "e" "E" "f" "i" "k" "l" "m" "M" "n" "o" "p" "P"
[27] "r" "s" "S" "t" "T" "u" "x"

Related

Sampling alternating blocks of character strings (words) from two fixed lists in R

I am attempting to create a 5000 word vector composed of 500 blocks of 10 words. One block is drawn from sampling with replacement from a fixed list of animals, and this block is to alternate with a fixed list of foods. The following code yields one iteration of what I need:
anim<- data.frame(cbind(stim=list.sample(animals$WORD, 10, replace=T), cond="animal"))
food <- data.frame(cbind(stim=list.sample(foods$WORD, 10, replace=T), cond="food"))
both <- data.frame(rbind(anim, food))
This yields output as follows:
I just cannot figure out how to repeat this procedure 499 more times to create the total vector I need -- I will be running semantic distances between clusters to determine whether I can autosegment the boundaries between foods and animals. I attempted a repeat loop to no avail
Thanks for any ideas!
Since you did not provide any reproducible data, we will assume that LETTERS are food and letters are animals. This line of code generates the vector you specified. Here we are only using batches of 5 to illustrate the process:
result <- as.vector(replicate(5, c(sample(LETTERS, 5, replace=TRUE), sample(letters, 5, replace=TRUE))))
result
# [1] "H" "O" "T" "K" "J" "m" "c" "s" "u" "c" "P" "Y" "V" "U" "Y" "p" "u" "q" "k" "l" "B" "H" "U" "F" "K" "h" "v" "g"
# [29] "c" "d" "X" "F" "R" "N" "U" "v" "t" "u" "q" "x" "N" "E" "G" "Q" "L" "d" "a" "v" "e" "a"

Opposite of "bitwAnd" Function in R?

I found this function in R that can generate the "power set" for a set of elements:
f <- function(set) {
n <- length(set)
masks <- 2^(1:n-1)
lapply( 1:2^n-1, function(u) set[ bitwAnd(u, masks) != 0 ] )
}
results = f((LETTERS[1:5])
results = sapply(results, paste, collapse = " ")
In a previous question (Subsetting Elements in a "Hypothetical" List), I learned how to interact with very large "power sets" that the computer can not load into memory. For example - suppose I wanted to make the "power set" for all 26 letters in the English alphabet (this set would contain 2^26 = 67108864 elements). I could find out the "13626980"th element in this list without actually generating the list (since it would be impossible to generate/store such a big list):
LETTERS[bitwAnd(13626980, 2^(1:26-1)) != 0]
[1] "C" "F" "G" "J" "K" "L" "N" "O" "P" "Q" "R" "S" "T" "W" "X"
I had the following question : Is it possible to do the "opposite" of this task?
For example, given the number "13626980" - can some function determine which sequence of letters ("C" "F" "G" "J" "K" "L" "N" "O" "P" "Q" "R" "S" "T" "W" "X") corresponds to? Is there some hypothetical function like:
#input
> hypothetical_function(c("C" "F" "G" "J" "K" "L" "N" "O" "P" "Q" "R" "S" "T" "W" "X"))
#output
13626980
Is this possible?
Thank you!

Split string into phonemic segments

I can't seem to figure out a seemingly simple task:
I have phonemic transcriptions of speech. To count the phonemes I want to split the strings into single phonemic segments. Unfortunately, the characters used for the phonemes are not 100% different from each other. For example, a long /i/ sound is transcribed iː (NB: ː is not a colon but a special char!) whereas a short /i/ sound may occasionally be transcribed i. This double use of the i in two distinct phonemes causes a problem in the split operation:
Test data:
test1 <- "dɪd ɛnɪbɒdi liːv ðeə glɑːsɪz hɪə lɑːst wiːk sʌmbədi dɪd"
A vector of all phonemes:
phonemes <- c("ɪə","eɪ","ʊə","ɔɪ","aɪ","eə","aʊ","əʊ", # diphthongs
"iː","uː","ɜː","ɔː","ɑː", # long vowels
"ə","ɪ", "ɛ","ɒ","ʌ","æ","i","ʊ", # short vowels
"j","w", # semi-vowels
"r","l", # approximants
"n","m","ŋ", # nasals
"f","v","θ","ð","s","z","ʃ","ʒ","h", # fricatives
"ʧ","ʤ", # affricates
"p","b","t","d","k","g") # plosives
The alternation pattern:
phonemes_pattern <- paste0("(", paste0(phonemes, collapse = "|"), ")")
The splitting operation:
str_split(gsub(" ", "", test1), paste0("(?<=", phonemes_pattern, ")"))
[[1]]
[1] "d" "ɪ" "d" "ɛ" "n" "ɪ" "b" "ɒ" "d" "i" "l" "i" "ː" "v" "ð" "eə" "g" "l" "ɑː" "s" "ɪ" "z" "h" "ɪ" "ə" "l" "ɑː" "s" "t"
[30] "w" "i" "ː" "k" "s" "ʌ" "m" "b" "ə" "d" "i" "d" "ɪ" "d" ""
The result is correct except for the long /i/ sound where the two characters iand ː are also separated. Can anybody help with this?
Why not extract the phonemes instead ?
phonemes_pattern <- paste0(phonemes, collapse = "|")
stringr::str_extract_all(test1, phonemes_pattern)[[1]]
#[1] "d" "ɪ" "d" "ɛ" "n" "ɪ" "b" "ɒ" "d" "i" "l"
#[12] "iː" "v" "ð" "eə" "g" "l" "ɑː" "s" "ɪ" "z" "h"
#[23] "ɪə" "l" "ɑː" "s" "t" "w" "iː" "k" "s" "ʌ" "m"
#[34] "b" "ə" "d" "i" "d" "ɪ" "d"
Or in base R :
regmatches(test1, gregexpr(phonemes_pattern, test1))[[1]]
Just changing the lookbehind to a lookahead makes it work
# using the unchanged phonemes vector
phonemes_pattern <- paste0(phonemes, collapse = "|")
str_split(gsub(" ", "", test1), paste0("(?=", phonemes_pattern, ")"))

Splitting character vector in my data frame by "|" not working

Working on Tidy Tuesday's data set horror_movies.csv and I cannot see how to split the genres column. I tried:
fieldList <- strsplit(df$genres, $"|")
Here is a sample of the output:
[1] "D" "r" "a" "m" "a" "|" " " "H" "o" "r" "r" "o" "r" "|" " " "S" "c" "i" "-" "F" "i"
[22] "|" " " "T" "h" "r" "i" "l" "l" "e" "r"
For some reason this splits my elements into individual characters. Here is a glimpse of this column so you can see how it is structured in the data frame:
$ genres <chr> "Drama| Horror| Thriller", "Horror", "Horror", "Comedy| Horror…
Is the | character special in R? What am I missing?
In R '|' is a logical operator meaning 'OR'.
You can do the following to solve the error, turn fixed=TRUE, this is set to FALSE by default.
fieldList <- strsplit(df$genres, $"|", fixed=TRUE)
Below is the documentation of the above function strsplit:
https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/strsplit

R: Create a vector with all Keyboard constants

I want to create a vector that contains every constant in a QWERTY keyboard. For now I have:
keyboard <- c(LETTERS, letters, 0:9)
I need to add an element to the vector containing all the symbols (e.g. #, !, ?, etc...) Is there an R constant (such as LETTERS for all upper case letters in the alphabet) that contains all the symbols? If not, is there a fast way of getting them without typing them one by one?
The ascii printable characters are represented by 32 through 126. We can generate that vector, convert to 'raw', then use a function to change to the actual characters. After that we just break the string into pieces.
strsplit(rawToChar(as.raw(32:126)), "")[[1]]
which gives
> strsplit(rawToChar(as.raw(32:126)), "")[[1]]
[1] " " "!" "\"" "#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "."
[16] "/" "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" ":" ";" "<" "="
[31] ">" "?" "#" "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L"
[46] "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z" "["
[61] "\\" "]" "^" "_" "`" "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
[76] "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y"
[91] "z" "{" "|" "}" "~"

Resources