Find all unique characters in a data frame in R

Find all unique characters in a data frame in R - r

I am wondering what is the most efficient way to find all unique characters from a data frame in R.
i.e for example:- [0-9,a-z,A-Z,",","$","&","#" etc]
> k
cola colb
1 1&3# %^
2 A4C% 89&
The output I am expecting is a list containing all unique characters including special characters. i.e 123#%^AC89&

There's nothing really efficient about this, but ... demonstrating on the diamonds dataset from the ggplot2 library,
library(ggplot2)
unique(unlist(lapply(diamonds, function(x) unlist(strsplit(as.character(x), "")))))
# [1] "0" "." "2" "3" "1" "9" "4" "6" "5" "8" "7" "I" "d" "e" "a" "l" "P" "r" "m" "i" "u" "G" "o"
# [24] "V" "y" " " "F" "E" "J" "H" "D" "S"
If you're curious about how many of each?
table(unlist(lapply(diamonds, function(x) unlist(strsplit(as.character(x), "")))))
# . 0 1 2 3 4 5 6 7 8 9 a
# 12082 261929 81785 142173 135042 108355 121267 157242 161862 91438 71904 67144 23161
# d D e E F G H i I J l m o
# 38539 6775 47424 9797 12942 28280 8304 15401 51763 2808 21551 27582 33976
# P r S u V y
# 13791 27483 51409 13791 49953 12082
(This is effectively akrun's answer ... posted before I saw his comment-edit.)
Using your sample frame:
k <- data.frame(cola = c("1&3#", "A4C%"), colb = c("%^", "89&"), stringsAsFactors = FALSE)
unique(unlist(lapply(k, function(x) unlist(strsplit(as.character(x), "")))))
# [1] "1" "&" "3" "#" "A" "4" "C" "%" "^" "8" "9"
And if you want them in a sorted no-space string,
paste(sort(unique(unlist(lapply(k, function(x) unlist(strsplit(as.character(x), "")))))), collapse = "")
# [1] "#%&^13489AC"
Since your question suggests you're considering using this in a regex somewhere, you can sandwich this in brackets. I wouldn't go through the pain of finding character ranges (e.g., AD-GW-Z24-9), since that buys you very little regex efficiency but would take a bit more effort to generate.

Related

Reorder vector so no certain items are positioned next to each other

Please consider the following example:
[[1]]
[1] 11 12 13 14
[[2]]
[1] 1 2 3
[[3]]
[1] 4
[[4]]
[1] 5
[[5]]
[1] 6
[[6]]
[1] 7
[[7]]
[1] 8
[[8]]
[1] 9
[[9]]
[1] 10
[[10]]
[1] 15
[[11]]
[1] 16
[[12]]
[1] 17
In this example, I have 12 unique values in a vector that is 17 elements long. For simplicity, let's say that this vector is:
foo_bar <- c("b","b","b","c","d","e","f","g","h","i","a","a","a","a", "j", "k", "l")
The first code block shows the index positions in foo_bar of each of the unique values (the letters a–l).
I am attempting to write an algorithm that reorders foo_bar so that, for all indices except the final one (index 17 in the foo_bar example), position i and position i+1 never contains the same two values. Here's an example of what would be an appropriate outcome:
reordered_foo_bar <- c("b","c","b","d","b","e","f","g","h","a","i","a","j","a","k","a", "l")

something like this?
foo_bar <- c("b","b","b","c","d","e","f","g","h","i","a","a","a","a", "j", "k", "l")
test == FALSE
while (test == FALSE) {
new_foo_bar <- sample(foo_bar, size = length(foo_bar), replace = FALSE)
test <- length(rle(new_foo_bar)$lengths) == length(foo_bar)
}
new_foo_bar
# [1] "f" "a" "g" "b" "h" "d" "j" "c" "e" "i" "a" "b" "k" "a" "l" "a" "b"

First we identify the indices of the unique values in the vector.
indices <-
unique(foo_bar) %>%
sort() %>%
lapply(function(x) which(foo_bar == x))
Then we create a position score based on 1) which order the value has when ordered by decreasing frequency and 2) how many previous occurences of this value has occurred, and we add these two values together. However, to ensure that we get a different value inserted between them, we divide 2) by 2. Finally, we order the position scores and reorder foo_bar with this new order.
This solution is also robust in case it is not possible to prevent duplicate values next to each other (for example because the values are c("a","a","b","a").
out <-
lengths(indices) %>%
lapply(., function(x) 1:x) %>%
{lapply(len_seq(.), function(x) (unlist(.[x]) + x / 2))} %>%
unlist() %>%
order() %>%
{unlist(indices)[.]} %>%
foo_bar[.]
The output is then:
> "a" "b" "a" "c" "b" "d" "a" "e" "b" "f" "a" "g" "h" "i" "j" "k" "l"

Compact Letter Display from a matrix of significancies or by hand

I am running a multiple pairwise comparison in R. I'm using the survival package survminer. I'm using the function:
pairwise_survdiff {survminer}
It gives the pairwise comparisons with significance as expected, but doesn't seem to have a way to give a compact letter display (CLD) of the results. I'm looking at pairs of 19 levels. I ended up printing the results, putting them into excel by hand and then doing letters by hand. But now I need to do it again and am hoping for an easier way.
Can I have R do a CLD from the pairwise_survdiff {survminer} results directly?
Baring that
Is there a way to get it to print results into a table that can be read by a spreadsheet?
If I make the logic matrix by hand, how do I have R take that and turn it into a CLD?
And 4) If I'm doing it all by hand, I'm wondering if there is a more compact method of showing this list of comparisons. Can I eliminate any of these letters due to redundancy?
hand made CLD for comparisons
Thank you

Here's the example from survminer
library(survminer)
library(multcomp)
library(tidyr)
data(myeloma)
res <- pairwise_survdiff(Surv(time, event) ~ molecular_group,
data = myeloma)
Looking at the internals of the glht.summary method from the multcomp package, we create the lvl_order vector which identifies the ordering of the levels of x from smallest to largest.
x <- myeloma$molecular_group
levs <- levels(x)
y <- Surv(myeloma$time, myeloma$event)
lvl_order <- levels(x)[order(tapply(as.numeric(y)[1:length(x)],
x, mean))]
Then we can re-arrange the p-values from the res object into a matrix. mycomps is a matrix of the two sides of the paired comparisons. The signif vector is logical indicating whether differences are significant or not.
comps <- as_tibble(res$p.value, rownames="row") %>%
pivot_longer(-row, names_to="col", values_to="p") %>%
na.omit()
mycomps <- as.matrix(comps[,1:2])
signif <- comps$p < .05
Then, you can use the insert_absorb internal function to create the letters:
multcomp:::insert_absorb(signif,
decreasing=FALSE,
comps=mycomps,
lvl_order=lvl_order)
# $Letters
# MAF Proliferation Cyclin D-2 MMSET Hyperdiploid
# "ab" "a" "b" "ab" "b"
# Low bone disease Cyclin D-1
# "ab" "ab"
#
# $monospacedLetters
# MAF Proliferation Cyclin D-2 MMSET Hyperdiploid
# "ab" "a " " b" "ab" " b"
# Low bone disease Cyclin D-1
# "ab" "ab"
#
# $LetterMatrix
# a b
# MAF TRUE TRUE
# Proliferation TRUE FALSE
# Cyclin D-2 FALSE TRUE
# MMSET TRUE TRUE
# Hyperdiploid FALSE TRUE
# Low bone disease TRUE TRUE
# Cyclin D-1 TRUE TRUE
#
# $aLetters
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v"
# [23] "w" "x" "y" "z" "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R"
# [45] "S" "T" "U" "V" "W" "X" "Y" "Z"
#
# $aseparator
# [1] "."
#
# attr(,"class")
# [1] "multcompLetters"

simple question about r language in naming

I am a developing beginner in r. I have a simple question about r language.
Thanks to many experts in this site, I am improving a lot.
I am always grateful for that, and anyone who's giving hand with this question, thank you in advance.
This is the code.
Data=sample(1:5,size=25,replace=T)
names(Data)=c("a","b","c","d","e")
I want to name each of 1,2,3,4,5 to a,b,c,d,e.
so I thought I could accomplish this by using the upper code.
I know that the right code is
Data=c("a","b","c","d","e")[Data]
But I can't understand why this is the right code and why I need the last [Data].
Any help would be really appreciated!! Thank you so much in advance!!:)

The last Data provides an index to subset values from c("a","b","c","d","e").
Let's take a simple example :
Consider,
a <- 1:10
Now to get the first value in a you can do
a[1]
#[1] 1
To get 3rd value in a you can do
a[3]
#[1] 3
To get 6th and 8th value in a you can do
a[c(6, 8)]
#[1] 6 8
What will happen if you repeat a certain index? Say you select 1 twice and 3 once.
a[c(1, 1, 3)]
#[1] 1 1 3
As you can see the first value is selected two times and third one time.
Now ,Data that you have serves as that index to subset whereas a becomes c("a","b","c","d","e")
a <- c("a","b","c","d","e")
set.seed(123)
Data=sample(1:5,size=25,replace=T)
Data
#[1] 3 3 2 2 3 5 4 1 2 3 5 3 3 1 4 1 1 5 3 2 2 1 3 4 1
Now you use this Data values to subset from a giving
a[Data]
#[1] "c" "c" "b" "b" "c" "e" "d" "a" "b" "c" "e" "c" "c" "a" "d" "a" "a" "e" "c" "b" "b" "a" "c" "d" "a"
A side note, there is an inbuilt constant letters and LETTERS which gives 26 lower and upper case alphabets.
letters
#[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
LETTERS
# [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

Here is a way that takes advantage of how objects of class "factor" are coded internally in R. In R, factors are coded as consecutive integers starting at 1, and what the user sees is their labels and levels, not the integer values. But the integer values do not go away, they are still there.
First, create a vector of integers like in the question but setting the RNG seed in order to make the results reproducible. This vector is saved for later.
set.seed(123)
Data <- sample(1:5, size = 25, replace = TRUE)
Saved <- Data
Now create the factor. Note the labels atribute is set to the letters "a" to"e".
Data <- factor(Data, labels = c("a","b","c","d","e"))
Data
# [1] c c b b c e d a b c e c c a d a a e c b b a c d a
#Levels: a b c d e
See the internal representation.
as.integer(Data)
# [1] 3 3 2 2 3 5 4 1 2 3 5 3 3 1 4 1 1 5 3 2 2 1 3 4 1
And compare with the initial values.
identical(Saved, as.integer(Data))
# [1] TRUE

This is because Data contains the numbers you want to name in the order you want to name them. By adding [Data] to the end you are selecting the letters in the order of Data. To understand this, try what c("a","b","c","d","e")[c(1, 2)] does; it selects just the two first letters. If you instead type c("a","b","c","d","e")[c(5, 4)] it will select the two last letters, but in reverse order. Then if you print just Data, you'll see that it contains the numbers from 1 to 5, which is the amount of unique letters. So it will select the letters according to that order. You can see that all the numbers correspond to the letters in order by printing the correctly named Data.
Using names(Data)=c("a","b","c","d","e") does not work correctly since you aren't naming all 25 of the numbers, but rather just the first five of them.

Analyze table in R to count nucleotide frequencies

I am quite new to R and I have a table of strings, I believe, that I extracted from a text file that contains a list of nucleotides (ex. "AGCTGTCATGCT.....").
Here are the first two rows of the text file to help as an example:
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAAC
I need to count every "A" in the sequence by incrementing its variable, a. The same applies for G, C, and T (variables to increment are g, c ,t respectively).
At the end of the "for" loop I want the number of times "A" "G" "C" and "T" nucleotides occurred so I can calculate the dinucleotide frequencies, and hoepfully the transition matrix. My code is below, it doesn't work, it just returns each variable being equal to 0 which is wrong. Please help, thanks!
#I saved the newest version to a text file of the nucleotides
dnaseq <- read.table("/My path file/ecoli.txt")
g=0
c=0
a=0
t=0
for(i in dnaseq[[1]]){
if(i=="A") (inc(a)<-1)
if(i=="G") (inc(g)<-1)
if(i=="C") (inc(c)<-1)
if(i=="T") (inc(t)<-1)
}
a
g
c
t

The simplest way to get the counts of each nucleotide (or any kind of letter) is to use the table and strsplit functions. For example:
myseq = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"
# split it into a vector of individual characters:
strsplit(myseq, "")[[1]]
# [1] "A" "G" "C" "T" "T" "T" "T" "C" "A" "T" "T" "C" "T" "G" "A" "C" "T" "G" "C" "A" "A" "C" "G" "G" "G" "C" "A" "A" "T" "A" "T" "G" "T" "C" "T" "C" "T" "G" "T"
# [40] "G" "T" "G" "G" "A" "T" "T" "A" "A" "A" "A" "A" "A" "A" "G" "A" "G" "T" "G" "T" "C" "T" "G" "A" "T" "A" "G" "C" "A" "G" "C"
# count the frequencies of each
table(strsplit(myseq, "")[[1]])
# A C G T
# 20 12 17 21
Now, if you don't care about the difference between one line and the next (if this is just one long sequence in ecoli.txt) then you want to combine the file into one long string first:
table(strsplit(paste(dnaseq[[1]], collapse = ""), "")[[1]])
That's the one line solution, but it might be clearer to see it in three lines:
combined.seq = paste(dnaseq[[1]], collapse = "")
combined.seq.vector = strsplit(combined.seq, "")
frequencies = table(combined.seq.vector)
If you're wondering what was wrong with your original code- first, I don't know where the inc function comes from (and why it wasn't throwing an error: are you sure dnaseq[[1]] has length greater than 0?) but in any case, you weren't iterating over the sequence, you were iterating over the lines. i was never going to be a single character like A or T, it was always going to be a full line.
In any case, the solution with collapse, table and strsplit is both more concise and computationally efficient than a for loop (or a pair of nested for loops, which is what you would need).

You may use the following code which calls the str_count function (that counts the number of occurrences of a fixed text pattern) from the stringr package. It should work faster than the other solution which splits the character string into one-letter substrings.
require('stringr') # call install.packages('stringr') to download the package first
# read the text file (each text line will be a separate string):
dnaseq <- readLines("path_to_file.txt")
# merge text lines into one string:
dnaseq <- str_c(dnaseq, collapse="")
# count the number of occurrences of each nucleotide:
sapply(c("A", "G", "C", "T"), function(nuc)
str_count(dnaseq, fixed(nuc)))
Note that this solution may easily br extended to the length > 1 subsequence finding task (just change the search pattern in sapply(), e.g. to as.character(outer(c("A", "G", "C", "T"), c("A", "G", "C", "T"), str_c)), which generates all pairs of nucleotides).
However, note that detecting AGA in AGAGA will report only 1 occurrence as str_count() does not take overlapping patterns into account.

I am assuming that your nucleotide sequence is in a character vector of length one. If you are looking for the dinucleotide frequencies and a transition matrix, here is one solution:
dnaseq <- "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAG
CTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAAC"
## list of nucleotides
nuc <- c("A","T","G","C")
## all distinct dinucleotides
nuc_comb <- expand.grid(nuc,nuc)
nuc_comb$two <- paste(nuc_comb$Var1, nuc$Var2, sep = "")
# Var1 Var2 two
# 1 A A AA
# 2 T A TA
# 3 G A GA
# 4 C A CA
# 5 A T AT
# 6 T T TT
# 7 G T GT
# 8 C T CT
# 9 A G AG
# 10 T G TG
# 11 G G GG
# 12 C G CG
# 13 A C AC
# 14 T C TC
# 15 G C GC
# 16 C C CC
## Using `vapply` and regular expressions to count dinucleotide sequences:
nuc_comb$freq <- vapply(nuc_comb$two,
function(x) length(gregexpr(x, dnaseq, fixed = TRUE)[[1]]),
integer(1))
# AA TA GA CA AT TT GT CT AG TG GG CG AC TC GC CC
# 11 11 7 5 9 12 9 13 7 13 4 2 8 7 5 2
## label and reshape to matrix/table
dinuc_df <- reshape(nuc_comb, direction = "wide",
idvar = "Var1", timevar = "Var2", drop = "two")
dinuc_mat <- as.matrix(dinuc_df_wide[-1])
rownames(dinuc_mat) <- colnames(dinuc_mat) <- nuc
# A T G C
# A 11 9 7 8
# T 11 12 13 7
# G 7 9 4 5
# C 5 13 2 2
## get margin proportions for transition matrix
## probability of moving from nucleotide in row to nucleotide in column)
dinuc_tab <- prop.table(dinuc_mat, 1)
# A T G C
# A 0.3142857 0.2571429 0.20000000 0.22857143
# T 0.2558140 0.2790698 0.30232558 0.16279070
# G 0.2800000 0.3600000 0.16000000 0.20000000
# C 0.2272727 0.5909091 0.09090909 0.09090909

How to analyze the data whose different rows have different number of elements using R?

The data format is as following, the first column is the id:
1, b, c
2, a, d, e, f
3, u, i, c
4, k, m
5, o
However, i can do nothing to analyze this data. Do you have a good idea of how to read the data into R? Further, My question is: How to analyze the data whose different rows have different number of elements using R?

It seems you are trying to read a file with elements of unequal length. The structure in R that is list.
It is possible to do this by combining read.table with sep="\n" and then to apply strsplit on each row of data.
Here is an example:
dat <- "
1 A B
2 C D E
3 F G H I J
4 K L
5 M"
The code to read and convert to a list:
x <- read.table(textConnection(dat), sep="\n")
apply(x, 1, function(i)strsplit(i, "\\s")[[1]])
The results:
[[1]]
[1] "1" "A" "B"
[[2]]
[1] "2" "C" "D" "E"
[[3]]
[1] "3" "F" "G" "H" "I" "J"
[[4]]
[1] "4" "K" "L"
[[5]]
[1] "5" "M"
You can now use any list manipulation technique to work with your data.

using the readLines and strsplit to solve this problem.
text <- readLines("./xx.txt",encoding='UTF-8', n = -1L)
txt = unlist(strsplit(text, sep = " "))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Find all unique characters in a data frame in R - r

Related

Reorder vector so no certain items are positioned next to each other

Compact Letter Display from a matrix of significancies or by hand

simple question about r language in naming

Analyze table in R to count nucleotide frequencies

How to analyze the data whose different rows have different number of elements using R?

Categories

Resources