R How to make a word cloud from a tally? - r

Here is the repex of a tally I have:
word_tally <- data.frame(scarred = c(1,1,0,0,0,0,0,0,0,0,0,0,0),
happy = c(0,0,1,0,0,0,0,0,0,0,0,0,0),
cheerful = c(0,0,0,1,0,0,0,0,0,0,0,0,0),
mad = c(0,0,0,0,1,1,1,1,1,0,0,0,0),
curious = c(0,0,0,0,0,0,0,0,0,1,1,1,1))
To make a word cloud seems I need 1 column with all the words. How could I transform the above dataframe to make that type of structure for a word cloud?

Using rep and colSums:
words <- rep(names(word_tally), colSums(word_tally))
words
[1] "scarred" "scarred" "happy" "cheerful" "mad"
[6] "mad" "mad" "mad" "mad" "curious"
[11] "curious" "curious" "curious"
Or since the frequencies are the column sums, using just the data.
wordcloud(names(word_tally), freq=colSums(word_tally), min.freq = 1)

You could get the data in long format and remove rows where value = 0.
library(dplyr)
tidyr::pivot_longer(word_tally, cols = everything(), names_to = "word") %>%
filter(value != 0) %>%
select(word)
# A tibble: 13 x 1
# word
# <chr>
# 1 scarred
# 2 scarred
# 3 happy
# 4 cheerful
# 5 mad
# 6 mad
# 7 mad
# 8 mad
# 9 mad
#10 curious
#11 curious
#12 curious
#13 curious
This would give all the words in one column which can be used as input for wordcloud.
In base R, another way could be :
names(word_tally)[which(word_tally != 0, arr.ind = TRUE)[,2]]

Related

R- How would I identify a specific ordered sequence of characters that fits certain parameters?

Say I have the following sequence of letters which represent a sequence in a gene: 5’ CTTGTACTGGCCATCGGCTGTGCGATTGTCGTCATCGGTGGCTATAACCGTGGCGACTCGATAAGAGTCCGCCCTCCCATG 3’
I want to identify a specific ordered sequence, length of 20 for instance, within the above sequence that follows the parameters:
Position 1: cannot be a G
Position 10: must be an A or T
I am only including two parameters above as a way to simplify the problem.
I start with the following, converting the sequence into a vector.
exon_2 <- "CTTGTACTGGCCATCGGCTGTGCGATTGTCGTCATCGGTGGCTATAACCGTGGCGACTCGATAAGAGTCCGCCCTCCCATG"
exon_2_vector <- as_vector(str_split_fixed(exon_2, pattern = "", n = nchar(exon_2)))
From here, I have been trying to start at position 1, and use if/else statements, and then if position 1 does not work, moving on to position 2, but I am struggling with the workflow and not the greatest at coding.
Any suggestions would be greatly appreciated.
A solution using ngrams:
#install.packages('ngram')
library(ngram)
library(stringr)
library(dplyr)
library(purrr)
exon_2 <- "CTTGTACTGGCCATCGGCTGTGCGATTGTCGTCATCGGTGGCTATAACCGTGGCGACTCGATAAGAGTCCGCCCTCCCATG"
exon_2_ws <- exon_2 %>% str_replace_all('(\\w)', '\\1 ')
ngram = exon_2_ws %>% ngram(n = 20) %>% get.ngrams %>% as_tibble %>%
mutate(value = value %>% str_replace_all(' ', ''),
identification = if_else( !substring(value,1,1) == 'G' & substring(value, 10,10) %in% c('A', 'T') , 1, 0))
Output:
# A tibble: 62 x 2
value identification
<chr> <dbl>
1 ATCGGCTGTGCGATTGTCGT 0
2 CGGCTGTGCGATTGTCGTCA 0
3 GCTATAACCGTGGCGACTCG 0
4 CGGTGGCTATAACCGTGGCG 1
5 GGCCATCGGCTGTGCGATTG 0
6 TAACCGTGGCGACTCGATAA 0
7 ATCGGTGGCTATAACCGTGG 1
8 GGCTGTGCGATTGTCGTCAT 0
9 CGATTGTCGTCATCGGTGGC 1
10 GATTGTCGTCATCGGTGGCT 0
# ... with 52 more rows
You could then filter valid sequences like this:
ngram %>% filter(identification == 1) %>% pull(value)
Which brings:
[1] "CGGTGGCTATAACCGTGGCG" "ATCGGTGGCTATAACCGTGG" "CGATTGTCGTCATCGGTGGC" "TACTGGCCATCGGCTGTGCG" "CTCGATAAGAGTCCGCCCTC" "CTATAACCGTGGCGACTCGA"
[7] "TTGTCGTCATCGGTGGCTAT" "CTGTGCGATTGTCGTCATCG" "ACCGTGGCGACTCGATAAGA" "CGACTCGATAAGAGTCCGCC" "CATCGGCTGTGCGATTGTCG" "CGATAAGAGTCCGCCCTCCC"
[13] "ATTGTCGTCATCGGTGGCTA" "TCGGTGGCTATAACCGTGGC" "CGTCATCGGTGGCTATAACC" "CGTGGCGACTCGATAAGAGT"

Going from dplyr to base: create a data frame of the first and last index for each level of a variable

Asking how to go from dplyr to base may be a weird ask, especially since I love the tidyverse, but I think because I learned the tidyverse first, my grasp of base is far from masterful, and I need a base solution because the package I'm helping to develop doesn't want any tidyverse dependencies
Data (there are many more columns, but abbreviated for reprex sake):
sample.df <- tibble(batch = rep(c(1,2,3), c(4,5,6)))
Desire base equivalent of:
sample.df %>%
mutate(rowid = row_number()) %>%
group_by(batch) %>%
summarize(idx_b = min(rowid),
idx_e = max(rowid))
# A tibble: 3 x 3
# Groups: batch [3]
batch idx_b idx_e
<dbl> <int> <int>
1 1 1 4
2 2 5 9
3 3 10 15
We create a sequence column in the data, use aggregate to get the range or min/max and convert the matrix column to regular data.frame column with do.call
out <- do.call(data.frame, aggregate(rowid ~ batch,
transform(sample.df, rowid = seq_len(nrow(sample.df))),
FUN = function(x) c(b = min(x), e = max(x))))
Another base R option using unique + ave
unique(
transform(
sample.df,
idx_b = ave(1:nrow(sample.df), batch, FUN = min),
idx_c = ave(1:nrow(sample.df), batch, FUN = max)
)
)
gives
batch idx_b idx_c
1 1 1 4
5 2 5 9
10 3 10 15

How to calculate common values across different groups?

I am trying to create a data frame for creating network charts using igraph package. I have sample data "mydata_data" and I want to create "expected_data".
I can easily calculate number of customers visited a particular store, but how do I calculate common set of customers who go to store x1 & store x2 etc.
I have 500+ stores, so I don't want to create columns manually. Sample data for reproducible purpose given below:
mydata_data<-data.frame(
Customer_Name=c("A","A","C","D","D","B"),
Store_Name=c("x1","x2","x2","x2","x3","x1"))
expected_data<-data.frame(
Store_Name=c("x1","x2","x3","x1_x2","x2_x3","x1_x3"),
Customers_Visited=c(2,3,1,1,1,0))
Another possible solution via dplyr is to create a list with all the combos for each customer, unnest that list, count and merge with a data frame with all the combinations, i.e.
library(tidyverse)
df %>%
group_by(Customer_Name) %>%
summarise(combos = list(unique(c(unique(Store_Name), paste(unique(Store_Name), collapse = '_'))))) %>%
unnest() %>%
group_by(combos) %>%
count() %>%
right_join(data.frame(combos = c(unique(df$Store_Name), combn(unique(df$Store_Name), 2, paste, collapse = '_'))))
which gives,
# A tibble: 6 x 2
# Groups: combos [?]
combos n
<chr> <int>
1 x1 2
2 x2 3
3 x3 1
4 x1_x2 1
5 x1_x3 NA
6 x2_x3 1
NOTE: Make sure that your Store_Name variable is a character NOT factor, otherwise the combn() will fail
Here's an igraph approach:
A <- as.matrix(as_adj(graph_from_edgelist(as.matrix(mydata_data), directed = FALSE)))
stores <- as.character(unique(mydata_data$Store_Name))
storeCombs <- t(combn(stores, 2))
data.frame(Store_Name = c(stores, apply(storeCombs, 1, paste, collapse = "_")),
Customers_Visited = c(colSums(A)[stores], (A %*% A)[storeCombs]))
# Store_Name Customers_Visited
# 1 x1 2
# 2 x2 3
# 3 x3 1
# 4 x1_x2 1
# 5 x1_x3 0
# 6 x2_x3 1
Explanation: A is the adjacency matrix of the corresponding undirected graph. stores is simply
stores
# [1] "x1" "x2" "x3"
while
storeCombs
# [,1] [,2]
# [1,] "x1" "x2"
# [2,] "x1" "x3"
# [3,] "x2" "x3"
The main trick then is how to obtain Customers_Visited: the first three numbers are just the corresponding numbers of neighbours of stores, while the common customers we get from the common graph neighbours (which we get from the square of A).
Here's one possible way to get the data
Here's a helper function adapted form here: Generate all combinations, of all lengths, in R, from a vector
comball <- function(x) do.call("c", lapply(seq_along(x), function(i) combn(as.character(x), i, FUN = list)))
Then you can use that with some tidy verse functions
library(dplyr)
library(purrr)
library(tidyr)
mydata_data %>%
group_by(Customer_Name) %>%
summarize(visits = list(comball(Store_Name))) %>%
mutate(visits = map(visits, ~map_chr(., ~paste(., collapse="_")))) %>%
unnest(visits) %>%
count(visits)
Another option, with base R:
Get the list of all possible stores
all_stores <- as.character(unique(mydata_data$Store_Name))
Find the different combinations of 1 or 2 stores :
all_comb_store <- lapply(1:2, function(n) combn(all_stores, n))
For each number of stores combined, get the number of customers that visited both and then combined this value in a data.frame with the names of the stores:
do.call(rbind,
lapply(all_comb_store,
function(nb_comb) {
data.frame(Store_Name=if (nrow(nb_comb)==1) as.character(nb_comb) else apply(nb_comb, 2, paste, collapse="_"),
Customers_Visited=apply(nb_comb, 2,
function(vec_stores) {
length(Reduce(intersect,
lapply(vec_stores,
function(store) mydata_data$Customer_Name[mydata_data$Store_Name %in% store])))}))}))
# Store_Name Customers_Visited
#1 x1 2
#2 x2 3
#3 x3 1
#4 x1_x2 1
#5 x1_x3 0
#6 x2_x3 1
Using dplyr: self join, then make group and get unique count. This should be a lot quicker compared to other answers where all combinations are considered.
Note: it doesn't show non-existent pairs. Also, here x1_x1 means, of course, x1.
left_join(mydata_data, mydata_data, by = "Customer_Name") %>%
transmute(Customer_Name,
grp = paste(pmin(Store_Name.x, Store_Name.y),
pmax(Store_Name.x, Store_Name.y), sep = "_")) %>%
group_by(grp) %>%
summarise(n = n_distinct(Customer_Name))
# # A tibble: 5 x 2
# grp n
# <chr> <int>
# 1 x1_x1 2
# 2 x1_x2 1
# 3 x2_x2 3
# 4 x2_x3 1
# 5 x3_x3 1
Data without factors:
mydata_data<-data.frame(
Customer_Name=c("A","A","C","D","D","B"),
Store_Name=c("x1","x2","x2","x2","x3","x1"),
stringsAsFactors = FALSE)

Extract data after a string one each appearance R

I have a data like this (named spectra):
#Milk spectra: 1234
##XYDATA=(X++(Y..Y))
649.025085449219
667.675231457819
686.325377466418
##XYDATA=(X++(Y..Y))
723.625669483618
742.275815492218
760.925961500818
##XYDATA=(X++(Y..Y))
872.826837552417
891.476983561017
910.127129569617
928.777275578216
In this data, each time the string ##XYDATA=(X++(Y..Y)), that is the data for each different animal.
So, I want to have the code that can help extract this sample into 3 pieces of data.
Animal 1: 3 lines after 1st ' ##XYDATA=(X++(Y..Y))'
Animal 2: 3 lines after 2nd ' ##XYDATA=(X++(Y..Y))'
And so on.
I tried this line of code but it only help to extract line 1 of all times the string '##XYDATA=(X++(Y..Y))' appeared together. Thus, it did not meet my expect to have three lines and to have a separate pieces of data after each appearance of the string.
bo<-data.frame(spectra$V1[which(spectra$V1 == '##XYDATA=(X++(Y..Y))')+1])
Okay I think you could do something along these lines. I'm sure this could be much better and more efficient but read it in as a character vector.
Then loop through to spread it out. However this assumes there are always the same number of measures and you have a way to identify the character values.
c_data<- c("split", 1, 2, 3,
"split", 4, 5, 6)
y<- c_data == "split"
df_wide <- data.frame("animal"= character(), "v1" = numeric(), "v2" = numeric(), "v3" = numeric(),
stringsAsFactors = FALSE)
names(df_wide)<- c("animal", "v1", "v2", "v3")
x <- 0
for (i in 1:length(c_data)){
if (y[i] == TRUE){
x <- x +1
df_wide[x,] <- rbind(c(c_data[i], c_data[i+1], c_data[i+2], c_data[i+3]))
}
}
yields
animal v1 v2 v3
1 split 1 2 3
2 split 4 5 6
If it is a one time thing, it may not be worth trying to write something nicer. If it is an ongoing thing then you may want to look at using an apply function that you could have to write a function for.
You can do either of the following with split and map:
library(dplyr)
library(purrr)
df %>%
mutate(Animal = cumsum(grepl("##XYDATA=(X++(Y..Y))", V1, fixed = TRUE))) %>%
split(.$Animal) %>%
map(~slice(., -1) %>% mutate(V1 = as.numeric(V1))) %>%
'['(-1)
This creates an indicator variable Animal, split by that indicator, remove the first row for each dataframe, convert V1 to numeric, and finally remove the first element of the list.
You can also do the following:
df %>%
mutate(Animal = cumsum(grepl("##XYDATA=(X++(Y..Y))", V1, fixed = TRUE))) %>%
filter(!grepl("^#.*$", V1)) %>%
mutate(V1 = as.numeric(V1)) %>%
split(.$Animal)
This also creates the indicator Animal, but it intead, filters out all rows with # signs in it and converts V1 to numeric before splitting into separate dataframes.
Result:
$`1`
# A tibble: 3 x 2
V1 Animal
<dbl> <int>
1 649.0251 1
2 667.6752 1
3 686.3254 1
$`2`
# A tibble: 3 x 2
V1 Animal
<dbl> <int>
1 723.6257 2
2 742.2758 2
3 760.9260 2
$`3`
# A tibble: 4 x 2
V1 Animal
<dbl> <int>
1 872.8268 3
2 891.4770 3
3 910.1271 3
4 928.7773 3
Note:
Here I assumed #Milk spectra: 1234 is also a row in your column, hence the subsetting at the end.
Data:
df = read.table(textConnection("'#Milk spectra: 1234'
##XYDATA=(X++(Y..Y))
649.025085449219
667.675231457819
686.325377466418
##XYDATA=(X++(Y..Y))
723.625669483618
742.275815492218
760.925961500818
##XYDATA=(X++(Y..Y))
872.826837552417
891.476983561017
910.127129569617
928.777275578216"),comment.char = "", stringsAsFactors = FALSE)

Summarize Data String in R

I have a large dataset where some of the information needed is stored in the first column as a string separated by semicolons. For example:
TestData <- data.frame("Information" = c("Forrest;Trees;Unknown", "Forrest;Trees;Leaves", "Forrest;Trees;Trunks", "Forrest;Shrubs;Unknown", "Forrest;Shrubs;Branches", "Forrest;Shrubs;Leaves", "Forrest;Shrubs;NA"), "Data" = c(5,1,3,4,2,1,3))
Giving:
Information Data
1 Forrest;Trees;Unknown 5
2 Forrest;Trees;Leaves 1
3 Forrest;Trees;Trunks 3
4 Forrest;Shrubs;Unknown 4
5 Forrest;Shrubs;Branches 2
6 Forrest;Shrubs;Leaves 1
7 Forrest;Shrubs;NA 3
I need to simplify the names so that I only have the last unique name that isn't "Unknown" or "NA" such that my dataframe becomes:
Information Data
1 Trees;Unknown 5
2 Trees;Leaves 1
3 Trunks 3
4 Shrubs;Unknown 4
5 Branches 2
6 Shrubs;Leaves 1
7 Shrubs;NA 3
Maybe it's not the most efficient or elegant solution, but it works on the sample data. Hope it's also adequate for your needs:
library(stringr)
library(dplyr)
TestData <- data.frame("Information" = c("Forrest;Trees;Unknown", "Forrest;Trees;Leaves", "Forrest;Trees;Trunks", "Forrest;Shrubs;Unknown", "Forrest;Shrubs;Branches", "Forrest;Shrubs;Leaves", "Forrest;Shrubs;NA"), "Data" = c(5,1,3,4,2,1,3))
# split text into 3 columns
TestData[3:5] <- str_split_fixed(TestData$Information, ";", 3)
# filter Unknown and NA values, count frequencies to determine unique values
a <- TestData %>%
filter(!V5 %in% c("Unknown", "NA")) %>%
group_by(V5) %>%
summarise(count = n())
# join back to original data
TestData <- TestData %>%
left_join(a)
TestData$Clean <- ifelse(TestData$count > 1 | is.na(TestData$count), paste0(TestData$V4, ";", TestData$V5), TestData$V5)
Generally it is not recommended to put multiple variables in the same column but using dplyr should give you what you want:
TestData_filtered<-TestData%>%separate(Information,into=c("common","TS","BL"),remove=FALSE)%>%filter(!grepl("Unknown|NA",BL))%>%mutate(wanted=paste(TS,BL,sep=";"))

Resources