Count letter frequencies in descending order - r

I have a data looks like this
df<-structure(list(col = structure(c(9L, 2L, 13L, 11L, 5L, 7L, 10L,
6L, 8L, 3L, 12L, 4L, 1L), .Label = c("HHRGGVCTS", "MGSSN", "MVKTTYYDVG",
"RRHYNGAYDD", "RTSTN", "S", "SNCWC", "sp|P31689|DNJA1_HUMAN DnaJ homolog GN=DNAJA1 PE=1 SV=2 ",
"sp|Q9H9K5|MER34_HUMAN Endogenous PE=1 SV=1", "THYDT", "TVHAV",
"VCMCVVDDNR", "YATTA"), class = "factor")), class = "data.frame", row.names = c(NA,
-13L))
I am trying to count letter frequencies. There are 20 possible letters which I want to count in each row.
For example,
the first row: row starts with sp| so character frequencies are not calculated and result is the original string
the second row: doesn't start with sp| so it will show character frequencies
MGSSN 2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
which means, there are 2 S, 1, M, 1, G, 1, N and the other letters are empty .
The character frequencies are ordered in descending order.
The final output would look like the following
output<-structure(list(col = structure(c(9L, 2L, 13L, 11L, 5L, 7L, 10L,
6L, 8L, 3L, 12L, 4L, 1L), .Label = c("HHRGGVCTS", "MGSSN", "MVKTTYYDVG",
"RRHYNGAYDD", "RTSTN", "S", "SNCWC", "sp|P31689|DNJA1_HUMAN DnaJ homolog GN=DNAJA1 PE=1 SV=2 ",
"sp|Q9H9K5|MER34_HUMAN Endogenous PE=1 SV=1", "THYDT", "TVHAV",
"VCMCVVDDNR", "YATTA"), class = "factor"), Col2 = structure(c(8L,
2L, 3L, 2L, 2L, 2L, 2L, 1L, 7L, 5L, 6L, 5L, 4L), .Label = c("1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0",
"2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0", "2,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0",
"2,2,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0", "2,2,2,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0",
"3,2,2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0", "sp|P31689|DNJA1_HUMAN DnaJ homolog GN=DNAJA1 PE=1 SV=2 ",
"sp|Q9H9K5|MER34_HUMAN Endogenous PE=1 SV=1"), class = "factor")), class = "data.frame", row.names = c(NA,
-13L))

We can use str_count
library(stringr)
i1 <- !grepl("^sp", df$col)
df$col2[i1] <- sapply(as.character(df$col[i1]), function(x)
paste(sort(str_count(x, LETTERS), decreasing = TRUE), collapse=", "))
df$col2[!i1] <- df$col[!i1]
Or instead of keeping as a string, it can be a list column as well
library(tidyverse)
df %>%
mutate(col = as.character(col),
col2 = map(col, ~ if(str_detect(.x, "^sp")) .x
else str_count(.x, LETTERS) %>%
sort(decreasing = TRUE)))

Related

convert df from factor to numeric

I am struggling to convert my dataset into numeric values. The dataset I have looks like this:
customer_id 2012 2013 2013 2014 2015 2016 2017
15251 X N U D S C L
X1 - X7 are marked as factors. The extract from dput(head(df)) is:
structure(list(`2012` = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("N",
"X"), class = "factor"), `2013` = structure(c(6L, 6L, 6L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L
), .Label = c("C", "D", "N", "S", "U", "X"), class = "factor"),
`2014` = structure(c(8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L,
8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L), .Label = c("C",
"D", "L", "N", "R", "S", "U", "X"), class = "factor"), ...
I would like to have the data in numeric values, but I don't know how I can transform them accordingly.
The goal is that I can feed the df into a heatmap so that I can visually explore the differences. To my knowledge, this is only possible with a numeric matrix. Because I get the error Heatmap.2(input, trace = "none", : `x' must be a numeric matrix
Does someone have any idea?
Many Thanks for your support!
it's do-able. I think it would help next time to include the complete df. The heatmap.2 does not work because you gave it a character matrix. It's a bit more complicated to display the legend for color to letters using heatmap.2, I suggest something below using ggplot
library(ggplot2)
library(dplyr)
library(viridis)
# simulate data
df = data.frame(id=1:5,
replicate(7,sample(LETTERS[1:10],5)))
colnames(df)[-1] = 2012:2018
#convert to long format for plotting and refactor
df <- df %>% pivot_longer(-id) %>%
mutate(value=factor(as.character(value),levels=sort(levels(value))))
#define color scale
# sorted in alphabetical order
present_letters = levels(df$value)
COLS = viridis_pal()(length(present_letters))
names(COLS) = present_letters
#plot
ggplot(data=df,aes(x=name,y=id,fill=value)) +
geom_tile() +
scale_fill_manual(values=COLS)

prop.test for multiply compare in R

kk=structure(list(items = structure(c(2L, 4L, 5L, 11L, 1L, 3L, 6L,
7L, 8L, 9L, 10L, 12L), .Label = c("ak47", "aks47", "colt", "dubstepgun",
"moneygun", "paintballgun", "portalgun", "s", "scar20", "spas12",
"tank", "watergun"), class = "factor"), N = c(3L, 3L, 3L, 3L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("items", "N"), class = "data.frame", row.names = c(NA,
-12L))
to perform prop.test for each item i use simple way.
count of item(12) and total number of obs.(N)=20,
prop.test(1,20)
so colt item met 1 time in 20 !
How to do single prop.test for all items at once, no manualy.
prop.test(3,20)
prop.test(2,20)
and so on, but with name of item
#tank
prop.test(3,20)
#spas12
prop.test(1,20)
An option is to get the unique elements of 'N' column, loop it with lapply and apply the prop.test
lst1 <- lapply(unique(kk$N), function(i) prop.test(i, 20))
names(lst1) <- unname(tapply(as.character(kk$items),
factor(kk$N, levels = unique(kk$N)), FUN = tail, 1))

Suppressing data from a graph in R

I have a dataset, d, that contains personally identifiable data, I have the dataset putting an X for all values that are suppressed:
column1 column2 column3
* FSM X
* Male 2.5
* Female X
A FSM 6
A Male 10.3
A Female 11.7
B FSM 14.8
B Male 21.5
B Female 25.3
I want to plot this with an X above the bars in a bar plot, where data has been suppressed, such as:
My code is:
p <- ggplot(d, aes(x=column1, y=column3, fill=column2)) +
geom_bar(position=position_dodge(), stat="identity", colour="black") +
geom_text(aes(label=column2),position= position_dodge(width=0.9), vjust=-.5)
scale_y_continuous("Percentage",breaks=seq(0, max(d$column3), 2)))
But of course, it can't plot 'X' on the graph and says:
Error: Discrete value supplied to continuous scale
How can I get the bar plotting to ignore the 'X' and still add the label if it's present?
Data dump:
structure(list(column1 = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L), .Label = c("*",
"A", "B", "C", "D", "E", "U"), class = "factor"), column2 = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L), .Label = c("FSM", "Male", "Female"), class = "factor"),
column3 = structure(c(21L, 1L, 2L, 18L, 3L, 4L, 7L, 12L,
14L, 16L, 15L, 13L, 10L, 9L, 8L, 11L, 6L, 5L, 20L, 19L, 17L
), .Label = c("1.93889541715629", "1.97444831591173", "10.1057579318449",
"11.7305458768873", "12.7758420441347", "14.4535840188014",
"14.8471615720524", "18.5830429732869", "19.9764982373678",
"20.0873362445415", "20.9606986899563", "21.5628672150411",
"24.1579558652729", "25.3193960511034", "25.7931844888367",
"29.2576419213974", "5.45876887340302", "6.11353711790393",
"6.16921269095182", "6.98689956331878", "X"), class = "factor")), .Names = c("column1",
"column2", "column3"), row.names = c(NA, -21L), class = "data.frame")
I 'm happy to print out 0 instances where there are 0 instances, but in the case of data suppression, I want to make it clear that data has been suppressed by printing out a 'X', but the bar will also show 0 instances
First convert the height to numeric which gives NA for censored values. Then create a label column based on that. Then you need a column of zeroes for the y coordinate of the labels.
> d$column3=as.numeric(as.character(d$column3))
Warning message:
NAs introduced by coercion
> d$column4 = ifelse(is.na(d$column3),"X","")
> d$y=0
Then:
> p <- ggplot(d, aes(x=column1, y=column3, fill=column2))
> p + geom_bar(position=position_dodge(), stat="identity",
colour="black") +
geom_text(aes(label=column4,x=column1,y=y),
position=position_dodge(width=1), vjust=-0.5)
Giving:
Its a variant on labelling a geom_bar with the value of the bar. Almost a dupe.

assign multiple color to each vertex in igraph

I have a dataframe d:
d<-structure(list(V1 = c(1L, 3L, 3L, 2L, 1L, 1L, 7L, 9L, 10L, 9L, 7L), V2 = c(2L, 4L, 5L, 5L, 4L, 6L, 8L, 3L, 1L, 8L, 5L)),
.Names = c("V1", "V2"), class ="data.frame", row.names = c(NA, -11L))
g<-graph.data.frame(d,directed = F)
I would assign to each vertex one or more colors depending on its affiliation variable given in a dataframe m
m<-structure(list(vertex = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 9L, 9L, 10L, 1L, 1L, 6L, 6L), affilation = c(1L, 1L, 1L, 2L, 2L, 1L, 3L, 3L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 3L)),
.Names = c("vertex", "affilation"), class = "data.frame", row.names = c(NA, -16L))
i would to ask if there is a simple method in igraph package to assign one or multiple color to each vertex according to its affiliation
(EDIT) Because the edges were specified as integers the vertices were not in order in the graph. I changed the initial graph.data.frame call to specify the order, then everything should work.
g<-graph.data.frame(d,directed = F, vertices = 1:10)
You can assign color as a vertex attribute by assigning into V(g)$color. Here's some code for your example (only for a single affiliation)
# Put in vertex order
m <- m[order(m$vertex), ]
# Create a palette (any palette would do)
pal <- rainbow(n=length(unique(m$affilation)))
# Get each node's first affiliation
oneAffil <- m$affilation[!duplicated(m$vertex)]
# Assign to the vertices as a color attribute
V(g)$color <- pal[oneAffil]
plot(g)
Now for multiple affiliations it's not too clear what you want. You could look at vertex.shape.pie that can draw shapes with more than one color on. Something like this works for me (but there's quite a bit of data wrangling to get it going)
# Use a cast to split out affiliation in each group as a column
library(reshape2)
am <- acast(m, formula = vertex ~ affilation)
# Turn this into a list
values <- lapply(seq_len(nrow(am)), function(i) am[i,])
plot(g, vertex.shape="pie", vertex.pie=values, vertex.pie.color=list(pal))

Correlating two data sets in R leaving out subjects not in both

Still very new to R and have a question about performing a correlation. I have two data sets that I want to correlate. Let's say I named the sets Data1 and Data2 for simplicity. Most of the subjects are in both sets but there are some subjects that are not. This is a problem as I now have uneven data sets that cannot correlate. How do I tell R to ignore the subjects that are not in both data sets so that I can perform my correlation? I know there is likely a way to have R ignore these subjects in the same command where I ask it to correlate my sets.
Also if I want R to only correlate columns 4:7 using the subject IDs in column 1 would I, for example, use the command cor.test(Data1[1,4:7], Data2[1,4:7])?
Thanks for any help you can provide.
Disclaimer: Have not test because no MWE provided.
Try something like this:
cor.test(subset(x=Data1, subset=ID==1, select=4:7), subset(x=Data2, subset=ID==1, select=4:7))
Try:
data
dat1 <- structure(list(V1 = c(9L, 2L, 5L, 9L, 9L), V2 = c(8L, 4L, 7L,
9L, 6L), V3 = c(4L, 5L, 7L, 7L, 8L), V4 = c(7L, 4L, 6L, 7L, 1L
), V5 = c(9L, 2L, 10L, 7L, 10L), subject = 1:5), .Names = c("V1",
"V2", "V3", "V4", "V5", "subject"), row.names = c(NA, -5L), class = "data.frame")
dat2 <- structure(list(V1 = c(2L, 6L, 5L, 9L, 7L), V2 = c(2L, 10L, 5L,
5L, 6L), V3 = c(3L, 4L, 3L, 8L, 7L), V4 = c(3L, 2L, 10L, 1L,
9L), V5 = c(2L, 4L, 8L, 1L, 6L), subject = c(1, 3, 5, 6, 8)), .Names = c("V1",
"V2", "V3", "V4", "V5", "subject"), row.names = c(NA, -5L), class = "data.frame")
Create an index of subject IDs that are common in both
indx <- intersect(dat1$subject, dat2$subject)
Apply cor.test on the dataset with common subject IDs
cor.test(as.matrix(dat1[dat1$subject %in% indx,3:5]), as.matrix(dat2[dat2$subject %in% indx, 3:5]))

Resources