Statistics on cluster member relationships over several days - r

Assume, I have hourly data corresponding to 5 categories for consective 10 days, created as:
library(xts)
set.seed(123)
timestamp <- seq(as.POSIXct("2016-10-01"),as.POSIXct("2016-10-10 23:59:59"), by = "hour")
data <- data.frame(cat1 = rnorm(length(timestamp),150,5),
cat2 = rnorm(length(timestamp),130,3),
cat3 = rnorm(length(timestamp),150,5),
cat4 = rnorm(length(timestamp),100,8),
cat5 = rnorm(length(timestamp),200,15))
data_obj <- xts(data,timestamp) # creat time-series object
head(data_obj,2)
Now, for each day separately, I perform clustering and see how these categories behave with respect to each other using simple kmeans as:
daywise_data <- split.xts(data_obj,f="days",k=1) # split data day wise
clus_obj <- lapply(daywise_data, function(x){ # clustering day wise
return (kmeans(t(x), 2))
})
Once clustering is over, I visualize the cluster relationships over different 10 days with
sapply(clus_obj,function(x) x$cluster) # clustering results
and I found the results as
On visual inspection, it is clear that cat1 and cat3 always remained in the same cluster. Similarly cat4 and cat5 are mostly in different clusters on 10 different days.
Apart from visual inspection, is there any automatic approach to gather this type of statistic from such clustering tables?
Note: This is a dummy example. I have a data frame containing such 80 categories over continuous 100 days. An automatic summary like above one will reduce the effort.

Pair-counting cluster evaluation measures show an easy way to tackle this problem.
Rather than looking at object-cluster assignments, which are unstable, these methods look at whether or not two objects are in the same cluster (that is called a "pair").
So you could check if these pairs change much over time, or not.
Since k-means is randomized, you may also want to run it several times for every time slice, as they may return different clusterings!
You could then say that e.g. series 1 is in the same cluster as series 2 in 90% of the results. etc.

Related

Analysis to identify similar occupations by frequency of skills requested in job postings (in R)

I have access to a dataset of job postings, which for each posting has a unique posting ID, the job posting occupation, and a row for each skill requested in each job posting.
The dataset looks a bit like this:
posting_id
occ_code
occname
skillname
1
1
data scientist
analysis
1
1
data scientist
python
2
2
lecturer
teaching
2
2
lecturer
economics
3
3
biologist
research
3
3
biologist
biology
1
1
data scientist
research
1
1
data scientist
R
I'd like to perform analysis in R to identify "close" occupations by how similar their overall skill demand is in job postings. E.g. if many of the top 10 in-demand skills for financial analysts matched some of the top 10 in-demand skills for data scientists, those could be considered closely related occupations.
To be more clear, I want to identify similar occupations by their overall skill demand in the postings i.e. by summing the no. of times each skill is requested for an occupation, and identifying which other occupations have similar frequently requested skills.
I am fairly new to R so would appreciate any help!
I think you might want an unsupervised clustering strategy. See the help page for hclust for a debugged worked example. This untested code.
# Load necessary libraries
library(tidyverse)
library(reshape2)
# Read in the data
data <- read.csv("path/to/your/data.csv")
# Sum the number of times each skill is requested for each occupation
skill_counts <- data %>%
group_by(occ_code, occname_skillname) %>%
summarise(count = n())
# Get the top 10 in-demand skills for each occupation
top_10_skills <- skill_counts %>%
group_by(occ_code) %>%
top_n(10, count)
# Convert the data into a matrix for clustering
matrix <- dcast(top_10_skills, occ_code ~ occname_skillname, value.var = "count")
# Perform clustering
fit <- hclust(dist(t(matrix)), method = "ward.D2")
# Plot the dendrogram
plot(fit, hang = -1, labels = row.names(matrix), main = "Occupation Clustering")
The resulting dendrogram will show the relationships between the occupations based on their skill demand. Closer occupations will be grouped together and more distantly related occupations will be separated further apart.

How alter R codes in efficient way

I have a sample created as follows:
survival1a= data.frame(matrix(vector(), 50, 2,dimnames=list(c(), c("Id", "district"))),stringsAsFactors=F)
survival1a$Id <- 1:nrow(survival1a)
survival1a$district<- sample(1:4, size=50, replace=TRUE)
this sample has 50 individuals from 4 different districts.
I have probabilities (a matrix) that shows the likelihood of migration from one district to another(Migdata) as follows:
district***** prob1****** prob2******** prob3******* prob4**
0.83790 0.08674 0.05524 0.02014
0.02184 0.88260 0.03368 0.06191
0.01093 0.03565 0.91000 0.04344
0.03338 0.06933 0.03644 0.86090
I merge these probabilities with my data with this code:
survival1a<-merge( Migdata,survival1a, by.x=c("district"), by.y=c("district"))
I would like to know by the end of the year each person resides in which districts based on probabilities of migration that I have(Migdata).
I have already written a code that perfectly works but with big data it is so time-consuming since it is based on a Loop:
for (k in 1:nrow(survival1a)){
survival1a$migration[k]<-sample(1:4, size=1,replace = TRUE,prob=survival1a[k,2:5])}
Now, I want to write the code in a way that it would not be based on a loop and shows every person district by the end of the year.

Best way to get list of SNPs by gene id?

I have a long data frame of genes and various forms of ids for them (e.g. OMIM, Ensembl, Genatlas). I want to get the list of all SNPs that are associated with each gene. (This is the reverse of this question.)
So far, the best solution I have found is using the biomaRt package (bioconductor). There is an example of the kind of lookup I need to do here. Fitted for my purposes, here is my code:
library(biomaRt)
#load the human variation data
variation = useEnsembl(biomart="snp", dataset="hsapiens_snp")
#look up a single gene and get SNP data
getBM(attributes = c(
"ensembl_gene_stable_id",
'refsnp_id',
'chr_name',
'chrom_start',
'chrom_end',
'minor_allele',
'minor_allele_freq'),
filters = 'ensembl_gene',
values ="ENSG00000166813",
mart = variation
)
This outputs a data frame that begins like this:
ensembl_gene_stable_id refsnp_id chr_name chrom_start chrom_end minor_allele minor_allele_freq
1 ENSG00000166813 rs8179065 15 89652777 89652777 T 0.242412
2 ENSG00000166813 rs8179066 15 89652736 89652736 C 0.139776
3 ENSG00000166813 rs12899599 15 89629243 89629243 A 0.121006
4 ENSG00000166813 rs12899845 15 89621954 89621954 C 0.421126
5 ENSG00000166813 rs12900185 15 89631884 89631884 A 0.449681
6 ENSG00000166813 rs12900805 15 89631593 89631593 T 0.439297
(4612 rows)
The code works, but the running time is extremely long. For the above, it takes about 45 seconds. I thought maybe this was related to the allele frequencies, which the server perhaps calculated on the fly. But looking up the bare minimum of only the SNPs rs ids takes something like 25 seconds. I have a few thousand genes, so this would take an entire day (assuming no timeouts or other errors). This can't be right. My internet connection is not slow (20-30 mbit).
I tried looking up more genes per query. This did dot help. Looking up 10 genes at once is roughly 10 times as slow as looking up a single gene.
What is the best way to get a vector of SNPs that associated with a vector of gene ids?
If I could just download two tables, one with genes and their positions and one with SNPs and their positions, then I could easily solve this problem using dplyr (or maybe data.table). I haven't been able to find such tables.
Since you're using R, here's an idea that uses the package rentrez. It utilizes NCBI's Entrez database system and in particular the eutils function, elink. You'll have to write some code around this and probably tweak parameters, but could be a good start.
library(rentrez)
# for converting gene name -> gene id
gene_search <- entrez_search(db="gene", term="(PTEN[Gene Name]) AND Homo sapiens[Organism]", retmax=1)
geneId <- gene_search$ids
# elink function
snp_links <- entrez_link(dbfrom='gene', id=geneId, db='snp')
# access results with $links
length(snp_links$links$gene_snp)
5779
head(snp_links$links$gene_snp)
'864622690' '864622594' '864622518' '864622451' '864622387' '864622341'
I suggest you manually double-check that the number of SNPs is about what you'd expect for your genes of interest -- you may need to drill down further and limit by transcript, etc...
For multiple gene ids:
multi_snp_links <- entrez_link(dbfrom='gene', id=c("5728", "374654"), db='snp', by_id=TRUE)
lapply(multi_snp_links, function(x) head(x$links$gene_snp))
1. '864622690' '864622594' '864622518' '864622451' '864622387' '864622341'
2. '797045093' '797044466' '797044465' '797044464' '797044463' '797016353'
The results are grouped by gene with by_id=TRUE

R ttest on multiple levels of a factor

I'm trying to perform multiple t-test on my dataset in r and got totally confused from the capabilities of apply functions, aggregate and for loops.
My data is as following: I have observations which are different products. for each product I have multiple numeric variables, which I'd like to compare. In addition, I have 13 different categories of products. AND, I have another factor variable which differentiate between new, used, and old products. So a sample of my data may look as the following:
ProdID Category Cond No. of instances Sales Time since launch
aaaaa Sports New 100 40000 30
bbbb Crafts New 0 0 20
ccccc Music Used 20 1000 10
My goal is to perform the following, I want to output separately, for each Category (Sports, Crafts, Music etc.) the results of a t-test. This t-test should compare means of each numeric var, with the comparison of "New" mean to "Used" mean (I'm not interested in "old" values at all). So at the end I want to see the comparison of "Time since launch"m "Sales" and "Num Instances" between new and old in Sports, then the same in crafts, the same in music etc....
I've tried it in so many ways, but in each of them (aggreagte, tapply, for loop) I had a different problem... It seems that I'm missing here something (I'm kind of new in R. I used to do it in spss and used split file...)

HMM text recognition in R depmixs4

I'm wondering how I would utilize the depmixs4 package for R to run HMM on a dataset. What functions would I use so I get a classification of a testing data set?
I have a file of training data, a file of label data, and a test data.
Training data consists of 4620 rows. Each row has 1079 values. These values are 83 windows with 13 values per window so in otherwords the 1079 is data that is made up of 83 states and each category has 13 observations. Each of these rows with 1079 values is a spoken word so it have 4620 utterances. But in total the data only has 7 distinct words. each of these distinct words have 660 different utterances hence the 4620 rows of words.
So we have words (0-6)
The label file is a list where each row is labeled 0-6 corresponding to what word they are. For example row 300 is labeled 2, row 450 is labeled 6 and 520 is labeled 0.
The test file contains about 5000 rows structured exactly like the training data except there are no labels assocaiated with it.
I want to use HMM to using the training data to classify the test data.
How would I use depmixs4 to output a classification of my test data?
I'm looking at :
depmix(response, data=NULL, nstates, transition=~1, family=gaussian(),
prior=~1, initdata=NULL, respstart=NULL, trstart=NULL, instart=NULL,
ntimes=NULL,...)
but I don't know what response refers to or any of the other parameters.
Here's a quick, albeit incomplete, test to get you started, if only to familiarize you with the basic outline. Please note that this is a toy example and it merely scratches the surface for HMM design/analysis. The vignette for the depmixs4 package, for instance, offers quite a lot of context and examples. Meanwhile, here's a brief intro.
Let's say that you wanted to investigate if industrial production offers clues about economic recessions. First, let's load the relevant packages and then download the data from the St. Louis Fed:
library(quantmod)
library(depmixS4)
library(TTR)
fred.tickers <-c("INDPRO")
getSymbols(fred.tickers,src="FRED")
Next, transform the data into rolling 1-year percentage changes to minimize noise in the data and convert data into data.frame format for analysis in depmixs4:
indpro.1yr <-na.omit(ROC(INDPRO,12))
indpro.1yr.df <-data.frame(indpro.1yr)
Now, let's run a simple HMM model and choose just 2 states--growth and contraction. Note that we're only using industrial production to search for signals:
model <- depmix(response=INDPRO ~ 1,
family = gaussian(),
nstates = 2,
data = indpro.1yr.df ,
transition=~1)
Now let's fit the resulting model, generate posterior states
for analysis, and estimate probabilities of recession. Also, we'll bind the data with dates in an xts format for easier viewing/analysis. (Note the use of set.seed(1), which is used to create a replicable starting value to launch the modeling.)
set.seed(1)
model.fit <- fit(model, verbose = FALSE)
model.prob <- posterior(model.fit)
prob.rec <-model.prob[,2]
prob.rec.dates <-xts(prob.rec,as.Date(index(indpro.1yr)),
order.by=as.Date(index(indpro.1yr)))
Finally, let's review and ideally plot the data:
head(prob.rec.dates)
[,1]
1920-01-01 1.0000000
1920-02-01 1.0000000
1920-03-01 1.0000000
1920-04-01 0.9991880
1920-05-01 0.9999549
1920-06-01 0.9739622
High values (>0.80 ??) indicate/suggest that the economy is in recession/contraction.
Again, a very, very basic introduction, perhaps too basic. Hope it helps.

Resources