How to determine order of occurrence for Tweets (survival analysis?)? - r

Trying to figure something out for a pet project and is awfully stuck :(
The project centers around order of Tweet occurrences. I collected Tweets on 3 different topics for 2 actors. I want to determine which actor's tweet on each particular topic occurred earlier overall. A friend recommended me look into the package "survival", but I couldn't see how it could work. Any suggestion would be welcome! Thanks so much!
EDIT: Additional information
created_at name
1544469754 chicagotribune
1541550304 chicagotribune
party type topic
M 1 trade
M 1 trade
The variable represents in following information
-created at: the time the tweet was sent out
-name: Twitter account name
-party: classification variable of political leaning
-type: binary indicator (1 = media type A, 0 = media type B)
-topic: the topic the tweet belongs to (3 topics total)

I don't think this is a survival analysis problem, you just need to find the earliest timestamp within each topic. I think something like this should work:
# Read in example data
df = readr::read_table("created_at name party type topic
1544469754 chicagotribune M 1 trade
1541550304 chicagotribune M 1 trade")
df %>%
group_by(topic) %>%
summarise(first_tweeter = name[which.min(created_at)])

Related

Analysis to identify similar occupations by frequency of skills requested in job postings (in R)

I have access to a dataset of job postings, which for each posting has a unique posting ID, the job posting occupation, and a row for each skill requested in each job posting.
The dataset looks a bit like this:
posting_id
occ_code
occname
skillname
1
1
data scientist
analysis
1
1
data scientist
python
2
2
lecturer
teaching
2
2
lecturer
economics
3
3
biologist
research
3
3
biologist
biology
1
1
data scientist
research
1
1
data scientist
R
I'd like to perform analysis in R to identify "close" occupations by how similar their overall skill demand is in job postings. E.g. if many of the top 10 in-demand skills for financial analysts matched some of the top 10 in-demand skills for data scientists, those could be considered closely related occupations.
To be more clear, I want to identify similar occupations by their overall skill demand in the postings i.e. by summing the no. of times each skill is requested for an occupation, and identifying which other occupations have similar frequently requested skills.
I am fairly new to R so would appreciate any help!
I think you might want an unsupervised clustering strategy. See the help page for hclust for a debugged worked example. This untested code.
# Load necessary libraries
library(tidyverse)
library(reshape2)
# Read in the data
data <- read.csv("path/to/your/data.csv")
# Sum the number of times each skill is requested for each occupation
skill_counts <- data %>%
group_by(occ_code, occname_skillname) %>%
summarise(count = n())
# Get the top 10 in-demand skills for each occupation
top_10_skills <- skill_counts %>%
group_by(occ_code) %>%
top_n(10, count)
# Convert the data into a matrix for clustering
matrix <- dcast(top_10_skills, occ_code ~ occname_skillname, value.var = "count")
# Perform clustering
fit <- hclust(dist(t(matrix)), method = "ward.D2")
# Plot the dendrogram
plot(fit, hang = -1, labels = row.names(matrix), main = "Occupation Clustering")
The resulting dendrogram will show the relationships between the occupations based on their skill demand. Closer occupations will be grouped together and more distantly related occupations will be separated further apart.

TABLEAU: How can I measure similarity of sets of dimensions across dates?

this is a bit of a complicated one - but I'll do my best to explain. I have a dataset comprised of data that I scrape from a particular video on demand interface every day. Each day there are around 120 titles on display (a grid of 12 x 10) - the data includes a range of variables: date of scrape, title of programme, vertical/horizontal position of programme, genre, synopsis, etc.
One of the things I want to do is analyse the similarity of what's on offer on a day-to-day basis. What I mean by this is that I want to compare how many of the titles on a given day appeared on the previous date (ideally expressed as a percentage). So if 40 (out of 120) titles were the same as the previous day, the similarity would be 30%.
Here's the thing - I know how to do this (thanks to some kindly stranger on this very site who helped me write a script using R). You can see the post here which gives some more detail: Calculate similarity within a dataframe across specific rows (R)
However, this method creates a similarity score based on the total number of titles on a day-to-day basis whereas I also want to be able to explore the similarity after applying other filters. Specifically, I want to narrow the focus to titles that appear within the first four rows and columns. In other words: how many of these titles are the same as the previous day in those positions? I could do this by modifying the R script, but it seems that the better way would be to do this within Tableau so that I can change these parameters in "real-time", so to speak. I.e. if I want to focus on the top 6 rows and columns I don't want to have to run the R script all over again and update the underlying data!
It feels as though I'm missing something very obvious here - maybe it's a simple table calculation? Or I need to somehow tell Tableau how to subset the data?
Hopefully this all makes sense, but I'm happy to clarify if not. Also, I can't provide you the underlying data (for research reasons!) but I can provide a sample if it would help.
Thanks in advance :)
You can have the best of both worlds. Use Tableau to connect to your data, filter as desired, then have Tableau call an R script to calculate similarity and return the results to Tableau for display.
If this fits your use case, you need to learn the mechanics to put this into play. On the Tableau side, you’ll be using the functions that start with the word SCRIPT to call your R code, for example SCRIPT_REAL(), or SCRIPT_INT() etc. Those are table calculations, so you’ll need to learn how table calculations work, in particular with regard to partitioning and addressing. This is described in the Tableau help. You’ll also have to point Tableau at the host for your R code, by managing external services under the Help->Settings and Performance menu.
On the R side, you’ll have write your function of course, and then use the function RServe() to make it accessible to Tableau. Tableau sends vectors of arguments to R and expects a vector in response. The partitioning and addressing mentioned above controls the size and ordering of those vectors.
It can be a bit tricky to get the mechanics working, but they do work. Practice on something simple first.
See Tableau’s web site resources for more information. The official name for this functionality is Tableau “analytic extensions”
I am sharing a strategy to solve this in R.
Step-1 Load the libraries and data
library(tidyverse)
library(lubridate)
movies <- tibble(read.csv("movies.csv"))
movies$date <- as.Date(movies$date, format = "%d-%m-%Y")
set the rows and columns you want to restrict your similarity search to in two variables. Say you are restricting the search to 5 columns and 4 rows only
filter_for_row <- 4
filter_for_col <- 5
Getting final result
movies %>% filter(rank <= filter_for_col, row <= filter_for_row) %>% #Restricting search to designated rows and columns
group_by(Title, date) %>% mutate(d_id = row_number()) %>%
filter(d_id ==1) %>% # removing duplicate titles screened on any given day
group_by(Title) %>%
mutate(similarity = ifelse(lag(date)== date - lubridate::days(1), 1, 0)) %>% #checking whether it was screened previous day
group_by(date) %>%
summarise(total_movies_displayed = sum(d_id),
similar_movies = sum(similarity, na.rm = T),
similarity_percent = similar_movies/total_movies_displayed)
# A tibble: 3 x 4
date total_movies_displayed similar_movies similarity_percent
<date> <int> <dbl> <dbl>
1 2018-08-13 17 0 0
2 2018-08-14 17 10 0.588
3 2018-08-15 17 9 0.529
If you change the filters to 12, 12 respectively, then
filter_for_row <- 12
filter_for_col <- 12
movies %>% filter(rank <= filter_for_col, row <= filter_for_row) %>%
group_by(Title, date) %>% mutate(d_id = row_number()) %>%
filter(d_id ==1) %>%
group_by(Title) %>%
mutate(similarity = ifelse(lag(date)== date - lubridate::days(1), 1, 0)) %>%
group_by(date) %>%
summarise(total_movies_displayed = sum(d_id),
similar_movies = sum(similarity, na.rm = T),
similarity_percent = similar_movies/total_movies_displayed)
# A tibble: 3 x 4
date total_movies_displayed similar_movies similarity_percent
<date> <int> <dbl> <dbl>
1 2018-08-13 68 0 0
2 2018-08-14 75 61 0.813
3 2018-08-15 72 54 0.75
Good Luck
As Alex has suggested, you can have best of both the worlds. But to the best of my knowledge, Tableau Desktop allows interface with R (or python etc.) through calculated fields i.e. script_int script_real etc. All of these can be used in tableau through calculated fields. Presently these functions in tableau allows creation on calculated field through Table calculations which in tableau work only in context. We cannot hard code these values (fields/columns) and thus. we are not at liberty to use these independent on context. Moreover, table calculations in tableau can neither be further aggregated and nor be mixed with LOD expressions. Thus, in your use case, (again to the best of my knowledge) you can build a parameter dependent view in tableau, after hard-coding values through any programming language of your choice. I therefore, suggest that prior to importing data in tableau a new column can be created in your dataset by running following (or alternate as per choice programming language)
movies_edited <- movies %>% group_by(Title) %>%
mutate(similarity = ifelse(lag(date)== date - lubridate::days(1), 1, 0)) %>%
ungroup()
write.csv(movies_edited, "movies_edited.csv")
This created a new column named similarity in dataset wherein 1 denotes that it was available on previous day, 0 denotes it was not not screened on immediately previous day and NA means it is first day of its screening.
I have imported this dataset in tableau and created a parameter dependent view, as you desired.

How to find topics and phrases (verbs/adjectives) that are grouped with a specific word?

I have a large tidy data set with columns containing text responses(i.e., in a grant application) and rows as the individual organization applying for the grant. I'm trying to find the topics and phrases grouped with a specific word (e.g., "funder"/"funding"). More specifically, what adjectives and verbs are being grouped with these tokens?
So for example
text <- "This funding would help us create a new website and hire talented people."
So "funding" can be grouped with verbs like "create", "hire", and adjective phrases like "new website", "talented people".
I'm doing this in R. Does anyone have a package or program in which they'd recommend doing this? I've found cleanNLP, but not sure if this is the most convenient package. Would I need to tokenize all the words? If so, wouldn't I have problems grouping phrases?
I'm fairly new to NLP/text mining, so I apologize for the introductory question.
Thank you!
This is a huge area to start exploring.
I would strongly recommend taking a look at the tidytextmining book and package, as well as the authors personal blogs (https://juliasilge.com, http://varianceexplained.org) there is a huge amount of great work there to get you started, and its really well written for people new to NLP.
Also really helpful for what you are looking for are the widyr and udpipe libraries.
Here's a couple of examples:
Using widyr we can look at the pairwise pmi between a word, say funding, and all other words that it has some relationship with. For info on PMI check out: https://stackoverflow.com/a/13492808/2862791
library(tidytext)
library(tidyverse)
texts <- tibble(text = c('This funding would help us create a new website and hire talented people',
'this random funding function talented people',
'hire hire hire new website funding',
'fun fun fun for all'))
tidy_texts %>%
pairwise_pmi(word, id) %>%
filter(item1 == 'funding') %>%
top_n(5, wt = pmi) %>%
arrange(desc(pmi))
item1 item2 pmi
<chr> <chr> <dbl>
1 funding this -0.0205
2 funding would -0.0205
3 funding help -0.0205
4 funding us -0.0205
So to introduce adjectives and phrases you could look at udpipe as boski suggested.
I'm going to reproduce the above to calculate the PMI too, as it's a really intuitive and quick to compute metric
library(udpipe)
english <- udpipe_download_model(language = "english")
ud_english <- udpipe_load_model(english$file_model)
tagged <- udpipe_annotate(ud_english, x = texts$text)
tagged_df <- as.data.frame(tagged)
tagged_df %>%
filter(upos == 'ADJ' |
token == 'funding') %>%
pairwise_pmi(token, doc_id) %>%
filter(item1 == 'funding')
item1 item2 pmi
<chr> <chr> <dbl>
1 funding new 0.170
2 funding talented 0.170
You've mentioned cleanNLP, which is a great library for this kind of work. It makes it easy to access udpipe and spacyr and a few other methods which do the kind of tokenisation and tagging needed for that adjective finding.
If you can get past the setup details spacyr is my preferred option just because its the fastest, but if speed isn't an issue I would just go with udpipe as its very easy to use.
Would I need to tokenize all the words? If so, wouldn't I have problems grouping phrases?
So udpipe, and other text annotators, have a solution for this.
In udpipe you can use 'keywords_collocation()' which identifies words which occur together more frequently than expected through random chance.
We would need to have a text dataset bigger than the three junk sentences I've written above to get a reproducible example.
But you can find out alot through this blog:
https://bnosac.github.io/udpipe/docs/doc7.html
Sorry this reply is kind of a collection of links ... but as I said it's a huge area of study.

Gene ontology (GO) analysis for a list of Genes (with ENTREZID) in R?

I am very new with the GO analysis and I am a bit confuse how to do it my list of genes.
I have a list of genes (n=10):
gene_list
SYMBOL ENTREZID GENENAME
1 AFAP1 60312 actin filament associated protein 1
2 ANAPC11 51529 anaphase promoting complex subunit 11
3 ANAPC5 51433 anaphase promoting complex subunit 5
4 ATL2 64225 atlastin GTPase 2
5 AURKA 6790 aurora kinase A
6 CCNB2 9133 cyclin B2
7 CCND2 894 cyclin D2
8 CDCA2 157313 cell division cycle associated 2
9 CDCA7 83879 cell division cycle associated 7
10 CDCA7L 55536 cell division cycle associated 7-like
and I simply want to find their function and I've been suggested to use GO analysis tools.
I am not sure if it's a correct way to do so.
here is my solution:
x <- org.Hs.egGO
# Get the entrez gene identifiers that are mapped to a GO ID
xx<- as.list(x[gene_list$ENTREZID])
So, I've got a list with EntrezID that are assigned to several GO terms for each genes.
for example:
> xx$`60312`
$`GO:0009966`
$`GO:0009966`$GOID
[1] "GO:0009966"
$`GO:0009966`$Evidence
[1] "IEA"
$`GO:0009966`$Ontology
[1] "BP"
$`GO:0051493`
$`GO:0051493`$GOID
[1] "GO:0051493"
$`GO:0051493`$Evidence
[1] "IEA"
$`GO:0051493`$Ontology
[1] "BP"
My question is :
how can I find the function for each of these genes in a simpler way and I also wondered if I am doing it right or?
because I want to add the function to the gene_list as a function/GO column.
Thanks in advance,
EDIT: There is a new Bioinformatics SE (currently in beta mode).
I hope I get what you are aiming here.
BTW, for bioinformatics related topics, you can also have a look at biostar which have the same purpose as SO but for bioinformatics
If you just want to have a list of each function related to the gene, you can query database such ENSEMBl through the biomaRt bioconductor package which is an API for querying biomart database.
You will need internet though to do the query.
Bioconductor proposes packages for bioinformatics studies and these packages come generally along with good vignettes which get you through the different steps of the analysis (and even highlight how you should design your data or which would be then some of the pitfalls).
In your case, directly from biomaRt vignette - task 2 in particular:
Note: there are slightly quicker way that the one I reported below:
# load the library
library("biomaRt")
# I prefer ensembl so that the one I will query, but you can
# query other bases, try out: listMarts()
ensembl=useMart("ensembl")
# as it seems that you are looking for human genes:
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
# if you want other model organisms have a look at:
#listDatasets(ensembl)
You need to create your query (your list of ENTREZ ids). To see which filters you can query:
filters = listFilters(ensembl)
And then you want to retrieve attributes : your GO number and description. To see the list of available attributes
attributes = listAttributes(ensembl)
For you, the query would look like something as:
goids = getBM(
#you want entrezgene so you know which is what, the GO ID and
# name_1006 is actually the identifier of 'Go term name'
attributes=c('entrezgene','go_id', 'name_1006'),
filters='entrezgene',
values=gene_list$ENTREZID,
mart=ensembl)
The query itself can take a while.
Then you can always collapse the information in two columns (but I won't recommend it for anything else that reporting purposes).
Go.collapsed<-Reduce(rbind,lapply(gene_list$ENTREZID,function(x)
tempo<-goids[goids$entrezgene==x,]
return(
data.frame('ENTREZGENE'= x,
'Go.ID'= paste(tempo$go_id,collapse=' ; '),
'GO.term'=paste(tempo$name_1006,collapse=' ; '))
)
Edit:
If you want to query a past version of the ensembl database:
ens82<-useMart(host='sep2015.archive.ensembl.org',
biomart='ENSEMBL_MART_ENSEMBL',
dataset='hsapiens_gene_ensembl')
and then the query would be:
goids = getBM(attributes=c('entrezgene','go_id', 'name_1006'),
filters='entrezgene',values=gene_list$ENTREZID,
mart=ens82)
However, if you had in mind to do a GO enrichment analysis, your list of genes is too short.

How do I optimize association analysis for the rules to make sense?

I have a dataset of customers, that I want to define a frequent criteria, to paint a picture of an ideal customer.
The dataset has the following fields:
email
fullname
Job (title)
company web domain
company description (string data)
company founded (year)
company employees (number)
company city
company state
company country
linkedin groups followed
created
updated
Except for Company Employees, Company Founded, Created and Updated there is no numerical data. The dataset has other useful data, like age (interval) and sex, but it has too many missing values, so I removed them for the analysis purposes.
I ran the code in R:
data1 <- read.csv("final_account_list.csv")
library(arules)
str(data1)
data1$Company.Founded <- factor(data1$Company.Founded)
rules1 <- apriori(data1)
rules1
inspect(rules1)
options(digits=2)
inspect(rules1[1:5])
I am getting a list of 59 rules, but they don't make much sense. For example,
{Company.Employees = 500} => {Company.Country USA} lift 1.176, confidence = 0.083, support = 0.109
The fact that majority of customers have 500 employees and are in USA does not bring much value. How do I make my analysis more meaningful?
For example, how do I find association for the title, geographies (city, state) and linkedin groups?
The most non-trivial part is to define, what "meaningful rule" means for you in terms of right-hand-side(rhs) and/or left-hand-side(rhs).
Then, as described in the documentation to the apriori package, you can investigate your rules.
For your example
how do I find association for the ... linkedin groups
you can use
# find rules with "linkedin groups followed" in right-hand-side
rulesLinkedIn = subset(rules1, subset = rhs %in% "linkedin groups followed"))
# inspect rules with highest confidence
inspect(head(sort(rulesLinkedIn, by="confidence")), n=3)

Resources