This is my first time asking a question on here so I hope I don't miss any crucial parts. I want to perform sentiment analysis on windows of speeches around certain keywords. My dataset is a large csv file containing a number of speeches, but I'm only interest in the sentiment of the words immediately surrounding certain key words.
I was told that the quanteda package in R would likely be my best bet for finding such a function, but I've been unsuccessful in locating it so far. If anyone knows how to do such a task it would be greatly appreciated !!!
Reprex (I hope?) below:
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stackoverflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stackoverflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id=1:3,
speechContent = speech)
I'd suggest using tokens_select() with the window argument set to a range of tokens surrounding your target terms.
To take your example, if "stackoverflow" is the target term, and you want to measure sentiment in the +/- 10 tokens around that, then this would work:
library("quanteda")
## Package version: 3.2.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## [CODE FROM ABOVE]
corp <- corpus(data, text_field = "speechContent")
toks <- tokens(corp) %>%
tokens_select("stackoverflow", window = 10)
toks
## Tokens consisting of 3 documents and 1 docvar.
## text1 :
## [1] "One" "relevant" "word" ","
## [5] "for" "example" "," "is"
## [9] "the" "word" "stackoverflow" "."
## [ ... and 9 more ]
##
## text2 :
## [1] "word" "of" "interest" ","
## [5] "but" "at" "the" "very"
## [9] "end" "." "stackoverflow" "."
##
## text3 :
## character(0)
There are many ways to compute sentiment from this point. An easy one is to apply a sentiment dictionary, e.g.
tokens_lookup(toks, data_dictionary_LSD2015) %>%
dfm()
## Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 1 docvar.
## features
## docs negative positive neg_positive neg_negative
## text1 0 1 0 0
## text2 0 0 0 0
## text3 0 0 0 0
Using quanteda:
library(quanteda)
corp <- corpus(data, docid_field = "id", text_field = "speechContent")
x <- kwic(tokens(corp, remove_punct = TRUE),
pattern = "stackoverflow",
window = 3
)
x
Keyword-in-context with 2 matches.
[1, 29] is the word | stackoverflow | However there are
[2, 24] the very end | stackoverflow |
as.data.frame(x)
docname from to pre keyword post pattern
1 1 29 29 is the word stackoverflow However there are stackoverflow
2 2 24 24 the very end stackoverflow stackoverflow
Now read the help for kwic (use ?kwic in console) to see what kind of patterns you can use. With tokens you can specify which data cleaning you want to use before using kwic. In my example I removed the punctuation.
The end result is a data frame with the window before and after the keyword(s). In this example a window of length 3. After that you can do some form of sentiment analyses on the pre and post results (or paste them together first).
This question already has answers here:
How do I separate a character column into two columns? [duplicate]
(2 answers)
Closed 1 year ago.
What gsub function can I use in R to get the gene name and the id number from a vector which looks like this?
head(colnames(cn), 20)
[1] "A1BG (1)" "NAT2 (10)" "ADA (100)" "CDH2 (1000)" "AKT3 (10000)" "GAGE12F (100008586)"
[7] "RNA5-8SN5 (100008587)" "RNA18SN5 (100008588)" "RNA28SN5 (100008589)" "LINC02584 (100009613)" "POU5F1P5 (100009667)" "ZBTB11-AS1 (100009676)"
[13] "MED6 (10001)" "NR2E3 (10002)" "NAALAD2 (10003)" "DUXB (100033411)" "SNORD116-1 (100033413)" "SNORD116-2 (100033414)"
[19] "SNORD116-3 (100033415)" "SNORD116-4 (100033416)"
1) Assuming the input s given in the Note at the end we can use read.table specifying that the fields are separated by ( and that ) is a comment character. We also strip white space around fields and give meaningful column names. No packages are used.
DF <- read.table(text = s, sep = "(", comment.char = ")",
strip.white = TRUE, col.names = c("Gene", "Id"))
DF
giving this data frame so DF$Gene is the genes and DF$Id is the id's.
Gene Id
1 A1BG 1
2 NAT2 10
3 ADA 100
4 CDH2 1000
5 AKT3 10000
6 GAGE12F 100008586
7 RNA5-8SN5 100008587
8 RNA18SN5 100008588
9 RNA28SN5 100008589
10 LINC02584 100009613
11 POU5F1P5 100009667
12 ZBTB11-AS1 100009676
13 MED6 10001
14 NR2E3 10002
15 NAALAD2 10003
16 DUXB 100033411
17 SNORD116-1 100033413
18 SNORD116-2 100033414
19 SNORD116-3 100033415
20 SNORD116-4 100033416
2) A variation of the above is to first remove the parentheses and then read it in giving the same result. Note that the second argument of chartr contains two spaces so that each parenthesis is translated to a space.
read.table(text = chartr("()", " ", s), col.names = c("Gene", "Id"))
Note
Lines <- '[1] "A1BG (1)" "NAT2 (10)" "ADA (100)" "CDH2 (1000)" "AKT3 (10000)" "GAGE12F (100008586)"
[7] "RNA5-8SN5 (100008587)" "RNA18SN5 (100008588)" "RNA28SN5 (100008589)" "LINC02584 (100009613)" "POU5F1P5 (100009667)" "ZBTB11-AS1 (100009676)"
[13] "MED6 (10001)" "NR2E3 (10002)" "NAALAD2 (10003)" "DUXB (100033411)" "SNORD116-1 (100033413)" "SNORD116-2 (100033414)"
[19] "SNORD116-3 (100033415)" "SNORD116-4 (100033416)" '
L <- Lines |>
textConnection() |>
readLines() |>
gsub(pattern = "\\[\\d+\\]", replacement = "")
s <- scan(text = L, what = "")
so s looks like this:
> dput(s)
c("A1BG (1)", "NAT2 (10)", "ADA (100)", "CDH2 (1000)", "AKT3 (10000)",
"GAGE12F (100008586)", "RNA5-8SN5 (100008587)", "RNA18SN5 (100008588)",
"RNA28SN5 (100008589)", "LINC02584 (100009613)", "POU5F1P5 (100009667)",
"ZBTB11-AS1 (100009676)", "MED6 (10001)", "NR2E3 (10002)", "NAALAD2 (10003)",
"DUXB (100033411)", "SNORD116-1 (100033413)", "SNORD116-2 (100033414)",
"SNORD116-3 (100033415)", "SNORD116-4 (100033416)")
First, in the future please share your data using the dput() command. See this for details.
Second, here is one solution for extracting the parts you need:
library(tidyverse)
g<-c("A1BG (1)","NAT2 (10)","ADA (100)" , "RNA18SN5 (100008588)", "RNA28SN5 (100008589)")
gnumber<-stringr::str_extract(g,"(?=\\().*?(?<=\\))")
gnumber
gname<-stringr::str_extract(g, "[:alpha:]+")
gname
# or, to get the whole first word:
gname<-stringr::word(g,1,1)
gname
I am trying to read a TSV file in R using the read.table function.
myTable <- read.table("file_path", sep='\t', header=T)
But when I try the command
names(myTable)
It gives me column names which are odd numbered, while merging the even numbered columns with those.
[1] "GeneSymbol" "GSM480304_JK_C_05.07.mas5.chp"
[3] "GSM480355_JK_C_05.07.mas5.chp" "GSM480480_JK_C_05.07.mas5.chp"
[5] "GSM480555_JK_C_05.07.mas5.chp" "GSM480634_JK_C_05.07.mas5.chp"
These are exact column names and you can see that two column names are separated by space while only ODD numbered column names are listed.
The output should be like this:
[1] "GeneSymbol"
[2] "GSM480304_JK_C_05.07.mas5.chp"
[3] "GSM480355_JK_C_05.07.mas5.chp"
[4] "GSM480480_JK_C_05.07.mas5.chp"
[5] "GSM480555_JK_C_05.07.mas5.chp"
[6] "GSM480634_JK_C_05.07.mas5.chp"
This is creating problem in assigning names to another table where I want to use these column names. Any suggestions ?
As noted in the comments, R is displaying all the columns, but not in the format you expect. This can be forced by casting the result of names() with as.data.frame() as follows:
rawData <- "
Number,Name,Type1,Type2,Total,HP,Attack,Defense,SpecialAtk,SpecialDef,Speed,Generation,Legendary
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,False
6,Charizard,Fire,Flying,534,78,84,78,109,85,100,1,False
6,CharizardMega Charizard X,Fire,Dragon,634,78,130,111,130,85,100,1,False
6,CharizardMega Charizard Y,Fire,Flying,634,78,104,78,159,115,100,1,False
7,Squirtle,Water,,314,44,48,65,50,64,43,1,False
8,Wartortle,Water,,405,59,63,80,65,80,58,1,False
9,Blastoise,Water,,530,79,83,100,85,105,78,1,False"
gen01 <- read.csv(textConnection=rawData,header=TRUE)
as.data.frame(names(gen01))
...and the output:
> as.data.frame(names(gen01))
names(gen01)
1 Number
2 Name
3 Type1
4 Type2
5 Total
6 HP
7 Attack
8 Defense
9 SpecialAtk
10 SpecialDef
11 Speed
12 Generation
13 Legendary
So I want to find patterns and "clusters" based on what items that are bought together, and according to the wiki for eclat:
The Eclat algorithm is used to perform itemset mining. Itemset mining let us find frequent patterns in data like if a consumer buys milk, he also buys bread. This type of pattern is called association rules and is used in many application domains.
Though, when I use the eclat in R, i get "zero frequent items" and "NULL" when when retrieving the results through tidLists. Anyone can see what I am doing wrong?
The full dataset: https://pastebin.com/8GbjnHK2
Each row is a transactions, containing different items in the columns. Quick snap of the data:
3060615;;;;;;;;;;;;;;;
3060612;3060616;;;;;;;;;;;;;;
3020703;;;;;;;;;;;;;;;
3002469;;;;;;;;;;;;;;;
3062800;;;;;;;;;;;;;;;
3061943;3061965;;;;;;;;;;;;;;
The code
trans = read.transactions("Transactions.csv", format = "basket", sep = ";")
f <- eclat(trans, parameter = list(supp = 0.1, maxlen = 17, tidLists = TRUE))
dim(tidLists(f))
as(tidLists(f), "list")
Could it be due to the data structure? In that case, how should I change it? Furthermore, what do I do to get the suggested itemsets? I couldn't figure that out from the wiki.
EDIT: I used 0.004 for supp, as suggested by #hpesoj626. But it seems like the function is grouping the orders/users and not the items. I don't know how to export the data, so here is a picture of the tidLists:
The problem is that you have set your support too high. Try adjusting supp say, supp = .001, for which we get
dim(tidLists(f))
# [1] 928 15840
For your data set, the highest support is 0.08239 which is below 0.1. That is why you are getting no results with supp = 0.1.
inspect(head(sort(f, by = "support"), 10))
# items support count
# [1] {3060620} 0.08239 1305
# [2] {3060619} 0.07260 1150
# [3] {3061124} 0.05688 901
# [4] {3060618} 0.05663 897
# [5] {4027039} 0.04975 788
# [6] {3060617} 0.04564 723
# [7] {3061697} 0.04306 682
# [8] {3060619,3060620} 0.03087 489
# [9] {3039715} 0.02727 432
# [10] {3045117} 0.02708 429
How to read the following vector "c" of strings into a list of tables? Which way is the shortest read.table strsplit? e.g. I cant see how to read the table Edit:c[4:6] a[4:6] in one command.
require(car)
m<-matrix(rnorm(16),4,4,byrow=T)
a<-Anova(lm(m~1),type=3,idata=data.frame(treatment=factor(1:4)),idesign=~treatment)
c<-capture.output(summary(a,multivariate=F))
c
This returns lines 4:6
c[4:6]
Now if you wanted to parse this I would do it in two steps. First on the column values from rows 5:6 and then add back the names.
> vals <- read.table(text=c[5:6])
> txt <- " \t SS\t num Df\t Error SS\t den Df\t F\t Pr(>F)"
> names(vals) <- names(read.delim(text=txt))
> vals
X SS num.Df Error.SS den.Df F Pr..F.
1 (Intercept) 0.57613392 1 0.4219563 3 4.09616 0.13614
2 treatment 1.85936442 3 8.2899759 9 0.67287 0.58996
EDIT --
you could look at the source code of the summary function and calculate the quantities required by yourself
getAnywhere(summary.Anova.mlm)
The original idea seems not to work.
c2 <- summary(a)
# find out what 'properties' the summary object has
# turns out, it is just the Anova object
class(c2) <- "list"
names(c2)
This returns
[1] "SSP" "SSPE" "P" "df" "error.df"
[6] "terms" "repeated" "type" "test" "idata"
[11] "idesign" "icontrasts" "imatrix" "singular"
and we can get access them
c2$SSP
c2$SSPE
It seems not a good idea to use R internal c function as a variable name