T-test after running sentiment analysis - r

I am trying to run a t-test after doing the sentiment analysis.
I did the sentiment analysis, and grouped my data into two parts:
library(textdata)
afinn_dictionary <- get_sentiments("afinn")
news_tokenized <- full_data %>%
unnest_tokens(word, full_article, to_lower = TRUE)
head(news_tokenized$word, 10)
full_data$full_article[2]
word_counts_senti <- news_tokenized %>%
inner_join(afinn_dictionary)
head(word_counts_senti)
news_senti <- word_counts_senti %>%
group_by(partisan_media) %>% #group by partisan media
summarize(sentiment = sum(value))
head(news_senti)
#as a result, I got: c(1): -13194, c(2): -12321. Both group 1 and 2 were negative, but group 1's stories tend to use more negative words (have greater negative sentiment).
table(full_data$partisan_media) #there are 1866 articles in group 1 and 2174 articles in group 2
I am trying to see if the differences between groups 1 and 2 (two groups of partisan media) are statistically different by running a t-test.
I'm using:
g1_senti = rnorm(1866, mean = -7.07074, sd = ) #group1
g2_senti = rnorm(2174, mean = -5.667433, sd = ) #group2
t.test(g1_senti, g2_senti)
The means are from "sentiment score of a group" divided by "number of articles of a group"
But I wasn't sure what should be entered inside the parenthesis for the sd. Does anyone have an idea about this?
I am adding my data set here: https://www.mediafire.com/file/uei2e3tajvi7wao/eight.csv/file

Related

Repeated Measure - correct code to measuring differences between Treatments (Over Time)

for a research project I need to run an ANOVA test to see the statistical significance of the differences between some treatments.
The experiment consisted in inoculating some bacteria in different tubes containing different treatments with different concentrations.
My dependent variable is the value of Optical Density 660 measured on the spectrophotometer, I measured the OD 13 times over time at different times.
Here is the dataset, i'll gave you all dataset, it is not so big:
od34_stat1 <- data.frame(
OD = c(0.032667,0.09,0.157,0.184,0.345667,
0.4445,0.47725,0.53925,0.74,0.750667,0.859167,0.880333,
0.8275,0.034667,0.0935,0.146,0.1725,0.522167,0.5865,0.71075,
0.69875,0.927,0.929667,1.063167,1.037333,0.973,0.031167,
0.1045,0.139,0.1665,0.425667,0.523,0.69875,0.80575,
1.0435,0.994667,1.085667,1.215333,1.1145,0.034667,0.1085,
0.1285,0.1645,0.349667,0.474,0.74075,0.78125,1.0815,
0.937167,1.045667,1.104333,0.9555,0.028167,0.065,0.13,0.1715,
0.331667,0.4015,0.45775,0.54425,0.811,0.739167,0.797167,
0.773333,0.6905,0.021167,0.0835,0.131,0.1585,0.279167,
0.384,0.40225,0.46975,0.646,0.625667,0.684667,0.701333,
0.5885,0.015667,0.0655,0.086,0.12,0.191667,0.261,0.29875,
0.35825,0.446,0.411167,0.364667,0.369333,0.31),
Treatment = as.factor(c("0_CNTRL","0_CNTRL",
"0_CNTRL","0_CNTRL","0_CNTRL","0_CNTRL","0_CNTRL",
"0_CNTRL","0_CNTRL","0_CNTRL","0_CNTRL",
"0_CNTRL","0_CNTRL","10_TOX","10_TOX","10_TOX","10_TOX",
"10_TOX","10_TOX","10_TOX","10_TOX","10_TOX",
"10_TOX","10_TOX","10_TOX","10_TOX","25_TOX",
"25_TOX","25_TOX","25_TOX","25_TOX","25_TOX","25_TOX",
"25_TOX","25_TOX","25_TOX","25_TOX","25_TOX",
"25_TOX","50_TOX","50_TOX","50_TOX","50_TOX",
"50_TOX","50_TOX","50_TOX","50_TOX","50_TOX",
"50_TOX","50_TOX","50_TOX","50_TOX","10_CNTRL",
"10_CNTRL","10_CNTRL","10_CNTRL","10_CNTRL","10_CNTRL",
"10_CNTRL","10_CNTRL","10_CNTRL","10_CNTRL",
"10_CNTRL","10_CNTRL","10_CNTRL","25_CNTRL","25_CNTRL",
"25_CNTRL","25_CNTRL","25_CNTRL","25_CNTRL",
"25_CNTRL","25_CNTRL","25_CNTRL","25_CNTRL",
"25_CNTRL","25_CNTRL","25_CNTRL","50_CNTRL","50_CNTRL",
"50_CNTRL","50_CNTRL","50_CNTRL","50_CNTRL",
"50_CNTRL","50_CNTRL","50_CNTRL","50_CNTRL","50_CNTRL",
"50_CNTRL","50_CNTRL")),
Time = as.factor(c("0","2","4","6",
"70","94","478","496","568","616","736","784",
"808","0","2","4","6","70","94","478","496",
"568","616","736","784","808","0","2","4","6",
"70","94","478","496","568","616","736","784",
"808","0","2","4","6","70","94","478","496",
"568","616","736","784","808","0","2","4",
"6","70","94","478","496","568","616","736",
"784","808","0","2","4","6","70","94","478",
"496","568","616","736","784","808","0","2","4",
"6","70","94","478","496","568","616","736",
"784","808"))
)
So, what I tried to do is a repeated measures anova, taking into account that I measured the OD over time, time is my repeated measures factor (?).
I would need to see if there are statistically significant differences between the treatment groups (e.g. Is there a significant difference between 0_CNTRL and 25_TOX?).
Initially I found a code where it correctly performs the ANOVA in repeated measures but it shows me the differences between the time points: then it tells me if there is a difference between Time 4 and Time 6 etc. but it is not the question that I need and above all the result is too dispersive.
This is the original code (i followed this guide: https://www.datanovia.com/en/lessons/repeated-measures-anova-in-r/#one-way-repeated-measures-anova):
library(tidyverse)
library(ggpubr)
library(rstatix)
library(ggplot2)
##Factors
od34_stat1$Treatment <- as.factor(od34_stat1$Treatment)
od34_stat1$Time <- as.factor(od34_stat1$Time)
#Interactionplot - Boxplot
bxp34 <- ggboxplot(od34_stat1, x = "Time", y = "OD", add = "point")
bxp34
##Check assumptions: Outliers
od34_stat1 %>%
group_by(Time) %>%
identify_outliers(OD)
##Check assumptions: Normality
od34_stat1 %>%
group_by(Time) %>%
shapiro_test(OD)
#OR
ggqqplot(od34_stat1, "OD", facet.by = "Time")
#Computing One-Way repeated measure ANOVA
od34.aov <- anova_test(data = od34_stat1, dv = OD, wid = Treatment, within = Time)
get_anova_table(od34.aov)
# Pairwise comparisons
od34.pwc <- od34_stat1 %>%
pairwise_t_test(
OD ~ Time, paired = TRUE,
p.adjust.method = "bonferroni"
)
od34.pwc
##Creating Report
od34.pwc <- od34.pwc %>% add_xy_position(x = "Time")
bxp34 +
stat_pvalue_manual(od34.pwc) +
labs(
subtitle = get_test_label(od34.aov, detailed = TRUE),
caption = get_pwc_label(od34.pwc)
)
Okay. Here is my problem, now the output is the "Time" factor. However, the guide uses a dataset where there are only 3 times of measurement of the dependent variable, while I measured 13 times. Moreover, I think that the intent of the guide is precisely to see the differences over time, while mine is to see the difference between the Treatments whose OD measure has been measured over Time..
So what I thought, as an rstudio noob, is change the code: "Time" to "Treatment". This way my output is just what I would need.
My concern is that by changing these factors the result is clear but doesn't make logical sense.
Reviewed code:
#Interactionplot - Boxplot
bxp34_1 <- ggboxplot(od34_stat1, x = "Treatment", y = "OD", add = "point")
bxp34_1
##Check assumptions: Outliers
od34_stat1 %>%
group_by(Time) %>%
identify_outliers(OD)
##Check assumptions: Normality
od34_stat1 %>%
group_by(Treatment) %>%
shapiro_test(OD)
#OR
ggqqplot(od34_stat1, "OD", facet.by = "Treatment")
#Computing One-Way repeated measure ANOVA
od34.aov_1 <- anova_test(data = od34_stat1, dv = OD, wid = Time, within = Treatment)
get_anova_table(od34.aov_1)
# Pairwise comparisons
od34.pwc_1 <- od34_stat1 %>%
pairwise_t_test(
OD ~ Treatment, paired = TRUE,
p.adjust.method = "bonferroni"
)
od34.pwc_1
##Creating Report
od34.pwc_1 <- od34.pwc_1 %>% add_xy_position(x = "Treatment")
bxp34_1 +
stat_pvalue_manual(od34.pwc_1) +
labs(
subtitle = get_test_label(od34.aov_1, detailed = TRUE),
caption = get_pwc_label(od34.pwc_1)
)
This way my graphical output (od34.pwc_1) allows me to explain the statistical significance of the difference between treatments.
I hope I have summarized all the doubt correctly. What do you think? Is it right to do this?
And if it is not correct, What would you recommend to analyze and visualize the difference between these treatments?

Repeated random sampling and kurtosis on unbalanced sample

I have an unbalanced dataset with people from liberal and conservative background giving rating on an issue (1-7). Would like to see how polarized the issue is.
The sample is heavily skewed towards liberal (70% of the sample). How do I do repeated sampling using R to create a balanced sample (50-50) and calculate kurtosis?
For example, I have total 50 conservatives. How do I randomly sample 50 liberals out of 150 repeatedly?
A sample dataframe below:
political_ort rating
liberal 1
liberal 6
conservative 5
conservative 3
liberal 7
liberal 3
liberal 1
What you're describing is termed 'undersampling'. Here is one method using tidyverse functions:
# Load library
library(tidyverse)
# Create some 'test' (fake) data
sample_df <- data_frame(id_number = (1:100),
political_ort = c(rep("liberal", 70),
rep("conservative", 30)),
ratings = sample(1:7, size = 100, replace = TRUE))
# Take the fake data
undersampled_df <- sample_df %>%
# Group the data by category (liberal / conservative) to treat them separately
group_by(political_ort) %>%
# And randomly sample 30 rows from each category (liberal / conservative)
sample_n(size = 30, replace = FALSE) %>%
# Because there are only 30 conservatives in total they are all included
# Finally, ungroup the data so it goes back to a 'vanilla' dataframe/tibble
ungroup()
# You can see the id_numbers aren't in order anymore indicating the sampling was random
There is also the ROSE package that has a function ("ovun.sample") that can do this for you: https://www.rdocumentation.org/packages/ROSE/versions/0.0-3/topics/ovun.sample

Clustering using daisy and pam in R

I'm trying to perform a pretty straightforward clustering analysis but can't get the results right. My question for a large dataset is "Which diseases are frequently reported together?". The simplified data sample below should result in 2 clusters: 1) headache / dizziness 2) nausea / abd pain. However, I can't get the code right. I'm using the pam and daisy functions. For this example I manually assign 2 clusters (k=2) because I know the desired result, but in reality I explore several values for k.
Does anyone know what I'm doing wrong here?
library(cluster)
library(dplyr)
dat <- data.frame(ID = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5"),
PTName = c("headache","dizziness","nausea","abd pain","dizziness","headache","abd pain","nausea","headache","dizziness"))
gower_dist <- daisy(dat, metric = "gower")
k <- 2
pam_fit <- pam(gower_dist, diss = TRUE, k) # performs cluster analysis
pam_results <- dat %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
head(pam_results$the_summary)
The format in which you give the dataset to the clustering algorithm is not precise for your objective. In fact, if you want to group diseases that are reported together but you also include IDs in your dissimilarity matrix, they will have a part in the matrix construction and you do not want that, since your objective regards only the diseases.
Hence, we need to build up a dataset in which each row is a patient with all the diseases he/she reported, and then construct the dissimilarity matrix only on the numeric features. For this task, I'm going to add a column presence with value 1 if the disease is reported by the patient, 0 otherwise; zeros will be filled automatically by the function pivot_wider (link).
Here is the code I used and I think I reached what you wanted to, please tell me if it is so.
library(cluster)
library(dplyr)
library(tidyr)
dat <- data.frame(ID = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5"),
PTName = c("headache","dizziness","nausea","abd pain","dizziness","headache","abd pain","nausea","headache","dizziness"),
presence = 1)
# build the wider dataset: each row is a patient
dat_wider <- pivot_wider(
dat,
id_cols = ID,
names_from = PTName,
values_from = presence,
values_fill = list(presence = 0)
)
# in the dissimalirity matrix construction, we leave out the column ID
gower_dist <- daisy(dat_wider %>% select(-ID), metric = "gower")
k <- 2
set.seed(123)
pam_fit <- pam(gower_dist, diss = TRUE, k)
pam_results <- dat_wider %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
head(pam_results$the_summary)
Furthermore, since you are working only with binary data, instead of Gower's distance you can consider using the Simple Matching or Jaccard distance if they suit your data better. In R you can employ them using
sm_dist <- dist(dat_wider %>% select(-ID), method = "manhattan")/p
j_dist <- dist(dat_wider %>% select(-ID), method = "binary")
respectively, where p is the number of binary variables you want to consider.

Average probability of drawing categorical variable from differently sized populations

This might be a very simple question. Suppose I have multiple populations of categorical values as well as a group of 'target' categories.
e.g.
set.seed(500)
pops <- list(
val1 = c('20','20','10','90','100','30','10','20'),
val2 = c('20','110','1400','50','40'),
val3 = c('100','50','30')
)
target <- c('20','100','40')
What would be the average probability of drawing at least one of the target categories from all populations?
I can calculate the frequency distribution of each value and therefore the chance of getting a specific result.
# Frequency table
p <- table(pops$val1) / length(pops$val1)
# The probability of getting at least of the target values
sum(p[which(names(p) %in% target)])
# 0.5
Problem is that calculation is not independent of sampling size as increasing N obviously increases the probability that at least one of the categories is present.
Anyone has an idea how do to assess this unbiased by sample size?
We can use
sapply(pops, function(x) {
p <- table(x)/length(x)
sum(p[which(names(p) %in% target)])
})
Or using tidyverse
library(tidyverse)
stack(pops) %>%
group_by(ind) %>%
mutate(n1 = n()) %>%
group_by(values, add = TRUE) %>%
summarise(perc = n()/n1[1]) %>%
filter(values %in% target) %>%
summarise(perc = sum(perc))

in R, Creating a summary table with comparisons of two groups

I frequently want to create summary tables for studies where I compare several variables between two groups, listing values for each variable along with the difference between that variable for the two groups.
For example, say I want to compare age groups (young and old) and proportion of males between two groups, A and B. I’d like to end up with a table with rows for each variable (age, proportion of males) and columns for the following variables repeated for each group (numerator, denominators, rate, difference between the two rates, 95%CI, p-value from a chi-square).
I’m looking for a general approach to this type of table.
Let’s say I have the following table:
library(dplyr)
AgeGroup <- sample(c("Young", "Old"), 10, replace = TRUE)
Gender <- sample(c("Male", "Female"), 10, replace = TRUE)
df <- data.frame(AgeGroup, Gender)
df
I can create a summary table without the comparison easily:
df1 <- df %>%
group_by(AgeGroup) %>%
summarise(num_M = sum(Gender == "Male"),
den_M = n(),
prop_M = num_M/den_M)
df1
But I can’t figure out how to create additional columns of comparisons between the different rows of grouped data. Let’s say I want to do a chi.sq test on the proportion of Males in each AgeGroup and add the p-value to the summary table above.
It would look like this (numbers, obviously, are examples), Y = Young, O = Old:
Any gentle nudges in the right direction would be greatly appreciated.
Thanks!
I like the finalfit package for summary tables. If you need to add custom summary functions, it might not be flexible enough, but its default stats cover everything you've asked for in your example, e.g. numbers in each group, proportions, and a chi-squared test. If you have continuous variables it will calculate means and SDs in each group.
library(finalfit)
finalfit::summary_factorlist(
df,
dependent = "Gender",
explanatory = "AgeGroup",
total_col = TRUE,
p = TRUE
)
Output:
label levels Female Male Total p
1 AgeGroup Old 0 (0.0) 6 (100.0) 6 0.197
2 Young 1 (25.0) 3 (75.0) 4

Resources