Repeated random sampling and kurtosis on unbalanced sample - r

I have an unbalanced dataset with people from liberal and conservative background giving rating on an issue (1-7). Would like to see how polarized the issue is.
The sample is heavily skewed towards liberal (70% of the sample). How do I do repeated sampling using R to create a balanced sample (50-50) and calculate kurtosis?
For example, I have total 50 conservatives. How do I randomly sample 50 liberals out of 150 repeatedly?
A sample dataframe below:
political_ort rating
liberal 1
liberal 6
conservative 5
conservative 3
liberal 7
liberal 3
liberal 1

What you're describing is termed 'undersampling'. Here is one method using tidyverse functions:
# Load library
library(tidyverse)
# Create some 'test' (fake) data
sample_df <- data_frame(id_number = (1:100),
political_ort = c(rep("liberal", 70),
rep("conservative", 30)),
ratings = sample(1:7, size = 100, replace = TRUE))
# Take the fake data
undersampled_df <- sample_df %>%
# Group the data by category (liberal / conservative) to treat them separately
group_by(political_ort) %>%
# And randomly sample 30 rows from each category (liberal / conservative)
sample_n(size = 30, replace = FALSE) %>%
# Because there are only 30 conservatives in total they are all included
# Finally, ungroup the data so it goes back to a 'vanilla' dataframe/tibble
ungroup()
# You can see the id_numbers aren't in order anymore indicating the sampling was random
There is also the ROSE package that has a function ("ovun.sample") that can do this for you: https://www.rdocumentation.org/packages/ROSE/versions/0.0-3/topics/ovun.sample

Related

T-test after running sentiment analysis

I am trying to run a t-test after doing the sentiment analysis.
I did the sentiment analysis, and grouped my data into two parts:
library(textdata)
afinn_dictionary <- get_sentiments("afinn")
news_tokenized <- full_data %>%
unnest_tokens(word, full_article, to_lower = TRUE)
head(news_tokenized$word, 10)
full_data$full_article[2]
word_counts_senti <- news_tokenized %>%
inner_join(afinn_dictionary)
head(word_counts_senti)
news_senti <- word_counts_senti %>%
group_by(partisan_media) %>% #group by partisan media
summarize(sentiment = sum(value))
head(news_senti)
#as a result, I got: c(1): -13194, c(2): -12321. Both group 1 and 2 were negative, but group 1's stories tend to use more negative words (have greater negative sentiment).
table(full_data$partisan_media) #there are 1866 articles in group 1 and 2174 articles in group 2
I am trying to see if the differences between groups 1 and 2 (two groups of partisan media) are statistically different by running a t-test.
I'm using:
g1_senti = rnorm(1866, mean = -7.07074, sd = ) #group1
g2_senti = rnorm(2174, mean = -5.667433, sd = ) #group2
t.test(g1_senti, g2_senti)
The means are from "sentiment score of a group" divided by "number of articles of a group"
But I wasn't sure what should be entered inside the parenthesis for the sd. Does anyone have an idea about this?
I am adding my data set here: https://www.mediafire.com/file/uei2e3tajvi7wao/eight.csv/file

Error in generalized linear mixed model cross-validation: The value in 'data[[cat_col]]' must be constant within each ID

I am trying to conduct a 5-fold cross validation on a generalized linear mixed model using the groupdata2 and cvms packages. This is the code I tried to run:
data <- groupdata2::fold(detect, k = 5,
cat_col = 'outcome',
id_col = 'bird') %>%
arrange(.folds)
cvms::cross_validate(
data,
"outcome ~ sex + year + season + (1 | bird) + (1 | obsname)",
family="binomial",
fold_cols = ".folds",
control = NULL,
REML = FALSE)
This is the error I receive:
Error in groupdata2::fold(detect, k = 4, cat_col = "outcome", id_col = "bird") %>% :
1 assertions failed:
* The value in 'data[[cat_col]]' must be constant within each ID.
In the package vignette, the following explanation is given: "A participant must always have the same diagnosis (‘a’ or ‘b’) throughout the dataset. Otherwise, the participant might be placed in multiple folds." This makes sense in the example. However, my data is based on the outcome of resighting birds, so outcome varies depending on whether the bird was observed on that particular survey. Is there a way around this?
Reproducible example:
bird <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
outcome <- c(0,1,1,1,0,0,0,1,0,1,0,1,0,0,1)
df <- data.frame(bird, outcome)
df$outcome <- as.factor(df$outcome)
df$bird <- as.factor(df$bird)
data <- groupdata2::fold(df, k = 5,
cat_col = 'outcome',
id_col = 'bird') %>%
arrange(.folds)
The full documentation says:
cat_col: Name of categorical variable to balance between folds.
E.g. when predicting a binary variable (a or b), we usually
want both classes represented in every fold.
N.B. If also passing an ‘id_col’, ‘cat_col’ should be
constant within each ID.
So in this case, where outcome varies within individual birds (id_col), you simply can't specify that the folds be balanced within respect to the outcome. (I don't 100% understand this constraint in the software: it seems it should be possible to do at least approximate balancing by selecting groups (birds) with a balanced range of outcomes, but I can see how it could make the balancing procedure a lot harder).
In my opinion, though, the importance of balancing outcomes is somewhat overrated in general. Lack of balance would mean that some of the simpler metrics in ?binomial_metrics (e.g. accuracy, sensitivity, specificity) are not very useful, but others (balanced accuracy, AUC, aic) should be fine.
A potentially greater problem is that you appear to have (potentially) crossed random effects (i.e. (1|bird) + (1|obsname)). I'm guessing obsname is the name of an observer: if some observers detected (or failed to detect) multiple birds and some birds were detected/failed by multiple observers, then there may be no way to define folds that are actually independent, or at least it may be very difficult.
You may be able to utilize the new collapse_groups() function in groupdata2 v2.0.0 instead of fold() for this. It allows you to take existing groups (e.g. bird) and collapse them to fewer groups (e.g. folds) with attempted balancing of multiple categorical columns, numeric columns, and factor columns (number of unique levels - though the same levels might be in multiple groups).
It does not have the constraints that fold() does with regards to changing outcomes, but on the other hand does not come with the same "guarantees" in the "non-changing outcome" context. E.g. it does not guarantee at least one of each outcome levels in all folds.
You need more birds than the number of folds though, so I've added a few to the test data:
bird <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,
4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,7)
outcome <- c(0,1,1,1,0,0,0,1,0,1,0,1,0,0,1,0,1,
0,1,1,0,1,1,0,0,1,1,0,0,1,0,0,1,1)
df <- data.frame(bird, outcome)
df$outcome <- as.factor(df$outcome)
df$bird <- as.factor(df$bird)
# Combine 'bird' groups to folds
data <- groupdata2::collapse_groups(
data = df,
n = 3,
group_cols="bird",
cat_col="outcome",
col_name = ".folds"
) %>%
arrange(.folds)
# Check the balance of the relevant columns
groupdata2::summarize_balances(
data=data,
group_cols=".folds",
cat_cols="outcome"
)$Groups
> # A tibble: 3 × 6
> .group_col .group `# rows` `# bird` `# outc_0` `# outc_1`
> <fct> <fct> <int> <int> <dbl> <dbl>
> 1 .folds 1 14 3 7 7
> 2 .folds 2 10 2 6 4
> 3 .folds 3 10 2 4 6
summarize_balances() shows us that we created 3 folds with 14 rows in the first fold and 10 in the other folds. There are 3 unique bird levels in the first fold and 2 in the others (normally only unique within the group, but here we know that birds are only in one group, as that is how collapse_groups() works with its group_cols argument).
The outcome variable (here # outc_0 and # outc_1) are somewhat decently balanced.
With larger datasets, you might want to run multiple collapsings and choose the one with the best balance from the summary. That can be done by adding num_new_group_cols = 10 to collapse_groups() (for even better results, enable the auto_tune setting) and then listing all the created group columns when running summarize_balances().
Hope this helps you or others in a similar position. The constraint in fold() is hard to get around with its current internal approach, but collapse_groups hopefully does the trick in those cases.
See more https://rdrr.io/cran/groupdata2/man/collapse_groups.html

Looking for analysis that clusters like SIMPROF, but allows for many observations per category

I need to run a clustering or similarity analysis on some biological data and I am looking for an output like the one SIMPROF gives. Aka a dendrogram or hierarchical cluster.
However, I have 3200 observations/rows per group. SIMPROF, see example here,
library(clustsig)
usarrests<-USArrests[,c(1,2,4)]
rownames(usarrests)<-state.abb
# Run simprof on the data
res <- simprof(data= usarrests,
method.distance="braycurtis")
# Graph the result
pl.color <- simprof.plot(res)
seems to expect only one observation per group (US state in this example).
Now, again, my biological data (140k rows total) has about 3200 obs per group.
I am trying to cluster the groups together that have a similar representation in the variables provided.
As if in the example above, AK would be represented by more than one observation.
What's my best bet for a function/package/analysis?
Cheers,
Mo
Example from a paper:
The solution became obvious upon further reflection.
Instead of using all observations (200k) in the long format, I made longitude and depth of sampling into one variable, used like sampling units along a transect. Thus, ending up with 3800 columns of longitude - depth combinations, and 61 rows for the taxa, with the value variable being the abundance of the taxa (If you want to cluster sampling units then you have to transpose the df). This is then feasible for hclust or SIMPROF since now the quadratic complexity only applies to 61 rows (as opposed to ~200k as I tried at the beginning).
Cheers
Here is some code:
library(reshape2)
library(dplyr)
d4<-d4 %>% na.omit() %>% arrange(desc(LONGITUDE_DEC))
# make 1 variable of longitude and depth that can be used for all taxa measured, like
#community ecology sampling units
d4$sampling_units<-paste(d4$LONGITUDE_DEC,d4$BIN_MIDDEPTH_M)
d5<-d4 %>% select(PREDICTED_GROUP,CONCENTRATION_IND_M3,sampling_units)
d5<-d5%>%na.omit()
# dcast data frame so that you get the taxa as rows, sampling units as columns w
# concentration/abundance as values.
d6<-dcast(d5,PREDICTED_GROUP ~ sampling_units, value.var = "CONCENTRATION_IND_M3")
d7<-d6 %>% na.omit()
d7$PREDICTED_GROUP<-as.factor(d7$PREDICTED_GROUP)
# give the rownames the taxa names
rownames(d7)<-paste(d7$PREDICTED_GROUP)
#delete that variable that is no longer needed
d7$PREDICTED_GROUP<-NULL
library(vegan)
# calculate the dissimilarity matrix with vegdist so you can use the sorenson/bray
#method
distBray <- vegdist(d7, method = "bray")
# calculate the clusters with ward.D2
clust1 <- hclust(distBray, method = "ward.D2")
clust1
#plot the cluster dendrogram with dendextend
library(dendextend)
library(ggdendro)
library(ggplot2)
dend <- clust1 %>% as.dendrogram %>%
set("branches_k_color", k = 5) %>% set("branches_lwd", 0.5) %>% set("clear_leaves") %>% set("labels_colors", k = 5) %>% set("leaves_cex", 0.5) %>%
set("labels_cex", 0.5)
ggd1 <- as.ggdend(dend)
ggplot(ggd1, horiz = TRUE)

in R, Creating a summary table with comparisons of two groups

I frequently want to create summary tables for studies where I compare several variables between two groups, listing values for each variable along with the difference between that variable for the two groups.
For example, say I want to compare age groups (young and old) and proportion of males between two groups, A and B. I’d like to end up with a table with rows for each variable (age, proportion of males) and columns for the following variables repeated for each group (numerator, denominators, rate, difference between the two rates, 95%CI, p-value from a chi-square).
I’m looking for a general approach to this type of table.
Let’s say I have the following table:
library(dplyr)
AgeGroup <- sample(c("Young", "Old"), 10, replace = TRUE)
Gender <- sample(c("Male", "Female"), 10, replace = TRUE)
df <- data.frame(AgeGroup, Gender)
df
I can create a summary table without the comparison easily:
df1 <- df %>%
group_by(AgeGroup) %>%
summarise(num_M = sum(Gender == "Male"),
den_M = n(),
prop_M = num_M/den_M)
df1
But I can’t figure out how to create additional columns of comparisons between the different rows of grouped data. Let’s say I want to do a chi.sq test on the proportion of Males in each AgeGroup and add the p-value to the summary table above.
It would look like this (numbers, obviously, are examples), Y = Young, O = Old:
Any gentle nudges in the right direction would be greatly appreciated.
Thanks!
I like the finalfit package for summary tables. If you need to add custom summary functions, it might not be flexible enough, but its default stats cover everything you've asked for in your example, e.g. numbers in each group, proportions, and a chi-squared test. If you have continuous variables it will calculate means and SDs in each group.
library(finalfit)
finalfit::summary_factorlist(
df,
dependent = "Gender",
explanatory = "AgeGroup",
total_col = TRUE,
p = TRUE
)
Output:
label levels Female Male Total p
1 AgeGroup Old 0 (0.0) 6 (100.0) 6 0.197
2 Young 1 (25.0) 3 (75.0) 4

Generate a crude incidence rate table (stratified by a factor variable) from a Lexis Model

I am using the 'Epi' package in R to model follow-up data from a study.
I am having no issues with declaring the Lexis model or running Poisson and (combined with the survival package) Cox regressions.
As part of the initial data review I want to find a simple way to make a table of crude unadjusted incidence/event rates from data in a lexis model in R (pre-fitting any poisson/cox models).
I have found a coded approach which allows me to do this and to stratify by a variable as part of exploratory data analysis:
#Generic Syntax Example
total <-cbind(tapply(lexis_model$lex.Xst,lexis_model$stratifying_var,sum),tapply(lexis_model$lex.dur,lexis_model$stratifying_var,sum))
#Add up the number of events within the stratifying variable
#Add up the amount of follow-up time within the stratifying the variable
rates <- tapply(lexis_model$lex.Xst,lexis_model$stratifying_var,sum)/tapply(lexis_model$lex.dur,lexis_model$stratifying_var,sum)*10^3
#Given rates per 1,000 person years
ratetable <- (cbind(totals,rates))
#Specific Example based on the dataset
totals <-cbind(tapply(lexis_model$lex.Xst,lexis_model$grade,sum),tapply(lexis_model$lex.dur,lexis_model$grade,sum))
rates <- tapply(lexis_model$lex.Xst,lexis_model$grade,sum)/tapply(lexis_model$lex.dur,lexis_model$grade,sum)*10^3
ratetable <- (cbind(totals,rates))
ratetable
rates
1 90 20338.234 4.4251630
2 64 7265.065 8.8092811
#Shows number of events, years follow-up, number of events per 1000 years follow-up, stratified by the stratifying variable
Note this is crude unadjusted/absolute rates - not the output of a Poisson model. Whilst I appreciate that the code above does indeed produce the desired output (and is pretty straightforward) I wanted to see if people were aware of a command which can take a lexis dataset and output this. I've had a look at the available commands in the Epi and epitools package - may have missed somthing but could not see an obvious way to do this.
As this is a quite common thing to want to do I wondered if anyone was aware of a package/function that could do this by specifying the simply the lexis dataset and the stratification variable (or indeed a single function to the steps above in a single go).
Ideally the output would look something like the below (which is taken from STATA which I am trying to move away from in favour of R!):
A copy of the first twenty rows or so of the actual data is here (the data has already been put in to a lexis model using Epi package so all relevant lexis variables are there):
https://www.dropbox.com/s/yjyz1kzusysz941/rate_table_data.xlsx?dl=0
I would do this simply using the tidyverse R package as such:
library(tidyverse)
lexis_model %>%
group_by(grade) %>%
summarise(sum_Xst = sum(lex.Xst), sum_dur = sum(lex.dur)) %>%
mutate(rate = sum_Xst/sum_dur*10^3) -> rateable
rateable
# A tibble: 2 x 4
# grade sum_Xst sum_dur rate
# <dbl> <int> <dbl> <dbl>
# 1 1 2 375.24709 5.329821
# 2 2 0 92.44079 0.000000
And you could wrap this into a function yourself:
rateFunc <- function(data, strat_var)
{
lexis_model %>%
group_by_(strat_var) %>%
summarise(sum_Xst = sum(lex.Xst), sum_dur = sum(lex.dur)) %>%
mutate(rate = sum_Xst/sum_dur*10^3)
}
which you would then call:
rateFunc(lexis_model, "grade")
This is useful because, using the combination of tidyverse summarise and mutate it is very easy to add more summary statistics to the table.
EDIT:
After clarification on the question, this can be done using the popEpi package using the rate command:
popEpi::rate(lexis_model, obs = lex.Xst, pyrs = lex.dur, print = grade)
# Crude rates and 95% confidence intervals:
# grade lex.Xst lex.dur rate SE.rate rate.lo rate.hi
# 1: 1 2 375.2472 0.00532982 0.003768752 0.001332942 0.0213115
# 2: 2 0 92.4408 0.00000000 0.000000000 0.000000000 NaN

Resources