I am trying to build a table with sjplot and the tab_df function but fail to get the label names in the table:
library(sjPlot)
library(stargazer)
Region<-c("Berlin", "Hamburg", "Berchtesgarden")
Sensor<-c("Riegl ", "Riegl ", "Riegl ")
Platform<-c("Aircraft", "UAV", "Helicoper")
Acquisition_Year <-c("2002", "2002", "2002")
Month<-c("August", "September", "July")
Flight_Height<-c("400-600m AGL","400-600m AGL","400-600m AGL")
LidarAcq<-as.data.frame(cbind(Region, Sensor, Platform, Acquisition_Year, Month, Flight_Height))
LidarAcq
attr(LidarAcq$Region, "label") <- "Region"
attr(LidarAcq$Sensor, "label") <- "Sensor"
attr(LidarAcq$Platform, "label") <- "Platform"
attr(LidarAcq$Acquisition_Year, "label") <- "Acquisition Year"
attr(LidarAcq$Month, "label") <- "Month"
attr(LidarAcq$Flight_Height, "label") <- "Flight Height"
tab_df(LidarAcq, title = "Lidar Acquisition Parameters" , file= "~/LidarAcq.doc")
As I understand, sjplot is supposed to automatically recognise the labels...
So what am I doing wrong?
I also cannot get sjplot to recognize the label for each column. You can use col.header and a character vector ("labels") to name your columns instead.
labels <- c("Region", "Sensor", "Platform", "Acquisition Value", "Month", "Flight Height")
tab_df(LidarAcq,
title = "Lidar Acquisition Parameters" ,
col.header = labels,
file= "~/LidarAcq.doc")
Related
below is a simple example for stacked bar plot with r and plotly.
By default, the stacks are alphabetly ordered.
How can I determine other order?
library(plotly)
bars <- c("bar1", "bar1", "bar2", "bar2")
category <- c("(order1)small value", "(order2)big value", "(order1)small value", "(order2)big value")
#category <- c("(order2)small value", "(order1)big value", "(order2)small value", "(order1)big value")
count <- c(3, 6, 2, 7)
data <- data.frame(bars, category, count)
#View(data)
p <- plot_ly(data, x = ~bars, y = ~count, color = ~category) %>%
layout(width = 460, height = 230, barmode = 'stack' )
p
E.g. I would like to be able to determine, that the category:
"(order1)small value" to be on the top. Now it is on the bottom, because alphabetly it is before the other category.
Thank you very much in advance
Add this line before creating you data. With category as a factor you can controle the order of each level.
category <- factor(category, c("(order2)big value", "(order1)small value"))
Thnak you very much. It works! The following after creating my data should also work.
data$category <- factor(data$category, c( "(order2)big value", "(order1)small value" ) )
I have a text file with words from historical accounts and I want to visualise the species and frequency of words associated with them.
So far I have tried using the following code with a txt file of all the historical documents in one doc but want to ask if there is specific formatting of a csv to then input into R for a bipartite network graph:
"""library(ggraph)
library(ggplot2)
library(dplyr)
library(pdftools)
library(tm)
library(readtext)
library(tidytext)
library(igraph)
library(tidyr)
library(FactoMineR)
library(factoextra)
library(flextable)
library(GGally)
library(ggdendro)
library(network)
library(Matrix)
library(quanteda)
library(stringr)
library(quanteda.textstats)
options(stringsAsFactors = F)
options(scipen = 999)
options(max.print=1000)
# Read in text--------
wordbase <- readtext("mq_bird_stories.txt")
# List of extra words to remove---------
extrawords <- c("the", "can", "get", "Ccchants", "make", "making", "house", "torn", "tree", "man", "however", "upon", "instructs", "wife", "coming","without", "mother", "versions","variant", "version", "thus", "got","throws", "are", "has", "already", "asks", "sacra", "can", "brings", "one", "look", "sees", "tonaheiee", "wants", "later",
"dont", "even", "may", "but", "will", "turn", "sing", "swallows", "alba", "gives", "find", "other","tonaheieee", "away","day","comes","another",
"much", "first", "but", "see", "new", "back","goes", "go","songs", "returns", "take","takes","come",
"many", "less", "now", "well", "taught", "like", "puts", "slits", "sends", "tell","tells","open","mentions",
"often", "every", "said", "two", "and", "handsome", "husband", "bring", "lives","gets", "von", "den", "steinen", "handy")
# Clean the data-------
darwin <- wordbase %>%
paste0(collapse = " ") %>%
stringr::str_squish() %>%
stringr::str_remove_all("\\(") %>%
stringr::str_remove_all("\\)") %>%
stringr::str_remove_all("!") %>%
stringr::str_remove_all(",") %>%
stringr::str_remove_all(";") %>%
stringr::str_remove_all("\\?") %>%
stringr::str_split(fixed(".")) %>%
unlist() %>%
tm :: removeWords(extrawords) %>%
paste0(collapse = " ")
# One method for calculating frequencies of bigrams------
# Process into a table of words
darwin_split <- darwin %>%
as_tibble() %>%
tidytext::unnest_tokens(words, value)
# Create data frame of bigrams-------
darwin_words <- darwin_split %>%
dplyr::rename(word1 = words) %>%
dplyr::mutate(word2 = c(word1[2:length(word1)], NA)) %>%
na.omit()
# Calculate frequency of bigrams-----
darwin2grams <- darwin_words %>%
dplyr::mutate(bigram = paste(word1, word2, sep = " ")) %>%
dplyr::group_by(bigram) %>%
dplyr::summarise(frequency = n()) %>%
dplyr::arrange(-frequency)
# Define stopwords
stps <- paste0(tm::stopwords(kind = "en"), collapse = "\\b|\\b")
# Remove stopwords from bigram table
darwin2grams_clean <- darwin2grams %>%
dplyr::filter(!str_detect(bigram, stps))
# Another method for calculating frequencies of bigrams
# Clean corpus
darwin_clean <- darwin %>%
stringr::str_to_title()
# Tokenize corpus----
darwin_tokzd <- quanteda::tokens(darwin_clean)
# Extract bigrams------
BiGrams <- darwin_tokzd %>%
quanteda::tokens_remove(stopwords("en")) %>%
quanteda::tokens_select(pattern = "^[A-Z]",
valuetype = "regex",
case_insensitive = FALSE,
padding = TRUE) %>%
quanteda.textstats::textstat_collocations(min_count = 1, tolower = FALSE)
# read in and process text
darwinsentences <- darwin %>%
stringr::str_squish() %>%
tokenizers::tokenize_sentences(.) %>%
unlist() %>%
stringr::str_remove_all("- ") %>%
stringr::str_replace_all("\\W", " ") %>%
stringr::str_squish()
# inspect data
head(darwinsentences)
darwincorpus <- Corpus(VectorSource(darwinsentences))
# clean corpus-----
darwincorpusclean <- darwincorpus %>%
tm::tm_map(removeNumbers) %>%
tm::tm_map(tolower) %>%
tm::tm_map(removeWords, stopwords()) %>%
tm::tm_map(removeWords, extrawords)
# create document term matrix
darwindtm <- DocumentTermMatrix(darwincorpusclean, control=list(bounds = list(global=c(1, Inf)), weighting = weightBin))
# convert dtm into sparse matrix
darwinsdtm <- Matrix::sparseMatrix(i = darwindtm$i, j = darwindtm$j,
x = darwindtm$v,
dims = c(darwindtm$nrow, darwindtm$ncol),
dimnames = dimnames(darwindtm))
# calculate co-occurrence counts
coocurrences <- t(darwinsdtm) %*% darwinsdtm
# convert into matrix
collocates <- as.matrix(coocurrences)
# inspect size of matrix
ncol(collocates)
#provide some summary stats
summary(rowSums(collocates))
#visualising collocations
# load function for co-occurrence calculation
source("https://slcladal.github.io/rscripts/calculateCoocStatistics.R")
# define term
coocTerm <- "pigeon"
# calculate co-occurrence statistics
coocs <- calculateCoocStatistics(coocTerm, darwinsdtm, measure="LOGLIK")
# inspect results
coocs[1:50]
coocdf <- coocs %>%
as.data.frame() %>%
dplyr::mutate(CollStrength = coocs,
Term = names(coocs)) %>%
dplyr::filter(CollStrength > 0)
###Make graph - visualize association strengths------
ggplot(coocdf, aes(x = reorder(Term, CollStrength, mean), y = CollStrength)) +
geom_point() +
coord_flip() +
theme_bw() +
labs(y = "")
##network
net = network::network(collocates_redux,
directed = FALSE,
ignore.eval = FALSE,
names.eval = "weights")
# vertex names
network.vertex.names(net) = rownames(collocates_redux)
# inspect object
net
ggnet2(net,label = TRUE,
label.size = 4,
alpha = 0.2,
size.cut = 3,
edge.alpha = 0.3) +
guides(color = FALSE, size = FALSE)"""
I'd suggest taking a look at the netCoin package. If you can transform your data into nodes and links data frames, then you can easily get a high quality network visualization:
#Example of links data frame
links <-
data.frame(
matrix(
c(
"Person A","Account 1", "not link",
"Person A","Account 2", "link",
"Person B","Account 2", "link",
"Person B","Account 3", "not link",
"Person B","Account 4", "link",
"Person C","Account 4", "link"
),
nrow = 6,
ncol = 3,
byrow = TRUE,
dimnames = list(NULL,
c("Source", "Target", "other_links_column"))
),
stringsAsFactors = FALSE
)
#Example of nodes data frame
nodes <-
data.frame(
matrix(
c(
"Person A","person",
"Person B","person",
"Person C","person",
"Account 1", "account",
"Account 2", "account",
"Account 3", "account",
"Account 4", "account"
),
nrow = 7,
ncol = 2,
byrow = TRUE,
dimnames = list(NULL,
c("name", "other_nodes_column"))
),
stringsAsFactors = FALSE
)
install.packages("netCoin") #may need to install the netCoin package
library(netCoin)
?netCoin #displays netCoin Help to see all the function options
graph_df <- netCoin(nodes = nodes, #Data frame of unique nodes and their attributes #Must contain name column
links = links, #Data frame of links and their attributes #Must contain Source and Target columns
cex = 1.25, #Font size
color = "other_nodes_column", #Column in node data frame to determine node color
shape = "other_nodes_column", #Column in node data frame to determine node shape
main = "This is the title of my visualization", #Visualization title
controls = 1:5, #Controls that will be shown in the visualization (maximum of 5)
dir = "folder-with-viz-output") #Output folder for the visualization #Entire folder should be exported as a zip file
plot(graph_df) #Command to display the visualization
I'm using the table1 package to create summary statistics. My current table1() looks like this (code below):
I want to add another section, with a labeled row in bold called "Co-occurring disorder", that is not referring to a specific variable. I want two un-bolded rows after it refers to just the number of TRUE in two distinct variables: "Mental Health" for one row and "Substance use" for another. For example, if 12 people have TRUE for "mental health" and 7 people have TRUE for "substance use," the following row would start like this:
Alternatively, how do I add a blank row to the table?
My current code is pasted below.
library(table1)
opp2$SEX <-
factor(opp2$SEX, levels=c(1,2),
labels=c("Male", "Female"))
opp2$REGION <-
factor(opp2$REGION, levels=c(1:5),
labels = c("Northeast", "North Central", "South", "West", "Unknown"))
opp2$GS <- factor(opp2$GS, levels = c(1:5),
labels = c("Male", "Female",
"Transmasculine", "Transfeminine", "Unknown"))
#units(opp2$AGE) <- "years"
labels <- list(variables=list(SEX="Sex",
AGE="Age (years)",
REGION="Region"),
groups=list("", "Cis", "TGM"))
strata <- c(list(Total=opp2), split(opp2, opp2$GS))
my.render.cont <- function(x) {
with(stats.apply.rounding(stats.default(x), digits=2),
c("",
"Median (IQR)"=sprintf("%s (± %s)", MEDIAN, IQR)))
}
my.render.cat <- function(x) {
c("", sapply(stats.default(x),
function(y) with(y, sprintf("%d (%0.0f %%)", FREQ, PCT))))
}
table1(strata, labels, groupspan=c(1,2, 3),
render.continuous=my.render.cont, render.categorical=my.render.cat)`
I'm just getting started in R and I'm trying to wrap my head around barplot for a university assignment. Specifically, I am using the General Social Survey 2018 dataset (for codebook: https://www.thearda.com/Archive/Files/Codebooks/GSS2018_CB.asp) and I am trying to figure out if religion has any effect on the way people seek out help for mental health. I want to use reliten (self-assessment of religiousness - from strong to no religion) as the IV and tlkclrgy, (asks if a person with mental health issues should reach out to a religious leader - yes or no) as the DV. For a better visualization of the data, I want to create a side-by-side barplot with reliten on the x-axis and see how many people answered yes and no on tlkclrgy. My problem is that on the barplot I get numbers instead of categories (from strong to no religion). This is what I tried, but I keep getting NA on the x-axis:
GSS$reliten <- factor(as.character(GSS$reliten),
levels = c("No religion", "Somewhat
strong", "Not very strong",
"Strong"))
GSS <- GSS18[!GSS18$tlkclrgy %in% c(0, 8, 9),]
GSS$reliten <- as_factor(GSS$reliten)
GSS$tlkclrgy <- as_factor(GSS$tlkclrgy)
ggplot(data=GSS,mapping=aes(x=reliten,fill=tlkclrgy))+
geom_bar(position="dodge")
Does anybody have any tips?
Here is complete code to download the codebook and data, table the two columns of interest and plot the frequencies.
1. Read the data
Data will be downloaded to a temporary directory, to keep my disk palatable. Use of these first two instructions is optional
od <- getwd()
setwd("~/Temp")
These are the links to the two files that need to be read and the filenames.
cols_url <- "https://osf.io/ydxu4/download"
cols_file <- "General Social Survey, 2018.col"
data_url <- "https://osf.io/e76rv/download"
data_file <- "General Social Survey, 2018.dat"
download.file(cols_url, cols_file, mode = "wb")
download.file(data_url, data_file, mode = "wb")
Now read in the codebook and process it, extracting the column widths and column names.
cols <- readLines(cols_file)
cols <- strsplit(cols, ": ")
widths_char <- sapply(cols, '[', 2)
i_widths <- grepl("-", widths_char)
f <- function(x) -eval(parse(text = x)) + 1L
widths <- rep(1L, length(widths_char))
widths[i_widths] <- f(widths[i_widths])
col_names <- sapply(cols, '[', 1)
col_names <- trimws(sub("^.[^ ]* ", "", col_names))
col_names <- tolower(col_names)
Finally, read the fixed width text file.
df1 <- read.fwf(data_file, widths = widths, header = FALSE, na.strings = "-", col.names = col_names)
2. Table the data
Find out where are the two columns we want with grep.
i_cols <- c(
grep("reliten", col_names, ignore.case = TRUE),
grep("tlkclrgy", col_names, ignore.case = TRUE)
)
head(df1[i_cols])
Table those columns and coerce to data.frame. Then coerce the columns to factor.
Here there is a problem, there is no answer 3 for tlkclrgy in the published survey but there are answers 3 in the data file. So I have created an extra factor level.
GSS <- as.data.frame(table(df1[i_cols]))
labels_reliten <- c(
"Not applicable",
"Strong",
"Not very strong",
"Somewhat Strong",
"No religion",
"Don't know",
"No answer"
)
levels_reliten <- c(0, 1, 2, 3, 4, 8, 9)
labels_tlkclrgy <- c(
"Not applicable",
"Yes",
"No",
"Not in codebook",
"Don't know",
"No answer"
)
levels_tlkclrgy <- c(0, 1, 2, 3, 8, 9)
GSS$reliten <- factor(
GSS$reliten,
labels = labels_reliten,
levels = levels_reliten
)
GSS$tlkclrgy <- factor(
GSS$tlkclrgy,
labels = labels_tlkclrgy,
levels = levels_tlkclrgy
)
3. Plot the frequencies table
library(ggplot2)
ggplot(data = GSS, mapping = aes(x = reliten, y = Freq, fill = tlkclrgy)) +
geom_col(position = "dodge")
I'm trying to classify a data frame of customer reviews into the respective categories. For example,
x <- data.frame(Reviews = c("The phone performance and display is good","Worth the money","Camera is good"))
The desired output is as below image
I tried creating a dictionary as below using R's Quanteda package
dic <- dictionary(list(camera = c("camera","lens","pixel", "pictures",
"pixels", "snap"), display = c("resolution", "display", "depth", "mode",
"color", "colour", "discolour"), performance = c("performance", "speed",
"usage", "fast", "run", "running", "lag", "processor", "shut", "shut down",
"restart", "hanging","hang"), Value = c("money", "worth", "budget", "value",
"price", "specs", "specifications", "invest",
"under","expectations","expected","expecting","expect")))
I would like to classify the texts based on keywords as stated above. Please help
P.S : dfm is one option. But particularly, I would like to know how to classify a data frame of texts as per the desired output.
Using already most of your code:
# Creating a DFM and saving the Reviews in a Vector
require("quanteda")
x <- dfm( Reviews <- c(
"The phone performance and display is good",
"Worth the money",
"Camera is good"),
tolower = TRUE)
I converted the capital letters to lowercase, otherwise, the fixed comparison would not work. Further, I recommend stopword removal and some kind of steaming.
# Creating the dictionary
dic <- dictionary(list(camera = c("camera","lens","pixel", "pictures", "pixels", "snap"),
display = c("resolution", "display", "depth", "mode", "color", "colour", "discolour"),
performance = c("performance", "speed", "usage", "fast", "run", "running", "lag", "processor", "shut", "shut down", "restart", "hanging","hang"),
Value = c("money", "worth", "budget", "value", "price", "specs", "specifications", "invest", "under","expectations","expected","expecting","expect")))
Using the dfm_lookup function:
# fixed parameter fof exact matching
res <- dfm_lookup(x, dic, valuetype = "fixed")
row.names(res)<- Reviews
res
Hope this is what you are looking for :)