Visualize blinks in conversation - r

I have conversational data with pupillary data showing when ppl blink, like so (reproducible data below):
df
# A tibble: 6 x 8
# Groups: Blinks_grp [6]
Speaker Utterance Starttime_ms Endtime_ms Blink_onset Blink_offset Blinks_grp Blink_dur
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ID16.B an Americ… 289569 293940 289879 289946 113 67
2 ID16.B an Americ… 289569 293940 290696 290879 114 183
3 ID16.B an Americ… 289569 293940 290962 291046 115 84
4 ID16.A [°gotcha°] 290604 291004 290696 290879 116 183
5 ID16.B =↓my fath… 300938 302140 301529 301612 117 83
6 ID16.B =↓my fath… 300938 302140 302062 302146 118 78
I want to visualize when blinks occur relative to speech (in column Utterance). So far I've come up with this code:
df %>%
mutate(Utterance = paste0(sub(".*(.)$", "\\1",Speaker), ": ", Utterance),
Utterance = factor(Utterance, levels = unique(Utterance))) %>%
ggplot(aes(x = Blink_onset, xend = Blink_offset,
y = Blinks_grp, yend = Blinks_grp)) +
geom_segment(size = 3) +
facet_wrap(~ Utterance, ncol = 1, scales= "free_x")
which produces this graph:
However, the graph does not make the relationship Utterance v. blinks clear enough:
it fails to take into account the duration of the Utterance
it fails to indicate whether blinks occur (partially) before the start or after the end of Utterances
if more than one blink occurs during one Utterance it puts these blinks on different lines
So what I'm looking for is a visulization that shows both blinks and Utterance and makes clear where blinks occur relative to Utterance. What I have in mind looks somewhat like this:
Reproducible data:
structure(list(Speaker = c("ID16.B", "ID16.B", "ID16.B", "ID16.A",
"ID16.B", "ID16.B"), Utterance = c("an American family that (.) [uh] moved to Germany in <nineteen ninety one>",
"an American family that (.) [uh] moved to Germany in <nineteen ninety one>",
"an American family that (.) [uh] moved to Germany in <nineteen ninety one>",
"[°gotcha°]", "=↓my father's↓ like ~°we're going to Germany°~",
"=↓my father's↓ like ~°we're going to Germany°~"), Starttime_ms = c(289569L,
289569L, 289569L, 290604L, 300938L, 300938L), Endtime_ms = c(293940,
293940, 293940, 291004, 302140, 302140), Blink_onset = c(289879,
290696, 290962, 290696, 301529, 302062), Blink_offset = c(289946,
290879, 291046, 290879, 301612, 302146), Blinks_grp = c(113,
114, 115, 116, 117, 118), Blink_dur = c(67, 183, 84, 183, 83,
78)), row.names = c(NA, -6L), groups = structure(list(Blinks_grp = c(113,
114, 115, 116, 117, 118), .rows = structure(list(1L, 2L, 3L,
4L, 5L, 6L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))

The main problem here is, how to shape the data so that ggplot can plot it. Here's a proposal: in the essence I'm replacing blink_onset and blink_offset with an attribute that tells whehter an entry in the table is blinking or talking, then ggplot can easily draw for each event a separate line:
## Preparing the data:
df <- structure(list(Speaker = c("ID16.B", "ID16.B", "ID16.B", "ID16.A",
"ID16.B", "ID16.B"), Utterance = c("an American family that (.) [uh] moved to Germany in <nineteen ninety one>",
"an American family that (.) [uh] moved to Germany in <nineteen ninety one>",
"an American family that (.) [uh] moved to Germany in <nineteen ninety one>",
"[°gotcha°]", "=↓my father's↓ like ~°we're going to Germany°~",
"=↓my father's↓ like ~°we're going to Germany°~"), Starttime_ms = c(289569L,
289569L, 289569L, 290604L, 300938L, 300938L), Endtime_ms = c(293940,
293940, 293940, 291004, 302140, 302140), Blink_onset = c(289879,
290696, 290962, 290696, 301529, 302062), Blink_offset = c(289946,
290879, 291046, 290879, 301612, 302146), Blinks_grp = c(113,
114, 115, 116, 117, 118), Blink_dur = c(67, 183, 84, 183, 83,
78)), row.names = c(NA, -6L), groups = structure(list(Blinks_grp = c(113,
114, 115, 116, 117, 118), .rows = structure(list(1L, 2L, 3L,
4L, 5L, 6L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
df <- df %>%
mutate(Utterance = paste0(sub(".*(.)$", "\\1",Speaker), ": ", Utterance),
Utterance = factor(Utterance, levels = unique(Utterance)))
## separate into a "Blink" and a "talk" data frame, add an attribute "Event" that represents talking and blinking:
blink_df <- df %>% select(Speaker, Utterance, Blink_onset, Blink_offset, Blinks_grp) %>%
mutate(Starttime_ms = Blink_onset, Endtime_ms = Blink_offset, Event = "blink")
talk_df <- df %>% select(Speaker, Utterance, Starttime_ms, Endtime_ms, Blinks_grp) %>%
mutate(Event = "talk")
## combine datasets again:
plot_df <- bind_rows(talk_df, blink_df)
## and plot, using "Event" as attribute to separate talking and blinking lines:
plot_df %>%
ggplot(aes(x = Starttime_ms, xend = Endtime_ms,
y = Event, yend = Event, colour = Event)) +
geom_segment(size = 3) +
facet_wrap(~ Utterance, ncol = 1, scales= "free_x")

Related

How make two row names in the Rshiny?

I have a dataset, which has a common feature - at the end of column names (after comma) is written the group to which a specific column corresponds to. Is it possible to create a table where two row column names will be used? In an example, the first row is Up and goes Quantity, Price, Quality. Is it also possible to somehow separate by empty column/ or some bold border these 3 groups (Up, Down and Total)? I know there is DT library that helps to make it easier, however, I am looking for the solution using shiny library only.
I also found that using tags$style() and CSS can help to solve it, however not familiar with CSS.
library(shiny)
df <- structure(list(Year = c(2022L, 2022L, 2022L, 2022L, 2022L),
Week = c(0, 1, 2, 3, 4),
`Quantity, Up` = c(335, 305, 740, 910, 515),
`Price, Up` = c(1, 2, 5, 5, 3),
`Quality, Up` = c(243, 243, 243, 12321.434052, 1706.327462),
`Quantity, Down` = c(-175, -900, -205, -35, 0),
`Price, Down` = c(243, 243, 1219.717851, 902.819827, 0),
`Quality, Down` = c(2834.205418, 243, -1219.717851, 902.819827, 0),
`Quantity, Total` = c(510, 1205, 945, 945, 515),
`Price, Total` = c(431, 32, 423, 342, 243),
`Quality, Total` = c(24, 4, -2, 42, 1706.327462)),
class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L),
groups = structure(list(Year = 2022L, .rows = structure(list(1:5), ptype = integer(0), class = c("vctrs_list_of", "vctrs_vctr", "list"))),
class = c("tbl_df", "tbl", "data.frame" ), row.names = c(NA, -1L), .drop = TRUE))
ui <- fluidPage(
# Application title
titlePanel("aFRR"),
# plot graphs
mainPanel(tabsetPanel(
tabPanel("Up",
h3(helpText("aFRR Price and Quantity")),
tableOutput("table_up"))
)
)
)
server <- function(input, output, session) {
output$table_up <- renderTable(df, na = "missing",align = 'c',striped = TRUE
)
}
shinyApp(ui, server)

Spatial data: calculating the distance of points from the maximum point value and plotting

My question is similar to this post where the distance between each point was calculated.
In my case, I am looking to find the distance of each point to the point with the highest value. I would also like to plot this relationship with lm(), but I am struggling to achieve both tasks with spatial data objects.
My data does not need CRS, it is based on the Euclidean distance (because these points are in a room).
A mock example of the data below, where column variable is of interest.
> dput(dat)
structure(list(date.hour = structure(c(1551057840, 1551057840,
1551057840, 1551057840, 1551057840, 1551057840, 1551057840), tzone = "UTC", class = c("POSIXct",
"POSIXt")), id = c(2, 5, 7, 8, 9, 10, 11), variable = c(456,
27, 130, 116, 92, 141, 145), xy_coord = c("6.2 14.8", "8.2 8.9",
"4.2 8.9", "2.2 8.9", "8.2 3.5", "6.2 3.5", "4.2 3.5")), row.names = c(NA,
-7L), groups = structure(list(id = c(2, 5, 7, 8, 9, 10, 11),
date.hour = structure(c(1551057840, 1551057840, 1551057840,
1551057840, 1551057840, 1551057840, 1551057840), tzone = "UTC", class = c("POSIXct",
"POSIXt")), .rows = structure(list(1L, 2L, 3L, 4L, 5L, 6L,
7L), ptype = integer(0), class = c("vctrs_list_of", "vctrs_vctr",
"list"))), row.names = c(NA, -7L), class = c("tbl_df", "tbl",
"data.frame"), .drop = TRUE), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"))
> dat
# A tibble: 7 x 4
# Groups: id, date.hour [7]
date.hour id variable xy_coord
<dttm> <dbl> <dbl> <chr>
1 2019-02-25 01:24:00 2 456 6.2 14.8
2 2019-02-25 01:24:00 5 27 8.2 8.9
3 2019-02-25 01:24:00 7 130 4.2 8.9
4 2019-02-25 01:24:00 8 116 2.2 8.9
5 2019-02-25 01:24:00 9 92 8.2 3.5
6 2019-02-25 01:24:00 10 141 6.2 3.5
7 2019-02-25 01:24:00 11 145 4.2 3.5
>
Turning the data frame into a SpatialPointsDataFrame with the sp() package:
#Split x and y to separate columns
dat$x <- sapply(strsplit(as.character(dat$xy_coord), " "), "[", 1); dat$x <- as.numeric(dat$x)
dat$y <- sapply(strsplit(as.character(dat$xy_coord), " "), "[", 2); dat$y <- as.numeric(dat$y)
#SpatialPointsDataFrame
coordinates(dat) <- ~x+y
This is the point where I don't know what steps to take, but I want to know the distance of all the points to the highest value:
which.max(dat#data$variable)
And then plot this relationship with base plot().
If my question is unclear please let me know.
I'm still not sure I understand your question but I propose the following answer.
Load packages
library(sf)
#> Linking to GEOS 3.9.1, GDAL 3.2.1, PROJ 7.2.1
library(tidyr)
Load data
dat = structure(
list(
date.hour = structure(
c(
1551057840, 1551057840, 1551057840, 1551057840, 1551057840,
1551057840, 1551057840
),
tzone = "UTC",
class = c(
"POSIXct",
"POSIXt"
)
),
id = c(2, 5, 7, 8, 9, 10, 11),
variable = c(
456, 27, 130, 116, 92, 141, 145
),
xy_coord = c(
"6.2 14.8", "8.2 8.9", "4.2 8.9", "2.2 8.9", "8.2 3.5", "6.2 3.5",
"4.2 3.5"
)
),
row.names = c(NA,-7L),
groups = structure(
list(
id = c(2, 5, 7, 8, 9, 10, 11),
date.hour = structure(
c(
1551057840, 1551057840, 1551057840, 1551057840, 1551057840,
1551057840, 1551057840
),
tzone = "UTC",
class = c(
"POSIXct",
"POSIXt"
)
),
.rows = structure(
list(1L, 2L, 3L, 4L, 5L, 6L, 7L),
ptype = integer(0),
class = c(
"vctrs_list_of", "vctrs_vctr", "list"
)
)
),
row.names = c(NA, -7L),
class = c("tbl_df", "tbl", "data.frame"),
.drop = TRUE
),
class = c("grouped_df", "tbl_df", "tbl", "data.frame")
)
Separate the xy_coord column, convert columns to numeric and create an sf object
dat_sf <- st_as_sf(
separate(dat, xy_coord, c("x", "y"), sep = " ", convert = TRUE),
coords = c("x", "y")
)
Find the maximum of variable
which.max(dat_sf[["variable"]])
#> [1] 1
Compute all distances
dat_sf[["distances"]] <- st_distance(dat_sf, dat_sf[1, ])
Plot
plot(variable ~ distances, data = dat_sf)
Created on 2021-11-22 by the reprex package (v2.0.1)
You can also remove the first point (with distance = 0).

make subway graph include 102 topics in ggplot2 r

This is a followup from subway-style graph for word frequency across three datasets in ggplot2
I used the code in the answer from this question, but am struggling with how best to manipulate the graph to make it fits 100 unique dict entries within the subway graph without completely messing up the dict word entries on the margins.
I have tested out different amounts of words to feed into the subway graph, and found that it cannot contain more than 25 words.
I have data:
structure(list(dict = c("apple", "apple", "apple",
"mandarin", "mandarin", "mandarin", "orange", "orange", "orange", "pear"),
name = c("freq_ongov", "freq_onindiv", "freq_onmedia", "freq_ongov",
"freq_onindiv", "freq_onmedia", "freq_ongov", "freq_onindiv",
"freq_onmedia", "freq_ongov"), value = c(0, 87, 63, 0, 44,
20, 3, 27, 25, 0), rank = c(26, 85, 70, 26, 61, 42.5, 86,
47, 48, 26)), row.names = c(NA, -10L), groups = structure(list(
name = c("freq_ongov", "freq_onindiv", "freq_onmedia"), .rows = structure(list(
c(1L, 4L, 7L, 10L), c(2L, 5L, 8L), c(3L, 6L, 9L)), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, 3L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
But there are 100 rows within this data that I want to include in the following code:
leftlabels <- df$dict[df$name == "freq_ongov"]
leftlabels <- leftlabels[order(df$rank[df$name == "freq_ongov"])]
rightlabels <- df$dict[df$name == "freq_onmedia"]
rightlabels <- rightlabels[order(df$rank[df$name == "freq_onmedia"])]
ggplot(df, aes(name, rank, color = dict, group = dict)) +
geom_line(size = 4) +
geom_point(shape = 21, fill = "white", size = 4) +
scale_y_continuous(breaks = seq(max(df$rank)), labels = leftlabels,
sec.axis = sec_axis(~., breaks = seq(max(df$rank)),
labels = rightlabels)) +
scale_x_discrete(expand = c(0.01, 0)) +
guides(color = guide_none()) +
coord_cartesian(clip = "off") +
theme(axis.ticks.length.y = unit(0, "points"))
I tried changing the y.int and width of the y axis to fit in 100 words, but that only makes the y-axis longer, without changing the spacing between each word label on the y-axis, so all the words get squeezed together. Any suggestions?

How to create differences between several pairs of columns?

I have a panel (cross-sectional time series) dataset. For each group (defined by (NAICS2, occ_type) in time ym) I have many variables. For each variable I would like to subtract each group's first (dplyr::first) value from every value of that group.
Ultimately I am trying to take the Euclidean difference between the vector of each row 's group's first entry, (i.e. sqrt(c_1^2 + ... + c_k^2).
I was able to create the a column equal to the first entries for each group:
df2 <- df %>%
group_by(ym, NAICS2, occ_type) %>%
distinct(ym, NAICS2, occ_type, .keep_all = T) %>%
arrange(occ_type, NAICS2, ym) %>%
select(group_cols(), ends_with("_scf")) %>%
mutate_at(vars(-group_cols(), ends_with("_scf")),
list(first = dplyr::first))
I then tried to include variations of f.diff = . - dplyr::first(.) in the list, but none of those worked. I googled the dot notation for a while as well as first and lag in dplyr timeseries but have not been able to resolve this yet.
Ideally, I unite all variables into a vector for each row first and then take the difference.
df2 <- df %>%
group_by(ym, NAICS2, occ_type) %>%
distinct(ym, NAICS2, occ_type, .keep_all = T) %>%
arrange(occ_type, NAICS2, ym) %>%
select(group_cols(), ends_with("_scf")) %>%
unite(vector, c(-group_cols(), ends_with("_scf")), sep = ',') %>%
# TODO: DISTANCE_BETWEEN_ENTRY_AND_FIRST
mutate(vector.diff = ???)
I expect the output to be a numeric column that contains a distance measure of how different each group's row vector is from its initial row vector.
Here is a sample of the data:
structure(list(ym = c("2007-01-01", "2007-02-01"), NAICS2 = c(0L,
0L), occ_type = c("is_middle_manager", "is_middle_manager"),
Administration_scf = c(344, 250), Agriculture..Horticulture..and.the.Outdoors_scf = c(11,
17), Analysis_scf = c(50, 36), Architecture.and.Construction_scf = c(57,
51), Business_scf = c(872, 585), Customer.and.Client.Support_scf = c(302,
163), Design_scf = c(22, 17), Economics..Policy..and.Social.Studies_scf = c(7,
7), Education.and.Training_scf = c(77, 49), Energy.and.Utilities_scf = c(25,
28), Engineering_scf = c(90, 64), Environment_scf = c(19,
19), Finance_scf = c(455, 313), Health.Care_scf = c(105,
71), Human.Resources_scf = c(163, 124), Industry.Knowledge_scf = c(265,
174), Information.Technology_scf = c(467, 402), Legal_scf = c(21,
17), Maintenance..Repair..and.Installation_scf = c(194, 222
), Manufacturing.and.Production_scf = c(176, 174), Marketing.and.Public.Relations_scf = c(139,
109), Media.and.Writing_scf = c(18, 20), Personal.Care.and.Services_scf = c(31,
16), Public.Safety.and.National.Security_scf = c(14, 7),
Religion_scf = c(0, 0), Sales_scf = c(785, 463), Science.and.Research_scf = c(52,
24), Supply.Chain.and.Logistics_scf = c(838, 455), total_scf = c(5599,
3877)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L), groups = structure(list(ym = c("2007-01-01",
"2007-02-01"), NAICS2 = c(0L, 0L), occ_type = c("is_middle_manager",
"is_middle_manager"), .rows = list(1L, 2L)), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))

Passing current value of ddply split on to function

Here is some sample data for which I want to encode the gender of the names over time:
names_to_encode <- structure(list(names = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("jane", "john", "madison"), class = "factor"), year = c(1890, 1990, 1890, 1990, 1890, 2012)), .Names = c("names", "year"), row.names = c(NA, -6L), class = "data.frame")
Here is a minimal set of the Social Security data, limited to just those names from 1890 and 1990:
ssa_demo <- structure(list(name = c("jane", "jane", "john", "john", "madison", "madison"), year = c(1890L, 1990L, 1890L, 1990L, 1890L, 1990L), female = c(372, 771, 56, 81, 0, 1407), male = c(0, 8, 8502, 29066, 14, 145)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("name", "year", "female", "male"))
I've defined a function which subsets the Social Security data given a year or range of years. In other words, it calculates whether a name was male or female over a given time period by figuring out the proportion of male and female births with that name. Here is the function along with a helper function:
require(plyr)
require(dplyr)
select_ssa <- function(years) {
# If we get only one year (1890) convert it to a range of years (1890-1890)
if (length(years) == 1) years <- c(years, years)
# Calculate the male and female proportions for the given range of years
ssa_select <- ssa_demo %.%
filter(year >= years[1], year <= years[2]) %.%
group_by(name) %.%
summarise(female = sum(female),
male = sum(male)) %.%
mutate(proportion_male = round((male / (male + female)), digits = 4),
proportion_female = round((female / (male + female)), digits = 4)) %.%
mutate(gender = sapply(proportion_female, male_or_female))
return(ssa_select)
}
# Helper function to determine whether a name is male or female in a given year
male_or_female <- function(proportion_female) {
if (proportion_female > 0.5) {
return("female")
} else if(proportion_female == 0.5000) {
return("either")
} else {
return("male")
}
}
Now what I want to do is use plyr, specifically ddply, to subset the data to be encoded by year, and merge each of those pieces with the value returned by the select_ssa function. This is the code I have.
ddply(names_to_encode, .(year), merge, y = select_ssa(year), by.x = "names", by.y = "name", all.x = TRUE)
When calling select_ssa(year), this command works just fine if I hard code a value like 1890 as the argument to the function. But when I try to pass it the current value for year that ddply is working with, I get an error message:
Error in filter_impl(.data, dots(...), environment()) :
(list) object cannot be coerced to type 'integer'
How can I pass the current value of year on to ddply?
I think you're making things too complicated by trying to do a join inside ddply. If I were to use dplyr I would probably do something more like this:
names_to_encode <- structure(list(name = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("jane", "john", "madison"), class = "factor"), year = c(1890, 1990, 1890, 1990, 1890, 2012)), .Names = c("name", "year"), row.names = c(NA, -6L), class = "data.frame")
ssa_demo <- structure(list(name = c("jane", "jane", "john", "john", "madison", "madison"), year = c(1890L, 1990L, 1890L, 1990L, 1890L, 1990L), female = c(372, 771, 56, 81, 0, 1407), male = c(0, 8, 8502, 29066, 14, 145)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("name", "year", "female", "male"))
names_to_encode$name <- as.character(names_to_encode$name)
names_to_encode$year <- as.integer(names_to_encode$year)
tmp <- left_join(ssa_demo,names_to_encode) %.%
group_by(year,name) %.%
summarise(female = sum(female),
male = sum(male)) %.%
mutate(proportion_male = round((male / (male + female)), digits = 4),
proportion_female = round((female / (male + female)), digits = 4)) %.%
mutate(gender = ifelse(proportion_female == 0.5,"either",
ifelse(proportion_female > 0.5,"female","male")))
Note that 0.1.1 is still a little finicky about the types of join columns, so I had to convert them. I think I saw some activity on github that suggested that was either fixed in the dev version, or at least something they're working on.

Resources