Computing share of sub-group by month and year (tidyverse) - r

I am trying to the share of entity mentions online by month, as the share of total mentions at the monthly level, rather than by the total number of mentions in my dataset.
Print data example
dput(directed_to_whom_monthly[1:4, ])
Output:
structure(list(directed_to_whom = structure(c(3L, 2L, 3L, 3L), .Label = c("MoE",
"MoL", "Private employers"), class = "factor"), treatment_details = structure(c(2L,
2L, 2L, 1L), .Label = c("post", "pre"), class = "factor"), month_year = structure(c(2011.41666666667,
2011.41666666667, 2011.5, 2012.5), class = "yearmon"), n = c(10L,
10L, 8L, 30L), directed_to_whom_percentage = c(0.00279251605696733,
0.00279251605696733, 0.00223401284557386, 0.00837754817090198
), year = c(2011, 2011, 2011, 2012), month = c(6, 6, 7, 7)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
To compute this, I have tried the following:
directed_to_whom_monthly %>%
group_by(directed_to_whom) %>% # group data entity mentions
group_by(month_year) %>%
add_count(treatment_details) %>% # add count of treatment_implementation
unique() %>% # remove duplicates
ungroup() %>% # remove grouping
mutate(directed_to_whom_percentage = n/sum(n)) %>% # ...calculating percentage
But this essentially divides the number of mentions of entity X, by all all mentions in the dataset.
I have also tried a solution from here, as follows, the code works well but it's not computing mentions by the total mentions per month.
test <-directed_to_whom_monthly %>%
group_by(month) %>% mutate(per= prop.table(n) * 100)
dput(test[1:4, ])
Output:
structure(list(directed_to_whom = structure(c(3L, 2L, 3L, 3L), .Label = c("MoE",
"MoL", "Private employers"), class = "factor"), treatment_details = structure(c(2L,
2L, 2L, 1L), .Label = c("post", "pre"), class = "factor"), month_year = structure(c(2011.41666666667,
2011.41666666667, 2011.5, 2012.5), class = "yearmon"), n = c(10L,
10L, 8L, 30L), directed_to_whom_percentage = c(0.00279251605696733,
0.00279251605696733, 0.00223401284557386, 0.00837754817090198
), year = c(2011, 2011, 2011, 2012), month = c(6, 6, 7, 7), per = c(2.49376558603491,
2.49376558603491, 8, 30)), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L), groups = structure(list(
month = c(6, 7), .rows = structure(list(1:2, 3:4), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L), .drop = TRUE))

I think you need to calculate counts for direct to who by month, and then the total count for all entries for that month and then calculate the percentage based on that
directed_to_whom_monthly %>%
group_by(directed_to_whom, month_year) %>%
mutate(direct_month_count=n()) %>% #count of directed to whom by month
group_by(month_year) %>%
mutate(month_year_count=n()) %>% ###total count per month
mutate(directed_to_whom_percentage = direct_month_count/month_year_count*100) #percentage

Related

Creating a unique id per username (dplyr) vs. Stata

I have a reddit dataset where each row represents a single reddit post, along with the username info. However, given that it's reddit data, the number of posts per username varies a lot (i.e. depending on how active a given username is on reddit).
I am trying to create a unique id for each username and my data are structured as follows:
dput(df[1:5,c(2,3)])
output:
structure(list(date = structure(c(15149, 15150, 15150, 15150,
15150), class = "Date"), username = c("تتطور", "عاطله فقط",
"قصه ألم", "بشروني بوظيفة", "الواعده"
)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-5L), groups = structure(list(username = c("الواعده",
"بشروني بوظيفة", "تتطور", "عاطله فقط",
"قصه ألم"), .rows = structure(list(5L, 4L, 1L, 2L, 3L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), .drop = TRUE))
I ran the following code where I tried replicate the code here
The code works w/out errors, but I am unable to create a unique id by username.
#create an ID per observation
df <- df %>%
group_by(username) %>%
mutate(id = row_number())%>%
relocate(id)
Print data example with specific columns
dput(df[1:10,c(1,4)])
output:
structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L),
username = c("تتطور", "عاطله فقط", "قصه ألم",
"بشروني بوظيفة", "الواعده", "ماخليتوآ لي اسم",
"مرافئ ساكنه", "معتوقة", "تتطور", "تتطور"
)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L), groups = structure(list(username = c("الواعده",
"بشروني بوظيفة", "تتطور", "عاطله فقط",
"قصه ألم", "ماخليتوآ لي اسم", "مرافئ ساكنه",
"معتوقة"), .rows = structure(list(5L, 4L, c(1L, 9L, 10L
), 2L, 3L, 6L, 7L, 8L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L), .drop = TRUE))
In Stata, I would do this as follows:
// create an id variable per username
egen id = group(username)
That's an incorrect use of group_by for your purpose. If you want to get an id just like your Stata code with egen, you may want to try this:
df$id = as.integer(factor(df$username))
This produced the same id as Stata
egen id = group(username)
Just FYI, I also tried dplyr::consecutive_id():
df %>% mutate(
id_dplyr = dplyr::consecutive_id(username)
)
but unable to reproduce Stata results with your example.

connect points within position_dodged factor x-axis in ggplot2

I'm trying to add significance annotations to an errorbar plot with a factor x-axis and dodged groups within each level of the x-axis. It is a similar but NOT identical use case to this
My base errorbar plot is:
library(ggplot2)
library(dplyr)
pres_prob_pd = structure(list(x = structure(c(1, 1, 1, 2, 2, 2, 3, 3, 3), labels = c(`1` = 1,
`2` = 2, `3` = 3)), predicted = c(0.571584427222816, 0.712630712634987,
0.156061969566517, 0.0162388386564817, 0.0371877245103279, 0.0165022541901018,
0.131528946944238, 0.35927812866896, 0.0708662221985375), std.error = c(0.355802875027348,
0.471253661425626, 0.457109887762665, 0.352871728451576, 0.442646879181155,
0.425913568532558, 0.376552208691762, 0.48178172708116, 0.451758041335245
), conf.low = c(0.399141779923204, 0.496138837620712, 0.0701919316506831,
0.00819832576725402, 0.0159620304815404, 0.00722904089045731,
0.0675129352870401, 0.17905347369819, 0.030504893442457), conf.high = c(0.728233665534388,
0.861980236164486, 0.311759350126477, 0.031911364587827, 0.0842227723261319,
0.0372248587668487, 0.240584344249407, 0.590437963881823, 0.156035177669385
), group = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("certain",
"neutral", "uncertain"), class = "factor"), group_col = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("certain", "neutral",
"uncertain"), class = "factor"), language = structure(c(2L, 2L,
2L, 1L, 1L, 1L, 3L, 3L, 3L), .Label = c("english", "dutch", "german"
), class = "factor"), top = c(0.861980236164486, 0.861980236164486,
0.861980236164486, 0.0842227723261319, 0.0842227723261319, 0.0842227723261319,
0.590437963881823, 0.590437963881823, 0.590437963881823)), row.names = c(NA,
-9L), groups = structure(list(language = structure(1:3, .Label = c("english",
"dutch", "german"), class = "factor"), .rows = structure(list(
4:6, 1:3, 7:9), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, 3L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
#dodge
pd = position_dodge(.75)
#plot
p = ggplot(pres_prob_pd,aes(x=language,y=predicted,color=group,shape=group)) +
geom_point(position=pd,size=2) +
geom_errorbar(aes(ymax=conf.high,ymin=conf.low),width=.125,position=pd)
p
What I want to do is annotate the plot such that the contrasts between group within each level of language are annotated for significance. I've plotted points representing the relevant contrasts and (toy) sig. annotations as follows:
#bump function
f = function(x){
v = c()
bump=0.025
constant = 0
for(i in x){
v = c(v,i+constant+bump)
bump = bump + 0.075
}
v
}
#create contrasts
combs = data.frame(gtools::combinations(3, 2, v=c("certain", "neutral", "uncertain"), set=F, repeats.allowed=F)) %>%
mutate(contrast=c("cont_1","cont_2","cont_3"))
combs = rbind(combs %>% mutate(language = 'english'),
combs %>% mutate(language='dutch'),
combs %>% mutate(language = "german")) %>%
left_join(select(pres_prob_pd,language:top)%>%distinct(),by='language') %>%
group_by(language)
#long transform and calc y_pos
combs_long = mutate(combs,y_pos=f(top)) %>% gather(long, probability, X1:X2, factor_key=TRUE) %>% mutate(language=factor(language,levels=c("english","dutch","german"))) %>%
arrange(language,contrast)
#back to wide
combs_wide =combs_long %>% spread(long,probability)
combs_wide$p = rep(c('***',"*","ns"),3)
#plot
p +
geom_point(data=combs_long,
aes(x = language,
color=probability,
shape=probability,
y=y_pos),
inherit.aes = T,
position=pd,
size=2) +
geom_text(data=combs_wide,
aes(x=language,
label=p,
y=y_pos+.025,
group=X1),
color='black',
position=position_dodge(.75),
inherit.aes = F)
What I am failing to achieve is plotting a line connecting each of the contrasts of group within each level of language, as is standard when annotating significant group-wise differences. Any help much appreciated!

how to reshape the matrix and fill the missing value as 0

I have a question about matrix structure manipulation in R, here I need to first transpose the matrix and combine the month and status columns, filling the missing values with 0. Here I have an example, currently my data is like belows. It seems very tricky. I would appreciate if anyone could help on this. Thank you.
Hi, my data looks like the follows:
structure(list(Customer = c("1096261", "1096261", "1169502",
"1169502"), Phase = c("2", "3", "1", "2"), Status = c("Ontime",
"Ontime", "Ontime", "Ontime"), Amount = c(21216.32, 42432.65,
200320.05, 84509.24)), .Names = c("Customer", "Phase", "Status",
"Amount"), row.names = c(NA, -4L), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), vars = c("Customer", "Phase"), drop = TRUE, indices
= list(
0L, 1L, 2L, 3L), group_sizes = c(1L, 1L, 1L, 1L), biggest_group_size = 1L,
labels = structure(list(
Customer = c("1096261", "1096261", "1169502", "1169502"),
Phase = c("2", "3", "1", "2")), row.names = c(NA, -4L), class =
"data.frame", vars = c("Customer",
"Phase"), drop = TRUE, .Names = c("Customer", "Phase")))
I need to have the reshaped matrix with the following columns:
Customer Phase1earlyTotal Phase2earlyTotal....Phase4earlyTotal...Phase1_ Ontimetotal...Phase4_Ontimetotal...Phase1LateTotal_Phase4LateTotal. For example Phase1earlytotal includes the sum of the amount with the Phase=1 and Status=Early.
Currently I use the following scripts, which does not work, coz I dont know
how to combine Phase and Stuatus Column.
mydata2<-data.table(mydata2,V3,V4)
mydata2$V4<-NULL
datacus <- data.frame(mydata2[-1,],stringsAsFactors = F);
datacus <- datacus %>% mutate(Phase= as.numeric(Phase),Amount=
as.numeric(Amount)) %>%
complete(Phase = 1:4,fill= list(Amount = 0)) %>%
dcast(datacus~V3, value.var = 'Amount',fill = 0) %>% select(Phase, V3)
%>%t()
I believe you are looking for somethink like this?
sample data
df <- structure(list(Customer = c("1096261", "1096261", "1169502",
"1169502"), Phase = c("2", "3", "1", "2"), Status = c("Ontime",
"Ontime", "Ontime", "Ontime"), Amount = c(21216.32, 42432.65,
200320.05, 84509.24)), .Names = c("Customer", "Phase", "Status",
"Amount"), row.names = c(NA, -4L), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), vars = c("Customer", "Phase"), drop = TRUE, indices
= list(
0L, 1L, 2L, 3L), group_sizes = c(1L, 1L, 1L, 1L), biggest_group_size = 1L,
labels = structure(list(
Customer = c("1096261", "1096261", "1169502", "1169502"),
Phase = c("2", "3", "1", "2")), row.names = c(NA, -4L), class =
"data.frame", vars = c("Customer",
"Phase"), drop = TRUE, .Names = c("Customer", "Phase")))
# Customer Phase Status Amount
# 1: 1096261 2 Ontime 21216.32
# 2: 1096261 3 Ontime 42432.65
# 3: 1169502 1 Ontime 200320.05
# 4: 1169502 2 Ontime 84509.24
code
library( data.table )
dcast( setDT( df ), Customer ~ Phase + Status, fun = sum, value.var = "Amount" )[]
output
# Customer 1_Ontime 2_Ontime 3_Ontime
# 1: 1096261 0 21216.32 42432.65
# 2: 1169502 200320 84509.24 0.00

Summarize dataframe with start and end times in R?

Here is a sample of my df:
structure(list(press_id = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L),
start_time = c(164429106370978, 164429106370978, 164429411618824,
164429411618824, 164429837271939, 164429837271939, 164430399454284,
164430399454284), end_time = c(164429182443824, 164429182443824,
164429512525747, 164429512525747, 164429903243169, 164429903243169,
164430465927554, 164430465927554), timestamp = c(164429140697138,
164429175921880, 164429440899844, 164429440899844, 164429867184830,
164429891199391, 164430427558256, 164430433561155), acc_x = c(3.1053743,
2.9904492, 5.889916, 5.889916, 5.808511, 5.36557, 3.545921,
3.4788814), acc_y = c(8.406299, 8.12138, 8.600235, 8.600235,
7.920261, 7.922655, 7.9346266, 7.972935), acc_z = c(4.577853,
4.0894213, 0.35435268, 0.35435268, -0.21309046, 0.46927786,
4.005622, 4.4198313), grav_x = c(3.931084, 4.0214577, 4.7844357,
4.7844357, 5.6572776, 5.65053, 3.9938855, 3.9938855), grav_y = c(8.318872,
8.281514, 8.21449, 8.21449, 7.94851, 7.9495893, 8.027369,
8.027369), grav_z = c(3.393116, 3.3785365, 2.408623, 2.408623,
0.99327636, 1.0226398, 3.9724596, 3.9724596), gyro_x = c(-0.35906965,
0.099690154, 0.06792516, 0.04532315, -0.05546962, -0.06524346,
-0.2967614, -0.32180685), gyro_y = c(0.15843217, -0.48053285,
-0.2196934, -0.21175216, 0.1895863, 0.37467846, 0.12239113,
0.04847643), gyro_z = c(-0.042139318, 0.39585108, 0.12523776,
0.11240959, -0.05863268, 0.042770952, 0.047047008, 0.097137965
), acc_mag = c(10.0630984547559, 9.5719886173707, 10.4297995361418,
10.4297995361418, 9.82419166595324, 9.58008483176486, 9.56958006531909,
9.75731607717771), acc_mag_max = c(10.4656808698978, 10.4656808698978,
10.5978974240054, 10.5978974240054, 10.2717799984467, 10.2717799984467,
10.0054693945119, 10.0054693945119), acc_mag_min = c(9.55048847884876,
9.55048847884876, 9.45791784630329, 9.45791784630329, 9.58008483176486,
9.58008483176486, 9.49389444102469, 9.49389444102469), acc_mag_avg = c(9.9181794947982,
9.9181794947982, 9.82876220923978, 9.82876220923978, 9.89351246166363,
9.89351246166363, 9.77034322149792, 9.77034322149792), vel_ang_mag = c(0.394724572535758,
0.630514095219792, 0.261846355511019, 0.243985821544114,
0.206052505577139, 0.382714007838398, 0.324438496782347,
0.339625377757329), vel_ang_mag_max = c(0.665292823798622,
0.665292823798622, 1.00730683166191, 1.00730683166191, 0.561349818527019,
0.561349818527019, 0.445252333070234, 0.445252333070234),
vel_ang_mag_min = c(0.212944405199931, 0.212944405199931,
0.18680382123856, 0.18680382123856, 0.111795327479332, 0.111795327479332,
0.258342546774667, 0.258342546774667), vel_ang_mag_avg = c(0.440700089033948,
0.440700089033948, 0.405484992593493, 0.405484992593493,
0.284553957549617, 0.284553957549617, 0.348811700631375,
0.348811700631375)), .Names = c("press_id", "start_time",
"end_time", "timestamp", "acc_x", "acc_y", "acc_z", "grav_x",
"grav_y", "grav_z", "gyro_x", "gyro_y", "gyro_z", "acc_mag",
"acc_mag_max", "acc_mag_min", "acc_mag_avg", "vel_ang_mag", "vel_ang_mag_max",
"vel_ang_mag_min", "vel_ang_mag_avg"), row.names = c(NA, -8L), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), vars = "press_id", drop = TRUE, indices = list(
0:1, 2:3, 4:5, 6:7), group_sizes = c(2L, 2L, 2L, 2L), biggest_group_size = 2L, labels = structure(list(
press_id = 1:4), row.names = c(NA, -4L), class = "data.frame", vars = "press_id", drop = TRUE, indices = list(
0:1, 2:3, 4:5, 6:7), group_sizes = c(2L, 2L, 2L, 2L), biggest_group_size = 2L, labels = structure(list(
press_id = 1:4), row.names = c(NA, -4L), class = "data.frame", vars = "press_id", drop = TRUE, .Names = "press_id"), .Names = "press_id"))
And I am trying to summarize it in the following way where the last columns(the blank are filled with their appropriate values from above dataframe):
press_id time_state time_state_val acc_mag acc_mag_max acc_mag_min acc_mag_avg vel_ang_mag vel_ang_mag_max vel_ang_mag_min vel_ang_mag_avg
1 start_time 164429106370978
1 end_time 164429182443824
2 start_time 164429411618824
2 end_time 164429512525747
3 start_time 164429837271939
3 end_time 164429903243169
4 start_time 164430399454284
4 end_time 164430427558256
Please advise how can I transform it to be like expected result.
I am trying to do this with combination of tidyr gather and dplyr but I don't get the structure I need.
library(dplyr)
library(tidyr)
df1 <- df[,1:6]
df1 %>% mutate(row=row_number()) %>%
gather(time_state , time_state_val, -press_id, -row,-timestamp:-acc_y) %>%
arrange(press_id, row) %>%
select(press_id, time_state, time_state_val, everything(),-row)

Collapse and aggregate several row values by date

I've got a data set that looks like this:
date, location, value, tally, score
2016-06-30T09:30Z, home, foo, 1,
2016-06-30T12:30Z, work, foo, 2,
2016-06-30T19:30Z, home, bar, , 5
I need to aggregate these rows together, to obtain a result such as:
date, location, value, tally, score
2016-06-30, [home, work], [foor, bar], 3, 5
There are several challenges for me:
The resulting row (a daily aggregate) must include the rows for this day (2016-06-30 in my above example
Some rows (strings) will result in an array containing all the values present on this day
Some others (ints) will result in a sum
I've had a look at dplyr, and if possible I'd like to do this in R.
Thanks for your help!
Edit:
Here's a dput of the data
structure(list(date = structure(1:3, .Label = c("2016-06-30T09:30Z",
"2016-06-30T12:30Z", "2016-06-30T19:30Z"), class = "factor"),
location = structure(c(1L, 2L, 1L), .Label = c("home", "work"
), class = "factor"), value = structure(c(2L, 2L, 1L), .Label = c("bar",
"foo"), class = "factor"), tally = c(1L, 2L, NA), score = c(NA,
NA, 5L)), .Names = c("date", "location", "value", "tally",
"score"), class = "data.frame", row.names = c(NA, -3L))
mydat<-structure(list(date = structure(1:3, .Label = c("2016-06-30T09:30Z",
"2016-06-30T12:30Z", "2016-06-30T19:30Z"), class = "factor"),
location = structure(c(1L, 2L, 1L), .Label = c("home", "work"
), class = "factor"), value = structure(c(2L, 2L, 1L), .Label = c("bar",
"foo"), class = "factor"), tally = c(1L, 2L, NA), score = c(NA,
NA, 5L)), .Names = c("date", "location", "value", "tally",
"score"), class = "data.frame", row.names = c(NA, -3L))
mydat$date <- as.Date(mydat$date)
require(data.table)
mydat.dt <- data.table(mydat)
mydat.dt <- mydat.dt[, lapply(.SD, paste0, collapse=" "), by = date]
cbind(mydat.dt, aggregate(mydat[,c("tally", "score")], by=list(mydat$date), FUN = sum, na.rm=T)[2:3])
which gives you:
date location value tally score
1: 2016-06-30 home work home foo foo bar 3 5
Note that if you wanted to you could probably do it all in one step in the reshaping of the data.table but I found this to be a quicker and easier way for me to achieve the same thing in 2 steps.

Resources