Creating a unique id per username (dplyr) vs. Stata

Creating a unique id per username (dplyr) vs. Stata - r

I have a reddit dataset where each row represents a single reddit post, along with the username info. However, given that it's reddit data, the number of posts per username varies a lot (i.e. depending on how active a given username is on reddit).
I am trying to create a unique id for each username and my data are structured as follows:
dput(df[1:5,c(2,3)])
output:
structure(list(date = structure(c(15149, 15150, 15150, 15150,
15150), class = "Date"), username = c("تتطور", "عاطله فقط",
"قصه ألم", "بشروني بوظيفة", "الواعده"
)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-5L), groups = structure(list(username = c("الواعده",
"بشروني بوظيفة", "تتطور", "عاطله فقط",
"قصه ألم"), .rows = structure(list(5L, 4L, 1L, 2L, 3L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), .drop = TRUE))
I ran the following code where I tried replicate the code here
The code works w/out errors, but I am unable to create a unique id by username.
#create an ID per observation
df <- df %>%
group_by(username) %>%
mutate(id = row_number())%>%
relocate(id)
Print data example with specific columns
dput(df[1:10,c(1,4)])
output:
structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L),
username = c("تتطور", "عاطله فقط", "قصه ألم",
"بشروني بوظيفة", "الواعده", "ماخليتوآ لي اسم",
"مرافئ ساكنه", "معتوقة", "تتطور", "تتطور"
)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L), groups = structure(list(username = c("الواعده",
"بشروني بوظيفة", "تتطور", "عاطله فقط",
"قصه ألم", "ماخليتوآ لي اسم", "مرافئ ساكنه",
"معتوقة"), .rows = structure(list(5L, 4L, c(1L, 9L, 10L
), 2L, 3L, 6L, 7L, 8L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L), .drop = TRUE))
In Stata, I would do this as follows:
// create an id variable per username
egen id = group(username)

That's an incorrect use of group_by for your purpose. If you want to get an id just like your Stata code with egen, you may want to try this:
df$id = as.integer(factor(df$username))
This produced the same id as Stata
egen id = group(username)
Just FYI, I also tried dplyr::consecutive_id():
df %>% mutate(
id_dplyr = dplyr::consecutive_id(username)
)
but unable to reproduce Stata results with your example.

Related

Computing share of sub-group by month and year (tidyverse)

I am trying to the share of entity mentions online by month, as the share of total mentions at the monthly level, rather than by the total number of mentions in my dataset.
Print data example
dput(directed_to_whom_monthly[1:4, ])
Output:
structure(list(directed_to_whom = structure(c(3L, 2L, 3L, 3L), .Label = c("MoE",
"MoL", "Private employers"), class = "factor"), treatment_details = structure(c(2L,
2L, 2L, 1L), .Label = c("post", "pre"), class = "factor"), month_year = structure(c(2011.41666666667,
2011.41666666667, 2011.5, 2012.5), class = "yearmon"), n = c(10L,
10L, 8L, 30L), directed_to_whom_percentage = c(0.00279251605696733,
0.00279251605696733, 0.00223401284557386, 0.00837754817090198
), year = c(2011, 2011, 2011, 2012), month = c(6, 6, 7, 7)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
To compute this, I have tried the following:
directed_to_whom_monthly %>%
group_by(directed_to_whom) %>% # group data entity mentions
group_by(month_year) %>%
add_count(treatment_details) %>% # add count of treatment_implementation
unique() %>% # remove duplicates
ungroup() %>% # remove grouping
mutate(directed_to_whom_percentage = n/sum(n)) %>% # ...calculating percentage
But this essentially divides the number of mentions of entity X, by all all mentions in the dataset.
I have also tried a solution from here, as follows, the code works well but it's not computing mentions by the total mentions per month.
test <-directed_to_whom_monthly %>%
group_by(month) %>% mutate(per= prop.table(n) * 100)
dput(test[1:4, ])
Output:
structure(list(directed_to_whom = structure(c(3L, 2L, 3L, 3L), .Label = c("MoE",
"MoL", "Private employers"), class = "factor"), treatment_details = structure(c(2L,
2L, 2L, 1L), .Label = c("post", "pre"), class = "factor"), month_year = structure(c(2011.41666666667,
2011.41666666667, 2011.5, 2012.5), class = "yearmon"), n = c(10L,
10L, 8L, 30L), directed_to_whom_percentage = c(0.00279251605696733,
0.00279251605696733, 0.00223401284557386, 0.00837754817090198
), year = c(2011, 2011, 2011, 2012), month = c(6, 6, 7, 7), per = c(2.49376558603491,
2.49376558603491, 8, 30)), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L), groups = structure(list(
month = c(6, 7), .rows = structure(list(1:2, 3:4), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L), .drop = TRUE))

I think you need to calculate counts for direct to who by month, and then the total count for all entries for that month and then calculate the percentage based on that
directed_to_whom_monthly %>%
group_by(directed_to_whom, month_year) %>%
mutate(direct_month_count=n()) %>% #count of directed to whom by month
group_by(month_year) %>%
mutate(month_year_count=n()) %>% ###total count per month
mutate(directed_to_whom_percentage = direct_month_count/month_year_count*100) #percentage

R missing variable that is conditional on the others that are present

I have a dataset in R and I'm trying to fill out two missing values at the same time. I had used the pad function from library(padr) to fill out the data frame with missing date values. Now I have two additional fields that are NA.
I know what these values should be but I don't understand an easy way to code them into the dataframe and the dataframe is too long to do it manually.
The missing field for the sales column should be 0. The harder part here is the store column. There are three options for stores: store1, store2, store3. And each value in the Date will be listed three times. I don't know which store is missing for each day. In the example I'm including here, store2 is missing but later in the data frame it might be store1 or store3. Is there a way to fill out the missing store by knowing the other two stores that are missing?
Here is a screenshot of my dataframe.
And here is a section of it so it's reproducible.
structure(list(date = structure(c(18628, 18628, 18628, 18629,
18629, 18629, 18630, 18630, 18630, 18631, 18631, 18631), class = "Date"),
store = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, NA, 3L, 1L,
2L, 2L), .Label = c("store1", "store2", "store3"), class = "factor"),
sales = c(153461, 2332, 1734, 176912, 53063, 17484, 243581,
NA, 412, 1739263, 427311, 9772)), row.names = c(NA, -12L), groups = structure(list(
store = structure(c(1L, 2L, 3L, NA), .Label = c("store1",
"store2", "store3"), class = "factor"), .rows = structure(list(
c(1L, 4L, 7L, 10L), c(2L, 5L, 11L, 12L), c(3L, 6L, 9L
), 8L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))

I guess you want a balanced pannel (for each date, three rows, one per store). I would go as follows:
Create a balanced dataset with dates and stores.
stores<-c('store1','store2','store3')
dates<-seq(as.Date('2021-01-01'),as.Date('2001-07-22'),by='day')
data<-data.frame(expand.grid(stores,dates))
And now, left join your dataset. It will leave NA the sales column if it is not there, but you can fill it with a 0 easily.
names(data)[1] <- "store"
names(data)[2] <- "date"
df2 <- left_join(data, df)
df2$sales[is.na(df2$sales)] <- 0

Conditionally adding characters to new column based on separate dataset

Hello all and thank you in advance.
I would like to add a new column to my pre-existing data frame where the values sourced from a second data frame based on certain conditions. The dataset I wish to add the new column to ("data_melt") has many different sample IDs (sample.#) under the variable column. Using a second dataset ("metadata") I want to add the pond names to the "data_melt" new column based on the sample-ids. The sample IDs are the same in both datasets.
My gut tells me there's an obvious solution but my head is pretty fried. Here is a toy example of my data_melt df (since its 25,000 observations):
> dput(toy)
structure(list(gene = c("serA", "mdh", "fdhB", "fdhA"), process = structure(c(1L,
1L, 1L, 1L), .Label = "energy", class = "factor"), category = structure(c(1L,
1L, 1L, 1L), .Label = "metabolism", class = "factor"), ko = structure(1:4, .Label = c("K00058",
"K00093", "K00125", "K00148"), class = "factor"), variable = structure(c(1L,
2L, 3L, 3L), .Label = c("sample.10", "sample.19", "sample.72"
), class = "factor"), value = c(0.00116, 2.77e-05, 1.84e-05,
0.0125)), row.names = c(NA, -4L), class = "data.frame")
And here is a toy example of my metadata df:
> dput(toy)
structure(list(sample = c("sample.10", "sample.19", "sample.72",
"sample.13"), pond = structure(c(2L, 2L, 1L, 1L), .Label = c("lower",
"upper"), class = "factor")), row.names = c(NA, -4L), class = "data.frame")
Thank you again!

We can use match from base R to create a numeric index to replace the values
toy$pond <- with(toy, out$pond[match(variable, out$sample)])

I believe merge will work here.
sss <- structure(list(gene = c("serA", "mdh", "fdhB", "fdhA"), process = structure(c(1L,
1L, 1L, 1L), .Label = "energy", class = "factor"), category = structure(c(1L,
1L, 1L, 1L), .Label = "metabolism", class = "factor"), ko = structure(1:4, .Label = c("K00058",
"K00093", "K00125", "K00148"), class = "factor"), variable = structure(c(1L,
2L, 3L, 3L), .Label = c("sample.10", "sample.19", "sample.72"
), class = "factor"), value = c(0.00116, 2.77e-05, 1.84e-05,
0.0125)), row.names = c(NA, -4L), class = "data.frame")
ss <- structure(list(sample = c("sample.10", "sample.19", "sample.72",
"sample.13"), pond = structure(c(2L, 2L, 1L, 1L), .Label = c("lower",
"upper"), class = "factor")), row.names = c(NA, -4L), class = "data.frame")
ssss <- merge(sss, ss, by.x = "variable", by.y = "sample")

You can use left_join() from the dplyr package after renaming sample to variable in the metadata data frame.
library(tidyverse)
data_melt <- structure(list(gene = c("serA", "mdh", "fdhB", "fdhA"),
process = structure(c(1L, 1L, 1L, 1L),
.Label = "energy",
class = "factor"),
category = structure(c(1L, 1L, 1L, 1L),
.Label = "metabolism",
class = "factor"),
ko = structure(1:4,
.Label = c("K00058", "K00093", "K00125", "K00148"),
class = "factor"),
variable = structure(c(1L, 2L, 3L, 3L),
.Label = c("sample.10", "sample.19", "sample.72"),
class = "factor"),
value = c(0.00116, 2.77e-05, 1.84e-05, 0.0125)),
row.names = c(NA, -4L),
class = "data.frame")
metadata <- structure(list(sample = c("sample.10", "sample.19", "sample.72", "sample.13"),
pond = structure(c(2L, 2L, 1L, 1L),
.Label = c("lower", "upper"),
class = "factor")),
row.names = c(NA, -4L),
class = "data.frame") %>%
# Renaming the column, so we can join the two data sets together
rename(variable = sample)
data_melt <- data_melt %>%
left_join(metadata, by = "variable")

why I do not get my total count of a column with sum function in R?

I have tried for already several hours to get a total count of one column using sum, in R. It worked in one case nicely but in this case it doesn't and do not understand why.
here it is a fake data set
fake_data <- structure(list(age_recoded_band = c("0-19", "20-39", "40-59",
"60+"), country = c("India", "India", "India", "India"), count_age_country = c(921L,
24601L, 11446L, 2561L), comorbidities = c("asthma", "asthma",
"asthma", "asthma"), count_study_pop_comorb = c(27L, 570L, 330L,
142L), comorbidity_rate = c(0.0293159609120521, 0.0231697898459412,
0.0288310326751704, 0.0554470909800859), count_age_group_standard_pop = c(4772L,
102286L, 55505L, 12827L), total_counts_stnd_pop = c(175390L,
175390L, 175390L, 175390L), expected_comorb_study_pop = c(139.895765472313,
2369.94512418194, 1600.26646863533, 711.219836001562)), row.names = c(NA,
-4L), groups = structure(list(age_recoded_band = c("0-19", "20-39",
"40-59", "60+"), .rows = structure(list(1L, 2L, 3L, 4L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, 4L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
I am applying the sum on expected_comorb_study_pop and do not get the answer I want
dt_total_expected_standard_pop <- fake_data %>%
dplyr::mutate(total_expected_study_pop_comorb = sum(expected_comorb_study_pop))
I actually get the same result as in expected_comorb_study_pop which is wrong. The total is 4,819.
Does someone know why sum function does not give me the output desired?

There is a grouping attribute and there was only one row per group, resulting in the sum to return exactly that single value. If we do ungroup, it will work
library(dplyr)
fake_data %>%
ungroup %>%
summarise(total_expected_study_pop_comorb = sum(expected_comorb_study_pop,
na.rm = TRUE))
summarise returns a single row. If we need to create a new column use mutate

how to reshape the matrix and fill the missing value as 0

I have a question about matrix structure manipulation in R, here I need to first transpose the matrix and combine the month and status columns, filling the missing values with 0. Here I have an example, currently my data is like belows. It seems very tricky. I would appreciate if anyone could help on this. Thank you.
Hi, my data looks like the follows:
structure(list(Customer = c("1096261", "1096261", "1169502",
"1169502"), Phase = c("2", "3", "1", "2"), Status = c("Ontime",
"Ontime", "Ontime", "Ontime"), Amount = c(21216.32, 42432.65,
200320.05, 84509.24)), .Names = c("Customer", "Phase", "Status",
"Amount"), row.names = c(NA, -4L), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), vars = c("Customer", "Phase"), drop = TRUE, indices
= list(
0L, 1L, 2L, 3L), group_sizes = c(1L, 1L, 1L, 1L), biggest_group_size = 1L,
labels = structure(list(
Customer = c("1096261", "1096261", "1169502", "1169502"),
Phase = c("2", "3", "1", "2")), row.names = c(NA, -4L), class =
"data.frame", vars = c("Customer",
"Phase"), drop = TRUE, .Names = c("Customer", "Phase")))
I need to have the reshaped matrix with the following columns:
Customer Phase1earlyTotal Phase2earlyTotal....Phase4earlyTotal...Phase1_ Ontimetotal...Phase4_Ontimetotal...Phase1LateTotal_Phase4LateTotal. For example Phase1earlytotal includes the sum of the amount with the Phase=1 and Status=Early.
Currently I use the following scripts, which does not work, coz I dont know
how to combine Phase and Stuatus Column.
mydata2<-data.table(mydata2,V3,V4)
mydata2$V4<-NULL
datacus <- data.frame(mydata2[-1,],stringsAsFactors = F);
datacus <- datacus %>% mutate(Phase= as.numeric(Phase),Amount=
as.numeric(Amount)) %>%
complete(Phase = 1:4,fill= list(Amount = 0)) %>%
dcast(datacus~V3, value.var = 'Amount',fill = 0) %>% select(Phase, V3)
%>%t()

I believe you are looking for somethink like this?
sample data
df <- structure(list(Customer = c("1096261", "1096261", "1169502",
"1169502"), Phase = c("2", "3", "1", "2"), Status = c("Ontime",
"Ontime", "Ontime", "Ontime"), Amount = c(21216.32, 42432.65,
200320.05, 84509.24)), .Names = c("Customer", "Phase", "Status",
"Amount"), row.names = c(NA, -4L), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), vars = c("Customer", "Phase"), drop = TRUE, indices
= list(
0L, 1L, 2L, 3L), group_sizes = c(1L, 1L, 1L, 1L), biggest_group_size = 1L,
labels = structure(list(
Customer = c("1096261", "1096261", "1169502", "1169502"),
Phase = c("2", "3", "1", "2")), row.names = c(NA, -4L), class =
"data.frame", vars = c("Customer",
"Phase"), drop = TRUE, .Names = c("Customer", "Phase")))
# Customer Phase Status Amount
# 1: 1096261 2 Ontime 21216.32
# 2: 1096261 3 Ontime 42432.65
# 3: 1169502 1 Ontime 200320.05
# 4: 1169502 2 Ontime 84509.24
code
library( data.table )
dcast( setDT( df ), Customer ~ Phase + Status, fun = sum, value.var = "Amount" )[]
output
# Customer 1_Ontime 2_Ontime 3_Ontime
# 1: 1096261 0 21216.32 42432.65
# 2: 1169502 200320 84509.24 0.00

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Creating a unique id per username (dplyr) vs. Stata - r

Related

Computing share of sub-group by month and year (tidyverse)

R missing variable that is conditional on the others that are present

Conditionally adding characters to new column based on separate dataset

why I do not get my total count of a column with sum function in R?

how to reshape the matrix and fill the missing value as 0

Categories

Resources