Reshape large dataset from wide to long with two ID variables - r

I want to change my data from long to wide format using two ID variables.
I have the below code that works with the below example dataset. However, when I run this code with a much larger dataset that I am working with, the code runs for a very long time and doesn't seem to finish running. When I use one ID variable the code runs fine, but I need to include two.
Is there a more efficient way of changing from long to wide format?
(I've also thought about creating an ID variable based on ID1 and ID2 for the purposes of converting from long to wide. Perhaps this is the best solution?)
Wide.vars <- names(df[,c("Date","V1")])
### 1. Reshape from wide to long format with two ID variables
df_wide <- reshape(as.data.frame(df),
idvar = c("ID1","ID2"),
direction = "wide",
v.names = Wide.vars,
timevar = "Timepoint")
Example data below (note that the dimensions of the example dataset are 15 rows 5 columns, whereas the dataset I'm working with is 15658 rows by 99 columns).
df <- structure(list(ID1 = c(5643923L, 5643923L, 5643923L, 3914822L,
3914822L, 3914822L, 3914822L, 1156115L, 1506426L, 7183921L, 4753447L,
4606792L, 8492773L, 8492773L, 8492773L), ID2 = c("02179",
"02179", "04101", "00819", "00819", "00819", "00819",
"01904", "01127", "00475", "02084", "04118", "15553",
"15553", "15553"), Date = structure(c(16731, 16731,
16731, 16732, 16733, 16733, 16733, 16733, 16733, 16733, 16733,
16733, 16734, 16734, 16734), class = "Date"), Timepoint = structure(c(1L,
3L, 1L, 1L, 3L, 4L, 5L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 4L), .Label = c("baseline",
"wave0.5", "wave1", "wave2", "wave3", "wave4"), class = "factor"), V1 = c(0, 8, 4, 9.5, 7, 7, 12, 9, 11, 8.4,
7.8, 6.6, 5, 5.5, 8.9)), row.names = c(NA,
-15L), groups = structure(list(CP1_t_210 = structure(1L, .Label = c("baseline",
"wave0.5", "wave1", "wave2", "wave3", "wave4"), class = "factor"),
.rows = structure(list(1:15), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))

data.table is usually faster, you can try using dcast from it.
library(data.table)
dcast(setDT(df), ID1+ID2~Timepoint, value.var = c('Date', 'V1'))
As suggested by #Mark Davies pivot_wider can also help.
tidyr::pivot_wider(df, names_from = Timepoint, values_from = c(Date, V1))

Related

Creating a unique id per username (dplyr) vs. Stata

I have a reddit dataset where each row represents a single reddit post, along with the username info. However, given that it's reddit data, the number of posts per username varies a lot (i.e. depending on how active a given username is on reddit).
I am trying to create a unique id for each username and my data are structured as follows:
dput(df[1:5,c(2,3)])
output:
structure(list(date = structure(c(15149, 15150, 15150, 15150,
15150), class = "Date"), username = c("تتطور", "عاطله فقط",
"قصه ألم", "بشروني بوظيفة", "الواعده"
)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-5L), groups = structure(list(username = c("الواعده",
"بشروني بوظيفة", "تتطور", "عاطله فقط",
"قصه ألم"), .rows = structure(list(5L, 4L, 1L, 2L, 3L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), .drop = TRUE))
I ran the following code where I tried replicate the code here
The code works w/out errors, but I am unable to create a unique id by username.
#create an ID per observation
df <- df %>%
group_by(username) %>%
mutate(id = row_number())%>%
relocate(id)
Print data example with specific columns
dput(df[1:10,c(1,4)])
output:
structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L),
username = c("تتطور", "عاطله فقط", "قصه ألم",
"بشروني بوظيفة", "الواعده", "ماخليتوآ لي اسم",
"مرافئ ساكنه", "معتوقة", "تتطور", "تتطور"
)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L), groups = structure(list(username = c("الواعده",
"بشروني بوظيفة", "تتطور", "عاطله فقط",
"قصه ألم", "ماخليتوآ لي اسم", "مرافئ ساكنه",
"معتوقة"), .rows = structure(list(5L, 4L, c(1L, 9L, 10L
), 2L, 3L, 6L, 7L, 8L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L), .drop = TRUE))
In Stata, I would do this as follows:
// create an id variable per username
egen id = group(username)
That's an incorrect use of group_by for your purpose. If you want to get an id just like your Stata code with egen, you may want to try this:
df$id = as.integer(factor(df$username))
This produced the same id as Stata
egen id = group(username)
Just FYI, I also tried dplyr::consecutive_id():
df %>% mutate(
id_dplyr = dplyr::consecutive_id(username)
)
but unable to reproduce Stata results with your example.

Computing share of sub-group by month and year (tidyverse)

I am trying to the share of entity mentions online by month, as the share of total mentions at the monthly level, rather than by the total number of mentions in my dataset.
Print data example
dput(directed_to_whom_monthly[1:4, ])
Output:
structure(list(directed_to_whom = structure(c(3L, 2L, 3L, 3L), .Label = c("MoE",
"MoL", "Private employers"), class = "factor"), treatment_details = structure(c(2L,
2L, 2L, 1L), .Label = c("post", "pre"), class = "factor"), month_year = structure(c(2011.41666666667,
2011.41666666667, 2011.5, 2012.5), class = "yearmon"), n = c(10L,
10L, 8L, 30L), directed_to_whom_percentage = c(0.00279251605696733,
0.00279251605696733, 0.00223401284557386, 0.00837754817090198
), year = c(2011, 2011, 2011, 2012), month = c(6, 6, 7, 7)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
To compute this, I have tried the following:
directed_to_whom_monthly %>%
group_by(directed_to_whom) %>% # group data entity mentions
group_by(month_year) %>%
add_count(treatment_details) %>% # add count of treatment_implementation
unique() %>% # remove duplicates
ungroup() %>% # remove grouping
mutate(directed_to_whom_percentage = n/sum(n)) %>% # ...calculating percentage
But this essentially divides the number of mentions of entity X, by all all mentions in the dataset.
I have also tried a solution from here, as follows, the code works well but it's not computing mentions by the total mentions per month.
test <-directed_to_whom_monthly %>%
group_by(month) %>% mutate(per= prop.table(n) * 100)
dput(test[1:4, ])
Output:
structure(list(directed_to_whom = structure(c(3L, 2L, 3L, 3L), .Label = c("MoE",
"MoL", "Private employers"), class = "factor"), treatment_details = structure(c(2L,
2L, 2L, 1L), .Label = c("post", "pre"), class = "factor"), month_year = structure(c(2011.41666666667,
2011.41666666667, 2011.5, 2012.5), class = "yearmon"), n = c(10L,
10L, 8L, 30L), directed_to_whom_percentage = c(0.00279251605696733,
0.00279251605696733, 0.00223401284557386, 0.00837754817090198
), year = c(2011, 2011, 2011, 2012), month = c(6, 6, 7, 7), per = c(2.49376558603491,
2.49376558603491, 8, 30)), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L), groups = structure(list(
month = c(6, 7), .rows = structure(list(1:2, 3:4), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L), .drop = TRUE))
I think you need to calculate counts for direct to who by month, and then the total count for all entries for that month and then calculate the percentage based on that
directed_to_whom_monthly %>%
group_by(directed_to_whom, month_year) %>%
mutate(direct_month_count=n()) %>% #count of directed to whom by month
group_by(month_year) %>%
mutate(month_year_count=n()) %>% ###total count per month
mutate(directed_to_whom_percentage = direct_month_count/month_year_count*100) #percentage

R missing variable that is conditional on the others that are present

I have a dataset in R and I'm trying to fill out two missing values at the same time. I had used the pad function from library(padr) to fill out the data frame with missing date values. Now I have two additional fields that are NA.
I know what these values should be but I don't understand an easy way to code them into the dataframe and the dataframe is too long to do it manually.
The missing field for the sales column should be 0. The harder part here is the store column. There are three options for stores: store1, store2, store3. And each value in the Date will be listed three times. I don't know which store is missing for each day. In the example I'm including here, store2 is missing but later in the data frame it might be store1 or store3. Is there a way to fill out the missing store by knowing the other two stores that are missing?
Here is a screenshot of my dataframe.
And here is a section of it so it's reproducible.
structure(list(date = structure(c(18628, 18628, 18628, 18629,
18629, 18629, 18630, 18630, 18630, 18631, 18631, 18631), class = "Date"),
store = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, NA, 3L, 1L,
2L, 2L), .Label = c("store1", "store2", "store3"), class = "factor"),
sales = c(153461, 2332, 1734, 176912, 53063, 17484, 243581,
NA, 412, 1739263, 427311, 9772)), row.names = c(NA, -12L), groups = structure(list(
store = structure(c(1L, 2L, 3L, NA), .Label = c("store1",
"store2", "store3"), class = "factor"), .rows = structure(list(
c(1L, 4L, 7L, 10L), c(2L, 5L, 11L, 12L), c(3L, 6L, 9L
), 8L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
I guess you want a balanced pannel (for each date, three rows, one per store). I would go as follows:
Create a balanced dataset with dates and stores.
stores<-c('store1','store2','store3')
dates<-seq(as.Date('2021-01-01'),as.Date('2001-07-22'),by='day')
data<-data.frame(expand.grid(stores,dates))
And now, left join your dataset. It will leave NA the sales column if it is not there, but you can fill it with a 0 easily.
names(data)[1] <- "store"
names(data)[2] <- "date"
df2 <- left_join(data, df)
df2$sales[is.na(df2$sales)] <- 0

Order dataframe/bar chart using a 2 coordinate data point - ggplot2

I have data, that is summarized by a 2 coordinate data point (e.g. [0,2]). However my data frame, and therefore my bar chart are ordered alphabetically even though the coordinate is a factor data type.
The data frame/ggplot default behavior: [0,1], [0,13], [0,2]
What I want to happen: [0,1], [0,2], [0,13]
This coordinate variable was created by pasteing numbers from 2 columns
mutate(swimlane_coord = factor(paste0("[", sl_subsection_index, ",", sl_element_index, "]")))
where sl_subsection_index is an integer and sl_element_index is an integer.
There can be any combination of coordinates, so I would like to avoid having to manually force the factor definitions.
Here is an example of the data:
structure(list(application_type1 = c("SamsungTV", "SamsungTV",
"SamsungTV", "SamsungTV", "SamsungTV", "SamsungTV", "SamsungTV",
"SamsungTV", "SamsungTV", "SamsungTV", "SamsungTV", "SamsungTV",
"SamsungTV", "SamsungTV"), variant_uuid = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Control",
"BackNav"), class = "factor"), allStreamSec = c("curatedCatalog",
"curatedCatalog", "curatedCatalog", "curatedCatalog", "curatedCatalog",
"curatedCatalog", "curatedCatalog", "curatedCatalog", "curatedCatalog",
"curatedCatalog", "curatedCatalog", "curatedCatalog", "curatedCatalog",
"curatedCatalog"), swimlane_coord = structure(c(1L, 2L, 8L, 9L,
10L, 21L, 1L, 2L, 8L, 9L, 10L, 11L, 25L, 29L), .Label = c("[0,0]",
"[0,1]", "[0,10]", "[0,11]", "[0,12]", "[0,13]", "[0,14]", "[0,2]",
"[0,3]", "[0,4]", "[0,5]", "[0,6]", "[0,7]", "[0,8]", "[0,9]",
"[1,0]", "[1,1]", "[1,3]", "[1,4]", "[1,5]", "[1,7]", "[2,0]",
"[2,11]", "[3,1]", "[3,11]", "[3,2]", "[3,5]", "[3,6]", "[3,7]",
"[3,8]"), class = "factor"), ESPerVisitBySL = c(1.775, 1.83333333333333,
0.976190476190476, 0.966666666666667, 1.08333333333333, 1, 1.33333333333333,
1.45161290322581, 1.68965517241379, 1.44827586206897, 1.5, 1,
1, 1), UESPerVisitBySL = c(13, 16.4, 8.80952380952381, 8.4, 9.33333333333333,
1, 11.5555555555556, 17.741935483871, 16.3448275862069, 8.10344827586207,
15.3571428571429, 6, 7, 2)), row.names = c(NA, -14L), groups = structure(list(
application_type1 = c("SamsungTV", "SamsungTV"), variant_uuid = structure(1:2, .Label = c("Control",
"BackNav"), class = "factor"), allStreamSec = c("curatedCatalog",
"curatedCatalog"), .rows = structure(list(1:6, 7:14), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
Notice that [3,11] comes before [3,2].
The only packages I have loaded are tidyverse and data.table.
Thank you
Harry
To achieve your desired result you could
arrange your data.frame by sl_subsection_index and sl_element_index
after doing so you could set the order of swimlane_coord using forcats::fct_inorder
library(ggplot2)
library(dplyr)
library(forcats)
d %>%
ungroup() %>%
mutate(
sl_subsection_index = gsub("^\\[(\\d+),\\d+\\]$", "\\1", swimlane_coord),
sl_element_index = gsub("^\\[\\d+,(\\d+)\\]$", "\\1", swimlane_coord)
) %>%
arrange(as.integer(sl_subsection_index), as.integer(sl_element_index)) %>%
mutate(swimlane_coord = forcats::fct_inorder(factor(swimlane_coord))) %>%
ggplot(aes(swimlane_coord)) +
geom_bar()
Created on 2021-06-04 by the reprex package (v2.0.0)

Collapse and aggregate several row values by date

I've got a data set that looks like this:
date, location, value, tally, score
2016-06-30T09:30Z, home, foo, 1,
2016-06-30T12:30Z, work, foo, 2,
2016-06-30T19:30Z, home, bar, , 5
I need to aggregate these rows together, to obtain a result such as:
date, location, value, tally, score
2016-06-30, [home, work], [foor, bar], 3, 5
There are several challenges for me:
The resulting row (a daily aggregate) must include the rows for this day (2016-06-30 in my above example
Some rows (strings) will result in an array containing all the values present on this day
Some others (ints) will result in a sum
I've had a look at dplyr, and if possible I'd like to do this in R.
Thanks for your help!
Edit:
Here's a dput of the data
structure(list(date = structure(1:3, .Label = c("2016-06-30T09:30Z",
"2016-06-30T12:30Z", "2016-06-30T19:30Z"), class = "factor"),
location = structure(c(1L, 2L, 1L), .Label = c("home", "work"
), class = "factor"), value = structure(c(2L, 2L, 1L), .Label = c("bar",
"foo"), class = "factor"), tally = c(1L, 2L, NA), score = c(NA,
NA, 5L)), .Names = c("date", "location", "value", "tally",
"score"), class = "data.frame", row.names = c(NA, -3L))
mydat<-structure(list(date = structure(1:3, .Label = c("2016-06-30T09:30Z",
"2016-06-30T12:30Z", "2016-06-30T19:30Z"), class = "factor"),
location = structure(c(1L, 2L, 1L), .Label = c("home", "work"
), class = "factor"), value = structure(c(2L, 2L, 1L), .Label = c("bar",
"foo"), class = "factor"), tally = c(1L, 2L, NA), score = c(NA,
NA, 5L)), .Names = c("date", "location", "value", "tally",
"score"), class = "data.frame", row.names = c(NA, -3L))
mydat$date <- as.Date(mydat$date)
require(data.table)
mydat.dt <- data.table(mydat)
mydat.dt <- mydat.dt[, lapply(.SD, paste0, collapse=" "), by = date]
cbind(mydat.dt, aggregate(mydat[,c("tally", "score")], by=list(mydat$date), FUN = sum, na.rm=T)[2:3])
which gives you:
date location value tally score
1: 2016-06-30 home work home foo foo bar 3 5
Note that if you wanted to you could probably do it all in one step in the reshaping of the data.table but I found this to be a quicker and easier way for me to achieve the same thing in 2 steps.

Resources