R missing variable that is conditional on the others that are present - r

I have a dataset in R and I'm trying to fill out two missing values at the same time. I had used the pad function from library(padr) to fill out the data frame with missing date values. Now I have two additional fields that are NA.
I know what these values should be but I don't understand an easy way to code them into the dataframe and the dataframe is too long to do it manually.
The missing field for the sales column should be 0. The harder part here is the store column. There are three options for stores: store1, store2, store3. And each value in the Date will be listed three times. I don't know which store is missing for each day. In the example I'm including here, store2 is missing but later in the data frame it might be store1 or store3. Is there a way to fill out the missing store by knowing the other two stores that are missing?
Here is a screenshot of my dataframe.
And here is a section of it so it's reproducible.
structure(list(date = structure(c(18628, 18628, 18628, 18629,
18629, 18629, 18630, 18630, 18630, 18631, 18631, 18631), class = "Date"),
store = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, NA, 3L, 1L,
2L, 2L), .Label = c("store1", "store2", "store3"), class = "factor"),
sales = c(153461, 2332, 1734, 176912, 53063, 17484, 243581,
NA, 412, 1739263, 427311, 9772)), row.names = c(NA, -12L), groups = structure(list(
store = structure(c(1L, 2L, 3L, NA), .Label = c("store1",
"store2", "store3"), class = "factor"), .rows = structure(list(
c(1L, 4L, 7L, 10L), c(2L, 5L, 11L, 12L), c(3L, 6L, 9L
), 8L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))

I guess you want a balanced pannel (for each date, three rows, one per store). I would go as follows:
Create a balanced dataset with dates and stores.
stores<-c('store1','store2','store3')
dates<-seq(as.Date('2021-01-01'),as.Date('2001-07-22'),by='day')
data<-data.frame(expand.grid(stores,dates))
And now, left join your dataset. It will leave NA the sales column if it is not there, but you can fill it with a 0 easily.
names(data)[1] <- "store"
names(data)[2] <- "date"
df2 <- left_join(data, df)
df2$sales[is.na(df2$sales)] <- 0

Related

Creating a unique id per username (dplyr) vs. Stata

I have a reddit dataset where each row represents a single reddit post, along with the username info. However, given that it's reddit data, the number of posts per username varies a lot (i.e. depending on how active a given username is on reddit).
I am trying to create a unique id for each username and my data are structured as follows:
dput(df[1:5,c(2,3)])
output:
structure(list(date = structure(c(15149, 15150, 15150, 15150,
15150), class = "Date"), username = c("تتطور", "عاطله فقط",
"قصه ألم", "بشروني بوظيفة", "الواعده"
)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-5L), groups = structure(list(username = c("الواعده",
"بشروني بوظيفة", "تتطور", "عاطله فقط",
"قصه ألم"), .rows = structure(list(5L, 4L, 1L, 2L, 3L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), .drop = TRUE))
I ran the following code where I tried replicate the code here
The code works w/out errors, but I am unable to create a unique id by username.
#create an ID per observation
df <- df %>%
group_by(username) %>%
mutate(id = row_number())%>%
relocate(id)
Print data example with specific columns
dput(df[1:10,c(1,4)])
output:
structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L),
username = c("تتطور", "عاطله فقط", "قصه ألم",
"بشروني بوظيفة", "الواعده", "ماخليتوآ لي اسم",
"مرافئ ساكنه", "معتوقة", "تتطور", "تتطور"
)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L), groups = structure(list(username = c("الواعده",
"بشروني بوظيفة", "تتطور", "عاطله فقط",
"قصه ألم", "ماخليتوآ لي اسم", "مرافئ ساكنه",
"معتوقة"), .rows = structure(list(5L, 4L, c(1L, 9L, 10L
), 2L, 3L, 6L, 7L, 8L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L), .drop = TRUE))
In Stata, I would do this as follows:
// create an id variable per username
egen id = group(username)
That's an incorrect use of group_by for your purpose. If you want to get an id just like your Stata code with egen, you may want to try this:
df$id = as.integer(factor(df$username))
This produced the same id as Stata
egen id = group(username)
Just FYI, I also tried dplyr::consecutive_id():
df %>% mutate(
id_dplyr = dplyr::consecutive_id(username)
)
but unable to reproduce Stata results with your example.

Reshape large dataset from wide to long with two ID variables

I want to change my data from long to wide format using two ID variables.
I have the below code that works with the below example dataset. However, when I run this code with a much larger dataset that I am working with, the code runs for a very long time and doesn't seem to finish running. When I use one ID variable the code runs fine, but I need to include two.
Is there a more efficient way of changing from long to wide format?
(I've also thought about creating an ID variable based on ID1 and ID2 for the purposes of converting from long to wide. Perhaps this is the best solution?)
Wide.vars <- names(df[,c("Date","V1")])
### 1. Reshape from wide to long format with two ID variables
df_wide <- reshape(as.data.frame(df),
idvar = c("ID1","ID2"),
direction = "wide",
v.names = Wide.vars,
timevar = "Timepoint")
Example data below (note that the dimensions of the example dataset are 15 rows 5 columns, whereas the dataset I'm working with is 15658 rows by 99 columns).
df <- structure(list(ID1 = c(5643923L, 5643923L, 5643923L, 3914822L,
3914822L, 3914822L, 3914822L, 1156115L, 1506426L, 7183921L, 4753447L,
4606792L, 8492773L, 8492773L, 8492773L), ID2 = c("02179",
"02179", "04101", "00819", "00819", "00819", "00819",
"01904", "01127", "00475", "02084", "04118", "15553",
"15553", "15553"), Date = structure(c(16731, 16731,
16731, 16732, 16733, 16733, 16733, 16733, 16733, 16733, 16733,
16733, 16734, 16734, 16734), class = "Date"), Timepoint = structure(c(1L,
3L, 1L, 1L, 3L, 4L, 5L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 4L), .Label = c("baseline",
"wave0.5", "wave1", "wave2", "wave3", "wave4"), class = "factor"), V1 = c(0, 8, 4, 9.5, 7, 7, 12, 9, 11, 8.4,
7.8, 6.6, 5, 5.5, 8.9)), row.names = c(NA,
-15L), groups = structure(list(CP1_t_210 = structure(1L, .Label = c("baseline",
"wave0.5", "wave1", "wave2", "wave3", "wave4"), class = "factor"),
.rows = structure(list(1:15), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
data.table is usually faster, you can try using dcast from it.
library(data.table)
dcast(setDT(df), ID1+ID2~Timepoint, value.var = c('Date', 'V1'))
As suggested by #Mark Davies pivot_wider can also help.
tidyr::pivot_wider(df, names_from = Timepoint, values_from = c(Date, V1))

Order dataframe/bar chart using a 2 coordinate data point - ggplot2

I have data, that is summarized by a 2 coordinate data point (e.g. [0,2]). However my data frame, and therefore my bar chart are ordered alphabetically even though the coordinate is a factor data type.
The data frame/ggplot default behavior: [0,1], [0,13], [0,2]
What I want to happen: [0,1], [0,2], [0,13]
This coordinate variable was created by pasteing numbers from 2 columns
mutate(swimlane_coord = factor(paste0("[", sl_subsection_index, ",", sl_element_index, "]")))
where sl_subsection_index is an integer and sl_element_index is an integer.
There can be any combination of coordinates, so I would like to avoid having to manually force the factor definitions.
Here is an example of the data:
structure(list(application_type1 = c("SamsungTV", "SamsungTV",
"SamsungTV", "SamsungTV", "SamsungTV", "SamsungTV", "SamsungTV",
"SamsungTV", "SamsungTV", "SamsungTV", "SamsungTV", "SamsungTV",
"SamsungTV", "SamsungTV"), variant_uuid = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Control",
"BackNav"), class = "factor"), allStreamSec = c("curatedCatalog",
"curatedCatalog", "curatedCatalog", "curatedCatalog", "curatedCatalog",
"curatedCatalog", "curatedCatalog", "curatedCatalog", "curatedCatalog",
"curatedCatalog", "curatedCatalog", "curatedCatalog", "curatedCatalog",
"curatedCatalog"), swimlane_coord = structure(c(1L, 2L, 8L, 9L,
10L, 21L, 1L, 2L, 8L, 9L, 10L, 11L, 25L, 29L), .Label = c("[0,0]",
"[0,1]", "[0,10]", "[0,11]", "[0,12]", "[0,13]", "[0,14]", "[0,2]",
"[0,3]", "[0,4]", "[0,5]", "[0,6]", "[0,7]", "[0,8]", "[0,9]",
"[1,0]", "[1,1]", "[1,3]", "[1,4]", "[1,5]", "[1,7]", "[2,0]",
"[2,11]", "[3,1]", "[3,11]", "[3,2]", "[3,5]", "[3,6]", "[3,7]",
"[3,8]"), class = "factor"), ESPerVisitBySL = c(1.775, 1.83333333333333,
0.976190476190476, 0.966666666666667, 1.08333333333333, 1, 1.33333333333333,
1.45161290322581, 1.68965517241379, 1.44827586206897, 1.5, 1,
1, 1), UESPerVisitBySL = c(13, 16.4, 8.80952380952381, 8.4, 9.33333333333333,
1, 11.5555555555556, 17.741935483871, 16.3448275862069, 8.10344827586207,
15.3571428571429, 6, 7, 2)), row.names = c(NA, -14L), groups = structure(list(
application_type1 = c("SamsungTV", "SamsungTV"), variant_uuid = structure(1:2, .Label = c("Control",
"BackNav"), class = "factor"), allStreamSec = c("curatedCatalog",
"curatedCatalog"), .rows = structure(list(1:6, 7:14), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
Notice that [3,11] comes before [3,2].
The only packages I have loaded are tidyverse and data.table.
Thank you
Harry
To achieve your desired result you could
arrange your data.frame by sl_subsection_index and sl_element_index
after doing so you could set the order of swimlane_coord using forcats::fct_inorder
library(ggplot2)
library(dplyr)
library(forcats)
d %>%
ungroup() %>%
mutate(
sl_subsection_index = gsub("^\\[(\\d+),\\d+\\]$", "\\1", swimlane_coord),
sl_element_index = gsub("^\\[\\d+,(\\d+)\\]$", "\\1", swimlane_coord)
) %>%
arrange(as.integer(sl_subsection_index), as.integer(sl_element_index)) %>%
mutate(swimlane_coord = forcats::fct_inorder(factor(swimlane_coord))) %>%
ggplot(aes(swimlane_coord)) +
geom_bar()
Created on 2021-06-04 by the reprex package (v2.0.0)

character and date variable chart

I have a data set that looks like these two first columns are just IDs and the last is the date,
I need to find a relation between them in R but am lost since my first problem is how to visualize my data correctly. I have the id as a factor but each time that I do a plot it gives me a numeric value of that.
You might start visualizing the relationship between your variables using the pairs.panel from the psych package. Here is an output using the sample data you shared. Note the data points are sparse but you have more data points.
library(psych)
pairs.panels(df)
Output
Data
structure(list(id1 = structure(c(6L, 2L, 2L, 1L, 5L, 4L, 5L,
3L), .Label = c("10017097", "17596277", "20501146", "3603827",
"57106539", "7596227"), class = "factor"), id2 = structure(c(3L,
1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("10122", "10197", "13840"
), class = "factor"), t_date = structure(c(17966, 17590, 17956,
17984, 17478, 17483, 17513, 17544), class = "Date")), class = "data.frame", row.names = c(NA,
-8L))
The documentation is available at pairs.panels
.

R Wide to long format for multiple variables with patterns [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I have a data set with a single identifier and five columns that repeat 18 times. I want to restructure the data into long format keeping the first five column headings as the column headings. Below is a sample with just two repeats:
structure(list(Response.ID = 1:2, Task = structure(c(1L, 1L), .Label = "task1", class = "factor"),
Freq = structure(c(1L, 1L), .Label = "Daily", class = "factor"),
Hours = c(3L, 2L), Value = c(10L, 8L), Mood = structure(1:2, .Label = c("Engaged",
"Neutral"), class = "factor"), Task.1 = structure(c(1L, 1L
), .Label = "task2", class = "factor"), Freq.1 = structure(c(1L,
1L), .Label = "Weekly", class = "factor"), Hours.1 = c(4L,
4L), Value.1 = c(10L, 6L), Mood.1 = structure(c(2L, 1L), .Label = c("Neutral",
"Optimistic"), class = "factor")), .Names = c("Response.ID", "Task", "Freq", "Hours", "Value", "Mood", "Task.1", "Freq.1", "Hours.1", "Value.1", "Mood.1"), class = "data.frame", row.names = c(NA, -2L))
I attempted using the melt and patterns functions, which appears to approximate my desired outcome without the desired column headings:
df = melt(df1, id.vars = c("Response.ID"), measure.vars = patterns("^Task", "^Freq","^Hours","^Mood"))
Here is the result:
structure(list(Response.ID = c(1L, 2L, 1L, 2L), variable = structure(c(1L, 1L, 2L, 2L), class = "factor", .Label = c("1", "2")), value1 = c("task1", "task1", "task2", "task2"), value2 = c("Daily", "Daily", "Weekly", "Weekly"), value3 = c(3L, 2L, 4L, 4L), value4 = c("Engaged", "Neutral", "Optimistic", "Neutral")), .Names = c("Response.ID", "variable", "value1", "value2", "value3", "value4"), row.names = c(NA, -4L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000000000330788>)
When I tried to specify names with value.name() below I receive an error:
df = melt(df1, id.vars = c("Response.ID"),measure.vars = patterns("^Task", "^Freq","^Hours","^Mood"), value.name=c("Task", "Freq", "Hours", "Value","Mood"))
My desired result would look like this:
structure(list(Response.ID = c(1L, 2L, 1L, 2L), Task = structure(c(1L, 1L, 2L, 2L), .Label = c("task1", "task2"), class = "factor"),
Freq = structure(c(1L, 1L, 2L, 2L), .Label = c("Daily", "Weekly"
), class = "factor"), Hours = c(3L, 2L, 4L, 4L), Value = c(10L,
8L, 10L, 6L), Mood = structure(c(1L, 2L, 3L, 2L), .Label = c("Engaged",
"Neutral", "Optimistic"), class = "factor")), .Names = c("Response.ID", "Task", "Freq", "Hours", "Value", "Mood"), class = "data.frame", row.names = c(NA, -4L))
It looks to me like you embarked on a difficult journey by using melt: this function is well named in the sense that trying to use it will probably melt your brain. Joke aside, the function melt has lots of underlying computations and its use could be inefficient if you have a large dataset.
I would instead solve the problem manually with rbindlist (from the excellent package data.table, which also ships with an optimized version of melt if you really want to use it), to manually concatenates groups of columns. This also preserves the column names:
> rbindlist(lapply(1:2, function(i) df1[,c(1,((i-1)*5+2):((i-1)*5+6))]))
Response.ID Task Freq Hours Value Mood
1: 1 task1 Daily 3 10 Engaged
2: 2 task1 Daily 2 8 Neutral
3: 1 task2 Weekly 4 10 Optimistic
4: 2 task2 Weekly 4 6 Neutral
This works on your example: replace the indices 1:2 by the number of repetitions to make it work with the real dataset (so, lapply(1:18)).

Resources