Create full data frame from possible combinations of grouping variables - r

I apologize if this has been asked before, but I could not find the answer I needed when there are three grouping variables.
I need to fill a dataframe with possible combinations of variables, but insert NAs for a non-grouping observation values when a combination does not appear. Say there is a dataframe with three grouping variables: Year, Geography, and Grouping:
Year <- rep(2008:2019,each=50)
Geography <- rep(1:60,each=10)
Grouping <- rep(1:4,each=150)
value <- seq(rnorm(600,mean=0,sd=1))
df=cbind(Year,Geography)
df=as.data.frame(cbind(df,value))
But the dataframe is missing some random observations like so:
df2=df[-c(15,60,150,510),]
How would one go about changing the dataframe back into a length of 600 (which is the length it would be if all possible combinations of three grouping variables were present), but inserting NAs where the value would be if the combinations were in the dataframe? Note that all unique observations for each grouping variable are present in the dataset at some point.

Related

I want to select all rows in a larger dataset, whose identification number, exists in another dataset in R

I have a large dataset, lets call it df1 (4226 observations X 186 variables)
I used a package called naniar to assess missingness, and created a dataset that shows, for each observation, what the percentage of missing data is. I then filtered the dataset, to show me only the observations (rows), in which there was less then 50% of missing data. Then, I created a dataset of just the row number of all rows that fit the missingness criteria, we can call this df2
Now, I want to create a subset of dataset df1 using the data in df2 (2044 observations X 1 variable).
Can anyone help me here?
I have tried something like:
df3 <- df2[df2$row %in% df1]

R function to subset DF by top X most frequent occurances of categorical variable?

I have a ~150,000 observation DF of 23 columns, one of which is "groupname" which includes over 1000 different group names.
How can I subset this dataframe to only include the group names of the top 10,25 or 50 most frequently occurring group names?
Thanks in advance

R, how can I sum the values of a variable for unique id-numbers in another variable

I am using R. I have a dataset "df", with the columns {"id-number", "A"}. How can I sum the values in A for unqiue id-numbers?. That means, lets say the id-number have the values {1,1,2,2,3,3} and the corresponding A column has the values {10,10,40,40,30,30}. I want R to do 10+40+30. This question is probably a duplicate, but I couldn't find its duplicate.

R Using lag() to create new columns in dataframe

I have a dataset with survey score results for 3 hospitals over a number of years. This survey contains 2 questions.
The dataset looks like this -
set.seed(1234)
library(dplyr)
library(tidyr)
dataset= data.frame(Hospital=c(rep('A',10),rep('B',8),rep('C',6)),
YearN=c(2015,2016,2017,2018,2019,
2015,2016,2017,2018,2019,
2015,2016,2017,2018,
2015,2016,2017,2018,
2015,2016,2017,
2015,2016,2017),
Question=c(rep('Overall Satisfaction',5),
rep('Overall Cleanliness',5),
rep('Overall Satisfaction',4),
rep('Overall Cleanliness',4),
rep('Overall Satisfaction',3),
rep('Overall Cleanliness',3)),
ScoreYearN=c(rep(runif(24,min = 0.6,max = 1))),
TotalYearN=c(rep(round(runif(24,min = 1000,max = 5000),0))))
MY OBJECTIVE
To add two columns to the dataset such that -
The first column contains the score for the given question in the given
hospital for the previous year
The second column contains the total number of respondents for the given question in the given hospital for the previous year
MY ATTEMPT
I called the first column ScoreYearN-1 and the second column TotalYearN-1
I used the lag function to create the new columns that contain the lagged values from the existing columns.
library(dplyr)
library(tidyr)
dataset$`ScoreYearN-1`=lag(dataset$ScoreYearN)
dataset$`TotalYearN-1`=lag(dataset$TotalYearN)
Which gives me a resulting dataset where I have the desired outcome for the first five rows only (these rows correspond to the first Hospital-Question combination).
The remaining rows do not account for this grouping, and hence the 2015 'N-1' values take on the values of the previous group.
I'm not sure this is the best way to go about this problem. If you have any better suggestions, I'm happy to consider them.
Any help will be greatly appreciated.
You're close! Just use dplyr to group by hospital
dataset_lagged <- dataset %>%
group_by(Hospital,Question) %>%
mutate(`ScoreYearN-1` = lag(ScoreYearN),
`TotalYearN-1` = lag(TotalYearN))

finding the mean of columns in a data frame in R

I have a vector that contains 50 data frames of re-sampled data. So all of the column names are consistent in each data frame but the numeric values are different. Each data frame consists of 12 rows. How can I find the mean value of each row in one particular column between the 50 data frames and place the 12 mean values into a new one column data frame?
If you want the mean of a specific column that exists across your list of dataframes into a dataframe of its own, you can use dplyr and purrr.
library(dplyr)
library(purrr)
map2_df(your_list, "column_name", ~summarize_at(.x, .y, mean))

Resources