I have a large dataset, lets call it df1 (4226 observations X 186 variables)
I used a package called naniar to assess missingness, and created a dataset that shows, for each observation, what the percentage of missing data is. I then filtered the dataset, to show me only the observations (rows), in which there was less then 50% of missing data. Then, I created a dataset of just the row number of all rows that fit the missingness criteria, we can call this df2
Now, I want to create a subset of dataset df1 using the data in df2 (2044 observations X 1 variable).
Can anyone help me here?
I have tried something like:
df3 <- df2[df2$row %in% df1]
I have a ~150,000 observation DF of 23 columns, one of which is "groupname" which includes over 1000 different group names.
How can I subset this dataframe to only include the group names of the top 10,25 or 50 most frequently occurring group names?
Thanks in advance
I am using R. I have a dataset "df", with the columns {"id-number", "A"}. How can I sum the values in A for unqiue id-numbers?. That means, lets say the id-number have the values {1,1,2,2,3,3} and the corresponding A column has the values {10,10,40,40,30,30}. I want R to do 10+40+30. This question is probably a duplicate, but I couldn't find its duplicate.
I have a dataset with survey score results for 3 hospitals over a number of years. This survey contains 2 questions.
The dataset looks like this -
set.seed(1234)
library(dplyr)
library(tidyr)
dataset= data.frame(Hospital=c(rep('A',10),rep('B',8),rep('C',6)),
YearN=c(2015,2016,2017,2018,2019,
2015,2016,2017,2018,2019,
2015,2016,2017,2018,
2015,2016,2017,2018,
2015,2016,2017,
2015,2016,2017),
Question=c(rep('Overall Satisfaction',5),
rep('Overall Cleanliness',5),
rep('Overall Satisfaction',4),
rep('Overall Cleanliness',4),
rep('Overall Satisfaction',3),
rep('Overall Cleanliness',3)),
ScoreYearN=c(rep(runif(24,min = 0.6,max = 1))),
TotalYearN=c(rep(round(runif(24,min = 1000,max = 5000),0))))
MY OBJECTIVE
To add two columns to the dataset such that -
The first column contains the score for the given question in the given
hospital for the previous year
The second column contains the total number of respondents for the given question in the given hospital for the previous year
MY ATTEMPT
I called the first column ScoreYearN-1 and the second column TotalYearN-1
I used the lag function to create the new columns that contain the lagged values from the existing columns.
library(dplyr)
library(tidyr)
dataset$`ScoreYearN-1`=lag(dataset$ScoreYearN)
dataset$`TotalYearN-1`=lag(dataset$TotalYearN)
Which gives me a resulting dataset where I have the desired outcome for the first five rows only (these rows correspond to the first Hospital-Question combination).
The remaining rows do not account for this grouping, and hence the 2015 'N-1' values take on the values of the previous group.
I'm not sure this is the best way to go about this problem. If you have any better suggestions, I'm happy to consider them.
Any help will be greatly appreciated.
You're close! Just use dplyr to group by hospital
dataset_lagged <- dataset %>%
group_by(Hospital,Question) %>%
mutate(`ScoreYearN-1` = lag(ScoreYearN),
`TotalYearN-1` = lag(TotalYearN))
I have a vector that contains 50 data frames of re-sampled data. So all of the column names are consistent in each data frame but the numeric values are different. Each data frame consists of 12 rows. How can I find the mean value of each row in one particular column between the 50 data frames and place the 12 mean values into a new one column data frame?
If you want the mean of a specific column that exists across your list of dataframes into a dataframe of its own, you can use dplyr and purrr.
library(dplyr)
library(purrr)
map2_df(your_list, "column_name", ~summarize_at(.x, .y, mean))