Transpose data frame variables and add null, unique counts in [r] - r

I am trying to build a summary table of a data frame like DataProfile below.
The idea is to transform each column into a row and add variables for count, nulls, not nulls, unique, and add additional mutations of those variables.
It seems like there should be a better faster way to do this. Is there a function that does this?
#trying to write the functions within dplyr & magrittr framework
library(tidyverse)
mtcars[2,2] <- NA # Add a null to test completeness
#
total <- mtcars %>% summarise_all(funs(n())) %>% melt
nulls <- mtcars %>% summarise_all(funs(sum(is.na(.)))) %>% melt
filled <- mtcars %>% summarise_all(funs(sum(!is.na(.)))) %>% melt
uniques <- mtcars %>% summarise_all(funs(length(unique(.)))) %>% melt
mtcars %>% summarise_all(funs(n_distinct(.))) %>% melt
#Build a Data Frame from names of mtcars and add variables with mutate
DataProfile <- as.data.frame(names(mtcars))
DataProfile <- DataProfile %>% mutate(Total = total$value,
Nulls = nulls$value,
Filled = filled $value,
Complete = Filled/Total,
Cardinality = uniques$value,
Uniqueness = Cardinality/Total,
Distinctness = Cardinality/Filled)
DataProfile
#These are other attempts with Base R, but they are harder to read and don't play well with summarise_all
sapply(mtcars, function(x) length(unique(x[!is.na(x)]))) %>% melt
rapply(mtcars,function(x)length(unique(x))) %>% melt

The summarise_all() function can process more than one function at a time, so you can consolidate code by doing it in one pass then formatting your data to get to the type of "profile" per variable that you want.
library(tidyverse)
mtcars[2,2] <- NA # Add a null to test completeness
DataProfile <- mtcars %>%
summarise_all(funs("Total" = n(),
"Nulls" = sum(is.na(.)),
"Filled" = sum(!is.na(.)),
"Cardinality" = length(unique(.)))) %>%
melt() %>%
separate(variable, into = c('variable', 'measure'), sep="_") %>%
spread(measure, value) %>%
mutate(Complete = Filled/Total,
Uniqueness = Cardinality/Total,
Distinctness = Cardinality/Filled)
DataProfile

Related

R - filtering and merging multiple objects

Can anyone tell me an elegant solution to achieve the same result with less code? As far as I know, there is no "un-filter" in R.
test1 <- dat_long%>%
filter(time == "0month") %>%
merge(blood_m1, by="ID")
test2 <- dat_long%>%
filter(time == "3month") %>%
merge(blood_m2, by="ID")
test3 <- dat_long%>%
filter(time == "4month") %>%
merge(blood_m3, by="ID")
test_long <- test3 %>%
bind_rows(test2 )%>%
bind_rows(test1 )
Basically, I want to achieve the "test_long" df by generating a single object and all connect with %>%. Thanks!
I think you want a cleaner blood_all lookup table to begin with, here's a suggestion :
library(dplyr)
blood_all <-
# create named list
list("0month" = blood_m1, "3month" = blood_m2, "4month", blood_m3) %>%
# bind into a single tibble, placing names into "time" column
bind_rows(, .id = "time")
test <- dat_long %>%
merge(blood, by=c("ID", "time")) # or inner_join to keep it tidyverse

Create loop to change structure of multiple data frames in R

I have a bunch of excel files that I have loaded into R as separate dataframes. I now need to change the structure/layout of every one of these data frames. I have done all of this separately, but it is becoming very time consuming. I am not sure how there is a better way to accomplish this. My guess would be that I need to combine them all into a list and then create some type of loop to go through every data frame in that list. I need to be able to remove rows and columns from the edge, add 'row' the top left cell that is currently empty, and then follow that pivot_longer, mutate, and select functions that I have listed below that I have done separately.
names(df)[1] <- 'row'
df <- df %>%
pivot_longer((!row), names_to = "plateColumn", values_to = "Broth_t0")
df <- df %>%
mutate(wellID = paste0(row, plateColumn)) %>%
select(-c(row, plateColumn))
I have tried what is below and I get an error, does anyone have a better way that what I am currently doing to accomplish this?
for(x in seq_along(files.list)){
names(files.list)[1] <- 'row'
df <- df %>%
pivot_longer((!row), names_to = "plateColumn", values_to = "Broth_t0")
df <- df %>%
mutate(wellID = paste0(row, plateColumn)) %>%
select(-c(row, plateColumn))
}
If you have a vector of filenames my_files, I think this will work
library(tidyverse)
library(readxl)
prepare_df <- function(df) {
# make changes to df
names(df)[1] <- 'row'
df <- df %>%
pivot_longer((!row), names_to = "plateColumn", values_to = "Broth_t0")
df <- df %>%
mutate(wellID = paste0(row, plateColumn)) %>%
select(-c(row, plateColumn))
return(df)
}
names(my_files) <- my_files # often useful if the vector we're mapping over has names
dfs <- map(my_files, read_excel) # read into a list of data frames
dfs <- map(dfs, prepare_df) # prepare each one
df <- bind_rows(dfs, .id = "file") # if you prefer one data frame instead

How to print a grouped_df grouped by two variables on two tables with dplyr in R

I want to group by two variables, compute a mean for the groups, then print the result on distinct tables.
Unlike the below where I get all my means in a single table, I would like one output table for x==1 and another one for x==2
data = tibble(x=factor(sample(1:2,10,rep=TRUE)),
y=factor(sample(letters[1:2],10,rep=TRUE)),
z=1:10)
data %>% group_by(x) %>% summarize(Mean_z=mean(z))
res = data %>% group_by(x,y) %>% summarize(Mean_z=mean(z))
print(res)
res %>% knitr::kable() %>% kableExtra::kable_styling()```
You want separate outputs for when x==1 and x==2. A simple way with dplyr would be to filter:
library(dplyr)
data = tibble(x=factor(sample(1:2,10,rep=TRUE)),
y=factor(sample(letters[1:2],10,rep=TRUE)),
z=1:10)
res = data %>% group_by(x,y) %>% summarize(Mean_z=mean(z))
x1= res%>%
filter(x ==1)
x2= res%>%
filter(x ==2)
x1 %>% knitr::kable() %>% kableExtra::kable_styling()
x2 %>% knitr::kable() %>% kableExtra::kable_styling()
I'm not sure why you have this line of code:
data %>% group_by(x) %>% summarize(Mean_z=mean(z))
It doesn't create a new object and so it's output won't be available to be used in subsequent lines of code. If you did use it, it would give you the means for z for each x value, without splitting into each y value.

Using replace_na for multiple data subsets

I'm trying to replace the NAs in multiple column variables with randomly generated values from each student_id's subset row data:
data snapshot
so for student 3, systolic needs two NAs replaced. I used the min and max values for each variable within the student 3 subset to generate random values.
library(dplyr)
library(tidyr)
library(tibble)
library(tidyverse)
dplyr::filter(exercise, student_id == "3") %>% replace_na(list(systolic= round(sample(runif(1000, 125,130),2),0),
diastolic =round(sample(runif(1000, 85,85),3),0), heart_rate= round(sample(runif(1000, 79,86),2),0),
phys_score = round(sample(runif(1000, 8,9),2),0)
However it works only when one NA needs replacing: successfully replaced systolic NA values. When I try to replace more than one NAs, this error comes up.
Error: Replacement for `systolic` is length 2, not length 1
Is there a way to fix this? I tried converting the column variables to data frames instead of the vectors they are now, but it only returned the original data without any replacement changes.
Are there any simpler ways to this? Any suggestions/comments would be appreciated. Thanks.
A solution that makes things a little more automated but may be unnecessarily complex.
Generated some grouped missing data from the mtcars dataset
library(magrittr)
library(purrr)
library(dplyr)
library(stringr)
library(tidyr)
## Generate some missing data with a subset of car make
mtcars_miss <- mtcars %>%
as_tibble(rownames = "car") %>%
select(car) %>%
separate(car, c("make", "name"), " ") %>%
bind_cols(mtcars[, -1] %>%
map_df(~.[sample(c(TRUE, NA), prob = c(0.8, 0.2),
size = length(.), replace = TRUE)])) %>%
filter(make %in% c("Mazda", "Hornet", "Merc"))
Function to replace na values from a given variable by sampling within the min and max and depending on some group (here make).
replace_na_sample <- function(df_miss, var, group = "make") {
var <- enquo(var)
df_miss %>%
group_by(.dots = group) %>%
mutate(replace_var := round(runif(n(), min(!!var, na.rm = T),
max(!!var, na.rm = T)), 0)) %>%
rowwise %>%
mutate_at(.vars = vars(!!var),
.funs = funs(replace_na(., replace_var))) %>%
select(-replace_var) %>%
ungroup
}
Example replacing several missing values in multiple columns.
mtcars_replaced <- mtcars_miss %>%
replace_na_sample(cyl, group = "make") %>%
replace_na_sample(disp, group = "make") %>%
replace_na_sample(hp, group = "make")

Can I create a data.frame in R from an existing data.frame by assigning a list of col.names?

I have a data.frame where I assign each column.name a vector of variables:
dat1 <- data.frame(a=1:5,b=1:5,c=1:5)
I want to create a new data.frame but instead of assigning each column individually, I want to assign them all at once. For example, if I wanted to rename them all:
dat.new <- data.frame(paste(names(dat1),'1',sep='') = dat1)
This obviously doens't work. Is there a way to make it work?
I understand I can just rename using names(), but the scenario where this actually seems useful is if combining multiple data sets that share the same col.names (and in which I don't want to simply rbind):
dat1 <- data.frame(a=1:5,b=1:5,c=1:5)
dat2 <- data.frame(a=6:10,b=6:10,c=6:10)
dat.new <- data.frame(paste(names(dat1),'1',sep='') = dat1, paste(names(dat1),'2',sep='') = dat2)
library(dplyr)
library(tidyr)
library(magrittr)
Ok, here's the first part:
dat2 =
dat1 %>%
setNames(names(.) %>%
paste0("1") )
Here's the second part. The reshaping is a bit complex but more flexible, especially if you have row id's already with different amounts of rows:
list(dat1, dat2) %>%
bind_rows(.id = "number") %>%
group_by(number) %>%
mutate(id = 1:n()) %>%
gather(variable, value, -number, -id) %>%
unite(new_variable, variable, number) %>%
spread(new_variable, value)

Resources