Comparing Column names in R across various data frames

Comparing Column names in R across various data frames - r

I am currently try to compare the column classes and names of various data frames in R prior to undertaking any transformations and calculations.
The code I have is noted below::
library(dplyr)
m1 <- mtcars
m2 <- mtcars %>% mutate(cyl = factor(cyl), xxxx1 = factor(cyl))
m3 <- mtcars %>% mutate(cyl = factor(cyl), xxxx2 = factor(cyl))
out <- cbind(sapply(m1, class), sapply(m2, class), sapply(m3, class))
If someone can solve this for dataframes stored in a list, that would be great. All my dataframes are currently stored in a list, for easier processing.
All.list <- list(m1,m2,m3)
I am expecting that the output is displayed in a matrix form as shown in the dataframe "out". The output in "out" is not desireable as it is incorrect. I am expecting the output to be more along the following::

Try compare_df_cols() from the janitor package:
library(janitor)
compare_df_cols(All.list)
#> column_name All.list_1 All.list_2 All.list_3
#> 1 am numeric numeric numeric
#> 2 carb numeric numeric numeric
#> 3 cyl numeric factor factor
#> 4 disp numeric numeric numeric
#> 5 drat numeric numeric numeric
#> 6 gear numeric numeric numeric
#> 7 hp numeric numeric numeric
#> 8 mpg numeric numeric numeric
#> 9 qsec numeric numeric numeric
#> 10 vs numeric numeric numeric
#> 11 wt numeric numeric numeric
#> 12 xxxx1 <NA> factor <NA>
#> 13 xxxx2 <NA> <NA> factor
It accepts both a list and/or the individual named data.frames, i.e., compare_df_cols(m1, m2, m3).
Disclaimer: I maintain the janitor package to which this function was recently added - posting it here as it addresses exactly this use case.

I think the easiest way would be to define a function, and then use a combination of lapply and dplyr to obtain the result you want. Here is how I did it.
library(dplyr)
m1 <- mtcars
m2 <- mtcars %>% mutate(cyl = factor(cyl), xxxx1 = factor(cyl))
m3 <- mtcars %>% mutate(cyl = factor(cyl), xxxx2 = factor(cyl))
All.list <- list(m1,m2,m3)
##Define a function to get variable names and types
my_function <- function(data_frame){
require(dplyr)
x <- tibble(`var_name` = colnames(data_frame),
`var_type` = sapply(data_frame, class))
return(x)
}
target <- lapply(1:length(All.list),function(i)my_function(All.list[[i]]) %>%
mutate(element =i)) %>%
bind_rows() %>%
spread(element, var_type)
target

Related

Reconcile dataset column types (formats) using a dictionary/list in R/dplyr

Following on the renaming request #67453183 I want to do the same for formats using the dictionary, because it won't bring together columns of distinct types.
I have a series of data sets and a dictionary to bring these together. But I'm struggling to figure out how to automate this. > Suppose this data and dictionary (actual one is much longer, thus I want to automate):
mtcarsA <- mtcars[1:2,1:3] %>% rename(mpgA = mpg, cyl_A = cyl) %>% as_tibble()
mtcarsB <- mtcars[3:4,1:3] %>% rename(mpg_B = mpg, B_cyl = cyl) %>% as_tibble()
mtcarsB$B_cyl <- as.factor(mtcarsB$B_cyl)
dic <- tibble(true_name = c("mpg_true", "cyl_true"),
nameA = c("mpgA", "cyl_A"),
nameB = c("mpg_B", "B_cyl"),
true_format = c("factor", "numeric")
)
I want these datasets (from years A and B) appended to one another, and then to have the names changed or coalesced to the 'true_name' values.... I want to automate 'coalesce all columns with duplicate names'.
And to bring these together, the types need to be the same too. I'm giving the entire problem here because perhaps someone also has a better solution for 'using a data dictionary'.
#ronakShah in the previous query proposed
pmap(dic, ~setNames(..1, paste0(c(..2, ..3), collapse = '|'))) %>%
flatten_chr() -> val
mtcars_all <- list(mtcarsA,mtcarsB) %>%
map_df(function(x) x %>% rename_with(~str_replace_all(.x, val)))
Which works great in the previous example but not if the formats vary. Here it throws error:
Error: Can't combine ..1$cyl_true<double> and..2$cyl_true <factor<51fac>>.
This response to #56773354 offers a related solution if one has a complete list of types, but not for a type list by column name, as I have.
Desired output:
mtcars_all
# A tibble: 4 x 3
mpg_true cyl_true disp
<factor> <numeric> <dbl>
1 21 6 160
2 21 6 160
3 22.8 4 108
4 21.4 6 258

Something simpler:
library(magrittr) # %<>% is cool
library(dplyr)
# The renaming is easy:
renameA <- dic$nameA
renameB <- dic$nameB
names(renameA) <- dic$true_name
names(renameB) <- dic$true_name
mtcarsA %<>% rename(all_of(renameA))
mtcarsB %<>% rename(all_of(renameB))
# Formatting is a little harder:
formats <- dic$true_format
names(formats) <- dic$true_name
lapply(names(formats), function (x) {
# there's no nice programmatic way to do this, I think
coercer <- switch(formats[[x]],
factor = as.factor,
numeric = as.numeric,
warning("Unrecognized format")
)
mtcarsA[[x]] <<- coercer(mtcarsA[[x]])
mtcarsB[[x]] <<- coercer(mtcarsB[[x]])
})
mtcars_all <- bind_rows(mtcarsA, mtcarsB)
In the background you should be aware of how base R treated concatenating factors before 4.1.0, and how this'll change. Here it probably doesn't matter because bind_rows will use the vctrs package.

I took another approach than Ronak's to read the dictionary. It is more verbose but I find it a bit more readable. A benchmark would be interesting to see which one is faster ;-)
Unfortunately, it seems that you cannot blindly cast a variable to a factor so I switched to character instead. In practice, it should behave exactly like a factor and you can call as_factor() on the end object if this is very important to you. Another possibility would be to store a casting function name (such as as_factor()) in the dictionary, retrieve it using get() and use it instead of as().
library(tidyverse)
mtcarsA <- mtcars[1:2,1:3] %>% rename(mpgA = mpg, cyl_A = cyl) %>% as_tibble()
mtcarsB <- mtcars[3:4,1:3] %>% rename(mpg_B = mpg, B_cyl = cyl) %>% as_tibble()
mtcarsB$B_cyl <- as.factor(mtcarsB$B_cyl)
dic <- tibble(true_name = c("mpg_true", "cyl_true"),
nameA = c("mpgA", "cyl_A"),
nameB = c("mpg_B", "B_cyl"),
true_format = c("numeric", "character") #instead of factor
)
dic2 = dic %>%
pivot_longer(-c(true_name, true_format), names_to=NULL)
read_dic = function(key, dict=dic2){
x = dict[dict$value==key,][["true_name"]]
if(length(x)!=1) x=key
x
}
rename_from_dic = function(df, dict=dic2){
rename_with(df, ~{
map_chr(.x, ~read_dic(.x, dict))
})
}
cast_from_dic = function(df, dict=dic){
mutate(df, across(everything(), ~{
cl=dict[dict$true_name==cur_column(),][["true_format"]]
if(length(cl)!=1) cl=class(.x)
as(.x, cl, strict=FALSE)
}))
}
list(mtcarsA,mtcarsB) %>%
map(rename_from_dic) %>%
map_df(cast_from_dic)
#> # A tibble: 4 x 3
#> mpg_true cyl_true disp
#> <dbl> <chr> <dbl>
#> 1 21 6 160
#> 2 21 6 160
#> 3 22.8 4 108
#> 4 21.4 6 258
Created on 2021-05-09 by the reprex package (v2.0.0)

Creating data frame with repeat rows

I want to create a data frame with rows that repeat.
Here is my original dataset:
> mtcars_columns_a
variables_interest data_set data_set_and_variables_interest mean
1 mpg mtcars mtcars$mpg 20.09062
2 disp mtcars mtcars$disp 230.72188
3 hp mtcars mtcars$hp 146.68750
Here is my desire dataset
> mtcars_columns_b
variables_interest data_set data_set_and_variables_interest mean
1 mpg mtcars mtcars$mpg 20.09062
2 mpg mtcars mtcars$mpg 20.09062
3 disp mtcars mtcars$disp 230.72188
4 disp mtcars mtcars$disp 230.72188
5 hp mtcars mtcars$hp 146.68750
6 hp mtcars mtcars$hp 146.68750
I know how to do this the long way manually, but this is time consuming and rigid. Is there a quicker way to do this that is more automated and flexible?
Here is the code I used to create the dataset:
# mtcars data
## displays data
mtcars
## 3 row data set
### lists columns of interest
# ---- NOTE: REQUIRES MANUAL INPUT
# ---- NOTE: lists variables of interest
mtcars_columns_a <-
data.frame(
c(
"mpg",
"disp",
"hp"
)
)
# ---- NOTE: REQUIRES MANUAL INPUT
# ---- NOTE: adds colnames
names(mtcars_columns_a)[names(mtcars_columns_a) == 'c..mpg....disp....hp..'] <- 'variables_interest'
### adds data set info
mtcars_columns_a$data_set <-
c("mtcars")
### creates data_set_and_variables_interest column
mtcars_columns_a$data_set_and_variables_interest <-
paste(mtcars_columns_a$data_set,mtcars_columns_a$variables_interest,sep = "$")
### creates mean column
mtcars_columns_a$mean <-
c(
mean(mtcars$mpg),
mean(mtcars$disp),
mean(mtcars$hp)
)
## 6 row data set., the long way
### lists columns of interest
# ---- NOTE: REQUIRES MANUAL INPUT
# ---- NOTE: lists variables of interest
mtcars_columns_b <-
data.frame(
c(
"mpg",
"mpg",
"disp",
"disp",
"hp",
"hp"
)
)
# ---- NOTE: REQUIRES MANUAL INPUT
# ---- NOTE: adds colnames
names(mtcars_columns_b)[names(mtcars_columns_b) == 'c..mpg....mpg....disp....disp....hp....hp..'] <- 'variables_interest'
### adds data set info
mtcars_columns_b$data_set <-
c("mtcars")
### creates data_set_and_variables_interest column
mtcars_columns_b$data_set_and_variables_interest <-
paste(mtcars_columns_b$data_set,mtcars_columns_b$variables_interest,sep = "$")
### creates mean column
mtcars_columns_b$mean <-
c(
mean(mtcars$mpg),
mean(mtcars$mpg),
mean(mtcars$disp),
mean(mtcars$disp),
mean(mtcars$hp),
mean(mtcars$hp)
)

You can try rep like below
mtcars_columns_a[rep(seq(nrow(mtcars_columns_a)), each = 2),]

Another option is uncount
library(dplyr)
library(tidyr)
mtcars_columns_a %>%
uncount(2)

Based on your expected output is this the sort of thing you were after?
The selection of required variables is made with the select function and the mean calculated using the summarise function following group_by variables.
The duplication of data and adding of additional variables (not really sure if these are necessary) is carried out using mutate.
You can edit variable names using the dplyr::rename function.
library(dplyr)
library(tidyr)
df <-
mtcars %>%
select(mpg, disp, hp) %>%
pivot_longer(everything()) %>%
group_by(name) %>%
summarise(mean = mean(value))
df1 <-
bind_rows(df, df) %>%
arrange(name) %>%
mutate(dataset = "mtcars",
variable = paste(dataset, name, sep = "$"))
df1
#> # A tibble: 6 x 4
#> name mean dataset variable
#> <chr> <dbl> <chr> <chr>
#> 1 disp 231. mtcars mtcars$disp
#> 2 disp 231. mtcars mtcars$disp
#> 3 hp 147. mtcars mtcars$hp
#> 4 hp 147. mtcars mtcars$hp
#> 5 mpg 20.1 mtcars mtcars$mpg
#> 6 mpg 20.1 mtcars mtcars$mpg
Created on 2021-04-06 by the reprex package (v1.0.0)

The order of records in a data.frame object is usually not meaningful, so you could just do:
rbind(mtcars_columns_a, mtcars_columns_a)
If you need it to be in the order you showed, this is also simple:
mtcars_columns_b <- rbind(mtcars_columns_a, mtcars_columns_a)
mtcars_columns_b[order(mtcars_columns_b, mtcars_columns_b$name),]

Pass a list of variable names to a function using {{foo}}

Problem
I would like to know how to pass a list of variable names to a purrr::map2 function for the purpose of iterating over a separate data frame.
The input_table$key variable below contains mpg and disp from the mtcars dataset. I think the names of the variables are being passed as character strings rather than variable names. The question is how I can change that so that my function recognises that they are variable names(?).
In this example I am trying to sum all of the values in the mtcars variables mpg and disp that fall below a set of numeric thresholds. Those variables from mtcars and the relevant thresholds are contained in input_table (below).
Ideal result
percentile key value sum_y
<fct> <chr> <dbl> <dbl>
1 0.5 mpg 19.2 266.5
2 0.9 mpg 30.1 515.8
3 0.99 mpg 33.4 609.0
4 1 mpg 33.9 642.9
5 ... ... ... ...
Attempt
library(dplyr)
library(purrr)
library(tidyr)
# Arrange a generic example
# Replicating my data structure
input_table <- mtcars %>%
as_tibble() %>%
select(mpg, disp) %>%
map_df(quantile, probs = c(0.5, 0.90, 0.99, 1)) %>%
mutate(
percentile = factor(c(0.5, 0.90, 0.99, 1))
) %>%
select(
percentile, mpg, disp
) %>%
gather(key, value, -percentile)
# Defining the function
test_func <- function(label_desc, threshold) {
mtcars %>%
select({{label_desc}}) %>%
filter({{label_desc}} <= {{threshold}}) %>%
summarise(
sum_y = sum(as.numeric({{label_desc}}), na.rm = T)
)
}
# Demo'ing that it works for a single variable and threshold value
test_func(label_desc = mpg, threshold = 19.2)
# This is where I am having trouble
# Trying to iterate over multiple (mpg, disp) variables
map2(input_table$key, input_table$value, ~test_func(label_desc = .x, threshold = .y))

The issue is curly-curly ({{}}) is used for unquoted variables as you are using in your first attempt. In your second attempt you are passing quoted variables to which the curly-curly operator does not work. A simple fix would be to use _at variants of dplyr which accepts quoted arguments.
test_func <- function(label_desc, threshold) {
mtcars %>%
filter_at(label_desc, any_vars(. <= threshold)) %>%
summarise_at(label_desc, sum)
}
purrr::map2(input_table$key, input_table$value, test_func)
#[[1]]
# mpg
#1 266.5
#[[2]]
# mpg
#1 515.8
#[[3]]
# mpg
#1 609
#[[4]]
# mpg
#1 642.9
#[[5]]
# disp
#1 1956.7
#.....

how to get row names from the apply() function output?

I'm learning R and have a tibble with some World Bank data. I used the apply() function in a slice of the columns and applied the standard deviation over the values, in this way: result <- apply(df[6:46],2,sd,na.rm=TRUE).
The result is an object with two columns with no header, one column is all the names of the tibble columns that were selected and the other one is the standard deviation for each column. When I use the typeof() command in the output, the result is 'double'. The R documentation says the output of apply() is a vector, an array or a list.
I need to know this because I want to extract all the row names and using the command rownames(result) throws the output NULL. What can I do to extract the row names of this object? Please help.
Tried rownames(result) and row.names(result and none worked.

We can use stack to convert the vector output into dataframe.
temp <- stack(apply(df[6:46],2,sd,na.rm=TRUE))
Now, we can access all the column names with temp$ind and values of sd in temp$values.
Using mtcars as example,
temp <- stack(apply(mtcars, 2, sd, na.rm = TRUE))
temp
# values ind
#1 6.02695 mpg
#2 1.78592 cyl
#3 123.93869 disp
#4 68.56287 hp
#5 0.53468 drat
#6 0.97846 wt
#7 1.78694 qsec
#8 0.50402 vs
#9 0.49899 am
#10 0.73780 gear
#11 1.61520 carb
We can also use this with sapply and lapply
stack(sapply(mtcars,sd, na.rm = TRUE))
#and
stack(lapply(mtcars,sd, na.rm = TRUE))

Here, the sd returns a single value and as the apply is with MARGIN = 2 i,e columnwise, we are getting a named vector. So, names(out) would get the names instead of row.names. Using a reproducible example with the inbuilt dataset iris
data(iris)
out <- apply(iris[1:4], 2, sd, na.rm = TRUE)
names(out)
#[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
Also, by wrapping the output of apply with data.frame, we can use the row.names
out1 <- data.frame(val = out)
row.names(out1)
#[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
If we need a data.frame as output, this can he directly created with data.frame call
data.frame(names = names(out), values = out)
Also, this can be done in tidyverse
library(dplyr)
library(tidyr)
iris %>%
summarise_if(is.numeric, sd, na.rm = TRUE) %>%
gather
# key value
#1 Sepal.Length 0.8280661
#2 Sepal.Width 0.4358663
#3 Petal.Length 1.7652982
#4 Petal.Width 0.7622377
Or convert to a list and enframe
library(tibble)
iris %>%
summarise_if(is.numeric, sd, na.rm = TRUE) %>%
as.list %>%
enframe

Pass column name to function from mutate_each

I'd like to apply a transformation to all columns via dplyr::mutate_each, e.g.
library(dplyr)
mult <- function(x,m) return(x*m)
mtcars %>% mutate_each(funs(mult(.,2))) # Multiply all columns by a factor of two
However, the transformation should have parameters depending on the column name. Therefore, the column name should be passed to the function as an additional argument
named.mult <- function(x,colname) return(x*param.A[[colname]])
Example: multiply every column by a different factor:
param.A <- c()
param.A[names(mtcars)] <- seq(length(names(mtcars)))
param.A
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 2 3 4 5 6 7 8 9 10 11
Since the column name gets lost during mutate_each, I currently work around this by passing a list with lazy evalution to mutate_ (the SE version):
library(lazyeval)
named.mutate <- function(fun, cols) sapply(cols, function(n) interp(~fun(col, n), fun=fun, col=as.name(n)))
mtcars %>% mutate_(.dots=named.mutate(named.mult, names(.)))
Works, but is there some special variable like .name which contains the column name of . for each colwise execution? So I could do something like
mtcars %>% mutate_each(funs(named.mult(.,.name)))

I'd suggest taking a different approach. Instead of using mutate_each a combination of dplyr::mutate with tidyr::gather and tidyr::spread can achieve the same result.
For example:
library(dplyr)
library(tidyr)
data(mtcars)
# Multiple each column by a different interger
mtcars %>%
dplyr::tbl_df() %>%
dplyr::mutate(make_and_model = rownames(mtcars)) %>%
tidyr::gather(key, value, -make_and_model) %>%
dplyr::mutate(m = as.integer(factor(key)), # a multiplication factor dependent on column name
value = value * m) %>%
dplyr::select(-m) %>%
tidyr::spread(key, value)
# compare to the original data
mtcars[order(rownames(mtcars)), order(names(mtcars))]
# the muliplicative values used.
mtcars %>%
tidyr::gather() %>%
dplyr::mutate(m = as.integer(factor(key))) %>%
dplyr::select(-value) %>%
dplyr::distinct()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Comparing Column names in R across various data frames - r

Related

Reconcile dataset column types (formats) using a dictionary/list in R/dplyr

Creating data frame with repeat rows

Pass a list of variable names to a function using {{foo}}

how to get row names from the apply() function output?

Pass column name to function from mutate_each

Categories

Resources

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Comparing Column names in R across various data frames - r

Related

Reconcile dataset *column types* (formats) using a dictionary/list in R/dplyr

Creating data frame with repeat rows

Pass a list of variable names to a function using {{foo}}

how to get row names from the apply() function output?

Pass column name to function from mutate_each

Categories

Resources

Reconcile dataset column types (formats) using a dictionary/list in R/dplyr