show unique values for each column - r

I am trying to create a data-frame of the column type and unique variables for each column.
I am able to get column type in the desired data-frame format using map(df, class) %>% bind_rows() %>% gather(key = col_name, value = col_class), but unable to get the unique variables to become a data-frame instead of a list.
Below is a small data-frame and code that gets the unique variables in a list, but not a data frame. Ideally, I could do this in one (map) function, but if I have to join them, it would not be a big deal.
df <- data.frame(v1 = c(1,2,3,2), v2 = c("a","a","b","b"))
library(tidyverse)
map(df, class) %>% bind_rows() %>% gather(key = col_name, value = col_class)
map(df, unique)
When I try to use the same method on the map(df, unique) as on the map(df, class) I get the following error: Error: Argument 2 must be length 3, not 2 which is expected, but I am not sure how to get around it.

The number of unique values are different in those two columns. You need to reduce them to a single element.
df2 <- map(df, ~str_c(unique(.x),collapse = ",")) %>%
bind_rows() %>%
gather(key = col_name, value = col_unique)
> df2
# A tibble: 2 x 2
col_name col_class
<chr> <chr>
1 v1 1,2,3
2 v2 a,b

We could use map_df and get the class and unique values from each column into one tibble. Since every column would have variables of different type, we need to bring them in one common class to bind the data together in one dataframe.
purrr::map_df(df,~tibble::tibble(class = class(.), value = as.character(unique(.))))
# class value
# <chr> <chr>
#1 numeric 1
#2 numeric 2
#3 numeric 3
#4 factor a
#5 factor b
Or if you want to have only one value for every column, we could do
map_df(df, ~tibble(class = class(.), value = toString(unique(.))))
# class value
# <chr> <chr>
#1 numeric 1, 2, 3
#2 factor a, b
Same in base R using lapply
do.call(rbind, lapply(df, function(x)
data.frame(class = class(x), value = as.character(unique(x)))))
and
do.call(rbind, lapply(df, function(x)
data.frame(class = class(x), value = toString(unique(x)))))

To address OP's comment asking about enframe and unnest I set up a benchmark.
set.seed(123)
df <- data.frame(v1 = sample(1:100000,10000000, replace = TRUE),
v2 = sample(c(letters,LETTERS),10000000, replace = TRUE))
library(tidyverse)
map(df, ~str_c(unique(.x),collapse = ",")) %>%
bind_rows() %>%
gather(key = col_name, value = col_unique)
#> # A tibble: 2 x 2
#> col_name col_unique
#> <chr> <chr>
#> 1 v1 51663,57870,2986,29925,95246,68293,62555,45404,65161,46435,9642~
#> 2 v2 S,V,k,t,z,K,f,J,n,R,W,h,M,P,q,g,C,U,a,d,Y,u,O,x,b,m,v,r,F,w,A,j~
map(df, ~str_c(unique(.x),collapse = ",")) %>%
enframe() %>%
unnest()
#> # A tibble: 2 x 2
#> name value
#> <chr> <chr>
#> 1 v1 51663,57870,2986,29925,95246,68293,62555,45404,65161,46435,9642,59~
#> 2 v2 S,V,k,t,z,K,f,J,n,R,W,h,M,P,q,g,C,U,a,d,Y,u,O,x,b,m,v,r,F,w,A,j,c,~
microbenchmark::microbenchmark(
bind_gather = map(df, ~str_c(unique(.x),collapse = ",")) %>%
bind_rows() %>%
gather(key = col_name, value = col_unique) ,
frame_unnest = map(df, ~str_c(unique(.x),collapse = ",")) %>%
enframe() %>%
unnest() ,
times = 10)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> bind_gather 581.6403 594.6479 615.0841 612.9336 618.3057 697.6204 10
#> frame_unnest 568.6620 590.0003 604.2774 606.5676 624.8159 630.2372 10
It seems that enframe %>% unnest is slightly faster than using bind_rows %>% gather().

Does this work for you?
data.table::rbindlist(list(map(df, class), map(df, function(x) list(unique(x)))))

Related

Combine attributes from two columns and sum the values from duplicate rows

This question is slightly modified from this one.
I have a dataframe in long table format like this:
df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
name=c("a","c","a","c","a","c","a","c"),
value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50))
ID name value
1 a broad
1 c 50
1 a mangrove
1 c 50
1 a mangrove
1 c 50
2 a coniferous
2 c 50
About the data: The value from the second row 50 corresponds to the value broad from the first row. Similarly, the value from the fourth row 50 corresponds to the value mangrove from the third row and so on.. In simple words, values for name c are related with name a.
I want to combine the value in such a way that I could get the corresponding values for each name, which would also aggregate the values with similar names:
df2 <- data.frame(ID=c(1,1,2),
name=c("c_broad","c_mangrove","c_coniferous"),
value=c(50,100,50))
which should look like this:
ID name value
1 c_broad 50
1 c_mangrove 100
2 c_coniferous 50
Using reshape2:
library(reshape2)
df1$grp = cumsum(df1$name == "a")
df2 = dcast(df1, ID + grp ~ name)
df2$c = as.numeric(df2$c)
aggregate(c ~ ID + a, df2, sum)
ID a c
1 1 broad 50
2 2 coniferous 50
3 1 mangrove 100
Column names can be changed if desired, also "c_" can be added to the names with paste.
Using tidyverse:
value_a <- df1 %>% dplyr::filter(name=="a") %>% dplyr::pull(value)
df1 %>%
dplyr::filter(name=="c") %>% #Modify into a sensible data frame from here
dplyr::mutate(a = value_a,
name = stringr::str_c(name, "_" ,a)) %>%
dplyr::select(-a) %>% # to here
dplyr::group_by(ID, name) %>%
dplyr::summarise(value=sum(as.numeric(value)))
# A tibble: 3 x 3
# Groups: ID [2]
ID name value
<dbl> <chr> <dbl>
1 1 c_broad 50
2 1 c_mangrove 100
3 2 c_coniferous 50
Tha main problem you find in your dataframe is that a single column is containing, names and values, and that is the first thing you should fix. My advice is always modify the original dataframe into a tidy format (https://tidyr.tidyverse.org/articles/tidy-data.html) and from there leverage all tidyverse power, or data.table or your framework of choice.
Notice the temporal variable value_a could be included in the pipeline directly I have not done it for clarity. The main idea is to separate values and species in different columns, the first three calls in the pipeline, and then apply the usual tidyverse operations.
Might not be the most elegant, but it works:
df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
name=c("a","c","a","c","a","c","a","c"),
value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50)
)
df1 %>% group_by( 1+floor((1:n()-1)/2) ) %>%
summarize(
ID = ID[1],
name = paste0( name[2], "_", value[1] ),
value = as.numeric(value[2])
) %>% ungroup %>% select( -1 ) %>% group_by(name) %>%
mutate( value = sum(value) ) %>%
unique
Here is somthing improved, that actually is humanly readable:
i <- seq( 1, nrow(df1), 2 )
df1 %>% summarise(
ID = ID[i],
name = paste0( name[i+1], "_", value[i] ),
value = as.numeric(value[i+1])
) %>% group_by(name) %>%
summarize(
ID=ID[1], value = sum( value )
) %>% arrange(ID)
Base R solution:
# Nullify numeric values belonging to a grouping category: grps => character vector
grps <- gsub("\\d+", NA, df1$value)
# Interpolate NA values using prior string value: a => character vector
df1$a <- na.omit(grps)[cumsum(!(is.na(grps)))]
# Split-Apply-Combine aggregation: data.frame => stdout(console)
data.frame(do.call(rbind, lapply(with(df1, split(df1, a)), function(x){
y <- transform(subset(x, !grepl("\\D+", value)), value = as.numeric(value))
setNames(
aggregate(value ~ ID + a, y, FUN = function(z){sum(z, na.rm = TRUE)}),
c("ID", "a", "c")
)
}
)
),
row.names = NULL
)
additional option
df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
name=c("a","c","a","c","a","c","a","c"),
value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50))
library(tidyverse)
df1 %>%
pivot_wider(ID, names_from = name, values_from = value) %>%
unnest(c("a", "c")) %>%
group_by(ID, name = a) %>%
summarise(value = sum(as.numeric(c), na.rm = T), .groups = "drop")
#> # A tibble: 3 x 3
#> ID name value
#> <dbl> <chr> <dbl>
#> 1 1 broad 50
#> 2 1 mangrove 100
#> 3 2 coniferous 50
Created on 2021-04-12 by the reprex package (v2.0.0)

Efficient tidyverse way of finding value at given date in data set with historic changes

Given a tibble of historic changes I want to efficiently create a tibble with what value each id had at a given date.
Example
library(lubridate)
library(tidyverse)
df <- tribble(
~id, ~value, ~date_created,
1, "a", as_date("2020-01-01"),
1, "b", as_date("2020-01-06"),
1, "c", as_date("2020-02-01"),
2, "Y", as_date("2020-01-01"),
2, "Z", as_date("2020-01-02")
)
# function should output a tibble with one row per id with the value it had at that date
get_value_at_date <- function(df, date){}
get_value_at_date(df, as_date("2019-01-01")) should have output tribble(~id,~value,1,NA,2,NA)
get_value_at_date(df, as_date("2020-01-06")) should have output tribble(~id,~value,1,"b",2,"Z")
get_value_at_date(df, as_date("2020-03-01")) should have output tribble(~id,~value,1,"c",2,"Z")
Example Solution Function
get_value_at_date <- function(df, date){
# find the last change before the date
value_at_date_df <- df %>%
arrange(id, date_created) %>%
group_by(id) %>%
filter(date_created <= date) %>%
slice_tail(n = 1) %>%
select(id,value)
# value could be of many class types, and need a unique NA for each
value_class <- class(df %>% select(value) %>% pull())
# we're assuming as.CLASS(NA) works for all CLASS inputs
bespoke_na <- eval(parse(text=paste0("as.",value_class,"(NA)")))
# find any that have been removed so should be blank
missed_ids <- df %>%
anti_join(value_at_date_df, by = "id") %>%
pull(id) %>%
unique()
# make it a df
missed_ids_df <- tibble(
id = missed_ids,
value = bespoke_na
)
# attach the 2 dfs
out_df <- bind_rows(value_at_date_df,missed_ids_df) %>%
arrange(id) %>%
ungroup()
return(out_df)
}
I have the following two issues with my solution:
it seems fairly slow, especially when scaled up to the actual data (of the order of thousands of rows).
the use of eval to guess the class of NA does not feel like good practice. The reason for this is the function's input tibble could have the value column as any class. I do not know if for every class --class-- the function as.--class-- exists.
The .preserve argument to filter removes the need to handle removed groups.
last uses dplyr:::default_missing for sensible missing values, but that can be overriden if needed.
get_value_at_date_2 <- function(df, date){
df %>%
group_by(id) %>%
dplyr::filter(date_created <= date, .preserve = TRUE) %>%
summarize(value = dplyr::last(value, order_by = date_created))
}
get_value_at_date_2(df, as_date("2019-01-01"))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#> id value
#> <dbl> <chr>
#> 1 1 <NA>
#> 2 2 <NA>
get_value_at_date_2(df, as_date("2020-01-06"))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#> id value
#> <dbl> <chr>
#> 1 1 b
#> 2 2 Z
get_value_at_date_2(df, as_date("2020-03-01"))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#> id value
#> <dbl> <chr>
#> 1 1 c
#> 2 2 Z
(Benchmarking is left out, as it is better conducted at mid to actual scale on real data, and would probably be meaningless on example data above. If performance remains an issue, consider the data.table package, that can mix with the tidyverse).
I ended up running a benchmark, and the performance is not really better
create_df <- function(n, rows_to_groups_ratio, seed = 123) {
set.seed(seed)
tibble(
id = sample(sample(n %/% rows_to_groups_ratio, 1), n, replace = TRUE),
value = sample(c(letters, LETTERS), n, replace = TRUE),
date_declared = sample(seq(as.Date("2019-01-01"), as.Date("2020-07-01"), "day"),
n, replace = TRUE)
)
}
mybench <- bench::press(
n = c(1e3, 10e3, 100e3, 1e6),
rows_to_groups_ratio = c(3, 5, 10, 50, 100),
{
df <- create_df(n, rows_to_groups_ratio)
date <- as.Date("2020-01-01")
bench::mark(
get_value_at_date(df, date),
get_value_at_date_2(df, date)
)
}
)
autoplot(mybench)

Compare column values against another column

I have the following data:
set.seed(1)
data <- data.frame(
id = 1:500, ht_1 = rnorm(500,10:20), ht_2 = rnorm(500,15:25),
ht_3 = rnorm(500,20:30), ht_4 = rnorm(500,25:35),
ht_5 = rnorm(500,20:40)
)
I would like to identify the values in columns ht_1:ht_4 that are greater than the values in column ht_5 (number of observations and means).
For each of these columns, I would then like to replace any values that are greater than ht_5 with ht_5.
Hi you can use the mutate_at function like this:
library(tidyverse)
data %>% as_tibble %>%
mutate_at(vars(paste0("ht_", 1:4)), ~if_else(.x > ht_5, ht_5, .x))
In this case you can also use pmin instead of if_else which should be faster.
data %>% as_tibble %>%
mutate_at(vars(paste0("ht_", 1:4)), ~pmin(.x, ht_5))
To see how many values are greater than ht_5 you can use the summarise_atfunction:
data %>% as_tibble %>%
summarize_at(vars(paste0("ht_", 1:4)), ~ length(.x[.x > ht_5]))
# A tibble: 1 x 4
ht_1 ht_2 ht_3 ht_4
<int> <int> <int> <int>
1 6 39 131 258

Return column names based on condition

I've a dataset with 18 columns from which I need to return the column names with the highest value(s) for each observation, simple example below. I came across this answer, and it almost does what I need, but in some cases I need to combine the names (like abin maxcolbelow). How should I do this?
Any suggestions would be greatly appreciated! If it's possible it would be easier for me to understand a tidyverse based solution as I'm more familiar with that than base.
Edit: I forgot to mention that some of the columns in my data have NAs.
library(dplyr, warn.conflicts = FALSE)
#turn this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5)
#into this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5, maxol = c("ab", "b", "b"))
Created on 2018-10-30 by the reprex package (v0.2.1)
Continuing from the answer in the linked post, we can do
Df$maxcol <- apply(Df, 1, function(x) paste0(names(Df)[x == max(x)], collapse = ""))
Df
# a b c maxcol
# <int> <int> <int> <chr>
#1 4 4 3 ab
#2 3 5 4 b
#3 2 6 5 b
For every row, we check which position has max values and paste the names at that position together.
If you prefer the tidyverse approach
library(tidyverse)
Df %>%
mutate(row = row_number()) %>%
gather(values, key, -row) %>%
group_by(row) %>%
mutate(maxcol = paste0(values[key == max(key)], collapse = "")) %>%
spread(values, key) %>%
ungroup() %>%
select(-row)
# maxcol a b c
# <chr> <int> <int> <int>
#1 ab 4 4 3
#2 b 3 5 4
#3 b 2 6 5
We first convert dataframe from wide to long using gather, then group_by each row we paste column names for max key and then spread the long dataframe to wide again.
Here's a solution I found that loops through column names in case you find it hard to wrap your head around spread/gather (pivot_wider/longer)
out_df <- Df %>%
# calculate rowwise maximum
rowwise() %>%
mutate(rowmax = max(across())) %>%
# create empty maxcol column
mutate(maxcol = "")
# loop through column names
for (colname in colnames(Df)) {
out_df <- out_df %>%
# if the value at the specified column name is the maximum, paste it to the maxcol
mutate(maxcol = ifelse(.data[[colname]] == rowmax, paste0(maxcol, colname), maxcol))
}
# remove rowmax column if no longer needed
out_df <- out_df %>%
select(-rowmax)

issues calculating rowwise maximum

suppose I have a tibble dat below, what I would like to do is to calculate maximum of (x 2, x 3) and then minus x 1, where x can be either a or b. In my real data I have more than 3 columns, so something like 2:n (e.g., 2:3) would be great. tried many things, seems not working as I wanted them to, still struggling with the string vs column name thing..
dat <- tibble(`a 1` = c(0, 0, 0), `a 2` = 1:3, `a 3` = 3:1,
`b 1` = rep(1, 3), `b 2` = 4:6, `b 3` = 6:4)
foo <- function(x = 'a')
{
???
}
end result:
if x == `a`
c(3, 2, 3)
if x == `b`
c(5, 4, 5)
Solution 1
This solution uses only base R. The idea is to define a function (max_minus_first) to calculate the answer. The max_minus_first function has two arguments. The first argument, dat, is a data frame for analysis with the same format as the OP provided. group is the name of the group for analysis. The end product is a vector with the answer.
max_minus_first <- function(dat, group){
# Get all column names with starting string "group"
col_names <- colnames(dat)
dat2 <- dat[, col_names[grepl(paste0("^", group), col_names)]]
# Get the maximum values from all columns except the first column
max_value <- apply(dat2[, -1], 1, max, na.rm = TRUE)
# Calculate max_value minus the values from the first column
final_value <- max_value - unlist(dat2[, 1], use.names = FALSE)
return(final_value)
}
max_minus_first(dat, "a")
# [1] 3 2 3
max_minus_first(dat, "b")
# [1] 5 4 5
Solution 2
A solution using the tidyverse. The end product (dat2) is a tibble with the output from each group (a, b, ...)
library(tidyverse)
dat2 <- dat %>%
rowid_to_column() %>%
gather(Column, Value, -rowid, -ends_with(" 1")) %>%
separate(Column, into = c("Group", "Column_Number")) %>%
gather(Column_1, Value_1, ends_with(" 1")) %>%
separate(Column_1, into = c("Group_1", "Column_Number_1")) %>%
filter(Group == Group_1) %>%
group_by(rowid, Group, Value_1) %>%
summarise(Value = max(Value, na.rm = TRUE)) %>%
mutate(Final = Value - Value_1) %>%
ungroup() %>%
select(-starts_with("Value")) %>%
spread(Group, Final)
dat2
# # A tibble: 3 x 3
# rowid a b
# * <int> <dbl> <dbl>
# 1 1 3 5
# 2 2 2 4
# 3 3 3 5
Explanation
rowid_to_column() is from the tibble package, a way to create a new column based on row ID.
gather is from the tidyr package to convert the data frame from the wide format to long format. I used gather twice because the first column of each group is different than other columns in the same group. ends_with(" 1") is a select helper function from the dplyr, which select the column with a name ending in " 1". Notice that the space in " 1" is important because "1" may select other columns like a 11 if such columns exist.
separate is from the tidyr package to separate a column into two columns. I used it to separate the Group name and column numbers in each Group.
filter(Group == Group_1) is to filter rows with Group == Group_1.
group_by(rowid, Group, Value_1) and then summarise(Value = max(Value, na.rm = TRUE)) make sure the maximum from each Group is calculated.
mutate(Final = Value - Value_1) is to calculate the difference between maximum from each Group and the value from the first column. The results are stored in the Final column.
select(-starts_with("Value")) removes any columns with a name beginning with "Value".
spread from the tidyr package converts the data frame from long format to wide format.
Solution 3
Another tidyverse solution, which similar to Solution 2. It uses do to conduct operation to each Group hence making the code more concise.
dat2 <- dat %>%
rowid_to_column() %>%
gather(Column, Value, -rowid) %>%
separate(Column, into = c("Group", "Column_Number")) %>%
group_by(rowid, Group) %>%
do(data_frame(Max = max(.$Value[.$Column_Number != 1]),
First = .$Value[.$Column_Number == 1])) %>%
mutate(Final = Max - First) %>%
select(-Max, -First) %>%
spread(Group, Final) %>%
ungroup()
dat2
# # A tibble: 3 x 3
# rowid a b
# * <int> <dbl> <dbl>
# 1 1 3 5
# 2 2 2 4
# 3 3 3 5

Resources