map_dfr converting data frame input to column - r

I'm trying to use the following function to iterate through a dataframe and return the counts from each row:
library(dplyr)
library(tidyr)
row_freq <- function(df_input,row_input){
print(df_input)
vec <- unlist(df_input %>%
select(-1) %>%
slice(row_input), use.names = FALSE)
r <- data.frame(table(vec)) %>%
pivot_wider(values_from = Freq, names_from = vec)
return(r)
}
This works fine if I use a single row from the dataframe:
sample_df <- data.frame(id = c(1,2,3,4,5), obs1 = c("A","A","B","B","B"),
obs2 = c("B","B","C","D","D"), obs3 = c("A","B","A","D","A"))
row_freq(sample_df, 1)
id obs1 obs2 obs3
1 1 A B A
2 2 A B B
3 3 B C A
4 4 B D D
5 5 B D A
# A tibble: 1 × 2
A B
<int> <int>
1 2 1
But when iterating over rows using purrr::map_dfr, it seems to reduce df_input to only the id column instead of using the entire dataframe as the argument, which I found quite strange:
purrr::map_dfr(sample_df, row_freq, 1:5)
[1] 1 2 3 4 5
Error in UseMethod("select") :
no applicable method for 'select' applied to an object of class "c('double', 'numeric')"
I'm looking for help with regards to 1) why this is happening, 2) how to fix it, and 3) any alternative approaches or functions that may already perform what I'm trying to do in a more efficient manner.

Specify the order of the arguments correctly if we are not passing with named arguments
purrr::map_dfr(1:5, ~ row_freq(sample_df, .x))
-output
# A tibble: 5 × 4
A B C D
<int> <int> <int> <int>
1 2 1 NA NA
2 1 2 NA NA
3 1 1 1 NA
4 NA 1 NA 2
5 1 1 NA 1
Or use a named argument
purrr::map_dfr(df_input = sample_df, .f = row_freq, .x = 1:5)
-output
# A tibble: 5 × 4
A B C D
<int> <int> <int> <int>
1 2 1 NA NA
2 1 2 NA NA
3 1 1 1 NA
4 NA 1 NA 2
5 1 1 NA 1
The reason is that map first argument is .x
map(.x, .f, ...)
and if we are providing the 'sample_df' as the argument, it takes the .x as sample_df and loops over the columns of the data (as data.frame/tibble/data.table - unit is column as these are list with additional attributes)

Related

how to compute row means iff the number of NA's is smaller than a given value

I have questionnaire data (rows=individuals, cols=scores on questions)and would like to compute a sumscore for individuals if they answered a given number of questions, otherwise the sumscore variable should be NA. The code below computes row sums, counts the number of NA's, assigns an otherwise not occurring value to the row sum variable in case the number of NA's is large, and then replaces that with an NA. The code works but I bet there is a more elegant way...Suggestions much appreciated.
dum<-tibble(x=c(1,NA,2,3,4),y=c(1,2,3,NA,5),z=c(1,NA,2,3,4))
dum<-dum %>%
mutate(sumsum = rowSums(select(., x:z), na.rm = TRUE))
dum<-dum %>%
mutate(countna=rowSums(is.na(select(.,x:z))))
dum<-dum %>%
mutate(sumsum=case_when(countna>=2 ~ 100,TRUE~sumsum))
dum<-dum %>%
mutate(sumsum = na_if(sumsum, 100))
You may combine your code in one statement -
library(dplyr)
dum <- tibble(x=c(1,NA,2,3,4),y=c(1,2,3,NA,5),z=c(1,NA,2,3,4))
dum <- dum %>%
mutate(sumsum = replace(rowSums(select(., x:z), na.rm = TRUE),
rowSums(is.na(select(., x:z))) >= 2, NA))
dum
# A tibble: 5 × 4
# x y z sumsum
# <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 3
#2 NA 2 NA NA
#3 2 3 2 7
#4 3 NA 3 6
#5 4 5 4 13
You can also try this:
dum<-tibble(x=c(1,NA,2,3,4),y=c(1,2,3,NA,5),z=c(1,NA,2,3,4))
dum2 <- dum %>% mutate(sumsum = ifelse(rowSums(is.na(select(.,x:z)))>=2, NA,rowSums(select(., x:z), na.rm = TRUE)))
dum2
# A tibble: 5 × 4
x y z sumsum
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 3
2 NA 2 NA NA
3 2 3 2 7
4 3 NA 3 6
5 4 5 4 13

A computation efficient way to find the IDs of the Type 1 rows just above and below each Type 2 rows?

I have the following data
df <- tibble(Type=c(1,2,2,1,1,2),ID=c(6,4,3,2,1,5))
Type ID
1 6
2 4
2 3
1 2
1 1
2 5
For each of the type 2 rows, I want to find the IDs of the type 1 rows just below and above them. For the above dataset, the output will be:
Type ID IDabove IDbelow
1 6 NA NA
2 4 6 2
2 3 6 2
1 2 NA NA
1 1 NA NA
2 5 1 NA
Naively, I can write a for loop to achieve this, but that would be too time consuming for the dataset I am dealing with.
One approach using dplyr lead,lag to get next and previous value respectively and data.table's rleid to create groups of consecutive Type values.
library(dplyr)
library(data.table)
df %>%
mutate(IDabove = ifelse(Type == 2, lag(ID), NA),
IDbelow = ifelse(Type == 2, lead(ID), NA),
grp = rleid(Type)) %>%
group_by(grp) %>%
mutate(IDabove = first(IDabove),
IDbelow = last(IDbelow)) %>%
ungroup() %>%
select(-grp)
# Type ID IDabove IDbelow
# <dbl> <dbl> <dbl> <dbl>
#1 1 6 NA NA
#2 2 4 6 2
#3 2 3 6 2
#4 1 2 NA NA
#5 1 1 NA NA
#6 2 5 1 NA
A dplyr only solution:
You could create your own rleid function then apply the logic provided by Ronak(Many thanks. Upvoted).
library(dplyr)
my_func <- function(x) {
x <- rle(x)$lengths
rep(seq_along(x), times=x)
}
# this part is the same as provided by Ronak.
df %>%
mutate(IDabove = ifelse(Type == 2, lag(ID), NA),
IDbelow = ifelse(Type == 2, lead(ID), NA),
grp = my_func(Type)) %>%
group_by(grp) %>%
mutate(IDabove = first(IDabove),
IDbelow = last(IDbelow)) %>%
ungroup() %>%
select(-grp)
Output:
Type ID IDabove IDbelow
<dbl> <dbl> <dbl> <dbl>
1 1 6 NA NA
2 2 4 6 2
3 2 3 6 2
4 1 2 NA NA
5 1 1 NA NA
6 2 5 1 NA

Adding column if it does not exist inside purrr language

I've been struggling trying to add a new column if it does not exist. I found the answer in here: Adding column if it does not exist .
However, in my problem I must use it inside purrr environment. I tried to adapt the above answer, but it doesn't fit my needs.
Here is an example what I'm dealing with:
Suppose I have a list of two data.frames:
library(tibble)
A = tibble(
x = 1:5, y = 1, z = 2
)
B = tibble(
x = 5:1, y = 3, z = 3, w = 7
)
dt_list = list(A, B)
The column I'd like to add is w:
cols = c(w = NA_real_)
Separately, if I want to add a column if it does not exist, I could do the following:
Since it does exist, not columns is added:
B %>% tibble::add_column(!!!cols[!names(cols) %in% names(.)])
# A tibble: 5 x 4
x y z w
<int> <dbl> <dbl> <dbl>
1 5 3 3 7
2 4 3 3 7
3 3 3 3 7
4 2 3 3 7
5 1 3 3 7
In this case, since it does not exist, w is added:
A %>% tibble::add_column(!!!cols[!names(cols) %in% names(.)])
# A tibble: 5 x 4
x y z w
<int> <dbl> <dbl> <dbl>
1 1 1 2 NA
2 2 1 2 NA
3 3 1 2 NA
4 4 1 2 NA
5 5 1 2 NA
I tried the following to replicate it using purrr (I'd prefer not to use a for loop):
dt_list_2 = dt_list %>%
purrr::map(
~dplyr::select(., -starts_with("x")) %>%
~tibble::add_column(!!!cols[!names(cols) %in% names(.)])
)
But the output is not the same as doing it separately.
Note: This is an example of my real problem. In fact, I'm using purrr to read many *.csv files and then apply some data transformation. Something like this:
re_file <- list.files(path = dir_path, pattern = "*.csv")
cols_add = c(UCI = NA_real_)
file_list = re_file %>%
purrr::map(function(file_name){ # iterate through each file name
read_csv(file = paste0(dir_path, "//",file_name), skip = 2)
}) %>%
purrr::map(
~dplyr::select(., -starts_with("Textbox")) %>%
~dplyr::tibble(!!!cols[!names(cols) %in% names(.)])
)
You can use :
dt_list %>%
purrr::map(
~tibble::add_column(., !!!cols[!names(cols) %in% names(.)])
)
#[[1]]
# A tibble: 5 x 4
# x y z w
# <int> <dbl> <dbl> <dbl>
#1 1 1 2 NA
#2 2 1 2 NA
#3 3 1 2 NA
#4 4 1 2 NA
#5 5 1 2 NA
#[[2]]
# A tibble: 5 x 4
# x y z w
# <int> <dbl> <dbl> <dbl>
#1 5 3 3 7
#2 4 3 3 7
#3 3 3 3 7
#4 2 3 3 7
#5 1 3 3 7

how to set names in a dynamically long list

Given the following data:
test = data.frame(x = c(NA,1,1,2,3,4),
y = c(NA,1,2,3,4,4))
I want to perform some calculations and store these as new columns. The calculations, however, might result in a variable amount of columns. E.g. suppose I want store for each row the column index of the column(s) that contain the minimum per row. E.g. in row 1, both columns contain the minimum, hence I need to create two columns.
Using the tidyverse approach, I know I can use the set_names argument when passing my function as a list. But this doesn't work when I don't know the number of columns my calculation will create. See also here: https://community.rstudio.com/t/how-to-handle-lack-of-names-with-unnest-wider/40496
My approach for the calculations:
library(tidyverse)
test %>%
rowwise() %>%
mutate(dist = min(c_across(everything())),
code = list(which(c_across(cols = c(everything(), -dist)) == dist))) %>%
ungroup() %>%
unnest_wider(code)
which automatically names the unnested columns with "...1" and "...2":
# A tibble: 6 x 5
x y dist ...1 ...2
<dbl> <dbl> <dbl> <int> <int>
1 NA NA NA NA NA
2 1 1 1 1 2
3 1 2 1 1 NA
4 2 3 2 1 NA
5 3 4 3 1 NA
6 4 4 4 1 2
But that's not what I want. I also tried to use the named_repair argument within the unnest_wider, i.e. unnest_wider(code, names_repair = ~paste0("code", .x)) but this renames all columns.
Any ideas (preferably in the tidyverse approach)? Expected outcome:
# A tibble: 6 x 5
x y dist code_1 code_2
<dbl> <dbl> <dbl> <int> <int>
1 NA NA NA NA NA
2 1 1 1 1 2
3 1 2 1 1 NA
4 2 3 2 1 NA
5 3 4 3 1 NA
6 4 4 4 1 2
EDITED to add an example where one row contains only missings.
Edit 2: this is my current solution. But it is really ugly and requires to stop half way through. Problem here is that the rename_with function doesn't recognize the on-the-fly generated "length_code" column when I put everything into one pipe.
test2 <- test %>%
rowwise() %>%
mutate(dist = min(c_across(everything())),
code = list(which(c_across(cols = c(everything(), -dist)) == dist)),
length_code = length(code)) %>%
ungroup() %>%
unnest_wider(code) %>%
test3 <- test2 %>%
rename_with(.cols = starts_with("..."), .fn = ~paste0("code_", 1:max(test2$length_code)))
which gives:
# A tibble: 6 x 6
x y dist code_1 code_2 length_code
<dbl> <dbl> <dbl> <int> <int> <int>
1 NA NA NA NA NA 0
2 1 1 1 1 2 2
3 1 2 1 1 NA 1
4 2 3 2 1 NA 1
5 3 4 3 1 NA 1
6 4 4 4 1 2 2

How to detect class type and change in R

I have a dataset where the first line is the header, the second line is some explanatory data, and then rows 3 on are numbers. Because when I read in the data with this second explanatory row, the classes are automatically converted to factors (or I could put stringsasfactors=F).
What I would like to do is remove the second row, and have a function that goes through all columns and detects if they're just numbers and change the class type to the appropriate type. Is there something like that available? Perhaps using dplyr? I have many columns so I'd like to avoid manually reassigning them.
A simplified example below
> df <- data.frame(A = c("col 1",1,2,3,4,5), B = c("col 2",1,2,3,4,5))
> df
A B
1 col 1 col 2
2 1 1
3 2 2
4 3 3
5 4 4
6 5 5
if all the numbers are after the second line, then we can do so
library(tidyverse)
df[-1, ] %>% mutate_all(as.numeric)
depending on the task can be done this way
df <- tibble(A = c("col 1",1,2,3,4,5),
B = c("col 2",1,2,3,4,5),
C = c(letters[1:5], 6))
df[-1, ] %>% mutate_if(~ any(!is.na(as.numeric(.))), as.numeric)
A B C
<dbl> <dbl> <dbl>
1 1 1 NA
2 2 2 NA
3 3 3 NA
4 4 4 NA
5 5 5 6
or so
df[-1, ] %>% mutate_if(~ all(!is.na(as.numeric(.))), as.numeric)
A B C
<dbl> <dbl> <chr>
1 1 1 b
2 2 2 c
3 3 3 d
4 4 4 e
5 5 5 6
In base R, we can just do
df[-1] <- lapply(df[-1], as.numeric)

Resources