pivot_wider() arguments imply differing number of rows Eroor - r

I have the following data.table object called x:
month.option som.month
all.year 56.6%
diff -0.9%
and when I perform the following operation :
x %>% pivot_wider(names_from = month.option, values_from = som.month) %>%
select(diff, everything()) %>%
set_names(c("Dif vs MA", "SOM YTD", "SOM AA"))
I get the following error: Error in data.frame(row = row_id, col = col_id) : arguments imply differing number of rows: 0, 2. However I don't understand the reason since x is a 2x2 data.table. If anyone knows a possible issue that I am not seeing I will appreciate the correction.
As a side note, all the columns are of type character, if that is any useful info

If we want to use pivot_wider, we could do this without creating a new column by specifying the values_fn as I
library(dplyr)
library(tidyr)
x %>%
pivot_wider(names_from = month.option, values_from = som.month, values_fn = I)
# A tibble: 1 x 2
# all.year diff
# <I<chr>> <I<chr>>
#1 56.6% -0.9%
Or it can be also a function to get the first element
x %>%
pivot_wider(names_from = month.option,
values_from = som.month, values_fn = first)
# A tibble: 1 x 2
# all.year diff
# <chr> <chr>
#1 56.6% -0.9%
However, these kind of problems can be easily tackled with transpose from data.table
data.table::transpose(x, make.names = 'month.option')
# all.year diff
#1 56.6% -0.9%
Or use deframe with as_tibble_row which would be more direct
library(tibble)
deframe(x) %>%
as_tibble_row
# A tibble: 1 x 2
# all.year diff
# <chr> <chr>
#1 56.6% -0.9%
Or another option is to convert the first column to rownames, do the transpose with t and convert to tibble (or data.frame)
x %>%
column_to_rownames('month.option') %>%
t %>%
as_tibble
# A tibble: 1 x 2
# all.year diff
# <chr> <chr>
#1 56.6% -0.9%
data
x <- structure(list(month.option = c("all.year", "diff"), som.month = c("56.6%",
"-0.9%")), class = "data.frame", row.names = c(NA, -2L))

Try this tidyverse solution with same pivot_wider(). You are having issues because the function can not identify the rows properly. Creating an id is the solution:
#Code
df %>% mutate(id=1) %>%
pivot_wider(names_from = month.option,values_from=som.month) %>%
select(-1)
Output:
# A tibble: 1 x 2
all.year diff
<chr> <chr>
1 56.6% -0.9%
Some data used:
#Data
df <- structure(list(month.option = c("all.year", "diff"), som.month = c("56.6%",
"-0.9%")), class = "data.frame", row.names = c(NA, -2L))

If you have data.table we can also use dcast :
library(data.table)
dcast(x, rowid(month.option)~month.option, value.var = 'som.month')
# month.option all.year diff
#1: 1 56.6% -0.9%

Related

Calculating average rle$lengths over grouped data

I would like to calculate duration of state using rle() on grouped data. Here is test data frame:
DF <- read.table(text="Time,x,y,sugar,state,ID
0,31,21,0.2,0,L0
1,31,21,0.65,0,L0
2,31,21,1.0,0,L0
3,31,21,1.5,1,L0
4,31,21,1.91,1,L0
5,31,21,2.3,1,L0
6,31,21,2.75,0,L0
7,31,21,3.14,0,L0
8,31,22,3.0,2,L0
9,31,22,3.47,1,L0
10,31,22,3.930,0,L0
0,37,1,0.2,0,L1
1,37,1,0.65,0,L1
2,37,1,1.089,0,L1
3,37,1,1.5198,0,L1
4,36,1,1.4197,2,L1
5,36,1,1.869,0,L1
6,36,1,2.3096,0,L1
7,36,1,2.738,0,L1
8,36,1,3.16,0,L1
9,36,1,3.5703,0,L1
10,36,1,3.970,0,L1
", header = TRUE, sep =",")
I want to know the average length for state == 1, grouped by ID. I have created a function inspired by: https://www.reddit.com/r/rstats/comments/brpzo9/tidyverse_groupby_and_rle/
to calculate the rle average portion:
rle_mean_lengths = function(x, value) {
r = rle(x)
cond = r$values == value
data.frame(count = sum(cond), avg_length = mean(r$lengths[cond]))
}
And then I add in the grouping aspect:
DF %>% group_by(ID) %>% do(rle_mean_lengths(DF$state,1))
However, the values that are generated are incorrect:
ID
count
avg_length
1 L0
2
2
2 L1
2
2
L0 is correct, L1 has no instances of state == 1 so the average should be zero or NA.
I isolated the problem in terms of breaking it down into just summarize:
DF %>% group_by(ID) %>% summarize_at(vars(state),list(name=mean)) # This works but if I use summarize it gives me weird values again.
How do I do the equivalent summarize_at() for do()? Or is there another fix? Thanks
As it is a data.frame column, we may need to unnest afterwards
library(dplyr)
library(tidyr)
DF %>%
group_by(ID) %>%
summarise(new = list(rle_mean_lengths(state, 1)), .groups = "drop") %>%
unnest(new)
Or remove the list and unpack
DF %>%
group_by(ID) %>%
summarise(new = rle_mean_lengths(state, 1), .groups = "drop") %>%
unpack(new)
# A tibble: 2 × 3
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN
In the OP's do code, the column that should be extracted should be not from the whole data, but from the data coming fromt the lhs i.e. . (Note that do is kind of deprecated. So it may be better to make use of the summarise with unnest/unpack
DF %>%
group_by(ID) %>%
do(rle_mean_lengths(.$state,1))
# A tibble: 2 × 3
# Groups: ID [2]
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN

R - Collapsing observations and creating new columns

In my dataframe there are multiple rows for a single observation (each referenced by ref). I would like to collapse the rows and create new columns for the keyword column. The outcome would include as many keyowrd colums as the number of rows for an observation (e.g. keyword_1, keyword_2, etc). Do you have any idea? Thanks a lot.
This is my MWE
df1 <- structure(list(rif = c("text10", "text10", "text10", "text11", "text11"),
date = c("20180329", "20180329", "20180329", "20180329", "20180329"),
keyword = c("Lucca", "Piacenza", "Milano", "Cascina", "Padova")),
row.names = c(NA, 5L), class = "data.frame")
Does this work:
library(dplyr)
library(tidyr)
df1 %>% group_by(rif,date) %>% mutate(n = row_number()) %>% pivot_wider(id_cols = c(rif,date), values_from = keyword, names_from = n, names_prefix = 'keyword')
# A tibble: 2 x 5
# Groups: rif, date [2]
rif date keyword1 keyword2 keyword3
<chr> <chr> <chr> <chr> <chr>
1 text10 20180329 Lucca Piacenza Milano
2 text11 20180329 Cascina Padova NA

show unique values for each column

I am trying to create a data-frame of the column type and unique variables for each column.
I am able to get column type in the desired data-frame format using map(df, class) %>% bind_rows() %>% gather(key = col_name, value = col_class), but unable to get the unique variables to become a data-frame instead of a list.
Below is a small data-frame and code that gets the unique variables in a list, but not a data frame. Ideally, I could do this in one (map) function, but if I have to join them, it would not be a big deal.
df <- data.frame(v1 = c(1,2,3,2), v2 = c("a","a","b","b"))
library(tidyverse)
map(df, class) %>% bind_rows() %>% gather(key = col_name, value = col_class)
map(df, unique)
When I try to use the same method on the map(df, unique) as on the map(df, class) I get the following error: Error: Argument 2 must be length 3, not 2 which is expected, but I am not sure how to get around it.
The number of unique values are different in those two columns. You need to reduce them to a single element.
df2 <- map(df, ~str_c(unique(.x),collapse = ",")) %>%
bind_rows() %>%
gather(key = col_name, value = col_unique)
> df2
# A tibble: 2 x 2
col_name col_class
<chr> <chr>
1 v1 1,2,3
2 v2 a,b
We could use map_df and get the class and unique values from each column into one tibble. Since every column would have variables of different type, we need to bring them in one common class to bind the data together in one dataframe.
purrr::map_df(df,~tibble::tibble(class = class(.), value = as.character(unique(.))))
# class value
# <chr> <chr>
#1 numeric 1
#2 numeric 2
#3 numeric 3
#4 factor a
#5 factor b
Or if you want to have only one value for every column, we could do
map_df(df, ~tibble(class = class(.), value = toString(unique(.))))
# class value
# <chr> <chr>
#1 numeric 1, 2, 3
#2 factor a, b
Same in base R using lapply
do.call(rbind, lapply(df, function(x)
data.frame(class = class(x), value = as.character(unique(x)))))
and
do.call(rbind, lapply(df, function(x)
data.frame(class = class(x), value = toString(unique(x)))))
To address OP's comment asking about enframe and unnest I set up a benchmark.
set.seed(123)
df <- data.frame(v1 = sample(1:100000,10000000, replace = TRUE),
v2 = sample(c(letters,LETTERS),10000000, replace = TRUE))
library(tidyverse)
map(df, ~str_c(unique(.x),collapse = ",")) %>%
bind_rows() %>%
gather(key = col_name, value = col_unique)
#> # A tibble: 2 x 2
#> col_name col_unique
#> <chr> <chr>
#> 1 v1 51663,57870,2986,29925,95246,68293,62555,45404,65161,46435,9642~
#> 2 v2 S,V,k,t,z,K,f,J,n,R,W,h,M,P,q,g,C,U,a,d,Y,u,O,x,b,m,v,r,F,w,A,j~
map(df, ~str_c(unique(.x),collapse = ",")) %>%
enframe() %>%
unnest()
#> # A tibble: 2 x 2
#> name value
#> <chr> <chr>
#> 1 v1 51663,57870,2986,29925,95246,68293,62555,45404,65161,46435,9642,59~
#> 2 v2 S,V,k,t,z,K,f,J,n,R,W,h,M,P,q,g,C,U,a,d,Y,u,O,x,b,m,v,r,F,w,A,j,c,~
microbenchmark::microbenchmark(
bind_gather = map(df, ~str_c(unique(.x),collapse = ",")) %>%
bind_rows() %>%
gather(key = col_name, value = col_unique) ,
frame_unnest = map(df, ~str_c(unique(.x),collapse = ",")) %>%
enframe() %>%
unnest() ,
times = 10)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> bind_gather 581.6403 594.6479 615.0841 612.9336 618.3057 697.6204 10
#> frame_unnest 568.6620 590.0003 604.2774 606.5676 624.8159 630.2372 10
It seems that enframe %>% unnest is slightly faster than using bind_rows %>% gather().
Does this work for you?
data.table::rbindlist(list(map(df, class), map(df, function(x) list(unique(x)))))

summarise function in R

I am trying to create a R database including some numerical variable.
While doing this, I made a typing mistake whose result looks weird to me and I would like to understand why (for sure I am missing something, here).
I have tried to look around for possible explanation but haven' t found what I am looking for.
library("dplyr")
library("tidyr")
data <-
data.frame(FS = c(1), FS_name = c("Armenia"), Year = c(2015), class =
c("class190"), area_1000ha = c(66.447)) %>%
mutate(FS_name = as.character(FS_name)) %>%
mutate(Year = as.integer(Year)) %>%
mutate(class = as.character(class)) %>%
tbl_df()
data
x <- data %>%
group_by(FS, FS_name, Year, class) %>%
dplyr::summarise(area_1000ha = sum(area_1000ha, rm.na = TRUE)) %>%
ungroup()
As you can see, the mistake is
rm.na=
rather than
na.rm=
When I type correctly, I have the right result on area_1000ha variable (10.5).
If I don't - i.e. keeping rm.na= I get 11.5, instead (+1, in fact).
What am I missing?
I think rm.na=TRUE is added to the sum, and as TRUE is considered as 1, it sums your initial sum and 1.
If you change TRUE to 2 for example
x <- data %>%
group_by(FS_name, Year, class) %>%
dplyr::summarise(area_1000ha = sum(area_1000ha, rm.na = 2)) %>%
ungroup()
The result is
# A tibble: 1 x 4
FS_name Year class area_1000ha
<chr> <int> <chr> <dbl>
1 Rome 2018 class190 12.5
There is no function in R as rm.na hence R is considering it as a variable which has value TRUE i.e. 1.
Try keeping it na.rm = T and you will get the right result.
Even if you change the name of the variable
x <- data %>%
group_by(FS, FS_name, Year, class) %>%
dplyr::summarise(area_1000ha = sum(area_1000ha, tester = TRUE)) %>%
ungroup()
I have replaced rm.na with tester variable.
# A tibble: 1 x 4
FS_name Year class area_1000ha
<chr> <int> <chr> <dbl>
1 Rome 2018 class190 11.5

How to use dplyr::mutate_all for rounding selected columns

I'm using the following package version
# devtools::install_github("hadley/dplyr")
> packageVersion("dplyr")
[1] ‘0.5.0.9001’
With the following tibble:
library(dplyr)
df <- structure(list(gene_symbol = structure(1:6, .Label = c("0610005C13Rik",
"0610007P14Rik", "0610009B22Rik", "0610009L18Rik", "0610009O20Rik",
"0610010B08Rik"), class = "factor"), fold_change = c(1.54037,
1.10976, 0.785, 0.79852, 0.91615, 0.87931), pvalue = c(0.5312,
0.00033, 0, 0.00011, 0.00387, 0.01455), ctr.mean_exp = c(0.00583,
59.67286, 83.2847, 6.88321, 14.67696, 1.10363), tre.mean_exp = c(0.00899,
66.22232, 65.37819, 5.49638, 13.4463, 0.97043), ctr.cv = c(5.49291,
0.20263, 0.17445, 0.46288, 0.2543, 0.39564), tre.cv = c(6.06505,
0.28827, 0.33958, 0.53295, 0.26679, 0.52364)), .Names = c("gene_symbol",
"fold_change", "pvalue", "ctr.mean_exp", "tre.mean_exp", "ctr.cv",
"tre.cv"), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
That looks like this:
> df
# A tibble: 6 × 7
gene_symbol fold_change pvalue ctr.mean_exp tre.mean_exp ctr.cv tre.cv
<fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0610005C13Rik 1.54037 0.53120 0.00583 0.00899 5.49291 6.06505
2 0610007P14Rik 1.10976 0.00033 59.67286 66.22232 0.20263 0.28827
3 0610009B22Rik 0.78500 0.00000 83.28470 65.37819 0.17445 0.33958
4 0610009L18Rik 0.79852 0.00011 6.88321 5.49638 0.46288 0.53295
5 0610009O20Rik 0.91615 0.00387 14.67696 13.44630 0.25430 0.26679
6 0610010B08Rik 0.87931 0.01455 1.10363 0.97043 0.39564 0.52364
I'd like to round the floats (2nd columns onward) to 3 digits. What's the way to do it with dplyr::mutate_all()
I tried this:
cols <- names(df)[2:7]
# df <- df %>% mutate_each_(funs(round(.,3)), cols)
# Warning message:
#'mutate_each_' is deprecated.
# Use 'mutate_all' instead.
# See help("Deprecated")
df <- df %>% mutate_all(funs(round(.,3)), cols)
But get the following error:
Error in mutate_impl(.data, dots) :
3 arguments passed to 'round'which requires 1 or 2 arguments
While the new across() function is slightly more verbose than the previous mutate_if variant, the dplyr 1.0.0 updates make the tidyverse language and code more consistent and versatile.
This is how to round specified columns:
df %>% mutate(across(2:7, round, 3)) # columns 2-7 by position
df %>% mutate(across(cols, round, 3)) # columns specified by variable cols
This is how to round all numeric columns to 3 decimal places:
df %>% mutate(across(where(is.numeric), round, 3))
This is how to round all columns, but it won't work in this case because gene_symbol is not numeric:
df %>% mutate(across(everything(), round, 3))
Where we put where(is.numeric) in across's arguments, you could put in other column specifications such as -1 or -gene_symbol to exclude column 1. See help(tidyselect) for even more options.
Update for dplyr 1.0.0
The across() function replaces the _if/_all/_at/_each variants of dplyr verbs. https://dplyr.tidyverse.org/dev/articles/colwise.html#how-do-you-convert-existing-code
Since some columns are not numeric, you could use mutate_if with the added benefit of rounding columns iff (if and only if) it is numeric:
df %>% mutate_if(is.numeric, round, 3)
packageVersion("dplyr")
[1] '0.7.6'
Try
df %>% mutate_at(2:7, funs(round(., 3)))
It works!!

Resources