I plan to sum a data.table row-wise and add a constant to it. What is wrong with this code. I am specifically looking for pmap_dfr solution:
library(data.table)
library(tidyverse)
temp.dt <- data.table(a = 1:3, b = 1:3, c = 1:3)
d <- 10
temp.dt %>% pmap_dfr(., sum, d) # add columns a b and c and add variable d to it
The output expected is a single column tibble with the following rows:
13
16
19
Error thrown: Argument 1 must have names.
I have been able to get it to work with pmap and pmap_dbl but it fails when using pmap_dfr. Additionally, the example I have provided is a toy example. I want the d variable as an input argument to the sum function instead of adding d later to the row-wise sum.
Example I know the below would work:
temp.dt %>% pmap_dbl(., sum) + d
The problem occurs for regular data frames too so to reduce this to the essentials start a new R session, get rid of the data.table part and use the input shown where we have a 3x4 data.frame so that we don't confuse rows and columns. Also note that pmap_dfr(sum, d) is the same as pmap(sum, d) %>% bind_rows and it is in the bind_rows step that the problem occurs.
library(dplyr)
library(purrr)
# test input
temp.df <- data.frame(a = 1:3, b = 1:3, c = 1:3, z = 1:3)
rownames(temp.df) <- LETTERS[1:3]
d <- 10
out <- temp.df %>% pmap(sum, d) # this works
out %>% bind_rows
## Error: Argument 1 must have names
The problem, as the error states, is that out has no names and it seems it will not provide default names for the result. For example, this will work -- I am not suggesting that you necessarily do this but just trying to illustrate why it does not work by showing minimal changes that make it work:
temp.df %>% pmap(sum, d) %>% set_names(rownames(temp.df)) %>% bind_rows
## # A tibble: 1 x 3
## A B C
## <dbl> <dbl> <dbl>
## 1 14 18 22
or this could be written like this to avoid writing temp.df twice:
temp.df %>% { set_names(pmap(., sum, d), rownames(.)) } %>% bind_rows
I think we can conclude that pmap_dfr is just not the right function to use here.
Base R
Of course, this is all trivial in base R as you can do this:
rowSums(temp.df) + d
## A B C
## 14 18 22
or more generally:
as.data.frame.list(apply(temp.df, 1, sum, d))
## A B C
## 14 18 22
or
as.data.frame.list(Reduce("+", temp.df) + d)
## X14 X18 X22
##1 14 18 22
data.table
In data.table we can write:
library(data.table)
DT <- as.data.table(temp.df)
DT[, as.list(rowSums(.SD) + d)]
## V1 V2 V3
## 1: 14 18 22
DT[, as.list(apply(.SD, 1, sum, d))]
## V1 V2 V3
## 1: 14 18 22
Also note that using data.table directly tends to be faster than sticking another level on top of it so if you thought you were getting the benefit of data.table's speed by using it with dplyr and purrr you likely aren't.
A pmap_dfr solution is to first transpose the dataset. We can later rename the columns as desired:
temp.dt %>%
t() %>%
as.data.frame()-> tmp_dt
pmap_dfr(list(tmp_dt, 10),sum)
# A tibble: 1 x 3
V1 V2 V3
<dbl> <dbl> <dbl>
1 13 16 19
A possible dplyr-base alternative:
temp.dt %>%
mutate(Sum = rowSums(.) + d) %>%
pull(Sum)
[1] 13 16 19
Or using pmap_dbl:
temp.dt %>%
pmap_dbl(.,sum) + d
[1] 13 16 19
Related
I'm still learning R and was wondering if I there was an elegant way of manipulating the below df to achieve df2.
I'm not sure if it's a loop that is supposed to be used for this, but basically I want to extract the first Non NA "X_No" Value if the "X_No" value is NA in the first row. This would perhaps be best described through an example from df to the desired df2.
A_ID <- c('A','B','I','N')
A_No <- c(11,NA,15,NA)
B_ID <- c('B','C','D','J')
B_No <- c(NA,NA,12,NA)
C_ID <- c('E','F','G','P')
C_No <- c(NA,13,14,20)
D_ID <- c('J','K','L','M')
D_No <- c(NA,NA,NA,40)
E_ID <- c('W','X','Y','Z')
E_No <- c(50,32,48,40)
df <- data.frame(A_ID,A_No,B_ID,B_No,C_ID,C_No,D_ID,D_No,E_ID,E_No)
ID <- c('A','D','F','M','W')
No <- c(11,12,13,40,50)
df2 <- data.frame(ID,No)
I'm hoping for an elegant solution to this as there are over a 1000 columns similar to the example provided.
I've looked all over the web for a similar example however to no avail that would reproduce the expected result.
Your help is very much appreciated.
Thankyou
I don't know if I'd call it "elegant", but here is a potential solution:
library(tidyverse)
A_ID <- c('A','B','I','N')
A_No <- c(11,NA,15,NA)
B_ID <- c('B','C','D','J')
B_No <- c(NA,NA,12,NA)
C_ID <- c('E','F','G','P')
C_No <- c(NA,13,14,20)
D_ID <- c('J','K','L','M')
D_No <- c(NA,NA,NA,40)
E_ID <- c('W','X','Y','Z')
E_No <- c(50,32,48,40)
df <- data.frame(A_ID,A_No,B_ID,B_No,C_ID,C_No,D_ID,D_No,E_ID,E_No)
ID <- c('A','D','F','M','W')
No <- c(11,12,13,40,50)
df2 <- data.frame(ID,No)
output <- df %>%
pivot_longer(everything(),
names_sep = "_",
names_to = c("Col", ".value")) %>%
drop_na() %>%
group_by(Col) %>%
slice_head(n = 1) %>%
ungroup() %>%
select(-Col)
df2
#> ID No
#> 1 A 11
#> 2 D 12
#> 3 F 13
#> 4 M 40
#> 5 W 50
output
#> # A tibble: 5 × 2
#> ID No
#> <chr> <dbl>
#> 1 A 11
#> 2 D 12
#> 3 F 13
#> 4 M 40
#> 5 W 50
all_equal(df2, output)
#> [1] TRUE
Created on 2023-02-08 with reprex v2.0.2
Using base R with max.col (assuming the columns are alternating with ID, No)
ind <- max.col(!is.na(t(df[c(FALSE, TRUE)])), "first")
m1 <- cbind(seq_along(ind), ind)
data.frame(ID = t(df[c(TRUE, FALSE)])[m1], No = t(df[c(FALSE, TRUE)])[m1])
ID No
1 A 11
2 D 12
3 F 13
4 M 40
5 W 50
Here is a data.table solution that should scale well to a (very) large dataset.
functionally
split the data.frame to a list of chunks of columns, based on their
names. So all columns startting with A_ go to
the first element, all colums startting with B_ to the second
Then, put these list elements on top of each other, using
data.table::rbindlist. Ignure the column-namaes (this only works if
A_ has the same number of columns as B_ has the same number of cols
as n_)
Now get the first non-NA value of each value in the first column
code
library(data.table)
# split based on what comes after the underscore
L <- split.default(df, f = gsub("(.*)_.*", "\\1", names(df)))
# bind together again
DT <- rbindlist(L, use.names = FALSE)
# extract the first value of the non-NA
DT[!is.na(A_No), .(No = A_No[1]), keyby = .(ID = A_ID)]
# ID No
# 1: A 11
# 2: D 12
# 3: F 13
# 4: G 14
# 5: I 15
# 6: M 40
# 7: P 20
# 8: W 50
# 9: X 32
#10: Y 48
#11: Z 40
I am trying to use the n_distinct function from dplyr inside a pipe in a function and am finding it to be sensitive to my choice of syntax in a way I didn't expect. Here's a toy example.
# preliminaries
library(tidyverse)
set.seed(123)
X <- data.frame(a1 = rnorm(10), a2 = rnorm(10), b = rep(LETTERS[1:5], times = 2), stringsAsFactors = FALSE)
print(X)
a1 a2 b
1 -0.56047565 1.2240818 A
2 -0.23017749 0.3598138 B
3 1.55870831 0.4007715 C
4 0.07050839 0.1106827 D
5 0.12928774 -0.5558411 E
6 1.71506499 1.7869131 A
7 0.46091621 0.4978505 B
8 -1.26506123 -1.9666172 C
9 -0.68685285 0.7013559 D
10 -0.44566197 -0.4727914 E
Okay, now let's say I want to iterate a function over the names of selected columns in that data frame (humor me). Here, I'm going to use values in the selected column to filter the initial data set, count the number of unique ids that remain, and return the results as a one-row tibble that I then bind into a new tibble. When I create a new tibble inside the function and then apply n_distinct to a selected column in that tibble as its own step, I get the expected results from n_distinct, 5 and 4.
bind_rows(map(str_subset(colnames(X), "a"), function(i) {
subdf <- filter(X, !!sym(i) > 0)
value <- n_distinct(subdf$b)
tibble(y = i, n_uniq = value)
}))
# A tibble: 2 x 2
y n_uniq
<chr> <int>
1 a1 5
2 a2 4
If I put n_distinct inside a pipe and use . to refer to the filtered tibble, however, the code executes but I get a different and incorrect result.
bind_rows(map(str_subset(colnames(X), "a"), function(i) {
value <- filter(X, !!sym(i) > 0) %>% n_distinct(.$b)
tibble(y = i, n_uniq = value)
}))
# A tibble: 2 x 2
y n_uniq
<chr> <int>
1 a1 5
2 a2 7
What's up with that? Am I misunderstanding the use of . inside a pipe? Is something funky with n_distinct?
n_distinct accepts multiple arguments and here you're actually passing both the tibble and the b column as arguments, since the left-hand-side of pipe is passed by default. Here's some other ways of getting the expected output:
filter(X, !!sym(i) > 0) %>%
{n_distinct(.$b)}
filter(X, !!sym(i) > 0) %>%
with(n_distinct(b))
library(magrittr)
filter(X, !!sym(i) > 0) %$%
n_distinct(b)
Also, not directly related to your question, there's a convenience function for this kind of thing
map_dfr(str_subset(colnames(X), "a"), function(i) {
value <- filter(X, !!sym(i) > 0) %>% {n_distinct(.$b)}
tibble(y = i, n_uniq = value)
})
Here is minimal example of I think what you are seeing.
iris %>%
n_distinct(.$Species)
# 149
n_distinct(iris$Species)
# 3
The first option is actually doing as follows. The .$Species is redundant.
n_distinct(iris, iris$Species)
# 149
I think to pipe it without doing weird syntax things you need to use this.
iris %>%
distinct(Species) %>%
count()
# 3
Agreed...if you're just looking for the number of distinct as in Adam's last example, you might be better off with
length(unique(iris$Species))
depending on what your goals are
There is my problem that I can't solve it:
Data:
df <- data.frame(f1=c("a", "a", "b", "b", "c", "c", "c"),
v1=c(10, 11, 4, 5, 0, 1, 2))
data.frame:f1 is factor
f1 v1
a 10
a 11
b 4
b 5
c 0
c 1
c 2
# What I want is:(for example, fetch data with the number of element of some level == 2, then to data.frame)
a b
10 4
11 5
Thanks in advance!
I might be missing something simple here , but the below approach using dplyr works.
library(dplyr)
nlevels = 2
df1 <- df %>%
add_count(f1) %>%
filter(n == nlevels) %>%
select(-n) %>%
mutate(rn = row_number()) %>%
spread(f1, v1) %>%
select(-rn)
This gives
# a b
# <int> <int>
#1 10 NA
#2 11 NA
#3 NA 4
#4 NA 5
Now, if you want to remove NA's we can do
do.call("cbind.data.frame", lapply(df1, function(x) x[!is.na(x)]))
# a b
#1 10 4
#2 11 5
As we have filtered the dataframe which has only nlevels observations, we would have same number of rows for each column in the final dataframe.
split might be useful here to split df$v1 into parts corresponding to df$f1. Since you are always extracting equal length chunks, it can then simply be combined back to a data.frame:
spl <- split(df$v1, df$f1)
data.frame(spl[lengths(spl)==2])
# a b
#1 10 4
#2 11 5
Or do it all in one call by combining this with Filter:
data.frame(Filter(function(x) length(x)==2, split(df$v1, df$f1)))
# a b
#1 10 4
#2 11 5
Here is a solution using unstack :
unstack(
droplevels(df[ave(df$v1, df$f1, FUN = function(x) length(x) == 2)==1,]),
v1 ~ f1)
# a b
# 1 10 4
# 2 11 5
A variant, similar to #thelatemail's solution :
data.frame(Filter(function(x) length(x) == 2, unstack(df,v1 ~ f1)))
My tidyverse solution would be:
library(tidyverse)
df %>%
group_by(f1) %>%
filter(n() == 2) %>%
mutate(i = row_number()) %>%
spread(f1, v1) %>%
select(-i)
# # A tibble: 2 x 2
# a b
# * <dbl> <dbl>
# 1 10 4
# 2 11 5
or mixing approaches :
as_tibble(keep(unstack(df,v1 ~ f1), ~length(.x) == 2))
Using all base functions (but you should use tidyverse)
# Add count of instances
x$len <- ave(x$v1, x$f1, FUN = length)
# Filter, drop the count
x <- x[x$len==2, c('f1','v1')]
# Hacky pivot
result <- data.frame(
lapply(unique(x$f1), FUN = function(y) x$v1[x$f1==y])
)
colnames(result) <- unique(x$f1)
> result
a b
1 10 4
2 11 5
I'd like code this, may it helps for you
library(reshape2)
library(dplyr)
aa = data.frame(v1=c('a','a','b','b','c','c','c'),f1=c(10,11,4,5,0,1,2))
cc = aa %>% group_by(v1) %>% summarise(id = length((v1)))
dd= merge(aa,cc) #get the level
ee = dd[dd$aa==2,] #select number of level equal to 2
ee$id = rep(c(1,2),nrow(ee)/2) # reset index like (1,2,1,2)
dcast(ee, id~v1,value.var = 'f1')
all done!
I have a list of dataframes which I am trying to apply a script to which works for a single data frame.
Part of the script uses both piping and group_by:
df2 <- df1 %>%
group_by (col1) %>%
summarise(newcol = sum(col2))
I've tried various loops or variations with lapply but haven't been able to find a way for it to work with a lists of dataframes where it would be something along the lines of:
mylist2 <- mylist1 %>%
group_by (col1) %>%
summarise(newcol = sum(col2))
But obviously changed around to work with loops or lapply. I'm probably missing something simple here but would appreciate some help. Thanks
PS - I looked at providing the data from the lists but wasn't able to provide reproducible samples.
Here is a tidyverse way.
# generate some data
mylist1 <- replicate(2, data.frame(col1 = rep(letters[1:2], 2),
col2 = 1:4),
simplify = FALSE)
library(purrr)
library(dplyr)
mylist1 %>%
map(., ~ group_by(., col1) %>%
summarise(new_col = sum(col2)))
#[[1]]
# A tibble: 2 x 2
# col1 new_col
# <fct> <int>
#1 a 4
#2 b 6
#[[2]]
# A tibble: 2 x 2
# col1 new_col
# <fct> <int>
#1 a 4
#2 b 6
In base R you might try lapply and tapply
lapply(mylist1, function(x)
tapply(X = x[["col2"]], INDEX = x[["col1"]], FUN = 'sum'))
#[[1]]
#a b
#4 6
#[[2]]
#a b
#4 6
I very much enjoy working with the magrittr pipes in R %>% and try to use them as often / efficiently as possible. I quite often need to target specific columns in a pipe chain, for example to change the column type. This results in me having to break the chain / my workflow because I need to target only a specific column instead of my entire dataframe.
Consider the following example:
library(tidyverse)
rm(list = ls())
a <- c(1:20)
b <- rep(c("a", "b"), 10)
df <- data_frame(a, b) %>%
rename(info = b) %>%
recode(x = df$info, "a" = "x") #I'd like to target only the df$info column here
This obviously doesn't work, because dplyr doesn't expect me to change the x = argument for a function in a pipe chain.
library(tidyverse)
rm(list = ls())
a <- c(1:20)
b <- rep(c("a", "b"), 10)
df <- data_frame(a, b) %>%
rename(info = b)
df$info <- df$info %>% #this works as expected, but is not as elegant
recode("a" = "x")
This is how I think it should be done, but I feel that it is not as efficient / elegant as I would like it to be, especially if I plan on chaining more functions together after recoding.
Is there a convenient way around this, so I can tell a command in my pipe chain to target only a specific column?
We need to place it inside mutate
data_frame(a, b) %>%
rename(info = b) %>%
mutate(info = recode(info, a = "x"))
# A tibble: 20 x 2
# a info
# <int> <chr>
# 1 1 x
# 2 2 b
# 3 3 x
# 4 4 b
# 5 5 x
# 6 6 b
# 7 7 x
# 8 8 b
# 9 9 x
#10 10 b
#11 11 x
#12 12 b
#13 13 x
#14 14 b
#15 15 x
#16 16 b
#17 17 x
#18 18 b
#19 19 x
#20 20 b