I'm trying to use dplyr with my own function which summarises a data frame to a single value. In the example below, my_func counts the number of missing values. I could do this specific case another way, but I'm interested in knowing how to do this generally. I need this to work with grouped data. I thought something like this might work:
my_func <- function(df) {
return(sum(is.na(df)))
}
data("airquality")
airquality %>% group_by(Month) %>% summarise(my_func(.))
## # A tibble: 5 × 2
## Month `my_func(.)`
## <int> <int>
## 1 5 44
## 2 6 44
## 3 7 44
## 4 8 44
## 5 9 44
But it seems . is the whole data frame, not the individual groups.
dplyr::do can get the correct data frame:
airquality %>% group_by(Month) %>% do(data.frame(m = my_func(.)))
## Source: local data frame [5 x 2]
## Groups: Month [5]
##
## Month m
## <int> <int>
## 1 5 9
## 2 6 21
## 3 7 5
## 4 8 8
## 5 9 1
But this seems like a hack. It's also not consistent with summarise, because the output from do is still a grouped data frame.
Essentially, my question is: can I pass the correct data frame (respecting groups) to my function from within summarise?
After some further checks, it seems that the problem lies with the use of . in summarise. For example, the following works for a single variable:
airquality %>% group_by(Month) %>% summarize(my_func(Ozone))
yet this one doesn't:
airquality %>% group_by(Month) %>% summarize(my_func(.$Ozone))
Similarly, explicitly creating a data.frame with all the variables gives the desired output:
airquality %>%
group_by(Month) %>%
summarize(NAs = my_func(data.frame(Ozone, Solar.R, Wind, Temp, Month, Day)))
so if you insist on using dplyr, you'll need a workaround like that one (or use do as you already mentioned). I believe it's the same bug that has been reported here: dplyr Issue #2752.
So, I think you can use the following struture:
data <- num.missing(lapply(data$Month, my_func))
You also can use:
object <- data %>% summarise_each(funs(my_func), Month)
I hope this helps you!
If you don't mind using the plyr package, that seems to produce the desired output:
plyr::ddply(.data = airquality, .variables = ~ Month, .fun = my_func)
Related
Sample data frame
Guest <- c("ann","ann","beth","beth","bill","bill","bob","bob","bob","fred","fred","ginger","ginger")
State <- c("TX","IA","IA","MA","AL","TX","TX","AL","MA","MA","IA","TX","AL")
df <- data.frame(Guest,State)
Desired output
I have tried about a dozen different ideas but not getting close. Closest was setting up a crosstab but didn't know how to get counts from that. Long/wide got me nowhere. etc. Too new still to think out of the box I guess.
Try this approach. You can arrange your values and then use group_by() and summarise() to reach a structure similar to those expected:
library(dplyr)
library(tidyr)
#Code
new <- df %>%
arrange(Guest,State) %>%
group_by(Guest) %>%
summarise(Chain=paste0(State,collapse = '-')) %>%
group_by(Chain,.drop = T) %>%
summarise(N=n())
Output:
# A tibble: 4 x 2
Chain N
<chr> <int>
1 AL-MA-TX 1
2 AL-TX 2
3 IA-MA 2
4 IA-TX 1
We can use base R with aggregate and table
table(aggregate(State~ Guest, df[do.call(order, df),], paste, collapse='-')$State)
-output
# AL-MA-TX AL-TX IA-MA IA-TX
# 1 2 2 1
I have a dataset with 11 columns and 18350 observations which has a variable company and region. There are 9 companies(company-0) spread across 5 regions(region-0 to region-5) and not all companies are present at all regions. I want to create a seperate dataframe for each combination of company and region.You can see like this-
company0-region1,
company0-region10,
company0-region7,
company1-region5,
company2-region0,
company3-region2,
company4-region3,
company5-region7,
company6-region6,
company8-region9,
company9-region8
Thus I need 11 different dataframes in R.No other combinations are possible
Any other approach would be highly appreciated.
Thanks in Advance
I used split function to get a list-
p<-split(tsog1,list(tsog1$company),drop=TRUE)
Now I have a list of dataframes and I can't convert the each element of that list into an individual dataframe.
I tried using loops too, but can't get a unique named dataframe.
v<-c(1:9)
p<-levels(tsog1$company)
for (x in v)
{
x.tsog1<-subset(tsog1,tsog1$company==p[x])
}
Dataset Image
You can create a column for the region company combination and split by that column.
For example:
library(tidyverse)
# Create a df with 9 regions, 6 companies, and some dummy observations (3 per case)
df <- expand.grid(region = 0:8, company = 0:5, dummy = 1:3 ) %>%
mutate(x = round(rnorm((54*3)),2)) %>%
select(-dummy) %>% as_tibble()
# Create the column to split, and split.
df %>%
mutate(region_company = paste(region,company, sep = '_')) %>%
split(., .$region_company)
Now, what to do once you have the list of data frames, depends on your next steps. If you want to for example, save them, you can do walk or lapply.
For saving:
df_list <- df %>%
mutate(region_company = paste(region,company, sep = '_')) %>%
split(., .$region_company)
iwalk(df_list,function(df, nm){
write_csv(df, paste0(nm,'.csv'))
})
Or if you simply wants to access it:
> df_list$`0_4`
# A tibble: 3 x 4
region company x region_company
<int> <int> <dbl> <chr>
1 0 4 0.54 0_4
2 0 4 1.61 0_4
3 0 4 0.16 0_4
I am working with the tidygraph package and try to find a "tidy" solution
for the example below. The problem is not really tied to tidygraph and more about data wrangling but I think it is interesting for people working with this package.
In the following code chunk I just generate some sample data.
library(tidyverse)
library(tidygraph)
library(igraph)
library(randomNames)
library(reshape2)
graph <- play_smallworld(1, 100, 3, 0.05)
labeled_graph <- graph %>%
activate(nodes) %>%
mutate(group = sample(letters[1:3], size = 100, replace = TRUE),
name = randomNames(100)
)
sub_graphs_df <- labeled_graph %>%
morph(to_split, group) %>%
crystallise()
The resulting data.frame looks as follows:
sub_graphs_df
# A tibble: 3 x 2
name graph
<chr> <list>
1 group: a <S3: tbl_graph>
2 group: b <S3: tbl_graph>
3 group: c <S3: tbl_graph>
Now to my actual problem. I want do apply a function to each element in the column graph. The result is simply a named vector.
sub_graphs_df$graph %>% map(degree)
The first thing I do not like is the subsetting by $. Is there a better way?
Next, I want to reshape this result into only one data.frame with 3 columns. One column for name (the name attribute of the vectors), one for group (the name attribute of the list) and one for the number (the elements of the vectors).
I tried melt from the reshape2 package.
sub_graphs_df$graph %>% map(degree) %>% melt
It works decently but the names are lost and as I read it, one should use
tidyr instead. However, I could not get gather to work because only data.frames are accepted.
Another option would be unlist:
sub_graphs_df$graph %>% map(degree) %>% unlist
Here the group and the name are in the names attribute and I would have to recover them with regular expressions.
I am pretty sure there is an easy way I just could not think of.
We can create a list column with mutate while applying the function with map, extract the names and integer and unnest to create the 'long' format dataset
sub_graphs_df %>%
mutate(newout = map(graph, degree)) %>%
transmute(name, group = map(newout, ~.x %>% names), number = map(newout, as.integer)) %>%
unnest
# A tibble: 100 x 3
# name group number
# <chr> <chr> <int>
# 1 group: a Seng, Trevor 0
# 2 group: a Buccieri, Joshua 1
# 3 group: a Street, Aimee 2
# 4 group: a Gonzalez, Corey 2
# 5 group: a Barber, Monique 1
# 6 group: a Doan, Christina 1
# 7 group: a Ninomiya, Janna 1
# 8 group: a Bazemore, Chao 1
# 9 group: a Perfecto, Jennifer 1
#10 group: a Lopez Jr, Vinette 0
# ... with 90 more rows
There must be an R-ly way to call wilcox.test over multiple observations in parallel using group_by. I've spent a good deal of time reading up on this but still can't figure out a call to wilcox.test that does the job. Example data and code below, using magrittr pipes and summarize().
library(dplyr)
library(magrittr)
# create a data frame where x is the dependent variable, id1 is a category variable (here with five levels), and id2 is a binary category variable used for the two-sample wilcoxon test
df <- data.frame(x=abs(rnorm(50)),id1=rep(1:5,10), id2=rep(1:2,25))
# make sure piping and grouping are called correctly, with "sum" function as a well-behaving example function
df %>% group_by(id1) %>% summarise(s=sum(x))
df %>% group_by(id1,id2) %>% summarise(s=sum(x))
# make sure wilcox.test is called correctly
wilcox.test(x~id2, data=df, paired=FALSE)$p.value
# yet, cannot call wilcox.test within pipe with summarise (regardless of group_by). Expected output is five p-values (one for each level of id1)
df %>% group_by(id1) %>% summarise(w=wilcox.test(x~id2, data=., paired=FALSE)$p.value)
df %>% summarise(wilcox.test(x~id2, data=., paired=FALSE))
# even specifying formula argument by name doesn't help
df %>% group_by(id1) %>% summarise(w=wilcox.test(formula=x~id2, data=., paired=FALSE)$p.value)
The buggy calls yield this error:
Error in wilcox.test.formula(c(1.09057358373486,
2.28465932554436, 0.885617572657959, : 'formula' missing or incorrect
Thanks for your help; I hope it will be helpful to others with similar questions as well.
Your task will be easily accomplished using the do function (call ?do after loading the dplyr library). Using your data, the chain will look like this:
df <- data.frame(x=abs(rnorm(50)),id1=rep(1:5,10), id2=rep(1:2,25))
df <- tbl_df(df)
res <- df %>% group_by(id1) %>%
do(w = wilcox.test(x~id2, data=., paired=FALSE)) %>%
summarise(id1, Wilcox = w$p.value)
output
res
Source: local data frame [5 x 2]
id1 Wilcox
(int) (dbl)
1 1 0.6904762
2 2 0.4206349
3 3 1.0000000
4 4 0.6904762
5 5 1.0000000
Note I added the do function between the group_by and summarize.
I hope it helps.
You can do this with base R (although the result is a cumbersome list):
by(df, df$id1, function(x) { wilcox.test(x~id2, data=x, paired=FALSE)$p.value })
or with dplyr:
ddply(df, .(id1), function(x) { wilcox.test(x~id2, data=x, paired=FALSE)$p.value })
id1 V1
1 1 0.3095238
2 2 1.0000000
3 3 0.8412698
4 4 0.6904762
5 5 0.3095238
I have a data frame df that has two columns, term and frequency. I also have a list of terms with given IDs stored in a vector called indices. To illustrate these two info, I have the following:
> head(indices)
Term
1 hello
256 i
33 the
Also, for the data frame.
> head(df)
Term Freq
1 i 24
2 hello 12
3 the 28
I want to add a column in df called TermID which will just be the index of the term in the vector indices. I have tried using dplyr::mutate but to no avail. Here is my code below
library(dplyr)
whichindex <- function(term){
ind <- which(indices == as.character(term))
ind}
mutate(df, TermID = whichindex(Term))
What I am getting as output is a df that has a new column called TermID, but all the values for TermID are the same.
Can someone help me figure out what I am doing wrong? It would be nice as well if you can recommend a more efficient algorithm to do this in [R]. I have implemented this in Python and I have not encountered such issues.
Thanks in advance.
what about?
df %>% rowwise() %>% mutate(TermID = grep(Term,indices))
w/ example data:
library(dplyr)
indices <- c("hello","i","the")
df <- data_frame(Term = c("i","hello","the"), Freq = c(24,12,28))
df_res <- df %>% rowwise() %>% mutate(TermID = grep(Term,indices))
df_res
gives:
Source: local data frame [3 x 3]
Groups: <by row>
Term Freq TermID
1 i 24 2
2 hello 12 1
3 the 28 3