I'm trying to figure out what I'm doing wrong here. Using the following training data I compute some frequencies using dplyr:
group.count <- c(101,99,4)
data <- data.frame(
by = rep(3:1,group.count),
y = rep(letters[1:3],group.count))
data %>%
group_by(by) %>%
summarise(non.miss = sum(!is.na(y)))
Which gives me the outcome I'm looking for. However, when I try to do it as a function:
res0 <- function(x1,x2) {
output = data %>%
group_by(x2) %>%
summarise(non.miss = sum(!is.na(x1)))
}
res0(y,by)
I get an error (index out of bounds).
Can anybody tell me what I'm missing?
Thanks on advance.
You can't do this like that in dplyr.
The problem is that you are passing it a NULL object at the moment. by doesn't exist anywhere. Your first thought might be to pass "by" but this won't work with dplyr either. What dplyr is doing here is trying to group_by the variable x2 which is not a part of your data.frame. To show this, make your data.frame as such:
data <- data.frame(
x2 = rep(3:1,group.count),
x1 = rep(letters[1:3],group.count)
)
Then call your function again and it will return the expected output.
I suggest changing the name of your dataframe to df.
This is basically what you have done:
df %>%
group_by(by) %>%
summarise(non.miss = sum(!is.na(y)))
which produces this:
# by non.miss
#1 1 4
#2 2 99
#3 3 101
but to count the number of observations per group, you could use length, which gives the same answer:
df %>%
group_by(by) %>%
summarise(non.miss = length(y))
# by non.miss
#1 1 4
#2 2 99
#3 3 101
or, use tally, which gives this:
df %>%
group_by(by) %>%
tally
# by n
#1 1 4
#2 2 99
#3 3 101
Now, you could put that if you really wanted into a function. The input would be the dataframe. Like this:
res0 <- function(df) {
df %>%
group_by(by) %>%
tally
}
res0(df)
# by n
#1 1 4
#2 2 99
#3 3 101
This of course assumes that your dataframe will always have the grouping column named 'by'. I realize that these data are just fictional, but avoiding naming columns 'by' might be a good idea because that is its own function in R - it may get a bit confusing reading the code with it in.
Related
I'd like to write a function with two inputs (x and y) to create some mutated variables in a very large dataframe. Specifically, if x=y then return x, and if x!=y then draw 1 sample from a sequence of x to y.
The function works fine when I test it outside of my datafarme, but throws an error when I try to use it within mutate. I've tried both ifelse and if_else versions.
library(dplyr)
smx <- function(x,y){ #Function to allow sampling if length>1
if_else(x==y,x,sample(seq(x,y,1),1))}
#ifelse(x==y,x,sample(seq(x,y,1),1))} #Have also tried this with ifelse, doesn't work
smx(0,0) #This works
smx(0,5) #This works
#Create dummy data frame
df <- as.data.frame(cbind(c(rep(0,5)),c(seq(0,4,1))))
colnames(df) <- c("varA","varB")
df
#This doesn't work
df1 <- df %>% mutate(
VarC = smx(varA,varB)
)
Ideally, my output should include a third column (VarC) in which the first row is equal to 0 (because varA=varB) and the remaining rows are a random sample between a sequence from varA to varB.
I have set up my data frame so that varA is always be smaller than varB, but I'm not certain. Appreciate any help on a clean solution to this problem!
The function is not working because it is not vectorized. First, you'll need to vectorized your function, in order to make it work inside mutate.
You can do that as follows:
vectorized_fun <- Vectorize(your_fun)
Your code will look like this:
smx_v <- Vectorize(smx)
#This works
df1 <- df %>%
mutate(VarC = smx_v(varA,varB)
)
The issue here comes from seq: when using this function inside dplyr verb, you need to make sure the length of input is 1, which isn't the case here.
Using rowwise() solves the problem:
smx <- function(x,y){
ifelse(x==y,x,sample(seq(x,y,1),1))
}
df <- as.data.frame(cbind(c(rep(0,5)),c(seq(0,4,1))))
colnames(df) <- c("varA","varB")
df %>%
rowwise() %>%
mutate(VarC = smx(varA, varB))
Output:
# A tibble: 5 x 3
# Rowwise:
varA varB VarC
<dbl> <dbl> <dbl>
1 0 0 0
2 0 1 1
3 0 2 1
4 0 3 2
5 0 4 0
Related to a previous question, I want to add some random noise to every value in a column in dplyr. However, when I tried the below code I get identical values back. I understand why this is happening (dplyr generate the random number and then uses that very same number to add to every single value). Is there any way to prevent this?
data <- data.frame(value=c(1,1,1,1,1)) %>% mutate(value = value + 1e-3*runif(1)) %>% print
# print(data)
# value
# 1 1.000236
# 2 1.000236
# 3 1.000236
# 4 1.000236
# 5 1.000236
Here is a solution with jitter:
library(dplyr)
set.seed(2020) # Make the results reproducible
data <- data.frame(value=c(1,1,1,1,1)) %>% mutate(value = jitter(value))
data
# value
#1 1.0058761
#2 0.9957690
#3 1.0047401
#4 0.9990756
#5 0.9854439
You could generate your random vector externally and then add it to data$value:
nrows <- nrow(data)
rands <- 1e-3 * runif(nrows)
data$value <- data$value + rands
Stepwise clarity works better for me.
Found my own answer. Adding rowwise() evaluates each row individually and thus gives a new random number.
data <- data.frame(value=c(1,1,1,1,1)) %>% rowwise() %>% mutate(value = value + 1e-3*runif(1)) %>% print
# print(data)
# value
# 1 1.000625
# 2 1.000764
# 3 1.000588
# 4 1.000536
# 5 1.000079
Sample data frame
Guest <- c("ann","ann","beth","beth","bill","bill","bob","bob","bob","fred","fred","ginger","ginger")
State <- c("TX","IA","IA","MA","AL","TX","TX","AL","MA","MA","IA","TX","AL")
df <- data.frame(Guest,State)
Desired output
I have tried about a dozen different ideas but not getting close. Closest was setting up a crosstab but didn't know how to get counts from that. Long/wide got me nowhere. etc. Too new still to think out of the box I guess.
Try this approach. You can arrange your values and then use group_by() and summarise() to reach a structure similar to those expected:
library(dplyr)
library(tidyr)
#Code
new <- df %>%
arrange(Guest,State) %>%
group_by(Guest) %>%
summarise(Chain=paste0(State,collapse = '-')) %>%
group_by(Chain,.drop = T) %>%
summarise(N=n())
Output:
# A tibble: 4 x 2
Chain N
<chr> <int>
1 AL-MA-TX 1
2 AL-TX 2
3 IA-MA 2
4 IA-TX 1
We can use base R with aggregate and table
table(aggregate(State~ Guest, df[do.call(order, df),], paste, collapse='-')$State)
-output
# AL-MA-TX AL-TX IA-MA IA-TX
# 1 2 2 1
Some questions are similar to this topic (here or here, as an example) and I know one solution that works, but I want a more elegant response.
I work in epidemiology and I have variables 1 and 0 (or NA). Example:
Does patient has cancer?
NA or 0 is no
1 is yes
Let's say I have several variables in my dataset and I want to count only variables with "1". Its a classical frequency table, but dplyr are turning things more complicated than I could imagine at the first glance.
My code is working:
dataset %>%
select(VISimpair, HEARimpai, IntDis, PhyDis, EmBehDis, LearnDis,
ComDis, ASD, HealthImpair, DevDelays) %>% # replace to your needs
summarise_all(funs(sum(1-is.na(.))))
And you can reproduce this code here:
library(tidyverse)
dataset <- data.frame(var1 = rep(c(NA,1),100), var2=rep(c(NA,1),100))
dataset %>% select(var1, var2) %>% summarise_all(funs(sum(1-is.na(.))))
But I really want to select all variables I want, count how many 0 (or NA) I have and how many 1 I have and report it and have this output
Thanks.
What about the following frequency table per variable?
First, I edit your sample data to also include 0's and load the necessary libraries.
library(tidyr)
library(dplyr)
dataset <- data.frame(var1 = rep(c(NA,1,0),100), var2=rep(c(NA,1,0),100))
Second, I convert the data using gather to make it easier to group_by later for the frequency table created by count, as mentioned by CPak.
dataset %>%
select(var1, var2) %>%
gather(var, val) %>%
mutate(val = factor(val)) %>%
group_by(var, val) %>%
count()
# A tibble: 6 x 3
# Groups: var, val [6]
var val n
<chr> <fct> <int>
1 var1 0 100
2 var1 1 100
3 var1 NA 100
4 var2 0 100
5 var2 1 100
6 var2 NA 100
A quick and dirty method to do this is to coerce your input into factors:
dataset$var1 = as.factor(dataset$var1)
dataset$var2 = as.factor(dataset$var2)
summary(dataset$var1)
summary(dataset$var2)
Summary tells you number of occurrences of each levels of factor.
There must be an R-ly way to call wilcox.test over multiple observations in parallel using group_by. I've spent a good deal of time reading up on this but still can't figure out a call to wilcox.test that does the job. Example data and code below, using magrittr pipes and summarize().
library(dplyr)
library(magrittr)
# create a data frame where x is the dependent variable, id1 is a category variable (here with five levels), and id2 is a binary category variable used for the two-sample wilcoxon test
df <- data.frame(x=abs(rnorm(50)),id1=rep(1:5,10), id2=rep(1:2,25))
# make sure piping and grouping are called correctly, with "sum" function as a well-behaving example function
df %>% group_by(id1) %>% summarise(s=sum(x))
df %>% group_by(id1,id2) %>% summarise(s=sum(x))
# make sure wilcox.test is called correctly
wilcox.test(x~id2, data=df, paired=FALSE)$p.value
# yet, cannot call wilcox.test within pipe with summarise (regardless of group_by). Expected output is five p-values (one for each level of id1)
df %>% group_by(id1) %>% summarise(w=wilcox.test(x~id2, data=., paired=FALSE)$p.value)
df %>% summarise(wilcox.test(x~id2, data=., paired=FALSE))
# even specifying formula argument by name doesn't help
df %>% group_by(id1) %>% summarise(w=wilcox.test(formula=x~id2, data=., paired=FALSE)$p.value)
The buggy calls yield this error:
Error in wilcox.test.formula(c(1.09057358373486,
2.28465932554436, 0.885617572657959, : 'formula' missing or incorrect
Thanks for your help; I hope it will be helpful to others with similar questions as well.
Your task will be easily accomplished using the do function (call ?do after loading the dplyr library). Using your data, the chain will look like this:
df <- data.frame(x=abs(rnorm(50)),id1=rep(1:5,10), id2=rep(1:2,25))
df <- tbl_df(df)
res <- df %>% group_by(id1) %>%
do(w = wilcox.test(x~id2, data=., paired=FALSE)) %>%
summarise(id1, Wilcox = w$p.value)
output
res
Source: local data frame [5 x 2]
id1 Wilcox
(int) (dbl)
1 1 0.6904762
2 2 0.4206349
3 3 1.0000000
4 4 0.6904762
5 5 1.0000000
Note I added the do function between the group_by and summarize.
I hope it helps.
You can do this with base R (although the result is a cumbersome list):
by(df, df$id1, function(x) { wilcox.test(x~id2, data=x, paired=FALSE)$p.value })
or with dplyr:
ddply(df, .(id1), function(x) { wilcox.test(x~id2, data=x, paired=FALSE)$p.value })
id1 V1
1 1 0.3095238
2 2 1.0000000
3 3 0.8412698
4 4 0.6904762
5 5 0.3095238