How to manipulate a data.frame by factor with dplyr - r

df <- data.frame(a=factor(c(1,1,2,2,3,3) ), b=c(1,1, 10,10, 20,20) )
a b
1 1 1
2 1 1
3 2 10
4 2 10
5 3 20
6 3 20
I want to split the data frame by column a, calculate b/sum(b) in each group, and put the result in column c. With plyr I can do:
fun <- function(x){
x$c=x$b/sum(x$b)
x
}
ddply(df, .(a), fun )
and have
a b c
1 1 1 0.5
2 1 1 0.5
3 2 10 0.5
4 2 10 0.5
5 3 20 0.5
6 3 20 0.5
but how can I do it with dplyr?
df %.% group_by(a) %.% do(fun)
returns a list instead of a data.frame.

df %>%
group_by(a) %>%
mutate(c=b/sum(b))
a b c
1 1 1 0.5
2 1 1 0.5
3 2 10 0.5
4 2 10 0.5
5 3 20 0.5
6 3 20 0.5

Just to mention an R base solution, you can use transform (R base equivalent to mutate) and ave function to split vectors and apply functions.
> transform(df, c=ave(b,a, FUN= function(b) b/sum(b)))
a b c
1 1 1 0.5
2 1 1 0.5
3 2 10 0.5
4 2 10 0.5
5 3 20 0.5
6 3 20 0.5

Related

Split up grouped binomial data in r

I have data that looks like this
samplesize <- 6
group <- c(1,2,3)
total <- rep(samplesize,length(group))
outcomeTrue <- c(2,1,3)
df <- data.frame(group,total,outcomeTrue)
and would like my data to look like this
group2 <- c(rep(1,6),rep(2,6),rep(3,6))
outcomeTrue2 <- c(rep(1,2),rep(0,6-2),rep(1,1),rep(0,6-1),rep(1,3),rep(0,6-3))
df2 <- data.frame(group2,outcomeTrue2)
That is to say I have binary data where I am told the total observations and the successful observations, but would prefer it to be organised as individual observations with their explicit outcome as 0 or 1. i.e.Visual Example of Desired Result
Is there an easy way to do this in r, or will I need to write a loop to automate this myself?
Here is one option with tidyverrse. We uncount to expand the rows using the 'total' column, grouped by 'group', create a binary index with a logical condition based on the row_number() and the value of 'outcomeTrue'
library(tidyverse)
df %>%
uncount(total) %>%
group_by(group) %>%
mutate(outcomeTrue = as.integer(row_number() <= outcomeTrue[1]))
# A tibble: 18 x 2
# Groups: group [3]
# group outcomeTrue
# <dbl> <int>
# 1 1 1
# 2 1 1
# 3 1 0
# 4 1 0
# 5 1 0
# 6 1 0
# 7 2 1
# 8 2 0
# 9 2 0
#10 2 0
#11 2 0
#12 2 0
#13 3 1
#14 3 1
#15 3 1
#16 3 0
#17 3 0
#18 3 0
You are also there. just use the group 2 variable with the "[" function in the x position:
df[ group2 , ]
group total outcomeTrue
1 1 6 2
1.1 1 6 2
1.2 1 6 2
1.3 1 6 2
1.4 1 6 2
1.5 1 6 2
2 2 6 1
2.1 2 6 1
2.2 2 6 1
2.3 2 6 1
2.4 2 6 1
2.5 2 6 1
3 3 6 3
3.1 3 6 3
3.2 3 6 3
3.3 3 6 3
3.4 3 6 3
3.5 3 6 3
When a number or character value that matches a rowname is put in the x-position of the "[" it replicates the entire row
Here is a base R solution.
do.call(rbind, lapply(split(df, df$group), function(x) data.frame(group2 = x$group, outcome2 = rep(c(1,0), times = c(x$outcome, x$total-x$outcome)))))
# group2 outcome2
# 1.1 1 1
# 1.2 1 1
# 1.3 1 0
# 1.4 1 0
# 1.5 1 0
# 1.6 1 0
# 2.1 2 1
# 2.2 2 0
# 2.3 2 0
# 2.4 2 0
# 2.5 2 0
# 2.6 2 0
# 3.1 3 1
# 3.2 3 1
# 3.3 3 1
# 3.4 3 0
# 3.5 3 0
# 3.6 3 0

Expanding data frame using tidyverse [duplicate]

This question already has answers here:
Expand ranges defined by "from" and "to" columns
(10 answers)
Closed 6 years ago.
Here's an example of what I'm trying to do:
df <- data.frame(
id = letters[1:5],
enum_start = c(1, 1, 1, 1, 1),
enum_end = c(1, 5, 3, 7, 2)
)
df2 <- df %>%
split(.$id) %>%
lapply(function(x) cbind(x, hello = seq(x$enum_start, x$enum_end, by = 1L))) %>%
bind_rows
df2
# id enum_start enum_end hello
# 1 a 1 1 1
# 2 b 1 5 1
# 3 b 1 5 2
# 4 b 1 5 3
# 5 b 1 5 4
# 6 b 1 5 5
# 7 c 1 3 1
# 8 c 1 3 2
# 9 c 1 3 3
# 10 d 1 7 1
# 11 d 1 7 2
# 12 d 1 7 3
# 13 d 1 7 4
# 14 d 1 7 5
# 15 d 1 7 6
# 16 d 1 7 7
# 17 e 1 2 1
# 18 e 1 2 2
Note that the starting and ending values for hello depend on the data and hence the number of rows for each id is dynamic. I'm looking for a solution that involves maybe expand from tidyr but am struggling.
Here's a dplyr/tidyr approach
group_by(df, id) %>%
expand(enum_start, enum_end, hello = full_seq(enum_end:enum_start, 1))
Not sure if there's a tidyr-way without grouping the data (would be interesting to know)
Here is a base R method that produces the desired output.
dfNew <- within(df[rep(seq_len(nrow(df)), df$enum_end), ],
hello <- sequence(df$enum_end))
sequence will return the natural numbers and takes a vector that allows for repeated recounting. It is used to produce the "hello" variable. within reduces typing and returns a modified data.frame. I fed it an expanded version of df where rows are repeated using rep and [.
dfNew
id enum_start enum_end hello
1 a 1 1 1
2 b 1 5 1
2.1 b 1 5 2
2.2 b 1 5 3
2.3 b 1 5 4
2.4 b 1 5 5
3 c 1 3 1
3.1 c 1 3 2
3.2 c 1 3 3
4 d 1 7 1
4.1 d 1 7 2
4.2 d 1 7 3
4.3 d 1 7 4
4.4 d 1 7 5
4.5 d 1 7 6
4.6 d 1 7 7
5 e 1 2 1
5.1 e 1 2 2

Cumulative sum minus mean with R

I am trying to subtract the cumulative sum of the previous values minus the mean based on my current position. For example I have:
A
1
2
3
4
5
and i want this:
A B
1 NA
2 3-mean(A)
3 6-mean(A)
4 10-mean(A)
5 15-mean(A)
Not sure why you want NA as the first value for the B column. Here I use 1-mean(A) instead:
> A <- 1:5
> data.frame(A=A, B=cumsum(A)-mean(A))
A B
1 1 -2
2 2 0
3 3 3
4 4 7
5 5 12
library(dplyr)
x<-data.frame(a=1:6)
x %>%
mutate(mycol=mean(a)-lag(cumsum(a),1))
a mycol
1 1 NA
2 2 2.5
3 3 0.5
4 4 -2.5
5 5 -6.5
6 6 -11.5

Reshaping a data frame and setting flag variables

I want to reshape my data frame from the df1 to df2 as appears below:
df1 <-
ID TIME RATEALL CL V1 Q V2
1 0 0 2.4 10 6 20
1 1 2 0.6 10 6 25
2 0 0 3.0 15 7 30
2 5 3 3.0 16 8 15
into a long format like this:
df2 <-
ID var TIME value
1 1 0 0
1 1 1 2
1 2 0 2.4
1 2 1 10
1 3 0 6
1 3 1 6
1 4 0 20
1 4 1 20
2 1 0 3.0
2 1 1 3.0
AND so on ...
Basically I want to give a flag variables (1: for RATEALL, 2:for CL, 3:for V1, 4:for Q,and 5: for V2 and then melt the values for each subject ID. Is there an easy way to do this in R?
You can try
df2 <- reshape2::melt(df1, c("ID", "TIME"))
names <- c("RATEALL"=1, "CL"=2, "V1"=3, "Q"=4, "V2"=5)
df2$variable <- names[df2$variable]
You could use tidyr/dplyr
library(tidyr)
library(dplyr)
res <- gather(df1,var, value, RATEALL:V2) %>%
mutate(var= as.numeric(factor(var)))
head(res)
# ID TIME var value
#1 1 0 1 0.0
#2 1 1 1 2.0
#3 2 0 1 0.0
#4 2 5 1 3.0
#5 1 0 2 2.4
#6 1 1 2 0.6

R: create variables up to power of n

Say I have a data frame with m variables, how can I get their generated variables up to power of n? For example, df is a data frame with 2 variables a and b:
df <- data.frame(a=c(1,2), b=c(3,4))
I want to add variables up to power of 3, which means adding to df these generated columns:
a^2, a*b, b^2, a^3, a^2*b, b^2*a, b^3
How can I do this?
Use polym:
df <- data.frame(a=c(1,2), b=c(3,4))
# a b
#1 1 3
#2 2 4
res <- do.call(polym, c(df, degree=3, raw=TRUE))
# 1.0 2.0 3.0 0.1 1.1 2.1 0.2 1.2 0.3
#[1,] 1 1 1 3 3 3 9 9 27
#[2,] 2 4 8 4 8 16 16 32 64
#attr(,"degree")
#[1] 1 2 3 1 2 3 2 3 3
Edit:
Here is a possibility to create the desired column names:
colnames(res) <- apply(
do.call(rbind,
strsplit(colnames(res), ".", fixed=TRUE)),
1,
function(x) paste(rep(names(df), as.integer(x)), collapse="")
)
# a aa aaa b ab aab bb abb bbb
#[1,] 1 1 1 3 3 3 9 9 27
#[2,] 2 4 8 4 8 16 16 32 64
#attr(,"degree")
#[1] 1 2 3 1 2 3 2 3 3

Resources