R--aggregate like function that ensures interactions of all factor levels - r

I'm wondering how I can ensure I included all interactions of factors when using aggregate even if they don't appear in the given dataset.
dff <- data.frame(a=as.factor(c(rep(1,3), rep(2,4), rep(3,3))),
b=as.factor(c(rep("A", 4), rep("B",6))),
c=sample(100,10))
levels(dff$b) <- c(levels(dff$b), "C")
levels(dff$a) <- c(levels(dff$a), 10)
dff$b
#[1] A A A A B B B B B B
#Levels: A B C
dff$a
#[1] 1 1 1 2 2 2 2 3 3 3
#Levels: 1 2 3 10
aggregate(c~a+b, dff, sum)
# a b c
#1 1 A 233
#2 2 A 78
#3 2 B 212
#4 3 B 73
what I want is
a b c
1 1 A 233
2 1 B 0
3 1 C 0
4 2 A 78
5 2 B 212
6 2 C 0
7 3 A 0
8 3 B 73
9 3 C 0
10 10 A 0
11 10 B 0
12 10 C 0
NA is fine too.
The reason I want it in this format is because I need to interact dff$c with results from other datasets and they may be of different length if not all factor levels are accounted for. I'm trying avoid merge and instead use vector calculation.
Thank you in advance.

If your aggregation function is just going to be sum, you can just use xtabs, which would create an object that includes the class table. As such, you can use data.frame, which would call the respective "method", which creates a "long" data.frame.
data.frame(xtabs(c ~ b + a, dff))
# b a Freq
# 1 A 1 121
# 2 B 1 0
# 3 C 1 0
# 4 A 2 89
# 5 B 2 203
# 6 C 2 0
# 7 A 3 0
# 8 B 3 126
# 9 C 3 0
# 10 A 10 0
# 11 B 10 0
# 12 C 10 0
This is similar to #nicola's suggestion to use as.data.frame.table, which explicitly calls the method for something that is not explicitly of the class "table" but can be treated as one.
One advantage of this approach (and all the others that follow) is that you can use different functions other than sum.
as.data.frame.table(tapply(dff$c, dff[c("a","b")], sum))
If merge is OK, you can continue with your aggregate step. In this case, we use expand.grid on the levels of your factor vectors:
merge(expand.grid(lapply(dff[c(1, 2)], levels)),
aggregate(c~a+b, dff, sum, drop = FALSE), all = TRUE)
A similar approach can be taken in "data.table":
library(data.table)
as.data.table(dff)[, sum(c), by = .(a, b)][do.call(CJ, lapply(dff[c(1, 2)], levels)), on = c("a", "b")]
Or using "dplyr" + "tidyr" (which essentially hides the merge, but ultimately uses left_join to create the missing combinations):
library(dplyr)
library(tidyr)
dff %>%
group_by(a, b) %>%
summarise(c = sum(c)) %>%
complete(a, b, fill = list(c = 0))

Related

Adding an index column representing a repetition of a dataframe in R

I have a dataframe in R that I'd like to repeat several times, and I want to add in a new variable to index those repetitions. The best I've come up with is using mutate + rbind over and over, and I feel like there has to be an efficient dataframe method I could be using here.
Here's an example: df <- data.frame(x = 1:3, y = letters[1:3]) gives us the dataframe
x
y
1
a
2
b
3
c
I'd like to repeat that say 3 times, with an index that looks like this:
x
y
index
1
a
1
2
b
1
3
c
1
1
a
2
2
b
2
3
c
2
1
a
3
2
b
3
3
c
3
Using the rep function, I can get the first two columns, but not the index column. The best I've come up with so far (using dplyr) is:
df2 <-
df %>%
mutate(index = 1) %>%
rbind(df %>% mutate(index = 2)) %>%
rbind(df %>% mutate(index = 3))
This obviously doesn't work if I need to repeat my dataframe more than a handful of times. It feels like the kind of thing that should be easy to do using dataframe methods, but I haven't been able to find anything.
Grateful for any tips!
You can use this code for as many data frames as you would like. You just have to set the n argument:
replicate function takes 2 main arguments. We first specify the number of time we would like to reproduce our data set by n. Then we specify our data set as expr argument. The result would be a list whose elements are instances of our data set
After that we pass it along to imap function from purrr package to define the unique id for each of our data set. .x represents each element of our list (here a data frame) and .y is the position of that element which amounts to the number of instances we created. So for example we assign value 1 to the first id column of the first data set as .y is equal to 1 for that and so on.
library(dplyr)
library(purrr)
replicate(3, df, simplify = FALSE) %>%
imap_dfr(~ .x %>%
mutate(id = .y))
x y id
1 1 a 1
2 2 b 1
3 3 c 1
4 1 a 2
5 2 b 2
6 3 c 2
7 1 a 3
8 2 b 3
9 3 c 3
In base R you can use the following code:
do.call(rbind,
mapply(function(x, z) {
x$id <- z
x
}, replicate(3, df, simplify = FALSE), 1:3, SIMPLIFY = FALSE))
x y id
1 1 a 1
2 2 b 1
3 3 c 1
4 1 a 2
5 2 b 2
6 3 c 2
7 1 a 3
8 2 b 3
9 3 c 3
You can use rerun to repeat the dataframe n times and add an index column using bind_rows -
library(dplyr)
library(purrr)
n <- 3
df <- data.frame(x = 1:3, y = letters[1:3])
bind_rows(rerun(n, df), .id = 'index')
# index x y
#1 1 1 a
#2 1 2 b
#3 1 3 c
#4 2 1 a
#5 2 2 b
#6 2 3 c
#7 3 1 a
#8 3 2 b
#9 3 3 c
In base R, we can repeat the row index 3 times.
transform(df[rep(1:nrow(df), n), ], index = rep(1:n, each = nrow(df)))
One more way
n <- 3
map_dfr(seq_len(n), ~ df %>% mutate(index = .x))
x y index
1 1 a 1
2 2 b 1
3 3 c 1
4 1 a 2
5 2 b 2
6 3 c 2
7 1 a 3
8 2 b 3
9 3 c 3

Multiply columns in different dataframes

I am writing a code for analysis a set of dplyr data.
here is how my table_1 looks:
1 A B C
2 5 2 3
3 9 4 1
4 6 3 8
5 3 7 3
And my table_2 looks like this:
1 D E F
2 2 9 3
I would love to based on table 1 column"A", if A>6, then create a column "G" in table1, equals to "C*D+C*E"
Basically, it's like make table 2 as a factor...
Is there any way I can do it?
I can apply a filter to Column "A" and multiply Column"C" with a set number instead of a factor from table_2
table_1_New <- mutate(Table_1,G=if_else(A<6,C*2+C*9))
You could try
#Initialize G column with 0
df1$G <- 0
#Get index where A value is greater than 6
inds <- df1$A > 6
#Multiply those values with D and E from df2
df1$G[inds] <- df1$C[inds] * df2$D + df1$C[inds] * df2$E
df1
# A B C G
#2 5 2 3 0
#3 9 4 1 11
#4 6 3 8 0
#5 3 7 3 0
Using dplyr, we can do
df1 %>% mutate(G = ifelse(A > 6, C*df2$D + C*df2$E, 0))

R: using if/else to append column in a list with objects of varying lengths

I am trying to append a column of values to the elements of an R list, where each element is of varying length. Here is an example list foo:
A B C
1 1 150
1 2 25
1 4 30
2 1 200
2 3 15
3 4 30
First, I split foo into list foo with elements based on each unique value of A. Now, I would like to write a function that a) the sums the values of C for each value of A, but that b) excludes B when B == 4. c) The sum is appended as a new column D, and d) C is divided by D to yield a proportion (column E). Ultimately, it would be combined in a new df to look like:
A B C D E
1 1 150 175 0.857
1 2 25 175 0.143
1 4 30 175 0.171
2 1 200 215 0.930
2 3 15 215 0.070
3 4 30 0 0/NA
However, I'm having problems because in some cases, for a given value of A, there are only cases when B == 4 (here, where A == 3), so when I try to divide C by D, I get error messages.
Is there a way to incorporate an if/else statement into the function so that when A is unique and the only possible value of B is 4, the operation is skipped and a default non-zero value is placed in the appended column?
Subsetting the df to excluded cases where B == 4 makes later operations more difficult, but including cases where B == 4 makes the sum/proportion calculate inaccurate.
Any help is appreciated! Here is the current code:
goo <- lapply(foo,function(df){
df$D <- sum(df$C, na.rm = TRUE)
df$E <- df$C / df$D
### .....
df
})
Here's how I would do it using dplyr
library(dplyr)
newfoo <- foo %>%
group_by(A) %>%
mutate(D = sum(C[B != 4]),
E = C/D)
#newfoo # the resulting data.frame
#Source: local data frame [6 x 5]
#Groups: A
#
# A B C D E
#1 1 1 150 175 0.85714286
#2 1 2 25 175 0.14285714
#3 1 4 30 175 0.17142857
#4 2 1 200 215 0.93023256
#5 2 3 15 215 0.06976744
#6 3 4 30 0 Inf
Or if you want to avoid Inf, you can use ifelse like this:
newfoo <- foo %>%
group_by(A) %>%
mutate(D = sum(C[B != 4]),
E = ifelse(D == 0, 0, C/D))
#Source: local data frame [6 x 5]
#Groups: A
#
# A B C D E
#1 1 1 150 175 0.85714286
#2 1 2 25 175 0.14285714
#3 1 4 30 175 0.17142857
#4 2 1 200 215 0.93023256
#5 2 3 15 215 0.06976744
#6 3 4 30 0 0.00000000
And a data.table (possible) solution
library(data.table)
setDT(foo)[, D := sum(C[B != 4]), by = A][, E := C/D]
# foo
# A B C D E
# 1: 1 1 150 175 0.85714286
# 2: 1 2 25 175 0.14285714
# 3: 1 4 30 175 0.17142857
# 4: 2 1 200 215 0.93023256
# 5: 2 3 15 215 0.06976744
# 6: 3 4 30 0 Inf
Not sure what you want to put into column E when A == 3, but you can use is.finite for it and avoid messing around with ifelse, for example (replacing with a zero)
setDT(foo)[, D := sum(C[B!=4]), by = A][, E := C/D][!is.finite(E), E := 0]
Here is a solution using the base package.
First, ensure that the data are modeled appropriately by converting A into a factor if it is not one already:
df$A <- factor(df$A)
Now, we can compute D using tapply, which iterates groupwise and returns the result as a table. We do this with the subset of df where B != 4.
df$D <- with(subset(df, B != 4), tapply(C, A, sum))[df$A]
Note that since A is a factor, we can index into the table to perform the merge. Now we can use ifelse to compute E:
df$E <- with(df, ifelse(is.na(D), 0, C/D))

Convert datafile from wide to long format to fit ordinal mixed model in R

I am dealing with a dataset that is in wide format, as in
> data=read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
> data
factor1 factor2 count_1 count_2 count_3
1 a a 1 2 0
2 a b 3 0 0
3 b a 1 2 3
4 b b 2 2 0
5 c a 3 4 0
6 c b 1 1 0
where factor1 and factor2 are different factors which I would like to take along (in fact I have more than 2, but that shouldn't matter), and count_1 to count_3 are counts of aggressive interactions on an ordinal scale (3>2>1). I would now like to convert this dataset to long format, to get something like
factor1 factor2 aggression
1 a a 1
2 a a 2
3 a a 2
4 a b 1
5 a b 1
6 a b 1
7 b a 1
8 b a 2
9 b a 2
10 b a 3
11 b a 3
12 b a 3
13 b b 1
14 b b 1
15 b b 2
16 b b 2
17 c a 1
18 c a 1
19 c a 1
20 c a 2
21 c a 2
22 c a 2
23 c a 2
24 c b 1
25 c b 2
Would anyone happen to know how to do this without using for...to loops, e.g. using package reshape2? (I realize it should work using melt, but I just haven't been able to figure out the right syntax yet)
Edit: For those of you that would also happen to need this kind of functionality, here is Ananda's answer below wrapped into a little function:
widetolong.ordinal<-function(data,factors,responses,responsename) {
library(reshape2)
data$ID=1:nrow(data) # add an ID to preserve row order
dL=melt(data, id.vars=c("ID", factors)) # `melt` the data
dL=dL[order(dL$ID), ] # sort the molten data
dL[,responsename]=match(dL$variable,responses) # convert reponses to ordinal scores
dL[,responsename]=factor(dL[,responsename],ordered=T)
dL=dL[dL$value != 0, ] # drop rows where `value == 0`
out=dL[rep(rownames(dL), dL$value), c(factors, responsename)] # use `rep` to "expand" `data.frame` & drop unwanted columns
rownames(out) <- NULL
return(out)
}
# example
data <- read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
widetolong.ordinal(data,c("factor1","factor2"),c("count_1","count_2","count_3"),"aggression")
melt from "reshape2" will only get you part of the way through this problem. To go the rest of the way, you just need to use rep from base R:
data <- read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
library(reshape2)
## Add an ID if the row order is importantt o you
data$ID <- 1:nrow(data)
## `melt` the data
dL <- melt(data, id.vars=c("ID", "factor1", "factor2"))
## Sort the molten data, if necessary
dL <- dL[order(dL$ID), ]
## Extract the numeric portion of the "variable" variable
dL$aggression <- gsub("count_", "", dL$variable)
## Drop rows where `value == 0`
dL <- dL[dL$value != 0, ]
## Use `rep` to "expand" your `data.frame`.
## Drop any unwanted columns at this point.
out <- dL[rep(rownames(dL), dL$value), c("factor1", "factor2", "aggression")]
This is what the output finally looks like. If you want to remove the funny row names, just use rownames(out) <- NULL.
out
# factor1 factor2 aggression
# 1 a a 1
# 7 a a 2
# 7.1 a a 2
# 2 a b 1
# 2.1 a b 1
# 2.2 a b 1
# 3 b a 1
# 9 b a 2
# 9.1 b a 2
# 15 b a 3
# 15.1 b a 3
# 15.2 b a 3
# 4 b b 1
# 4.1 b b 1
# 10 b b 2
# 10.1 b b 2
# 5 c a 1
# 5.1 c a 1
# 5.2 c a 1
# 11 c a 2
# 11.1 c a 2
# 11.2 c a 2
# 11.3 c a 2
# 6 c b 1
# 12 c b 2

Condensing Data Frame in R

I just have a simple question, I really appreciate everyones input, you have been a great help to my project. I have an additional question about data frames in R.
I have data frame that looks similar to something like this:
C <- c("","","","","","","","A","B","D","A","B","D","A","B","D")
D <- c(NA,NA,NA,2,NA,NA,1,1,4,2,2,5,2,1,4,2)
G <- list(C=C,D=D)
T <- as.data.frame(G)
T
C D
1 NA
2 NA
3 NA
4 2
5 NA
6 NA
7 1
8 A 1
9 B 4
10 D 2
11 A 2
12 B 5
13 D 2
14 A 1
15 B 4
16 D 2
I would like to be able to condense all the repeat characters into one, and look similar to this:
J B C E
1 2 1
2 A 1 2 1
3 B 4 5 4
4 D 2 2 2
So of course, the data is all the same, it is just that it is condensed and new columns are formed to hold the data. I am sure there is an easy way to do it, but from the books I have looked through, I haven't seen anything for this!
EDIT I edited the example because it wasn't working with the answers so far. I wonder if the NA's, blanks, and unevenness from the blanks are contributing??
hereĀ“s a reshape solution:
require(reshape)
cast(T, C ~ ., function(x) x)
Changed T to df to avoid a bad habit. Returns a list, which my not be what you want but you can convert from there.
C <- c("A","B","D","A","B","D","A","B","D")
D <- c(1,4,2,2,5,2,1,4,2)
my.df <- data.frame(id=C,val=D)
ret <- function(x) x
by.df <- by(my.df$val,INDICES=my.df$id,ret)
This seems to get the results you are looking for. I'm assuming it's OK to remove the NA values since that matches the desired output you show.
T <- na.omit(T)
T$ind <- ave(1:nrow(T), T$C, FUN = seq_along)
reshape(T, direction = "wide", idvar = "C", timevar = "ind")
# C D.1 D.2 D.3
# 4 2 1 NA
# 8 A 1 2 1
# 9 B 4 5 4
# 10 D 2 2 2
library(reshape2)
dcast(T, C ~ ind, value.var = "D", fill = "")
# C 1 2 3
# 1 2 1
# 2 A 1 2 1
# 3 B 4 5 4
# 4 D 2 2 2

Resources