Doing computations on lists inside data frames - r

I want to aggregate data frame f into a new data frame g so that the column g$z contains a list of all group-wise values from column f$z. At first sight, this seems to be working:
f = data.frame(x=c(1, 1, 1, 2), y=c(4, 4, 5, 6), z=c(11, 12, 13, 14))
g = aggregate(z ~ x + y, f, c)
x y z
1 1 4 11, 12
2 1 5 13
3 2 6 14
Now I want to do different computations on the lists in column c for all rows in the data frame and put the result in new columns in the same data frame. But this doesn't work!
g$m = sum(g$z)
g$n = g$z + 1
Error in sum(g$z) : invalid 'type' (list) of argument
How can I work with lists inside a data frame cell like attempted above? Or is this simply un-R-like / impossible? If so, what is the correct approach?
UPDATE
My underlying goal is to do a lot of group-wise operations on all combinations of X and Y in the original data set. What options do I have for this in R in general?
Use apply. Pro: Everything in one table. Con: Complex table structure, can't use sum etc.
for(y), for(x), subset. Pro: Can do sum etc. directly. Con: Lots of code, and possibly slow.
Work in parallell w/original and aggregated table. Pro: Can do sum etc. Con: Data duplication.
Other options?

sum and Vectorization doesn't apply to lists, you can simply use sapply and lapply for the task:
g$m <- sapply(g$z, sum)
g$n <- lapply(g$z, `+`, 1)
g
# x y z m n
#1 1 4 11, 12 23 12, 13
#2 1 5 13 13 14
#3 2 6 14 14 15
Or if you use tidyverse, you can use map + mutate:
g %>% mutate(m = map_dbl(z, sum), n = map(z, ~.x + 1))
# x y z m n
#1 1 4 11, 12 23 12, 13
#2 1 5 13 13 14
#3 2 6 14 14 15

Related

R | Create a cross matrix

I don´t know how or where to start, but i hope someone can help. It´s the first time i´d use R like this, so even a keyword or a recommendation where to look it up would be helpful.
My dataframe looks like this:
set.seed(1)
df <- data.frame(
X = sample(c(1, 2, 3), 50, replace = TRUE),
Y = sample(c(1, 2, 3), 50, replace = TRUE))
And I would like to get a cross table like this:
using
length(which(df$X == & df$Y == ))
I could calculate the data with R and fill it in my Excel-sheet but there has to be a better option.
Thank you in advance.
Try this base R solution:
#Data
set.seed(1)
df <- data.frame(
X = sample(c(1, 2, 3), 50, replace = TRUE),
Y = sample(c(1, 2, 3), 50, replace = TRUE))
#Code
addmargins(table(df$X,df$Y))
Output:
1 2 3 Sum
1 6 7 5 18
2 4 6 9 19
3 5 5 3 13
Sum 15 18 17 50
You can also change the order of your variables like this:
#Code2
addmargins(table(df$Y,df$X))
Output:
1 2 3 Sum
1 6 4 5 15
2 7 6 5 18
3 5 9 3 17
Sum 18 19 13 50
In order to export to MS Excel, you use this code:
library(xlsx)
#Transform to dataframe
d1 <- as.data.frame.matrix(addmargins(table(df$X,df$Y)))
#Export
write.xlsx(d1,file='myexample.xlsx','Sheet1')
If the data have only two columns, just pass the data.frame object to table.
addmargins(table(df))
If the data include more than two columns, you can subset it's variable before passing to table().
addmargins(table(df[c("X", "Y")]))
You can also pass a formula to xtabs().
addmargins(xtabs( ~ X + Y, df))
All of above give
Y
X 1 2 3 Sum
1 5 6 3 14
2 2 6 6 14
3 13 4 5 22
Sum 20 16 14 50
To export the table to an excel file, you can use write.xlsx() from openxlsx.
library(openxlsx)
tab <- addmargins(xtabs( ~ X + Y, df))
write.xlsx(tab, "foo.xlsx")

Automate replacement of missing data on a sequence of variables using mutate_all

I am trying to automate a process to complete missing values on a sequence of variables using an ifelse statement and mutate_all function. The problem involves a dataframe with many variable names, for example, ax1, bx1, ...zx1, ax2, bx2, ...zx2, ax3, bx3, ...zx3. The following data give a small scenario:
df<-data.frame(
"id" = c(1:5),
"ax1" = c(1, "NA", 8, "NA", 17),
"bx1" = c(2, 7, "NA", 11, 12),
"ax2" = c(2, 1, 8, 15, 17),
"bx2" = c(2, 6, 4, 13, 11))
The process is to replace the missing values on the variables with the ending "x1" with their corresponding values on the variables with the ending "x2". That is, if ax1 is missing it is replaced by ax2 and any missingness on bx1 is replaced by bx2 and so on. Since there are many variables than the scenario presented here, I am looking for a way to automate this process. I have tried the following codes
library(dplyr)
df <- df %>%
mutate_all(vars(ends_with("x1", "x2")), function(x,y)
ifelse(is.na(x), y, x)))
but it does not work. I greatly appreciate any help on this.
The expected output is
id ax1 bx1 ax2 bx2
1 1 2 2 2
2 1 7 1 6
3 8 4 8 4
4 15 11 15 13
5 17 12 17 11
In base R, we can replace NA value in x1 with corresponding NA values in x2 using Map.
x1_cols <- grep('x1$', names(df))
x2_cols <- grep('x2$', names(df))
df[x1_cols] <- Map(function(x, y) {x[is.na(x)] <- y[is.na(x)];x},
df[x1_cols], df[x2_cols])
df
# id ax1 bx1 ax2 bx2
#1 1 1 2 2 2
#2 2 1 7 1 6
#3 3 8 4 8 4
#4 4 15 11 15 13
#5 5 17 12 17 11
We can use the same logic and use purrr::map2
df[x1_cols] <- purrr::map2(df[x1_cols], df[x2_cols],
~{.x[is.na(.x)] <- .y[is.na(.x)];.x})
data
Modified data a bit making sure that NA are actual NAs and not string "NA" which were actually making columns as factors.
df<-data.frame(id=c(1:5),
ax1=c(1,NA,8,NA,17),
bx1=c(2,7,NA,11,12),
ax2=c(2,1,8,15,17),
bx2=c(2,6,4,13,11))

How to pass a multivariate vector valued function (with variable length output) to aggregate

I have a data frame in R that I want to aggregate. The summary function that I want to apply to each subset is a custom function that takes several variables (columns) as input, and returns a vector or list of variable length. As an output, I would like to have a data frame with a column of the grouping variable, and a single other column containing the output vector (of varying length).
To give a mock example, suppose I have the following dataframe:
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
> df
particle time state energy
1 X 1 A 9
2 X 2 A 8
3 X 3 B 7
4 X 4 C 5
5 X 5 A 0
6 Y 1 A 1
7 Y 2 B 7
8 Y 3 B 7
9 Z 1 B 3
10 Z 2 C 9
11 Z 3 A 5
12 Z 4 A 6
I would like to obtain for each particle a list of the energy they had every time they changed state. The output I'm looking for is something like this:
>
particle energy
1 X c(9,7,5,0)
2 Y c(1,7)
3 Z c(3,9,5)
To do so, I would define a function like the following:
myfun <- function(state, energy){
tempstate <- state[1]
energyvec <- energy[1]
for(i in 2:length(state)){
if(state[i] != tempstate){
energyvec <- c(energyvec, energy[i])
tempstate <- state[i]
}
}
return(energyvec)
}
And try to pass it to aggregate somehow
The two data structures I tried for this are data.frame and data.table.
In data.frame, using a custom function that returns a vector seems to give the correct output format I am looking for, that is where the output column is really a list, and each row contains a list with the output of the function. However, I can't seem to pass several columns to the function when aggregating this way.
With a data.table, the aggregation is easier to do when considering a function of several variables. However, I can't seem to obtain the output I'm looking for. Indeed,
dt <- data.table(df)
dt[,myfun(state, energy), by= Particle]
only returns the first element of energyvec (instead of a vector), and
dt <- data.table(df)
dt[,as.list(myfun(state, energy)), by= Particle]
doesn't work as the outputs don't all have the same length.
Is there an alternative way to go to accomplish this?
Thank you very much in advance for all your help!
Here's a tidyverse approach:
library(tidyverse)
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
# Hard-code energy to make this reproducible
df$energy <- c(9, 8, 7, 5, 0, 1, 7, 7, 3, 9, 5, 6)
df %>%
group_by(particle) %>%
mutate(
changed_state = coalesce(state != lag(state, 1), TRUE)
) %>%
filter(changed_state) %>%
summarise(
string = toString(energy)
)
#> # A tibble: 3 x 2
#> particle string
#> <fct> <chr>
#> 1 X 9, 7, 5, 0
#> 2 Y 1, 7
#> 3 Z 3, 9, 5
I'd run each line of the pipe individually. Basically, create a changed_state variable by checking if the "this" state matches the last state lag(state, 1). Since we only care when this happens, we filter where this is TRUE (a more verbose line would be filter(changed_state == TRUE). The toString function collapses the rows of energy as desired and we are already "grouped" by particle.
data.table approach
sample data
#stolen from JasonAizkalns's answer
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
df$energy <- c(9, 8, 7, 5, 0, 1, 7, 7, 3, 9, 5, 6)
code
library( data.table )
#create data.table
dt <- as.data.table(df)
#use `uniqlist` to get rownumbers where the value of `state` changes,
# then get these rows into a subset
result <- dt[ data.table:::uniqlist(dt[, c("particle", "state")]), ]
#split the resulting `energy`-column by the contents of the `particle`-column
l <- split( result$energy, result$particle)
# $X
# [1] 9 7 5 0
#
# $Y
# [1] 1 7
#
# $Z
# [1] 3 9 5
#craete final output
data.table( particle = names(l), energy = l )
# particle energy
# 1: X 9,7,5,0
# 2: Y 1,7
# 3: Z 3,9,5
Another possible data.table approach
library(data.table)
setDT(DF)[, .(energy=.(.SD[, first(energy), by=.(rleid(state))]$V1)), by=.(particle)]
output:
particle energy
1: X 9,4,6,9
2: Y 2,9
3: Z 7,6,1
data:
set.seed(0L)
DF <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
DF
# particle time state energy
# 1 X 1 A 9
# 2 X 2 A 3
# 3 X 3 B 4
# 4 X 4 C 6
# 5 X 5 A 9
# 6 Y 1 A 2
# 7 Y 2 B 9
# 8 Y 3 B 9
# 9 Z 1 B 7
# 10 Z 2 C 6
# 11 Z 3 A 1
# 12 Z 4 A 2

create a new variable in r using for loop based on the condition of an existing variable

Suppose the data frame is like this:
df <- data.frame(x = c(1,7,8,15,24,100,9,19,128))
How do I create a new variable that satisfies the following condition:
y = 1 if 1<=x<=7
y = 2 if 8<=x<=14
y = 3 if 15<=x<=21
...
y = k if 1+7*(k-1)<= x<= 7+7*(k-1)
so that I can have the new data frame like this
df <- data.frame(y = c(1,1,2,3,4,15, 2,3, 19))
I am wondering if a for loop can be applied in this case.
Via simple algebra, you can do:
df$y <- floor((df$x+6)/7)
df
# x y
# 1 1 1
# 2 7 1
# 3 8 2
# 4 15 3
# 5 24 4
# 6 100 15
# 7 9 2
# 8 19 3
# 9 128 19
In R you will often find it easier (less typing and less thinking) to use vectorized operators than for loops for simple computations like this. In this case we performed calls to +, /, and floor over a whole vector instead of looping and using them on each element.

creating a sum matrix from counts

I have one matrix of mutation counts, say "counts". This matrix has column names V1, V2,...,Vi,...Vn where not every "i" is there. Thus it can jump, such as V1, V2, V5 say. Further, most of columns have a 0 in them.
I need to create a sum matrix, called "answer", where element i, j is the sum of the number of the number counts at both i and j. At the i, i element it just shows the number of counts at i.
Here's a quick data set up. I already have the correct dimensioned matrix set up in my code called "answer". Thus what I would need to automate are the last several lines where I fill in the matrix.
counts <- matrix(data = c(0,2,0,5,0,6,0), nrow = 1, ncol = 7, dimnames=list("",c("V1","V2","V3","V4","V5","V6","V7")))
answer <- matrix(data =0, nrow = 3, ncol = 3, dimnames = list(c("V2","V4","V6"),c("V2","V4","V6")))
answer[1,1] <- 2
answer[1,2] <- 7
answer[1,3] <- 8
answer[2,1] <- 7
answer[2,2] <- 5
answer[2,3] <- 11
answer[3,1] <- 8
answer[3,2] <- 11
answer[3,3] <- 6
I understand I can do this with 2 nested for loops, but surely there must be a better way no? Thanks!
This could be done with the right use of expand.grid and rowSums:
n = counts[, counts > 0]
answer = matrix(rowSums(expand.grid(n, n)), nrow=length(n), dimnames=list(names(n), names(n)))
diag(answer) = n
To show how it works, n would end up being:
V2 V4 V5
2 5 6
and expand.grid(n, n) would be:
Var1 Var2
1 2 2
2 5 2
3 6 2
4 2 5
5 5 5
6 6 5
7 2 6
8 5 6
9 6 6
The last line (diag) is necessary because otherwise the diagonal would be twice the original vector (adding 2+2, 5+5, or 6+6).

Resources