Creating data.frames where one column contains matrices - r

I would like to create a list column of matrices, where the entries of each matrix are elements from variables already present in the original dataset. My goal is to create 2 time 2 contingency tables for each row of the data set, and subsequently pass each matrix as an argument to fisher.test.
I have tried adding the new column using a combination of mutate and matrix, but this returns an error. I've also tried using do instead of mutate and this seems like a step in the right direction, but I know this is also incorrect, because the dimensions of the elements are off, and there is only one row in the output.
library(tidyverse)
mtcars %>%
mutate(mat = matrix(c(.$disp, .$hp, .$gear, .$carb)))
#> Error: Column `mat` must be length 32 (the number of rows) or one, not 128
mtcars %>%
do(mat = matrix(c(.$disp, .$hp, .$gear, .$carb)))
#> # A tibble: 1 x 1
#> mat
#> <list>
#> 1 <dbl [128 x 1]>
Created on 2019-06-05 by the reprex package (v0.2.1)
I am expecting 32 rows in my output, and the mat column to contain 32 2x2 matrices composed of entries from mtcars$disp, mtcars$hp, mtcars$gear, and mtcars$carb.
My intent is to use map to pass each entry in the mat column as an argument to fisher.test, then extract the odds ratio estimate, and the p-value. But the main focus, of course, is creation of the list of matrices.

You have two issues:
To store a matrix in a data.frame (tibble), you simply have to put it in a list.
To create 2 x 2 matrices (instead of repeating the same 4 x 32 matrix in each cell), you need to work row by row. Currently, when you do matrix(c(disp, hp, gear, carb)) you create a 4 x 32 matrix! You want only 4 x 1 inputs, reshaped to 2 x 2.
Working with pmap allows you to process the rows one by one, but alternatively you can use rowwise which groups by row:
library(tidyverse)
df <-
mtcars %>%
as_tibble() %>%
rowwise() %>%
mutate(mat = list(matrix(c(disp, hp, gear, carb), 2, 2)))
Edit: Now how do you actually use those? Let's take the example of a fisher.test. Note that a test is a complex object, with components (like p.value) and attributes, so we'll have to store them in a list-column.
You can either keep working rowwise, in which case the list is automagically "unlist-ed":
df %>%
# keep in mind df is still grouped by row so 'mat' is only one matrix.
# A test is a complex object so we need to store it in a list-column
mutate(test = list(fisher.test(mat)),
# test is just one test so we can extract p-value directly
pval = test$p.value)
Or if you stop working row by row (for which you simply need to ungroup), then mat is a list of matrices onto which you can map functions. We use the map functions from purrr.
library("purrr")
df %>%
ungroup() %>%
# Apply the test to each mat using `map` from `purrr`
# `map` returns a list so `test` is a list-column
mutate(test = map(mat, fisher.test),
# Now `test` is a list of tests... so you need to map operations onto it
# Extract the p-values from each test, into a numeric column rather than a list-column
pval = map_dbl(test, pluck, "p.value"))
Which one you prefer is a matter of taste :)

you can use the pmap function from the purrr package inside mutate:
library(tidyverse)
mtcars %>% as_tibble() %>%
mutate(mat = pmap(list(disp, hp, gear, carb), ~matrix(c(..1, ..2, ..3, ..4), 2, 2)))
# A tibble: 32 x 12
mpg cyl disp hp drat wt qsec vs am gear carb mat
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 <dbl[,2] [2 x 2]>
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 <dbl[,2] [2 x 2]>
Each entry of mat is then a 2x2 matrix with the desired elements. Hope this helps.

Related

R arrange function seems not to be working

I am having a dataframe with timestamps that also have decimal values. I want to calculate the difference between the first and all other events from the same group. To do that I use the following code:
values <- c("1671535501.862424", "1671535502.060679","1671535502.257422",
"1671535502.472993", "1671535502.652619","1671535502.856569",
"1671535503.048685", "1671535503.245988")
column_b <- c("a", "a","a","a","a","a","a","a")
values<-as.numeric(values)
#-- Calculate differences
data <- data.frame(values,column_b) #create data frame
res <- data %>%
group_by(column_b) %>%
arrange(values) %>%
mutate(time=values-lag(values, default = first(values)))
In general, the code does exactly what I expect it to do. It groups them, arranges them, and calculates the difference for each group. The output looks like this:
> res
# A tibble: 8 × 3
# Groups: column_b [2]
values column_b time
<dbl> <fct> <dbl>
1 1671535502. a 0
2 1671535502. a 0.198
3 1671535502. a 0.197
4 1671535502. a 0.216
5 1671535503. a 0.180
6 1671535503. a 0.204
7 1671535503. a 0.192
8 1671535503. a 0.197
Nevertheless, I have my doubts about the math results. If I am not mistaken, the values in this example are prearranged. But even if that was not the case, arrange() should have done the job. Hence, IF it is arranging the values, how can the 4th have a larger value than the 5th? There are multiple examples where we see that it does not make sense. What am I missing?

Tidyr: pivot_wider error: Can't convert <double> to <list>

I have a dataframe which lists species observations across multiple survey plots (the data is here). I'm trying to use tidyr's pivot_wider to spread that abundance data across several columns, with the new columns being each of the observed species. Here's the line of code I'm trying to use to do that:
data %>% pivot_wider(names_from = Species, values_from = Total.Abundance, values_fill = 0)
However, this gives me two error messages:
Error: Can't convert <double> to <list>.
Values are not uniquely identified; output will contain list-cols.
I'm not sure what the issue is, because this has worked fine for several other dataframes that are (seemingly) identical to this one. I've tried googling the first error message and have not been able to find what conditions cause it—I don't know what double R is trying to convert to a list, nor why it's trying to convert to a list. The Total.Abundance column should be integers, but I wonder if somehow it's a double data type?
From what I've been able to find, the second error message appears when there are identical rows in the dataframe. However, the error persists when I modify my statement to
unique(data) %>% pivot_wider(names_from = Species, values_from = Total.Abundance, values_fill = 0)
Which I would have thought would remove duplicate rows.
Any help would be much appreciated!
Expanding on my comment, there are duplicates in your data that cannot be removed by unique() or in dplyr, distinct():
dat %>%
distinct() %>%
group_by(Plot.ID, Species) %>%
count()
# Plot.ID Species n
# <dbl> <chr> <int>
# 1 1 Calliopius 1
# 2 1 Idotea 2
# 3 1 Lacuna vincta 2
# 4 1 Mitrella lunata 2
# 5 1 Podoceropsis nitida 1
# 6 1 Unk. Amphipod 1
# 7 1 Unk. Bivalve 1
# 8 2 Calliopius 1
# 9 2 Caprella penantis 1
#10 2 Corophium insidiosum 1
Need to find out why you have duplicates like this and reconcile it, say by summing them up. The problem might be coming out of data wrangling coding bugs in which case summing is not necessarily suitable. Or perhaps say you sample same plot twice, you want mean instead of sum to normalize vs sampling effort, or perhaps you need extra column indicating sampling effort). Nevertheless, this works perfectly:
dat %>%
group_by(Plot.ID, Species) %>%
summarise(abundance = sum(Total.Abundance)) %>%
tidyr::pivot_wider(names_from = Species, values_from = abundance,
values_fill = 0)

Correlation between multiple variables of a data frame Group by a different variable

Assuming I have a data frame like the below (actual data frame has million observations). I am trying to look for correlation between signal column and other net returns columns group by various values of signal_up column.
I have tried “dplyr” library and combination of functions “group_by” and “summarize”. However, I am only able to get correlation between two columns and not the multiple columns.
library(dplyr)
df %>%
group_by(Signal_Up) %>%
summarize (COR=cor(signal, Net_return_at_t_plus1))
Data and desired result are given below.
Data
Desired Result
Correlation between "signal" Vs ["Net_return_at_t_plus1", "Net_return_at_t_plus5", "Net_return_at_t_plus10"]
Group by "Signal_Up"
Maybe you can try to use summarise_at to perform the correlation over several columns.
Here, I took the iris dataset as example:
library(dplyr)
iris %>% group_by(Species) %>%
summarise_at(vars(Sepal.Length:Petal.Length), ~cor(Petal.Width,.))
# A tibble: 3 x 4
Species Sepal.Length Sepal.Width Petal.Length
<fct> <dbl> <dbl> <dbl>
1 setosa 0.278 0.233 0.332
2 versicolor 0.546 0.664 0.787
3 virginica 0.281 0.538 0.322
For your dataset, you should try something like:
library(dplyr)
df %>% group_by(Signal_Up) %>%
summarise_at(vars(Net_return_at_t_plus1:Net_return_at_t_plus1), ~cor(signal,.))
Does it answer your question ?
NB: It is easier for people to try to solve your issue if you are providing reproducible example that they can easily copy/paste instead of adding it as an image (see: How to make a great R reproducible example)

Delete outliers

I have a large data set with over 2000 observations. The data involves toxin concentrations in animal tissue. My response variable is myRESULT and I have multiple observations per ANALYTE of interest. I need to remove the outliers, as defined by numbers more than three SD away from the mean, from within each ANALYTE group.
While I realize that I should not remove outliers from a dataset normally, I would still like to know how to do it in R.
Here is a small portion of what my data look like:
It's subsetting by group, which can be done in different ways. With dplyr, you use group_by to set grouping, then filter to subset rows, passing it an expression that will calculate return TRUE for rows to keep, and FALSE for outliers.
For example, using iris and 2 standard deviations (everything is within 3):
library(dplyr)
iris_clean <- iris %>%
group_by(Species) %>%
filter(abs(Petal.Length - mean(Petal.Length)) < 2*sd(Petal.Length))
iris_clean %>% count()
#> # A tibble: 3 x 2
#> # Groups: Species [3]
#> Species n
#> <fct> <int>
#> 1 setosa 46
#> 2 versicolor 47
#> 3 virginica 47
With a split-apply-combine approach in base R,
do.call(rbind, lapply(
split(iris, iris$Species),
function(x) x[abs(x$Petal.Length - mean(x$Petal.Length)) < 2*sd(x$Petal.Length), ]
))

In R, apply a complicated function (not base) on several columns by group (dplyr)

I have a data frame, df, on which I would like to run a the function kepdf (from the package pdfCluster which calculates multivariate density). The point is this is not a simple base function like head, mean and the likes.
My data frame looks like this:
> head(df)
# A tibble: 6 x 4
A B C Group
<dbl> <dbl> <dbl> <dbl>
2 1 39 1
2 2 66 1
2 2 36 1
1 1 56 1
1 1 37 1
1 1 45 1
Now, I would like to calculate the density of columns A, B, and C for each Group separately (the variable Group just indicates the group the observation belongs to and should not enter the density calculation). I naively tried the following:
df %>% group_by(Group) %>% select(1:3) %>% do(kepdf(.))
and got the following error:
Adding missing grouping variables: `Group`
Error in kepdf(.) : NA/NaN/Inf in foreign function call (arg 2)
Now, there are no missing values in the data, so I'm confused. Also, I don't want to add the grouping variable Group because then the algorithm will add it to the density calculation, which I don't want it to do.
Any thoughts?
Your issue is that you're grouping your data.frame by Group then trying to discard the grouping column before performing kepdf(...). When you call do(...), it adds back the grouping column necessarily.
Try instead
library(purrr)
df %>% split(.$Group) %>% map(., ~select(.x, 1:3)) %>% map(., ~kepdf(.x))
You can always combine the last two map(...) into a single function
myfun <- function(df) {
require(pdfCluster)
data <- select(df, 1:3)
kepdf(data)
}
df %>% split(.$Group) %>% map(., ~myfun(.x))

Resources