I have a large data set with over 2000 observations. The data involves toxin concentrations in animal tissue. My response variable is myRESULT and I have multiple observations per ANALYTE of interest. I need to remove the outliers, as defined by numbers more than three SD away from the mean, from within each ANALYTE group.
While I realize that I should not remove outliers from a dataset normally, I would still like to know how to do it in R.
Here is a small portion of what my data look like:
It's subsetting by group, which can be done in different ways. With dplyr, you use group_by to set grouping, then filter to subset rows, passing it an expression that will calculate return TRUE for rows to keep, and FALSE for outliers.
For example, using iris and 2 standard deviations (everything is within 3):
library(dplyr)
iris_clean <- iris %>%
group_by(Species) %>%
filter(abs(Petal.Length - mean(Petal.Length)) < 2*sd(Petal.Length))
iris_clean %>% count()
#> # A tibble: 3 x 2
#> # Groups: Species [3]
#> Species n
#> <fct> <int>
#> 1 setosa 46
#> 2 versicolor 47
#> 3 virginica 47
With a split-apply-combine approach in base R,
do.call(rbind, lapply(
split(iris, iris$Species),
function(x) x[abs(x$Petal.Length - mean(x$Petal.Length)) < 2*sd(x$Petal.Length), ]
))
Related
Assuming I have a data frame like the below (actual data frame has million observations). I am trying to look for correlation between signal column and other net returns columns group by various values of signal_up column.
I have tried “dplyr” library and combination of functions “group_by” and “summarize”. However, I am only able to get correlation between two columns and not the multiple columns.
library(dplyr)
df %>%
group_by(Signal_Up) %>%
summarize (COR=cor(signal, Net_return_at_t_plus1))
Data and desired result are given below.
Data
Desired Result
Correlation between "signal" Vs ["Net_return_at_t_plus1", "Net_return_at_t_plus5", "Net_return_at_t_plus10"]
Group by "Signal_Up"
Maybe you can try to use summarise_at to perform the correlation over several columns.
Here, I took the iris dataset as example:
library(dplyr)
iris %>% group_by(Species) %>%
summarise_at(vars(Sepal.Length:Petal.Length), ~cor(Petal.Width,.))
# A tibble: 3 x 4
Species Sepal.Length Sepal.Width Petal.Length
<fct> <dbl> <dbl> <dbl>
1 setosa 0.278 0.233 0.332
2 versicolor 0.546 0.664 0.787
3 virginica 0.281 0.538 0.322
For your dataset, you should try something like:
library(dplyr)
df %>% group_by(Signal_Up) %>%
summarise_at(vars(Net_return_at_t_plus1:Net_return_at_t_plus1), ~cor(signal,.))
Does it answer your question ?
NB: It is easier for people to try to solve your issue if you are providing reproducible example that they can easily copy/paste instead of adding it as an image (see: How to make a great R reproducible example)
I would like to create a list column of matrices, where the entries of each matrix are elements from variables already present in the original dataset. My goal is to create 2 time 2 contingency tables for each row of the data set, and subsequently pass each matrix as an argument to fisher.test.
I have tried adding the new column using a combination of mutate and matrix, but this returns an error. I've also tried using do instead of mutate and this seems like a step in the right direction, but I know this is also incorrect, because the dimensions of the elements are off, and there is only one row in the output.
library(tidyverse)
mtcars %>%
mutate(mat = matrix(c(.$disp, .$hp, .$gear, .$carb)))
#> Error: Column `mat` must be length 32 (the number of rows) or one, not 128
mtcars %>%
do(mat = matrix(c(.$disp, .$hp, .$gear, .$carb)))
#> # A tibble: 1 x 1
#> mat
#> <list>
#> 1 <dbl [128 x 1]>
Created on 2019-06-05 by the reprex package (v0.2.1)
I am expecting 32 rows in my output, and the mat column to contain 32 2x2 matrices composed of entries from mtcars$disp, mtcars$hp, mtcars$gear, and mtcars$carb.
My intent is to use map to pass each entry in the mat column as an argument to fisher.test, then extract the odds ratio estimate, and the p-value. But the main focus, of course, is creation of the list of matrices.
You have two issues:
To store a matrix in a data.frame (tibble), you simply have to put it in a list.
To create 2 x 2 matrices (instead of repeating the same 4 x 32 matrix in each cell), you need to work row by row. Currently, when you do matrix(c(disp, hp, gear, carb)) you create a 4 x 32 matrix! You want only 4 x 1 inputs, reshaped to 2 x 2.
Working with pmap allows you to process the rows one by one, but alternatively you can use rowwise which groups by row:
library(tidyverse)
df <-
mtcars %>%
as_tibble() %>%
rowwise() %>%
mutate(mat = list(matrix(c(disp, hp, gear, carb), 2, 2)))
Edit: Now how do you actually use those? Let's take the example of a fisher.test. Note that a test is a complex object, with components (like p.value) and attributes, so we'll have to store them in a list-column.
You can either keep working rowwise, in which case the list is automagically "unlist-ed":
df %>%
# keep in mind df is still grouped by row so 'mat' is only one matrix.
# A test is a complex object so we need to store it in a list-column
mutate(test = list(fisher.test(mat)),
# test is just one test so we can extract p-value directly
pval = test$p.value)
Or if you stop working row by row (for which you simply need to ungroup), then mat is a list of matrices onto which you can map functions. We use the map functions from purrr.
library("purrr")
df %>%
ungroup() %>%
# Apply the test to each mat using `map` from `purrr`
# `map` returns a list so `test` is a list-column
mutate(test = map(mat, fisher.test),
# Now `test` is a list of tests... so you need to map operations onto it
# Extract the p-values from each test, into a numeric column rather than a list-column
pval = map_dbl(test, pluck, "p.value"))
Which one you prefer is a matter of taste :)
you can use the pmap function from the purrr package inside mutate:
library(tidyverse)
mtcars %>% as_tibble() %>%
mutate(mat = pmap(list(disp, hp, gear, carb), ~matrix(c(..1, ..2, ..3, ..4), 2, 2)))
# A tibble: 32 x 12
mpg cyl disp hp drat wt qsec vs am gear carb mat
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 <dbl[,2] [2 x 2]>
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 <dbl[,2] [2 x 2]>
Each entry of mat is then a 2x2 matrix with the desired elements. Hope this helps.
I have a data frame, df, on which I would like to run a the function kepdf (from the package pdfCluster which calculates multivariate density). The point is this is not a simple base function like head, mean and the likes.
My data frame looks like this:
> head(df)
# A tibble: 6 x 4
A B C Group
<dbl> <dbl> <dbl> <dbl>
2 1 39 1
2 2 66 1
2 2 36 1
1 1 56 1
1 1 37 1
1 1 45 1
Now, I would like to calculate the density of columns A, B, and C for each Group separately (the variable Group just indicates the group the observation belongs to and should not enter the density calculation). I naively tried the following:
df %>% group_by(Group) %>% select(1:3) %>% do(kepdf(.))
and got the following error:
Adding missing grouping variables: `Group`
Error in kepdf(.) : NA/NaN/Inf in foreign function call (arg 2)
Now, there are no missing values in the data, so I'm confused. Also, I don't want to add the grouping variable Group because then the algorithm will add it to the density calculation, which I don't want it to do.
Any thoughts?
Your issue is that you're grouping your data.frame by Group then trying to discard the grouping column before performing kepdf(...). When you call do(...), it adds back the grouping column necessarily.
Try instead
library(purrr)
df %>% split(.$Group) %>% map(., ~select(.x, 1:3)) %>% map(., ~kepdf(.x))
You can always combine the last two map(...) into a single function
myfun <- function(df) {
require(pdfCluster)
data <- select(df, 1:3)
kepdf(data)
}
df %>% split(.$Group) %>% map(., ~myfun(.x))
I have a data frame in R that lists monthly sales data by department for a store. Each record contains a month/year, a department name, and the total sales in that department for the month. I'm trying to calculate the mean sales by department, adding them to the vector avgs but I seem to be having two problems: the total sales per department is not compiling at all (its evaluating to zero) and avgs is compiling by record instead of by department. Here's what I have:
avgs = c()
for(dept in data$departmentName){
total <- 0
for(record in data){
if(identical(data$departmentName, dept)){
total <- total + data$ownerSales[record]
}
}
avgs <- c(avgs, total/72)
}
Upon looking at avgs on completion of the loop, I find that it's returning a vector of zeroes the length of the data frame rather than a vector of 22 averages (there are 22 departments). I've been tweaking this forever and I'm sure it's a stupid mistake, but I can't figure out what it is. Any help would be appreciated.
why not use library(dplyr)?:
library(dplyr)
data(iris)
iris %>% group_by(Species) %>% # or dept
summarise(total_plength = sum(Petal.Length), # total owner sales
weird_divby72 = total_plength/72) # total/72?
# A tibble: 3 × 3
Species total_plength weird_divby72
<fctr> <dbl> <dbl>
1 setosa 73.1 1.015278
2 versicolor 213.0 2.958333
3 virginica 277.6 3.855556
your case would probably look like this :
data %>% group_by(deptName) %>%
summarise(total_sales = sum(ownerSales),
monthly_sales = total_sales/72)
I like dplyr for it's syntax and pipeability. I think it is a huge improvement over base R for ease of data wrangling. Here is a good cheat sheet to help you get rolling: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
I am trying to pivot some data such that I retrieve (1) the total of some measurement for two+ groups, and then (2) that measurement divided by the # of observations in that group. I have achieved (1) but not (2). Below is an example output I desire:
grouping measurement_total group_size mean
1 1 301 60 5.0
2 2 215 40 5.4
Let some data be:
> grouping <- c(1,2,1,1,2)
> measurement <- sample(rnorm(1,10),100, replace=TRUE)
> dataframe <- cbind(grouping, measurement)
To create the pivot, I used aggregate. I then used a cbind to get the # of observations per group:
> aggregate(cbind(measurement,1) ~ grouping, data=dataframe, FUN=sum)
grouping measurement V2
1 1 301 60
2 2 215 40
I now need to create "V3" which would be { measurement / V2 } such that I achieve the result. NB I can get the mean only by using FUN=mean, but this means I cannot also get the group size.
> aggregate(cbind(measurement,1) ~ grouping, data=dataframe, FUN=mean)
grouping V2(# obs.) mean
1 1 1 5.0
2 2 1 5.4
What are some options for achieving this simply, ideally with a single line? I.e. I could obtain the two tables separately and merge the two, but it's a little long-winded.
Thanks
John
You can use dplyr to do this fairly easily
library(dplyr)
dataframe <- data.frame(dataframe) # Convert to dataframe
dataframe %>%
group_by(grouping) %>%
mutate(measurement_total = sum(measurement)) %>%
mutate(group_size = length(measurement)) %>%
mutate(mean = mean(measurement)) %>%
filter(row_number()==1) %>%
select(-measurement)
Of course, the easy way to do it in base R would be:
df <- aggregate(cbind(measurement,1) ~ grouping, data=dataframe, FUN=sum)
df$mean <- df$measurement/df$V2
But if you're going to be doing dataframe manipulation, it might be a good idea to get into dplyr