How do you compare data from two experiments - r

I am often trying to measure percentage changes under two distinct scenarios/test/period.
An example dataset:
library(dplyr)
set.seed(11)
toy_dat <- data.frame(state = sample(state.name,3, replace=F),
experiment=c('control','measure'),
accuracy=sample(30:50, size=6, replace=T),
speed=sample(21:39, size=6, replace=T)) %>% arrange(state)
state experiment accuracy speed
1 Alabama measure 31 24
2 Alabama control 36 37
3 Indiana control 30 23
4 Indiana measure 31 38
5 Missouri control 50 29
6 Missouri measure 48 34
I then resort to writing something horrible like this:
result <- toy_dat %>% group_by(state) %>% arrange(experiment) %>%
summarise(acc_delta = (accuracy[2]-accuracy[1])/accuracy[1],
speed_delta = (speed[2]-speed[1])/speed[1])
However, the above solution does not scale at all when the number of measurable begins to grow. In addition, the code is very fragile in terms of the ordering.
I am very new to R. I was hoping that this is a common enough pattern that there are well-known (smarter) solutions to the problem.
I would greatly appreciate any help/pointers.

Just create your own custom function and use summarise_each in order to apply it on all the measurements at once (it doesn't matter how many measurements you have)
delta_fun <- function(x) diff(x)/x[1L]
toy_dat %>%
group_by(state) %>%
arrange(experiment) %>%
summarise_each(funs(delta_fun), -experiment)
# Source: local data frame [3 x 3]
#
# state accuracy speed
# 1 Alabama -0.13888889 -0.3513514
# 2 Indiana 0.03333333 0.6521739
# 3 Missouri -0.04000000 0.1724138
As you mentioned that you are new to R, here's another awesome package you can use in order to achieve the same effect
library(data.table)
setDT(toy_dat)[order(experiment),
lapply(.SD, delta_fun),
.SDcols = -"experiment",
by = state]
# state accuracy speed
# 1: Alabama -0.13888889 -0.3513514
# 2: Indiana 0.03333333 0.6521739
# 3: Missouri -0.04000000 0.1724138

Related

Calculate difference between values using different column and with gaps using R

Can anyone help me figure out how to calculate the difference in values based on my monthly data? For example I would like to calculate the difference in groundwater values between Jan-Jul, Feb-Aug, Mar-Sept etc, for each well by year. Note in some years there will be some months missing. Any tidyverse solutions would be appreciated.
Well year month value
<dbl> <dbl> <fct> <dbl>
1 222 1995 February 8.53
2 222 1995 March 8.69
3 222 1995 April 8.92
4 222 1995 May 9.59
5 222 1995 June 9.59
6 222 1995 July 9.70
7 222 1995 August 9.66
8 222 1995 September 9.46
9 222 1995 October 9.49
10 222 1995 November 9.31
# ... with 18,400 more rows
df1 <- subset(df, month %in% c("February", "August"))
test <- df1 %>%
dcast(site + year + Well ~ month, value.var = "value") %>%
mutate(Diff = February - August)
Thanks,
Simon
So I attempted to manufacture a data set and use dplyr to create a solution. It is best practice to include a method of generating a sample data set, so please do so in future questions.
# load required library
library(dplyr)
# generate data set of all site, well, and month combinations
## define valid values
sites = letters[1:3]
wells = 1:5
months = month.name
## perform a series of merges
full_sites_wells_months_set <-
merge(sites, wells) %>%
dplyr::rename(sites = x, wells = y) %>% # this line and the prior could be replaced on your system with initial_tibble %>% dplyr::select(sites, wells) %>% unique()
merge(months) %>%
dplyr::rename(months = y) %>%
dplyr::arrange(sites, wells)
# create sample initial_tibble
## define fraction of records to simulate missing months
data_availability <- 0.8
initial_tibble <-
full_sites_wells_months_set %>%
dplyr::sample_frac(data_availability) %>%
dplyr::mutate(values = runif(nrow(full_sites_wells_months_set)*data_availability)) # generate random groundwater values
# generate final result by joining full expected set of sites, wells, and months to actual data, then group by sites and wells and perform lag subtraction
final_tibble <-
full_sites_wells_months_set %>%
dplyr::left_join(initial_tibble) %>%
dplyr::group_by(sites, wells) %>%
dplyr::mutate(trailing_difference_6_months = values - dplyr::lag(values, 6L))

Arranging the Dataset in R as per Sum value

Please run the R code below, I wish to obtain a data set using this data below in which I have the "Brand" and "Candy" Column values appear once and corresponding time value to be the sum of all such cases. For illustration, I want the first two values "Mars" and "A" to appear only once and their sum "22" in the next column. Similarly for the rest, also the command should be fast to work on large data. Thanks and please help.
PlanetData <- read.table(
text = "
Brand Candy time
Mars A 10
Mars A 12
Jupiter B 13
Jupiter B 14
Saturn C 21
Saturn C 26",
header = TRUE,
stringsAsFactors = FALSE)
You can try two alternative processes, using dplyr or data.table and pick the fastest one:
PlanetData <- read.table(
text = "
Brand Candy time
Mars A 10
Mars A 12
Jupiter B 13
Jupiter B 14
Saturn C 21
Saturn C 26",
header = TRUE,
stringsAsFactors = FALSE)
library(dplyr)
PlanetData %>% group_by(Brand, Candy) %>% summarise(SUM = sum(time)) %>% ungroup()
# # A tibble: 3 x 3
# Brand Candy SUM
# <chr> <chr> <int>
# 1 Jupiter B 27
# 2 Mars A 22
# 3 Saturn C 47
library(data.table)
setDT(PlanetData)[, .(SUM=sum(time)),by=.(Brand, Candy)]
# Brand Candy SUM
# 1: Mars A 22
# 2: Jupiter B 27
# 3: Saturn C 47
It would be also useful to try the dplyr version using stringsAsFactors = TRUE. It's very likely to be (slightly?) faster. It depends on how many rows and unique values you have.
Note that the moment you use setDT(PlanetData) then PlanetData becomes a data.table and not a data.frame. Make sure that doesn't skew/affect your timings when you go back to run the dplyr versions.

R: Combine duplicate columns after dplyr join

When you use a dplyr join function like full_join, columns with identical names are duplicated and given suffixes like "col.x", "col.y", "col.x.x", etc. when they are not used to join the tables.
library(dplyr)
data1<-data.frame(
Code=c(2,1,18,5),
Country=c("Canada", "USA", "Brazil", "Iran"),
x=c(50,29,40,29))
data2<-data.frame(
Code=c(2,40,18),
Country=c("Canada","Japan","Brazil"),
y=c(22,30,94))
data3<-data.frame(
Code=c(25,14,52),
Country=c("China","Japan","Australia"),
z=c(22,30,94))
data4<-Reduce(function(...) full_join(..., by="Code"), list(data1,data2,data3))
This results in "Country", "Country.x", and "Country.y" columns.
Is there a way to combine the three columns into one, such that if a row has NA for a "Country", it takes the value from "Country.x" or "Country.y"?
I attempted a solution based on this similar question, but it gives me a warning and returns only values from the top three rows.
data4<-Reduce(function(...) full_join(..., by="Code"), list(data1,data2,data3)) %>%
mutate(Country=coalesce(Country.x,Country.y,Country)) %>%
select(-Country.x, -Country.y)
This returns the warning invalid factor level, NA generated.
Any ideas?
You could use my package safejoin, make a full join and deal with the conflicts using dplyr::coalesce.
First we'll have to rename the tables to have value columns named the same.
library(dplyr)
data1 <- rename_at(data1,3, ~"value")
data2 <- rename_at(data2,3, ~"value")
data3 <- rename_at(data3,3, ~"value")
Then we can join
# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
data1 %>%
safe_full_join(data2, by = c("Code","Country"), conflict = coalesce) %>%
safe_full_join(data3, by = c("Code","Country"), conflict = coalesce)
# Code Country value
# 1 2 Canada 50
# 2 1 USA 29
# 3 18 Brazil 40
# 4 5 Iran 29
# 5 40 Japan 30
# 6 25 China 22
# 7 14 Japan 30
# 8 52 Australia 94
You get some warnings because you're joining factor columns with different levels, add parameter check="" to remove them.

Convert data.frame wide to long while concatenating date formats

In R (or other language), I want to transform an upper data frame to lower one.
How can I do that?
Thank you beforehand.
year month income expense
2016 07 50 15
2016 08 30 75
month income_expense
1 2016-07 50
2 2016-07 -15
3 2016-08 30
4 2016-08 -75
Well, it seems that you are trying to do multiple operations in the same question: combine dates columns, melt your data, some colnames transformations and sorting
This will give your expected output:
library(tidyr); library(reshape2); library(dplyr)
df %>% unite("date", c(year, month)) %>%
mutate(expense=-expense) %>% melt(value.name="income_expense") %>%
select(-variable) %>% arrange(date)
#### date income_expense
#### 1 2016_07 50
#### 2 2016_07 -15
#### 3 2016_08 30
#### 4 2016_08 -75
I'm using three different libraries here, for better readability of the code. It might be possible to do it with base R, though.
Here's a solution using only two packages, dplyr and tidyr
First, your dataset:
df <- dplyr::data_frame(
year =2016,
month = c("07", "08"),
income = c(50,30),
expense = c(15, 75)
)
The mutate() function in dplyr creates/edits individual variables. The gather() function in tidyr will bring multiple variables/columns together in the way that you specify.
df <- df %>%
dplyr::mutate(
month = paste0(year, "-", month)
) %>%
tidyr::gather(
key = direction, #your name for the new column containing classification 'key'
value = income_expense, #your name for the new column containing values
income:expense #which columns you're acting on
) %>%
dplyr::mutate(income_expense =
ifelse(direction=='expense', -income_expense, income_expense)
)
The output has all the information you'd need (but we will clean it up in the last step)
> df
# A tibble: 4 × 4
year month direction income_expense
<dbl> <chr> <chr> <dbl>
1 2016 2016-07 income 50
2 2016 2016-08 income 30
3 2016 2016-07 expense -15
4 2016 2016-08 expense -75
Finally, we select() to drop columns we don't want, and then arrange it so that df shows the rows in the same order as you described in the question.
df <- df %>%
dplyr::select(-year, -direction) %>%
dplyr::arrange(month)
> df
# A tibble: 4 × 2
month income_expense
<chr> <dbl>
1 2016-07 50
2 2016-07 -15
3 2016-08 30
4 2016-08 -75
NB: I guess that I'm using three libraries, including magrittr for the pipe operator %>%. But, since the pipe operator is the best thing ever, I often forget to count magrittr.

Creating Mean Function for Subset in R

I'm trying to create a function that will take a few parameters and return the total average hourly return. My data set looks like this:
Location Time units
1 Columbus 3:35 12
2 Columbus 3:58 199
3 Chicago 6:10 -45
4 Chicago 6:19 87
5 Detroit 12:05 -200
6 Detroit 0:32 11
What I would like returned would be
Location Time units unitsph
Columbus 7:33 211 27.9
Chicago 12:29 42 3.4
Detroit 12:37 -189 -15.1
while also retaining the other items
basically total units produced and units per hour.
I tried out
thing <- time %>% group_by(Location) %>% summarize(sum(units))
which returned locations and total units but not units per hour. Then I moved to
thing <- time %>% group_by(Location) %>% summarize(sum(units)) %>% summarize(sum(Time))
which returned
Error in eval(expr, envir, enclos) : object 'Time' not found
I also tried mutate but to no effect:
fin <- mutate(time, as.numeric(sum(Time))/as.numeric(sum(units)))
Error in Summary.factor(c(118L, 131L, 174L, 178L, 57L), na.rm = FALSE) :
‘sum’ not meaningful for factors
Any help here much appreciated. I also have a few other columns that I'd like to retain (they're geocodes for the locations etc), but didn't list those here. If that's important I can add back in.
Your time is a a string object. You can use
data <- data.frame(loc=c("C","C","D","D"),time=c("1:22","1:23","1:24","1:25"),u=c(1,2,3,4))
basetime <- strptime("00:00","%H:%M")
data$in.hours <- as.double(strptime(data$time,"%H:%M")-basetime)
thing <- data %>% group_by(loc) %>% summarize(sum(u),sum(in.hours))
The conversion into hours is not exactly beautiful. It first turns the time into a Posix.ct object to convert it in turn to a double. But guess ok.
The converted data
loc time u in.hours
1 C 1:22 1 1.366667
2 C 1:23 2 1.383333
3 D 1:24 3 1.400000
4 D 1:25 4 1.416667
so 1.366 means 1h + 1/3h.
The final result is then
loc sum(u) sum(in.hours)
(fctr) (dbl) (dbl)
1 C 3 2.750000
2 D 7 2.816667
hence for C you have 2 hours and 0.75*60 minutes
I ended up taking part of what #CAFEBABE recommended and modifying it.
I used
mutated_time <- time %>%
group_by(Location) %>%
summarize(play
= sum(as.numeric(Time)/60),
unitsph = sum(units))
and that plus
selektor <- as.data.frame(select(distinct(mutated_time), Location,unitsph))
got me where I wanted to go. Thank you all for the many helpful comments.

Resources