I have two data frames with two columns each. The first column is timestamps and the second contains some values.
One of the data frames is much bigger than the other one but both of them contains data in the same timestamp range.
If I plot these two on top of each other, I will get a nice plot showing how they differ in time.
Now I would like to get the absolute difference by time of these two dataframes to make a another plot showing how much they differ (or to create a boxplot with this information) even though they do not have the same length and exact matching timestamps.
Check this example:
df1:
timestamp | data
1334103075| 1.2
1334103085| 1.5
1334103095| 0.9
1334103105| 0.7
1334103115| 1.1
1334103125| 0.8
df2:
timestamp | data
1334103078| 1.2
1334103099| 1.5
1334103123| 0.8
1334103125| 0.9
How would I achieve something like this:
df3 <- abs(df1-df2)
As you see df2 might not have the same timestamps as df1, but they both have timestamps in the same time range.
Of course the subtraction should try to match timestamps or subtract values from timestamp averages that they are near.
I would suggest using two linear interpolators and evaluate both of them on the union of your two sets of timestamps:
df1 <- data.frame(timestamp = c(1334103075, 1334103085, 1334103095,
1334103105, 1334103115, 1334103125),
data = c(1.2, 1.5, 0.9, 0.7, 1.1, 0.8))
df2 <- data.frame(timestamp = c(1334103078, 1334103099, 1334103123,
1334103125),
data = c(1.2, 1.5, 0.8, 0.9))
library(Hmisc)
all.timestamps <- sort(unique(c(df1$timestamp, df2$timestamp)))
data1 <- approxExtrap(df1$timestamp, df1$data, all.timestamps)$y
data2 <- approxExtrap(df2$timestamp, df2$data, all.timestamps)$y
df3 <- data.frame(timestamp = all.timestamps,
data1 = data1,
data2 = data2,
abs.diff = abs(data1 - data2))
df3
# timestamp data1 data2 abs.diff
# 1 1334103075 1.20 1.157143 0.04285714
# 2 1334103078 1.29 1.200000 0.09000000
# 3 1334103085 1.50 1.300000 0.20000000
# 4 1334103095 0.90 1.442857 0.54285714
# 5 1334103099 0.82 1.500000 0.68000000
# 6 1334103105 0.70 1.325000 0.62500000
# 7 1334103115 1.10 1.033333 0.06666667
# 8 1334103123 0.86 0.800000 0.06000000
# 9 1334103125 0.80 0.900000 0.10000000
Then you could consider fitting splines if you are not quite happy with linear approximations.
Related
I'm trying to implement a lag function but it seems i need an existing x column for it to work
lets say i have this data frames
df <- data.frame(AgeGroup=c("0-4", "5-39", "40-59","60-69","70+"),
px=c(.99, .97, .95, .96,.94))
i want a column Ix that is lag(Ix)*lag(px) starting from 1000.
The data i want is
df2 <- data.frame(AgeGroup=c("0-4", "5-39", "40-59","60-69","70+"),
px=c(.99, .97, .95, .96, .94),
Ix=c(1000, 990, 960.3, 912.285, 875.7936))
I've tried
library(dplyr)
df2<-mutate(df,Ix = lag(Ix, default = 1000)*lag(px))
ifelse statements don't work after the creation of a reference
df$Ix2=NA
df[1,3]=1000
df$Ix<-ifelse(df[,3]==1000,1000,
lag(df$Ix, default = 1000)*lag(px,default =1))
and have been playing around with creating separate Ix column with Ix=1000 then run the above but it doesn't seem to work. Does anyone have any ideas how i can create the x+1 column?
You could use cumprod() combined with dplyr::lag() for this:
> df$Ix <- 1000*dplyr::lag(cumprod(df$px), default = 1)
> df
AgeGroup px Ix
1 0-4 0.99 1000.0000
2 5-39 0.97 990.0000
3 40-59 0.95 960.3000
4 60-69 0.96 912.2850
5 70+ 0.94 875.7936
You can also use accumulate from purrr. Using head(px, -1) includes all values in px except the last one, and the initial Ix is set to 1000.
library(tidyverse)
df %>%
mutate(Ix = accumulate(head(px, -1), prod, .init = 1000))
Output
AgeGroup px Ix
1 0-4 0.99 1000.0000
2 5-39 0.97 990.0000
3 40-59 0.95 960.3000
4 60-69 0.96 912.2850
5 70+ 0.94 875.7936
I have a dataset, df, where columns consist of various chemicals and rows consist of samples identified by their id and the concentration of each chemical.
I need to correct the chemical concentrations using a unique value for each chemical, which are found in another dataset, df2.
Here's a minimal df1 dataset:
df1 <- read.table(text="id,chem1,chem2,chem3,chemA,chemB
1,0.5,1,5,4,3
2,1.5,0.5,2,3,4
3,1,1,2.5,7,1
4,2,5,3,1,7
5,3,4,2.3,0.7,2.3",
header = TRUE,
sep=",")
and here is a df2 example:
df2 <- read.table(text="chem,value
chem1,1.7
chem2,2.3
chem3,4.1
chemA,5.2
chemB,2.7",
header = TRUE,
sep = ",")
What I need to do is to divide all observations of chem1 in df1 by the value provided for chem1 in df2, repeated for each chemical. In reality, chemical names are not sequential, and there's roughly 30 chemicals.
Previously I would have done this using Excel and index/match but I'm looking to make my methods more reproducible, hence fighting my way through with R. I mostly do data manipulation with dplyr, so if there's a tidyverse solution out there, that would be great!
Thankful for any help
We can use the 'chem' column from 'df2' to subset the 'df1', divide by the 'value' column of 'df2' replicated to make the lengths same and update the columns of 'df1' by assigning the results back
df1[as.character(df2$chem)] <- df1[as.character(df2$chem)]/df2$value[col(df1[-1])]
Using reshape2 package, the data frame can be changed to long format to merge with the df2 as follows. (Note that the example df introduce some whitespace that are filtered in this solution)
library(reshape2)
df1 <- read.table(text="id,chem1,chem2,chem3,chemA,chemB
1,0.5,1,5,4,3
2,1.5,0.5,2,3,4
3,1,1,2.5,7,1
4,2,5,3,1,7
5,3,4,2.3,0.7,2.3",
header = TRUE,
sep=",",stringsAsFactors = F)
df2 <- read.table(text="chem,value
chem1,1.7
chem2,2.3
chem3,4.1
chemA,5.2
chemB,2.7",
header = TRUE,
sep = ",",stringsAsFactors = F)
df2$chem <- gsub("\\s+","",df2$chem) #example introduces whitespaces in the names
df1A <- melt(df1,id.vars=c("id"),variable.name="chem")
combined <- merge(x=df1A,y=df2,by="chem",all.x=T)
combined$div <- combined$value.x/combined$value.y
head(combined)
chem id value.x value.y div
1 chem1 1 0.5 1.7 0.2941176
2 chem1 2 1.5 1.7 0.8823529
3 chem1 3 1.0 1.7 0.5882353
4 chem1 4 2.0 1.7 1.1764706
5 chem1 5 3.0 1.7 1.7647059
6 chem2 1 1.0 2.3 0.4347826
or in wide format:
> dcast(combined[,c("id","chem","div")],id ~ chem,value.var="div")
id chem1 chem2 chem3 chemA chemB
1 1 0.2941176 0.4347826 1.2195122 0.7692308 1.1111111
2 2 0.8823529 0.2173913 0.4878049 0.5769231 1.4814815
3 3 0.5882353 0.4347826 0.6097561 1.3461538 0.3703704
4 4 1.1764706 2.1739130 0.7317073 0.1923077 2.5925926
5 5 1.7647059 1.7391304 0.5609756 0.1346154 0.8518519
Here's a tidyverse solution.
df3 <- df1 %>%
# convert the data from wide to long to make the next step easier
gather(key = chem, value = value, -id) %>%
# do your math, using 'match' to map values from df2 to rows in df3
mutate(value = value/df2$value[match(df3$chem, df2$chem)]) %>%
# return the data to wide format if that's how you prefer to store it
spread(chem, value)
I have two datasets of lobster egg size data taken by different samplers, which will be used to assess measurement variability. Each sampler measures ~50 eggs\lobster from numerous lobster. However, occasionally some lobsters are processed by sampler one and not sampler two, and vice versa. I would like to combine the data from the two samplers as a new dataset, but remove all data from lobsters processed only by one sampler. I've played around with dplyr with semi_join and intersect, but I need the matching to be preformed between dataset 1 -> 2 and 2 <-1. I am able to create a new dataset which binds rows from the two samplers but not clear on how to remove all the unique lobster IDs between the two datasets in the new one.
Here is a simplified version of my data where there are multiple egg area measurements taken from multiple lobster, but the sampling does not always overlap (i.e., eggs measured from an individual only by one sampler and not the other):
install.packages(dplyr)
library(dplyr)
sampler1 <- data.frame(LobsterID=c("Lobster1","Lobster1","Lobster2",
"Lobster2","Lobster2","Lobster2",
"Lobster2","Lobster3","Lobster3","Lobster3"),
Area=c(.4,.35,1.1,1.04,1.14,1.1,1.05,1.7,1.63,1.8),
Sampler=c(rep("Sampler1", 10)))
sampler2 <- data.frame(LobsterID=c("Lobster1","Lobster1","Lobster1",
"Lobster1","Lobster1","Lobster2",
"Lobster2","Lobster2","Lobster4","Lobster4"),
Area=c(.41,.44,.47,.43,.38,1.14,1.11,1.09,1.41,1.4),
Sampler=c(rep("Sampler2", 10)))
combined <- bind_rows(sampler1, sampler2)
desiredresult <- combined[-c(8, 9, 10, 19, 20), ]
The bottom line of the script is the desired result from the mock data. I was hoping to limit use to base R or dplyr.
sampler1 %>% rbind(sampler2) %>% filter(LobsterID %in% intersect(sampler1$LobsterID, sampler2$LobsterID))
combined <- bind_rows(sampler1, sampler2)
Lobsters.2.sample <- as.character(unique(sampler1$LobsterID)[unique(sampler1$LobsterID) %in% unique(sampler2$LobsterID)])
combined <- combined[combined$LobsterID %in% Lobsters.2.sample,]
Using base R
combined <-rbind(sampler1, sampler2)
inBoth <- intersect(sampler1[["LobsterID"]], sampler2[["LobsterID"]])
output <- combined[combined[["LobsterID"]] %in% inBoth, ]
intersect finds the set union of two vectors, giving you the lobsters in both samples. All functions are vectorized, so it should run pretty fast.
Bind the rows, group, and filter by the number of distinct samplers in each group:
sampler1 %>% bind_rows(sampler2) %>%
group_by(LobsterID) %>%
filter(n_distinct(Sampler) == 2)
## Source: local data frame [15 x 3]
## Groups: LobsterID [2]
##
## LobsterID Area Sampler
## <chr> <dbl> <chr>
## 1 Lobster1 0.40 Sampler1
## 2 Lobster1 0.35 Sampler1
## 3 Lobster2 1.10 Sampler1
## 4 Lobster2 1.04 Sampler1
## 5 Lobster2 1.14 Sampler1
## 6 Lobster2 1.10 Sampler1
## 7 Lobster2 1.05 Sampler1
## 8 Lobster1 0.41 Sampler2
## 9 Lobster1 0.44 Sampler2
## 10 Lobster1 0.47 Sampler2
## 11 Lobster1 0.43 Sampler2
## 12 Lobster1 0.38 Sampler2
## 13 Lobster2 1.14 Sampler2
## 14 Lobster2 1.11 Sampler2
## 15 Lobster2 1.09 Sampler2
Here is an option using data.table. Using rbindlist to bind the datasets, group by 'LobsterID' and subset the rows using the logical condition based on the number of unique elements in 'Sampler' i.e. equal to 2.
library(data.table)
rbindlist(list(sampler1, sampler2))[, if(uniqueN(Sampler)==2) .SD , by = LobsterID]
I have a large data set that I am trying to discretise and create a 3d surface plot with:
rowColFoVCell wpbCount Feret
1 001001001001 1 0.58
2 001001001001 1 1.30
3 001001001001 1 0.58
4 001001001001 1 0.23
5 001001001001 2 0.23
6 001001001001 2 0.58
There are currently 695302 rows in this data set. I am trying to discretise the third 'Feret' column based on the second column, so for each 'wpbCount' bin the 'Feret' column.
I think the solution will involve using cut but I am not sure how to go about this. I would like to end up with a data frame something like this:
wpbCount Feret Count
1 1 [0.0,0.2] 3
2 1 [0.2,0.4] 5
3 1 [0.4,0.6] 6
4 1 [0.8,0.8] 9
5 2 [0.0,0.2] 6
6 2 [0.4,0.6] 23
This is to answer the first part:
Create Some data
DF <- data.frame(wpbCount = sample(1:1000, 1000),
Feret = sample(seq(0, 1, 0.001), 1000))
1) Discretize
Use cut with right = FALSE so the intervals are [)
I normally find this more usefull than the default
DF$cut_it <- cut(DF$Feret, right = FALSE,
breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1))
2) Aggregate
TABLE <- data.frame(table(DT$cut_it))
EDIT Another attempt
library(data.table)
DT <- data.table(DF)
DT <- DT[, list(wpbCount = length(wpbCount),
Feret = length(Feret)
), by=cut_it]
Perhaps you are just trying to discretize and not aggregate.
Try this:
DF2 <- data.frame(wpbCount = sample(1:3, 1000, replace=T),
Feret = sample(seq(0, 1, 0.001), 1000))
DF2$Feret2 <- cut(DF$Feret, right = FALSE,
breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1.1))
DF2 <- DF2[, c(1, 3)]
Thanks very much for your help I used the following functions in R:
x$bin <- cut(x$Feret, right = FALSE, breaks = seq(0,max(wpbFeatures$Feret), by=0.1))
y <-aggregate(x$bin, by = x[c('wpbCount', 'bin')], length)
From your suggestions I have been able to get the data frame that I require:
wpbCount | bin | x
1 [0.2,0.3) 72
2 [0.2,0.3) 142
3 [0.2,0.3) 224
4 [0.2,0.3) 299
5 [0.2,0.3) 421
6 [0.2,0.3) 479
Now I need to plot this in 3D and I am not sure how to do so with a non-numerical column i.e. the bin column which is factors.
Does anyone know how I can plot these three columns against each other?
Check out this link.
There are some 3d plots. However, 3d plots aren't the greatest tool to analize data.
If you insist with the 3d approach, try stat_contout()
from the ggplot2 package.
However, a probably better apprach is to do a few plots in 2d, or use facet_grid().
Take a look at ggplot2 current documentation also.
Try this based on your last answer (not tested):
ggplot(DF, aes(wpbCount , x)) +
geon_point() +
facet_grid(. ~ bin)
The idea is to use the factor variable (in this case, bin) to facet the plot.
I am trying to plot the CDF curve for a large dataset containing about 29 million values using ggplot. The way I am computing this is like this:
mycounts = ddply(idata.frame(newdata), .(Type), transform, ecd = ecdf(Value)(Value))
plot = ggplot(mycounts, aes(x=Value, y=ecd))
This is taking ages to plot. I was wondering if there is a clean way to plot only a sample of this dataset (say, every 10th point or 50th point) without compromising on the actual result?
I am not sure about your data structure, but a simple sample call might be enough:
n <- nrow(mycounts) # number of cases in data frame
mycounts <- mycounts[sample(n, round(n/10)), ] # get an n/10 sample to the same data frame
Instead of taking every n-th point, can you quantize your data set down to a sufficient resolution before plotting it? That way, you won't have to plot resolution you don't need (or can't see).
Here's one way you can do it. (The function I've written below is generic, but the example uses names from your question.)
library(ggplot2)
library(plyr)
## A data set containing two ramps up to 100, one by 1, one by 10
tens <- data.frame(Type = factor(c(rep(10, 10), rep(1, 100))),
Value = c(1:10 * 10, 1:100))
## Given a data frame and ddply-style arguments, partition the frame
## using ddply and summarize the values in each partition with a
## quantized ecdf. The resulting data frame for each partition has
## two columns: value and value_ecdf.
dd_ecdf <- function(df, ..., .quantizer = identity, .value = value) {
value_colname <- deparse(substitute(.value))
ddply(df, ..., .fun = function(rdf) {
xs <- rdf[[value_colname]]
qxs <- sort(unique(.quantizer(xs)))
data.frame(value = qxs, value_ecdf = ecdf(xs)(qxs))
})
}
## Plot each type's ECDF (w/o quantization)
tens_cdf <- dd_ecdf(tens, .(Type), .value = Value)
qplot(value, value_ecdf, color = Type, geom = "step", data = tens_cdf)
## Plot each type's ECDF (quantizing to nearest 25)
rounder <- function(...) function(x) round_any(x, ...)
tens_cdfq <- dd_ecdf(tens, .(Type), .value = Value, .quantizer = rounder(25))
qplot(value, value_ecdf, color = Type, geom = "step", data = tens_cdfq)
While the original data set and the ecdf set had 110 rows, the quantized-ecdf set is much reduced:
> dim(tens)
[1] 110 2
> dim(tens_cdf)
[1] 110 3
> dim(tens_cdfq)
[1] 10 3
> tens_cdfq
Type value value_ecdf
1 1 0 0.00
2 1 25 0.25
3 1 50 0.50
4 1 75 0.75
5 1 100 1.00
6 10 0 0.00
7 10 25 0.20
8 10 50 0.50
9 10 75 0.70
10 10 100 1.00
I hope this helps! :-)