I have two datasets of lobster egg size data taken by different samplers, which will be used to assess measurement variability. Each sampler measures ~50 eggs\lobster from numerous lobster. However, occasionally some lobsters are processed by sampler one and not sampler two, and vice versa. I would like to combine the data from the two samplers as a new dataset, but remove all data from lobsters processed only by one sampler. I've played around with dplyr with semi_join and intersect, but I need the matching to be preformed between dataset 1 -> 2 and 2 <-1. I am able to create a new dataset which binds rows from the two samplers but not clear on how to remove all the unique lobster IDs between the two datasets in the new one.
Here is a simplified version of my data where there are multiple egg area measurements taken from multiple lobster, but the sampling does not always overlap (i.e., eggs measured from an individual only by one sampler and not the other):
install.packages(dplyr)
library(dplyr)
sampler1 <- data.frame(LobsterID=c("Lobster1","Lobster1","Lobster2",
"Lobster2","Lobster2","Lobster2",
"Lobster2","Lobster3","Lobster3","Lobster3"),
Area=c(.4,.35,1.1,1.04,1.14,1.1,1.05,1.7,1.63,1.8),
Sampler=c(rep("Sampler1", 10)))
sampler2 <- data.frame(LobsterID=c("Lobster1","Lobster1","Lobster1",
"Lobster1","Lobster1","Lobster2",
"Lobster2","Lobster2","Lobster4","Lobster4"),
Area=c(.41,.44,.47,.43,.38,1.14,1.11,1.09,1.41,1.4),
Sampler=c(rep("Sampler2", 10)))
combined <- bind_rows(sampler1, sampler2)
desiredresult <- combined[-c(8, 9, 10, 19, 20), ]
The bottom line of the script is the desired result from the mock data. I was hoping to limit use to base R or dplyr.
sampler1 %>% rbind(sampler2) %>% filter(LobsterID %in% intersect(sampler1$LobsterID, sampler2$LobsterID))
combined <- bind_rows(sampler1, sampler2)
Lobsters.2.sample <- as.character(unique(sampler1$LobsterID)[unique(sampler1$LobsterID) %in% unique(sampler2$LobsterID)])
combined <- combined[combined$LobsterID %in% Lobsters.2.sample,]
Using base R
combined <-rbind(sampler1, sampler2)
inBoth <- intersect(sampler1[["LobsterID"]], sampler2[["LobsterID"]])
output <- combined[combined[["LobsterID"]] %in% inBoth, ]
intersect finds the set union of two vectors, giving you the lobsters in both samples. All functions are vectorized, so it should run pretty fast.
Bind the rows, group, and filter by the number of distinct samplers in each group:
sampler1 %>% bind_rows(sampler2) %>%
group_by(LobsterID) %>%
filter(n_distinct(Sampler) == 2)
## Source: local data frame [15 x 3]
## Groups: LobsterID [2]
##
## LobsterID Area Sampler
## <chr> <dbl> <chr>
## 1 Lobster1 0.40 Sampler1
## 2 Lobster1 0.35 Sampler1
## 3 Lobster2 1.10 Sampler1
## 4 Lobster2 1.04 Sampler1
## 5 Lobster2 1.14 Sampler1
## 6 Lobster2 1.10 Sampler1
## 7 Lobster2 1.05 Sampler1
## 8 Lobster1 0.41 Sampler2
## 9 Lobster1 0.44 Sampler2
## 10 Lobster1 0.47 Sampler2
## 11 Lobster1 0.43 Sampler2
## 12 Lobster1 0.38 Sampler2
## 13 Lobster2 1.14 Sampler2
## 14 Lobster2 1.11 Sampler2
## 15 Lobster2 1.09 Sampler2
Here is an option using data.table. Using rbindlist to bind the datasets, group by 'LobsterID' and subset the rows using the logical condition based on the number of unique elements in 'Sampler' i.e. equal to 2.
library(data.table)
rbindlist(list(sampler1, sampler2))[, if(uniqueN(Sampler)==2) .SD , by = LobsterID]
Related
I have a for loop I would like to run by group. I would like it to run through a set of data, creates a time series for most rows, and then output a forecast for that row of data (based on that time point and the ones preceding it) in the group The issue I am having is running that loop for every 'group' within my data. I want to avoid doing so manually as that would take hours and surely there is a better way.
Allow to me explain in more detail.
I have a large dataset (1.6M rows), each row has a year, country A, country B, and a number of measures which concern the relationship between the two.
So far, I have been successful in extracting a single (country A, country B) relationship into a new table and using a for loop to output the necessary forecast data to a new variable in the dataset. I'd like to create to have that for loop run over every (country A, country B) grouping with more than 3 entries.
The data:
Here I will replicate a small slice of the data, and will include a missing value for realism.
set.seed(2000)
df <- data.frame(year = rep(c(1946:1970),length.out=50),
ccode1 = rep(c("2"), length.out = 50),
ccode2 = rep(c("20","31"), each=25),
kappavv = rnorm(50,mean = 0, sd=0.25),
output = NA)
df$kappavv[12] <- NA
What I've done:
NOTE: I start forecasting from the third data point of each group but based on all time points preceding the forecast.
for(i in 3:nrow(df)){
dat_ts <- ts(df[, 4], start = c(min(df$year), 1), end = c(df$year[i], 1), frequency = 1)
dat_ts_corr <- na_interpolation(dat_ts)
trialseries <- holt(dat_ts_corr, h=1)
df$output[i] <- trialseries$mean
}
This part works and outputs what I want when I apply it to a single pairing of ccode1 and ccode2 when arranged correctly in ascending order of years.
What isn't working:
I am having some serious problems getting my head around applying this for loop by grouping of ccode2. Some of my data is uneven: sometimes groups are different sizes, having different start/end points, and there are missing data.
I have tried expressing the loop as a function, using group_by() and piping, using various types of apply() functions.
Your help is appreciated. Thanks in advance. I am glad to answer any clarifying questions you have.
You can put the for loop code in a function.
library(dplyr)
library(purrr)
apply_func <- function(df) {
for(i in 3:nrow(df)){
dat_ts <- ts(df[, 4], start = c(min(df$year), 1),
end = c(df$year[i], 1), frequency = 1)
dat_ts_corr <- imputeTS::na_interpolation(dat_ts)
trialseries <- forecast::holt(dat_ts_corr, h=1)
df$output[i] <- trialseries$mean
}
return(df)
}
Split the data by ccode2 and apply apply_func.
df %>%group_split(ccode2) %>% map_df(apply_func)
# year ccode1 ccode2 kappavv output
# <int> <chr> <chr> <dbl> <dbl>
# 1 1946 2 20 -0.213 NA
# 2 1947 2 20 -0.0882 NA
# 3 1948 2 20 0.223 0.286
# 4 1949 2 20 0.435 0.413
# 5 1950 2 20 0.229 0.538
# 6 1951 2 20 -0.294 0.477
# 7 1952 2 20 -0.485 -0.675
# 8 1953 2 20 0.524 0.405
# 9 1954 2 20 0.0564 0.0418
#10 1955 2 20 0.294 0.161
# … with 40 more rows
I have a data frame in which I would I would like to compute some extra column as a function of the existing columns, but want to specify both each new column name and the function dynamically. I have a vector of column names that are already in the dataframe df_daily:
DAILY_QUESTIONS <- c("Q1_Daily", "Q2_Daily", "Q3_Daily", "Q4_Daily", "Q5_Daily")
The rows of the dataframe have responses to each question from each user each time they answer the questionnaire, as well as a column with the number of days since the user first answered the questionnaire (i.e. Days_From_First_Use = 0 on the very first use, = 1 if it is used the next day etc.). I want to average the responses to these questions by Days_From_First_Use . I start by by grouping my dataframe by Days_From_First_Use:
df_test <- df_daily %>%
group_by(Days_From_First_Use)
and then try averaging the responses in a loop as follows:
for(i in 1:5){
df_test <- df_test %>%
mutate(!! paste0('Avg_Score_', DAILY_QUESTIONS[i]) :=
paste0('mean(', DAILY_QUESTIONS[i], ')'))
}
Unfortunately, while my new variable names are correct ("Avg_Score_Q1_Daily", "Avg_Score_Q2_Daily", "Avg_Score_Q3_Daily", "Avg_Score_Q4_Daily", "Avg_Score_Q5_Daily"), my answers are not: every row in my data frame has a string such as "mean(Q1_Daily)" in the relevant column .
So I'm clearly doing something wrong - what do I need to do fix this and get the average score across all users on each day?
Sincerely and with many thanks in advance
Thomas Philips
I took a somewhat different approach, using summarize(across(...)) after group_by(Days_From_First_Use) I achieve the dynamic names by using rename_with and a custom function that replaces (starts with)"Q" with "Avg_Score_Q"
library(dplyr, warn.conflicts = FALSE)
# fake data -- 30 normalized "responses" from 0 to 2 days from first use to 5 questions
DAILY_QUESTIONS <- c("Q1_Daily", "Q2_Daily", "Q3_Daily", "Q4_Daily", "Q5_Daily")
df_daily <- as.data.frame(do.call('cbind', lapply(1:5, function(i) rnorm(30, i))))
colnames(df_daily) <- DAILY_QUESTIONS
df_daily$Days_From_First_Use <- floor(runif(30, 0, 3))
df_test <- df_daily %>%
group_by(Days_From_First_Use) %>%
summarize(across(.fns = mean)) %>%
rename_with(.fn = function(x) gsub("^Q","Avg_Score_Q",x))
#> `summarise()` ungrouping output (override with `.groups` argument)
df_test
#> # A tibble: 3 x 6
#> Days_From_First… Avg_Score_Q1_Da… Avg_Score_Q2_Da… Avg_Score_Q3_Da…
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0 1.26 1.75 3.02
#> 2 1 0.966 2.14 3.48
#> 3 2 1.08 2.45 3.01
#> # … with 2 more variables: Avg_Score_Q4_Daily <dbl>, Avg_Score_Q5_Daily <dbl>
Created on 2020-12-06 by the reprex package (v0.3.0)
The data set has the following structure
Key Date Mat Amount
<int> <date> <chr> <dbl>
1 1001056 2014-12-12 10025 0.10
2 1001056 2014-12-23 10025 0.20
3 1001056 2015-01-08 10025 0.10
4 1001056 2015-04-07 10025 0.20
5 1001056 2015-05-08 10025 0.20
6 1001076 2013-10-29 10026 3.00
7 1001140 2013-01-18 10026 0.72
8 1001140 2013-04-11 10026 2.40
9 1001140 2014-10-08 10026 0.24
10 1001237 2015-02-17 10025 2.40
11 1001237 2015-02-17 10026 3.40
Mat takes values in {10001,...,11000}, hence A:=|Mat|=1000.
I would like to accomplish the following goals:
1) (Intermediate step) For each Key-Date combination I would like to calculate for all materials, which are availabe at such a combination (which might vary from key to key), the differences in amount,
e.g. for combination "1001237 2015-02-17" this would be for materials 10025 and 10026 2.40-3.40=-1 (but might be more combinations). (How to store those values effienently?)
This step might be skipped.
2) Finally, I would like to construct a new matrix of dimension A=1000 where each entry (i,j) (Material combination i and j) contains the average of the values calculated in the step before.
More formally, entry (i,j) is given by,
1/|all key-date combinationas containing Mat i and Mat j| \sum_{all key-date combinationas containing Mat i and Mat j} Amount_i - Amount_j
As the table is quite large efficiency of the computation is very important.
Thank you very much for your help in advance!
I can do it with list columns in tidyverse; the trick is to use group_by to get distinct combinations of Key and Date. Here's the code:
materials <- unique(x$Mat)
n <- length(materials)
x <- x %>%
group_by(Key, Date) %>%
nest() %>%
# Create a n by n matrix for each combination of Key and Date
mutate(matrices = lapply(data,
function(y) {
out <- matrix(nrow = n, ncol = n,
dimnames = list(materials, materials))
# Only fill in when the pair of materials is present
# for the date of interest
mat_present <- as.character(unique(y$Mat))
for (i in mat_present) {
for (j in mat_present) {
# You may want to take an absolute value
out[i,j] <- y$Amount[y$Mat == i] - y$Amount[y$Mat == j]
}
}
out
}))
If you really want speed, you can implement the function in lapply with Rcpp. You can use RcppParallel to further speed it up. Now one of the columns of the data frame is a list of matrices. Then, for each element of the matrices, take an average while ignoring NAs:
x_arr <- array(unlist(x$matrices), dim = c(2,2,10))
results <- apply(x_arr, 2, rowMeans, na.rm = TRUE)
I stacked the list of matrices into a 3D array and found row means slice by slice. For performance, you can also do it in RcppArmadillo, with sum(x_arr, 2), but it's hard to deal with missing values when not all types of materials are represented in a combination of Key and Date.
Is there a clean/automatic way to convert CSV values formatted with as percents (with trailing % symbol) in R?
Here is some example data:
actual,simulated,percent error
2.1496,8.6066,-300%
0.9170,8.0266,-775%
7.9406,0.2152,97%
4.9637,3.5237,29%
Which can be read using:
junk = read.csv("Example.csv")
But all of the % columns are read as strings and converted to factors:
> str(junk)
'data.frame': 4 obs. of 3 variables:
$ actual : num 2.15 0.917 7.941 4.964
$ simulated : num 8.607 8.027 0.215 3.524
$ percent.error: Factor w/ 4 levels "-300%","-775%",..: 1 2 4 3
but I would like them to be numeric values.
Is there an additional parameter for read.csv? Is there a way to easily post process the needed columns to convert to numeric values? Other solutions?
Note: of course in this example I could simply recompute the values, but in my real application with a larger data file this is not practical.
There is no "percentage" type in R. So you need to do some post-processing:
DF <- read.table(text="actual,simulated,percent error
2.1496,8.6066,-300%
0.9170,8.0266,-775%
7.9406,0.2152,97%
4.9637,3.5237,29%", sep=",", header=TRUE)
DF[,3] <- as.numeric(gsub("%", "",DF[,3]))/100
# actual simulated percent.error
#1 2.1496 8.6066 -3.00
#2 0.9170 8.0266 -7.75
#3 7.9406 0.2152 0.97
#4 4.9637 3.5237 0.29
This is the same as Roland's solution except using the stringr package. When working with strings I'd recommend it though as the interface is more intuitive.
library(stringr)
d <- str_replace(junk$percent.error, pattern="%", "")
junk$percent.error <- as.numeric(d)/100
With data.table you can achieve it as
a <- fread("file.csv")[,`percent error` := as.numeric(sub('%', '', `percent error`))/100]
Tidyverse has multiple ways of solving such issues. You can use the parse_number() specification which will strip a number off any symbols, text etc.:
sample_data = "actual,simulated,percent error\n 2.1496,8.6066,-300%\n 0.9170,8.0266,-775%\n7.9406,0.2152,97%\n4.9637,3.5237,29%"
DF <- read_csv(sample_data,col_types = cols(`percent error`= col_number()))
# A tibble: 4 x 3
# actual simulated `percent error`
# <chr> <dbl> <dbl>
# 1 2.1496 8.61 -300
# 2 + 0.9170 8.03 -775
# 3 + 7.9406 0.215 97.0
# 4 + 4.9637 3.52 29.0
I have two data frames with two columns each. The first column is timestamps and the second contains some values.
One of the data frames is much bigger than the other one but both of them contains data in the same timestamp range.
If I plot these two on top of each other, I will get a nice plot showing how they differ in time.
Now I would like to get the absolute difference by time of these two dataframes to make a another plot showing how much they differ (or to create a boxplot with this information) even though they do not have the same length and exact matching timestamps.
Check this example:
df1:
timestamp | data
1334103075| 1.2
1334103085| 1.5
1334103095| 0.9
1334103105| 0.7
1334103115| 1.1
1334103125| 0.8
df2:
timestamp | data
1334103078| 1.2
1334103099| 1.5
1334103123| 0.8
1334103125| 0.9
How would I achieve something like this:
df3 <- abs(df1-df2)
As you see df2 might not have the same timestamps as df1, but they both have timestamps in the same time range.
Of course the subtraction should try to match timestamps or subtract values from timestamp averages that they are near.
I would suggest using two linear interpolators and evaluate both of them on the union of your two sets of timestamps:
df1 <- data.frame(timestamp = c(1334103075, 1334103085, 1334103095,
1334103105, 1334103115, 1334103125),
data = c(1.2, 1.5, 0.9, 0.7, 1.1, 0.8))
df2 <- data.frame(timestamp = c(1334103078, 1334103099, 1334103123,
1334103125),
data = c(1.2, 1.5, 0.8, 0.9))
library(Hmisc)
all.timestamps <- sort(unique(c(df1$timestamp, df2$timestamp)))
data1 <- approxExtrap(df1$timestamp, df1$data, all.timestamps)$y
data2 <- approxExtrap(df2$timestamp, df2$data, all.timestamps)$y
df3 <- data.frame(timestamp = all.timestamps,
data1 = data1,
data2 = data2,
abs.diff = abs(data1 - data2))
df3
# timestamp data1 data2 abs.diff
# 1 1334103075 1.20 1.157143 0.04285714
# 2 1334103078 1.29 1.200000 0.09000000
# 3 1334103085 1.50 1.300000 0.20000000
# 4 1334103095 0.90 1.442857 0.54285714
# 5 1334103099 0.82 1.500000 0.68000000
# 6 1334103105 0.70 1.325000 0.62500000
# 7 1334103115 1.10 1.033333 0.06666667
# 8 1334103123 0.86 0.800000 0.06000000
# 9 1334103125 0.80 0.900000 0.10000000
Then you could consider fitting splines if you are not quite happy with linear approximations.