R - sample and resample a person-period file - r

I am working with a gigantic person-period file and I thought that
a good way to deal with a large dataset is by using sampling and re-sampling technique.
My person-period file look like this
id code time
1 1 a 1
2 1 a 2
3 1 a 3
4 2 b 1
5 2 c 2
6 2 b 3
7 3 c 1
8 3 c 2
9 3 c 3
10 4 c 1
11 4 a 2
12 4 c 3
13 5 a 1
14 5 c 2
15 5 a 3
I have actually two distinct issues.
The first issue is that I am having trouble in simply sampling a person-period file.
For example, I would like to sample 2 id-sequences such as :
id code time
1 a 1
1 a 2
1 a 3
2 b 1
2 c 2
2 b 3
The following line of code is working for sampling a person-period file
dt[which(dt$id %in% sample(dt$id, 2)), ]
However, I would like to use a dplyr solution because I am interested in resampling and in particular I would like to use replicate.
I am interested in doing something like replicate(100, sample_n(dt, 2), simplify = FALSE)
I am struggling with the dplyr solution because I am not sure what should be the grouping variable.
library(dplyr)
dt %>% group_by(id) %>% sample_n(1)
gives me an incorrect result because it does not keep the full sequence of each id.
Any clue how I could both sample and re-sample person-period file ?
data
dt = structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L, 4L, 5L, 5L, 5L), .Label = c("1", "2", "3", "4", "5"
), class = "factor"), code = structure(c(1L, 1L, 1L, 2L, 3L,
2L, 3L, 3L, 3L, 3L, 1L, 3L, 1L, 3L, 1L), .Label = c("a", "b",
"c"), class = "factor"), time = structure(c(1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1", "2",
"3"), class = "factor")), .Names = c("id", "code", "time"), row.names = c(NA,
-15L), class = "data.frame")

I think the idiomatic way would probably look like
set.seed(1)
samp = df %>% select(id) %>% distinct %>% sample_n(2)
left_join(samp, df)
id code time
1 2 b 1
2 2 c 2
3 2 b 3
4 5 a 1
5 5 c 2
6 5 a 3
This extends straightforwardly to more grouping variables and fancier sampling rules.
If you need to do this many times...
nrep = 100
ng = 2
samps = df %>% select(id) %>% distinct %>%
slice(rep(1:n(), nrep)) %>% mutate(r = rep(1:nrep, each = n()/nrep)) %>%
group_by(r) %>% sample_n(ng)
repdat = left_join(samps, df)
# then do stuff with it:
repdat %>% group_by(r) %>% do_stuff

I imagine you are doing some simulations and may want to do the subsetting many times. You probably also want to try this data.table method and utilize the fast binary search feature on the key column:
library(data.table)
setDT(dt)
setkey(dt, id)
replicate(2, dt[list(sample(unique(id), 2))], simplify = F)
#[[1]]
# id code time
#1: 3 c 1
#2: 3 c 2
#3: 3 c 3
#4: 5 a 1
#5: 5 c 2
#6: 5 a 3
#[[2]]
# id code time
#1: 3 c 1
#2: 3 c 2
#3: 3 c 3
#4: 4 c 1
#5: 4 a 2
#6: 4 c 3

We can use filter with sample
dt %>%
filter(id %in% sample(unique(id),2, replace = FALSE))
NOTE: The OP specified using dplyr method and this solution does uses the dplyr.
If we need to do replicate one option would be using map from purrr
library(purrr)
dt %>%
distinct(id) %>%
replicate(2, .) %>%
map(~sample(., 2, replace=FALSE)) %>%
map(~filter(dt, id %in% .))
#$id
# id code time
#1 1 a 1
#2 1 a 2
#3 1 a 3
#4 4 c 1
#5 4 a 2
#6 4 c 3
#$id
# id code time
#1 4 c 1
#2 4 a 2
#3 4 c 3
#4 5 a 1
#5 5 c 2
#6 5 a 3

Related

How can I calculate the sum of the column wise differences using dplyr

Despite using R and dplyr on a regular basis, I encountered the issue of not being able to calculate the sum of the absolute differences between all columns:
sum_diff=ABS(A-B)+ABS(B-C)+ABS(C-D)...
A
B
C
D
sum_diff
1
2
3
4
3
2
1
3
4
4
1
2
1
1
2
4
1
2
1
5
I know I could iterate using a for loop over all columns, but given the size of my data frame, I prefer a more elegant and fast solution.
Any help?
Thank you
We may remove the first and last columns, get the difference, and use rowSums on the absolute values in base R. This could be very efficient compared to a package solution
df1$sum_diff <- rowSums(abs(df1[-ncol(df1)] - df1[-1]))
-output
> df1
A B C D sum_diff
1 1 2 3 4 3
2 2 1 3 4 4
3 1 2 1 1 2
4 4 1 2 1 5
Or another option is rowDiffs from matrixStats
library(matrixStats)
rowSums(abs(rowDiffs(as.matrix(df1))))
[1] 3 4 2 5
data
df1 <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Daata from akrun (many thanks)!
This is complicated the idea is to generate a list of the combinations, I tried it with combn but then I get all possible combinations. So I created by hand.
With this combinations we then could use purrrs map_dfc and do some data wrangling after that:
library(tidyverse)
combinations <-list(c("A", "B"), c("B", "C"), c("C","D"))
purrr::map_dfc(combinations, ~{df <- tibble(a=data[[.[[1]]]]-data[[.[[2]]]])
names(df) <- paste0(.[[1]],"_v_",.[[2]])
df}) %>%
transmute(sum_diff = rowSums(abs(.))) %>%
bind_cols(data)
sum_diff A B C D
<dbl> <int> <int> <int> <int>
1 3 1 2 3 4
2 4 2 1 3 4
3 2 1 2 1 1
4 5 4 1 2 1
data:
data <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Here is a dplyrs version of #akrun's elegant aproach that calculates the diff of the dataframe with it's shifted variant:
df %>%
mutate(sum_diff = rowSums(abs(identity(.) %>% select(1:last_col(1))
- identity(.) %>% select(2:last_col()))))
And here we have the rowwise variant, which basicly follows the same idea but this time every row is used as a vector that get's substracted by it's shifted self.
df %>%
rowwise() %>%
mutate(sum_diff = map2_int(c_across(1:last_col(1)),
c_across(2:last_col()),
~ abs(.x - .y)) %>% sum())

Create a calculated field in R at each record/row level

I have the below dataframe from which I intend to create a calculated field at each Code level or row level.
Code count_pol const_q
A028 12 3
B09 7 4
M017 5 2
S83 4 1
S1960 6 4
S179 2 2
S193 3 3
IN the above dataset, I want to create a calculated field y for which the following conditions apply:
If for a code the count_pol lies in 1,2,3 , y = count_pol/const_q else const_q/4
Thus the expected output is:
Code count_pol const_q y
A028 12 3 0.75
B09 7 4 1
M017 5 2 0.5
S83 4 1 0.25
S1960 6 4 1
S179 2 2 1
S193 3 3 1
I have tried the below code:
a_df <- mutate(a_df,
y = if_else(count_pol %in% c(1:3), as.integer(const_q)/count_pol,const_q/4))
but that does not give the desired output.
Can someone please help me rectify this?
We can use if_else to check for values in 1:3
library(dplyr)
df %>% mutate(y = if_else(count_pol %in% 1:3, count_pol/const_q, const_q/4))
# Code count_pol const_q y
#1 A028 12 3 0.75
#2 B09 7 4 1.00
#3 M017 5 2 0.50
#4 S83 4 1 0.25
#5 S1960 6 4 1.00
#6 S179 2 2 1.00
#7 S193 3 3 1.00
and in base R that would be
transform(df, y = ifelse(count_pol %in% 1:3, count_pol/const_q, const_q/4))
data
df <- structure(list(Code = structure(c(1L, 2L, 3L, 7L, 6L, 4L, 5L),
.Label = c("A028", "B09", "M017", "S179", "S193", "S1960", "S83"),
class = "factor"), count_pol = c(12L, 7L, 5L, 4L, 6L, 2L, 3L), const_q = c(3L,
4L, 2L, 1L, 4L, 2L, 3L)), class = "data.frame", row.names = c(NA, -7L))
With case_when() ...
df %>%
group_by(code) %>%
mutate(
y = case_when(
count_pol %in% c(1, 2, 3) ~ count_pol/const_q,
TRUE ~ const_q/4
)
)

R - add column that counts sequentially within groups but repeats for duplicates

I'm looking for a solution to add the column "desired_result" preferably using dplyr and/or ave(). See the data frame here, where the group is "section" and the unique instances I want my "desired_results" column to count sequentially are in "exhibit":
structure(list(section = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), exhibit = structure(c(1L,
2L, 3L, 3L, 1L, 2L, 2L, 3L), .Label = c("a", "b", "c"), class = "factor"),
desired_result = c(1L, 2L, 3L, 3L, 1L, 2L, 2L, 3L)), .Names = c("section",
"exhibit", "desired_result"), class = "data.frame", row.names = c(NA,
-8L))
dense_rank it is
library(dplyr)
df %>%
group_by(section) %>%
mutate(desire=dense_rank(exhibit))
# section exhibit desired_result desire
#1 1 a 1 1
#2 1 b 2 2
#3 1 c 3 3
#4 1 c 3 3
#5 2 a 1 1
#6 2 b 2 2
#7 2 b 2 2
#8 2 c 3 3
I've recently pushed a function rleid() to data.table (currently available on the development version, 1.9.5), which does exactly this. If you're interested, you can install it by following this.
require(data.table) # 1.9.5, for `rleid()`
require(dplyr)
DF %>%
group_by(section) %>%
mutate(desired_results=rleid(exhibit))
# section exhibit desired_result desired_results
# 1 1 a 1 1
# 2 1 b 2 2
# 3 1 c 3 3
# 4 1 c 3 3
# 5 2 a 1 1
# 6 2 b 2 2
# 7 2 b 2 2
# 8 2 c 3 3
If exact enumeration is necessary and you need the desired result to be consistent (so that a same exhibit in a different section will always have the same number), you can try:
library(dplyr)
df <- data.frame(section = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
exhibit = c('a', 'b', 'c', 'c', 'a', 'b', 'b', 'c'))
if (is.null(saveLevels <- levels(df$exhibit)))
saveLevels <- sort(unique(df$exhibit)) ## or levels(factor(df$exhibit))
df %>%
group_by(section) %>%
mutate(answer = as.integer(factor(exhibit, levels = saveLevels)))
## Source: local data frame [8 x 3]
## Groups: section
## section exhibit answer
## 1 1 a 1
## 2 1 b 2
## 3 1 c 3
## 4 1 c 3
## 5 2 a 1
## 6 2 b 2
## 7 2 b 2
## 8 2 c 3
If/when a new exhibit appears in subsequent sections, they should get newly enumerated results. (Notice the last exhibit is different.)
df2 <- data.frame(section = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
exhibit = c('a', 'b', 'c', 'c', 'a', 'b', 'b', 'd'))
if (is.null(saveLevels2 <- levels(df2$exhibit)))
saveLevels2 <- sort(unique(df2$exhibit))
df2 %>%
group_by(section) %>%
mutate(answer = as.integer(factor(exhibit, levels = saveLevels2)))
## Source: local data frame [8 x 3]
## Groups: section
## section exhibit answer
## 1 1 a 1
## 2 1 b 2
## 3 1 c 3
## 4 1 c 3
## 5 2 a 1
## 6 2 b 2
## 7 2 b 2
## 8 2 d 4

R Aggregate and count of not null

I have the following data table
PIECE SAMPLE QC_CODE
1 1 1
2 1 NA
3 2 2
4 2 4
5 2 NA
6 3 6
7 3 3
8 3 NA
9 4 6
10 4 NA
and I would like to count the number of qc_code in each sample and return an output like this
SAMPLE SAMPLE_SIZE QC_CODE_COUNT
1 2 1
2 3 2
3 3 2
4 2 1
Where sample size is the count of pieces in each sample, and qc_code_count is the count of al qc_code that are no NA.
How would I go about this in R
You can try
library(dplyr)
df1 %>%
group_by(SAMPLE) %>%
summarise(SAMPLE_SIZE=n(), QC_CODE_UNIT= sum(!is.na(QC_CODE)))
# SAMPLE SAMPLE_SIZE QC_CODE_UNIT
#1 1 2 1
#2 2 3 2
#3 3 3 2
#4 4 2 1
Or
library(data.table)
setDT(df1)[,list(SAMPLE_SIZE=.N, QC_CODE_UNIT=sum(!is.na(QC_CODE))), by=SAMPLE]
Or using aggregate from base R
do.call(data.frame,aggregate(QC_CODE~SAMPLE, df1, na.action=NULL,
FUN=function(x) c(SAMPLE_SIZE=length(x), QC_CODE_UNIT= sum(!is.na(x)))))
data
df1 <- structure(list(PIECE = 1:10, SAMPLE = c(1L, 1L, 2L, 2L, 2L, 3L,
3L, 3L, 4L, 4L), QC_CODE = c(1L, NA, 2L, 4L, NA, 6L, 3L, NA,
6L, NA)), .Names = c("PIECE", "SAMPLE", "QC_CODE"), class = "data.frame",
row.names = c(NA, -10L))

Complex data frame transposition in R

I've tried searching for an answer for this but most data.frame/matrix transpoitions aren't as complicated as I am trying to accomplish. Basically I have a data.frame which looks like
F M A
2008_b 1 5 6
2008_r 3 3 6
2008_a 4 1 5
2009_b 1 1 2
2009_r 5 4 9
2009_a 2 2 4
I'm trying to transpose it and rename the column and row names as such:
F_b M_b A_b F_r M_r A_r F_a M_a A_a
2008 1 5 6 3 3 6 4 1 5
2009 1 1 2 5 4 9 2 2 4
Essentially every three rows are being collapsed in to a single row. I assume this can be done with some clever plyr or reshape2 commands but I'm at a total loss how to accomplish it.
You could try
library(dplyr)
library(tidyr)
lvl <- c(outer(colnames(df), unique(gsub(".*_", "", rownames(df))),
FUN=paste, sep="_"))
res <- cbind(Var1=row.names(df), df) %>%
gather(Var2, value, -Var1) %>%
separate(Var1, c('Var11', 'Var12')) %>%
unite(VarN, Var2, Var12) %>%
mutate(VarN=factor(VarN, levels=lvl)) %>%
spread(VarN, value)
row.names(res) <- res[,1]
res1 <- res[,-1]
res1
# F_b M_b A_b F_r M_r A_r F_a M_a A_a
#2008 1 5 6 3 3 6 4 1 5
#2009 1 1 2 5 4 9 2 2 4
data
df <- structure(list(F = c(1L, 3L, 4L, 1L, 5L, 2L), M = c(5L, 3L, 1L,
1L, 4L, 2L), A = c(6L, 6L, 5L, 2L, 9L, 4L)), .Names = c("F",
"M", "A"), class = "data.frame", row.names = c("2008_b", "2008_r",
"2008_a", "2009_b", "2009_r", "2009_a"))

Resources