R - stratified sampling for Person Period file - r

Following up this question, I wondered how I can effectively sample a stratified Person Period file.
I have a database who looks like this
id time var clust
1: 1 1 a clust1
2: 1 2 c clust1
3: 1 3 c clust1
4: 2 1 a clust1
5: 2 2 a clust1
...
With individuals id grouped into clusters clust. What I would like is to sample id by clust, keeping the person period format.
The solution I came up with is to sample id and then to merge back. However, is it not a very elegant solution.
library(data.table)
library(dplyr)
setDT(dt)
dt[,.SD[sample(.N,1)],by = clust] %>%
merge(., dt, by = 'id')
which gives
id clust.x time.x var.x time.y var.y clust.y
1: 2 clust1 1 a 1 a clust1
2: 2 clust1 1 a 2 a clust1
3: 2 clust1 1 a 3 c clust1
4: 3 clust2 3 c 1 a clust2
5: 3 clust2 3 c 2 b clust2
6: 3 clust2 3 c 3 c clust2
7: 5 clust3 1 a 1 a clust3
8: 5 clust3 1 a 2 a clust3
9: 5 clust3 1 a 3 c clust3
Is there a more straightforward solution ?
library(data.table)
dt = setDT(structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L), .Label = c("1", "2",
"3", "4", "5", "6"), class = "factor"), time = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L), .Label = c("1", "2", "3"), class = "factor"), var = structure(c(1L,
3L, 3L, 1L, 1L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 1L, 3L, 2L, 2L,
3L), .Label = c("a", "b", "c"), class = "factor"), clust = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L,
2L), .Label = c("clust1", "clust2", "clust3"), class = "factor")), .Names = c("id",
"time", "var", "clust"), row.names = c(NA, -18L), class = "data.frame"))

Here is a variant following #Frank's comment that might help, essentially you can sample a unique id from each clust group and find out the corresponding index number with .I for subsetting:
dt[dt[, .I[id == sample(unique(id),1)], clust]$V1]
# id time var clust
#1: 2 1 a clust1
#2: 2 2 a clust1
#3: 2 3 c clust1
#4: 3 1 a clust2
#5: 3 2 b clust2
#6: 3 3 c clust2
#7: 4 1 a clust3
#8: 4 2 b clust3
#9: 4 3 c clust3

I think tidy data here would have an ID table where cluster is an attribute:
idDT = unique(dt[, .(id, clust)])
id clust
1: 1 clust1
2: 2 clust1
3: 3 clust2
4: 4 clust3
5: 5 clust3
6: 6 clust2
From there, sample...
my_selection = idDT[, .(id = sample(id, 1)), by=clust]
and merge or subset
dt[ my_selection, on=names(my_selection) ]
# or
dt[ id %in% my_selection$id ]
I would keep the intermediate table my_selection around, expecting it to come in handy later.

Related

Calculating and looping summaries for individual participants into a table

I have data from several hundred participants who each provided between 1 and 6 sentences. They then rated their sentence(s) on 4 dimensions, as did two external raters.
I'd like to create a table, grouped by participant, with columns showing these values:
Participants' rate of agreement with rater 1 (par1), with rater 2 (par2) and overall (paro)
Participants' rate of agreement for each dimension with rater 1 (pad1.1, pad2.1 etc.), with rater 2 (pad1.2, pad2.2 etc.) and overall (pad1.o, pad2.o etc.)
Mean difference in rating between participant and rater 1 (mdrp1), rater 2 (mdrp2) and both raters (mdrpo)
Mean difference in rating for each dimension between participant and rater 1 (mdr1p1, mdr2p1 etc.), rater 2 (mdr1p2, mdr2p2 etc.) and both raters (mdr1po, mdr2po etc.)
(So with 4 dimensions there should be 30 values per participant)
Due to the size and structure of the data, I'm not sure where to start on this. I'm guessing that a loop would be necessary, but I've struggled to get my head around how to do that as well.
For agreement I'm considering adding TRUE/FALSE variables and then replacing them with 1 and 0 to eventually calculate agreement:
df <- df %>% mutate(par1 = (df$d1 == df$r1.1)
df <- df %>% mutate(par2 = (df$d1 == df$r2.1)
df <- df %>% mutate(paro = (df$d1 == df$r1.1 & df$d1 == df$r2.1)
And similarly for mean differences, adding variables with rating difference for each dimension...
df <- df %>% mutate(mdr1p1 = (df$d1 - df$r1.1))
df <- df %>% mutate(mdr1p2 = (df$d1 - df$r2.1))
df <- df %>% mutate(mdr1po = (df$d1 - ((df$r1.1 + df$r2.1)/2)))
...But these seem to be quite inefficient approaches!
My data looks like this:
ID Ans d1 d2 d3 d4 r1.1 r1.2 r1.3 r1.4 r2.1 r2.2 r2.3 r2.4
1 53 abc 3 3 3 3 3 2 4 3 3 2 4 3
2 a4 def 3 3 3 3 3 1 2 3 3 1 3 3
3 a4 ghi 4 4 4 4 3 2 5 1 3 1 5 2
4 hj jkl 3 3 3 3 3 1 3 3 3 1 5 3
5 32 mno 2 3 3 3 3 1 3 2 3 1 3 3
6 32 pqr 3 3 3 2 3 2 5 3 4 2 3 3
ID = participant
Ans = participants' written answer
d = dimension rated by participant
r1 = dimensions rated by external rater 1
r2 = dimensions rated by external rater 2
Example data:
structure(list(ID = c(1L, 2L, 2L, 3L, 4L, 4L, 5L),
Ans = c("abc", "def", "ghi", "jkl", "mno", "pqr", "stu"),
d1 = c(3L, 3L, 4L, 3L, 2L, 3L, 3L), d2 = c(3L, 3L, 4L, 3L, 3L, 3L, 1L),
d3 = c(3L, 3L, 4L, 3L, 3L, 3L, 1L), d4 = c(3L, 3L, 4L, 3L, 3L, 2L, 3L),
r1.1 = c(3L, 3L, 3L, 3L, 3L, 3L, 3L), r1.2 = c(2L, 1L, 2L, 1L, 1L, 2L, 3L),
r1.3 = c(4L, 2L, 5L, 3L, 3L, 5L, 3L), r1.4 = c(3L, 3L, 1L, 3L, 2L, 3L, 2L),
r2.1 = c(3L, 3L, 3L, 3L, 3L, 4L, 3L), r2.2 = c(2L, 1L, 1L, 1L, 1L, 2L, 1L),
r2.3 = c(4L, 3L, 5L, 5L, 3L, 3L, 5L), r2.4 = c(3L, 3L, 2L, 3L, 3L, 3L, 2L)),
row.names = c(1L, 2L, 3L, 4L, 5L, 6L), class = "data.frame")

Subset in R with specific values for specific columns identified by their index number

If I have a data frame like this:
df = data.frame(A = sample(1:5, 10, replace=T), B = sample(1:5, 10, replace=T), C = sample(1:5, 10, replace=T), D = sample(1:5, 10, replace=T), E = sample(1:5, 10, replace=T))
Giving me this:
A B C D E
1 1 5 1 4 3
2 2 3 5 4 3
3 4 2 2 4 4
4 2 1 2 5 2
5 3 3 4 4 5
6 3 2 3 1 5
7 1 5 4 2 3
8 1 3 5 5 1
9 3 1 1 3 5
10 5 3 1 2 4
How do I get a subset that includes all the rows where the values for certain columns (B and D, say) are equal to 1, with the columns identified by their index numbers (2 and 4) rather than their names? In this case:
A B C D E
4 2 1 2 5 2
6 3 2 3 1 5
9 3 1 1 3 5
df[rowSums(df[c(2,4)] == 1) > 0,]
# A B C D E
# 4 2 1 2 5 2
# 6 3 2 3 1 5
# 9 3 1 1 3 5
You said to compare values by column index, so df[c(2,4)] or (or df[,c(2,4)]).
df[c(2,4)] == 1 returns a matrix of logicals, whether the cell's value is equal to 1.
rowSums(.) > 0 finds those rows with at least one 1.
df[rowSums(.)>0,] selects just those rows.
Data
df <- structure(list(A = c(1L, 2L, 4L, 2L, 3L, 3L, 1L, 1L, 3L, 5L), B = c(5L, 3L, 2L, 1L, 3L, 2L, 5L, 3L, 1L, 3L), C = c(1L, 5L, 2L, 2L, 4L, 3L, 4L, 5L, 1L, 1L), D = c(4L, 4L, 4L, 5L, 4L, 1L, 2L, 5L, 3L, 2L), E = c(3L, 3L, 4L, 2L, 5L, 5L, 3L, 1L, 5L, 4L)), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))
tidyverse
df <-
structure(
list(
A = c(1L, 2L, 4L, 2L, 3L, 3L, 1L, 1L, 3L, 5L),
B = c(5L, 3L, 2L, 1L, 3L, 2L, 5L, 3L, 1L, 3L),
C = c(1L, 5L, 2L, 2L, 4L, 3L, 4L, 5L, 1L, 1L),
D = c(4L, 4L, 4L, 5L, 4L, 1L, 2L, 5L, 3L, 2L),
E = c(3L, 3L, 4L, 2L, 5L, 5L, 3L, 1L, 5L, 4L)
),
class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10")
)
library(tidyverse)
df %>%
filter(B == 1 | D == 1)
#> A B C D E
#> 4 2 1 2 5 2
#> 6 3 2 3 1 5
#> 9 3 1 1 3 5
Created on 2022-01-23 by the reprex package (v2.0.1)
data.table
library(data.table)
setDT(df)[B == 1 | D == 1, ]
#> A B C D E
#> 1: 2 1 2 5 2
#> 2: 3 2 3 1 5
#> 3: 3 1 1 3 5
Created on 2022-01-23 by the reprex package (v2.0.1)

Get sum of unique rows in table function in R

Suppose I have data which looks like this
Id Name Price sales Profit Month Category Mode Supplier
1 A 2 5 8 1 X K John
1 A 2 6 9 2 X K John
1 A 2 5 8 3 X K John
2 B 2 4 6 1 X L Sam
2 B 2 3 4 2 X L Sam
2 B 2 5 7 3 X L Sam
3 C 2 5 11 1 X M John
3 C 2 5 11 2 X L John
3 C 2 5 11 3 X K John
4 D 2 8 10 1 Y M John
4 D 2 8 10 2 Y K John
4 D 2 5 7 3 Y K John
5 E 2 5 9 1 Y M Sam
5 E 2 5 9 2 Y L Sam
5 E 2 5 9 3 Y M Sam
6 F 2 4 7 1 Z M Kyle
6 F 2 5 8 2 Z L Kyle
6 F 2 5 8 3 Z M Kyle
if I apply table function, it will just combines are the rows and result will be
K L M
X 4 4 1
Y 2 1 3
Z 0 1 2
Now what if I want not the sum of all rows but only sum of those rows with Unique Id
so it looks like
K L M
X 2 2 1
Y 1 1 2
Z 0 1 1
Thanks
If df is your data.frame:
# Subset original data.frame to keep columns of interest
df1 <- df[,c("Id", "Category", "Mode")]
# Remove duplicated rows
df1 <- df1[!duplicated(df1),]
# Create table
with(df1, table(Category, Mode))
# Mode
# Category K L M
# X 2 2 1
# Y 1 1 2
# Z 0 1 1
Or in one line using unique
table(unique(df[c("Id", "Category", "Mode")])[-1])
df <- structure(list(Id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L), Name = structure(c(1L, 1L, 1L,
2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L), .Label = c("A",
"B", "C", "D", "E", "F"), class = "factor"), Price = c(2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), sales = c(5L, 6L, 5L, 4L, 3L, 5L, 5L, 5L, 5L, 8L, 8L, 5L,
5L, 5L, 5L, 4L, 5L, 5L), Profit = c(8L, 9L, 8L, 6L, 4L, 7L, 11L,
11L, 11L, 10L, 10L, 7L, 9L, 9L, 9L, 7L, 8L, 8L), Month = c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L), Category = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("X", "Y", "Z"
), class = "factor"), Mode = structure(c(1L, 1L, 1L, 2L, 2L,
2L, 3L, 2L, 1L, 3L, 1L, 1L, 3L, 2L, 3L, 3L, 2L, 3L), .Label = c("K",
"L", "M"), class = "factor"), Supplier = structure(c(1L, 1L,
1L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 2L, 2L, 2L
), .Label = c("John", "Kyle", "Sam"), class = "factor")), .Names = c("Id",
"Name", "Price", "sales", "Profit", "Month", "Category", "Mode",
"Supplier"), class = "data.frame", row.names = c(NA, -18L))
We can try
library(data.table)
dcast(unique(setDT(df1[c('Category', 'Mode', 'Id')])),
Category~Mode, value.var='Id', length)
# Category K L M
#1: X 2 2 1
#2: Y 1 1 2
#3: Z 0 1 1
Or with dplyr
library(dplyr)
df1 %>%
distinct(Id, Category, Mode) %>%
group_by(Category, Mode) %>%
tally() %>%
spread(Mode, n, fill=0)
# Category K L M
# (chr) (dbl) (dbl) (dbl)
#1 X 2 2 1
#2 Y 1 1 2
#3 Z 0 1 1
Or as #David Arenburg suggested, a variant of the above is
df1 %>%
distinct(Id, Category, Mode) %>%
select(Category, Mode) %>%
table()

Rescaling by group across data frames

I have two data frames
df1 <- structure(list(g1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), g2 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), val1 = 1:20, val2 = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 4L, 1L, 2L, 3L)), .Names = c("g1", "g2", "val1", "val2"), row.names = c(NA, -20L), class = "data.frame")
df2 <- structure(list(g1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), g2 = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), val3 = c(5L, 6L, 7L, 3L, 4L, 5L, 2L, 3L, 4L, 8L, 9L, 10L, 4L, 5L, 6L, 5L, 6L)), .Names = c("g1", "g2", "val3"), row.names = c(NA, -17L), class = "data.frame")
> df1
g1 g2 val1 val2
1 A a 1 1
2 A a 2 2
3 A a 3 3
4 A a 4 4
5 A b 5 1
6 A b 6 2
7 A b 7 3
8 A c 8 1
9 A c 9 2
10 A c 10 3
11 B a 11 1
12 B a 12 2
13 B a 13 3
14 B b 14 1
15 B b 15 2
16 B b 16 3
17 B b 17 4
18 B c 18 1
19 B c 19 2
20 B c 20 3
> df2
g1 g2 val3
1 A a 5
2 A a 6
3 A a 7
4 A b 3
5 A b 4
6 A b 5
7 A c 2
8 A c 3
9 B c 4
10 B a 8
11 B a 9
12 B a 10
13 B b 4
14 B b 5
15 B b 6
16 B c 5
17 B c 6
My aim is to rescale df1$val2 to take values between the min and max values of df2$val3 within the respective groups.
I tried this:
library(dplyr)
df1 <- df1 %.% group_by(g1, g2) %.% mutate(rescaled=(max(df2$val3)-min(df2$val3))*(val2-min(val2))/(max(val2)-min(val2))+min(df2$val3))
But the output is different from what I expect. The problem is that I can neither cbind nor merge the two data frames due to their different lengths. Any hints?
Does this work?
library(plyr)
df3 <- ddply(df2, .(g1, g2), summarize, max.val=max(val3), min.val=min(val3))
merged.df <- merge(df1, df3, by=c("g1", "g2"), all.x=TRUE)
## Now rescale merged.df$val2 as desired

count frequency based on values in 2 or more columns

I have a pretty simple question but I can't think of a way to do this without using if statements
The data I have looks something like:
df <- structure(list(years = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L,
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), id = c(1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), x = structure(c(2L,
1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L,
1L), .Label = c("E", "I"), class = "factor")), .Names = c("years",
"id", "x"), class = "data.frame", row.names = c(NA, -18L))
so the table looks like:
years id x
1 1 1 I
2 2 1 E
3 3 1 E
4 1 1 E
5 2 1 I
6 3 1 I
7 1 2 I
8 2 2 E
9 3 2 I
10 1 2 E
11 2 2 E
12 3 2 I
13 1 3 I
14 2 3 E
15 3 3 I
16 1 3 I
17 2 3 I
18 3 3 E
I would like the output to report the fraction of x's that are "I" for each id and each year:
years id xnew
1 1 1 0.5
2 2 1 0.5
3 3 1 0.5
4 1 2 0.5
5 2 2 0.0
6 3 2 1.0
7 1 3 1.0
8 2 3 0.5
9 3 3 0.5
Any help would be greatly appreciated! Thank you!
aggregate(x ~ years + id, data=df, function(y) sum(y=="I")/length(y) )
years id x
1 1 1 0.5
2 2 1 0.5
3 3 1 0.5
4 1 2 0.5
5 2 2 0.0
6 3 2 1.0
7 1 3 1.0
8 2 3 0.5
9 3 3 0.5

Resources