I'm afraid I can't find an answer to my problem.
I am looking to create
1) They are 4 sets of cards A, B, C, D and 16 cards.
2) Each card is numbered within a set (A from 1 to 4, B from 5 to 8, and so on).
3) We want to randomize the assignment such that each person is randomly assigned a set of cards, for example A.
4) In addition, the order of the cards within the set has to be randomized.
So what we want is the following:
Person 1: Set A, cards 1-2-3-4
Person 2: Set A, cards 4-2-3-1
Person 3: Set D, cards 16-15-12-13
and so on.
I would also like each number to be in a separate column.
Thanks for your help!
S.
if each person gets one set of cards
> df=NULL
> a=rep(LETTERS[1:4],4)
> df$card1=sample(a,16,F)
> df=as.data.frame(df)
> df=df[order(card1),]
> df
card1
1: A
2: A
3: A
4: A
5: B
6: B
7: B
8: B
9: C
10: C
11: C
12: C
13: D
14: D
15: D
16: D
> df$card2=rep((1:4),4)
> df
card1 card2
1: A 1
2: A 2
3: A 3
4: A 4
5: B 1
6: B 2
7: B 3
8: B 4
9: C 1
10: C 2
11: C 3
12: C 4
13: D 1
14: D 2
15: D 3
16: D 4
> df1=df[sample(nrow(df)),]
> df1
card1 card2
1: A 2
2: D 4
3: C 3
4: D 3
5: B 3
6: D 1
7: C 2
8: A 3
9: B 2
10: D 2
11: B 1
12: A 1
13: C 4
14: C 1
15: B 4
16: A 4
Here's one way of approaching this.
person <- c("Person1", "Person2", "Person3", "Person4")
cardset <- LETTERS[1:4]
set.seed(357) # this is for reproducibility
xy <- data.frame(
person = sample(person), # pick out persons in a random order
set = sample(cardset)) # assign a random card set to a person
vx <- rep(xy$set, each = 4) # for each set, create repeats
vy <- split(paste(vx, rep(1:4, times = 4), sep = ""), f = vx) # append numbers to it
vz <- do.call(rbind, sapply(vy, FUN = sample, simplify = FALSE)) # shuffle using sapply and stitch together with do.call
cbind(xy, vz) # add it to the original data
person set 1 2 3 4
A Person1 C A4 A3 A2 A1
B Person4 B B2 B1 B4 B3
C Person3 D C2 C3 C4 C1
D Person2 A D1 D2 D4 D3
Here's another option:
# create data frame of decks and their numbered cards
cards <- data.frame(deck = rep(LETTERS[1:4], each = 4),
numbers = c(1:16),
stringsAsFactors = FALSE)
# create list of people
people <- c("Person1", "Person2", "Person3")
# loop through each person and randomly select a deck
# based on deck selected, subset the cards that can be used
# randomize the numbered cards
# add the deck, order of cards, and person to a
# growing data frame of assignments
assignment <- NULL
for(i in unique(people)) {
set <- sample(cards$deck, size = 1)
setCards <- cards[cards$deck == set, ]
orderCards <- sample(setCards$numbers)
assignment <- rbind(assignment, data.frame(Person = i,
Deck = set,
Card1 = orderCards[1],
Card2 = orderCards[2],
Card3 = orderCards[3],
Card4 = orderCards[4],
stringsAsFactors = FALSE))
}
Related
I am impressed by the efficiency R-code could be by using functions and loops.
I will provide a simplified example of the question first, and explain my problem (where the code is probably not replicable).
If I have several vectors which are different in contents and length,like:
tables_vector_1 <- c(1,2,3)
tables_vector_2 <- c(1:10)
And I have a function to create data.tables from the vector, like:
create_dt <- function(tables_vector, i){
DT <- data.table(id = 1:i, name = c("a","b","c"))
return(DT)
}
I am wondering, if there is a way to write a loop or function, where I can create all (or some of ) data.tables in the vector by running the function created before?
(probably like)
for i in 1:length(tables_vector){
create_dt(tables_vector, i)
}
And then combine the results in a list, same as the result if you run:
list(create_dt(tables_vector_1,1),create_dt(tables_vector_1,2),create_dt(tables_vector_1,3))
I have tried to use lapply(list(1:3),create_dt,tables_vector = tables_vector_1, i), but it falls, since I don't know how to specify the i argument correctly in lapply().
Here is the explanation why this problem rise:
In the real situation, I have created a function to import data.table from the database:
import_data <- function(tables_vector,i){
end <- Sys.time()
start <- end - 7200
con <- dbConnect("PostgreSQL", dbname="db", host = "host", user=db_user, password=db_password)
query <- sprintf("SELECT %s.timeutc, %s.scal AS %s FROM %s WHERE timeutc BETWEEN '%s' AND '%s' AND mode='General';",
tables_vector[i],tables_vector[i],tables_vector[i], tables_vector[i],start,end)
rs <- dbSendQuery(con, query)
df <- fetch(rs, n = -1)
dbClearResult(rs)
dbDisconnect(con)
return(as.data.table(df))
}
And I have tens of vectors which are defined by groups (e.g. vector1 contains channels for purpose 1, vector2 contains channels for purpose 2).
Since they are created for different analysis purposes, I cannot simply combine them in one vector.
Moreover, some vector contains 7, 8 channels, so it is quite annoying to list them by repeating the function one by one.
How about something like this:
tables_vector_1 <- c(1,2,3)
tables_vector_2 <- c(1:10)
create_dt <- function(tables_vector, i){
DT <- data.table(id = 1:i, name = letters[1:i])
return(DT)
}
make_list <- function(x){
lapply(seq_along(x), function(i)create_dt(x, i))
}
make_list(tables_vector_1)
[[1]]
id name
1: 1 a
[[2]]
id name
1: 1 a
2: 2 b
[[3]]
id name
1: 1 a
2: 2 b
3: 3 c
make_list(tables_vector_2)
[[1]]
id name
1: 1 a
[[2]]
id name
1: 1 a
2: 2 b
[[3]]
id name
1: 1 a
2: 2 b
3: 3 c
[[4]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
[[5]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
[[6]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
[[7]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
7: 7 g
[[8]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
7: 7 g
8: 8 h
[[9]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
7: 7 g
8: 8 h
9: 9 i
[[10]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
7: 7 g
8: 8 h
9: 9 i
10: 10 j
Note, I changed the create_dt() function so it did not produce a warning, but the mechanics should still work as intended.
I have some large data sets and am trying out data.table to combine them while summing up the shared column over matching rows. I know how to merge using [ matching rows in the LHS data.table as shown below with tables a2:LHS and a:RHS
a2 <- data.table( b= c(letters[1:5],letters[11:15]), c = as.integer(rep(100,10)))
a <- data.table(b = letters[1:10], c = as.integer(1:10))
setkey(a2 ,"b")
setkey(a , "b")
a2
b c
1: a 100
2: b 100
3: c 100
4: d 100
5: e 100
6: k 100
7: l 100
8: m 100
9: n 100
10: o 100
a
b c
1: a 1
2: b 2
3: c 3
4: d 4
5: e 5
6: f 6
7: g 7
8: h 8
9: i 9
10: j 10
from second answer hereMerge data frames whilst summing common columns in R I saw how columns could be summed up over matching rows, as such:
setkey(a , "b")
setkey(a2, "b")
a2[a, `:=`(c = c + i.c)]
a2
b c
1: a 101
2: b 102
3: c 103
4: d 104
5: e 105
6: k 100
7: l 100
8: m 100
9: n 100
10: o 100
However I am trying retain the rows that don't match as well.
Alternately I could use merge as shown below but I would like a void making a new table with 4 rows before reducing it to 2 rows.
c <- merge(a, a2, by = "b", all=T)
c <- transform(c, value = rowSums(c[,2:3], na.rm=T))
c <- c[,c(1,4)]
c
b value
1: a 102
2: b 104
3: c 106
4: d 108
5: e 110
6: f 6
7: g 7
8: h 8
9: i 9
10: j 10
11: k 100
12: l 100
13: m 100
14: n 100
15: o 100
This last table is what I would like to achieve, Thanks in Advance.
merge is likely to not be very efficient for the end result you are after. Since both of your data.tables have the same structure, I would suggest rbinding them together and taking the sum by their key. In other words:
rbindlist(list(a, a2))[, sum(c), b]
I've used rbindlist because it is generally more efficient at rbinding data.tables (even though you have to first put your data.tables in a list).
Compare some timings on larger datasets:
library(data.table)
library(stringi)
set.seed(1)
n <- 1e7; n2 <- 1e6
x <- stri_rand_strings(n, 4)
a2 <- data.table(b = sample(x, n2), c = sample(100, n2, TRUE))
a <- data.table(b = sample(x, n2), c = sample(10, n2, TRUE))
system.time(rbindlist(list(a, a2))[, sum(c), b])
# user system elapsed
# 0.83 0.05 0.87
system.time(merge(a2, a, by = "b", all = TRUE)[, rowSums(.SD, na.rm = TRUE), b]) # Get some coffee
# user system elapsed
# 159.58 0.48 162.95
## Do we have all the rows we expect to have?
length(unique(c(a$b, a2$b)))
# [1] 1782166
nrow(rbindlist(list(a, a2))[, sum(c), b])
# [1] 1782166
I very often transform subsets of data using the .SDcols option in data.table. It makes sense that the .SD columns sent to j are in the same order as the original data.table.
EDITED to properly identify the issue
It's nice that .SD columns have the same order as that specified in the .SDcols argument. This does not happen when get is used in the j argument (inside an lapply call, at least). In this case, the .SD table columns maintain their original order.
Is there any way to override this behaviour?
An example without get works fine
# library(data.table)
dt = data.table(col1 = rep(LETTERS[1:3], 4),
b = rnorm(12),
a = 1:12,
c = LETTERS[1:12])
# columns I want to do something to
d.vars = c('a', 'b') #' names in different order than names(dt)
# Generate columns of first differences by group
dt[, paste('d', d.vars, sep='.') :=
lapply(.SD, function(L) L - shift(L, n = 1, type='lag') ),
keyby = col1, .SDcols = d.vars]
The result is assigns differenced values to the "wrong" column because my named vector (d.vars) is ordered differently than the columns in dt. The result is:
The results are as expected, the .SD table's columns are ordered the same way as the names in d.vars.
> dt
col1 b a c d.a d.b
1: A -0.28901751 1 A NA NA
2: A 0.65746901 4 D 3 0.94648651
3: A -0.10602462 7 G 3 -0.76349362
4: A -0.38406252 10 J 3 -0.27803790
5: B -1.06963450 2 B NA NA
6: B 0.35137273 5 E 3 1.42100723
7: B 0.43394046 8 H 3 0.08256772
8: B 0.82525042 11 K 3 0.39130996
9: C 0.50421710 3 C NA NA
10: C -1.09493665 6 F 3 -1.59915375
11: C -0.04858163 9 I 3 1.04635501
12: C 0.45867279 12 L 3 0.50725443
Which is the expected output because lapply in j processed column a first and b second, in spite of the column order in dt.
Example with get behaves differently
dt2 = data.table(col1 = rep(LETTERS[1:3], 4),
b = rnorm(12),
a = 1:12,
neg = -1,
c = LETTERS[1:12])
# columns I want to do something to
d.vars = c('a', 'b') #' names in different order than names(dt)
# name of variable to be called in j.
negate <- 'neg'
dt2[, paste('d', d.vars, sep='.') :=
lapply(.SD, function(L) {(L - shift(L, n = 1, type='lag') ) * get(negate) }),
keyby = col1, .SDcols = d.vars]
Now the naming of the newly created columns doesn't align with the name order in d.vars:
> dt2
col1 b a neg c d.a d.b
1: A -0.3539066 1 -1 A NA NA
2: A 0.2702374 4 -1 D -0.62414408 -3
3: A -0.7834941 7 -1 G 1.05373150 -3
4: A -1.2765652 10 -1 J 0.49307118 -3
5: B -0.2936422 2 -1 B NA NA
6: B -0.2451996 5 -1 E -0.04844252 -3
7: B -1.6577614 8 -1 H 1.41256181 -3
8: B 1.0668059 11 -1 K -2.72456737 -3
9: C -0.1160938 3 -1 C NA NA
10: C -0.7940771 6 -1 F 0.67798333 -3
11: C 0.2951743 9 -1 I -1.08925140 -3
12: C -0.4508854 12 -1 L 0.74605969 -3
In this second example the b column is processed by lapply first and therefore assigned to d.a.
If I refer to neg directly (i.e., I don't use get) then the results are as expected: lapply processes the .SD columns in the order given in d.vars.
p.s. Thanks data.table team! I love this package!
Based on the description, we can use match to match the 'd.vars' and the column names of 'dt' ('d.vars1') and then use it to get the order right
d.vars1 <- d.vars[match(names(dt), d.vars, nomatch = 0)]
dt[, paste0("d.",d.vars1) := lapply(.SD, function(L)
L - shift(L, n = 1, type='lag') ), keyby = col1, .SDcols = d.vars1]
dt
# col1 b a c d.b d.a
# 1: A -0.28901751 1 A NA NA
# 2: A 0.65746901 4 D 0.94648652 3
# 3: A -0.10602462 7 G -0.76349363 3
# 4: A -0.38406252 10 J -0.27803790 3
# 5: B -1.06963450 2 B NA NA
# 6: B 0.35137273 5 E 1.42100723 3
# 7: B 0.43394046 8 H 0.08256773 3
# 8: B 0.82525042 11 K 0.39130996 3
# 9: C 0.50421710 3 C NA NA
#10: C -1.09493665 6 F -1.59915375 3
#11: C -0.04858163 9 I 1.04635502 3
#12: C 0.45867279 12 L 0.50725442 3
Update
Based on the new dataset
d.vars1 <- d.vars[match(names(dt2), d.vars, nomatch = 0)]
dt2[, paste0('d.', d.vars1) := lapply(.SD, function(L)
L - shift(L, n = 1, type='lag') * get(negate) ),
keyby = col1, .SDcols = d.vars1]
dt2
# col1 b a neg c d.b d.a
# 1: A -0.3539066 1 -1 A NA NA
# 2: A 0.2702374 4 -1 D -0.0836692 5
# 3: A -0.7834941 7 -1 G -0.5132567 11
# 4: A -1.2765652 10 -1 J -2.0600593 17
# 5: B -0.2936422 2 -1 B NA NA
# 6: B -0.2451996 5 -1 E -0.5388418 7
# 7: B -1.6577614 8 -1 H -1.9029610 13
# 8: B 1.0668059 11 -1 K -0.5909555 19
# 9: C -0.1160938 3 -1 C NA NA
#10: C -0.7940771 6 -1 F -0.9101709 9
#11: C 0.2951743 9 -1 I -0.4989028 15
#12: C -0.4508854 12 -1 L -0.1557111 21
If I want to add a field to a given data frame and setting it equal to an existing field in the same data frame based on a condition on a different (existing) field.
I know this works:
is.even <- function(x) x %% 2 == 0
df <- data.frame(a = c(1,2,3,4,5,6),
b = c("A","B","C","D","E","F"))
df$test[is.even(df$a)] <- as.character(df[is.even(df$a), "b"])
> df
a b test
1 1 A NA
2 2 B B
3 3 C NA
4 4 D D
5 5 E NA
6 6 F F
But I have this feeling it can be done a lot better than this.
Using data.table it's quite easy
library(data.table)
dt = data.table(a = c(1,2,3,4,5,6),
b = c("A","B","C","D","E","F"))
dt[is.even(a), test := b]
> dt
a b test
1: 1 A NA
2: 2 B B
3: 3 C NA
4: 4 D D
5: 5 E NA
6: 6 F F
I would like to remove data points above 97.5% and below 2.5%. I created the following parsimonious data set to explain the issue:
y <- data.table(a = rep(c("b","d"), each = 2, times = 3), c=rep(c("e","f"),
each = 3, times = 2), seq(1,6))
I created the following script to accomplish the task:
require(data.table)
y[, trimErr := ifelse(y$V3 < quantile(y$V3, 0.95) & y$V3 > quantile(y$V3, 0.05),y$V3, NA),
by = list(a,c)]
I then got 4 warning messages, I will only provide the first warning:
Warning messages:
1: In `[.data.table`(y, , `:=`(trimErr, ifelse(y$V3 < quantile(y$V3, :
RHS 1 is length 12 (greater than the size (3) of group 1). The last 9 element(s) will be discarded.
can you please explain to me what the warning means and how can i modify my code.
Would you suggest a better code to remove the top and bottom 2.5% of the data. Thanks in advance.
You're grouping by a and c, but passing in a vector that is the length of the entire data.table, instead of just the data for each group.
You don't need the y$ inside the [.data.table call
y[, trimErr:=ifelse(V3 < quantile(V3, 0.95) & V3 > quantile(V3, 0.05),V3, NA),
by=list(a,c)]
y
# a c V3 trimErr
# 1: b e 1 NA
# 2: b e 2 2
# 3: d e 3 NA
# 4: d f 4 NA
# 5: b f 5 5
# 6: b f 6 NA
# 7: d e 1 NA
# 8: d e 2 2
# 9: b e 3 NA
#10: b f 4 NA
#11: d f 5 5
#12: d f 6 NA