This is my first post at StackOverflow. I am relatively a newbie in programming and trying to work with the data.table in R, for its reputation in speed.
I have a very large data.table, named "Actions", with 5 columns and potentially several million rows. The column names are k1, k2, i, l1 and l2. I have another data.table, with the unique values of Actions in columns k1 and k2, named "States".
For every row in Actions, I would like to find the unique index for columns 4 and 5, matching with States. A reproducible code is as follows:
S.disc <- c(2000,2000)
S.max <- c(6200,2300)
S.min <- c(700,100)
Traces.num <- 3
Class.str <- lapply(1:2,function(x) seq(S.min[x],S.max[x],S.disc[x]))
Class.inf <- seq_len(Traces.num)
Actions <- data.table(expand.grid(Class.inf, Class.str[[2]], Class.str[[1]], Class.str[[2]], Class.str[[1]])[,c(5,4,1,3,2)])
setnames(Actions,c("k1","k2","i","l1","l2"))
States <- unique(Actions[,list(k1,k2,i)])
So if i was using data.frame, the following line would be like:
index <- apply(Actions,1,function(x) {which((States[,1]==x[4]) & (States[,2]==x[5]))})
How can I do the same with data.table efficiently ?
This is relatively simple once you get the hang of keys and the special symbols which may be used in the j expression of a data.table. Try this...
# First make an ID for each row for use in the `dcast`
# because you are going to have multiple rows with the
# same key values and you need to know where they came from
Actions[ , ID := 1:.N ]
# Set the keys to join on
setkeyv( Actions , c("l1" , "l2" ) )
setkeyv( States , c("k1" , "k2" ) )
# Join States to Actions, using '.I', which
# is the row locations in States in which the
# key of Actions are found and within each
# group the row number ( 1:.N - a repeating 1,2,3)
New <- States[ J(Actions) , list( ID , Ind = .I , Row = 1:.N ) ]
# k1 k2 ID Ind Row
#1: 700 100 1 1 1
#2: 700 100 1 2 2
#3: 700 100 1 3 3
#4: 700 100 2 1 1
#5: 700 100 2 2 2
#6: 700 100 2 3 3
# reshape using 'dcast.data.table'
dcast.data.table( Row ~ ID , data = New , value.var = "Ind" )
# Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27...
#1: 1 1 1 1 4 4 4 7 7 7 10 10 10 13 13 13 16 16 16 1 1 1 4 4 4 7 7 7...
#2: 2 2 2 2 5 5 5 8 8 8 11 11 11 14 14 14 17 17 17 2 2 2 5 5 5 8 8 8...
#3: 3 3 3 3 6 6 6 9 9 9 12 12 12 15 15 15 18 18 18 3 3 3 6 6 6 9 9 9...
Related
I am trying to add additional data from a reference table onto my primary dataframe. I see similar questions have been asked about this however cant find anything for my specific case.
An example of my data frame is set up like this
df <- data.frame("participant" = rep(1:3,9), "time" = rep(1:9, each = 3))
lookup <- data.frame("start.time" = c(1,5,8), "end.time" = c(3,6,10), "var1" = c("A","B","A"),
"var2" = c(8,12,3), "var3"= c("fast","fast","slow"))
print(df)
participant time
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
10 1 4
11 2 4
12 3 4
13 1 5
14 2 5
15 3 5
16 1 6
17 2 6
18 3 6
19 1 7
20 2 7
21 3 7
22 1 8
23 2 8
24 3 8
25 1 9
26 2 9
27 3 9
> print(lookup)
start.time end.time var1 var2 var3
1 1 3 A 8 fast
2 5 6 B 12 fast
3 8 10 A 3 slow
What I want to do is merge or join these two dataframes in a way which also includes the times in between both the start and end time of the look up data frame. So the columns var1, var2 and var3 are added onto the df at each instance where the time lies between the start time and end time.
for example, in the above case - the look up value in the first row has a start time of 1, an end time of 3, so for times 1, 2 and 3 for each participant, the first row data should be added.
the output should look something like this.
print(output)
participant time var1 var2 var3
1 1 1 A 8 fast
2 2 1 A 8 fast
3 3 1 A 8 fast
4 1 2 A 8 fast
5 2 2 A 8 fast
6 3 2 A 8 fast
7 1 3 A 8 fast
8 2 3 A 8 fast
9 3 3 A 8 fast
10 1 4 <NA> NA <NA>
11 2 4 <NA> NA <NA>
12 3 4 <NA> NA <NA>
13 1 5 B 12 fast
14 2 5 B 12 fast
15 3 5 B 12 fast
16 1 6 B 12 fast
17 2 6 B 12 fast
18 3 6 B 12 fast
19 1 7 <NA> NA <NA>
20 2 7 <NA> NA <NA>
21 3 7 <NA> NA <NA>
22 1 8 A 3 slow
23 2 8 A 3 slow
24 3 8 A 3 slow
25 1 9 A 3 slow
26 2 9 A 3 slow
27 3 9 A 3 slow
I realise that column names don't match and they should for merging data sets.
One option would be to use the sqldf package, and phrase your problem as a SQL left join:
sql <- "SELECT t1.participant, t1.time, t2.var1, t2.var2, t2.var3
FROM df t1
LEFT JOIN lookup t2
ON t1.time BETWEEN t2.\"start.time\" AND t2.\"end.time\""
output <- sqldf(sql)
A dplyr solution:
output <- df %>%
# Create an id for the join
mutate(merge_id=1) %>%
# Use full join to create all the combinations between the two datasets
full_join(lookup %>% mutate(merge_id=1), by="merge_id") %>%
# Keep only the rows that we want
filter(time >= start.time, time <= end.time) %>%
# Select the relevant variables
select(participant,time,var1:var3) %>%
# Right join with initial dataset to get the missing rows
right_join(df, by = c("participant","time")) %>%
# Sort to match the formatting asked by OP
arrange(time, participant)
This produces the output asked by OP, but it will only work for data of reasonable size, as the full join produces a data frame with number of rows equal to the product of the number of rows of both initial datasets.
Using tidyverse and creating an auxiliary table:
df <- data.frame("participant" = rep(1:3,9), "time" = rep(1:9, each = 3))
lookup <- data.frame("start.time" = c(1,5,8), "end.time" = c(3,6,10), "var1" = c("A","B","A"),
"var2" = c(8,12,3), "var3"= c("fast","fast","slow"))
lookup_extended <- lookup %>%
mutate(time = map2(start.time, end.time, ~ c(.x:.y))) %>%
unnest(time) %>%
select(-start.time, -end.time)
df2 <- df %>%
left_join(lookup_extended, by = "time")
I have the following dataframe containing a variable "group" and a variable "number of elements per group"
group elements
1 3
2 1
3 14
4 10
.. ..
.. ..
30 5
then I have a bunch of numbers going from 1 to (let's say) 30
when summing "elements" I would get 900. what I want to obtain is to randomly select a number (from 0 to 30) from 1-30 and assign it to each group until I fill the number of elements for that group. Each of those should appear 30 times in total.
thus, for group 1, I want to randomly select 3 number from 0 to 30
for group 2, 1 number from 0 to 30 etc. until I filled all of the groups.
the final table should look like this:
group number(randomly selected)
1 7
1 20
1 7
2 4
3 21
3 20
...
any suggestions on how I can achieve this?
In base R, if you have df like this...
df
group elements
1 3
2 1
3 14
Then you can do this...
data.frame(group = rep(df$group, #repeat group no...
df$elements), #elements times
number = unlist(sapply(df$elements, #for each elements...
sample.int, #...sample <elements> numbers
n=30, #from 1 to 30
replace = FALSE))) #without duplicates
group number
1 1 19
2 1 15
3 1 28
4 2 15
5 3 20
6 3 18
7 3 27
8 3 10
9 3 23
10 3 12
11 3 25
12 3 11
13 3 14
14 3 13
15 3 16
16 3 26
17 3 22
18 3 7
Give this a try:
df <- read.table(text = "group elements
1 3
2 1
3 14
4 10
30 5", header = TRUE)
# reproducibility
set.seed(1)
df_split2 <- do.call("rbind",
(lapply(split(df, df$group),
function(m) cbind(m,
`number(randomly selected)` =
sample(1:30, replace = TRUE,
size = m$elements),
row.names = NULL
))))
# remove element column name
df_split2$elements <- NULL
head(df_split2)
#> group number(randomly selected)
#> 1.1 1 25
#> 1.2 1 4
#> 1.3 1 7
#> 2 2 1
#> 3.1 3 2
#> 3.2 3 29
The split function splits the df into chunks based on the group column. We then take those smaller data frames and add a column to them by sampling 1:30 a total of elements time. We then do.call on this list to rbind back together.
Yo have to generate a new dataframe repeating $group $element times, and then using sample you can generate the exact number of random numbers:
data<-data.frame(group=c(1,2,3,4,5),
elements=c(2,5,2,1,3))
data.elements<-data.frame(group=rep(data$group,data$elements),
number=sample(1:30,sum(data$elements)))
The result:
group number
1 1 9
2 1 4
3 2 29
4 2 28
5 2 18
6 2 7
7 2 25
8 3 17
9 3 22
10 4 5
11 5 3
12 5 8
13 5 26
I solved as follow:
random_sample <- rep(1:30, each=30)
random_sample <- sample(random_sample)
then I create a df with this variable and a variable containing one group per row repeated by the number of elements in the group itself
Hi I have dataframe with multiple columns ,I.e first 5 columns are my metadata and remaing
columns (columns count will be even) are actual columns which need to be calculated
formula : (col6*col9) + (col7*col10) + (col8*col11)
country<-c("US","US","US","US")
name <-c("A","B","c","d")
dob<-c(2017,2018,2018,2010)
day<-c(1,4,7,9)
hour<-c(10,11,2,4)
a <-c(1,3,4,5)
d<-c(1,9,4,0)
e<-c(8,1,0,7)
f<-c(10,2,5,6)
j<-c(1,4,2,7)
m<-c(1,5,7,1)
df=data.frame(country,name,dob,day,hour,a,d,e,f,j,m)
how to get final summation if i have more columns
I have tried with below code
df$final <-(df$a*df$f)+(df$d*df$j)+(df$e*df$m)
Here is one way to do generalize the computation:
x <- ncol(df) - 5
df$final <- rowSums(df[6:(5 + x/2)] * df[(ncol(df) - x/2 + 1):ncol(df)])
# country name dob day hour a d e f j m final
# 1 US A 2017 1 10 1 1 8 10 1 1 19
# 2 US B 2018 4 11 3 9 1 2 4 5 47
# 3 US c 2018 7 2 4 4 0 5 2 7 28
# 4 US d 2010 9 4 5 0 7 6 7 1 37
I have the following data:
a <- data.frame(ID=c("A","B","Z","H"), a=c(0,1,2,45), b=c(3,4,5,22), c=c(6,7,8,3))
> a
ID a b c
1 A 0 3 6
2 B 1 4 7
3 Z 2 5 8
4 H 45 22 3
b <- data.frame(ID=c("A","B","E","W","Z","H"), a=c(9,10,11,39,5,0), b=c(4,2,7,54,12,34), c=c(12,0,34,23,13,14))
> b
ID a b c
1: A 9 4 12
2: B 10 2 0
3: E 11 7 34
4: W 39 54 23
5: Z 5 12 13
6: H 0 34 14
I want to merge both dataframes, keeping only rows of data.frame a and summarize the same columns, so at the end I get:
> z
ID a b c
1 A 9 7 18
2 B 11 6 7
3 Z 7 17 21
4 H 45 56 17
So far I have tried the following:
merge(a,b,by="ID",all.x=T,all.y=F)
> merge(a,b,by="ID",all.x=T,all.y=F)
ID a.x b.x c.x a.y b.y c.y
1 A 0 3 6 9 4 12
2 B 1 4 7 10 2 0
3 H 45 22 3 0 34 14
4 Z 2 5 8 5 12 13
> join(a,b,type="left",by="ID")
ID a b c a b c
1 A 0 3 6 9 4 12
2 B 1 4 7 10 2 0
3 Z 2 5 8 5 12 13
4 H 45 22 3 0 34 14
I cannot manage to summarize the columns.
My dataframe is pretty big so if the solution can speed up things that would even be better.
If your data.frame is very big, then you may consider this option:
library(data.table)
## convert data.frame to data.table
setDT(a)
## convert data.frame to data.table
setDT(b)
## merge the two data.tables
c <- merge(a,b,by='ID')
## extract names of all columns except the first one i.e. ID
col_names <- colnames(a)[-1]
## query building
col_1 <- paste0(col_names,'.x')
col_2 <- paste0(col_names,'.y')
cols <- paste(col_1,col_2,sep=',')
cols_2 <- paste0(col_names," = sum(",cols,")")
cols_3 <- paste(cols_2,collapse=',')
query <- paste0("z <- c[,.(",cols_3,"),by=ID]")
## query execution
eval(parse(text = query))
This works at least for your example:
a <- data.frame(ID=c("A","B","Z","H"), a=c(0,1,2,45), b=c(3,4,5,22), c=c(6,7,8,3))
b <- data.frame(ID=c("A","B","E","W","Z","H"), a=c(9,10,11,39,5,0), b=c(4,2,7,54,12,34), c=c(12,0,34,23,13,14))
match_a <- na.omit(match(b$ID, a$ID))
match_b <- na.omit(match(a$ID, b$ID))
df <- cbind(ID = a$ID[match_a], a[match_a, -1] + b[match_b, -1])
First, get matching rows from a in b and vice versa, so we can be sure that we only have those rows that appear in both data frames (and we now know their row-indices in both data frames). Then, simply use vectorized additions for those matching rows, but omit ID, as factor cannot be summed up; add ID back manually.
You cannot directly add both data frame is because both the data frames are of unequal size. To make them of equal size you can check for IDs in a which are present in b and then add them element wise.
new <- b[b$ID %in% a$ID, ]
cbind(ID = a$ID, a[-1] + new[-1])
# ID a b c
#1 A 9 7 18
#2 B 11 6 7
#3 Z 7 17 21
#4 H 45 56 17
I have a table that I routinely compute with R that has three dimensions. I would like to add some tables within the (here 5) marginal tables. What I usually do is like:
A=sample(LETTERS[1:5],100, rep=T)
b=sample(letters[1:2],100, rep=T)
numbers=sample(1:3,100, rep=T)
( tab=table(A,b,numbers) )
( tab1=ftable(addmargins(tab)) )
I would like to add the sum of the first few marginal tables and then the sum of the remaining tables at the bottom, then the grand total. I can imagine that in the resulting ftable I would want the As,Bs,Cs, then the sum of those three, then the Ds, Es, and the sum of those two, then the sum of all of the tables, like:
numbers 1 2 3 Sum
A b
A a 1 5 0 6
b 4 2 2 8
Sum 5 7 2 14
B a 2 6 6 14
b 5 4 5 14
Sum 7 10 11 28
C a 3 2 5 10
b 1 2 2 5
Sum 4 4 7 15
sumac a 6 13 11 30 #### how do i add these three lines
b ....
sum ....
D a 2 1 1 4
b 4 5 3 12
Sum 6 6 4 16
E a 5 3 4 12
b 4 3 8 15
Sum 9 6 12 27
sumde a 7 4 5 20 #### and these
b ....
sum ....
sumae a 13 17 16 46
b 18 16 20 54
Sum 31 33 36 100
As usual I think the solution is prolly many fewer lines than the question. Thanks
Seth Latimer
You could do something like this:
isABC<-ifelse(A %in% c("A", "B", "C"), "ABC", "CD")
( tab=table(isABC,A,b,numbers) )
( tab1=ftable(addmargins(tab)) )
Now you have a larger table which holds even more rows than you want, but those should be easy to drop...