I'm looking to optimise the following problem [simplified version here]:
I have two data frames, the first contains the information.
user_id
game_id
score
ON
1
1
450
1
1
2
200
1
1
3
400
1
2
1
225
1
2
2
150
1
2
3
200
1
The second contains the conditions.
game_id
game_id_ref
req_score
type
2
1
150
1
3
1
200
1
1
1
400
2
3
2
175
1
The conditions should be evaluated on the information data frame in the following way.
The conditions with type == 1 describe TURN ON conditions, and enforce that a game can only TURN ON if the score on the game from the game_id_ref >= req_score, so the first row from the conditions should be read as; the game with game_id == 2 can only TURN ON for user X when they have a score of 150 or higher on the game with game_id == 1.
The conditions with type == 2 describe TURN OFF conditions, and enforce that a game must be TURNED OFF if the score on the game from the game_id_ref >= req_score, so the third row from the conditions should be read as; for user X the game with game_id == 1 must be TURNED OFF when they have a score of 400 or higher on the game with game_id == 1.
In the information data frame I have a column ON which indicates if a game is ON for a particular user. The default is 1 [the game is ON] but this is before evaluating all the conditions. I am looking for the fastest way to evaluate the conditions for each user separately, and return the same information data frame, however now with ON = 0 if for a user the game fails to meet criteria type 1 or met criteria type 2.
So for this mock example, the required output would be:
user_id
game_id
score
ON
1
1
450
0
1
2
200
1
1
3
400
1
2
1
225
1
2
2
150
1
2
3
200
0
My current solution has been to create a separate function in which I check this by applying a for_loop over all the rows of the conditions table [approx 100 conditions], and using this function in a group_map function, on the information data frame grouped by the user_ids [approx 350000 unique users]. While this works relatively ok [approx 10 min], I would like to know if someone has a much faster solution for this.
Thanks!
Probably you can fine-tune your solution to be a bit faster in R but without seeing your code it is hard to say. Your solution sounds quite reasonable to me already.
However, if you have so much data, this kind of problem can be solved faster with SQL. I assume you already use some data management system. SQL uses indexing to make JOIN very fast, which you can never achieve in R (unless you write a database management system in R, not recommended). After you join your information and condition data frame on the game_id column, you can check all the conditions which should be fast. That can also be done in SQL by the way.
Sorry if it is not the expected answer. If you are not familiar with SQL, and you feel like there is no way you want to learn a new technology for a simple question like this, please provide your code so far so we can see what could be improved
Related
I have been continuing to learn r to transition away from excel and I am wondering what the best way to approach the following problem is, or at least what tools are available to me:
I have a large data set (100K+ rows) and several columns that I could generate a signal off of and each value in the vectors can range between 0 and 3.
sig1 sig2 sig3 sig4
1 1 1 1
1 1 1 1
1 0 1 1
1 0 1 1
0 0 1 1
0 1 2 2
0 1 2 2
0 1 1 2
0 1 1 2
I want to generate composite signals using the state of each cell in the four columns then see what each of the composite signals tell me about the returns in a time series. For this question the scope is only generating the combinations.
So for example, one composite signal would be when all four cells in the vectors = 0. I could generate a new column that reads TRUE when that case is true and false in each other case, then go on to figure out how that effects the returns from the rest of the data frame.
The thing is I want to check all combinations of the four columns, so 0000, 0001, 0002, 0003 and on and on, which is quite a few. With the extent of my knowledge of r, I only know how to do that by using mutate() for each combination and explicitly entering the condition to check. I assume there is a better way to do this, but I haven't found it yet.
Thanks for the help!
I think that you could paste the columns together to get unique combinations, then just turn this to dummy variables:
library(dplyr)
library(dummies)
# Create sample data
data <- data.frame(sig1 = c(1,1,1,1,0,0,0),
sig2 = c(1,1,0,0,0,1,1),
sig3 = c(2,2,0,1,1,2,1))
# Paste together
data <- data %>% mutate(sig_tot = paste0(sig1,sig2,sig3))
# Generate dummmies
data <- cbind(data, dummy(data$sig_tot, sep = "_"))
# Turn to logical if needed
data <- data %>% mutate_at(vars(contains("data_")), as.logical)
data
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
EDIT: There were lots of problems in my first example so I am reworking them here. This is primarily to direct credit towards the original responder who cut my process time by a factor of about 180 even with my poor example. This question was frozen for being unclear, or not general enough, but I think it has value as data.table can do amazing things with the right syntax, but that syntax can be elusive even with the available vignettes. From my own experience, having more examples of how data.table can be used will be helpful. Particularly for those of us who got our start in Excel the VLOOKUP like behavior here fills a gap that is not always easy to find.
The specific things that happen in this example that may be of general interest are:
looking up values in one data.table in another data.table
passing variables by name and by reference
apply like behavior in data.table
Original question with modified (limited rows) example:
I am looking for help in the arcane world of data.table, passing functions, and fast use of lookups across multiple tables. I have a larger function that, when I profile it, seems to spend all of its time in this one area doing some fairly straightforward lookup and sum actions. I am not adept enough at profiling to figure out exactly which subareas of the call are causing the problem but my guess is that I am unintentionally doing something computationally expensive that I don't need to do. Data.table syntax is still a complete mystery to me, so I am seeking help here to speed this process up.
Small worked example:
library(data.table)
set.seed(seed = 911)
##Other parts of the analysis generate all of these data.tables
#A data table containing id values (the real version has other things too)
whoamI<-data.table(id=1:5)
#The result of another calculation it tells me how many neighbors I will be interested in
#the real version has many more columns in it.
howmanyneighbors<-data.table(id=1:5,toCount=round(runif(5,min=1,max=3),0))
#Who the first three neighbors are for each id
#real version has a hundreds of neighbors
myneighborsare<-data.table(id=1:5,matrix(1:5,ncol=3,nrow=5,byrow = TRUE))
colnames(myneighborsare)<-c("id","N1","N2","N3")
#How many of each group live at each location?
groupPops<-data.table(id=1:5,matrix(floor(runif(25,min=0,max=10)),ncol=5,nrow=5))
colnames(groupPops)<-c("id","ape","bat","cat","dog","eel")
whoamI
howmanyneighbors
myneighborsare
groupPops
> whoamI
id
1: 1
2: 2
3: 3
4: 4
5: 5
> howmanyneighbors
id toCount
1: 1 2
2: 2 1
3: 3 3
4: 4 3
5: 5 2
> myneighborsare
id N1 N2 N3
1: 1 1 2 3
2: 2 4 5 1
3: 3 2 3 4
4: 4 5 1 2
5: 5 3 4 5
> groupPops
id ape bat cat dog eel
1: 1 9 8 6 8 1
2: 2 9 8 0 9 8
3: 3 6 1 9 1 2
4: 4 6 1 9 0 3
5: 5 6 2 2 2 5
##At any given time I will only want the group populations for some of the groups
#I will always want 'ape' but other groups will vary. Here I have picked two
#I retain this because passing the column names by variable along with the pass of 'ape' was tricky
#and I don't want to lose that syntax in any new answer
animals<-c("bat","eel")
i<-2 #similarly, howmanyneighbors has many more columns in it and I need to pass a reference to one of them which I call i here
##Functions I will call on the above data
#Get the ids of my neighbors from myneighborsare. The number of ids returned will vary based on value in howmanyneighbors
getIDs<-function(a){myneighborsare[id==a,2:(as.numeric(howmanyneighbors[id==a,..i])+1)]} #so many coding fails here it pains me to put this in public view
#Sum the populations of my neighbors for groups I am interested in.
sumVals<-function(b){colSums(groupPops[id%in%b,c("ape",..animals)])} #cringe
#Wrap the first two together and put them into a format that works well with being returned as a row in a data.table
doBoth<-function(a){
ro.ws<-getIDs(a)
su.ms<-sumVals(ro.ws)
answer<-lapply(split(su.ms,names(su.ms)),unname) #not too worried about this as it just mimics some things that happen in the original code at little time cost
return(answer)
}
#Run the above function on my data
result<-data.table(whoamI)
result[,doBoth(id),by=id]
id ape bat eel
1: 1 18 16 9
2: 2 6 1 3
3: 3 21 10 13
4: 4 24 18 14
5: 5 12 2 5
This involves a reshape and non-equi join.
library(data.table)
# reshape to long and add a grouping ID for a non-equi join later
molten_neighbors <- melt(myneighborsare, id.vars = 'id')[, grp_id := .GRP, by = variable]
#regular join by id
whoamI[howmanyneighbors,
on = .(id)
#non-equi join - replaces getIDs(a)
][molten_neighbors,
on = .(id, toCount >= grp_id),
nomatch = 0L
#regular join - next steps replace sumVals(ro.ws)
][groupPops[, c('id','ape', ..animals)],
on = .(value = id),
.(id, ape, bat, eel),
nomatch = 0L,
][,
lapply(.SD, sum),
keyby = id
]
I highly recommend simplifying future questions. Using 10 rows allows you to post the tables within your question. As is, it was somewhat difficult to follow.
I am currently working with a dataset with 551 observation and 141 variables. Normally there are some mistakes done by the data entry operators and I am now screening and correcting those. But the problem is the ID number and the row number of the dataset is not similar/corresponding. And I can only bring the row number where the problematic data lies in. It is taking more time of mine to find the ID number as they do not correspond. Is there any way to get the ID number of the problematic data within one command?
Suppose, the row number of the B345 ID, is #1. For B346 ID the row is #2.
My dataset is presented like this-
ID S1 S2 S3 I30 I31 I34
B345 12 23 3 2 1 4
B346 15 4 4 3 2 4
I am using the following command in my original dataset and got the following results. Row number 351 and 500 but actually their ID number is B456 and B643.
which (x$I30 ==0)
[1] 351 500
I am expecting to get the ID number within 1 command. It will be very helpful to me.
How about this?
x$ID[which(x$I30==0)]
We can just use the logical condition to subset the 'ID'
x$ID[x$I30 ==0]
I am running scripts for a project in Hidden Markov Model with 2 hiddens states at school. At some point, I use Viterbi's algrithm to find the most suitable sequences of hidden states. My output is a vector like that :
c("1","1","1","2","2","1", "1","1","1","2", "2","2")
I would like to count how many subsequences of each states there is, and also record their length and positions. The output would be, for example, a matrx like that:
State Length Starting_Position
1 3 1
2 2 4
1 4 6
2 3 10
Is there any R command or package who can do that easily ?
Thank you.
I am a relatively new R user, and most of the complex coding (and packages) looks like Greek to me. It has been a long time since I used a programming language (Java/Perl) and I have only used R for very simple manipulations in the past (basic loading data from file, subsetting, ANOVA/T-Test). However, I am working on a project where I had no control over the data layout and the data file is very lengthy.
In my data, I have 172 rows which feature the Participant to a survey and 158 columns, each which represents the question number. The answers for each are 1-5. The raw data includes the number "99" to indicate that a question was not answered. I need to exclude any questions where a Participant did not answer without excluding the entire participant.
Part Q001 Q002 Q003 Q004
1 2 4 99 2
2 3 99 1 3
3 4 4 2 5
4 99 1 3 2
5 1 3 4 2
In the past I have used the subset feature to filter my data
data.filter <- subset(data, Q001 != 99)
Which works fine when I am working with sets where all my answers are contained in one column. Then this would just delete the whole row where the answer was not available.
However, with the answers in this set spread across 158 columns, if I subset out 99 in column 1 (Q001), I also filter out that entire Participant.
I'd like to know if there is a way to filter/subset the data such that my large data set would end up having 'blanks' when the "99" occured so that these 99's would not inflate or otherwise interfere with the statistics I run of the rest of the numbers. I need to be able to calculate means per question and run ANOVAs and T-Tests on various questions.
Resp Q001 Q002 Q003 Q004
1 2 4 2
2 3 1 3
3 4 4 2 5
4 1 3 2
5 1 3 4 2
Is this possible to do in R? I've tried to filter it before submitting to R, but it won't read the data file in when I have blanks, and I'd like to be able to use the whole data set without creating a subset for each question (which I will do if I have to... it's just time consuming if there is a better code or package to use)
Any assistance would be greatly appreciated!
You could replace the "99" by "NA" and the calculate the colMeans omitting NAs:
df <- replicate(20, sample(c(1,2,3,99), 4))
colMeans(df) # nono
dfc <- df
dfc[dfc == 99] <- NA
colMeans(dfc, na.rm = TRUE)
You can also indicate which values are NA's when you read your data base. For your particular case:
mydata <- read.table('dat_base', na.strings = "99")