I am trying to add a column to my Access 2010 database that provides a version number based upon the value of another row. For example if I had a column that was:
A
A
B
A
C
B
I would like a calculated column that returns:
1
2
1
3
1
2
How can I achieve this functionality?
Relational databases are unordered. In other words, there is no concept of which row "comes before" any other row. So you cannot set any value based on the order of entries.
Related
I'm looking to optimise the following problem [simplified version here]:
I have two data frames, the first contains the information.
user_id
game_id
score
ON
1
1
450
1
1
2
200
1
1
3
400
1
2
1
225
1
2
2
150
1
2
3
200
1
The second contains the conditions.
game_id
game_id_ref
req_score
type
2
1
150
1
3
1
200
1
1
1
400
2
3
2
175
1
The conditions should be evaluated on the information data frame in the following way.
The conditions with type == 1 describe TURN ON conditions, and enforce that a game can only TURN ON if the score on the game from the game_id_ref >= req_score, so the first row from the conditions should be read as; the game with game_id == 2 can only TURN ON for user X when they have a score of 150 or higher on the game with game_id == 1.
The conditions with type == 2 describe TURN OFF conditions, and enforce that a game must be TURNED OFF if the score on the game from the game_id_ref >= req_score, so the third row from the conditions should be read as; for user X the game with game_id == 1 must be TURNED OFF when they have a score of 400 or higher on the game with game_id == 1.
In the information data frame I have a column ON which indicates if a game is ON for a particular user. The default is 1 [the game is ON] but this is before evaluating all the conditions. I am looking for the fastest way to evaluate the conditions for each user separately, and return the same information data frame, however now with ON = 0 if for a user the game fails to meet criteria type 1 or met criteria type 2.
So for this mock example, the required output would be:
user_id
game_id
score
ON
1
1
450
0
1
2
200
1
1
3
400
1
2
1
225
1
2
2
150
1
2
3
200
0
My current solution has been to create a separate function in which I check this by applying a for_loop over all the rows of the conditions table [approx 100 conditions], and using this function in a group_map function, on the information data frame grouped by the user_ids [approx 350000 unique users]. While this works relatively ok [approx 10 min], I would like to know if someone has a much faster solution for this.
Thanks!
Probably you can fine-tune your solution to be a bit faster in R but without seeing your code it is hard to say. Your solution sounds quite reasonable to me already.
However, if you have so much data, this kind of problem can be solved faster with SQL. I assume you already use some data management system. SQL uses indexing to make JOIN very fast, which you can never achieve in R (unless you write a database management system in R, not recommended). After you join your information and condition data frame on the game_id column, you can check all the conditions which should be fast. That can also be done in SQL by the way.
Sorry if it is not the expected answer. If you are not familiar with SQL, and you feel like there is no way you want to learn a new technology for a simple question like this, please provide your code so far so we can see what could be improved
I need to update all the values of a column, using as reference another df.
The two dataframes have equal structures:
cod name dom_by
1 A 3
2 B 4
3 C 1
4 D 2
I tried to use the following line, but apparently it did not work:
df2$name[df2$dom_by==df1$cod] <- df1$name[df2$dom_by==df1$cod]
It keeps saying that replacement has 92 rows, data has 2.
(df1 has 92 rows and df2 has 2).
Although it seems like a simple problem, I still can not solve it, even after some searches.
I have a strange data set that has claim numbers in one column and in another column has a "companion" number. The companion number values are either equal to a different claim number that occurred in the same event as the claim that corresponds to the companion number you are looking at, or blank meaning that the claim number is the companion number.
All claim numbers are unique in the entire data set.
The most this goes up to is 3 claims per one event. I need to create a unique identifier column that will group these claims as 1 unique event. There are a majority of 1 claim 1 event cases, but a significant number of 2 cases per event and some 3 cases per event.
2 cases per event Example:
claim_num companion_num
A B
B A
Or
claim_num companion_num
A B
B B
3 cases per event Example:
claim_num companion_num
A B
B
C A
The 3 case per event scenario is particularly tricky because there are many possible combinations that can occur. In this example, claim number B is the 'original' because all paths can be traced back to claim B.
I need something that looks like these and will work for 2 case and 3 case events:
claim_num companion_num ID
A B 1
B 1
C A 1
Or
claim_num companion_num ID
A B 1
B B 1
Ive tried many times in excel but I cant figure out how to do this. I know some R so I am hoping for some guidance here. Ive gotten to the point where I can fill in any blank companion numbers with its claim number, but that is it.
I am trying exclude rows of a subset which contain an NA for a particular column that I choose. I have a CSV spreadsheet of survey data this kind of organization, for instance:
name idnum term type q2 q3
bob 0321 1 2 0 .
. . 3 1 5 3
ron . 2 4 2 1
. 2561 4 3 4 2
When I was creating my R-workspace, I set it such that data <- read.csv(..., na.strings='.'). For purposes of my analysis, I then created subsets by term and type, like set13 <- subset(data, term=1 & type=2), for example. When I trying to conduct t-tests, I noticed that the function threw out any instance of NA, effectively cutting my sample size in half.
For my analysis, I want to exclude responses that are missing survey items, such as Bob from my example, missing question 3. But I still want to include rows that have one or more NAs in the name or idnum columns. So, in essence, I want to pick by columns which NAs are omitted. (Keep in mind, this is just an example - my actual CSV has about 1000 rows, so each subset may contain 100-150 rows.)
I know this can be done using data frames, but I'm not sure how to incorporate that into my given subset format. Is there a way to do this?
Check out complete.cases as shown in the answer to this SO post.
data[complete.cases(data[,3:6]),]
This will return all rows with complete information in columns 3 through 6.
Another approach.
data[rowSums(is.na(data[,3:6]))==0,]
Another option is
data[!Reduce(`|`, lapply(data[3:6], is.na)),]
I have a data-frame with 2 columns that contains two different types of text
The first column contains codes that are strings in the form of DD-HI-HO (DD being the code)
Column 2 is free text which anyone can insert
I am trying to populate the third column based on three statements which use the logic below to give a single vector column of 1 or 0
i don't seem to be able to update a vector column to incorporate all three rules. Below is Pseudo code
Basic info:
Codes is a vector (basically a reference table with one column)
Fuzzy is a vector (basically another reference table with one column)
#----CHECK SEQUENCES----
# Check if code is applied in column 1
Data$Has.Code <- grepl(pattern = "(HC|HD|HE|HK|HM|HH|HY|HL)", Data.Raw$Col1)
# Check if string contains relevant text in col 2
Data$Has.DG <- if(length(intersect(Codes, Data$Contents)) > 0) {1}
# Check how closely Strings are related. Take the highest match If its over 45% then set flag as 1
levenshteinSim(Fuzzy ,Data$Contents)
-------Added Table with sample data
Col1, Col2, Col3
1.HC-IE, Ice-cream, 1
2.IE-GB, Volvo, 0
3,IE-DE, Iced_Lollipop, 1
Record 1,
Rule number 1 would catch "HC" in Col1 and so set Col 3 to 1 (boolean)
Rule number 2 would also catch something in Col2 for record 1 as the vector Codes contains "Ice" as an element. It wouldn't execute in any case because
Rule one supercedes it
Record 2
None of the rules would return anything for the second item so col 3 is set to 0
Record 3
A bit of a daft example but the levenschtein distance computes a 75% similarity between Col 2 and one of the elements in the vector Fuzzy. This is above our stated threshold so col 3 is set to 1
Can anyone help
Thank you for your help