I'm using TraMineR and I'm trying to import a dataset and convert it from a SPELL format to STS format.
That's an example of my dataset (for the sake of simplicity I used numeric values instead of dates).
Alphabet=[a,b]
days=[1,2,3,4,5....]
id | start | end | values |
1 | 1 | 5 | a |
1 | 6 | 12 | a |
1 | 10 | 15 | b |
2 | 2 | 8 | b |
2 | 7 | 10 | a |
Defining the sequences in STS format, I'll have the following
id day1 day2 .........day9 day10 day11 day12 day13 day14.......
1 a a ......... a a a a b b .......
2 ........and so on
The problem is that if I have concomintant states, the last starts when the first ends as happened in my example between the second to third state for id 1.
How can I split states?
I.e. when the state a finishes then b starts from the beginning, just if overlapping is less then n days.
Or maybe can I define another states when a and b overlap for more than n days.
I.e.
id day1 day2 .........day9 day10 day11 day12 day13 day14.......
1 a a ......... a ab ab ab b b .......
Related
I am trying to determine repeat IDs based on date and an initial event. Below is a sample data set
+----+------------+-------------------------+
| ID | Date | Investigation or Intake |
+----+------------+-------------------------+
| 1 | 1/1/2019 | Investigation |
| 2 | 1/2/2019 | Investigation |
| 3 | 1/3/2019 | Investigation |
| 4 | 1/4/2019 | Investigation |
| 1 | 1/2/2019 | Intake |
| 2 | 12/31/2018 | Intake |
| 3 | 1/5/2019 | Intake |
+----+------------+-------------------------+
I want to write R codes to go through IDs from 1 to 4 (IDs that have investigations) and see if they have a subsequent intake (an intake that happens at a later date than the date of investigation). So the expected output looks like this:
+----+------------+-------------------------+------------+
| ID | Date | Investigation or Intake | New Column |
+----+------------+-------------------------+------------+
| 1 | 1/1/2019 | Investigation | Sub Intake |
| 2 | 1/2/2019 | Investigation | None |
| 3 | 1/3/2019 | Investigation | Sub Intake |
| 4 | 1/4/2019 | Investigation | None |
| 1 | 1/2/2019 | Intake | |
| 2 | 12/31/2018 | Intake | |
| 3 | 1/5/2019 | Intake | |
+----+------------+-------------------------+------------+
What will the code look like to solve this? I am guessing it will be some loop function?
Thanks!
you can do this using the dplyr package and using some ifelse statements create a new column as required.
Instead of using looping instead just check the next entry in the group using lead function.
This solution assumes that in each group you will have one "Investigation" and then 0 or more "Intake" entries that are listed afterwards.
library(dplyr)
df <- data.frame(ID = c(1, 2, 3, 4, 1, 2, 3),
Date = as.Date(c("2019-01-01", "2019-01-02", "2019-1-03", "2019-01-04", "2019-01-02", "2018-12-31", "2019-1-5")),
Investigation_or_Intake = c("Investigation", "Investigation", "Investigation", "Investigation", "Intake", "Intake", "Intake"),
stringsAsFactors = FALSE)
df %>%
group_by(ID) %>% # Make groups according to ID column
mutate(newcol = ifelse(lead(Date) > Date, "Sub Intake", "None"), # Check next entry in the group to see if Date is after current
newcol = ifelse(Investigation_or_Intake == "Investigation" & is.na(newcol), "None", newcol)) # Change "Investigation" entries with no Intake to "None"
This gives
ID Date Investigation_or_Intake newcol
<dbl> <date> <chr> <chr>
1 1 2019-01-01 Investigation Sub Intake
2 2 2019-01-02 Investigation None
3 3 2019-01-03 Investigation Sub Intake
4 4 2019-01-04 Investigation None
5 1 2019-01-02 Intake NA
6 2 2018-12-31 Intake NA
7 3 2019-01-05 Intake NA
I have 2 R data.tables in R like so:
first_table
id | first | trunc | val1
=========================
1 | Bob | Smith | 10
2 | Sue | Goldm | 20
3 | Sue | Wollw | 30
4 | Bob | Bellb | 40
second_table
id | first | last | val2
==============================
1 | Bob | Smith | A
2 | Bob | Smith | B
3 | Sue | Goldman | A
4 | Sue | Goldman | B
5 | Sue | Wollworth | A
6 | Sue | Wollworth | B
7 | Bob | Bellbottom | A
8 | Bob | Bellbottom | B
As you can see, the last names in the first table are truncated. Also, the combination of first and last name is unique in the first table, but not in the second. I want to "join" on the combination of first name and last name under the incredibly naive assumptions that
first,last uniquely defines a person
that truncation of the last name does not introduce ambiguity.
The result should look like this:
id | first | trunc | last | val1
=======================================
1 | Bob | Smith | Smith | 10
2 | Sue | Goldm | Goldman | 20
3 | Sue | Wollw | Wollworth | 30
4 | Bob | Bellb | Bellbottom | 40
Basically, for each row in table_1, I need to find a row that back fills the last name.
For Each Row in first_table:
Find the first row in second_table with:
matching first_name & trunc is a substring of last
And then join on that row
Is there an easy vectorized way to accomplish this with data.table?
One approach is to join on first, then filter based on the substring-match
first_table[
unique(second_table[, .(first, last)])
, on = "first"
, nomatch = 0
][
substr(last, 1, nchar(trunc)) == trunc
]
# id first trunc val1 last
# 1: 1 Bob Smith 10 Smith
# 2: 2 Sue Goldm 20 Goldman
# 3: 3 Sue Wollw 30 Wollworth
# 4: 4 Bob Bellb 40 Bellbottom
Or, do the truncation on the second_table to match the first, then join on both columns
first_table[
unique(second_table[, .(first, last, trunc = substr(last, 1, 5))])
, on = c("first", "trunc")
, nomatch = 0
]
## yields the same answer
I am using the DGET function in LibreOffice. I have the first table as shown below (top). I want to make second table (bottom). I can use DGET function where Database is the cell range containing top table and Database Field is "Winner".
Is it possible to have different cell ranges in Search Criteria, so that for each cell in row for Case #1 can have separate formula with a different search criteria as given in the first row of bottom table?
If I have to use separate continuous cell ranges for search criteria, then there would be [n*Chances] cell ranges, where n=total number of cases (~150 in my case) and Chances = possible number of Chance# (50 in my case).
Case | Chance# | Winner
-------------------------
1 | 7 | Joe
1 | 9 | Emil
1 | 10 | Harry
1 | 11 | Kate
2 | 1 | Tom
2 | 3 | Jerry
2 | 4 | Mike
2 | 7 | John
Case |Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|
|="=1" |="=2" |="=3" |="=4" |="=5" |="=6" |="=7" |="=8" |="=9" |="=10" |="=11" | ---- |="=50"
1 | | | | | | | Joe | |Emil |Harry | Kate | ---- |
2 | Tom | |Jerry |Mike | | | John | | | | | ---- |
To do so, you need to change your approach, instead of using DGET, I'm using a rather more complex method:
Considering your example:
A B C D
1 # Case Chance# Winner
2 1 1 7 Joe
3 2 1 9 Emil
4 3 1 10 Harry
5 4 1 11 Kate
6 5 2 1 Tom
7 6 2 3 Jerry
8 7 2 4 Mike
9 8 2 7 John
10
11 Case\Chance# 1 2 3 4
12 1
13 2 Tom Jerry Mike
I use the following:
=IF(SUMPRODUCT(($B$2:$B$9=$A12)*($C$2:$C$9=B$11)*($A$2:$A$9))> 0,INDEX($D$2:$D$9,SUMPRODUCT(($B$2:$B$9=$A12)*($C$2:$C$9=B$11)*($A$2:$A$9))),"")
Let's ignore the IF, and focus on the real deal here:
First, Get the row that matches your condition, $B$2:$B$9=$A12 and $C$2:$C$9=B$11 will result in a TRUE/FALSE arrays, multiply them to get a 0/1 array with only a single 1 for the match, now multiply by the ID to get the row number in your table.
SUMPRODUCT will get you a single value (the row) from the result array.
Finally use index to retrieve the desired value.
The IF statement tests if a match do exist (SUMPRODUCT > 0), to filter out the cell with no match.
When considering time dependent data in survival analysis, you have multiple start-stop times for an individual subject with measurements for the covariates as each start-stop time. How does the coxph function keeps track of which subject it is associating the start and stop times along with the covariates?
The function looks as follows
coxph(Surv(start, stop, event, type) ~ X)
Your data may look as follows
subject | start | stop | event | covariate |
--------+---------+--------+--------+-----------+
1 | 1 | 7 | 0 | 2 |
1 | 7 | 14 | 0 | 3 |
1 | 14 | 17 | 1 | 6 |
2 | 1 | 7 | 0 | 1 |
2 | 7 | 14 | 0 | 1 |
2 | 14 | 21 | 0 | 2 |
3 | 1 | 3 | 1 | 8 |
How can the function get away without an individual subject specifier?
My understanding is that survival analysis is not interested in individuals through time, it is looking at total counts for each time point, so the subject specifier is irrelevant. Instead, based on the counts, probabilities can be estimated that any particular subject will be alive/dead at a certain time given certain treatments.
I have one table 'positions' with columns:
id | session_id | keyword_id | position
and some rows in it:
10 rows with session_id = 1
and 10 with session_id = 2.
As a result of the query I need a table like this:
id | keyword_id | position1 | position2
where 'position1' is a column with values that had session_id = 1 and 'position2' is a column with values that had session_id = 2.
The result set should contain 10 records.
Sorry for my bad English.
Data examle:
id | session_id | keyword_id | position
1 | 1 | 1 | 2
2 | 1 | 2 | 3
3 | 1 | 3 | 0
4 | 1 | 4 | 18
5 | 2 | 5 | 9
6 | 2 | 1 | 0
7 | 2 | 2 | 14
8 | 2 | 3 | 2
9 | 2 | 4 | 8
10 | 2 | 5 | 19
Assuming that you wish to combine positions with the same id, from the two sessions, then the following query should to the trick:
SELECT T1.keyword_id
, T1.position as Position1
, T2.position as Position2
FROM positions T1
INNER JOIN positions T2
ON T1.keyword_id = T2.keyword_id -- this will match positions by [keyword_id]
AND T1.session_id = 1
AND T2.session_id = 2