Merge datasets using common and uncommon/time-varying variables - r

I am trying to merge multiple datasets based on subject IDs and dates of measurement. The subject IDs are common across datasets (but some datasets may contain subject IDs that others do not). The dates of measurement are uncommon i.e. they differ between datasets. I am trying to match entries for the same subjects between datasets that were recorded within 14 days of each other.
Here is an example of a pair of datasets I am trying to merge. I assume that I need to match two datasets at a time, as is customary in data matching functions such as merge in Stata or R.
Dataset 1:
ID t x1 x2 x3
1 01/01/2019 7957 0 1
1 31/01/2019 6991 0 1
1 02/03/2019 4242 0 1
1 26/03/2019 9459 0 1
1 30/03/2019 5584 0 1
2 04/02/2020 9142 1 3
2 29/02/2020 8208 1 3
2 12/03/2020 9260 1 3
3 12/03/2019 8919 1 2
3 25/03/2019 4694 1 2
3 16/04/2019 1393 1 2
4 25/03/2020 . 0 0
4 22/04/2020 . 0 0
5 02/04/2018 7537 1 1
5 29/04/2018 9172 1 1
5 19/05/2018 4914 1 1
5 22/06/2018 846 1 1
6 06/04/2020 3493 1 5
6 29/04/2020 9894 1 5
6 22/05/2020 7034 1 5
7 02/01/2022 8142 0 2
7 07/02/2022 7891 0 2
Dataset 2:
ID t y1 x4
1 16/01/2019 22 0
1 01/02/2019 16 0
1 06/03/2019 18 0
1 29/03/2019 13 0
2 17/03/2020 22 1
4 06/04/2020 17 0
4 14/05/2020 15 0
4 17/05/2020 23 0
4 22/05/2020 19 0
4 24/05/2020 16 0
5 10/03/2018 . .
5 17/04/2018 . .
5 14/05/2018 . .
5 07/06/2018 . .
6 06/04/2020 12 1
6 22/05/2020 15 1
7 22/01/2022 24 0
7 09/03/2022 27 0
8 22/02/2021 11 .
8 24/02/2021 14 .
8 28/02/2021 16 .
The merged dataset:
ID t1 t2 tdiff x1 x2 x3 y1 x4
1 31/01/2019 01/02/2019 -1 6991 0 1 16 0
1 02/03/2019 06/03/2019 -4 4242 0 1 18 0
1 30/03/2019 29/03/2019 1 5584 0 1 13 0
2 12/03/2020 17/03/2020 -5 9260 1 3 22 1
4 25/03/2020 06/04/2020 -12 . . . 17 0
5 29/04/2018 17/04/2018 12 9172 1 1 . .
5 19/05/2018 14/05/2018 5 4914 1 1 . .
6 06/04/2020 06/04/2020 0 3493 1 5 12 1
6 22/05/2020 22/05/2020 0 7034 1 5 15 1
t1 reflects the date of measurement in dataset 1; t2 reflections the date of measurement in dataset 2; tdiff reflects the difference in days between t1 and t2. There should be no values in tdiff that >|14|. The periods reflect missing values.
As you can see, only those entries that were recorded within +/-14 days of each other for a given subject have been merged on a 1:1 basis. There is an instance where two entries in dataset 2 fall within 14 days of one entry in dataset 1 for subject ID 1. In cases like this, I would like to take the pair of entries that are closest in date (e.g., out of 26/03/2019 and 29/03/2019 in dataset 2, the latter is closer to 30/03/2019 in dataset 1). There may be more than two entries in one dataset that fall within 14 days of an entry in the other dataset; again I would be looking to save the pair of entries that are closest in time. There are some subjects that are not included in the merged dataset as they are not included in both datasets (e.g., subject 4 in dataset 1 and subject 8 in dataset 2).
All variables including the dates of measurement (t) from each dataset have been carried over (e.g., x1-x3 in dataset 1 and y1 and x4). Each dataset has a different number of variables to merge. There are 12 datasets to merge in total, which I envision doing in pairs. Also, subjects vary in how many data entries they have recorded within a dataset (e.g., subject 1 has 5 entries whereas subject 7 only has 2) and between datasets (e.g., subject 2 has three entries recorded in dataset 1, but only one entry in dataset 2).
I have had a few thoughts so far but feel lost in how to implement something like this:
The data is currently in long format but there is no reason why we cannot transpose to wide format. It might be easier if a subject's data was on one row?
Ideally we would have a standardized time variable that could be used to match measurements across datasets. I have thought of creating a variable that reflects the difference between an absolute starting time point and the date for a given entry, and then converting this measure into months, but we still have the problem that time variable is not the same across datasets.
As for programs for implementation, I am using Stata, R and Excel for data management and analysis.
Your guidance is greatly appreciated!

Related

Removing rows in R where the contiguous values of rows in one column are further than 1 numeric value apart, whilst accounting for participant ID?

I am trying to clean up a timeseries with multiple data points. The data is arranged by day and by 'beep'. I want to only keep items that are 1 beep away from each other on the same day.
In order to do this I have created a dummy variable by multiplying day number by 10 and adding the beep number to it.
I was wondering if it would be possible to use some kind of clause to specify that I want to keep data that is contiguously = 1 to its lead OR lag variable, but also less than 50 (so that it will keep the days isolated). Alternatively is there a way to group by participant and then by day so that it will apply across participant and across each day in a way that won't delete incorrect data between days e.g. it should not delete day 2 beep 1 for being too far away from day 1 beep 7.
I am doing this so I can use a function called lagvar from an ESM package to created a time-lagged series. Before doing this I want to make sure that any variables in day_beep that are greater than 1 from their contiguous neighbours are removed.
E.g.
Take the following rows and day_beep values
Participant ID Day Beep Dummy Variable
1 1 1 101
1 1 2 102
1 1 4 104
**1 1 7 107**
1 2 3 203
1 2 4 204
2 1 2 102
2 1 3 103
**2 2 5 205
2 3 4 305**
**3 1 1 101**
3 2 4 204
3 2 5 205
**4 1 7 107**
4 4 4 404
4 4 5 405
In this instance I would want to remove the data held between the asterisks as it is either contiguously more than 1 beep from its neighbours, or an isolated beep in the series.
What would be the easiest way to do this for the entire dataframe?
Any help would be greatly appreciated!
You can use lead and lag from dplyr to keep only the rows that have a contiguous value before or after:
library(dplyr)
df %>%
group_by(Participant_ID) %>%
filter(((Dummy_Variable - lag(Dummy_Variable)) == 1) |
(lead(Dummy_Variable) - Dummy_Variable == 1))
output
Participant_ID Day Beep Dummy_Variable
1 1 1 1 101
2 1 1 2 102
3 1 2 3 203
4 1 2 4 204
5 2 1 2 102
6 2 1 3 103
7 3 2 4 204
8 3 2 5 205
9 4 4 4 404
10 4 4 5 405

R - How do you to count number of rows associated within two group_by() functions?

I have a dataset (see example below) in which each individual underwent two sessions, each with 4 trials. In each trial they could either pick correctly (1) or incorrectly (0) as designated by the y variable. I am trying to calculate the rate correct choices per individual per session. (This is an example dataset, the real one is larger and has many more rows so I don't want to do this by hand)
df
head(df, 16)
row name session_number y
1 1 Tom 1 1
2 2 Tom 1 1
3 3 Tom 1 0
4 4 Tom 1 0
5 5 Tom 2 1
6 6 Tom 2 0
7 7 Tom 2 1
8 8 Tom 2 0
9 9 Rob 1 0
10 10 Rob 1 1
11 11 Rob 1 0
12 12 Rob 1 1
13 13 Rob 2 0
14 14 Rob 2 1
15 15 Rob 2 0
16 16 Rob 2 1
For example, I want to know that Tom, on his first session, picked correctly in 0.50 of his trials. This is calculated by summing Y and dividing by the number of rows associated with "Tom" AND "Session 1". I can't seem to figure out how to calculate those number of rows though in a larger dataset.
I tried using group_by() and mutate(), but I still can't seem to get it to work because the count() is not working.
by_name_by_session <- df %>%
group_by(df$name) %>%
group_by(session_number) %>%
mutate(rate = (sum(df$y)/count(df$name)))
Thanks in advance to anyone who can help!

Create a new dataframe in R resulting from comparison of differently ordered columns from two other databases with different lengths

I have this two dataframe CDD26_FF (5593 rows) and CDD_HI (5508 rows) having a structure (columns) like below. CDDs are "consecutive dry days", and the two table show species exposure to CDD in far future (FF) and historical period (HI).
I want to focus only on "Biom" and "Species_name" columnes.
As you can see the two table have same "Species_names" and same "Biom" (areas in the world with sama climatic conditions). "Biom" values goes from 0 to 15. By the way, "Species_name" do not always appear in both tables (e.g. Abromoco_ben); Furthemore, the two tables not always have the combinations of "Species_name" and "Biom" (combinations are simply population of the same species belonging to that Biom)
CDD26_FF :
CDD26_FF
AreaCell
Area_total
Biom
Species_name
AreaCellSuAreaTotal
1
1
13
10
Abrocomo_ben
0.076923
1
1
8
1
Abrocomo_cin
0.125000
1
1
30
10
Abrocomo_cin
0.033333
1
2
10
1
Abrothrix_an
0.200000
1
1
44
10
Abrothrix_an
0.022727
1
3
6
2
Abrothrix_je
0.500000
1
1
7
12
Abrothrix_lo
0.142857
CDD_HI
CDD_HI
AreaCell
Area_total
Biom
Species_name
AreaCellSuAreaTot_HI
1
1
8
1
Abrocomo_cin
0.125000
1
5
30
10
Abrocomo_cin
0.166666
1
1
5
2
Abrocomo_cin
0.200000
1
1
10
1
Abrothrix_an
0.100000
1
1
44
10
Abrothrix_an
0.022727
1
6
18
1
Abrothrix_je
0.333333
1
1
23
4
Abrothrix_lo
0.130434
I want to highlight rows that have same matches of "Species_name" and "Biom": in the example they are lines 3, 4, 5 from CDD26_FF matching lines 2, 4, 5 from CDD_HI, respectively. I want to store these line in a new table, but I want to store not only "Species_name" and "Biom" column (as "compare()" function seems to do), but also all the other columns.
More precisely, I want then to calculate the ratio of "AreaCellSuAreaTot" / "AreaCellSuAreaTot_HI" from the highlighted lines.
How can I do that?
Aside from "compare()", I tried a "for" loop, but lengths of the table differ, so I tried with a 3-nested for loop, still without results. I also tried "compareDF()" and "semi_join()". No results untill now. Thank you for your help.
You could use an inner join (provided by dplyr). An inner join returns all datasets that are present in both tables/data.frames and with matching conditions (in this case: matching "Biom" and "Species_name").
Subsequently it's easy to calculate some ratio using mutate:
library(dplyr)
cdd26_f %>%
inner_join(cdd_hi, by=c("Biom", "Species_name")) %>%
mutate(ratio = AreaCellSuAreaTotal/AreaCellSuAreaTot_HI) %>%
select(Biom, Species_name, ratio)
returns
# A tibble: 4 x 3
Biom Species_name ratio
<dbl> <chr> <dbl>
1 1 Abrocomo_cin 1
2 10 Abrocomo_cin 0.200
3 1 Abrothrix_an 2
4 10 Abrothrix_an 1
Note: Remove the select-part, if you need all columns or manipulate it for other columns.
Data
cdd26_f <- readr::read_table2("CDD26_FF AreaCell Area_total Biom Species_name AreaCellSuAreaTotal
1 1 13 10 Abrocomo_ben 0.076923
1 1 8 1 Abrocomo_cin 0.125000
1 1 30 10 Abrocomo_cin 0.033333
1 2 10 1 Abrothrix_an 0.200000
1 1 44 10 Abrothrix_an 0.022727
1 3 6 2 Abrothrix_je 0.500000
1 1 7 12 Abrothrix_lo 0.142857")
cdd_hi <- readr::read_table2("CDD_HI AreaCell Area_total Biom Species_name AreaCellSuAreaTot_HI
1 1 8 1 Abrocomo_cin 0.125000
1 5 30 10 Abrocomo_cin 0.166666
1 1 5 2 Abrocomo_cin 0.200000
1 1 10 1 Abrothrix_an 0.100000
1 1 44 10 Abrothrix_an 0.022727
1 6 18 1 Abrothrix_je 0.333333
1 1 23 4 Abrothrix_lo 0.130434")

lag and summarize time series data

I have spent a significant amount of time searching for an answer with little luck. I have some time series data and need to collapse and create a rolling mean of every nth row in that data. It looks like this is possible in zoo and maybe hmisc and i am sure other packages. I need to average rows 1,2,3 then 3,4,5 then 5,6,7 and so on. my data looks like such and has thousands of observations:
id time x.1 x.2 y.1 y.2
10 1 22 19 0 -.5
10 2 27 44 -1 0
10 3 19 13 0 -1.5
10 4 7 22 .5 1
10 5 -15 5 .33 2
10 6 3 17 1 .33
10 7 6 -2 0 0
10 8 44 25 0 0
10 9 27 12 1 -.5
10 10 2 11 2 1
I would like it to look like this when complete:
id time x.1 x.2 y.1 y.2
10 1 22.66 25.33 -.33 -.66
10 2 3.66 13.33 .27 .50
The time var 1 would actually be times 1,2,3 averaged and 2 would be 3,4,5 averaged but at this point the time var would not be important to keep. I would need to group by id as it does change eventually. The only way I could figure out how to do this successfully was to use Lag() and make new rows lead by 1 and another by 2 then take average across columns. after that you have to delete every other row
1 NA NA
2 1 NA
3 2 1
4 3 2
5 4 3
use the 123 and 345 and remove 234... to do this for each var would be outrageous especially as i gather new data.
any ideas? help would be much appreciated
something like this maybe?
# sample data
id <- c(10,10,10,10,10,10)
time <- c(1,2,3,4,5,6)
x1 <- c(22,27,19,7,-15,3)
x2 <- c(19,44,13,22,5,17)
df <- data.frame(id,time,x1,x2)
means <- data.frame(rollmean(df[,c(1,3:NCOL(df))], 3))
means <- means[c(T,F),]
means$time <- seq(1:NROW(means))
row.names(means) <- 1:NROW(means)
> means
id x1 x2 time
1 10 22.666667 25.33333 1
2 10 3.666667 13.33333 2

Data Cleaning for Survival Analysis

I’m in the process of cleaning some data for a survival analysis and I am trying to make it so that an individual only has a single, sustained, transition from symptom present (ss=1) to symptom remitted (ss=0). An individual must have a complete sustained remission in order for it to count as a remission. Statistical problems/issues aside, I’m wondering how I can go about addressing the issues detailed below.
I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.
The details are below:
#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(1,1,1,1,NA,0,0,1,1,0,NA,0,0,0,1,1,1,1,1,1,NA,1,1,0,NA,NA,0,0)
mydat <- data.frame(id, time, ss)
*Bold and underlined characters represent changes from the dataset above
The goal here is to find a way to get the NA values for ID #1 (variable ss) to look like this: 1,1,1,1,1,0,0
ID# 2 (variable ss) to look like this: 1,1,0,0,0,0,0
ID #3 (variable ss) to look like this: 1,1,1,1,1,1,NA (no change because the row with NA will be deleted eventually)
ID #4 (variable ss) to look like this: 1,1,1,1,1,0,0 (this one requires multiple changes and I expect it is the most challenging to tackle).
I don't really think you have considered all the "edge case". What to do with two NA's in a row at the end of a period or 4 or 5 NA's in a row. This will give you the requested solution in your tiny test case, however, using the na.locf-function:
require(zoo)
fillNA <- function(vec) { if ( is.na(tail(vec, 1)) ){ vec } else { vec <- na.locf(vec) }
}
> mydat$locf <- with(mydat, ave(ss, id, FUN=fillNA))
> mydat
id time ss locf
1 1 0 1 1
2 1 1 1 1
3 1 2 1 1
4 1 3 1 1
5 1 4 NA 1
6 1 5 0 0
7 1 6 0 0
8 2 0 1 1
9 2 1 1 1
10 2 2 0 0
11 2 3 NA 0
12 2 4 0 0
13 2 5 0 0
14 2 6 0 0
15 3 0 1 1
16 3 1 1 1
17 3 2 1 1
18 3 3 1 1
19 3 4 1 1
20 3 5 1 1
21 3 6 NA NA
22 4 0 1 1
23 4 1 1 1
24 4 2 0 0
25 4 3 NA 0
26 4 4 NA 0
27 4 5 0 0
28 4 6 0 0

Resources