Matching two datasets using different IDs - r

I have two datasets, one is longitudinal (following individuals over multiple years) and one is cross-sectional. The cross-sectional dataset is compiled from the longitudinal dataset, but uses a randomly generated ID variable which does not allow to track someone across years. I need the panel/longitudinal structure, but the cross-sectional dataset has more variables available than the longitudinal,
The combination of ID-year uniquely identifies each observation, but since the ID values are not the same across the two datasets (they are randomized in cross-sectional so that one cannot track individuals) I cannot match them based on this.
I guess I would need to find a set of variables that uniquely identify each observation, excluding ID, and match based on those. How would I go about ding that in R?
The long dataset looks like so
id year y
1 1 10
1 2 20
1 3 30
2 1 15
2 2 20
2 3 5
and the cross dataset like so
id year y x
912 1 10 1
492 2 20 1
363 3 30 0
789 1 15 1
134 2 25 0
267 3 5 0
Now, in actuality the data has 200-300 variables. So I would need a method to find the smallest set of variables that uniquely identifies each observation in the long dataset and then match based on these to the cross-sectional dataset.
Thanks in advance!

Related

R: How to measure difference with both categorical and numeric features

I'm very new to data wrangling. And now I have this problem at hand:
So basically I have used tables of biochemical measurements (all numerical) of patients to perform cluster analysis, and by doing so I sorted them into 5 clusters.
Then I also have their clinical data/features, now I want to ask if any of these clinical features (a mix of numerical and categorical features) are significantly different from one cluster to another. So how can I go about this? What test shall I perform? Is there a good library I should be looking at?
To give you an idea about the "clinical data":
ClusterAssigned PatientID age sex stage FISH IGHV IgG ...
1 S134567 50 m 4 11q mutated scig
1 S234667 80 m 2 13q mutated 6.5
1 S135677 55 f 4 11q na scig
1 S356576 94 f 2 13q,t12 unmutated 5
1 S187978 59 m 4 11q mutated scig
4 S278967 80 f 2 17q unmutated 6.5
4 S123467 75 f 4 na unmutated 9.1
4 S234577 62 m 2 t12 mutated 9
.....
So you see the Cluster assigned is based on my cluster analysis. FISH, IGHV, IgG are categorical, and you can see there are sometimes na values and sometimes one person can have multiple entry "13q,t12".
In a discounted way, I can perhaps just take cluster 1 and 4 patients out, emit all na ones, and ask if there is a difference in their age, sex, FISH, IGHV...Still what's the method I can use here to perform such test in one go?
You can convert the categorical variables into dummy variables first and then perform a normal cluster analysis.
Things get more complicated if you have ordered categorical fields

R: Pairwise Matrix Manipulation & Variable Construction with Many Groups

I'm starting with data of scores at the "group-person" level as follows:
group_id person_id score
1 1 3
1 2 1
1 3 5
2 1 3
2 2 3
2 3 6
The goal is to generate data on person-person pairs that looks like the following:
person_id1 person_id2 sumsquarederror
1 2 4
1 3 13
2 3 25
where the "sumsquarederror" variable is defined as the sum across all groups of the squared differences in score values for each possible pair of persons. In mathspeak, this variable would be defined like: for persons i=1 and i=2 and groups j=(1,...,J)
sumsquarederror(i=1,i=2) = sum_j (( score(i=1) - score(i=2) )^2)
Building this data is trivial with small numbers of groups and persons, but I have roughly 1,000 groups and 150,000 persons, so creating matrices/dataframes for all combinations possible quickly becomes computationally burdensome (=150K by 150K by 1K, before collapsing to the sumsquarederror variable)
I'm guessing there might be some linear algebra approaches or regression-type ideas, but am stumped. Any tips or tricks or useful packages would be greatly appreciated!

Merging two columns with survival times into one (loses survival property once merged)

I am trying to run a cox regression for survival data. I am comparing two groups that have different censoring dates, and now in the dataset I have two columns, one for each survival data (days). In other words, some individuals have their survival data on the first column, while others have the data in the second column.
id censorgrp days1 days2
1 1 30+ NA
2 2 20+ 10+
3 1 50+ NA
4 1 35+ NA
5 1 100+ NA
6 2 80+ 30
7 2 75+ 15
8 2 40+ 20+
9 1 30+ NA
10 1 30+ NA
In order to run the regression model, I need to combine the two columns into one. Right now I am doing the following:
data$newcolumn<-ifelse(data$censorgrp==2,data$days2,data$days1)
where censorgrp==2 is the second group, so if the person belongs to the second group, this variable will take the survival data from the second column, otherwise first column for group 1.
However, with this approach, I lose the property of the survival data (i.e., previously the data looked like this "50+", meaning 50 days and was censored, but now it becomes simply "50"). Is there a better way to merge the two columns together? Many thanks.

drawing multiple boxplots from imputed data in R

I have an imputed dataset that I'm analysing, and I'm trying to draw boxplots, but I can't wrap my head around the proper procedure.
my data (a sample, original has 20 observations per imputation and 13 vars per group, all values range from 0 to 25):
.imp .id FTE_RM FTE_PD OMZ_RM OMZ_PD
1 1 25 25 24 24
1 2 4 0 2 6
1 3 11 5 3 2
1 4 12 3 3 3
2 1 20 15 15 15
2 2 4 1 2 3
2 3 0 0 0 6
2 4 20 0 0 0
.imp signifies the imputation round, .id the identifer for each observartion.
I want to draw all the FTE_* variables in a single plot (and the `OMZ_* in another), but wonder what to do with all the imputations, can I just include all values? The imputated data now has 500 observations. With for instance an ANOVA I'd need to average the ANOVA results by 5 to get back to 20 observations. But is this needed for a boxplot as well, since I only deal with medians, means, max. and min.?
Such as:
data_melt <- melt(df[grep("^FTE_", colnames(df))])
ggplot(data_melt, aes(x=variable, y=value))+geom_boxplot()
I've played a couple of times with ggplot, but consider myself a complete newbie.
I assume you want to keep the identifier for .imp and .id after melting so rather put:
data_melt <- melt(df,c(".imp",".id"))
For completeness of the dataframe it probably helps to introduce a column that identifies the type - FTE vs. OMZ:
data_melt$type <- ifelse(grepl("FTE",data_melt$variable),"FTE","OMZ")
Having this data.frame you can, for example, facet on the type (alternatively you can just use a simple filter statement on data_melt to restrict to one type):
ggplot(data_melt, aes(x=variable, y=value))+geom_boxplot()+facet_wrap(~type,scales="free_x")
This would look like this.
EDIT: fixed the data mess-up

Re-sample a data frame with panel dimension

I have a data set consisting of 2000 individuals. For each individual, i:2000 , the data set contains n repeated situations. Letting d denote this data set, each row of dis indexed by i and n. Among other variables, d has a variable pid which takes on identical value for an individual across different (situations) rows.
Taking into consideration the panel nature of the data, I want to re-sample d (as in bootstrap):
with replacement,
store each re-sample data as a data frame
I considered using the sample function but could not make it work. I am a new user of r and have no programming skills.
The data set consists of many variables, but all the variables have numeric values. The data set is as follows.
pid x y z
1 10 2 -5
1 12 3 -4.5
1 14 4 -4
1 16 5 -3.5
1 18 6 -3
1 20 7 -2.5
2 22 8 -2
2 24 9 -1.5
2 26 10 -1
2 28 11 -0.5
2 30 12 0
2 32 13 0.5
The first six rows are for the first person, for which pid=1, and the next sex rows, pid=2 are different observations for the second person.
This should work for you:
z <- replicate(100,
d[d$pid %in% sample(unique(d$pid), 2000, replace=TRUE),],
simplify = FALSE)
The result z will be a list of dataframes you can do whatever with.
EDIT: this is a little wordy, but will deal with duplicated rows. replicate has its obvious use of performing a set operation a given number of times (in the example below, 4). I then sample the unique values of pid (in this case 3 of those values, with replacement) and extract the rows of d corresponding to each sampled value. The combination of a do.call to rbind and lapply deal with the duplicates that are not handled well by the above code. Thus, instead of generating dataframes with potentially different lengths, this code generates a dataframe for each sampled pid and then uses do.call("rbind",...) to stick them back together within each iteration of replicate.
z <- replicate(4, do.call("rbind", lapply(sample(unique(d$pid),3,replace=TRUE),
function(x) d[d$pid==x,])),
simplify=FALSE)

Resources