how to add variable in longitudinal data using SPSS or R? - r

I have a file with repeated measures data and another file with single observations for the same persons (e.g. in one file subjects have repeated assessments and the other file just says if subjects are male or female) when I merge the files I get something like this:
ID time gender
1 1 0
1 2
1 3
2 1 1
2 2
3 1 0
3 2
3 3
3 4
but I want that the variable that was measured once (e.g.male/female) to be repeated across time (in each row) for each subject. So I would like to have :
1 1 0
1 2 0
1 3 0
2 1 1
2 2 1
and not do it manually, since I have thousands of cases...
How to do this in SPSS (preferably), or in R ?

You should have used match files with one "file" (multiple record per ID) and one "table" (no duplicate ID's).
But you can probably still fix it by running
sort cases by ID.
if mis(gender) and ID = lag(ID) gender= lag(gender).
Wherever there's no value for gender, it will be filled in with the gender of the previous case if it has the same ID as the current one.

Related

Retain group in the dataframe if all rows in the group meet certain criteria

My dataframe pertains to contestants of village council elections, and looks something like this:
village_council
college_yes_no
winner
a
1
1
a
0
0
b
0
1
b
1
0
c
0
1
c
0
0
d
1
1
d
1
0
My 'group' variable is village_council. The variable 'winner' tells us if the contestant won the elections and the variable 'college_yes_no' tells us if the contestant has a college degree.
I wish to retain both the observations pertaining to a given village_council only if either of the following criteria are met (but remove them from the data set if criteria are not met):
one of the observations within village_council has winner=1 college_yes_no=1, AND the other observation has winner=0 and college_yes_no=0
OR
one of the observations within village_council has winner=1 and college_yes_no=0, AND the other observation has winner=0 and college_yes_no=1
If I apply the above criteria, then only the first 4 observations (village_council a and b) will be retained in the data set, while the last 4 will be dropped.
In essence, I want to retain only those village councils in which one of the contestants is college-educated while the other is not.
How can I code this in R?

Matching two datasets using different IDs

I have two datasets, one is longitudinal (following individuals over multiple years) and one is cross-sectional. The cross-sectional dataset is compiled from the longitudinal dataset, but uses a randomly generated ID variable which does not allow to track someone across years. I need the panel/longitudinal structure, but the cross-sectional dataset has more variables available than the longitudinal,
The combination of ID-year uniquely identifies each observation, but since the ID values are not the same across the two datasets (they are randomized in cross-sectional so that one cannot track individuals) I cannot match them based on this.
I guess I would need to find a set of variables that uniquely identify each observation, excluding ID, and match based on those. How would I go about ding that in R?
The long dataset looks like so
id year y
1 1 10
1 2 20
1 3 30
2 1 15
2 2 20
2 3 5
and the cross dataset like so
id year y x
912 1 10 1
492 2 20 1
363 3 30 0
789 1 15 1
134 2 25 0
267 3 5 0
Now, in actuality the data has 200-300 variables. So I would need a method to find the smallest set of variables that uniquely identifies each observation in the long dataset and then match based on these to the cross-sectional dataset.
Thanks in advance!

Create recency variable using previous observation in data.table

I am willing to create a new variable called recency - how recent is the transaction of the customer - which is useful for RFM analysis. The definition is as follows: We observe transaction log of each customer weekly and assign dummy variable called "trans" if the customers made a transaction. Recency variable will equal to the number of the week if she made a transaction at that week, otherwise recency will be equal to the previous recency value. To make it more clear, I have also created a demo data.table for you.
demo<-data.table( cust=rep(c(1:3), 3))
demo[,week:=seq(1,3,1),by=cust]
demo[, trans:=c(1,1,1,0,1,0,1,1,0)]
demo[, rec:=c(1,1,1, 1,2,1,3,3,1)]
I need to calculate "rec" variable which I entered manually in demo data.table. Please also consider that, I can handle it with looping which takes a lot of time. Therefore, I would be grateful if you help me with data.table way. Thanks in advance.
This works for the example:
demo[, v := cummax(week*trans), by=cust]
cust week trans rec v
1: 1 1 1 1 1
2: 2 1 1 1 1
3: 3 1 1 1 1
4: 1 2 0 1 1
5: 2 2 1 2 2
6: 3 2 0 1 1
7: 1 3 1 3 3
8: 2 3 1 3 3
9: 3 3 0 1 1
We observe transaction log of each customer weekly and assign dummy variable called "trans" if the customers made a transaction. Recency variable will equal to the number of the week if she made a transaction at that week, otherwise recency will be equal to the previous recency value.
This means taking the cumulative max week, ignoring weeks where there is no transaction. Since weeks are positive numbers, we can treat the no-transaction weeks as zero.

How to use logistic regression for a given session ID from a dataset with many session IDs?

I have a data set, which has data divided in session IDs (customers) and for each session ID, I have many rows of data (customers'choices). I have to use logistic regression(0 or 1) but with positive value 1 only once for each session Id and rest 0 (Customer chooses only one).
How do I tell my algo to use each session ID in the data set to predict the choice? (as there are so many customers each having many choices)
edit:
data is something like this
Session ID choice price duration preferred
1 1 200 2 0
1 2 300 3 0
1 3 150 1 1
1 4 250 2 0
2 1 150 2 0
2 2 200 1 1
2 3 300 1 0

New calculation loop

I want to have a loop that will perform a calculation for me, and export the variable (along with identifying information) into a new data frame.
My data look like this:
Each unique sampling point (UNIQUE) has 4 data points associated with it (they differ by WAVE).
WAVE REFLECT REFEREN PLOT LOCAT COMCOMP DATE UNIQUE
1 679.9 119 0 1 1 1 11.16.12 1
2 799.9 119 0 1 1 1 11.16.12 1
3 899.8 117 0 1 1 1 11.16.12 1
4 970.3 113 0 1 1 1 11.16.12 1
5 679.9 914 31504 1 2 1 11.16.12 2
6 799.9 1693 25194 1 2 1 11.16.12 2
And I want to create a new data frame that will look like this:
For each unique sampling point, I want to calculate "WBI" from 2 specific "WAVE" measurements.
WBI PLOT .... UNIQUE
(WAVE==899.8/WAVE==970) 1 1
(WAVE==899.8/WAVE==970) 1 2
(WAVE==899.8/WAVE==970) 1 3
Depends on the size of your input data.frame there could be better solution in terms of efficiency but the following should work ok for small or medium data sets, and is kind of simple:
out.unique = unique(input$UNIQUE);
out.plot = sapply(out.unique,simplify=T,function(uq) {
# assuming that plot is simply the first PLOT of those belonging to that
# unique number. If not yo should change this.
subset(input,subset= UNIQUE == uq)$PLOT[1];
});
out.wbi = sapply(out.unique,simplify=T,function(uq) {
# not sure how you compose WBI but I assume that are the two last
# record with that unique number so it matches the first output of your example
uq.subset = subset(input,subset= UNIQUE == uq);
uq.nrow = nrow(uq.subset);
paste("(WAVE=",uq.subset$WAVE[uq.nrow-1],"/WAVE=",uq.subset$WAVE[uq.nrow],")",sep="")
});
output = data.frame(WBI=out.wbi,PLOT=out.plot,UNIQUE=out.unique);
If the input data is big however you may want to exploit de fact that records seem to be sorted by "UNIQUE"; repetitive data.frame sub-setting would be costly. Also both sapply calls can be combined into one but make it a bit more cumbersome so I had leave it like this.

Resources