Detecting identical combination of two columns within a R dataframe - r

I have this following R dataframe:
OffspringID1 OffspringID2 Relation Replicate
1 ID24 ID1 PO 3
2 ID29 ID31 PO 3
3 ID31 ID82 PO 3
4 ID44 ID75 PO 3
5 ID1 ID24 HS 9
6 ID1 ID51 HS 9
7 ID1 ID54 HS 9
8 ID1 ID55 HS 9
9 ID1 ID83 HS 9
and so on. I would like to grep the number of observations per level of the factor "Relation", for each combination of individuals (OffspringID1/OffspringID2) that are identical.
I think I could basically use a simple aggregate function, but as you may see, I could get identical but permuted pairs across rows (e.g., row 1 and row 5 display the same individuals pair but in a different order).
How could I take this into account within aggregate? And in a general way, is there a rule of thumb to detect rows resulting from columns permutations within a R dataframe?
Thank you very much!

Related

Matching two datasets using different IDs

I have two datasets, one is longitudinal (following individuals over multiple years) and one is cross-sectional. The cross-sectional dataset is compiled from the longitudinal dataset, but uses a randomly generated ID variable which does not allow to track someone across years. I need the panel/longitudinal structure, but the cross-sectional dataset has more variables available than the longitudinal,
The combination of ID-year uniquely identifies each observation, but since the ID values are not the same across the two datasets (they are randomized in cross-sectional so that one cannot track individuals) I cannot match them based on this.
I guess I would need to find a set of variables that uniquely identify each observation, excluding ID, and match based on those. How would I go about ding that in R?
The long dataset looks like so
id year y
1 1 10
1 2 20
1 3 30
2 1 15
2 2 20
2 3 5
and the cross dataset like so
id year y x
912 1 10 1
492 2 20 1
363 3 30 0
789 1 15 1
134 2 25 0
267 3 5 0
Now, in actuality the data has 200-300 variables. So I would need a method to find the smallest set of variables that uniquely identifies each observation in the long dataset and then match based on these to the cross-sectional dataset.
Thanks in advance!

R: How to measure difference with both categorical and numeric features

I'm very new to data wrangling. And now I have this problem at hand:
So basically I have used tables of biochemical measurements (all numerical) of patients to perform cluster analysis, and by doing so I sorted them into 5 clusters.
Then I also have their clinical data/features, now I want to ask if any of these clinical features (a mix of numerical and categorical features) are significantly different from one cluster to another. So how can I go about this? What test shall I perform? Is there a good library I should be looking at?
To give you an idea about the "clinical data":
ClusterAssigned PatientID age sex stage FISH IGHV IgG ...
1 S134567 50 m 4 11q mutated scig
1 S234667 80 m 2 13q mutated 6.5
1 S135677 55 f 4 11q na scig
1 S356576 94 f 2 13q,t12 unmutated 5
1 S187978 59 m 4 11q mutated scig
4 S278967 80 f 2 17q unmutated 6.5
4 S123467 75 f 4 na unmutated 9.1
4 S234577 62 m 2 t12 mutated 9
.....
So you see the Cluster assigned is based on my cluster analysis. FISH, IGHV, IgG are categorical, and you can see there are sometimes na values and sometimes one person can have multiple entry "13q,t12".
In a discounted way, I can perhaps just take cluster 1 and 4 patients out, emit all na ones, and ask if there is a difference in their age, sex, FISH, IGHV...Still what's the method I can use here to perform such test in one go?
You can convert the categorical variables into dummy variables first and then perform a normal cluster analysis.
Things get more complicated if you have ordered categorical fields

Stacking two data frame columns into a single separate data frame column in R

I will present my question in two ways. First, requesting a solution for a task; and second, as a description of my overall objective (in case I am overthinking this and there is an easier solution).
1) Task Solution
Data context: each row contains four price variables (columns) representing (a) the price at which the respondent feels the product is too cheap; (b) the price that is perceived as a bargain; (c) the price that is perceived as expensive; (d) the price that is too expensive to purchase.
## mock data set
a<-c(1,5,3,4,5)
b<-c(6,6,5,6,8)
c<-c(7,8,8,10,9)
d<-c(8,10,9,11,12)
df<-as.data.frame(cbind(a,b,c,d))
## result
# a b c d
#1 1 6 7 8
#2 5 6 8 10
#3 3 5 8 9
#4 4 6 10 11
#5 5 8 9 12
Task Objective: The goal is to create a single column in a new data frame that lists all of the unique values contained in a, b, c, and d.
price
#1 1
#2 3
#3 4
#4 5
#5 6
...
#12 12
My initial thought was to use rbind() and unique()...
price<-rbind(df$a,df$b,df$c,df$d)
price<-unique(price)
...expecting that a, b, c and d would stack vertically.
[Pseudo illustration]
a[1]
a[2]
a[...]
a[n]
b[1]
b[2]
b[...]
b[n]
etc.
Instead, the "columns" are treated as rows and stacked horizontally.
V1 V2 V3 V4 V5
1 1 5 3 4 5
2 6 6 5 6 8
3 7 8 8 10 9
4 8 10 9 11 12
How may I stack a, b, c and d such that price consists of only one column ("V1") that contains all twenty responses? (The unique part I can handle separately afterwards).
2) Overall Objective: The Bigger Picture
Ultimately, I want to create a cumulative share of population for each price (too cheap, bargain, expensive, too expensive) at each price point (defined by the unique values described above). For example, what percentage of respondents felt $1 was too cheap, what percentage felt $3 or less was too cheap, etc.
The cumulative shares for bargain and expensive are later inverted to become not.bargain and not.expensive and the four vectors reside in a data frame like this:
buckets too.cheap not.bargain not.expensive too.expensive
1 0.01 to 0.50 0.000000000 1 1 0
2 0.51 to 1.00 0.000000000 1 1 0
3 1.01 to 1.50 0.000000000 1 1 0
4 1.51 to 2.00 0.000000000 1 1 0
5 2.01 to 2.50 0.001041667 1 1 0
6 2.51 to 3.00 0.001041667 1 1 0
...
from which I may plot something that looks like this:
Above, I accomplished my plotting objective using defined price buckets ($0.50 ranges) and the hist() function.
However, the intersections of these lines have meanings and I want to calculate the exact price at which any of the lines cross. This is difficult when the x-axis is defined by price range buckets instead of a specific value; hence the desire to switch to exact values and the need to generate the unique price variable.
[Postscript: This analysis is based on Peter Van Westendorp's Price Sensitivity Meter (https://en.wikipedia.org/wiki/Van_Westendorp%27s_Price_Sensitivity_Meter) which has known practical limitations but is relevant in the context of my research which will explore consumer perceptions of value under different treatments rather than defining an actual real-world price. I mention this for two reasons 1) to provide greater insight into my objective in case another approach comes to mind, and 2) to keep the thread focused on the mechanics rather than whether or not the Price Sensitivity Meter should be used.]
We can unlist the data.frame to a vector and get the sorted unique elements
sort(unique(unlist(df)))
When we do an rbind, it creates a matrix and unique of matrix calls the unique.matrix
methods('unique')
#[1] unique.array unique.bibentry* unique.data.frame unique.data.table* unique.default unique.IDate* unique.ITime*
#[8] unique.matrix unique.numeric_version unique.POSIXlt unique.warnings
which loops through the rows as the default MARGIN is 1 and then looks for unique elements. Instead, if we use the 'price', either as.vector or c(price) converts into vector
sort(unique(c(price)))
#[1] 1 3 4 5 6 7 8 9 10 11 12
If we use unique.default
sort(unique.default(price))
#[1] 1 3 4 5 6 7 8 9 10 11 12

TraMineR: Can I get the complete sequence if I give an event sub sequence?

I have a sequence dataset like below:
customerid flag 0 1 2 3 4 5 6 7 8 9 10 11
abc234 1 3 4 3 4 5 8 4 3 3 2 14 14
abc233 0 4 4 4 4 4 4 4 4 4 4 4 4
qpr81 0 9 8 7 8 8 7 8 8 7 8 8 7
qnr94 0 14 14 14 2 14 14 14 14 14 14 14 14
Values in column 0 to 11 are the sequences. There are two sets of customers with flag=1 and flag=0, I have differentiating event sequences for both sets. ( Only frequencies and residuals for 2 groups are shown here)
Subsequence Freq.0 Freq.1 Resid.0 Resid.1
(3>4) 0.19208177 0.0753386 5.540793 -21.43304
(4>5) 0.15752553 0.059960497 5.115241 -19.78691
(5>4) 0.15950556 0.062782167 5.037413 -19.48586
I want to find the customer ids and the flags for which the event sequences match.
Should I write a python script to traverse the transactions or is there some direct method in R to do this?
`
CODE
--------------
library(TraMineR)
custid=c(a1,a2,a3,b4,b5,c6,c7,d8,d9)#sample customer ids
flag=c(0,0,0,1,0,1,1,0,1)#flag
col1=c(14,14,14,14,14,5,14,14,2)
col2=c(14,14,3,14,3,14,6,3,3)
col3=c(14,2,2,14,2,14,2,2,2)
col4=c(14,2,2,14,2,14,2,2,14)
df=data.frame(custid,flag,col1,col2,col3,col4)#dataframe generation
print(df)
#Defining sequence from col1 to col4
df.s<-seqdef(df,3:6)
print(df.s)
#finding the transitions
transition<-seqetm(df.s,method='transition')
print(transition)
#converting to TSE format
df.tse=seqformat(df.s,from='SPS',to='TSE',tevent = transition)
print(df.tse)
#Event sequence generation
df.seqe=seqecreate(id=df.tse$id,timestamp=df.tse$time,event=df.tse$event)
print(df.seqe)
#subsequences
fsubseq <- seqefsub(df.seqe, pMinSupport = 0.01)
print(fsubseq)
groups <- factor(df$flag>0,labels=c(1,0))
#finding differentiating event sequences based on flag using ChiSquare test
diff <- seqecmpgroup(fsubseq, group = df$flag, method = "chisq")
#Using seqeapplysub for finding the presence of subsequences?
presence=seqeapplysub(fsubseq,method="presence")
print(presence[1:3,3:1])
`
Thanks
From what I understand, you have state sequences and have transformed them into event sequences using the seqecreate function of TraMineR. The events you are considering are the state changes. Thus (3>4) stands for a subsequence with only one event, namely the event 3>4 (switching from 3 to 4). Then, you identify the event subsequences that best discriminate your two flags using the seqefsub and seqecmpgroup functions.
If this is correct, then you can identify the sequences containing each subsequence with the seqeapplysub function. I cannot illustrate here because you do not provide any code in your question. Look at the online help of the seqeapplysub function.
======= update referring to your added code =======
Here is how you get the ids of the sequences that contain the most discriminating subsequence.
First we extract the first three most discriminating sequences from your diff object. Second, we compute the presence matrix that provides a column for each extracted subsequence with a 1 in regard of the sequences that contain the subsequence and 0 otherwise.
diffseq <- seqefsub(df.seqe, strsubseq = paste(diff$subseq[1:3]))
(presence=seqeapplysub(diffseq, method="presence"))
Now you get the ids for the first subsequence with
custid[presence[,1]==1]
For the second it would be custid[presence[,2]==1] etc.
Likewise you get the flag with
flag[presence[,1]==1]
Hope this helps.

chaining together sequential observations with only current and immediately prior ID values in R

Say I have some data on traits of individuals measured over time, that looks like this:
present <- c(1:4)
pre.1 <- c(5:8)
pre.2 <- c(9:12)
present2 <- c(13:16)
id <- c(present,pre.1,pre.2,present2)
prev.id <- c(pre.1,pre.2,rep(NA,8))
trait <- rnorm(16,10,3)
d <- data.frame(id,prev.id,trait)
print d:
id prev.id trait
1 1 5 10.693266
2 2 6 12.059654
3 3 7 3.594182
4 4 8 14.411477
5 5 9 10.840814
6 6 10 13.712924
7 7 11 11.258689
8 8 12 10.920899
9 9 NA 14.663039
10 10 NA 5.117289
11 11 NA 8.866973
12 12 NA 15.508879
13 13 NA 14.307738
14 14 NA 15.616640
15 15 NA 10.275843
16 16 NA 12.443139
Every observations has a unique value of id. However, some individuals have been observed in the past, and so I also have an observation of prev.id. This allows me to connect an individual with its current and past values of trait. However, some individuals have been remeasured multiple times. Observations 1-4 have previous IDs of 5-8, and observations of 5-8 have previous IDs of 9-12. Observations 9-12 have no previous ID because this is the first time these were measured. Furthermore, observations 13-16 have never been measured before. So, observations 1:4 are unique individuals, observations 5-12 are prior observations of individuals 1-4, and observations 13-16 are another set of unqiue individuals, distinct from 1-4. I would like to write code to generate a table that has every unique individual, as well as every past observation of that individuals trait. The final output would look like:
id <- c(1:4,13:16)
prev.id <- c(5:8, rep(NA,4))
trait <- d$trait[c(1:4,13:16)]
prev.trait.1 <- d$trait[c(5:8 ,rep(NA,4))]
prev.trait.2 <- d$trait[c(9:12,rep(NA,4))]
output<- data.frame(id,prev.id,trait,prev.trait.1,prev.trait.2)
> output
id prev.id trait prev.trait.1 prev.trait.2
1 1 5 10.693266 10.84081 14.663039
2 2 6 12.059654 13.71292 5.117289
3 3 7 3.594182 11.25869 8.866973
4 4 8 14.411477 10.92090 15.508879
5 13 NA 14.307738 NA NA
6 14 NA 15.616640 NA NA
7 15 NA 10.275843 NA NA
8 16 NA 12.443139 NA NA
I can accomplish this in a straightforward manner, but it requires me coding an additional pairing for each previous observation, such that the number of code groups I need to write is the number of times any individual has been recorded. This is a pain, as in the data set I am applying this problem to, there may be anywhere from 0-100 previous observations of an individual.
#first pairing
d.prev <- data.frame(d$id,d$trait,d$prev.id)
colnames(d.prev) <- c('prev.id','prev.trait.1','prev.id.2')
d <- merge(d,d.prev, by = 'prev.id',all.x=T)
#second pairing
d.prev2 <- data.frame(d$id,d$trait,d$prev.id)
colnames(d.prev2) <- c('prev.id.2','prev.trait.2','prev.id.3')
d<- merge(d,d.prev2,by='prev.id.2',all.x=T)
#remove observations that are another individuals previous observation
d <- d[!(d$id %in% d$prev.id),]
How can I go about doing this in fewer lines, so I don't need 100 code chunks to cover individuals that have been remeasured 100 times?
What you have is a forest of linear lists. We'll start at the terminal ends
roots<-d$id[is.na(d$prev.id)]
And determine the paths backwards
path <- function(node) {
a <- integer(nrow(d))
i <- 0
while(!is.na(node)) {
i <- i+1
a[i] <- node
node <- d$id[match(node,d$prev.id)]
}
return(rev(a[1:i]))
}
Then we can get a 'stacked' representation of your desired output with
x<-do.call(rbind,lapply(roots,
function(r) {p<-path(r); data.frame(id=p[[1]],seq=seq_along(p),traits=d$trait[p])}))
And then use reshape2::dcast to get it in the desired shape
library(reshape2)
dcast(x,id~seq,fill=NA,value.var='traits')
id 1 2 3
1 1 10.693266 10.84081 14.663039
2 2 12.059654 13.71292 5.117289
3 3 3.594182 11.25869 8.866973
4 4 14.411477 10.92090 15.508879
5 13 14.307738 NA NA
6 14 15.616640 NA NA
7 15 10.275843 NA NA
8 16 12.443139 NA NA
I leave it to you to adapt column names.

Resources