Discovering the dependency relations among the samples of a data set

Discovering the dependency relations among the samples of a data set - r

In the R environment;
Let say I have a data set similar to the one in below:
ID Activity
1 a
1 b
2 a
3 c
2 a
1 c
4 a
4 b
3 b
4 c
As you can see each ID has a sequence of activities. What is important to consider is the number of times an activity is being followed by the other ones.
The results I am looking for are:
1. Discovering existing variants in the dataset (the existing sequence for each ID):
like: `
<a,b, c> : id: 1 & 4
<a,a> : id: 2
<c,b> : id:3
A matrix as following which shows the number of times an activity is being followed by the other one:
like:
:a b c
a 1 2 0
b 0 0 1
c 0 1 0
Thank you for your help.

Here is a solution with data.table
library(data.table)
dt <- data.table(ID=c(1,1,2,3,2,1,4,4,3,4),Activity=c("a","b","a","c","a","c","a","b","b","c"))
IDs per Sequence:
dt[,.(seq=paste(Activity,collapse = ",")),ID][,.(ids=paste(ID,collapse = ",")),seq]
We can get a fast answer:
consecutive_id <- dt[,.(first=(Activity),second=(shift(Activity,type = "lead"))),ID][!is.na(second)]
consecutive <- consecutive_id[,.N,.(first,second)]
but if you need it in the matrix form a few extra steps are needed:
classes <- dt[,unique(Activity)];n <- length(classes)
M_consecutive <- data.table(matrix(0,nrow = n,ncol=n))
setnames(M_consecutive,classes)
M_consecutive$classes <- classes; setkey(M_consecutive,classes)
for(i in 1:nrow(consecutive)) M_consecutive[consecutive[i]$first,(consecutive[i]$second):=consecutive[i]$N]
M_consecutive

Related

How to get which() to return similar indices from two dataframes?

I have two dataframes (ma.sig, pricebreak) that look like this:
Date A B C
01/1 1 0 1
02/1 1 0 1
Date D E G
01/1 1 0 1
02/1 0 1 0
For starters, I just want to retrieve the column indices for all non-zero values in the first row. I tried doing this via the following methods:
sig <- which(!ma.sig[1,]==0&!pricebreak[1,]==0)
and
sig <- which(!ma.sig[1,]==0)&which(!pricebreak[1,]==0)
I would like it to return something like: 1, 3 (based in the above sample dataframe). However, I get this string of logical sequences:
[1] TRUE FALSE TRUE
How do I get it to return the columns indices? I do not want to use merge to merge my dataframes because of the nature of the data.
EDIT: Just for background information, the above data frames are 'signals' that are on when the values are non-zero. I'm trying to use sig to collect indices that I can use for my main dataframe so that I can only calculate and print outputs when the signals are on.

#serhatCevikel already given the answer:
I am just trying to explain it more for your convenience.
ma.sig =
Date A B C
01/1 1 0 1
02/1 1 0 1
pricebrake =
Date D E G
01/1 1 0 1
02/1 0 1 0
Now as per your method:
sig <- which(!ma.sig[1,]==0)&which(!pricebreak[1,]==0)
print(sig)
gives:
TRUE TRUE TRUE
Now try:
which(sig)
it will return index of TRUE value:
1 2 3
Please let me know if you get this. I have checked it twice in my terminal. Hope you will get this too.

R enumerate duplicates in a dataframe with unique value

I have a dataframe containing a set of parts and test results. The parts are tested on 3 sites (North Centre and South). Sometimes those parts are re-tested. I want to eventually create some charts that compare the results from the first time that a part was tested with the second (or third, etc.) time that it was tested, e.g. to look at tester repeatability.
As an example, I've come up with the below code. I've explicitly removed the "Experiment" column from the morley data set, as this is the column I'm effectively trying to recreate. The code works, however it seems that there must be a more elegant way to approach this problem. Any thoughts?
Edit - I realise that the example given was overly simplistic for my actual needs (I was trying to generate a reproducible example as easily as possible).
New example:
part<-as.factor(c("A","A","A","B","B","B","A","A","A","C","C","C"))
site<-as.factor(c("N","C","S","C","N","S","N","C","S","N","S","C"))
result<-c(17,20,25,51,50,49,43,45,47,52,51,56)
data<-data.frame(part,site,result)
data$index<-1
repeat {
if(!anyDuplicated(data[,c("part","site","index")]))
{ break }
data$index<-ifelse(duplicated(data[,1:2]),data$index+1,data$index)
}
data
part site result index
1 A N 17 1
2 A C 20 1
3 A S 25 1
4 B C 51 1
5 B N 50 1
6 B S 49 1
7 A N 43 2
8 A C 45 2
9 A S 47 2
10 C N 52 1
11 C S 51 1
12 C C 56 1
Old example:
#Generate a trial data frame from the morley dataset
df<-morley[,c(2,3)]
#Set up an iterative variable
#Create the index column and initialise to 1
df$index<-1
# Loop through the dataframe looking for duplicate pairs of
# Runs and Indices and increment the index if it's a duplicate
repeat {
if(!anyDuplicated(df[,c(1,3)]))
{ break }
df$index<-ifelse(duplicated(df[,c(1,3)]),df$index+1,df$index)
}
# Check - The below vector should all be true
df$index==morley$Expt

We may use diff and cumsum on the 'Run' column to get the expected output. In this method, we are not creating a column of 1s i.e 'index' and also assuming that the sequence in 'Run' is ordered as showed in the OP's example.
indx <- cumsum(c(TRUE,diff(df$Run)<0))
identical(indx, morley$Expt)
#[1] TRUE
Or we can use ave
indx2 <- with(df, ave(Run, Run, FUN=seq_along))
identical(indx2, morley$Expt)
#[1] TRUE
Update
Using the new example
with(data, ave(seq_along(part), part, site, FUN=seq_along))
#[1] 1 1 1 1 1 1 2 2 2 1 1 1
Or we can use getanID from library(splitstackshape)
library(splitstackshape)
getanID(data, c('part', 'site'))[]

I think this is a job for make.unique, with some manipulation.
index <- 1L + as.integer(sub("\\d+(\\.)?","",make.unique(as.character(morley$Run))))
index <- ifelse(is.na(index),1L,index)
identical(index,morley$Expt)
[1] TRUE

Details of your actual data.frame may matter. However, a couple of options working with your example:
#this works if each group starts with 1:
df$index<-cumsum(df$Run==1)
#this is maybe more general, with data.table
require(data.table)
dt<-as.data.table(df)
dt[,index:=seq_along(Speed),by=Run]

Use value in new variable name

I am trying to build a for loop which will step through each site, for that site calculate frequencies of a response, and put those results in a new data frame. Then after the loop I want to be able to combine all of the site data frames so it will look something like:
Site Genus Freq
1 A 50
1 B 30
1 C 20
2 A 70
2 B 10
2 C 20
But to do this I need my names (of vectors, dataframes) to change each time through the loop. I think I can do this using the SiteNum variable, but how do I insert it into new variable names? The way I tried (below) treats it like part of the string, doesn't insert the value for the name.
I feel like what I want to use is a placeholder %, but I don't know how to do that with variable names.
> SiteNum <- 1
> for (Site in CoralSites){
> Csub_SiteNum <- subset(dfrmC, Site==CoralSites[SiteNum])
> CGrfreq_SiteNum <- numeric(length(CoralGenera))
> for (Genus in CoralGenera){
> CGrfreq_SiteNum[GenusNum] <- mean(dfrmC$Genus == CoralGenera[GenusNum])*100
> GenusNum <- GenusNum + 1
> }
> names(CGrfreq_SiteNum) <- c(CoralGenera)
> Site_SiteNum <- c(Site)
> CG_SiteNum <- data.frame(CoralGenera,CGrfreq_SiteNum,Site_SiteNum)
> SiteNum <- SiteNum + 1
> }

Your question as stated asks how you can create a bunch of variables, e.g. CGrfreq_1, CGrfreq_2, ..., where the name of the variable indicates the site number that it corresponds to (1, 2, ...). While you can do such a thing with functions like assign, it is not good practice for a few reasons:
It makes your code to generate the variables more complicated because it will be littered with calls to assign and get and paste0.
It makes your data more difficult to manipulate afterwards -- you'll need to (either manually or programmatically) identify all the variables of a certain type, grab their values with get or mget, and then do something with them.
Instead, you'll find it easier to work with other R functions that will perform the aggregation for you. In this case you're looking to generate for each Site/Genus pairing the percentage of data points at the site with the particular genus value. This can be done in a few lines of code with the aggregate function:
# Sample data:
(dat <- data.frame(Site=c(rep(1, 5), rep(2, 5)), Genus=c(rep("A", 3), rep("B", 6), "A")))
# Site Genus
# 1 1 A
# 2 1 A
# 3 1 A
# 4 1 B
# 5 1 B
# 6 2 B
# 7 2 B
# 8 2 B
# 9 2 B
# 10 2 A
# Calculate frequencies
dat$Freq <- 1
res <- aggregate(Freq~Genus+Site, data=dat, sum)
res$Freq <- 100 * res$Freq / table(dat$Site)[as.character(res$Site)]
res
# Genus Site Freq
# 1 A 1 60
# 2 B 1 40
# 3 A 2 20
# 4 B 2 80

Comparing two columns: logical- is value from column 1 also in column 2?

I'm pretty confused on how to go about this. Say I have two columns in a dataframe. One column a numerical series in order (x), the other specifying some value from the first, or -1 (y). These are results from a matching experiment, where the goal is to see if multiple photos are taken of the same individual. In the example below, there 10 photos, but 6 are unique individuals. In the y column, the corresponding x is reported if there is a match. y is -1 for no match (might as well be NAs). If there is more than 2 photos per individual, the match # will be the most recent record (photo 1, 5 and 7 are the same individual below). The group is the time period the photo was take (no matches within a group!). Hopefully I've got this example right:
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(-1,-1,-1,-1,1,-1,1,-1,2,4)
group <- c(1,1,1,2,2,2,3,3,3,3)
DF <- data.frame(x,y,group)
I would like to create a new variable to name the unique individuals, and have a final dataset with a single row per individual (i.e. only have 6 rows instead of 10), that also includes the group information. I.e. if an individual is in all three groups, there could be a value of "111" or if just in the first and last group it would be "101". Any tips?
Thanks for asking about the resulting dataset. I realized my group explanation was bad based on the actual numbers I gave, so I changed the results slightly. Bonus would also be nice to have, but not critical.
name <- c(1,2,3,4,6,8)
group_history <- as.character(c('111','101','100','011','010','001'))
bonus <- as.character(c('1,5,7','2,9','3','4,10','6','8'))
results_I_want <- data.frame(name,group_history,bonus)
My word, more mistakes fixed above...

Using the (updated) example you gave
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(-1,-1,-1,-1,1,-1,1,-1,3,4)
group <- c(1,1,1,2,2,2,3,3,3,3)
DF <- data.frame(x,y,group)
Use the x and y to create a mapping from higher numbers to lower numbers that are the same person. Note that names is a string, despite it be a string of digits.
bottom.df <- DF[DF$y==-1,]
mapdown.df <- DF[DF$y!=-1,]
mapdown <- c(mapdown.df$y, bottom.df$x)
names(mapdown) <- c(mapdown.df$x, bottom.df$x)
We don't know how many times it might take to get everything down to the lowest number, so have to use a while loop.
oldx <- DF$x
newx <- mapdown[as.character(oldx)]
while(any(oldx != newx)) {
oldx = newx
newx = mapdown[as.character(oldx)]
}
The result is the group it belongs to, names by the lowest number of that set.
DF$id <- unname(newx)
Getting the group membership is harder. Using reshape2 to convert this into wide format (one column per group) where the column is "1" if there was something in that one and "0" if not.
library("reshape2")
wide <- dcast(DF, id~group, value.var="id",
fun.aggregate=function(x){if(length(x)>0){"1"}else{"0"}})
Finally, paste these "0"/"1" memberships together to get the grouping variable you described.
wide$grouping = apply(wide[,-1], 1, paste, collapse="")
The result:
> wide
id 1 2 3 grouping
1 1 1 1 1 111
2 2 1 0 0 100
3 3 1 0 1 101
4 4 0 1 1 011
5 6 0 1 0 010
6 8 0 0 1 001
No "bonus" yet.
EDIT:
To get the bonus information, it helps to redo the mapping to keep everything. If you have a lot of cases, this could be slow.
Replace the oldx/newx part with:
iterx <- matrix(DF$x, ncol=1)
iterx <- cbind(iterx, mapdown[as.character(iterx[,1])])
while(any(iterx[,ncol(iterx)]!=iterx[,ncol(iterx)-1])) {
iterx <- cbind(iterx, mapdown[as.character(iterx[,ncol(iterx)])])
}
DF$id <- iterx[,ncol(iterx)]
To generate the bonus data, then you can use
bonus <- tapply(iterx[,1], iterx[,ncol(iterx)], paste, collapse=",")
wide$bonus <- bonus[as.character(wide$id)]
Which gives:
> wide
id 1 2 3 grouping bonus
1 1 1 1 1 111 1,5,7
2 2 1 0 0 100 2
3 3 1 0 1 101 3,9
4 4 0 1 1 011 4,10
5 6 0 1 0 010 6
6 8 0 0 1 001 8
Note this isn't same as your example output, but I don't think your example output is right (how can you have a grouping_history of "000"?)
EDIT:
Now it agrees.

Another solution for bonus variable
f_bonus <- function(data=df){
data_a <- subset(data,y== -1,select=x)
data_a$pos <- seq(nrow(data_a))
data_b <- subset(df,y!= -1,select=c(x,y))
data_b$pos <- match(data_b$y, data_a$x)
data_t <- rbind(data_a,data_b[-2])
data_t <- with(data_t,tapply(x,pos,paste,sep="",collapse=","))
return(data_t)
}

Accessing high dimension tables - clearer way to index the different dimensions?

I was wondering if there is a clearer easier way of accessing different dimensions of a table.
I have this code
datasettable = addmargins(
table(dataset[c('observer','condition','stimulus1', 'stimulus2','response')]),
4, FUN = sum)
And I access it the following way:
datasettable[,'u',1,'sum',]
However, I find accessing it this way somewhat confusing. Because the indices for the different dimensions are separated by a comma it is easy to confuse the indices for the separate dimensions.
Is there a way to define the indices to the different dimensions by name( especially important for numerical indices) such as with
datasettable ['obsever'=='ALL','condition'=='u',
'stimulus1'==1, 'stimulus2'=='sum','response'=='ALL']

I'll make up some data (hint: including data of your own helps you get better answers; dput can be a great tool for that).
dataset <- expand.grid(observer=LETTERS[1:3], condition=c("u","v"),
stimulus1=1:2, stimulus2=1:2)
set.seed(5)
dataset$response <- sample(1:4, nrow(dataset), replace=TRUE)
datasettable <- addmargins(table(dataset), 4, FUN = sum)
What you suggest is this:
> datasettable[,'u',1,'sum',]
response
observer 1 2 3 4
A 1 1 0 0
B 0 0 2 0
C 0 1 0 1
I'd probably get the total without first converting to a table, perhaps using the reshape package, like this:
> library(reshape)
> dw <- cast(dataset, condition + stimulus1 + observer ~ response,
fun.aggregate=length, value="stimulus2")
> subset(dw, condition=="u" & stimulus1==1)
condition stimulus1 observer 1 2 3 4
1 u 1 A 1 1 0 0
2 u 1 B 0 0 2 0
3 u 1 C 0 1 0 1
But to answer your question, no, I don't think there's an alternate way built in to access parts of a table, but you could certainly build one, maybe like this:
tableaccess <- function(tabl, ...) {
d <- list(...)
vv <- c(list(tabl), as.list(rep(TRUE, length(dim(tabl)))))
vv[match(names(d), names(dimnames(datasettable)))+1] <- d
do.call(`[`, vv)
}
with a result of
> tableaccess(datasettable, condition='u', stimulus1=1, stimulus2='sum')
response
observer 1 2 3 4
A 1 1 0 0
B 0 0 2 0
C 0 1 0 1

For what you described, you can use subset function:
subset(datasettable, observer == 'ALL' & condition == 'u' &
stimulus1 == 1 & stimulus2 == 'sum' & response == 'ALL')
This, of course, assumes datasettable is a data.frame.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Discovering the dependency relations among the samples of a data set - r

Related

How to get which() to return similar indices from two dataframes?

R enumerate duplicates in a dataframe with unique value

Use value in new variable name

Comparing two columns: logical- is value from column 1 also in column 2?

Accessing high dimension tables - clearer way to index the different dimensions?

Categories

Resources