Comparing vectors values by shifting reading frame - r

I have "Y maze" sequence data containing the characters, A,B,C. I am trying to quantitative the number of times those three values are found together. The data looks like this:
Animal=c(1,2,3,4,5)
VisitedZones=c(1,2,3,4,5)
data=data.frame(Animal, VisitedZones)
data[1,2]=("A,C,B,A,C,A,B,A,C,A,C,A,C,B,B,C,A,C,C,C")
data[2,2]=("A,C,B,A,C,A,B,A,C,A,C,A,C,B")
data[3,2]=("A,C,B,A,C,A,B,A,C,A")
data[4,2]=("A,C,B,A,C,A,A,A,B,A,C,A,C,A,C,B")
data[5,2]=("A,C,B,A,C,A,A,A,B,")
The tricky part is that I also have to consider the reading frame so that I can find all instances of ABC combinations. There are three reading frames, For example:
Here is the working example I have so far.
Split <- strsplit(data$VisitedZones, ",", fixed = TRUE)
## How long is each list element?
Ncol <- vapply(Split, length, 1L)
## Create an empty character matrix to store the results
M <- matrix(NA_character_, nrow = nrow(data),ncol = max(Ncol),
dimnames = list(NULL, paste0("V", sequence(max(Ncol)))))
## Use matrix indexing to figure out where to put the results
M[cbind(rep(1:nrow(data), Ncol),sequence(Ncol))] <- unlist(Split,
use.names = FALSE)
# Bind the values back together, here as a "data.table" (faster)
v2=data.table(Animal = data$Animal, M)
# I get error here
df=mutate(as.data.frame(v2),trio=paste0(v2,lead(v2),lead(v2,2)))
table(df$trio[1:(length(v2)-2)])
It would be great if I could get something like this:
Animal VisitedZones ABC ACB BCA BAC CAB CBA
1 A,B,C,A,B.C... 2 0 1 0 1 0
2 A,B,C,C... 1 0 0 0 0 0
3 A,C,B,A... 0 1 0 0 0 1

df<-mutate(as.data.frame(v2),trio=paste0(v2,lead(v2),lead(v2,2)))
table(df$trio[1:(length(v2)-2)])
Using dplyr, I generate for every letter in your vector the three-letter combination that starts from it, then create a table of frequencies of all found combinations (minus the last two, which are incomplete).
Result:
AAB ABC BCA CAA CAB
1 6 5 1 4

Your revised question is basically completely different, so I'll answer it here.
First, I would say your data structure doesn't make much sense to me, so I'll start out by reshaping it into something I can work with:
v2<-as.data.frame(t(v2))
Flip it over so the letters are in columns, not rows;
v2<-tidyr::gather(v2,"v","letter",na.rm=T)
Melt the table so it's long data (so that I'll be able to use lead etc).
v2<-group_by(v2,v)
df=mutate(v2,trio=paste0(letter,lead(letter),lead(letter,2)))
This brings us back basically to where we were at the end of the last question, only the data is grouped by the "animal" variable (here called "v" and represented by V1 thru V5).
df<-df[!grepl("NA",df$trio),]
Even though we removed the unnecessary NA's, we still end up having those pesky ABNA and ANANA etc at the end of each group, so this line will remove anything with an NA in it.
tt<-table(df$v,df$trio)
And finally, we create the table but also break it by "v". The result is this:
AAA AAB ABA ACA ACB ACC BAC BBC BCA CAA CAB CAC CBA CBB CCC
V1 0 0 1 3 2 1 2 1 1 0 1 3 1 1 1
V2 0 0 1 3 2 0 2 0 0 0 1 2 1 0 0
V3 0 0 1 2 1 0 2 0 0 0 1 0 1 0 0
V4 1 1 1 3 2 0 2 0 0 1 0 2 1 0 0
V5 1 1 0 1 1 0 1 0 0 1 0 0 1 0 0
You can now cbind it to your original data to get something like what you described, but it requires just an additional step, because of the way table saves its results:
data<-cbind(data,spread(as.data.frame(tt),Var2,Freq))[,-3]
Which ends up looking like this:
Animal VisitedZones AAA AAB ABA ACA ACB ACC BAC BBC BCA CAA CAB CAC CBA CBB CCC
1 1 A,C,B,A,C,A,B,A,C,A,C,A,C,B,B,C,A,C,C,C 0 0 1 3 2 1 2 1 1 0 1 3 1 1 1
2 2 A,C,B,A,C,A,B,A,C,A,C,A,C,B 0 0 1 3 2 0 2 0 0 0 1 2 1 0 0
3 3 A,C,B,A,C,A,B,A,C,A 0 0 1 2 1 0 2 0 0 0 1 0 1 0 0
4 4 A,C,B,A,C,A,A,A,B,A,C,A,C,A,C,B 1 1 1 3 2 0 2 0 0 1 0 2 1 0 0
5 5 A,C,B,A,C,A,A,A,B, 1 1 0 1 1 0 1 0 0 1 0 0 1 0 0

Related

How do you sum different columns of binary variables based on a desired set of variables/column?

I used the code below for a total of 25 variables and it worked.It shows up as either 1 or 0:
jb$finances <- ifelse(grepl("Finances", jb$content.roll),1,0)
I want to be able to add the number of "1" s in each row across the multiple of selected column/variables I just made (using the code above) into another column called "sum.content". I used the code below:
jb <- jb %>%
mutate(sum.content=sum(jb$Finances,jb$Exercise,jb$Volunteer,jb$Relationships,jb$Laugh,jb$Gratitude,jb$Regrets,jb$Meditate,jb$Clutter))
I didn't get an error using the code above, but I did not get the outcome I wanted.
The result of this was 14 for all my row.I was expecting something <9 since I only selected for 9 variables.I don't want to delete the other variables like V1 and V2, I just want to focus on summing some variables.
This is what I got using the code:
V1 V2... Finances Exercise Volunteer Relationships Laugh sum.content
1 1 1 1 1 0 14
2 0 1 0 0 1 14
2 0 0 0 0 1 14
This is What I want:
V1 V2... Finances Exercise Volunteer Relationships Laugh sum.content
1 1 1 1 1 0 4
2 0 1 0 0 1 1
2 0 0 0 0 1 1
I want R to add the number of 1's in each row(within the columns I want to select). How would I go about incorporating the adding of the 1's in code(from a set of variable/column)?
Here is an answer that uses dplyr to sum across rows of variables starting with the letter V. We'll simulate some data, convert to binary, and then sum the rows.
data <- matrix(rnorm(100,100,30),nrow = 10)
# recode to binary
data <- apply(data,2,function(x){x <- ifelse(x > 100,1,0)})
# change some of the column names to illustrate impact of
# select() within mutate()
colnames(data) <- c(paste0("V",1:5),paste0("X",1:5))
as.data.frame(data) %>%
mutate(total = select(.,starts_with("V")) %>% rowSums)
...and the output, where the sums should equal the sum of V1 - V5 but not
X1 - X5:
V1 V2 V3 V4 V5 X1 X2 X3 X4 X5 total
1 1 0 0 0 1 0 0 0 1 0 2
2 1 0 0 1 0 0 0 1 1 0 2
3 1 1 1 0 1 0 0 0 1 0 4
4 0 0 1 1 0 1 0 0 1 0 2
5 0 0 1 0 1 0 1 1 1 0 2
6 0 1 1 0 1 0 0 1 1 1 3
7 1 0 1 1 0 0 0 0 0 1 3
8 1 0 0 1 1 1 0 1 1 1 3
9 1 1 0 0 1 0 1 1 0 0 3
10 0 1 1 0 1 1 0 0 1 0 3
>

Match combinations of row values between 2 different data frames

I have a data.frame with 16 different combinations of 4 different cell markers
combinations_df
FITC Cy3 TX_RED Cy5
a 0 0 0 0
b 1 0 0 0
c 0 1 0 0
d 1 1 0 0
e 0 0 1 0
f 1 0 1 0
g 0 1 1 0
h 1 1 1 0
i 0 0 0 1
j 1 0 0 1
k 0 1 0 1
l 1 1 0 1
m 0 0 1 1
n 1 0 1 1
o 0 1 1 1
p 1 1 1 1
I have my "main" data.frame with 10 columns and thousands of rows.
> main_df
a b FITC d Cy3 f TX_RED h Cy5 j
1 0 1 1 1 1 0 1 1 1 1
2 0 1 0 1 1 0 1 0 1 1
3 1 1 0 0 0 1 1 0 0 0
4 0 1 1 1 1 0 1 1 1 1
5 0 0 0 0 0 0 0 0 0 0
....
I want to use all the possible 16 combinations from combinations_df to compare with each row of main_df. Then I want to create a new vector to later cbind to main_df as column 11.
sample output
> phenotype
[1] "g" "i" "a" "p" "g"
I thought about doing a while loop within a for loop checking each combinations_df row through each main_df row.
Sounds like it could work, but I have close to 1 000 000 rows in main_df, so I wanted to see if anybody had a better idea.
EDIT: I forgot to mention that I want to compare combinations_df only to columns 3,5,7,9 from main_df. They have the same name, but it might not be that obvious.
EDIT: Changin the sample data output, since no "t" should be present
The dplyr solution is outrageously simple. First you need to put phenotype in combinations_df as an explicit variable like this:
# phenotype FITC Cy3 TX_RED Cy5
#1 a 0 0 0 0
#2 b 1 0 0 0
#3 c 0 1 0 0
#4 d 1 1 0 0
# etc
dplyr lets you join on multiple variables, so from here it's a one-liner to look up the phenotypes.
library(dplyr)
left_join(main_df, combinations_df, by=c("FITC", "Cy3", "TX_RED", "Cy5"))
# a b FITC d Cy3 f TX_RED h Cy5 j phenotype
#1 0 1 1 1 1 0 1 1 1 1 p
#2 0 1 0 1 1 0 1 0 1 1 o
#3 1 1 0 0 0 1 1 0 0 0 e
#4 0 1 1 1 1 0 1 1 1 1 p
#5 0 0 0 0 0 0 0 0 0 0 a
I originally thought you'd have to concatenate columns with tidyr::unite but this was not the case.
Its not very elegant but this method works just fine. There are no loops in loops here so it should run just fine. Might trying to match using the dataframe rows and do away with the loops all together but this was just the fastest way I could figure it out. You might look at packages plyr or data.table. Very powerful packages for this kind of thing.
main_text=NULL
for(i in 1:length(main_df[,1])){
main_text[i]<-paste(main_df[i,3],main_df[i,5],main_df[i,7],main_df[i,9],sep="")
}
comb_text=NULL
for(i in 1:length(combinations_df[,1])){
comb_text[i]<-paste(combinations_df[i,1],combinations_df[i,2],combinations_df[i,3],combinations_df[i,4],sep="")
}
rownames(combinations_df)[match(main_text,comb_text)]
How about something like this? My results are different than yours as there is no "t" in the combination_df. You could do it without assigning a new column to if you wanted. This is mainly for illustrative purposes.
combination_df <- read.table("Documents/comb.txt.txt", header=T)
main_df <- read.table("Documents/main.txt", header=T)
main_df
combination_df
main_df$key <- do.call(paste0, main_df[,c(3,5,7,9)])
combination_df$key <- do.call(paste0, combination_df)
rownames(combination_df)[match(main_df$key, combination_df$key)]

generalized aggregate by row

I would like to aggregate by row. I know how to do this and have answered several questions here from others asking for help doing it. However, I want to generalize the aggregate formula and ideally not have the aggregated rows in a different order than they first appear in the original data set.
Here is an example set:
my.data <- read.table(text = '
0 0 0 1
0 0 0 1
2 2 2 2
2 2 2 2
0 4 0 0
0 4 0 0
2 2 0 0
2 2 0 0
2 2 0 0
2 2 0 0
', header = FALSE)
and my desired result:
desired.result <- read.table(text = '
0 0 0 1 2
2 2 2 2 2
0 4 0 0 2
2 2 0 0 4
', header = FALSE)
Here is one way to obtain the answer, albeit the rows are not in their original order:
my.data[,(ncol(my.data)+1)] = 1
aggregate(V5 ~ V1 + V2 + V3 + V4, FUN = sum, data=my.data)
V1 V2 V3 V4 V5
1 2 2 0 0 4
2 0 4 0 0 2
3 0 0 0 1 2
4 2 2 2 2 2
Here is an unsuccessful attempt to generalize the aggregate formula:
with(my.data, aggregate(my.data[,ncol(my.data)], by = list(paste0('V', seq(1, ncol(my.data)-1))), FUN = sum))
The order of the result is less important than the generalization.
Thank you for any advice.
Since it turned out that the desired result is just the frequency counts of unique rows, you could/should use table (as mentioned in the comments). table uses factor on its arguments and factor, if "levels" is not specified, sorts its input's unique (unique does not sort) to specify the levels. So, for table to "see" your levels (i.e. the desired order of rows) you need to call table on an explicitly specified factor.
tmp = do.call(paste, my.data)
as.data.frame(table(tmp))
# tmp Freq
#1 0 0 0 1 2
#2 0 4 0 0 2
#3 2 2 0 0 4
#4 2 2 2 2 2
res = table(factor(tmp, unique(tmp)))
as.data.frame(res)
# Var1 Freq
#1 0 0 0 1 2
#2 2 2 2 2 2
#3 0 4 0 0 2
#4 2 2 0 0 4
Instead of calling as.data.frame.table -where your rows have been concatenated- you could take advantage of unique.data.frame and use a call like:
data.frame(unique(my.data), unclass(res))
# V1 V2 V3 V4 unclass.res.
#1 0 0 0 1 2
#3 2 2 2 2 2
#5 0 4 0 0 2
#7 2 2 0 0 4
It might be useful to mention that the count function in the plyr package can also aggregate this quickly. Although, you still would lose the original order of rows.
> library(plyr)
> x <- count(my.data)
> x
## V1 V2 V3 V4 freq
## 1 0 0 0 1 2
## 2 0 4 0 0 2
## 3 2 2 0 0 4
## 4 2 2 2 2 2
To order the table as desired.result shows (and borrowing a snippet from #alexis_laz),
> pst <- do.call(paste, my.data)
> x[order(x$freq, as.factor(unique(pst))), ]
## V1 V2 V3 V4 freq
## 1 0 0 0 1 2
## 4 2 2 2 2 2
## 2 0 4 0 0 2
## 3 2 2 0 0 4
I like the posted answers, especially the answer by #alexis_laz since I tend to prefer base R. However, here is a general answer using aggregate. The order of the rows in the output differs from the order of their first appearance in the original data set, but at least the rows are tallied:
I borrowed the . in aggregate from #alexis_laz's comment:
my.data <- read.table(text = '
0 0 0 1
0 0 0 1
2 2 2 2
2 2 2 2
0 4 0 0
0 4 0 0
2 2 0 0
2 2 0 0
2 2 0 0
2 2 0 0
', header = FALSE)
my.data
my.count = rep(1, nrow(my.data))
my.count
aggregate(my.count ~ ., FUN = sum, data=my.data)
V1 V2 V3 V4 my.count
1 2 2 0 0 4
2 0 4 0 0 2
3 0 0 0 1 2
4 2 2 2 2 2

How to exclude cases that do not repeat X times in R?

I have a long format unbalanced longitudinal data. I would like to exclude all the cases that do not contain complete information. By that I mean all cases that do not repeat 8 times. Someone can help me finding a solution?
Below an example: I have three subjects {A, B, and C}. I have 8 information for A and B, but only 2 for C. How can I delete rows in which C is present based on the information it has less than 8 repeated measurements?
temp = scan()
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
A 1 1 1 1
A 0 1 0 0
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
B 1 1 1 1
B 0 1 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
C 1 1 1 1
C 0 1 0 0
Any help?
Assuming your variable names are V1, V2... and so on, here's one approach:
temp[temp$V1 %in% names(which(table(temp$V1) == 8)), ]
The table(temp$V1) == 8 matches the values in the V1 column that have exactly 8 cases. The names(which(... part creates a basic character vector that we can match using %in%.
And another:
temp[ave(as.character(temp$V1), temp$V1, FUN = length) == "8", ]
Here's another approach:
temp <- read.table(text="
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
A 1 1 1 1
A 0 1 0 0
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
B 1 1 1 1
B 0 1 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
C 1 1 1 1
C 0 1 0 0", header=FALSE)
do.call(rbind,
Filter(function(subgroup) nrow(subgroup) == 8,
split(temp, temp[[1]])))
split breaks the data.frame up by its first column, then Filter drops the subgroups that don't have 8 rows. Finally, do.call(rbind, ...) collapses the remaining subgroups back into a single data.frame.
If the first column of temp is character (rather than factor, which you can verify with str(temp)) and the rows are ordered by subgroup, you could also do:
with(rle(temp[[1]]), temp[rep(lengths==8, times=lengths), ])

randomly sum values from rows and assign them to 2 columns in R

I have a data.frame with 8 columns. One is for the list of subjects (one row per subject) and the other 7 rows are a score of either 1 or 0.
This is what the data looks like:
>head(splitkscores)
subject block3 block4 block5 block6 block7 block8 block9
1 40002 0 0 1 0 0 0 0
2 40002 0 0 1 0 0 1 1
3 40002 1 1 1 1 1 1 1
4 40002 1 1 0 0 0 1 0
5 40002 0 1 0 0 0 1 1
6 40002 0 1 1 0 1 1 1
I want to create a data.frame with 3 columns. One column for subjects. In the other two columns, one must have the sum of 3 or 4 randomly chosen numbers from each row of my data.frame (except the subject) and the other column must have the sum of the remaining values which were not chosen in the first random sample.
Help is much appreciated.
Thanks in advance
Here's a neat and tidy solution free of unnecessary complexity (assume the input is called df):
chosen=sort(sample(setdiff(colnames(df),"subject"),sample(c(3,4),1)))
notchosen=setdiff(colnames(df),c("subject",chosen))
out=data.frame(subject=df$subject,
sum1=apply(df[,chosen],1,sum),sum2=apply(df[,notchosen],1,sum))
In plain English: sample from the column names other than "subject", choosing a sample size of either 3 or 4, and call those column names chosen; define notchosen to be the other columns (excluding "subject" again, obviously); then return a data frame with the list of subjects, the sum of the chosen columns, and the sum of the non-chosen columns. Done.
I think this'll do it: [changed the way data were read in based on the other response because I made a manual mistake...]
splitkscores <- read.table(text = " subject block3 block4 block5 block6 block7 block8 block9
1 40002 0 0 1 0 0 0 0
2 40002 0 0 1 0 0 1 1
3 40002 1 1 1 1 1 1 1
4 40002 1 1 0 0 0 1 0
5 40002 0 1 0 0 0 1 1
6 40002 0 1 1 0 1 1 1", header = TRUE)
df2 <- data.frame(subject = splitkscores$subject, sum3or4 = NA, leftover = NA)
df2$sum3or4 <- apply(splitkscores[,2:ncol(splitkscores)], 1, function(x){
sum(sample(x, sample(c(3,4),1), replace = FALSE))
})
df2$leftover <- rowSums(splitkscores[,2:ncol(splitkscores)]) - df2$sum3or4
df2
subject sum3or4 leftover
1 40002 1 0
2 40002 2 1
3 40002 3 4
4 40002 1 2
5 40002 2 1
6 40002 1 4

Resources