I have a dataframe like below.
11,15,12,25
11,12
15,25
134,45,56
46
45,56
15,12
66,45,56,24,14,11,25,12,134
I want to identify the frequency of pairs/triplets or higher that occurs in the data. Say for example, in above data the occurrence of pairs looks like below
item No of occurrence
11,12 3
11,25 2
15,12 2
15,25 2
.
.
45,56 3
134,45,56 2 ....and so on
I am trying to write a R code for the above and I am finding difficulty to approach this.
Given a 1 column data.frame with commas separating the variables, the following should produce your desired result:
# split column into a list
myList <- strsplit(df$V1, split=",")
# get all pairwise combinations
myCombos <- t(combn(unique(unlist(myList)), 2))
# count the instances where the pair is present
myCounts <- sapply(1:nrow(myCombos), FUN=function(i) {
sum(sapply(myList, function(j) {
sum(!is.na(match(c(myCombos[i,]), j)))})==2)})
# construct final matrix
allDone <- cbind(matrix(as.integer(myCombos), nrow(myCombos)), myCounts)
This returns a matrix where the first two columns are the items in comparison and the third column of the count that these items are in the row of the data.frame.
data
df <- read.table(text="11,15,12,25
11,12
15,25
134,45,56
46
45,56
15,12
66,45,56,24,14,11,25,12,134", as.is=TRUE)
Related
I have a dataframe composed by several paired columns. So, for example, the first column is a list of names and the second column contains numeric values quantifying the variables of the first column. In the third column I have again a list of names and the fourth column is numeric and quantifies variables of the third column and so on.
I now want to automatically subset the first two columns to make a separate dataframe and the third-fourth columns to make a second dataframe. The final aim is to align the rows by name.
For example, from dataframe a
names_a<-c("a","b","c","d")
values_a<-c(1,2,3,4)
names_b<-c("a","b","e","f")
values_b<-c(5,6,7,8)
a<-as.data.frame(cbind(names_a,values_a,names_b,values_b))
I would obtain a dataframe containing names_a and values_a and another dataframe containing names_b and values_b, then aligning them to have dataframe a1:
names_a1<-c("a","b","c","d","e","f")
values_a1<-c(1,2,3,4,0,0)
values_b1<-c(5,6,0,0,7,8)
a1<-as.data.frame(cbind(names_a1,values_a1,values_b1))
Any suggestion?
Thanks in advance for any help
I can help for the first Part of your request. Please see how to create the separated data frames.
names_a<-c("a","b","c","d")
values_a<-c(1,2,3,4)
names_b<-c("a","b","e","f")
values_b<-c(5,6,7,8)
a<-as.data.frame(cbind(names_a,values_a,names_b,values_b))
#When you subset a data frame you focus on observations (rows), not on the variables (columns). You can create 2 new data frames out of the existing one.
#df contain 3+4 Variable
a34 <- data.frame(cbind(as.vector(a$names_b),as.vector(a$values_b)))
colnames(a34) <-c("names_b","values_b")
#then "subset" a (in fact you create a new one and replace it)
a <- data.frame(cbind(as.vector(a$names_a),as.vector(a$values_a)))
colnames(a) <-c("names_a","values_a")
This result in:
> a
names_a values_a
1 a 1
2 b 2
3 c 3
4 d 4
> a34
names_b values_b
1 a 5
2 b 6
3 e 7
4 f 8
I have a data frame with zero columns and zero rows, and I want to have the for loop fill in numbers from 1 to 39. The numbers should be repeating themselves twice until 39, so for instance, the result I am looking for will be in one column, where each number repeats twice
Assume st is the data frame I have set already. This is what I have so far:
for(i in 1:39) {
append(st,i)
for(i in 1:39) {
append(st,i)
}
}
Expected outcome will be in a column structure:
1
1
2
2
3
3
.
.
.
.
39
39
You don't need to use for loop. Instead use rep()
# How many times you want each number to repeat sequentially
times_repeat <- 2
# Assign the repeated values as a data frame
test_data <- as.data.frame(rep(1:39, each = times_repeat))
# Change the column name if you want to
names(test_data) <- "Dont_encourage_the_use_of_blanks_in_column_names"
I have a matrix B that is 10 rows x 2 columns:
B = matrix(c(1:20), nrow=10, ncol=2)
Some of the rows are technical duplicates, and they correspond to the same
number in a list of length 20 (list1).
list1 = c(1,1,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8,8)
list1 = as.list(list1)
I would like to use this list (list1) to take the mean of any duplicate values for all columns in B such that I end up with a matrix or data.frame with 8 rows and 2 columns (all the duplicates are averaged).
Here is my code:
aggregate.data.frame(B, by=list1, FUN=mean)
And it generates this error:
Error in aggregate.data.frame(B, by = list1, FUN = mean) :
arguments must have same length
What am I doing wrong?
Thank you!
Your data have 2 variables (2 columns), each with 10 observations (10 rows). The function aggregate.data.frame expects the elements in the list to have the same length as the number of observations in your variables. You are getting an error because the vector in your list has 20 values, while you only have 10 observations per variable. So, for example, you can do this because now you have 1 variable with 20 observations, and list 1 has a vector with 20 elements.
B <- 1:20
list1 <- list(B=c(1,1,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8,8))
aggregate.data.frame(B, by=list1, FUN=mean)
The code will also work if you give it a matrix with 2 columns and 20 rows.
aggregate.data.frame(cbind(B,B), by=list1, FUN=mean)
I think this answer addresses why you are getting an error. However, I am not sure that it addresses what you are actually trying to do. How do you expect to end up with 8 rows and 2 columns? What exactly would the cells in that matrix represent?
I would like to select specific elements of a data.list after processing it.
To get process parameters I describe the my problem in the reproducible example.
In the example code below, I have three sets of data.list each have 5 column.
Each data.list repeat theirselves three times each and each data.list assignet to unique number called set_nbr which defines these datasets.
#to create reproducible data (this part creates three sets of data each one repeats 3 times of those of Mx, My and Mz values along with set_nbr)
set.seed(1)
data.list <- lapply(1:3, function(x) {
nrep <- 3
time <- rep(seq(90,54000,length.out=600),times=nrep)
Mx <- c(replicate(nrep,sort(runif(600,-0.014,0.012),decreasing=TRUE)))
My <- c(replicate(nrep,sort(runif(600,-0.02,0.02),decreasing=TRUE)))
Mz <- c(replicate(nrep,sort(runif(600,-1,1),decreasing=TRUE)))
df <- data.frame(time,Mx,My,Mz,set_nbr=x)
})
after applying some function I have output like this.
result
time Mz set_nbr
1 27810 -1.917835e-03 1
2 28980 -1.344288e-03 1
3 28350 -3.426615e-05 1
4 27900 -9.934413e-04 1
5 25560 -1.016492e-02 2
6 27360 -4.790767e-03 2
7 28080 -7.062256e-04 2
8 26550 -1.171716e-04 2
9 26820 -2.495893e-03 3
10 26550 -7.397865e-03 3
11 26550 -2.574022e-03 3
12 27990 -1.575412e-02 3
My questions starts from here.
1) How to get min,middle and max values of time column, for each set_nbr ?
2) How to use evaluated set_nbr and Mz values inside of data.list?
In short;
After deciding the min,middle and max values from time column and corresponding Mz values for each set_nbr in result, I want to return back to original data.list and extract those columns of Mx, My, Mz according those of set_nbr and Mz values. Since each set_nbr actually corresponding to 600 rows, I would like to extract those defined set_nbrs family from data.list
we use time as a factor to select set_nbr. Here factor means as extraction parameter not the real factor in R command.
In addition, as you will see four set_nbr exist for each dataset but they are indeed addressing different dataset in the data.list
I'm a big advocate of using lists of data frames when appropriate, but in this case it doesn't look like there's any reason to keep them separated as different list items. Let's combine them into a single data frame.
library(dplyr)
dat = bind_rows(data.list)
Then getting your summary stats is easy:
dat %>% group_by(set_nbr) %>%
summarize(min_time = min(time),
max_time = max(time),
middle_time = median(time))
# Source: local data frame [3 x 4]
#
# set_nbr min_time max_time middle_time
# 1 1 90 54000 27045
# 2 2 90 54000 27045
# 3 3 90 54000 27045
In your sample data, time is defined the same way each time, so of course the min, median, and max are all the same.
I'd suggest, in the new question you ask about plotting, starting with the combined data frame dat.
As to your second question:
2) How to select evaluated set_nbr values inside of data.list?
Selecting a single item from a list, use double brackets
data.list[[2]]
However, with the combined data, it's just a normal column of a normal data frame so any of these will work:
dat[dat$set_nbr == 2, ]
subset(dat, set_nbr == 2)
filter(dat, set_nbr == 2)
To your clarification in comments, if you want the Mx and My values for the time and set_nbr in the results object, using my combined dat above, simply do a join: left_join(results, dat).
This should work, but I'm a little confused because in your simulated data time is numeric, but in your new text you say "we use time as a factor". If you've converted time to a factor object, this will only work if it has the same levels in each of the data frames in your data list. If not, I would recommend keeping time as numeric.
I have a dataset that contains results for many tests across many samples. The samples are replicated within the dataset. I would like to compare the test results between replicates within each group of replicated samples. I thought it might be easiest to first split my data frame by the SampleID so that I have a list of data frames, one data frame for each SampleID. There could be 2, 3, 4, or even 5 replicates of a sample so the number of unique combinations of rows to compare for each sample group is not the same. I have the logic that I am thinking laid out below. I want to run a function on the list of data frames and output the match results. The function would compare unique sets of 2 rows within each group of replicated samples and return values of "Match", "Mismatch", or NA (if one or both values for a test is missing). It would also return the count of tests that overlapped between the 2 compared replicates, the number of matches, and the number of mismatches. Lastly, it would include a column where the sample names are pasted together with their row numbers so I know which two samples were compared (ex. Sample1.1_Sample1.2). Could anyone point me in the right direction?
#Input data structure
data = as.data.frame(cbind(rbind("Sample1","Sample1","Sample2","Sample2","Sample2"),rbind("A","A","C","C","C"), rbind("A","T","C","C","C"),
rbind("A",NA,"C","C","C"), rbind("A","A","C","C","C"), rbind("A","T","C","C",NA), rbind("A","A","C","C","C"),
rbind("A","A","C","C","C"), rbind("A",NA,"C","T","T"), rbind("A","A","C","C","C"), rbind("A","A","C","C","C")))
colnames(data) = c("SampleID", "Test1","Test2","Test3","Test4","Test5","Test6","Test7","Test8","Test9","Test10")
data
data.split = split(data, data$SampleID)
##Row comparison function
#Input is a list of data frames. Each data frame contains results for replicates of the same sample.
RowCompare = function(x){
rowcount = nrow(x)
##ifelse(rowcount==2,
##compare row 1 to row 2
##paste sample names being compared together
##how many non-NA values overlap, keep value
##of those that overlap, how many match, keep value
##of those that overlap, how many do not match, keep value
#ifelse(rowcount==3,
##compare row 1 to row 2
##paste sample names being compared together
##how many non-NA values overlap, keep value
##of those that overlap, how many match, keep value
##of those that overlap, how many do not match, keep value
##compare row 1 to row 3
##paste sample names being compared together
##how many non-NA values overlap, keep value
##of those that overlap, how many match, keep value
##of those that overlap, how many do not match, keep value
##compare row 2 to row 3
##paste sample names being compared together
##how many non-NA values overlap, keep value
##of those that overlap, how many match, keep value
##of those that overlap, how many do not match, keep value
return(results)
}
#Output is a list of data frames - one for sample name
out = lapply(names(data.split), function(x) RowCompare(data.split[[x]]))
#Row bind the list of data frames back together to one large data frame
out.merge = do.call(rbind.data.frame, out)
head(out.merge)
#Desired output
out.merge = as.data.frame(cbind(rbind("Sample1.1_Sample1.2","Sample2.1_Sample2.2","Sample2.1_Sample2.3","Sample2.2_Sample2.3"),rbind("Match","Match","Match","Match"),
rbind("Mismatch","Match","Match","Match"), rbind(NA,"Match","Match","Match"), rbind("Match","Match","Match","Match"), rbind("Mismatch","Match",NA,NA),
rbind("Match","Match","Match","Match"), rbind("Match","Match","Match","Match"), rbind(NA,"Mismatch","Mismatch","Match"), rbind("Match","Match","Match","Match"),
rbind("Match","Match","Match","Match"), rbind(8,10,9,9), rbind(6,9,8,8), rbind(2,1,1,1)))
colnames(out.merge) = c("SampleID", "Test1","Test2","Test3","Test4","Test5","Test6","Test7","Test8","Test9","Test10", "Num_Overlap", "Num_Match","Num_Mismatch")
out.merge
One thing I did see on another post that I thought might be useful is the line below which would create a data frame of unique row combinations that could then be used to define which rows to compare in each group of replicated samples. Not sure how to implement it though.
t(combn(nrow(data),2))
Thank you.
You are on the right track with t(combn(nrow(data),2)). See below for how I would do it.
testCols <- which(grepl("^Test\\d+",colnames(data)))
TestsCompare=function(x,y){
##how many non-NA values overlap
overlaps <- sum(!is.na(x) & !is.na(y))
##of those that overlap, how many match
matches <- sum(x==y, na.rm=T)
##of those that overlap, how many do not match
non_matches <- overlaps - matches # complement of matches
c(overlaps,matches,non_matches)
}
RowCompare= function(x){
comp <- NULL
pairs <- t(combn(nrow(x),2))
for(i in 1:nrow(pairs)){
row_a <- pairs[i,1]
row_b <- pairs[i,2]
a_tests <- x[row_a,testCols]
b_tests <- x[row_b,testCols]
comp <- rbind(comp, c(row_a, row_b, TestsCompare(a_tests, b_tests)))
}
colnames(comp) <- c("row_a","row_b","overlaps","matches","non_matches")
return(comp)
}
out = lapply(data.split, RowCompare)
Produces:
> out
$Sample1
row_a row_b overlaps matches non_matches
[1,] 1 2 8 6 2
$Sample2
row_a row_b overlaps matches non_matches
[1,] 1 2 10 9 1
[2,] 1 3 9 8 1
[3,] 2 3 9 9 0