R and appending to data frames - r

I have some cross correlation function crosscor, and I would like to loop through the function for each of the columns I have in my data matrix. The function outputs some cross correlation that looks something like this each time it is run:
Lags Cross.Correlation P.value
1 0 -0.0006844958 0.993233547
2 1 0.1021006478 0.204691627
3 2 0.0976746274 0.226628526
4 3 0.1150337867 0.155426784
5 4 0.1943150900 0.016092041
6 5 0.2360415470 0.003416147
7 6 0.1855274375 0.022566685
8 7 0.0800646242 0.330081900
9 8 0.1111071269 0.177338885
10 9 0.0689602574 0.404948252
11 10 -0.0097332533 0.906856279
12 11 0.0146241719 0.860926388
13 12 0.0862549791 0.302268025
14 13 0.1283308019 0.125302070
15 14 0.0909537922 0.279988895
16 15 0.0628012627 0.457795228
17 16 0.1669241304 0.047886605
18 17 0.2019811994 0.016703619
19 18 0.1440124960 0.090764520
20 19 0.1104842808 0.197035340
21 20 0.1247428178 0.146396407
I would like put all of the lists together so they are in a data frame, and ultimately export it into a csv file so the columns are as follows: lags.3, cross-correlation.3, p-value.3, lags.3, cross-correlation.2....etc. until p.value.50.
I have tried to use do.call as follows, but have not been successful:
for(i in 3:50)
{
l1<-crosscor(data[,2], data[,i], lagmax=20)
ccdata<-do.call(rbind, l1)
cat("Data row", i)
}
I've also tried just creating the data frame straight out, but am just getting the lag column names:
ccdata <- data.frame()
for(i in 3:50)
{
ccdata[i-2:i+1]<-crosscor(data[,2], data[,i], lagmax=20)
cat("Data row", i)
}
What am I doing wrong? Or is there an online source on data sets I could access to figure out how to do this? Best,

There is a transpose method for data.frames. If "crosscor" is the name of the object just try this:
tcrosscor <- t(crosscor)
write.csv(tcrosscor, file="my_crosscor_1.csv")
The first row would be the Lag's; the second row, the Cross.Correlation's; the third row the P.value's. I suppose you could "flatten" it further so it would be entirely "horizontal" or "wide". Seems painful but this might go something like:
single_line <- as.data.frame(unlist(tcrosscor))
names(single_line) <- paste("Lag", 'Cross.Correlation', 'P.value'), rep(1:50, 3), sep=".")
write.csv(single_line, file="my_single_1.csv")

Related

Can I subset an aggregated CellDataSet object using Monocle in R?

I have a CelldataSet object (cds):
> class(cds)
[1] "CellDataSet"
attr(,"package")
[1] "monocle"
composed of 6 different aggregated samples that can be distinguished by the suffixes of their barcodes. Here is a sample of what these look like:
cds$barcode
1 ACCAACGACTTGCC-1
2 CGCACTACTCGATG-4
3 CGTACAGAGTATCG-5
4 CGTCAAGATCACCC-5
5 ACTGAGACCCGTAA-2
6 TTAGACCTCGGGAA-6
7 TTCAAGCTGGTATC-3
8 TTTGACTGTCCTTA-4
9 TTTGCATGCTCTTA-4
10 AAACATTGAAGCCT-5
Is it possible to split this CellDataSet object into 6 smaller CellDataSet objects that each comprise barcodes with the same "-n" suffix, so I can analyse each sample separately? For example, the barcodes of CellDataSet1 would look like:
cds$barcode
1 AAACCGTGCCCTCA-1
2 AAACGCACACGCAT-1
3 AAACGGCTTCCGAA-1
4 AAAGACGAACCCAA-1
5 AAAGACGACTGTTT-1
6 AAAGAGACAAAGCA-1
7 AAAGATCTGGTAAA-1
8 AAAGCAGAGCAAGG-1
9 AAAGCAGATTATCC-1
10 AAAGCCTGATGACC-1
etc, and would contain the corresponding attributes as in the original object.
Many thanks!
Abigail
You can use tidyverse to solve the problem:
library(tidyverse)
dataseti <- data.frame(barcode = c("ACCAACGACTTGCC-1",
"GCACTACTCGATG-4",
"CGTACAGAGTATCG-5",
"CGTCAAGATCACCC-5",
"ACTGAGACCCGTAA-2",
"TTAGACCTCGGGAA-6",
"TTCAAGCTGGTATC-3",
"TTTGACTGTCCTTA-4",
"TTTGCATGCTCTTA-4",
"AAACATTGAAGCCT-5"),
stringsAsFactors = FALSE)
Let's say you want group 4
dataseti %>% separate(barcode, c("chain","group"),"-") %>% filter(group == 4)
Good luck!

Creating a loop with compare_means

I am trying to create a loop to use compare_means (ggpubr library in R) across all columns in a dataframe and then select only significant p.adjusted values, but it does not work well.
Here is some code
head(df3)
sampleID Actio Beta Gammes Traw Cluster2
gut10 10 2.2 55 13 HIGH
gut12 20 44 67 12 HIGH
gut34 5.5 3 89 33 LOW
gut26 4 45 23 4 LOW
library(ggpubr)
data<-list()
for (i in 2:length(df3)){
data<-compare_means(df3[[i]] ~ Cluster2, data=df3, paired = FALSE,p.adjust.method="bonferroni",method = "wilcox.test")
}
Error: `df3[i]` must evaluate to column positions or names, not a list
I would like to create an output to convert in dataframe with all the information contained in compare_means output
Thanks a lot
Try this:
library(ggpubr)
data<-list()
for (i in 2:(length(df3)-1)){
new<-df3[,c(i,"Cluster2")]
colnames(new)<-c("interest","Cluster2")
data<-compare_means(interest ~ Cluster2, data=new, paired = FALSE,p.adjust.method="bonferroni",method = "wilcox.test")
}

Listing my splitted data in R

My data looks like this:
colnames(dati)< - c("grupa", "regions5", "regions6", "novads.rep", "pilseta.lt", "specialists", "limenis.1", "limenis.2", "cipari.3", "ratio", "gads", "KV", "DS")
and I have manually applied split to it in order to have 24 splits (12 splits including year and 12 without splitting by years). I did them following way:
k1<-split(dati$ratio, list(dati$gads, dati$grupa), drop=TRUE)
k2<-split(dati$ratio, list(dati$gads, dati$grupa, dati$regions5), drop=TRUE)
...
k13<-split(dati$ratio,list(dati$grupa),drop=TRUE)
k14<-split(dati$ratio,list(dati$grupa,dati$regions5),drop=TRUE)
...etc
and what I mean to do is to apply these splits to my function as follows:
function(k1,k13)
but instead of inserting the values manually I would like to change them so that I could do my function similar to this:
for(i in 1:12){function(k[i],k[i+12])}
I just can't seem to find the right way to do it
dati after i split them look like this:
grupa regions5 regions6 novads.rep pilseta.lt specialists
1 1* Zemgales Zemgales Novads lauki Silva
2 1* Kurzemes Kurzemes Novads lauki Sniedze
3 3* Kurzemes Kurzemes REP pilsēta AnitaE
4 1* Vidzemes Vidzemes Novads pilsēta Dainis
limenis.1 limenis.2 cipari.3 ratio gads KV
1 Jelgavas nov. Svētes pag. 1 0.8682626 2011 2162
2 Ventspils nov. Vārves pag. 1 0.3923857 2011 27467
3 _Liepāja _Liepāja 4 0.4069100 2011 30107
4 Alūksnes nov. Alūksne 2 0.5641127 2011 8147
DS
1 2490.03
2 70000.00
3 73989.33
4 14442.15
...
and here is the output i'm looking for:
count mean lowermean uppermean median ...
2011.1*.Kurzemes 119 0.83322820 7.719323e-01 0.8945241 0.79888324
2012.1*.Kurzemes 171 0.82800498 7.836221e-01 0.8723879 0.84424821
2013.1*.Kurzemes 144 0.77551814 7.347631e-01 0.8162731 0.80745150
2014.1*.Kurzemes 180 0.78134649 7.396007e-01 0.8230923 0.81635065
2015.1*.Kurzemes 80 0.78146588 7.135070e-01 0.8494248 0.73659659
2011.10*.Kurzemes 16 1.09552970 6.930780e-01 1.4979814 1.02127841
2012.10*.Kurzemes 22 0.87442906 5.721409e-01 1.1767172 0.74787482
2013.10*.Kurzemes 25 0.84406131 6.947097e-01 0.9934129 0.91786319
2014.10*.Kurzemes 22 0.79385199 5.880507e-01 0.9996533 0.71708060
2015.10*.Kurzemes 12 1.19059850 8.213604e-01 1.5598365 1.25322750
2012.11*.Kurzemes 1 0.09461065 NA NA 0.09461065
2013.11*.Kurzemes 2 0.18134522 -1.823437e+00 2.1861274 0.18134522
2014.11*.Kurzemes 1 0.11097174 NA NA 0.11097174
2013.12*.Kurzemes 1 0.44620780 NA NA 0.44620780
...
You could use a list:
k <- list()
k[[1]] <- split(dati$ratio, list(dati$gads, dati$grupa), drop=TRUE)
k[[2]] <- split(dati$ratio, list(dati$gads, dati$grupa, dati$regions5), drop=TRUE)
# etc
Then the following is valid:
for(i in 1:12){
function(k[[i]],k[[i+12]])
}
Note that k3 is the name of a variable, which could be x, myvar32, whatever. When you type k[3], you state that you want to access the third cell of the vector k. Note that k and k3 are totally distinct variables. If you want to be able to access you variables using k[i], you must first create the vector k and store what you need in k[i]...
The double bracket notation is used to access lists, which are basically handy store anything -- what you need in your case.

R: reading and manipulating a strangely formatted file

I have a file that is formatted in a slightly weird way, like so:
Cluster 1 Score:3.96
Category Term Count
GOTERM_BP_FAT GO:0006412 34
KEGG_PATHWAY hsa00970 9
GOTERM_BP_FAT GO:0043038 9
GOTERM_BP_FAT GO:0043039 9
Cluster 2 Score:3.94
Category Term Count
GOTERM_BP_FAT GO:0006414 21
KEGG_PATHWAY hsa03010 20
GOTERM_BP_FAT GO:0034660 16
GOTERM_BP_FAT GO:0006399 11
GOTERM_BP_FAT GO:0042254 10
GOTERM_BP_FAT GO:0022613 12
... and several more "sub-data frames" (including space in-between) and additional (here omitted) columns for the rows after the Cluster X rows.
What I want to do is to somehow read each separate cluster, get it as a data frame (i.e. a data frame with the names Category, Term, Count), manipulate the data frame a bit (adding columns based on calculations, mostly) and then write the manipulated data frame AND the the Cluster X row to a new file on the very same format as it started.
I've racked my brain for some smart way to do this, but I haven't really come up with anything other than reading each row separately and doing different things depending on the type of row, like this:
con <- file('test.txt', open="r")
# Read file line for line
while ( length(currentLine <- readLines(con, n=1, warn=FALSE)) > 0 ) {
line = strsplit(currentLine, '\t')[[1]]
# Save previous data, initiate new cluster name/score
if ( grepl('Annotation Cluster', line[1]) ) {
# Save previous data if available
if ( exists('currentData') ) {
## save the current data somehow
}
# Initiate new
clusterInfo = line
}
# Initiate new, empty data frame
else if ( grepl('Category', line[1]) ) {
currentData = data.frame(t(rep(NA, length(line))))
names(currentData) = line
}
# Add data to data frame
else if ( grepl('GOTERM', line[1]) || grepl('KEGG', line[1]) ) {
currentData = rbind(currentData, line)
# Delete NAs if line row
if ( nrow(currentData) == 2 ) {
currentData = na.omit(currentData)
}
}
}
The above is obviously not finished (I'm not sure how to save the clusterInfo together with currentData to the same format), but I hope I get across my idea. I'm not really too fond of this approach, though... It seems very odd, to me, to create data frames row-by-row like this, and try and save the data at the same time as you're initiating the start of the next block of data.
Is there some better way of doing this?
You can try read.mtable from my GitHub-only "SOfun" package.
Usage would be something like:
library(SOfun)
read.mtable(x, "Cluster", header = TRUE) ## Replace "x" with your file name
# $`Cluster 1 Score:3.96`
# Category Term Count
# 1 GOTERM_BP_FAT GO:0006412 34
# 2 KEGG_PATHWAY hsa00970 9
# 3 GOTERM_BP_FAT GO:0043038 9
# 4 GOTERM_BP_FAT GO:0043039 9
#
# $`Cluster 2 Score:3.94`
# Category Term Count
# 1 GOTERM_BP_FAT GO:0006414 21
# 2 KEGG_PATHWAY hsa03010 20
# 3 GOTERM_BP_FAT GO:0034660 16
# 4 GOTERM_BP_FAT GO:0006399 11
# 5 GOTERM_BP_FAT GO:0042254 10
# 6 GOTERM_BP_FAT GO:0022613 12
As you acn see, the "cluster" information is retained as the list names. Thus, you can go ahead and use lapply to do whatever calculations you need to do, and then re-write the data in whatever form you need to.
Reproducible sample data:
x <- tempfile()
writeLines("Cluster 1 Score:3.96
Category Term Count
GOTERM_BP_FAT GO:0006412 34
KEGG_PATHWAY hsa00970 9
GOTERM_BP_FAT GO:0043038 9
GOTERM_BP_FAT GO:0043039 9
Cluster 2 Score:3.94
Category Term Count
GOTERM_BP_FAT GO:0006414 21
KEGG_PATHWAY hsa03010 20
GOTERM_BP_FAT GO:0034660 16
GOTERM_BP_FAT GO:0006399 11
GOTERM_BP_FAT GO:0042254 10
GOTERM_BP_FAT GO:0022613 12", con = x, sep = "\n")
You could read the file with readLines and split it with a numeric index ('indx') created based on the lines having 'Cluster'. Read the list elements with read.table, create two new columns ('Cluster' and 'Score') and rbind the list elements to create a single dataset.
lines <- readLines('Clusterfile.txt')
indx <- cumsum(grepl('^Cluster', lines))
res <- do.call(rbind,lapply(split(lines, indx), function(x) {
d1 <-read.table(text=x[-1], header=TRUE, stringsAsFactors=FALSE)
d2 <- read.table(text=gsub('[^0-9.]+', ' ', x[1]),
col.names=c('Cluster', 'Score'))
cbind(d1, d2)}))
row.names(res) <- NULL
head(res,3)
# Category Term Count Cluster Score
#1 GOTERM_BP_FAT GO:0006412 34 1 3.96
#2 KEGG_PATHWAY hsa00970 9 1 3.96
#3 GOTERM_BP_FAT GO:0043038 9 1 3.96

How to Find difference between two values of last two dates in R program

DF2
Date EMMI ACT NO2
2011/02/12 12345 21 11
2011/02/14 43211 22 12
2011/02/19 12345 21 13
2011/02/23 43211 13 12
2011/02/23 56341 13 12
2011/03/03 56431 18 20
I need to find difference between two dates in a column. For example difference between ACT column values.For example, the EMMI 12345, Difference between dates 2011/02/19 - 2011/02/12 = 21-21 = 0. like that i want to do for entire column of ACT. Add a new column diff and add values to that. Can anybody let me know please how to do it.
This is the output i want
DF3
Date EMMI ACT NO2 DifACT
2011/02/12 12345 21 11 NA
2011/02/14 43211 22 12 NA
2011/02/19 12345 21 13 0
2011/02/23 43211 13 12 -9
2011/02/23 56341 13 12 5
Try this:
DF3 <- DF2
DF3$difACT <- ave( DF3$ACT, DF3$EMMI, FUN= function(x) c(NA, diff(x)) )
As long as the dates are sorted (within EMMI) this will work, if they are not sorted then we would need to modify the above to sort within EMMI first. I would probably sort the entire data frame on date first (and save the results of order), then run the above. Then if you need it back in the original order you can run order on the results of the original order results to "unorder" the data frame.
This is based on plyr package (not tested):
library(plyr)
DF3<-ddply(DF2,.(EMMI),mutate,difACT=diff(ACT))

Resources