Calling variable works outside of loop, but not inside loop - r

I am looping through vars_macro. The first variable in vars_macro is c1372 (dput below). The below code works perfectly fine:
len <- 32
c1372[1:(len-z),1:1]
However when I try to call the same variable (c1372) in code below, I get an error:
Error in m[1:(len - z), 1:1] : incorrect number of dimensions
Code:
output <- list()
forecast <- list()
for(m in noquote(vars_macro)){
output[[m]] <- list() # treat output as a list-of-lists
fit[[m]] <- list() # treat fit as a list-of-lists
for(z in rev(1:6)) {
x <- m[1:(len-z),1:1]
x <- ts((x), start = c(2011, 4), frequency = 4)
y <- Macro[1:(len-z),2:2]
y <- ts((y), start = c(2011, 4), frequency = 4)
t <- Macro[(len+1-z):(len+1-z),3:10]
t <- ts((t), start = c(2019, 2), frequency = 4)
#fit model
fit[[m]][[z]] <-auto.arima(y,xreg=x,seasonal=TRUE)
output[[m]][[z]] <- forecast(fit[[m]][[z]],xreg=t)$mean
}
}
Note the code above fails on the first variable (c1372), so the issue isn't the other variables. You can confirm this by simply writing for(m in ("c1372"))
Dput:
dput(vars_macro)
c("c1372", "c5244", "c5640", "c6164", "b1372", "b5244", "b5640",
"b6164", "v1372", "v5244", "v5640", "v6164", "bv1372", "bv5244",
"bv5640", "bv6164")
dput(c1372)
structure(list(c1372 = c(1.386445329, 1.600103663, 1.906186443,
1.962067415, 2.716663882, 1.875961101, 2.086589462, 2.115101307,
2.960605275, 2.109288864, 2.730920081, 2.816577742, 4.006180002,
3.503741762, 4.162132837, 4.122407811, 5.352681171, 3.961705849,
4.773003078, 4.575654378, 5.71727247, 4.401603262, 5.204187541,
4.7354794, 5.809822373, 4.137968937, 4.881120131, 4.812274313,
6.143882981, 4.935116748, 5.95001413, 5.384694268)), row.names = c(NA,
-32L), class = "data.frame")

The code in the OP fails because once the line for(m in noquote(vars_macro)) executes for the first time, m is set to a single element character vector, c1372.
Therefore, x <- m[1:(len-z),1:1] fails because m is a single element character vector, not a data frame with 32 rows and one column.
In R, everything is an object, and it's important to know the types of objects one is manipulating. The mechanism to move back and forth between a character vector and an actual object are the two R functions, get() and assign().
assign() assigns a name with an object. get() retrieves an object, given a name.
If what is required is to access the c1372 data frame instead of the character vector c1372, one needs to use the get() function to get a named object.
Illustrating with the data provided in the OP:
vars_macro <- "c1372"
c1372 <- structure(list(c1372 = c(1.386445329, 1.600103663, 1.906186443,
1.962067415, 2.716663882, 1.875961101, 2.086589462, 2.115101307,
2.960605275, 2.109288864, 2.730920081, 2.816577742, 4.006180002,
3.503741762, 4.162132837, 4.122407811, 5.352681171, 3.961705849,
4.773003078, 4.575654378, 5.71727247, 4.401603262, 5.204187541,
4.7354794, 5.809822373, 4.137968937, 4.881120131, 4.812274313,
6.143882981, 4.935116748, 5.95001413, 5.384694268)), row.names = c(NA,
-32L), class = "data.frame")
len <- 32
theData <- NULL
for (m in vars_macro){
theData <- get(m)
}
# print first few rows to show that get() worked
head(theData)
...and the output:
> # print first few rows to show that get() worked
> head(theData)
c1372
1 1.386445
2 1.600104
3 1.906186
4 1.962067
5 2.716664
6 1.875961
>

Related

posix time comparison in r not behaving the same in for loop and apply function

Hello i am having an interesting issue with R
When i do :
touchtimepairs = structure(list(v..length.v.. = structure(c(1543323677.254, 1543323678.137, 1543323679.181, 1543323679.918, 1543323680.729, 1543323681.803, 1543323682.523, 1543323682.977,1543323683.519, 1543323684.454), class = c("POSIXct", "POSIXt"), tzone = "CEST"),v.2.length.v.. = structure(c(1543323678.137, 1543323679.181, 1543323679.918, 1543323680.729, 1543323681.803, 1543323682.523, 1543323682.977, 1543323683.519, 1543323684.454, 1543323690.793), class = c("POSIXct", "POSIXt"), tzone = "CEST")), .Names = c("v..length.v..", "v.2.length.v.."), row.names = c(NA, 10L), class = "data.frame")
data = data.frame(a = seq(1,10), b = seq(21,30), posixtime = touchtimepairs[,1])
for(x in seq(nrow(touchtimepairs))){
a = data$[data$posixtime < touchtimepairs[x,2],]
}
it works without a problem i get results back but when i try to use apply
a = apply(touchtimepairs, 1,
function(x) data[data$posixtime < x[2],])
it does not work anymore, I get an empty data frame. The same happens with the subset() command.
Interestingly when i do > instead of < it works !
a = apply(touchtimepairs, 1,
function(x) data[data$posixtime > x[2],])
Then there is another issue:
apply in the case of the > comparison gives another result than the for loop
1951 lines with apply and
1897 with the for loop
can anyone reproduce this behavior?
The posix time has also miliseconds if that is of any interest
Many thanks
If you look at your data inside the apply anonymous function, you'll see the symptom that is causing your trouble.
apply(touchtimepairs, 1, class)
# 1 2 3 4 5 6 7 8 9 10
# "character" "character" "character" "character" "character" "character" "character" "character" "character" "character"
(It should be returning a 2-row matrix with POSIXct and POSIXt.) I should also note that I kept getting warnings about unknown timezone 'CEST'. I fixed it temporarily with attr(touchtimepairs[[1]], "tzone") <- "UTC", though that's just a kludge to stop the warnings on my console. It doesn't fix the problem and might just be my system. :-)
If you are trying to use both columns of touchtimepairs, you have two options:
If you really only need one of touchtimepairs at a time, then lapply will work:
lapply(touchtimepairs[[1]],
function(x) subset(data, posixtime < x))
If you need to use both columns at the same time, use an index on the rows:
lapply(seq_len(nrow(touchtimepairs)),
function(i) subset(data, posixtime < touchtimepairs[i,2]))
(where you'd also reference touchtimepairs[i,1] somehow).
Especially if you are trying to use both columns simultaneously, you can use Map:
Map(function(a, b) subset(data, a < posixtime & posixtime <= b),
touchtimepairs[[1]], touchtimepairs[[2]])
(This does not return anything in your sample data, so either the data is not the best representative sample, or you are not intending to use it in this fashion. Most likely the latter, I'm just guessing :-)
The biggest difference between Map and the *apply family is that it accepts one or more vectors/lists and zips them together. As an example of this "zipper" effect:
Map(func, 1:3, 11:13)
is effectively calling:
func(1, 11)
func(2, 12)
func(3, 13)

How to prevent coercion to list in R

I am trying to remove all NA values from two columns in a matrix and make sure that neither column has a value that the other doesn't.
code:
data <- dget(file)
dependent <- data[,"chroma"]
independent <- data[,"mass..Pantheria."]
names(independent) <- names(dependent) <- rownames(data)
for (name in rownames(data)) {
if(is.na(dependent[name])) {
independent$name <- NULL
dependent$name <- NULL
}
if(is.na(independent[name])) {
independent$name <- NULL
dependent$name <- NULL
}
}
print(dput(independent))
print(dput(dependent))
I am brand new to R and am trying to perform this task with a for loop. However, when I delete a section by assigning NULL I receive the following warning:
1: In independent$Aeretes_melanopterus <- NULL : Coercing LHS to a list
2: In dependent$name <- NULL : Coercing LHS to a list
No elements are deleted and independent and dependent retain all their original rows.
file (input):
structure(list(chroma = c(7.443501276, 10.96156313, 13.2987235,
17.58110922, 13.4991105), mass..Pantheria. = c(NA, 126.57, NA,
160.42, 250.57)), .Names = c("chroma", "mass..Pantheria."), class = "data.frame", row.names = c("Aeretes_melanopterus",
"Ammospermophilus_harrisii", "Ammospermophilus_insularis", "Ammospermophilus_nelsoni",
"Atlantoxerus_getulus"))
chroma mass..Pantheria.
Aeretes_melanopterus 7.443501 NA
Ammospermophilus_harrisii 10.961563 126.57
Ammospermophilus_insularis 13.298723 NA
Ammospermophilus_nelsoni 17.581109 160.42
Atlantoxerus_getulus 13.499111 250.57
desired output:
structure(list(chroma = c(10.96156313, 17.58110922, 13.4991105
), mass..Pantheria. = c(126.57, 160.42, 250.57)), .Names = c("chroma",
"mass..Pantheria."), class = "data.frame", row.names = c("Ammospermophilus_harrisii",
"Ammospermophilus_nelsoni", "Atlantoxerus_getulus"))
chroma mass..Pantheria.
Ammospermophilus_harrisii 10.96156 126.57
Ammospermophilus_nelsoni 17.58111 160.42
Atlantoxerus_getulus 13.49911 250.57
structure(c(126.57, 160.42, 250.57), .Names = c("Ammospermophilus_harrisii",
"Ammospermophilus_nelsoni", "Atlantoxerus_getulus"))
Ammospermophilus_harrisii Ammospermophilus_nelsoni Atlantoxerus_getulus
126.57 160.42 250.57
structure(c(10.96156313, 17.58110922, 13.4991105), .Names = c("Ammospermophilus_harrisii",
"Ammospermophilus_nelsoni", "Atlantoxerus_getulus"))
Ammospermophilus_harrisii Ammospermophilus_nelsoni Atlantoxerus_getulus
10.96156 17.58111 13.49911
Looks like you want to omit rows from your data where chroma or mass..Pantheria are NA. Here's a quick way to do it:
data = data[!is.na(data$chroma) & !is.na(data$mass..Pantheria.), ]
I'm not sure why you are breaking independent and dependent out separately, but after filtering out bad observations is a good time to do it.
Since those are your only two columns, this is equivalent to omitting rows from your data frame that have any NA values, so you can use a shortcut like this:
data = na.omit(data)
If you want to keep a "pristine" copy of your raw data, simply change the name of the result:
data_no_na = na.omit(data)
# or
data = data[!is.na(data$chroma) & !is.na(data$mass..Pantheria.), ]
As to what's wrong with your code, $ is used for extracting columns from a data frame, but you're trying to use it for a named vector (since you've already extracted the columns), which doesn't work. Even then, $ only works with a literal string, you can't use it with a variable. For data frames, you need to use brackets to extract columns stored in variables. For example, the built-in mtcars data has a column called "mpg":
# these work:
mtcars$mpg
mtcars[, "mpg"]
my_col = "mpg"
mtcars[, my_col]
mtcars$my_col ## does not work, need to use brackets!
You can never use $ with row names in a data frame, only column names.

Loop to Create and save dataframes into list

I have example code and data below. What I'm trying to figure out is how write a loop that would create say x (in this example x=3) dataframes from a dataframe (in this example datadf) and save those dataframes in a list. The main part I'm stuck on is how to save each dataframe into a list. Any tips are greatly appreciated.
The updated code below seems to just about work, except the beginning index on the dataframes always stays at 1, instead of stepping 10 ahead each time. Anybody know what the issue is?
Update:
N<-3
x<-vector("list",N)
for (i in 1:N)
{
a<-(1:100)*rnorm(1,0.5)
b<-(1:100)*rnorm(1,2)
datadf<-as.data.frame(cbind(a,b))
n<-10
t<-50
datadfn<-datadf[((i-1)*n+1):(t+2*(i-1)*n),]
x[[i]]<-datadfn
}
Example Code:
n<-10
t<-50
datadf1<-datadf[1:t,]
datadf2<-datadf[(n+1):(t+n),]
datadf3<-datadf[(2*n+1):(t+2*n),]
dfList<-list(datadf1, datadf2, datadf3)
Data:
dput(datadf)
structure(list(a = c(2.00134717160119, 4.00269434320238, 6.00404151480358,
8.00538868640477, 10.006735858006, 12.0080830296072, 14.0094302012083,
16.0107773728095, 18.0121245444107, 20.0134717160119, 22.0148188876131,
24.0161660592143, 26.0175132308155, 28.0188604024167, 30.0202075740179,
32.0215547456191, 34.0229019172203, 36.0242490888215, 38.0255962604226,
40.0269434320238, 42.028290603625, 44.0296377752262, 46.0309849468274,
48.0323321184286, 50.0336792900298, 52.035026461631, 54.0363736332322,
56.0377208048334, 58.0390679764346, 60.0404151480358, 62.041762319637,
64.0431094912381, 66.0444566628393, 68.0458038344405, 70.0471510060417,
72.0484981776429, 74.0498453492441, 76.0511925208453, 78.0525396924465,
80.0538868640477, 82.0552340356489, 84.0565812072501, 86.0579283788513,
88.0592755504524, 90.0606227220536, 92.0619698936548, 94.063317065256,
96.0646642368572, 98.0660114084584, 100.06735858006, 102.068705751661,
104.070052923262, 106.071400094863, 108.072747266464, 110.074094438066,
112.075441609667, 114.076788781268, 116.078135952869, 118.07948312447,
120.080830296072, 122.082177467673, 124.083524639274, 126.084871810875,
128.086218982476, 130.087566154077, 132.088913325679, 134.09026049728,
136.091607668881, 138.092954840482, 140.094302012083, 142.095649183685,
144.096996355286, 146.098343526887, 148.099690698488, 150.101037870089,
152.102385041691, 154.103732213292, 156.105079384893, 158.106426556494,
160.107773728095, 162.109120899697, 164.110468071298, 166.111815242899,
168.1131624145, 170.114509586101, 172.115856757703, 174.117203929304,
176.118551100905, 178.119898272506, 180.121245444107, 182.122592615708,
184.12393978731, 186.125286958911, 188.126634130512, 190.127981302113,
192.129328473714, 194.130675645316, 196.132022816917, 198.133369988518,
200.134717160119), b = c(2.05061146723527, 4.10122293447054,
6.15183440170581, 8.20244586894108, 10.2530573361764, 12.3036688034116,
14.3542802706469, 16.4048917378822, 18.4555032051174, 20.5061146723527,
22.556726139588, 24.6073376068232, 26.6579490740585, 28.7085605412938,
30.7591720085291, 32.8097834757643, 34.8603949429996, 36.9110064102349,
38.9616178774701, 41.0122293447054, 43.0628408119407, 45.113452279176,
47.1640637464112, 49.2146752136465, 51.2652866808818, 53.315898148117,
55.3665096153523, 57.4171210825876, 59.4677325498228, 61.5183440170581,
63.5689554842934, 65.6195669515287, 67.6701784187639, 69.7207898859992,
71.7714013532345, 73.8220128204697, 75.872624287705, 77.9232357549403,
79.9738472221756, 82.0244586894108, 84.0750701566461, 86.1256816238814,
88.1762930911166, 90.2269045583519, 92.2775160255872, 94.3281274928224,
96.3787389600577, 98.429350427293, 100.479961894528, 102.530573361764,
104.581184828999, 106.631796296234, 108.682407763469, 110.733019230705,
112.78363069794, 114.834242165175, 116.88485363241, 118.935465099646,
120.986076566881, 123.036688034116, 125.087299501351, 127.137910968587,
129.188522435822, 131.239133903057, 133.289745370293, 135.340356837528,
137.390968304763, 139.441579771998, 141.492191239234, 143.542802706469,
145.593414173704, 147.644025640939, 149.694637108175, 151.74524857541,
153.795860042645, 155.846471509881, 157.897082977116, 159.947694444351,
161.998305911586, 164.048917378822, 166.099528846057, 168.150140313292,
170.200751780527, 172.251363247763, 174.301974714998, 176.352586182233,
178.403197649469, 180.453809116704, 182.504420583939, 184.555032051174,
186.60564351841, 188.656254985645, 190.70686645288, 192.757477920115,
194.808089387351, 196.858700854586, 198.909312321821, 200.959923789056,
203.010535256292, 205.061146723527)), .Names = c("a", "b"), row.names = c(NA,
-100L), class = "data.frame")
Simply change your second expression (t+2*(i-1)*n) to (t+(i-1)*n) or to align with first expression ((i-1)*n+t). Also, consider lapply over the for loop as its return is a list equal to input seq(N) or [1] 1 2 3:
N <- 3
n<-10
t<-50
dfList <- lapply(seq(N), function(i) {
a <- (1:100)*rnorm(1,0.5)
b <- (1:100)*rnorm(1,2)
datadf <- as.data.frame(cbind(a,b))
datadf[((i-1)*n+1):((i-1)*n+t),]
})
Or an easier read:
dfList <- lapply(seq(N), function(i) {
a <- (1:100)*rnorm(1,0.5)
b <- (1:100)*rnorm(1,2)
s <- (i-1)*n
datadf <- as.data.frame(cbind(a,b))
datadf[(s+1):(s+t),]
})

Applying `ar` (autoregressive model) for my data frame using `lapply` returns `numeric(0)`?

I'm working with a data.frame with all numeric data. I want to calculate the first order autoregressive coefficients for each column. I chose apply function to do the task and I defined a function as the following:
return.ar <- function(vec){
return(as.numeric(ar(vec)$ar))
}
Then I applied it to a data frame I subset by column names as the following
lapply(df_return[,col.names],return.ar)
I was expecting to get a vector with ar coefficients. But instead I got a list with all the coefficients put in the first element like the following
$C.Growth
[1] 0.35629140 -0.07671252 -0.08699333 -0.27404355 0.21448342
[6] -0.19049197 0.06610908 -0.23077602
$Mkt.ret
numeric(0)
$SL
numeric(0)
$SM
numeric(0)
$SH
numeric(0)
$LL
numeric(0)
$LM
numeric(0)
$LH
numeric(0)
I don't understand what's going on.
The output of dput(head(df_return)) looks like the following:
structure(list(Year = c(1929, 1930, 1931, 1932, 1933, 1934),
C.Growth = c(0.94774902516838, 0.989078396169958, 0.911586749357132,
0.996183522774413, 1.08170234030149, 1.05797659377887), S.Return = c(-19.7068321696574,
-31.0834309393085, -45.2864376593084, -9.42504715968666,
57.0992131145999, 4.05781718258972), Rf = c(4.79316783034255,
2.58656906069154, 1.24356234069162, 0.954952840313344, 0.199213114599945,
0.147817182589718), Inflation = c(-0.0531678303425544, -0.15656906069154,
-0.15356234069162, -0.00495284031334435, 0.100786885400055,
0.0321828174102824), Mkt.ret = c(-14.9668321696574, -28.6534309393085,
-44.1964376593084, -8.47504715968666, 57.3992131145999, 4.23781718258972
), SL = c(-45.2568321696575, -35.1134309393085, -41.1864376593084,
-5.28504715968666, 166.0392131146, 34.1378171825897), SM = c(-30.7368321696574,
-31.9034309393085, -48.5364376593084, -8.94504715968666,
118.7092131146, 19.7578171825897), SH = c(-36.7568321696575,
-45.1834309393085, -51.5364376593084, 2.78495284031334, 125.7792131146,
7.95781718258972), LL = c(-19.6968321696574, -26.2734309393085,
-36.2264376593084, -7.31504715968666, 44.1492131145999, 10.6978171825897
), LM = c(0.673167830342554, -29.2434309393085, -59.9864376593084,
-16.7150471596867, 89.4692131145999, -2.93218281741028),
LH = c(-4.35683216965745, -43.1934309393085, -57.7364376593084,
-4.30504715968666, 114.7092131146, -21.8421828174103)), .Names = c("Year",
"C.Growth", "S.Return", "Rf", "Inflation", "Mkt.ret", "SL", "SM",
"SH", "LL", "LM", "LH"), row.names = c(NA, 6L), class = "data.frame")
Once you include your data, diagnose becomes easy.
ar will do auto-section of p based on AIC. Some of your columns have strong evidence to be white noise, hence ar has selected p = 0, in which case $ar field will be numeric(0).
I suggest you also use the following:
lapply(df_return[col.names], function (x) ar(x, order.max = 5)$order)
or even better:
fit_ar <- function(x) ar(x, order.max = 5)[c("order", "ar")]
lapply(df_return[col.names], fit_ar)
The latter returns both p as well as AR coefficients for each column. I have set order.max = 5, so that ar won't choose it itself.
You tried to convince me that lapply is doing wrong, by using this for loop:
ar.vec <- numeric()
for (name in col.names)
ar.vec <- c(ar.vec, return.ar(df_return[[ name ]]))
But unfortunately you won't get anything useful from this. Note you used concatenation c(), thus there is no way to tell which coefficient is for which column.
lapply is not identical to such loop. You should use:
ar.vec <- vector("list", length(col.names))
for (i in 1:length(col.names))
ar.vec[[i]] <- return.ar(df_return[[ col.names[i] ]])

Huge data file and running multiple parameters and memory issue, Fisher's test

I have a R code that I am trying to run in a server. But it is stopping in the middle/get frozen probably because of memory limitation. The data files are huge/massive (one has 20 million lines) and if you look at the double for loop in the code, length(ratSplit) = 281 and length(humanSplit) = 36. The data has specific data of human and rats' genes and human has 36 replicates, while rat has 281. So, the loop is basically 281*36 steps. What I want to do is to process data using the function getGeneType and see how different/independent are the expression of different replicate combinations. Using Fisher's test. The data rat_processed_7_25_FDR_05.out looks like this :
2 Sptbn1 114201107 114200202 chr14|Sptbn1:114201107|Sptbn1:114200202|reg|- 2 Thymus_M_GSM1328751 reg
2 Ndufb7 35680273 35683909 chr19|Ndufb7:35680273|Ndufb7:35683909|reg|+ 2 Thymus_M_GSM1328751 rev
2 Ndufb10 13906408 13906289 chr10|Ndufb10:13906408|Ndufb10:13906289|reg|- 2 Thymus_M_GSM1328751 reg
3 Cdc14b 1719665 1719190 chr17|Cdc14b:1719665|Cdc14b:1719190|reg|- 3 Thymus_M_GSM1328751 reg
and the data fetal_output_7_2.out has the form
SPTLC2 78018438 77987924 chr14|SPTLC2:78018438|SPTLC2:77987924|reg|- 11 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
EXOSC1 99202993 99201016 chr10|EXOSC1:99202993|EXOSC1:99201016|rev|- 5 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
SHMT2 57627893 57628016 chr12|SHMT2:57627893|SHMT2:57628016|reg|+ 8 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
ZNF510 99538281 99537128 chr9|ZNF510:99538281|ZNF510:99537128|reg|- 8 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
PPFIBP1 27820253 27824363 chr12|PPFIBP1:27820253|PPFIBP1:27824363|reg|+ 10 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
Now I have few questions on how to make this more efficient. I think when I run this code, R takes up lots of memory that ultimately causes problems. I am wondering if there is any way of doing this more efficiently
Another possibility is the usage of double for-loop'. Will sapply help? In that case, how should I apply sapply?
At the end I want to convert result into a csv file. I know this is a bit overwhelming to put code like this. But any optimization/efficient coding/programming will be A LOT! I really need to run the whole thing at least one to get the data soon.
#this one compares reg vs rev
date()
ratRawData <- read.table("rat_processed_7_25_FDR_05.out",col.names = c("alignment", "ratGene", "start", "end", "chrom", "align", "ratReplicate", "RNAtype"), fill = TRUE)
humanRawData <- read.table("fetal_output_7_2.out", col.names = c("humanGene", "start", "end", "chrom", "alignment", "humanReplicate", "RNAtype"), fill = TRUE)
geneList <- read.table("geneList.txt", col.names = c("human", "rat"), sep = ',')
#keeping only information about gene, alignment number, replicate and RNAtype, discard other columns
ratRawData <- ratRawData[,c("ratGene", "ratReplicate", "alignment", "RNAtype")]
humanRawData <- humanRawData[, c( "humanGene", "humanReplicate", "alignment", "RNAtype")]
#function to capitalize
capitalize <- function(x){
capital <- toupper(x) ## capitalize
paste0(capital)
}
#capitalizing the rna type naming for rat. So, reg ->REG, dup ->DUP, rev ->REV
#doing this to make data manipulation for making contingency table easier.
levels(ratRawData$RNAtype) <- capitalize(levels(ratRawData$RNAtype))
#spliting data in replicates
ratSplit <- split(ratRawData, ratRawData$ratReplicate)
humanSplit <- split(humanRawData, humanRawData$humanReplicate)
print("done splitting")
#HyRy :when some gene has only reg, rev , REG, REV
#HnRy : when some gene has only reg,REG,REV
#HyRn : add 1 when some gene has only reg,rev,REG
#HnRn : add 1 when some gene has only reg,REG
#function to be used to aggregate
getGeneType <- function(types) {
types <- as.character(types)
if ('rev' %in% types) {
return(ifelse(('REV' %in% types), 'HyRy', 'HyRn'))
}
else {
return(ifelse(('REV' %in% types), 'HnRy', 'HnRn'))
}
}
#logical function to see whether x is integer(0) ..It's used the for loop bellow in case any one HmYn is equal to zero
is.integer0 <- function(x) {
is.integer(x) && length(x) == 0L
}
result <- data.frame(humanReplicate = "human_replicate", ratReplicate = "rat_replicate", pvalue = "p-value", alternative = "alternative_hypothesis",
Conf.int1 = "conf.int1", Conf.int2 ="conf.int2", oddratio = "Odd_Ratio")
for(i in 1:length(ratSplit)) {
for(j in 1:length(humanSplit)) {
ratReplicateName <- names(ratSplit[i])
humanReplicateName <- names(humanSplit[j])
#merging above two based on the one-to-one gene mapping as in geneList defined above.
mergedHumanData <-merge(geneList,humanSplit[[j]], by.x = "human", by.y = "humanGene")
mergedRatData <- merge(geneList, ratSplit[[i]], by.x = "rat", by.y = "ratGene")
mergedHumanData <- mergedHumanData[,c(1,2,4,5)] #rearrange column
mergedRatData <- mergedRatData[,c(2,1,4,5)] #rearrange column
mergedHumanRatData <- rbind(mergedHumanData,mergedRatData) #now the columns are "human", "rat", "alignment", "RNAtype"
agg <- aggregate(RNAtype ~ human+rat, data= mergedHumanRatData, FUN=getGeneType) #agg to make HmYn form
HmRnTable <- table(agg$RNAtype) #table of HmRn ie RNAtype in human and rat.
#now assign these numbers to variables HmYn. Consider cases when some form of HmRy is not present in the table. That's why
#is.integer0 function is used
HyRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRy"]), 0, HmRnTable[names(HmRnTable) == "HyRy"][[1]])
HnRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRn"]), 0, HmRnTable[names(HmRnTable) == "HnRn"][[1]])
HyRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRn"]), 0, HmRnTable[names(HmRnTable) == "HyRn"][[1]])
HnRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRy"]), 0, HmRnTable[names(HmRnTable) == "HnRy"][[1]])
contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)
# contingencyTable:
# HnRn --|--HyRn
# |------|-----|
# HnRy --|-- HyRy
#
fisherTest <- fisher.test(contingencyTable)
#make new line out of the result of fisherTest
newLine <- data.frame(t(c(humanReplicate = humanReplicateName, ratReplicate = ratReplicateName, pvalue = fisherTest$p,
alternative = fisherTest$alternative, Conf.int1 = fisherTest$conf.int[1], Conf.int2 =fisherTest$conf.int[2],
oddratio = fisherTest$estimate[[1]])))
result <-rbind(result,newLine) #append newline to result
if(j%%10 = 0) print(c(i,j))
}
}
write.table(result, file = "compareRegAndRev.csv", row.names = FALSE, append = FALSE, col.names = TRUE, sep = ",")
Referring to the accepted answer to Monitor memory usage in R, the amount of memory used by R can be tracked with gc().
If the script is, indeed, running short of memory (which would not surprise me), the easiest way to resolve the problem would be to move the write.table() from the outside to the inside of the loop, to replace the rbind(). It would just be necessary to create a new file name for the CSV file that is written from each output, e.g. by:
csvFileName <- sprintf("compareRegAndRev%03d_%03d.csv",i,j)
If the CSV files are written without headers, they could then be concatenated separately outside R (e.g. using cat in Unix) and the header added later.
While this approach might succeed in creating the CSV file that is sought, it is possible that file might be too big to process subsequently. If so, it may be preferable to process the CSV files individually, rather than concatenating them at all.

Resources