This is part of a project to switch from SPSS to R. While there are good tools to import SPSS files into R (expss) what this question is part of is attempting to get the benefits of SPSS style labeling when data originates from CSV sources. This is to help bridge the staff training gap between SPSS and R by providing a common format for data.tables irrespective of file format origin.
Whilst CSV does a reasonable job of storing data it is hopeless for providing meaningful data. This inevitably means variable and factor levels and labels have to come from somewhere else. In most short examples of this (e.g. in documentation) it is practical to simply hard code the meta data in. But for larger projects it makes more sense to store this meta data in a second csv file.
Example data file
ID,varone,vartwo,varthree,varfour,varfive,varsix,varseven,vareight,varnine,varten
1,1,34,1,,1,,1,1,4,
2,1,21,0,1,,1,3,14,3,2
3,1,54,1,,,1,3,6,4,4
4,2,32,1,1,1,,3,7,4,
5,3,66,0,,,1,3,9,3,3
6,2,43,1,,1,,1,12,2,1
7,2,26,0,,,1,2,11,1,
8,3,,1,1,,,2,15,1,4
9,1,34,1,,1,,1,12,3,4
10,2,46,0,,,,3,13,2,
11,3,39,1,1,1,,3,7,1,2
12,1,28,0,,,1,1,6,5,1
13,2,64,0,,1,,2,11,,3
14,3,34,1,1,,,3,10,1,1
15,1,52,1,,1,1,1,8,6,
Example metadata file
Rowlabels,ID,varone,vartwo,varthree,varfour,varfive,varsix,varseven,vareight,varnine,varten
varlabel,,Question one,Question two,Question three,Question four,Question five,Question six,Question seven,Question eight,Question nine,Question ten
varrole,Unique,Attitude,Unique,Filter,Filter,Filter,Filter,Attitude,Filter,Attitude,Attitude
Missing,Error,Error,Ignored,Error,Unchecked,Unchecked,Unchecked,Error,Error,Error,Ignored
vallable,,One,,No,Checked,Checked,Checked,x,One,A,Support
vallable,,Two,,Yes,,,,y,Two,B,Neutral
vallable,,Three,,,,,,z,Three,C,Oppose
vallable,,,,,,,,,Four,D,Dont know
vallable,,,,,,,,,Five,E,
vallable,,,,,,,,,Six,F,
vallable,,,,,,,,,Seven,G,
vallable,,,,,,,,,Eight,,
vallable,,,,,,,,,Nine,,
vallable,,,,,,,,,Ten,,
vallable,,,,,,,,,Eleven,,
vallable,,,,,,,,,Twelve,,
vallable,,,,,,,,,Thirteen,,
vallable,,,,,,,,,Fourteen,,
vallable,,,,,,,,,Fifteen,,
SO the common elements are the column names which are the key to both files
The first column of the metadata file describes the role of the row for the data file
so
varlabel provides the variable label for each column
varrole describes the analytic purpose of the variable
missing describes how to treat missing data
varlabel describes the label for a factor level starting at one on up to as many labels as there are.
Right! Here's the code that works:
```#Libraries
library(expss)
library(data.table)
library(magrittr)```
readcsvdata <- function(dfile)
{
# TESTED - Working
print("OK Lets read some comma separated values")
rdata <- fread(file = dfile, sep = "," , quote = "\"" , header = TRUE, stringsAsFactors = FALSE,
na.strings = getOption("datatable.na.strings",""))
return(rdata)
}
rawdatafilename <- "testdata.csv"
rawmetadata <- "metadata.csv"
mdt <- readcsvdata(rawmetadata)
rdt <- readcsvdata(rawdatafilename)
names(rdt)[names(rdt) == "ï..ID"] <- "ID" # correct minor data error
commonnames <- intersect(names(mdt),names(rdt)) # find common variable names so metadata applies
commonnames <- commonnames[-(1)] # remove ID
qlabels <- as.list(mdt[1, commonnames, with = FALSE])
(Here I copy the rdt datatable simply so I can roll back to the original data without re-running the previous read chunks and tidying whenever I make changes that don't work out.
# set var names to columns
for (each_name in commonnames) # loop through commonnames and qlabels
{
expss::var_lab(tdt[[each_name]]) <- qlabels[[each_name]]
}
OK this is where I fall down.
Failure from here
factorcols <- as.vector(commonnames) # create a vector of column names (for later use)
for (col in factorcols)
{
print( is.na(mdt[4, ..col])) # print first row of value labels (as test)
if (is.na(mdt[4, ..col])) factorcols <- factorcols[factorcols != col]
# if not a factor column, remove it from the factorcol list and dont try to factor it
else { # if it is a vector factorise
print(paste("working on",col)) # I have had a lot of problem with unrecognised ..col variables
tlabels <- as.vector(na.omit(mdt[4:18, ..col])) # get list of labels from the data column}
validrange <- seq(1,lengths(tlabels),1) # range of valid values is 1 to the length of labels list
print(as.character(tlabels)) # for testing
print(validrange) # for testing
tdt[[col]] <- factor(tdt[[col]], levels = validrange, ordered = is.ordered(validrange), labels = as.character(tlabels))
# expss::val_lab(tdt[, ..col]) <- tlabels
tlabels = c() # flush loop variable
validrange = c() # flush loop variable
}
}
So the problem is revealed here when we check the data table.
tdt
the labels have been applied as whole vectors to each column entry except where there is only one value in the vector ("checked" for varfour and varfive)
tdt
id (int) 1
varone (fctr) c("One", "Two", "Three") 1 (should be "One" 1)
vartwo (S3: labelled) 34
varthree (fctr) c("No", "Yes") 1 (should be "No" 1)
varfour (fctr) NA
varfive (fctr) Checked
And a mystery
this code works just fine on a single columns when I don't use a for loop variable
# test using column name
tlabels <- c("one","two","three")
validrange <- c(1,2,3)
factor(tdt[,varone], levels = validrange, ordered=is.ordered(validrange), labels = tlabels)
It seems the issue is in the line tlabels <- as.vector(na.omit(mdt[4:18, ..col])). It doesn't make vector as you expect. Contrary to usual data.frame data.table doesn't drop dimensions when you provide single column in the index. And as.vector do nothing with data.frames/data.tables. So tlabels remains data.table. This line need to be rewritten as tlabels <- na.omit(mdt[[col]][4:18]).
Example:
library(data.table)
mdt = as.data.table(mtcars)
col = "am"
tlabels <- as.vector(na.omit(mdt[3:6, ..col])) # ! tlabels is data.table
str(tlabels)
# Classes ‘data.table’ and 'data.frame': 4 obs. of 1 variable:
# $ am: num 1 0 0 0
# - attr(*, ".internal.selfref")=<externalptr>
as.character(tlabels) # character vector of length 1
# [1] "c(1, 0, 0, 0)"
tlabels <- na.omit(mdt[[col]][3:6]) # vector
str(tlabels)
# num [1:4] 1 0 0 0
as.character(tlabels) # character vector of length 4
# [1] "1" "0" "0" "0"
I have the following reproducible data:
MyScaledData contains scaled values between 0 and 1 for 6 variables. minvec and maxvec are named vectors and contain the maximum and minimum values from the original data set that was used to create the scaled data frame MyScaledData. minvec and maxvec contain values for all 22 variables of the original data set, including the 6 variables I have now in MyScaledData.
X14863 X15066 X15067 X15068 X15069 X15070
0.6014784 0.6975109 0.5043208 0.15284648 0.9416364 0.7860731
0.2495215 0.7801444 0.6683925 0.13768245 0.4277954 0.2058412
0.6167705 0.3344044 0.9254125 0.12777565 0.3826231 0.2590457
0.1227380 0.4448501 0.3961802 0.19117246 0.7789835 0.7587897
0.7299760 0.6375931 0.5760061 0.44746838 0.3634903 0.1079679
0.1988647 0.7814712 0.6572054 0.71409305 0.6715690 0.4029459
0.5041371 0.6374958 0.9333635 0.89057831 0.5716711 0.7219823
0.5774327 0.7677038 0.7622717 0.45288270 0.2817869 0.2572325
0.6809509 0.6089656 0.8191862 0.01151454 0.2780449 0.4655353
0.5754383 0.5662045 0.7003630 0.62559642 0.2865510 0.1847980
MyScaledData<-structure(list(X14863=c(0.601478444979532,0.249521497274968,0.616770466379489,0.122737966507165,0.729975993009922,0.198864661389536,0.504137054265617,0.577432671357089,0.680950947164095,0.575438259547452),X15066=c(0.697510926657699,0.780144354632397,0.334404422875259,0.444850091405716,0.637593061483412,0.781471212351781,0.637495834667556,0.7677038048039,0.608965550162107,0.566204459603197),X15067=c(0.50432083998529,0.668392530333367,0.925412484830622,0.396180214305286,0.576006062451239,0.657205387087382,0.933363470346907,0.762271729415789,0.819186151914183,0.700362991098644),X15068=c(0.152846483002917,0.137682446305942,0.127775652495726,0.191172455317975,0.447468375530484,0.714093046059637,0.890578310935752,0.452882699805154,0.011514536383708,0.625596417031532),X15069=c(0.94163636689763,0.427795395079331,0.38262308941233,0.77898345642139,0.363490265569212,0.671568951210917,0.571671115989958,0.281786881885636,0.278044876559552,0.286551022600823),X15070=c(0.786073059382553,0.205841229942702,0.259045736299276,
0.758789694211416,0.107967864736275,0.402945912782515,0.721982268066207,0.257232456508833,0.46553533255268,0.184798001614338)),row.names=c(NA,10L),class="data.frame"); minvec<-c(X14861=22.95,X14862=29.95,X14863= 39.95,X15066=59.95,X15067=79.95,X15068=14.99,X15069=24.99,X15070=33.45,X15071=36.95,X15072=44.95,X15073=54.95,X15074=74.95,X15132=12.95,X15548=12.95,X15549=22.95,X15550=29.95,X15551=39.95,X15552=59.95,X15553=79.95,X15956=49.95,X15957=49.95,X16364=3.5);maxvec<-c(X14861=29.99,X14862=39.99,X14863=49.99,X15066=79.99,X15067=99.99,X15068=19.99,X15069=29.99,X15070=39.99,X15071=49.99,X15072=59.99,X15073=79.99,X15074=99.99,X15132=19.99,X15548=19.99,X15549=29.99,X15550=39.99,X15551=49.99,X15552=79.99,X15553=99.99,X15956=59.99,X15957=59.99,X16364=9.99)
I want to rescale back MyScaledData to their original scale by matching the min/max values to each corresponding column based on name. I've tried the following:
descale <- function(x,minval,maxval) {x*(maxval-minval) + minval}
as.data.frame(Map(descale,MyScaledData,minvec,maxvec))
The output I get has more than 6 columns than MyScaledData has. I sense that the function is not even matching columns by names and therefore the output is not calculated correctly. How can I match the function by column name so it takes the corresponding minvec and maxvec element for each column and return only the 6 columns I have?
Desired output shall be:
MyDeScaledData <- structure(list(X14863 = c(45.9888435875945, 42.4551958326407,46.1423754824501, 41.1822891837319, 47.2789589698196, 41.9466012003509,45.0115360248268, 45.7474240204252, 46.7867475095275, 45.7274001258564), X15066 = c(73.9281189702203, 75.5840928668332, 66.6514646344202,68.8647958317706, 72.7273649521276, 75.6106830955297, 72.7254165267378,75.3347842482702, 72.1536696252486, 71.2967373704481), X15067 = c(90.0565896333052,93.3445863078807, 98.4952661960057, 87.8894514946779, 91.4931614915228,93.1203959572311, 98.654603945752, 95.2259254574924, 96.3664904843602,93.9852743416168), X15068 = c(15.7542324150146, 15.6784122315297,15.6288782624786, 15.9458622765899, 17.2273418776524, 18.5604652302982,19.4428915546788, 17.2544134990258, 15.0475726819185, 18.1179820851577), X15069 = c(29.6981818344881, 27.1289769753967, 26.9031154470616,28.8849172821069, 26.8074513278461, 28.3478447560546, 27.8483555799498,26.3989344094282, 26.3802243827978, 26.4227551130041), X15070 = c(38.5909178083619,34.7962016438253, 35.1441591153973, 38.4124846001427, 34.1561098353752,36.0852662695976, 38.171764033153, 35.1323002655678, 36.4946010748945,34.6585789305578)), row.names = c(NA, 10L), class = "data.frame")
Thanks to #Shirin Yavari for providing the solution:
MyDeScaledData<-as.data.frame(Map(descale,MyScaledData,minvec[names(MyScaledData)],maxvec[names(MyScaledData)]))
I have a R code that I am trying to run in a server. But it is stopping in the middle/get frozen probably because of memory limitation. The data files are huge/massive (one has 20 million lines) and if you look at the double for loop in the code, length(ratSplit) = 281 and length(humanSplit) = 36. The data has specific data of human and rats' genes and human has 36 replicates, while rat has 281. So, the loop is basically 281*36 steps. What I want to do is to process data using the function getGeneType and see how different/independent are the expression of different replicate combinations. Using Fisher's test. The data rat_processed_7_25_FDR_05.out looks like this :
2 Sptbn1 114201107 114200202 chr14|Sptbn1:114201107|Sptbn1:114200202|reg|- 2 Thymus_M_GSM1328751 reg
2 Ndufb7 35680273 35683909 chr19|Ndufb7:35680273|Ndufb7:35683909|reg|+ 2 Thymus_M_GSM1328751 rev
2 Ndufb10 13906408 13906289 chr10|Ndufb10:13906408|Ndufb10:13906289|reg|- 2 Thymus_M_GSM1328751 reg
3 Cdc14b 1719665 1719190 chr17|Cdc14b:1719665|Cdc14b:1719190|reg|- 3 Thymus_M_GSM1328751 reg
and the data fetal_output_7_2.out has the form
SPTLC2 78018438 77987924 chr14|SPTLC2:78018438|SPTLC2:77987924|reg|- 11 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
EXOSC1 99202993 99201016 chr10|EXOSC1:99202993|EXOSC1:99201016|rev|- 5 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
SHMT2 57627893 57628016 chr12|SHMT2:57627893|SHMT2:57628016|reg|+ 8 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
ZNF510 99538281 99537128 chr9|ZNF510:99538281|ZNF510:99537128|reg|- 8 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
PPFIBP1 27820253 27824363 chr12|PPFIBP1:27820253|PPFIBP1:27824363|reg|+ 10 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
Now I have few questions on how to make this more efficient. I think when I run this code, R takes up lots of memory that ultimately causes problems. I am wondering if there is any way of doing this more efficiently
Another possibility is the usage of double for-loop'. Will sapply help? In that case, how should I apply sapply?
At the end I want to convert result into a csv file. I know this is a bit overwhelming to put code like this. But any optimization/efficient coding/programming will be A LOT! I really need to run the whole thing at least one to get the data soon.
#this one compares reg vs rev
date()
ratRawData <- read.table("rat_processed_7_25_FDR_05.out",col.names = c("alignment", "ratGene", "start", "end", "chrom", "align", "ratReplicate", "RNAtype"), fill = TRUE)
humanRawData <- read.table("fetal_output_7_2.out", col.names = c("humanGene", "start", "end", "chrom", "alignment", "humanReplicate", "RNAtype"), fill = TRUE)
geneList <- read.table("geneList.txt", col.names = c("human", "rat"), sep = ',')
#keeping only information about gene, alignment number, replicate and RNAtype, discard other columns
ratRawData <- ratRawData[,c("ratGene", "ratReplicate", "alignment", "RNAtype")]
humanRawData <- humanRawData[, c( "humanGene", "humanReplicate", "alignment", "RNAtype")]
#function to capitalize
capitalize <- function(x){
capital <- toupper(x) ## capitalize
paste0(capital)
}
#capitalizing the rna type naming for rat. So, reg ->REG, dup ->DUP, rev ->REV
#doing this to make data manipulation for making contingency table easier.
levels(ratRawData$RNAtype) <- capitalize(levels(ratRawData$RNAtype))
#spliting data in replicates
ratSplit <- split(ratRawData, ratRawData$ratReplicate)
humanSplit <- split(humanRawData, humanRawData$humanReplicate)
print("done splitting")
#HyRy :when some gene has only reg, rev , REG, REV
#HnRy : when some gene has only reg,REG,REV
#HyRn : add 1 when some gene has only reg,rev,REG
#HnRn : add 1 when some gene has only reg,REG
#function to be used to aggregate
getGeneType <- function(types) {
types <- as.character(types)
if ('rev' %in% types) {
return(ifelse(('REV' %in% types), 'HyRy', 'HyRn'))
}
else {
return(ifelse(('REV' %in% types), 'HnRy', 'HnRn'))
}
}
#logical function to see whether x is integer(0) ..It's used the for loop bellow in case any one HmYn is equal to zero
is.integer0 <- function(x) {
is.integer(x) && length(x) == 0L
}
result <- data.frame(humanReplicate = "human_replicate", ratReplicate = "rat_replicate", pvalue = "p-value", alternative = "alternative_hypothesis",
Conf.int1 = "conf.int1", Conf.int2 ="conf.int2", oddratio = "Odd_Ratio")
for(i in 1:length(ratSplit)) {
for(j in 1:length(humanSplit)) {
ratReplicateName <- names(ratSplit[i])
humanReplicateName <- names(humanSplit[j])
#merging above two based on the one-to-one gene mapping as in geneList defined above.
mergedHumanData <-merge(geneList,humanSplit[[j]], by.x = "human", by.y = "humanGene")
mergedRatData <- merge(geneList, ratSplit[[i]], by.x = "rat", by.y = "ratGene")
mergedHumanData <- mergedHumanData[,c(1,2,4,5)] #rearrange column
mergedRatData <- mergedRatData[,c(2,1,4,5)] #rearrange column
mergedHumanRatData <- rbind(mergedHumanData,mergedRatData) #now the columns are "human", "rat", "alignment", "RNAtype"
agg <- aggregate(RNAtype ~ human+rat, data= mergedHumanRatData, FUN=getGeneType) #agg to make HmYn form
HmRnTable <- table(agg$RNAtype) #table of HmRn ie RNAtype in human and rat.
#now assign these numbers to variables HmYn. Consider cases when some form of HmRy is not present in the table. That's why
#is.integer0 function is used
HyRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRy"]), 0, HmRnTable[names(HmRnTable) == "HyRy"][[1]])
HnRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRn"]), 0, HmRnTable[names(HmRnTable) == "HnRn"][[1]])
HyRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRn"]), 0, HmRnTable[names(HmRnTable) == "HyRn"][[1]])
HnRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRy"]), 0, HmRnTable[names(HmRnTable) == "HnRy"][[1]])
contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)
# contingencyTable:
# HnRn --|--HyRn
# |------|-----|
# HnRy --|-- HyRy
#
fisherTest <- fisher.test(contingencyTable)
#make new line out of the result of fisherTest
newLine <- data.frame(t(c(humanReplicate = humanReplicateName, ratReplicate = ratReplicateName, pvalue = fisherTest$p,
alternative = fisherTest$alternative, Conf.int1 = fisherTest$conf.int[1], Conf.int2 =fisherTest$conf.int[2],
oddratio = fisherTest$estimate[[1]])))
result <-rbind(result,newLine) #append newline to result
if(j%%10 = 0) print(c(i,j))
}
}
write.table(result, file = "compareRegAndRev.csv", row.names = FALSE, append = FALSE, col.names = TRUE, sep = ",")
Referring to the accepted answer to Monitor memory usage in R, the amount of memory used by R can be tracked with gc().
If the script is, indeed, running short of memory (which would not surprise me), the easiest way to resolve the problem would be to move the write.table() from the outside to the inside of the loop, to replace the rbind(). It would just be necessary to create a new file name for the CSV file that is written from each output, e.g. by:
csvFileName <- sprintf("compareRegAndRev%03d_%03d.csv",i,j)
If the CSV files are written without headers, they could then be concatenated separately outside R (e.g. using cat in Unix) and the header added later.
While this approach might succeed in creating the CSV file that is sought, it is possible that file might be too big to process subsequently. If so, it may be preferable to process the CSV files individually, rather than concatenating them at all.