Ordering Merged data frames - r

As a fairly new R programmer I seem to have run into a strange problem - probably my inexperience with R
After reading and merging successive files into a single data frame, I find that order does not sort the data as expected.
I have multiple references in each file but each file refers to measurement data obtained at a different time.
Here's the code
library(reshape)
# Enter file name to Read & Save data
FileName=readline("Enter File name:\n")
# Find first occurance of file
for ( round1 in 1 : 6) {
ReadFile=paste(round1,"C_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile))
break
}
x = data.frame(read.csv(ReadFile, header=TRUE),rnd=round1)
for ( round2 in (round1+1) : 6) {
#
ReadFile=paste(round2,"C_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile)) {
y = data.frame(read.csv(ReadFile, header=TRUE),rnd = round2)
if (round2 == (round1 +1))
z=data.frame(merge(x,y,all=TRUE))
z=data.frame(merge(y,z,all=TRUE))
}
}
ordered = order(z$lab_id)
results = z[ordered,]
res = data.frame( lab=results[,"lab_id"],bw=results[,"ZBW"],wi=results[,"ZWI"],pf_zbw=0,pf_zwi=0,r = results[,"rnd"])
#
# Establish no of samples recorded
nsmpls = length(res[,c("lab")])
# Evaluate Z_scores for Between Lab Results
for ( i in 1 : nsmpls) {
if (res[i,"bw"] > 3 | res[i,"bw"] < -3)
res[i,"pf_zbw"]=1
}
# Evaluate Z_scores for Within Lab Results
for ( i in 1 : nsmpls) {
if (res[i,"wi"] > 3 | res[i,"wi"] < -3)
res[i,"pf_zwi"]=1
}
dd = melt(res, id=c("lab","r"), "pf_zbw")
b = cast(dd, lab ~ r)
If anyone could see why the ordering only works for about 55 of 70 records and could steer me in the right direction I would be obliged
Thanks very much

Check whether z$lab_id is a factor (with is.factor(z$lab_id)).
If it is, try
z$lab_id <- as.character(z$lab_id)
if it is supposed to be a character vector; or
z$lab_id <- as.numeric(as.character(z$lab_id))
if it is supposed to be a numeric vector.
Then order it again.
Ps. I had previously put these in the comments.

Related

Attribute labels to scored items [function]

I don't know if the subject has already been find but here my problem :
I have a dataset from behaviors personality items scored from 1 to 8 and I would like to convert each scored according a range (e.g. 1-2 = Rare ; 3-5 = Occasionally ; 6-8 = Frequent).
I succeed to create new columns and put labels in it but I don't understand why I have same repetition in others columns :
Beh_data[,c(2,3,4,32,33,34)
enter image description here
You can see that columns with "_class" had the same outputs, and there are mistakes about correct match between labels and scores (e.g. row4 -- 8 put as Occasionally)
Here the function code :
l = unlist(names(Beh_data[,2:28]))
for (j in 1:length(l)) {
cl[j] = list(paste(l[j],"class",sep="_"))
for (k in 1:length(cl)) {
Beh_data[,cl[[k]] ] <- cl[[k]]
for(i in 1:nrow(Beh_data)){
Beh_data[,cl[[k]] ][i] <-ifelse(Beh_data[,l[j] ][i]<3, "Rare", Beh_data[,cl[[k]] ][i])
Beh_data[,cl[[k]] ][i] <-ifelse(Beh_data[,l[j] ][i]>2 & Beh_data[,l[j] ][i]<6, "Occasionally", Beh_data[,cl[[k] ] ][i])
Beh_data[,cl[[k]] ][i] <-ifelse(Beh_data[,l[j] ][i]>5, "Frequent", Beh_data[,cl[[k]] ][i])
}
}
}
I tried to see if it's could from a wrong annotation as cl[[k]] ] or something like this but it steels doesn't work
Do you have any ideas please ?
If you're open to a dplyr solution, I think its across and case_when functions are helpful here. It should also run faster since it's vectorized. This will create new columns like aff_sum_class which use the categorization you've specified.
library(dplyr)
Beh_data |>
mutate(across(aff_sum:qui_sum,
~case_when(. >= 6 ~ "Frequent",
. >= 3 ~ "Occasionally",
TRUE ~ "Rare"),
.names = "{.col}_class"))

Matching unequal data frames on a unique identifier

So I have all this data from the EPL and I'm currently trying to create a column for a teams "form" based on their last five games. A win counts at 1 point, a draw as .5, and a loss zero. I have a loop to do this for a single teams at a time but when I try to create one to merge them all together I can not get it to work for some reason. My data comes from: http://www.football-data.co.uk/englandm.php
For the purpose for making it more simple I will use only the data from the 2013/2014 season for the Premier League.
I import the data from the excel sheet and labeled it PL_20
library(RCurl)
URL <- "www.football-data.co.uk/mmz4281/1415/E0.csv"
x <- getURL(URL, ssl.verifypeer = FALSE)
PL_20 <- read.csv(textConnection(x))
# cleaning data, getting rid of the betting odds
data <- PL_20[,-c(1,10,11,24:65)]
#Get dates all in same date format
data$Date<-as.Date(data$Date, guess_formats(data$Date, "dmy"))
#sorting (used for all the seasons)
sorted <- data[order(data$Date,decreasing=TRUE),]
sorted$index <- seq(1:nrow(sorted))
# making a column for form
teams<-as.matrix(unique(data$HomeTeam))
test<-sorted
z<-1 # controls which team the form is being found for. Ideally I would have this cycle
# through all of the teams
current<-subset(sorted, HomeTeam==as.character(teams[z]) | AwayTeam==as.character(teams[z]))
current$h.form<-0
current$a.form<-0
current$recent<-0
for (i in 1:nrow(current)){
if((as.character(current[i,2])==as.character(teams[z]) && as.character(current[i,6])=="H") || (as.character(current[i,3])==as.character(teams[z]) && as.character(current[i,6])=="A")){
# current[i,7]<- "W"
current[i,24]<- 1
}else{
if((as.character(current[i,2])==as.character(teams[z]) && as.character(current[i,6])=="D") || (as.character(current[i,3])==as.character(teams[z]) && as.character(current[i,6])=="D"))
{
#current[i,7]<- "D"
current[i,24]<- .5
}else{
if((as.character(current[i,2])==as.character(teams[z]) && as.character(current[i,6])=="A") || (as.character(current[i,3])==as.character(teams[z]) && as.character(current[i,6])=="H"))
{
# current[i,7]<- "L"
current[i,24]<- 0
}
}
}
}
d<-0
for (d in 0:(nrow(current)-6))
{
if (as.character(current[nrow(current)-(5+d),2])==as.character(teams[z])){
current[(nrow(current)-(5+d)),22]<-as.numeric(sum(current[(nrow(current)-(4+d)):(nrow(current)-d),24]))
}else{
if(as.character(current[nrow(current)-(5+d),3])==as.character(teams[z]))
{
current[(nrow(current)-(5+d)),23]<-sum(current[(nrow(current)-(4+d)):(nrow(current)-d),24])
}
}
}
Now these ugly loops create a data frame called current which has three columns at the end: h.form, a.form, and recent. h.form is the form for the designated home team for that game and a.form is the form for the designated away team for that game. Recent is just the outcome of that game.
I would like to be able to combine all of the teams together so there is one observation per game and both h.form and a.form are populated with the correct values for their corresponding teams.
Your help is appreciated and if you have suggestions on how to clean up these loops that would be helpful as well.

How to access data saved in an assign construct?

I made a list, read the list into a for loop, do some calculations with it and export a modified dataframe to [1] "IAEA_C2_NoStdConditionResiduals1" [2] "IAEA_C2_EAstdResiduals2" ect. When I do View(IAEA_C2_NoStdConditionResiduals1) after the for loop then I get the following error message in the console: Error in print(IAEA_C2_NoStdConditionResiduals1) : object 'IAEA_C2_NoStdConditionResiduals1' not found, but I know it is there because RStudio tells me in its Environment view. So the question is: How can I access the saved data (in this assign construct) for further usage?
ResidualList = list(IAEA_C2_NoStdCondition = IAEA_C2_NoStdCondition,
IAEA_C2_EAstd = IAEA_C2_EAstd,
IAEA_C2_STstd = IAEA_C2_STstd,
IAEA_C2_Bothstd = IAEA_C2_Bothstd,
TIRI_I_NoStdCondition = TIRI_I_NoStdCondition,
TIRI_I_EAstd = TIRI_I_EAstd,
TIRI_I_STstd = TIRI_I_STstd,
TIRI_I_Bothstd = TIRI_I_Bothstd
)
C = 8
for(j in 1:C) {
#convert list Variable to string for later usage as Variable Name as unique identifier!!
SubNameString = names(ResidualList)[j]
SubNameString = paste0(SubNameString, "Residuals")
#print(SubNameString)
LoopVar = ResidualList[[j]]
LoopVar[ ,"F_corrected_normed"] = round(LoopVar[ ,"F_corrected_normed"] / mean(LoopVar[ ,"F_corrected_normed"]),
digit = 5
)
LoopVar[ ,"F_corrected_normed_error"] = round(LoopVar[ ,"F_corrected_normed_error"] / mean(LoopVar[ ,"F_corrected_normed_error"]),
digit = 5
)
assign(paste(SubNameString, j), LoopVar)
}
View(IAEA_C2_NoStdConditionResiduals1)
Not really a problem with assign and more with behavior of the paste function. This will build a variable name with a space in it:
assign(paste(SubNameString, j), LoopVar)
#simple example
> assign(paste("v", 1), "test")
> `v 1`
[1] "test"
,,,, so you need to get its value by putting backticks around its name so the space is not misinterpreted as a parse-able delimiter. See what happens when you type:
`IAEA_C2_NoStdCondition 1`
... and from here forward, use paste0 to avoid this problem.

Huge data file and running multiple parameters and memory issue, Fisher's test

I have a R code that I am trying to run in a server. But it is stopping in the middle/get frozen probably because of memory limitation. The data files are huge/massive (one has 20 million lines) and if you look at the double for loop in the code, length(ratSplit) = 281 and length(humanSplit) = 36. The data has specific data of human and rats' genes and human has 36 replicates, while rat has 281. So, the loop is basically 281*36 steps. What I want to do is to process data using the function getGeneType and see how different/independent are the expression of different replicate combinations. Using Fisher's test. The data rat_processed_7_25_FDR_05.out looks like this :
2 Sptbn1 114201107 114200202 chr14|Sptbn1:114201107|Sptbn1:114200202|reg|- 2 Thymus_M_GSM1328751 reg
2 Ndufb7 35680273 35683909 chr19|Ndufb7:35680273|Ndufb7:35683909|reg|+ 2 Thymus_M_GSM1328751 rev
2 Ndufb10 13906408 13906289 chr10|Ndufb10:13906408|Ndufb10:13906289|reg|- 2 Thymus_M_GSM1328751 reg
3 Cdc14b 1719665 1719190 chr17|Cdc14b:1719665|Cdc14b:1719190|reg|- 3 Thymus_M_GSM1328751 reg
and the data fetal_output_7_2.out has the form
SPTLC2 78018438 77987924 chr14|SPTLC2:78018438|SPTLC2:77987924|reg|- 11 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
EXOSC1 99202993 99201016 chr10|EXOSC1:99202993|EXOSC1:99201016|rev|- 5 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
SHMT2 57627893 57628016 chr12|SHMT2:57627893|SHMT2:57628016|reg|+ 8 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
ZNF510 99538281 99537128 chr9|ZNF510:99538281|ZNF510:99537128|reg|- 8 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
PPFIBP1 27820253 27824363 chr12|PPFIBP1:27820253|PPFIBP1:27824363|reg|+ 10 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
Now I have few questions on how to make this more efficient. I think when I run this code, R takes up lots of memory that ultimately causes problems. I am wondering if there is any way of doing this more efficiently
Another possibility is the usage of double for-loop'. Will sapply help? In that case, how should I apply sapply?
At the end I want to convert result into a csv file. I know this is a bit overwhelming to put code like this. But any optimization/efficient coding/programming will be A LOT! I really need to run the whole thing at least one to get the data soon.
#this one compares reg vs rev
date()
ratRawData <- read.table("rat_processed_7_25_FDR_05.out",col.names = c("alignment", "ratGene", "start", "end", "chrom", "align", "ratReplicate", "RNAtype"), fill = TRUE)
humanRawData <- read.table("fetal_output_7_2.out", col.names = c("humanGene", "start", "end", "chrom", "alignment", "humanReplicate", "RNAtype"), fill = TRUE)
geneList <- read.table("geneList.txt", col.names = c("human", "rat"), sep = ',')
#keeping only information about gene, alignment number, replicate and RNAtype, discard other columns
ratRawData <- ratRawData[,c("ratGene", "ratReplicate", "alignment", "RNAtype")]
humanRawData <- humanRawData[, c( "humanGene", "humanReplicate", "alignment", "RNAtype")]
#function to capitalize
capitalize <- function(x){
capital <- toupper(x) ## capitalize
paste0(capital)
}
#capitalizing the rna type naming for rat. So, reg ->REG, dup ->DUP, rev ->REV
#doing this to make data manipulation for making contingency table easier.
levels(ratRawData$RNAtype) <- capitalize(levels(ratRawData$RNAtype))
#spliting data in replicates
ratSplit <- split(ratRawData, ratRawData$ratReplicate)
humanSplit <- split(humanRawData, humanRawData$humanReplicate)
print("done splitting")
#HyRy :when some gene has only reg, rev , REG, REV
#HnRy : when some gene has only reg,REG,REV
#HyRn : add 1 when some gene has only reg,rev,REG
#HnRn : add 1 when some gene has only reg,REG
#function to be used to aggregate
getGeneType <- function(types) {
types <- as.character(types)
if ('rev' %in% types) {
return(ifelse(('REV' %in% types), 'HyRy', 'HyRn'))
}
else {
return(ifelse(('REV' %in% types), 'HnRy', 'HnRn'))
}
}
#logical function to see whether x is integer(0) ..It's used the for loop bellow in case any one HmYn is equal to zero
is.integer0 <- function(x) {
is.integer(x) && length(x) == 0L
}
result <- data.frame(humanReplicate = "human_replicate", ratReplicate = "rat_replicate", pvalue = "p-value", alternative = "alternative_hypothesis",
Conf.int1 = "conf.int1", Conf.int2 ="conf.int2", oddratio = "Odd_Ratio")
for(i in 1:length(ratSplit)) {
for(j in 1:length(humanSplit)) {
ratReplicateName <- names(ratSplit[i])
humanReplicateName <- names(humanSplit[j])
#merging above two based on the one-to-one gene mapping as in geneList defined above.
mergedHumanData <-merge(geneList,humanSplit[[j]], by.x = "human", by.y = "humanGene")
mergedRatData <- merge(geneList, ratSplit[[i]], by.x = "rat", by.y = "ratGene")
mergedHumanData <- mergedHumanData[,c(1,2,4,5)] #rearrange column
mergedRatData <- mergedRatData[,c(2,1,4,5)] #rearrange column
mergedHumanRatData <- rbind(mergedHumanData,mergedRatData) #now the columns are "human", "rat", "alignment", "RNAtype"
agg <- aggregate(RNAtype ~ human+rat, data= mergedHumanRatData, FUN=getGeneType) #agg to make HmYn form
HmRnTable <- table(agg$RNAtype) #table of HmRn ie RNAtype in human and rat.
#now assign these numbers to variables HmYn. Consider cases when some form of HmRy is not present in the table. That's why
#is.integer0 function is used
HyRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRy"]), 0, HmRnTable[names(HmRnTable) == "HyRy"][[1]])
HnRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRn"]), 0, HmRnTable[names(HmRnTable) == "HnRn"][[1]])
HyRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRn"]), 0, HmRnTable[names(HmRnTable) == "HyRn"][[1]])
HnRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRy"]), 0, HmRnTable[names(HmRnTable) == "HnRy"][[1]])
contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)
# contingencyTable:
# HnRn --|--HyRn
# |------|-----|
# HnRy --|-- HyRy
#
fisherTest <- fisher.test(contingencyTable)
#make new line out of the result of fisherTest
newLine <- data.frame(t(c(humanReplicate = humanReplicateName, ratReplicate = ratReplicateName, pvalue = fisherTest$p,
alternative = fisherTest$alternative, Conf.int1 = fisherTest$conf.int[1], Conf.int2 =fisherTest$conf.int[2],
oddratio = fisherTest$estimate[[1]])))
result <-rbind(result,newLine) #append newline to result
if(j%%10 = 0) print(c(i,j))
}
}
write.table(result, file = "compareRegAndRev.csv", row.names = FALSE, append = FALSE, col.names = TRUE, sep = ",")
Referring to the accepted answer to Monitor memory usage in R, the amount of memory used by R can be tracked with gc().
If the script is, indeed, running short of memory (which would not surprise me), the easiest way to resolve the problem would be to move the write.table() from the outside to the inside of the loop, to replace the rbind(). It would just be necessary to create a new file name for the CSV file that is written from each output, e.g. by:
csvFileName <- sprintf("compareRegAndRev%03d_%03d.csv",i,j)
If the CSV files are written without headers, they could then be concatenated separately outside R (e.g. using cat in Unix) and the header added later.
While this approach might succeed in creating the CSV file that is sought, it is possible that file might be too big to process subsequently. If so, it may be preferable to process the CSV files individually, rather than concatenating them at all.

Adding a row to a dataframe

I am reading a file line by line and then adding specific lines to a dataframe. Here is an example of a line I would add to a dataframe:
ATOM 230 CA GLU A 31 66.218 118.140 2.411 1.00 31.82 C
I have verified that my checks are ok, I think it has specifically to do with my rbind command. Thanks for your help!
Edit: The error is as follows, the output of the dataframe is:
Residue AtomCount SideChain XCoord YCoord ZCoord
2 MET 1 A 62.935 97.579 30.223
21 <NA> 2 A 63.155 95.525 27.079
3 <NA> 3 A 65.289 96.895 24.308
It seems like it stops picking up the name of the residue..
The code I am using is:
get.positions <- function(sourcefile, chain_required = "A"){
positions = data.frame()
visited = list()
filedata <- readLines(sourcefile, n= -1)
for(i in 1: length(filedata)){
input = filedata[i]
id = substr(input,1,4)
if(id == "ATOM"){
type = substr(input,14,15)
if(type == "CA"){
#if there are duplicates it takes the first one
residue = substr(input,18,20)
type_of_chain = substr(input,22,22)
atom_count = strtoi(substr(input, 23,26))
if(atom_count >=1){
if(type_of_chain == chain_required && !(atom_count %in% visited) ){
position_string = trim(substr(input,30,54))
position_string = lapply(unlist(strsplit(position_string," +")),as.numeric)
positions<- rbind(positions, list(residue, atom_count, type_of_chain, position_string[[1]], position_string[[2]], position_string[[3]]))
}
}
}
}
}
return (positions)
}
When I ran your code with that data I got type=="LU" (so it failed the type=="CA" test) and the rest of processing never got accomplished. I think you may need to change the indices to
type = substr(input,10,11)
Fixing that problem brings up others, and its going to be very difficult to fix all the problems since the goal is not clearly stated, but it suggests that you edit your code and data so it's reproducible. This could be a reproducible input/execution method:
get.positions(textConnection("ATOM 230 CA GLU A 31 66.218 118.140 2.411 1.00 31.82 C") )
In, the end, the following worked. First I made a much larger data frame, and then just replace specific rows (thank you Joran who linked me to the R inferno).
For the user that asked why I am splitting on a plus, your assumption is incorrect. The syntax is actually " +", that's a space-plus so that it's splitting on multiple spaces.Finally, as for the incorrect indices, I've finally figured out how to show the extra spaces on the form. Here is the correct original line, you will see the indices match.
ATOM 2 CA MET A 1 62.935 97.579 30.223 1.00 37.58 C
The R code that works, is as follows.
get.positions <- function(sourcefile, chain_required = "A"){
N <- 10^5
AACount <- 0
positions = data.frame(Residue=rep(NA, N),AtomCount=rep(NA, N),SideChain=rep(NA, N),XCoord=rep(NA, N),YCoord=rep(NA, N),ZCoord=rep(NA, N),stringsAsFactors=FALSE)
visited = list()
filedata <- readLines(sourcefile, n= -1)
for(i in 1: length(filedata)){
input = filedata[i]
id = substr(input,1,4)
if(id == "ATOM"){
type = substr(input,14,15)
if(type == "CA"){
#if there are duplicates it takes the first one
residue = substr(input,18,20)
type_of_chain = substr(input,22,22)
atom_count = strtoi(substr(input, 23,26))
if(atom_count >=1){
if(type_of_chain == chain_required && !(atom_count %in% visited) ){
visited <- c(visited, atom_count)
AACount <- AACount + 1
position_string = trim(substr(input,30,54))
position_string = lapply(unlist(strsplit(position_string," +")),as.numeric)
#print(input)
positions[AACount,]<- c(residue, atom_count, type_of_chain, position_string[[1]], position_string[[2]], position_string[[3]])
}
}
}
}
}
positions<-positions[1:AACount,]
return (positions)
}

Resources