Ordering the data based on 2 columns in a data frame - r

I have a data frame x with columns a,b,c.
I am trying to order the data based on columns a and c and my ordering criteria is ascending on a and descending on c. Below is the code I wrote.
z <- x[order(x$a,-x$c),]
This code gives me a warning as below.
Warning message:
In Ops.factor(x$c) : - not meaningful for factors
Also, when I check the data frame z using head(z), it gives me wrong data as below:
30708908 0.3918980 NA
22061768 0.4022183 NA
21430343 0.4118651 NA
21429828 0.4134888 NA
21425966 0.4159323 NA
22057521 0.4173094 NA
and initially there wasnt any NA values of the column c in the data frame x. I have gone through a lot of threads but couldn't find any solution. Can anybody please suggest.

try this
install.packages('plyr');
library('plyr');
z<-arrange(x,a,desc(c));
In addition, you can use the
options(stringsAsFactors = FALSE)
before you create your frame, or while creating your 'x' data frame, specify
stringsAsFactors = FALSE

z <- x[order(x$a,-as.character(x$c) ), ]
z
If as Roman suspects you have digits in your facttor levels you may need to do as he suggests and add as.numeric, otherwise 9 will be greater than 10
z <- x[order(x$a,-as.numeric(as.character(x$c)) ), ]
z
But if they are characters, then you will again get all NAs, so it really depends on the nature of the levels of x$c

Related

Combine table and matrix with R

I am performing an analysis in R. I want to fill the first row of an matrix with the content of a table. The problem I have is that the content of the table is variable depending on the data so sometimes certain identifiers that appear in the matrix do not appear in the table.
> random.evaluate
DNA LINE LTR/ERV1 LTR/ERVK LTR/ERVL LTR/ERVL-MaLR other SINE
[1,] NA NA NA NA NA NA NA NA
> y
DNA LINE LTR/ERVK LTR/ERVL LTR/ERVL-MaLR SINE
1 1 1 1 1 4
Due to this, when I try to join the data of the matrix with the data of the table, I get the following error
random.evaluate[1,] <- y
Error in random.evaluate[1, ] <- y :
number of items to replace is not a multiple of replacement length
Could someone help me fix this bug? I have found solutions to this error but in my case they do not work for me.
First check if the column names of the table exist in the matrix
Check this link
If it exists, just set the value as usual.

Extracting values from matrix to add to a dataframe column using dplyr

I am using mutate to add a fifth (h) and sixth column (d) to my data frame containing 37975 rows and 4 columns names i,j,k,x. For d, I am picking values from a matrix which was basically a raster. Addition of this column (d) is causing problems. I receive the following error when I try to view the dataframe.
Error: cannot allocate vector of size 9.3 Gb Error: no more error
handlers available (recursive errors?); invoking 'abort' restart
I use the following code to do this:
vxfile <- vxfile %>%
select(1:4) %>%
mutate(h = k+as.numeric(sub(".*\\s","",txt)),
d = as.vector(dt_mat[35-j, i+1]))
When I check lengths of the columns, column d is different. It is 1442100625 (37975 x 37975) while others are obviously 37975
When I check the class of each columns, column d is "matrix" "array" and others are numeric
When I check the str(vxfile), column d is $ d : num [1:37975, 1:37975] NA NA NA NA NA NA NA NA NA NA ...
Clearly the problem is with how I am picking the value from the matrix. Could someone explain why this is causing this problem?
I tried dt_mat[cbind(35-j, i+1)] and it seems to be working

Replace NA for multiple columns with average of values from other dataframe

I am trying to replace NA values in multiple columns from dataframe x1 by the average of the values from dataframes x2 and x3, based on common and distinct atrribute 'ID'.
All the dataframes(each dataframe is for a particular year) have the same column structure:
ID A B C .....
01 2 5 7 .....
02 NA NA NA .....
03 5 4 8 .....
I have found an answer to do it for 1 column at a time, thanks to this post.
x1$A[is.na(x1$A)] <- (x2$A[match(x1$ID[is.na(x1$A)],x2$ID)] + x3$A[match(x1$ID[is.na(x1$A)],x3$ID)])/2
But since I have about a 100 coulmns to apply this for I would really like to have a smarter way to do it.
I tried the suggestions from this post and also from here.
I came up with this code, but couldn't make it work.
x1[6:105] = as.data.frame(lapply(x1[6:105], function(x) ifelse(is.na(x), (x2$x[match(x1$ID, x2$ID)]+x3$x[match(x1$ID, x3$ID)])/2, x1$x)))
Got the following error:
Error in ifelse(is.na(x), (x2$x[match(x1$ID, x2$ID)] + x3$x[match(x1$ID, : replacement has length zero
I initially thought function(x) worked on the entire column and x represented the column name, but i think it represents each individual cell value and that is why it wont work.
I am a novice in R, I would surely appreciate some guidance to let me know where I am going wrong, in applying the logic to multiple columns.
for (i in 1:ncol(x1)) {
nas <- is.na(x1[,i]) # where are NAs
if (sum(nas)==0) next
ids <- x1$ID[nas] # ids of NAs
nam <- colnames(x1)[i] # colname of the column
x1[nas, i] <- (x2[match(ids, x2$zip), nam] + x3[match(ids, x3$zip), nam]) / 2
}

Data handling: 2 independent factors, which decide the position of a numeric value in a new data frame

I am new to Stackoverflow and to R, so I hope you can be a bit patient and excuse any formatting mistakes.
I am trying to write an R-script, which allows me to automatically analyze the raw data of a qPCR machine.
I was quite successful in cleaning up the data, but at some point I run into trouble. My goal is to consolidate the data into a comprehensive table.
The initial data frame (DF) looks something like this:
Sample Detector Value
1 A 1
1 B 2
2 A 3
3 A 2
3 B 3
3 C 1
My goal is to have a dataframe with the Sample-names as row names and Detector as column names.
A B C
1 1 2 NA
2 3 NA NA
3 2 3 1
My approach
First I took out the names of samples and detectors and saved them in vectors as factors.
detectors = summary(DF$Detector)
detectors = names(detectors)
samples = summary(DF$Sample)
samples = names(samples)
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
Then I subsetted the detectors into a new dataframe based on the name of the detector in the dataframe.
for (i in 1:length(detectors)){
assign(detectors[i], DF[which(DF$Detector == detectors[i]),])
}
Then I initialize an empty dataframe with the right column and row names:
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
So now the Problem. I have to get the values from the detector subsets into the result dataframe. Here it is important that each values finds the way to the right position in the dataframe. The issue is that there are not equally many values since some samples lack some detectors.
I tried to do the following: Iterate through the detector subsets, compare the rowname (=samplename) with each other and if it's the same write the value into the new dataframe. In case it it is not the same, it should write an NA.
for (i in 1:length(detectors)){
for (j in 1:length(get(detectors[i])$Sample)){
result[j,i] = ifelse(get(detectors[i])$Sample[j] == rownames(result[j,]), get(detectors[i])$Ct.Mean[j], NA)
}
}
The trouble is, that this stops the iteration through the detector$Sample column and it switches to the next detector. My understanding is that the comparing samples get out of sync, yielding the all following ifelse yield a NA.
I tried to circumvent it somehow by editing the ifelse(test, yes, no) NO with j=j+1 to get it back in sync, but this unfortunately didn't work.
I hope I could make my problem understandable to you!
Looking forward to hear any suggestions, or comments (also how to general improve my code ;)
We can use acast from library(reshape2) to convert from 'long' to 'wide' format.
acast(DF, Sample~Detector, value.var='Value') #returns a matrix output
# A B C
#1 1 2 NA
#2 3 NA NA
#3 2 3 1
If we need a data.frame output, use dcast.
Or use spread from library(tidyr), which will also have the 'Sample' as an additional column.
library(tidyr)
spread(DF, Detector, Value)

R: collapse two time series to create vectors containing only the points where both exist

Using R, I wish to:
Take two time series:
a<-c(NA,1,2,3,NA,5)
b<-c(0,NA,6,7,NA,NA)
I would like to end up with
aa<-c(2,3), bb<-(6,7)
or alternately
aa<-c(NA,NA,2,3,NA,NA)
The genesis for this question lies in 'feature' of the ccf/acf function in R. The mean of a series is calculated prior to checking for the existance of mutual data points. The default fails on NA values, but if na.action=na.pass, this can result in correlation coeficients greater than one.
Although my actual data are time series, I am not currently interested in time lagged ACF, I am only interested in spatial cross correlation between disparate data sets, so the loss of absolute temporal data inherent in this approach is not important. I wish to run the CCF with vetors in which the unusable data has already been 'knocked out'
The actual data sets are ~ 10,000 points each x 20 sets
Thank you in advance for advice
You can use standard subsetting and is.na to find where both have non NA elements.
a[!is.na(a)&!is.na(b)]
[1] 2 3
b[!is.na(a)&!is.na(b)]
[1] 6 7
The question has already been answered by #James. You can create the alternative versions with the following commands:
idx <- ! (a + b) * 0
a[idx]
# [1] NA NA 2 3 NA NA
b[idx]
# [1] NA NA 6 7 NA NA
index = !is.na(a) & !is.na(b)
aa = a[index]
bb = b[index]
1) Combine them both into a data.frame, remove rows with any NAs in them and then extract back out what is left:
both <- data.frame(a, b)
both <- na.omit(both) ##
aa <- both$a
bb <- both$b
To get the representation with NAs in it replace the second line (marked with ##) with:
both[is.na(rowSums(both), ] <- NA
2) Alternately consider a time series representation:
library(zoo)
z <- zoo(cbind(a, b))
z <- na.omit(z) ##
aa.z <- z$a
bb.z <- z$b
To get the repesentation with NAs in it replace the second line (marked with ##) with:
z[is.na(rowSums(both))] <- NA
3) The representation with NAs in the output could also be done using "ts" class:
tt <- ts(cbind(a, b))
tt[is.na(rowSums(both))] <- NA
aa.ts <- tt[, "a"]
bb.ts <- tt[, "b"]
Note: Depending on how you use the them afterwards you might not need to extract them out at the end, i.e. you might not need the last two lines in each solution.
ADDED additional solutions.

Resources