Datamatrix in R - extracting data from scatterplot - r

I am trying to modify a R script but I have only basic experience with R:
question 1:
In line: for (i in 1:nrow(x)). what does the integer 1 actually do? Changing the value to 2 or higher seem to have a big effect on the output.
question 2:
I have been getting the message:
"Error in if (p[2] > a + b * p[1]) { :
missing value where TRUE/FALSE needed"
. In general, what might be causing this?
Any help is much appreciated!
question edited:
Say I have a dataframe for plotting scatterplot. The dataframe would be organized in the following fashion (in CSV format):
name ABC EFG
1 32 45
2 56 67
to, say 200 000 entries
I am going to first do a scatterplot, after which I am going to subset a portion of the dataset into A using alphahull and export them as XYZ. The script for doing this:
#plot first plot containing all data
plot(x = X$ABC,
y = X$EFG,
pch=20,
)
#subset data using ahull. choose 4 points on the plot
A <- ahull(locator(4, type="p", pch=20), alpha=10000)
#exporting subset
XYZ <- {}
for (i in 1:nrow(X)) { if (inahull(A, c(X$ABC[i],X$EFG[i]))) XYZ <- rbind(X,X[i,])}
I am getting the following message if the number of data points in the subset that I choose is too large:Error in if (p[2] > a + b * p[1]) { :
missing value where TRUE/FALSE needed

Question 1 - this is a for loop - it is executing once for each row in the matrix or data frame x (not sure what x is here exactly). Changing it to 2 will mean the loop happens one less time. Without the rest of the code I can't say much else.
Question 2 - can you post the whole code? It apparently needs to evaluate that expression and one or more of the values is missing.

Say you have data x
set.seed(123) # for reproducibility
x<-as.data.frame(rnorm(10)) # generate random number and store it as dataframe
k<-2 #assign n as 2
for (i in (1:nrow(x))){
cat("this is row",i,"\n")
show (k)
k<-k+i
}
show (k)
this is row 1
[1] 2
this is row 2
[1] 3
this is row 3
[1] 5
this is row 4
[1] 8
this is row 5
[1] 12
this is row 6
[1] 17
this is row 7
[1] 23
this is row 8
[1] 30
this is row 9
[1] 38
this is row 10
[1] 47
> show (k)
[1] 57

Related

How can I troubleshoot the delete row function

I am attempting to delete a row like this:
data <- data[-1645,]
However, after running the code, the row is still there. I can tell because there is an outlier in that row that is showing up on all my graphs, and when I view the data I can sort a column to easily find the offending outlier. I have had no trouble deleting rows in the past- has anyone run into anything similar? I do understand the limitations of outlier removal and I don't typically remove them however for a number of reasons I would like to see what the data look like without this one (in this case, all other values in the response variable are between -1 and 0, and in this row the value is 10^4).
You really need to provide more information, but there are several ways you can troubleshoot the problem. The first one is to print out the line you are removing:
data[1645, ]
Is that the outlier? You did not tell us how you identified the outlier. If lines have been removed from the data frame, the row names are not changed but the index values are changed, e.g.
set.seed(42)
x <- sample.int(25)
y <- sample.int(25)
data <- data.frame(x, y)
head(data)
# x y
# 1 17 2
# 2 5 8
# 3 1 3
# 4 10 1
# 5 4 10
# 6 18 11
data <- data[-c(5, 10, 15, 20, 25), ]
head(data)
# x y
# 1 17 2
# 2 5 8
# 3 1 3
# 4 10 1
# 6 18 11
# 7 25 15
data[6, ]
# x y
# 7 25 15
data["6", ]
# x y
# 6 18 11
Notice that the 6th row of the data has a row name of "7" but the row with name "6" is the 5th row in the data frame because we deleted the 5th row. The which function will give you the index value, but if you identified the outlier by looking at the printout, you got the row name and that may be different from the index. If we want to remove values in x greater than 24, here is one way to do that:
data[data$x<25, ]
After playing around with the data, I think the best explanation is that the indexing is off. This is in line with what dcarlson was saying- that it could be removing the 1,645th row, it just isn't labelled as such. I think the best solution is to use subset:
data <- subset(data, Yield.Decline < 100)
This is a more robust solution than trying to remove any given row based on its value (the line can be accidentally run multiple times without erroneously removing additional lines).

Looping through items on a list in R

this may be a simple question but I'm fairly new to R.
What I want to do is to perform some kind of addition on the indexes of a list, but once I get to a maximum value it goes back to the first value in that list and start over from there.
for example:
x <-2
data <- c(0,1,2,3,4,5,6,7,8,9,10,11)
data[x]
1
data[x+12]
1
data[x+13]
3
or something functionaly equivalent. In the end i want to be able to do something like
v=6
x=8
y=9
z=12
values <- c(v,x,y,z)
data <- c(0,1,2,3,4,5,6,7,8,9,10,11)
set <- c(data[values[1]],data[values[2]], data[values[3]],data[values[4]])
set
5 7 8 11
values <- values + 8
set
1 3 4 7
I've tried some stuff with additon and substraction to the lenght of my list but it does not work well on the lower numbers.
I hope this was a clear enough explanation,
thanks in advance!
We don't need a loop here as vectors can take vectors of length >= 1 as index
data[values]
#[1] 5 7 8 11
NOTE: Both the objects are vectors and not list
If we need to reset the index
values <- values + 8
ifelse(values > length(data), values - length(data) - 1, values)
#[1] 1 3 4 7

Function meaning, how is it deleting zero expression genes?

I'm working with an expression matrix obtained by single cell RNA sequencing, but I have a question related with the R code one mate has sent me...
sort(unique(1 + slot(as(data_matrix, "dgTMatrix"), "i")))
# there isn't more details in the code...
In theory, this function is to delete non expressed genes (if it's zero in all samples, it think...), but it's impossible for me to understand it, anyone can give me a tip?
Well, I think I have understood this code... let's try to explain it! (please, correct me if I'm wrong).
Our data has a structure of sparse matrix (ie. more handly in regards to memory, link) and with as it's coerced to a specific format for this kind of matrix (Triplet Format for Sparse Matrices, link): three columns with i and j index for these non-zero values.
y <- matrix_counts # sparse matrix
AAACCTGAGAACAACT-1 AAACCTGTCGGAAATA-1 AAACGGGAGAGCTGCA-1
ENSG00000243485 1 . .
ENSG00000237613 . . 2
y2 <- as(y, "dgTMatrix") #triplet format for sparse matrix
i j x
1 9 1 1 #in row(9) and column(1) we have the value 1
2 50 1 2
3 60 1 1
4 62 1 2
5 78 1 1
6 87 1 1
After, it takes only the column "i" (slot(data, "i")), because we only need the row index (to know what rows are different to zero), and delete duplicates (unique) to finally obtain a vector with the row index which will be used to filter the raw data:
y3 <- unique(1 + slot(as(exprs(gbm), "dgTMatrix"), "i"))
[1] 9 50 60 62 78 87
data <- data_raw[y3,]
I am a bit confused with sort and 1+, but I think this is the basic concept. So, to summarize, we take the row index from this non-zero rows (genes) and use it to filter our raw data... another original method for delete non-expressed genes, interesting!

R: Large dataframe: Need to sort all elements and then output with position info

I have a large (500 * 21000) dataframe consisting of numeric values. I would like some help in doing this task most efficiency:
Essentially, I would like to sort the items in the dataframe, get an O/P with index information. i.e. If the largest element, say is I would also like to know its (position) in the dataframe. I need info on all elements in the dataframe - not just the largest/smallest (in which case I could easily get that info from the summary call).
I can think of ways to program this - but I am wondering if there is some built in utilityin R to do it.
Thanks!
Your question is very vague. But this can be a starting point for you.
> set.seed(345)
# Create a dataframe
> newdf <- data.frame(x = rnorm(n=100,mean=2.5,sd=2.5),
+ y = rnorm(n=100,mean=4.5,sd=10),
+ z = rnorm(n=100,mean=3.8,sd=1))
> head(newdf)
x y z
1 0.5377296 -9.1446883 3.008115
2 1.8012141 -0.3508551 3.681795
3 2.0963553 13.3248010 4.116340
4 1.7735086 3.0728637 5.545473
5 2.3311710 -5.3247035 3.733314
6 0.9161990 9.3002188 3.763627
>
# Find the maximum on each column
> sapply(newdf,max)
x y z
9.545697 31.851232 5.956058
# Find the location of maximum value on each column
> sapply(newdf,which.max)
x y z
85 87 79

Find indices of 5 closest samples in distance matrix

Users
I have a distance matrix dMat and want to find the 5 nearest samples to the first one. What function can I use in R? I know how to find the closest sample (cf. 3rd line of code), but can't figure out how to get the other 4 samples.
The code:
Mat <- replicate(10, rnorm(10))
dMat <- as.matrix(dist(Mat))
which(dMat[,1]==min(dMat[,1]))
The 3rd line of code finds the index of the closest sample to the first sample.
Thanks for any help!
Best,
Chega
You can use order to do this:
head(order(dMat[-1,1]),5)+1
[1] 10 3 4 8 6
Note that I removed the first one, as you presumably don't want to include the fact that your reference point is 0 distance away from itself.
Alternative using sort:
sort(dMat[,1], index.return = TRUE)$ix[1:6]
It would be nice to add a set.seed(.) when using random numbers in matrix so that we could show the results are identical. I will skip the results here.
Edit (correct solution): The above solution will only work if the first element is always the smallest! Here's the correct solution that will always give the 5 closest values to the first element of the column:
> sort(abs(dMat[-1,1] - dMat[1,1]), index.return=TRUE)$ix[1:5] + 1
Example:
> dMat <- matrix(c(70,4,2,1,6,80,90,100,3), ncol=1)
# James' solution
> head(order(dMat[-1,1]),5) + 1
[1] 4 3 9 2 5 # values are 1,2,3,4,6 (wrong)
# old sort solution
> sort(dMat[,1], index.return = TRUE)$ix[1:6]
[1] 4 3 9 2 5 1 # values are 1,2,3,4,6,70 (wrong)
# Correct solution
> sort(abs(dMat[-1,1] - dMat[1,1]), index.return=TRUE)$ix[1:5] + 1
[1] 6 7 8 5 2 # values are 80,90,100,6,4 (right)

Resources