Extract a portion of 1 column from data.frame/matrix - r

I get flummoxed by some of the simplest of things. In the following code I wanted to extract just a portion of one column in a data.frame called 'a'. I get the right values, but the final entity is padded with NAs which I don't want. 'b' is the extracted column, 'c' is the correct portion of data but has extra NA padding at the end.
How do I best do this where 'c' is ends up naturally only 9 elements long? (i.e. - the 15 original minus the 6 I skipped)
NumBars = 6
a = as.data.frame(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15))
a[,2] = c(11,12,13,14,15,16,17,18,19,20,21,22,23,24,25)
names(a)[1] = "Data1"
names(a)[2] = "Data2"
{Use 1st column of data only}
b = as.matrix(a[,1])
c = as.matrix(b[NumBars+1:length(b)])

The immediate reason why you're getting NA's is that the sequence operator : takes precedence over the addition operator +, as is detailed in the R Language Definition. Therefore NumBars+1:length(b) is not the same as (NumBars+1):length(b). The first adds NumBars to the vector 1:length(b), while the second adds first and then takes the sequence.
ind.1 <- 1+1:3 # == 2:4
ind.2 <- (1+1):3 # == 2:3
When you index with this longer vector, you get all the elements you want, and you also are asking for entries like b[length(b)+1], which the R Language Definition tells us returns NA. That's why you have trailing NA's.
If i is positive and exceeds length(x) then the corresponding
selection is NA. A negative out of bounds value for i causes an error.
b <- c(1,2,3)
b[ind.1]
#[1] 2 3 NA
b[ind.2]
#[1] 2 3
From a design perspective, the other solutions listed here are good choices to help avoid this mistake.

It is often easier to think of what you want to remove from your vector / matrix. Use negative subscripts to remove items.
c = as.matrix(b[-1:-NumBars])
c
## [,1]
## [1,] 7
## [2,] 8
## [3,] 9
## [4,] 10
## [5,] 11
## [6,] 12
## [7,] 13
## [8,] 14
## [9,] 15

If your goal is to remove NAs from a column, you can also do something like
c <- na.omit(a[,1])
E.g.
> x
[1] 1 2 3 NA NA
> na.omit(x)
[1] 1 2 3
attr(,"na.action")
[1] 4 5
attr(,"class")
[1] "omit"
You can ignore the attributes - they are there to let you know what elements were removed.

Related

Using distGeo with two sets of coordinates

I have two sets of coordinates (loc and stat) both in the following format
x y
1 49.68375 8.978462
2 49.99174 8.238287
3 51.30842 12.411870
4 50.70487 6.627252
5 50.70487 6.627252
6 50.37381 8.040766
For each location in the first data set (location of observation) I want to know the location in the second data set (weather stations), that is closest to it. Basically matching the locations of observations to the closest weather station for later analysis of weather effects.
I tried using the distGeo function simply by putting in
distGeo(loc, stat, a=6378137, f=1/298.257223563)
But that didn't work, because loc and stat are not in the right format.
Thanks for your help!
Try this:
outer(seq_len(nrow(loc)), seq_len(nrow(stat)),
function(a,b) geosphere::distGeo(loc[a,], stat[b,]))
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0.00 88604.79 419299.1 283370.9 283370.9 128560.08
# [2,] 88604.79 0.00 483632.9 194784.6 194784.6 47435.65
# [3,] 419299.12 483632.85 0.0 643230.3 643230.3 494205.86
# [4,] 283370.91 194784.62 643230.3 0.0 0.0 160540.63
# [5,] 283370.91 194784.62 643230.3 0.0 0.0 160540.63
# [6,] 128560.08 47435.65 494205.9 160540.6 160540.6 0.00
Brief explanation:
outer(1:3, 1:4, ...) produces two vectors that are a cartesian product, very similar to
expand.grid(1:3, 1:4)
# Var1 Var2
# 1 1 1
# 2 2 1
# 3 3 1
# 4 1 2
# 5 2 2
# 6 3 2
# 7 1 3
# 8 2 3
# 9 3 3
# 10 1 4
# 11 2 4
# 12 3 4
(using expand.grid only for demonstration of the expansion)
the anonymous function I defined (function(a,b)...) is called once, where a is assigned the integer vector c(1,2,3,1,2,3,1,2,3,1,2,3) (using my 1:3 and 1:4 example), and b is assigned the int vector c(1,1,1,2,2,2,3,3,3,4,4,4).
within the anon func, loc[a,] results in a much longer frame: if loc has m rows and stat has n rows, then loc[a,] should have m*n rows; similarly stat[b,] should have m*n rows as well. This works well, because distGeo (and other dist* functions in geosphere::) operates in one of two ways:
If either of the arguments have 1 row, then its distance is calculated against all rows of the other argument. Unfortunately, unless you know that loc or stat will always have just one row, this method doesn't work.
otherwise, both arguments must have the same number of rows, where the distance is calculated piecewise (1st row of 1st arg with 1st row of 2nd arg; 2nd row 1st arg with 2nd row 2nd arg; etc). This is the method we're prepared for.
in general, the anon func given to outer must deal with vectorized arguments on its own. For instance, if you needed geoDist to be called once for each pair (so it would be called m*n times), then you have to handle that yourself, outer will not do it for you. There are constructs in R that support this (e.g., mapply, Map) or replace outer (Map, expand.grid, and do.call), but that's for another question.

Why does is.na() change its argument?

I just discovered the following behaviour of the is.na() function which I don't understand:
df <- data.frame(a = 5:1, b = "text")
df
## a b
## 1 5 text
## 2 4 text
## 3 3 text
## 4 2 text
## 5 1 text
is.na(df)
## a b
## [1,] FALSE FALSE
## [2,] FALSE FALSE
## [3,] FALSE FALSE
## [4,] FALSE FALSE
## [5,] FALSE FALSE
is.na(df) <- "0"
df
## a b 0
## 1 5 text NA
## 2 4 text NA
## 3 3 text NA
## 4 2 text NA
## 5 1 text NA
My question
Why does is.na() change its argument (and in this case adds an extra column to the data frame)? In this case its behaviour seems extra puzzling (or at least unexpected) because the result of the query is FALSE for all instances.
NB
This question is not about subsetting and changing the NA values in a data frame - I know how to do that (df[is.na(df)] <- "0"). This question is about the behaviour of the is.na function! Why is an assignment to a is.something function changing the argument itself - this is unexpected.
The actual function being used here is not is.na() but the assignment function `is.na<-`, for which the default method is `is.na<-.default`. Printing that function to console we see:
function (x, value)
{
x[value] <- NA
x
}
So clearly, value is supposed to be an index here. If you index a data.frame like df["0"], it will try to select the column named "0". If you assign something to df["0"], the column will be created and filled with (in this case) NA.
To clarify, `is.na<-` sets values to NA, it does not replace NA values with something else.

Why does class change from integer to character when indexing a data frame with a numeric matrix?

If I index a data.frame of all integers with a matrix, I get the expected result.
df <- data.frame(c1=1:4, c2=5:8)
df1
# c1 c2
#1 1 5
#2 2 6
#3 3 7
#4 4 8
df1[matrix(c(1:4,1,2,1,2), nrow=4)]
# [1] 1 6 3 8
If the data.frame has a column of characters, the result is all characters, even though I'm only indexing the integer columns.
df2 <- data.frame(c0=letters[1:4], c1=1:4, c2=5:8)
df2
# c0 c1 c2
#1 a 1 5
#2 b 2 6
#3 c 3 7
#4 d 4 8
df2[matrix(c(1:4,2,3,2,3), nrow=4)]
# [1] "1" "6" "3" "8"
class(df[matrix(c(1:4,2,3,2,3), nrow=4)])
# [1] "character"
df2[1,2]
# [1] 1
My best guess is that R is too busy to go through the answer to check if they all originated from a certain class. Can anyone please explain why this is happening?
In ?Extract it is described that indexing via a numeric matrix is intended for matrices and arrays. So it might be surprising that such indexing worked for a data frame in the first place.
However, if we look at the code for [.data.frame (getAnywhere(`[.data.frame`)), we see that when extracting elements from a data.frame using a matrix in i, the data.frame is first coerced to a matrix with as.matrix:
function (x, i, j, drop = if (missing(i)) TRUE else length(cols) ==
1)
{
# snip
if (Narg < 3L) {
# snip
if (is.matrix(i))
return(as.matrix(x)[i])
Then look at ?as.matrix:
"The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column".
Thus, because the first column in "df2" is of class character, as.matrix will coerce the entire data frame to a character matrix before the extraction takes place.

Deleting inverses in a matrix in R

I have initially a matrix, p:
# p is a matrix
p
A B
[1,] 1 1
[2,] 2 3
[3,] 3 2
[4,] 1 1
[5,] 8 2
For a given matrix, I want to iterate through the rows and removing any inversions. So that the new matrix is:
p
A B
[1,] 1 1
[2,] 2 3
[3,] 8 2
This is what I got:
p<-unique(p) # gets rid of duplicates
output<-lapply(p, function(x){
check<-which(p$A[x,] %in% p$B[x,])#is the value in row x of column A found in
#column B if so return the row number it was found in column B
if (length(check)!=0 ){
if(p$A[check,]== p$B[x]){ # now check if at the found row (check)of p$A is equal to p$B[x]
p<-p[-check,] #if so remove that inverse
}
}
}
)
I get this message Error in which(p$A[x] %in% p$B[x]) :
Why am I getting this Error?
Is there a better way to find inversions?
Try
p <- unique(p)
p[!duplicated(apply(p, 1, function(x) paste(sort(x), collapse=''))),]
# A B
#[1,] 1 1
#[2,] 2 3
#[3,] 8 2
data
p <- matrix(c(1,2,3,1,8, 1,3,2,1,2),
dimnames=list(NULL, c("A", "B")), ncol=2)
It's not clear whether the order of values is important in your final output, but perhaps you can make use of pmin and pmax.
Here's an approach using those functions within "data.table":
library(data.table)
unique(as.data.table(p)[, list(A = pmin(A, B), B = pmax(A, B))])
# A B
# 1: 1 1
# 2: 2 3
# 3: 2 8
The question is a bit unclear. I am assuming based on your example that you want to remove the row containing "3 2" because first value occurs in the second column (in a different row). In that case
check <- which(p[,1] %in% p[,2])
should return the rows that you want to delete. Your second round of checking is not needed. You could just delete the rows returned.

In R, can out of bounds indexing return NAs on matrices, like it does on vectors?

I would like an out of bounds subscript on a matrix in R to return NAs instead of an error, like it does on vectors.
> a <- 1:3
> a[1:4]
[1] 1 2 3 NA
> b <- matrix(1:9, 3, 3)
> b
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> b[1:4, 1]
Error: subscript out of bounds
>
So I would have liked it to return:
[1] 1 2 3 NA
Right now I am doing this with ifelse tests to see if the index variables exist in the rownames but on large data structures this is taking quite a bit of time. here is an example:
s <- split(factors, factors$date) # split so each date has its own list
names <- last(s)[[1]]$bond # names of bonds that we want
cdmat <- sapply(names, function(n)
sapply(s, function(x)
if(n %in% x$bond) x[x$bond == n, column] else NA))
where factors is an xts with about 250 000 rows. So it's taking about 15 seconds and that's too long for my application.
The reason this is important is that each list element I am applying this to has a different length, but I need to output a matrix with equal length columns as a result of the sapply. I don't want another list out with different length elements.
Actually I have just realised that if I take the column I want and turn it into a vector, this works perfectly. So:
> b[, 1][1:4]
[1] 1 2 3 NA

Resources