Choose one cell per row in data frame - r

I have a vector that tells me, for each row in a date frame, the column index for which the value in this row should be updated.
> set.seed(12008); n <- 10000; d <- data.frame(c1=1:n, c2=2*(1:n), c3=3*(1:n))
> i <- sample.int(3, n, replace=TRUE)
> head(d); head(i)
c1 c2 c3
1 1 2 3
2 2 4 6
3 3 6 9
4 4 8 12
5 5 10 15
6 6 12 18
[1] 3 2 2 3 2 1
This means that for rows 1 and 4, c3 should be updated; for rows 2, 3 and 5, c2 should be updated (among others). What is the cleanest way to achieve this in R using vectorized operations, i.e, without apply and friends? EDIT: And, if at all possible, without R loops?
I have thought about transforming d into a matrix and then address the matrix elements using an one-dimensional vector. But then I haven't found a clean way to compute the one-dimensional address from the row and column indexes.

With your example data, and using only the first few rows (D and I below) you can easily do what you want via a matrix as you surmise.
set.seed(12008)
n <- 10000
d <- data.frame(c1=1:n, c2=2*(1:n), c3=3*(1:n))
i <- sample.int(3, n, replace=TRUE)
## just work with small subset
D <- head(d)
I <- head(i)
First, convert D into a matrix:
dmat <- data.matrix(D)
Next compute the indices of the vector representation of the matrix corresponding to rows and columns indicated by I. For this, it is easy to generate the row indices as well as the column index (given by I) using seq_along(I) which in this simple example is the vector 1:6. To compute the vector indices we can use:
(I - 1) * nrow(D) + seq_along(I)
where the first part ( (I - 1) * nrow(D) ) gives us the correct multiple of the number of rows (6 here) to index the start of the Ith column. We then add on the row index to get the index for the n-th element in the Ith column.
Using this we just index into dmat using "[", treating it like a vector. The replacement version of "[" ("[<-") allows us to do the replacement in a single line. Here I replace the indicated elements with NA to make it easier to see that the correct elements were identified:
> dmat
c1 c2 c3
1 1 2 3
2 2 4 6
3 3 6 9
4 4 8 12
5 5 10 15
6 6 12 18
> dmat[(I - 1) * nrow(D) + seq_along(I)] <- NA
> dmat
c1 c2 c3
1 1 2 NA
2 2 NA 6
3 3 NA 9
4 4 8 NA
5 5 NA 15
6 NA 12 18

If you are willing to first convert your data.frame to a matrix, you can index elements-to-be-replaced using a two-column matrix. (Beginning with R-2.16.0, this will be possible with data.frames directly.) The indexing matrix should have row indices in its first column and column indices in its second column.
Here's an example:
## Create a subset of the your data
set.seed(12008); n <- 6
D <- data.frame(c1=1:n, c2=2*(1:n), c3=3*(1:n))
i <- seq_len(nrow(D)) # vector of row indices
j <- sample(3, n, replace=TRUE) # vector of column indices
ij <- cbind(i, j) # a 2-column matrix to index a 2-D array
# (This extends smoothly to higher-D arrays.)
## Convert it to a matrix
Dmat <- as.matrix(D)
## Replace the elements indexed by 'ij'
Dmat[ij] <- NA
Dmat
# c1 c2 c3
# [1,] 1 2 NA
# [2,] 2 NA 6
# [3,] 3 NA 9
# [4,] 4 8 NA
# [5,] 5 NA 15
# [6,] NA 12 18
Beginning with R-2.16.0, you will be able to use the same syntax for dataframes (i.e. without having to first convert dataframes to matrices).
From the R-devel NEWS file:
Matrix indexing of dataframes by two column numeric indices is now supported for replacement as well as extraction.
Using the current R-devel snapshot, here's what that looks like:
D[ij] <- NA
D
# c1 c2 c3
# 1 1 2 NA
# 2 2 NA 6
# 3 3 NA 9
# 4 4 8 NA
# 5 5 NA 15
# 6 NA 12 18

Here's one way:
d[which(i == 1), "c1"] <- "one"
d[which(i == 2), "c2"] <- "two"
d[which(i == 3), "c3"] <- "three"
c1 c2 c3
1 1 2 three
2 2 two 6
3 3 two 9
4 4 8 three
5 5 two 15
6 one 12 18

Related

How do I calculate Euclidean distances across NA values in r

I have a date frame like this
individual <- c("1",NA,NA,NA,NA,NA,NA,NA,"1","1")
x <- c(665,NA,NA,NA,NA,NA,NA,NA,663,665)
y <- c(-474.5,NA,NA,NA,NA,NA,NA,NA,-474.5,-472.5)
frame <- rep(1:10)
df <- data.frame(individual,x,y,frame)
I have an ID column labeled 'individual', xy coordinates, and a frame number.
I need to calculate the euclidean distances for the x,y coordinates between rows but over the NA values.
So, in the example I gave - I would need to calculate the distances between rows 1 and 9, as well as 10 and 9. In the real data there would be substantially more rows of course.
Eventually what I need to do is interpolate the data, so that if the euclidean distance is <5, fill in the data rows that are missing with the ID of the individual. If the euclidean distance is >5, then ignore and interpolate nothing.
Here is the example result data frame that's needed:
individual <- c("1","1","1","1","1","1","1","1","1","1")
x <- c(665,NA,NA,NA,NA,NA,NA,NA,663,665)
y <- c(-474.5,NA,NA,NA,NA,NA,NA,NA,-474.5,-472.5)
frame <- rep(1:10)
dist_measure <- c(NA,NA,NA,NA,NA,NA,NA,NA,2,2.828427)
df <- data.frame(individual,x,y,frame,dist_measure)
Any advice on an approach to this problem is greatly appreciated. My first thought was to have a function that calculates Euclidean distance and put it in a for loop. But I'm a bit stuck on how to work this over the NA values. I thought somehow using the lag function in the tidyverse would help, but not sure again how to integrate that into the loop/function.
Thank you in advance.
This should work. I've added another individual into the hypothetical data to show how it works.
individual <- c("1",NA,NA,NA,NA,NA,NA,NA,"1","1",
"2",NA,NA,NA,NA,NA,NA,NA,"2","2")
x <- c(665,NA,NA,NA,NA,NA,NA,NA,663,665,
.665,NA,NA,NA,NA,NA,NA,NA,.663,.665)
y <- c(-474.5,NA,NA,NA,NA,NA,NA,NA,-474.5,-472.5,
-.4745,NA,NA,NA,NA,NA,NA,NA,-.4745,-.4725)
frame <- rep(1:10, 2)
df <- data.frame(individual,x,y,frame)
for(i in 1:2){
tmp <- df[min(which(df$individual == as.character(i))):
max(which(df$individual == as.character(i))), ]
ends <- range(which(is.na(tmp$individual))) + c(-1,1)
if(nrow(tmp) > 1 & ends[1] > 0 & ends[2] <= nrow(tmp)){
d <- c(dist(tmp[ends, c("x", "y")]))
if(d < 5){
df$individual[min(which(df$individual == as.character(i))):
max(which(df$individual == as.character(i)))] <- tmp$individual[ends[1]]
}
}
}
df
# individual x y frame
# 1 1 665.000 -474.5000 1
# 2 1 NA NA 2
# 3 1 NA NA 3
# 4 1 NA NA 4
# 5 1 NA NA 5
# 6 1 NA NA 6
# 7 1 NA NA 7
# 8 1 NA NA 8
# 9 1 663.000 -474.5000 9
# 10 1 665.000 -472.5000 10
# 11 2 0.665 -0.4745 1
# 12 2 NA NA 2
# 13 2 NA NA 3
# 14 2 NA NA 4
# 15 2 NA NA 5
# 16 2 NA NA 6
# 17 2 NA NA 7
# 18 2 NA NA 8
# 19 2 0.663 -0.4745 9
# 20 2 0.665 -0.4725 10

Comparing items in a list to a dataset in R

I have a large dataset (8,000 obs) and about 16 lists with anywhere from 120 to 2,000 items. Essentially, I want to check to see if any of the observations in the dataset match an item in a list. If there is a match, I want to include a variable indicating the match.
As an example, if I have data that look like this:
dat <- as.data.frame(1:10)
list1 <- c(2:4)
list2 <- c(7,8)
I want to end with a dataset that looks something like this
Obs Var List
1 1
2 2 1
3 3 1
4 4 1
5 5
6 6
7 7 2
8 8 2
9 9
10 10
How do I go about doing this? Thank you!
Here is one way to do it using boolean sum and %in%. If several match, then the last one is taken here:
dat <- data.frame(Obs = 1:10)
list_all <- list(c(2:4), c(7,8))
present <- sapply(1:length(list_all), function(n) dat$Obs %in% list_all[[n]]*n)
dat$List <- apply(present, 1, FUN = max)
dat$List[dat$List == 0] <- NA
dat
> dat
Obs List
1 1 NA
2 2 1
3 3 1
4 4 1
5 5 NA
6 6 NA
7 7 2
8 8 2
9 9 NA
10 10 NA

How to use row number inside apply function in R

I want to process data frame as follows, where I want to get the sum of 2 vectors and append it to a data frame as a row vector. 2 vectors are row vector of considering row and column vector which start just below the considering row with a fixed length.
data
A b1 b2 b3
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
output (expected)
A b1 b2 b3
1 4 5 6
2 6 7 8
3 8 9 -
4 10 - -
5 - - -
In the example if 1st row is considered, two vectors are
row vector r- [2 2 2]
column vector c - [2,3,4]
After getting the transpose of column vector I can add tow vectors and append it to a new data frame. This process must be done to all the rows.
Easiest way to do this is looping, but in R loops are not efficient, instead apply function can be used. However in this scenario, to do that need to know what is the current row number.
Is there a way to do this efficiently in R
1) rollapply We can use rollapply to form the matrix of subvectors of A and then add that together with an initial column of zero to m. Note that we pad A with NA values so that the result of rollapply is the appropriate shape.
library(zoo)
m <- cbind(A = 1:5, b1 = 2:6, b2 = 2:6, b3 = 2:6) # input matrix
nc1 <- ncol(m) - 1
A <- c(m[, 1], rep(NA, nc1))
cbind(0, rollapply(A[-1], nc1, c)) + m
giving:
A b1 b2 b3
[1,] 1 4 5 6
[2,] 2 6 7 8
[3,] 3 8 9 NA
[4,] 4 10 NA NA
[5,] 5 NA NA NA
2) base This solution is similar but does not use any packages. The first two lines are the same as in (1).
nc1 <- ncol(m) - 1
A <- c(m[, 1], rep(NA, nc1))
cbind(0, embed(A[-1], nc1)[, seq(nc1, 1)]) + m
giving:
A b1 b2 b3
[1,] 1 4 5 6
[2,] 2 6 7 8
[3,] 3 8 9 NA
[4,] 4 10 NA NA
[5,] 5 NA NA NA

Get specific column value for each row

I want to get a "m" length vector that, considering a m x n matrix, for each row, gives the value on the column identified by another column (say column "Z").
I made it using a for loop:
for (i in 1:dim(data.frame)[1]){vector[i] <- data.frame[i,data.frame$Z[i]]}
Do you see a simpler way to code it avoiding the loop?
"apply" is a possibility:
> M <- cbind( matrix(1:15,3,5), "Z"=c(3,1,2) )
> M
Z
[1,] 1 4 7 10 13 3
[2,] 2 5 8 11 14 1
[3,] 3 6 9 12 15 2
> v <- apply(M,1,function(x){x[x["Z"]]})
> v
[1] 7 2 6
>

How to delete rows from a dataframe that contain n*NA

I have a number of large datasets with ~10 columns, and ~200000 rows. Not all columns contain values for each row, although at least one column must contain a value for the row to be present, I would like to set a threshold for how many NAs are allowed in a row.
My Dataframe looks something like this:
ID q r s t u v w x y z
A 1 5 NA 3 8 9 NA 8 6 4
B 5 NA 4 6 1 9 7 4 9 3
C NA 9 4 NA 4 8 4 NA 5 NA
D 2 2 6 8 4 NA 3 7 1 32
And I would like to be able to delete the rows that contain more than 2 cells containing NA to get
ID q r s t u v w x y z
A 1 5 NA 3 8 9 NA 8 6 4
B 5 NA 4 6 1 9 7 4 9 3
D 2 2 6 8 4 NA 3 7 1 32
complete.cases removes all rows containing any NA, and I know one can delete rows that contain NA in certain columns but is there a way to modify it so that it is non-specific about which columns contain NA, but how many of the total do?
Alternatively, this dataframe is generated by merging several dataframes using
file1<-read.delim("~/file1.txt")
file2<-read.delim(file=args[1])
file1<-merge(file1,file2,by="chr.pos",all=TRUE)
Perhaps the merge function could be altered?
Thanks
Use rowSums. To remove rows from a data frame (df) that contain precisely n NA values:
df <- df[rowSums(is.na(df)) != n, ]
or to remove rows that contain n or more NA values:
df <- df[rowSums(is.na(df)) < n, ]
in both cases of course replacing n with the number that's required
If dat is the name of your data.frame the following will return what you're looking for:
keep <- rowSums(is.na(dat)) < 2
dat <- dat[keep, ]
What this is doing:
is.na(dat)
# returns a matrix of T/F
# note that when adding logicals
# T == 1, and F == 0
rowSums(.)
# quickly computes the total per row
# since your task is to identify the
# rows with a certain number of NA's
rowSums(.) < 2
# for each row, determine if the sum
# (which is the number of NAs) is less
# than 2 or not. Returns T/F accordingly
We use the output of this last statement to
identify which rows to keep. Note that it is not necessary to actually store this last logical.
If d is your data frame, try this:
d <- d[rowSums(is.na(d)) < 2,]
This will return a dataset where at most two values per row are missing:
dfrm[ apply(dfrm, 1, function(r) sum(is.na(x)) <= 2 ) , ]

Resources