How can I pad a vector with NA from the front? - r

I want to make an existing vector size n and use NA. I know I can pad at the end of the vector like so:
v1 <- 1:10
v2 <- diff(v1)
length(v2) <- length(v1)
v2
# 1 1 1 1 1 1 1 1 1 NA
But I want to fill the NA at the beginnning instead in a generic way. I mean for this particular example I can just
v2 <- c(NA, diff(v1))
# NA 1 1 1 1 1 1 1 1 1
But I was hoping that there exist some base R function or library that provides something like v2 <- pad(v2, n=length(v1), value=NA)
Is there anything like that I can use off the self or do I need to define my own function:
pad <- function(x, n) { # ugly function that doesn't keep the attributes of x
len.diff <- n - length(x)
c(rep(NA, len.diff), x)
}
pad(1:10, 12) # NA NA 1 2 3 4 5 6 7 8 9 10

Assuming v1 has the desired length and v2 is shorter (or the same length) these left pad v2 with NA values to the length of v1. The first four assume numeric vectors although they can be modified to also work more generally by replacing NA*v1 in the code with rep(NA, length(v1)).
replace(NA * v1, seq(to = length(v1), length = length(v2)), v2)
rev(replace(NA * v1, seq_along(v2), rev(v2)))
replace(NA * v1, seq_along(v2) + length(v1) - length(v2), v2)
tail(c(NA * v1, v2), length(v1))
c(rep(NA, length(v1) - length(v2)), v2)
The fourth is the shortest. The first two and fourth do not involve any explicit arithmetic calculations other than multiplying v1 with NA values. The second is likely slow since it involves two applications of rev.

One option is diff from zoo which also have the na.pad
library(zoo)
as.vector(diff(zoo(v1), na.pad=TRUE))
#[1] NA 1 1 1 1 1 1 1 1 1

Defining nrValues as the number of elements you want at the start of v2 you could use:
n <- length(v1)
v2 <- c(rep(NA,nrValues),v1[nrValues:n])
I'm not familiar with a function that does this, so if you intend to do it multiple times I would create your own function.

Related

A vector is created from different vectors and i want to find starting and ending positions of these vectors in the vector created from them

These vectors will always be in increasing order such as 1 ..2 ... 3 ..4. They cannot decrease. Let's say I have three vectors as an example.
v1 <- c(1,3)
v2 <- c(2)
v3 <- c(1,3,4)
And I have a vector that was created from these vectors:
vsum <- c(v2, v1, v3)
Now i want to create a code which can find the position where each vector (v1,v2,v3) starts and ends in vsum. In this case, the starting position would look like
start <- c(1,2,4)
because if I run vsum these are the starting positions of each vector.
2 1 3 1 3 4
the ending position would look like
end <- c(1,3,6)
because these are ending positions
2 1 3 1 3 4
You can wrap your vectors in a list and use lengths with cumsum:
v1 <- c(1,3)
v2 <- c(2)
v3 <- c(1,3,4)
l = lengths(list(v2, v1, v3))
# [1] 1 2 3
start = cumsum(l) - l + 1
# [1] 1 2 4
end = cumsum(l)
# [1] 1 3 6

R data.table sum number of columns exceeding threshold

I would like to sum the number of columns whose values exceed a threshold in an observation. Additionally, I would like to specify those column names and thresholds as vectors (cols, th)
Take the example data set:
x <- data.table(x1=c(1,2,3),x2=c(3,2,1))
The goal is to create a new column exceed.count with number of columns in which x1 and x2 exceed a respective threshold. Assuming the case in which the thresholds for both x1 and x2 are 2:
th <- c(2,2)
The function could be defined as:
fn <- function(z,th) (sum(z[,x1]>th[1],z[,x2]>th[2]))
And the number of columns exceeding the thresholds calculated by:
x[,exceed.count:=fn(.SD,th),by=seq_len(nrow(x))]
The results are:
x1 x2 exceed.count
1: 1 3 1
2: 2 2 0
3: 3 1 1
What I would like to do is be able to specify the column names as vector, e.g.
cols <- c("x1","x2")
I was playing around with a function of the form:
fn.i <- function(z,i) (sum(z[,cols[i],with=FALSE] > th[i]))
which works for a single i, but how do I vectorize this across elements of cols? (cols and th will always be the same length)
I think there is an easier way to solve your problem:
x<-data.table(x1=c(1,2,3),x2=c(3,2,1))
th<-c(2,2)
x[,exceed.count:=sum(.SD>th),by=seq_len(nrow(x))]
Or, taking into account your input (only a subset of columns):
x<-data.table(x1=c(1,2,3),x2=c(3,2,1))
sd.cols = c("x1")
th<-c(2)
x[,exceed.count:=sum(.SD>th),by=seq_len(nrow(x)), .SDcols=sd.cols]
Or
x<-data.table(x1=c(1,2,3),x2=c(3,2,1))
sd.cols = c("x1")
th<-c(2,2)
x[,exceed.count:=sum(.SD>th[1]),by=seq_len(nrow(x)), .SDcols=sd.cols]
#JonnyCrunch's approach, specifying a subset of columns with .SDcols=sd.cols works fine (as long as you ensure ncol(x) == length(th), otherwise vector recycling will mess things up).
Here's an alternative that is shorter syntax (but will be less performant for very wide columns):
x[,exceed.count:=sum(.SD>th), by=seq_len(nrow(x)) ]
no need to explicitly specify .SDcols, let it default to all columns
define the threshold vector th for all columns, using the don't-care value +Inf in those columns you don't want counted.
.
> x <- data.table(x0=4:6, x1=1:3, x2=3:1, x3=7:5)
x0 x1 x2 x3
1: 4 1 3 7
2: 5 2 2 6
3: 6 3 1 5
> th <- c(+Inf, 2, +Inf, 2)
> fn <- function(z,th) (z>th)
> x[,exceed.count:=sum(.SD>th), by=seq_len(nrow(x)) ]
x0 x1 x2 x3 exceed.count
1: 4 1 3 7 1
2: 5 2 2 6 1
3: 6 3 1 5 2
Here's one way to get around iteration over rows:
x <- data.table(x1=c(1,2,3), x2=c(3,2,1))
thL <- list(x1 = 2, x2 = 2)
nm = names(thL)
x[, n := 0L]
for (i in seq_along(thL)) x[thL[i], on=sprintf("%s>%s", nm[i], nm[i]), n := n + 1L][]
x1 x2 n
1: 1 3 1
2: 2 2 0
3: 3 1 1

R Difference with previous column across multiple columns

I have a dataframe like this that resulted from a cumsum of variables:
id v1 v2 v3
1 4 5 9
2 1 1 4
I I would like to get the difference among columns, such as the dataframe is transformed as:
id v1 v2 v3
1 4 1 4
2 1 0 3
So effectively "de-acumulating" the resulting values getting the difference. This is a small example original df is around 150 columns.
Thx!
x <- read.table(header=TRUE, text="
id v1 v2 v3
1 4 5 9
2 1 1 4")
x[,c("v1","v2","v3")] <- cbind(x[,"v1"], t(apply(x[,c("v1","v2","v3")], 1, diff)))
x
# id v1 v2 v3
# 1 1 4 1 4
# 2 2 1 0 3
Explanation:
Up front, a note: when using apply on a data.frame, it converts the argument to a matrix. This means that if you have any character columns in the argument passed to apply, then the entire matrix will be character, likely not what you want. Because of this, it is safer to only select columns you need (and reassign them specifically).
apply(.., MARGIN=1, ...) returns its output in an orientation transposed from what you might expect, so I have to wrap it in t(...).
I'm using diff, which returns a vector of length one shorter than the input, so I'm cbinding the original column to the return from t(apply(...)).
Just as I had to specific about which columns to pass to apply, I'm similarly specific about which columns will be replaced by the return value.
Simple for cycle might do the trick, but for larger data it will be slower that other approaches.
df <- data.frame(id = c(1,2), v1 = c(4,1), v2 = c(5,1))
df2 <- df
for(i in 3:ncol(df)){
df2[,i] <- df[,i] - df[,i-1]
}

Sorting a list of unequal-size vectors in r

Suppose I have several vectors - maybe they're stored in a list, but if there's a better data structure that's fine too:
ll <- list(c(1,3,2),
c(1,2),
c(2,1),
c(1,3,1))
And I want to sort them, using the first number, then the second number to resolve ties, then the third number to resolve remaining ties, etc.:
c(1,2)
c(1,3,1)
c(1,3,2)
c(2,1)
Are there any built in functions that will allow me to do this or do I need to roll my own solution?
(For those who know Python, what I'm after is something that mimics the behavior of sort in Python)
ll <- list(c(1,3,2),
c(1,2),
c(2,1),
c(1,3,1))
I'd prefer using NA for missing values and using rbind.data.frame instead of paste:
sortfun <- function(l) {
l1 <- lapply(l, function(x, n) {
length(x) <- n
x
}, n = max(lengths(l)))
l1 <- do.call(rbind.data.frame, l1)
l[do.call(order, l1)] #order's default is na.last = TRUE
}
sortfun(ll)
#[[1]]
#[1] 1 2
#
#[[2]]
#[1] 1 3 1
#
#[[3]]
#[1] 1 3 2
#
#[[4]]
#[1] 2 1
Here's an approach that uses data.table.
The result is a rectangular data.table with the rows ordered in the form you described. NA values are filled in where the list item was a different length.
library(data.table)
setorderv(data.table(do.call(cbind, transpose(l))), paste0("V", 1:max(lengths(l))))[]
# V1 V2 V3
# 1: 1 2 NA
# 2: 1 3 1
# 3: 1 3 2
# 4: 2 1 NA
This is ugly, but you can use the result on your list with something like:
l[setorderv(
data.table(
do.call(cbind, transpose(l)))[
, ind := seq_along(l)][],
paste0("V", seq_len(max(lengths(l)))))$ind]

taking the sum of a TRUE/FALSE vector in r

I am working analyzing SNP data for a fungus, and I am trying to impute the missing data by changing the Ns to the genotype of the more frequent allele....see below.
newdata is a matrix of my snps (rows)and fungal isolates(columns). The genotypes for each snp are in the 0, 1, and N format, and that is why I am trying to impute the missing genotypes.
newdata_imputed=newdata
for (k in 1:nrow(newdata)){
u=newdata[k,]
x<-sum(u==0)
y<-sum(u==1)
all_freq=y/(x+y)
if (all_freq<0.5){
newdata_imputed[k,]=gsub("N",0,u)
} else{newdata_imputed[k,]=gsub("N",1,u)}
print(k)
}
However, I keep getting this error:
[1] 295
[1] 296
Error in if (all_freq < 0.5) { : missing value where TRUE/FALSE needed
It is obvious that the code runs but stops after encountering a problem. Please, can someone tell me what I am doing wrong? I am a newbie to R, and any advice would be greatly appreciated.
#akrun, the reason why i used a for loop is because it is nested in another for loop..so after using your code.
newdata=as.data.frame(newdata)
u=newdata
all_freq <- rowSums(u==1)/rowSums((u==1)|(u==0))
indx <- all_freq < 0.5
indx1 <- indx & !is.na(indx)
indx2 <- !indx & !is.na(indx)
newdata[indx1,] <- lapply(newdata[indx1,], gsub, pattern='N', replacement=0)
newdata[indx2,] <- lapply(newdata[indx2,], gsub, pattern='N', replacement=1)
newdata[] <- lapply(newdata, as.numeric)
I got weird values
newdata[1:10,1:10]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 3 3 3 3 3 3 3 3 3 3
2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3
4 1 1 1 1 1 1 1 1 1 1
Please where is the "3" coming from.???? I should only have 0 or 1
We could do this using rowSums. As #bergant and #MatthewLundberg mentioned in the comments, if there are rows with no 0 or 1 elements, we get NaN based on the calculation. One way would be to modify the logical condition by including !is.na, i.e. elements that are not NA along with the previous condition.
#using `rowSums` to create the all_freq vector
all_freq <- rowSums(newdata==1)/rowSums((newdata==1)|(newdata==0))
#Create a logical index based on elements that are less than 0.5
indx <- all_freq < 0.5
#The NA elements can be changed to FALSE by adding another condition
indx1 <- indx & !is.na(indx)
#similarly for elements that are > 0.5
indx2 <- !indx & !is.na(indx)
Now, we subset the rows of the 'newdata' with 'indx1', loop through the columns (lapply) and use gsub with pattern and replacement arguments and assign the output back to the subset of 'newdata'.
newdata[indx1,] <- lapply(newdata[indx1,], gsub, pattern='N', replacement=0)
Similarly, we can do the replacement for the rows that are greater than 0.5 for 'all_freq'
newdata[indx2,] <- lapply(newdata[indx2,], gsub, pattern='N', replacement=1)
The gsub output columns are character class, which can be converted back to numeric (if needed).
newdata[] <- lapply(newdata, as.numeric)
data
set.seed(24)
newdata <- as.data.frame(matrix(sample(c(0:1, "N"), 10*4, replace=TRUE),
ncol=4), stringsAsFactors=FALSE)
newdata[7,] <- 2

Resources