Removing duplicates in vector but preserving order - r

Suppose a vector :
vec = c(NA,NA,1,NA,NA,NA,1,NA,NA,0,NA,NA,0,NA,NA,0,NA,NA,1,NA,NA,1,NA,NA,0,NA,0)
I would like to get :
vec = c(NA,NA,1,NA,NA,NA,NA,NA,NA,0,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,0,NA,NA)
I have tried a for loop with an if checking if the value is equal to the previous non NA, but it doesn't work when it is repeated more than once.
Remove duplicates in vector to next value
doesn't work either since I want to keep my NAs.

You can do this with a little bit of logic and a compound [ and [<- operation. First we need to find the duplicates. We'll do this with diff() on all the non NA values...
diff( vec[ ! is.na( vec ) ]
[1] 0 -1 0 0 1 0 -1 0
Each 0 is a duplicate. Now we need to find their position in vec and set them to NA..
# This gives us a vector of TRUE/FALSE values which we will use to subset vec to the values we want to change
dups <- c( 1 , diff( vec[ ! is.na( vec ) ] ) ) == 0
# Now subset vec to non NA values and change the duplicates to NA
vec[ ! is.na( vec ) ][ dups ] <- NA
# [1] NA NA 1 NA NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA 1 NA NA NA
#[26] NA NA 0 NA NA

Use duplicated:
vec[duplicated(vec, incomparables=NA)] <- NA
You could omit the incomparables parameter in your example:
vec[duplicated(vec)] <- NA
According to the documentation this might be faster, but you'd need to benchmark it yourself.
Edit:
After clarification:
vec <- c(NA,NA,1,NA,NA,NA,1,NA,NA,NA,NA,0,NA,NA,0,NA,NA,0,NA,NA,NA,1,NA,NA,1,NA,NA,0,NA,0)
vec2 <- c(NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,0,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,0,NA,NA)
tmp <- vec[!is.na(vec)]
tmp[c(FALSE, diff(tmp)==0)] <- NA
vec[!is.na(vec)] <- tmp
identical(vec, vec2)
#[1] TRUE

I think this does it:
vrl<-rle(vec)
diff(vrl$values[!is.na(vrl$values)])->vdif
vdif<-c(1,vdif)
vrl$values[!is.na(vrl$values)][vdif==0]<-NA
inverse.rle(vrl)
# [1] NA NA 1 NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA
#[19] 1 NA NA NA NA NA 0 NA NA
The trick in there was to prepend a 1 to the difference vector so that the very first non-NA location is preserved.

Related

Index TRUE occurrences preserving NA in a new vector

I have what some of you might categorise as a dumb question, but I cannot solve it. I have this vector:
a <- c(NA,NA,TRUE,NA,TRUE,NA,TRUE)
And I want to get this in a new vector:
b <- c(NA,NA,1,NA,2,NA,3)
That simple. All the ways I am trying do not preserve the NA and I need them untouched. I would prefer if there would be a way in base R.
In base R, use cumsum() while excluding the NA values:
a <- c(NA,NA,TRUE,NA,TRUE,NA,TRUE)
a[!is.na(a)] <- cumsum(a[!is.na(a)])
Output:
[1] NA NA 1 NA 2 NA 3
Using replace from base R
b <- replace(a, !is.na(a), seq_len(sum(a, na.rm = TRUE)))
b
[1] NA NA 1 NA 2 NA 3
Or slightly more compact option (if the values are logical/numeric)
cumsum(!is.na(a)) * a
[1] NA NA 1 NA 2 NA 3
Update
If the OP's vector is
a <- c(NA,NA,TRUE,NA,FALSE,NA,TRUE)
(a|!a) * cumsum(replace(a, is.na(a), 0))
[1] NA NA 1 NA 1 NA 2
replaceing the non-NAs with the cumsum.
replace(a, !is.na(a), cumsum(na.omit(a)))
# [1] NA NA 1 NA 2 NA 3

R na.approx error: need at least two non-NA values to interpolate

Sample Data
1/1/2000 NA NA NA 29.71 NA
1/2/2000 NA NA NA NA NA
1/3/2000 NA NA NA NA NA
1/4/2000 NA NA NA 29.25 NA
1/5/2000 NA NA NA 30.28 NA
1/6/2000 NA NA NA 27.66 NA
1/7/2000 NA NA NA 27.22 NA
1/8/2000 NA NA NA 27.27 NA
1/9/2000 170 4.1 NA 5.24 NA
1/10/2000 NA NA NA NA NA
1/11/2000 NA NA NA 27.65 NA
1/12/2000 NA NA NA 28.28 100.57
1/13/2000 NA NA NA 27.52 NA
I'm trying to interpolate a lot of NA values.
I have unique dates (key), but most [other] data columns begin/end with with NULL/NA values (combined_data_z[,a]). I care to interpolate these [other] columns empty values against date, I'm having this error when attempting
Error in approx(x[!na], y[!na], xout, ...) : need at least two
non-NA values to interpolate
library(zoo)
#start with 2 because 1st column is date
a=2
for (i in parsedList)
{
dates <- combined_data_z[,1]
test1 <- combined_data_z[,a]
test1_z <- zoo(test1)
test1_z_approx <- na.fill(na.approx(test1_z, x=dates, rule=2, na.rm = FALSE), "extend")
#print(test1_z_approx)
a=a+1
}
update: apparently it has something to do with the for loop, when I removed it and tested using print statements and built up from there, I found that it works when not enclosed in brackets (but I need the loop).
dates <- combined_data_z[,1]
test1 <- combined_data_z[,4]
test1_z <- zoo(test1)
test1_z_approx <- na.fill(na.approx(test1_z, x=dates, rule=2, na.rm = FALSE), "extend")
print(test1_z_approx)
For the following dataset you provided in comments this works:
library(zoo)
combined_data_z <- read.csv(file="http://thistleknot.sytes.net/wordpress/wp-content/uploads/2018/04/output_NoNA.csv")
test1_z_approx <- matrix(NA, ncol=ncol(combined_data_z)-2, nrow = nrow(combined_data_z))
for (i in 3:ncol(combined_data_z))
{
dates <- combined_data_z[,1]
test1 <- combined_data_z[,i]
test1_z <- zoo(test1)
test1_z_approx[,i-2] <-as.matrix( na.fill(na.approx(test1_z, x=dates, rule=2, na.rm = FALSE), "extend"))[,1]
}
If your dataset starts with the "date" column , then the code will look like:
head(combined_data_z)
# date CPIAUCSL UNRATE MEHOINUSA672N INTDSRUSM193N CIVPART
# 1 1/1/2000 169.3 4 58544 5 67.3
# 2 1/2/2000 NA NA NA NA NA
# 3 1/3/2000 NA NA NA NA NA
# 4 1/4/2000 NA NA NA NA NA
# 5 1/5/2000 NA NA NA NA NA
# 6 1/6/2000 NA NA NA NA NA
test1_z_approx <- matrix(NA, ncol=ncol(combined_data_z)-1, nrow = nrow(combined_data_z))
for (i in 2:ncol(combined_data_z))
{
dates <- combined_data_z[,1]
test1 <- combined_data_z[,i]
test1_z <- zoo(test1)
test1_z_approx[,i-1] <-as.matrix( na.fill(na.approx(test1_z, x=dates, rule=2, na.rm = FALSE), "extend"))[,1]
}
head(test1_z_approx)
# [,1] [,2] [,3] [,4] [,5]
#[1,] 169.3000 4.000000 58544 5.000000 67.30000
#[2,] 224.0420 4.033100 59039 2.844406 64.07145
#[3,] 196.4639 3.959895 59039 4.579983 65.57215
#[4,] 188.9426 3.939930 59039 5.053322 65.98144
#[5,] 186.4355 3.933275 59039 5.211101 66.11786
#[6,] 183.9284 3.926620 59039 5.368881 66.25429
Thanks goes to Katia for the assist (specifically my x's and y's needing to be in separate dataframes)
combined_data_z <- df3
#https://stackoverflow.com/a/50173660/1731972
#file begins with numeric iterations
#ncol(combined_data_z)
dates <- combined_data_z[1]
print(dates)
#important to start at 2!, otherwise na.approx will not work!
#either copy from 2: on or copy whole and drop first column (date)
#test1 <- combined_data_z[c(2:length(parsedList)+1)]
#drop date
test1 <- combined_data_z
test1[1] <- NULL
print(test1)
#wtf, had to add data.frame today!
test1_z <- zoo(data.frame(test1))
date_z <- zoo(data.frame(dates))
print(test1_z)
#colnames(test1_z)
print(dates)
test1_z_approx <- na.fill(na.approx(test1_z, dates$date, rule=2, na.rm = FALSE), "extend")
print(test1_z_approx)
#new <- NULL
print(new)
new <- c(data.frame(dates),data.frame(test1_z_approx))
print(new)
write.csv(new, file = "output_test.csv")

Searching pairs in matrix in R

I am rather new to R, so I would be grateful if anyone could help me :)
I have a large matrices, for example:
matrix
and a vector of genes.
My task is to search the matrix row by row and compile pairs of genes with mutations (on the matrix is D707H) with the rest of the genes contained in the vector and add it to a new matrix. I tried do this with loops but i have no idea how to write it correctly. For this matrix it should look sth like this:
PR.02.1431
NBN BRCA1
NBN BRCA2
NBN CHEK2
NBN ELAC2
NBN MSR1
NBN PARP1
NBN RNASEL
Now i have sth like this:
my idea
"a" is my initial matrix.
Can anyone point me in the right direction? :)
Perhaps what you want/need is which(..., arr.ind = TRUE).
Some sample data, for demonstration:
set.seed(2)
n <- 10
mtx <- array(NA, dim = c(n, n))
dimnames(mtx) <- list(letters[1:n], LETTERS[1:n])
mtx[sample(n*n, size = 4)] <- paste0("x", 1:4)
mtx
# A B C D E F G H I J
# a NA NA NA NA NA NA NA NA NA NA
# b NA NA NA NA NA NA NA NA NA NA
# c NA NA NA NA NA NA NA NA NA NA
# d NA NA NA NA NA NA NA NA NA NA
# e NA NA NA NA NA NA NA NA NA NA
# f NA NA NA NA NA NA NA NA NA NA
# g NA "x4" NA NA NA "x3" NA NA NA NA
# h NA NA NA NA NA NA NA NA NA NA
# i NA "x1" NA NA NA NA NA NA NA NA
# j NA NA NA NA NA NA "x2" NA NA NA
In your case, it appears that you want anything that is not an NA or NaN. You might try:
which(! is.na(mtx) & ! is.nan(mtx))
# [1] 17 19 57 70
but that isn't always intuitive when retrieving the row/column pairs (genes, I think?). Try instead:
ind <- which(! is.na(mtx) & ! is.nan(mtx), arr.ind = TRUE)
ind
# row col
# g 7 2
# i 9 2
# g 7 6
# j 10 7
How to use this: the integers are row and column indices, respectively. Assuming your matrix is using row names and column names, you can retrieve the row names with:
rownames(mtx)[ ind[,"row"] ]
# [1] "g" "i" "g" "j"
(An astute reader might suggest I use rownames(ind) instead. It certainly works!) Similarly for the colnames and "col".
Interestingly enough, even though ind is a matrix itself, you can subset mtx fairly easily with:
mtx[ind]
# [1] "x4" "x1" "x3" "x2"
Combining all three together, you might be able to use:
data.frame(
gene1 = rownames(mtx)[ ind[,"row"] ],
gene2 = colnames(mtx)[ ind[,"col"] ],
val = mtx[ind]
)
# gene1 gene2 val
# 1 g B x4
# 2 i B x1
# 3 g F x3
# 4 j G x2
I know where my misteke was, now i have matrix. Analyzing your code it works good, but that's not exactly what I want to do.
a, b, c, d etc. are organisms and row names are genes (A, B, C, D etc.). I have to cobine pairs of genes where one of it (in the same column) has sth else than NA value. For example if gene A has value=4 in column a I have to have:
gene1 gene2
a A B
a A C
a A D
a A E
I tried in this way but number of elements do not match and i do not know how to solve this.
ind= which(! is.na(a) & ! is.nan(a), arr.ind = TRUE)
ind1=which(macierz==1,arr.ind = TRUE)
ramka= data.frame(
kolumna = rownames(a)[ ind[,"row"] ],
gene1 = colnames(a)[ ind[,"col"] ],
gene2 = colnames(a)[ind1[,"col"]],
#val = macierz[ind]
)
Do you know how to do this in R?

Opening csv of specific sequences: NAs come out of nowhere?

I feel like this is a relatively straightforward question, and I feel I'm close but I'm not passing edge-case testing. I have a directory of CSVs and instead of reading all of them, I only want some of them. The files are in a format like 001.csv, 002.csv,...,099.csv, 100.csv, 101.csv, etc which should help to explain my if() logic in the loop. For example, to get all files, I'd do something like:
id = 1:1000
setwd("D:/")
filenames = as.character(NULL)
for (i in id){
if(i < 10){
i <- paste("00",i,sep="")
}
else if(i < 100){
i <- paste("0",i,sep="")
}
filenames[[i]] <- paste(i,".csv", sep="")
}
y <- do.call("rbind", lapply(filenames, read.csv, header = TRUE))
The above code works fine for id=1:1000, for id=1:10, id=20:70 but as soon as I pass it id=99:100 or any sequence involving numbers starting at over 100, it introduces a lot of NAs.
Example output below for id=98:99
> filenames
098 099
"098.csv" "099.csv"
Example output below for id=99:100
> filenames
099
"099.csv" NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
"100.csv"
I feel like I'm missing some catch statement in my if() logic. Any insight would be greatly appreciated! :)
You can avoid the loop for creating the filenames
filenames <- sprintf('%03d.csv', 1:1000)
y <- do.call(rbind, lapply(filenames, read.csv, header = TRUE))
#akrun has given you a much better way of solving your task. But in terms of the actual issue with your code, the problem is that for i < 100 you subset by a character vector (implicitly converted using paste) while for i >= 100 you subset by an integer. When you use id = 99:100 this translates to:
filenames <- character(0)
filenames["099"] <- "099.csv" # length(filenames) == 1L
filenames[100] <- "100.csv" # length(filenames) == 100L, with all(filenames[2:99] == NA)
Assigning to a named member of a vector that doesn't yet exist will create a new member at position length(vector) + 1 whereas assigning to a numbered position that is > length(vector) will also fill in every intervening position with NA.
Another approach, although less efficient than #akrun's solution, is with the following function:
merged <- function(id = 1:332) {
df <- data.frame()
for(i in 1:length(id)){
add <- read.csv(sprintf('%03d.csv', id[i]))
df <- rbind(df,add)
}
df
}
Now, you can merge the files with:
dat <- merged(99:100)
Furthermore, you can assign columnnames by inserting the following line in the function just before the last line with df:
colnames(df) <- c(..specify the colnames in here..)

R Loop Script to Create Many, Many Variables

I want to create a lot of variables across several separate dataframes which I will then combine into one grand data frame.
Each sheet is labeled by a letter (there are 24) and each sheet contributes somewhere between 100-200 variables. I could write it as such:
a$varible1 <- NA
a$variable2 <- NA
.
.
.
w$variable25 <- NA
This can/will get ugly, and I'd like to write a loop or use a vector to do the work. I'm having a heck of a time doing it though.
I essentially need a script which will allow me to specify a form and then just tack numbers onto it.
So,
a$variable[i] <- NA
where [i] gets tacked onto the actual variable created.
I just learnt this neat little trick from #eddi
#created some random dataset with 3 columns
library(data.table)
a <- data.table(
a1 = c(1,5),
a2 = c(2,1),
a3 = c(3,4)
)
#assuming that you now need to ad more columns from a4 to a200
# first, creating the sequence from 4 to 200
v = c(4:200)
# then using that sequence to add the 197 more columns
a[, paste0("a", v) :=
NA]
# now a has 200 columns, as compared to the three we initiated it with
dim(a)
#[1] 2 200
I don't think you actually need this, although you seem to think so for some reason.
Maybe something like this:
a <- as.data.frame(matrix(NA, ncol=10, nrow=5))
names(a) <- paste0("Variable", 1:10)
print(a)
# Variable1 Variable2 Variable3 Variable4 Variable5 Variable6 Variable7 Variable8 Variable9 Variable10
# 1 NA NA NA NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA NA NA
# 3 NA NA NA NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA NA NA NA
# 5 NA NA NA NA NA NA NA NA NA NA
If you want variables with different types:
p <- 10 # number of variables
N <- 100 # number of records
vn <- vector(mode="list", length=p)
names(vn) <- paste0("V", seq(p))
vn[1:8] <- NA_real_ # numeric
vn[9:10] <- NA_character_ # character
df <- as.data.frame(lapply(vn, function(x, n) rep(x, n), n=N))

Resources