Subset NAs between values - r

I would like to know how I could only subset NAs excluding those that are on the extremes of a vector.
For instance,
vector <- c(NA,NA,1,3,5,NA,3,NA,7,NA,NA,NA)
How could I only subset the NAs vector[6] and vector[8]?
Thank you very much for your help!

One way to get indices which are not on extremes is
non_NA_inds <- which(!is.na(vector))
NA_inds <- which(is.na(vector))
NA_inds[NA_inds > min(non_NA_inds) & NA_inds < max(non_NA_inds)]
#[1] 6 8

You can try the following code
idx <- which(!is.na(vector))
res <- setdiff(min(idx):max(idx),idx)
which gives:
> res
[1] 6 8

Related

Split a group of integers into two subgroups of approximately the same suns

I have a group of integers, as in this R data.frame:
set.seed(1)
df <- data.frame(id = paste0("id",1:100), length = as.integer(runif(100,10000,1000000)), stringsAsFactors = F)
So each element has an id and a length.
I'd like to split df into two data.frames with approximately equal sums of length.
Any idea of an R function to achieve that?
I thought that Hmisc's cut2 might do it but I don't think that's its intended use:
library(Hmisc) # cut2
ll <- split(df, cut2(df$length, g=2))
> sum(ll[[1]]$length)
[1] 14702139
> sum(ll[[2]]$length)
[1] 37564671
It's called Bin pack problem. https://en.wikipedia.org/wiki/Bin_packing_problem this link may be helpful.
Using BBmisc::binPack function,
df$bins <- binPack(df$length, sum(df$length)/2 + 1)
tapply(df$length, df$bins, sum)
results like
1 2 3
25019106 24994566 26346
Now since you want two groups,
dummy$bins[dummy$bins == 3] <- 2 #because labeled as 2's sum is smaller
result is
1 2
25019106 25020912

How to count missing values from two columns in R

I have a data frame which looks like this
**Contig_A** **Contig_B**
Contig_0 Contig_1
Contig_3 Contig_5
Contig_4 Contig_1
Contig_9 Contig_0
I want to count how many contig ids (from Contig_0 to Contig_1193) are not present in either Contig_A column of Contig_B.
For example: if we consider there are total 10 contigs here for this data frame (Contig_0 to Contig_9), then the answer would be 4 (Contig_2, Contig_6, Contig_7, Contig_8)
Create a vector of all the values that you want to check (all_contig) which is Contig_0 to Contig_10 here. Use setdiff to find the absent values and length to get the count of missing values.
cols <- c('Contig_A', 'Contig_B')
#If there are lot of 'Contig' columns that you want to consider
#cols <- grep('Contig', names(df), value = TRUE)
all_contig <- paste0('Contig_', 0:10)
missing_contig <- setdiff(all_contig, unlist(df[cols]))
#[1] "Contig_2" "Contig_6" "Contig_7" "Contig_8" "Contig_10"
count_missing <- length(missing_contig)
#[1] 5
by match,
x <- c(0:9)
contigs <- sapply(x, function(t) paste0("Contig_",t))
df1 <- data.frame(
Contig_A = c("Contig_0", "Contig_3", "Contig_4", "Contig_9"),
Contig_B = c("Contig_1", "Contig_5", "Contig_1", "Contig_0")
)
xx <- c(df1$Contig_A,df1$Contig_B)
contigs[is.na(match(contigs, xx))]
[1] "Contig_2" "Contig_6" "Contig_7" "Contig_8"
In your case, just change x as x <- c(0,1193)

How to convert loop output to a vector?

I extract certain values out of dataset Z (the positions are given in dataset A) using a loop function.
#Exemplary datasets
Z <- data.frame(Depth=c(0.02,0.04,0.06,0.08,0.10,0.12,0.14,0.16,0.18,0.2),
Value=c(10,12,5,6,7,4,3,2,11,13))
A <- data.frame(Depth=c(0.067, 0.155))
for (n in c(1:nrow(A)))
+ {find_values <- Z$Value[Z$Depth>=A$Depth[n]][1]
+ print(find_values)}
#Result
[1] 6
[1] 2
The result seems to consist of values in two seperate vectors. How can I merge them in an easy way to one vector as follows?
[1] 6, 2
Thanks in advance!
For your code to work as it is you can store them using index in for loop
for (n in seq_len(nrow(A))) {
find_values[n] <- Z$Value[Z$Depth>=A$Depth[n]][1]
}
find_values
#[1] 6 2
However, you can simplify this with sapply by doing
sapply(A$Depth, function(x) Z$Value[which.max(Z$Depth >= x)])
#[1] 6 2
We can use a vectorized approach
Z$Value[findInterval(A$Depth, Z$Depth) + 1]
#[1] 6 2

R - Select Rows Where Number of Values Satisfies Condition

I have a dataframe called df, what I want to do is select all rows where there are at least n values in that row satisfying some condition c.
For example, I want rows from df such that at least 50% of the values (or columns) in the row are greater than 0.75.
Here is what I came up with to accomplish this:
test <- df[apply(df, 1, function(x) (length(x[x > 0.75]) / length(x) > 0.5)]
Unfortunately I am getting this error message:
Error in `[.data.frame`(df, apply(df, :
undefined columns selected
I am very new to R, so I'm pretty stuck at this point, what's the problem here?
You are getting that error message because you haven't told R what columns you want to include in your subset.
You have:
df[your_apply_function]
Which doesn't specify which columns. Instead, you should try
df[your_apply_function, ]
That means 'subset 'df' for all rows that match the result of this apply function, and all columns'. Edit: I don't think this will work either.
However, I would approach it by using dplyr:
library(dplyr)
rowcounts <- apply(df, 1, function(x) rowSums(x > 0.75))
df <- bind_cols(df, rowcounts)
df <- filter(df, rowcounts > ncol(df)/2)
I didn't get to test this yet (code still running on my machine), but it looks right to my eye. When I get a chance I will test it.
This can be accomplished with a cellwise comparison against 0.75, rowSums(), and then a vectorized comparison against 0.5:
set.seed(3L); NR <- 5L; NC <- 4L; df <- as.data.frame(matrix(rnorm(NR*NC,0.75,0.1),NR));
df;
## V1 V2 V3 V4
## 1 0.6538067 0.7530124 0.6755218 0.7192344
## 2 0.7207474 0.7585418 0.6368781 0.6546983
## 3 0.7758788 0.8616610 0.6783642 0.6851757
## 4 0.6347868 0.6281143 0.7752652 0.8724314
## 5 0.7695783 0.8767369 0.7652046 0.7699812
df[rowSums(df>0.75)/ncol(df)>=0.5,];
## V1 V2 V3 V4
## 3 0.7758788 0.8616610 0.6783642 0.6851757
## 4 0.6347868 0.6281143 0.7752652 0.8724314
## 5 0.7695783 0.8767369 0.7652046 0.7699812
This can work on both matrices and data.frames.

R replace numbers in a data frame based on their value

I have a data frame with numbers like :
28521 59385 58381
V7220 25050 V7231
I need to replace them based on conditions like:
if the number is bigger than 59380 and smaller than 59390 then code it as 1
delete numbers starts with "v"
so the frame work will be look like
28521 1 1
NA 25050 NA
How can I do this quickly for a huge data frame?
x <- c(28521, 59385, 58381, 'V7220', 25050, 'V7231')
as.numeric(ifelse(as.numeric(x) > 59380 & as.numeric(x) < 59390, 1, x))
This will return a warning message about NA values, but if you wrap it with suppressWarnings, you'll get what you want.
> suppressWarnings(as.numeric(ifelse(as.numeric(x) > 59380 & as.numeric(x) < 59390, 1, x)))
[1] 28521 1 58381 NA 25050 NA
Write a function then apply it to the columns of the matrix/data.frame after you convert to numeric to get rid of those V entries.
sapply(df,as.numeric)
# If you have factor instead of character
sapply(df,function(x) as.numeric(as.character(x)))
replace <- function(x) {
x[x >= 59380 & x <= 59390] <- 1
return(x)
}

Resources