Remove same indices from two vectors - r

I have two vectors in R, e.g.
a <- c(2,6,4,9,8)
b <- c(8,9,4,2,1)
Vectors a and b are ordered in a way that I wish to conserve (I will be plotting them against each other). I want to remove certain values from vector a and remove the values at the same indices in b. e.g. if I wanted to remove values ≥ 8 from a:
a <- a[a<8]
... which gives a new vector without those values.
Is there now an easy way of removing values from the same indices in b (in this example indices 4 and 5)? Perhaps by using a data frame?

Something like this:
keep <- a < 8
a <- a[keep]
b <- b[keep]
You could also use:
keep <- which( a < 8 )

If the vectors are logically part of the same data, use a data frame. It is better programming practice.
df <- data.frame(a = a, b = b)
df <- df[df$a < 8, ]
Otherwise, set another vector to be the indices removed:
keep <- a < 8
a <- a[keep]
b <- b[keep]

Why not:
d <- data.frame(a=a, b=b)
d <- d[d$a < 8, ]
Or even:
d <- subset(d, a < 8)

First remove the indices from b then from a
b <- b[a<8]
a <- a[a<8]
a<8 returns a vector which defines which indices are smaller than 8.

If this is purely for plotting, you can avoid messing with b and the x-axis by using NA .
a[a>8]<-NA
plot(b,a) #works for point or line graphs

Related

How to iteratively remove columns in r?

lets take an example dataframe with removal of variable columns:
frame <- data.frame("a" = 1:5, "b" = 2:6, "c" = 3:7, "d" = 4:8)
rem <- readline()
frame <- subset(frame, select = -c(rem))
How do I get the variable column to be removed? This is not my real code, just wanted to present my problem in a simple code. Thanks!
Edit: I am so sorry, I am really sleepy and don't know what I typed into my code, I edited it now.
1) Do both at once. We assume that ix contains at least one column number.
ix <- 1:2
frame[-ix]
## c d
## 1 3 4
## 2 4 5
## 3 5 6
## 4 6 7
## 5 7 8
1a) or if the case where ix is zero length, ix <- c(), is important we can do this. The output of this and all the rest are the same as for (1) so we won't repeat the output.
ix <- 1:2
frame[setdiff(seq_along(frame), ix)]
1b) or if we have names rather than column numbers. This works even if nms is a zero length vector in which case it returns the original data frame.
nms <- c("a", "b")
frame[setdiff(names(frame), nms)]
2) or if you need to do it iteratively remove the largest one first because if it were done in ascending order then after the first one is removed the second column is no longer the second but is the first. If we knew that ix is already sorted we could omit the sort. We have used frame_out to hold the result so that the input is not destroyed. This works even if ix is the empty vector.
ix <- 1:2
frame_out <- frame
for(i in rev(sort(ix))) frame_out <- frame_out[-i]
frame_out
3) One way to do it independent of order is to do it by name. In this case it would be possible to remove them in ascending order. This works even if ix the empty vector.
ix <- 1:2
nms <- names(frame)[ix]
frame_out <- frame
for(nm in nms) frame_out <- frame_out[-match(nm, names(frame_out))]
frame_out

reference x's column in R's apply function

I have a df like this:
a <- c(4,5,3,5,1)
b <- c(8,9,7,3,5)
c <- c(6,7,5,4,3)
df <- data.frame(rbind(a,b,c))
I want a new df, df2, containing the difference between the values in each cell in rows a and b and the value in row c in their respective columns.
df2 would look like this:
a <- c(-2,-2,-2,1,-2)
b <- c(2,2,2,-1,2)
df2 <- data.frame(rbind(a,b))
Here is where I'm getting stuck:
df2 <- data.frame(apply(df,c(1,2),function(x) x - df[nrow(df),the col index of x]))
How do I reference the column index of x? Is there something like JavaScript's this?
We can do this easily by replicating the 3rd row to make the lengths equal before subtracting with the first two rows
out <- df[c("a", "b"),] - df["c",][col(df[c("a", "b"),])]
identical(df2, out)
#[1] TRUE
Or explicitly using rep
df[c("a", "b"),] - rep(unlist(df["c",]), each = 2)

How to subset a column by triplicates?

I am wondering how to subset my data based on the appearance of triplicates in a column.
t <- c(1,1,2,2,3,3,4,4,5,5,5,6,6,7,7,7,8,8)
mydf <- data.frame(t, 1:18)
I want to be able to grab only the rows that correspond to a triplicate in column t, so that I can form a new dataframe of only those rows. That would look like this where p is the vector of rows I'm looking for:
p <- c(9,10,11,14,15,16)
myidealdf[p,]
Sorry if this isn't clear, it's my first post
This should do it
keeps <- unique(t)[table(as.factor(t)) == 3]
keeps <- t %in% keeps
mydf <- mydf[keeps, ]
Using rle function.
which(t %in% with(rle(t), values[lengths==3]))
[1] 9 10 11 14 15 16

R Check a row of strings, if equal, assign equal ID, less time consuming

im fairly new to R and was wondering if anyone here had a better solution to my problem, as mine is too time consuming. I know R is not very "for-loop-friendly" so I am sure there is a better way to solve this.
I have a data frame where x is a text string and y is a numeric id:
x = c("a", "b", "c", "b", "a")
y = c(1,2,3,4,5)
df <- data.frame(x, y)
I want a to find all matches in column x, and assign them the same numeric value as the first in y. I have solved this with the following:
library(foreach)
library(iterators)
for(i in 1:NROW(df)) {
for(j in i:NROW(df)) {
if(df$x[j] == df$x[i]){
df$y[j] <- df$y[i]
}
j = j + 1
}
i = i + 1
}
Problem is, I have a fairly large dataset which makes this process take a lot of time! Hope anyone here knows a less time consuming alternative!
If your dataset is indeed large, then data.table will probably the fastest solution (see benchmarks here).
library(data.table)
setDT(df)
df[, y := first(y), by = x]
R likes vectorised code, so things like arithmetic operations and assignments can be slow if done in a loop. Consider for example assigning the vector 1, 2, ... 1,000,000 to a variable x in two different ways
x <- 1:1e6
and
x <- numeric(x, 1e6) # initialise a numeric vector of length 1 million
for (i in 1:1e6) x[i] <- i
If you try this out you will see that the second method takes much longer.
Coming to your problem, you want to group the data by the value in df$x and replace the values of y by their first element
df.by <- by(df$x, function(d) transform(d, y = y[1]), data = df)
will set the second column of each subset of df (subsetting based on df$x) equal to its first element. The result is
#df$x: a
# x y
#1 a 1
#5 a 1
#------------------------------------------------------------
#df$x: b
# x y
#2 b 2
#4 b 2
#------------------------------------------------------------
#df$x: c
# x y
#3 c 3
To combine these back to a data frame, use df.new <- do.call(rbind, df.by). One (possibly unwanted) side effect of this operation is that it will change the order of the rows.
If you are new to R check out the dplyr package, it's got a smooth learning curve and easy to write and read syntax. What you want to do could be accomplished in only a few lines.
library(dplyr)
df %>% group_by(x) %>% mutate(y = y[1])
will do it!

Effeciently replacing a variable number of NA values based on logical vector

I am attempting to replace NA values in my data frame based on the logical return of one of the columns in the data frame.
#Creating random example data frame
a <- rbinom(1000,1,.5)
b <- rbinom(1000,1,.75)
c <- rbinom(1000,1,.25)
d <- rbinom(1000,1,.5)
e <- rbinom(1000,1,.5) # Will be the logical column
df <- cbind(a,b,c,d)
for(i in 1:1000){
if(sum(df[i,1:4]) >2){
df[i,1:4] <- NA
}
}
# randomly replacing some of the NA to represent the observation data
df[sample(1:length(df), 100, replace=F)] <- 1
df <- cbind(df, e)
I am attempting to fill in the NAs with 0 when e == 1 while still retaining the random 1s I placed in the the other 4 columns (especially those where the rest of the values are NA).
I've tried creating loops like:
for(i in 1:nrow(df)){
if(df[,'e']==1){
df[i,is.na(df[i,1:4])] <- 0
}
}
however that clears both my logical column and my observation data.
The data frame that I want to apply this to is large (2.8 million rows X 23 col) containing metadata and observation data so something that takes speed into account would be great.
We can do this with data.table
library(data.table)
df1 <- as.data.frame(df)
setDT(df1)
for(j in 1:4){
set(df1, i = which(df1[['e']]==1 & is.na(df1[[j]])), j = j, value = 0)
}
It would be more efficient as we are using set. Based on the help page of set (?set) overhead of [.data.table is avoided by calling it.
As #thelatemail mentioned a compact base R option would be
df[,1:4][df[,"e"]==1 & is.na(df[,1:4])] <- 0
If the matrix is very big, the logical matrix would be big as well and that could potentially create memory-related issues.

Resources