I have an ordered dataframe with many variables, and am looking to extract the data from all columns associated with the longest sequence of non-NA rows for one particular column. Is there an easy way to do this? I have tried the na.contiguous() function but my data is not formatted as a time series.
My intuition is to create a running counter which determines whether a row has NA or not, and then will determine the count for the number of consecutive rows without an NA. I would then put this in an if statement to keep restarting every time an NA is encountered, outputting a dataframe with the lengths of every sequence of non-NAs, which I could use to find the longest such sequence. This seems very inefficient so I'm wondering if there is a better way!
If I understand this phrase correctly:
[I] am looking to extract the data from all columns associated with the longest sequence of non-NA rows for one particular column
You have a column of interest, call it WANT, and are looking to isolate all columns from the single row of data with the highest consecutive non-NA values in WANT.
Example data
df <- data.frame(A = LETTERS[1:10],
B = LETTERS[1:10],
C = LETTERS[1:10],
WANT = LETTERS[1:10],
E = LETTERS[1:10])
set.seed(123)
df[sample(1:nrow(df), 2), 4] <- NA
# A B C WANT E
#1 A A A A A
#2 B B B B B
#3 C C C <NA> C
#4 D D D D D
#5 E E E E E
#6 F F F F F
#7 G G G G G
#8 H H H H H
#9 I I I I I # want to isolate this row (#9) since most non-NA in WANT
#10 J J J <NA> J
Here you would want all I values as it is the row with the longest running non-NA values in WANT.
If my interpretation of your question is correct, we can extend the excellent answer found here to your situation. This creates a data frame with a running tally of consecutive non-NA values for each column.
The benefit of using this is that it will count consecutive non-NA runs across all columns (of any type, ie character, numeric), then you can index on whatever column you want using which.max()
# from #jay.sf at https://stackoverflow.com/questions/61841400/count-consecutive-non-na-items
res <- as.data.frame(lapply(lapply(df, is.na), function(x) {
r <- rle(x)
s <- sapply(r$lengths, seq_len)
s[r$values] <- lapply(s[r$values], `*`, 0)
unlist(s)
}))
# index using which.max()
want_data <- df[which.max(res$WANT), ]
#> want_data
# A B C WANT E
#9 I I I I I
If this isn't correct, please edit your question for clarity.
Related
I have a .csv file where I have one column with 25-30 characters per row. I need to separate the one column into 25 columns each with its own character (or nucleotide) inside each. Thus, I will be ignoring the extra 0-5 nucleotides in each row.
The .csv files looks similar to this:
Sequence
ATCGGTCGGGGGAT
TGCTGGCAAA
ACCGTCGAA
ACTGGTAATTG
I need the table to look similar to this:
Sequence
A T G C T
G T A C T
G G T C C
A T G T G
If this information helps: the end goal for me is trying to find the nucleotide frequencys of each column that is why I need the columns to be separated.
I am very new to R so any help would be greatly appreciated!
First use dput() to provide data that we can cut/paste.
Sequence <- c("ATCGGTCGGGGGAT", "TGCTGGCAAA", "ACCGTCGAA", "ACTGGTAATTG")
The chop off the bits you don't need and split the rest:
Sequence <- substr(Sequence, 1, 5)
Sequence <- data.frame(do.call(rbind, strsplit(Sequence, "")))
Sequence
# X1 X2 X3 X4 X5
# 1 A T C G G
# 2 T G C T G
# 3 A C C G T
# 4 A C T G G
a <- rnorm(10)
b <- sample(a,18,replace = T)
For each element of a, I want to randomly assign a value from vector b. So that I will have a combination for all elements of vector "a". It will be something like:
combinations <- data.table(first=a ,second=sample(b,length(a)))
What I actually want is a little different than the data.table combinations. I want to get a set of combinations where non of the rows has repeated values.
Edit: combinations$first[i] and combinations$second[i] may be equal in the code above. What ı want is to make it impossible to have a case where combinations$first[i] and combinations$second[i] are equal.
Note: I will do this for large vector, so it needs to be fast.
You can sample by group as follows
library(data.table)
set.seed(0L)
a <- LETTERS[1L:10L]
output <- data.table(first=a)[, .(second=sample(setdiff(a, first), .N)), by=.(first)]
If random row ordering is needed, you can run output[sample(.N)].
output:
first second
1: A J
2: B D
3: C E
4: D G
5: E J
6: F B
7: G J
8: H J
9: I F
10: J F
I have a question about matching values between two vectors.
Lets say I have a vector and data frame:
data.frame
value name vector 2
154.0031 A 154.0084
154.0768 B 159.0344
154.2145 C 154.0755
154.4954 D 156.7758
156.7731 E
156.8399 F
159.0299 G
159.6555 H
159.9384 I
Now I want to compare vector 2 with values in the data frame with a defined global tolerance (e.g. +-0.005) that is adjustable and add the corresponding names to vector 2, so I get a result like this:
data.frame
value name vector 2 name
154.0031 A 154.0074 A
154.0768 B 159.0334 G
154.2145 C 154.0755 B
154.4954 D 156.7758 E
156.7731 E
156.8399 F
159.0299 G
159.6555 H
159.9384 I
I tried to use intersect() but there is no option for tolerance in it?
Many thanks!
This outcome can be achieved through with outer, which, and subsetting.
# calculate distances between elements of each object
# rows are df and columns are vec 2
myDists <- outer(df$value, vec2, FUN=function(x, y) abs(x - y))
# get the values that have less than some given value
# using arr.ind =TRUE returns a matrix with the row and column positions
matches <- which(myDists < 0.05, arr.ind=TRUE)
data.frame(name = df$name[matches[, 1]], value=vec2[matches[, 2]])
name value
1 A 154.0084
2 G 159.0344
3 B 154.0755
4 E 156.7758
Note that this will only return elements of vec2 with matches and will return all elements of df that satisfy the threshold.
to make the results robust to this, use
# get closest matches for each element of vec2
closest <- tapply(matches[,1], list(matches[,2]), min)
# fill in the names.
# NA will appear where there are no obs that meet the threshold.
data.frame(name = df$name[closest][match(as.integer(names(closest)),
seq_along(vec2))], value=vec2)
Currently, this returns the same result as above, but will return NAs where there is no adequate observation in df.
data
Please provide reproducible data if you ask a question in the future. See below.
df <- read.table(header=TRUE, text="value name
154.0031 A
154.0768 B
154.2145 C
154.4954 D
156.7731 E
156.8399 F
159.0299 G
159.6555 H
159.9384 I")
vec2 <- c(154.0084, 159.0344, 154.0755, 156.7758)
I’m trying to compare the values in column C and return the rows that are associated with them . An example would be to compare the first two values in column C. If the first value is greater than the second, return the first two rows in a data frame. If the first value is not greater then skip to the next set, and compare and see if the third value from column C is greater than the fourth. If this is the case return rows 3 and 4. if not skip to the next set.
I’ve been wrangling with the filter function from dplyr but no luck.
Below is a example data frame.
set.seed(99)
DF <- data.frame(abs(rnorm(10)), abs(rnorm(10)), abs(rnorm(10)))
colnames(DF) <-c("A", "B", "C")
DF
Any help would be appreciated.
You can use rollapply from zoo package,
library(zoo)
ind <- rep(rollapply(DF$C, 2, by = 2, which.max) == 1, each = 2)
DF[ind,]
A B C
#1 1.52984334 2.0127251 1.70922539
#2 1.96454540 0.2887642 0.52301701
#5 1.15765833 0.2866493 1.72702076
#6 0.80379719 1.0945894 0.72269558
#7 1.52239099 0.5296913 2.04080511
#8 0.01663749 0.3593682 0.88601771
#9 0.12672258 0.4110257 0.19165526
#10 0.27740770 0.1950477 0.01378397
Here is a base R solution you can try with where you find the index for every two rows based on the condition and then do a subset on the data frame:
ind <- which(DF$C[c(T, F)] > DF$C[c(F, T)]) # check whether the odd rows are larger than
# the even rows and find out the index
DF[c(2*ind-1, 2*ind), ] # subset the data frame based on index for every two rows
# A B C
# 1 1.6866933 0.6886403 1.1231086
# 9 0.8781335 2.1689560 1.3686023
# 2 0.8377870 0.5539177 0.4028848
# 10 0.8215811 1.2079620 0.2257710
I have vector index that corresponds to the rows of a df I want to modify for one specific column
index <- c(1,3,6)
dm <- one two three four
x y z k
a b c r
s e t f
e d f a
a e d r
q t j i
Now I want to modify column "three" only for rows 1, 3 and 6 replacing whatever value in it with "A".
Should I use apply?
There is no need for apply. You could simply use the following:
dm$three[index] <- "A"