a <- rnorm(10)
b <- sample(a,18,replace = T)
For each element of a, I want to randomly assign a value from vector b. So that I will have a combination for all elements of vector "a". It will be something like:
combinations <- data.table(first=a ,second=sample(b,length(a)))
What I actually want is a little different than the data.table combinations. I want to get a set of combinations where non of the rows has repeated values.
Edit: combinations$first[i] and combinations$second[i] may be equal in the code above. What ı want is to make it impossible to have a case where combinations$first[i] and combinations$second[i] are equal.
Note: I will do this for large vector, so it needs to be fast.
You can sample by group as follows
library(data.table)
set.seed(0L)
a <- LETTERS[1L:10L]
output <- data.table(first=a)[, .(second=sample(setdiff(a, first), .N)), by=.(first)]
If random row ordering is needed, you can run output[sample(.N)].
output:
first second
1: A J
2: B D
3: C E
4: D G
5: E J
6: F B
7: G J
8: H J
9: I F
10: J F
Related
I have an ordered dataframe with many variables, and am looking to extract the data from all columns associated with the longest sequence of non-NA rows for one particular column. Is there an easy way to do this? I have tried the na.contiguous() function but my data is not formatted as a time series.
My intuition is to create a running counter which determines whether a row has NA or not, and then will determine the count for the number of consecutive rows without an NA. I would then put this in an if statement to keep restarting every time an NA is encountered, outputting a dataframe with the lengths of every sequence of non-NAs, which I could use to find the longest such sequence. This seems very inefficient so I'm wondering if there is a better way!
If I understand this phrase correctly:
[I] am looking to extract the data from all columns associated with the longest sequence of non-NA rows for one particular column
You have a column of interest, call it WANT, and are looking to isolate all columns from the single row of data with the highest consecutive non-NA values in WANT.
Example data
df <- data.frame(A = LETTERS[1:10],
B = LETTERS[1:10],
C = LETTERS[1:10],
WANT = LETTERS[1:10],
E = LETTERS[1:10])
set.seed(123)
df[sample(1:nrow(df), 2), 4] <- NA
# A B C WANT E
#1 A A A A A
#2 B B B B B
#3 C C C <NA> C
#4 D D D D D
#5 E E E E E
#6 F F F F F
#7 G G G G G
#8 H H H H H
#9 I I I I I # want to isolate this row (#9) since most non-NA in WANT
#10 J J J <NA> J
Here you would want all I values as it is the row with the longest running non-NA values in WANT.
If my interpretation of your question is correct, we can extend the excellent answer found here to your situation. This creates a data frame with a running tally of consecutive non-NA values for each column.
The benefit of using this is that it will count consecutive non-NA runs across all columns (of any type, ie character, numeric), then you can index on whatever column you want using which.max()
# from #jay.sf at https://stackoverflow.com/questions/61841400/count-consecutive-non-na-items
res <- as.data.frame(lapply(lapply(df, is.na), function(x) {
r <- rle(x)
s <- sapply(r$lengths, seq_len)
s[r$values] <- lapply(s[r$values], `*`, 0)
unlist(s)
}))
# index using which.max()
want_data <- df[which.max(res$WANT), ]
#> want_data
# A B C WANT E
#9 I I I I I
If this isn't correct, please edit your question for clarity.
I have a .csv file where I have one column with 25-30 characters per row. I need to separate the one column into 25 columns each with its own character (or nucleotide) inside each. Thus, I will be ignoring the extra 0-5 nucleotides in each row.
The .csv files looks similar to this:
Sequence
ATCGGTCGGGGGAT
TGCTGGCAAA
ACCGTCGAA
ACTGGTAATTG
I need the table to look similar to this:
Sequence
A T G C T
G T A C T
G G T C C
A T G T G
If this information helps: the end goal for me is trying to find the nucleotide frequencys of each column that is why I need the columns to be separated.
I am very new to R so any help would be greatly appreciated!
First use dput() to provide data that we can cut/paste.
Sequence <- c("ATCGGTCGGGGGAT", "TGCTGGCAAA", "ACCGTCGAA", "ACTGGTAATTG")
The chop off the bits you don't need and split the rest:
Sequence <- substr(Sequence, 1, 5)
Sequence <- data.frame(do.call(rbind, strsplit(Sequence, "")))
Sequence
# X1 X2 X3 X4 X5
# 1 A T C G G
# 2 T G C T G
# 3 A C C G T
# 4 A C T G G
I have a question about matching values between two vectors.
Lets say I have a vector and data frame:
data.frame
value name vector 2
154.0031 A 154.0084
154.0768 B 159.0344
154.2145 C 154.0755
154.4954 D 156.7758
156.7731 E
156.8399 F
159.0299 G
159.6555 H
159.9384 I
Now I want to compare vector 2 with values in the data frame with a defined global tolerance (e.g. +-0.005) that is adjustable and add the corresponding names to vector 2, so I get a result like this:
data.frame
value name vector 2 name
154.0031 A 154.0074 A
154.0768 B 159.0334 G
154.2145 C 154.0755 B
154.4954 D 156.7758 E
156.7731 E
156.8399 F
159.0299 G
159.6555 H
159.9384 I
I tried to use intersect() but there is no option for tolerance in it?
Many thanks!
This outcome can be achieved through with outer, which, and subsetting.
# calculate distances between elements of each object
# rows are df and columns are vec 2
myDists <- outer(df$value, vec2, FUN=function(x, y) abs(x - y))
# get the values that have less than some given value
# using arr.ind =TRUE returns a matrix with the row and column positions
matches <- which(myDists < 0.05, arr.ind=TRUE)
data.frame(name = df$name[matches[, 1]], value=vec2[matches[, 2]])
name value
1 A 154.0084
2 G 159.0344
3 B 154.0755
4 E 156.7758
Note that this will only return elements of vec2 with matches and will return all elements of df that satisfy the threshold.
to make the results robust to this, use
# get closest matches for each element of vec2
closest <- tapply(matches[,1], list(matches[,2]), min)
# fill in the names.
# NA will appear where there are no obs that meet the threshold.
data.frame(name = df$name[closest][match(as.integer(names(closest)),
seq_along(vec2))], value=vec2)
Currently, this returns the same result as above, but will return NAs where there is no adequate observation in df.
data
Please provide reproducible data if you ask a question in the future. See below.
df <- read.table(header=TRUE, text="value name
154.0031 A
154.0768 B
154.2145 C
154.4954 D
156.7731 E
156.8399 F
159.0299 G
159.6555 H
159.9384 I")
vec2 <- c(154.0084, 159.0344, 154.0755, 156.7758)
I'm trying to build a factor column that relates to two other factor columns with completely different factor levels. Here's example data.
set.seed(1234)
a<-sample(LETTERS[1:10],50,replace=TRUE)
b<-sample(letters[11:20],50,replace=TRUE)
df<-data.frame(a,b)
df$a<-as.factor(df$a)
df$b<-as.factor(df$b)
The rule I want to make creates a new column, c, that bases it's factor level value based on the value of column a.
if any row in column a ="F", that row in column c will equal whatever the entry is for column b. The code I'm trying:
dfn<-dim(df)[1]
for (i in 1:dfn){
df$c[i]<-ifelse(df$a[i]=="F",df$b[i],df$a[i])
}
df
only spits out the numbered index of the factor level for column b and not the actual entry. What have I done wrong?
I think you'll need to do a little finagling of character values. This seems to do it.
w <- df$a == "F"
df$c <- factor(replace(as.character(df$a), w, as.character(df$b)[w]))
Here is a quick look at the new column,
factor(replace(as.character(df$a), w, as.character(df$b)[w]))
# [1] B G G G I G A C G s G k C J C I C C B C D D B A C I n J I A
# [31] E C D p B H C C J I l G D G D p G E C H
# Levels: A B C D E G H I J k l n p s
As my previous comment, a solution with dplyr:
df %>% mutate(c = ifelse(a == "F", as.character(b), as.character(a)))
If you plan on doing anything involving combinations of the columns as factors, for example, comparisons, you should refactor to the same set of levels.
u<-union(levels(df$a),levels(df$b))
df$a<-factor(df$a,u)
df$b<-factor(df$b,u)
df$c<-df$a
ind<-df$a=="F"
df$c[ind]<-df$b[ind]
By taking this precaution, you can sensibly do
> sum(df$c==df$b)
[1] 6
> sum(df$a=="F")
[1] 6
otherwise the first line will fail.
I have vector index that corresponds to the rows of a df I want to modify for one specific column
index <- c(1,3,6)
dm <- one two three four
x y z k
a b c r
s e t f
e d f a
a e d r
q t j i
Now I want to modify column "three" only for rows 1, 3 and 6 replacing whatever value in it with "A".
Should I use apply?
There is no need for apply. You could simply use the following:
dm$three[index] <- "A"