'Proper' way to do row-wise replacement - r

I have a data frame which looks something like:
dataDemo <- data.frame(POS = 1:4 , REF = c("A" , "T" , "G" , "C") ,
ind1 = c("A" , "." , "G" , "C") , ind2 = c("A" , "C" , "C" , "."),
stringsAsFactors=FALSE)
dataDemo
POS REF ind1 ind2
1 1 A A A
2 2 T . C
3 3 G G C
4 4 C C .
and I'd like to replace all the "."s with the REF value for that row. Here is how I did it:
for(i in seq_along(dataDemo$REF)){
dataDemo[i , ][dataDemo[i , ] == '.'] <- dataDemo$REF[i]
}
I'd like to know if there's a more 'proper' or idiomatic way of doing this in R. I generally try to use *apply whenever possible and this seems like something that could easily be adapted to that approach and made more readable (and run faster), but despite throwing a good bit of time at it I haven't made much progress.

In dplyr,
library(dplyr)
dataDemo %>% mutate_each(funs(ifelse(. == '.', REF, as.character(.))), -POS)
# POS REF ind1 ind2
# 1 1 A A A
# 2 2 T T C
# 3 3 G G C
# 4 4 C C C

Here's another base R alternative, where we use the row numbers of the "." occurrences to replace them by the appropriate REF values.
# Get row numbers
rownrs <- which(dataDemo==".", arr.ind = TRUE)[,1]
# Replace values
dataDemo[dataDemo=="."] <- dataDemo$REF[rownrs]
# Result
dataDemo
# POS REF ind1 ind2
#1 1 A A A
#2 2 T T C
#3 3 G G C
#4 4 C C C

Here is an option using set from data.table, which should be fast.
library(data.table)
setDT(dataDemo)
nm1 <- paste0("ind", 1:2)
for(j in nm1){
i1 <- dataDemo[[j]]=="."
set(dataDemo, i = which(i1), j=j, value = dataDemo$REF[i1])
}
dataDemo
# POS REF ind1 ind2
#1: 1 A A A
#2: 2 T T C
#3: 3 G G C
#4: 4 C C C
EDIT: Based on #alexis_laz's comments
Or using dplyr
library(dplyr)
dataDemo %>%
mutate_each(funs(ifelse(.==".", REF,.)), ind1:ind2)
# POS REF ind1 ind2
#1 1 A A A
#2 2 T T C
#3 3 G G C
#4 4 C C C
Or we can use base R methods to do this in a single line.
dataDemo[nm1] <- lapply(dataDemo[nm1], function(x) ifelse(x==".", dataDemo$REF, x))

Related

Find values in data frame 2 which is found in data frame 1, within a certain range

I want to find which values in df2 which is also present in df1, within a certain range. One value is considering both a and b in the data frames (a & b can't split up). For examples, can I find 9,1 (df1[1,1]) in df2? It doesn't have to be on the same position. Also, we can allow a diff of for example 1 for "a" and 1 for "b". For example, I want to find all values 9+-1,1+-1 in df2. "a" & "b" always go together, each row stick together. Does anyone have a suggestion of how to code this? Many many thanks!
set.seed(1)
a <- sample(10,5)
set.seed(1)
b <- sample(5,5, replace=T)
feature <- LETTERS[1:5]
df1 <- data.frame(feature,a,b)
df1
> df1
feature a b
A 9 1
B 4 4
C 7 1
D 1 2
E 2 5
set.seed(2)
a <- sample(10,5)
b <- sample(5,5, replace=T)
feature <- LETTERS[1:5]
df2 <- data.frame(feature,a,b)
df2
df2
feature a b
A 5 1
B 6 4
C 9 5
D 1 1
E 10 2
Not correct but Im imaging this can be done for a for loop somehow!
for(i in df1[,1]) {
for(j in df1[,2]){
s<- c(s,(df1[i,1] & df1[j,2]== df2[,1] & df2[,2]))# how to add certain allowed diff levels?
}
}
s
Output wanted:
feature_df1 <- LETTERS[1:5]
match <- c(1,0,0,1,0)
feature_df2 <- c("E","","","D", "")
df <- data.frame(feature_df1, match, feature_df2)
df
feature_df1 match feature_df2
A 1 E
B 0
C 0
D 1 D
E 0
I loooove data.table, which is (imo) the weapon of choice for these kind of problems..
library( data.table )
#make df1 and df2 a data.table
setDT(df1, key = "feature"); setDT(df2)
#now perform a join operation on each row of df1,
# creating an on-the-fly subset of df2
df1[ df1, c( "match", "feature_df2") := {
val = df2[ a %between% c( i.a - 1, i.a + 1) & b %between% c(i.b - 1, i.b + 1 ), ]
unique_val = sort( unique( val$feature ) )
num_val = length( unique_val )
list( num_val, paste0( unique_val, collapse = ";" ) )
}, by = .EACHI ][]
# feature a b match feature_df2
# 1: A 9 1 1 E
# 2: B 4 4 0
# 3: C 7 1 0
# 4: D 1 2 1 D
# 5: E 2 5 0
One way to go about this in Base R would be to split the data.frames() into a list of rows then calculate the absolute difference of row vectors to then evaluate how large the absolute difference is and if said difference is larger than a given value.
Code
# Find the absolute difference of all row vectors
listdif <- lapply(l1, function(x){
lapply(l2, function(y){
abs(x - y)
})
})
# Then flatten the list to a list of data.frames
listdifflat <- lapply(listdif, function(x){
do.call(rbind, x)
})
# Finally see if a pair of numbers is within our threshhold or not
m1 <- 2
m2 <- 3
listfin <- Map(function(x){
x[1] > m1 | x[2] > m2
},
listdifflat)
head(listfin, 1)
[[1]]
V1
[1,] TRUE
[2,] FALSE
[3,] TRUE
[4,] TRUE
[5,] TRUE
[6,] TRUE
[7,] TRUE
[8,] TRUE
[9,] TRUE
[10,] TRUE
Data
df1 <- read.table(text = "
4 1
7 5
1 5
2 10
13 6
19 10
11 7
17 9
14 5
3 5")
df2 <- read.table(text = "
15 1
6 3
19 6
8 2
1 3
13 7
16 8
12 7
9 1
2 6")
# convert df to list of row vectors
l1<- lapply(1:nrow(df1), function(x){
df1[x, ]
})
l2 <- lapply(1:nrow(df2), function(x){
df2[x, ]
})

R: reshape data frame when one column has unequal number of entries

I have a data frame x with 2 character columns:
x <- data.frame(a = numeric(), b = I(list()))
x[1:3,"a"] = 1:3
x[[1, "b"]] <- "a, b, c"
x[[2, "b"]] <- "d, e"
x[[3, "b"]] <- "f"
x$a = as.character(x$a)
x$b = as.character(x$b)
x
str(x)
The entries in column b are comma-separated strings of characters.
I need to produce this data frame:
1 a
1 b
1 c
2 d
2 e
3 f
I know how to do it when I loop row by row. But is it possible to do without looping?
Thank you!
Have you checked out require(splitstackshape)?
> cSplit(x, "b", ",", direction = "long")
a b
1: 1 a
2: 1 b
3: 1 c
4: 2 d
5: 2 e
6: 3 f
> s <- strsplit(as.character(x$b), ',')
> data.frame(value=rep(x$a, sapply(s, FUN=length)),b=unlist(s))
value b
1 1 a
2 1 b
3 1 c
4 2 d
5 2 e
6 3 f
there you go, should be very fast:
library(data.table)
x <- data.table(x)
x[ ,strsplit(b, ","), by = a]

Adding two vectors by names

I have two named vectors
v1 <- 1:4
v2 <- 3:5
names(v1) <- c("a", "b", "c", "d")
names(v2) <- c("c", "e", "d")
I want to add them up by the names, i.e. the expected result is
> v3
a b c d e
1 2 6 9 4
Is there a way to programmatically do this in R? Note the names may not necessarily be in a sorted order, like in v2 above.
Just combine the vectors (using c, for example) and use tapply:
v3 <- c(v1, v2)
tapply(v3, names(v3), sum)
# a b c d e
# 1 2 6 9 4
Or, for fun (since you're just doing sum), continuing with "v3":
xtabs(v3 ~ names(v3))
# names(v3)
# a b c d e
# 1 2 6 9 4
I suppose with "data.table" you could also do something like:
library(data.table)
as.data.table(Reduce(c, mget(ls(pattern = "v\\d"))),
keep.rownames = TRUE)[, list(V2 = sum(V2)), by = V1]
# V1 V2
# 1: a 1
# 2: b 2
# 3: c 6
# 4: d 9
# 5: e 4
(I shared the latter not so much for "data.table" but to show an automated way of capturing the vectors of interest.)

order while splitting (eg. TA should be split to two column "A" in first "T" second) in r

I have following issue, I could solve:
set.seed (1234)
mydf <- data.frame (var1a = sample (c("TA", "AA", "TT"), 5, replace = TRUE),
varb2 = sample (c("GA", "AA", "GG"), 5, replace = TRUE),
varAB = sample (c("AC", "AA", "CC"), 5, replace = TRUE)
)
mydf
var1a varb2 varAB
1 TA AA CC
2 AA GA AA
3 AA GA AC
4 AA AA CC
5 TT AA AC
I want to split two letter into different column, and then order alphabetically.
Edit: Ordering can be done before split, for example var1a value "TA" var1a should be "AT" or after split so that var1aa should be "A", and var1ab be "T" (instead of "T", "A").
so sorting is within each cell.
split_col <- function(.col, data){
.x <- colsplit( data[[.col]], names = paste0(.col, letters[1:2]))
}
split each column and combine
require(reshape)
splitdf <- do.call(cbind, lapply(names(mydf), split_col, data = mydf))
var1aa var1ab varb2a varb2b varABa varABb
1 T A A A C C
2 A A G A A A
3 A A G A A C
4 A A A A C C
5 T T A A A C
But the unsolved part is I want to order the pair of columns such that columnname"a" and columname"b" are ordered, alphabetically. Thus expected output:
var1aa var1ab varb2a varb2b varABa varABb
1 A T A A C C
2 A A A G A A
3 A A A G A C
4 A A A A C C
5 T T A A A C
Can how can order (short with each pair of variable) ?
mylist <-as.list(mydf)
splits <- lapply(mylist, reshape::colsplit, names=c("a", "b"))
rowsort <- lapply(splits, function(x) t(apply(x, 1, sort)))
comb <- do.call(data.frame, rowsort)
comb
var1a.1 var1a.2 varb2.1 varb2.2 varAB.a varAB.b
1 A T A A C C
2 A A A G A A
3 A A A G A C
4 A A A A C C
5 T T A A A C
EDIT:
If names are important, you can replace them:
replaceNums <- function(x){
.which <- regmatches(x, regexpr("[[:alnum:]]*(?=.)", x, perl=TRUE))
stopifnot(length(x) %% 2 == 0) #checkstep
paste0(.which, c("a", "b"))
}
names(comb) <- replaceNums(names(comb))
comb
var1aa var1ab varb2a varb2b varABa varABb
1 A T A A C C
2 A A A G A A
3 A A A G A C
4 A A A A C C
5 T T A A A C

Create a vector from repetitons of items from a matrix

I have a data frame m
A 2
B 3
C 4
and I want to create a data frame like
A 1
A 2
B 1
B 2
B 3
C 1
C 2
C 3
C 4
Any help? Thanks a lot in advance
Your original question can be answered by:
text <- LETTERS[1:3]
n <- 2:4
rep(text, times=n)
[1] "A" "A" "B" "B" "B" "C" "C" "C" "C"
Your new question is quite different:
df <- data.frame(
text <- LETTERS[1:3],
n <- 2:4
)
data.frame(
text = rep(df$text, times=df$n),
seq = sequence(df$n)
)
text seq
1 A 1
2 A 2
3 B 1
4 B 2
5 B 3
6 C 1
7 C 2
8 C 3
9 C 4
rep accepts vectors. Try this:
dat <- data.frame(V1 = letters[1:3], V2 = 2:4)
rep(dat[, 1], dat[, 2])
> rep(dat[, 1], dat[, 2])
[1] a a b b b c c c c
Assuming m is a data frame:
m <- data.frame(V1 = LETTERS[1:3], V2 = 2:4, stringsAsFactors = FALSE)
This will do what you want:
with(m, rep(V1, times = V2))
e.g.
> with(m, rep(V1, times = V2))
[1] "A" "A" "B" "B" "B" "C" "C" "C" "C"
Edit: To address the edit made by the OP, try the following:
with(m, data.frame(X1 = rep(V1, times = V2),
X2 = unlist(lapply(V2, seq_len))))
Which produces:
> with(m, data.frame(X1 = rep(V1, times = V2),
+ X2 = unlist(lapply(V2, seq_len))))
X1 X2
1 A 1
2 A 2
3 B 1
4 B 2
5 B 3
6 C 1
7 C 2
8 C 3
9 C 4
Or more succinctly via sequence() — as per #Andrie's Answer (which I also keep forgetting about):
with(m, data.frame(X1 = rep(V1, times = V2), X2 = sequence(V2)))
#Andrie's answer is the only one so far that answers your new question. There may be a better way to do this but:
m <- data.frame(V1 = LETTERS[1:3], V2 = 2:4, stringsAsFactors = FALSE)
library(plyr)
ddply(m,"V1",function(x) data.frame(V2=seq(x[,2])))

Resources