while converting Long data to wide How do I provide multiple columns to timevar argumnet in reshape
`reshape(DT, idvar="Cell", timevar = "n1", direction="wide")`
like example timevar=c("n1","n2"....)
DT<-data.table(Cell = c("A","A","B","B"), n1=c("x","y","y","a"), n2=c("t","x","x","z"))
Cell n1 n2
1: A x t
2: A y x
3: B y x
4: B a z
but I need output like below:
Cell n1 n2 n3 n4
A x y t NA
B x y a z
order of elements in n1, n2, n3 columns of output doesn't matter. only unique elements from n1 and n2 cols is required. Also I have have multiples columns like n1, n2, n3,,, n in my actual DT
Here is a rough concept that seems to achieve the desired result.
foo <- function(x, y, n) {
l <- as.list(unique(c(x, y)))
if (length(l) < n) l[(length(l)+1):n] <- NA_character_
l
}
DT[, foo(n1, n2, 4), Cell]
# Cell V1 V2 V3 V4
# 1: A x y t <NA>
# 2: B y a x z
# Set the names by reference
setnames(DTw, c("Cell", paste0("n", 1:4)))
Related
I have 200 datasets with size of 5120*732 .Some of the elements are NA.
Now in each row, once they're >= N1 (N1 = 8) consecutive elements that are not NA (i.e. is.na()==FALSE), I would like to prefix all of them with 'D'.
Here's a example with N1 = 3.
df1 <- data.frame(c(1.0,NA,1.1,1.2,1.3),
c(2.0,2.1,NA,NA,NA),
c(3.0,3.1,3.2,3.3,NA),
c(4.0,4.1,4.2,4.3,4.4),
c(5.0,NA,5.1,NA,5.2))
The expected outcome should be:
df1_expected <- data.frame(c('D1.0',NA,1.1,1.2,1.3),
c('D2.0','D2.1',NA,NA,NA),
c('D3.0','D3.1','D3.2',3.3,NA),
c('D4.0','D4.1','D4.2',4.3,4.4),
c('D5.0',NA,'D5.1',NA,5.2))
Here's the code I modified from this post but it doesn't work as expected.
Is there an efficient method to check for 8 successive elements that are not NA (i.e. is.na()==FALSE) in each column of a large dataset?
Any hints or tips greatly appreciated!
My code:
append_one <- function(x, N, pref = "D"){
y <- rep(pref, length(x))
is.na(y) <- is.na(x)
r <- rle(y)
r$values[r$lengths < N] <- ""
y <- inverse.rle(r)
paste0(y, x)
}
append_all <- function(X, n, pref = "D"){
Y <- X
Y [] <- apply(Y, 1, append_one, N = n, pref = pref) #where I modified
Y
}
Here is one base R solution that you can try
N1 <- 3
df1_expected <- data.frame(t(apply(df1, 1, function(v) {
r <- rle(!is.na(v))
idx <- rep(r$lengths >=N1 & r$values,r$lengths)
replace(v,which(idx),paste0("D",v[idx]))
})))
such that
> df1_expected
v1 v2 v3 v4 v5
1 D1 D2 D3 D4 D5
2 <NA> D2.1 D3.1 D4.1 <NA>
3 1.1 <NA> D3.2 D4.2 D5.1
4 1.2 <NA> 3.3 4.3 <NA>
5 1.3 <NA> <NA> 4.4 5.2
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 3 years ago.
I need to build a matrix from data that is stored in several other matrices that all have a pointer in their first column. This is how the original matrices might look, with a-e being the pointers connecting the the data from all the matrices and the v-z being the data that is linked together. The arrow points to what I want my final matrix to look like.
a x x
b y y
c z z
d w w
e v v
e v v
d w w
c z z
b y y
a x x
----->
x x x x
y y y y
z z z z
w w w w
v v v v
I cant seem to write the right algorithm to do this, I am either getting subscript out of bounds errors or replacement has length zero errors. Here is what I have now but it is not working.
for(i in 1:length(matlist)){
tempmatrix = matlist[[i]] # list of matrices to be combined
genMatrix[1,i] = tempmatrix[1,2]
for(j in 2:length(tempmatrix[,1])){
index = which(indexv == tempmatrix[j,1]) #the row index for the data that needs to be match
# with an ECID
for(k in 1:length(tempmatrix[1,])){
genMatrix[index,k+i] = tempmatrix[j,k]
}
# places the data in same row as the ecid
}
}
print(genMatrix)
EDIT: I just want to clarify that my example only shows two matrices but in the list matlist there can be any number of matrices. I need to find a way of merging them without having to know how many matrices are in matlist at the time.
We can merge all the matrices in the list using Reduce and merge from base package.
as.matrix(read.table(text="a x x
b y y
c z z
d w w
e v v")) -> mat1
as.matrix(read.table(text="e v v
d w w
c z z
b y y
a x x")) -> mat2
as.matrix(read.table(text="e x z
d z w
c w v
b y x
a v y")) -> mat3
matlist <- list(mat1=mat1, mat2=mat2, mat3=mat3)
Reduce(function(m1, m2) merge(m1, m2, by = "V1", all.x = TRUE),
matlist)[,-1]
#> V2.x V3.x V2.y V3.y V2 V3
#> 1 x x x x v y
#> 2 y y y y y x
#> 3 z z z z w v
#> 4 w w w w z w
#> 5 v v v v x z
Created on 2019-06-05 by the reprex package (v0.3.0)
Or we can append all the matrices together and then use tidyr to go from long to wide and get the desired output.
library(tidyr)
library(dplyr)
bind_rows(lapply(matlist, as.data.frame), .id = "mat") %>%
gather(matkey, val, c("V2","V3")) %>%
unite(matkeyt, mat, matkey, sep = ".") %>%
spread(matkeyt, val) %>%
select(-V1)
#> mat1.V2 mat1.V3 mat2.V2 mat2.V3 mat3.V2 mat3.V3
#> 1 x x x x v y
#> 2 y y y y y x
#> 3 z z z z w v
#> 4 w w w w z w
#> 5 v v v v x z
Created on 2019-06-06 by the reprex package (v0.3.0)
I have a data set for MMA bouts.
The structure currently is
Fighter 1, Fighter 2, Winner
x y x
x y x
x y x
x y x
x y x
My problem is that Fighter 1 = Winner so my model will be trained that fighter 1 always wins, which is a problem.
I need to be able to randomly swap Fighter 1 and Fighter 2 for half the data set in order to have the winner represented equally.
Ideally i would have this
Fighter 1, Fighter 2, Winner
x y x
y x x
x y y
y x x
x y y
is there a way to randomise across columns without messing up the order of the rows ??
I'm assuming your xs and ys are arbitrary and just placeholders. I'll further assume that you need the Winner column to stay the same, you just need that the winner not always be in the first column.
Sample data:
set.seed(42)
x <- data.frame(
F1 = sample(letters, size = 5),
F2 = sample(LETTERS, size = 5),
stringsAsFactors = FALSE
)
x$W <- x$F1
x
# F1 F2 W
# 1 x N x
# 2 z S z
# 3 g D g
# 4 t P t
# 5 o W o
Choose some rows to change, randomly:
(ind <- sample(nrow(x), size = ceiling(nrow(x)/2)))
# [1] 3 5 4
This means that we expect rows 3-5 to change.
Now the random changes:
within(x, { tmp <- F1[ind]; F1[ind] = F2[ind]; F2[ind] = tmp; rm(tmp); })
# F1 F2 W
# 1 x N x
# 2 z S z
# 3 D g g
# 4 P t t
# 5 W o o
Rows 1-2 still show the F1 as the Winner, and rows 3-5 show F2 as the Winner.
I also found that this code worked
matches_clean[, c("fighter1", "fighter2")] <- lapply(matches_clean[, c("fighter1", "fighter2")], as.character)
changeInd <- !!((match(matches_clean$fighter1, levels(as.factor(matches_clean$fighter1))) -
match(matches_clean$fighter2, levels(as.factor(matches_clean$fighter2)))) %% 2)
matches_clean[changeInd, c("fighter1", "fighter2")] <- matches_clean[changeInd, c("fighter2", "fighter1")]
Let's say I have the following data.table
> DT
# A B C D E N
# 1: J t X D N 0.07898388
# 2: U z U L A 0.46906049
# 3: H a Z F S 0.50826435
# ---
# 9998: X b R L X 0.49879990
# 9999: Z r U J J 0.63233668
# 10000: C b M K U 0.47796539
Now I need to group by a pair of columns and calculate sum N.
That's easy to do when you know column names in advance:
> DT[, sum(N), by=.(A,B)]
# A B V1
# 1: J t 6.556897
# 2: U z 9.060844
# 3: H a 4.293426
# ---
# 674: V z 11.439100
# 675: M x 1.736050
# 676: U k 3.676197
But I must do that in a function, which receives a vector of column indices to group by.
> f <- function(columns = 1:2) {
DT[, sum(N), by=columns]
}
> f(1:2)
Error in `[.data.table`(DT, , sum(N), by = columns) :
The items in the 'by' or 'keyby' list are length (2). Each must be same
length as rows in x or number of rows returned by i (10000).
I also tried:
> f(list("A", "B"))
Error in `[.data.table`(DT, , sum(N), by = list(columns)) :
column or expression 1 of 'by' or 'keyby' is type list. Do not quote column
names. Usage: DT[,sum(colC),by=list(colA,month(colB))]
How do I make this to work?
Here's how I would approach this:
f <- function(columns) {
Get <- if (!is.numeric(columns)) match(columns, names(DT)) else columns
columns <- names(DT)[Get]
DT[, sum(N), by = columns]
}
The first line (Get..) keeps "columns" as numeric if it's already numeric or it converts it from characters to numeric if they are not.
Test it out with some sample data:
set.seed(1)
DT <- data.table(
A = sample(letters[1:3], 20, TRUE),
B = sample(letters[1:5], 20, TRUE),
C = sample(LETTERS[1:2], 20, TRUE),
N = rnorm(20)
)
## Should work with either column number or name
f(1)
f("A")
f(c(1, 3))
f(c("A", "C"))
I asked this question a while ago (Recode dataframe based on one column) and the answer worked perfectly. Now however, i almost want to do the reverse. Namely, I have a (700k * 2000) of 0/1/2 or NA. In a separate dataframe I have two columns (Ref and Obs). The 0 corresponds to two instances of Ref, 1 is one instance of Ref and one instance of Obs and 2 is two Obs. To clarify, data snippet:
Genotype File ---
Ref Obs
A G
T C
G C
Ref <- c("A", "T", "G")
Obs <- c("G", "C", "C")
Current Data---
Sample.1 Sample.2 .... Sample.2000
0 1 2
0 0 0
0 NA 1
mat <- matrix(nrow=3, ncol=3)
mat[,1] <- c(0,0,0)
mat[,2] <- c(1,0,NA)
mat[,3] <- c(2,0,1)
Desired Data format---
Sample.1 Sample.1 Sample.2 Sample.2 Sample.2000 Sample.2000
A A A G G G
T T T T T T
G G 0 0 G C
I think that's right. The desired data format has two columns (space separated) for each sample. 0 in this format (plink ped file for the bioinformaticians out there) is missing data.
MAJOR ASSUMPTION: your data is in 3 element frames, i.e. you want to apply your mapping to the first 3 rows, then the next 3, and so on, which I think makes sense given DNA frames. If you want a rolling 3 element window this will not work (but code can be modified to make it work). This will work for an arbitrary number of columns, and arbitrary number of 3 row groups:
# Make up a matrix with your properties (4 cols, 6 rows)
col <- 4L
frame <- 3L
mat <- matrix(sample(c(0:2, NA_integer_), 2 * frame * col, replace=T), ncol=col)
# Mapping data
Ref <- c("A", "T", "G")
Obs <- c("G", "C", "C")
map.base <- cbind(Ref, Obs)
num.to.let <- matrix(c(1, 1, 1, 2, 2, 2), byrow=T, ncol=2) # how many from each of ref obs
# Function to map 0,1,2,NA to Ref/Obs
re_map <- function(mat.small) { # 3 row matrices, with col columns
t(
mapply( # iterate through each row in matrix
function(vals, map, num.to.let) {
vals.2 <- unlist(lapply(vals, function(x) map[num.to.let[x + 1L, ]]))
ifelse(is.na(vals.2), 0, vals.2)
},
vals=split(mat.small, row(mat.small)), # a row
map=split(map.base, row(map.base)), # the mapping for that row
MoreArgs=list(num.to.let=num.to.let) # general conversion of number to Obs/Ref
) )
}
# Split input data frame into 3 row matrices (assumes frame size 3),
# and apply mapping function to each group
mat.split <- split.data.frame(mat, sort(rep(1:(nrow(mat) / frame), frame)))
mat.res <- do.call(rbind, lapply(mat.split, re_map))
colnames(mat.res) <- paste0("Sample.", rep(1:ncol(mat), each=2))
print(mat.res, quote=FALSE)
# Sample.1 Sample.1 Sample.2 Sample.2 Sample.3 Sample.3 Sample.4 Sample.4
# 1 G G A G G G G G
# 2 C C 0 0 T C T C
# 3 0 0 G C G G G G
# 1 A A A A A G A A
# 2 C C C C T C C C
# 3 C C G G 0 0 0 0
I am not sure but this could be what you need:
first same simple data
geno <- data.frame(Ref = c("A", "T", "G"), Obs = c("G", "C", "C"))
data <- data.frame(s1 = c(0,0,0),s2 = c(1, 0, NA))
then a couple of functions:
f <- function(i , x, geno){
x <- x[i]
if(!is.na(x)){
if (x == 0) {y <- geno[i , c(1,1)]}
if (x == 1) {y <- geno[i, c(1,2)]}
if (x == 2) {y <- geno[i, c(2,2)]}
}
else y <- c(0,0)
names(y) <- c("s1", "s2")
y
}
g <- function(x, geno){
Reduce(rbind, lapply(1:length(x), FUN = f , x = x, geno = geno))
}
The way f() is defined may not be the most elegant but it does the job
Then simply run it as a doble for loop in a lapply fashion
as.data.frame(Reduce(cbind, lapply(data , g , geno = geno )))
hope it helps
Here's one way based on the sample data in your answer:
# create index
idx <- lapply(data, function(x) cbind((x > 1) + 1, (x > 0) + 1))
# list of matrices
lst <- lapply(idx, function(x) {
tmp <- apply(x, 2, function(y) geno[cbind(seq_along(y), y)])
replace(tmp, is.na(tmp), 0)
})
# one data frame
as.data.frame(lst)
# s1.1 s1.2 s2.1 s2.2
# 1 A A A G
# 2 T T T T
# 3 G G 0 0