Insert an empty column between every column of a dataframe in R - r

Say you have a dataframe of four columns:
dat <- data.frame(A = rnorm(5), B = rnorm(5), C = rnorm(5), D = rnorm(5))
And you want to insert an empty column between each of the columns in the dataframe, so that the output is:
A A1 B B1 C C1 D D1
1 1.15660588 NA 0.78350197 NA -0.2098506 NA 2.07495662 NA
2 0.60107853 NA 0.03517539 NA -0.4119263 NA -0.08155673 NA
3 0.99680981 NA -0.83796981 NA 1.2742644 NA 0.67469277 NA
4 0.09940946 NA -0.89804952 NA 0.3419173 NA -0.95347049 NA
5 0.28270734 NA -0.57175554 NA -0.4889045 NA -0.11473839 NA
How would you do this?
The dataframe I would like to do this operation to has hundreds of columns and so obviously I don't want to type out each column and add them naively like this:
dat$A1 <- NA
dat$B1 <- NA
dat$C1 <- NA
dat$D1 <- NA
dat <- dat[, c("A", "A1", "B", "B1", "C", "C1", "D", "D1")]
Thanks for you help in advance!

You can try
res <- data.frame(dat, dat*NA)[order(rep(names(dat),2))]
res
# A A.1 B B.1 C C.1 D D.1
#1 1.15660588 NA 0.78350197 NA -0.2098506 NA 2.07495662 NA
#2 0.60107853 NA 0.03517539 NA -0.4119263 NA -0.08155673 NA
#3 0.99680981 NA -0.83796981 NA 1.2742644 NA 0.67469277 NA
#4 0.09940946 NA -0.89804952 NA 0.3419173 NA -0.95347049 NA
#5 0.28270734 NA -0.57175554 NA -0.4889045 NA -0.11473839 NA
NOTE: I am leaving the . in the column names as it is a trivial task to remove it.
Or another option is
dat[paste0(names(dat),1)] <- NA
dat[order(names(dat))]

you can try this
df <- cbind(dat, dat)
df <- df[, sort(names(df))]
df[, seq(2, 8,by=2)] <- NA
names(df) <- sub("\\.", "", names(df))

# create new data frame with twice the number of columns
bigdat <- data.frame(matrix(ncol = dim(dat)[2]*2, nrow = dim(dat)[1]))
# set sequence of target column indices
inds <- seq(1,dim(bigdat)[2],by=2)
# insert values
bigdat[,inds] <- dat
# set column names
colnames(bigdat)[inds] <- colnames(dat)

Related

How to create columns in a loop?

I would like to create some columns with a loop. I am not sure why it is not working. To simplify, let's just assume that I want several columns with missing values.
Below are just some codes I've tried:
varlist <- c("5000_A", "5000_B", "5000_C", "5000_D",
"5000_E", "5000_F", "5000_G", "5000_G")
for(i in varlist){
df <- df %>% mutate(i = NA)
}
I have also tried:
letterseq <- c(LETTERS[1:8])
for(i in letterseq){
df <- df %>% mutate(paste("5000", i, sep = "_"), NA)
}
Or even:
letterseq <- c(LETTERS[1:8])
for(i in letterseq){
df <- df %>% assign(paste("5000", i, sep = "_"), NA)
}
All are giving me different errors. I would like to get by the end of the code 8 different columns called 5000_A, 5000_B, 5000_C, 5000_D, 5000_E, 5000_F, 5000_G, 5000_H.
varlist <- c("5000_A", "5000_B", "5000_C", "5000_D",
"5000_E", "5000_F", "5000_G", "5000_G")
for(i in varlist){
df[[i]] <- NA
}
Here is a solution using the package data.table.
dt[, varlist[1:length(varlist)]:=NA]
For example...
library(data.table)
varlist <- c("5000_A", "5000_B", "5000_C", "5000_D",
"5000_E", "5000_F", "5000_G", "5000_H")
dt <- data.table("A" = c(1,2,3), B = c("a", "b", "c"))
dt[, varlist[1:length(varlist)]:=NA]
> dt
A B 5000_A 5000_B 5000_C 5000_D 5000_E 5000_F 5000_G 5000_H
1: 1 a NA NA NA NA NA NA NA NA
2: 2 b NA NA NA NA NA NA NA NA
3: 3 c NA NA NA NA NA NA NA NA

How to select n random values from each rows of a dataframe in R?

I have a dataframe
df= data.frame(a=c(56,23,15,10),
b=c(43,NA,90.7,30.5),
c=c(12,7,10,2),
d=c(1,2,3,4),
e=c(NA,45,2,NA))
I want to select two random non-NA row values from each row and convert the rest to NA
Required Output- will differ because of randomness
df= data.frame(
a=c(56,NA,15,NA),
b=c(43,NA,NA,NA),
c=c(NA,7,NA,2),
d=c(NA,NA,3,4),
e=c(NA,45,NA,NA))
Code Used
I know to select random non-NA value from specific rows
set.seed(2)
sample(which(!is.na(df[1,])),2)
But no idea how to apply it all dataframe and get the required output
You may write a function to keep n random values in a row.
keep_n_value <- function(x, n) {
x1 <- which(!is.na(x))
x[-sample(x1, n)] <- NA
x
}
Apply the function by row using base R -
set.seed(123)
df[] <- t(apply(df, 1, keep_n_value, 2))
df
# a b c d e
#1 NA NA 12 1 NA
#2 NA NA 7 2 NA
#3 NA 90.7 10 NA NA
#4 NA 30.5 NA 4 NA
Or if you prefer tidyverse -
purrr::pmap_df(df, ~keep_n_value(c(...), 2))
Base R:
You could try column wise apply (sapply) and randomly replace two non-NA values to be NA, like:
as.data.frame(sapply(df, function(x) replace(x, sample(which(!is.na(x)), 2), NA)))
Example Output:
a b c d e
1 56 NA 12 NA NA
2 23 NA NA 2 NA
3 NA NA 10 3 NA
4 NA 30.5 NA NA NA
One option using dplyr and purrr could be:
df %>%
mutate(pmap_dfr(across(everything()), ~ `[<-`(c(...), !seq_along(c(...)) %in% sample(which(!is.na(c(...))), 2), NA)))
a b c d e
1 56 43.0 NA NA NA
2 23 NA 7 NA NA
3 15 NA NA NA 2
4 NA 30.5 2 NA NA

Searching pairs in matrix in R

I am rather new to R, so I would be grateful if anyone could help me :)
I have a large matrices, for example:
matrix
and a vector of genes.
My task is to search the matrix row by row and compile pairs of genes with mutations (on the matrix is D707H) with the rest of the genes contained in the vector and add it to a new matrix. I tried do this with loops but i have no idea how to write it correctly. For this matrix it should look sth like this:
PR.02.1431
NBN BRCA1
NBN BRCA2
NBN CHEK2
NBN ELAC2
NBN MSR1
NBN PARP1
NBN RNASEL
Now i have sth like this:
my idea
"a" is my initial matrix.
Can anyone point me in the right direction? :)
Perhaps what you want/need is which(..., arr.ind = TRUE).
Some sample data, for demonstration:
set.seed(2)
n <- 10
mtx <- array(NA, dim = c(n, n))
dimnames(mtx) <- list(letters[1:n], LETTERS[1:n])
mtx[sample(n*n, size = 4)] <- paste0("x", 1:4)
mtx
# A B C D E F G H I J
# a NA NA NA NA NA NA NA NA NA NA
# b NA NA NA NA NA NA NA NA NA NA
# c NA NA NA NA NA NA NA NA NA NA
# d NA NA NA NA NA NA NA NA NA NA
# e NA NA NA NA NA NA NA NA NA NA
# f NA NA NA NA NA NA NA NA NA NA
# g NA "x4" NA NA NA "x3" NA NA NA NA
# h NA NA NA NA NA NA NA NA NA NA
# i NA "x1" NA NA NA NA NA NA NA NA
# j NA NA NA NA NA NA "x2" NA NA NA
In your case, it appears that you want anything that is not an NA or NaN. You might try:
which(! is.na(mtx) & ! is.nan(mtx))
# [1] 17 19 57 70
but that isn't always intuitive when retrieving the row/column pairs (genes, I think?). Try instead:
ind <- which(! is.na(mtx) & ! is.nan(mtx), arr.ind = TRUE)
ind
# row col
# g 7 2
# i 9 2
# g 7 6
# j 10 7
How to use this: the integers are row and column indices, respectively. Assuming your matrix is using row names and column names, you can retrieve the row names with:
rownames(mtx)[ ind[,"row"] ]
# [1] "g" "i" "g" "j"
(An astute reader might suggest I use rownames(ind) instead. It certainly works!) Similarly for the colnames and "col".
Interestingly enough, even though ind is a matrix itself, you can subset mtx fairly easily with:
mtx[ind]
# [1] "x4" "x1" "x3" "x2"
Combining all three together, you might be able to use:
data.frame(
gene1 = rownames(mtx)[ ind[,"row"] ],
gene2 = colnames(mtx)[ ind[,"col"] ],
val = mtx[ind]
)
# gene1 gene2 val
# 1 g B x4
# 2 i B x1
# 3 g F x3
# 4 j G x2
I know where my misteke was, now i have matrix. Analyzing your code it works good, but that's not exactly what I want to do.
a, b, c, d etc. are organisms and row names are genes (A, B, C, D etc.). I have to cobine pairs of genes where one of it (in the same column) has sth else than NA value. For example if gene A has value=4 in column a I have to have:
gene1 gene2
a A B
a A C
a A D
a A E
I tried in this way but number of elements do not match and i do not know how to solve this.
ind= which(! is.na(a) & ! is.nan(a), arr.ind = TRUE)
ind1=which(macierz==1,arr.ind = TRUE)
ramka= data.frame(
kolumna = rownames(a)[ ind[,"row"] ],
gene1 = colnames(a)[ ind[,"col"] ],
gene2 = colnames(a)[ind1[,"col"]],
#val = macierz[ind]
)
Do you know how to do this in R?

Collapse and intersect data frames

I have two data.frames which have a 3 columns:
1. id - a unique key
target - semicolon separated unique values
source - similar for each of the data frames but different for the two data.frames.
Here's simulated data:
set.seed(1)
df.1 <- data.frame(id=LETTERS[sample(length(LETTERS),10,replace=F)],
target=sapply(1:10,function(x) paste(LETTERS[sample(length(LETTERS),5,replace=F)],collapse=";")),
source="A",stringsAsFactors=F)
df.2 <- data.frame(id=LETTERS[sample(length(LETTERS),5,replace=F)],
target=sapply(1:5,function(x) paste(LETTERS[sample(length(LETTERS),5,replace=F)],collapse=";")),
source="B",stringsAsFactors=F)
I'm looking for a function that will collapse the two data.frames together and will create 3 columns:
1.intersected.targets - semicolon separated unique values that are intersected between the two data.frames
2.source1.targets - targets that are unique to the first data.frame
3.source2.targets - targets that are unique to the second data.frame
So for the example above the resulting data.frame will be:
> res.df
id intersected.targets sourceA.targets sourceB.targets
1 G NA F;E;Q;I;X <NA>
2 J NA M;R;X;I;Y <NA>
3 N NA Y;F;P;C;Z <NA>
4 U NA K;A;J;U;H <NA>
5 E NA M;O;L;E;S <NA>
6 S NA R;T;C;Q;J <NA>
7 W NA V;Q;S;M;L <NA>
8 M NA U;A;L;Q;P <NA>
9 B NA C;H;M;P;I <NA>
10 X NA <NA> G;L;S;B;T
11 H NA <NA> I;U;Z;H;K
12 Y NA <NA> L;R;J;H;Q
13 O NA <NA> F;R;C;Z;D
14 L V M;K;F;B X;J;R;Y
This is a continuation of DavidArenberg's deleted answer that taught me the notion of creating a list column in a data.table. I didn't know how to properly implement my idea of using setdiff row by row but eventually after multiple searches found an answer by Frank that does it. Here is David's (partial) answer:
=====
Here's a possible solution on a different seed that have more than one intersections and more than one letter in a single intersection
#Generating Data
set.seed(123)
df.1 <- data.frame(id=LETTERS[sample(length(LETTERS),10,replace=F)],
target=sapply(1:10,function(x) paste(LETTERS[sample(length(LETTERS),5,
replace=F)],collapse=";")),
source="A",stringsAsFactors=F)
df.2 <- data.frame(id=LETTERS[sample(length(LETTERS),5, replace=F)],
target=sapply(1:5,function(x) paste(LETTERS[sample(length(LETTERS),5,
replace=F)],collapse=";")),
source="B",stringsAsFactors=F)
#Solution
library(data.table)
library(stringi)
res <- dcast(rbind(setDT(df.1), setDT(df.2)), id ~ source, value.var = "target")
res[!is.na(A) & !is.na(B), intersected.targets :=
stri_extract_all(A, regex = gsub(";", "|", B, fixed = TRUE))]
res
==========================
So I used his listifying code to make an A2and B2 column that are the list-version of A and B
res[ , A2 := stri_extract_all(A, regex = "[[:alpha:]]") ]
res[ , B2 := stri_extract_all(B, regex = "[[:alpha:]]") ]
Then used Map() to do a row by row setdiff:
res[, SourceA := Map( setdiff, A2, intersected.targets)]
res[, SourceB := Map( setdiff, B, intersected.targets)]
res
#-------------------------------
id A B intersected.targets A2 B2 SourceA SourceB
1: A M;S;F;H;X NA NULL M,S,F,H,X NA M,S,F,H,X NA
2: C NA T;P;R;A;K NULL NA T,P,R,A,K NA T,P,R,A,K
3: G NA G;Q;K;S;C NULL NA G,Q,K,S,C NA G,Q,K,S,C
4: H Y;L;Q;N;C NA NULL Y,L,Q,N,C NA Y,L,Q,N,C NA
5: J X;R;P;W;O F;J;O;I;C O X,R,P,W,O F,J,O,I,C X,R,P,W F,J,I,C
6: K D;K;J;I;Z NA NULL D,K,J,I,Z NA D,K,J,I,Z NA
7: Q D;F;L;G;S NA NULL D,F,L,G,S NA D,F,L,G,S NA
8: R NA L;U;T;S;J NULL NA L,U,T,S,J NA L,U,T,S,J
9: T X;G;B;H;U NA NULL X,G,B,H,U NA X,G,B,H,U NA
10: U S;N;O;G;D NA NULL S,N,O,G,D NA S,N,O,G,D NA
11: W Z;W;Q;S;A NA NULL Z,W,Q,S,A NA Z,W,Q,S,A NA
12: X B;L;T;C;M NA NULL B,L,T,C,M NA B,L,T,C,M NA
13: Z F;D;S;U;I L;Y;V;U;D D,U F,D,S,U,I L,Y,V,U,D F,S,I L,Y,V
I'm leaving the clean-up as a student exercise.
The pain in the butt in this type of data cleaning, as #42- mentions, is unlisting data frames of lists.
library(dplyr)
library(stringr)
df <- full_join(df.1, df.2) %>%
spread(source, target) %>%
mutate(intersect_targets = str_c(A,B,sep = ";"))
df[,4][!is.na(df[,4])] <- names(do.call("c",lapply(df$intersect_targets, function(x)
which(table(str_split(x, ";"))>1))))
a <- sapply(seq(nrow(df)), function(x) {
str_split(df[x,2:3],";")
})
sa <- do.call("c",lapply(mapply(setdiff,a[1,], a[2,]),paste0, collapse = ","))
sb <- do.call("c",lapply(mapply(setdiff,a[2,], a[1,]), paste0, collapse = ","))
df[,2:3] <-cbind(sa,sb)
head(df)
id A B intersect_targets
1 B C,H,M,P,I NA <NA>
2 E M,O,L,E,S NA <NA>
3 G F,E,Q,I,X NA <NA>
4 H NA I,U,Z,H,K <NA>
5 J M,R,X,I,Y NA <NA>
6 L M,K,F,B X,J,R,Y V

R Loop Script to Create Many, Many Variables

I want to create a lot of variables across several separate dataframes which I will then combine into one grand data frame.
Each sheet is labeled by a letter (there are 24) and each sheet contributes somewhere between 100-200 variables. I could write it as such:
a$varible1 <- NA
a$variable2 <- NA
.
.
.
w$variable25 <- NA
This can/will get ugly, and I'd like to write a loop or use a vector to do the work. I'm having a heck of a time doing it though.
I essentially need a script which will allow me to specify a form and then just tack numbers onto it.
So,
a$variable[i] <- NA
where [i] gets tacked onto the actual variable created.
I just learnt this neat little trick from #eddi
#created some random dataset with 3 columns
library(data.table)
a <- data.table(
a1 = c(1,5),
a2 = c(2,1),
a3 = c(3,4)
)
#assuming that you now need to ad more columns from a4 to a200
# first, creating the sequence from 4 to 200
v = c(4:200)
# then using that sequence to add the 197 more columns
a[, paste0("a", v) :=
NA]
# now a has 200 columns, as compared to the three we initiated it with
dim(a)
#[1] 2 200
I don't think you actually need this, although you seem to think so for some reason.
Maybe something like this:
a <- as.data.frame(matrix(NA, ncol=10, nrow=5))
names(a) <- paste0("Variable", 1:10)
print(a)
# Variable1 Variable2 Variable3 Variable4 Variable5 Variable6 Variable7 Variable8 Variable9 Variable10
# 1 NA NA NA NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA NA NA
# 3 NA NA NA NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA NA NA NA
# 5 NA NA NA NA NA NA NA NA NA NA
If you want variables with different types:
p <- 10 # number of variables
N <- 100 # number of records
vn <- vector(mode="list", length=p)
names(vn) <- paste0("V", seq(p))
vn[1:8] <- NA_real_ # numeric
vn[9:10] <- NA_character_ # character
df <- as.data.frame(lapply(vn, function(x, n) rep(x, n), n=N))

Resources