Merging two columns at once in R - r

I have a data frame with 127 columns and 518 rows. Now I have to odd and even columns (from 3).
>data
Name id S1 S2 S3 S4 S5 S6
abc 1 A A C C G G
abc 2 A G T T C G
abc 3 G C T A A C
The output which I want is
>output
Name id S2 S4 S6
abc 1 AA CC GG
abc 2 AG TT CG
abc 3 GC TA AC
Can anyone help me with this?

We can use Map to do this after subsetting the alternate columns
res <- data.frame(data[1:2], Map(paste0, data[-(1:2)][c(TRUE, FALSE)],
data[-(1:2)][c(FALSE, TRUE)]))
names(res)[3:5] <- names(data)[3:8][c(FALSE, TRUE)]
res
# Name id S2 S4 S6
#1 abc 1 AA CC GG
#2 abc 2 AG TT CG
#3 abc 3 GC TA AC

Related

Pasting one column to every other column in a dataframe

I have a bunch of columns and I need to paste the first column into every other column. It looks like this except actual words instead of letters, and theres a few hundred columns.
TEST0 TEST1 TEST2 TEST3 TEST4
1 Q1: AA AA AA AA AA AA BB BB BB
2 Q2:
3 Q3: BB BB BB CC CC CC CC CC CC CC CC CC
4 Q4: DD DD DD DD DD DD DD DD DD
I'm able to paste the first column into another column one at a time doing this:
paste(test[,2],test[,3])
[1] "Q1: AA AA AA" "Q2: " "Q3: BB BB BB" "Q4: DD DD DD "
paste(test[,2],test[,4])
[1] "Q1: AA AA AA " "Q2: " "Q3: CC CC CC " "Q4: "
but is there a way to do multiple columns at once? Thanks
Here is a way of doing it with dplyr. Create your own pasting function first:
df <- data.frame(A = LETTERS, B = 1:26, C = 1:26)
head(df)
A B C
1 A 1 1
2 B 2 2
3 C 3 3
4 D 4 4
5 E 5 5
pasteA <- function(., x) paste0(df$A,.)
df %>%
mutate_if(.predicate = c(F, rep(T, ncol(df)-1)), .funs = list(pasteA))
A B C
1 A A1 A1
2 B B2 B2
3 C C3 C3
4 D D4 D4
5 E E5 E5
We use mutate_if to select all columns except the first one using a logical vector.
This is a base solution with a for loop. For every target column, paste the
first column to it.
df <- data.frame(a = letters[1:5], b = 1:5, c = 5:1)
for (i in 2:length(df)) {
df[[i]] <- paste(df[[1]], df[[i]], sep = ": ")
}
Where length gives the number of columns of a data.frame.
Result:
a b c
1 a a: 1 a: 5
2 b b: 2 b: 4
3 c c: 3 c: 3
4 d d: 4 d: 2
5 e e: 5 e: 1
{dplyr} is surprisingly convoluted for this case. A much easier solution is to use lapply (which works since data.frames are lists of columns):
as.data.frame(lapply(test[-1], function (x) paste(test[[1]], x)))

How to create an ID, based on the minimum unique length of grouped columns?

I have a data.table similar to this one:
set.seed(175232)
DT <- data.table( AU2 = sample(LETTERS[1:3], 10, T), INST= sample(LETTERS[1-5], 10),
ZIP = replicate(10, paste0(sample(LETTERS[1:2],1), sample(10:15,1), collapse="")),
CU = replicate(10, paste0(sample(LETTERS[1:3], 2), collapse = "" )))
DT[,AID1:= .GRP, by= AU2]
setkey(DT, AID1)
setorder(DT, AID1)
AU2 INST ZIP CU AID1
1: B A B11 BC 1
2: B T B12 AC 1
3: B S A13 AC 1
4: B C A12 BC 1
5: B Q B11 BC 1
6: B J B12 BC 1
7: B L A12 AC 1
8: C I A11 BC 2
9: A W A14 CB 3
10: A O B12 AB 3
I would like to create an ID, based on the length of unique elements in 3 columns, grouped by another column, AID1. Notice that in this case, the AID1 column is apparently redundant, since can be grouped by AU2, but I'm using AID1 latter. Firstly, I output the minimum unique length in each column.
sd.cols = c("INST", "ZIP", "CU" )
DT[, lapply(.SD, function(x){length(unique(x)) }) , by=AID1, .SDcols = sd.cols]
AID1 INST ZIP CU
1: 1 7 4 2
2: 2 1 1 1
3: 3 2 2 2
Then to the minimum column, I can just run
DT[, which.min(lapply(.SD, function(x){length(unique(x)) })) , by=AID1, .SDcols = sd.cols]
AID1 V1
1: 1 3
2: 2 1
3: 3 1
That is, for AID1 1, the minimum unique length is in column 3, for 2 is 1, and for 3 is 1. Now, taking AID1 2 as an example, the minimum length is in column 1, and its value is value in the previous table is 1, that is no duplicates, return its original id 2. For ID 1, the minimum is in column 3, and its length is 2, I would like it to return the AID1, plus an addition, 1:2 accordingly to the values in column 2. My approach so far was to write a small function and run it by reference inside the data.table.
mk_id <- function(aid1, inst, zip, cu) {
grp <- list(inst, zip, cu)
u_grp <- lapply(grp, function(x) {
length(unique(x))
})
if (any(u_grp == 1)) {
paste0(aid1)
} else{
paste0(aid1, "-", as.integer(factor(grp[[which.min(u_grp)]])) )
}
}
DT[, ID:= mk_id(AID1, INST, ZIP, CU), by = AID1 ]
DT
AU2 INST ZIP CU AID1 ID
1: B A B11 BC 1 1-2
2: B T B12 AC 1 1-1
3: B S A13 AC 1 1-1
4: B C A12 BC 1 1-2
5: B Q B11 BC 1 1-2
6: B J B12 BC 1 1-2
7: B L A12 AC 1 1-1
8: C I A11 BC 2 2
9: A W A14 CB 3 3-2
10: A O B12 AB 3 3-1
Although it's working, I figured out while writing this question, but, for sure there is a more straightforward way.

Math function using multiple matching criteria

I'm new to this but I'm pretty sure this question hasn't been answered, or I'm just not good at searching....
I would like to subtract the values in multiple rows from a particular row based on matching columns and values. My actual data will be a large matrix with >5000 columns, eaching needing to be subtracted by a blank value that matches the a value in a factor column.
Here is an example data table:
c1 c2 c3 c4 c5
r1 A 1 2 3 aa
r2 B 2 3 4 bb
r3 C 3 4 5 aa
r4 D 4 1 6 bb
r5 Blank 2 3 4 aa
r6 Blank 3 4 5 bb
I would like to subtract the c2,c3,and c4 values of c1 ="Blank" row from A,B,and C using the c5 factor to define which Blank values are used (aa or bb). I would like the "Blank" values to be subtracted from all rows sharing c5 info.
(i know this is confusing to describe)
So the results would look like this:
c1 c2 c3 c4 c5
r1 A -1 -1 -1 aa
r2 B -1 -1 -1 bb
r3 C 1 1 1 aa
r4 D 1 -3 1 bb
I've seen the ddply function work for doing something like this with a single column, but I wasn't able to expand that to perform this task for multiple columns. I'm a noob though...
Thank you for your help!
This is not tested for all possible cases, but should give you an idea:
df <- read.table(text =
"c1 c2 c3 c4 c5
r1 A 1 2 3 aa
r2 B 2 3 4 bb
r3 C 3 4 5 aa
r4 D 4 1 6 bb
r5 Blank 2 3 4 aa
r6 Blank 3 4 5 bb", header = T)
library(data.table)
# separate dataset into two
dt <- data.table(df, key = "c5")
dt.blank <- dt[c1 == "Blank"]
dt <- dt[c1 != "Blank"]
# merge into resulting dataset
dt.res <- dt[dt.blank]
# update each column
columns.count <- ncol(dt)
for(i in 2:(columns.count-1)) {
dt.res[[i]] <- dt.res[[i]] - dt.res[[i + columns.count]]
}
# > dt.res
# c1 c2 c3 c4 c5 i.c1 i.c2 i.c3 i.c4
# 1: A -1 -1 -1 aa Blank 2 3 4
# 2: C 1 1 1 aa Blank 2 3 4
# 3: B -1 -1 -1 bb Blank 3 4 5
# 4: D 1 -3 1 bb Blank 3 4 5
First split your data, since there's no reason you have them in a single data structure. Then apply the function:
# recreate your data
df <- data.frame(rbind(c(1:3, "aa"), c(2:4, "bb"), c(3:5, "aa"), c(4,1,6, "bb"), c(2:4, "aa"), c(3:5, "bb")))
df[,1:3] <- apply(df[,1:3], 2, as.integer)
# split it
blank1 <- df[5,]
blank2 <- df[6,]
df <- df[1:4,]
for (i in 1:nrow(df)) {
if (df[i,4] == "aa") {df[i,1:3] <- df[i,1:3] - blank1[1:3]}
else {df[i,1:3] <- df[i,1:3] - blank2[1:3]}
}
There are a few different was to run the loop, including vectorizing. But this suffices. I'd also argue that there's no reason to keep the labels "aa" v "bb" in the initial data structure either, which would make this simpler; but it's your choice.

Compare two tables and return list of mismatches

I'd like to be able to compare two tables and have R return a list of records and variables that don't match.
For example, with the following two tables
> df1
id let num
1 1a a 1
2 2b b 2
3 3c c 3
4 4d d 4
5 5e e 5
> df2
id let num
1 1a a 1
2 2b b 2
3 3c c 3
4 4d e 4
5 5e d 5
I would want a compare() function to return something like "id=4d, let" to let me know that the let variable in the record with id = 4d doesn't match.
I have seen the compare library in CRAN but it only returns TRUE or FALSE for the entire variable if there is a mismatch. Is there a library with a different compare function, or a way to do this manually?
df1 <- read.table(text="
id let1 num1
1a a 1
2b b 2
3c c 3
4d d 4
5e e 5", head=T, as.is=T)
df2 <- read.table(text="
id let2 num2
1a a 1
2b b 2
3c c 3
4d e 4
5e d 5", head=T, as.is=T)
df <- merge(df1, df2, by="id")
df$let <- ifelse(df$let1 == df$let2, "equal", "not equal")
df$num <- ifelse(df$num1 == df$num2, "equal", "not equal")
df
# id let1 num1 let2 num2 let num
# 1 1a a 1 a 1 equal equal
# 2 2b b 2 b 2 equal equal
# 3 3c c 3 c 3 equal equal
# 4 4d d 4 e 4 not equal equal
# 5 5e e 5 d 5 not equal equal
You mean something like which? Quick reproducible example:
> m1 <- m2 <- matrix(1:9, 3)
> diag(m1) <- 0
> which(m1 != m2, arr.ind = TRUE)
row col
[1,] 1 1
[2,] 2 2
[3,] 3 3
Something like:
df_diff <- list()
for (i in 1:ncol(df1))
{
df_diff[[i]] <- df1$id[df2[i] != df1[i]]
names(df_diff)[i] <- names(df1)[i]
}
This should produce (hopefully :)) a list of character vectors (one for each variable). Each vector contains the IDs of df1 where the records of the two df don't match.

Simple function does not work for `dcast` - reshape2

Suppose I have this data:
c1 c2 c3
A A AA
A B BB
A C CC
B A DD
B B EE
B C FF
C A GG
C B HH
C C II
A A JJ
I want to reshape them with dcast with this function:
dcast(data,c1~c2,value.var="c3",function(x)x)
But I get this error:
Error in vapply(indices, fun, .default) : values must be length 0,
but FUN(X[[1]]) result is length 1
How can use a new function with dcast (User defined function).
I want to get:
A B C
A AA BB CC
B DD EE FF
C GG HH II
A JJ NA NA
Here's a possible solution using data.tables v 1.9.5+ new rleid function, which will create an index for the c1 column (you can remove indx afterwards if you want)
library(data.table) # v 1.9.5+
dcast(setDT(stocksm)[, indx := rleid(c1)], indx + c1 ~ c2, value.var = "c3")
# indx c1 A B C
# 1: 1 A AA BB CC
# 2: 2 B DD EE FF
# 3: 3 C GG HH II
# 4: 4 A JJ NA NA
### installing the development version
# library(devtools)
# install_github("Rdatatable/data.table", build_vignettes = FALSE)
So basically after creating an index on c1 we are spreading the data more or less as before, while including indx inside
Or if you insist on tidyr, here's an option
library(tidyr)
stocksm$indx <- with(rle(as.character(stocksm$c1)), rep(seq_along(lengths), lengths))
spread(stocksm, c2, c3)
# c1 indx A B C
# 1 A 1 AA BB CC
# 2 A 4 JJ <NA> <NA>
# 3 B 2 DD EE FF
# 4 C 3 GG HH II
Another way to use dcast is to create unique identifiers with cumsum. The function will not know which value to fill in for duplicates like A A if it isn't created.
data$ids <- cumsum(c(T,diff(as.numeric(data$c1)) != 0L))
dcast(data, ids+c1~c2, value.var="c3")[-1]
# c1 A B C
# 1 A AA BB CC
# 2 B DD EE FF
# 3 C GG HH II
# 4 A JJ <NA> <NA>

Resources