Simple function does not work for `dcast` - reshape2 - r

Suppose I have this data:
c1 c2 c3
A A AA
A B BB
A C CC
B A DD
B B EE
B C FF
C A GG
C B HH
C C II
A A JJ
I want to reshape them with dcast with this function:
dcast(data,c1~c2,value.var="c3",function(x)x)
But I get this error:
Error in vapply(indices, fun, .default) : values must be length 0,
but FUN(X[[1]]) result is length 1
How can use a new function with dcast (User defined function).
I want to get:
A B C
A AA BB CC
B DD EE FF
C GG HH II
A JJ NA NA

Here's a possible solution using data.tables v 1.9.5+ new rleid function, which will create an index for the c1 column (you can remove indx afterwards if you want)
library(data.table) # v 1.9.5+
dcast(setDT(stocksm)[, indx := rleid(c1)], indx + c1 ~ c2, value.var = "c3")
# indx c1 A B C
# 1: 1 A AA BB CC
# 2: 2 B DD EE FF
# 3: 3 C GG HH II
# 4: 4 A JJ NA NA
### installing the development version
# library(devtools)
# install_github("Rdatatable/data.table", build_vignettes = FALSE)
So basically after creating an index on c1 we are spreading the data more or less as before, while including indx inside
Or if you insist on tidyr, here's an option
library(tidyr)
stocksm$indx <- with(rle(as.character(stocksm$c1)), rep(seq_along(lengths), lengths))
spread(stocksm, c2, c3)
# c1 indx A B C
# 1 A 1 AA BB CC
# 2 A 4 JJ <NA> <NA>
# 3 B 2 DD EE FF
# 4 C 3 GG HH II

Another way to use dcast is to create unique identifiers with cumsum. The function will not know which value to fill in for duplicates like A A if it isn't created.
data$ids <- cumsum(c(T,diff(as.numeric(data$c1)) != 0L))
dcast(data, ids+c1~c2, value.var="c3")[-1]
# c1 A B C
# 1 A AA BB CC
# 2 B DD EE FF
# 3 C GG HH II
# 4 A JJ <NA> <NA>

Related

Pasting one column to every other column in a dataframe

I have a bunch of columns and I need to paste the first column into every other column. It looks like this except actual words instead of letters, and theres a few hundred columns.
TEST0 TEST1 TEST2 TEST3 TEST4
1 Q1: AA AA AA AA AA AA BB BB BB
2 Q2:
3 Q3: BB BB BB CC CC CC CC CC CC CC CC CC
4 Q4: DD DD DD DD DD DD DD DD DD
I'm able to paste the first column into another column one at a time doing this:
paste(test[,2],test[,3])
[1] "Q1: AA AA AA" "Q2: " "Q3: BB BB BB" "Q4: DD DD DD "
paste(test[,2],test[,4])
[1] "Q1: AA AA AA " "Q2: " "Q3: CC CC CC " "Q4: "
but is there a way to do multiple columns at once? Thanks
Here is a way of doing it with dplyr. Create your own pasting function first:
df <- data.frame(A = LETTERS, B = 1:26, C = 1:26)
head(df)
A B C
1 A 1 1
2 B 2 2
3 C 3 3
4 D 4 4
5 E 5 5
pasteA <- function(., x) paste0(df$A,.)
df %>%
mutate_if(.predicate = c(F, rep(T, ncol(df)-1)), .funs = list(pasteA))
A B C
1 A A1 A1
2 B B2 B2
3 C C3 C3
4 D D4 D4
5 E E5 E5
We use mutate_if to select all columns except the first one using a logical vector.
This is a base solution with a for loop. For every target column, paste the
first column to it.
df <- data.frame(a = letters[1:5], b = 1:5, c = 5:1)
for (i in 2:length(df)) {
df[[i]] <- paste(df[[1]], df[[i]], sep = ": ")
}
Where length gives the number of columns of a data.frame.
Result:
a b c
1 a a: 1 a: 5
2 b b: 2 b: 4
3 c c: 3 c: 3
4 d d: 4 d: 2
5 e e: 5 e: 1
{dplyr} is surprisingly convoluted for this case. A much easier solution is to use lapply (which works since data.frames are lists of columns):
as.data.frame(lapply(test[-1], function (x) paste(test[[1]], x)))

Merging two columns at once in R

I have a data frame with 127 columns and 518 rows. Now I have to odd and even columns (from 3).
>data
Name id S1 S2 S3 S4 S5 S6
abc 1 A A C C G G
abc 2 A G T T C G
abc 3 G C T A A C
The output which I want is
>output
Name id S2 S4 S6
abc 1 AA CC GG
abc 2 AG TT CG
abc 3 GC TA AC
Can anyone help me with this?
We can use Map to do this after subsetting the alternate columns
res <- data.frame(data[1:2], Map(paste0, data[-(1:2)][c(TRUE, FALSE)],
data[-(1:2)][c(FALSE, TRUE)]))
names(res)[3:5] <- names(data)[3:8][c(FALSE, TRUE)]
res
# Name id S2 S4 S6
#1 abc 1 AA CC GG
#2 abc 2 AG TT CG
#3 abc 3 GC TA AC

melt data table and split values

I have a column in a data table which is a list of comma separated values
dt = data.table( a = c('a','b','c'), b = c('xx,yy,zz','mm,nn','qq,rr,ss,tt'))
> dt
a b
1: a xx,yy,zz
2: b mm,nn
3: c qq,rr,ss,tt
I would like to transform it into a long format
a b
1: a xx
2: a yy
3: a zz
4: b mm
5: b nn
6: c qq
7: c rr
8: c ss
9: c tt
This question has been answered for a data frame here. I'm wondering if there is an elegant data table solution.
The following will work for your example:
dt[, c(b=strsplit(b, ",")), by=a]
a b
1: a xx
2: a yy
3: a zz
4: b mm
5: b nn
6: c qq
7: c rr
8: c ss
9: c tt
This method fails if the "by" variable is repeated as in
dt = data.table(a = c('a','b','c', 'a'),
b = c('xx,yy,zz','mm,nn','qq,rr,ss,tt', 'zz,gg,tt'))
One robust solution in this situation can be had by using paste to collapse all observations with the same grouping variable (a) and feeding the result to the code above.
dt[, .(b=paste(b, collapse=",")), by=a][, c(b=strsplit(b, ",")), by=a]
This returns
a b
1: a xx
2: a yy
3: a zz
4: a zz
5: a gg
6: a tt
7: b mm
8: b nn
9: c qq
10: c rr
11: c ss
12: c tt
There is another method, but this method involves another package : splitstackshape.
library(splitstackshape)
cSplit(dt, "b", sep = ",", direction = "long")
a b
1: a xx
2: a yy
3: a zz
4: b mm
5: b nn
6: c qq
7: c rr
8: c ss
9: c tt
This function uses data.table to work. And this work even if we have multiple same value for the column "a".
We can split the column 'b' by the delimiter ',' (using strsplit), grouped by 'a' and set the name of the new column i.e. 'V1' to 'b' with setnames
setnames(dt[, strsplit(b, ','), by = a], "V1", "b")[]
# a b
#1: a xx
#2: a yy
#3: a zz
#4: b mm
#5: b nn
#6: c qq
#7: c rr
#8: c ss
#9: c tt
If there are repeating elements in 'a' as in the below example
dt <- data.table(a = c('a','b','c', 'a'),
b = c('xx,yy,zz','mm,nn','qq,rr,ss,tt', 'zz,gg,tt'))
we can group by the sequence of rows, do the strsplit on 'b', concatenate with the 'a' column and assign (:=) the 'grp' to NULL
dt[, c(a=a, b=strsplit(b, ",")), .(grp = 1:nrow(dt))][, grp := NULL][]
# a b
# 1: a xx
# 2: a yy
# 3: a zz
# 4: b mm
# 5: b nn
# 6: c qq
# 7: c rr
# 8: c ss
# 9: c tt
#10: a zz
#11: a gg
#12: a tt
NOTE: Both the methods are data.table methods

Math function using multiple matching criteria

I'm new to this but I'm pretty sure this question hasn't been answered, or I'm just not good at searching....
I would like to subtract the values in multiple rows from a particular row based on matching columns and values. My actual data will be a large matrix with >5000 columns, eaching needing to be subtracted by a blank value that matches the a value in a factor column.
Here is an example data table:
c1 c2 c3 c4 c5
r1 A 1 2 3 aa
r2 B 2 3 4 bb
r3 C 3 4 5 aa
r4 D 4 1 6 bb
r5 Blank 2 3 4 aa
r6 Blank 3 4 5 bb
I would like to subtract the c2,c3,and c4 values of c1 ="Blank" row from A,B,and C using the c5 factor to define which Blank values are used (aa or bb). I would like the "Blank" values to be subtracted from all rows sharing c5 info.
(i know this is confusing to describe)
So the results would look like this:
c1 c2 c3 c4 c5
r1 A -1 -1 -1 aa
r2 B -1 -1 -1 bb
r3 C 1 1 1 aa
r4 D 1 -3 1 bb
I've seen the ddply function work for doing something like this with a single column, but I wasn't able to expand that to perform this task for multiple columns. I'm a noob though...
Thank you for your help!
This is not tested for all possible cases, but should give you an idea:
df <- read.table(text =
"c1 c2 c3 c4 c5
r1 A 1 2 3 aa
r2 B 2 3 4 bb
r3 C 3 4 5 aa
r4 D 4 1 6 bb
r5 Blank 2 3 4 aa
r6 Blank 3 4 5 bb", header = T)
library(data.table)
# separate dataset into two
dt <- data.table(df, key = "c5")
dt.blank <- dt[c1 == "Blank"]
dt <- dt[c1 != "Blank"]
# merge into resulting dataset
dt.res <- dt[dt.blank]
# update each column
columns.count <- ncol(dt)
for(i in 2:(columns.count-1)) {
dt.res[[i]] <- dt.res[[i]] - dt.res[[i + columns.count]]
}
# > dt.res
# c1 c2 c3 c4 c5 i.c1 i.c2 i.c3 i.c4
# 1: A -1 -1 -1 aa Blank 2 3 4
# 2: C 1 1 1 aa Blank 2 3 4
# 3: B -1 -1 -1 bb Blank 3 4 5
# 4: D 1 -3 1 bb Blank 3 4 5
First split your data, since there's no reason you have them in a single data structure. Then apply the function:
# recreate your data
df <- data.frame(rbind(c(1:3, "aa"), c(2:4, "bb"), c(3:5, "aa"), c(4,1,6, "bb"), c(2:4, "aa"), c(3:5, "bb")))
df[,1:3] <- apply(df[,1:3], 2, as.integer)
# split it
blank1 <- df[5,]
blank2 <- df[6,]
df <- df[1:4,]
for (i in 1:nrow(df)) {
if (df[i,4] == "aa") {df[i,1:3] <- df[i,1:3] - blank1[1:3]}
else {df[i,1:3] <- df[i,1:3] - blank2[1:3]}
}
There are a few different was to run the loop, including vectorizing. But this suffices. I'd also argue that there's no reason to keep the labels "aa" v "bb" in the initial data structure either, which would make this simpler; but it's your choice.

R: Reshape count matrix to long format with multiple entries

I have a matrix. The entries of the matrix are counts for the combination of the dimension levels. For example:
(m0 <- matrix(1:4, nrow=2, dimnames=list(c("A","B"),c("A","B"))))
A B
A 1 3
B 2 4
I can change it to a long format:
library("reshape")
(m1 <- melt(m0))
X1 X2 value
1 A A 1
2 B A 2
3 A B 3
4 B B 4
But I would like to have multipe entries according to value:
m2 <- m1
for (i in 1:nrow(m1)) {
j <- m1[i,"value"]
k <- 2
while ( k <= j) {
m2 <- rbind(m2,m1[i,])
k = k+1
}
}
> m2 <- subset(m2,select = - value)
> m2[order(m2$X1),]
X1 X2
1 A A
3 A B
31 A B
32 A B
2 B A
4 B B
21 B A
41 B B
42 B B
43 B B
Is there a parameter in melt which considers to multiply the entries according to value? Or any other library which can perform this issue?
We could do this with base R. We convert the dimnames of 'm0' to a 'data.frame' with two columns using expand.grid, then replicate the rows of the dataset with the values in 'm0', order the rows and change the row names to NULL (if necessary).
d1 <- expand.grid(dimnames(m0))
d2 <- d1[rep(1:nrow(d1), c(m0)),]
res <- d2[order(d2$Var1),]
row.names(res) <- NULL
res
# Var1 Var2
#1 A A
#2 A B
#3 A B
#4 A B
#5 B A
#6 B A
#7 B B
#8 B B
#9 B B
#10 B B
Or with melt, we convert the 'm0' to 'long' format and then replicate the rows as before.
library(reshape2)
dM <- melt(m0)
dM[rep(1:nrow(dM), dM$value),1:2]
As #Frank mentioned, we can also use table with as.data.frame to create 'dM'
dM <- as.data.frame(as.table(m0))

Resources