Pasting one column to every other column in a dataframe - r

I have a bunch of columns and I need to paste the first column into every other column. It looks like this except actual words instead of letters, and theres a few hundred columns.
TEST0 TEST1 TEST2 TEST3 TEST4
1 Q1: AA AA AA AA AA AA BB BB BB
2 Q2:
3 Q3: BB BB BB CC CC CC CC CC CC CC CC CC
4 Q4: DD DD DD DD DD DD DD DD DD
I'm able to paste the first column into another column one at a time doing this:
paste(test[,2],test[,3])
[1] "Q1: AA AA AA" "Q2: " "Q3: BB BB BB" "Q4: DD DD DD "
paste(test[,2],test[,4])
[1] "Q1: AA AA AA " "Q2: " "Q3: CC CC CC " "Q4: "
but is there a way to do multiple columns at once? Thanks

Here is a way of doing it with dplyr. Create your own pasting function first:
df <- data.frame(A = LETTERS, B = 1:26, C = 1:26)
head(df)
A B C
1 A 1 1
2 B 2 2
3 C 3 3
4 D 4 4
5 E 5 5
pasteA <- function(., x) paste0(df$A,.)
df %>%
mutate_if(.predicate = c(F, rep(T, ncol(df)-1)), .funs = list(pasteA))
A B C
1 A A1 A1
2 B B2 B2
3 C C3 C3
4 D D4 D4
5 E E5 E5
We use mutate_if to select all columns except the first one using a logical vector.

This is a base solution with a for loop. For every target column, paste the
first column to it.
df <- data.frame(a = letters[1:5], b = 1:5, c = 5:1)
for (i in 2:length(df)) {
df[[i]] <- paste(df[[1]], df[[i]], sep = ": ")
}
Where length gives the number of columns of a data.frame.
Result:
a b c
1 a a: 1 a: 5
2 b b: 2 b: 4
3 c c: 3 c: 3
4 d d: 4 d: 2
5 e e: 5 e: 1

{dplyr} is surprisingly convoluted for this case. A much easier solution is to use lapply (which works since data.frames are lists of columns):
as.data.frame(lapply(test[-1], function (x) paste(test[[1]], x)))

Related

Math function using multiple matching criteria

I'm new to this but I'm pretty sure this question hasn't been answered, or I'm just not good at searching....
I would like to subtract the values in multiple rows from a particular row based on matching columns and values. My actual data will be a large matrix with >5000 columns, eaching needing to be subtracted by a blank value that matches the a value in a factor column.
Here is an example data table:
c1 c2 c3 c4 c5
r1 A 1 2 3 aa
r2 B 2 3 4 bb
r3 C 3 4 5 aa
r4 D 4 1 6 bb
r5 Blank 2 3 4 aa
r6 Blank 3 4 5 bb
I would like to subtract the c2,c3,and c4 values of c1 ="Blank" row from A,B,and C using the c5 factor to define which Blank values are used (aa or bb). I would like the "Blank" values to be subtracted from all rows sharing c5 info.
(i know this is confusing to describe)
So the results would look like this:
c1 c2 c3 c4 c5
r1 A -1 -1 -1 aa
r2 B -1 -1 -1 bb
r3 C 1 1 1 aa
r4 D 1 -3 1 bb
I've seen the ddply function work for doing something like this with a single column, but I wasn't able to expand that to perform this task for multiple columns. I'm a noob though...
Thank you for your help!
This is not tested for all possible cases, but should give you an idea:
df <- read.table(text =
"c1 c2 c3 c4 c5
r1 A 1 2 3 aa
r2 B 2 3 4 bb
r3 C 3 4 5 aa
r4 D 4 1 6 bb
r5 Blank 2 3 4 aa
r6 Blank 3 4 5 bb", header = T)
library(data.table)
# separate dataset into two
dt <- data.table(df, key = "c5")
dt.blank <- dt[c1 == "Blank"]
dt <- dt[c1 != "Blank"]
# merge into resulting dataset
dt.res <- dt[dt.blank]
# update each column
columns.count <- ncol(dt)
for(i in 2:(columns.count-1)) {
dt.res[[i]] <- dt.res[[i]] - dt.res[[i + columns.count]]
}
# > dt.res
# c1 c2 c3 c4 c5 i.c1 i.c2 i.c3 i.c4
# 1: A -1 -1 -1 aa Blank 2 3 4
# 2: C 1 1 1 aa Blank 2 3 4
# 3: B -1 -1 -1 bb Blank 3 4 5
# 4: D 1 -3 1 bb Blank 3 4 5
First split your data, since there's no reason you have them in a single data structure. Then apply the function:
# recreate your data
df <- data.frame(rbind(c(1:3, "aa"), c(2:4, "bb"), c(3:5, "aa"), c(4,1,6, "bb"), c(2:4, "aa"), c(3:5, "bb")))
df[,1:3] <- apply(df[,1:3], 2, as.integer)
# split it
blank1 <- df[5,]
blank2 <- df[6,]
df <- df[1:4,]
for (i in 1:nrow(df)) {
if (df[i,4] == "aa") {df[i,1:3] <- df[i,1:3] - blank1[1:3]}
else {df[i,1:3] <- df[i,1:3] - blank2[1:3]}
}
There are a few different was to run the loop, including vectorizing. But this suffices. I'd also argue that there's no reason to keep the labels "aa" v "bb" in the initial data structure either, which would make this simpler; but it's your choice.

R: Reshape count matrix to long format with multiple entries

I have a matrix. The entries of the matrix are counts for the combination of the dimension levels. For example:
(m0 <- matrix(1:4, nrow=2, dimnames=list(c("A","B"),c("A","B"))))
A B
A 1 3
B 2 4
I can change it to a long format:
library("reshape")
(m1 <- melt(m0))
X1 X2 value
1 A A 1
2 B A 2
3 A B 3
4 B B 4
But I would like to have multipe entries according to value:
m2 <- m1
for (i in 1:nrow(m1)) {
j <- m1[i,"value"]
k <- 2
while ( k <= j) {
m2 <- rbind(m2,m1[i,])
k = k+1
}
}
> m2 <- subset(m2,select = - value)
> m2[order(m2$X1),]
X1 X2
1 A A
3 A B
31 A B
32 A B
2 B A
4 B B
21 B A
41 B B
42 B B
43 B B
Is there a parameter in melt which considers to multiply the entries according to value? Or any other library which can perform this issue?
We could do this with base R. We convert the dimnames of 'm0' to a 'data.frame' with two columns using expand.grid, then replicate the rows of the dataset with the values in 'm0', order the rows and change the row names to NULL (if necessary).
d1 <- expand.grid(dimnames(m0))
d2 <- d1[rep(1:nrow(d1), c(m0)),]
res <- d2[order(d2$Var1),]
row.names(res) <- NULL
res
# Var1 Var2
#1 A A
#2 A B
#3 A B
#4 A B
#5 B A
#6 B A
#7 B B
#8 B B
#9 B B
#10 B B
Or with melt, we convert the 'm0' to 'long' format and then replicate the rows as before.
library(reshape2)
dM <- melt(m0)
dM[rep(1:nrow(dM), dM$value),1:2]
As #Frank mentioned, we can also use table with as.data.frame to create 'dM'
dM <- as.data.frame(as.table(m0))

Compare two tables and return list of mismatches

I'd like to be able to compare two tables and have R return a list of records and variables that don't match.
For example, with the following two tables
> df1
id let num
1 1a a 1
2 2b b 2
3 3c c 3
4 4d d 4
5 5e e 5
> df2
id let num
1 1a a 1
2 2b b 2
3 3c c 3
4 4d e 4
5 5e d 5
I would want a compare() function to return something like "id=4d, let" to let me know that the let variable in the record with id = 4d doesn't match.
I have seen the compare library in CRAN but it only returns TRUE or FALSE for the entire variable if there is a mismatch. Is there a library with a different compare function, or a way to do this manually?
df1 <- read.table(text="
id let1 num1
1a a 1
2b b 2
3c c 3
4d d 4
5e e 5", head=T, as.is=T)
df2 <- read.table(text="
id let2 num2
1a a 1
2b b 2
3c c 3
4d e 4
5e d 5", head=T, as.is=T)
df <- merge(df1, df2, by="id")
df$let <- ifelse(df$let1 == df$let2, "equal", "not equal")
df$num <- ifelse(df$num1 == df$num2, "equal", "not equal")
df
# id let1 num1 let2 num2 let num
# 1 1a a 1 a 1 equal equal
# 2 2b b 2 b 2 equal equal
# 3 3c c 3 c 3 equal equal
# 4 4d d 4 e 4 not equal equal
# 5 5e e 5 d 5 not equal equal
You mean something like which? Quick reproducible example:
> m1 <- m2 <- matrix(1:9, 3)
> diag(m1) <- 0
> which(m1 != m2, arr.ind = TRUE)
row col
[1,] 1 1
[2,] 2 2
[3,] 3 3
Something like:
df_diff <- list()
for (i in 1:ncol(df1))
{
df_diff[[i]] <- df1$id[df2[i] != df1[i]]
names(df_diff)[i] <- names(df1)[i]
}
This should produce (hopefully :)) a list of character vectors (one for each variable). Each vector contains the IDs of df1 where the records of the two df don't match.

Simple function does not work for `dcast` - reshape2

Suppose I have this data:
c1 c2 c3
A A AA
A B BB
A C CC
B A DD
B B EE
B C FF
C A GG
C B HH
C C II
A A JJ
I want to reshape them with dcast with this function:
dcast(data,c1~c2,value.var="c3",function(x)x)
But I get this error:
Error in vapply(indices, fun, .default) : values must be length 0,
but FUN(X[[1]]) result is length 1
How can use a new function with dcast (User defined function).
I want to get:
A B C
A AA BB CC
B DD EE FF
C GG HH II
A JJ NA NA
Here's a possible solution using data.tables v 1.9.5+ new rleid function, which will create an index for the c1 column (you can remove indx afterwards if you want)
library(data.table) # v 1.9.5+
dcast(setDT(stocksm)[, indx := rleid(c1)], indx + c1 ~ c2, value.var = "c3")
# indx c1 A B C
# 1: 1 A AA BB CC
# 2: 2 B DD EE FF
# 3: 3 C GG HH II
# 4: 4 A JJ NA NA
### installing the development version
# library(devtools)
# install_github("Rdatatable/data.table", build_vignettes = FALSE)
So basically after creating an index on c1 we are spreading the data more or less as before, while including indx inside
Or if you insist on tidyr, here's an option
library(tidyr)
stocksm$indx <- with(rle(as.character(stocksm$c1)), rep(seq_along(lengths), lengths))
spread(stocksm, c2, c3)
# c1 indx A B C
# 1 A 1 AA BB CC
# 2 A 4 JJ <NA> <NA>
# 3 B 2 DD EE FF
# 4 C 3 GG HH II
Another way to use dcast is to create unique identifiers with cumsum. The function will not know which value to fill in for duplicates like A A if it isn't created.
data$ids <- cumsum(c(T,diff(as.numeric(data$c1)) != 0L))
dcast(data, ids+c1~c2, value.var="c3")[-1]
# c1 A B C
# 1 A AA BB CC
# 2 B DD EE FF
# 3 C GG HH II
# 4 A JJ <NA> <NA>

R: reshape data frame when one column has unequal number of entries

I have a data frame x with 2 character columns:
x <- data.frame(a = numeric(), b = I(list()))
x[1:3,"a"] = 1:3
x[[1, "b"]] <- "a, b, c"
x[[2, "b"]] <- "d, e"
x[[3, "b"]] <- "f"
x$a = as.character(x$a)
x$b = as.character(x$b)
x
str(x)
The entries in column b are comma-separated strings of characters.
I need to produce this data frame:
1 a
1 b
1 c
2 d
2 e
3 f
I know how to do it when I loop row by row. But is it possible to do without looping?
Thank you!
Have you checked out require(splitstackshape)?
> cSplit(x, "b", ",", direction = "long")
a b
1: 1 a
2: 1 b
3: 1 c
4: 2 d
5: 2 e
6: 3 f
> s <- strsplit(as.character(x$b), ',')
> data.frame(value=rep(x$a, sapply(s, FUN=length)),b=unlist(s))
value b
1 1 a
2 1 b
3 1 c
4 2 d
5 2 e
6 3 f
there you go, should be very fast:
library(data.table)
x <- data.table(x)
x[ ,strsplit(b, ","), by = a]

Resources