Using unique() when naming dataframe columns - r

I am having trouble naming a data frame I reshaped. Using just reshape, I get the wrong titles, so I tried to name them myself but I cannot get the right names in the right spots.
df<-data.frame(color=rep(c("red", "blue", "green"), 10), letter=c(letter=c("a", "b", "c", "d", "e", "b", "c", "d", "e", "f", "c", "d", "e", "f", "g", "d", "e", "f", "g", "h", "e", "b", "c", "d", "e", "f", "c", "d", "e", "f"))
b<-as.data.frame(table(df))
c<-reshape(b, direction="wide", idvar="color", timevar="letter")
color Freq.a Freq.b Freq.c Freq.d Freq.e Freq.f Freq.g Freq.h
1 blue 0 1 2 1 3 2 0 1
2 green 0 1 2 2 2 2 1 0
3 red 1 1 1 3 2 1 1 0
To get rid of the "Freq.", I added used names but this didn't give the right numbers for the column names. This happens for anything I name for the first column.
names(c)<-c("color", unique(b$letter))
color 1 2 3 4 5 6 7 8
1 blue 0 1 2 1 3 2 0 1
2 green 0 1 2 2 2 2 1 0
3 red 1 1 1 3 2 1 1 0
I tried just unique without concatenating something for the first column, and the correct numbers are column names, but obviously they are in the wrong place. How can I get the right unique values over the correct columns?
names(c)<-unique(b$letter)
a b c d e f g h NA
1 blue 0 1 2 1 3 2 0 1
2 green 0 1 2 2 2 2 1 0
3 red 1 1 1 3 2 1 1 0

your b$letter column is a factor (unique(b$letter) will also be a factor), hence when being concatenated with a character, R implicitly coerces its "values" (not "levels") to character, giving you numbers.
df <- data.frame(color=rep(c("red", "blue", "green"), 10),
letter=c(letter=c("a", "b", "c", "d", "e",
"b", "c", "d", "e", "f",
"c", "d", "e", "f", "g",
"d", "e", "f", "g", "h",
"e", "b", "c", "d", "e",
"f", "c", "d", "e", "f")))
b <- as.data.frame(table(df))
c <- reshape(b, direction="wide", idvar="color", timevar="letter")
You can easily verify this by comparing the following:
> unique(b$letter)
[1] a b c d e f g h
Levels: a b c d e f g h
> class(unique(b$letter))
[1] "factor"
> as.character(unique(b$letter))
[1] "a" "b" "c" "d" "e" "f" "g" "h"
> class(as.character(unique(b$letter)))
[1] "character"
To solve this, it's as simple as using the second version:
names(c) <- c("color", as.character(unique(b$letter)))
Alternatively, you can also use sub to remove "Freq." from names(c) (which IMO is a safer and easier approach):
names(c) <- sub('^Freq\\.', '', names(c))
Result:
color a b c d e f g h
1 blue 0 1 2 1 3 2 0 1
2 green 0 1 2 2 2 2 1 0
3 red 1 1 1 3 2 1 1 0

Is this what you mean?
> setNames(reshape(b, timevar="numbers", idvar="color", direction="wide"),
c("Name", unique(b$numbers)))
Name 1 2 3 4 5 6 7 8
1 blue 0 1 2 1 3 2 0 1
2 green 0 1 2 2 2 2 1 0
3 red 1 1 1 3 2 1 1 0

Related

Rearrange rownames of data.frame (non alphabetic)

set0 <- data.frame(A = c("A", "B", "C", "D", "A", "B", "C", "D", "A", "B"),
B = c("E", "F", "G", "H", "I", "E", "F", "G", "H", "I"))
set0 <- table(set0)
result:
> set0
B
A E F G H I
A 1 0 0 1 1
B 1 1 0 0 1
C 0 1 1 0 0
D 0 0 1 1 0
I know that when I want to change the column order I can use the following:
set0 <- set0[, c(5, 4, 3, 2, 1)]
The above makes it possible to change the order of the column in any way I like.
I am looking for something that makes it possible to do the exact same but for the rownames.
Any ideas?
In the brackets, columns are defined to the right of the comma (that what you did). In case you didn't know yet, rows are defined to the left of the comma. So you can use one of these approaches that you like best:
set0[c(4, 3, 2, 1), ]
set0[4:1, ]
set0[rev(seq(nrow(set0))), ]
set0[c("D", "C", "B", "A"), ]
set0[rev(rownames(set0)), ]
# B
# A E F G H I
# D 0 0 1 1 0
# C 0 1 1 0 0
# B 1 1 0 0 1
# A 1 0 0 1 1
You can use factor to define the order, e.g.,
table(
transform(
set0,
A = factor(A, levels = sort(unique(A), decreasing = TRUE))
)
)
which gives
B
A E F G H I
D 0 0 1 1 0
C 0 1 1 0 0
B 1 1 0 0 1
A 1 0 0 1 1

create weighted adjacency matrix from data frame R

I have a data frame of the following way
dat <- data.frame(A=c("D", "A", "D", "B"), B=c("B", "B", "D", "R"), C=c("A", "D", "C", ""), D=c("D", "C", "A", "A"))
My idea is to create a matrix with this information, based on the number of occasions that each column variable refers to the other columns (and ignore when referring to other things that are not in one of the columns (e.g. "R")). So I want to fill the following matrix:
n <- ncol(dat)
names_d <- colnames(dat)
mat <- matrix(0, nrow=n, ncol=n)
rownames(mat) <- names_d
colnames(mat) <- names_d
So in the end, I would have something like this:
A B C D
A 1 1 0 2
B 0 2 0 1
C 1 0 1 1
D 2 0 1 1
Which would be the most efficient way of doing this in R?
You can try the code below
> t(sapply(dat, function(x) table(factor(x, levels = names(dat)))))
A B C D
A 1 1 0 2
B 0 2 0 1
C 1 0 1 1
D 2 0 1 1
or
> t(xtabs(~., subset(stack(dat), values != "")))
values
ind A B C D
A 1 1 0 2
B 0 2 0 1
C 1 0 1 1
D 2 0 1 1
Another option is stack with table
table(subset(stack(dat), nzchar(values) & values != 'R'))

Ignore specific levels when performing lapply in R

I have a data frame (500 obs of 40000 variables) in R where all columns consist of one or two letters interspersed with '1' and '3'. E.g., mydata[45:50,20:25]
45 C A 3 T C C
46 C G T C C A
47 C A G T C C
48 1 A T 3 C 3
49 C A G T C C
50 T A T C C A
I wish to replace the letters only not the numbers. My goal is for the letters to be replaced with '0' or '2' depending on their frequency. The most frequent letter therefore becoming '0' and the least frequent becoming '2'. If there is only one letter, that would become '0'.
I can achieve this without ignoring the interspersed '1' and '3' using:
data.frame(lapply(mydata[45:50,20:25], function(x){as.numeric(factor(x, levels = names(sort(-table(x)))))}))
which yields:
1 1 1 3 1 1 1
2 1 2 1 2 1 2
3 1 1 2 1 1 1
4 2 1 1 3 1 3
5 1 1 2 1 1 1
6 3 1 1 2 1 2
However, I would like to be able to do that while ignoring '1' and '3' in the original data frame.
Any help appreciated. Thank you.
I would work with a matrix here.
Using grep we make a table of frequencies which we rank on their negative values and subtract one to get zero. Since I'm not sure what you want in case of ties I chose "first" to get an integer (see ?rank for options).
Then we match the letters on the frequencies. Finally we convert back to data frame using type.convert to get numeric formats.
m <- as.matrix(d)
ftb <- table(grep("[\\p{Lu}]", m, perl=TRUE, value=TRUE))
ftb <- rank(-ftb, ties.method="first") - 1
m.res <- apply(m, 1:2, function(x) ifelse(x %in% names(ftb), ftb[match(x, names(ftb))], x))
d.res <- type.convert(as.data.frame(m.res))
d.res
# V1 V2 V3 V4 V5 V6 V7
# 1 45 0 1 3 2 0 0
# 2 46 0 3 2 0 0 1
# 3 47 0 1 3 2 0 0
# 4 48 1 1 2 3 0 3
# 5 49 0 1 3 2 0 0
# 6 50 2 1 2 0 0 1
Edit
Since you want to look into the column frequencies, we may use the approach in an lapply (without matrix conversion). We can multiply the rank then by a factor 2.
f <- 2
d[-1] <- lapply(d[-1], function(x) {
ftb <- (rank(-table(grep("[\\p{Lu}]", x, perl=TRUE, value=TRUE)),
ties.method="first") - 1)*f
stopifnot(length(ftb) <= 2)
x <- ifelse(x %in% names(ftb), ftb[match(x, names(ftb))], x)
as.numeric(x)
})
d
# V1 V2 V3 V4 V5 V6 V7
# 1 45 0 0 3 0 0 0
# 2 46 0 2 0 2 0 2
# 3 47 0 0 2 0 0 0
# 4 48 1 0 0 3 0 3
# 5 49 0 0 2 0 0 0
# 6 50 2 0 0 2 0 2
Data:
d <- structure(list(V1 = 45:50, V2 = c("C", "C", "C", "1", "C", "T"
), V3 = c("A", "G", "A", "A", "A", "A"), V4 = c("3", "T", "G",
"T", "G", "T"), V5 = c("T", "C", "T", "3", "T", "C"), V6 = c("C",
"C", "C", "C", "C", "C"), V7 = c("C", "A", "C", "3", "C", "A"
)), class = "data.frame", row.names = c(NA, -6L))

R: Counting values of other columns in groups in data.table

(Sorry about the poorly made title, wasn't sure how to phrase my question in a single sentence.)
I have a data.table of matches that is sorted chronologically, with P1 representing player1, P2 representing player2, and Res representing whether P1 won ("w") or drew ("d") against P2.
EDIT: a, b, and c in the P1 and P2 columns represent individual players. Think of them as Alice, Bob, and Charlie.
DT <- data.table(time = 1:10,
P1 = c("a", "a", "b", "b", "b", "a", "a", "b", "a", "c"),
Res = c("d", "w", "w", "w", "w", "w", "d", "d", "w", "w"),
P2 = c("b", "c", "c", "a", "a", "c", "c", "a", "b", "b"))
I performed the following operations to count at the time of the match, how many wins P1 had (wins1) and how many losses P2 had (loss2).
DT[, wins1 := shift(cumsum(Res == "w"), 1, fill=0L), by=P1]
DT[, loss2 := shift(cumsum(Res == "w"), 1, fill=0L), by=P2]
I am trying to create the columns for wins2, loss1, draw1, and draw2.
That is, I right now have how many wins P1 had at the time of their match against P2, but I do not know how many losses they had. Hope that makes sense. I'm sure the method to creating these columns are all similar, so if I can make one I should be able to make them all.
The final table should look like the following:
time P1 Res P2 wins1 loss1 draw1 wins2 loss2 draw2
1: 1 a d b 0 0 0 0 0 0
2: 2 a w c 0 0 1 0 0 0
3: 3 b w c 0 0 1 0 1 0
4: 4 b w a 1 0 1 1 0 1
5: 5 b w a 2 0 1 1 1 1
6: 6 a w c 1 2 1 0 2 0
7: 7 a d c 2 2 1 0 3 0
8: 8 b d a 3 0 1 2 2 2
9: 9 a w b 2 2 2 3 0 1
10: 10 c w b 0 3 1 3 1 1

make a boolean-like matrix from multiple vectors

Assume
xx.1 <- c("a", "b", "d")
xx.2 <- c("a", "d", "e")
xx.3 <- c("b", "e", "d", "f")
How to make a boolean matrix as this:
xx.1 xx.2 xx.3
a 1 1 NA
b 1 NA 1
d 1 1 1
e NA 1 1
f NA NA 1
Try table and stack:
table(stack(list(xx.1 = xx.1, xx.2 = xx.2, xx.3 = xx.3)))
# ind
# values xx.1 xx.2 xx.3
# a 1 1 0
# b 1 0 1
# d 1 1 1
# e 0 1 1
# f 0 0 1
More conveniently, you can try:
table(stack(mget(ls(pattern = "xx"))))

Resources