I have a data frame (500 obs of 40000 variables) in R where all columns consist of one or two letters interspersed with '1' and '3'. E.g., mydata[45:50,20:25]
45 C A 3 T C C
46 C G T C C A
47 C A G T C C
48 1 A T 3 C 3
49 C A G T C C
50 T A T C C A
I wish to replace the letters only not the numbers. My goal is for the letters to be replaced with '0' or '2' depending on their frequency. The most frequent letter therefore becoming '0' and the least frequent becoming '2'. If there is only one letter, that would become '0'.
I can achieve this without ignoring the interspersed '1' and '3' using:
data.frame(lapply(mydata[45:50,20:25], function(x){as.numeric(factor(x, levels = names(sort(-table(x)))))}))
which yields:
1 1 1 3 1 1 1
2 1 2 1 2 1 2
3 1 1 2 1 1 1
4 2 1 1 3 1 3
5 1 1 2 1 1 1
6 3 1 1 2 1 2
However, I would like to be able to do that while ignoring '1' and '3' in the original data frame.
Any help appreciated. Thank you.
I would work with a matrix here.
Using grep we make a table of frequencies which we rank on their negative values and subtract one to get zero. Since I'm not sure what you want in case of ties I chose "first" to get an integer (see ?rank for options).
Then we match the letters on the frequencies. Finally we convert back to data frame using type.convert to get numeric formats.
m <- as.matrix(d)
ftb <- table(grep("[\\p{Lu}]", m, perl=TRUE, value=TRUE))
ftb <- rank(-ftb, ties.method="first") - 1
m.res <- apply(m, 1:2, function(x) ifelse(x %in% names(ftb), ftb[match(x, names(ftb))], x))
d.res <- type.convert(as.data.frame(m.res))
d.res
# V1 V2 V3 V4 V5 V6 V7
# 1 45 0 1 3 2 0 0
# 2 46 0 3 2 0 0 1
# 3 47 0 1 3 2 0 0
# 4 48 1 1 2 3 0 3
# 5 49 0 1 3 2 0 0
# 6 50 2 1 2 0 0 1
Edit
Since you want to look into the column frequencies, we may use the approach in an lapply (without matrix conversion). We can multiply the rank then by a factor 2.
f <- 2
d[-1] <- lapply(d[-1], function(x) {
ftb <- (rank(-table(grep("[\\p{Lu}]", x, perl=TRUE, value=TRUE)),
ties.method="first") - 1)*f
stopifnot(length(ftb) <= 2)
x <- ifelse(x %in% names(ftb), ftb[match(x, names(ftb))], x)
as.numeric(x)
})
d
# V1 V2 V3 V4 V5 V6 V7
# 1 45 0 0 3 0 0 0
# 2 46 0 2 0 2 0 2
# 3 47 0 0 2 0 0 0
# 4 48 1 0 0 3 0 3
# 5 49 0 0 2 0 0 0
# 6 50 2 0 0 2 0 2
Data:
d <- structure(list(V1 = 45:50, V2 = c("C", "C", "C", "1", "C", "T"
), V3 = c("A", "G", "A", "A", "A", "A"), V4 = c("3", "T", "G",
"T", "G", "T"), V5 = c("T", "C", "T", "3", "T", "C"), V6 = c("C",
"C", "C", "C", "C", "C"), V7 = c("C", "A", "C", "3", "C", "A"
)), class = "data.frame", row.names = c(NA, -6L))
Related
I have a data frame of the following way
dat <- data.frame(A=c("D", "A", "D", "B"), B=c("B", "B", "D", "R"), C=c("A", "D", "C", ""), D=c("D", "C", "A", "A"))
My idea is to create a matrix with this information, based on the number of occasions that each column variable refers to the other columns (and ignore when referring to other things that are not in one of the columns (e.g. "R")). So I want to fill the following matrix:
n <- ncol(dat)
names_d <- colnames(dat)
mat <- matrix(0, nrow=n, ncol=n)
rownames(mat) <- names_d
colnames(mat) <- names_d
So in the end, I would have something like this:
A B C D
A 1 1 0 2
B 0 2 0 1
C 1 0 1 1
D 2 0 1 1
Which would be the most efficient way of doing this in R?
You can try the code below
> t(sapply(dat, function(x) table(factor(x, levels = names(dat)))))
A B C D
A 1 1 0 2
B 0 2 0 1
C 1 0 1 1
D 2 0 1 1
or
> t(xtabs(~., subset(stack(dat), values != "")))
values
ind A B C D
A 1 1 0 2
B 0 2 0 1
C 1 0 1 1
D 2 0 1 1
Another option is stack with table
table(subset(stack(dat), nzchar(values) & values != 'R'))
My variable is as follows
variable
D
D
B
C
B
D
C
C
D
I want to make the column in the above figure below
variable
B
C
D
D
0
0
1
D
0
0
1
B
1
0
0
C
0
1
0
B
1
0
0
D
0
0
1
C
0
1
0
C
0
1
0
D
0
0
1
But I don't want a code like the one below. Because the number of factors in the variable column is too many
data = data %>% mutate(B=ifelse(variable=="B", 1,0),
C=ifelse(variable=="C", 1,0),
D=ifelse(variable=="D", 1,0))
Here is a base R approach. We can first find all unique variable values from the data frame. Then, sapply over that vector and generate a new column for each value. Finally, we can rbind this new data frame of 0/1 valued columns to the original data frame.
cols <- sort(unique(df$variable))
df2 <- sapply(cols, function(x) ifelse(df$variable == x, 1, 0))
df <- cbind(df, df2)
df
variable B C D
1 D 0 0 1
2 D 0 0 1
3 B 1 0 0
4 C 0 1 0
5 B 1 0 0
6 D 0 0 1
7 C 0 1 0
8 C 0 1 0
9 D 0 0 1
Data:
df <- data.frame(variable=c("D", "D", "B", "C", "B",
"D", "C", "C", "D"),
stringsAsFactors=FALSE)
Try this with reshaping and duplicating the original variable in order to have a reference for values. Then, you can reshape to obtain the expected output:
library(dplyr)
library(tidyr)
#Code
new <- df %>% mutate(Var=variable,Val=1,id=row_number()) %>%
pivot_wider(names_from = Var,values_from=Val,values_fill = 0) %>%
select(-id)
Output:
# A tibble: 9 x 4
variable D B C
<chr> <dbl> <dbl> <dbl>
1 D 1 0 0
2 D 1 0 0
3 B 0 1 0
4 C 0 0 1
5 B 0 1 0
6 D 1 0 0
7 C 0 0 1
8 C 0 0 1
9 D 1 0 0
Some data used:
#Data
df <- structure(list(variable = c("D", "D", "B", "C", "B", "D", "C",
"C", "D")), class = "data.frame", row.names = c(NA, -9L))
1) model.matrix
model.matrix will generate column names like variableB so the last line removes the variable part to ensure that the column names are exactly the same as in the question. Omit the last line if it is not important that the column names be exactly as shown there.
dat2 <- cbind(dat, model.matrix(~ variable - 1, dat))
names(dat2) <- sub("variable(.)", "\\1", names(dat2))
giving:
> dat2
variable B C D
1 D 0 0 1
2 D 0 0 1
3 B 1 0 0
4 C 0 1 0
5 B 1 0 0
6 D 0 0 1
7 C 0 1 0
8 C 0 1 0
9 D 0 0 1
2) outer
This can also be done using outer as shown. Each component of variable is compared to each level. We name the levels so that outer uses them as column names. The output is the same.
levs <- sort(unique(dat$variable))
names(levs) <- levs
cbind(dat, +outer(dat$variable, levs, `==`))
Note
The input in reproducible form:
Lines <- "
variable
D
D
B
C
B
D
C
C
D"
dat <- read.table(text = Lines, header = TRUE)
(Sorry about the poorly made title, wasn't sure how to phrase my question in a single sentence.)
I have a data.table of matches that is sorted chronologically, with P1 representing player1, P2 representing player2, and Res representing whether P1 won ("w") or drew ("d") against P2.
EDIT: a, b, and c in the P1 and P2 columns represent individual players. Think of them as Alice, Bob, and Charlie.
DT <- data.table(time = 1:10,
P1 = c("a", "a", "b", "b", "b", "a", "a", "b", "a", "c"),
Res = c("d", "w", "w", "w", "w", "w", "d", "d", "w", "w"),
P2 = c("b", "c", "c", "a", "a", "c", "c", "a", "b", "b"))
I performed the following operations to count at the time of the match, how many wins P1 had (wins1) and how many losses P2 had (loss2).
DT[, wins1 := shift(cumsum(Res == "w"), 1, fill=0L), by=P1]
DT[, loss2 := shift(cumsum(Res == "w"), 1, fill=0L), by=P2]
I am trying to create the columns for wins2, loss1, draw1, and draw2.
That is, I right now have how many wins P1 had at the time of their match against P2, but I do not know how many losses they had. Hope that makes sense. I'm sure the method to creating these columns are all similar, so if I can make one I should be able to make them all.
The final table should look like the following:
time P1 Res P2 wins1 loss1 draw1 wins2 loss2 draw2
1: 1 a d b 0 0 0 0 0 0
2: 2 a w c 0 0 1 0 0 0
3: 3 b w c 0 0 1 0 1 0
4: 4 b w a 1 0 1 1 0 1
5: 5 b w a 2 0 1 1 1 1
6: 6 a w c 1 2 1 0 2 0
7: 7 a d c 2 2 1 0 3 0
8: 8 b d a 3 0 1 2 2 2
9: 9 a w b 2 2 2 3 0 1
10: 10 c w b 0 3 1 3 1 1
I have a data frame like this, called df:
a b c d e f
b c f a a a
d f a b c c
f e d f f d
The first row is actually the column name. Let's take an example to explain the meaning here: df[1,1] is b, which means there is relation from a to b, so the values in the column means there is relation from the 'column name' to that entry.
I want create a matrix(df1) with 6*6 dimension, column and row names are both column names of df. The (i,j) entry is 1, if there is relation from 'i' to 'j', otherwise, 0.
The output I want is:
a b c d e f
a 0 1 0 1 0 1
b 0 0 1 0 1 1
c 1 0 0 1 0 1
d 1 1 0 0 0 1
e 1 0 1 0 0 1
f 1 0 1 1 0 0
How to do this with a loop in R?
How to do this without a loop, and only use basic R?
How to do this using some fancy packages in R?
Using the reshape2 package, this is one way to go. My sample data has all columns as character. You use melt() to reshape your data in a long format. Then, you use dcast() from the same package.
library(magrittr)
library(reshape2)
melt(mydf, measure.vars = names(mydf)) %>%
dcast(variable ~ value, length)
variable a b c d e f
1 a 0 1 0 1 0 1
2 b 0 0 1 0 1 1
3 c 1 0 0 1 0 1
4 d 1 1 0 0 0 1
5 e 1 0 1 0 0 1
6 f 1 0 1 1 0 0
EDIT
As mentioned below by akrun, you can do all work using recast() in the reshape2 package.
recast(mydf, measure.var= names(mydf),variable~value, length)
DATA
mydf <- structure(list(a = c("b", "d", "f"), b = c("c", "f", "e"), c = c("f",
"a", "d"), d = c("a", "b", "f"), e = c("a", "c", "f"), f = c("a",
"c", "d")), .Names = c("a", "b", "c", "d", "e", "f"), class = "data.frame", row.names = c(NA,
-3L))
Just use table:
table(colnames(mydf)[col(mydf)], unlist(mydf) )
# a b c d e f
# a 0 1 0 1 0 1
# b 0 0 1 0 1 1
# c 1 0 0 1 0 1
# d 1 1 0 0 0 1
# e 1 0 1 0 0 1
# f 1 0 1 1 0 0
If you have multiple matches, then:
pmin(table(colnames(mydf)[col(mydf)], unlist(mydf) ), 1)
You can do this with reshaping.
library(dplyr)
library(tidyr)
data %>%
gather(from, to) %>%
distinct %>%
mutate(value = 1) %>%
spread(to, value, fill = 0)
The other solution using dplyr is really neat and smart. I recommend using that solution.
Here is an alternative solution to your problem using most basic functions in R.
Say your data frame has n columns and m rows i.e. n <- ncol(df) and m <- nrow(df).
output_matrix <- matrix(rep(0, n*n), ncol = n)
for(i in 1:n){
for(j in 1:m){
# UTF to integer conversion
# utf8ToInt("a") = 97
rowWithRelation <- utf8ToInt(df[j, i]) - 96
output_matrix[rowWithRelation, i] <- 1
}
}
rownames(output_matrix) <- letters[seq(from = 1, to = n)]
colnames(output_matrix) <- letters[seq(from = 1, to = n)]
Assume
xx.1 <- c("a", "b", "d")
xx.2 <- c("a", "d", "e")
xx.3 <- c("b", "e", "d", "f")
How to make a boolean matrix as this:
xx.1 xx.2 xx.3
a 1 1 NA
b 1 NA 1
d 1 1 1
e NA 1 1
f NA NA 1
Try table and stack:
table(stack(list(xx.1 = xx.1, xx.2 = xx.2, xx.3 = xx.3)))
# ind
# values xx.1 xx.2 xx.3
# a 1 1 0
# b 1 0 1
# d 1 1 1
# e 0 1 1
# f 0 0 1
More conveniently, you can try:
table(stack(mget(ls(pattern = "xx"))))