Suppose I have a data.frame with several columns of categorical data, and one column of quantitative data. Here's an example:
my_data <- structure(list(A = c("f", "f", "f", "f", "t", "t", "t", "t"),
B = c("t", "t", "t", "t", "f", "f", "f", "f"),
C = c("f","f", "t", "t", "f", "f", "t", "t"),
D = c("f", "t", "f", "t", "f", "t", "f", "t")),
.Names = c("A", "B", "C", "D"),
row.names = 1:8, class = "data.frame")
my_data$quantity <- 1:8
Now my_data looks like this:
A B C D quantity
1 f t f f 1
2 f t f t 2
3 f t t f 3
4 f t t t 4
5 t f f f 5
6 t f f t 6
7 t f t f 7
8 t f t t 8
What's the most elegant way to get a cross tab / sum of quantity where both values =='t'? That is, I'm looking for an output like this:
A B C D
A "?" "?" "?" "?"
B "?" "?" "?" "?"
C "?" "?" "?" "?"
D "?" "?" "?" "?"
..where the intersection of x/y is the sum of quantity where x=='t' and y=='t'. (I only care about half this table, really, since half is duplicated)
So for example the value of A/C should be:
good_rows <- with(my_data, A=='t' & C=='t')
sum(my_data$quantity[good_rows])
15
*Edit: What I already had was:
nodes <- names(my_data)[-ncol(my_data)]
sapply(nodes, function(rw) {
sapply(nodes, function(cl) {
good_rows <- which(my_data[, rw]=='t' & my_data[, cl]=='t')
sum(my_data[good_rows, 'quantity'])
})
})
Which gives the desired result:
A B C D
A 26 0 15 14
B 0 10 7 6
C 15 7 22 12
D 14 6 12 20
I like this solution because, being very 'literal', it's fairly readable: two apply funcs (aka loops) to go through rows * columns, compute each cell, and produce the matrix. Also plenty fast enough on my actual data (tiny: 192 rows x 10 columns). I didn't like it because it seems like a lot of lines. Thank you for the answers so far! I will review and absorb.
Try using matrix multiplication
temp <- (my_data[1:4]=="t")*my_data$quantity
t(temp) %*% (my_data[1:4]=="t")
# A B C D
#A 26 0 15 14
#B 0 10 7 6
#C 15 7 22 12
#D 14 6 12 20
(Although this might be a fluke)
For each row name, you could build a vector dat that's just the rows with that value equal to t. Then you could multiply the true/false values in this data subset by that row's quantity value (so it's 0 when false and the quantity value when true), finally taking the column sum.
sapply(c("A", "B", "C", "D"), function(x) {
dat <- my_data[my_data[,x] == "t",]
colSums((dat[,-5] == "t") * dat[,5])
})
# A B C D
# A 26 0 15 14
# B 0 10 7 6
# C 15 7 22 12
# D 14 6 12 20
Related
I have gathered that this question is somewhat commonly asked, but I've hit a few snags that I can't seem to find an answer to.
I have a long string:
line1 = "GGCTTATTTAACGGGCAGATATACGCTGGGCAAATC ..."
I want it to look like:
line1 = c("G", "G", "C", ...)
(As an aside, is it possible to have letters like above as integers - when I tried with the function as.integer, it converted it all to NAs?)
I have tried:
strsplit(line1, "")
Which produces a list of: 'G''G''C'...
To solve this, I've tried:
paste(line1, collapse = ", ")
Which sort of works: c(\"G\", \"G\", \"C" ...)
When I tried to remove the ' \ ' with gsub, it didn't let be do it, as it suddenly registered everything in the script as in quotes.
Further, once this is done, I'd like to shape this into either a row or a column of a dataframe like so:
[1] [2] [3] ...
[1] G G C
Or:
[1]
[1] G
[2] G
[3] C
After splitting unlist the result, convert it to factor and then numeric:
fac <- factor(unlist(strsplit(line1, "")))
as.numeric(fac)
## [1] 5 5 4 6 6 3 6 6 6 3 3 4 5 5 5 4 3 5 3 6 3 6 3 4 5 4 6 5 5 5 4 3 3 3 6 4 1 2 2 2
# this gives the correspondence between numbers and characters
# i.e. space is 1, dot is 2, A is 3, C is 4, G is 5 and T is 6
levels(fac)
## [1] " " "." "A" "C" "G" "T"
The levels can also be specified explicitly using the levels= argument in which case other characters will be NA and optionally could be eliminated using na.omit(...) .
fac <- factor(unlist(strsplit(line1, "")), levels = c("A", "C", "G", "T"))
as.numeric(fac)
## [1] 3 3 2 4 4 1 4 4 4 1 1 2 3 3 3 2 1 3 1 4 1 4 1 2 3 2 4 3 3 3 2 1 1 1 4 2 NA NA NA NA
Note
The input in the question is the following. Possibly the last 4 characters were not intended to be part of the data but if that were so then it ought to have been written that way so that others don't have to edit it. In any case the code above should work.
line1 = "GGCTTATTTAACGGGCAGATATACGCTGGGCAAATC ..."
To convert that list to a character vector you can just go:
x <- strsplit(line1, "")
x <- x[[1]]
To make it a column of a df you can either go:
x <- as.data.frame(x)
Or just do it directly from the first line:
x <- as.data.frame(strsplit(line1, ""))
That'll give it an ugly column header which you can fix with
names(x)[1] <- 'whatever'
Or again directly in the one call:
x <- as.data.frame(strsplit(line1, ""), col.names = 'whatever')
The question seems to ask for the output of dput but this is seldom needed.
x <- strsplit(line1, "")[[1]]
dput(x)
#c("G", "G", "C", "T", "T", "A", "T", "T", "T", "A", "A", "C",
#"G", "G", "G", "C", "A", "G", "A", "T", "A", "T", "A", "C", "G",
#"C", "T", "G", "G", "G", "C", "A", "A", "A", "T", "C")
As for the question on how to get integers from the string, here is a way. The output are the ASCII codes for the letters in the original line1 string.
charToRaw(line1)
# [1] 47 47 43 54 54 41 54 54 54 41 41 43 47 47 47 43 41 47 41 54 41 54 41
#[24] 43 47 43 54 47 47 47 43 41 41 41 54 43
Data
line1 <- "GGCTTATTTAACGGGCAGATATACGCTGGGCAAATC"
The issue is related
to: InvalidArgumentError (see above for traceback): indices[1] = 10 is not in [0, 10)
I need it for R and therefore another solution than given in the link above.
maxlen <- 40
chars <- c("'", "-", " ", "!", "\"", "(", ")", ",", ".", ":", ";", "?", "[", "]", "_", "=", "0", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z")
tokenizer <- text_tokenizer(char_level = T, filters = NULL)
tokenizer %>% fit_text_tokenizer(chars)
unlist(tokenizer$word_index)
Output is:
' - ! " ( ) , . : ; ? [ ] _ = 0 a b c d e f g h i j k l m n o p q r s t u v w x y z
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
How can I change the indexing so it starts from 0 not from 1 in text_tokenizer?
The error I get after running fit() is as follows:
InvalidArgumentError: indices[127,7] = 43 is not in [0, 43)
[[Node: embedding_3/embedding_lookup = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:#training_1/RMSprop/Assign_1"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_3/embeddings/read, embedding_3/Cast, training_1/RMSprop/gradients/embedding_3/embedding_lookup_grad/concat/axis)]]
But I believe that changing the Indexing will solve my problem.
Index 0 is often reserved for padding so it is not a wise idea to start your actual character indices from 0 as well. Instead you should venture to the Embedding layer and add 1 to the input size as suggested by the documentation:
input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1.
In your case this would be 43 + 1 = 44.
You need to initialize your Embedding layer with vocabulary size. For example:
model.add(Embedding(875, 64))
In this case, 875 is the length of my vocabulary.
Consider the below code to count the occurrence of letter 'a' in each of the words:
data <- data.frame(number=1:4, string=c("this.is.a.great.word", "Education", "Earth.Is.Round", "Pinky), stringsAsFactors = F)
library(stringr)
data$Count_of_a <- str_count(data$string, "a")
data
Which will result into something like this:
number string Count_of_a
1 1 this.is.a.great.word 2
2 2 Education 1
3 3 Earth.Is.Round 1
4 4 Pinky 0
I was trying to do couple of more things:
compute the total of vowels in each word
total no. of letters in each word
whether a word starts with a vowel, then 1 else 0
whether a word ends with a vowel, then 1 else 0
Problem is if I use nchar(data$string), it also counts dots '.'
also i could not find much help on the above 4 requirements.
final data I wanted to look like this:
number string starts_with_vowel ends_with_vowel TotalLtrs
1 this.is.a.great.word 0 0 16
2 Education 1 0 9
3 Earth.Is.Round 1 0 12
4 Pinky 0 1 5
You want a combination of regex expressions
library(tidyverse)
data %>%
mutate(
nvowels = str_count(tolower(string), "[aeoiu]"),
total_letters = str_count(tolower(string), "\\w"),
starts_with_vowel = grepl("^[aeiou]", tolower(string)),
ends_with_vowel = grepl("[aeiou]$", tolower(string))
)
# number string nvowels total_letters starts_with_vowel ends_with_vowel
# 1 1 this.is.a.great.word 6 16 FALSE FALSE
# 2 2 Education 5 9 TRUE FALSE
# 3 3 Earth.Is.Round 5 12 TRUE FALSE
# 4 4 Pinky 1 5 FALSE FALSE
If you consider y a vowel, add it like so
nvowels = str_count(tolower(string), "[aeoiuy]")
starts_with_vowel = grepl("^[aeiouy]", tolower(string))
ends_with_vowel = grepl("[aeiouy]$", tolower(string))
library(stringr)
str_count(df$string, "a|e|i|o|u|A|E|I|O|U")
[1] 6 5 5 1
str_count(df$string, paste0(c(letters,LETTERS), collapse = "|"))
[1] 16 9 12 5
ifelse(substr(df$string, 1, 1) %in% c("a", "e", "i", "o", "u", "A", "E", "I", "O", "U"), 1, 0)
[1] 0 1 1 0
ifelse(substr(df$string, nchar(df$string), nchar(df$string)) %in% c("a", "e", "i", "o", "u", "A", "E", "I", "O", "U"), 1, 0)
[1] 0 0 0 0
I would like to take nominal results from a round-robin tournament and convert them to a list of binary adjacency matrices.
By convention, results from these tournaments are written by recording the name of the winner. Here is code for an example table where four individuals (A,B,C,D) compete against each other:
set <- c(rep(1, 6), rep(2,6))
trial <- (1:12)
home <- c("B", "A", "C", "D", "B", "C", "D", "C", "B", "A", "A", "D")
visitor <- c("D", "C", "B", "A", "A", "D", "B", "A", "C", "D", "B", "C" )
winners.rr1 <- c("D", "A", "B", "A", "A", "D", "D", "A", "B", "D", "A", "D")
winners.rr2 <- c("D", "A", "C", "A", "A", "D", "D", "A", "C", "A", "A", "D")
winners.rr3 <- c("D", "A", "B", "A", "A", "D", "D", "A", "B", "D", "A", "D")
roundrobin <- data.frame(set=set, trial=trial, home=home, visitor=visitor,
winners.rr1=winners.rr1, winners.rr2=winners.rr2,
winners.rr3=winners.rr3)
Here's the table:
> roundrobin
set trial home visitor winners.rr1 winners.rr2 winners.rr3
1 1 1 B D D D D
2 1 2 A C A A A
3 1 3 C B B C B
4 1 4 D A A A A
5 1 5 B A A A A
6 1 6 C D D D D
7 2 7 D B D D D
8 2 8 C A A A A
9 2 9 B C B C B
10 2 10 A D D A D
11 2 11 A B A A A
12 2 12 D C D D D
This table shows the winners from three round robin tournaments. Within each tournament, there are two sets: each player competes against all others once at home, and once as a visitor. This makes for a total of 12 trials in each round robin tournament.
So, in the first trial in the first set, player D defeated player B. In the second trial of the first set, player A defeated player C, and so on.
I would like to turn these results into a list of six adjacency matrices. Each matrix is to be derived from each set within each round robin tournament. Wins are tallied on rows as "1", and losses are tallied as "0" on rows. ("Home" and "visitor" designations are irrelevant for what follows).
Here is what the adjacency matrix from Set 1 of the first round robin would look like:
> Adj.mat.set1.rr1
X A B C D
1 A NA 1 1 1
2 B 0 NA 1 0
3 C 0 0 NA 0
4 D 0 1 1 NA
And here is what Set 2 of the first round robin would look like:
> Adj.mat.set2.rr1
X A B C D
1 A NA 1 1 0
2 B 0 NA 1 0
3 C 0 0 NA 0
4 D 1 1 1 NA
The latter matrix shows, for example, that player A won 2 trials, player B won 1 trial, player C won 0 trials, and player D won 3 trials.
The trick of this manipulation is therefore to convert each win (recorded as a name) into a score of "1" in the appropriate row on the adjacency matrix, while losses are recorded as "0".
Any help is much appreciated.
Here's one way to go about it, although I imagine there must be a simpler approach - perhaps involving plyr. The following splits the data frame into subsets corresponding to set, then, for each round, sets up a table of zeroes (with NA diagonal) to hold results, and finally sets "winning cells" to 1 by subsetting the table with a matrix. Output class is set to matrix to ensure matrices are presented as such.
results <- lapply(split(roundrobin, roundrobin$set), function(set) {
lapply(grep('^winners', names(set)), function(i) {
tab <- table(set$home, set$visitor)
tab[] <- 0
diag(tab) <- NA
msub <- t(apply(set, 1, function(x) {
c(x[i], setdiff(c(x['home'], x['visitor']), x[i]))
}))
tab[msub] <- 1
class(tab) <- 'matrix'
tab
})
})
Results for set 1:
> results[[1]]
[[1]]
A B C D
A NA 1 1 1
B 0 NA 1 0
C 0 0 NA 0
D 0 1 1 NA
[[2]]
A B C D
A NA 1 1 1
B 0 NA 0 0
C 0 1 NA 0
D 0 1 1 NA
[[3]]
A B C D
A NA 1 1 1
B 0 NA 1 0
C 0 0 NA 0
D 0 1 1 NA
My dataframe looks like so:
group <- c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C")
value <- c(3:6, 1:4, 4:9)
type <- c("d", "d", "e", "e", "g", "g", "e", "e", "d", "d", "e", "e", "f", "f")
df <- cbind.data.frame(group, value, type)
df
group value type
1 A 3 d
2 A 4 d
3 A 5 e
4 A 6 e
5 B 1 g
6 B 2 g
7 B 3 e
8 B 4 e
9 C 4 d
10 C 5 d
11 C 6 e
12 C 7 e
13 C 8 f
14 C 9 f
Within each level of factor "group" I would like to subtract the values based on "type", such that (for group "A") 3 - 5 (1st value of d - 1st value of e) and 4 - 6 (2nd value of d - 2nd value of d). My outcome should look similarly to this..
A
group d_e
1 A -2
2 A -2
B
group g_e
1 B -2
2 B -2
C
group d_e d_f e_f
1 C -2 -4 -2
2 C -2 -4 -2
So if - as for group C - there are more than 2 types, I would like to calculate the difference between each combination of types.
Reading this post I reckon I could maybe use ddply and transform. However, I am struggling with finding a way to automatically assign the types, given that each group consists of different types and also different numbers of types.
Do you have any suggestions as to how I could manage that?
Its not clear why the sample answer in the post has two identical rows in each output group and not just one but at any rate this produces similar output to that shown:
DF <- df[!duplicated(df[-2]), ]
f <- function(x) setNames(
data.frame(group = x$group[1:2], as.list(- combn(x$value, 2, diff))),
c("group", combn(x$type, 2, paste, collapse = "_"))
)
by(DF, DF$group, f)
giving:
DF$group: A
group d_e
1 A -2
2 A -2
------------------------------------------------------------
DF$group: B
group d_e
1 B -2
2 B -2
------------------------------------------------------------
DF$group: C
group d_e d_f e_f
1 C -2 -4 -2
2 C -2 -4 -2
REVISED minor improvements.