Find the number of items row wise with an exception - r

I have a set of the following form:-
a <- data.frame(X1=c("A", "B", "C", "D", "0"),
X2=c("B", "A", "D", "E", "A"),
X3=c("0", "0", "B", "A", "0"),
X4=c("A", "0", "A", "0", "0")
)
# a
# X1 X2 X3 X4
# A B 0 A
# B A 0 0
# C D B A
# D E A 0
# 0 A 0 0
What I want to know if in each row how many items are there except "0" and save them in a new column. The expected output should be :-
# b
# 3
# 2
# 4
# 3
# 1
Duplicates should be counted as different, ie, if a row consists of 2 "A", 1 "B" and a "0", it should return 3. Thanks in advance.

We could compare the dataframe with 0 and use rowSums to calculate number of entries except 0 in each row.
rowSums(a != 0)
#[1] 3 2 4 3 1
Although, it is not needed here (since applying rowSums is straight-forward) we can also use apply row-wise :
apply(a!= 0 , 1, sum)

If you have single character in each cell of data frame a, then here is a base R option. Otherwise (if you have have any multiple characters in some cells), please turn to the approach by #Ronak Shah
a$b <- nchar(gsub("0","",do.call(paste0,a)))
such that
> a
X1 X2 X3 X4 b
1 A B 0 A 3
2 B A 0 0 2
3 C D B A 4
4 D E A 0 3
5 0 A 0 0 1

We can use lengths with split
lengths(split(a[a!=0], row(a)[a != 0]))

Related

apply function removing 0 counts from table() output of ordered factors [duplicate]

This question already has an answer here:
R: Why am I not getting type or class "factor" after converting columns to factor?
(1 answer)
Closed 7 months ago.
Assume the following data.frame with columns of ordered factors:
dat0 <- data.frame(X1 = 1:5, X2 = 1:5, X3 = c(1,1:4), X4 = c(2,2:5))
dat <- data.frame(lapply(dat0, factor, ordered=TRUE, levels=1:5, labels=letters[1:5]))
I want to create a nice looking table that compiles how many a:e are in each column of dat (including any 0 counts). The function table() is an obvious choice.
My "clean" attempt at making this table does not work. See below:
The table() function works as expected (i.e., includes all 5 factor choices -- even if one or more has a 0 count) when applied to individual columns:
table(dat[,1])
a b c d e
1 1 1 1 1
table(dat[,3])
a b c d e
2 1 1 1 0
# note: that a 0 is provided for any factor missing
However, when I try to use an apply() function on the data.frame to include all column counts into one table, I get wonky resulting formatting:
apply(dat, 2, table)
$X1
a b c d e
1 1 1 1 1
$X2
a b c d e
1 1 1 1 1
$X3
a b c d
2 1 1 1
$X4
b c d e
2 1 1 1
I can demonstrate the cause of the issue by only including columns of my data.frame that have at least 1 count for each factor that is similar between the columns. (i.e., I can get my desired formatting outcome by removing any column with a 0 count for any factor):
apply(dat[1:2], 2, table) # only including columns of dat with all 5 letters (i.e., no 0 counts)
X1 X2
a 1 1
b 1 1
c 1 1
d 1 1
e 1 1
Question: Is there a simple workaround/solution here when using table() or am I going to have to find a different approach?
Note: I know I could simply cbind() the individual table results, but that's very tedious in my actual more complex data set.
We may use table in sapply.
sapply(dat, table)
# X1 X2 X3 X4
# a 1 1 2 0
# b 1 1 1 2
# c 1 1 1 1
# d 1 1 1 1
# e 1 1 0 1
Or vapply which is faster, but we need to know the .
vapply(dat, table, nlevels(unlist(dat)))
# X1 X2 X3 X4
# a 1 1 2 0
# b 1 1 1 2
# c 1 1 1 1
# d 1 1 1 1
# e 1 1 0 1
If we don't urgently need the row names, we may use tabulate.
sapply(dat, tabulate, nlevels(unlist(dat)))
# X1 X2 X3 X4
# [1,] 1 1 2 0
# [2,] 1 1 1 2
# [3,] 1 1 1 1
# [4,] 1 1 1 1
# [5,] 1 1 0 1
In case we know the nlevels before, we may simplify it to vapply(dat, table, numeric(5L)) and sapply(dat, tabulate, numeric(5L)) which also gives a gain in speed.
Here comes the benchmark
set.seed(42)
DAT <- dat[sample(nrow(dat),1e5, replace=TRUE), ]
r <- matrix(, 5L, dim(DAT)[2])
microbenchmark::microbenchmark(
t(data.frame(do.call(rbind,lapply(DAT, table)))),
sapply(DAT, table),
vapply(DAT, table, numeric(5L)),
vapply(DAT, table, numeric(nlevels(unlist(dat)))),
sapply(DAT, tabulate, 5L),
sapply(DAT, tabulate, nlevels(unlist(dat))),
`for`={for (j in seq_along(DAT)) r[, j] <- tabulate(DAT[, j], 5L)}
)
Unit: microseconds
expr min lq mean median uq max neval cld
t(data.frame(do.call(rbind, lapply(DAT, table)))) 9960.629 10101.4820 11662.6014 10221.6970 14459.0215 17422.732 100 c
sapply(DAT, table) 9690.340 9822.2150 11721.6487 9934.2045 14128.6330 19107.070 100 c
vapply(DAT, table, numeric(5L)) 9630.185 9729.9155 11313.4803 9816.3260 14017.8180 22655.129 100 c
vapply(DAT, table, numeric(nlevels(unlist(dat)))) 9753.252 9890.5700 11309.0461 9976.4840 14110.4775 17906.082 100 c
sapply(DAT, tabulate, 5L) 725.613 742.7820 778.6458 785.3595 807.1935 916.700 100 a
sapply(DAT, tabulate, nlevels(unlist(dat))) 848.600 891.1135 936.7825 939.8245 967.2390 1114.601 100 a
for 3580.538 3846.5700 4059.3048 3922.1300 3981.4300 19752.024 100 b
Data:
dat <- structure(list(X1 = structure(1:5, levels = c("a", "b", "c",
"d", "e"), class = c("ordered", "factor")), X2 = structure(1:5, levels = c("a",
"b", "c", "d", "e"), class = c("ordered", "factor")), X3 = structure(c(1L,
1L, 2L, 3L, 4L), levels = c("a", "b", "c", "d", "e"), class = c("ordered",
"factor")), X4 = structure(c(2L, 2L, 3L, 4L, 5L), levels = c("a",
"b", "c", "d", "e"), class = c("ordered", "factor"))), class = "data.frame", row.names = c(NA,
-5L))
Solution:
Use lapply and not apply as explained in the ZheyuanLi's linked answer and his comment.
Summary: The problem of apply is that it converts everything to characters, then table re-factors those characters so that unused levels are not preserved. But lapply gives a list.
Use a combination of data.frame, do.call, rbind, and t (transpose) to get the data into the desired data.frame format:
t(data.frame(do.call(rbind,lapply(dat, table))))
X1 X2 X3 X4
a 1 1 2 0
b 1 1 1 2
c 1 1 1 1
d 1 1 1 1
e 1 1 0 1
Or:
As ZheyuanLi pointed out, one can simply use sapply(dat, table).
Also thanks jay.sf for showing how vapply works.

problems with counting vowels, checking starts with or ends with a vowel in words

Consider the below code to count the occurrence of letter 'a' in each of the words:
data <- data.frame(number=1:4, string=c("this.is.a.great.word", "Education", "Earth.Is.Round", "Pinky), stringsAsFactors = F)
library(stringr)
data$Count_of_a <- str_count(data$string, "a")
data
Which will result into something like this:
number string Count_of_a
1 1 this.is.a.great.word 2
2 2 Education 1
3 3 Earth.Is.Round 1
4 4 Pinky 0
I was trying to do couple of more things:
compute the total of vowels in each word
total no. of letters in each word
whether a word starts with a vowel, then 1 else 0
whether a word ends with a vowel, then 1 else 0
Problem is if I use nchar(data$string), it also counts dots '.'
also i could not find much help on the above 4 requirements.
final data I wanted to look like this:
number string starts_with_vowel ends_with_vowel TotalLtrs
1 this.is.a.great.word 0 0 16
2 Education 1 0 9
3 Earth.Is.Round 1 0 12
4 Pinky 0 1 5
You want a combination of regex expressions
library(tidyverse)
data %>%
mutate(
nvowels = str_count(tolower(string), "[aeoiu]"),
total_letters = str_count(tolower(string), "\\w"),
starts_with_vowel = grepl("^[aeiou]", tolower(string)),
ends_with_vowel = grepl("[aeiou]$", tolower(string))
)
# number string nvowels total_letters starts_with_vowel ends_with_vowel
# 1 1 this.is.a.great.word 6 16 FALSE FALSE
# 2 2 Education 5 9 TRUE FALSE
# 3 3 Earth.Is.Round 5 12 TRUE FALSE
# 4 4 Pinky 1 5 FALSE FALSE
If you consider y a vowel, add it like so
nvowels = str_count(tolower(string), "[aeoiuy]")
starts_with_vowel = grepl("^[aeiouy]", tolower(string))
ends_with_vowel = grepl("[aeiouy]$", tolower(string))
library(stringr)
str_count(df$string, "a|e|i|o|u|A|E|I|O|U")
[1] 6 5 5 1
str_count(df$string, paste0(c(letters,LETTERS), collapse = "|"))
[1] 16 9 12 5
ifelse(substr(df$string, 1, 1) %in% c("a", "e", "i", "o", "u", "A", "E", "I", "O", "U"), 1, 0)
[1] 0 1 1 0
ifelse(substr(df$string, nchar(df$string), nchar(df$string)) %in% c("a", "e", "i", "o", "u", "A", "E", "I", "O", "U"), 1, 0)
[1] 0 0 0 0

Add new column describing unique values per ID

The data I have contain four fields: ID, x1 (numeric), x2 (numeric), and x3 (factor). Some IDs have multiple records, and also some values of x3 are missing (NA). Here is a sample
ID <- c(1,1,1,1,2,2,3,3,3,3,4,4,4,5,6,6)
x1 <- rnorm(16,0,1)
x2 <- rnorm(16,2,2)
x3 <- c("a", "a", "a", NA, "b", "b", "c", "c", "a", "c", "w", "w", "w", "y", NA, NA)
df <- data.frame(ID, x1, x2, x3)
I want to to create a new field (let's call it unqind) to check whether each ID has unique values of x3.
For example, ID=1 has four observations of x3 ("a", "a", "a", NA) ... three "a"'s and one NA. Therefore unqind=0.
ID=2 has two observations of x3 (2 "b"s)... therefore, unqind=1.
In case all values of x3 are NAs per ID, then unqind=1.
After creating unqind, df looks like:
ID x1 x2 x3 unqind
1 0.9087691 4.4353865 a 0
1 0.3686852 2.5851186 a 0
1 -1.335171 1.18109 a 0
1 -0.1596629 0.593775 NA 0
2 0.4841148 0.1684549 b 1
2 0.1256352 4.2785666 b 1
3 -0.954508 3.1284599 c 0
3 0.3502183 2.4766285 c 0
3 -1.2365438 1.041901 a 0
3 0.9786498 -0.6517521 c 0
4 1.3426399 1.5733424 w 1
4 -0.3117586 -0.4648479 w 1
4 0.136769 -2.6124866 w 1
5 -1.3295984 6.2783164 y 1
6 -1.1989125 -1.7025381 NA 1
6 -0.8936165 2.3131387 NA 1
You could do this quite easily with the data.table package. uniqueN() is equivalent to length(unique(x)) but much faster. Group by ID and compare the result to 1.
library(data.table)
setDT(df)[, unqind := as.integer(uniqueN(x3) == 1L), by = ID]
Another option, using base R, could be with ave().
df$unqind <- with(df, {
as.integer(ave(as.character(x3), ID, FUN=function(x) length(unique(x))) == 1L)
})

Sorting data frame by character string

I have a data frame and need to sort its columns by a character string.
I tried it like this:
# character string
a <- c("B", "E", "A", "D", "C")
# data frame
data <- data.frame(A = c(0, 0, 1), B = c(1, 1, 1), C = c(1, 0, 1), D = c(0, 0, 1), E = c(0, 1, 1))
data
# A B C D E
# 1 0 1 1 0 0
# 2 0 1 0 0 1
# 3 1 1 1 1 1
# sorting
data.sorted <- data[, order(a)]
# order of characters in data
colnames(data.sorted)
# [1] "C" "A" "E" "D" "B"
However, the order of columns in the sorted data frame is not the same as the characters in the original character string.
Is there any way, how to sort it?
The function order(a) returns the position in the vector a that each ranked value lies in. So, since "A" (ranked first) lies in the third position of a, a[1] is equal to 3. Similarly "C" (ranked third) lies in the fifth position of a, then a[3] equals 5.
Luckily your solution is actually even more simple, thanks to the way R works with brackets. If you ask to see just the column named "B" you'll get:
> data[, "B", drop=FALSE]
B
1 1
2 1
3 1
Or if you want two specific columns
> data[, c("B", "E")]
B E
1 1 0
2 1 1
3 1 1
And finally, more generally, if you have a whole vector by which you want to order your columns, then you can do that, too:
> data.sorted <- data[, a]
> data.sorted
B E A D C
1 1 0 0 0 1
2 1 1 0 0 0
3 1 1 1 1 1
> all(colnames(data.sorted)==a)
[1] TRUE
string[] str = { "H", "G", "F", "D", "S","A" };
Array.Sort(str);
for (int i = 0; i < str.Length; i++)
{
Console.WriteLine(str[i]);
}
Console.ReadLine();

Convert nominal results from round robin tournaments into a list of adjacency matrices

I would like to take nominal results from a round-robin tournament and convert them to a list of binary adjacency matrices.
By convention, results from these tournaments are written by recording the name of the winner. Here is code for an example table where four individuals (A,B,C,D) compete against each other:
set <- c(rep(1, 6), rep(2,6))
trial <- (1:12)
home <- c("B", "A", "C", "D", "B", "C", "D", "C", "B", "A", "A", "D")
visitor <- c("D", "C", "B", "A", "A", "D", "B", "A", "C", "D", "B", "C" )
winners.rr1 <- c("D", "A", "B", "A", "A", "D", "D", "A", "B", "D", "A", "D")
winners.rr2 <- c("D", "A", "C", "A", "A", "D", "D", "A", "C", "A", "A", "D")
winners.rr3 <- c("D", "A", "B", "A", "A", "D", "D", "A", "B", "D", "A", "D")
roundrobin <- data.frame(set=set, trial=trial, home=home, visitor=visitor,
winners.rr1=winners.rr1, winners.rr2=winners.rr2,
winners.rr3=winners.rr3)
Here's the table:
> roundrobin
set trial home visitor winners.rr1 winners.rr2 winners.rr3
1 1 1 B D D D D
2 1 2 A C A A A
3 1 3 C B B C B
4 1 4 D A A A A
5 1 5 B A A A A
6 1 6 C D D D D
7 2 7 D B D D D
8 2 8 C A A A A
9 2 9 B C B C B
10 2 10 A D D A D
11 2 11 A B A A A
12 2 12 D C D D D
This table shows the winners from three round robin tournaments. Within each tournament, there are two sets: each player competes against all others once at home, and once as a visitor. This makes for a total of 12 trials in each round robin tournament.
So, in the first trial in the first set, player D defeated player B. In the second trial of the first set, player A defeated player C, and so on.
I would like to turn these results into a list of six adjacency matrices. Each matrix is to be derived from each set within each round robin tournament. Wins are tallied on rows as "1", and losses are tallied as "0" on rows. ("Home" and "visitor" designations are irrelevant for what follows).
Here is what the adjacency matrix from Set 1 of the first round robin would look like:
> Adj.mat.set1.rr1
X A B C D
1 A NA 1 1 1
2 B 0 NA 1 0
3 C 0 0 NA 0
4 D 0 1 1 NA
And here is what Set 2 of the first round robin would look like:
> Adj.mat.set2.rr1
X A B C D
1 A NA 1 1 0
2 B 0 NA 1 0
3 C 0 0 NA 0
4 D 1 1 1 NA
The latter matrix shows, for example, that player A won 2 trials, player B won 1 trial, player C won 0 trials, and player D won 3 trials.
The trick of this manipulation is therefore to convert each win (recorded as a name) into a score of "1" in the appropriate row on the adjacency matrix, while losses are recorded as "0".
Any help is much appreciated.
Here's one way to go about it, although I imagine there must be a simpler approach - perhaps involving plyr. The following splits the data frame into subsets corresponding to set, then, for each round, sets up a table of zeroes (with NA diagonal) to hold results, and finally sets "winning cells" to 1 by subsetting the table with a matrix. Output class is set to matrix to ensure matrices are presented as such.
results <- lapply(split(roundrobin, roundrobin$set), function(set) {
lapply(grep('^winners', names(set)), function(i) {
tab <- table(set$home, set$visitor)
tab[] <- 0
diag(tab) <- NA
msub <- t(apply(set, 1, function(x) {
c(x[i], setdiff(c(x['home'], x['visitor']), x[i]))
}))
tab[msub] <- 1
class(tab) <- 'matrix'
tab
})
})
Results for set 1:
> results[[1]]
[[1]]
A B C D
A NA 1 1 1
B 0 NA 1 0
C 0 0 NA 0
D 0 1 1 NA
[[2]]
A B C D
A NA 1 1 1
B 0 NA 0 0
C 0 1 NA 0
D 0 1 1 NA
[[3]]
A B C D
A NA 1 1 1
B 0 NA 1 0
C 0 0 NA 0
D 0 1 1 NA

Resources