Count number of | per column in a data frame - r

I have a 5 column by 100 row data frame. I want to count the number of pipe symbols | occurring in each column.
df <- as.data.frame(matrix(c(
c("1", "2", "3", "4", "5"),
c("A", "B", "C", "B", "B"),
c("|", "W", "G", "|", "D"),
c("Q", "D", "F", "|", "F"),
c("Q", "|", "|", "|", "Q")),
5, 5, byrow=T)
)
V1 V2 V3 V4 V5
1 1 2 3 4 5
2 A B C B B
3 | W G | D
4 Q D F | F
5 Q | | | Q
I'd like a result showing 1 pipe in column 1, 1 pipe in column 2, 1 pipe in column 3, 3 pipes in column 4, 0 pipes in column 5

Another way to do it is using colSums() on Dan Y's data frame.
colSums(df == "|")
V1 V2 V3 V4 V5
1 1 1 3 0

If each string is just single character, you can do a simple sapply:
# turning the example data you provided into a data.frame
df <- as.data.frame(matrix(c(
c("1", "2", "3", "4", "5"),
c("A", "B", "C", "B", "B"),
c("|", "W", "G", "|", "D"),
c("Q", "D", "F", "|", "F"),
c("Q", "|", "|", "|", "Q")),
5, 5, byrow=T)
)
# calculation you want
sapply(df, function(x) sum(x == "|"))
# result = c(1, 1, 1, 3, 0)

Related

How to pull the column indices when matching the rows of a dataframe and a vector

Say I have a dataframe of letters like so:
X1 X2 X3
1 G A C
2 G T C
3 G T C
4 A T G
5 A C G
And a vector like so:
ref <- c("A", "C", "C", "A", "G")
Going row-wise, how do I pull the column indices of the dataframe which match the vector?
So the answer should be a vector of numbers like so:
2, 3, 3, 1, 3
We can use
max.col(df1 == ref)
#[1] 2 3 3 1 3
data
df1 <- structure(list(X1 = c("G", "G", "G", "A", "A"), X2 = c("A", "T",
"T", "T", "C"), X3 = c("C", "C", "C", "G", "G")), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5"))

How to apply separate_rows() to all columns, passing a `sep` parameter?

Pretty straight straightforward: I have a data frame where the values in many columns need to be split into their own rows, based on ;s as the delimiter.
After reading a bit,
df %>%
Reduce(separate_rows_, x = colnames)
works, except that I can't pass the sep parameter (so it also separates by white spaces, commas, and other non-alphanumeric chars).
One answer proposed writing a modified version of the function that includes the parameter, but I couldn't get that working:
Reduce(f = function(y) separate_rows_(sep = ";"), x = colnames)
What am I doing wrong?
Having said that, my ideal solution would be a tidyverse solution, if it's cleaner (maybe map_dfr?); but obviously any solution is better than none :).
Here's sample data:
structure(list(q1 = c("1,2,3,4", "2,4"), q2 = c("a,b", "e,f"),
q3 = c("c,d", "g,h,z")), row.names = 1:2, class = "data.frame")
Expected output:
structure(list(q1 = c("1", "1", "1", "1", "2", "2", "2", "2",
"3", "3", "3", "3", "4", "4", "4", "4", "2", "2", "2", "2", "2",
"2", "4", "4", "4", "4", "4", "4"), q2 = c("a", "a", "b", "b",
"a", "a", "b", "b", "a", "a", "b", "b", "a", "a", "b", "b", "e",
"e", "e", "f", "f", "f", "e", "e", "e", "f", "f", "f"), q3 = c("c",
"d", "c", "d", "c", "d", "c", "d", "c", "d", "c", "d", "c", "d",
"c", "d", "g", "h", "z", "g", "h", "z", "g", "h", "z", "g", "h",
"z")), row.names = c(NA, -28L), class = "data.frame")
The process I want to streamline is not having to pass every column name like so:
output <- test %>%
separate_rows(q1, sep = ",") %>%
separate_rows(q2, sep = ",") %>%
separate_rows(q3, sep = ",")
You can use purrr::reduce, which applies the given function .f to .init and the first element of .x, then applies the function to the output of that and the second element of .x, etc. until all elements of .x have been used.
Within the .f argument formula, .x is the previous output (or .init for the first run) and .y is the given element of the .x argument to reduce.
library(tidyverse)
reduce(.init = df, .x = names(df), .f = ~separate_rows(.x, .y, sep = ','))
# equiv to: reduce(.init = df, .x = names(df), .f = separate_rows, sep = ',')
As akrun notes in the comments, this can also be done in base R with the code below (same output)
Reduce(function(x, y) separate_rows(x, y, sep=","), names(df), init = df)
# q1 q2 q3
# 1 1 a c
# 2 1 a d
# 3 1 b c
# 4 1 b d
# 5 2 a c
# 6 2 a d
# 7 2 b c
# 8 2 b d
# 9 3 a c
# 10 3 a d
# 11 3 b c
# 12 3 b d
# 13 4 a c
# 14 4 a d
# 15 4 b c
# 16 4 b d
# 17 2 e g
# 18 2 e h
# 19 2 e z
# 20 2 f g
# 21 2 f h
# 22 2 f z
# 23 4 e g
# 24 4 e h
# 25 4 e z
# 26 4 f g
# 27 4 f h
# 28 4 f z

Count occurrences per entry in dataframe

I have the following kind of dataframe (this is simplified example):
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank)
df
id bank
1 1 a
2 1 b
3 1 c
4 2 b
5 3 b
6 3 c
7 4 a
8 4 c
In this dataframe you can see that for some ids there are multiple banks, i.e. for id==1, bank=c(a,b,c).
The information I would like to extract from this dataframe is the overlap between id's within different banks and the count.
So for example for bank a: bank a has two persons (unique ids): 1 and 4. For these persons, I want to know what other banks they have
For person 1: bank b and c
For person 4: bank c
the total amount of other banks: 3, for which, b = 1, and c = 2.
So I want to create as output a sort of overlap table as below:
bank overlap amount
a b 1
a c 2
b a 1
b c 2
c a 2
c b 2
Took me a while to get a result, so I post it. Not as sexy as Ronak Shahs but same result.
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank)
df$bank <- as.character(df$bank)
resultlist <- list()
dflist <- split(df, df$id)
for(i in 1:length(dflist)) {
if(nrow(dflist[[i]]) < 2) {
resultlist[[i]] <- data.frame(matrix(nrow = 0, ncol = 2))
} else {
resultlist[[i]] <- as.data.frame(t(combn(dflist[[i]]$bank, 2)))
}
}
result <- setNames(data.table(rbindlist(resultlist)), c("bank", "overlap"))
result %>%
group_by(bank, overlap) %>%
summarise(amount = n())
bank overlap amount
<fct> <fct> <int>
1 a b 1
2 a c 2
3 b c 2
We may use data.table:
df = data.frame(id = c("1", "1", "1", "2", "3", "3", "4", "4"),
bank = c("a", "b", "c", "b", "b", "c", "a", "c"))
library(data.table)
setDT(df)[, .(bank = rep(bank, (.N-1L):0L),
overlap = bank[(sequence((.N-1L):1L) + rep(1:(.N-1L), (.N-1L):1))]),
by=id][,
.N, by=.(bank, overlap)]
#> bank overlap N
#> 1: a b 1
#> 2: a c 2
#> 3: b c 2
#> 4: <NA> b 1
Created on 2019-07-01 by the reprex package (v0.3.0)
Please note that you have b for id==2 which is not overlapping with other values. If you don't want that in the final product, just apply na.omit() on the output.
An option would be full_join
library(dplyr)
full_join(df, df, by = "id") %>%
filter(bank.x != bank.y) %>%
dplyr::count(bank.x, bank.y) %>%
select(bank = bank.x, overlap = bank.y, amount = n)
# A tibble: 6 x 3
# bank overlap amount
# <fct> <fct> <int>
#1 a b 1
#2 a c 2
#3 b a 1
#4 b c 2
#5 c a 2
#6 c b 2
Do you need to cover both banks in both the directions? Since a -> b is same as b -> a in this case here. We can use combn and create combinations of unique bank taken 2 at a time, find out length of common id found in the combination.
as.data.frame(t(combn(unique(df$bank), 2, function(x)
c(x, with(df, length(intersect(id[bank == x[1]], id[bank == x[2]])))))))
# V1 V2 V3
#1 a b 1
#2 a c 2
#3 b c 2
data
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank, stringsAsFactors = FALSE)

Expanding data frame with "mirror" observations

I have a data frame arranged as follows:
df <- structure(list(name1= c("A","A","B"),
name2 = c("B", "C","C"),
size = c(10,20,30)),.Names=c("name1","name2","size"),
row.names = c("1", "2", "3"), class =("data.frame"))
I would like to add "mirror" observations as follows:
df <- structure(list(name1 = c("A","B","A", "C", "B", "C"),
name2 = c("B", "A","C", "A", "C", "B"),
size = c(10,10,20,20,30,30)),.Names=c("name1","name2","size"),
row.names = c("1", "2", "3", "4", "5", "6"), class =("data.frame"))
Inputs would be much appreciated.
We can do this in two steps,
df1 <- df[rep(rownames(df), each = 2),]
df1[c(FALSE, TRUE), 1:2] <- df1[c(FALSE, TRUE), 2:1]
df1
# name1 name2 size
#1 A B 10
#1.1 B A 10
#2 A C 20
#2.1 C A 20
#3 B C 30
#3.1 C B 30
We can do
library(data.table)
rbindlist(list(df, df[c(2:1, 3)]))

Create column identifying minimum character from within a group and label ties

I have paired data for 10 subjects (with some missing and some ties). My goal is to select the eye with the best disc_grade (A > B > C) and label ties accordingly from the data frame below.
I'm stuck on how to use R code to select the rows with the best disc_grade for each subject.
df <- structure(list(patientID = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6,
6, 7, 7, 8, 8, 9, 9, 10, 10), eye = c("R", "L", "R", "L", "R",
"L", "R", "L", "R", "L", "R", "L", "R", "L", "R", "L", "R", "L",
"R", "L"), disc_grade = c(NA, "B", "C", "B", "B", "C", "B", "C",
"B", "A", "B", "B", "C", "B", NA, NA, "B", "C", "B", "C")), .Names = c("patientID", "eye", "disc_grade"), class = c("tbl_df", "data.frame"), row.names = c(NA, -20L))
The desired output is:
patientID eye disc_grade
2 1 L B
4 2 L B
5 3 R B
7 4 R B
10 5 L A
11 6 Tie B
14 7 L B
17 9 R B
19 10 R B
This seems to work:
df %>%
group_by(patientID) %>%
filter(disc_grade == min(disc_grade, na.rm=TRUE)) %>%
summarise(eye = if (n()==1) eye else "Tie", disc_grade = first(disc_grade))
patientID eye disc_grade
(dbl) (chr) (chr)
1 1 L B
2 2 L B
3 3 R B
4 4 R B
5 5 L A
6 6 Tie B
7 7 L B
8 9 R B
9 10 R B
There is a warning for group 8, but we get the desired result thanks to how filter works on NAs.
With data.table:
setDT(df)[,
.SD[ disc_grade == min(disc_grade, na.rm=TRUE) ][,
.( eye = if (.N==1) eye else "Tie", disc_grade = disc_grade[1] )
]
, by=patientID]
Again, there's a warning, but now we do get a row for group 8, since [ does not ignore NAs. To get around this, you could filter the NAs before or after the operation (as in other answers). My best idea for doing it during the main operation is pretty convoluted:
setDT(df)[,
.SD[ which(disc_grade == min(disc_grade, na.rm=TRUE)) ][,
if (.N >= 1) list( eye = if (.N==1) eye else "Tie", disc_grade = disc_grade[1] )
]
, by=patientID]
One option with data.table
library(data.table)
na.omit(setDT(df))[, eye:=if(uniqueN(disc_grade)==1 &
.N >1) 'Tie' else eye, patientID
][order(factor(disc_grade, levels=c('A', 'B', 'C'))),
.SD[1L] ,patientID][order(patientID)]
# patientID eye disc_grade
#1: 1 L B
#2: 2 L B
#3: 3 R B
#4: 4 R B
#5: 5 L A
#6: 6 Tie B
#7: 7 L B
#8: 9 R B
#9: 10 R B
library(dplyr)
df <- structure(list(patientID = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6,
6, 7, 7, 8, 8, 9, 9, 10, 10), eye = c("R", "L", "R", "L", "R",
"L", "R", "L", "R", "L", "R", "L", "R", "L", "R", "L", "R", "L",
"R", "L"), disc_grade = c(NA, "B", "C", "B", "B", "C", "B", "C",
"B", "A", "B", "B", "C", "B", NA, NA, "B", "C", "B", "C")), .Names = c("patientID", "eye", "disc_grade"), class = c("tbl_df", "data.frame"), row.names = c(NA, -20L))
df %>%
filter(!is.na(disc_grade)) %>% ## remove rows with NAs
group_by(patientID) %>% ## for each patient
filter(disc_grade == min(disc_grade)) %>% ## keep the row (his eye) that has the best score
mutate(eye_upd = ifelse(n() > 1, "tie", eye)) %>% ## if you kept both eyes you have a tie
select(patientID,eye_upd,disc_grade) %>%
distinct()
# patientID eye_upd disc_grade
# (dbl) (chr) (fctr)
# 1 1 L B
# 2 2 L B
# 3 3 R B
# 4 4 R B
# 5 5 L A
# 6 6 tie B
# 7 7 L B
# 8 9 R B
# 9 10 R B
There's certainly a better way to do this, but this gets the job done...need more coffee...
df_orig <- df
library(dplyr)
df %>%
filter(!is.na(disc_grade)) %>%
group_by(patientID) %>%
summarise(best = min(disc_grade)) %>%
left_join(., df_orig, by = c("patientID" = "patientID",
"best" = "disc_grade")) %>%
group_by(patientID) %>%
mutate(eye = ifelse(n() > 1, "tie", eye)) %>%
distinct(patientID) %>%
select(patientID, eye, best)
Note: I am able to get away with min(disc_grade) because of type conversation. Consider looking at as.numeric(as.factor(df$disc_grade)).

Resources