The problem
I am having a lot of difficulty using a known value within a function within dplyr. The issue is with the following line. The rest of what follows it is just data that leads to the problematic component.
data <- data %>%
group_by(Group) %>%
bind_cols(as_tibble(rotate2(as.matrix(.)[,1:2], theta = min(.$theta))))
The min(.$theta) is my attempt to try to find the theta value within each group and use it. There is a column in the data created (as shown below) which contains this value. I want to take the value from each group (Group) and use it with rotate2. There are only two groups in the sample below, but the real data has hundreds of groups. What I want to know is: how can I use the existing value for each group (the theta column repeats the same value for each group).
Is there something I can replace min(.$theta) with that would do this? It seems to take data from the entire column, rather than taking the value from each Group individually.
Data to get to the problem
Packages - dplyr, plyr, lava
data <- structure(list(X = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 4.9046,
6.1424, 7.275, 8.5851, 10.0373, 11.9981, 13.7726, 15.0731, 16.0664,
18.1945, 21.2666, 24.2093, 26.7119, 28.8037, 30.7135, 32.1351,
33.1982, 34.2341, 35.7587, 37.2147, 38.4303, 39.625, 40.4596,
42.0938, 42.7428, 42.7593, 43.5085, 43.7419, 43.5989, 44.0841,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -14.845, -11.9052,
-8.7897, -5.8034, -2.6756, 0.3316, 3.4003, 6.5281, 9.6517, 12.804,
15.9861, 19.1769, 22.2929, 25.4089, 28.3392, 31.0054, 33.1847,
35.081, 36.7227, 38.1544, 39.1697, 40.049, 40.9647, 41.5014,
41.8874, 42.1778, 42.3435, 42.2681, 42.3745, 42.4619, NA, NA,
NA, NA), Y = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, -9.9938, -7.4596,
-4.8647, -2.2903, 0.3158, 2.9302, 5.7262, 8.7033, 11.8007, 14.9847,
16.7225, 16.7813, 15.6921, 14.2964, 11.5579, 8.2378, 5.183, 1.5938,
-2.0712, -5.195, -7.1447, -9.0446, -11.1269, -13.0979, -15.3295,
-17.1898, -19.4376, -21.4781, -23.8426, -25.6343, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 8.0113, 9.1826, 9.838, 10.7908,
11.175, 12.0393, 12.6813, 12.8828, 13.2281, 13.5102, 13.6637,
13.5493, 12.8699, 12.2191, 10.9208, 9.0209, 6.2158, 3.2466, 0.2169,
-2.7807, -6.0439, -9.1262, -11.8684, -14.7779, -17.5825, -20.2452,
-22.807, -25.3519, -27.6105, -29.7536, NA, NA, NA, NA), fan_line = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L,
16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L,
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L,
42L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L,
14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L,
27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L,
40L, 41L, 42L)), class = "data.frame", row.names = c(NA, -84L
))
data <- data %>% mutate(Group = rep(1:(n()/42), each = 42)) %>% dplyr::group_by(Group) %>%
mutate(start = min(which(!is.na(X))), end = max(which(!is.na(X))), midpoint = round((start+end)/2, digits = 0)) %>% ungroup()
data$start_val_x <- 0
data$end_val_x <- 0
data$start_val_y <- 0
data$end_val_y <- 0
for (i in 1:nrow(data)){
if (data[i, "fan_line"] == data[i, "start"]){
data[i, "start_val_x"] = data[i, "X"]
data[i, "start_val_y"] = data[i, "Y"]
}
else{data[i, "start_val_y"] = NA
data[i, "start_val_x"] = NA}
}
for (i in 1:nrow(data)){
if (data[i, "fan_line"] == data[i, "end"]){
data[i, "end_val_x"] = data[i, "X"]
data[i, "end_val_y"] = data[i, "Y"]
}
else{data[i, "end_val_y"] = NA
data[i, "end_val_x"] = NA}
}
data <- data %>% group_by(Group) %>% fill(c(start_val_x, start_val_y), .direction = "down") %>% fill(c(start_val_x, start_val_y), .direction = "up")
data <- data %>% group_by(Group) %>% fill(c(end_val_x, end_val_y), .direction = "down") %>% fill(c(end_val_x, end_val_y), .direction = "up")
data <- data %>% group_by(Group) %>% mutate(theta = max(atan(diff(c(start_val_y, end_val_y))/diff(c(start_val_x, end_val_x))), na.rm = T))
data <- data %>% group_by(Group) %>% bind_cols(as_tibble(rotate2(as.matrix(.)[,1:2], theta = min(.$theta))))
We could use group_modify. However, I'm not sure if the outcome below is what you are looking for.
In a normal dplyr pipeline we could use cur_data() to access the data of each group. This is not possible here, because we are inside a non-dplyr function. For this case we have group_map (which returns a list) and group_modify (which returns a grouped tibble as long as each output is a data.frame). We can use a lambda function and here .x is our grouped data.
library(tidyverse)
library(lava)
data %>%
group_by(Group) %>%
group_modify(~ as_tibble(rotate2(as.matrix(.x)[,1:2], theta = min(.x$theta))))
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
#> # A tibble: 84 x 3
#> # Groups: Group [2]
#> Group V1 V2
#> <int> <dbl> <dbl>
#> 1 1 NA NA
#> 2 1 NA NA
#> 3 1 NA NA
#> 4 1 NA NA
#> 5 1 NA NA
#> 6 1 NA NA
#> 7 1 NA NA
#> 8 1 NA NA
#> 9 1 NA NA
#> 10 1 8.26 -7.46
#> # … with 74 more rows
Created on 2021-04-13 by the reprex package (v0.3.0)
Related
I have a list of nested data frames and I want to extract the observations of the earliest year, my problem is the first year change with the data frames. the year is either 1992 or 2005.
I want to create a list to stock them, I tried with which, but since there is the same year, observations are repeated, and I want them apart
new_df<- which(df[[i]]==1992 | df[[i]]==2005)
I've tried with ifelse() but I have to do an lm operation after, and it doesn't work. And I can't take only the first rows, because the year are repeated
my code looks like this:
df<- list(a<-data.frame(a_1<-(1992:2015),
a_2<-sample(1:24)),
b<-data.frame(b_1<-(1992:2015),
b_2<-sample(1:24)),
c<-data.frame(c_1<-(2005:2015),
c_2<-sample(1:11)),
d<-data.frame(d_1<-(2005:2015),
d_2<-sample(1:11)))
You can define a function to get the data on one data.frame and loop on the list to extract values.
Below I use map from the purrr package but you can also use lapply and for loops
Please do not use <- when assigning values in a function call (here data.frame() ) because it will mess colnames. = is used in function calls for arguments variables and it's okay to use it. You can read this ;)
df<- list(a<-data.frame(a_1 = (1992:2015),
a_2 = sample(1:24)),
b<-data.frame(b_1 = (1992:2015),
b_2 = sample(1:24)),
c<-data.frame(c_1 = (2005:2015),
c_2 = sample(1:11)),
d<-data.frame(d_1 = (2005:2015),
d_2 = sample(1:11)))
extract_miny <- function(df){
miny <- min(df[,1])
res <- df[df[,1] == miny, 2]
names(res) <- miny
return(res)
}
map(df, extract_miny)
If the data is sorted as the example, you can slice() the first row for the information. Notice the use of = rather than <- in creating a nested dataframe.
library(tidyverse)
df <- list(
a = data.frame(a_1 = (1992:2015),
a_2 = sample(1:24)),
b = data.frame(b_1 = (1992:2015),
b_2 = sample(1:24)),
c = data.frame(c_1 = (2005:2015),
c_2 = sample(1:11)),
d = data.frame(d_1 = (2005:2015),
d_2 = sample(1:11))
)
df %>%
imap_dfr( ~ slice(.x, 1) %>%
set_names(c("year", "value")) %>%
mutate(dataframe = .y) %>%
as_tibble())
# A tibble: 4 x 3
year value dataframe
<int> <int> <chr>
1 1992 19 a
2 1992 2 b
3 2005 1 c
4 2005 5 d
You may subset anonymeously.
lapply(df, \(x) setNames(x[x[[1]] == min(x[[1]]), ], c('year', 'value'))) |> do.call(what=rbind)
# year value
# 1 1992 6
# 2 1992 9
# 3 2005 11
# 4 2005 11
Or maybe better by creating a variable from which sample the value stems from.
Map(`[<-`, df, 'sample', value=letters[seq_along(df)]) |>
lapply(\(x) setNames(x[x[[1]] == min(x[[1]]), ], c('year', 'value', 'sample'))) |>
do.call(what=rbind)
# year value sample
# 1 1992 6 a
# 2 1992 9 b
# 3 2005 11 c
# 4 2005 11 d
Data:
df <- list(structure(list(a_1.....1992.2015. = 1992:2015, a_2....sample.1.24. = c(6L,
18L, 23L, 5L, 7L, 14L, 4L, 10L, 19L, 17L, 15L, 1L, 11L, 22L,
13L, 8L, 20L, 16L, 2L, 3L, 24L, 21L, 9L, 12L)), class = "data.frame", row.names = c(NA,
-24L)), structure(list(b_1.....1992.2015. = 1992:2015, b_2....sample.1.24. = c(9L,
24L, 18L, 8L, 16L, 11L, 13L, 23L, 15L, 20L, 19L, 21L, 12L, 22L,
7L, 3L, 6L, 17L, 2L, 5L, 4L, 10L, 1L, 14L)), class = "data.frame", row.names = c(NA,
-24L)), structure(list(c_1.....2005.2015. = 2005:2015, c_2....sample.1.11. = c(11L,
2L, 5L, 10L, 9L, 6L, 1L, 7L, 3L, 8L, 4L)), class = "data.frame", row.names = c(NA,
-11L)), structure(list(d_1.....2005.2015. = 2005:2015, d_2....sample.1.11. = c(11L,
2L, 5L, 1L, 6L, 9L, 3L, 7L, 10L, 4L, 8L)), class = "data.frame", row.names = c(NA,
-11L)))
I have a table with two columns A and B. I want to create a new table with two new columns added: X and Y.
The X column is to contain the values from the A column, but with the division performed. Values from the first row (from column A) divided by the values from the second row in column A and so for all subsequent rows, e.g. the third row divided by the fourth row etc.
The Y column is to contain the values from the B column, but with the division performed. Values from the first row (from column B) divided by the values from the second row in column B and so for all subsequent rows, e.g. the third row divided by the fourth row etc.
So far I used Excel for this. But now I need it in R if possible in the form of a function so that I can reuse this code easily. I haven't done this in R yet, so I am asking for help.
Example data:
structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L,
6L), B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L)), class = "data.frame", row.names = c(NA,
-10L))
Sample results:
structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L,
6L), B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L), X = c("",
"0,285714286", "", "0,454545455", "", "4,5", "", "2,428571429",
"", "1,666666667"), Y = c("", "0,6", "", "0,047619048", "", "2,09375",
"", "0,791666667", "", "1,189189189")), class = "data.frame", row.names = c(NA,
-10L))
You could use dplyr's across and lag (combined with modulo for picking every second row):
library(dplyr)
df |> mutate(across(c(A, B), ~ ifelse(row_number() %% 2 == 0, lag(.) / ., NA), .names = "new_{.col}"))
If you want a character vector change NA to "".
Output:
A B new_A new_B
1 2 3 NA NA
2 7 5 0.2857143 0.60000000
3 5 1 NA NA
4 11 21 0.4545455 0.04761905
5 54 67 NA NA
6 12 32 4.5000000 2.09375000
7 34 19 NA NA
8 14 24 2.4285714 0.79166667
9 10 44 NA NA
10 6 37 1.6666667 1.18918919
Function:
ab_fun <- function(data, vars) {
data |>
mutate(across(c(A, B), ~ ifelse(row_number() %% 2 == 0, lag(.) / ., NA), .names = "new_{.col}"))
}
ab_fun(df, c(A,B))
Updated with new data and correct code. + Function
I have a table with two columns A and B. I want to create a new table with two new columns added: X and Y. These two new columns are to contain data from column A, but every second row from column A. Correspondingly for column X, starting from the first value in column A and from the second value in column A for column Y.
So far, I have been doing it in Excel. But now I need it in R best function form so that I can easily reuse that code. I haven't done this in R yet, so I am asking for help.
Example data:
structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L,
6L), B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L)), class = "data.frame", row.names = c(NA,
-10L))
Sample result:
structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L,
6L), B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L), X = c(2L,
NA, 5L, NA, 54L, NA, 34L, NA, 10L, NA), Y = c(NA, 7L, NA, 11L,
NA, 12L, NA, 14L, NA, 6L)), class = "data.frame", row.names = c(NA,
-10L))
It is not a super elegant solution, but it works:
exampleDF <- structure(list(A = c(2L, 7L, 5L, 11L, 54L,
12L, 34L, 14L, 10L, 6L),
B = c(3L, 5L, 1L, 21L, 67L,
32L, 19L, 24L, 44L, 37L)),
class = "data.frame", row.names = c(NA, -10L))
index <- seq(from = 1, to = nrow(exampleDF), by = 2)
exampleDF$X <- NA
exampleDF$X[index] <- exampleDF$A[index]
exampleDF$Y <- exampleDF$A
exampleDF$Y[index] <- NA
You could also make use of the row numbers and the modulo operator:
A simple ifelse way:
library(dplyr)
df |>
mutate(X = ifelse(row_number() %% 2 == 1, A, NA),
Y = ifelse(row_number() %% 2 == 0, A, NA))
Or using pivoting:
library(dplyr)
library(tidyr)
df |>
mutate(name = ifelse(row_number() %% 2 == 1, "X", "Y"),
value = A) |>
pivot_wider()
A function using the first approach could look like:
See comment
xy_fun <- function(data, A = A, X = X, Y = Y) {
data |>
mutate({{X}} := ifelse(row_number() %% 2 == 1, {{A}}, NA),
{{Y}} := ifelse(row_number() %% 2 == 0, {{A}}, NA))
}
xy_fun(df, # Your data
A, # The col to take values from
X, # The column name of the first new column
Y # The column name of the second new column
)
Output:
A B X Y
1 2 3 2 NA
2 7 5 NA 7
3 5 1 5 NA
4 11 21 NA 11
5 54 67 54 NA
6 12 32 NA 12
7 34 19 34 NA
8 14 24 NA 14
9 10 44 10 NA
10 6 37 NA 6
Data stored as df:
df <- structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L, 6L),
B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L)
),
class = "data.frame",
row.names = c(NA, -10L)
)
I like the #harre approach:
Another approach with base R we could ->
Use R's recycling ability (of a shorter-vector to a longer-vector):
df$X <- df$A
df$Y <- df$B
df$X[c(FALSE, TRUE)] <- NA
df$Y[c(TRUE, FALSE)] <- NA
df
A B X Y
1 2 3 2 NA
2 7 5 NA 5
3 5 1 5 NA
4 11 21 NA 21
5 54 67 54 NA
6 12 32 NA 32
7 34 19 34 NA
8 14 24 NA 24
9 10 44 10 NA
10 6 37 NA 37
I need to overlap two different plots. They use the same scale already.
My code for each separated scatterplot look like this.
ggscatter(chemicals, x = "columnB", y = "columnA",
color = "nombre",
palette = "jco",
ellipse = FALSE,
ellipse.type = "convex",
repel = TRUE,
max.overlaps = 10,
font.label = c(6, "plain", "red"))
ggscatter(rivers, x = "V3", y = "V2",
label = rivers$V1,
palette = "jco",
ellipse = FALSE,
ellipse.type = "convex",
repel = FALSE,
max.overlaps = 10,
font.label = c(6, "plain", "blue"))
The first data look like this...
chemicals <- structure(list(columnA = c(0.34526, -0.47491, 1.9717, -1.28922,
-1.3365, -1.06089, -1.35741, -1.03362, 1.33577, 0.26619, -1.33583,
0.56619, -0.84651, 0.52487, -0.44644, 0.33894, 1.33558, -1.36652,
-1.41608, 0.08864, -0.98665, -0.13102, 0.96633, -0.33869, -1.45537,
1.50434, -1.30283, -0.03662, -0.83985, -0.86605, 0.96659, -1.37216,
1.05501, 0.34936, -0.56608, -0.84148, 1.16633, 1.15391, -1.10533,
-0.04087, 1.36684, 0.39588, -0.4166, -0.7338, -1.33663, 1.24798,
0.26939, 0.57514, 0.21976, -0.62348, -1.3341, 0.6696, 1.71274,
0.0337, -1.33959, -0.33319, -0.21368, -0.25305, 0.56606, 0.56665
), columnB = c(0.46696, 0.15238, 0.28205, -1.01343, -0.45548, -0.58032,
-0.03174, -1.86618, 0.37332, 0.33668, 0.3668, 0.67415, -0.0393,
1.21716, 0.06624, 1.4333, 0.42663, 0.33143, 0.33529, -2.66816,
0.76601, 0.06666, 0.86633, 0.59532, -0.33115, -0.76641, 0.06633,
0.50038, -0.11718, 0.28718, -1.84348, -0.2598, -0.37834, 1.82102,
0.66669, 0.56604, -2.17667, -1.86617, 0.67087, -2.2598, -2.06249,
-0.25863, 1.26661, -1.76684, 0.06665, 0.80114, -1.33408, 0.23333,
0.21658, 0.39268, 0.50466, -0.09929, -0.09178, 1.07363, 1.15409,
-0.49409, 1.628, 0.26664, 0.62084, 0.50397)), row.names = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L,
13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L,
24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L,
35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L,
46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L, 55L, 56L,
57L, 58L, 59L, 60L), class = "data.frame")
The second data looks like this...
rivers <- structure(list(V1 = structure(c(7L, 5L, 6L, 1L, 3L, 4L, 8L, 2L
), .Label = c("riverA", "riverB", "riverC", "riverD",
"riverE", "riverF", "riverG", "riverH"), class = "factor"),
V2 = structure(c(8L, 7L, 6L, 5L, 4L, 1L, 2L, 3L), .Label = c("-0.800",
"0.021", "0.220", "0.590", "0.999", "0.333", "0.700", "0.850"
), class = "factor"), V3 = structure(c(1L, 3L, 4L, 2L, 7L,
6L, 8L, 5L), .Label = c("-0.028", "-0.011", "-0.078", "-0.4",
"-0.952", "0.275", "0.630", "0.725"), class = "factor")), class = "data.frame", row.names = c(NA,
-8L))
I need to put both of these scatter plots together in one plot.
I don't have ggpubr, but here is a demonstration using ggplot2:
library(dplyr)
library(ggplot2)
rivers %>%
mutate(source = "rivers", across(c(V3,V2), ~ as.numeric(as.character(.)))) %>%
select(source, columnA = V3, columnB = V2) %>%
bind_rows(mutate(chemicals, source = "chemicals")) %>%
ggplot(aes(columnA, columnB)) +
geom_point(aes(color = source))
I'm guessing this should be straight-forward to translate into ggpubr::ggscatter.
The premise of row-binding (via base rbind or dplyr::bind_rows or data.table::rbindlist) is that the number of rows matters not, it's the columns that matter. In the base case, there must be the same number of columns with the same names:
dat1 <- data.frame(a = 1, b = 2)
dat2 <- data.frame(a = 1:2, d = 3:4)
rbind(dat1, dat2)
# Error in match.names(clabs, names(xi)) :
# names do not match previous names
dat2b <- data.frame(a = 1:2, b = 3:4)
rbind(dat1, dat2b)
# a b
# 1 1 2
# 2 1 3
# 3 2 4
Both dplyr::bind_rows and data.table::rbindlist provide wiggle room around this, either by default (former) or with options (latter):
dat2 <- data.frame(a = 1:2, d = 3:4)
dplyr::bind_rows(dat1, dat2)
# a b d
# 1 1 2 NA
# 2 1 NA 3
# 3 2 NA 4
data.table::rbindlist(list(dat1, dat2), use.names = TRUE, fill = TRUE)
# a b d
# <num> <num> <int>
# 1: 1 2 NA
# 2: 1 NA 3
# 3: 2 NA 4
In this case, though, you want to normalize the names, so for one of them you need to change in either or both of them so that they can be aligned/row-bound properly.
FYI, you don't actually have to rename or rbind them to do things the brute-force way in ggplot2, but doing it this way has consequences and limits several other options so it is generally discouraged:
ggplot() +
geom_point(aes(columnA, columnB), color = "red", data = chemicals) +
geom_point(aes(as.numeric(as.character(V3)), as.numeric(as.character(V2))), color = "blue", data = rivers)
... but this doesn't help you adapt the process to ggscatter, so it is doubly not useful. I'll keep it, but don't go down this last path.
I have a data like this
df<-structure(list(X1 = c(37L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, NA,
11L, 12L), X2 = c(40L, NA, 35L, 35L, 35L, 34L, NA, 28L, 28L,
NA, 25L, 24L), X3 = c(60L, 44L, 49L, 41L, NA, NA, NA, 25L, 26L,
NA, NA, 22L), T1 = c(19L, 55L, 47L, 46L, 36L, 42L, 25L, NA, 33L,
42L, 50L, 22L), T2 = c(75L, NA, 32L, 44L, 27L, 31L, 17L, NA,
18L, 45L, 10L, 11L), T3 = c(5L, 6L, 7L, 8L, 9L, 10L, 11L, NA,
46L, 36L, 42L, NA), P1 = c(2L, 2L, 3L, 4L, 2L, 6L, 7L, 8L, 9L,
NA, 1L, 12L), P2 = c(40L, 44L, 4L, 2L, 1L, 1L, NA, 1L, 1L, 1L,
5L, 55L), P3 = c(1L, 44L, 49L, 3L, NA, NA, NA, 25L, 26L, NA,
NA, 66L)), class = "data.frame", row.names = c(NA, -12L))
I have three groups and each group has 3 columns , they are called X, T and P.
I am trying to find out how many of rows in each group are overlapped with another group and how many rows in each group is different than another group. ( each row of each group must at lest have 2 values)
so I am looking for an output like this
X 10 rows overlapping with T and 2 different
T has 10 overlapping with X and 2 different
X has 10 overlapping with P and 1 different
T has 10 overlapping with P and 3 different
it means I have 10 rows of X1,X2 and X3 which have at least 2 values and they have values in the group T (T1,T2,T3). There is one row that is completely empty or has only 1 value but they have values in T group.
The same for other combination
This question is still sort of ambiguous and narrow, but here is the general idea for tidying your data to the point where you can easily summarize over different groups and/or rows:
library(tidyverse)
df %>%
as_tibble %>%
rowid_to_column %>%
gather(select=-rowid) %>%
separate(key, into=c('group', 'column'), sep=1) %>%
group_by(group)
Extending along the lines of John Colby's answer, you can summarize how many rows are populated with 2 or more non-NA values in each letter's columns:
library(tidyverse)
df_summarized <- df %>%
rowid_to_column() %>%
gather(colname, value, -rowid) %>%
separate(colname, into = c("letter", "number"), sep = 1) %>%
count(rowid, letter, wt = !is.na(value), name = "num_values") %>%
mutate(populated = num_values >= 2)
> df_summarized
# A tibble: 36 x 4
rowid letter num_values populated
<int> <chr> <int> <lgl>
1 1 P 3 TRUE
2 1 T 3 TRUE
3 1 X 3 TRUE
4 2 P 3 TRUE
5 2 T 2 TRUE
6 2 X 2 TRUE
7 3 P 3 TRUE
8 3 T 3 TRUE
9 3 X 3 TRUE
10 4 P 3 TRUE
# ... with 26 more rows
And then use that to compare between letters. For instance, here I see that 9 rows have the same populated / not-populated status among X and T columns. Three rows (7, 8, and 10) differ in their populated status between those two letters.
> df_summarized %>%
+ select(-num_values) %>%
+ spread(letter, populated)
# A tibble: 12 x 4
rowid P T X
<int> <lgl> <lgl> <lgl>
1 1 TRUE TRUE TRUE
2 2 TRUE TRUE TRUE
3 3 TRUE TRUE TRUE
4 4 TRUE TRUE TRUE
5 5 TRUE TRUE TRUE
6 6 TRUE TRUE TRUE
7 7 FALSE TRUE FALSE # T but no X
8 8 TRUE FALSE TRUE # X but no T
9 9 TRUE TRUE TRUE
10 10 FALSE TRUE FALSE # T but no X
11 11 TRUE TRUE TRUE
12 12 TRUE TRUE TRUE
We could query the data like this to get the overlaps and non-overlaps:
df_summarized %>%
select(-num_values) %>%
spread(letter, populated) %>%
summarize(PT = sum(P==T),
PT_non = sum(P!=T),
TX = sum(T==X),
TX_non = sum(T!=X),
XP = sum(X==P),
XP_non = sum(X!=P))
# A tibble: 1 x 6
PT PT_non TX TX_non XP XP_non
<int> <int> <int> <int> <int> <int>
1 9 3 9 3 12 0