Subset data frame with matrix of logical values - r

Problem
I have data on two measures for four individuals each in a wide format. The measures are x and y and the individuals are A, B, C, D. The data frame looks like this
d <- data.frame(matrix(sample(1:100, 40, replace = F), ncol = 8))
colnames(d) <- paste(rep(c("x.", "y."),each = 4), rep(LETTERS[1:4], 2), sep ="")
d
x.A x.B x.C x.D y.A y.B y.C y.D
1 56 65 42 96 100 76 39 26
2 19 93 94 75 63 78 5 44
3 22 57 15 62 2 29 89 79
4 49 13 95 97 85 81 60 37
5 45 38 24 91 23 82 83 72
Now, would I would like to obtain for each row is the value of y for the individual with the lowest value of x.
So in the example above, the lowest value of x in row 1 is for individual C. Hence, for row 1 I would like to obtain y.C which is 39.
In the example, the resulting vector should be 39, 63, 89, 81, 83.
Approach
I have tried to get to this by first generating a matrix of the subset of d for the values of x.
t(apply(d[,1:4], 1, function(x) min(x) == x))
x.A x.B x.C x.D
[1,] FALSE FALSE TRUE FALSE
[2,] TRUE FALSE FALSE FALSE
[3,] FALSE FALSE TRUE FALSE
[4,] FALSE TRUE FALSE FALSE
[5,] FALSE FALSE TRUE FALSE
Now I wanted to apply this matrix to subset the subset of the data frame for the values of y. But I cannot find a way to achieve this.
Any help is much appreciated. Suggestions for a totally different - more elegant - approach are highly welcome too.
Thanks a lot!

We subset the dataset with the columns starting with 'x' ('dx') and 'y' ('dy'). Get the column index of the minimum value in each row of 'dx' using max.col, cbind with the row index and get the corresponding elements in 'dy'.
dx <- d[grep('^x', names(d))]
dy <- d[grep('^y', names(d))]
dy[cbind(1:nrow(dx),max.col(-dx, 'first'))]
#[1] 39 63 89 81 83
The above can be easily be converted to a function
get_min <- function(dat){
dx <- dat[grep('^x', names(dat))]
dy <- dat[grep('^y', names(dat))]
dy[cbind(1:nrow(dx), max.col(-dx, 'first'))]
}
get_min(d)
#[1] 39 63 89 81 83
Or using the OP's apply based method
t(d[,5:8])[apply(d[,1:4], 1, function(x) min(x) == x)]
#[1] 39 63 89 81 83
data
d <- structure(list(x.A = c(56L, 19L, 22L, 49L, 45L),
x.B = c(65L,
93L, 57L, 13L, 38L), x.C = c(42L, 94L, 15L, 95L, 24L),
x.D = c(96L,
75L, 62L, 97L, 91L), y.A = c(100L, 63L, 2L, 85L, 23L),
y.B = c(76L,
78L, 29L, 81L, 82L), y.C = c(39L, 5L, 89L, 60L, 83L),
y.D = c(26L,
44L, 79L, 37L, 72L)), .Names = c("x.A", "x.B", "x.C",
"x.D",
"y.A", "y.B", "y.C", "y.D"), class = "data.frame",
row.names = c("1", "2", "3", "4", "5"))

Here is my solution. The core idea is that there are functions which.min, which.max that can be row applied to the data frame:
Edit:
Now, would I would like to obtain for each row is the value of y for
the individual with the lowest value of x.
ind <- apply(d[ ,1:4], 1, which.min) # build column index by row
res <- d[,5:8][cbind(1:nrow(d), ind)] # rows are in order, select values by matrix
names(res) <- colnames(d)[5:8][ind] # set colnames as names from the sample column
res
y.D y.B y.D y.A y.D
18 46 16 85 80
Caveat: only works if individuals are in the same order for treatment x. and y. and all individuals present. Otherwise you can use grep like in Akrun's solution.
# My d was:
x.A x.B x.C x.D y.A y.B y.C y.D
1 88 96 65 55 14 99 63 18
2 12 11 27 45 70 46 20 69
3 32 81 21 9 77 44 91 16
4 8 84 42 78 85 94 28 90
5 31 51 83 2 67 25 54 80

We can create a function as follows,
get_min <- function(x){
d1 <- x[,1:4]
d2 <- x[,5:8]
mtrx <- as.matrix(d2[,apply(d1, 1, which.min)])
a <- row(mtrx) - col(mtrx)
split(mtrx, a)$"0"
}
get_min(d)
#[1] 39 63 89 81 83

Related

Calculating Percent Change in R for Multiple Variables

I'm trying to calculate percent change in R with each of the time points included in the column label (table below). I have dplyr loaded and my dataset was loaded in R and I named it data. Below is the code I'm using but it's not calculating correctly. I want to create a new dataframe called data_per_chg which contains the percent change from "v1" each variable from. For instance, for wbc variable, I would like to calculate percent change of wbc.v1 from wbc.v1, wbc.v2 from wbc.v1, wbc.v3 from wbc.v1, etc, and do that for all the remaining variables in my dataset. I'm assuming I can probably use a loop to easily do this but I'm pretty new to R so I'm not quite sure how proceed. Any guidance will be greatly appreciated.
id
wbc.v1
wbc.v2
wbc.v3
rbc.v1
rbc.v2
rbc.v3
hct.v1
hct.v2
hct.v3
a1
23
63
30
23
56
90
13
89
47
a2
81
45
46
N/A
18
78
14
45
22
a3
NA
27
14
29
67
46
37
34
33
data_per_chg<-data%>%
group_by(id%>%
arrange(id)%>%
mutate(change=(wbc.v2-wbc.v1)/(wbc.v1))
data_per_chg
Assuming the NA values are all NA and no N/A
library(dplyr)
library(stringr)
data <- data %>%
na_if("N/A") %>%
type.convert(as.is = TRUE) %>%
mutate(across(-c(id, matches("\\.v1$")), ~ {
v1 <- get(str_replace(cur_column(), "v\\d+$", "v1"))
(.x - v1)/v1}, .names = "{.col}_change"))
-output
data
id wbc.v1 wbc.v2 wbc.v3 rbc.v1 rbc.v2 rbc.v3 hct.v1 hct.v2 hct.v3 wbc.v2_change wbc.v3_change rbc.v2_change rbc.v3_change hct.v2_change hct.v3_change
1 a1 23 63 30 23 56 90 13 89 47 1.7391304 0.3043478 1.434783 2.9130435 5.84615385 2.6153846
2 a2 81 45 46 NA 18 78 14 45 22 -0.4444444 -0.4320988 NA NA 2.21428571 0.5714286
3 a3 NA 27 14 29 67 46 37 34 33 NA NA 1.310345 0.5862069 -0.08108108 -0.1081081
If we want to keep the 'v1' columns as well
data %>%
na_if("N/A") %>%
type.convert(as.is = TRUE) %>%
mutate(across(ends_with('.v1'), ~ .x - .x,
.names = "{str_replace(.col, 'v1', 'v1change')}")) %>%
transmute(id, across(ends_with('change')),
across(-c(id, matches("\\.v1$"), ends_with('change')),
~ {
v1 <- get(str_replace(cur_column(), "v\\d+$", "v1"))
(.x - v1)/v1}, .names = "{.col}_change")) %>%
select(id, starts_with('wbc'), starts_with('rbc'), starts_with('hct'))
-output
id wbc.v1change wbc.v2_change wbc.v3_change rbc.v1change rbc.v2_change rbc.v3_change hct.v1change hct.v2_change hct.v3_change
1 a1 0 1.7391304 0.3043478 0 1.434783 2.9130435 0 5.84615385 2.6153846
2 a2 0 -0.4444444 -0.4320988 NA NA NA 0 2.21428571 0.5714286
3 a3 NA NA NA 0 1.310345 0.5862069 0 -0.08108108 -0.1081081
data
data <- structure(list(id = c("a1", "a2", "a3"), wbc.v1 = c(23L, 81L,
NA), wbc.v2 = c(63L, 45L, 27L), wbc.v3 = c(30L, 46L, 14L), rbc.v1 = c("23",
"N/A", "29"), rbc.v2 = c(56L, 18L, 67L), rbc.v3 = c(90L, 78L,
46L), hct.v1 = c(13L, 14L, 37L), hct.v2 = c(89L, 45L, 34L), hct.v3 = c(47L,
22L, 33L)), class = "data.frame", row.names = c(NA, -3L))

Correlation of similar variables in R

I have slightly edited the data table.
I would like to correlate variable with similar name in my dataset:
A_y B_y C_y A_p B_p C_p
1 15 52 32 30 98 56
2 30 99 60 56 46 25
3 10 25 31 20 22 30
..........
n 55 23 85 12 34 52
I would like to obtain correlation of
A_y-A_p: 0.78
B_y-B_p: 0.88
C_y-C_p: 0.93
How can I do it in R? Is it possible?
This is really dangerous. Behavior of data.frames with invalid column names is undefined by the language definition. Duplicated column names are invalid.
You should restructure your input data. Anyway, here is an approach with your input data.
DF <- read.table(text = " A B C A B C
1 15 52 32 30 98 56
2 30 99 60 56 46 25
3 10 25 31 20 22 30", header = TRUE, check.names = FALSE)
sapply(unique(names(DF)), function(s) do.call(cor, unname(DF[, names(DF) == s])))
# A B C
#0.9995544 0.1585501 -0.6004010
#compare:
cor(c(15, 30, 10), c(30, 56, 20))
#[1] 0.9995544
Here is another base R option
within(
rev(
stack(
Map(
function(x) do.call(cor, unname(x)),
split.default(df, unique(gsub("_.*", "", names(df))))
)
)
),
ind <- sapply(
ind,
function(x) {
paste0(grep(paste0("^", x), names(df), value = TRUE),
collapse = "-"
)
}
)
)
which gives
ind values
1 A_y-A_p 0.9995544
2 B_y-B_p 0.1585501
3 C_y-C_p -0.6004010
Data
df <- structure(list(A_y = c(15L, 30L, 10L), B_y = c(52L, 99L, 25L),
C_y = c(32L, 60L, 31L), A_p = c(30L, 56L, 20L), B_p = c(98L,
46L, 22L), C_p = c(56L, 25L, 30L)), class = "data.frame", row.names = c("1",
"2", "3"))

Matching dataframes with data.table

I need to fill a matrix (MA) with information from a long data frame (DF) using another matrix as identifier (ID.MA).
An idea of my three matrices:
MA.ID creates an identifier to look in the big DF the needed variables:
a b c
a ID.aa ID.ab ID.ac
b ID.ba ID.bb ID.bc
c ID.ca ID.cb ID.cc
The original big data frame has useless information but has also the rows that are useful for me to fill the target MA matrix:
ID 1990 1991 1992
ID.aa 10 11 12
ID.ab 13 14 15
ID.ac 16 17 18
ID.ba 19 20 21
ID.bb 22 23 24
ID.bc 25 26 27
ID.ca 28 29 30
ID.cb 31 32 33
ID.cc 34 35 36
ID.xx 40 40 55
ID.xy 50 51 45
....
MA should be filled with cross-information. In my example it should look like that for a chosen column of DF (let's say, 1990):
a b c
a 10 13 16
b 19 22 25
c 28 31 34
I've tried to use match but honestly it didn't work out:
MA$a = DF[match(MA.ID$a, DF$ID),2]
I was recommended to use the data.table package but I couldn't see how that would help me.
Anyone has any good way to approach this problem?
Supposing that your input are dataframes, then you could do the following:
library(data.table)
setDT(ma)[, lapply(.SD, function(x) x = unlist(df[match(x,df$ID), "1990"]))
, .SDcols = colnames(ma)]
which returns:
a b c
1: 10 13 16
2: 19 22 25
3: 28 31 34
Explanation:
With setDT(ma) you transform the dataframe into a datatable (which is an enhanced dataframe).
With .SDcols=colnames(ma) you specify on which columns the transformation has to be applied.
lapply(.SD, function(x) x = unlist(df[match(x,df$ID),"1990"])) performs the matching operation on each column specified with .SDcols.
An alternative approach with data.table is first transforming ma to a long data.table:
ma2 <- melt(setDT(ma), measure.vars = c("a","b","c"))
setkey(ma2, value) # set key by which 'ma' has to be indexed
setDT(df, key="ID") # transform to a datatable & set key by which 'df' has to be indexed
# joining the values of the 1990 column of df into
# the right place in the value column of 'ma'
ma2[df, value := `1990`]
which gives:
> ma2
variable value
1: a 10
2: b 13
3: c 16
4: a 19
5: b 22
6: c 25
7: a 28
8: b 31
9: c 34
The only drawback of this method is that the numeric values in the 'value' column get stored as character values. You can correct this by extending it as follows:
ma2[df, value := `1990`][, value := as.numeric(value)]
If you want to change it back to wide format you could use the rowid function within dcast:
ma3 <- dcast(ma2, rowid(variable) ~ variable, value.var = "value")[, variable := NULL]
which gives:
> ma3
a b c
1: 10 13 16
2: 19 22 25
3: 28 31 34
Used data:
ma <- structure(list(a = structure(1:3, .Label = c("ID.aa", "ID.ba", "ID.ca"), class = "factor"),
b = structure(1:3, .Label = c("ID.ab", "ID.bb", "ID.cb"), class = "factor"),
c = structure(1:3, .Label = c("ID.ac", "ID.bc", "ID.cc"), class = "factor")),
.Names = c("a", "b", "c"), class = "data.frame", row.names = c(NA, -3L))
df <- structure(list(ID = structure(1:9, .Label = c("ID.aa", "ID.ab", "ID.ac", "ID.ba", "ID.bb", "ID.bc", "ID.ca", "ID.cb", "ID.cc"), class = "factor"),
`1990` = c(10L, 13L, 16L, 19L, 22L, 25L, 28L, 31L, 34L),
`1991` = c(11L, 14L, 17L, 20L, 23L, 26L, 29L, 32L, 35L),
`1992` = c(12L, 15L, 18L, 21L, 24L, 27L, 30L, 33L, 36L)),
.Names = c("ID", "1990", "1991", "1992"), class = "data.frame", row.names = c(NA, -9L))
In base R, it can be seen as a job for outer:
> outer(1:nrow(MA.ID), 1:ncol(MA.ID), Vectorize(function(x,y) {DF[which(DF$ID==MA.ID[x,y]),'1990']}))
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 19 22 25
[3,] 28 31 34
Explanations:
outer creates a matrix as the outer product of the first argument X (here a b c) and the second argument Y (here the same, a b c)
for every combination of X and Y, it applies a function that looks up the value in DF, at the row where the ID is MA.ID[X,Y], and at the column 1990
the important trick here is to wrap the function with Vectorize, because outer expects a vectorized function
the result is finally returned as matrix
Alternatively, another way to do it (still in base R) is:
to convert your data frame MA.ID into a vector
sapply a quick function that looks up correspondance with DF$ID
and convert back to a matrix.
This works:
> structure(
sapply(unlist(MA.ID),
function(id){DF[which(DF$ID==id),'1990']}),
dim=dim(MA.ID), names=NULL)
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 19 22 25
[3,] 28 31 34
(here the call to structure(..., dim=dim(MA.ID), names=NULL) converts back the vector to a matrix)

Identifying Duplicate/Unique Teams (and Restructuring Data) in R

I have a data set that looks like this:
Person Team
1 30
2 30
3 30
4 30
11 40
22 40
1 50
2 50
3 50
4 50
15 60
16 60
17 60
1 70
2 70
3 70
4 70
11 80
22 80
My overall goal is to organize that team identification codes so that it is easy to see which teams are duplicates of one another and which teams are unique. I want to summarize the data so that it looks like this:
Team Duplicate1 Duplicate2
30 50 70
40 80
60
As you can see, teams 30, 50, and 70 have identical members, so they share a row. Similarly, teams 40 and 80 have identical members, so they share a row. Only team 60 (in this example) is unique.
In situations where teams are duplicated, I don't care which team id goes in which column. Also, there may be more than 2 duplicates of a team. Teams range in size from 2 members to 8 members.
This answer gives the output data format you asked for. I left the duplicate teams in a single variable because I think it's a better way to handle an arbitrary number of duplicates.
require(dplyr)
df %>%
arrange(Team, Person) %>% # this line is necessary in case the rest of your data isn't sorted
group_by(Team) %>%
summarize(players = paste0(Person, collapse = ",")) %>%
group_by(players) %>%
summarize(teams = paste0(Team, collapse = ",")) %>%
mutate(
original_team = ifelse(grepl(",", teams), substr(teams, 1, gregexpr(",", teams)[[1]][1]-1), teams),
dup_teams = ifelse(grepl(",", teams), substr(teams, gregexpr(",", teams)[[1]][1]+1, nchar(teams)), NA)
)
The result:
Source: local data frame [3 x 4]
players teams original_team dup_teams
1 1,2,3,4 30,50,70 30 50,70
2 11,22 40,80 40 80
3 15,16,17 60 60 NA
Not exactly the format you're wanting, but pretty useful:
# using MrFlick's data
library(dplyr)
dd %>% group_by(Team) %>%
arrange(Person) %>%
summarize(team.char = paste(Person, collapse = "_")) %>%
group_by(team.char) %>%
arrange(team.char, Team) %>%
mutate(duplicate = 1:n())
Source: local data frame [6 x 3]
Groups: team.char
Team team.char duplicate
1 40 11_22 1
2 80 11_22 2
3 60 15_16_17 1
4 30 1_2_3_4 1
5 50 1_2_3_4 2
6 70 1_2_3_4 3
(Edited in the arrange(Person) line in case the data isn't already sorted, got the idea from #Reed's answer.)
Using this for your sample data
dd<-structure(list(Person = c(1L, 2L, 3L, 4L, 11L, 22L, 1L, 2L, 3L,
4L, 15L, 16L, 17L, 1L, 2L, 3L, 4L, 11L, 22L), Team = c(30L, 30L,
30L, 30L, 40L, 40L, 50L, 50L, 50L, 50L, 60L, 60L, 60L, 70L, 70L,
70L, 70L, 80L, 80L)), .Names = c("Person", "Team"),
class = "data.frame", row.names = c(NA, -19L))
You could try a table()/interaction() to find duplicate groups. For example
tt <- with(dd, table(Team, Person))
grp <- do.call("interaction", c(data.frame(unclass(tt)), drop=TRUE))
split(rownames(tt), grp)
this returns
$`1.1.1.1.0.0.0.0.0`
[1] "30" "50" "70"
$`0.0.0.0.0.1.1.1.0`
[1] "60"
$`0.0.0.0.1.0.0.0.1`
[1] "40" "80"
so the group "names" are really just indicators for membership for each person. You could easily rename them if you like with setNames(). But here it collapse the appropriate teams.
Two more base R options (though not exactly the desired output):
DF2 <- aggregate(Person ~ Team, DF, toString)
> split(DF2$Team, DF2$Person)
$`1, 2, 3, 4`
[1] 30 50 70
$`11, 22`
[1] 40 80
$`15, 16, 17`
[1] 60
Or
( DF2$DupeGroup <- as.integer(factor(DF2$Person)) )
Team Person DupeGroup
1 30 1, 2, 3, 4 1
2 40 11, 22 2
3 50 1, 2, 3, 4 1
4 60 15, 16, 17 3
5 70 1, 2, 3, 4 1
6 80 11, 22 2
Note that the expected output as shown in the question would either require to add NAs or empty strings in some of the columns entries because in a data.frame, all columns must have the same number of rows. That is different for lists in, as you can see in some of the answers.
The second option, but using data.table, since aggregate tends to be slow for large data:
library(data.table)
setDT(DF)[, toString(Person), by=Team][,DupeGroup := .GRP, by=V1][]
Team V1 DupeGroup
1: 30 1, 2, 3, 4 1
2: 40 11, 22 2
3: 50 1, 2, 3, 4 1
4: 60 15, 16, 17 3
5: 70 1, 2, 3, 4 1
6: 80 11, 22 2
Using uniquecombs from the mgcv package:
library(mgcv)
library(magrittr) # for the pipe %>%
# Using MrFlick's data
team_names <- sort(unique(dd$Team))
unique_teams <- with(dd, table(Team, Person)) %>% uniquecombs %>% attr("index")
printout <- unstack(data.frame(team_names, unique_teams))
> printout
$`1`
[1] 60
$`2`
[1] 40 80
$`3`
[1] 30 50 70
Now you could use something like this answer to print it in tabular form (note that the groups are column-wise, not row-wise as in your question):
attributes(printout) <- list(names = names(printout)
, row.names = 1:max(sapply(printout, length))
, class = "data.frame")
> printout
1 2 3
1 60 40 30
2 <NA> 80 50
3 <NA> <NA> 70
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
corrupt data frame: columns will be truncated or padded with NAs

Deriving contingency table from a bigger contingency table in R

I have a csv format contingency table made by python language , like this:
case control
disease_A 20 30
disease_B 35 45
disease_C 42 52
disease_D 52 62
now i want to derive 2x2 contingency tables from this contingency table to calculate chi-square value using R
how can i derive a 2x2 table like the following from the contingency table above:
case control
disease_A 20 30
disease_D 52 62
Thats probably a novice question but im new to R and i couldn't find the solution anywhere else
Here's an approach.
The data:
txt <- " case control
disease_A 20 30
disease_B 35 45
disease_C 42 52
disease_D 52 62"
Read the data:
dat <- read.table(textConnection(txt))
# case control
# disease_A 20 30
# disease_B 35 45
# disease_C 42 52
# disease_D 52 62
Extract a subset of rows:
dat2 <- dat[rownames(dat) %in% c("disease_A", "disease_D"), ]
# case control
# disease_A 20 30
# disease_D 52 62
If M is of class table
M <- structure(c(20, 35, 42, 52, 30, 45, 52, 62), .Dim = c(4L, 2L), .Dimnames = list(
c("disease_A", "disease_B", "disease_C", "disease_D"), c("case",
"control")), class = "table")
xtabs(Freq~Var1+Var2,data= subset(as.data.frame(M,stringsAsFactors=F),
Var1%in% c("disease_A", "disease_D")))
Var2
Var1 case control
disease_A 20 30
disease_D 52 62
If M is a data.frame
M <- structure(list(case = c(20L, 35L, 42L, 52L), control = c(30L,
45L, 52L, 62L)), .Names = c("case", "control"), class = "data.frame", row.names = c("disease_A",
"disease_B", "disease_C", "disease_D"))
as.table(as.matrix(M[grep("A|D", rownames(M)),]))

Resources