recursive data subset based on column attributes in R - r

I have a data frame with 10K rows and 6 columns. The first two columns are factors.
A B C D E F
A1 B1 0.1 0.2 0.3 0.4
A2 B2 .........................
A1 B3 .........................
A1 B1 0.3 ...................
Now I want to generate models(using my function F) based on different subsets of data (different rows), that is different combinations of attributes of A and B.
In my above example, I should have call my function F 6 times with Cartesian production of A and B
(A1,A2) x (B1,B2,B3). I wonder how to do this in R efficiently without explicit loop?
To avoid confusion
e.g, apply F to (A1,B1) combination, in this case, rows 1 and 4, columns 3 to 6.
to other combinations is similar

Try:
lapply(seq_len(length(df$A)*length(df$B))-1, function(x)
myFunction(df[df$A == paste0("A",1+floor(x / length(df$B))) &
df$B == paste0("B",1+(x %% length(df$B))), ]))

Related

Assign 0 or 1 based on conditional probability + previous simulated data

I'm doing a simulation study and I have some problem generating data that meet certain conditions.
My first simulated data looks like below.
A1 A2
1 0.8 6
2 0.5 3
3 0.9 2
...
1000
This is how I generated A1 & A2
set.seed(47)
df <- data.frame(A1 = rnorm(1000, mean=0.7, sd=0.1), A2 = rnorm(1000, mean=4, sd=1))
df
In tabular format, this is how the conditional statement looks where 0=fail and 1=pass and the output in the table is the probability of getting a 1 for A3.
A1 0 1
A2
0 0.1 0.3
1 0.9 0.7
Here is the explanation in words:
I want to generate a third row (A3) based on conditional probabilities of the first two rows. This is the condition I want to apply.
If A1>=0.7 (pass) & A2>=0.8 (pass) --> A3=1 with a 70% probability (implying %30 of zero)
If A1>=0.7 (pass) & A2<0.8 (fail) --> A3=1 with a 30% probability
If A1<0.7 (fail) & A2>=0.8 (pass) --> A3=1 with a 90% probability
If A1<0.7 (fail) & A2<0.8 (fail)--> A3=1 with a 10% probability
I hope my logic makes sense. Please let me know if I need more data or words to better explain. Thank you.
You could use a little trick here of converting logical vectors to integers then counting in binary.
If you do the logical test df$A1 >= 0.7 you get a vector of TRUE and FALSE values. If instead you do as.numeric(df$A1 >= 0.7) you get the equivalent vector of 1s and 0s. The trick is to do this for both variables, but multiply the second vector by 2. Now if you add both vectors together, you will get a number between 0 and 3 that corresponds to your truth table:
A1 pass, A2 pass = 3
A1 fail, A2 pass = 2
A1 pass, A2 fail = 1
A1 fail, A2 fail = 0.
Note that if we add one to these numbers, we get a value between one and four. We can therefore use them as indexes of our probability vector:
probs <- c(0.1, 0.3, 0.9, 0.7)[(df$A1 >= 0.7) + 2*(df$A2 >= 0.8)]
That means we can generate the random binary numbers using rbinom like so:
df$A3 <- rbinom(1000, 1, probs)
Resulting in:
head(df)
#> A1 A2 A3
#> 1 0.8994696 5.345481 1
#> 2 0.7711143 3.662635 1
#> 3 0.7185405 3.125840 1
#> 4 0.6718235 3.914527 0
#> 5 0.7108776 3.366858 1
#> 6 0.5914263 2.082173 0
Created on 2022-09-30 with reprex v2.0.2

R Compare non side-by-side duplicates in 2 columns

There are many similar questions but I'd like to compare 2 columns and delete all the duplicates in both columns so that all that is left is the unique observations in each column. Note: Duplicates are not side-by-side. If possible, I would also like a list of the duplicates (not just TRUE/FALSE). Thanks!
C1 C2
1 a z
2 c d
3 f a
4 e c
would become
C1 C2
1 f z
2 e d
with duplicate list
duplicates: a, c
Here is another answer
where_dupe <- which(apply(df, 2, duplicated), arr.ind = T)
Gives you the location of the duplicated elements within your original data frame.
col_unique <- setdiff(1:ncol(df), where_dupe)
Gives you which columns had no duplicates
You can find out the values by indexing.
df[,col_unique]
Here is a base R method using duplicated and lapply.
temp <- unlist(df)
# get duplicated elements
myDupeVec <- unique(temp[duplicated(temp)])
# get list without duplicates
noDupesList <- lapply(df, function(i) i[!(i %in% myDupeVec)])
noDupesList
$C1
[1] "f" "e"
$C2
[1] "z" "d"
data
df <- read.table(header=T, text=" C1 C2
1 a z
2 c d
3 f a
4 e c ", as.is=TRUE)
Note that this returns a list. This is much more flexible structure, as there is generally a possibility that a level may be repeated more than once in a particular variable. If this is not the case, you can use do.call and data.frame to put the result into a rectangular structure.
do.call(data.frame, noDupesList)
C1 C2
1 f z
2 e d

vectorize combining two matrices in R

I have a data frame A as follows (the numbers are totally made up)
ID statistic p.value
1 4 .1
2 5 .3
3 3 .4
4 2 .4
5 1 .5
6 7 .8
and data frame B as follows:
ID Info1 Info2 ....
4 A1 B1
5 A2 B2
2 A3 ..
3 A4
1 A5
6 A6
7 A7
9 A8
8 A9
How would I cbind data frame A to data frame B in the correct order without a loop. I know I need to do something like:
cbind(A, B[something in here, ]) but how do I get the ordering? do I do a which statement? something else?
Too long for a comment.
So if I understand you correctly (from your question and all the comments), A and B are extremely large data frames. A has an ID column, and B has the IDs in the row names.
You should definitely use data.tables for this. Assuming you are pulling in the data from some kind of text file, read up on fread(...) in the data.table package. This will read the file directly into a data.table. fread(...) is extremely fast: 10 - 100 times faster than read.table(...) or read.csv(...) for large datasets.
Below is a comparison of the data frame approach with merge(...) and the data.table join approach.
data.frame approach
N <- 1e7 # 10 million rows; big enough??
set.seed(1) # for reproducible example
A <- data.frame(ID=1:N,statistic=sample(1:10,N,replace=T),pvalue=runif(N),stringsAsFactors=F)
B <- data.frame(info1=sample(LETTERS,N,replace=T),info2=sample(letters,N,replace=T),stringsAsFactors=F)
rownames(B) <- sample(1:N,N) # row names in randon order in B
system.time({
# this does the work...
B$ID <- as.integer(rownames(B))
result <- merge(B,A,by="ID")
})
# user system elapsed
# 285.75 3.15 289.33
data.table approach
set.seed(1)
A <- data.frame(ID=1:N,statistic=sample(1:10,N,replace=T),pvalue=runif(N),stringsAsFactors=F)
B <- data.frame(info1=sample(LETTERS,N,replace=T),info2=sample(letters,N,replace=T),stringsAsFactors=F)
rownames(B) <- sample(1:N,N)
library(data.table)
system.time({
# this does the work...
IDs <- as.integer(rownames(B))
setDT(A)
setDT(B)
B[,ID:=IDs]
setkey(A,ID)
setkey(B,ID)
B[A,c("statistic","pvalue"):=list(statistic,pvalue=pvalue)]
})
# user system elapsed
# 122.46 0.40 122.87
So the data.table approach is twice as fast in this example. But most of the time is spent converting the rownames to a column, so if you can read them into a column to begin with, and especially if you can read the data directly into data.tables using fread(...), this will be much faster.

In R how to make two columns an ID and get a frequency histogram for each ID

Example Dataset:
A 2 1.5
A 2 1.5
B 3 2.0
B 3 2.5
B 3 2.6
C 4 3.2
C 4 3.5
So here I would like to create 3 frequency histograms based on the first two columns so A2, B3, and C4? I am new to R any help would be greatly appreciated should I flatten out the data so its like this:
A 2 1.5 1.5
B 3 2.0 2.5 2.6 etc...
Thank you
Here's an alternative solution, that is based on by-function, which is just a wrapper for the tapply that Jilber suggested. You might find the 'ex'-variable useful:
set.seed(1)
dat <- data.frame(First = LETTERS[1:3], Second = 1:2, Num = rnorm(60))
# Extract third column per each unique combination of columns 'First' and 'Second'
ex <- by(dat, INDICES =
# Create names like A.1, A.2, ...
apply(dat[,c("First","Second")], MARGIN=1, FUN=function(z) paste(z, collapse=".")),
# Extract third column per each unique combination
FUN=function(x) x[,3])
# Draw histograms
par(mfrow=c(3,2))
for(i in 1:length(ex)){
hist(ex[[i]], main=names(ex)[i], xlim=extendrange(unlist(ex)))
}
Assuming your dataset is called x and the columns are a,b,c respectively I think this command should do the trick
library(lattice)
histogram(~c|a+b,x)
Notice that this requires you to have the package lattice installed

subset rows and columns in a dataframe based on boundary conditions

I have some problems to express myself. Probably, that is why I havent found anything which helps me yet. The example should make clear what I want.
Suppose I have a m x m matrix structure of coordinates. Lets say it ranges from A1 to E5 . and I want to subset the rows/columns which are k lines away from the outer coordinates.
In my example k is 2. So I want to select all records in the data frame which have the coordinates B2, B3, B4, C2, C4, D2, D3, D4. Manually, I would do the following:
cc <- data.frame(x=(LETTERS[1:5]), y=c(rep(1,5),rep(2,5),rep(3,5), rep(4,5), rep(5,5)) , z=rnorm(25))
slct <- with(cc, which( (x=="B" | x=="C" | x=="D" ) & (y==2 | y==3 | y==4) & !(x=="C" & y==3) ))
cc[slct,] # result data frame
But if the matrix dimensions increase that is not the way which will work great. Any better ideas?
Rather hard to read but it does the trick.
m <- 5 # Matrix dimensions
k <- 2 # The index of the the inner square that you want to extract
cc[(cc$x %in% LETTERS[c(k,m-k+1)] & !cc$y %in% c(1:(k-1), m:(m-k+2))) |
(cc$y %in% c(k, m-k+1) & !cc$x %in% LETTERS[c(1:(k-1), m:(m-k+2))]),]
The first line of comparisons extracts the k:th column from the left and right edges of the matrix, but not the parts that are closer than k to the upper and lower edges. The second line does the same thing but for rows.
cc$xy <- paste0(cc$x,cc$y)
coords <- c("B2","B3","B4", "C2", "C4", "D2", "D3", "D4")
cc[cc$xy %in% coords,]
# x y z xy
#7 B 2 -0.9031472 B2
#8 C 2 -0.1405147 C2
#9 D 2 1.6017619 D2
#12 B 3 1.7713041 B3
#14 D 3 -0.2005749 D3
#17 B 4 1.8671238 B4
#18 C 4 0.3428815 C4
#19 D 4 0.1470436 D4

Resources