Able to specify thresholds for calculating TPR and FPR? using pROC - r

Calculating TPR and FPR using pROC package. Am i able to specify the thresholds i want in the calculation using the package?
I am to get calculate TPR and FPR for thresholds from 0 to 1, with 0.05 increment.
This is the data set i am working with:
structure(list(prediction_resp_4 = c(0.0093660156194744, 0.63696691410065,
0.693562340217509, 0.850026939982271, 0.0921374166454612, 0.223883311111169,
0.000699258172241612, 0.117062385395824, 0.951947429014154, 0.714711536699156,
0.230100717565363, 0.839895799034341, 0.149678433930086, 0.0913803675468538,
0.86430898026459, 0.0807110314548418, 0.452757912184497, 0.819293921115556,
0.0700190883640999, 0.44900095299551, 0.803772423123997, 0.373799624421601,
0.122405205571954, 0.858831937028595, 0.276135791757235, 0.86129869300195,
0.674060141486476, 0.303046534598074, 0.356020758015023, 0.0246899999008411,
0.670342328628664, 0.0178170992678319, 0.0945567242524256, 0.0110559559742041,
0.356077534809716, 0.0792480681507026, 0.630756724182966, 0.0165338433136149,
0.816750535548877, 0.661098390528446, 0.0587373125478858, 0.315062410973728,
0.831315518918304, 0.463520030831427, 0.725937488979879, 0.301643645590828,
0.288625193696339, 0.9038875106375, 0.780722912230085, 0.37912106477669,
0.136094212636133, 0.503643519530075, 0.544482442341009, 0.575738927352128,
0.356077534809716, 0.722011034808203, 0.760550508601042, 0.603109270061287,
0.793014589613734, 0.834485477242473, 0.783008040183127, 0.365330782046478,
0.022358212647161, 0.0884525015278602, 0.200257196356859, 0.912502624283191,
0.230100717565363, 0.112122111461138, 0.453938368209989, 0.704600061065344,
0.224872418284352, 0.395491910748845, 0.999703986760998, 0.794479788600805,
0.385076713799569, 0.0305635117938959, 0.92898574855535, 0.163314780984271,
0.893410014409946, 0.496199240836053, 0.618023472980794, 0.584273518401166,
0.295133623201644, 0.12042873699888, 0.251479713644139, 0.825885814333607,
0.674317836386347, 0.371047453863054, 0.645239618141106, 0.00544077442795404,
0.377910289600606, 0.936696423985203, 0.418497325382622, 0.871684421084382,
0.345285714385491, 0.835470162627044, 0.0581701844461216, 0.612133334197249,
0.675206715878502, 0.667971057122422), landslide.validation.lslpts = c(0,
1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0,
0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0,
0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0,
1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0)), row.names = c(NA,
100L), class = "data.frame")
where i have response and prediction.
This pROC package allows me to calculate but for all possible thresholds.
my current code is as follows:
library(pROC)
myRoc <- roc(response = new_df$landslide.validation.lslpts, predictor = new_df$prediction_resp_4)
ROC <-data.frame(myRoc$sensitivities, myRoc$specificities, myRoc$thresholds)
Expected to calculate TPR and FPR for thresholds from 0 to 1 with 0.05 increment. How can i go about doing it?
Any help will be appreciated. Thank you

Although you can't specify which thresholds to use to build the ROC curve itself (because a ROC curve goes over all thresholds by definition), you can easily extract the information you want with the coords function:
> coords(myRoc, seq(0, 1, 0.05))
0 0.05 0.1 0.15 0.2 0.25 0.3 ...
threshold 0 0.0500000 0.1000000 0.1500000 0.2000000 0.2500000 0.3000000 ...
specificity 0 0.1568627 0.3333333 0.4509804 0.4705882 0.5098039 0.5882353 ...
sensitivity 1 0.9795918 0.9795918 0.9795918 0.9795918 0.9183673 0.9183673 ...
Noting that:
TPR = sensitivity
FPR = 1 − specificity
Please not that, although you have FPR and TPR, this is not a ROC curve, as it doesn't go over all possible thresholds.

Related

Chi Square Test of Independence of Whole Dataset

I have a 3185x90 dataset of binary values and want to do a chi-squared test of independence, comparing all column variables against each other.
I've been tried using different variations of code from google searches with chisq.test() and some for loops, but none of them have worked so far.
How do I do this?
This is the frame I've tinkered with. My dataset is oak.
chi_trial <- data.frame(a = c(0,1), b = c(0,1))
for(row in 1:nrow(oak)){
print(row)
print(chisq.test(c(oak[row,1],d[row,2])))
}
I also tried this:
apply(d, 1, chisq.test)
which gives me the error: Error in FUN(newX[, i], ...) :
all entries of 'x' must be nonnegative and finite
dput(oak[1:2],)
structure(list(post_flu = structure(c(1, 1, 1, 1, 1, 0, 0, 0,
0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
label = "Receipt of Flu Vaccine - Encounter Survey", format.stata = "%10.0g")), row.names = c(NA,
-3185L), class = c("tbl_df", "tbl", "data.frame"), label = "Main Oakland Clinic Analysis Dataset")
I added a sample of my data with the final lines of the output. The portion of the dataset is small, but it all looks like this.
You could use something like the code below, which is similar to R's cor function. I don't have your data, so I'm simulating some. Note that I get one significant p-value, using the traditional cut-off of 0.05.
set.seed(3)
nr=3185; nc=3
oak <- as.data.frame(matrix(sample(0:1, size=nr*nc, replace=TRUE), ncol=nc))
oak
mult.chi <- function(data){
nc <- ncol(data)
res <- matrix(0, nrow=nc, ncol=nc) # or NA
for(i in 1:(nc-1))
for(j in (i+1):nc)
res[i,j] <- suppressWarnings(chisq.test(oak[,i], oak[,j])$p.value)
rownames(res) <- colnames(data)
colnames(res) <- colnames(data)
res
}
mult.chi(oak)
# V1 V2 V3
# V1 0 0.7847063 0.32012466
# V2 0 0.0000000 0.01410326
# V3 0 0.0000000 0.00000000
So consider applying a multiple testing adjustment as mentioned in the comments.
Here is a solution with combn to get all combinations of column numbers 2 by 2. Tested with the data in #Edward's answer.
chisq2cols <- function(X){
y <- matrix(0, ncol(X), ncol(X))
cmb <- combn(ncol(X), 2)
y[upper.tri(y)] <- apply(cmb, 2, function(k){
tbl <- table(X[k])
chisq.test(tbl)$p.value
})
y
}
chisq2cols(oak)
# [,1] [,2] [,3]
#[1,] 0 0.7847063 0.32012466
#[2,] 0 0.0000000 0.01410326
#[3,] 0 0.0000000 0.00000000

Intersecting ranges of consecutive values in logical vectors in R

I have two logical vectors which look like this:
x = c(0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0)
y = c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0)
I would like to count the intersections between ranges of consecutive values. Meaning that consecutive values (of 1s) are handled as one range. So in the above example, each vector contains one range of 1s and these ranges intersect only once.
Is there any R package for range intersections which could help here?
I think this should work (calling your logical vectors x and y):
sum(rle(x & y)$values)
A few examples:
x = c(0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0)
y = c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0)
sum(rle(x & y)$values)
# [1] 1
x = c(1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0)
y = c(0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0)
sum(rle(x & y)$values)
# [1] 2
x = c(1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0)
y = c(0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0)
sum(rle(x & y)$values)
# [1] 3
By way of explanation, x & y gives the intersections on a per-element level, rle collapses runs of adjacent intersections, and sum counts.

Optimum algorithm to check various combinations of items when number of items is too large

I have a data frame which has 20 columns/items in it, and 593 rows (number of rows doesn't matter though) as shown below:
Using this the reliability of test is obtained as 0.94, with the help of alpha from psych package psych::alpha. The output also gives me the the new value of cronbach's alpha if I drop one of the items. However, I want to know how many items can I drop to retain an alpha of at least 0.8 I used a brute force approach for the purpose where I am creating the combination of all the items that exists in my data frame and check if their alpha is in the range (0.7,0.9). Is there a better way of doing this, as this is taking forever to run because number of items is too large to check for all the combination of items. Below is my current piece of code:
numberOfItems <- 20
for(i in 2:(2^numberOfItems)-1){
# ignoring the first case i.e. i=1, as it doesn't represent any model
# convert the value of i to binary, e.g. i=5 will give combination = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1
# using the binaryLogic package
combination <- as.binary(i, n=numberOfItems)
model <- c()
for(j in 1:length(combination)){
# choose which columns to consider depending on the combination
if(combination[j])
model <- c(model, j)
}
itemsToUse <- itemResponses[, c(model)]
#cat(model)
if(length(model) > 13){
alphaVal <- psych::alpha(itemsToUse)$total$raw_alpha
if(alphaVal > 0.7 && alphaVal < 0.9){
cat(alphaVal)
print(model)
}
}
}
A sample output from this code is as follows:
0.8989831 1 4 5 7 8 9 10 11 13 14 15 16 17 19 20
0.899768 1 4 5 7 8 9 10 11 12 13 15 17 18 19 20
0.899937 1 4 5 7 8 9 10 11 12 13 15 16 17 19 20
0.8980605 1 4 5 7 8 9 10 11 12 13 14 15 17 19 20
Here are the first 10 rows of data:
dput(itemResponses)
structure(list(CESD1 = c(1, 2, 2, 0, 1, 0, 0, 0, 0, 1), CESD2 = c(2,
3, 1, 0, 0, 1, 1, 1, 0, 1), CESD3 = c(0, 3, 0, 1, 1, 0, 0, 0,
0, 0), CESD4 = c(1, 2, 0, 1, 0, 1, 1, 1, 0, 0), CESD5 = c(0,
1, 0, 2, 1, 2, 2, 0, 0, 0), CESD6 = c(0, 3, 0, 1, 0, 0, 2, 0,
0, 0), CESD7 = c(1, 2, 1, 1, 2, 0, 1, 0, 1, 0), CESD8 = c(1,
3, 1, 1, 0, 1, 0, 0, 1, 0), CESD9 = c(0, 1, 0, 2, 0, 0, 1, 1,
0, 1), CESD10 = c(0, 1, 0, 2, 0, 0, 1, 1, 0, 1), CESD11 = c(0,
2, 1, 1, 1, 1, 2, 3, 0, 0), CESD12 = c(0, 3, 1, 1, 1, 0, 2, 0,
0, 0), CESD13 = c(0, 3, 0, 2, 1, 2, 1, 0, 1, 0), CESD14 = c(0,
3, 1, 2, 1, 1, 1, 0, 1, 1), CESD15 = c(0, 2, 0, 1, 0, 1, 0, 1,
1, 0), CESD16 = c(0, 2, 2, 0, 0, 1, 1, 0, 0, 0), CESD17 = c(0,
0, 0, 0, 0, 1, 1, 0, 0, 0), CESD18 = c(0, 2, 0, 0, 0, 0, 0, 0,
0, 1), CESD19 = c(0, 3, 0, 0, 0, 0, 0, 1, 1, 0), CESD20 = c(0,
3, 0, 1, 0, 0, 0, 0, 0, 0)), .Names = c("CESD1", "CESD2", "CESD3",
"CESD4", "CESD5", "CESD6", "CESD7", "CESD8", "CESD9", "CESD10",
"CESD11", "CESD12", "CESD13", "CESD14", "CESD15", "CESD16", "CESD17",
"CESD18", "CESD19", "CESD20"), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
The idea is to replace the computation of alpha with the so-called discrimination for each item from classical test theory (CTT). The discrimination is the correlation of the item score with a "true score" (which we would assume to be the row sum).
Let the data be
dat <- structure(list(CESD1 = c(1, 2, 2, 0, 1, 0, 0, 0, 0, 1), CESD2 = c(2, 3, 1, 0, 0, 1, 1, 1, 0, 1),
CESD3 = c(0, 3, 0, 1, 1, 0, 0, 0, 0, 0), CESD4 = c(1, 2, 0, 1, 0, 1, 1, 1, 0, 0),
CESD5 = c(0, 1, 0, 2, 1, 2, 2, 0, 0, 0), CESD6 = c(0, 3, 0, 1, 0, 0, 2, 0, 0, 0),
CESD7 = c(1, 2, 1, 1, 2, 0, 1, 0, 1, 0), CESD8 = c(1, 3, 1, 1, 0, 1, 0, 0, 1, 0),
CESD9 = c(0, 1, 0, 2, 0, 0, 1, 1, 0, 1), CESD10 = c(0, 1, 0, 2, 0, 0, 1, 1, 0, 1),
CESD11 = c(0, 2, 1, 1, 1, 1, 2, 3, 0, 0), CESD12 = c(0, 3, 1, 1, 1, 0, 2, 0, 0, 0),
CESD13 = c(0, 3, 0, 2, 1, 2, 1, 0, 1, 0), CESD14 = c(0, 3, 1, 2, 1, 1, 1, 0, 1, 1),
CESD15 = c(0, 2, 0, 1, 0, 1, 0, 1, 1, 0), CESD16 = c(0, 2, 2, 0, 0, 1, 1, 0, 0, 0),
CESD17 = c(0, 0, 0, 0, 0, 1, 1, 0, 0, 0), CESD18 = c(0, 2, 0, 0, 0, 0, 0, 0, 0, 1),
CESD19 = c(0, 3, 0, 0, 0, 0, 0, 1, 1, 0), CESD20 = c(0, 3, 0, 1, 0, 0, 0, 0, 0, 0)),
.Names = c("CESD1", "CESD2", "CESD3", "CESD4", "CESD5", "CESD6", "CESD7", "CESD8", "CESD9",
"CESD10", "CESD11", "CESD12", "CESD13", "CESD14", "CESD15", "CESD16", "CESD17",
"CESD18", "CESD19", "CESD20"), row.names = c(NA, -10L),
class = c("tbl_df", "tbl", "data.frame"))
We compute (1) the discrimination and (2) the alpha coefficient.
stat <- t(sapply(1:ncol(dat), function(ii){
dd <- dat[, ii]
# discrimination is the correlation of the item to the rowsum
disc <- if(var(dd, na.rm = TRUE) > 0) cor(dd, rowSums(dat[, -ii]), use = "pairwise")
# alpha that would be obtained when we skip this item
alpha <- psych::alpha(dat[, -ii])$total$raw_alpha
c(disc, alpha)
}))
dimnames(stat) <- list(colnames(dat), c("disc", "alpha^I"))
stat <- data.frame(stat)
Observe that the discrimination (which is more efficient to compute) is inversely proportional to alpha that is obtained when deleting this item. In other words, alpha is highest when there are many high "discriminating" items (that correlate with each other).
plot(stat, pch = 19)
Use this information to select the sequence with which the items should be deleted to fall below a benchmark (say .9, since the toy data doesn't allow for a lower mark):
1) delete as many items as possible to stay above the benchmark; that is, start with the least discriminating items.
stat <- stat[order(stat$disc), ]
this <- sapply(1:(nrow(stat)-2), function(ii){
ind <- match(rownames(stat)[1:ii], colnames(dat))
alpha <- psych::alpha(dat[, -ind, drop = FALSE])$total$raw_alpha
})
delete_these <- rownames(stat)[which(this > .9)]
psych::alpha(dat[, -match(delete_these, colnames(dat)), drop = FALSE])$total$raw_alpha
length(delete_these)
2) delete as few items as possible to stay above the benchmark; that is, start with the highest discriminating items.
stat <- stat[order(stat$disc, decreasing = TRUE), ]
this <- sapply(1:(nrow(stat)-2), function(ii){
ind <- match(rownames(stat)[1:ii], colnames(dat))
alpha <- psych::alpha(dat[, -ind, drop = FALSE])$total$raw_alpha
})
delete_these <- rownames(stat)[which(this > .9)]
psych::alpha(dat[, -match(delete_these, colnames(dat)), drop = FALSE])$total$raw_alpha
length(delete_these)
Note, that 1) is coherent with classical item selection procedures in (psychological/educational) diagnostic/assessments: remove items from the assessment, that fall below a benchmark in terms of discriminatory power.
I changed the code as follows, now I am dropping a fixed number of items and changing the value of numberOfItemsToDrop from 1 to 20 manually. Although it is a lil better, but it still is taking too long to run :(
I hope there is some better way of doing this.
numberOfItemsToDrop <- 13
combinations <- combinat::combn(20, numberOfItemsToDrop)
timesToIterate <- length(combinations)/numberOfItemsToDrop
for(i in 1:timesToIterate){
model <- combinations[,i]
itemsToUse <- itemResponses[, -c(model)]
alphaVal <- psych::alpha(itemsToUse)$total$raw_alpha
if(alphaVal < 0.82){
cat("Cronbach's alpha =",alphaVal, ", number of items dropped = ", length(model), " :: ")
print(model)
}
}

Extract value from table function in R (no factors)

I have this data frame
d1 <- c(1, 0, 0, 1, 0, 0, 0, 1)
d2 <- c(0, 1, 0, 1, 1, 0, 0, 0)
d3 <- c(0, 0, 1, 0, 0, 0, 1, 0)
d4 <- c(0, 0, 0, 1, 0, 0, 0, 0)
d5 <- c(0, 0, 0, 0, 0, 0, 1, 0)
d6 <- c(0, 0, 0, 1, 0, 1, 0, 1)
d7 <- c(0, 0, 1, 0, 0, 1, 0, 1)
d8 <- c(1, 0, 0, 0, 0, 0, 0, 1)
d9 <- c(0, 0, 0, 0, 0, 1, 0, 1)
d10 <- c(1, 1, 0, 0, 0, 1, 0, 1)
df <- as.data.frame(rbind(d1,d2,d3,d4,d5,d6,d7,d8,d9,d10))
str(df)
I get all lines where V8 == 1, and find the relative frequencies for each column like this (for example column 2, V2):
table(df[which(df$V8==1),][2])/sum(as.numeric(df[which(df$V8==1),]$V8))
0 1
0.8333333 0.1666667
My question is how can I get each relative frequency individually, let's say set it into a new variable. I found this
How to extract value from table function in R
but it does not work in my case, since 0 and 1 are numericals.
table(df[which(df$V8==1),][2])/sum(as.numeric(df[which(df$V8==1),]$V8))["1"]
use as.numeric, and then, after that, change them to ratios
the numbers 0 and 1 are extracted with
as.numeric(names(table(data)))
and the numbers 64 and 17 are extracted with
counts<-as.numeric(table(data))
then
ratios<-counts/sum(counts)
Not completely sure about what you're trying to do but...
sapply(subset(df, V8==1), function(x) sum(x==1)/length(x))

Differences between merge and match functions in R

I everybody I remove my last post to make a reproducible exmaple of my problem. I am working with the next to data frames a1 (dput structure):
structure(list(r04_numero_operacion = c("0050475725", "0050490602",
"0050491033", "0050496386", "0050518985", "0050630090", "0050631615",
"0060235906", "0060238732", "0060241333", "0060244391", "0060245813",
"0060260056", "0060266356", "0800041441", "0800054041", "0800055382",
"0800058554", "2020200062", "2020200073", "CAR1010001706000",
"CAR1010001795000", "CAR1010001803000", "CAR1010001871000", "CAR1010001962000",
"CAR1010002002000", "CAR1010002120000", "CAR1010002189000", "CAR1010002215000",
"CAR1010002250000"), perdida3 = c(523.12, 265.43, 8371.66, 5242.13,
4960.51, 8473.27, 3743.45, 1283.32, 2229.25, 8001.27, 8653.94,
3670.13, 4536.02, 8216.55, 2481.36, 288.94, 1637.28, 4566.89,
1573.63, 11217.92, 0, 0, 0, 0, 0, 0, 0, 0, 9633.9, 0), Saldo = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 288.94, 1637.28, 4566.89,
1, 1, 481.59, 299.52, 258.13, 603.84, 231.61, 631.68, 220.6,
210.54, 1, 1224.44), Bvencida = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 603.84, 0, 631.68,
0, 0, 0, 0), Cvencida = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1224.44),
Dvencida = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), vencida = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 288.94, 1637.28,
4566.89, 1, 1, 0, 0, 0, 603.84, 0, 631.68, 0, 0, 1, 1224.44
), V1 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names = c("r04_numero_operacion",
"perdida3", "Saldo", "Bvencida", "Cvencida", "Dvencida", "vencida",
"V1"), codepage = 1252L, row.names = c(NA, 30L), class = "data.frame")
And a2 data frame (dput structure):
structure(list(r04_numero_operacion = c("0050475725", "0050490602",
"0050491033", "0050496386", "0050518985", "0050630090", "0050631615",
"0060235906", "0060238732", "0060241333", "0060244391", "0060245813",
"0060260056", "0060266356", "0800041441", "0800054041", "0800055382",
"0800058554", "2020200073", "CAR1010002002000", "CAR1010002189000",
"CAR1010002215000", "CAR1010002250000", "CAR1010002264000", "CAR1010002297000",
"CAR1010002401000", "CAR1010002412000", "CAR1010002436000", "CAR1010002529000",
"CAR1010002709000"), perdida3 = c(523.12, 265.43, 8371.66, 5242.13,
4960.51, 8473.27, 3743.45, 1283.32, 2229.25, 8001.27, 8653.94,
3670.13, 4536.02, 8216.55, 2481.36, 288.94, 1637.28, 4566.89,
11217.92, 0, 0, 9633.9, 0, 0, 0, 0, 0, 0, 0, 0), Saldo = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 288.94, 1637.28, 4566.89,
1, 317.72, 210.54, 1, 868.93, 242.91, 298.78, 120.63, 255.01,
357.68, 284.08, 308.83), Bvencida = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 317.72, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0), Cvencida = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 868.93, 0, 0, 0, 0, 0, 0, 0), Dvencida = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0), vencida = c(1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 288.94, 1637.28, 4566.89, 1, 317.72, 0,
1, 868.93, 0, 0, 0, 0, 0, 0, 0), V2 = c(2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2)), .Names = c("r04_numero_operacion", "perdida3", "Saldo",
"Bvencida", "Cvencida", "Dvencida", "vencida", "V2"), class = "data.frame", row.names = c(NA,
30L))
My problem is when I use merge() and match() functions. merge() is more functional than match() related to add new variables by common one but when I use merge() I don't get the same result as match(). First I used merge() with a2 and a1 to create DF with the next code:
DF=merge(a2,a1,all.x=TRUE)
It added V1 variable from a1 to DF and I got this summary for DF$V1:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1 1 1 1 1 1 9
After I create a copy of a2 named DF and I made a match with r04_numero_operacion using this code to add V1 variable from a1 to a2:
a2$V1<-a1[match(a2$r04_numero_operacion,a1$r04_numero_operacion),"V1"]
It added `V1 to DF but the result is different to the merge() way. I got this summary for DF$V1 in match() solution:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1 1 1 1 1 1 7
My problem is I want to make the same I made with match() but using merge() function due to this function is more poweful than match(). Thanks for your help.
In using match(a2$r04_numero_operacion,a1$r04_numero_operacion) the a2$r04_numero_operacion values gets matched the coresponding column in a1 while in using merge(a2,a1,all.x=TRUE) the a1 all the matching columns get matched to the matching column names in a2. If you only match on the first column, the NA counts match up:
summary( merge(a2,a1,by=1,all.x=TRUE)$V1 )
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1 1 1 1 1 1 7

Resources