Multiple wilcox.tests across columns using variables in first column (R) - r

I have this data.frame
df <- data.frame(
variable=c(2.4860651, -0.68863024, 2.63530974, -2.95754943, 1.67945091, 2.63530974,
4.79002539, 2.32575938, 3.57236441, -0.364825998, -2.00646016, -3.12380516,
0.69307013, -5.65846824, 0.45632519, 2.08978142),
A=c(0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0),
B=c(1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0),
C=c(0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1),
D=c(1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0),
E=c(0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0),
F=c(0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1))
I would like to perform wilcox.test for each column with groups defined by 0 and 1 in the columns and using the variables in the column df$variable. Then add the p.values in a new row and adjusted p.values in another row.
I have tried this:
library(dplyr)
result <- df %>% summarise(across(!variable, ~wilcox.test(.x ~ variable)$p.value), exact=NULL) %>%
bind_rows(., p.adjust(., method = 'BH')) %>%
bind_rows(df, .) %>%
mutate(variable=replace(variable, is.na(variable), c('p.values', 'p.adjust')))
But this causes errors.
This is the result I would like to get:
result <- data.frame(
variable=c(2.4860651, -0.68863024, 2.63530974, -2.95754943, 1.67945091, 2.63530974,
4.79002539, 2.32575938, 3.57236441, -0.364825998, -2.00646016, -3.12380516,
0.69307013, -5.65846824, 0.45632519, 2.08978142, 'p.value', 'p.adjust'),
A=c(0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1),
B=c(1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0.560444274, 1),
C=c(0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0.143117298, 0.764253489),
D=c(1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0.820753088, 1),
E=c(0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0.95482869, 1),
F=c(0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0.254751163, 0.764253489))
Can anyone help?

You may try something along the lines of the following -
library(dplyr)
tmp <- df %>% summarise(across(!variable,
~wilcox.test(variable[.x == 0], variable[.x == 1])$p.value))
adj_value <- p.adjust(unlist(tmp), method = "BH")
result <- bind_rows(df %>% mutate(variable = as.character(variable)),
rbind(tmp, adj_value) %>%
mutate(variable = c('p.values', 'p.adjust'))
)

Thank you, Ronak. I modified your former answer and I found that this also works and results in the same as you found:
result <- df %>%
summarise(across(!variable,
~wilcox.test(variable[.x == 0], variable[.x == 1])$p.value), exact=NULL) %>%
bind_rows(., p.adjust(., method = 'BH')) %>%
bind_rows(df, .) %>%
mutate(variable=replace(variable, is.na(variable), c('p.values', 'p.adjust')))
Thank you! :)

Related

Why am I getting this error when using glmnet in R

When trying to predict I am getting this error
error in evaluating the argument 'x' in selecting a method for function 'as.matrix': Cholmod error 'X and/or Y have wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 90
Here is my code so far
library(data.table)
library(caret)
library(Metrics)
library(glmnet)
library(plotmo)
library(lubridate)
#Reading in the necessary data needed for this project
train <- fread("project/volume/data/processed/start_train.csv")
test<-fread("project/volume/data/processed/start_test.csv")
example_sub<-fread("project/volume/data/processed/example_submission.csv")
card_tab <- fread("project/volume/data/processed/card_tab.csv")
#Merging the card_tab dataset with both my train and test datasets to add more variables to each
train = merge(train, card_tab, by = "id")
test = merge(test, card_tab, by = "id")
train$power = as.numeric(train$power)
train$toughness = as.numeric(train$toughness)
test$power = as.numeric(test$power)
test$toughness = as.numeric(test$toughness)
train$powerovertough = train$power/train$toughness
test$powerovertough = test$power/test$toughness
train$current_date<-as_date(train$current_date)
train<-train[order(-current_date)]
test$current_date<-as_date(test$current_date)
test<-test[order(-current_date)]
#Handling NA values in both train and test. The NA values will be set to 0
train[is.na(train)] <- 0
test[is.na(test)] <- 0
# Specifying which columns of our model we will be dropping and applying it to our train and test datasets
drops<- c('id','future_date','current_date','card_name','power','loyalty','cmc','type','colors','mana_cost','subtypes', 'text','set','set_name')
train<-train[, !drops, with = FALSE]
test<-test[, !drops, with = FALSE]
#Saving the response variable in the train dataset
train_y<-train$future_price
test$future_price<-0
#Using dummies to train model
dummies <- dummyVars(future_price ~ ., data = train)
train<-predict(dummies, newdata = train)
test<-predict(dummies, newdata = test)
train<-data.table(train)
test<-data.table(test)
#Cross validating the model to fin the best lamda value
train<-as.matrix(train)
test<-as.matrix(test)
gl_model<-cv.glmnet(train, train_y, alpha = 1,family="gaussian")
bestlam<-gl_model$lambda.min
# Fitting the full model
gl_model<-glmnet(train, train_y, alpha = 1,family="gaussian")
plot_glmnet(gl_model)
saveRDS(gl_model,"./project/volume/models/gl_model.model")
test<-as.matrix(test)
#use the full model
pred<-predict(gl_model,s=bestlam, newx = test)
I am trying to predict future_price for my test set. The error is saying my dimensions are wrong but I don't know what could be causing them to be different. I have tried observing the data sets as it runs through the code and they seem to have the same variables.
Here is the dput
> dput(head(train))
structure(c(0.25, 0.1, 0.1, 0.1, 0.25, 0.25, 1, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 3, 0, 2, 0, 0, 0, 0.333333333333333,
0, 1, 0), .Dim = c(6L, 20L), .Dimnames = list(NULL, c("current_price",
"typesArtifact", "typesArtifact Creature", "typesCreature", "typesEnchantment",
"typesEnchantment Artifact", "typesEnchantment Creature", "typesInstant",
"typesLand", "typesPlaneswalker", "typesSorcery", "supertypes",
"supertypesBasic", "supertypesLegendary", "rarityCommon", "rarityMythic",
"rarityRare", "rarityUncommon", "toughness", "powerovertough"
)))
> dput(head(test))
structure(c(0.15, 0.16, 2, 0.39, 0.16, 0.19, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0), .Dim = c(6L, 18L), .Dimnames = list(
NULL, c("current_price", "typesArtifact", "typesArtifact Creature",
"typesCreature", "typesEnchantment", "typesInstant", "typesLand",
"typesPlaneswalker", "typesSorcery", "supertypes", "supertypesBasic",
"supertypesLegendary", "rarityCommon", "rarityMythic", "rarityRare",
"rarityUncommon", "toughness", "powerovertough")))

lapply Question: How to streamline further without generating errors

I'm looking to condense the steps in my script, but I'm having issues with lapply(). It looks to be an issue with my code as usual. Any help would be much appreciated!
library(iNEXT)
sa4 <- list(Bird = list(structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1,
0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0,
0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0), .Dim = c(26L,
6L), .Dimnames = list(Scientific_name = c(" Pycnonotus plumosus",
"Acridotheres javanicus", "Aegithina tiphia", "Aethopyga siparaja",
"Anthreptes malacensis", "Aplonis panayensis", "Cacatua goffiniana",
"Callosciurus notatus", "Cinnyris jugularis", "Copsychus malabaricus",
"Copsychus saularis", "Dicaeum cruentatum", "Dicrurus paradiseus",
"Gorsachius melanolophus", "Larvivora cyane", "Macronus gularis",
"Oriolus chinensis", "Orthotomus atrogularis", "Otus lempiji",
"Pitta moluccensis", "Pycnonotus goiavier", "Pycnonotus plumosus",
"Pycnonotus zeylanicus", "Spilopelia chinensis", "Todiramphus chloris",
"Zosterops simplex"), Sampling_Point = c("SA_01", "SA_02", "SA_03",
"SA_04", "SA_05", "SA_06")))), Butterfly = list(structure(c(0,
0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0,
0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0), .Dim = c(10L,
4L), .Dimnames = list(Scientific_name = c("Burara harisa consobrina",
"Catopsilia pyranthe pyranthe", "Catopsilia scylla cornelia",
"Delias hyparete metarete", "Eurema sp", "Idea leuconoe clara",
"Pachliopta aristolochiae asteris", "Phalanta phalantha phalantha",
"Troides helena cerberus", "Zizula hylax pygmaea"), Sampling_Point = c("SA_01",
"SA_02", "SA_04", "SA_06")))), Mammal = list(structure(c(0, 1,
1, 1, 1, 0), .Dim = 2:3, .Dimnames = list(Scientific_name = c("Callosciurus notatus",
"Unidentified Fruit Bat sp"), Sampling_Point = c("SA_03", "SA_04",
"SA_05")))), Reptile = list(structure(1, .Dim = c(1L, 1L), .Dimnames = list(
Scientific_name = "Hemidactylus frenatus", Sampling_Point = "SA_05"))))
I've been doing it the longer way:
estimateD(sa4$Butterfly, datatype="incidence_raw") #Sampling coverage for butterflies
estimateD(sa4$Mammal, datatype="incidence_raw") #Sampling coverage for mammals
estimateD(sa4$Bird, datatype="incidence_raw") #Sampling coverage for birds
estimateD(sa4$Reptile, datatype="incidence_raw") #Sampling coverage for reptiles
Note that estimateD(sa4$Reptile, datatype="incidence_raw" generates an error since it only has one species.
Is it possible to condense the following steps via lapply? In this situation I've only have 4 taxa, but for other projects, it might be a lot more. I tried the following and it gives me a warning message--which actually is the same warning message as the one above. I'm wondering if lapply stops working if one component gives an error?
> (lapply(sa4, function(x) estimateD(x, datatype="incidence_raw")) )
Error in `[.data.frame`(tmp, , c(1, 2, 3, 7, 4, 5, 6)) :
undefined columns selected
In addition: Warning messages:
1: In FUN(X[[i]], ...) :
Invalid data type, the element of species by sites presence-absence matrix should be 0 or 1. Set nonzero elements as 1.
2: In log(b/Ub) : NaNs produced
Please let me know if I need to provide more information? Thank you!
This is a simple error trapping issue. Wrap tryCatcharound your problem function call and have the error function return information on what happened.
results <- lapply(sa4, function(x) {
tryCatch(estimateD(x, datatype="incidence_raw"),
error = function(e) e)
})
Now determine which ran alright.
ok <- !sapply(results, inherits, "error")
ok
# Bird Butterfly Mammal Reptile
# TRUE TRUE TRUE FALSE
And check those that did.
results[ok]
It is the issue with the 'Reptiles', so if we select the first 3 elements of the list, it should work
lapply(sa4[1:3], function(x) estimateD(x, datatype="incidence_raw"))

How to use ids from one dataframe to sum rows in another dataframe

I feel like this answer has been asked before, but I can't seem to find an answer to this question. Maybe my title is too vague, so feel free to change it.
So I have one data frame, a, with ids the correspond to column name in data frame b. Both data frames are simplified versions of a much larger data frame.
here is data frame a
a <- structure(list(V1 = structure(c(4L, 5L, 1L, 2L, 3L), .Label = c("GEN[D00105].GT",
"GEN[D00151].GT", "GEN[D00188].GT", "GEN[D86396].GT", "GEN[D86397].GT"
), class = "factor")), row.names = c(NA, -5L), class = "data.frame")
here is data frame b
b <- structure(list(`GEN[D01104].GT` = c(0, 0, 0, 0, 1, 0, 0, 2, 0,
1, 1, 1, 1, 0, 0, 0, 2, 0, 0, 0), `GEN[D01312].GT` = c(1, 0,
2, 2, 0, 0, 0, 0, 0, 1, 1, 0, 0, 2, 0, 0, 2, 0, 0, 0), `GEN[D01878].GT` = c(0,
0, 0, 2, 0, 0, 2, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 2, 0, 0), `GEN[D01882].GT` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0), `GEN[D01952].GT` = c(0,
0, 1, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0), `GEN[D01953].GT` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 2, 0, 0, 0, 2, 0), `GEN[D02053].GT` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0), `GEN[D00316].GT` = c(0,
0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2, 0, 0), `GEN[D01827].GT` = c(0,
0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0), `GEN[D01881].GT` = c(0,
0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 0, 2, 0, 2, 0), `GEN[D02044].GT` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0), `GEN[D02085].GT` = c(0,
0, 0, 2, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0), `GEN[D02204].GT` = c(0,
0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0), `GEN[D02276].GT` = c(0,
0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0), `GEN[D02297].GT` = c(0,
0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0), `GEN[D02335].GT` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 2, 0, 0), `GEN[D02397].GT` = c(0,
0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0), `GEN[D00856].GT` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 0), `GEN[D00426].GT` = c(0,
0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0), `GEN[D02139].GT` = c(0,
0, 1, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0), `GEN[D02168].GT` = c(0,
0, 2, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0)), row.names = c(NA,
-20L), class = "data.frame")
I want to be able to use the ids from data frame a to sum the row in data frame b that have a matching id if that makes sense.
So in the past, I just did something like
b$affected.samples <- (b$`GEN[D86396].GT` + b$`GEN[D86397].GT` + b$`GEN[D00105].GT` + b$`GEN[D00151].GT` + b$`GEN[D00188].GT`)
which got annoying and took to much time, so I moved over to
b$affected.samples <- rowSums(b[,c(1:5)])
Which isn't too bad for this example but with my large data set, my sample can be all over the place, and it's starting to take too much time to finds where everything is. I was hoping there is a way just to use my data frame a to sum the correct rows in data frame b.
Hopefully, I gave this is all the information you need! Let me know if you have any questions.
Thanks in advance!!
Extract the 'V1' column as a character string, use that to select the columns of 'b' (assuming these column names are found in 'b') and get the rowSums
rowSums( b[as.character(a$V1)], na.rm = TRUE)

tapply of two variables in a loop

My data :
TEST <- structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 3, 1, 0, 0,
0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 1, 2, 0, 0, 1, 0, 0, 1, 1, 1,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2,
0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0,
1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0,
3, 0, 0, 2, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0,
0, 0, 2, 0, 1, 1, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
3, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0,
0, 1, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 3, 0, 0, 1, 0, 0, 2, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0,
0, 0, 0, 1, 0, 0, 2, 0, 1, 2, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 3, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 0, 0), .Dim = c(22L, 20L), .Dimnames = list(NULL, c("month",
"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
"", "", "")))
and :
month <- c(1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2)
I want to sum by row, so I used this code :
su_test <- list()
for (i in 1:ncol(TEST)){
su_test[[i]] <- tapply(TEST[,i], month, sum)
}
su_test <- do.call(cbind, su_test)
and to check the quantile :
su_test_obs <- apply(su_test,1,quantile,c(0.1,0.9))
This is an observation simulation per month. However I also have the detail by area. :
TEST2 <- structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 3, 1, 0,
0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 1, 2, 0, 0, 1, 0, 0, 1, 1,
1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0,
0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1,
0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0,
0, 3, 0, 0, 2, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3,
0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2,
0, 0, 0, 2, 0, 1, 1, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 3, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 3, 0, 0, 1, 0, 0, 2, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 2, 0, 1, 2, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,
1, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 3, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
0, 1, 1), .Dim = c(22L, 20L), .Dimnames = list(NULL, c("month",
"area", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
"", "", "", "")))
area <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
I would like to have the same result of su_test_obs
but with the details of the areas in addition like list(month, area) but I don't understand the logic.
Would you have a solution please? Maybe there is a solution simpler with dplyr?
Thanks
It would be much simpler if you convert your matrix to dataframe. We can then use aggregate which can be applied to multiple groups easily
df <- data.frame(TEST2)
apply(aggregate(.~month + area, df, sum)[-c(1, 2)], 1, quantile, c(0.1,0.9))
# [,1] [,2]
#10% 2 1.7
#90% 6 5.3

Differences between merge and match functions in R

I everybody I remove my last post to make a reproducible exmaple of my problem. I am working with the next to data frames a1 (dput structure):
structure(list(r04_numero_operacion = c("0050475725", "0050490602",
"0050491033", "0050496386", "0050518985", "0050630090", "0050631615",
"0060235906", "0060238732", "0060241333", "0060244391", "0060245813",
"0060260056", "0060266356", "0800041441", "0800054041", "0800055382",
"0800058554", "2020200062", "2020200073", "CAR1010001706000",
"CAR1010001795000", "CAR1010001803000", "CAR1010001871000", "CAR1010001962000",
"CAR1010002002000", "CAR1010002120000", "CAR1010002189000", "CAR1010002215000",
"CAR1010002250000"), perdida3 = c(523.12, 265.43, 8371.66, 5242.13,
4960.51, 8473.27, 3743.45, 1283.32, 2229.25, 8001.27, 8653.94,
3670.13, 4536.02, 8216.55, 2481.36, 288.94, 1637.28, 4566.89,
1573.63, 11217.92, 0, 0, 0, 0, 0, 0, 0, 0, 9633.9, 0), Saldo = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 288.94, 1637.28, 4566.89,
1, 1, 481.59, 299.52, 258.13, 603.84, 231.61, 631.68, 220.6,
210.54, 1, 1224.44), Bvencida = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 603.84, 0, 631.68,
0, 0, 0, 0), Cvencida = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1224.44),
Dvencida = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), vencida = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 288.94, 1637.28,
4566.89, 1, 1, 0, 0, 0, 603.84, 0, 631.68, 0, 0, 1, 1224.44
), V1 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names = c("r04_numero_operacion",
"perdida3", "Saldo", "Bvencida", "Cvencida", "Dvencida", "vencida",
"V1"), codepage = 1252L, row.names = c(NA, 30L), class = "data.frame")
And a2 data frame (dput structure):
structure(list(r04_numero_operacion = c("0050475725", "0050490602",
"0050491033", "0050496386", "0050518985", "0050630090", "0050631615",
"0060235906", "0060238732", "0060241333", "0060244391", "0060245813",
"0060260056", "0060266356", "0800041441", "0800054041", "0800055382",
"0800058554", "2020200073", "CAR1010002002000", "CAR1010002189000",
"CAR1010002215000", "CAR1010002250000", "CAR1010002264000", "CAR1010002297000",
"CAR1010002401000", "CAR1010002412000", "CAR1010002436000", "CAR1010002529000",
"CAR1010002709000"), perdida3 = c(523.12, 265.43, 8371.66, 5242.13,
4960.51, 8473.27, 3743.45, 1283.32, 2229.25, 8001.27, 8653.94,
3670.13, 4536.02, 8216.55, 2481.36, 288.94, 1637.28, 4566.89,
11217.92, 0, 0, 9633.9, 0, 0, 0, 0, 0, 0, 0, 0), Saldo = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 288.94, 1637.28, 4566.89,
1, 317.72, 210.54, 1, 868.93, 242.91, 298.78, 120.63, 255.01,
357.68, 284.08, 308.83), Bvencida = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 317.72, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0), Cvencida = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 868.93, 0, 0, 0, 0, 0, 0, 0), Dvencida = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0), vencida = c(1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 288.94, 1637.28, 4566.89, 1, 317.72, 0,
1, 868.93, 0, 0, 0, 0, 0, 0, 0), V2 = c(2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2)), .Names = c("r04_numero_operacion", "perdida3", "Saldo",
"Bvencida", "Cvencida", "Dvencida", "vencida", "V2"), class = "data.frame", row.names = c(NA,
30L))
My problem is when I use merge() and match() functions. merge() is more functional than match() related to add new variables by common one but when I use merge() I don't get the same result as match(). First I used merge() with a2 and a1 to create DF with the next code:
DF=merge(a2,a1,all.x=TRUE)
It added V1 variable from a1 to DF and I got this summary for DF$V1:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1 1 1 1 1 1 9
After I create a copy of a2 named DF and I made a match with r04_numero_operacion using this code to add V1 variable from a1 to a2:
a2$V1<-a1[match(a2$r04_numero_operacion,a1$r04_numero_operacion),"V1"]
It added `V1 to DF but the result is different to the merge() way. I got this summary for DF$V1 in match() solution:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1 1 1 1 1 1 7
My problem is I want to make the same I made with match() but using merge() function due to this function is more poweful than match(). Thanks for your help.
In using match(a2$r04_numero_operacion,a1$r04_numero_operacion) the a2$r04_numero_operacion values gets matched the coresponding column in a1 while in using merge(a2,a1,all.x=TRUE) the a1 all the matching columns get matched to the matching column names in a2. If you only match on the first column, the NA counts match up:
summary( merge(a2,a1,by=1,all.x=TRUE)$V1 )
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1 1 1 1 1 1 7

Resources