Related
It's my first time using R. I want to create a scatterplot with a line of best fit for a decade of data about all countries. I joined two excel datasets - one has the number of people jailed for a certain crime by country in a given year (rows: country, columns year:, the other has average income for a certain population group (rows: country, columns: year).
dataclean=inner_join(EnforcementData, IncomeData, by = "Country")
This gives me a dataset with x, y points where enforcement is the x and income is the y
I want to plot this and find the outliers - so those countries where enforcement is out of step with income. I tried:
ggplot(dataclean, aes(x=EnforcementData, y=IncomeData, group= "Country")) +
geom_line(aes(color = "Country")
Thanks for any suggestions!
EDIT: I think I've improperly merged the datasets somehow, as it returns a matrix. Like this:
dput(head(dataclean))
structure(list(Country = c("Albania", "Algeria", "Angola", "Antigua and Barbuda",
"Argentina", "Armenia"), 2006.x = c(0, 0, 0, 0, 0, 0), 2007.x = c(0,
0, 0, 0, 0, 0), 2008.x = c(0, 0, 0, 0, 3, 0), 2009.x = c(0,
0, 0, 0, 2, 0), 2010.x = c(0, 0, 3, 0, 0, 0), 2011.x = c(0,
0, 0, 0, 4, 0), 2012.x = c(0, 0, 0, 0, 2, 0), 2013.x = c(1,
1, 3, 0, 3, 0), 2014.x = c(0, 0, 0, 0, 1, 0), 2015.x = c(0,
0, 1, 1, 0, 0), 2016.x = c(0, 0, 5, 1, 5, 0), 2017.x = c(0,
0, 3, 0, 0, 0), 2018.x = c(0, 0, 0, 0, 0, 0), 2019.x = c(0,
1, 3, 0, 0, 0), 2020.x = c(0, 1, 0, 0, 0, 0), 2006.y = c(3.273755,
2.9912451, 3.689971, 1.342365, 2.8111637, 3.1407325), 2007.y = c(3.157699,
3.0298389, 3.759603, 1.315153, 2.8102016, 3.2122944), 2008.y = c(3.0636166,
3.0644794, 3.754531, 1.181255, 2.9054865, 3.1780076), 2009.y = c(3.0084051,
3.0477934, 3.874565, 1.144331, 2.9149061, 3.0896677), 2010.y = c(2.9951254,
2.9948973, 3.796005, 1.161454, 2.8314702, 3.1664003), 2011.y = c(3.1528966,
3.0144704, 3.814187, 1.190574, 2.8360401, 3.1267727), 2012.y = c(3.1964009,
2.9731618, 3.73838, 1.201921, 2.913096, 3.0577149), 2013.y = c(3.1683419,
2.943247, 3.779373, 1.209151, 2.9020493, 3.0017037), 2014.y = c(3.0180735,
3.0699088, 3.913854, 1.8298544, 3.0114942, 2.9938708), 2015.y = c(2.9489451,
3.1155215, 3.864924, 1.7799824, 3.0169873, 3.0037498), 2016.y = c(2.8750588,
3.1476701, 3.909438, 1.7761061, 2.7538409, 3.041738), 2017.y = c(2.8906318,
3.0717401, 3.880863, 2.2256225, 2.7280908, 3.0332232), 2018.y = c(2.9485421,
3.12678, 3.609102, 2.1923678, 2.5386973, 2.8175096), 2019.y = c(3.0029988,
3.0910585, 3.524361, 2.1915031, 2.5461976, 2.6481938), 2020.y = c(1.9297139,
3.1117555, 3.3970031, 2.1946293, 2.5862916, 2.438313)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
I have a dataset like original with numeric (NP) and binary (all the rest) variables (my dataset is much larger and includes way more numeric and dummies):
NP <- c(4,6,18,1,3,12,8)
iso_mode_USA <- c(1, 0, 0, 0, 0, 1, 1)
iso_mode_CHN <- c(0, 1, 1, 0, 0, 0, 0)
iso_mode_COL <- c(0, 0, 0, 1, 1, 0, 0)
iso_mode_mod_USA <- c(1, 0, 0, 0, 0, 1, 1)
iso_mode_mod_CHN <- c(0, 1, 1, 0, 0, 0, 0)
iso_mode_mod_COL <- c(0, 0, 0, 1, 1, 0, 0)
exp_sector_4 <- c(0, 1, 0, 0, 1, 0, 0)
exp_sector_5 <- c(1, 0, 1, 0, 0, 0, 0)
exp_sector_7 <- c(0, 0, 0, 1, 0, 1, 1)
original <- data.frame(NP, iso_mode_USA, iso_mode_CHN, iso_mode_COL, iso_mode_mod_USA, iso_mode_mod_CHN, iso_mode_mod_CHN, exp_sector_4, exp_sector_5, exp_sector_7)
I want to have a vector that records the group of each column by the start of their names (e.g. NP forms one group, iso_mode_ forms another group, exp_sect_ forms another group and so on...). Therefore, the vector looks like:
vector <- c("1", "2", "2", "2", "3", "3", "3", "4", "4", "4")
Any idea on how to do it in dplyr (for many more variables)?
Thank you.
You can use grepl to find the name and which in apply to get the position.
tt <- paste0("^", unique(sub("_[^_]+$", "_", names(original))), "([^_]+$|$)")
apply(sapply(tt, grepl, names(original)), 1, which)
# [1] 1 2 2 2 3 3 3 4 4 4
For an assignment, I am applying mixture modeling with the mixtools package on R. When I try to figure out the optimal amount of components with bootstrap. I get the following error
Error in boot.comp(y, x, N = NULL, max.comp = 2, B = 5, sig = 0.05, arbmean = TRUE, :
Number of trials must be specified!
I found out that I have to fill an N: An n-vector of number of trials for the logistic regression type logisregmix. If
NULL, then N is an n-vector of 1s for binary logistic regression.
But, I don't know how to find out what the N is in fact to make my bootstrap working.
Link to my codes:
https://www.kaggle.com/blastchar/telco-customer-churn
My codes:
data <- read.csv("Desktop/WA_Fn-UseC_-Telco-Customer-Churn.csv", stringsAsFactors = FALSE,
na.strings = c("NA", "N/A", "Unknown*", "NULL", ".P"))
data <- droplevels(na.omit(data))
data <- data[c(1:5032),]
testdf <- data[c(5033:7032),]
data <- subset(data, select = -customerID)
set.seed(100)
library(plyr)
library(mixtools)
data$Churn <- revalue(data$Churn, c("Yes"=1, "No"=0))
y <- as.numeric(data$Churn)
x <- model.matrix(Churn ~ . , data = data)
x <- x[, -1] #remove intercept
x <-x[,-c(7, 11, 13, 15, 17, 19, 21)] #multicollinearity
a <- boot.comp(y, x, N = NULL, max.comp = 2, B = 100,
sig = 0.05, arbmean = TRUE, arbvar = TRUE,
mix.type = "logisregmix", hist = TRUE)
Below there is more information about my predictors:
dput(x[1:4,])
structure(c(0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
34, 2, 45, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 1, 1, 0, 29.85, 56.95, 53.85, 42.3, 29.85, 1889.5, 108.15,
1840.75), .Dim = c(4L, 23L), .Dimnames = list(c("1", "2", "3",
"4"), c("genderMale", "SeniorCitizen", "PartnerYes", "DependentsYes",
"tenure", "PhoneServiceYes", "MultipleLinesYes", "InternetServiceFiber optic",
"InternetServiceNo", "OnlineSecurityYes", "OnlineBackupYes",
"DeviceProtectionYes", "TechSupportYes", "StreamingTVYes", "StreamingMoviesYes",
"ContractOne year", "ContractTwo year", "PaperlessBillingYes",
"PaymentMethodCredit card (automatic)", "PaymentMethodElectronic check",
"PaymentMethodMailed check", "MonthlyCharges", "TotalCharges"
)))
My response variable is binary
I hope you guys can help me out!
Looking in the source code of mixtools::boot.comp, which is scary as it is over 800 lines long and in serious need of refactoring, the offending lines are:
if (mix.type == "logisregmix") {
if (is.null(N))
stop("Number of trials must be specified!")
Despite what the documentation says, N must be specified.
Try to set it to a vector of 1s: N = rep(1, length(y)) or N = rep(1, nrow(x))
In fact, if you look in mixtools::logisregmixEM, the internal function called by boot.comp, you'll see how N is set if NULL:
n <- length(y)
if (is.null(N)) {
N = rep(1, n)
}
Too bad this is never reached if N is NULL since it stops with an error before. This is a bug.
I am trying to implement the rowsums solution proposed here Getting rowSums in a data table in R . Basically I want a variable with the sum of top15, top16 and top17 for each row. This output produces an answer but its clearly not right, I am sure I understand what is happening.
I am looking for a data.table solution - I am running this on millions of cases
library( data.table)
d <- structure(list(top15 = c(1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1), top16 = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0), top17 = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0)), class = c("data.table",
"data.frame"), row.names = c(NA, -20L))
d[ , tops:=lapply(.SD,sum), .SDcols=c(paste0("top", 15:17))]
We can use rowSums on the Subset of data.table (.SD), which can also take care of the NA elements with na.rm
nm1 <- paste0("top", 15:17)
d[, tops := rowSums(.SD, na.rm = TRUE), .SDcols = nm1]
Or if there are no NA elements, then do + with Reduce
d[, tops := Reduce(`+`, .SD), .SDcols = nm1]
Let's start with my data.
> dput(head(tbl_ready)) ## To make it clear I didn't put all of the row names
structure(list(Gene_name = structure(1:6, .Label = c("AT1G01050",
"AT1G01080", "AT1G01090", "AT1G01220", "AT1G01320", "AT1G01420",
"AT1G01470", "AT1G01800", "AT1G01910", "AT1G01920", "AT1G01960",
"AT5G66570", "AT5G66720", "AT5G66760", "AT5G67150", "AT5G67360",
"ATCG00120", "ATCG00160", "ATCG00170", "ATCG00190", "ATCG00380",
"ATCG00470", "ATCG00480", "ATCG00490", "ATCG00500", "ATCG00650",
"ATCG00660", "ATCG00670", "ATCG00750", "ATCG00770", "ATCG00780",
"ATCG00800", "ATCG00810", "ATCG00820", "ATCG01090", "ATCG01110",
"ATCG01120", "ATCG01240", "ATCG01300", "ATCG01310", "ATMG01190"
), class = "factor"), `10` = c(0, 0, 0, 0, 0, 0), `20` = c(0,
0, 0, 0, 0, 0), `52.5` = c(0, 1, 0, 0, 0, 0), `81` = c(0, 0.660693687777888,
0, 0, 0, 0), `110` = c(0, 0.521435654491704, 0, 0, 0, 1), `140.5` = c(0,
0.437291194705566, 0, 0, 0, 1), `189` = c(0, 0.52204783488213,
0, 0, 0, 0), `222.5` = c(0, 0.524298383907171, 0, 0, 0, 0), `278` = c(1,
0.376865096972469, 0, 1, 0, 0), `340` = c(0, 0, 0, 0, 0, 0),
`397` = c(0, 0, 0, 0, 0, 0), `453.5` = c(0, 0, 0, 0, 0, 0
), `529` = c(0, 0, 0, 0, 0, 0), `580` = c(0, 0, 0, 0, 0,
0), `630.5` = c(0, 0, 0, 0, 0, 0), `683.5` = c(0, 0, 0, 0,
0, 0), `735.5` = c(0, 0, 0, 0, 0, 0), `784` = c(0, 0, 0.476101907006443,
0, 0, 0), `832` = c(0, 0, 1, 0, 0, 0), `882.5` = c(0, 0,
0, 0, 0, 0), `926.5` = c(0, 0, 0, 0, 1, 0), `973` = c(0,
0, 0, 0, 0, 0), `1108` = c(0, 0, 0, 0, 0, 0), `1200` = c(0,
0, 0, 0, 0, 0)), .Names = c("Gene_name", "10", "20", "52.5",
"81", "110", "140.5", "189", "222.5", "278", "340", "397", "453.5",
"529", "580", "630.5", "683.5", "735.5", "784", "832", "882.5",
"926.5", "973", "1108", "1200"), row.names = c(NA, 6L), class = "data.frame")
Take a look on the names of the columns (just picked the 6 of them):
10
20
52.5
81
110
140.5
Those names tell me the size range. The size of the genes in the first column starts from 10 and ends on the begining of the second column = 20. That means that to the first column should belong genes with the size between 10-20.
I have another table which tells me what's the size of all genes (there are much more than can be finded in my first table):
>dput(head(tbl_size))
structure(list(Gene_name = structure(1:6, .Label = c("ATMG01290", "ATMG01300", "ATMG01310", "ATMG01320", "ATMG01330",
"ATMG01350", "ATMG01360", "ATMG01370", "ATMG01400", "ATMG01410"
), class = "factor"), tp = c(26L, 17L, 22L, 142L, 12L, 45L),
size = c(49.4255, 28.0913, 40.2872, 213.572, 24.4838, 70.4375
)), .Names = c("locus", "tp", "size"), row.names = c(NA,
6L), class = "data.frame")
and now the main part. What I want to achieve with my code ?
So, I'm trying to find only those genes which are found in the fractions (columns) with the size range two times higher than a real size of the gene. No idea if you understand what I am trying to do so let me use an example.
so let's say that we have a genes:
Names Size
AT1G01080 40
AT1G01090 30
AT1G01220 50
Let's multiply the size by 2:
Names Size
AT1G01080 80
AT1G01090 60
AT1G01220 100
In first table (tbl_ready) we can find the list of the genes and specific fractions (columns) defined by size which I explained in the begining of this thread. I would like to put the 0 instead of any values if any gene can be found in the fraction (column) which is not atleast two times higher than the gene size.
To find the size of the gene you have to look in the second table (tbl_size).
Just to sum it up. I'm trying to define which of those genes come atleast as a complex of 2. So only fractions with size two times higher than the size of the gene are important for me.
IF SOMEONE KNOWS WHAT I AM TRYING TO DO PLEASE EDIT MY QUESTION TO MAKE IT READABLE. I FEEL LIKE MY BRAIN IS DEAD.
Firstly, convert the columns to their numerical value:
frac <- as.numeric(colnames(tbl_ready))
and then get the index per gene of the column that doesn't exceed it's frac by two-fold:
ind <- lapply(tbl_size$size, function(x) which(frac > x*2)[1]-1)
Then you can create an array index of the values that you need to set to zero:
rowI = rep(match(tbl_size$locus, tbl_ready$Gene_name), times=ind-1)
colI = unlist(mapply(seq, from=2, length=ind-1))
tbl_ready[cbind(rowI, colI)] <- 0
You'll have to be careful if gene_names don't have a 1:1 mapping with locus, and cases where none of the columns exceed the gene size two fold, as there'll be NAs that need dealing with. I'm assuming you're stuck using these representations of your data, as it would probably be better to store tbl_ready in a longer narrower form than you have it here (containing only three columns name, size, and value - and omitted the zero values).
I'm going to change my original answer, this time using the data you've provided - the only real differences are that you've changed the column names (I'm assuming column tp in tbl_size is the thing we need to match to the column headings in tbl_ready), and that some of the rows in table_size don't correspond to tbl_ready.
Firstly, convert the columns to their numerical value:
frac <- as.numeric(colnames(tbl_ready))
and then get the index per gene of the column that doesn't exceed it's frac by two-fold:
mapToReady <- tbl_size$locus %in% tbl_ready[[1]]
ind <- sapply(tbl_size$tp[mapToReady], function(x) which(frac > x*2)[1]-1)
Then you can create an array index of the values that you need to set to zero:
rowI = rep(match(tbl_size$locus[mapToReady], tbl_ready[[1]]), times=ind-1)
colI = unlist(mapply(seq, from=2, length=ind-1))
tbl_ready[cbind(rowI, colI)] <- 0
So, for instance, AT1G01050 is the 5th row of tbl_size (none of the previous entries have an entry in your tbl_size), and the first row of tbl_ready. So the first 'iteration' of the sapply line hits 'tbl_size$tp[mapToReady][1]' which is the tp of AT1G01050 which is 12. 2*12 is 24, so is between 20.0 and 52.5, so we're going to need to set columns corresponding to '10', and '20' to zero, but not columns '52.5' onwards, for the AT1G01050. This corresponds to columns 2 and 3 of row 1 of tbl_ready, which is what the cbind portion of the last three lines is doing.