Problem with cut() in R - r

I want to assign subjects to classes based on probabilities that I provide. I will be doing this in a variety of cases, with different values. Sometimes, I want the probability of a particular class to be 0. I've been using
classlist <- cut(runif(p), c(0, pdrop, ptitrate, pcomplete, pnoise, 1), labels = c("D", "T", "C", "N", "O"))
but this fails when two of the p variables are the same. I could make them different by minimal amounts e.g. pdrop = .2 ptitrate = .200001. But is there some better way?
Thanks
Peter

I suggest sample():
> p <- 100
> groups <- c("D", "T", "C", "N", "O")
> probVec <- c(0.2, 0.2, 0.3, 0.25, 0.05)
> classlist <- factor(sample(groups, size=p, replace=TRUE, prob=probVec))
> table(classlist)
classlist
C D N O T
26 16 28 5 25

Related

R: Simulating Correlated Coin Flips

I am working with the R programming language.
I want to simulate coin flips such that:
If heads, then next head with p = 0.6 and tail = 0.4
if tails, the next tails with p = 0.6 and heads = 0.4
Using the 'markovchain' package in R, I did this as follows:
library(markovchain)
# transition matrix
P <- matrix(c(0.6, 0.4, 0.4, 0.6), byrow = TRUE, nrow = 2)
rownames(P) <- colnames(P) <- c("H", "T")
mc <- new("markovchain", states = c("H", "T"), transitionMatrix = P)
# Generate states
states <- rmarkovchain(n = 100, object = mc, t0 = "H")
# Print
table(states)
The output looks something like this:
> states
[1] "H" "T" "T" "H" "T" "T" "T" "H" "H"
My Question: Can someone please show me how I can do this in base R?
I think I need to:
create an empty list of size "n"
assign n[1] = H or T with prob 0.5
write an IFELSE statement that says n[i] = ifelse(n[i-1] == "H", sample(c("H", "T"), prob = c(0.6, 0.4), sample(c("H", "T"), prob = c(0.4, 0.6))
But I am not sure how to do this.
Can someone please show me how to do this?
Thanks!
Here it is.
Fair=function() sample( c(rep('H',5), rep('T',5)),1)
A=function() sample( c(rep('H',6), rep('T',4)),1)
B=function() sample( c(rep('H',4), rep('T',6)),1)
x=Fair()
for (i in 1:10) x=c(x,ifelse(tail(x,n=1)=='H', A(), B()))
print(x)

what's wrong with this small decision tree using C5.0?

I'm trying to make simple decision tree using C5.0 in R.
data has 3 columns(including target data) and 14 rows.
This is my 'jogging' data. target variable is 'CLASSIFICATION'
WEATHER JOGGED_YESTERDAY CLASSIFICATION
C N +
W Y -
Y Y -
C Y -
Y N -
W Y -
C N -
W N +
C Y -
W Y +
W N +
C N +
Y N -
W Y -
or as dput result:
structure(list(WEATHER = c("C", "W", "Y", "C", "Y", "W", "C",
"W", "C", "W", "W", "C", "Y", "W"), JOGGED_YESTERDAY = c("N",
"Y", "Y", "Y", "N", "Y", "N", "N", "Y", "Y", "N", "N", "N", "Y"
), CLASSIFICATION = c("+", "-", "-", "-", "-", "-", "-", "+",
"-", "+", "+", "+", "-", "-")), class = "data.frame", row.names = c(NA,
-14L))
jogging <- read.csv("Jogging.csv")
jogging #training data
library(C50)
jogging$CLASSIFICATION <- as.factor(jogging$CLASSIFICATION)
jogging_model <- C5.0(jogging[-3], jogging$CLASSIFICATION)
jogging_model
summary(jogging_model)
plot(jogging_model)
but it does not make any decision tree.
I thought that it should have made 2 nodes(because of 2 columns except target variables)
I want to know what's wrong :(
For this answer I will use a different tree building package partykit just for the reason that I am more used to it. Let's do the following:
jogging <- read.table(header = TRUE, text = "WEATHER JOGGED_YESTERDAY CLASSIFICATION
C N +
W Y -
Y Y -
C Y -
Y N -
W Y -
C N -
W N +
C Y -
W Y +
W N +
C N +
Y N -
W Y -",
stringsAsFactors = TRUE)
library(partykit)
ctree(CLASSIFICATION ~ WEATHER + JOGGED_YESTERDAY, data = jogging,
minsplit = 1, minbucket = 1, mincriterion = 0) |> plot()
That will print the following tree:
That is a tree that uses up to three levels of splits and still does not find a perfect fit. The first split has a p-value of .2, indicating that there is not nearly enough data to justify even this first split, let alone those following it. This is a tree that is very likely to massively overfit the data and overfitting is bad. That is why usual tree algorithms come with measures to prevent overfitting and in your case, that prohibits growing a tree. I disabled those with the arguments in the ctree call.
So in short: You have not enough data. Just predicting - all the time is the most reasonable thing a classification tree can do.

Extract single linkage clusters from very large pairs list

I have a very large pairs list that I need to break down into single linkage communities. So far I have been able to do this entirely in R just fine. But I need to prepare for the eventuality that the entire list may be too large to hold in memory, or for igraph's R implementation to handle. A very simple version of this task looks like:
library(igraph)
df <- data.frame("p1" = c("a", "a", "d", "d"),
"p2" = c("b", "c", "e", "f"),
"val" = c(0.5, 0.75, 0.25, 0.35))
g <- graph_from_data_frame(d = df,
directed = FALSE)
sg <- groups(components(g))
sg <- sapply(sg,
function(x) induced_subgraph(graph = g,
vids = x),
USE.NAMES = FALSE,
simplify = FALSE)
if df is incredibly large - on the scale of hundreds of millions, to tens of billions of rows, is there a way for me to extract individual positions of sg without having to build g in it's entirety? It's relatively easy for me to store representations of df outside of R either as a compressed txt file or as a sqlite database.
To adress the problem with igraph's R implementation (assuming the dataset is still holdable in RAM, otherwise see #Paul Brodersen's answer):
The solution below works by specifying one element of the graph and then going over all connections until no further edges are found. It therefore creates the subgraph without building the whole graph. It looks a bit hacky compared to a recursive function but scales better.
library(igraph)
reduce_graph <- function(df, element) {
stop = F
elements_to_inspect <- element
rows_graph <-0
while(stop ==F) {
graph_parts <- df[df$p1 %in% elements_to_inspect |
df$p2 %in% elements_to_inspect,]
elements_to_inspect <- unique(c(unique(graph_parts$p1),
unique(graph_parts$p2)))
if(dim(graph_parts)[1] == rows_graph) {
stop <-TRUE
} else {
rows_graph <- dim(graph_parts)[1]
}
}
return(graph_parts)
}
df <- data.frame("p1" = c("a", "a", "d", "d","o"),
"p2" = c("b", "c", "e", "f","u"),
"val" = c(100, 0.75, 0.25, 0.35,1))
small_graph <- reduce_graph(df, "f")
g <- graph_from_data_frame(d = small_graph,
directed = FALSE)
sg <- groups(components(g))
sg <- sapply(sg,
function(x) induced_subgraph(graph = g,
vids = x),
USE.NAMES = FALSE,
simplify = FALSE)
One can test the speed on a bigger dataset.
##larger dataset with lots of sparse graphs.
set.seed(100)
p1 <- as.character(sample(1:10000000, 1000000, replace=T))
p2 <- as.character(sample(1:10000000, 1000000, replace=T))
val <- rep(1, 1000000)
df <- data.frame("p1" = p1,
"p2" = p2,
"val" = val)
small_graph <- reduce_graph(df, "9420672") #has 3 pairwise connections
g <- graph_from_data_frame(d = small_graph,
directed = FALSE)
sg <- groups(components(g))
sg <- sapply(sg,
function(x) induced_subgraph(graph = g,
vids = x),
USE.NAMES = FALSE,
simplify = FALSE)
Building groups and subgraph takes one second, compared to multiple minutes for the whole graph on my machine. This of course depends on how sparsely connected the graphs are.

How I can set the size and/or color of symbols in a R plotrix::polar.plot on the basis of a data frame column?

I have a data frame "df" with three columns: distance, azimuth, intensity.
Through plotrix:polar.plot I got a plot using the following code
polar.plot(df$distance, df$azimuth, radial.lim=c(0,450),start=90,rp.type = "s", clockwise=TRUE, point.col=4, cex = 1.2)
Is there a way to have the size (or color) of symbols changing with the "intensity" column value?
since I didn't find any "direct" solution I have divided the dataframe in several dataframes using subset() and after I plotted every dataframe with different "cex"/"point.col" setting to "TRUE" the "add" parameter.
df3 <- subset(outPutDeg, intensity <= 3)
df5 <- subset(outPutDeg, intensity > 3 & intensity <= 5)
df7 <- subset(outPutDeg, intensity > 5 & intensity <= 7)
df9 <- subset(outPutDeg, intensity > 7 & intensity <= 9)
df11 <- subset(outPutDeg, intensity > 9)
polar.plot(df3$distance, df3$azimuth, radial.lim=c(0,450),start=90,rp.type = "s", clockwise=TRUE, point.col=5, cex = 0.6)
polar.plot(df5$distance, df5$azimuth, radial.lim=c(0,450),start=90,rp.type = "s", clockwise=TRUE, point.col=4, cex = 0.6, add = T)
polar.plot(df7$distance, df7$azimuth, radial.lim=c(0,450),start=90,rp.type = "s", clockwise=TRUE, point.col=3, cex = 0.6, add = T)
polar.plot(df9$distance, df9$azimuth, radial.lim=c(0,450),start=90,rp.type = "s", clockwise=TRUE, point.col=2, cex = 0.6, add = T)
polar.plot(df11$distance, df11$azimuth, radial.lim=c(0,450),start=90,rp.type = "s", clockwise=TRUE, point.col=1, cex = 0.6, add = T)
crude and effective

How to generate n MarkovChain sequences of 25 transitions each

I have a transition matrix "T" and would like to produce 20 different sequences of 25 states each.
I have the markovchain package and have tried the following:
lapply(1:20,markovchainSequence(n = 25, markovchain = T, t0 = "In"))
but it says that markovcahinsequence is not a function. Is there a way around this please?
A reproducible example can really help here but I think this does the job done! You may just need a bigger transition matrix?!
set.seed(123)
statesNames <- c("a", "b", "c") #easier with three states
t <- new("markovchain", states = statesNames,
transitionMatrix = matrix(c(0.2, 0.5, 0.3, 0, 0.2, 0.8, 0.1, 0.8, 0.1),
nrow = 3, byrow = TRUE, dimnames = list(statesNames, statesNames)))
mchain = function(n){
markovchainSequence(n = n, markovchain = t, t0 = "a")
}
lapply(rep(25, each=20), mchain) # you may change 25 to desired number

Resources