Modifying a R for loop to work with three variables

Modifying a R for loop to work with three variables - r

I've got a dataset that i need to remove outliers from. The data is from a repeated measures experiment with 105 subjects. The code i am using only removes outliers from one of the conditions in the study ("markLiturEndurt") but not the other two ("markStadsEndurt" and "litStaEndurt"). Each condition is binary and has been coded into "Skilyrdi"
The code I am using looks like this:
(Skilyrdi<-unique(gogn$markLiturEndurt))
(fjoldiRada<-length(gogn$subject))
(fjoldiSkil<-length(Skilyrdi))
gognHrein<-0
for (i in 1:length(Skilyrdi))
{
gognSkil<-subset(gogn,gogn$markLiturEndurt==Skilyrdi[i])
numerThatt<-unique(gognSkil$subject)
for (j in 1:length(numerThatt))
{
gognThatt<-subset(gognSkil,gognSkil$subject==numerThatt[j])
(medalST<-mean(gognThatt$areitiRT))
(sfST<-sd(gognThatt$areitiRT))
(nedriMork<-100)
(efriMork<-medalST+3*sfST)
gognThatt<-subset(gognThatt,gognThatt$areitiRT>nedriMork)
gognThatt<-subset(gognThatt,gognThatt$areitiRT<efriMork)
gognHrein<-rbind(gognHrein,gognThatt)
}
}
I've tried messing around with the code but my limited R programming skills got the better of me. I'm pretty sure it should be possible either to modify the my code so that it checks for outliers in all the conditions or that maybe i could create additional for loops for each condition and then bind them together.
subject umferd areitiACC areitiRT markLiturEndurt markStadsEndurt litStaEndurt
117 1 1 1202 0 0 0
117 2 1 924 0 1 0
117 4 1 1139 0 0 0
117 5 1 1211 0 0 0
117 6 1 998 1 1 0
117 7 1 778 0 1 0
"areitiRT" is reaction time the dependent variable. "umferð" is each round, "areitiACC" is correct or wrong answer. The other three variables are the independent variables.
What i would like to achieve:
The depentant variable of interest for me is areitiRT this is "reaction time". The indepentant variables I am working with are markLiturEndurt markStadsEndurt litStaEndurt. Each indepentant variable can take two values "0" (IV changed) and "1" (IV held constant).
I want to see how reaction time differs under each condition that the DV can take by doing an ANOVA. Before i can do the ANOVA i need to clean the data so it does not contain any extreme values. Reaction times areitiRT under each condition of the DV for each subject need to be check for every observation under that condition and compared to that subjects mean reaction time. Reaction times less then 100ms and greater then the "mean + 3 * Standard Deviation" need to be removed.

It's taken me a week to get back to you because I've been on holiday, so sorry about the delay.
I'm still not entirely sure what you need, so here are some options. If you clarify what you need by editing your question I can edit my answer appropriately.
Example dataset
Use the following dataset as gogn as an example:
gogn <- structure(list(subject = c(117L, 117L, 117L, 117L, 117L, 117L,
117L, 118L, 118L, 118L, 118L, 118L, 118L), umferd = c(1L, 2L,
4L, 5L, 6L, 7L, 7L, 1L, 2L, 4L, 5L, 6L, 7L), areitiACC = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), areitiRT = c(1202L,
924L, 1139L, 1211L, 998L, 778L, 53L, 1202L, 924L, 1139L, 1211L,
1024L, 778L), markLiturEndurt = c(0L, 0L, 0L, 0L, 1L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 0L), markStadsEndurt = c(0L, 1L, 0L, 0L,
1L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L), litStaEndurt = c(0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("subject",
"umferd", "areitiACC", "areitiRT", "markLiturEndurt", "markStadsEndurt",
"litStaEndurt"), class = "data.frame", row.names = c(NA, -13L
))
Remove reaction times <100ms
gogn <- gogn[gogn$areitiRT > 99, ]
Or, if you want to use dplyr:
require("dplyr")
gogn <- subset(gogn, areitiRT > 99)
In both cases this will remove reaction times less than 100ms (i.e. 100ms or greater).
Reaction times 3 * sd + mean
This is a bit more involved, but not too difficult. First, we need a list of all unique subjects:
uSubject <- as.list(unique(gogn$subject))
Then we create a list which contains the value of 3 * sd + mean for each subject:
outliers <- lapply(uSubject, function(x){
3 * sd(gogn$areitiRT[gogn$subject == x]) +
mean(gogn$areitiRT[gogn$subject == x])
})
Then I prefer to input this back in to the data frame (gogn) as a new column for easy use:
for(i in 1:length(outliers)){
gogn$outlier[gogn$subject == uSubject[i]] <- outliers[i]
}
Then we can compare each individual reaction time with the 'outlier value' we just calculated and, if it's greater, ditch it:
gogn <- gogn[which(gogn$areitiRT < gogn$outlier), ]
# this removes any cases where reaction time is NOT less than the limit
Is this answer along the lines of what you need to do?

Related

How to do multiple arithmetic operations according to conditions between two datasets in R

I have several datasets.
The first one
lid=structure(list(x1 = 619490L, x2 = 10L, x3 = 0L, x4 = 6089230L,
x5 = 0L, x6 = -10L), class = "data.frame", row.names = c(NA,
-1L))
second dataset
lidar=structure(list(A = c(638238.76, 638238.76, 638239.29, 638235.39,
638233.86, 638233.86, 638235.55, 638231.97, 638231.91, 638228.41
), B = c(6078001.09, 6078001.09, 6078001.15, 6078001.15, 6078001.07,
6078001.07, 6078001.02, 6078001.08, 6078001.09, 6078001.01),
C = c(186.64, 186.59, 199.28, 189.37, 186.67, 186.67, 198.04,
200.03, 199.73, 192.14), gpstime = c(319805734.664265, 319805734.664265,
319805734.67875, 319805734.678768, 319805734.678777, 319805734.678777,
319805734.687338, 319805734.701928, 319805734.701928, 319805734.701945
), Intensity = c(13L, 99L, 5L, 2L, 20L, 189L, 2L, 11L, 90L,
1L), ReturnNumber = c(2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L,
3L), NumberOfReturns = c(2L, 1L, 3L, 2L, 1L, 1L, 3L, 1L,
1L, 4L), ScanDirectionFlag = c(1L, 1L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 0L), EdgeOfFlightline = c(0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), Classification = c(1L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-10L))
How to subtract the value for each row of the lidar dataset from lid dataset using the formula
(lidar$A-lid$x1)/lid$x3
then
(lidar$B-lid$x4)/lid$x6
So for first row will be result
(lidar$A-lid$x1)/lid$x3=1874,876(but everything after the comma is discarded)=1874(without,876)
(lidar$B-lid$x4)/lid$x6=1122
also in lidar dataset for column lidar$C
subtract the smallest value from the largest value. In this case lidar$c11-lidar$c1=5,5
so desired output for this will be
A B C Intensity ReturnNumber NumberOfReturns row col subs(lidar$Cmax-lidar$Cmin)
638238.76 6078001.09 186.64 13 2 2 1874 1122 5,5
638238.76 6078001.09 186.59 99 1 1 1874 1122 5,5
638239.29 6078001.15 199.28 5 1 3 1874 1122 5,5
638235.39 6078001.15 189.37 2 2 2 1874 1122 5,5
the result of subtraction (lidar$Cmax-lidar$Cmin) for all rows is always the same.
row and col this the result of this arithmetic
(lidar$A-lid$x1)/lid$x3 (row)
then
(lidar$B-lid$x4)/lid$x6 (col)
with the value after the comma, these values(row and col) are different, but we must remove the part after the comma, so they seem to be the same.
How can i get desired output according to such arithmetic operations.
Any of your help is valuable.Thank you

If I understand your purpose correctly, the main question is how to remove the part after comma, which is a decimal separator in your examples.
If that's true, one way of doing that is to split the number into two parts, one which comes before the comma and another one which comes after it, and then extract only the first part. In R you can do this by strsplit(). However, this function requires the input to be characters, not numerics. So, you need to coerce the numbers into characters, do the splitting, coerce the result back to numbers, and then extract its first element.
Here is an example of a function to implement the steps:
remove_after_comma <- function(num_with_comma){
myfun <- function(num_with_comma) {
num_with_comma|>
as.character() |>
strsplit("[,|.]") |>
unlist() |>
as.numeric() |>
getElement(1)
}
vapply(num_with_comma, myfun, FUN.VALUE = numeric(1))
}
Notes:
[,|.] is used to anticipate other systems that use . instead of , as the decimal separator.
vapply is used to make it possible to apply this function to a numeric vectors, such as a numeric column.
Check:
remove_after_comma(c(a = '1,5', b = '12,74'))
# a b
# 1 12
(4:10)/3
#[1] 1.333333 1.666667 2.000000 2.333333 2.666667 3.000000 3.333333
remove_after_comma ((4:10)/3)
#[1] 1 1 2 2 2 3 3
Assuming that lid$x3 = 10L in your example:
(lidar$A-lid$x1)/lid$x3
#[1] 1874.876 1874.876 1874.929 1874.539 1874.386 1874.386 1874.555 1874.197 #1874.191 1873.841
remove_after_comma((lidar$A-lid$x1)/lid$x3)
#[1] 1874 1874 1874 1874 1874 1874 1874 1874 1874 1873

I'm not sure if this is what you mean
`
lidar$row <- round((lidar$A-lid$x1)/lid$x3, 0)
lidar$col <- (lidar$B-lid$x4)/lid$x6
lidar$cdif <- max(lidar$C)-min(lidar$C)
`

Weighted mean using aggregated

Sorry for asking what might be a very basic question, but I am stuck in a conundrum and cannot seem to get out of it.
I have a code that looks like
Medicine Biology Business sex weights
0 1 0 1 0.5
0 0 1 0 1
1 0 0 1 05
0 1 0 0 0.33
0 0 1 0 0.33
1 0 0 1 1
0 1 0 0 0.33
0 0 1 1 1
1 0 0 1 1
Where the first three are fields of study, and the fouth variable regards gender. Obviously with many more observations.
What I want to get, is the mean level of the the field of study (medicine, biology, business) by the variable sex (so the mean for men and the mean for women). To do so, I have used the following code:
barplot_sex<-aggregate(x=df_dummies[,1:19] , by=list(df$sex),
FUN= function(x) mean(x)
Which works perfectly and gives me what I needed. My problem is that I need to use a weighted mean now, but I canno use
FUN= function(x) weighted.mean(x, weights)
as there are many more observations than fields of study.
The only alternative I managed to do was to edit(boxplot) and change the values manually, but then R doesn't save the changes. Plus, I am sure there must be a trivial way to do exactly what I need.
Any help would be greatly appreciated.
Bests,
Gabriele

Using by.
by(dat, dat$sex, function(x) sapply(x[, 1:3], weighted.mean, x[, "weights"]))
# dat$sex: 0
# Medicine Biology Business
# 0.0000000 0.3316583 0.6683417
# ---------------------------------------------------------------------------------------
# dat$sex: 1
# Medicine Biology Business
# 0.82352941 0.05882353 0.11764706
Data:
dat <- structure(list(Medicine = c(0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L
), Biology = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L), Business = c(0L,
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L), sex = c(1L, 0L, 1L, 0L, 0L,
1L, 0L, 1L, 1L), weights = c(0.5, 1, 5, 0.33, 0.33, 1, 0.33,
1, 1)), class = "data.frame", row.names = c(NA, -9L))

String matching where strings contain punctuation

I want to find a case insensitive match using grepl().
I have the following list of keywords that I want to find in a Text column of my data frame df.
# There is a long list of words, but for simplification I have provided only a subset.
I, I'm, the, and, to, a, of
I want to have the counts of these words separately for each of the data rows.
I define this word list to be used in the code as:
word_list = c('\\bI\\b','\\bthe\\b','\\band\\b','\\bto\\b','\\ba\\b','\\bof\\b')
# Note that I'm is not currently in this word_list
In my dataframe df I add the columns as below to keep the counts of above words:
df$I = 0
df$IM = 0 # this is where I need help
df$THE = 0
df$AND = 0
df$TO = 0
df$A = 0
df$OF = 0
Then I use the following for-loop for each word of the word list to iterate over each row of the required column.
# for each word of my word_list
for (i in 1:length(word_list)){
# to search in each row of text response
for(j in 1:nrow(df)){
if(grepl(word_list[i], df$Text[j], ignore.case = T)){
df[j,i+4] = (df[j,i+4]) # 4 is added to go to the specific column
}#if
}#for
}#for
For a reproducible example dput(df) is as below:
dput(df)
structure(list(cluster3 = c(2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L), userID = c(3016094L, 3042038L, 3079341L, 3079396L, 3130832L, 3130864L, 3148118L, 3148914L, 3149040L, 3150222L), Text = structure(c(3L, 4L, 2L, 9L, 6L, 10L, 7L, 1L, 5L, 8L), .Label = c("I'm alright","I'm stressed", "I am a good person.", "I don't care", "I have a difficult task", "I like it", "I think it doesn't matter", "Let's not argue about this", "Let's see if I can run", "No, I'm not in a mood"), class = "factor"), I = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), IM = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), AND = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), THE = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), TO = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), OF = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -10L))

I would suggest a more streamlined approach:
## use a named vector for the word patterns
## with the column names you want to add to `df`
word_list = c('I' = '\\bi\\b', 'THE' = '\\bthe\\b', 'AND' = '\\band\\b',
'TO' = '\\bto\\b', 'A' = '\\ba\\b', 'OF' = '\\bof\\b', 'IM' = "\\bim")
## use `stringr::str_count` instead of `grepl`
## sapply does the looping and result gathering for us
library(stringr)
results = sapply(word_list, str_count,
string = gsub("[[:punct:]]", "", tolower(df$Text))
)
results
# I THE AND TO A OF IM
# [1,] 1 3 2 1 1 1 0
# [2,] 0 0 1 0 0 0 0
# [3,] 0 0 0 0 0 0 0
# [4,] 2 2 3 2 1 1 1
# [5,] 0 0 0 1 1 0 0
# [6,] 0 3 2 2 0 0 0
# [7,] 1 3 0 1 1 0 0
# [8,] 1 2 0 1 1 1 0
# [9,] 0 0 0 0 0 0 0
# [10,] 0 0 0 1 2 0 0
## put the results into the data frame based on the names
df[colnames(results)] = data.frame(results)
Since we rely on str_count which is vectorized, this should be much faster than the row-by-row approach.

I am able to make my code working by adding the expression in double quotes:
word_list = c('\\bI\\b',"\\bI'm\\b",'\\bthe\\b','\\band\\b','\\bto\\b','\\ba\\b','\\bof\\b')

result of rpart is a root, but data shows Information Gain

I have a dataset with an event rate of less than 3% (i.e. there are about 700 records with class 1 and 27000 records with class 0).
ID V1 V2 V3 V5 V6 Target
SDataID3 161 ONE 1 FOUR 0 0
SDataID4 11 TWO 2 THREE 2 1
SDataID5 32 TWO 2 FOUR 2 0
SDataID7 13 ONE 1 THREE 2 0
SDataID8 194 TWO 2 FOUR 0 0
SDataID10 63 THREE 3 FOUR 0 1
SDataID11 89 ONE 1 FOUR 0 0
SDataID13 78 TWO 2 FOUR 0 0
SDataID14 87 TWO 2 THREE 1 0
SDataID15 81 ONE 1 THREE 0 0
SDataID16 63 ONE 3 FOUR 0 0
SDataID17 198 ONE 3 THREE 0 0
SDataID18 9 TWO 3 THREE 0 0
SDataID19 196 ONE 2 THREE 2 0
SDataID20 189 TWO 2 ONE 1 0
SDataID21 116 THREE 3 TWO 0 0
SDataID24 104 ONE 1 FOUR 0 0
SDataID25 5 ONE 2 ONE 3 0
SDataID28 173 TWO 3 FOUR 0 0
SDataID29 5 ONE 3 ONE 3 0
SDataID31 87 ONE 3 FOUR 3 0
SDataID32 5 ONE 2 THREE 1 0
SDataID34 45 ONE 1 FOUR 0 0
SDataID35 19 TWO 2 THREE 0 0
SDataID37 133 TWO 2 FOUR 0 0
SDataID38 8 ONE 1 THREE 0 0
SDataID39 42 ONE 1 THREE 0 0
SDataID43 45 ONE 1 THREE 1 0
SDataID44 45 ONE 1 FOUR 0 0
SDataID45 176 ONE 1 FOUR 0 0
SDataID46 63 ONE 1 THREE 3 0
I am trying to find out split using the decision tree. But result of tree is only 1 root.
> library(rpart)
> tree <- rpart(Target ~ ., data=subset(train, select=c( -Record.ID) ),method="class")
> printcp(tree)
Classification tree:
rpart(formula = Target ~ ., data = subset(train, select = c(-Record.ID)), method = "class")
Variables actually used in tree construction:
character(0)
Root node error: 749/18239 = 0.041066
n= 18239
CP nsplit rel error xerror xstd
1 0 0 1 0 0
After reading most of the resources on StackOverflow, I loosened/tweaked the control parameters which gave me the desired decision tree.
> tree <- rpart(Target ~ ., data=subset(train, select=c( -Record.ID) ),method="class" ,control =rpart.control(minsplit = 1,minbucket=2, cp=0.00002))
> printcp(tree)
Classification tree:
rpart(formula = Target ~ ., data = subset(train, select = c(-Record.ID)),
method = "class", control = rpart.control(minsplit = 1, minbucket = 2,
cp = 2e-05))
Variables actually used in tree construction:
[1] V5 V2 V1
[4] V3 V6
Root node error: 749/18239 = 0.041066
n= 18239
CP nsplit rel error xerror xstd
1 0.00024275 0 1.00000 1.0000 0.035781
2 0.00019073 20 0.99466 1.0267 0.036235
3 0.00016689 34 0.99199 1.0307 0.036302
4 0.00014835 54 0.98798 1.0334 0.036347
5 0.00002000 63 0.98665 1.0427 0.036504
When I pruned the tree it resulted in a tree with a single node.
> pruned.tree <- prune(tree, cp = tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"])
> printcp(pruned.tree)
Classification tree:
rpart(formula = Target ~ ., data = subset(train, select = c(-Record.ID)),
method = "class", control = rpart.control(minsplit = 1, minbucket = 2,
cp = 2e-05))
Variables actually used in tree construction:
character(0)
Root node error: 749/18239 = 0.041066
n= 18239
CP nsplit rel error xerror xstd
1 0.00024275 0 1 1 0.035781
The tree should not be giving out only root node because mathematically, on a given node (example provided) we are getting Information Gain. I don't know if I am making a mistake by pruning or there is an issue with rpart in handling low event rate dataset?
NODE p 1-p Entropy Weights Ent*Weight # Obs
Node 1 0.032 0.968 0.204324671 0.351398601 0.071799404 10653
Node 2 0.05 0.95 0.286396957 0.648601399 0.185757467 19663
Sum(Ent*wght) 0.257556871
Information gain 0.742443129

The data you provided does not reflect the ratio of the two target classes, so I've tweaked the data to better reflect that (see Data section):
> prop.table(table(train$Target))
0 1
0.96707581 0.03292419
> 700/27700
[1] 0.02527076
The ratios are now relatively close...
library(rpart)
tree <- rpart(Target ~ ., data=train, method="class")
printcp(tree)
Results in:
Classification tree:
rpart(formula = Target ~ ., data = train, method = "class")
Variables actually used in tree construction:
character(0)
Root node error: 912/27700 = 0.032924
n= 27700
CP nsplit rel error xerror xstd
1 0 0 1 0 0
Now, the reason that you are seeing only the root node for your first model, is probably due to the fact that you have extremely imbalanced target classes, and so, your independent variables could not provide enough information to grow the tree. My sample data has 3.3% event rate, but yours has only about 2.5%!
As you have mentioned, there is a way to force rpart to grow the tree. That is to override the default complexity parameter (cp). The complexity measure is a combination of the size of the tree and how well the tree separates the target classes. From ?rpart.control, "Any split that does not decrease the overall lack of fit by a factor of cp is not attempted". This means that your model at this point does not have a split beyond the root node that decreases the complexity level enough for rpart to take into consideration. We can relax this threshold of what is considered "enough" by either setting a low or a negative cp (negative cp basically forces the tree to grow to its full size).
tree <- rpart(Target ~ ., data=train, method="class" ,parms = list(split = 'information'),
control =rpart.control(minsplit = 1,minbucket=2, cp=0.00002))
printcp(tree)
Results in:
Classification tree:
rpart(formula = Target ~ ., data = train, method = "class", parms = list(split = "information"),
control = rpart.control(minsplit = 1, minbucket = 2, cp = 2e-05))
Variables actually used in tree construction:
[1] ID V1 V2 V3 V5 V6
Root node error: 912/27700 = 0.032924
n= 27700
CP nsplit rel error xerror xstd
1 4.1118e-04 0 1.00000 1.0000 0.032564
2 3.6550e-04 30 0.98355 1.0285 0.033009
3 3.2489e-04 45 0.97807 1.0702 0.033647
4 3.1328e-04 106 0.95504 1.0877 0.033911
5 2.7412e-04 116 0.95175 1.1031 0.034141
6 2.5304e-04 132 0.94737 1.1217 0.034417
7 2.1930e-04 149 0.94298 1.1458 0.034771
8 1.9936e-04 159 0.94079 1.1502 0.034835
9 1.8275e-04 181 0.93640 1.1645 0.035041
10 1.6447e-04 193 0.93421 1.1864 0.035356
11 1.5664e-04 233 0.92654 1.1853 0.035341
12 1.3706e-04 320 0.91228 1.2083 0.035668
13 1.2183e-04 344 0.90899 1.2127 0.035730
14 9.9681e-05 353 0.90789 1.2237 0.035885
15 2.0000e-05 364 0.90680 1.2259 0.035915
As you can see, the tree has grown to a size that reduces the complexity level by a minimum of cp. Two things to note:
At zero nsplit, CP is already as low as 0.0004, where as the default cp in rpart is set to 0.01.
Starting from nsplit == 0, the cross validation error (xerror) increases as you increase the number of splits.
Both of these indicate that your model is overfitting the data at nsplit == 0 and beyond, since adding more independent variables into your model does not add enough information (insufficient reduction in CP) to reduce the cross validation error. With this being said, your root node model is the best model in this case, which explains why your initial model has only the root node.
pruned.tree <- prune(tree, cp = tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"])
printcp(pruned.tree)
Results in:
Classification tree:
rpart(formula = Target ~ ., data = train, method = "class", parms = list(split = "information"),
control = rpart.control(minsplit = 1, minbucket = 2, cp = 2e-05))
Variables actually used in tree construction:
character(0)
Root node error: 912/27700 = 0.032924
n= 27700
CP nsplit rel error xerror xstd
1 0.00041118 0 1 1 0.032564
As for the pruning part, it is now clearer why your pruned tree is the root node tree, since a tree that goes beyond 0 splits has increasing cross validation error. Taking the tree with the minimum xerror would leave you with root node tree as expected.
Information gain basically tells you how much "information" is added for each split. So technically, every split has some degree of information gain since you are adding more variables into your model (information gain is always non-negative). What you should think about is whether that additional gain (or no gain) reduces the errors enough for you to warrant a more complex model. Hence, the tradeoff between bias and variance.
In this case, it doesn't really make sense for you to reduce cp and later prune the resulting tree. since by setting a low cp, you are telling rpart to make splits even if it overfits, while pruning "cuts" all the nodes that overfits.
Data:
Note that I am shuffling the rows for each column and sample instead of sampling the row indices. This is because the data you provided is probably not a random sample of your original dataset (likely biased), so I am basically randomly creating new observations with combinations of your existing rows which would hopefully reduce that bias.
init_train = structure(list(ID = structure(c(16L, 24L, 29L, 30L, 31L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L,
17L, 18L, 19L, 20L, 21L, 22L, 23L, 25L, 26L, 27L, 28L), .Label = c("SDataID10",
"SDataID11", "SDataID13", "SDataID14", "SDataID15", "SDataID16",
"SDataID17", "SDataID18", "SDataID19", "SDataID20", "SDataID21",
"SDataID24", "SDataID25", "SDataID28", "SDataID29", "SDataID3",
"SDataID31", "SDataID32", "SDataID34", "SDataID35", "SDataID37",
"SDataID38", "SDataID39", "SDataID4", "SDataID43", "SDataID44",
"SDataID45", "SDataID46", "SDataID5", "SDataID7", "SDataID8"), class = "factor"),
V1 = c(161L, 11L, 32L, 13L, 194L, 63L, 89L, 78L, 87L, 81L,
63L, 198L, 9L, 196L, 189L, 116L, 104L, 5L, 173L, 5L, 87L,
5L, 45L, 19L, 133L, 8L, 42L, 45L, 45L, 176L, 63L), V2 = structure(c(1L,
3L, 3L, 1L, 3L, 2L, 1L, 3L, 3L, 1L, 1L, 1L, 3L, 1L, 3L, 2L,
1L, 1L, 3L, 1L, 1L, 1L, 1L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("ONE", "THREE", "TWO"), class = "factor"),
V3 = c(1L, 2L, 2L, 1L, 2L, 3L, 1L, 2L, 2L, 1L, 3L, 3L, 3L,
2L, 2L, 3L, 1L, 2L, 3L, 3L, 3L, 2L, 1L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L), V5 = structure(c(1L, 3L, 1L, 3L, 1L, 1L, 1L,
1L, 3L, 3L, 1L, 3L, 3L, 3L, 2L, 4L, 1L, 2L, 1L, 2L, 1L, 3L,
1L, 3L, 1L, 3L, 3L, 3L, 1L, 1L, 3L), .Label = c("FOUR", "ONE",
"THREE", "TWO"), class = "factor"), V6 = c(0L, 2L, 2L, 2L,
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 2L, 1L, 0L, 0L, 3L, 0L,
3L, 3L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 3L), Target = c(0L,
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L
)), .Names = c("ID", "V1", "V2", "V3", "V5", "V6", "Target"
), class = "data.frame", row.names = c(NA, -31L))
set.seed(1000)
train = as.data.frame(lapply(init_train, function(x) sample(x, 27700, replace = TRUE)))

Adding counts in matrix in R

my issue is that within a loop for every i - a matrix like this outputted
structure(c(8L, 4L, 3L, 4L, 1L, 8L, 28L, 32L, 24L, 32L, 8L, 64L,
0L, 6L, 12L, 16L, 4L, 32L, 0L, 0L, 3L, 12L, 3L, 24L, 0L, 0L,
0L, 6L, 4L, 32L, 0L, 0L, 0L, 0L, 0L, 8L, 0L, 0L, 0L, 0L, 0L,
28L), .Dim = 6:7, .Dimnames = structure(list(c("ESN", "GWD",
"LWK", "MSL", "PEL", "YRI"), c("ACB", "ESN", "GWD", "LWK", "MSL",
"PEL", "YRI")), .Names = c("", "")), class = "table")
this matrix counts pariwise sharing - these counts should now be added to a larger table - with more levels than only the 7 present in this table. It is always a symmetric matrix (so the upper triangl) can be neglected
the real table (for which all elements are 0 in the beginnign)
matr<-matrix(0,nrow=26,ncol=26)
pop<-c("CHB","JPT","CHS","CDX","KHV","CEU","TSI","FIN","GBR","IBS","YRI","LWK","GWD","MSL","ESN","ASW","ACB","MXL","PUR","CLM","PEL","GIH","PJL","BEB","STU","ITU")
rownames(matr)<-pop
colnames(matr)<-pop
Can somebody tell me how I can add these counts from the small table to the large table (in the correct field) in an efficient way? I need to update the table 100k time - so effectiveness would be good. As mentioned addiing in the lower triangle is fine....
EDI #####
so another data set - might look like (this would then be generated from the next iteration of the loop)
structure(c(1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L), .Dim = c(3L,
3L), .Dimnames = structure(list(c("IBS", "MXL", "TSI"), c("GBR",
"IBS", "MXL")), .Names = c("", "")), class = "table")
this should then also be added to matr - if a field has a number in it previously, the two number should be added up
Thanks

Taking into account duplicate/non-equal/non-zero entries in each of "table" created through iterations and updating only the lower.tri of "matr":
for(tab in tabs) {
## if each 'tab' is large enough,
## instead of creating (and subsetting with) 'row(tab)' and 'col(tab)'
##, a 'rep(, each = )' could be used
i = match(rownames(tab), rownames(mat))[row(tab)]
j = match(colnames(tab), colnames(mat))[col(tab)]
## to fill only the 'lower.tri'
ii = pmax(i, j); jj = pmin(i, j)
## sum duplicate entries 'tab' with 'sparseMatrix's intrinsic 'xtabs'-like behaviour
ijx = summary(sparseMatrix(ii, jj, x = c(tab)))
## subset and assign with a matrix index updating previous entries
ij = cbind(ijx$i, ijx$j)
mat[ij] = mat[ij] + ijx$x
}
mat
# a b c d e
#a 0 0 0 0 0
#b 4 1 0 0 0
#c 6 7 2 0 0
#d 5 12 5 7 0
#e 4 6 3 3 0
where "tabs" is a "list" containing the -iteratively- created "table"s:
set.seed(007)
tabs = replicate(3, table(replicate(2,
sample(letters[1:5], 50, TRUE), simplify = FALSE))[
sample(5, sample(2:5, 1)), sample(5, sample(2:5, 1))],
simplify = FALSE)
and "mat" is a smaller "matr":
mat = matrix(0L, 5, 5, dimnames = replicate(2, letters[1:5], simplify = FALSE))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Modifying a R for loop to work with three variables - r

Related

How to do multiple arithmetic operations according to conditions between two datasets in R

Weighted mean using aggregated

String matching where strings contain punctuation

result of rpart is a root, but data shows Information Gain

Adding counts in matrix in R

Categories

Resources