Repeat and while problems in loop with some condition - r

In my artificial problem, I need to remove empty values create during a sample process and make a new sampling process until I have just one value (nrow(s.df)>0). But, if the first condition is satisfied I keep the results (res[[i]] <- s.df) but if not, I need to make a new sample again and for this, I try to use repeat and while functions combining with else without success.
My example:
#Artificial data set
v0<-rnorm(20)
vNA<-rep(NA, 80)
v<-c(v0,vNA)
id<-1:100
df<-data.frame(id,v)
s_size<-c(1,2,3,4,5)
# Sampling using repeat
res<-list()
for(i in 1:length(s_size)){ # Loop for different sample size
s.df<-df[sample(nrow(df), 3), ] #sampling in data set
s.df<-s.df[complete.cases(s.df), ] #remove NAs
s.df
if (nrow(s.df)>0){
res[[i]] <- s.df# add it to the list
}
}
else{
repeat{
s.df<-df[sample(nrow(df), 3), ] #sampling in data set
s.df<-s.df[complete.cases(s.df), ] #remove NAs
if (nrow(s.df)>0){
res[[i]] <- s.df# add it to the list
}
if (nrow(res.circle)>0){break}
}
}
}
big_sample = do.call(rbind, res)
or
# Sampling using while
res<-list()
for(i in 1:length(s_size)){ # Loop
s.df<-df[sample(nrow(df), 3), ] #sampling in data set
s.df<-s.df[complete.cases(s.df), ] #remove NAs
if (nrow(s.df)>0){
res[[i]] <- s.df# add it to the list
}
}
else{
while(nrow(res.circle)>0) {
s.df<-df[sample(nrow(df), 3), ] #sampling in data set
s.df<-s.df[complete.cases(s.df), ] #remove NAs
if (nrow(s.df)>0){
res[[i]] <- s.df# add it to the list
}
}
}
big_sample = do.call(rbind, res)
This approach obviously doesn't work but if I don't use the else{}, I will overwrite the results that already satisfied the first condiction. Any ideas, please?

You could put the while loop inside the for loop and make it depend on the outcome of the if-condition then you don't need the else:
set.seed(42)
v0 <- rnorm(20)
vNA <- rep(NA, 80)
v <- c(v0, vNA)
id <- 1:100
df <- data.frame(id, v)
s_size <- c(1, 2, 3, 4, 5)
res <- list()
for (i in 1:length(s_size)) {
condition <- FALSE
while (condition == FALSE) {
s.df <- df[sample(nrow(df), 3),]
s.df <- s.df[complete.cases(s.df),]
if (nrow(s.df) > 0) {
res[[i]] <- s.df
condition <- TRUE
}
}
}
big_sample <- do.call(rbind, res)
big_sample
#> id v
#> 15 15 -0.13332134
#> 8 8 -0.09465904
#> 18 18 -2.65645542
#> 4 4 0.63286260
#> 6 6 -0.10612452
#> 2 2 -0.56469817
Created on 2020-06-11 by the reprex package (v0.3.0)

Related

How to manually write a function which duplicates values in r?

I am learning how to do loops now, and trying to figure out how to write a function which duplicates arguments manually.
Essentially, I want to take something like this:
duplicate_easy <- function(x){
rep(c(x), c(x))
}
x1 <- c(3,1,9)
duplicate_easy(x1)
result: 3 3 3, 1, 9 9 9 9 9 9 9 9 9
And replace it with a for loop along the lines of,
duplicate <- function(x)
{
result <- NULL
for (i in rep(x) )
{
result <- c(result, rep(x))
}
return(result)
}
x1 <- c(3, 1, 9)
duplicate(x1)
Which is also intended to result in the same thing, but the above does not work.
Maybe this:
duplicate <- function(x)
{
result <- NULL
for (i in 1:length(x))
{
result <- c(result, rep(x[i], x[i]))
}
return(result)
}

for loop with lists R

I want to create two lists of data frames in a for loop, but I cannot use assign:
dat <- data.frame(name = c(rep("a", 10), rep("b", 13)),
x = c(1,3,4,4,5,3,7,6,5,7,8,6,4,3,9,1,2,3,5,4,6,3,1),
y = c(1.1,3.2,4.3,4.1,5.5,3.7,7.2,6.2,5.9,7.3,8.6,6.3,4.2,3.6,9.7,1.1,2.3,3.2,5.7,4.8,6.5,3.3,1.2))
a <- dat[dat$name == "a",]
b <- dat[dat$name == "b",]
samp <- vector(mode = "list", length = 100)
h <- list(a,b)
hname <- c("a", "b")
for (j in 1:length(h)) {
for (i in 1:100) {
samp[[i]] <- sample(1:nrow(h[[j]]), nrow(h[[j]])*0.5)
assign(paste("samp", hname[j], sep="_"), samp[[i]])
}
}
Instead of lists named samp_a and samp_b I get vectors which contain the result of the 100th sample. I want to get a list samp_a and samp_b, which have all the different samples for dat[dat$name == a,] and dat[dat$name == a,].
How could I do this?
How about creating two different lists and avoiding using assign:
Option 1:
# create empty list
samp_a <-list()
samp_b <- list()
for (j in seq(h)) {
# fill samp_a list
if(j == 1){
for (i in 1:100) {
samp_a[[i]] <- sample(1:nrow(h[[j]]), nrow(h[[j]])*0.5)
}
# fill samp_b list
} else if(j == 2){
for (i in 1:100) {
samp_b[[i]] <- sample(1:nrow(h[[j]]), nrow(h[[j]])*0.5)
}
}
}
You could use assign too, shorter answer:
Option 2:
for (j in seq(hname)) {
l = list()
for (i in 1:100) {
l[[i]] <- sample(1:nrow(h[[j]]), nrow(h[[j]])*0.5)
}
assign(paste0('samp_', hname[j]), l)
rm(l)
}
You could easily use an lapply for this using the rep function. Unless you want a random x, paired with a random y. This will maintain the existing paired order.
dat <- data.frame(name = c(rep("a", 10), rep("b", 13)),
x = c(1,3,4,4,5,3,7,6,5,7,8,6,4,3,9,1,2,3,5,4,6,3,1),
y = c(1.1,3.2,4.3,4.1,5.5,3.7,7.2,6.2,5.9,7.3,8.6,6.3,4.2,3.6,9.7,1.1,2.3,3.2,5.7,4.8,6.5,3.3,1.2))
a <- dat[dat$name == "a",]
b <- dat[dat$name == "b",]
h <- list(a,b)
hname <- c("a", "b")
testfunc <- function(df) {
#df[sample(nrow(df), nrow(df)*0.5), ] #gives you the values in your data frame
sample(nrow(df), nrow(df)*0.5) # just gives you the indices
}
lapply(h, testfunc) # This gives you the standard lapply format, and only gives one a, and one b
samp <- lapply(rep(h, 100), testfunc) # This shows you how to replicate the function n times, giving you 100 a and 100 b data.frames in a list
samp_a <- samp[c(TRUE, FALSE)] # Applies a repeating T/F vector, selecting the odd data.frames, which in this case are the `a` frames.
samp_b <- samp[c(FALSE, TRUE)] # And here, the even data.frames, which are the `b` frames.

Loop for value matching won't work across data frames for multiple instances

Can anyone tell me what’s preventing this loop from running?
For each row i, in column 3 of the data frame ‘depth.df’, the loop preforms a mathematical function, using a second data frame, 'linker.df' (it multiplies i by a constant / a value from linker.df which is found by matching the value of i.
If I run the loop for a single instance of i, (lets say its = 50) it runs fine:
cor.depth <- function(depth.df){
result <- seq(from=1, to=(nrow(depth.df)))
x <- 8971
for(i in 1:nrow(depth.df)){
result[i] <- depth.df[i,3]*(x /( linker.df [i,2][ linker.df [i,1] == 50]))
return(result)
}
}
>97,331
but if I run it to loop over each instance of i, it always returns an error:
cor.depth <- function(depth.df){
result <- seq(from=1, to=(nrow(depth.df)))
x <- 8971
for(i in 1:nrow(depth.df)){
result[i] <- depth.df[i,3]*(x /( linker.df [i,2][ linker.df [i,1] %in% depth.df[i,3]]))
return(result)
}
}
Error in result[i] <- depth.df[i, 3] * (all_SC_bins/(depth.ea.bin.all[, :
replacement has length zero
EDIT
Here is a reproducible data set provided to illustrate data structure and issue
#make some data as an example
#make some data as an example
linker.data <- sample(x=40:50, replace = FALSE)
linker.df <- data.frame(
X = linker.data
, Y = sample(x=2000:3000, size = 11, replace = TRUE)
)
depth.df <- data.frame(
X = sample(x=9000:9999, size = 300, replace = TRUE)
, Y = sample(x=c("A","G","T","C"), size = 300, replace = TRUE)
, Z = sample(linker.data, size = 300, replace = TRUE)
)
cor.depth <- function(depth.df){
result <- seq(from=1, to=(nrow(depth.df)))
x <- 8971
for(i in 1:nrow(depth.df)){
result[i] <- depth.df[i,3]*(x /( linker.df [i,2][ linker.df [i,1] %in% depth.df[i,3]]))
return(result)
}
}
Error emerges because denominator returns integer(0) or numeric(0) or a FALSE result on most rows. Your loop attempts to find exact row number, i, where both dataframes' respective X and Z match. Likely, you intended where any of the rows match which would entail using a second, nested loop with an if conditional on matches.
cor.depth <- function(depth.df){
result <- seq(from=1, to=(nrow(depth.df)))
x <- 8971
for(i in 1:nrow(depth.df)){
for (j in 1:nrow(linker.df)){
if (linker.df[j,1] == depth.df[i,3]) {
result[i] <- depth.df[i,3]*(x /( linker.df[j,2]))
}
}
}
return(result)
}
Nonetheless, consider merge a more efficient, vectorized approach which matches any rows between both sets on ids. The setNames below renames columns to avoid duplicate headers:
mdf <- merge(setNames(linker.df, paste0(names(linker.df), "_l")),
setNames(depth.df, paste0(names(depth.df), "_d")),
by.x="X_l", by.y="Z_d")
mdf$result <- mdf$X_l * (8971 / mdf$Y_l)
And as comparison, the two approaches would be equivalent:
depth.df$result <- cor.depth(depth.df)
depth.df <- with(depth.df, depth.df[order(Z),]) # ORDER BY Z
mdf <- with(mdf, mdf[order(X_l),]) # ORDER BY X_L
all.equal(depth.df$result, mdf$result)
# [1] TRUE

Small bug in backpropagation algorithm in r

I've been trying to implement backpropagation in R, but I've been getting some strange results. It appears that after 1000 iterations of backprop, the program predicts 1 for all values. I was hoping it was a problem in the test function, but testing on smaller numbers of iterations shows that 0 is predicted as an output value in some instances. It seems that somewhere in iterating through the dataset, the weight updates tend toward increasing, when they should tend toward reducing error.
I apologize that the code is difficult to read in spots. I'm working on this with a partner and I dislike the way that he names variables. It's also not as fully commented as I'd like. Any help is appreciated
# initialize a global output vector and a global vector of data frames
createNeuralNet <- function(numberOfInputNodes,hiddenLayers,nodesInHiddenLayer){
L <<- initializeWeightDataFrames(numberOfInputNodes,nodesInHiddenLayer,hiddenLayers)
# print(L)
OutputList <<- initializeOutputVectors(hiddenLayers)
}
# creates a list of weight data frames
# each weight data frame uses the row as an index of the "tail" for a connection
# the "head" of the connection (where the arrow points) is in the column index
# the value in the weight data frame is the weight of that connection
# the last row is the weight between the bias and a particular node
initializeWeightDataFrames <- function(numberOfInputNodes, nodesPerHiddenLayer, numberOfHiddenLayers) {
weights <- vector("list", numberOfHiddenLayers + 1)
# this code simply creates empty data frames of the proper size so that they may
first <- read.csv(text=generateColumnNamesCSV(nodesPerHiddenLayer))
middle <- read.csv(text=generateColumnNamesCSV(nodesPerHiddenLayer))
# assume binary classifier, so output layer has 1 node
last <- read.csv(text=generateColumnNamesCSV(1))
first <- assignWeights(first, numberOfInputNodes + 1)
weights[[1]] <- first
# assign random weights to each row
if (numberOfHiddenLayers != 1) {
for (i in 1:numberOfHiddenLayers - 1) {
middle <- assignWeights(middle, nodesPerHiddenLayer + 1)
weights[[i+1]] <- middle
}
}
last <- assignWeights(last, nodesPerHiddenLayer + 1)
weights[[length(weights)]] <- last
return(weights)
}
# generate a comma-separated string of column names c1 thru cn for creating arbitrary size data frame
generateColumnNamesCSV <- function(n) {
namesCSV <- ""
if (n==1) {
return("c1")
}
for (i in 1:(n-1)) {
namesCSV <- paste0(namesCSV, "c", i, ",")
}
namesCSV <- paste0(namesCSV, "c", n)
return(namesCSV)
}
assignWeights <- function(weightDF, numRows) {
modifiedweightDF <- weightDF
for (rowNum in 1:numRows) {
# creates a bunch of random numbers from -1 to 1, used to populate a row
rowVector <- runif(length(weightDF))
for (i in 1:length(rowVector)) {
sign <- (-1)^round(runif(1))
rowVector[i] <- sign * rowVector[i]
}
modifiedweightDF[rowNum,] <- rowVector
}
return(modifiedweightDF)
}
# create an empty list of the right size, will hold vectors of node outputs in the future
initializeOutputVectors <- function(numberOfHiddenLayers) {
numberOfLayers <- numberOfHiddenLayers + 1
outputVectors <- vector("list", numberOfLayers)
return(outputVectors)
}
# this is the main loop that does feed-forward and back prop
trainNeuralNet <- function(trainingData,target,iterations){
count <- 0
# iterations is a constant for how many times the dataset should be iterated through
while(count<iterations){
print(count)
for(row in 1:nrow(trainingData)) { # for each row in the data set
#Feed Forward
# instance is the current row that's being looked at
instance <- trainingData[row,]
# print(instance)
for (l in 1:length(L)) { # for each weight data frame
# w is the current weights
w <- L[[l]]
#print(w)
Output <- rep(NA, length(w))
if (l!=1) {
# x is the values in the previous layer
# can't access the previous layer if you're on the first layer
x <- OutputList[[l-1]]
#print(x)
}
for (j in 1:ncol(w)) { # for each node j in the "head" layer
s <- 0
for (i in 1:(nrow(w)-1)) {
# calculate the weighted sum s of connection weights and node values
# this is used to calculate a node in the next layer
# check the instance if on the first layer
if (l==1) {
# print(i)
# print(instance[1,i])
# i+1 skips over the target column
s <- s + instance[1,i+1]*w[i,j]
# print(s)
# if the layer is 2 or more
}else{
# print(i)
#print(j)
# print(w)
# print(w[i,j])
s <- s + x[i]*w[i,j] # weighted sum
# sigmoid activation function value for node j
}
}
#print(s)
s <- s + w[nrow(w),j] # add weighted bias
# print("s")
# print(s)
# print("sigmoid s")
# print(sigmoid(s))
Output[j] <- sigmoid(s)
}
OutputList[[l]] <- Output
}
# print(OutputList)
# print("w")
# print(L)
# print("BAck prop Time")
#Back Propagation
out <- OutputList[length(OutputList)]
#print(OutputList)
outputError <- rep(NA, length(w))
outputErrorPresent <- rep(NA, length(w))
outputError[1] <- out[[1]]*(1-out[[1]])*(out[[1]]-target[row])
for (h in (length(L)):1) { # for each weight matrix in hidden area h (going backwards)
hiddenOutput <- OutputList[h]
#print("hiddenOutput")
#print(h)
if (row==1||row==2) {
# print("h")
# print(h)
# print("output error Present")
# print(outputErrorPresent)
}
if (h!=(length(L))) {
outputError <- outputErrorPresent
}
w <- L[[h]]
for (j in 1:(nrow(w))) { # for each node j in hidden layer h
#print("length w")
#print(length(w))
if (row==1||row==2) {
# print("j")
# print(j)
}
errSum <- 0
nextLayerNodes <- L[[h]]
# print(nextLayerNodes)
#print(class(nextLayerNodes))
for (k in 1:ncol(nextLayerNodes)) {
errSum <- errSum + outputError[k]*nextLayerNodes[j,k]
}
m <- 0
if (h == 1) {
m <- as.numeric(instance)
m <- m[-1]
} else {
m <- OutputList[h-1][[1]]
}
deltaWeight <- 0
for (k in 1:ncol(nextLayerNodes)) {
hiddenNodeError <- hiddenOutput[[1]][k]*(1- hiddenOutput[[1]][k])*errSum
if (j == nrow(w)) {
deltaWeight <- learningRate*hiddenNodeError
} else {
deltaWeight <- learningRate*hiddenNodeError*m[j]
}
# print(deltaWeight)
w[j,k] <- w[j,k] + deltaWeight
}
if (j != nrow(w)) {
outputErrorPresent[j] <- hiddenNodeError
}
}
L[[h]] <<- w
}
# print(OutputList)
}
count <- count +1
# print(L)
#calculate global error
}
########################repeat
# print("w")
}
sigmoid <- function(s){
sig <- 1/(1+exp(-s))
return(sig)
}
testNeuralNetwork <- function(testingData,testTarget){
correctCount <- 0
# run the same code as feed forward
# this time run it on testing examples and compare the outputs
for(row in 1:nrow(testingData)) { # for each test instance
#Feed Forward
instance <- testingData[row,]
#print(instance)
for (l in 1:length(L)) { # for each layer l
w <- L[[l]]
#print(w)
Output <- rep(NA, length(w))
if (l!=1) {
x <- OutputList[[l-1]]
#print(x)
}
for (j in 1:ncol(w)) { # for each node j in layer l
s <- 0
for (i in 1:(nrow(w)-1)) {
if (l==1) {
# i+1 skips over the target column
s <- s + instance[1,i+1]*w[i,j]
# print(s)
}else{
# print(i)
#print(j)
# print(w)
# print(w[i,j])
s <- s + x[i]*w[i,j] # weighted sum
# sigmoid activation function value for node j
}
}
#print(s)
s <- s + w[nrow(w),j] # add weighted bias
Output[j] <- sigmoid(s)
#print(sigmoid(s))
}
OutputList[[l]] <- Output
}
# print(OutputList)
outputVal <- threshold(OutputList[[length(OutputList)]])
if (outputVal==testTarget[row]) {
print(paste0(" ", outputVal, " Correct!"))
correctCount <- correctCount + 1
}else{
print(paste0(" ", outputVal, " Wrong."))
}
#print()
#print(paste0("s2 ",str))
}
}
# convert real-valued output to a binary classification
threshold <- function(value){
if (value>=0.5) {
return(1)
}else{
return(0)
}
}
# this modifies df by removing 30 random rows
# this means that the same df will be changed permanently, so be careful of that
# it also returns the 30 random rows as a test set
makeTestSet <- function(df, size) {
len <- 1:length(df[,1])
randRows <- sample(len, size, replace=F)
return(randRows)
}
Data <- read.csv(file = "Downloads/numericHouse-votes-84.csv", head = TRUE, sep = ",")
learningRate <<- 0.1
# assume that the first column of the data is the column that is to be predicted
# thus the number of inputs is 1 less than the number of columnns
numberOfInputNodes <- ncol(Data) - 1
randRows <- makeTestSet(Data,30) #change this to 30
testData <- Data[randRows,]
trainingData <- Data[-randRows,]
testTarget <- testData[,1]
#trainingData <- Data[,1:numberOfInputNodes]
trainingTarget <- trainingData[,1]
createNeuralNet(numberOfInputNodes,1,numberOfInputNodes)
iterations <- 100
trainNeuralNet(trainingData,trainingTarget,iterations)
testNeuralNetwork(testData,testTarget)
L

R repeat function until condition met

I am trying to generate a random sample that excludes certain "bad data." I do not know whether the data is "bad" until after I sample it. Thus, I need to make a random draw from the population and then test it. If the data is "good" then keep it. If the data is "bad" then randomly draw another and test it. I would like to do this until my sample size reaches 25. Below is a simplified example of my attempt to write a function that does this. Can anyone please tell me what I am missing?
df <- data.frame(NAME=c(rep('Frank',10),rep('Mary',10)), SCORE=rnorm(20))
df
random.sample <- function(x) {
x <- df[sample(nrow(df), 1), ]
if (x$SCORE > 0) return(x)
#if (x$SCORE <= 0) run the function again
}
random.sample(df)
Here is a general use of a while loop:
random.sample <- function(x) {
success <- FALSE
while (!success) {
# do something
i <- sample(nrow(df), 1)
x <- df[sample(nrow(df), 1), ]
# check for success
success <- x$SCORE > 0
}
return(x)
}
An alternative is to use repeat (syntactic sugar for while(TRUE)) and break:
random.sample <- function(x) {
repeat {
# do something
i <- sample(nrow(df), 1)
x <- df[sample(nrow(df), 1), ]
# exit if the condition is met
if (x$SCORE > 0) break
}
return(x)
}
where break makes you exit the repeat block. Alternatively, you could have if (x$SCORE > 0) return(x) to exit the function directly.
use this after your first sample
while (any(bad <- (x$SCORE <= 0)))
x[bad, ] <- df[sample(nrow(df), sum(bad)), ]
You can just select the rows to sample directly like so (just 5):
> df <- data.frame(NAME=c(rep('Frank',10),rep('Mary',10)), SCORE=rnorm(20))
> df[sample(which(df$SCORE>0), 5),]
NAME SCORE
14 Mary 1.0858854
10 Frank 0.7037989
16 Mary 0.7688913
5 Frank 0.2067499
17 Mary 0.4391216
this is without replacement, for bootstrap put in replace=T.
random.sample <- function(x) {
x <- df[sample(nrow(df), 1), ]
if (x$SCORE > 0) return(x)
Recall(x)# run the function again
}
random.sample(df)
# NAME SCORE
#14 Mary 1.252566
It seems to me that this should work as well:
df$SCORE[ df$SCORE > 0 ][ sample(1:sum(df$SCORE > 0), 1) ]
#[1] 0.6579631

Resources