while loop in R challenge - r

I have a dataset with 2 columns which consists of a boolean column and values.I will like to find the sum of the F value using while loop.Coe shown below but giving error:
sum <- 0
FM <- 0
idx <- 1
while ( idx <= nrow(dataset)){
if(subset(dataset,boolean=="F")){
sum <- sum + dataset [ idx,"value" ]
FM <- FM + 1
}
idx <- idx + 1
}
print(sum)
error message is : Error in idx : object 'idx' not found

If you count sum of logical values, you get count how many TRUE values are present. Since in this case as you want to count number of FALSE values we can negate the values and then use sum.
sum(!df$boolean)
#[1] 2
However, I guess you want this in a while loop. You can iterate over every value in boolean column, check if it is FALSE and increment the count.
i <- 1
FM <- 0
while(i <= nrow(df)) {
if(!df$boolean[i])
FM <- FM + 1
i <- i + 1
}
FM
#[1] 2
We can also do this without if condition
while(i <= nrow(df)) {
FM <- FM + !df$boolean[i]
i <- i + 1
}
data
df <- data.frame(boolean= c(TRUE,FALSE,TRUE,TRUE,FALSE),value=c(8,16,4,12,9))

Related

Storing simulation results in R

I want to estimate Mantel-Haenszel Differential Item Functioning (DIF) Odds Ratio and HMDDIF index. I wrote the function below. It seems to me I am making a mistake when storing the results. Would you please take a look at this and give me feedback?
Here is the sample data:
# generate dataset
r <- 1000
c <- 16
test <- matrix(rbinom(r*c,1,0.5),r,c)
# create sum scores for each student using first 15 columns
test <- cbind(test, apply(test[,1:15],1,sum))
colnames(test) <- c("v1","v2","v3","v4","v5","v6","v7","v8","v9","v10","v11","v12","v13","v14","v15","group","score")
test <- as.data.frame(test)
The first 15 columns are the student True/false responses to items/questions. The group membership column is the 16th column. The student "score" variable is the sum of item scores at the last (17th) column. The formula can be found here in the picture that I got from Wikipedia (https://en.wikipedia.org/wiki/Differential_item_functioning).
For each of the score category, I want to estimate the last two formulas in this picture. Rows are 10 students and columns are six items/questions. Again, the 16th column is group membership (1-focal, 0-reference)
Here is my function code.
library(dplyr)
# this function first starts with the first item and loop k scores from 1-15. Then move to the second item.
# data should only contain the items, grouping variable, and person score.
Mantel.Haenszel <- function (data) {
# browser() #runs with debug
for (item in 1:15) { #item loop not grouping/scoring
item.incorrect <- data[,item] == 0
item.correct <- data[,item] == 1
Results <- c()
for (k in 1:15) { # for k scores
Ak <- nrow(filter(data, score == k, group == 0, item.correct)) # freq of ref group & correct
Bk <- nrow(filter(data, score == k, group == 0, item.incorrect)) # freq of ref group & incorrect
Ck <- nrow(filter(data, score == k, group == 1, item.correct)) # freq of foc group & correct
Dk <- nrow(filter(data, score == k, group == 1, item.incorrect)) # freq of foc group & incorrect
nrk <- nrow(filter(data, score == k, group == 0)) #sample size for ref
nfk <- nrow(filter(data, score == k, group == 1)) #sample size for focal
if (Bk == 0 | Ck == 0) {
next
}
nominator <-sum((Ak*Dk)/(nrk + nfk))
denominator <-sum((Bk*Ck)/(nrk + nfk))
odds.ratio <- nominator/denominator
if (odds.ratio == 0) {
next
}
MH.D.DIF <- (-2.35)*log(odds.ratio) #index
# save the output
out <- list("Odds Ratio" = odds.ratio, "MH Diff" = MH.D.DIF)
results <- rbind(Results, out)
return(results)
} # close score loop
} # close item loop
} #close function
Here is what I get
# test funnction
Mantel.Haenszel(test)
> Mantel.Haenszel(test)
Odds Ratio MH Diff
out 0.2678571 3.095659
What I want to get is
> Mantel.Haenszel(test)
Odds Ratio MH Diff
out 0.2678571 3.095659
## ##
.. ..
(15 rows here for 15 score categories in the dataset)
Should you not expect a result for every combination of item and k, for a max number of output rows of 225, barring any instances with break? If so, I think you just need to change a few minor things. First, declare Results only once, at the beginning of your function. Then, make sure you are rbind-ing and returning either Results or results, but not both. Then, move yourreturn to your actual function level rather than the loops. In the example below I've also included the current item and k for demonstration:
Mantel.Haenszel <- function (data) {
# browser() #runs with debug
Results <- c()
for (item in 1:15) {
#item loop not grouping/scoring
item.incorrect <- data[, item] == 0
item.correct <- data[, item] == 1
for (k in 1:15) {
# for k scores
Ak <-
nrow(filter(data, score == k, group == 0, item.correct)) # freq of ref group & correct
Bk <-
nrow(filter(data, score == k, group == 0, item.incorrect)) # freq of ref group & incorrect
Ck <-
nrow(filter(data, score == k, group == 1, item.correct)) # freq of foc group & correct
Dk <-
nrow(filter(data, score == k, group == 1, item.incorrect)) # freq of foc group & incorrect
nrk <-
nrow(filter(data, score == k, group == 0)) #sample size for ref
nfk <-
nrow(filter(data, score == k, group == 1)) #sample size for focal
if (Bk == 0 | Ck == 0) {
next
}
nominator <- sum((Ak * Dk) / (nrk + nfk))
denominator <- sum((Bk * Ck) / (nrk + nfk))
odds.ratio <- nominator / denominator
if (odds.ratio == 0) {
next
}
MH.D.DIF <- (-2.35) * log(odds.ratio) #index
# save the output
out <-
list(
item = item,
k = k,
"Odds Ratio" = odds.ratio,
"MH Diff" = MH.D.DIF
)
Results <- rbind(Results, out)
} # close score loop
} # close item loop
return(Results)
} #close function
test.output <- Mantel.Haenszel(test)
Gives an output like:
> head(test.output, 20)
item k Odds Ratio MH Diff
out 1 3 2 -1.628896
out 1 4 4.666667 -3.620046
out 1 5 0.757085 0.6539573
out 1 6 0.5823986 1.27041
out 1 7 0.9893293 0.02521097
out 1 8 1.078934 -0.1785381
out 1 9 1.006237 -0.01461145
out 1 10 1.497976 -0.9496695
out 1 11 1.435897 -0.8502066
out 1 12 1.5 -0.952843
out 2 3 0.8333333 0.4284557
out 2 4 2.424242 -2.08097
out 2 5 1.368664 -0.7375117
out 2 6 1.222222 -0.4715761
out 2 7 0.6288871 1.089938
out 2 8 1.219512 -0.4663597
out 2 9 1 0
out 2 10 2.307692 -1.965183
out 2 11 0.6666667 0.952843
out 2 12 0.375 2.304949
Is that what you're looking for?

About missing value where TRUE/FALSE needed in R

I want to return the number of times in string vector v that the element at the next successive index has more characters than the current index.
Here's my code
BiggerPairs <- function (v) {
numberOfTimes <- 0
for (i in 1:length(v)) {
if((nchar(v[i+1])) > (nchar(v[i]))) {
numberOfTimes <- numberOfTimes + 1
}
}
return(numberOfTimes)
}
}
missing value where TRUE/FALSE needed.
I do not know why this happens.
The error you are getting is saying that your code is trying to evaluate a missing value (NA) where it expects a number. There are likely one of two reasons for this.
You have NA's in your vector v (I suspect this is not the actual issue)
The loop you wrote is from 1:length(v), however, on the last iteration, this will try the loop to try to compare v[n+1] > v[n]. There is no v[n+1], thus this is a missing value and you get an error.
To remove NAs, try the following code:
v <- na.omit(v)
To improve your loop, try the following code:
for(i in 1:(length(v) -1)) {
if(nchar(v[i + 1]) > nchar(v[i])) {
numberOfTimes <- numberOfTimes + 1
}
}
Here is some example dummy code.
# create random 15 numbers
set.seed(1)
v <- rnorm(15)
# accessing the 16th element produces an NA
v[16]
#[1] NA
# if we add an NA and try to do a comparison, we get an error
v[10] <- NA
v[10] > v[9]
#[1] NA
# if we remove NAs and limit our loop to N-1, we should get a fair comparison
v <- na.omit(v)
numberOfTimes <- 0
for(i in 1:(length(v) -1)) {
if(nchar(v[i + 1]) > nchar(v[i])) {
numberOfTimes <- numberOfTimes + 1
}
}
numberOfTimes
#[1] 5
Is this what you're after? I don't think there is any need for a for loop.
I'm generating some sample data, since you don't provide any.
# Generate some sample data
set.seed(2017);
v <- sapply(sample(30, 10), function(x)
paste(sample(letters, x, replace = T), collapse = ""))
v;
#[1] "raalmkksyvqjytfxqibgwaifxqdc" "enopfcznbrutnwjq"
#[3] "thfzoxgjptsmec" "qrzrdwzj"
#[5] "knkydwnxgfdejcwqnovdv" "fxexzbfpampbadbyeypk"
#[7] "c" "jiukokceniv"
#[9] "qpfifsftlflxwgfhfbzzszl" "foltth"
The following vector marks the positions with 1 in v where entries have more characters than the previous entry.
# The following vector has the same length as v and
# returns 1 at the index position i where
# nchar(v[i]) > nchar(v[i-1])
idx <- c(0, diff(nchar(v)) > 0);
idx;
# [1] 0 0 0 0 1 0 0 1 1 0
If you're just interested in whether there is any entry with more characters than the previous entry, you can do this:
# If you just want to test if there is any position where
# nchar(v[i+1]) > nchar(v[i]) you can do
any(idx == 1);
#[1] TRUE
Or count the number of occurrences:
sum(idx);
#[1] 3

How to create a conditional dummy in R?

I have a dataframe of time series data with daily observations of temperatures. I need to create a dummy variable that counts each day that has temperature above a threshold of 5C. This would be easy in itself, but an additional condition exists: counting starts only after ten consecutive days above the threshold occurs. Here's an example dataframe:
df <- data.frame(date = seq(365),
temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))
I think I got it done, but with too many loops for my liking. This is what I did:
df$dummyUnconditional <- 0
df$dummyHead <- 0
df$dummyTail <- 0
for(i in 1:nrow(df)){
if(df$temp[i] > 5){
df$dummyUnconditional[i] <- 1
}
}
for(i in 1:(nrow(df)-9)){
if(sum(df$dummyUnconditional[i:(i+9)]) == 10){
df$dummyHead[i] <- 1
}
}
for(i in 9:nrow(df)){
if(sum(df$dummyUnconditional[(i-9):i]) == 10){
df$dummyTail[i] <- 1
}
}
df$dummyConditional <- ifelse(df$dummyHead == 1 | df$dummyTail == 1, 1, 0)
Could anyone suggest simpler ways for doing this?
Here's a base R option using rle:
df$dummy <- with(rle(df$temp > 5), rep(as.integer(values & lengths >= 10), lengths))
Some explanation: The task is a classic use case for the run length encoding (rle) function, imo. We first check if the value of temp is greater than 5 (creating a logical vector) and apply rle on that vector resulting in:
> rle(df$temp > 5)
#Run Length Encoding
# lengths: int [1:7] 66 1 1 225 2 1 69
# values : logi [1:7] FALSE TRUE FALSE TRUE FALSE TRUE ...
Now we want to find those cases where the values is TRUE (i.e. temp is greater than 5) and where at the same time the lengths is greater than 10 (i.e. at least ten consecutive tempvalues are greater than 5). We do this by running:
values & lengths >= 10
And finally, since we want to return a vector of the same lengths as nrow(df), we use rep(..., lengths) and as.integer in order to return 1/0 instead of TRUE/FALSE.
I think you could use a combination of a simple ifelse and the roll apply function in the zoo package to achieve what you are looking for. The final step just involves padding the result to account for the first N-1 days where there isnt enough information to fill the window.
library(zoo)
df <- data.frame(date = seq(365),
temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))
df$above5 <- ifelse(df$temp > 5, 1, 0)
temp <- rollapply(df$above5, 10, sum)
df$conseq <- c(rep(0, 9),temp)
I would do this:
set.seed(42)
df <- data.frame(date = seq(365),
temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))
thr <- 5
df$dum <- 0
#find first 10 consecutive values above threshold
test1 <- filter(df$temp > thr, rep(1,10), sides = 1) == 10L
test1[1:9] <- FALSE
n <- which(cumsum(test1) == 1L)
#count days above threshold after that
df$dum[(n+1):nrow(df)] <- cumsum(df$temp[(n+1):nrow(df)] > thr)

calculation conditional on the signs of an element in r

I know the question is silly but I really can't solve it. I just want to do different operations to the elements in a dataframe deppending on its sign. The following code generating a mock dataframe:
mock<-data.frame(matrix(NA,ncol=5,nrow=2))
colnames(mock)<-as.vector(c("m","n","1985-02-04","1985-02-05","1985-02-06"))
rownames(mock)<-as.vector(c("fund1","fund2"))
mock
mock[1,]<-c(0.001,0.0045,-0.03,0.25,NA)
mock[2,]<-c(0.004,0.0004,NA,0.12,-0.087)
mock
so it looks like
m n 1985-02-04 1985-02-05 1985-02-06
fund1 0.001 0.0045 -0.03 0.25 NA
fund2 0.004 0.0004 NA 0.12 -0.087
for each fund, m and n represent two different ratios, the last three figures are returns on the given days. I wish to do the following oerations:
if the return x on one day is positive, I need (x+m)/(1+n) to replace the corresponding figure in the dataframe.
If the return x is negative, I need x+m to replace the corresponding figure in the dataframe.
If it is NA on the day, I will leave it NA.
I tried the following code:
Grossreturn<-function(x){
a<-x[3:5]
m<-x[1]
p<-x[2]
a[a>0]<-(a[a>0]+m)/(1-p)
a[a<0]<-a[a<0]+m
return(a)
}
apply(mock,1,Grossreturn)
and of course it failed and the error message is:
Error in a[a > 0] <- (a[a > 0] + m)/(1 - p) :
NAs are not allowed in subscripted assignments
I really get stucked here and couldn't sort it out. Can someone help?
Thanks!
You should just exclude NAs from all your assignments. A sample syntax for doing this is below:
> foo = data.frame(x=runif(3)-0.5, y=runif(3)) #random data frame
> foo[2,1] <- NA #adding an NA
> foo
x y
1 -0.4616014 0.4892859
2 NA 0.4730237
3 0.4060813 0.1517448
If you now try to reassign without filtering out NAs, you get your error.
> foo[sign(foo$x)==-1, 1] <- -10
Error in `[<-.data.frame`(`*tmp*`, sign(foo$x) == -1, 1, value = -10) :
missing values are not allowed in subscripted assignments of data frames
But not if you explicitly leave out NAs:
> foo[sign(foo$x)==-1 & !is.na(foo$x), 1] <- -10
> foo
x y
1 -10.0000000 0.4892859
2 NA 0.4730237
3 0.4060813 0.1517448
Here is a code which solves your problem:
grossreturn <- function(x) {
m <- x[1]
n <- x[2]
# iterate over all date columns and compute new value
for (i in 3:length(x)) {
if (is.na(x[i]) {
# NA remains NA
} else if (x[i] < 0) {
x[i] <- x[i] + m # x + m
} else {
# x[i] >= 0
# includes case where x[i] == 0
x[i] <- (x[i] + m) / (1 + n) # (x + m) / (1 + n)
}
}
return x
}
result <- apply(mock, 1, FUN=function(x) grossreturn(x))
I wanted to use an apply function to iterate over the numerical columns after extracting out m and n, but there does not seem to be any vectorized apply functions which can also pass multiple parameters as input (so mapply would not be a vectorized solution).
I assumed that the case where a return is 0 that you wanted (x + m) / (1 + n). Also, you test whether R drops either the row or column names when you run this code.

Euler Project #1 in R

Problem
Find the sum of all numbers below 1000 that can be divisible by 3 or 5
One solution I created:
x <- c(1:999)
values <- x[x %% 3 == 0 | x %% 5 == 0]
sum(values
Second solution I can't get to work and need help with. I've pasted it below.
I'm trying to use a loop (here, I use while() and after this I'll try for()). I am still struggling with keeping references to indexes (locations in a vector) separate from values/observations within vectors. Loops seem to make it more challenging for me to distinguish the two.
Why does this not produce the answer to Euler #1?
x <- 0
i <- 1
while (i < 100) {
if (i %% 3 == 0 | i %% 5 == 0) {
x[i] <- c(x, i)
}
i <- i + 1
}
sum(x)
And in words, line by line this is what I understand is happening:
x gets value 0
i gets value 1
while object i's value (not the index #) is < 1000
if is divisible by 3 or 5
add that number i to the vector x
add 1 to i in order (in order to keep the loop going to defined limit of 1e3
sum all items in vector x
I am guessing x[i] <- c(x, i) is not the right way to add an element to vector x. How do I fix this and what else is not accurate?
First, your loop runs until i < 100, not i < 1000.
Second, replace x[i] <- c(x, i) with x <- c(x, i) to add an element to the vector.
Here is a shortcut that performs this sum, which is probably more in the spirit of the problem:
3*(333*334/2) + 5*(199*200/2) - 15*(66*67/2)
## [1] 233168
Here's why this works:
In the set of integers [1,999] there are:
333 values that are divisible by 3. Their sum is 3*sum(1:333) or 3*(333*334/2).
199 values that are divisible by 5. Their sum is 5*sum(1:199) or 5*(199*200/2).
Adding these up gives a number that is too high by their intersection, which are the values that are divisible by 15. There are 66 such values, and their sum is 15*(1:66) or 15*(66*67/2)
As a function of N, this can be written:
f <- function(N) {
threes <- floor(N/3)
fives <- floor(N/5)
fifteens <- floor(N/15)
3*(threes*(threes+1)/2) + 5*(fives*(fives+1)/2) - 15*(fifteens*(fifteens+1)/2)
}
Giving:
f(999)
## [1] 233168
f(99)
## [1] 2318
And another way:
x <- 1:999
sum(which(x%%5==0 | x%%3==0))
# [1] 233168
A very efficient approach is the following:
div_sum <- function(x, n) {
# calculates the double of the sum of all integers from 1 to n
# that are divisible by x
max_num <- n %/% x
(x * (max_num + 1) * max_num)
}
n <- 999
a <- 3
b <- 5
(div_sum(a, n) + div_sum(b, n) - div_sum(a * b, n)) / 2
In contrast, a very short code is the following:
x=1:999
sum(x[!x%%3|!x%%5])
Here is an alternative that I think gives the same answer (using 99 instead of 999 as the upper bound):
iters <- 100
x <- rep(0, iters-1)
i <- 1
while (i < iters) {
if (i %% 3 == 0 | i %% 5 == 0) {
x[i] <- i
}
i <- i + 1
}
sum(x)
# [1] 2318
Here is the for-loop mentioned in the original post:
iters <- 99
x <- rep(0, iters)
i <- 1
for (i in 1:iters) {
if (i %% 3 == 0 | i %% 5 == 0) {
x[i] <- i
}
i <- i + 1
}
sum(x)
# [1] 2318

Resources