I have a function that is currently working, but I think there may be a better way for it to work without having to manipulate the data so much beforehand. Basically, I am returning a simple TRUE or FALSE if a value in my column is greater than both the two values before it, and after it.
y1 #a single vector column of values
for (i in 3:length(y1)){ #for every number starting at 3 (because for 2 and 1 you can't go back two)
if(y1[i] > y1[i-1] && y1[i] > y1[i-2] && y1[i] > y1[i+1] && y1[i] > y1[i+2]){ #if the number is greater than 2 before and 2 after...
y2[i] <- 'TRUE' #if it is true, write true. Here y2[i] you're saving the results in the blank vector
} else {
y2[i] <- 'FALSE' } #opposite here
print(y2[i])
This works okay, but as you see I have to start at 3 in my for loop because otherwise I get an error, given that the first and second values, as well as the last two, can't compute the [i-1],[i-2] or [i+1] and [i+2]. If I do for i:length(y1) it will not work and I also have to add two zeros onto the dataset in order to not get an error/be able to "compute" the last TRUE/FALSE value.
Is there any way to clean up the actual function so that I don't have to manipulate the data beforehand? Essentially have the function give me a null just for the first two and last two values in my data?
Another approach is using lag and lead from dplyr:
library(dplyr)
v2 <- (lag(v1, 2) < v1 & lag(v1, 1) < v1 & lead(v1, 1) < v1 & lead(v1, 2) < v1)
Output & data:
v1 <- c(1,2,3,2,1,1,1,1,1,2,1,1)
v2
[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
my first quick tip would be to look into the lead and lag functions of dplyr.
See for example this tutorial or in the dplyr documentation or in Hadley Wickhams R for Data Science.
Hope this helps!
Related
I am trying to understand the for and if-statement in r, so I run a code where I am saying that if the sum of rows are bigger than 3 then return 1 else zero:
Here is the code
set.seed(2)
x = rnorm(20)
y = 2*x
a = cbind(x,y)
hold = c()
Now comes the if-statement
for (i in nrow(a)) {
if ([i,1]+ [i,2] > 3) hold[i,] == 1
else ([i,1]+ [i,2]) <- hold[i,] == 0
return (cbind(a,hold)
}
I know that maybe combining for and if may not be ideal, but I just want to understand what is going wrong. Please keep the explanation at a dummy level:) Thanks
You've got some issues. #mnel covered a better way to go about doing this, I'll focus on understanding what went wrong in this attempt (but don't do it this way at all, use a vectorized solution).
Line 1
for (i in nrow(a)) {
a has 20 rows. nrow(a) is 20. Thus your code is equivalent to for (i in 20), which means i will only ever be 20.
Fix:
for (i in 1:nrow(a)) {
Line 2
if ([i,1]+ [i,2] > 3) hold[i,] == 1
[i,1] isn't anything, it's the ith row and first column of... nothing. You need to reference your data: a[i,1]
You initialized hold as a vector, c(), so it only has one dimension, not rows and columns. So we want to assign to hold[i], not hold[i,].
== is used for equality testing. = or <- are for assignment. Right now, if the >3 condition is met, then you check if hold[i,] is equal to 1. (And do nothing with the result).
Fix:
if (a[i,1]+ a[i,2] > 3) hold[i] <- 1
Line 3
else ([i,1]+ [i,2]) <- hold[i,] == 0
As above for assignment vs equality testing. (Here you used an arrow assignment, but put it in the wrong place - as if you're trying to assign to the else)
else happens whenever the if condition isn't met, you don't need to try to repeat the condition
Fix:
else hold[i] <- 0
Fixed code together:
for (i in 1:nrow(a)) {
if (a[i,1] + a[i,2] > 3) hold[i] <- 1
else hold[i] <- 0
}
You aren't using curly braces for your if and else expressions. They are not required for single-line expressions (if something do this one line). They are are required for multi-line (if something do a bunch of stuff), but I think they're a good idea to use. Also, in R, it's good practice to put the else on the same line as a } from the preceding if (inside the for loop or a function it doesn't matter, but otherwise it would, so it's good to get in the habit of always doing it). I would recommend this reformatted code:
for (i in 1:nrow(a)) {
if (a[i, 1] + a[i, 2] > 3) {
hold[i] <- 1
} else {
hold[i] <- 0
}
}
Using ifelse
ifelse() is a vectorized if-else statement in R. It is appropriate when you want to test a vector of conditions and get a result out for each one. In this case you could use it like this:
hold <- ifelse(a[, 1] + a[, 2] > 3, 1, 0)
ifelse will take care of the looping for you. If you want it as a column in your data, assign it directly (no need to initialize first)
a$hold <- ifelse(a[, 1] + a[, 2] > 3, 1, 0)
Such operations in R are nicely vectorised.
You haven't included a reference to the dataset you wish to index with your call to [ (eg a[i,1])
using rowSums
h <- rowSums(a) > 3
I am going to assume that you are new to R and trying to learn about the basic function of the for loop itself. R has fancy functions called "apply" functions that are specifically for doing basic math on each row of a data frame. I am not going to talk about these.
You want to do the following on each row of the array.
Sum the elements of the row.
Test that the sum is greater than 3.
Return a value of 1 or 0 representing the result of 2.
For 1, luckily "sum" is a built in function. It pays off to check out the built in functions within every programming language because they save you time. To sum the elements of a row, just use sum(a[row_number,]).
For 2, you are evaluating a logical statement "is x >3?" where x is the result from 1. The ">3" statement returns a value of true or false. The logical expression is a fancy "if then" statement without the "if then".
> 4>3
[1] TRUE
> 2>3
[1] FALSE
For 3, a true or false value is a data structure called a "logical" value in R. A 1 or 0 value is a data structure called a "numeric" value in R. By converting the "logical" into a "numeric", you can change the TRUE to 1's and FALSE to 0's.
> class(4>3)
[1] "logical"
> as.numeric(4>3)
[1] 1
> class(as.numeric(4>3))
[1] "numeric"
A for loop has a min, a max, a counter, and an executable. The counter starts at the min, and increments until it goes to the max. The executable will run for each run of the counter. You are starting at the first row and going to the last row. Putting all the elements together looks like this.
for (i in 1:nrow(a)){
hold[i] <- as.numeric(sum(a[i,])>3)
}
I am looking for a command in R which is equivalent of this SQL statement. I want this to be a very simple basic solution without using complex functions OR dplyr type of packages.
Select count(*) as number_of_states
from myTable
where sCode = "CA"
so essentially I would be counting number of rows matching my where condition.
I have imported a csv file into mydata as a data frame.So far I have tried these with no avail.
nrow(mydata$sCode == "CA") ## ==>> returns NULL
sum(mydata[mydata$sCode == 'CA',], na.rm=T) ## ==>> gives Error in FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables
sum(subset(mydata, sCode='CA', select=c(sCode)), na.rm=T) ## ==>> FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables
sum(mydata$sCode == "CA", na.rm=T) ## ==>> returns count of all rows in the entire data set, which is not the correct result.
and some variations of the above samples. Any help would be appreciated! Thanks.
mydata$sCode == "CA" will return a boolean array, with a TRUE value everywhere that the condition is met. To illustrate:
> mydata = data.frame(sCode = c("CA", "CA", "AC"))
> mydata$sCode == "CA"
[1] TRUE TRUE FALSE
There are a couple of ways to deal with this:
sum(mydata$sCode == "CA"), as suggested in the comments; because
TRUE is interpreted as 1 and FALSE as 0, this should return the
numer of TRUE values in your vector.
length(which(mydata$sCode == "CA")); the which() function
returns a vector of the indices where the condition is met, the
length of which is the count of "CA".
Edit to expand upon what's happening in #2:
> which(mydata$sCode == "CA")
[1] 1 2
which() returns a vector identify each column where the condition is met (in this case, columns 1 and 2 of the dataframe). The length() of this vector is the number of occurences.
sum is used to add elements; nrow is used to count the number of rows in a rectangular array (typically a matrix or data.frame); length is used to count the number of elements in a vector. You need to apply these functions correctly.
Let's assume your data is a data frame named "dat". Correct solutions:
nrow(dat[dat$sCode == "CA",])
length(dat$sCode[dat$sCode == "CA"])
sum(dat$sCode == "CA")
mydata$sCode is a vector, it's why nrow output is NULL.
mydata[mydata$sCode == 'CA',] returns data.frame where sCode == 'CA'. sCode includes character. That's why sum gives you the error.
subset(mydata, sCode='CA', select=c(sCode)), you should use sCode=='CA' instead sCode='CA'. Then subset returns you vector where sCode equals CA, so you should use
length(subset(na.omit(mydata), sCode='CA', select=c(sCode)))
Or you can try this: sum(na.omit(mydata$sCode) == "CA")
With dplyr package, Use
nrow(filter(mydata, sCode == "CA")),
All the solutions provided here gave me same error as multi-sam but that one worked.
Just give a try using subset
nrow(subset(data,condition))
Example
nrow(subset(myData,sCode == "CA"))
to get the number of observations the number of rows from your Dataset would be more valid:
nrow(dat[dat$sCode == "CA",])
grep command can be used
CA = mydata[grep("CA", mydata$sCode, ]
nrow(CA)
Call nrow passing as argument the name of the dataset:
nrow(dataset)
I'm using this short function to make it easier using dplyr:
countc <- function(.data, ..., preserve = FALSE){
return(nrow(filter(.data, ..., .preserve = preserve)))
}
With this you can just use it like filter. For example:
countc(data, active == TRUE)
[1] 42
I have several dataframes in which I want to delete each row that matches a certain string. I used the following code to do it:
df[!(regexpr("abc", df$V4) ==1),]
How can I delete the row that is following, e.g. if I delete row n as specified by the code above, how can I additionally delete row n+1?
My first try was to simply find out the indices of the desired rows, but that won't work, as I need to delete rows in different dataframes which are of different lengths. So the indices vary.
Thanks!
I would suggest taking out and manipulating the logical vector directly. Suppose we have the vector:
x = c(5,0,1, 4, 3)
and we want to do:
x[x > 3]
First, note that:
R> (s_n = x>3)
[1] TRUE FALSE FALSE TRUE FALSE
So
R> (s_n1 = as.logical(s_n + c(F, l[1:(length(s_n)-1)])))
[1] TRUE TRUE FALSE TRUE TRUE
Hence,
x[s_n1]
gives you what you want.
In your particular example, something like:
s_n = !(regexpr("abc", df$V4) == 1)
s_n1 = as.logical(s_n + c(F, l[1:(length(s_n)-1)])))
df[s_n1, ]
should work.
Use which() on your logical expression and then you can just add 1 to the result.
sel <- which(grep("abc", df$V4))
sel <- c(sel, sel+1)
df[-sel,]
df[which(!(regexpr("abc", df$V4) ==1))+c(0,1),]
I have a dataframe where the dates are given as hydrological years (October to September). To change this I am trying to use a if statement:
if(cet$month== 10|cet$month==11|cet$month==12)
cet$year <- substr(as.character(cet[,2]),1,4) else
cet$year <- substr(as.character(cet[,2]),6,9)
but I get an error:
the condition has length > 1 and only the first element will be used
Reading the "if" help file I realized that the condition has to be a length-one logical vector. Is there no way of using an "or" with an "if"? All I want is to apply that expression if the month is October, November or December.
ifelse is the vectorised version. You can also use %in% to reduce the number of statements.
cet$year <- ifelse(cet$month%in%(10:12), substr(as.character(cet[,2]),1,4), substr(as.character(cet[,2]),6,9))
Ok, here's a reproducible example that should help to clarify things:
# generate some vector
x <- c(1,2,4,4,5,5,6,6,6)
# have a check using OR, return values
x[x == 2 | x == 1]
## or return TRUE / FALSE
(x == 2 | x == 1)
or check ?ifelse
EDIT: Note that for characters you need to use "", like x == "yourchars" | x == "someotherchars"
Here's also some simple reference and how to work with operators: QuickR
the OR instruction is double pipes
| => || in the if()
I was wondering about how to write do-while-style loop?
I found this post:
you can use repeat{} and check conditions whereever using if() and
exit the loop with the "break" control word.
I am not sure what it exactly means. Can someone please elaborate if you understand it and/or if you have a different solution?
Pretty self explanatory.
repeat{
statements...
if(condition){
break
}
}
Or something like that I would think. To get the effect of the do while loop, simply check for your condition at the end of the group of statements.
See ?Control or the R Language Definition:
> y=0
> while(y <5){ print( y<-y+1) }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
So do_while does not exist as a separate construct in R, but you can fake it with:
repeat( { expressions}; if (! end_cond_expr ) {break} )
If you want to see the help page you cannot type ?while or ?repeat at the console but rather need to use ?'repeat' or ?'while'. All the "control-constructs" including if are on the same page and all need character quoting after the "?" so the interpreter doesn't see them as incomplete code and give you a continuation "+".
Building on the other answers, I wanted to share an example of using the while loop construct to achieve a do-while behaviour. By using a simple boolean variable in the while condition (initialized to TRUE), and then checking our actual condition later in the if statement. One could also use a break keyword instead of the continue <- FALSE inside the if statement (probably more efficient).
df <- data.frame(X=c(), R=c())
x <- x0
continue <- TRUE
while(continue)
{
xi <- (11 * x) %% 16
df <- rbind(df, data.frame(X=x, R=xi))
x <- xi
if(xi == x0)
{
continue <- FALSE
}
}
Noticing that user 42-'s perfect approach {
* "do while" = "repeat until not"
* The code equivalence:
do while (condition) # in other language
..statements..
endo
repeat{ # in R
..statements..
if(! condition){ break } # Negation is crucial here!
}
} did not receive enough attention from the others, I'll emphasize and bring forward his approach via a concrete example. If one does not negate the condition in do-while (via ! or by taking negation), then distorted situations (1. value persistence 2. infinite loop) exist depending on the course of the code.
In Gauss:
proc(0)=printvalues(y);
DO WHILE y < 5;
y+1;
y=y+1;
ENDO;
ENDP;
printvalues(0); # run selected code via F4 to get the following #
1.0000000
2.0000000
3.0000000
4.0000000
5.0000000
In R:
printvalues <- function(y) {
repeat {
y=y+1;
print(y)
if (! (y < 5) ) {break} # Negation is crucial here!
}
}
printvalues(0)
# [1] 1
# [1] 2
# [1] 3
# [1] 4
# [1] 5
I still insist that without the negation of the condition in do-while, Salcedo's answer is wrong. One can check this via removing negation symbol in the above code.