R: Create binary data from a data frame - r

i need some advise for the following problem:
I have a dataframe with two columns, one containing the date, the other the frequency of a an event.
Now i want to add a third column to this dataframe, wich should contain some binary data: 1 for days with a frequency of 100 and higher, 0 for the lower ones.
Has anyone an idea how to do this in a smart way (i'm affraid of writing it by hand;-)? Thanks for your answer in advance!

data$newcol = as.integer(data$freq >= 100)
alternatively
data$newcol = ifelse(data$freq >= 100, 1, 0)
alternatively
data$newcal = 0
data$newcol[data$freq >= 100] = 1

df$freq.gt.100 = as.integer(df$freq >= 100)
The bit inside brackets evaluates to TRUE or FALSE which can be converted to 1 or 0 via as.integer.
There's nothing to be "afraid" of: you can test the right-hand side of the expression on its own to check it works and only when you are happy with this do you add it as a new column to the original data.
EDIT: didn't see the above answer as I was creating this one and had a call to take!

Related

Finding the percentage of a specific value in the column of a data set

I have a dataset called college, and one of the columns is 'accepted'. There are two values for this column - 1 (which means student was accepted) and 0 (which means student was not accepted). I was to find the accepted student percentage.
I did this...
table(college$accepted)
which gave me the frequency of 1 and 0. (1 = 44,224 and 0 = 75,166). I then manually added those two values together (119,390) and divided the 44,224/119,390. This is fine and gets me the value I was looking for. But I would really like to know how I could do this with R code, since I'm sure there is a way to do it that I just haven't thought of.
Thanks!
Perhaps you can use prop.table like below
prop.table(table(college$accepted))["1"]
If it's a simple 0/1 column then you only need take the column mean.
mean_accepted <- mean(df$accepted)
you could first sum the column, and the count the total number in the column
sum(college$accepted)/length(college$accepted)
To make the code more explicit and describe your intent better, I suggest using a condition to identify the cases that meet your criteria for inclusion. For example:
college$accepted == 1
Then take the average of the logical vector to compute the proportion (between 0 and 1), multiply by 100 to make it a percentage.
100 * mean(college$accepted == 1, na.rm = TRUE)

How to create a new column with repeated values based on another column?

Here is what I currently Have. I have a column named "test1M", which has values of either 0 or 1. If the value is 1, I want to set the next 20 values in column "test1Mxx" to value 1.
If I run this code, I get an error of (Error in if (data$test1M[x] == 1) { : argument is of length zero).
Whats a better way for me to do this? The code is pretty repetitive, so I would like to minimize that if possible. If there is a way to turn this into a function that would be preferable, so I could change the number of values (for instance, maybe the following 25 values, or 40 values, etc.)
for(x in data$test1){
if(data$test1[x]==1){
data$test2[x+1]=1
data$test2[x+2]=1
data$test2[x+3]=1
data$test2[x+4]=1
data$test2[x+5]=1
data$test2[x+6]=1
data$test2[x+7]=1
data$test2[x+8]=1
data$test2[x+9]=1
data$test2[x+10]=1
data$test2[x+11]=1
data$test2[x+12]=1
data$test2[x+13]=1
data$test2[x+14]=1
data$test2[x+15]=1
data$test2[x+16]=1
data$test2[x+17]=1
data$test2[x+18]=1
data$test2[x+19]=1
data$test2[x+20]=1}
}
Your loop doesn't work because x is a value of data$test1, not an index of it. You need something like:
data$test2 <- data$test1
for (x in seq_along(data$test1))
if (data$test1[x] == 1) data$test2[x + 1:20] <- 1

Is it possible to create a countif like function in R using ranges?

I've already read this question with an approach to counting entries in R:
how to realize countifs function (excel) in R
I'm looking for a similar approach, except that I want to count data that is within a given range.
For example, let's say I have this dataset:
data <- data.frame( values = c(1,1.2,1.5,1.7,1.7,2))
Following the approach on the linked question, we would develop something like this:
count <- data$values == 1.5
sum(count)
Problem is, I want to be able to include in the count anything that varies 0.2 from 1.5 - that is, all possible number from 1.3 to 1.7.
Is there a way to do so?
sum(data$values>=1.3 & data$values<=1.7)
As the explanation in the question you linked to points out, when you just write out a boolean condition, it generates a vector of TRUEs and FALSEs the same length as your original dataframe. TRUE equals 1 and FALSE equals 0, so summing across it gives you a count. So it simply becomes a matter of putting your condition as a boolean phrase. In the case of more than one condition, you connect them with & or | (or) -- much the same way that you could do in excel (only in excel you have to do AND() or OR()).
(For a more general solution, you can use dplyr::between - it's also supposed to be faster since it's implemented in C++. In this case, it would be sum(between(data$values,1.3,1.7).)
Like #doviod writes, you can use a compound logical condition.
My approach is different, I wrote a function that takes the vector and as range the center point value and the distance delta.
After a suggestion by #doviod, I have set a default value delta = 0, so that if only value is passed, the function returns
a count of cases where the values equal the value the user provides.
(doviod, in the comment)
countif <- function(x, value, delta = 0)
sum(value - delta <= x & x <= value + delta)
data <- data.frame( values = c(1,1.2,1.5,1.7,1.7,2))
countif(data$values, 1.5, 0.2)
#[1] 3
which identifies the location of all values in your vector that satisfy your criterion, and length subsequently counts the 'hits'.
length( which(data$values>=1.3 & data$values<=1.7) )
[1] 3

R: Produce Index Values to Group Increasing Values in Vector

I have a list of increasing year values that occasionally has breaks in it and I want to create a grouping value for each unbroken sequence. Think of a vector like this one (missing 2005,2011):
x <- c(2001,2002,2003,2004,2006,2007,2008,2009,2010,2013,2014,2015,2016)
I would like to produce an equal length vector that numbers every value in a run with the same index to end up with something like this.
[1] 1 1 1 1 2 2 2 2 2 3 3 3 3
I would like to do this using best R practices so I am trying to avoid falling back to a for loop but I am not sure how to get from Vector A to Vector B. Does anyone have any suggestions?
Some things I know I can do:
I can flag the record before or after a gap as true with an ifelse
I can get the index of when the counter should change by wrapping that in a which statement
This is the code to do each
ifelse(!is.na(lag(x)) & x == lag(x)+1, FALSE, TRUE)
which(ifelse(!is.na(lag(x)) & x == lag(x)+1, FALSE, TRUE))
I think there a couple solutions to this problem. One as d.b posted in the comment above that will produce a sequence that increments every time there is a break in the sequence.
cummax(c(1, diff(x)))
There is a similar solution that I chose to use with ifelse() flagging breaks and cumsum(). I chose this solution because additional information,like other vectors, can be included in the decision and diff seems to have problems with very erratic up and down values.
cumsum(ifelse(!is.na(lag(x)) & x == lag(x) + 1, FALSE, TRUE))

R: Create a Vector starting at position 0 in R

I am new to R. I want to carry out a simulation starting at Period 0. That works quite well using vectors, but they all start at position 1.
Is there a way to change that? Or an alternative?
Thanks a lot!
Serijoscha
Use the Oarray package with offset = 0
library(Oarray)
vec <- Oarray(1:10, offset = 0)
vec[0]
#[1] 1

Resources