Finding the percentage of a specific value in the column of a data set - r

I have a dataset called college, and one of the columns is 'accepted'. There are two values for this column - 1 (which means student was accepted) and 0 (which means student was not accepted). I was to find the accepted student percentage.
I did this...
table(college$accepted)
which gave me the frequency of 1 and 0. (1 = 44,224 and 0 = 75,166). I then manually added those two values together (119,390) and divided the 44,224/119,390. This is fine and gets me the value I was looking for. But I would really like to know how I could do this with R code, since I'm sure there is a way to do it that I just haven't thought of.
Thanks!

Perhaps you can use prop.table like below
prop.table(table(college$accepted))["1"]

If it's a simple 0/1 column then you only need take the column mean.
mean_accepted <- mean(df$accepted)

you could first sum the column, and the count the total number in the column
sum(college$accepted)/length(college$accepted)

To make the code more explicit and describe your intent better, I suggest using a condition to identify the cases that meet your criteria for inclusion. For example:
college$accepted == 1
Then take the average of the logical vector to compute the proportion (between 0 and 1), multiply by 100 to make it a percentage.
100 * mean(college$accepted == 1, na.rm = TRUE)

Related

Is there some way to detect 'wrong' measures in a dataframe?

I'm struggling on how can I remove 'wrong' measures from my dataset. I'm dealing with kind a huge table, where I have a date and the size of an equipment. It can't get bigger with use, at most it can stay the same size, so of course this problem is a measurement error.
My database is extensive and with several particular cases, which makes it impossible for me to place it here, among other business reasons... Therefore, I use an image and a part of the data as an example, but the problem is what I described above...
simplest_example = test = data.frame(data1 = c("20-09-2020", "15-10-2020", "13-05-2021", "20-10-2021","20-11-2021"), measure = c(5,4,3,5,2))
#as result:
# data1 measure
#1 20-09-2020 5
#2 15-10-2020 4
#3 13-05-2021 3
#4 20-11-2021 2
The point is: Select the largest non-ascending sequence possible, and exclude some values that inhibit this from happening.
So I would like to ask for a suggestion, if anyone here has come across something similar, and let me know how to recommend something.
If I understand, you want to detect any time the variable measure is greater than the value at the previous time point? I'd create a lag column, which is just the measure column lagged by one time. Then identify when a previous measure is greater than the current measure
library(dplyr)
simplest_example %>%
mutate(previous_measure = lag(measure)) %>%
filter(previous_measure < measure)

R programming- adding column in dataset error

cv.uk.df$new.d[2:nrow(cv.uk.df)] <- tail(cv.uk.df$deaths, -1) - head(cv.uk.df$deaths, -1) # this line of code works
I wanted to know why do we -1 in the tail and -1 in head to create this new column.
I made an effort to understand by removing the -1 and "R"(The code is in R studio) throws me this error.
Could anyone shed some light on this? I can't explain how much I would appreciate it.
Look at what is being done. On the left-hand side of the assignment operator, we have:
cv.uk.df$new.d[2:nrow(cv.uk.df)] <-
Let's pick this apart.
cv.uk.df # This is the data.frame
$new.d # a new column to assign or a column to reassign
[2:nrow(cv.uk.df)] # the rows which we are going to assign
Specifically, this line of code will assign a new value all rows of this column except the first. Why would we want to do that? We don't have your data, but from your example, it looks like you want to calculate the change from one line to the next. That calculation is invalid for the first row (no previous row).
Now let's look at the right-hand side.
<- tail(cv.uk.df$deaths, -1) - head(cv.uk.df$deaths, -1)
The cv.uk.df$deaths column has the same number of rows as the data.frame. R gets grouchy when the numbers of elements don't follow sum rules. For data.frames, the right-hand side needs to have the same number of elements, or a number that can be recycled a whole-number of times. For example, if you have 10 rows, you need to have a replacement of 10 values. Or you can have 5 values that R will recycle.
If your data.frame has 100 rows, only 99 are being replaced in this operation. You cannot feed 100 values into an operation that expects 99. We need to trim the data. Let's look at what is happening. The tail() function has the usage tail(x, n), where it returns the last n values of x. If n is a negative integer, tail() returns all values but the first n. The head() function works similarly.
tail(cv.uk.df$deaths, -1) # This returns all values but the first
head(cv.uk.df$deaths, -1) # This returns all values but the last
This makes sense for your calculation. You cannot subtract the number of deaths in the row before the first row from the number in the first row, nor can you subtract the number of deaths in the last row from the number in the row after the last row. There are more intuitive ways to do this thing using functions from other packages, but this gets the job done.

How to add grouping variable to data set that will classify both an observation and its N neighbors based on some condition

I am having some trouble coming up with a solution that properly handles classifying a variable number of neighbors for any given observation in a data frame based on some condition. I would like to be able to add a simple, binary indicator variable to a data frame that will equal 1 if the condition is satisfied, and 0 if it is not.
Where I am getting stuck is I am unsure how to iteratively check the condition against neighboring observations only, in either direction (i.e., to check if out of 4 neighboring observations in a given column in my data frame, that at least 3 out of 4 of them contain the same value). I have tried first creating another indicator variable indicating if the condition is satisfied or not (1 or 0 = yes or no). Then, I tried setting up a series of ifelse() statements within a loop to try to assign the proper categorization of the observation where the initial condition is satisfied, +/- 2 observations in either direction. However, when I inspect the dataframe after running the loop, only the observation itself (not its neighbors) where the condition is satisfied is receiving the value, rather than all neighboring observations also receiving the value. Here is my code:
#sample data
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
sample_dat$violate <- NULL
for(i in 1:nrow(dat_date_ord)){
sample_dat$violate[i] <- ifelse(sample_dat$initial_ind[i]==1 &
((sample_dat$initial_ind[i-2]==1 |
sample_dat$initial_ind[i-1]==1) &
(sample_dat$initial_ind[i+2]==1 |
sample_dat$initial_ind[i+1]==1)),
"trending",
"non-trending"
)
}
This loop correctly identifies one of the four points that needs to be labelled "trending", but it does not also assign "trending" to the correct neighbors. In other words, I expect the output to be "trending for observations 7-10, since 3/4 observations in that group of 4 all have a value of 1 in the initial indicator column. I feel like there might be an easier way to accomplish this - but what I need to ensure is that my code is robust enough to identify and assign observations to a group regardless of if I want 3/4 to indicate a group, 5/6, 2/5, etc.
Thank you for any and all advice.
You can use the rollapply function from the zoo package to apply a function to set intervals in your data. The question then becomes about creating a function that satisfies your needs. I'm not sure if I've understood correctly, but it seems you want a function that checks if the condition is true for at least 3/5 of the observation plus its four closest neighbors. In this case just adding the 1s up and checking if they're above 2 works.
library(zoo)
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
trend_test = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test, width = 5, fill = NA)
Edit: If you want a function that checks if the observation and the next 3 observations have at least 3 1s, you can do something very similar, just by changing the align argument on rollapply:
trend_test_2 = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test_2, width = 4,
fill = NA, align = "left")

Is it possible to create a countif like function in R using ranges?

I've already read this question with an approach to counting entries in R:
how to realize countifs function (excel) in R
I'm looking for a similar approach, except that I want to count data that is within a given range.
For example, let's say I have this dataset:
data <- data.frame( values = c(1,1.2,1.5,1.7,1.7,2))
Following the approach on the linked question, we would develop something like this:
count <- data$values == 1.5
sum(count)
Problem is, I want to be able to include in the count anything that varies 0.2 from 1.5 - that is, all possible number from 1.3 to 1.7.
Is there a way to do so?
sum(data$values>=1.3 & data$values<=1.7)
As the explanation in the question you linked to points out, when you just write out a boolean condition, it generates a vector of TRUEs and FALSEs the same length as your original dataframe. TRUE equals 1 and FALSE equals 0, so summing across it gives you a count. So it simply becomes a matter of putting your condition as a boolean phrase. In the case of more than one condition, you connect them with & or | (or) -- much the same way that you could do in excel (only in excel you have to do AND() or OR()).
(For a more general solution, you can use dplyr::between - it's also supposed to be faster since it's implemented in C++. In this case, it would be sum(between(data$values,1.3,1.7).)
Like #doviod writes, you can use a compound logical condition.
My approach is different, I wrote a function that takes the vector and as range the center point value and the distance delta.
After a suggestion by #doviod, I have set a default value delta = 0, so that if only value is passed, the function returns
a count of cases where the values equal the value the user provides.
(doviod, in the comment)
countif <- function(x, value, delta = 0)
sum(value - delta <= x & x <= value + delta)
data <- data.frame( values = c(1,1.2,1.5,1.7,1.7,2))
countif(data$values, 1.5, 0.2)
#[1] 3
which identifies the location of all values in your vector that satisfy your criterion, and length subsequently counts the 'hits'.
length( which(data$values>=1.3 & data$values<=1.7) )
[1] 3

R Compare each data value of a column to rest of the values in the column?

I would like to create a function that looks at a column of values. from those values look at each value individually, and asses which of the other data points value is closest to that data point.
I'm guessing it could be done by checking the length of the data frame, making a list of the respective length in steps of 1. Then use that list to reference which cell is being analysed against the rest of the column. though I don't know how to implement that.
eg.
data:
20
17
29
33
1) is closest to 2)
2) is closest to 1)
3) is closest to 4)
4) is closest to 3)
I found this example which tests for similarity but id like to know what letter is assigns to.
x=c(1:100)
your.number=5.43
which(abs(x-your.number)==min(abs(x-your.number)))
Also if you know how I could do this, could you expain the parts of the code and what they mean?
I wrote a quick function that does the same thing as the code you provided.
The code you provided takes the absolute value of the difference between your number and each value in the vector, and compares that the minimum value from that vector. This is the same as the which.min function that I use below. I go through my steps below. Hope this helps.
Make up some data
a = 1:100
yourNumber = 6
Where Num is your number, and x is a vector
getClosest=function(x, Num){
return(which.min(abs(x-Num)))
}
Then if you run this command, it should return the index for the value of the vector that corresponds to the closest value to your specified number.
getClosest(x=a, Num=yourNumber)

Resources