R - Counting the number of a specific value in bins - r

I have a data frame (df) like below:
Value <- c(1,1,0,2,1,3,4,0,0,1,2,0,3,0,4,5,2,3,0,6)
Sl <- c(1:20)
df <- data.frame(Sl,Value)
> df
Sl Value
1 1 1
2 2 1
3 3 0
4 4 2
5 5 1
6 6 3
7 7 4
8 8 0
9 9 0
10 10 1
11 11 2
12 12 0
13 13 3
14 14 0
15 15 4
16 16 5
17 17 2
18 18 3
19 19 0
20 20 6
I would like to create 4 bins out of df and count the occurrences of Value=0 grouped by Sl values in a separate data frame like below:
Bin Count
1 1
2 2
3 2
4 1
I was trying to use table and cut to create the desire data frame but its not clear how I'll specify df$Value and the logic to find the 0s here
df.4.cut <- as.data.frame(table(cut(df$Sl, breaks=seq(1,20, by=5))))

Using your df
tapply(df$Value, cut(df$Sl, 4), function(x) sum(x==0))
gives
> tapply(df$Value, cut(df$Sl, 4), function(x) sum(x==0))
(0.981,5.75] (5.75,10.5] (10.5,15.2] (15.2,20]
1 2 2 1
In cut you can specify the number of breaks or the breaks itself if you prefer and the logic is in the function definition in tapply

Or using data.table, we convert the 'data.frame' to 'data.table' (setDT(df)), using cut output as the grouping variable, we get the sum of 'Value' that are '0' (!Value). By negating (!), the column will be converted to logical vector i.e. TRUE for 0 and FALSE all other values not equal to 0.
library(data.table)
setDT(df)[,sum(!Value) , .(gr=cut(Sl,breaks=seq(0,20, 5)))]
# gr V1
#1: (0,5] 1
#2: (5,10] 2
#3: (10,15] 2
#4: (15,20] 1

Your question used table(), but it lacked a second argument. It is needed to produce a contingency table. You can find the count of each bin with :
table(cut(df$Sl,4),df$Value)
0 1 2 3 4 5 6
(0.981,5.75] 1 3 1 0 0 0 0
(5.75,10.5] 2 1 0 1 1 0 0
(10.5,15.2] 2 0 1 1 1 0 0
(15.2,20] 1 0 1 1 0 1 1
And the count of Value == 0 for each bin :
table(cut(df$Sl,4),df$Value)[,"0"]
(0.981,5.75] (5.75,10.5] (10.5,15.2] (15.2,20]
1 2 2 1

A more convoluted way using sqldf :
First we create a table defining the bins and ranges (min and max):
bins <- data.frame(id = c(1, 2, 3, 4),
bins = c("(0,5]", "(5,10]", "(10,15]", "(15,20]"),
min = c(0, 6, 11, 16),
max = c(5, 10, 15, 20))
id bins min max
1 1 (0,5] 0 5
2 2 (5,10] 6 10
3 3 (10,15] 11 15
4 4 (15,20] 16 20
Then we use the following query using both tables to bin each sl into its respective group using BETWEEN for those Value equal to 0.
library(sqldf)
sqldf("SELECT bins, COUNT(Value) AS freq FROM df, bins
WHERE (((sl) BETWEEN [min] AND [max]) AND Value = 0)
GROUP BY bins
ORDER BY id")
Output:
bins freq
1 (0,5] 1
2 (5,10] 2
3 (10,15] 2
4 (15,20] 1
Another alternative to simplify the construction of bins suggested by mts using cut, extracting the levels of the factor:
bins <- data.frame(id = 1:4,
bins = levels(cut(Sl, breaks = seq(0, 20, 5))),
min = seq(1, 20, 5),
max = seq(5, 20, 5))

Related

How to shift data in only one column up and down in R?

I have a data frame that looks as follows:
ID
Count
1
3
2
5
3
2
4
0
5
1
And I am trying to shift ONLY the values in the "Count" column down one so that it looks as follows:
ID
Count
1
NA
2
3
3
5
4
2
5
0
I will also need to eventually shift the same data up one:
ID
Count
1
5
2
2
3
0
4
1
5
NA
I've tried the following code:
shift <- function(x, n){
c(x[-(seq(n))], rep(NA, n))
}
df$Count <- shift(df$Count, 1)
But it ended up duplicating the titles and shifting the data down, like as follows:
ID
Count
ID
Count
1
3
2
5
3
2
4
0
Is there an easy way for me to accomplish this? Thank you!!
# set as data.table
setDT(df)
# shift
df[, count := shift(count, 1)]
df$Count=c(NA, df$Count[1:(nrow(df)-1)])
1) dplyr Using DF shown reproducibly in the Note at the end, use lag and lead from dplyr
library(dplyr)
DF %>% mutate(CountLag = lag(Count), CountLead = lead(Count))
## ID Count CountLag CountLead
## 1 1 3 NA 5
## 2 2 5 3 2
## 3 3 2 5 0
## 4 4 0 2 1
## 5 5 1 0 NA
2) zoo This creates a zoo object using zoo's vectorized lag. Optionally use fortify.zoo(z) or as.ts(z) to convert it back to a data frame or ts object.
Note that dplyr clobbers lag with its own lag so we used stats::lag to ensure it does not interfere. The stats:: can optionally be omitted if dplyr is not loaded.
library(zoo)
z <- stats::lag(read.zoo(DF), seq(-1, 1)); z
Index lag-1 lag0 lag1
1 1 NA 3 5
2 2 3 5 2
3 3 5 2 0
4 4 2 0 1
5 5 0 1 NA
3) collapse flag from the collapse package is also vectorized over its second argument.
library(collapse)
with(DF, data.frame(ID, Count = flag(Count, seq(-1, 1))))
## ID Count.F1 Count... Count.L1
## 1 1 5 3 NA
## 2 2 2 5 3
## 3 3 0 2 5
## 4 4 1 0 2
## 5 5 NA 1 0
Note
DF <- data.frame(ID = 1:5, Count = c(3, 5, 2, 0, 1))

R: Creating bins by a factor when number of observations not divisible by number of bins?

I have a data set in which I have a number of DV's for each level of a factor. The number of DV's/ factor is not consistent. I would like to create quintiles, such that for each level of the factor the smallest 25% of values are assigned to bin 1, the next 25% smallest in bin2, etc,
I have found a package with a NEAR perfect solution: schoRsch, in which the function ntiles creates bins based on levels of the factor, like so:
library(schoRsch)
#{
dv <- c(5, 2, 10, 15, 3, 7, 20, 44, 18)
factor <- c(1,1,2,2,2,2,3,3,3)
tmpdata <- data.frame(cbind(dv,factor))
tmpdata$factor <- as.factor(tmpdata$factor)
head(tmpdata)
tmpdata$bins <- ntiles(tmpdata, dv = "dv", bins=2, factors = "factor")
tmpdata
#}
the output looks like:
dv factor bins
1 5 1 2
2 2 1 1
3 10 2 2
4 15 2 2
5 3 2 1
6 7 2 1
7 20 3 2
8 44 3 2
9 18 3 1
My problem occurs when the number of DV's for a particular factor level is not divisible by the number of bins. In the example above, factor 3 has 3 observations, and when sorting into two bins the first bin has one observation, and the second has 2. However, I would like the priority such that the first bin gets priority for assigning a DV, and the second and so-on. In my actual data set, for instance, I have a factor with 79 associated DV's and 5 bins. So I would want 16 observations in each of bin 1-4, and then 15 in bin 5. However this method gives me 16 observation in bins 1 and 3-5, and 15 in bin 2.
Is there any way to specify here my desired order of binning? Or is there an alternative way that I can solve this problem with another method that allows me to bin on the basis of a factor or, more helpfully, multiple factors?
Thank-you!
Something like this?
foo = function(x, bins) {
len = length(x)
n1 = ceiling(len/bins)
n2 = len - n1 * (bins - 1)
c(rep(1:(bins - 1), each = n1), rep(bins, n2))
}
table(foo(1:79, 5))
# 1 2 3 4 5
#16 16 16 16 15
library(dplyr)
tmpdata %>% group_by(factor) %>% mutate(bin = foo(dv, 2))
## A tibble: 9 x 3
## Groups: factor [3]
# dv factor bin
# <dbl> <fct> <dbl>
#1 5 1 1
#2 2 1 2
#3 10 2 1
#4 15 2 1
#5 3 2 2
#6 7 2 2
#7 20 3 1
#8 44 3 1
#9 18 3 2

subtracting the greater column from smaller columns in a dataframe in R

I have the input below and I would like to subtract the two columns, but I want to subtract always the lowest value from the highest value.
Because I don't want negative values as a result and sometimes the highest value is in the first column (PaternalOrgin) and other times in the second column (MaternalOrigin).
Input:
df <- PaternalOrigin MaternalOrigin
16 20
3 6
11 0
1 3
1 4
3 11
and the dput output is this:
df <- structure(list(PaternalOrigin = c(16, 3, 11, 1, 1, 3), MaternalOrigin = c(20, 6, 0, 3, 4, 11)), colnames = c("PaternalOrigin", "MaternalOrigin"), row.names= c(NA, -6L), class="data.frame")
Thus, my expected output would look like:
df2 <- PaternalOrigin MaternalOrigin Results
16 20 4
3 6 3
11 0 11
1 3 2
1 4 3
3 11 8
Please, can someone advise me?
Thanks.
We can wrap with abs
transform(df, Results = abs(PaternalOrigin - MaternalOrigin))
# PaternalOrigin MaternalOrigin Results
#1 16 20 4
#2 3 6 3
#3 11 0 11
#4 1 3 2
#5 1 4 3
#6 3 11 8
Or we can assign it to 'Results'
df$Results <- with(df, abs(PaternalOrigin - MaternalOrigin))
Or using data.table
library(data.table)
setDT(df)[, Results := abs(PaternalOrigin - MaternalOrigin)]
Or with dplyr
library(dplyr)
df %>%
mutate(Results = abs(PaternalOrigin - MaternalOrigin))

Percolation clustering

Consider the following groupings:
> data.frame(x = c(3:5,7:9,12:14), grp = c(1,1,1,2,2,2,3,3,3))
x grp
1 3 1
2 4 1
3 5 1
4 7 2
5 8 2
6 9 2
7 12 3
8 13 3
9 14 3
Let's say I don't know the grp values but only have a vector x. What is the easiest way to generate grp values, essentially an id field of groups of values within a threshold from from each other? Is this a percolation algorithm?
One option would be to compare the next with the current value and check if the difference is greater than 1, and get the cumulative sum.
df1$grp <- cumsum(c(TRUE, diff(df1$x) > 1))
df1$grp
#[1] 1 1 1 2 2 2 3 3 3
EDIT: From #geotheory's comments.

Conditional calculation of means of different columns in data.table with R

Here was discussed the question of calculation of means and medians of vector t, for each value of vector y (from 1 to 4) where x=1, z=1, using aggregate function in R.
x y z t
1 1 1 10
1 0 1 15
2 NA 1 14
2 3 0 15
2 2 1 17
2 1 NA 19
3 4 2 18
3 0 2 NA
3 2 2 45
4 3 2 NA
4 1 3 59
5 0 3 0
5 4 3 45
5 4 4 74
5 1 4 86
Multiple aggregation in R with 4 parameters
But how can I for each value (from 1 to 5) of vector x calculate (mean(y)+mean(z))/(mean(z)-mean(t)) ? And do not make calculations for values 0 and NA in any vector. For example, in vector y the 3rd value is 0, so the 3rd number in every vector (y,z,t) should not be used. And in result the the third row (for x=3) should be NA.
Here is the code for calculating means of y,z and t and it`s needed to add the formula for calculation (mean(y)+mean(z))/(mean(z)-mean(t)):
data <- data.table(dataframe)
bar <- data[,.N,by=x]
foo <- data[ ,list(mean.y =mean(y, na.rm = T),
mean.z=mean(z, na.rm = T),
mean.t=mean(t,na.rm = T)),
by=x]
In this code for calculating means all rows are used, but for calculating (mean(y)+mean(z))/(mean(z)-mean(t)), any row where y or z or t equal to zero or NA should not be used.
Update:
Oh, this can be further simplified, as data.table doesn't subset NA by default (especially with such cases in mind, similar to base::subset). So, you just have to do:
dt[y != 0 & z != 0 & t != 0,
list(ans = (mean(y) + mean(z))/(mean(z) - mean(t))), by = x]
FWIW, here's how I'd do it in data.table:
dt[(y | NA) & (z | NA) & (t | NA),
list(ans=(mean(y)+mean(z))/(mean(z)-mean(t))), by=x]
# x ans
# 1: 1 -0.22222222
# 2: 2 -0.18750000
# 3: 3 -0.16949153
# 4: 4 -0.07142857
# 5: 5 -0.10309278
Let's break it down with the general syntax: dt[i, j, by]:
In i, we filter out for your conditions using a nice little hack TRUE | NA = TRUE and FALSE | NA = NA and NA | NA = NA (you can test these out in your R session).
Since you say you need only the non-zero non-NA values, it's just a matter of |ing each column with NA - which'll return TRUE only for your condition. That settles the subset by condition part.
Then for each group in by, we aggregate according to your function, in j, to get the result.
HTH
Here's one solution:
# create your sample data frame
df <- read.table(text = " x y z t
1 1 1 10
1 0 1 15
2 NA 1 14
2 3 0 15
2 2 1 17
2 1 NA 19
3 4 2 18
3 0 2 NA
3 2 2 45
4 3 2 NA
4 1 3 59
5 0 3 0
5 4 3 45
5 4 4 74
5 1 4 86", header = TRUE)
library('dplyr')
dfmeans <- df %>%
filter(!is.na(y) & !is.na(z) & !is.na(t)) %>% # remove rows with NAs
filter(y != 0 & z != 0 & t != 0) %>% # remove rows with zeroes
group_by(x) %>%
summarize(xmeans = (mean(y) + mean(z)) / (mean(z) - mean(t)))
I'm sure there is a simpler way to remove the rows with NAs and zeroes, but it's not coming to me. Anyway, dfmeans looks like this:
# x xmeans
# 1 1 -0.22222222
# 2 2 -0.18750000
# 3 3 -0.16949153
# 4 4 -0.07142857
# 5 5 -0.10309278
And if you just want the values from xmeans use dfmeans$xmeans.

Resources