selecting row in a dataframe according to defined values - r

I have an annual record of temperature. I need to select special row (days) with five rows before them (to take the mean of five days) and then take the mean of the selected groups. here is my data frame and the following code that i applied but didn't work.
Day T.m
1 22
2 21
3 34
4 28
5 14
6 7
7 12
8 22
9 11
10 12
11 14
12 3
13 4
14 11
15 16
a <- c(8, 12,14)
apply(DF [c((a-5):a),2], 1, mean)

We can use mapply
mapply(function(x, y) mean(DF[[2]][x:y]), a-5, a)
#[1] 19.500000 12.333333 9.166667
Or a vectorized approach would be
tapply(DF[[2]][rep(a-5 , each = 6) + 0:5], rep(1:3, each = 6), FUN = mean)
# 1 2 3
#19.500000 12.333333 9.166667

Related

Binning with quantiles adding exception in r

I need to create 10 bins with the most approximate frequency each; for this,
I am using the function "ClassInvervals" from the library (ClassInt) with the style
'quantile' for binning some data. This is working for must columns; but, when I have a column that has 1 number repeated too many times, it appears an error that says that some brackets are not unique, which makes sense assuming the last +30% of the column data is the same number so the function doesn't know how to split the bins.
What I would like to do is that if a number is greater than the 10% of the length of the column, then treat it as a different bin, and if not, then use the function as it is.
For example, let's assume we have this DF:
df <- read.table(text="
X
1 5
2 29
3 4
4 26
5 4
6 17
7 4
8 4
9 4
10 25
11 4
12 4
13 5
14 14
15 18
16 13
17 29
18 4
19 13
20 6
21 26
22 11
23 2
24 23
25 4
26 21
27 7
28 4
29 18
30 4",h=T,strin=F)
So in this case the 10% of the length would be 3, so if we create a table containing the frequency of each number, it would appear something like this:
2 1
4 11
5 2
6 1
7 1
11 1
13 2
14 1
17 1
18 2
21 1
23 1
25 1
26 2
29 2
With this info, first we should treat "4" as a unique bin.
So we have a final output more or less like this:
X Bins
1 5 [2,6)
2 29 [27,30)
3 4 [4]
4 26 [26,27)
5 4 [4]
6 17 [15,19)
7 4 [4]
8 4 [4]
9 4 [4]
10 25 [19,26)
11 4 [4]
12 4 [4]
13 5 [2,6)
14 14 [12,15)
15 18 [15,19)
16 13 [12,15)
17 29 [27,30)
18 4 [4]
19 13 [12,15)
20 6 [6,12)
21 26 [26,27)
22 11 [6,12)
23 2 [2,6)
24 23 [19,26)
25 4 [4]
26 21 [19,26)
27 7 [6,12)
28 4 [4]
29 18 [15,19)
30 4 [4]
Until now, my approach has been something like this:
Moda <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Binner <- function(df) {
library(classInt)
#Input is a matrix that wants to be binned
for (c in 1:ncol(df)) {
if (sapply(df,class)[c]=="numeric") {
VectorTest <- df[,c]
# Here I get the 10% of the values
TenPer <- floor(length(VectorTest)/10)
while((sum(VectorTest == Moda(VectorTest)))>=TenPer) {
# in this loop I manage to remove the values that
# are repeated more than 10% but I still don't know how to add it as a special bin
VectorTest <- VectorTest[VectorTest!=Moda(VectorTest)]
Counter <- Counter +1
}
binsTest <- classIntervals(VectorTest_Fixed, 10- Counter, style = 'quantile')
binsBrakets <- cut(VectorTest, breaks = binsTest$brks)
df[ , paste0("Binned_", colnames(df)[c])] <- binsBrakets
}
}
return (df)
}
Can someone help me?
You could use cutr::smart_cut:
# devtools::install_github("moodymudskipper/cutr")
library(cutr)
df$Bins <- smart_cut(df$X,list(10,"balanced"),"g",simplify = F)
table(df$Bins)
#
# [2,4) [4,5) [5,6) [6,11) [11,14) [14,18) [18,21) [21,25) [25,29) [29,29]
# 1 11 2 2 3 2 2 2 3 2
more on cutr and smart_cut
you can create two different dataframes: one with the 10% bins and the rest with the cut created bins. Then bind them together (make sure the bins are strings).
library(magrittr)
#lets find the numbers that appear more than 10% of the time
large <- table(df$X) %>%
.[. >= length(df$X)/10] %>%
names()
#these numbers appear less than 10% of the time
left_over <- df$X[!df$X %in% large]
#we want a total of 10 bins, so we'll cut the data into 10 - the number of 10%
left_over_bins <- cut(left_over, 10 - length(large))
#Let's combine the information into a single data frame
numbers_bins <- rbind(
data.frame(
n = left_over,
bins = left_over_bins %>% as.character,
stringsAsFactors = F
),
data.frame(
n = df$X[df$X %in% large],
bins = df$X[df$X %in% large] %>% as.character,
stringsAsFactors = F
)
)
If you table the information you'll get something like this
table(numbers_bins$bins) %>% sort(T)
4 (1.97,5] (11,14] (23,26] (17,20]
11 3 3 3 2
(20,23] (26,29] (5,8] (14,17] (8,11]
2 2 2 1 1

calculating column sum for certain row

I am trying to calculate column sum of per 5 rows for each row, in R using the following code:
df <- data.frame(count=1:10)
for (loop in (1:nrow(df)))
{df[loop,"acc_sum"] <- sum(df[max(1,loop-5):loop,"count"])}
But I don't like the explicit loop here, how can I modify it? Thanks.
According to your question, your desired result is:
df
# count acc_sum
# 1 1 1
# 2 2 3
# 3 3 6
# 4 4 10
# 5 5 15
# 6 6 21
# 7 7 27
# 8 8 33
# 9 9 39
# 10 10 45
This can be done like this:
df <- data.frame(count=1:10)
library(zoo)
df$acc_sum <- rev(rollapply(rev(df$count), 6, sum, partial = TRUE, align = "left"))
To obtain this result, we are reversing the order of df$count, we sum the elements (using partial = TRUE and align = "left" is important here), and we reverse the result to have the vector needed.
rev(rollapply(rev(df$count), 6, sum, partial = TRUE, align = "left"))
# [1] 1 3 6 10 15 21 27 33 39 45
Note that this sums 6 elements, not 5. According to the code in your question, this gives the same output. If you just want to sum 5 rows, just replace the 6 with a 5.

Function for lagged sums

I know how to take the lagged difference:
delX = diff(x)
But the only way I know to take the lagged sum is:
sumY = apply(embed(c(0,y),2),1, sum)
Is there a function that can take the lagged sum? This way (or sliding the index in some other fashion) is not very intuitive.
You're looking for filter:
x <- 1:10
filter(x, filter=c(1,1), sides=1)
# [1] NA 3 5 7 9 11 13 15 17 19
You could also use head and tail:
head(x, -1) + tail(x, -1)
# [1] 3 5 7 9 11 13 15 17 19
Two more options:
x <- 1:10
x + dplyr::lag(x)
# [1] NA 3 5 7 9 11 13 15 17 19
x + data.table::shift(x)
# [1] NA 3 5 7 9 11 13 15 17 19
Note that you can easily change the number of lags in both functions. Instead of lagging, you can also create a leading vector by using dplyr::lead() or data.table::shift(x, 1L, type = "lead"). Both functions also allow you to specify default values (which are NA by default).

Assigning Percentile Based Groups to Dataframe in R

I am having trouble figuring out how to take on this particular problem.
Suppose I have the following data frame:
set.seed(123)
Factors <- sample(LETTERS[1:26],50,replace=TRUE)
Values <- sample(c(5,10,15,20,25,30),50,replace=TRUE)
df <- data.frame(Factors,Values)
df
Factors Values
1 H 5
2 U 15
3 K 25
4 W 5
5 Y 20
6 B 10
7 N 5
8 X 25
9 O 30
10 L 15
11 Y 20
12 L 5
13 R 15
Data goes all the way to row 50, but left out here
Now suppose that I take the sum of Values by Factors
Sum.df <- aggregate(Values ~ Factors, data = df, FUN = sum)
Sum.df
Factors Values
1 A 5
2 B 35
3 C 25
4 D 30
5 F 30
6 G 75
7 H 20
8 I 55
9 J 20
10 K 60
11 L 20
12 M 20
13 N 5
14 O 55
15 P 20
16 Q 25
17 R 45
18 S 30
19 T 30
20 U 40
21 W 25
22 X 90
23 Y 55
24 Z 15
Then finally I use quantile to find percentile cut offs for the aggregated data.
quantile(Sum.df$Values, probs = c(0.33,.66,1))
33% 66% 100%
22.95 35.90 90.00
Okay, so here's my question. What I want to do is create three groups Group 1,Group 2,Group 3 based on their quantile. So for example in Sum.df the aggregated value for A is 5 so I want to assign that Factors to Group 1 because 5 is less than 22.95. If the value in Sum.df is greater than 22.95 or less than or equal to 35.9 then assign it to Group 2 and all else assign to Group 3. What I would love to see is a new column in df that denotes which Group each Factors is in. I hope this makes sense. Thanks guys!
How about the cut function. Just need to include the min in your quantiles.
q <- quantile(Sum.df$Values, probs = c(0, 0.33,.66,1))
Sum.df$group <- cut(Sum.df$Values, q, include.lowest=TRUE,
labels=paste("Group", 1:3))

automating a normal transformation function in R over multiple columns

I have a data frame m with:
>m
id w y z
1 2 5 8
2 18 5 98
3 1 25 5
4 52 25 8
5 5 5 4
6 3 3 5
Below is a general function for normally transforming a variable that I need to apply to columns w,y,z.
y<-qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x))
For example, if I wanted to run this function on "column w" to get the output column appended to dataframe "m" then:
m$w_n<-qnorm((rank(m$w,na.last="keep")-0.5)/sum(!is.na(m$w))
Can someone help me automate this to run on multiple columns in data frame m?
Ideally, I would want an output data frame with the following columns:
id w y z w_n y_n z_n
Note this is a sample data frame, the one I have is much larger and I have more letter columns to run this function on other than w, y,z.
Thanks!
Probably a way to do it in a single step, but what about:
df <- data.frame(id = 1:6, w = sample(50, 6), z = sample(50, 6) )
df
id w z
1 1 39 40
2 2 20 26
3 3 43 11
4 4 4 37
5 5 36 24
6 6 27 14
transCols <- function(x) qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x)))
tmpdf <- lapply(df[, -1], transCols)
names(tmpdf) <- paste0(names(tmpdf), "_n")
df_final <- cbind(df, tmpdf)
df_final
df_final
id w z w_n z_n
1 1 39 40 -0.2104284 -1.3829941
2 2 20 26 1.3829941 1.3829941
3 3 43 11 0.2104284 0.6744898
4 4 4 37 -1.3829941 0.2104284
5 5 36 24 0.6744898 -0.6744898
6 6 27 14 -0.6744898 -0.2104284

Resources