Select varying number of top_n for different groups using dplyr

Select varying number of top_n for different groups using dplyr - r

I have the following dataframe. I want to prefer dplyr to solve this problem.
For each zone I want at minimum two values. Value > 4.0 is preferred.
Therefore, for zone 10 all values (being > 4.0) are kept. For zone 20, top two values are picked. Similarly for zone 30.
zone <- c(rep(10,4), rep(20, 4), rep(30, 4))
set.seed(1)
value <- c(4.5,4.3,4.6, 5,5, rep(3,7)) + round(rnorm(12, sd = 0.1),1)
df <- data.frame(zone, value)
> df
zone value
1 10 4.4
2 10 4.3
3 10 4.5
4 10 5.2
5 20 5.0
6 20 2.9
7 20 3.0
8 20 3.1
9 30 3.1
10 30 3.0
11 30 3.2
12 30 3.0
The desired output is as follows
> df
zone value
1 10 4.4
2 10 4.3
3 10 4.5
4 10 5.2
5 20 5.0
6 20 3.1
7 30 3.1
8 30 3.2
I thought of using top_n but it picks the same number for each zone.

You could dynamically calculate n in top_n
library(dplyr)
df %>% group_by(zone) %>% top_n(max(sum(value > 4), 2), value)
# zone value
# <dbl> <dbl>
#1 10 4.4
#2 10 4.3
#3 10 4.5
#4 10 5.2
#5 20 5
#6 20 3.1
#7 30 3.1
#8 30 3.2

can do so
library(tidyverse)
df %>%
group_by(zone) %>%
filter(row_number(-value) <=2 | head(value > 4))

Related

Add a column that iterates/ counts every time a sequence resets

I have a dataframe, with a column that increases with every row, and periodically (though not regularly) resets back to 1.
I'd like to track/ count these resets in separate column. This for-loop example does exactly what I want, but is incredibly slow when applied to large datasets. Is there a better/ quicker/ more R way to do this same operation:
ColA<-seq(1,20)
ColB<-rep(seq(1,5),4)
DF<-data.frame(ColA, ColB)
DF$ColC<-NA
DF[1,'ColC']<-1
#Removing line 15 and changing line 5 to 1.1 per comments in answer
DF<-DF[-15,]
DF[5,2]<-0.1
for(i in seq(1,nrow(DF)-1)){
print(i)
MyRow<-DF[i+1,]
if(MyRow$ColB < DF[i,'ColB']){
DF[i+1,"ColC"]<-DF[i,"ColC"] +1
}else{
DF[i+1,"ColC"]<-DF[i,"ColC"]
}
}

No need for a loop here. We can just use the vectorized cumsum. This ought to be faster:
DF$ColC<-cumsum(DF$ColB==1)
DF
To keep using varying variable reset values that are always lower then the previous value, use cumsum(ColB < lag(ColB)):
DF %>% mutate(ColC = cumsum(ColB < lag(ColB, default = Inf)))
ColA ColB ColC
1 1 1.0 1
2 2 2.0 1
3 3 3.0 1
4 4 4.0 1
5 5 0.1 2
6 6 1.0 2
7 7 2.0 2
8 8 3.0 2
9 9 4.0 2
10 10 5.0 2
11 11 1.0 3
12 12 2.0 3
13 13 3.0 3
14 14 4.0 3
16 16 1.0 4
17 17 2.0 4
18 18 3.0 4
19 19 4.0 4
20 20 5.0 4

simple random sampling from groups with specified sample size

So I have a dataframe (my.df) which I have grouped by the variable "strat". Each row consists of numerous variables. Example of what it looks like is below - I've simplified my.df for this example since it is quite large. What I want to do next is draw a simple random sample from each group. If I wanted to draw 5 observations from each group I would use this code:
new_df <- my.df %>% group_by(strat) %>% sample_n(5)
However, I have a different specified sample size that I want to sample for each group. I have these sample sizes in a vector nj.
nj <- c(3, 4, 2)
So ideally, I would want 3 observations from my first strata, 4 observations from my second strata and 2 observations from my last srata. I'm not sure if I can sample by group using each unique sample size (without having to write out "sample" however many times I need to)? Thanks in advance!
my.df looks like:
var1 var2 strat
15 3 1
13 5 3
8 6 2
12 70 3
11 10 1
14 4 2

You can use stratified from my "splitstackshape" package.
Here's some sample data:
set.seed(1)
my.df <- data.frame(var1 = sample(100, 20, TRUE),
var2 = runif(20),
strat = sample(3, 20, TRUE))
table(my.df$strat)
#
# 1 2 3
# 5 9 6
Here's how you can use stratified:
library(splitstackshape)
# nj needs to be a named vector
nj <- c("1" = 3, "2" = 4, "3" = 2)
stratified(my.df, "strat", nj)
# var1 var2 strat
# 1: 72 0.7942399 1
# 2: 39 0.1862176 1
# 3: 50 0.6684667 1
# 4: 21 0.2672207 2
# 5: 69 0.4935413 2
# 6: 91 0.1255551 2
# 7: 78 0.4112744 2
# 8: 7 0.3403490 3
# 9: 27 0.9347052 3
table(.Last.value$strat)
#
# 1 2 3
# 3 4 2

Since your data is inadequate for sampling, let us consider this example on iris dataset
library(tidyverse)
nj <- c(3, 5, 6)
set.seed(1)
iris %>% group_split(Species) %>% map2_df(nj, ~sample_n(.x, size = .y))
# A tibble: 14 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 4.6 3.1 1.5 0.2 setosa
2 4.4 3 1.3 0.2 setosa
3 5.1 3.5 1.4 0.2 setosa
4 6 2.7 5.1 1.6 versicolor
5 6.3 2.5 4.9 1.5 versicolor
6 5.8 2.6 4 1.2 versicolor
7 6.1 2.9 4.7 1.4 versicolor
8 5.8 2.7 4.1 1 versicolor
9 6.4 2.8 5.6 2.2 virginica
10 6.9 3.2 5.7 2.3 virginica
11 6.2 3.4 5.4 2.3 virginica
12 6.9 3.1 5.1 2.3 virginica
13 6.7 3 5.2 2.3 virginica
14 7.2 3.6 6.1 2.5 virginica

You can bring nj values to sample in the dataframe and then use sample_n by group.
library(dplyr)
df %>%
mutate(nj = nj[strat]) %>%
group_by(strat) %>%
sample_n(size = min(first(nj), n()))
Note that the above works because strat has value 1, 2, 3. For a general solution when the group does not have such values you could use :
df %>%
mutate(nj = nj[match(strat, unique(strat))]) %>%
group_by(strat) %>%
sample_n(size = min(first(nj), n()))

R sum consecutive duplicate odd rows and remove all but first

I am stuck with question - how to sum consecutive duplicate odd rows and remove all but first row. I have got how to sum consecutive duplicate rows and remove all but first row (link: https://stackoverflow.com/a/32588960/11323232). But this project, i would like to sum the consecutive duplicate odd rows but not all of the consecutive duplicate rows.
ia<-c(1,1,2,NA,2,1,1,1,1,2,1,2)
time<-c(4.5,2.4,3.6,1.5,1.2,4.9,6.4,4.4, 4.7, 7.3,2.3, 4.3)
a<-as.data.frame(cbind(ia, time))
a
ia time
1 1 4.5
2 1 2.4
3 2 3.6
5 2 1.2
6 1 4.9
7 1 6.4
8 1 4.4
9 1 4.7
10 2 7.3
11 1 2.3
12 2 4.3
to
a
ia time
1 1 6.9
3 2 3.6
5 2 1.2
6 1 20.4
10 2 7.3
11 1 2.3
12 2 4.3
how to edit the following code for my goal to sum consecutive duplicate odd rows and remove all but first row ?
result <- a %>%
filter(na.locf(ia) == na.locf(ia, fromLast = TRUE)) %>%
mutate(ia = na.locf(ia)) %>%
mutate(change = ia != lag(ia, default = FALSE)) %>%
group_by(group = cumsum(change), ia) %>%
# this part
summarise(time = sum(time))

One dplyr possibility could be:
a %>%
group_by(grp = with(rle(ia), rep(seq_along(lengths), lengths))) %>%
mutate(grp2 = ia %/% 2 == 0,
time = sum(time)) %>%
filter(!grp2 | (grp2 & row_number() == 1)) %>%
ungroup() %>%
select(-grp, -grp2)
ia time
<dbl> <dbl>
1 1 6.9
2 2 3.6
3 2 1.2
4 1 20.4
5 2 7.3
6 1 2.3
7 2 4.3

You could try with use of data.table the following:
library(data.table)
ia <- c(1,1,2,NA,2,1,1,1,1,2,1,2)
time <- c(4.5,2.4,3.6,1.5,1.2,4.9,6.4,4.4, 4.7, 7.3,2.3, 4.3)
a <- data.table(ia, time)
a[, sum(time), by=.(ia, rleid(!ia %% 2 == 0))]
Gives
## ia rleid V1
##1: 1 1 6.9
##2: 2 2 3.6
##3: NA 3 1.5
##4: 2 4 1.2
##5: 1 5 20.4
##6: 2 6 7.3
##7: 1 7 2.3
##8: 2 8 4.3

cbind multiple, individual columns in a single data frame using column numbers

I have a single data frame of 100 columns and 25 rows. I would like to cbind different groupings of columns (sometimes as many as 30 columns) in several new data frames without having to type out each column name every time.
Some columns that i want fall individually e.g. 6 and 72 and some do lie next to each other e.g. columns 23, 24, 25, 26 (23:26).
Usually i would use:
z <- cbind(visco$fish, visco$bird)
for example, but i have too many columns and need to create too many new data frames to be typing the name of every column that i need every time. Generally i do not attach my data.
I would like to use column numbers, something like:
z <- cbind(6 , 72 , 23:26, data=visco)
and also retain the original column names, not the automatically generated V1, V2. I have tried adding deparse.level=2 but my column names then become "visco$fish" rather than the original "fish"
I feel there should be a simple answer to this, but so far i have failed to find anything that works as i would like.

df <- data.frame(AA = 11:15, BB = 2:6, CC = 12:16, DD = 3:7, EE = 23:27)
df
# AA BB CC DD EE
# 1 11 2 12 3 23
# 2 12 3 13 4 24
# 3 13 4 14 5 25
# 4 14 5 15 6 26
# 5 15 6 16 7 27
df1 <- data.frame(cbind(df,df,df,df))
df1
# AA BB CC DD EE AA.1 BB.1 CC.1 DD.1 EE.1 AA.2 BB.2 CC.2 DD.2 EE.2 AA.3 BB.3
# 1 11 2 12 3 23 11 2 12 3 23 11 2 12 3 23 11 2
# 2 12 3 13 4 24 12 3 13 4 24 12 3 13 4 24 12 3
# 3 13 4 14 5 25 13 4 14 5 25 13 4 14 5 25 13 4
# 4 14 5 15 6 26 14 5 15 6 26 14 5 15 6 26 14 5
# 5 15 6 16 7 27 15 6 16 7 27 15 6 16 7 27 15 6
# CC.3 DD.3 EE.3
# 1 12 3 23
# 2 13 4 24
# 3 14 5 25
# 4 15 6 26
# 5 16 7 27
Result <- data.frame(cbind(df1[,c(1:5,14:17,20)]))
Result
# AA BB CC DD EE DD.2 EE.2 AA.3 BB.3 EE.3
# 1 11 2 12 3 23 3 23 11 2 23
# 2 12 3 13 4 24 4 24 12 3 24
# 3 13 4 14 5 25 5 25 13 4 25
# 4 14 5 15 6 26 6 26 14 5 26
# 5 15 6 16 7 27 7 27 15 6 27
Note: The columns with same name are adjusted in their next appearance as .1 or .2 by R itself.

Here's an example of how to do this using the select function from dplyr - which should be your go to package for this type of data wrangling
> library(dplyr)
> df <- head(iris)
> df
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
>
>## select by variable name
>newdf <- df %>% select(Sepal.Length, Sepal.Width,Species)
> newdf
Sepal.Length Sepal.Width Species
1 5.1 3.5 setosa
2 4.9 3.0 setosa
3 4.7 3.2 setosa
4 4.6 3.1 setosa
5 5.0 3.6 setosa
6 5.4 3.9 setosa
>## select by variable indices
> newdf <- df %>% select(1:2,5)
> newdf
Sepal.Length Sepal.Width Species
1 5.1 3.5 setosa
2 4.9 3.0 setosa
3 4.7 3.2 setosa
4 4.6 3.1 setosa
5 5.0 3.6 setosa
6 5.4 3.9 setosa
However, I'm not sure why you would need to do this? Can you not run your analyses on the original dataframe?

I understand your question as , subsetting a large dataframe into smaller ones. Which could be achieved in different ways. One way is, data.table package helps you to retain the column names, and yet subset it by indexing the columns.
if you have your data as dataframe, you can just do
DT<- data.table(df)
# You still have to define your subsets of columns you need to create
sub_1<-c(2,3)
sub_2<-c(2:5,9)
sub_3<-c(1:2,5:6,10)
DT[ ,sub_2, with = FALSE]
Output
bird cat dog rat car
1: 0.2682538 0.1386834 0.01633384 0.5336649 0.43432878
2: 0.2418727 0.7530654 0.26999873 0.2679446 0.00859734
3: 0.1211858 0.2563736 0.92637523 0.8572615 0.63165705
4: 0.4556401 0.2343427 0.09324584 0.8731174 0.50098461
5: 0.1646126 0.9258622 0.86957980 0.3636781 0.89608415
Data
require("data.table")
DT <- data.table(matrix(runif(10*10),5,10))
colnames(DT) <- c("fish","bird","cat","dog","rat","tiger","insect","boat","car", "cycle")

Try this
z <- visco[c(6,72,23:26)]

In R we have vectors and matrices. You can create your own vectors with the function c.
c(1,5,3,4)
They are also the output of many functions such as
rnorm(10)
You can turn vectors into matrices using functions such as rbind, cbind or matrix.
Create the matrix from the vector 1:1000 like this:
X = matrix(1:1000,100,10)
What is the entry in row 25, column 3 ?

Data Transformations based on certain transformation criteria

I want to transform a dataset based on certain conditions. These conditions are given in another dataset. Let me explain it using an example.
Suppose I've a dataset in the following format:
Date Var1 Var2
3/1/2016 8 14
3/2/2016 7 8
3/3/2016 7 6
3/4/2016 10 8
3/5/2016 5 10
3/6/2016 9 15
3/7/2016 2 5
3/8/2016 6 14
3/9/2016 8 15
3/10/2016 8 8
And the following dataset has the transformation conditions and is in the following format:
Variable Trans1 Trans2
Var1 1||2 0.5||0.7
Var2 1||2 0.3||0.8
Now, I want to extract first conditions from transformation table for Var1, 1.0.5, and add 1 to Var1 and multiply it by 0.5. I'll do the same for var2, add by 1 and multiply by 0.3. This transformation will give me new variable Var1_1 and var2_1. I'll do the same thing for the other transformation, which will give me Var1_2 and Var2_2. For Var1_2, the transformation is Var1 sum with 2 and multiplied by 0.7.
After the transformation, the dataset will look like the following:
Date Var1 Var2 Var1_1 Var2_1 Var1_2 Var2_2
3/1/2016 8 14 4.5 4.5 7 11.2
3/2/2016 7 8 4 2.7 6.3 7
3/3/2016 7 6 4 2.1 6.3 5.6
3/4/2016 10 8 5.5 2.7 8.4 7
3/5/2016 5 10 3 3.3 4.9 8.4
3/6/2016 9 15 5 4.8 7.7 11.9
3/7/2016 2 5 1.5 1.8 2.8 4.9
3/8/2016 6 14 3.5 4.5 5.6 11.2
3/9/2016 8 15 4.5 4.8 7 11.9
3/10/2016 8 8 4.5 2.7 7 7

Given that your original data.frame is called df and your conditions table cond1 then we can create a custom function,
funV1Cond1 <- function(x){
t1 <- as.numeric(gsub("[||].*", "", cond1$Trans1[cond1$Variable == "Var1"]))
t2 <- as.numeric(gsub("[||].*", "", cond1$Trans2[cond1$Variable == "Var1"]))
result <- (x$Var1 + t1)*t2
return(result)
}
funV1Cond1(df)
#[1] 4.5 4.0 4.0 5.5 3.0 5.0 1.5 3.5 4.5 4.5
Same way with function 2
funV1Cond2 <- function(x){
t1 <- as.numeric(gsub(".*[||]", "", cond1$Trans1[cond1$Variable == "Var1"]))
t2 <- as.numeric(gsub(".*[||]", "", cond1$Trans2[cond1$Variable == "Var1"]))
result <- (x$Var1 + t1)*t2
return(result)
}
funV1Cond2(df)
#[1] 7.0 6.3 6.3 8.4 4.9 7.7 2.8 5.6 7.0 7.0
Assuming that Trans1 column has 3 conditions i.e. 1, 2, 3 then,
as.numeric(sapply(str_split(cond1$Trans1[cond1$Variable == "Var1"], ','),function(x) x[2]))
#[1] 2
as.numeric(sapply(str_split(cond1$Trans1[cond1$Variable == "Var1"], ','),function(x) x[1]))
#[1] 1
as.numeric(sapply(str_split(cond1$Trans1[cond1$Variable == "Var1"], ','),function(x) x[3]))
#[1] 3
Note that I changed the delimeter to a ','

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Select varying number of top_n for different groups using dplyr - r

You could dynamically calculate n in top_n library(dplyr) df %>% group_by(zone) %>% top_n(max(sum(value > 4), 2), value) # zone value # <dbl> <dbl> #1 10 4.4 #2 10 4.3 #3 10 4.5 #4 10 5.2 #5 20 5 #6 20 3.1 #7 30 3.1 #8 30 3.2

can do so library(tidyverse) df %>% group_by(zone) %>% filter(row_number(-value) <=2 | head(value > 4))

Related

Add a column that iterates/ counts every time a sequence resets

simple random sampling from groups with specified sample size

R sum consecutive duplicate odd rows and remove all but first

cbind multiple, individual columns in a single data frame using column numbers

Data Transformations based on certain transformation criteria

Categories

Resources