I have a data frame containing fish population sampling data. I would like to create bins to count how many fish are in a given length group for each species.
The below code accomplishes this task for 2 species. Doing this for all species in the data frame doesn't seem like the most elegant way to achieve this goal.
Plus I would like to apply this code to other lakes with different species. It would be great to find an "automated" way to apply these bins to each species group in the data frame.
The data frame looks like:
Species TL WT
BLG 75 6
BLG 118 27
LMB 200 98
LMB 315 369
RBS 112 23
RES 165 73
SPB 376 725
YEP 155 33
ss = read.csv("SS_West Point.csv" , na.strings="." , header=T)
blg = ss %>% subset(Species == "BLG")
lmb = ss %>% subset(Species == "LMB")
blgn = blg %>% summarise(n = n())
lmbn = lmb %>% summarise(n = n())
### 20mm Length Groups - BLG ###
blg20 = blg %>% group_by(gr=cut(TL , breaks = seq(0 , 1000 , by = 20))) %>%
summarise(n = n()) %>% mutate(freq = n , percent = ((n/blgn$n)*100) ,
cumfreq = cumsum(freq) , cumpercent = cumsum(percent))
### 20mm Length Groups - BLG ###
lmb20 = lmb %>% group_by(gr=cut(TL , breaks = seq(0 , 1000 , by = 20))) %>%
summarise(n = n()) %>% mutate(freq = n , percent = ((n/lmbn$n)*100) ,
cumfreq = cumsum(freq) , cumpercent = cumsum(percent))
I've successfully used do() to run linear models on this data frame but can't seem to get it to work on cut(). Here is how I used do() on lm():
ssl = ss %>% mutate(lTL = log10(TL) , lWT = log10(WT)) %>% group_by(Species)
m = ssl %>% do(lm(lWT~lTL , data =.)) %>% mutate(wp = 10^(.fitted))
Does this do what you expect?
ss20 <- ss %>%
add_count(Species) %>%
rename(Species_count = n) %>%
# I added Species_count to the grouping so it goes along for the ride in summarization
group_by(Species, Species_count, gr=cut(TL , breaks = seq(0 , 1000 , by = 20))) %>%
summarise(n = n()) %>%
mutate(freq = n, percent = ((n/Species_count)*100),
cumfreq = cumsum(freq) , cumpercent = cumsum(percent)) %>%
ungroup()
> ss20
# A tibble: 8 x 8
Species Species_count gr n freq percent cumfreq cumpercent
<chr> <int> <fct> <int> <int> <dbl> <int> <dbl>
1 BLG 2 (60,80] 1 1 50 1 50
2 BLG 2 (100,120] 1 1 50 2 100
3 LMB 2 (180,200] 1 1 50 1 50
4 LMB 2 (300,320] 1 1 50 2 100
5 RBS 1 (100,120] 1 1 100 1 100
6 RES 1 (160,180] 1 1 100 1 100
7 SPB 1 (360,380] 1 1 100 1 100
8 YEP 1 (140,160] 1 1 100 1 100
Related
Edit: I found the solution with na.locf().
data <-
data %>%
group_by(country) %>%
arrange(wave) %>%
mutate(weight.io = na.locf(weight)) %>%
mutate(lag_weight = weight - lag(weight.io)
I have a dataset below.
set.seed(42000)
data <- data_frame(
country = sample(letters[1:20], size = 100, replace = TRUE),
weight = round(runif(100, min = 48, max = 90)))
data <- data %>%
group_by(country) %>%
arrange(weight) %>%
mutate(wave = seq_along(weight))
n_rows <- nrow(data)
perc_missing <- 10
data[sample(1:n_rows, sample(1:n_rows, round(perc_missing/100 * n_rows, 0))), c("weight")] <- NA
I would like to obtain the difference between one country's current "weight" and the last observed "weight for each wave.
For country "a" wave 5, I want the value to be 69 - 65 (last observed weight at wave < 5).
And for wave 8, 82(weight at wave 8) - 69(weight at wave 5).
My approach was the one below, but it didn't work.
data <-
data %>%
group_by(country) %>%
arrange(wave) %>%
mutate(lag_weight = weight - lag(weight, default = first(weight, na.rm = TRUE)))
Thank you!
I think this is a combination of diff (instead of lag, though that could work just as well) and more important tidyr::fill (or zoo::na.locf, not demonstrated):
BTW, na.rm= is not an argument for first, I've removed it.
library(dplyr)
# library(tidyr) # fill
data %>%
group_by(country) %>%
tidyr::fill(weight) %>%
filter(country == "a") %>%
mutate(lag_weight = weight - lag(weight, default = first(weight)))
# # A tibble: 10 x 4
# # Groups: country [1]
# country weight wave lag_weight
# <chr> <dbl> <int> <dbl>
# 1 a 54 1 0
# 2 a 55 2 1
# 3 a 65 3 10
# 4 a 65 4 0
# 5 a 69 5 4
# 6 a 69 6 0
# 7 a 69 7 0
# 8 a 82 8 13
# 9 a 82 9 0
# 10 a 85 10 3
The issue here is that weight is over-written with the LOCF (last-observation carried forward) value instead of preserving the NA values. If that's important, then you can make another weight variable for temporary use (and remove it):
data %>%
mutate(tmpweight = weight) %>%
group_by(country) %>%
tidyr::fill(tmpweight) %>%
filter(country == "a") %>%
mutate(lag_weight = tmpweight - lag(tmpweight, default = first(tmpweight))) %>%
select(-tmpweight)
# # A tibble: 10 x 4
# # Groups: country [1]
# country weight wave lag_weight
# <chr> <dbl> <int> <dbl>
# 1 a 54 1 0
# 2 a 55 2 1
# 3 a 65 3 10
# 4 a NA 4 0
# 5 a 69 5 4
# 6 a NA 6 0
# 7 a NA 7 0
# 8 a 82 8 13
# 9 a 82 9 0
# 10 a 85 10 3
FYI, you can use c(0, diff(weight)) instead of weight - lag(weight) for the same effect. Since it returns length of 1 shorter (since it is the gap between each value), we prepend a 0 here:
data %>%
group_by(country) %>%
tidyr::fill(weight) %>%
filter(country == "a") %>%
mutate(lag_weight = c(0, diff(weight)))
(The filter(country == "a") is purely for demonstration to match your example, not that it is required for this solution.)
In my dataset,and I have several variables like this -
Hypertension = 1,0,1,1,1,1,0,1
Diabetes = 1,1,0,0,1,1,0,1
Other NCD = 1,1,0,0,0,0,1,1
here, 1 = yes and 0 = No
Now I want to bind all of these "yes" responses from the above variables and create a multiple responses table like this -
SPSS has a function called "Multiple Response". This image is one of the outputs of this function.
How do I create this table?
Thanks in advance.
Please try this.
dat <- data.frame(
Hypertension = c(1,0,1,1,1,1,0,1),
Diabetes = c(1,1,0,0,1,1,0,1),
`Other NCD` = c(1,1,0,0,0,0,1,1),
check.names = FALSE
)
library(dplyr)
library(tidyr) # pivot_longer
dat %>%
tidyr::pivot_longer(everything(), names_to="k", values_to="v") %>%
group_by(k) %>%
summarize(
n = n(),
cases = sum(v),
percent = 100 * cases / n()
) %>%
ungroup() %>%
mutate(overall = 100 * cases / sum(n))
# # A tibble: 3 x 5
# k n cases percent overall
# <chr> <int> <dbl> <dbl> <dbl>
# 1 Diabetes 8 5 62.5 20.8
# 2 Hypertension 8 6 75 25
# 3 Other NCD 8 4 50 16.7
With base R, we can do
dat1 <- transform(stack(colSums(dat)), n = nrow(dat))
dat1$percent <- 100 *dat1$values/dat1$n
dat1$overall <- round(100 * dat1$values/sum(dat1$n), 2)
data
dat <- data.frame(
Hypertension = c(1,0,1,1,1,1,0,1),
Diabetes = c(1,1,0,0,1,1,0,1),
`Other NCD` = c(1,1,0,0,0,0,1,1),
check.names = FALSE
)
I have the DF1:
KEY <- c(11,12,22,33,44,55,66,77,88,99,1010,1111,1212,1313,1414,1515,1616,1717,1818,1919,2020)
PRICE <- c(0,0,1,5,7,10,20,80,110,111,200,1000,2500,2799,3215,4999,7896,8968,58914,78422,96352)
DF1 <- data.frame(KEY,PRICE)
I want to group DF1 into ranges, accumulating the values of the two columns (count the KEY column and sum the PRICE column). This is the result I hope for:
INTERVAL <-c('0','UP_TO_10','UP_TO_100','UP_TO_1000','UP_TO_5000','UP_TO_10000','UP_TO_100000')
COUNT_KEY <-c(2,6,8,12,16,18,21)
SUM_PRICE <- c(0,23,123,1544,15057,31921,265609)
DF2 <- data.frame(INTERVAL,COUNT_KEY,SUM_PRICE)
How do I make this table?
If you have a vector of limits or thresholds, such as:
LIMITS <- c(0, 10, 100, 1000, 5000, 10000, 100000)
You could obtain a count of rows where PRICE is below each limit:
unlist(lapply(LIMITS, function(x) sum(DF1$PRICE <= x)))
[1] 2 6 8 12 16 18 21
And a sum of these prices as well:
unlist(lapply(LIMITS, function(x) sum(DF1$PRICE[DF1$PRICE <= x])))
[1] 0 23 123 1544 15057 31921 265609
Is this what you had in mind?
This is everything all together:
LIMITS <- c(0, 10, 100, 1000, 5000, 10000, 100000)
COUNT_KEY <- unlist(lapply(LIMITS, function(x) sum(DF1$PRICE <= x)))
SUM_PRICE <- unlist(lapply(LIMITS, function(x) sum(DF1$PRICE[DF1$PRICE <= x])))
data.frame(INTERVAL = c(0, paste("UP_TO", LIMITS[-1], sep="_")), COUNT_KEY, SUM_PRICE)
INTERVAL COUNT_KEY SUM_PRICE
1 0 2 0
2 UP_TO_10 6 23
3 UP_TO_100 8 123
4 UP_TO_1000 12 1544
5 UP_TO_5000 16 15057
6 UP_TO_10000 18 31921
7 UP_TO_100000 21 265609
You have to manually define you boundaries first:
X = c(-Inf,0,10,100,1000,5000,10000,100000)
Then you use cut to assign to entries to your labels. And we first summarize the counts and total price within the intervals.
library(dplyr)
DF1 %>%
mutate(LABELS = cut(DF1$PRICE,X,INTERVAL,include.lowest =TRUE)) %>%
group_by(LABELS) %>%
summarise(COUNT_KEY=n(),SUM_PRICE=sum(PRICE))
# A tibble: 7 x 3
LABELS COUNT_KEY SUM_PRICE
<fct> <int> <dbl>
1 0 2 0
2 UP_TO_10 4 23
3 UP_TO_100 2 100
4 UP_TO_1000 4 1421
5 UP_TO_5000 4 13513
6 UP_TO_10000 2 16864
7 UP_TO_100000 3 233688
This is close to what you want, except the sum_price and counts, should be cumulative. So this can be achieved by doing mutate_if(is.numeric,cumsum):
DF1 %>%
mutate(LABELS = cut(DF1$PRICE,X,INTERVAL,include.lowest =TRUE)) %>% group_by(LABELS) %>%
summarise(COUNT_KEY=n(),SUM_PRICE=sum(PRICE)) %>%
mutate_if(is.numeric,cumsum)
To give:
# A tibble: 7 x 3
LABELS COUNT_KEY SUM_PRICE
<fct> <int> <dbl>
1 0 2 0
2 UP_TO_10 6 23
3 UP_TO_100 8 123
4 UP_TO_1000 12 1544
5 UP_TO_5000 16 15057
6 UP_TO_10000 18 31921
7 UP_TO_100000 21 265609
Okay, here's an all-in-one, tidy way to handle this using dplyr ;)
library(dplyr)
DF1 %>%
mutate(
INTERVAL =
factor(
case_when( # create discrete variable
PRICE == 0 ~ '0',
PRICE <= 10 ~ 'UP_TO_10',
PRICE <= 100 ~ 'UP_TO_100',
PRICE <= 1000 ~ 'UP_TO_1000',
PRICE <= 5000 ~ 'UP_TO_5000',
PRICE <= 10000 ~ 'UP_TO_10000',
PRICE <= 100000 ~ 'UP_TO_100000'
),
levels = # set the factor levels
c(
'0',
'UP_TO_10',
'UP_TO_100',
'UP_TO_1000',
'UP_TO_5000',
'UP_TO_10000',
'UP_TO_100000'
)
)
) %>%
group_by(INTERVAL) %>% # create desired group
summarise( # and summary variables
COUNT_KEY = n(),
SUM_PRICE = sum(PRICE)
) %>%
mutate( # cumulative totals
COUNT_KEY_CUM = cumsum(COUNT_KEY),
SUM_PRICE_CUM = cumsum(SUM_PRICE)
)
Is there a way to pause a series of pipes to store a temporary variable that can be used later on in pipe sequence?
I found this question but I'm not sure that it was doing the same thing I am looking for.
Here's a sample dataframe:
library(dplyr)
set.seed(123)
df <- tibble(Grp = c("Apple","Boy","Cat","Dog","Edgar","Apple","Boy","Cat","Dog","Edgar"),
a = sample(0:9, 10, replace = T),
b = sample(0:9, 10, replace = T),
c = sample(0:9, 10, replace = T),
d = sample(0:9, 10, replace = T),
e = sample(0:9, 10, replace = T),
f = sample(0:9, 10, replace = T),
g = sample(0:9, 10, replace = T))
I am going to convert df to long format but, after having done so, I will need to apply the number of rows before the gather.
This is what my desired output looks like. In this case, storing the number of rows before the pipe begins would look like:
n <- nrow(df)
df %>%
gather(var, value, -Grp) %>%
mutate(newval = value * n)
# A tibble: 70 x 4
Grp var value newval
<chr> <chr> <int> <int>
1 Apple a 2 20
2 Boy a 7 70
3 Cat a 4 40
4 Dog a 8 80
5 Edgar a 9 90
6 Apple a 0 0
7 Boy a 5 50
8 Cat a 8 80
9 Dog a 5 50
10 Edgar a 4 40
# ... with 60 more rows
In my real world problem, I have a long chain of pipes and it would be a lot easier if I could perform this action within the pipe structure. I would like to do something that looks like this:
df %>%
{ "n = nrow(.)" } %>% # temporary variable is created here but df is passed on
gather(var, value, -Grp) %>%
mutate(newval = value * n)
I could do something like the following, but it seems really sloppy.
df %>%
mutate(n = nrow(.)) %>%
gather(var, value, -Grp, -n) %>%
mutate(newval = value * mean(n))
Is there a way to do this or perhaps a good workaround?
You could use a code block for a local variable. This would look like
df %>%
{ n = nrow(.)
gather(., var, value, -Grp) %>%
mutate(newval = value * n)
}
Notice how we have to pass the . to gather as well here and the pipe continues inside the block. But you could put other parts afterwards
df %>%
{ n = nrow(.)
gather(., var, value, -Grp) %>%
mutate(newval = value * n)
} %>%
select(newval)
Here is an option with %>>% (pipe operator) from pipeR
library(pipeR)
library(dplyr)
library(tidyr)
df %>>%
(~ n = nrow(.)) %>%
gather(., var, value, -Grp) %>%
mutate(newval = value * n)
# A tibble: 70 x 4
# Grp var value newval
# <chr> <chr> <int> <int>
# 1 Apple a 2 20
# 2 Boy a 7 70
# 3 Cat a 4 40
# 4 Dog a 8 80
# 5 Edgar a 9 90
# 6 Apple a 0 0
# 7 Boy a 5 50
# 8 Cat a 8 80
# 9 Dog a 5 50
#10 Edgar a 4 40
# … with 60 more rows
I have a data table called prob72. I want to add a column for rank. I want to rank each row by frac_miss_arr_delay. The highest value of frac_miss_arr_delay should get rank 1 and the lowest value should get the highest ranking (for my data that is rank 53). frac_miss_arr_delay are decimal values all less than 1. When I use the following line of code it ranks every single row as "1"
prob72<- prob72 %>% mutate(rank=rank(desc(frac_miss_arr_delay), ties.method = "first"))
I've tried using row_number as well
prob72<- prob72 %>% mutate(rank=row_number())
This STILL outputs all "1s" in the rank column.
week arrDelayIsMissi~ n n_total frac_miss_arr_d~
<dbl> <lgl> <int> <int> <dbl>
1 6. TRUE 1012 6101 0.166
2 26. TRUE 536 6673 0.0803
3 10. TRUE 518 6549 0.0791
4 50. TRUE 435 6371 0.0683
5 49. TRUE 404 6398 0.0631
6 21. TRUE 349 6285 0.0555
prob72[6]
# A tibble: 53 x 1
rank
<int>
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
# ... with 43 more rows
flights_week = mutate(flights, week=lubridate::week(time_hour))
prob51<-flights_week %>%
mutate(pos_arr_delay=if_else(arr_delay<0,0,arr_delay))
prob52<-prob51 %>% group_by(week) %>% mutate(avgDelay =
mean(pos_arr_delay,na.rm=T))
prob52 <- prob52 %>% mutate(ridic_late=TRUE)
prob52$ridic_late<- ifelse(prob52$pos_arr_delay>prob52$avgDelay*10,TRUE, FALSE)
prob53<- prob52 %>% group_by(week) %>% count(ridic_late) %>% arrange(desc(ridic_late))
prob53<-prob53 %>% filter(ridic_late==TRUE)
prob54<- prob52 %>% group_by(week) %>% count(n())
colnames(prob53)[3] <- "n_ridiculously_late"
prob53["n"] <- NA
prob53$n <- prob54$n
table5 = subset(prob53, select=c(week,n, n_ridiculously_late))
prob71 <- flights_week
prob72 <- prob71 %>% group_by(week) %>% count(arrDelayIsMissing=is.na(arr_delay)) %>% arrange(desc(arrDelayIsMissing)) %>% filter(arrDelayIsMissing==TRUE)
prob72["n_total"] <- NA
prob72$n_total<- table5$n
prob72<-prob72 %>% mutate(percentageMissing = n/n_total)
prob72<-prob72 %>% arrange(desc(percentageMissing))
colnames(prob72)[5]="frac_miss_arr_delay"