How to select consecutive measurement cycles - r

I am working with a dataset that contains variables measured from permanent plots. These plots are continuously remeasured every couple of years. The data sort of looks like the table at the bottom. I used the following code to separate the dataset to slice the initial measurement at t1. Now, I want to slice t2 which is the remeasurement that is one step greater than the minimum_Cycle or minimum_Measured_year. This is particularly a problem for plots that have more than two remeasurements (num_obs > 2) and the measured_year intervals and cycle intervals are different.
I would really appreciate the help. I have stuck on this for quite sometime now.
df_Time1 <- df %>% group_by(State, County, Plot) %>% slice(which.min(Cycle))
State County Plot Measured_year basal_area tph Cycle num_obs
1 1 1 2006 10 10 8 2
2 1 2 2002 20 20 7 3
1 1 1 2009 30 30 9 2
2 1 1 2005 40 40 6 3
2 1 1 2010 50 50 8 3
2 1 2 2013 60 60 10 2
2 1 2 2021 70 70 12 3
2 1 1 2019 80 80 13 3

Create a t variable for yourself based on the Cycle order:
df_Time1 %>%
group_by(State, County, Plot) %>%
mutate(t = order(Cycle))
You can then filter on t == 1 or t == 2, etc.

Related

How to sample from smaller data frame with multiple conditions to a larger data frame?

I have a main dataset df.main with 3 sites and each site has 3 subsites, that were samlpled over three months. I have a separate dataset with some abiotic variables ONLY for a single month df.sample. But for each site, I have three values from the sub-sites. In my original dataset, I need to add the abitoic column. However, for every month, for each sub-site I only want to SAMPLE with replacement from one of the three samples from the sub.site.
set.seed(111)
##Main Data Set
month <- rep(c("Jan","Feb","Mar"), each =9 )
site <- rep(c("1","2","3","1","2","3","1","2","3"), each = 3)
sub.site <- rep(c(1,2,3,1,2,3,1,2,3), time = 3 )
df.main <- data.frame(month, site, sub.site)
month site sub.site
Jan 1 1
Jan 1 2
Jan 1 3
Jan 2 1
Jan 2 2
... .. ..
Mar 3 3
##Sampler Data Set
site <- rep(c(1,2,3), time = 9)
sub.site <- rep(c(1,1,1,2,2,2,3,3,3), each = 3)
abiotic <- rnorm(27,7,1)
df.sample <- data.frame(site, sub.site, abiotic)
site sub.site abiotic
1 1 7.175096
1 1 8.805868
1 1 6.783571
1 2 7.910917
1 2 7.202307
1 2 8.404883
...
2 1 7.122915
2 1 6.152732
...
3 1 7.978232
3 1 6.870228
##Desired Output in df.main
month site sub.site abiotic
Jan 1 1 8.805868
Jan 1 2 7.910917
You can do a full join of the two tables using site and sub.site, and then just sample one row from each month, site and sub.site combination.
If you are unsure about table joining (full join, left join, etc.), you may want to look that up online. It is very simple and standard in, say querying database.
In your case, after the full joining, because you have 9 unique combination of site and sub.site, you will have 81 rows:
joining.output <- df.main %>%
full_join(df.sample, by = c("site", "sub.site"))
> joining.output
month site sub.site abiotic
1 Jan 1 1 7.235221
2 Jan 1 1 4.697654
3 Jan 1 1 5.502573
...
28 Feb 1 1 7.235221
29 Feb 1 1 4.697654
30 Feb 1 1 5.502573
...
55 Mar 1 1 7.235221
56 Mar 1 1 4.697654
57 Mar 1 1 5.502573
Then to sample 1 row for each site and sub.site combination for each month, just group by the 3 variables and sample.
Here is the code that puts everything together:
output <- df.main %>%
full_join(df.sample, by = c("site", "sub.site")) %>%
group_by(month, site, sub.site) %>%
slice_sample(n=1)
p.s. in your example, df.main$sub.site is a character array but df.sample$sub.site is a numeric array. You may want to convert the character array to numeric using as.double() function before joining.

How to summarize events prior to a specific event (that can happen multiple times) across multiple observations in r?

I'm trying to collect data on what events have happened prior to a specific event (i.e. bDragons)which can be recurring based on the full observation. These are just an excerpt of one observation where a dragon is taken more than once, and I want to be able to pull insights on each and every one over many observations. So in the data set below, I would want to know that only 1 outer turret was taken prior to the first dragon at Time == 12.891. The next is taken at 20.215, which 4 towers and a drake before it.
ID TeamObj Time Type Lane League Year Season bResult rResult gamelength Gold
1 1 bTowers 9.397 OUTER_TURRET TOP_LANE CBLoL 2017 Summer 1 0 34 NA
2 1 bDragons 12.891 AIR_DRAGON <NA> CBLoL 2017 Summer 1 0 34 NA
3 1 bTowers 16.215 OUTER_TURRET BOT_LANE CBLoL 2017 Summer 1 0 34 NA
4 1 bTowers 16.591 INNER_TURRET BOT_LANE CBLoL 2017 Summer 1 0 34 NA
5 1 bTowers 19.830 OUTER_TURRET MID_LANE CBLoL 2017 Summer 1 0 34 NA
6 1 bDragons 20.215 EARTH_DRAGON <NA> CBLoL 2017 Summer 1 0 34 NA
7 1 bBarons 22.512 BARON_NASHOR <NA> CBLoL 2017 Summer 1 0 34 NA
8 1 bTowers 23.962 INNER_TURRET MID_LANE CBLoL 2017 Summer 1 0 34 NA
9 1 bTowers 24.707 INNER_TURRET TOP_LANE CBLoL 2017 Summer 1 0 34 NA
10 1 bTowers 24.962 BASE_TURRET TOP_LANE CBLoL 2017 Summer 1 0 34 NA
I'd want this for every TeamObj of that type but the issue comes up where I try to group_by address and filter by (Time <= which(Team == bDragons)and the wrong things get filtered out or I can't summarize based on that count(Type) or anything. I'm looking for help on recording some type of recurring function or a better way to record and summarize that. Looking to fit the observations into a linear model later on, but I can't get to that square one which causes the issue.
Am I thinking about my filter incorrectly? My summarize? tst3 %>% group_by(ID) %>% filter(Time <= which(Team == "bDragons")) %>% summarize(count(Type))
Something like:
ID dragonID dragonType Time Baron_Nashor Base_Turret Inner_Turret Nexus_Turret Outer_Turret
1 1 AIR_DRAGON 12.891 N/A N/A N/A N/A 1
2 2 EARTH_DRAGON 20.215 N/A N/A 1 N/A 3
and so on, if that is clear. Want to be able to use each as an observation.
How about the following
tst3 %>%
group_by(ID) %>%
# arrange(Time) %>% # uncomment if needed
mutate(
Type = factor(Type),
dragonID = cumsum(dplyr::lag(TeamObj == 'bDragons', default = 1))) %>%
group_by(ID, dragonID) %>%
summarize(
dragonType = last(Type),
Time = last(Time),
tmp = list(as.data.frame(table(Type)))) %>%
unnest() %>%
spread(Type, Freq, fill = 0) %>%
# select(-ends_with("DRAGON")) %>%
group_by(ID) %>%
mutate_at(vars(BARON_NASHOR:OUTER_TURRET), cumsum) %>%
filter(str_detect( dragonType, "DRAGON"))

How to express a variable as a function of 2 others in a dataframe composed of 3 vectors

I know it is fundamental but I can't find the trick ...
Here is an exemple :
Species <- c("dark frog",rep(c("elephant","tiger","boa"),3),"black mamba")
Year <- c(rep(2011,4),rep(2012,3),rep(2013,4))
Abundance <- c(2,4,5,6,9,2,1,5,6,8,4)
df <- data.frame(Species, Year, Abundance)
I would like to obtain another dataframe (3 rows *5 columns) with the abundance values in function of the species as the column names (each species appearing thus only one time) and the years as the row names (appearing one time also).
May someone help me please ?
You mean something like this?
> xtabs(Abundance~Year+Species, data=df)
Species
Year black mamba boa dark frog elephant tiger
2011 0 6 2 4 5
2012 0 1 0 9 2
2013 4 8 0 5 6
The class for the above is a table, so if you prefer a data.frame instead, you can try:
library(tidyr)
new.df<- spread(df, key = Species, value = Abundance)
Year black mamba boa dark frog elephant tiger
1 2011 NA 6 2 4 5
2 2012 NA 1 NA 9 2
3 2013 4 8 NA 5 6
If you want 0s instead of NA add the following line:
new.df[is.na(new.df)]<- 0

Aggregating by subsets in dplyr

I have a dataset with a million records that I need to aggregate after first subsetting the data. It is difficult to provide a good reproducible sample because in this case, the sample size would be rather large - but I will try anyway.
A random sample of the data that I am working with looks like this:
> df
auto_id user_id month
164537 7124 240249 10
151635 7358 226423 9
117288 7376 172463 9
177119 6085 199194 11
128904 7110 141608 9
157194 7143 241964 9
71303 6090 141646 7
72480 6808 175910 7
108705 6602 213098 8
97889 7379 185516 8
184906 6405 212580 12
37242 6057 197905 8
157284 6548 162928 9
17910 6885 194180 10
70660 7162 161827 7
8593 7375 207061 8
28712 6311 176373 10
144194 7324 142715 9
73106 7196 176153 7
67065 7392 171039 7
77954 7116 161489 7
59842 7107 162637 7
101819 5994 182973 9
183546 6427 142029 12
102881 6477 188129 8
In every month, there many users who are the same, and first we should subset by month and make a frequency table of the users and the amount of trips taken (unfortunately, in the random sample above there is only one trip per user, but in the larger dataset, this is not the case):
full_data <- full_data[full_data$month == 7,]
users <- as.data.frame(table(full_data$user_id))
head(users)
Var1 Freq
1 100231 10
2 100744 17
3 111281 1
4 111814 2
5 113716 3
6 117493 3
As we can see, in the full data set, in month of July (month = 7), users have taken multiple trips. Now the important part - which is to subset only the top 10% of these users (the top 10% in terms of Freq)
tenPercent = round(nrow(users)/10)
users <- users[order(-users$Freq),]
topten <- head(users, n = tenPercent)
Now the new dataframe - topten - can be summed and we get the amount of trips taken by the top ten percent of users
sum(topten$Freq)
[1] 12147
In the end the output should look like this
> output
month trips
1 7 12147
2 8 ...
3 9 ...
4 10 ...
5 11 ...
6 12 ...
Is there a way to automate this process using dplyr - I mean specifically the subsetting by the top ten percent ? I have tried
output <- full_data %>%
+ group_by(month) %>%
+ summarise(n = n())
But this only aggregates total trips by month. Could someone suggest a way to integrate this part into the query in dplyr ? :
tenPercent = round(nrow(users)/10)
users <- users[order(-users$Freq),]
topten <- head(users, n = tenPercent)
The code below counts the number of rows for each user_id in each month, and then selects the 10% of users with the most rows in each month and sums them. Let me know if it solves your problem.
library(dplyr)
full_data %>% group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9) %>%
summarise(n_trips = sum(n))
UPDATE: Following up on your comment, let's do a check with some fake data. Below we have 30 different values of user_id and 10,000 total rows. I've also used the prob argument so that the probability of a user_id being selected is proportional to its value (i.e., user_id 1 is the least likely to be chosen and user_id 30 is the most likely to be chosen).
set.seed(3)
full_data = data.frame(user_id=sample(1:30,10000, replace=TRUE, prob=1:30),
month=sample(1:12, 10000, replace=TRUE))
Let's look as the number of rows for each user_id for month==1. The code below counts the number of rows for each user_id and sorts from most to least common. Note that the three most common values of user_id (28,29,26) comprise 171 rows (60+57+54). Since there are 30 different values of user_id the top three users represent the top 10% of users:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
arrange(desc(n)) %>% as.data.frame
month user_id n
1 1 28 60
2 1 29 57
3 1 26 54
4 1 30 53
5 1 27 49
6 1 22 43
7 1 21 41
8 1 20 40
9 1 23 40
10 1 24 38
11 1 25 38
12 1 19 37
13 1 18 33
14 1 16 28
15 1 15 27
16 1 17 27
17 1 14 26
18 1 9 20
19 1 12 20
20 1 13 20
21 1 10 17
22 1 11 17
23 1 6 15
24 1 7 13
25 1 8 13
26 1 4 9
27 1 5 7
28 1 2 3
29 1 3 2
30 1 1 1
So now let's take the next step and select the top 10% of users. To answer the question in your comment, filter(percent_rank(n) >= 0.9) keeps only the top 10% of user_id, based on the value of n (which is the number of rows for each user_id). percent_rank is on of several ranking functions in dplyr that have different ways of dealing with ties (which may be the reason you're not getting the results you expect). See ?percent_rank for details:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9)
month user_id n
1 1 26 54
2 1 28 60
3 1 29 57
And the sum of n (the total number of trips for the top 10%) is:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9) %>%
summarise(n_trips = sum(n))
month n_trips
1 1 171
So it looks like the code does what we'd naively expect, but maybe the issue is related to how ties are dealt with. Let me know if you're still getting anomalous results in your real data or if I've misunderstood what you're trying to accomplish.

How to calculate top rows from a large data set

I have a dataset in which there are following columns: flavor, flavorid and unitSoled.
Flavor Flavorid unitsoled
beans 350 6
creamy 460 2
.
.
.
I want to find top ten flavors and then calculate market share for each flavor. My logic is market share for each flavor = units soled for particular flavor divided by total units soled.
How do I implement this. For output I just want two col Flavorid and corresponding market share. Do I need to save top ten flavors in some table first?
One way is with the dplyr package:
An example data set:
flavor <- rep(letters[1:15],each=5)
flavorid <- rep(1:15,each=5)
unitsold <- 1:75
df <- data.frame(flavor,flavorid,unitsold)
> df
flavor flavorid unitsold
1 a 1 1
2 a 1 2
3 a 1 3
4 a 1 4
5 a 1 5
6 b 2 6
7 b 2 7
8 b 2 8
9 b 2 9
...
...
Solution:
library(dplyr)
df %>%
select(flavorid,unitsold) %>% #select the columns you want
group_by(flavorid) %>% #group by flavorid
summarise(total=sum(unitsold)) %>% #sum the total units sold per id
mutate(marketshare=total/sum(total)) %>% #calculate the market share per id
arrange( desc(marketshare)) %>% #order by marketshare descending
head(10) #pick the 10 first
#and you can add another select(flavorid,marketshare) if you only want those two
Output:
Source: local data frame [10 x 3]
flavorid total marketshare
1 15 365 0.12807018
2 14 340 0.11929825
3 13 315 0.11052632
4 12 290 0.10175439
5 11 265 0.09298246
6 10 240 0.08421053
7 9 215 0.07543860
8 8 190 0.06666667
9 7 165 0.05789474
10 6 140 0.04912281

Resources