How can I extract the unique variables in one column conditional to a variable in another and make a new data frame with the output? - r

I would like to extract the number of camera trap nights (CTN) (one column in df) per camera trap station (another column in DF) so I can work out relative abundance indices for each cameras station. For example Station 1 has had 5 triggers/events (of the same species) and has had 30 CTN. It is listed in my database 5 times (has 5 rows). I want to extract the unique CTN for Station 1 and subsequently all the other Stations in the DF.
Data frame:
EventID CameraStation CTN
001 Station 1 30
002 Station 1 30
003 Station 1 30
004 Station 1 30
005 Station 2 29
006 Station 2 29
007 Station 2 29
008 Station 2 29
009 Station 2 29
010 Station 3 31
011 Station 3 31
I have tried to use 'unique' and 'with' but do not get the result I want.
with(unique(rai.PS[c("CameraStation", "CTN")]), table(CameraStation))
I expect to get the following results;
CameraStation CTN
Station 1 30
Station 2 29
Station 3 31
I.e. Station 1 is only listed once with the outcome of CTN and in a new data frame.
But instead I get;
CameraStation
Station 1
1
Station 2
1
Station 3
1
I am assuming it is giving me the unique station once without the CTN as the criteria.

Related

How to select consecutive measurement cycles

I am working with a dataset that contains variables measured from permanent plots. These plots are continuously remeasured every couple of years. The data sort of looks like the table at the bottom. I used the following code to separate the dataset to slice the initial measurement at t1. Now, I want to slice t2 which is the remeasurement that is one step greater than the minimum_Cycle or minimum_Measured_year. This is particularly a problem for plots that have more than two remeasurements (num_obs > 2) and the measured_year intervals and cycle intervals are different.
I would really appreciate the help. I have stuck on this for quite sometime now.
df_Time1 <- df %>% group_by(State, County, Plot) %>% slice(which.min(Cycle))
State County Plot Measured_year basal_area tph Cycle num_obs
1 1 1 2006 10 10 8 2
2 1 2 2002 20 20 7 3
1 1 1 2009 30 30 9 2
2 1 1 2005 40 40 6 3
2 1 1 2010 50 50 8 3
2 1 2 2013 60 60 10 2
2 1 2 2021 70 70 12 3
2 1 1 2019 80 80 13 3
Create a t variable for yourself based on the Cycle order:
df_Time1 %>%
group_by(State, County, Plot) %>%
mutate(t = order(Cycle))
You can then filter on t == 1 or t == 2, etc.

How to view all rows of an output thats not in table form

Problem
I started with an ungrouped data set which I proceeded to group, the output of the grouping however, does not return all 427 rows. The output is needed to input that data into a table.
So initially the data was ungrouped and appears as follows:
Occupation Education Age Died
1 household Secondary 39 no
2 farming primary 83 yes
3 farming primary 60 yes
4 farming primary 73 yes
5 farming Secondary 51 no
6 farming iliterate 62 yes
The data is then grouped as follows:
occu %>% group_by(Occupation, Died, Age) %>% count()##use this to group on the occupation of the suicide victimrs
which gives the following output:
Occupation Died Age n
<fct> <fct> <int> <int>
1 business/service no 20 2
2 business/service no 30 1
3 business/service no 31 2
4 business/service no 34 1
5 business/service no 36 2
6 business/service no 41 1
7 business/service no 44 1
8 business/service no 46 1
9 business/service no 84 1
10 business/service yes 21 1
# ... with 417 more rows
problem is i need all the rows in order to input the grouped data into a table using:
dt <- read.table(text="full output from above")
If any more code would be useful to solving this let me know.
It is not really clear what you want but try the following code :
occu %>% group_by(Occupation, Died, Age) %>% count()
dt <- as.data.frame(occu)
It seems you simply want to convert the tibble to a data frame. There is no need to print all the output and then copy-paste it into read.table().
Also if you need you can save your output with write.table(dt,"filename.txt"), it will create a .txt file with your data.
If what you want is really print all the tibble output in the console, then you can do the following code, as suggested by Akrun's link :
options(dplyr.print_max = 1e9)
It will allow R to print the full tibble into the console, which I think is not efficient to do what you are asking.

How to select random rows from R data frame to include all distinct values of two columns

I want to select a random sample of rows from a large R data frame df (around 10 million rows) in such a way that all distinct values of two columns are included in the resulting sample. df looks like:
StoreID WEEK Units Value ProdID
2001 1 1 3.5 20702
2001 2 2 3 20705
2002 32 3 6 23568
2002 35 5 15 24025
2003 1 2 10 21253
I have the following unique values in the respective columns: StoreID: 1433 and WEEK: 52. When I generate a random sample of rows from df, I must have at least one row each for each StoreID and each WEEK value.
I used the function sample_frac in dplyr in various trials but that does not ensure that all distinct values of StoreID and WEEK are included at least once in the resulting sample. How can I achieve what I want?
It sounds like you need to group the desired columns before sampling rows. The last line will return one random row for each unique storeID-week pairing.
df <- data.frame(storeid=sample(c(2000:2010),1000,T),
week=sample(c(1:52),1000,T),
value=runif(1000))
# count number of duplicated storeid-week pairs
df %>% count(storeid,week) %>% filter(n>1)
df %>% group_by(storeid,week) %>% sample_n(1)
# A tibble: 468 x 3
# Groups: storeid, week [468]
storeid week value
<int> <int> <dbl>
1 2000 1 0.824
2 2000 2 0.0987
3 2000 6 0.916
4 2000 8 0.289
5 2000 9 0.610
6 2000 11 0.0807
7 2000 12 0.592
8 2000 13 0.849
9 2000 14 0.0181
10 2000 16 0.182
# ... with 458 more rows
Not sure if I have read the problem correctly. I would have tried the following using sample function.
Assuming your dataframe is called MyDataFrame and is two dimensional, I would have done it like this.
RandomizedDF <- MyDataFrame[sample(dim(MyDataFrame)[1],dim(MyDataFrame)[1],replace=FALSE),]
Let me know if this is what you wanted or something else?

Calculate Percentage Column for List of Dataframes When Total Value is Hidden Within the Rows

library(tidyverse)
I feel like there is a simple solution for this but I'm stuck. The code below creates a simple list of two dataframes (they are the same for simplicity of the example, but the real data has different values)
Loc<-c("Montreal","Toronto","Vancouver","Quebec","Ottawa","Hamilton","Total")
Count<-c("2344","2322","122","45","4544","44","9421")
Data<-data_frame(Loc,Count)
Data2<-data_frame(Loc,Count)
Data3<-list(Data,Data2)
Each dataframe has "Total" within the "Loc" column with the corresponding overall total of the "Count" column. I would like to calculate percentages for each dataframe by dividing each value in the "Count" column by the total, which is the last number in the "Count" column.
I would like the percentages to be added as new columns for each dataframe.
For this example, the total is the last number in the column, but in reality, it may be mixed anywhere in the column and can be found by the corresponding "Total" value in the "Loc" column.
I would like to use purrr and Tidyverse:
Below is an example of the code, but I'm stuck on the percentage...
Data3%>%map(~mutate(.x,paste0(round(100* (MISSING PERCENTAGE),2),"%"))
This solution uses only base-R:
for (i in seq_along(Data3)) {
Data3[[i]]$Count <- as.numeric(Data3[[i]]$Count)
n <- nrow(Data3[[i]])
Data3[[i]]$perc <- Data3[[i]]$Count / Data3[[i]]$Count[n]
}
> Data3
[[1]]
# A tibble: 7 x 3
Loc Count perc
<chr> <dbl> <dbl>
1 Montreal 2344 0.248805859
2 Toronto 2322 0.246470651
3 Vancouver 122 0.012949793
4 Quebec 45 0.004776563
5 Ottawa 4544 0.482326717
6 Hamilton 44 0.004670417
7 Total 9421 1.000000000
[[2]]
# A tibble: 7 x 3
Loc Count perc
<chr> <dbl> <dbl>
1 Montreal 2344 0.248805859
2 Toronto 2322 0.246470651
3 Vancouver 122 0.012949793
4 Quebec 45 0.004776563
5 Ottawa 4544 0.482326717
6 Hamilton 44 0.004670417
7 Total 9421 1.000000000

Only changing a single variable in R

I have a dataframe df:
Group Age Sales
A1234 12 1000
A2312 11 900
B2100 23 2100
...
I intend to create a new dataframe through the modification of the Group variable, by only taking the substring of Group. At present, I am able to execute it in 2 steps:
dt1<- dt
dt1$Group<- substr(dt$Group,1,2)
Is it able to do the above in one single command? I guess the following would get tedious if I have to create and transform many intermediate dataframes along the way.
You can try:
dt1<-`$<-`(dt,"Group",substr(dt$Group,1,2))
dt1
# Group Age Sales
#1 A1 12 1000
#2 A2 11 900
#3 B2 23 2100
dt
# Group Age Sales
#1 A1234 12 1000
#2 A2312 11 900
#3 B2100 23 2100
The original table is unchanged and you get the new one with a single line.

Resources