Add multiple selects in one dataset - r

I have the dataset below and in it I consolidate the categories Mk_Cap, Exports and Money_Supply, but each of these grids has a different Unit.
df <- data.frame(Mes=c("Jan","Fev","Mar","Abr","Mai",
"Jan","Fev","Mar","Abr","Mai",
"Jan","Fev","Mar","Abr","Mai"),
Ano=c(2005,2006,2007,2008,2009,
2005,2006,2007,2008,2009,
2005,2006,2007,2008,2009),
Mk_Cap=c(11:15,116:120,1111:1115),
Exports=c(21:25,146:150,1351:1355),
Money_Supply=c(31:35,546:550,2111:2115),
Unit=c("USD","USD","USD","USD","USD","200=10",
"200=10","200=10","200=10","200=10",
"CNY","CNY","CNY","CNY","CNY"))
enter image description here
Today I am consolidating as follows:
library(dplyr)
Money_Supply <- df %>% dplyr::select(Ano, Mes,Money_Supply) %>% dplyr::filter(df$Unit == "USD")
Mk_Cap <- df %>% dplyr::select(Mk_Cap) %>% dplyr::filter(df$Unit == "200=10")
Exports <- df %>% dplyr::select(Exports) %>% dplyr::filter(df$Unit == "CNY")
Consolidado <- base::cbind(Money_Supply,Mk_Cap,Exports)
enter image description here
I believe that it is not the most correct way to do this, but today it is the way that I found, in this example that I passed there are few occurrences, but in the practical case I do this in more than 30 variables which is extremely costly, if there is any way easier would be ideal.

A solution with dplyr:
There is a pattern in the dataframe. Each year has three rows.
Of the three column of interest Money_Supply, Mk_Cap, Exports each variable is in the first, second or third row.
First reorder the columns, then arrange by year, then lead the columns of interest. Then group and filter by id==1.
df1 <- df %>%
select(Ano, Mes, Money_Supply, Mk_Cap, Exports) %>%
arrange(Ano) %>%
mutate(Mk_Cap = lead(Mk_Cap, order_by = Ano)) %>%
mutate(Exports = lead(Exports, 2, order_by = Ano)) %>%
mutate(group = rep(row_number(), each=3, length.out = n())) %>%
group_by(group) %>%
mutate(id = row_number()) %>%
filter(id ==1) %>%
ungroup() %>%
select(-group, -id)
Data
df <- data.frame(Mes=c("Jan","Fev","Mar","Abr","Mai",
"Jan","Fev","Mar","Abr","Mai",
"Jan","Fev","Mar","Abr","Mai"),
Ano=c(2005,2006,2007,2008,2009,
2005,2006,2007,2008,2009,
2005,2006,2007,2008,2009),
Mk_Cap=c(11:15,116:120,1111:1115),
Exports=c(21:25,146:150,1351:1355),
Money_Supply=c(31:35,546:550,2111:2115),
Unit=c("USD","USD","USD","USD","USD","200=10",
"200=10","200=10","200=10","200=10",
"CNY","CNY","CNY","CNY","CNY"))
Edit: Try to clarify my point and the simplicity of the pattern in the data:
# slightly simplified code
df1 <- df %>%
arrange(Ano) %>%
mutate(Mk_Cap = lead(Mk_Cap, order_by = Ano)) %>%
mutate(Exports = lead(Exports, 2, order_by = Ano)) %>%
group_by(Ano) %>%
mutate(id = row_number()) %>%
filter(id ==1) %>%
ungroup() %>%
select(Ano, Mes, Money_Supply, Mk_Cap, Exports, -id, -Unit)
If you consider your dataframe like Fig1 with arrange(Ano):
You have 5 Ano (orange): 2005-2009
In each Ano you have 1 Mes(purple): In 2005 = Jan, 2006 = Fev, 2007 = Mar, 2008 = Abr, 2009 = Mai
In each Ano and Mes you have 3 Unit (blue): In 2005 & Jan = USD, 200=10, CNY ; In 2006 & Fev = USD, 200=10, CNY ; etc...
In your desired output you wish to have:
to condense the
3 rows of one Ano with 3 different Unit to
1 row with Ano, Mes and the corresponding values of Money_Supply, Mk_Cap, Exports
This can be achieved by lead function (see Fig.1):
In Money_Supply: no code necessary is already in the first row (color green)
In Mk_Cap: mutate(Mk_Cap = lead(Mk_Cap, order_by = Ano)) yellow arrow
In Exports: mutate(Exports = lead(Exports, 2, order_by = Ano)) red arrow
group_by(Ano) Group by Ano
mutate(id = row_number()) Assign unique id within each group
filter(id ==1) Filter the 1 row in each group
Finally tweak the order of columns and remove unnesseccary columns.
select(Ano, Mes, Money_Supply, Mk_Cap, Exports, -id, -Unit)

I think a simple way would be filtering your dataset by the Unit column before doing any other operations. Store the variations in a list by performing:
unit_variations <- lapply(unique(df$Unit), function(x) {
return(df[df$Unit == x, ])
})
names(unit_variations) <- unique(df$Unit)
Then, to make your Consolidado dataframe, select which variables you want from which unit variations. Say:
vars <- c("Money_Supply", "Mk_Cap", "Exports")
unit <- c("USD", "200=10", "CNY")
Consolidado <- mapply(
FUN = function(var, unit) {
return(unit_variations[[unit]][[var]])
},
vars,
unit
)
I used a list because, from what you described, I cannot assume that the number of rows for each type of Unit will always be the same, so a list allows for more flexibility. I also did not include month and year, for the same reason.

Related

Getting rid of NA values in R when trying to aggregate columns

I'm trying to aggregate this df by the last value in each corresponding country observation. For some reason, the last value that is added to the tibble is not correct.
aggre_data <- combined %>%
group_by(location) %>%
summarise(Last_value_vacc = last(people_vaccinated_per_hundred)
aggre_data
I believe it has something to do with all of the NA values throughout the df. However I did try:
aggre_data <- combined %>%
group_by(location) %>%
summarise(Last_value_vacc = last(people_vaccinated_per_hundred(na.rm = TRUE)))
aggre_data
Update:
combined %>%
group_by(location) %>%
arrange(date, .by_group = TRUE) %>% # or whatever
summarise(Last_value_vacc = last(na.omit( people_vaccinated_per_hundred)))

Finding the first row after which x rows meet some criterium in R

A data wrangling question:
I have a dataframe of hourly animal tracking points with columns for id, time, and whether the animal is on land or in water (0 = water; 1 = land). It looks something like this:
set.seed(13)
n <- 100
dat <- data.frame(id = rep(1:5, each = 10),
datetime=seq(as.POSIXct("2020-12-26 00:00:00"), as.POSIXct("2020-12-30 3:00:00"), by = "hour"),
land = sample(0:1, n, replace = TRUE))
What I need to do is flag the first row after which the animal uses land at least once for 3 straight days. I tried doing something like this:
dat$ymd <- ymd(dat$datetime[1]) # make column for year-month-day
# add land points within each id group
land.pts <- dat %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
drop_na(land) %>%
mutate(all.land = cumsum(land))
#flag days that have any land points
flag <- land.pts %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
slice(n()) %>%
mutate(flag = if_else(all.land == 0,0,1))
# Combine flagged dataframe with full dataframe
comb <- left_join(land.pts, flag)
comb[is.na(comb)] <- 1
and then I tried this:
x = comb %>%
group_by(id) %>%
arrange(id, datetime) %>%
mutate(time.land=ifelse(land==0 | is.na(lag(land)) | lag(land)==0 | flag==0,
0,
difftime(datetime, lag(datetime), units="days")))
But I still can't quite wrap my head around what to do to make it so that I can figure out when the animal has been on land at least once for three days straight, and then flag that first point on land. Thanks so much for any help you can provide!
Create a date column from the timestamp. Summarise the data and keep only 1 row for each id and date which shows whether the animal was on land even once in the entire day.
Use zoo's rollapply function to mark the first day as TRUE if the next 3 days the animal was on land.
library(dplyr)
library(zoo)
dat <- dat %>% mutate(date = as.Date(datetime))
dat %>%
group_by(id, date) %>%
summarise(on_land = any(land == 1)) %>%
mutate(consec_three = rollapply(on_land, 3,all, align = 'left', fill = NA)) %>%
ungroup %>%
#If you want all the rows of the data
left_join(dat, by = c('id', 'date'))

Manipulating data.frame while using cycles and storing values in a list

I have 2 codes that manipulate and filter (by date) my data.frame and that work perfectly. Now I want to run the code for not only one day, but for every day in vector:
seq(from=as.Date('2020-03-02'), to=Sys.Date(),by='days')` #.... 538 days
The code I want to run for all the days between 2020-03-02 and today is:
KOKOKO <- data.frame %>%
filter(DATE < '2020-03-02')%>%
summarize(DATE = '2020-03-02', CZK = sum(Objem.v.CZK,na.rm = T)
STAVPTF <- data.frame %>%
filter (DATE < '2020-03-02')%>%
group_by(CP) %>%
summarize(mnozstvi = last(AKTUALNI_MNOZSTVI_AKCIE), DATE = '2020-03-02') %>%
select(DATE,CP,mnozstvi) %>%
rbind(KOKOKO)%>%
drop_na() %>%
So instead of '2020-03-02' I want to fill in all days since '2020-03-02' one after another. And each of the KOKOKO and STAVPTF created for the unique day like this I want to save as a separate data.frame and all of them store in a list.
We could use map to loop over the sequence and apply the code
library(dplyr)
library(purrr)
out <- map(s1, ~ data.frame %>%
filter(DATE < .x)%>%
summarize(DATE = .x, CZK = sum(Objem.v.CZK,na.rm = TRUE))
As this is repeated cycle, a function would make it cleaner
f1 <- function(dat, date_col, group_col, Objem_col, aktualni_col, date_val) {
filtered <- dat %>%
filter({{date_col}} < date_val)
KOKOKO <- filtered %>%
summarize({{date_col}} := date_val,
CZK = sum({{Objem_col}}, na.rm = TRUE)
STAVPTF <- filtered %>%
group_by({{group_col}}) %>%
summarize(mnozstvi = last({{aktualni_col}}),
{{date_col}} := date_val) %>%
select({{date_col}}, {{group_col}}, mnozstvi) %>%
bind_rows(KOKOKO)%>%
drop_na()
return(STAVPTF)
}
and call as
map(s1, ~ f1(data.frame, DATE, CP, Objem.v.CZK, AKTUALNI_MNOZSTVI_AKCIE, !!.x))
where
s1 <- seq(from=as.Date('2020-03-02'), to=Sys.Date(), by='days')
It would be easier to answer your question, if you would provide a minimal reproducible example. It's easy done with tidyverses reprex packages
However, your KOKOKO code can be rewritten as simple cumulative sum:
KOKOKO =
data.frame %>%
arrange(DATE) %>% # if necessary
group_by(DATE) %>%
summarise(CZK = sum(Objem.v.CZK), .groups = 'drop') %>% # summarise per DATE (if necessary)
mutate(CZK = cumsum(CZK) - CZK) # cumulative sum excluding current row (current DATE)
Even STAVPTF code can probably be rewritten without iterations. First find the last value of AKTUALNI_MNOZSTVI_AKCIE per CP and DATE. Then this value is assigned to the next DATE:
STAVPTF <-
data.frame %>%
group_by(CP, DATE) %>%
summarise(mnozstvi = last(AKTUALNI_MNOZSTVI_AKCIE), .groups='drop_last') %>%
arrange(DATE) %>% # if necessary
mutate(DATE = lead(DATE))

How to use count for object of class "Character"

I have a data frame where in one column named "City" there are more than 50 different cities and if I plot a bar graph using city then it gets very difficult to read the plot.
Is there any way to first use count() to count the number of cities and then select top 15 cities based on how many time they appear in the data and after that using ggplot() plot a bar graph.
We can also do
library(dplyr)
res <- df %>%
group_by(City) %>%
summarise(n = n()) %>%
slice_max(n = 15, n) %>%
left_join(df, by = 'City')
To keep the rows for top 15 Cities you can do -
library(dplyr)
df %>%
count(City) %>%
slice_max(n = 15, n) %>%
left_join(df, by = 'City') -> res
res
Or in base R -
res <- subset(df, City %in% tail(sort(table(City)), 15))

Writing a for loop to count similar keys in two data frames

I have panel data where I split the whole data set into multiple data frames by year and match unique keys across years. For example, if you have 6,000 observations in 2000 and 7000 observations in, I'm trying to match the overlap between each year for every year from 2000 to 2017.
I have a brute forced solution that's about 350 lines of copy and pasted code, but I'm looking for a more efficient and elegant solution using loops.
I'm working with for loops and looking into map() functions at the moment, but I haven't found a solution. I'm using R4DS.
#1989
b1989 <- b %>% filter(year == 1989) %>% select(key, V7, z9, z11, z13, z15)
a1990 <- a %>% select(key,year) %>% filter(year == 1990) %>% distinct()
br1989 <- inner_join(b1989, a1990, by = "key")
#1990
b1990 <- b %>% filter(year == 1990) %>% select(key, V7, z9, z11, z13, z15)
a1991 <- a %>% select(key,year) %>% filter(year == 1991) %>% distinct()
br1990 <- inner_join(b1990, a1991, by = "key")
#1991
b1991 <- b %>% filter(year == 1991) %>% select(key, V7, z9, z11, z13, z15)
a1992 <- a %>% select(key,year) %>% filter(year == 1992) %>% distinct()
br1991 <- inner_join(b1991, a1992, by = "key")
busrescount_t1 <- c(nrow(br1989),nrow(br1990),nrow(br1991))
busrescount_t1
[1] 4366 4956 4768
It currently works, but is simply bad code and cumbersome. Also, doing it at scale for 2-year, 3-year, 4-year differences in a nightmare and will be 1000+ lines of copy/pasted code.
The goal is to have a loop that produces a vector of these matches that can be placed into a data frame. I'm trying to do this for 20+ years.
How about something like this? (I'd love to be able to verify this works using a sample of your data.)
In theory, we should be able to join b to a version of a where the year is shifted forward one. If the row in b has a match in a with the same key and the following year, the join should complete and have a TRUE in the a_match column.
b %>%
select(key, V7, z9, z11, z13, z15) %>%
left_join(a %>% select(key, year) %>%
mutate(year = year + 1, a_match = TRUE),
by = c("key", "year")) %>%
filter(!is.na(a_match))

Resources