Subset Data Frame to Exclude 28 Different Months in R Using dplyr - r

I have a data frame consisting of monthly volumes beginning 2004-01-01 and ending 2019-12-01. I need to apply a filter that will delete rows that equal certain dates. My problem is that there's 28 dates I need to filter out and they arent consecutive. What I have right now works, but isn't efficient. I am using dplyr's filter function.
I currently have 28 variables, d1-d28, which are the dates that I would like filtered out and then I use
df<-data%>%dplyr::filter(Date!=d1 & Date!=d2 & Date!=d3 .......Date!=d28)
I would like to put the dates of interest, the d1-d28, into a data.frame and just reference the data.frame in my filter code.
I've tried:
df<-data%>%dplyr::filter(!Date %in% DateFilter)
Where DateFilter is a data.frame with 1 column and 28 rows of the dates I want filtered, but I get an an error where it says the length of the objects don't match.
Is there any way I can do this with dplyr?

Here, we may use filter_at
library(dplyr)
data %>%
filter_at(vars(matches('^d\\d+$')), all_vars(Date != .))

Related

How do I use mutate in a data frame in R to update column based on value of a second column [duplicate]

This question already has answers here:
Update a Value in One Column Based on Criteria in Other Columns
(4 answers)
Closed 2 years ago.
In R I am trying to update one column in a data frame base on the value of the other column
So the dataframe is 2 columns 1st column s/b a number and second column is a date. For 3 specific dates, I need to update the exam column to a certain value. Data frame looks like this:
Exams Year
NA 2009-12-01
NA 2010-01-01
NA 2010-02-01
and I want to change the NA to a specific value for these 3 dates
I have tried this:
library(dplyr)
ABVILE %>%
mutate(Exams=replace(Exams, Year==2009-12-01, 1709.67)) %>%
as.data.frame()
and tried putting the value I need to update it to as a variable too but it doesn't do the update either way and I don't get an error.
I expect it to be like this:
1709.67 2009-12-01
but I get this:
NA 2009-12-01.
There are (at least) two options using dplyr methods. Note that here I'm assuming your Year variable is a character vector. The first option uses an ifelse statement:
ABVILE %>%
mutate(Exams = ifelse(Year == "2009-12-01", 1709.67, Exams)) %>%
data.frame
ifelse takes a test clause and two return conditions (for if the clause is and is not met, respectively). That's an easy solution if you have only one test condition. If you have to string together many of them, then dply provides a nice function called case_when. Here's what that would look like:
ABVILE %>%
mutate(Exams = case_when(Year == "2009-12-01" ~ 1709.67,
Year == "2010-01-01" ~ 999.999)) %>%
data.frame
That's equivalent to
ABVILE %>%
mutate(Exams = ifelse(Year == "2009-12-01", 1709.67,
ifelse(Year == "2010-01-01", 999.999, NA))) %>%
data.frame
Obviously case_when is going to be a lot easier to deal with if you have lots of test conditions/assignments.

How to unnest irregular JSON data

I have been looking at many solutions on this site to similar problems for weeks but cannot wrap my head around how to apply them successfully to this particular one:
I have the dataset at https://statdata.pgatour.com/r/006/player_stats.json
using:
player_stats_url<-"https://statdata.pgatour.com/r/006/player_stats.json"
player_stats_json <- fromJSON(player_stats_url)
player_stats_df <- ldply(player_stats_json,data.frame)
gives:
a dataframe of 145 rows, one for each player, and 7 columns, the 7th of which is named "players.stats" that contains the data I'd like broken out into a 2-dimensional dataframe
next, I do this to take a closer look at the "players.stats" column:
player_stats_df2<- ldply(player_stats_df$players.stats, data.frame)
the data in the "players.stats" columns are formatted as follows: rows of
25 repeating stat categories in the column (player_stats_df2$name) and another nested list in the column $rounds ... on which I repeat ldply to unnest everything but I cannot sew it back together logically in the way that I want ...
the format of the column $rounds, after unnested, using:
player_stats_df3<- ldply(player_stats_df2$rounds, data.frame)
gives the round number in the first column $r (1,2,3,4 as only choices) and then the stat value in the second column $rValue. to complicate things, some entries have 2 rounds, while others have 4 rounds
the final format of the 2-dimensional dataframe I need would have columns named players.pid and players.pn from player_stats_df, a NEW COLUMN denoting "round.no" which would correspond to player_stats_df3$r and then each of the 25 repeating stat categories from player_stats_df2$name as a column (eagles, birdies, pars ... SG: Off-the-tee, SG: tee-to-green, SG: Total) and each row being unique to a player name and round number ...
For example, there would be four rows for Matt Kuchar, one for each round played, and a column for each of the 25 stat categories ... However, some other players would only have 2 rows.
Please let me know if I can clarify this at all for this particular example- I have tried many things but cannot sew this data back together in the format I need to use it in ...
Here something you can start with, we can create a tibble using tibble::as_tibble then apply multiple unnest using tidyr::unnest
library(tidyverse)
as_tibble(player_stats_json$tournament$players) %>% unnest() %>% unnest(rounds)
Also see this tutorial here. Finally use dplyr "tidyverse" instead of plyr

Assigning a variable in one dataset to multiple fields in another dataset

I'm trying to assign a variable in one dataframe into multiple rows of another dataframe - namely the AWND variable here (average wind speed).
I'm trying to obtain the AWND from
here
And I am trying to match it with multiple dates based on the date
here
Here's what I've tried so far.
dfNew <- merge(dfWeather, dfFlight, by="DATE")
I'm not sure how to proceed with this.
Should I do a join?
(EDIT: Here's the data- https://shrib.com/#-7dXevTkb12Bt6Kdfxim (this is the dput output of the data I am getting AWND from)
I got the flights data (that I am trying to match dates with) from the nycflights13 package, and then I subset the flights data to include only the carriers that had at least 1000 flights depart from LaGuardia.
The flights data has the date-time class as shown in your tibble. First, make sure that the elements you want to join between are the same i.e. 2013-01-01 05:00:00 will not match with 2013-01-01 in your dfWeather data.frame
# Make sure dates match between data.frames
dfFlight$DATE <- stringr::str_extract(dfFlight$DATE, "\\S*")
# Join AWND wherever dates match to left-hand side
dfNew <- dplyr::left_join(dfFlight, dfWeather, by = "DATE")
I did assume some things about your data since I couldn't fully see what you're working with from screenshot. This is my first answer on Stack Overflow, so feel free to edit or leave me suggestions

How can I delete na values from specific columns in a data set?

I have a dataset with 28 variables, and I want to exclude all missing data from 4 of these variables.
If I use na.omit in the whole dataset, I'll lose data from these columns. What I want is to get the examples with complete data and exclude rows in which there is an NA value in this 4 variables.
Also, what if I wanted to exclude NA values in these 4 variables so each of them have no more than 5% missing data?
You can use tidyr package:
library(tidyr)
df %>% drop_na(col_a, col_b, col_c, col_d)
For the second part, you probably need to get different subsets and combine them together. AFAIK

Create a stack of n subset data frames from a single data frame based on date column

I need to create a bunch of subset data frames out of a single big df, based on a date column (e.g. - "Aug 2015" in month-Year format). It should be something similar to the subset function, except that the count of subset dfs to be formed should change dynamically depending upon the available values on date column
All the subsets data frames need to have similar structure, such that the date column value will be one and same for each and every subset df.
Suppose, If my big df currently has last 10 months of data, I need 10 subset data frames now, and 11 dfs if i run the same command next month (with 11 months of base data).
I have tried something like below. but after each iteration, the subset subdf_i is getting overwritten. Thus, I am getting only one subset df atlast, which is having the last value of month column in it.
I thought that would be created as 45 subset dfs like subdf_1, subdf_2,... and subdf_45 for all the 45 unique values of month column correspondingly.
uniqmnth <- unique(df$mnth)
for (i in 1:length(uniqmnth)){
subdf_i <- subset(df, mnth == uniqmnth[i])
i==i+1
}
I hope there should be some option in the subset function or any looping might do. I am a beginner in R, not sure how to arrive at this.
I think the perfect solution for this might be use of assign() for the iterating variable i, to get appended in the names of each of the 45 subsets. Thanks for the note from my friend. Here is the solution to avoid the subset data frame being overwritten each run of the loop.
uniqmnth <- unique(df$mnth)
for (i in 1:length(uniqmnth)){
assign(paste("subdf_",i,sep=""), subset(df, mnth == uniqmnth[i])) i==i+1
}

Resources