flexibly naming subsetted objects in R - r

I'm somewhat new to R so i apologize in advance if the answer to this question is obvious. I have a very long data frame (only one variable) from which i want to create multiple objects from subsets within the data frame. The code to scrape the data & format as data frame 'aa', define the variable as 'whatever':
aa<-data.frame(readLines("ftp://ftp.cmegroup.com/pub/settle/stlint"))
aa<-data.frame(aa[-1:-3,])
colnames(aa)<-"whatever"
I am looking to subset each section under a heading beginning with 'ZE' and ending with the last data row before the next 'ZE' or before the 'TOTAL'... so basically i want 36 objects (length(grep("ZE",aa$whatever[1:nrow(aa)]))=36) each starting with their respective 'ZE' title followed by (roughly) 70 rows of data, with each object identified by their respective title. So for instance, I would want the first dataset (headed by row ZE MAR15 EURODOLLAR OPTIONS CALL) to be named some variant of 'March 2015 Calls' as i just need to denote the month, year, and whether the data is for calls or puts.
I can actually code this up in batch thru a loop, but here's my problem: right now of course the first 'ZE' month is Mar15, ie March 2015, and the last 'ZE' month is Dec18, or Dec 2015. This will change as time goes on though, and i'm hoping to be able to automatically name them based on the first line without tweaking the script when the months change for each contract. So is it possible to flexibly name each of these subsets based on the content of the header?
Thanks

Related

Merging multiple excel sheets based on different column values in R

I'm a bit new to R so apologies up front if its not explained as clear as it should be. I have 6 excel sheets within a single workbook (Trees_2020, Trees_2017, Trees_2014, Trees_2011, Trees_2008, Trees_2003). These contain plot IDs (ID_Plot), within plot tree ID numbers (ID_tree) and then growth data (DBH_mm). The problem is the tree IDs do not remain the same through the years but are linked based on their old ID (Field_Mapping software recognises them based on location but assigns a new number which is linked to the Old_ID).
What I'm trying to do is merge all the sheets linking the years together based on the plot ID and then the Old_ID to current ID.
2020 Data Example
2017 Data Example
You can see in the 2020 sheet a column linking to the Old_ID number of 2017 and this is true of all sheets. Trees that are recorded for the first time do not have an Old_ID number in that first recording.
The ideal output would be a single sheet where a unique identifier is added for each tree, the DBH of each tree for each year linked together based on the plot_ID and the within plot ID_tree (coupled based on Old_ID)
Ideal Output
Apologies if thats very confusing but I struggled to explain it in a simpler way. I've been playing with tidyverse and loops but can't seem to figure it out so any help greatly appreciated!

Beginner R Data Frames - How to do arithmetic on columns from two tables with different State abbreviations

I am very new to R programming, and I have a few datasets that I'm playing around with. One of the things I'm trying to do is use ggplot to graph what percentage of the population in each state voted in the 2016 election.
The first csv file I have contains an estimate of the population of each state in 2016, and the second csv file I have contains the number of votes cast by each party in the 2016 election. I'm not sure how to attach the file here, so I will show some screenshots:
2016 Election Votes:
2016 State Populations:
From what I understand, I can read the 2016 election votes csv file, and create a new column that contains the total votes using something like:
electionVotes$TotalVotes <- electionVotes$DemocraticCandidates + electionVotes$RepublicanCandidates + electionVotes$OtherCandidates
Once I have that, I would like to create a column where I do something like:
electionVotes$PercentVoted <- electionVotes$TotalVotes / *number of people per state*
I understand how to use ggplot to display the results, but what is confusing to me is how I can accurately use these tables with each other when one State column uses an abbreviation for the state name, like "AL", while the other one uses ".Alabama".
Any thoughts on what would be the best process to do this other than manually editing the csv file? Thank you!
You could bring in a table (http://app02.clerk.org/menu/ccis/Help/CCIS%20Codes/state_codes.html) that links the abbreviations to the full state names and use that to join the two datasets.

Updating a File in R by adding a column/vector

Is there any way that I can update an existing .csv file by adding a column/vector that I have scraped from the web. I have a webscraper that pulls COVID-19 data and I am trying to create a file that has positive cases in columns and each column is the list of cases for a day in each county (x-axis is counties, y-axis is date). I have toyed around with many different ideas at this point and seem to have hit a roadblock. I'm fairly new to r so any ideas would be appreciated!
Packages I am Currently Using/Planning to Use:
library(tidyverse)
library(funModeling)
library(Hmisc)
library(rvest)
library(ggplot2)
CODE:
#writing the original file
positive <- data.frame(Counties= counties_list, "06/12/2020"= positive_data)
positive[is.na(positive)]= 0
positive = positive[-c(76),]
write.csv(positive, "C:/Users/Nathan May/Desktop/Research Files (ABI)/Covid/Data For
Shiny/Positive/Positive Data.csv")
#creating the new vector and updating the existing file with it
datap <- read.csv("C:/Users/Nathan May/Desktop/Research Files (ABI)/Covid/Data For
Shiny/Positive/Positive Data.csv")
positive_data = positive_data[-c(76),]
datap$DATE <- positive_data
NOTE: The end goal is to create a ShinyApp that displays bar charts for postives, recoveries, and deaths by day in each county. This is the data wrangling portion.
First things first, if you are going to use the tidyverse, use tibble instead of data.frame. Tibbles are the Tidyverse version of data frames.
Next, be aware of the structure of your data frame. The way you create your data.frame now (and later probably your tibble) you get a variable "Counties" and one additional variable for each day. That means that you will have to add columns as time passes (the opposite of what you described: Moving along the x axis (along columns) will move along dates while moving along the y-axis (moving along rows) will move along counties). It's possible but I think a bit unconventional. You might want to initialize your data frame with one column for each county and an additional variable called "date". Then whenever you get new data you can add a row in your dataframe instead of a column (so you're "adding a new case" instead of "adding a new variable").
To actually add the row you will have to load the data as you do in your code, create the new row (or column, if you insist) and then "glue" it to the rest of the data.
Depending on how your data looks you can create a single row dataframe using tibble_row() with the same countries as variable names as you have in your main data frame and then glue them together with add_row(datap, your_new_row). Alternatively, if you want to add the row only using position and not column names, you can have the new row as a vector and use rbind() instead of add_row.
If you persist with the "one variable per date" approach there's column equivalents (add_column and cbind) for both these functions.
Hope this helps, Cheers

How to create a new column filtering data by date

I need to sort my data by date. Previously I had one dataset and used select and filter to create two separate datasets, one with data from June 30 or earlier and the other with data from July 1 or later. However, my problem is that I seem to have lost some rows during this process - I went from 1390 rows in my original dataset to 1335 rows between the two new datasets. I can't figure out what happened.
What I am trying to do now is use my original dataset, ethica_surveys and create a new column. I want to call this column pre_post. I know how to create a new column, but I want to filter the data into this column based on my date parameters. So the rows containing pre should be dated June 30 or earlier, and those containing post should be dated July 1 or later. I am filtering based on the variable response_time, but I am just unsure of how to code this in RStudio.
Thanks in advance for any help you can provide.
This seemed to work after a lot of trial and error.
ethica_surveys$pre_post <-
if_else(ethica_surveys$response_time > as.Date("2018-07-
01"),"post","pre")

Extract data for all days for last 30 days from R data frame

I am totally new to R environment and I'm stuck at Date operations. The scenario is, I have a daily database of customer activity of a certain Store, and I need to extract last 30 months data starting from current date.
In other words, suppose today is 18-NOV-2014, I need all the data from 18-OCT-2014 till today in a separate data-frame. To extract it, what kind of iteration logic should I write in R?
You don't need an iteration. What you could do is, assuming your data.frame is called X, and the date column, DATE, you could write:
X$DATE=as.Date(X$DATE, format='%d-%B-%Y')
the 'format' argument is to match your date format you specify in you question. Then, to get the lines you are interested in, something like:
X[X$DATE>=as.Date(today(),format='%d-%B-%Y')-30)]
which is all the lines that are after today - 30 days.
Does this help at all?

Resources