For looping over dates - r

So I have a dataframe called Swine_flu_cases that looks as follows (just an extract):
Country Date Confirmed
1 Canada 2020-01-22 1
2 Egypt 2020-01-23 1
3 Algeria 2020-01-24 1
4 France 2020-01-25 1
5 Zambia 2020-01-26 1
6 Congo 2020-01-27 1
This data set looks at the recorded amount of swine flu cases of a country on a specific date.
I have filtered my data to only show variables where the confirmed cases are 1 and have also grouped it by the different country and sorted it by ascending order of date. (I did this to get the dates that these countries each had their first cases)
I have sorted it in ascending order of date because I want to extract the first time each country had their first recorded swine flu case and store that as a vector.
I have tried doing so by using the following code :
first_case_date = as.Date(data.frame(Swine_flu_cases$Date))
This however gave me an error though.
Error in as.Date.default(data.frame(Swine_flu_cases$Date)) : do
not know how to convert 'data.frame(Swine_flu_cases$Date)' to class
“Date”
What I want to do is create a new variable Swine_flu_cases$days_since_first_case which will take the stored date of each of the countries on my lists first case and subtract that from all the other dates for each country.
My knowledge of for loops is very basic but I know I need to somehow use a for loop for this. I have recently familiarised myself with the lead and lag function as well and was thinking maybe there is a way in which I could combine these two functions to create this variable?
If someone can just give me a general idea on how I could go about doing this please I would really appreciate it.

You can do this with dplyr and lubridate to make your dates behave.
library(dplyr)
library(lubridate)
Swine_flu_cases %>%
mutate(Date = ymd(Date) %>% # makes the Dates behave better for subtraction
group_by(Country) %>% # You want grouped by country
mutate(days_since_first_case = Date - min(Date))
# subtracts the first date in each group from the current date for the row)

Related

R language: how to return and print a list of missing entries based on two columns

I'm struggling to write R code that prints a "list of dates that do not have data between given start and end dates for all the possible values of another variable / column in a table". It's a little difficult to explain in words, so I'll give a very simplified example that will hopefully make it clear what I'm trying to do.
You are the manager of a pet store and in charge of checking the quality of pet food sales data. The data comes in a csv file with four columns; date, type of animal food, sales price, and quantity sold. The animal_type column can have 3 possible values; dog, cat, or bird in string format.
I've simulated the first three days worth of data for the month of December in a very simplified manner below. The price and quantity columns aren't relevant and so I've left them blank.
date
animal_type
price
quantity
2021-12-01
dog
2021-12-01
dog
2021-12-01
cat
2021-12-01
bird
2021-12-02
dog
2021-12-02
bird
2021-12-03
cat
2021-12-03
cat
2021-12-03
cat
What I'm trying to do is print out / return the dates that don't have entries for all the possible values in the animal_type column. So for my example, what I'm looking to print out is something like...
2021-12-02 : ['cat']
2021-12-03 : ['dog', 'bird']
Because [2021-12-02] doesn't have an entry for 'cat' and [2021-12-03] doesn't have entries for 'dog' or 'bird' in the data. However, I've only been able to get a count of the number of unique animal_type values for each date so far with the following functions.
import(tidyverse)
import(dplyr)
df %>% group_by(date) %>% summarise(n = n_distinct(unique(animal_type))) # sums the number of unique animal_type appearing in all the entries for every date
df %>% group_by(animal_type) %>% summarise(n = n_distinct(unique(date))) # sums the number of unique dates that appear in all the entries for every animal_type
# output for "sums the number of unique animal_type appearing in all the entries for every date"
date n
<date> <int>
1 2021-12-01 3
2 2021-12-02 2
3 2021-12-03 1
# output for "sums the number of unique dates that appear in all the entries for every animal_type"
animal_type num_dates
<chr> <int>
1 dog 2
2 cat 2
3 bird 2
This can me tell which dates have missing animal_type values but not which one(s) specifically. I've tried looking around but couldn't find many similar problems and so I'm wondering how feasible this would be. I'm also rusty with using R and relearning much of the syntax, packages, and libraries. So I could be missing something simple. I'm open to both tidyverse / dplyr and base r advice as you can likely see from my code. I would appreciate any help and thank you guys for your time!
You can use both the tidyr::complete function and an anti-join.
First you have to complete the implicit missing values and then anti-join the completed tibble with the one you currently have.
See the example below
library(tidyverse)
example <- crossing("Date"=c("2021-12-01", "2021-12-02", "2021-12-03"),
"Pet"=c("Bird", "Cat", "Dog"))
op_example <- example %>% slice(-c(5, 7, 9))
op_example %>% complete(Date, Pet) %>%
anti_join(op_example)

Using lubridate with multiple date formats

I have a column of dates that was stored in the format 8/7/2001, 10/21/1990, etc. Two values are just four-digit years. I converted the entire column to class Date using the following code.
lubridate::parse_date_time(eventDate, orders = c('mdy', 'Y'))
It works great, except the values that were just years are converted to yyyy-01-01 and I want them to just be yyyy. Is there a way to keep lubridate from adding on any information that wasn't already there?
Edit: Code to create data frame
id = (1:5)
eventDate = c("10/7/2001", "1989", NA, "5/5/2016", "9/18/2011")
df <- data.frame(id, eventDate)
I do not think is possible to convert your values to Dates, and keep the "yyyy" values intact. And by transforming your "yyyy" values into "yyyy-01-01" the lubridate is doing the right thing. Because dates have order, and if you have other values in your column that have days and months defined, all the other values needs to have these components too.
For example. If I produce the data.frame below. If I ask R, to order the table, according to the date column, the date in the first line ("2020"), comes before the value in the second row ("2020-02-28")? Or comes after it? The value "2020" being the year of 2020, it can actually means every possible day in this year, so how R should treate it? By adding the first day of the year, lubridate is defining these components, and avoiding that R get confused by it.
dates <- c("2020", "2020-02-28", "2020-02-20", "2020-01-10", "2020-05-12")
id <- 1:5
df <- data.frame(
id,
dates
)
id dates
1 1 2020
2 2 2020-02-28
3 3 2020-02-20
4 4 2020-01-10
5 5 2020-05-12
So if you want to mantain the "yyyy" intact, is very likely that they should not rest in your eventDate column, with other values that are in a different structure ("dd/mm/yyyy"). Now if is really necessary to mantain these values intact, I think is best, to keep the values of eventDate column as characters, and store these values as Dates in another column, like this:
df$as_dates <- lubridate::parse_date_time(df$eventDate, orders = c('mdy', 'Y'))
id eventDate as_dates
1 1 10/7/2001 2001-10-07
2 2 1989 1989-01-01
3 3 <NA> <NA>
4 4 5/5/2016 2016-05-05
5 5 9/18/2011 2011-09-18

creating columns of monthly averages in R

I have a dataframe in R where each row corresponds to a household. One column describes a date in 2010 when that household planted crops. The remainder of the dataset contains over 1000 columns describing the temperature on every day between 2007-2010 for those households.
This is the basic form:
Date 2007-01-01 2007-01-02 2007-01-03
1 2010-05-01 70 72 61
2 2010-02-10 63 59 73
3 2010-03-06 60 59 81
I need to create columns for each household that describe the monthly mean temperatures of the two months following their planting date in each of the three years prior to 2010.
For instance: if a household planted on 2010-05-01, I would need the following columns:
mean temp of 2007-05-01 through 2007-06-01
mean temp of 2007-06-02 through 2007-07-01
mean temp of 2008-05-01 through 2008-06-01
...
mean temp of 2009-06-02 through 2009-07-01
I skipped two columns, but you get the idea. Specific code would be most helpful, but in general, I am just looking for a way to pull data from specific columns based upon a date that is described by another column.
Hi #bricevk you could use the apply function. It allows you to use a function over a data either column-wise or row-wise.
https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/apply
Say your data is in a object df. It applies the mean function over the columns of df . Giving you the column-wise mean. The 2 indicates the columns. This wpuld the daily average, assuming each column, is a day.
Averages <- apply(df,2,mean)
If I didn't answer this the way you would like perhaps I have not really understood your dataset. Could you try explain it more clearly?
I suggest you to use tidyverse. However, in order to be compatible with this universe, you firstly have to make your data standard, ie tidy. In your example, the things would be easier if you transformed your data in order to have your observations ordrered by rows, and columns being variables. If I correctly understood your data, you have households planting trees (the row names are dates of plantation ?), and then controls with temperature. I'd do something like :
-----------------------------------------------------------------------------
| Household ID | planting date | Date of control | Temperature controlled |
-----------------------------------------------------------------------------
firstly, have your planting date stored as another thing than a rowname, by example :
library(dplyr)
df <- tibble::rownames_to_column(data, "PlantingDate")
You also have to get your household id var you haven't specified to us.
Then you can manage to have the tidy data with tidyr, using
library(tidyr)
df <- gather(df,"DateOfControl","Temperature",-c(PlantingDate,ID))
When you'll have that, you'll be able to use the package lubridate, something like
library(lubridate)
df %>%
group_by(ID,PlantingDate,year(ControlDate),month(ControlDate)) %>%
summarise(MeanT=mean(Temperature))
could work

Filtering Data based on another dataframe based on two rows

I have two Datasets.
The first dataset includes Companies, the Quarter and the corresponding value from the whole timespan.
Quarter Date Company value
2012.1 2012-12-28 x 1
2013.1 2013-01-02 y 2
2013.1 2013-01-03 z 3
Companies again are in the dataset over the whole time and show up multiple times.
The other dataset is an index which includes a company identifier and the quarter in which it existed in the index (Companies can be in the index in multiple quarters).
Quarter Date Company value
2012.1 2012-12-28 x 1
2014.1 2013-01-02 y 2
2013.1 2013-01-03 x 3
Now I need to only select the companies which are in the index at the same time (quarter) as I have data from the first dataset.
In the example above I would need the data from company x in both quarters, but company y needs to get kicked out because the data is available in the wrong quarter.
I tried multiple functions including filter, subset and match but never got the desired result. It always filters either too much or too little.
data %>% filter(Company == index$Company & Quarter == index$Quarter)
or
data[Company == index$Company & Quarter = index$Quarter,]
Something with my conditions doesn't seem right. Any help is appreciated!
Have a look at dplyr's powerful join functions. Here inner_join might help you
dplyr::inner_join(df1, df2, by=c("Company", "Quarter"))

R Display by Unique Value and Frequency

I have a dataset such as the one below except for around 5 million observations. I have already filtered the dates based on the time they were recorded in previous code to include only the calls made during working time. Now, I want to separate the dates based on WORKERCALL_ID in order to see a list of all of the unique dates for each worker and the number of times each WORKERCALL_ID shows up on each date (number of calls per date, separated by each WORKERCALL_ID. I tried to do this using a contingency matrix and then changing it to a data frame, but the file is so large that my R session always aborts. Does anyone have any idea how to accomplish this?
WORKERCALL_ID DATE
124789244 02-01-2014
128324834 05-01-2014
124184728 06-10-2014
An example of desired output is below, for each WORKERCALL_ID and date. My end goal is to be able to subset the result and remove the rows/ID's with a high frequency of calls.
WORKERCALL_ID DATE FREQ
124789244 02-01-2014 4
124789244 02-23-2014 1
Two options :
table(df$WORKERCALL_ID, df$DATE)
Or, using dplyr (also including the requested added filtering out for IDs that have any cases of frequency higher than 5):
df %>% group_by(WORKERCALL_ID, DATE) %>% summarize(freq=n()) %>% group_by(WORKERCALL_ID) %>%
filter(!any(freq>5))
Example:
rbind(as.data.frame(df),data.frame(WORKERCALL_ID=128324834, DATE="Moose",freq=6,stringsAsFactors = FALSE)) %>% group_by(WORKERCALL_ID) %>% filter(!any(freq>5))
# A tibble: 2 x 3
# Groups: WORKERCALL_ID [2]
WORKERCALL_ID DATE freq
<dbl> <chr> <dbl>
1 124184728. 06-10-2014 1.
2 124789244. 02-01-2014 1.
Note how ID 128324834 is removed from the final result.
I would use dplyr::count
library(dplyr)
count(df,WORKERCALL_ID,DATE)

Resources