What's the difference between a double and a numeric? - r

I'd like to store quarterly data using a number representation where the left side represents the year and the right the quarter.
This is my code
library(tidyverse)
library(fpp)
ausbeer %>%
as_tibble %>%
select(megalitres = x) %>%
mutate(year = as.double(seq(1956, 2009, 0.25)[1:211]))
For some reason, it will only show the year as integers, and it won't show the decimals.
I've checked and it's the right data underneath but I'm having a hard time making it show up.
I don't want to code them as characters because that will make visualization more difficult

I guess this has to do with converting your data.frame into a tibble. Replicating your code on mtcars dataset, we get:
mtcars %>%
as_tibble() %>%
mutate(year = as.double(seq(1956, 2009, 0.25)[1:nrow(mtcars)])) %>%
dplyr::select(year) %>%
head
# year
# <dbl>
# 1 1956
# 2 1956.
# 3 1956.
# 4 1957.
# 5 1957
# 6 1957.
Here's the difference if we comment as_tibble:
# year
# 1 1956.00
# 2 1956.25
# 3 1956.50
# 4 1956.75
# 5 1957.00
# 6 1957.25
Swapping as.double with as.numeric does not change anything.
From ?double:
as.double is a generic function. It is identical to as.numeric.

Related

How to use group_by without ordering alphabetically?

I'm trying to visualize some bird data, however after grouping by month, the resulting output is out of order from the original data. It is in order for December, January, February, and March in the original, but after manipulating it results in December, February, January, March.
Any ideas how I can fix this or sort the rows?
This is the code:
BirdDataTimeClean <- BirdDataTimes %>%
group_by(Date) %>%
summarise(Gulls=sum(Gulls), Terns=sum(Terns), Sandpipers=sum(Sandpipers),
Plovers=sum(Plovers), Pelicans=sum(Pelicans), Oystercatchers=sum(Oystercatchers),
Egrets=sum(Egrets), PeregrineFalcon=sum(Peregrine_Falcon), BlackPhoebe=sum(Black_Phoebe),
Raven=sum(Common_Raven))
BirdDataTimeClean2 <- BirdDataTimeClean %>%
pivot_longer(!Date, names_to = "Species", values_to = "Count")
You haven't shared any workable data but i face this many times when reading from csv and hence all dates and data are in character.
as suggested, please convert the date data to "date" format using lubridate package or base as.Date() and then arrange() in dplyr will work or even group_by
example :toy data created
birds <- data.table(dates = c("2020-Feb-20","2020-Jan-20","2020-Dec-20","2020-Apr-20"),
species = c('Gulls','Turns','Gulls','Sandpiper'),
Counts = c(20,30,40,50)
str(birds) will show date is character (and I have not kept order)
using lubridate convert dates
birds$dates%>%lubridate::ymd() will change to date data-type
birds$dates%>%ymd()%>%str()
Date[1:4], format: "2020-02-20" "2020-01-20" "2020-12-20" "2020-04-20"
save it with birds$dates <- ymd(birds$dates) or do it in your pipeline as follows
now simply so the dplyr analysis:
birds%>%group_by(Months= ymd(dates))%>%
summarise(N=n()
,Species_Count = sum(Counts)
)%>%arrange(Months)
will give
# A tibble: 4 x 3
Months N Species_Count
<date> <int> <dbl>
1 2020-01-20 1 30
2 2020-02-20 1 20
3 2020-04-20 1 50
However, if you want Apr , Jan instead of numbers and apply as.Date() with format etc, the dates become "character" again. I woudl suggest you keep your data that way and while representing in output for others -> format it there with as.Date or if using DT or other datatables -> check the output formatting options. That way your original data remains and users see what they want.
this will make it character
birds%>%group_by(Months= as.character.Date(dates))%>%
summarise(N=n()
,Species_Count = sum(Counts)
)%>%arrange(Months)
A tibble: 4 x 3
Months N Species_Count
<chr> <int> <dbl>
1 2020-Apr-20 1 50
2 2020-Dec-20 1 40
3 2020-Feb-20 1 20
4 2020-Jan-20 1 30

Trying to convert month number to month name in a date set

Im getting NA value when im trying to replace month number with month name with the below code:
total_trips_v2$month <- ordered(total_trips_v2$month, levels=c("Jul","Aug","Sep","Oct", "Nov","Dec","Jan", "Feb", "Mar","Apr","May","Jun"))
Im working with a big data set where the month column was char data type and the months were numbered as '06','07' and so on starting with 06.
Im not quiet sure even the ordered function in the code which i used, what it really does.I saw it somewhere and i used it. I tried to look up codes to replace specific values in rows but it looked very confusing.
Can anyone help me out with this?
Working with data types can be confusing at times, but it helps you with what you want to achieve. Thus, make sure you understand how to move from type to type!
There are some "helpers" build in to R to work with months and months' names.
Below we have a "character" vector in our data frame, i.e. df$month.
The helper vectors in R are month.name (full month names) and month.abb (abbreviated month names).
You can index a vector by calling the element of the vector at the n-th position.
Thus, month.abb[6] will return "Jun".
We use this to coerce the month to "numeric" and then recode it with the abbreviated names.
# simulating some data
df <- data.frame(month = c("06","06","07","09","01","02"))
# test index month name
month.abb[6]
# check what happens to our column vector - for this we coerce the 06,07, etc. to numbers!
month.abb[as.numeric(df$month)]
# now assign the result
df$month_abb <- month.abb[as.numeric(df$month)]
This yields:
df
month month_abb
1 06 Jun
2 06 Jun
3 07 Jul
4 09 Sep
5 01 Jan
6 02 Feb
The lubridate package can also help you extract certain components of datetime objects, such as month number or name.
Here, I have made some sample dates:
tibble(
date = c('2021-01-01', '2021-02-01', '2021-03-01')
) %>%
{. ->> my_dates}
my_dates
# # A tibble: 3 x 1
# date
# <chr>
# 2021-01-01
# 2021-02-01
# 2021-03-01
First thing we need to do it convert these character-formatted values to date-formatted values. We use lubridate::ymd() to do this:
my_dates %>%
mutate(
date = ymd(date)
) %>%
{. ->> my_dates_formatted}
my_dates_formatted
# # A tibble: 3 x 1
# date
# <date>
# 2021-01-01
# 2021-02-01
# 2021-03-01
Note that the format printed under the column name (date) has changed from <chr> to <date>.
Now that the dates are in <date> format, we can pull out different components using lubridate::month(). See ?month for more details.
my_dates_formatted %>%
mutate(
month_num = month(date),
month_name_abb = month(date, label = TRUE),
month_name_full = month(date, label = TRUE, abbr = FALSE)
)
# # A tibble: 3 x 4
# date month_num month_name_abb month_name_full
# <date> <dbl> <ord> <ord>
# 2021-01-01 1 Jan January
# 2021-02-01 2 Feb February
# 2021-03-01 3 Mar March
See my answer to your other question here, but when working with dates in R, it is good to leave them in the default YYYY-MM-DD format. This generally makes calculations and manipulations more straightforward. The month names as shown above can be good for making labels, for example when making figures and labelling data points or axes.

Can´t convert chr to numeric in R-studio

I get NA´s when i try to convert into numeric values (see below)
Im supposed to make these annual dataframes into monthly ones. to do this i need to make the numbers numeric. I get NA´s when i try to do this. does anyone know?
When you unlist() the data frame, it turns it into a vector. Here's a couple of lines of the data that I can see from your post (with shorter variable names).
TBS <- tibble::tibble(
desc = c("1934-01", "1934-02"),
rate = c("0.72", "0.6")
)
unlist(TBS)
# desc1 desc2 rate1 rate2
# "1934-01" "1934-02" "0.72" "0.6"
When you do as.numeric() on that vector, it turns the dates into missing. I think that's what the output above in your RStudio window shows us.
as.numeric(unlist(TBS))
# [1] NA NA 0.72 0.60
You're probably better off just fixing the variables in place in the data frame, like this:
library(zoo)
library(lubridate)
library(dplyr)
TBS <- TBS %>%
mutate(desc = as.yearmon(desc),
year = year(desc),
rate = as.numeric(rate))
TBS
# A tibble: 2 x 3
# desc rate year
# <yearmon> <dbl> <dbl>
# 1 Jan 1934 0.72 1934
# 2 Feb 1934 0.6 1934
Then you could do whatever you need (e.g., average) over the years. If it was just a straight average, you could do.
TBS %>%
group_by(year) %>%
summarise(mean_rate = mean(rate))

R - create a timeseries from filenames

I have 900 files named like 20120412_bwDD2yYa.txt. The first part up to the _ is in the year-month-day format. Some days have multiple files associated with them.
I'd like to use the dates extracted from the file names as data to compile a timeseries where the dates are the x axis and the number of files are the y axis.
How can I do this?
Here is a solution with Base R. Since the question does not include a reproducible example, we'll simulate the file names, parse out the dates, and create the counts by date.
# use list.files() to extract files from directory
files <- list.files(path="./data",pattern="*.txt",full.names = FALSE)
# simulate result from list.files()
files <- c("20120101_aaa.txt","20120101_bbb.txt","20120102_ccc.txt")
# extract dates from file names
date <- as.Date(substr(files,1,8),"%Y%m%d")
df <- data.frame(date,count = rep(1,length(date)))
aggregate(count ~ date,data = df, sum)
...and the output:
date count
1 2012-01-01 2
2 2012-01-02 1
dplyr solution
A solution with dplyr::summarise() looks like this:
files <- list.files(path="./data",pattern="*.txt",full.names = FALSE)
# simulate result from list.files()
files <- c("20120101_aaa.txt","20120101_bbb.txt","20120102_ccc.txt")
library(dplyr)
data.frame(date=as.Date(substr(files,1,8),"%Y%m%d")) %>%
group_by(date) %>% summarise(count = n())
# A tibble: 2 x 2
date count
<date> <int>
1 2012-01-01 2
2 2012-01-02 1
Accounting for dates with no files
In response to a comment on my answer, here is a solution that fills in gaps in the file list where there are days with 0 files. We take the minimum and maximum dates from the file list and create a data frame containing the sequence of dates. Then we left_join() this with the previously aggregated data, and recode NA values for count to 0.
# create a gap in dates with files
files <- c("20120101_aaa.txt","20120101_bbb.txt","20120102_ccc.txt",
"20120104_aaa.txt","20120104_aab.txt","20120104_aac.txt")
library(dplyr)
data.frame(date=as.Date(substr(files,1,8),"%Y%m%d")) %>%
group_by(date) %>% summarise(count = n()) -> fileCounts
# create df with all dates, left_join() and recode NA to 0
data.frame(date = as.Date(min(fileCounts$date):max(fileCounts$date),
origin = "1970-01-01")) %>%
left_join(.,fileCounts) %>%
mutate(count = if_else(is.na(count),0,as.numeric(count)))
...and the output:
Joining, by = "date"
date count
1 2012-01-01 2
2 2012-01-02 1
3 2012-01-03 0
4 2012-01-04 3
You can use table to count frequencies and then stack it to get a dataframe.
Using #Len Greski's files.
files <- c("20120101_aaa.txt","20120101_bbb.txt","20120102_ccc.txt")
stack(table(as.Date(sub('_.*', '', files),"%Y%m%d")))[2:1]
# ind values
#1 2012-01-01 2
#2 2012-01-02 1

Recoding Dates in nested data to continuous long file with for loop in R

I am struggling a little with the logic for recoding nested data into a long "continuous" format based on dates in R
Below is a dummy example of my data. I have three sets of dates The start and stop time for a participant that is stored in long format, and then the start of another incident that is stored as wide data.
GC_ID HMIS_Start HMIS_Stop CPS Start CPS Start 2 CPS Start 3
------- ------------ ----------- ----------- ------------- -------------
1 1/10/14 1/20/14 1/15/14 6/2/14 NA
1 4/10/14 5/30/14 1/15/14 6/2/14 NA
1 12/1/14 12/2/14 1/15/14 6/2/14 NA
1 1/1/15 2/28/15 1/15/14 6/2/14 NA
2 8/13/13 8/17/14 NA NA NA
3 5/1/15 5/2/15 1/16/13 6/26/14 7/27/15
3 6/4/16 7/10/16 1/16/13 6/26/14 7/27/15
4 10/15/13 10/25/13 2/18/15 NA NA
4 12/25/13 1/18/14 2/18/15 NA NA
4 2/8/15 7/20/15 2/18/15 NA NA
My goal is to create two long continuous variables that go along with each months from August 2013 to December 2015. For one of the two variables, I would like to code a 1 for each month that target month is within an HMIS_start and HMIS_stop time for a participant AND has at least one CPS Start date within that month. The second variable would do a similar thing, but it would be if the CPS Start date happened in the month after the HMIS Stop date.
So participant 1's data could look like this:
I assume I need to create a blank data set with the ID variable and then the month/year variable. Then I would use a for loop for each ID to run an "if_then" statement comparing IF the month is greater then the HMIS start and less then the HMIS stop AND if the CPS start is within that month too.
I am mostly just struggling with how to create that process and use the for loop logically given that there are long data already in the file and multiple lines of long data per participant that need to be compared to all possible CPS start dates
Any thoughts or code tips on how to tackle this?
I am not sure how you came to your answers, and I will update this code once that is provided. But I used library(tidyverse) and library(lubridate) for this:
dat <- data.frame(GC_ID = c(1,1,1,1,2,3,3,4,4,4),
HMIS_Start = c("1/10/14", "4/10/14", "12/1/14", "1/1/15", "8/13/13", "5/1/15", "6/4/16", "10/15/13", "12/25/13","2/8/15"), HMIS_Stop = c("1/20/14", "5/30/14", "12/2/14", "2/28/15", "8/17/14", "5/2/15", "7/10/16", "10/25/13", "1/18/14", "7/20/15"), CPS_Start = c("1/15/14","1/15/14","1/15/14","1/15/14",NA, "1/16/13", "1/16/13", "2/18/15", "2/18/15", "2/18/15"), CPS_Start_2 = c("6/2/15", "6/2/15", "6/2/15", "6/2/15", NA, "6/26/14", "6/26/14", NA, NA, NA), CPS_Start_3 = c(NA,NA,NA,NA,NA,"7/27/15", "7/27/15", NA,NA,NA))
dats <- dat %>%
mutate_if(is.factor, as.character) %>%
mutate_if(is.character, ~as.Date(., format = "%m/%d/%y")) %>%
gather(Var, Dates, -GC_ID, -HMIS_Start, -HMIS_Stop) %>%
filter(!is.na(Dates)) %>%
mutate(HMIS_CPS_SAME = if_else(month(HMIS_Start) == month(HMIS_Stop) &
year(HMIS_Start) == year(HMIS_Stop) &
month(HMIS_Start) == month(Dates) &
year(HMIS_Start) == year(Dates), 1, 0 ),
CPS_After = if_else(month(HMIS_Stop) + 1 == month(Dates) &
year(HMIS_Stop) == year(Dates), 1,0 ),
Months = month(HMIS_Start),
Years = year(HMIS_Start)) %>%
arrange(GC_ID, HMIS_Start, Dates) %>%
group_by(GC_ID, Months, Years) %>%
summarise(HMIS_CPS_SAME = max(HMIS_CPS_SAME),
CPS_After = max(CPS_After)) %>%
ungroup()
full_dat <- merge(data.frame(GC_ID = unique(dat$GC_ID)), data.frame(Dates = seq.Date(as.Date("2013-08-01"), as.Date("2015-12-01"), by = "month"))) %>%
mutate(Months = month(Dates), Years = year(Dates)) %>%
left_join(dats, by = c("GC_ID", "Months", "Years")) %>%
mutate_if(is.numeric , replace_na, replace = 0)
First I created the data in R and R format. Then I converted the data to date format for the 5 columns you mentioned. I made the data long to do the comparisons specified, then found the max for each GC_ID, Months, Years. Then I used a cartesian join for each date and GC_ID and got the months and years from those and joined our dats to full_dat by GC_ID, Months, Years. The last mutate_if is to convert all NA values to 0. NO Looping Needed! :-)

Resources