I've compiled a corpus of tweets sent over the past few months or so, which looks something like this (the actual corpus has a lot more columns and obviously a lot more rows, but you get the idea)
id when time day month year handle what
UK1.1 Sat Feb 20 2016 12:34:02 20 2 2016 dave Great goal by #lfc
UK1.2 Sat Feb 20 2016 15:12:42 20 2 2016 john Can't wait for the weekend
UK1.3 Sat Mar 01 2016 12:09:21 1 3 2016 smith Generic boring tweet
Now what I'd like to do in R is, using grep for string matching, plot the frequency of certain words/hashtags over time, ideally normalised by the number of tweets from that month/day/hour/whatever. But I have no idea how to do this.
I know how to use grep to create subsets of this dataframe, e.g. for all tweets including the #lfc hashtag, but I don't really know where to go from there.
The other issue is that whatever time scale is on my x-axis (hour/day/month etc.) needs to be numerical, and the 'when' column isn't. I've tried concatenating the 'day' and 'month' columns into something like '2.13' for February 13th, but this leads to the issue of R treating 2.13 as being 'earlier', so to speak, than 2.7 (February 7th) on mathematical grounds.
So basically, I'd like to make plots like these, where frequency of string x is plotted against time
Thanks!
Here's one way to count up tweets by day. I've illustrated with a simplified fake data set:
library(dplyr)
library(lubridate)
# Fake data
set.seed(485)
dat = data.frame(time = seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-12-31"), length.out=10000),
what = sample(LETTERS, 10000, replace=TRUE))
tweet.summary = dat %>% group_by(day = date(time)) %>% # To summarise by month: group_by(month = month(time, label=TRUE))
summarise(total.tweets = n(),
A.tweets = sum(grepl("A", what)),
pct.A = A.tweets/total.tweets,
B.tweets = sum(grepl("B", what)),
pct.B = B.tweets/total.tweets)
tweet.summary
day total.tweets A.tweets pct.A B.tweets pct.B
1 2016-01-01 28 3 0.10714286 0 0.00000000
2 2016-01-02 27 0 0.00000000 1 0.03703704
3 2016-01-03 28 4 0.14285714 1 0.03571429
4 2016-01-04 27 2 0.07407407 2 0.07407407
...
Here's a way to plot the data using ggplot2. I've also summarized the data frame on the fly within ggplot, using the dplyr and reshape2 packages:
library(ggplot2)
library(reshape2)
library(scales)
ggplot(dat %>% group_by(Month = month(time, label=TRUE)) %>%
summarise(A = sum(grepl("A", what))/n(),
B = sum(grepl("B", what))/n()) %>%
melt(id.var="Month"),
aes(Month, value, colour=variable, group=variable)) +
geom_line() +
theme_bw() +
scale_y_continuous(limits=c(0,0.06), labels=percent_format()) +
labs(colour="", y="")
Regarding your date formatting issue, here's how to get numeric dates: You can turn the day month and year columns into a date using as.Date and/or turn the day, month, year, and time columns into a date-time column using as.POSIXct. Both will have underlying numeric values with a date class attached, so that R treats them as dates in plotting functions and other functions. Once you've done this conversion, you can run the code above to count up tweets by day, month, etc.
# Fake time data
dat2 = data.frame(day=sample(1:28, 10), month=sample(1:12,10), year=2016,
time = paste0(sample(c(paste0(0,0:9),10:12),10),":",sample(10:50,10)))
# Create date-time format column from existing day/month/year/time columns
dat2$posix.date = with(dat2, as.POSIXct(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day)," ",
time)))
# Create date format column
dat2$date = with(dat2, as.Date(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day))))
dat2
day month year time posix.date date
1 28 10 2016 01:44 2016-10-28 01:44:00 2016-10-28
2 22 6 2016 12:28 2016-06-22 12:28:00 2016-06-22
3 3 4 2016 11:46 2016-04-03 11:46:00 2016-04-03
4 15 8 2016 10:13 2016-08-15 10:13:00 2016-08-15
5 6 2 2016 06:32 2016-02-06 06:32:00 2016-02-06
6 2 12 2016 02:38 2016-12-02 02:38:00 2016-12-02
7 4 11 2016 00:27 2016-11-04 00:27:00 2016-11-04
8 12 3 2016 07:20 2016-03-12 07:20:00 2016-03-12
9 24 5 2016 08:47 2016-05-24 08:47:00 2016-05-24
10 27 1 2016 04:22 2016-01-27 04:22:00 2016-01-27
You can see that the underlying values of a POSIXct date are numeric (number of seconds elapsed since midnight on Jan 1, 1970), by doing as.numeric(dat2$posix.date). Likewise for a Date object (number of days elapsed since Jan 1, 1970): as.numeric(dat2$date).
Related
I have a dataframe df in R:
month abc1 def2 xyz3
201201 1 2 4
201202 2 5 7
201203 4 11 4
201204 6 23 40
I would like to convert each of the columns (of which there are ~50, each with ~100 monthly observations) to a time series format in order to check for seasonality in the data, using the decompose function.
I assumed a for loop using the ts function would be the best way of doing this. I would like to use something along the lines of the loop below, although I realise using a function on the left side of the <- produces an error. Is there a way to dynamically name variables generated by a loop?
for(i in 2:ncol(df)) {
paste(names(df[, i]), "_ts") <- ts(df[ ,i], start = c(2012, 1), end = c(2021,11), frequency = 12)
}
You could try zoo:
test = data.frame(month=c("201201", "201202", "201203", "201204"), abc1=c(1,2,3,4), def2=c(4,6,7,10), xyz3=c(12,15,16,19))
library(zoo)
ZOO =zoo(test[, c("abc1", "def2", "xyz3")], order.by=as.Date(paste0(test$month, "01"), format="%Y%m%d"))
ts(ZOO, frequency=12)
Output:
abc1 def2 xyz3
Jan 1 1 4 12
Feb 1 2 6 15
Mar 1 3 7 16
Apr 1 4 10 19
attr(,"index")
[1] 2012-01-01 2012-02-01 2012-03-01 2012-04-01
Update:
Now with correct frequency.
I have a cross section data as following:
transaction_code <- c('A_111','A_222','A_333')
loan_start_date <- c('2016-01-03','2011-01-08','2013-02-13')
loan_maturity_date <- c('2017-01-03','2013-01-08','2015-02-13')
loan_data <- data.frame(cbind(transaction_code,loan_start_date,loan_maturity_date))
Now the dataframe looks like this
>loan_data
transaction_code loan_start_date loan_maturity_date
1 A_111 2016-01-03 2017-01-03
2 A_222 2011-01-08 2013-01-08
3 A_333 2013-02-13 2015-02-13
Now I want to create a monthly time series observing the time to maturity(in months) for each of the three loans for a period of 48 months. How can I achieve that? The final output should look like following:
>loan data
transaction_code loan_start_date loan_maturity_date feb13 march13 april13........
1 A_111 2016-01-03 2017-01-03 46 45 44
2 A_222 2011-01-08 2013-01-08 NA NA NA
3 A_333 2013-02-13 2015-02-13 23 22 21
Here new columns (for 48 months) represents the time to maturity for each loan from that respective months.
Would really appreciate your help. Thanks
Here's an approach using tidyverse packages.
# Define the months to use in the right-hand columns.
months <- seq.Date(from = as.Date("2013-02-01"), by = "month", length.out = 48)
library(tidyverse); library(lubridate)
loan_data2 <- loan_data %>%
# Make a row for each combination of original data and the `months` list
crossing(months) %>%
# Format dates as MonYr and make into an ordered factor
mutate(month_name = format(months, "%b%y") %>% fct_reorder(months)) %>%
# Calculate months remaining -- this task is harder than it sounds! This
# approach isn't perfect, but it's hard to accomplish more simply, since
# months are different lengths.
mutate(months_remaining =
round(interval(months, loan_maturity_date) / ddays(1) / 30.5 - 1),
months_remaining = if_else(months_remaining < 0,
NA_real_, months_remaining)) %>%
# Drop the Date format of months now that calcs done
select(-months) %>%
# Spread into wide format
spread(month_name, months_remaining)
Output
loan_data2[,1:6]
# transaction_code loan_start_date loan_maturity_date Feb13 Mar13 Apr13
# 1 A_111 2016-01-03 2017-01-03 46 45 44
# 2 A_222 2011-01-08 2013-01-08 NA NA NA
# 3 A_333 2013-02-13 2015-02-13 23 22 21
I have a dataset of 2 years of user text messages - 2015 and 2016 (135,000). I am trying to identify new users to this program for February 2016 (based on subscriber_id and entity=="subscribe-online").
The wrinkle is that a new user is one where the subscriber_id has not occurred in the data within the past 12 months. So, for example, if I have the following sample data:
created subscriber_id cellnum entity message msgtxt
2015-21-01 14:03:00 15855 7788826943 tip 100 end
2015-07-12 14:03:00 15839 7788815940 tip 24 tip 24
2015-08-12 14:03:00 15839 7788815940 stop 99 stop
2016-01-01 14:05:00 15800 2508816941 tip 25 tip 25
2016-02-01 16:05:00 15800 2508816941 tip 26 tip 26
2016-03-01 14:05:00 15800 2508816941 tip 27 tip 27
2016-01-02 14:03:00 15855 7788826943 subscribe-online 1 msg 1
2016-01-02 14:03:00 15839 7788815940 subscribe-online 1 msg 1
15855 and 15839 both subscribe on February 1. I want to be able to assign 15855 as a new user based on the fact that the last occurrence of the subscriber_id 15855 was on Jan 21, 2015 - more than 12 months. I would like to assign 15839 as a repeat user since their last occurrence was on December 8th, 2015 (less than 12 months).
The created (date) field is in POSIXct, format. I have been trying to understand loops and sapply and tapply to see how I could use this here. Any help would be greatly appreciated. Thanks.
Here is a potential solution using dplyr
library(dplyr)
df <- data.frame(created = c("2015-21-01 14:03:00","2015-12-07 14:03:00","2015-12-08 14:03:00","2016-01-01 14:05:00","2016-02-01 16:05:00","2016-03-01 14:05:00","2016-01-02 14:03:00","2016-01-02 14:03:00"),
subscriber_id = c(15855,15839,15839,15800,15800,15800,15855,15839),
cellnum = c(7788826943,7788815940,7788815940,2508816941,2508816941,2508816941,7788826943,7788815940),
entity = c("tip","tip","stop","tip","tip","tip","subscribe-online","subscribe-online"),
message = c("100","24","99","25","26","27","1","1"),
msgtxt = c("end","tip 24","stop","tip 25 ","tip 26 ","tip 27 ","msg 1","msg 1"),
stringsAsFactors = FALSE
)
df$created <- as.POSIXct(df$created, format = "%Y-%d-%m %H:%M:%S")
df <- df %>%
arrange(subscriber_id, created) %>%
group_by(subscriber_id) %>%
mutate(new_user = if_else(entity != "subscribe-online", NA, if_else(as.numeric(difftime(created, lag(created), units = "days") > 365) == TRUE, TRUE, NA)))
I have data that includes dates (dd/mm/yyyy) and am wanting to summarise the data by year. I'm sure that there is an easier way to do it but the route that I've taken is to try to create a new categorical variable using the "cut" function.
For example:
# create sample dataframe
dates<-c("01/01/2013", "01/02/2013", "01/01/2014", "01/02/2014", "01/01/2015", "01/02/2015")
cases<-c(3,5,2,6,8,4)
df<-as.data.frame(cbind(dates, cases))
df$dates <- as.Date(df$dates,"%d/%m/%Y")
# categorise by year
df$year <- cut(df$dates, c(2013-01-01, 2013-12-31, 2014-12-31, 2015-12-31))
This gives an error:
invalid specification of 'breaks'
How do I tell R to cut at various "date" intervals? Is my approach to this all wrong? Still new to R (sorry about the basic question).
Greg
How should your output look like?
Your code works when you define your breaks with as.Date:
breaks <- as.Date(c("2013-01-01", "2013-12-31", "2014-12-31", "2015-12-31"))
# categorise by year
df$year <- cut(df$dates, breaks)
dates cases year
1 2013-01-01 3 2013-01-01
2 2013-02-01 5 2013-01-01
3 2014-01-01 2 2013-12-31
4 2014-02-01 6 2013-12-31
5 2015-01-01 8 2014-12-31
6 2015-02-01 4 2014-12-31
I'm guessing you want your variable year to look different, though? You can define labels when using cut:
# categorise by year
df$year <- cut(df$dates, breaks, labels = c(2013, 2014, 2015))
dates cases year
1 2013-01-01 3 2013
2 2013-02-01 5 2013
3 2014-01-01 2 2014
4 2014-02-01 6 2014
5 2015-01-01 8 2015
6 2015-02-01 4 2015
if you are just looking for the year, maybe this helps:
df$year <- format(df$dates, format="%Y")
dates cases year
1 2013-01-01 3 2013
2 2013-02-01 5 2013
3 2014-01-01 2 2014
4 2014-02-01 6 2014
5 2015-01-01 8 2015
6 2015-02-01 4 2015
I think the solutions based on cut are a bit overkill. You can use the year function from the lubridate package to extract the year from the date:
library(dplyr)
library(lubridate)
df %>% mutate(year = year(dates))
# dates cases year
# 1 2013-01-01 3 2013
# 2 2013-02-01 5 2013
# 3 2014-01-01 2 2014
# 4 2014-02-01 6 2014
# 5 2015-01-01 8 2015
# 6 2015-02-01 4 2015
lubridate is such an awesome package when it comes to dealing with time data.
After the year column is constructed you can apply all kinds of summaries. I use the dplyr style here:
# Note that as.numeric(as.character()) is needed as `cbind` forces `cases` to be a factor
df %>% mutate(year = year(dates), cases = as.numeric(as.character(cases))) %>%
group_by(year) %>% summarise(tot_cases = sum(cases))
# # A tibble: 3 × 2
# year tot_cases
# <dbl> <dbl>
# 1 2013 8
# 2 2014 8
# 3 2015 12
Note that group_by ensures that all operations after that are done per unique category mentioned there, in this case per year.
A simple solution would be using the dplyr package. Here is a simple example:
library(dplyr)
df_grouped <- df %>%
mutate(
dates = as_date(dates),
cases = as.numeric(cases)) %>%
group_by(year = year(dates)) %>%
summarise(tot_cases = sum(cases))
In the mutate statement we convert the variables to a more suitable format, in group_by we select which variable is going to do the grouping and in summarise we create any new variables that we want.
df_grouped looks like this:
# A tibble: 3 × 2
year tot_cases
<dbl> <dbl>
1 2013 6
2 2014 6
3 2015 9
The table consist of a Month/Year field i.e. "January 2016".
How do I use the reorder a boxplot to display the X axis in date order (Jan 2016...Feb 2016.... What I tried using the following code :
boxplot(YR$S~reorder(format(YR$MY,'%M %Y'),YR$MY),outline =FALSE)
<pre>
IDX MY Day V Time G S W
24 January 2015 1 G 1821 6 11 71
25 January 2015 2 G 1600 9 15 1
26 January 2015 5 G 1700 5 14 64
27 January 2015 6 F 1805 3 14 4
28 January 2015 7 G 1716 3 15 45
29 January 2015 9 F 1910 3 8 38
As I mentioned above depending on the format of the data and how best to bin the data (i.e. monthly daily) would affect the recommendation. Below is different approaches which I would consider (may not be the best way but it can get the job done):
#Sample data
string<-rep(c("January 2016", "February 2016", "March 2016"), 3)
day<-rep(c(1:3), each=3)
value<-runif(9,10, 20)
#data frame with string, int and float
df<-data.frame(string, day, value)
#Date as string
boxplot(df$value~df$string, las=2, main="String")
#undersirable - x - axis not in order
#Date as a Date Class
#convert to Date Class
#xdate<-as.Date(paste(df$string, day), format= "%B %Y %d")
#Need to convert everything to first of month to bin by month
xdate<-as.Date(paste(df$string, 1), format= "%B %Y %d")
b<-boxplot(df$value~xdate, las=2, main="Date", names=unique(months(xdate)))
#Good - may need work on x axis labels
#Date as a factor
#convert to factor
xfactor<-as.factor(df$string)
#sets the factors in month order (drops the year suffix)
xfactor<-factor(xfactor, levels = paste(month.name, "2016"))
#remove unused levels
xfactor<-droplevels(xfactor)
boxplot(df$value~xfactor, las=2, main="factor")
#Good - may need work on x axis labels depending in timeframe on interest
All three attempts have their pro and cons and depending on the initial format, how much data, report frequency and the final results determines the best approach.
Hope this helps.
Convert your dates to class Date, so that boxplot can pick the appropriate continuous scale for the x axis and order your values automatically:
y <- YR$S
oldloc <- Sys.getlocale("LC_TIME"); Sys.setlocale("LC_TIME", "english")
x <- as.Date(with(YR, paste(paste(MY, Day, sep = "-"))), format="%B-%Y-%d")
Sys.setlocale("LC_TIME", oldloc)
boxplot(y~x)
I set the locale to English so that R knows how to interpret "January" in foreign languages (in German for example it's "Januar"). You may omit that, if you are set to English already...
Data used:
YR <- read.table(header=T, text="
MY Day V Time G S W
February-2015 1 G 1821 6 11 71
January-2015 2 G 1600 9 15 1
January-2015 5 G 1700 5 14 64
January-2015 6 F 1805 3 14 4
January-2015 7 G 1716 3 15 45
January-2015 9 F 1910 3 8 38")
Thanks one and all for your responses.
Turns out that there is a much shorter and simpler soution. The library "Rlab" has a built in box chart + binning function called "bplot. Here's a copde example: (MY = month-year field, S = number of sunspots)
library(Rlab)
bplot(MY,S)