I have a data frame like the following:
user_id track_id created_at
1 81496937 cd52b3e5b51da29e5893dba82a418a4b 2014-01-01 05:54:21
2 2205686924 da3110a77b724072b08f231c9d6f7534 2014-01-01 05:54:22
3 132588395 ba84d88c10fb0e42d4754a27ead10546 2014-01-01 05:54:22
4 97675221 33f95122281f76e7134f9cbea3be980f 2014-01-02 05:54:24
5 17945688 b5c42e81e15cd54b9b0ee34711dedf05 2014-01-02 05:54:24
6 452285741 8bd5206b84c968eda0af8bc86d6ab1d1 2014-01-02 05:54:25
I want to create a line chart in R showing the number of user_id across days. I want to know how many user_id are present per day and create a plot of that. How do I do it?
First of all, you should know how to process date and time in R. I strongly recommend the lubridate package.
library(lubridate)
t <- ymd_hms("20170621111800")
dt <- floor_date(t, unit='day')
dt
Then you need to learn how to manipulate a data frame in R. I usually use dplyr package because it is quite simple to learn and the code is easy to read.
library(dplyr)
new_df <- df %>%
mutate(dt=floor_date(ymd_hms(created_at, unit='day'))) %>%
group_by(dt) %>%
summarise(user_cnt=n_distinct(user_id))
new_df
At last, you need to learn how to plot a data frame in R. I personally prefer to use ggplot2 to do this task.
library(ggplot2)
p <- ggplot(new_df) + geom_line(aes(x=dt, y=user_cnt))
p
Now you will see a picture showed in the bottom right panel if you use RStudio to run the code. Furthermore, you could use plotly package to change the static image to a dynamic chart!
library(plotly)
ggplotly(p)
Related
I'm trying to create a simple time series plot in R with the following data (it's in tbl format):
Date sales
<date> <dbl>
1 2010-02-05 1105572.
2 2010-09-03 1048761.
3 2010-11-19 1002791.
4 2010-12-24 1798476.
5 2011-02-04 1025625.
6 2011-11-18 1031977.
When I use the following command: plot(by_date$Date, by_date$sales, type = 'l'), the resulting graph just skips the individual dates, as I want it to display, and just shows the year on the x-axis, like this (please ignore the axis labels for now):
I've checked the format of the date column using class(by_date$Date) and it does show up as 'Date'. I've also checked other threads here and the closest one that came to answering my query is the one here. Tried that approach but didn't work for me, while plotting in ggplot or converting data to data frame didn't work either. Please help, thanks.
With ggplot this should work -
library(ggplot2)
ggplot(by_date, aes(Date, sales)) + geom_line()
You can use scale_x_date to format your x-axis as you want.
I am trying to pull stock price data using tq_get in tidyquant, then want to plot the current price against the 52 week range. Here is an example of what I am looking to create.
Basically just a visual representation of where the stock is currently trading in relation to its 52 week range. Below is the code I have begun to load in the appropriate values for TSLA. First, I am wondering if it is possible to set the "from" and "to" dates so that they constantly update to be exactly one year ago and the current date, respectively? Second, is there a ggplot or another package that might be able to generate a similar plot? I've explored boxplots, but really I need something even more simple than that, as I really only need one axis. Thanks in advance!
X <- tq_get(c("^GSPC","TSLA"),get="stock.prices",from="2019-05-04", to="2020-05-04")
TSLA <- X %>% filter(symbol == "TSLA") %>% tk_xts()
chartSeries(TSLA)
TSLAlow <- min(TSLA$close)
TSLAlow
TSLAhigh <- max(TSLA$close)
TSLAhigh
TSLAclose <- tail(X$close, n=1)
TSLAclose
TSLArange <- tibble(TSLAlow, TSLAhigh, TSLAclose)
The table shows the first row with 12 month names and the values of visitors, with portuguese (Portugal) and foreigners (ESTRANGEIRO) (ignore the row with no names)
How can I plot, in ggplot2, a bar graph that shows the portuguese visitors and the foreigners visitors during the 12 month period?
Usually it is better to provide some reproducible code example than to submit a screenshot, see e.g. here: Click
To accomplish what you want to do, you will have to change your format a little bit. Given a dataframe that looks like yours and using reshape2:
df <- data.frame(month=factor(c("Jan","Feb","Mar"),labels=c("Jan","Feb","Mar"),ordered=TRUE),
portugal=c(4000,2330,3000),
foreigner=c(4999,2600,3244),
stringsAsFactors = FALSE)
library(reshape2)
plotdf<-melt(df)
colnames(plotdf)<-c("Month","Country","Visitors")
levels(plotdf$Country)<-c("Portgual","Foreigners")
ggplot(plotdf,aes(x=Month,y=Visitors,fill=Country)) +
geom_bar(stat="identity",position=position_dodge()) +
xlab("Month") +
ylab("Visitors")
these data are exported from postgresql of interval type, for example:
1 00:01:30
2 00:07:00
3 00:07:00
4 00:03:00
5 00:02:00
6 00:03:30
7 -00:02:00
...
what i want
I want to see the distribution of these data, and what's more, I want to get the decile of the distribution, even if it's interval time.
what I did
I used the :
COPY (SELECT the_interval from the_table) TO '/some/file/path.txt';
to get the file path.txt.
then I used
tools -> import datasets -> from loalfile
to get the data imported into workset of R with RStudio.
what I am asking
I'm new to R, and I want to know: do I need to transfer the data into time type in R, or any function I could use to plot these data. Or any further, you can propose me any better way you think that it would achieve the goal I expressed.
Thanks a lot!
Assuming you can read your data into R as character strings. The easiest option is to convert your times into interval objects with the "times" function. From there R makes it easy to plot a histogram.
#Sample data
t<-c("00:01:30", "00:07:00", "00:07:00", "00:03:00", "00:02:00", "00:03:30", "00:06:00")
#load library and convert to a times object
library(chron)
tt<-times(t)
#Plot the histogram
h<-hist(as.numeric(tt), main="sample data", col="blue")
#For data summaries
summary(tt)
quantile(tt, 0.90)
Hope this provides you a head start on solving your problem, if not please ask a more detail question providing some sample data and the expect output.
I am facing a problem concerning aggregating my data to daily data.
I have a data frame where NAs have been removed (Link of picture of data is given below). Data has been collected 3 times a day, but sometimes due to NAs, there is just 1 or 2 entries per day; some days data is missing completely.
I am now interested in calculating the daily mean of "dist": this means summing up the data of "dist" of one day and dividing it by number of entries per day (so 3 if there is no data missing that day). I would like to do this via a loop.
How can I do this with a loop? The problem is that sometimes I have 3 entries per day and sometimes just 2 or even 1. I would like to tell R that for every day, it should sum up "dist" and divide it by the number of entries that are available for every day.
I just have no idea how to formulate a for loop for this purpose. I would really appreciate if you could give me any advice on that problem. Thanks for your efforts and kind regards,
Jan
Data frame: http://www.pic-upload.de/view-11435581/Data_loop.jpg.html
Edit: I used aggregate and tapply as suggested, however, the mean value of the data was not really calculated:
Group.1 x
1 2006-10-06 12:00:00 636.5395
2 2006-10-06 20:00:00 859.0109
3 2006-10-07 04:00:00 301.8548
4 2006-10-07 12:00:00 649.3357
5 2006-10-07 20:00:00 944.8272
6 2006-10-08 04:00:00 136.7393
7 2006-10-08 12:00:00 360.9560
8 2006-10-08 20:00:00 NaN
The code used was:
dates<-Dis_sub$date
distance<-Dis_sub$dist
aggregate(distance,list(dates),mean,na.rm=TRUE)
tapply(distance,dates,mean,na.rm=TRUE)
Don't use a loop. Use R. Some example data :
dates <- rep(seq(as.Date("2001-01-05"),
as.Date("2001-01-20"),
by="day"),
each=3)
values <- rep(1:16,each=3)
values[c(4,5,6,10,14,15,30)] <- NA
and any of :
aggregate(values,list(dates),mean,na.rm=TRUE)
tapply(values,dates,mean,na.rm=TRUE)
gives you what you want. See also ?aggregate and ?tapply.
If you want a dataframe back, you can look at the package plyr :
Data <- as.data.frame(dates,values)
require(plyr)
ddply(data,"dates",mean,na.rm=TRUE)
Keep in mind that ddply is not fully supporting the date format (yet).
Look at the data.table package especially if your data is huge. Here is some code that calculates the mean of dist by day.
library(data.table)
dt = data.table(Data)
Data[,list(avg_dist = mean(dist, na.rm = T)),'date']
It looks like your main problem is that your date field has times attached. The first thing you need to do is create a column that has just the date using something like
Dis_sub$date_only <- as.Date(Dis_sub$date)
Then using Joris Meys' solution (which is the right way to do it) should work.
However if for some reason you really want to use a loop you could try something like
newFrame <- data.frame()
for d in unique(Dis_sub$date){
meanDist <- mean(Dis_sub$dist[Dis_sub$date==d],na.rm=TRUE)
newFrame <- rbind(newFrame,c(d,meanDist))
}
But keep in mind that this will be slow and memory-inefficient.