Related
I am running analysis in Bike Sharing (kaggle) dataset. Heres is a sample:
Head
yr mnth Ano cnt
<int> <int> <chr> <int>
1 0 1 2011 985
2 0 1 2011 801
3 0 1 2011 1349
4 0 1 2011 1562
5 0 1 2011 1600
Tail
yr mnth Ano cnt
<int> <int> <chr> <int>
1 1 12 2012 2114
2 1 12 2012 3095
3 1 12 2012 1341
4 1 12 2012 1796
5 1 12 2012 2729
Where "cnt" means the number of bikes for each day. Every line is a day from 01/01/2011 to 12/12/2012
My goal was to analyse the cnt for each month from both 2011 and 2012; However, I keep getting this weird output:
my code:
k<- bike_new %>%
ggplot(aes(x=mnth,y=cnt))+ geom_line();k
What am I doing wrong here?
As mentioned by the sage advice from #AllanCameron add the group element as a factor, and as you have two years, you would need a color. Here the code using simulated data:
library(ggplot2)
library(dplyr)
#Code
bike_new %>%
ggplot(aes(x=factor(mnth),y=cnt,group=factor(Ano),color=factor(Ano)))+
geom_line()+
xlab('month')+
labs(color='Ano')
Output:
Some data used:
#Data
bike_new <- structure(list(yr = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L,
0L, 0L, 0L, 0L, 0L), mnth = c(1, 1, 1, 1, 1, 12, 12, 12, 12,
12, 2, 2, 2, 2, 2), Ano = c(2011L, 2011L, 2011L, 2011L, 2011L,
2012L, 2012L, 2012L, 2012L, 2012L, 2011L, 2011L, 2011L, 2011L,
2011L), cnt = c(985, 801, 1349, 1562, 1600, 2114, 3095, 1341,
1796, 2729, 1085, 901, 1449, 1662, 1700)), row.names = c(NA,
-15L), class = "data.frame")
If you want to see only one line per year, a strategy could be that explained by #Phil using other variable as day. Or you can aggregate values in next form:
#Code 2
bike_new %>%
group_by(Ano,mnth) %>%
summarise(cnt=sum(cnt,na.rm=T)) %>%
ggplot(aes(x=factor(mnth),y=cnt,group=factor(Ano),color=factor(Ano)))+
geom_line()+
geom_point()+
xlab('month')+
labs(color='Ano')
Output:
As you are analyzing number of bikes.
My dataset is constructed as follows:
# A tibble: 20 x 8
iso3 year Var1 Var1_imp Var2 Var2_imp Var1_type Var2_type
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 ATG 2000 NA 144 NA 277 imputed imputed
2 ATG 2001 NA 144 NA 277 imputed imputed
3 ATG 2002 NA 144 NA 277 imputed imputed
4 ATG 2003 NA 144 NA 277 imputed imputed
5 ATG 2004 NA 144 NA 277 imputed imputed
6 ATG 2005 NA 144 NA 277 imputed imputed
7 ATG 2006 NA 144 NA 277 imputed imputed
8 ATG 2007 144 144 277 277 observed observed
9 ATG 2008 45 45 NA 301 observed imputed
10 ATG 2009 NA 71.3 NA 325 imputed imputed
11 ATG 2010 NA 97.7 NA 349 imputed imputed
12 ATG 2011 NA 124 NA 373 imputed imputed
13 ATG 2012 NA 150. NA 397 imputed imputed
14 ATG 2013 NA 177. 421 421 imputed observed
15 ATG 2014 NA 203 434 434 imputed observed
16 ATG 2015 NA 229. 422 422 imputed observed
17 ATG 2016 NA 256. 424 424 imputed observed
18 ATG 2017 282 282 429 429 observed observed
19 ATG 2018 NA 282 435 435 imputed observed
20 EGY 2000 NA 38485 NA 146761 imputed imputed
I am new to R and I would like to create a line chart for each country with time series for variables Var1_imp and Var2_imp on the same chart (I have 193 countries in my database with data from 2000 to 2018) using filled circles when data are observed and unfilled circles when data are imputed (based on Var1_type and VAr2_type). Circles would be joined with lines if two subsequent data points are observed otherwise circles would be joined with dotted lines.
The main goal is to check country by country if the method used to impute missing data is good or bad, depending on whether there are outliers in time series.
I have tried the following:
ggplot(df, aes(x=year, y=Var1_imp, group=Var1_type))
+ geom_point(size=2, shape=21) # shape = 21 for unfilled circles and shape = 19 for filled circles
+ geom_line(linetype = "dashed") # () for not dotted line, otherwise linetype ="dashed"
I have difficulties to find out:
1/ how to do one single chart per country per variable
2/ how to include both Var1_imp and Var2_imp on the same chart
3/ how to use geom_point based on conditions (imputed versus observed in Var1_type)
4/ how to use geom_line based on conditions (plain line if two subsequent observed data points, otherwise dotted).
Thank you very much for your help - I think this exercise is not easy and I would learn a lot from your inputs.
You can use the following code
df %>%
pivot_longer(cols = -c(sl, iso3, year, Var1, Var2, Var1_type, Var2_type), values_to = "values", names_to = "variable") %>%
ggplot(aes(x=year, y=values, group=variable)) +
geom_point(size=2, shape=21) +
geom_line(linetype = "dashed") + facet_wrap(iso3~., scales = "free") +
xlab("Year") + ylab("Imp")
Better to use colour like
df %>%
pivot_longer(cols = -c(sl, iso3, year, Var1, Var2, Var1_type, Var2_type), values_to = "values", names_to = "variable") %>%
ggplot(aes(x=year, y=values, colour=variable)) +
geom_point(size=2, shape=21) +
geom_line() + facet_wrap(iso3~., scales = "free") + xlab("Year") + ylab("Imp")
Update
df %>%
pivot_longer(cols = -c(sl, iso3, year, Var1, Var2),
names_to = c("group", ".value"),
names_pattern = "(.*)_(.*)") %>%
ggplot(aes(x=year, y=imp, shape = type, colour=group)) +
geom_line(aes(group = group, colour = group), size = 0.5) +
geom_point(aes(group = group, colour = group, shape = type),size=2) +
scale_shape_manual(values = c('imputed' = 21, 'observed' = 16)) +
facet_wrap(iso3~., scales = "free") + xlab("Year") + ylab("Imp")
Data
df = structure(list(sl = 1:20, iso3 = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L
), .Label = c("ATG", "EGY"), class = "factor"), year = c(2000L,
2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L,
2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L,
2000L), Var1 = c(NA, NA, NA, NA, NA, NA, NA, 144L, 45L, NA, NA,
NA, NA, NA, NA, NA, NA, 282L, NA, NA), Var1_imp = c(144, 144,
144, 144, 144, 144, 144, 144, 45, 71.3, 97.7, 124, 150, 177,
203, 229, 256, 282, 282, 38485), Var2 = c(NA, NA, NA, NA, NA,
NA, NA, 277L, NA, NA, NA, NA, NA, 421L, 434L, 422L, 424L, 429L,
435L, NA), Var2_imp = c(277L, 277L, 277L, 277L, 277L, 277L, 277L,
277L, 301L, 325L, 349L, 373L, 397L, 421L, 434L, 422L, 424L, 429L,
435L, 146761L), Var1_type = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L), .Label = c("imputed",
"observed"), class = "factor"), Var2_type = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 1L), .Label = c("imputed", "observed"), class = "factor")), class = "data.frame", row.names = c(NA,
-20L))
Plotting two varibles at the same time in a meaningful way in a line chart is going to be a bit hard. It's easier if you use pivot_longer to create one column containing both the var1_imp and var2_imp values. You will then have a key column containing var1_imp and var2_imp, and a values column containing the values for those two. You can then plot using year as x, and the new values column as y, with fill set to the key column. You'll then get two lines per country.
However, looking for outliers based on a line chart for 193 countries ins't a very good idea. Use
outlier_values <- boxplot.stats(airquality$Ozone)$out
for to get outliers in a column, or similar with sapply to get multiple columns. Outliers are normally defined as 1.5* IQR, so it's easy to figure out which ones are.
I want to count the number of days rain fell in a month for different years at different location.
This is my data:
Location Year Month Day Precipitation
A 2008 1 1 0
A 2008 1 2 8.32
A 2008 1 3 4.89
A 2008 1 4 0
I have up to 18 locations, year is from 2008 - 2018, 12 months in each year and 0 for precipitation means no rain on that day.
You can use aggregate:
aggregate(cbind(days=x$Precipitation > 0), as.list(x[c("Location", "Year", "Month")]), sum)
# Location Year Month days
#1 A 2008 1 2
Data:
x <- structure(list(Location = structure(c(1L, 1L, 1L, 1L), .Label = "A", class = "factor"),
Year = c(2008L, 2008L, 2008L, 2008L), Month = c(1L, 1L, 1L,
1L), Day = 1:4, Precipitation = c(0, 8.32, 4.89, 0)), class = "data.frame", row.names = c(NA, -4L))
Based on the available information
df <- df %>%
filter(Precipitation != 0) %>%
group_by(Location, Year, Month) %>%
summarize(DaysOfRain = n())
I want a stacked barplot in R with year as my x-axis, percentage as my y-axis, and landuse as a colour fill. My data is given below
Year Percentage LandUse
1 2015 49.8 Agriculture
2 2012 51.2 Agriculture
3 2009 50.2 Agriculture
10 2015 22.5 fishing
11 2012 21.4 fishing
12 2009 21.9 fishing
19 2015 14.7 services and residential
20 2012 16.0 services and residential
21 2009 17.1 services and residential
28 2015 0.8 mining and quarrying
29 2012 0.7 mining and quarrying
30 2009 0.7 mining and quarrying
37 2015 0.4 water and waste treatment
38 2012 0.5 water and waste treatment
39 2009 0.4 water and waste treatment
46 2015 0.8 Industry and Manufacturing
47 2012 0.8 Industry and Manufacturing
48 2009 0.9 Industry and Manufacturing
You can use ggplot2 package to plot stacked bars. Pay attention to that Year variable should be of factor type.
See the code below:
df <- structure(list(Year = c(2015L, 2012L, 2009L, 2015L, 2012L, 2009L,
2015L, 2012L, 2009L, 2015L, 2012L, 2009L, 2015L, 2012L, 2009L,
2015L, 2012L, 2009L), Percentage = c(49.8, 51.2, 50.2, 22.5,
21.4, 21.9, 14.7, 16, 17.1, 0.8, 0.7, 0.7, 0.4, 0.5, 0.4, 0.8,
0.8, 0.9), LandUse = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 5L,
5L, 5L, 4L, 4L, 4L, 6L, 6L, 6L, 3L, 3L, 3L), .Label = c("Agriculture",
"fishing", "Industry_and_Manufacturing", "mining_and_quarrying",
"services_and_residential", "water_and_waste_treatment"), class = "factor")), class = "data.frame", row.names = c("1",
"2", "3", "10", "11", "12", "19", "20", "21", "28", "29", "30",
"37", "38", "39", "46", "47", "48"))
df$Year <- factor(df$Year)
library(ggplot2)
ggplot(df, aes(Year, Percentage, fill = LandUse)) +
geom_bar(stat = "identity")
Output:
P.S.
If you want to use barplot you'll need to go through plotting value matrix creation, legend, colours etc (dozen lines of code). ggplot2 gives it to you by default.
I am trying to migrate this activity from excel/SQL to R and I am stuck - any help is very much appreciated. Thanks !
Format of Data:
There are unique customer ids. Each customer has purchases in different groups in different years.
Objective:
For each customer id - get one row of output. Use variable names stored in column and create columns - for each column assign sum of amount. Create a similar column and assign as 1 or 0 depending on presence or absence of revenue.
SOURCE:
Cust_ID Group Year Variable_Name Amount
1 1 A 2009 A_2009 2000
2 1 B 2009 B_2009 100
3 2 B 2009 B_2009 300
4 2 C 2009 C_2009 20
5 3 D 2009 D_2009 299090
6 3 A 2011 A_2011 89778456
7 1 B 2011 B_2011 884
8 1 C 2010 C_2010 34894
9 3 D 2010 D_2010 389849
10 2 A 2013 A_2013 742
11 1 B 2013 B_2013 25661
12 2 C 2007 C_2007 393
13 3 D 2007 D_2007 23
OUTPUT:
Cust_ID A_2009 B_2009 C_2009 D_2009 A_2011 …. A_2009_P B_2009_P
1 sum of amount .. 1 0 ….
2
3
dput of original data:
structure(list(Cust_ID = c(1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 3L,
2L, 1L, 2L, 3L), Group = c("A", "B", "B", "C", "D", "A", "B",
"C", "D", "A", "B", "C", "D"), Year = c(2009L, 2009L, 2009L,
2009L, 2009L, 2011L, 2011L, 2010L, 2010L, 2013L, 2013L, 2007L,
2007L), Variable_Name = c("A_2009", "B_2009", "B_2009", "C_2009",
"D_2009", "A_2011", "B_2011", "C_2010", "D_2010", "A_2013", "B_2013",
"C_2007", "D_2007"), Amount = c(2000L, 100L, 300L, 20L, 299090L,
89778456L, 884L, 34894L, 389849L, 742L, 25661L, 393L, 23L)), .Names = c("Cust_ID",
"Group", "Year", "Variable_Name", "Amount"), class = "data.frame", row.names = c(NA,
-13L))
One option:
intm <- as.data.frame.matrix(xtabs(Amount ~ Cust_ID + Variable_Name,data=dat))
result <- data.frame(aggregate(Amount~Cust_ID, data=dat,sum),intm,(intm > 0)+0 )
Result (abridged):
Cust_ID Amount A_2009 A_2011 ... A_2009.1 A_2011.1
1 1 65539 4000 0 ... 1 0
2 2 1455 0 0 ... 0 0
3 3 90467418 0 89778456 ... 0 1
If the names are a concern, they can easily be fixed via:
names(res) <- gsub("\\.1","_P",names(res))