The following is my data, of which I would like to plot the monthly frequency. There are missing values.
YEAR MONTH
1960 5
1961 7
1961 8
1961 11
1962 5
1963 6
1964
1965 7
1966 7
1966 7
1966 10
1967 4
1967 8
1968
1969
1970 8
1971 6
1971 9
1971 10
1972 7
1973 6
1973 9
1974 10
1974 10
1975 10
1976
1977
1978 9
1979 11
1980 7
1980 7
1980 8
1981
1982 10
1982 12
1983
1984 7
1985 9
1986
1987
1988 9
1988 10
1989 7
1989 10
1990
1991 7
1992
1993 6
1993 7
1993 9
1993 9
1994
1995 7
1996 8
1996 9
1997 5
1998 8
1998 9
1998 10
1999 8
1999 9
2000 9
2001
2002 1
2003 5
2003 7
2003 8
2003 9
2003 10
2004
2005 11
2006 7
2006 10
2007 9
2007 11
2007 11
2008 5
2009 5
2009 7
2009 9
2009 9
2010 10
2011 5
2011 9
2011 9
2012 8
2013 7
2014 9
2015 7
2016
2017 8
2018 10
2019 11
2020
I used the following code in a Jupyter Notebook. There are other columns but I selected only the month.
#Plot Frequency
ISA = pd.read_csv (r'G\:data.csv', encoding="ISO-8859-1")
ISA = pd.DataFrame(ISA,columns=['YEAR','MONTH','TYPE'])
ISA= ISA[ISA['YEAR'].between(1960,2020, inclusive="both")]
ISA['YEAR'] = pd.to_datetime(ISA['MONTH'])
ISA = ISA.set_index('YEAR')
ISA=ISA.drop(['MSW','TC NAME', 'KNOTS','PAR BEG', 'PAR END'],axis=1)
ISA=ISA.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
ax=ISA.groupby([ISA.index.month, 'MONTH']).count().plot(kind='bar',color='lightgray',width=1, edgecolor='darkgray')
plt.xlabel('Month', color='black', fontsize=14, weight='bold')
plt.ylabel('Monthly frequency' , color='black', fontsize=14, weight='bold',)
plt.xticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct','Nov','Dec'],rotation=0, fontsize=12)
ax.yaxis.set_major_formatter(FormatStrFormatter('%.0f'))
plt.yticks(fontsize=12)
plt.ylim(0,20)
plt.suptitle("Monthly Frequency",fontweight='bold',y=0.95,x=0.53)
plt.title("ISA", pad=0)
L=plt.legend()
L.get_texts()[0].set_text('Frequency')
plt.bar_label(ax.containers[0], label_type='center', fontsize=11)
plt.plot()
plt.tight_layout()
plt.show()
Using this code, the resulting plot includes February and other months. It should be zero. Can you help me adjust the bar chart? OR if there is something wrong with my code.
This comes close with your supplied example data:
# Read the initial data to a dataframe
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib.ticker as mtick
ISA = pd.read_csv (r'data.txt', delim_whitespace=True)
ISA = pd.DataFrame(ISA,columns=['YEAR','MONTH'])
ISA['MONTH'] = ISA['MONTH'].astype(dtype='Int64')
ISA= ISA[ISA['YEAR'].between(1960,2020, inclusive="both")]
# Use `value_counts()` with that dataframe to collect counts fixing for the month numbers that are missing
# because no values ever reported for those months in imported data
months_count_collected = {}
for x in range (1,13):
if x in ISA['MONTH'].value_counts():
months_count_collected[x] = ISA['MONTH'].value_counts()[x]
#print(ISA['MONTH'].value_counts()[x])
else:
months_count_collected[x] = 0
#print(0)
# Make a dataframe with the frequency from `months_count_collected` where those with zero counts added back in
df = pd.DataFrame.from_dict(months_count_collected, orient='index', columns = ["Frequency"])
# Make plot from frequency dataframe
ax = df.sort_index().plot(kind='bar',color='lightgray',width=1, edgecolor='darkgray'); # note that `sort_index().` isn't
# needed here but would come in handy perhaps if values for unrepresented months added later/differently and can be useful when developing
# and left in so it's handy; `sort_index()` usee based on https://stackoverflow.com/a/57876952/8508004 .
# Set tick labels to the month names based on https://stackoverflow.com/a/30280076/8508004
ax.set_xticklabels(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct','Nov','Dec'],rotation=0, fontsize=12);
ax.set_xlabel('Month', color='black', fontsize=14, weight='bold')
ax.set_ylabel('Annual frequency' , color='black', fontsize=14, weight='bold',)
#ax.set_title("Passage Frequency", pad=0);
#plt.yaxis.set_major_formatter(FormatStrFormatter('%.0f'))
ax.yaxis.set_major_formatter(mtick.FormatStrFormatter('%.0f')) # based on OP code and https://stackoverflow.com/a/36319915/8508004 to import and use `mtick` with Pandas
plt.yticks(fontsize=12)
ax.set_ylim(0,20)
plt.suptitle("Monthly Frequency",fontweight='bold',y=0.95,x=0.53)
plt.title("ISA", pad=0)
L=plt.legend()
L.get_texts()[0].set_text('Frequency')
plt.bar_label(ax.containers[0], label_type='center', fontsize=11)
plt.plot()
plt.tight_layout()
plt.show();
There's probably a more clever way to fill in the months unrepresented in the input.
And titles and labels get generated but may not be correct text right now.
What it makes:
Related
I need to count the number of contiguous years in a data frame. I want to filter data frames that have more than 30 years of consecutive records. Before I was doing:
(length(unique(Daily_Streamflow$year)) > 30
But I realized that the number of years (unique years) could be more than 30 but not in a consecutive range, for example:
(unique(DSF_09494000$year))
[1] 1917 1918 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980
[27] 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
[53] 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
How is possible to count the number of years in a range that is continuous without missing years? Is there a similar function as na.contiguous of stats package but for non-NA values?
I might be overcomplicating things - would love to know if if there is an easier way to solve this. I have a data frame (df) with 5654 observations - 1332 are foreign-born, and 4322 Canada-born subjects.
The variable df$YR_IMM captures: "In what year did you come to live in Canada?"
See the following distribution of observations per immigration year table(df$YR_IMM) :
1920 1926 1928 1930 1939 1942 1944 1946 1947 1948 1949 1950 1951 1952 1953 1954
2 1 1 2 1 2 1 1 1 9 5 1 7 13 3 5
1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970
10 5 8 6 6 1 5 1 6 3 7 16 18 12 15 13
1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986
10 17 8 18 25 16 15 12 16 27 13 16 11 9 17 16
1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
24 21 31 36 26 30 26 24 22 30 29 26 47 52 53 28 9
Naturally these are only foreign-born individuals (mean = 1985) - however, 348 foreign-borns are missing. There are a total of 4670 NAs that also include Canada-borns subjects.
How can I code these df$YR_IMM NAs in such a way that
348 (NA) --> 1985
4322(NA) --> 100
Additionally, the status is given by df$Brthcoun with 0 = "born in Canada" and 1 = "born outside of Canada.
Hope this makes sense - thank you!
EDIT: This was the solution ->
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 0] <- 100
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 1] <- 1985
Try the below code:
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 0] <- 100
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 1] <- 1985
I hope this helps!
Something like this should also work:
df$YR_IMM <- ifelse(is.na(df$YR_IMM) & df$Brthcoun == 0, 100, 1985)
I would like to create a time series portrayed visually as a spiral graph like this one. I would like for the ticks to be in months instead of hours. Each spiral will represent years instead of days. I would like to do the option of having the main ticks to be broken into four minor ticks (represented by weeks) or no minor ticks and just have the main ticks of months only.
Time-Spiral Graph
I have included a sample of mock data. The daily temperature means could be binned into four bins (as represented by weeks).
Year Month Day Temperature
1993 January 1 9
1993 January 2 6
1993 January 3 6
1993 January 4 5
1993 January 5 5
1993 January 6 5
1993 January 7 8
1993 January 8 9
1993 January 9 6
1993 January 10 5
1993 January 11 7
1993 January 12 10
1993 January 13 7
1993 January 14 10
1993 January 15 5
1993 January 16 5
1993 January 17 7
1993 January 18 7
1993 January 19 10
1993 January 20 8
1993 January 21 9
1993 January 22 8
1993 January 23 9
1993 January 24 9
1993 January 25 5
1993 January 26 6
1993 January 27 7
1993 January 28 6
1993 January 29 8
1993 January 30 8
1993 January 31 10
1993 February 1 8
1993 February 2 9
1993 February 3 9
1993 February 4 6
1993 February 5 5
1993 February 6 9
1993 February 7 8
1993 February 8 10
1993 February 9 9
1993 February 10 6
1993 February 11 6
1993 February 12 9
1993 February 13 8
1993 February 14 6
1993 February 15 6
1993 February 16 9
1993 February 17 10
1993 February 18 5
1993 February 19 7
1993 February 20 6
1993 February 21 8
1993 February 22 9
1993 February 23 5
1993 February 24 10
1993 February 25 10
1993 February 26 8
1993 February 27 10
1993 February 28 9
1993 March 1 10
1993 March 2 9
1993 March 3 9
1993 March 4 6
1993 March 5 7
1993 March 6 6
1993 March 7 5
1993 March 8 10
1993 March 9 9
1993 March 10 8
1993 March 11 9
1993 March 12 7
1993 March 13 7
1993 March 14 6
1993 March 15 6
1993 March 16 9
1993 March 17 7
1993 March 18 6
1993 March 19 10
1993 March 20 7
1993 March 21 6
1993 March 22 6
1993 March 23 10
1993 March 24 9
1993 March 25 8
1993 March 26 6
1993 March 27 5
1993 March 28 5
1993 March 29 10
1993 March 30 7
1993 March 31 8
1993 April 1 6
1993 April 2 7
1993 April 3 10
1993 April 4 7
1993 April 5 8
1993 April 6 5
1993 April 7 7
1993 April 8 5
1993 April 9 10
1993 April 10 7
1993 April 11 6
1993 April 12 9
1993 April 13 10
1993 April 14 10
1993 April 15 6
1993 April 16 5
There is a thread that shows the code needed to achieve this (How to Create A Time-Spiral Graph Using R); however, I am having a difficulty understanding the code and modifying it to fit my purpose. I am hoping someone can either point me in the right direction or help me customize the code.
Thank you!!
As #42 said, it sounds like you have some other pre-processing to do to get your data ready for what you want.
In ggplot, here's the approach I would take. First get your data printing as a bar chart. Then add an ascending baseline. Finally, use coord_polar to put it around an annual circle.
sample <- data.frame(date = seq.Date(from = as.Date("1993-01-01"), to = as.Date("1996-12-31"), by = 1),
day_num = 1:1461,
temp = rnorm(1461, 10, 2))
# as normal bar
ggplot(sample, aes(date, temp, fill = temp)) +
geom_col() +
scale_fill_viridis_c() + theme_minimal()
# or use the fill pattern below to replicate OP picture:
# scale_fill_gradient2(low="green", mid="yellow", high="red", midpoint=10)
# as ascending bar
ggplot(sample, aes(date, 0.01*day_num + temp/2,
height = temp, fill = temp)) +
geom_tile() +
scale_fill_viridis_c() + theme_minimal()
# as spiral
ggplot(sample, aes(day_num %% 365,
0.05*day_num + temp/2, height = temp, fill = temp)) +
geom_tile() +
scale_y_continuous(limits = c(-20, NA)) +
scale_x_continuous(breaks = 30*0:11, minor_breaks = NULL, labels = month.abb) +
coord_polar() +
scale_fill_viridis_c() + theme_minimal()
I have one data frame which has sales values for the time period Oct. 2000 to Dec. 2001 (15 months). Also I have profit values for the same time period as above and I want to find the correlation between these two data frames month wise for these 15 months in R. My data frame sales is:
Month sales
Oct. 2000 24.1
Nov. 2000 23.3
Dec. 2000 43.9
Jan. 2001 53.8
Feb. 2001 74.9
Mar. 2001 25
Apr. 2001 48.5
May. 2001 18
Jun. 2001 68.1
Jul. 2001 78
Aug. 2001 48.8
Sep. 2001 48.9
Oct. 2001 34.3
Nov. 2001 54.1
Dec. 2001 29.3
My second data frame profit is:
period profit
Oct 2000 14.1
Nov 2000 3.3
Dec 2000 13.9
Jan 2001 23.8
Feb 2001 44.9
Mar 2001 15
Apr 2001 58.5
May 2001 18
Jun 2001 58.1
Jul 2001 38
Aug 2001 28.8
Sep 2001 18.9
Oct 2001 24.3
Nov 2001 24.1
Dec 2001 19.3
Now I know that for initial two months I cannot get the correlation as there are not enough values but from Dec 2000 onwards I want to calculate the correlation by taking into consideration the previous months values. So, for Dec. 200 I will consider values of Oct. 2000, Nov. 2000 and Dec. 2000 which will give me 3 sales value and 3 profit values. Similarly for Jan. 2001 I will consider values of Oct. 2000, Nov. 2000 Dec. 2000 and Jan. 2001 thus having 4 sales value and 4 profit value. Thus for every month I will consider previous month values also to calculate the correlation and my output should be something like this:
Month Correlation
Oct. 2000 NA or Empty
Nov. 2000 NA or Empty
Dec. 2000 x
Jan. 2001 y
. .
. .
Dec. 2001 a
I know that in R there is a function cor(sales, profit) but how can I find out the correlation for my scenario?
Make some sample data:
> sales = c(1,4,3,2,3,4,5,6,7,6,7,5)
> profit = c(4,3,2,3,4,5,6,7,7,7,6,5)
> data = data.frame(sales=sales,profit=profit)
> head(data)
sales profit
1 1 4
2 4 3
3 3 2
4 2 3
5 3 4
6 4 5
Here's the beef:
> data$runcor = c(NA,NA,
sapply(3:nrow(data),
function(i){
cor(data$sales[1:i],data$profit[1:i])
}))
> data
sales profit runcor
1 1 4 NA
2 4 3 NA
3 3 2 -0.65465367
4 2 3 -0.63245553
5 3 4 -0.41931393
6 4 5 0.08155909
7 5 6 0.47368421
8 6 7 0.69388867
9 7 7 0.78317543
10 6 7 0.81256816
11 7 6 0.80386072
12 5 5 0.80155885
So now data$runcor[3] is the correlation of the first 3 sales and profit numbers.
Note I call this runcor as its a "running correlation", like a "running sum" which is the sum of all elements so far. This is the correlation of all pairs so far.
Another possibility would be: (if dat1 and dat2 are the initial datasets)
Update
dat1$Month <- gsub("\\.", "", dat1$Month)
datN <- merge(dat1, dat2, sort=FALSE, by.x="Month", by.y="period")
indx <- sequence(3:nrow(datN)) #create index to replicate the rows
indx1 <- cumsum(c(TRUE,diff(indx) <0)) #create another index to group the rows
#calculate the correlation grouped by `indx1`
datN$runcor <- setNames(c(NA, NA,by(datN[indx,-1],
list(indx1), FUN=function(x) cor(x$sales, x$profit) )), NULL)
datN
# Month sales profit runcor
#1 Oct 2000 24.1 14.1 NA
#2 Nov 2000 23.3 3.3 NA
#3 Dec 2000 43.9 13.9 0.5155911
#4 Jan 2001 53.8 23.8 0.8148546
#5 Feb 2001 74.9 44.9 0.9345166
#6 Mar 2001 25.0 15.0 0.9119941
#7 Apr 2001 48.5 58.5 0.7056301
#8 May 2001 18.0 18.0 0.6879528
#9 Jun 2001 68.1 58.1 0.7647177
#10 Jul 2001 78.0 38.0 0.7357748
#11 Aug 2001 48.8 28.8 0.7351366
#12 Sep 2001 48.9 18.9 0.7190413
#13 Oct 2001 34.3 24.3 0.7175138
#14 Nov 2001 54.1 24.1 0.7041889
#15 Dec 2001 29.3 19.3 0.7094334
I have managed to aggregate some data into the following:
Month Year Number
1 1 2011 3885
2 2 2011 3713
3 3 2011 6189
4 4 2011 3812
5 5 2011 916
6 6 2011 3813
7 7 2011 1324
8 8 2011 1905
9 9 2011 5078
10 10 2011 1587
11 11 2011 3739
12 12 2011 3560
13 1 2012 1790
14 2 2012 1489
15 3 2012 1907
16 4 2012 1615
I am trying to create a barplot where the bars for the months are next to each other, so for the above example January through April will have two bars (one for 2011 and one for 2012) and the remaining months will only have one bar representing 2011.
I know I have to use beside=T, but I guess I need to create some sort of matrix in order to get the barplot to display properly. I am having an issue figuring out what that step is. I have a feeling it may involve matrix but for some reason I am completely stumped to what seems like a very simple solution.
Also, I have this data: y=c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec') which I would like to use in my names.arg. When I try to use it with the above data it tells me undefined columns selected which I am taking to mean that I need 16 variables in y. How can I fix this?
To use barplot you need to rearrange your data:
dat <- read.table(text = " Month Year Number
1 1 2011 3885
2 2 2011 3713
3 3 2011 6189
4 4 2011 3812
5 5 2011 916
6 6 2011 3813
7 7 2011 1324
8 8 2011 1905
9 9 2011 5078
10 10 2011 1587
11 11 2011 3739
12 12 2011 3560
13 1 2012 1790
14 2 2012 1489
15 3 2012 1907
16 4 2012 1615",sep = "",header = TRUE)
y <- c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec')
barplot(rbind(dat$Number[1:12],c(dat$Number[13:16],rep(NA,8))),
beside = TRUE,names.arg = y)
Or you can use ggplot2 with the data pretty much as is:
dat$Year <- factor(dat$Year)
dat$Month <- factor(dat$Month)
ggplot(dat,aes(x = Month,y = Number,fill = Year)) +
geom_bar(position = "dodge") +
scale_x_discrete(labels = y)