How to adjust list interval in xticks? - jupyter-notebook

I have data where I want to plot their timeseries. The FRQ is the column bar and the FILT is the line chart.
YEAR FRQ FILT
1960 1
1961 3
1962 1 1.416666667
1963 1 0.916666667
1964 0 0.833333333
1965 1 1.333333333
1966 3 1.75
1967 2 1.5
1968 0 0.833333333
1969 0 0.666666667
1970 1 1.166666667
1971 3 1.666666667
1972 1 1.833333333
1973 2 1.75
1974 2 1.5
1975 1 1
1976 0 0.5
1977 0 0.416666667
1978 1 0.833333333
1979 1 1.333333333
1980 3 1.5
1981 0 1.333333333
1982 2 1
1983 0 0.833333333
1984 1 0.75
1985 1 0.583333333
1986 0 0.5
1987 0 0.75
1988 2 1.166666667
1989 2 1.25
1990 0 0.916666667
1991 1 0.833333333
1992 0 1.25
1993 4 1.5
1994 0 1.416666667
1995 1 1.25
1996 2 1.416666667
1997 1 1.833333333
1998 3 2
1999 2 1.75
2000 1 1.166666667
2001 0 1.083333333
2002 1 1.666666667
2003 5 2
2004 0 1.75
2005 1 1.5
2006 2 1.75
2007 3 2.166666667
2008 1 2.333333333
2009 4 2.333333333
2010 1 2.25
2011 3 1.916666667
2012 1 1.5
2013 1 1.166666667
2014 1 0.916666667
2015 1 0.75
2016 0 0.666666667
2017 1 0.75
2018 1 0.833333333
2019 1
2020 0
My working code looks like this:
#Read Tropical cyclone frequency
TC = pd.read_csv (r'G:\TC_Atlas\\data.csv', encoding="ISO-8859-1")
TC = pd.DataFrame(TC,columns=['YEAR','FRQ','FILT','FRQ2','FILT2','LMI','FILTLMI','LMI2','FILTLMI2'])
TC= TC[TC['YEAR'].between(1960,2020,inclusive="both")]
#TC = TC.set_index('YEAR')
labels =['1960','1961','1962','1963','1964','1965','1966','1967','1968','1969','1970','1971','1972','1973','1974','1975',
'1976','1977','1978','1979','1980','1981','1982','1983','1984','1985','1986','1987','1988','1989','1990','1991','1992',
'1993','1994','1995','1996','1997','1998','1999','2000','2001','2002','2003','2004','2005','2006','2007','2008','2009',
'2010','2011','2012','2013','2014','2015','2016','2017','2018','2019','2020']
#Plot timeseries
TC['FRQ'].plot(kind='bar', color='lightgray', width=1, edgecolor='darkgray')
TC['FILT'].plot(kind='line',color='black')
plt.suptitle("TC Passage Frequency",fontweight='bold',y=0.95,x=0.53)
plt.title("Isabela (1960-2020)", pad=0)
L=plt.legend()
L.get_texts()[0].set_text('filtered')
plt.yticks(fontsize=12)
tickvalues = range(0,len(labels))
plt.xticks(ticks = tickvalues ,labels = labels, rotation = 30)
plt.xlabel('Year', color='black', fontsize=14, weight='bold',labelpad=10)
plt.ylabel('Frequency' , color='black', fontsize=14, weight='bold',labelpad=15)
plt.tight_layout()
plt.show()
Unfortunately, I cannot adjust the interval of the x-axis to make the xticks every 4 year interval. I have scouring for possible solution. Kindly help. I use Jupyter Notebook in Python. Below is the sample output but my goal is to make the xticks 4 year interval.

You are explicitly adding ticks every year. The culprit is this line:
plt.xticks(ticks = tickvalues ,labels = labels, rotation = 30)
To make them less frequent, you can take advantage of list slicing like so:
plt.xticks(ticks=tickvalues[::4], labels=labels[::4], rotation=30)
If you need to e.g. shift them so that a specific year is listed, you could set the initial index in the slice as well (e.g. tickvalues[2::4]).
EDIT: Since you are producing plots from a pd.DataFrame, a more sensible way would be using a column/index for ticks:
plt.xticks(TC['YEAR'][::4], rotation=30)
If your data is not converted properly, you might encounter weird bugs with ordering, in which case make sure you convert YEAR column to a number type by using astype or to_numeric first (a detailed writeup can be found in this SO answer).

Related

r data.table adjust min and max years only if each set has at least one incrementing obs

I have a data set that holds an id, location, start year, end year, age1 and age2. For each group defined as id, location, age1 and age2, I would like to create new start and end year. For instance, I may have three entries for china encompassing age 0 - age 4. One will be 2000 - 2000, the other is 2001 - 2001, and the final is 2005-2005. Since the years are incrementing by 1 in the first two entries, I'd want their corresponding newstart and newend to be 2000-2001. The third entry would have newstart==2005 and newend==2005 as this is not apart of a continuous set of years.
The data table I have resembles the following, except it has thousands of entries many combinations :
id location start end age1 age2
1 brazil 2000 2000 0 4
1 brazil 2001 2001 0 4
1 brazil 2002 2002 0 4
2 argentina 1990 1991 1 1
2 argentina 1991 1991 2 2
2 argentina 1992 1992 2 2
2 argentina 1993 1993 2 2
3 belize 2001 2001 0.5 1
3 belize 2005 2005 1 2
I want to alter the data table so that it will look like the following
id location start end age1 age2 newstart newend
1 brazil 2000 2000 0 4 2000 2002
1 brazil 2001 2001 0 4 2000 2002
1 brazil 2002 2002 0 4 2000 2002
2 argentina 1990 1991 1 1 1991 1991
2 argentina 1991 1991 2 2 1991 1993
2 argentina 1992 1992 2 2 1991 1993
2 argentina 1993 1993 2 2 1991 1993
3 belize 2001 2001 0.5 1 2001 2001
3 belize 2005 2005 1 2 2005 2005
I have tried creating a variable that tracks the difference of the previous year and the current year using lag and then calculating the difference between these two years. I then created the newstart and newend by placing the min start and max end. I have found that this only works if there is a set of 2 in continuous years. If I have a larger set, this doesn't work as it has no way of tracking the number of obs in which the years increase by 1 for each grouping. I believe I need some type of loop.
Is there a more efficient way to accomplish this?
data.table
You tagged with data.table, so my first suggestion is this:
library(data.table)
dat[, contiguous := rleid(c(TRUE, diff(start) == 1)), by = .(id)]
dat[, c("newstart", "newend") := .(min(start), max(end)), by = .(id, contiguous)]
dat[, contiguous := NULL]
dat
# id location start end age1 age2 newstart newend
# 1: 1 brazil 2000 2000 0.0 4 2000 2002
# 2: 1 brazil 2001 2001 0.0 4 2000 2002
# 3: 1 brazil 2002 2002 0.0 4 2000 2002
# 4: 2 argentina 1990 1991 1.0 1 1990 1993
# 5: 2 argentina 1991 1991 2.0 2 1990 1993
# 6: 2 argentina 1992 1992 2.0 2 1990 1993
# 7: 2 argentina 1993 1993 2.0 2 1990 1993
# 8: 3 belize 2001 2001 0.5 1 2001 2001
# 9: 3 belize 2005 2005 1.0 2 2005 2005
base R
If instead you really just mean data.frame, then
dat <- transform(dat, contiguous = ave(start, id, FUN = function(a) cumsum(c(TRUE, diff(a) != 1))))
dat <- transform(dat,
newstart = ave(start, id, contiguous, FUN = min),
newend = ave(end , id, contiguous, FUN = max)
)
# Warning in FUN(X[[i]], ...) :
# no non-missing arguments to min; returning Inf
# Warning in FUN(X[[i]], ...) :
# no non-missing arguments to min; returning Inf
# Warning in FUN(X[[i]], ...) :
# no non-missing arguments to max; returning -Inf
# Warning in FUN(X[[i]], ...) :
# no non-missing arguments to max; returning -Inf
dat
# id location start end age1 age2 newstart newend contiguous
# 1 1 brazil 2000 2000 0.0 4 2000 2002 1
# 2 1 brazil 2001 2001 0.0 4 2000 2002 1
# 3 1 brazil 2002 2002 0.0 4 2000 2002 1
# 4 2 argentina 1990 1991 1.0 1 1990 1993 1
# 5 2 argentina 1991 1991 2.0 2 1990 1993 1
# 6 2 argentina 1992 1992 2.0 2 1990 1993 1
# 7 2 argentina 1993 1993 2.0 2 1990 1993 1
# 8 3 belize 2001 2001 0.5 1 2001 2001 1
# 9 3 belize 2005 2005 1.0 2 2005 2005 2
dat$contiguous <- NULL
Interesting point I just learned about ave: it uses interaction(...) (all grouping variables), which is going to give all possible combinations, not just the combinations observed in the data. Because of that, the FUNction may be called with zero data. In this case, it did, giving the warnings. One could suppress this with function(a) suppressWarnings(min(a)) instead of just min.
We could use dplyr. After grouping by 'id', take the difference of the 'start' and the lagof the 'start', apply rleid to get the run-length-id' and create the 'newstart', 'newend' as the min and max of the 'start'
library(dplyr)
library(data.table)
df1 %>%
group_by(id) %>%
group_by(grp = rleid(replace_na(start - lag(start), 1)),
.add = TRUE) %>%
mutate(newstart = min(start), newend = max(end))
-output
# A tibble: 9 x 9
# Groups: id, grp [4]
# id location start end age1 age2 grp newstart newend
# <int> <chr> <int> <int> <dbl> <int> <int> <int> <int>
#1 1 brazil 2000 2000 0 4 1 2000 2002
#2 1 brazil 2001 2001 0 4 1 2000 2002
#3 1 brazil 2002 2002 0 4 1 2000 2002
#4 2 argentina 1990 1991 1 1 1 1990 1993
#5 2 argentina 1991 1991 2 2 1 1990 1993
#6 2 argentina 1992 1992 2 2 1 1990 1993
#7 2 argentina 1993 1993 2 2 1 1990 1993
#8 3 belize 2001 2001 0.5 1 1 2001 2001
#9 3 belize 2005 2005 1 2 2 2005 2005
Or with data.table
library(data.table)
setDT(df1)[, grp := rleid(replace_na(start - shift(start), 1))
][, c('newstart', 'newend') := .(min(start), max(end)), .(id, grp)][, grp := NULL]

Trying to convert data long format to wide format

My data frame currently looks like
country_txt Year nkill_yr Countrycode Population deathsPer100k
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 1973 0 4 12028 0.000000e+00
2 Afghanistan 1979 53 4 13307 3.982866e-05
3 Afghanistan 1987 0 4 11503 0.000000e+00
4 Afghanistan 1988 128 4 11541 1.109089e-04
5 Afghanistan 1989 10 4 11778 8.490406e-06
6 Afghanistan 1990 12 4 12249 9.796718e-06
It contains a list of al countries, and the terrorist Deaths per 100,000 population.
Ideally I would Like a data frame in wide format that has the structure of:
country_txt 1970 1971 1972 1973 1974 1975
Afghanistan 3.98 1.1 0 4.3 0.8 0.09
Albania 0 0.4 0.5 0 0 0
Algeria 0 0 0 0.1 0.2 0
Angola 0 0.3 0 0 0 0
Except my function currently repeats like this:
YearCountryRatio<- spread(data = YearCountryRatio, Year, deathsPer100k )
country_txt 1970 1971 1972 1973
Afghanistan 3.98 NA NA NA
Afghanistan NA 1.1 NA NA
Afghanistan NA NA 0 NA
Afghanistan NA NA NA 4.3
And similarly for other countries,
Is there any way to either:
Collapse all of the NA values to show only one country or
Put it directly into wide format?
I've assumed you want each country_txt value reduced to a single row and are happy to drop the unused variables. (Note: I added a dummy country_txt value of "XYZ" to the sample data to show how multiple countries spread)
library(dplyr)
library(tidyr)
df <- read.table(text = "country_txt Year nkill_yr Countrycode Population deathsPer100k
1 Afghanistan 1973 0 4 12028 0.000000e+00
2 Afghanistan 1979 53 4 13307 3.982866e-05
3 Afghanistan 1987 0 4 11503 0.000000e+00
4 XYZ 1988 128 4 11541 1.109089e-04
5 XYZ 1989 10 4 11778 8.490406e-06
6 XYZ 1990 12 4 12249 9.796718e-06", header = TRUE)
df <- mutate(df, deathsPer100k = round(deathsPer100k*100000, 2))
select(df, country_txt, Year, deathsPer100k) %>% spread(Year, deathsPer100k, fill = 0)
#> country_txt 1973 1979 1987 1988 1989 1990
#> 1 Afghanistan 0 3.98 0 0.00 0.00 0.00
#> 2 XYZ 0 0.00 0 11.09 0.85 0.98

Identifying and removing rows that don't have a preceding or succeeding year of data ( years where there is no surrounding data)

This may seem trivial and have a trivial answer but its not coming to me:
So I have an example table below:
Year SINDEX
1976 0
1981 16
1982 85
1983 135
1984 141
1986 42
1988 6
1989 0
1990 0
1991 0
1992 0
1994 0
2002 1
2003 3
2004 10
2005 36
and I would like it to look like this:
Year SINDEX
1981 16
1982 85
1983 135
1984 141
1988 6
1989 0
1990 0
1991 0
1992 0
2002 1
2003 3
2004 10
2005 36
having removed the years 1976, 1986 and 1994.
I know how to remove rows its more about how do I find a neat way of identifying these rows of data that don't have any accompanying years of data.
Any help would be much appreciated.
Let's first put this data in a dataframe.
tmp <- data.frame(matrix(c(1976, 0,
1981, 16,
1982, 85,
1983, 135,
1984, 141,
1986, 42,
1988, 6,
1989, 0,
1990, 0,
1991, 0,
1992, 0,
1994, 0,
2002, 1,
2003, 3,
2004, 10,
2005, 36), ncol = 2, byrow = TRUE))
One solution would be to create two auxiliary variables aux1 and aux2: the first encoding the preceding year and the second encoding the succeeding year:
aux1 <- tmp$X1 - 1
aux2 <- tmp$X1 + 1
Then you can simply condition on the logical that checks whether or not the preceding or succeeding year is included in the dataset:
tmp[aux1 %in% tmp$X1 | aux2 %in% tmp$X1, ]
which returns
X1 X2
2 1981 16
3 1982 85
4 1983 135
5 1984 141
7 1988 6
8 1989 0
9 1990 0
10 1991 0
11 1992 0
13 2002 1
14 2003 3
15 2004 10
16 2005 36
In case you're already familiar with dplyr (or planning to start using it), here's an alternative approach using the filter function and lead and lag:
require(dplyr)
filter(df, Year - lag(Year) == 1L | lead(Year) - Year == 1L)
# Year SINDEX
#1 1981 16
#2 1982 85
#3 1983 135
#4 1984 141
#5 1988 6
#6 1989 0
#7 1990 0
#8 1991 0
#9 1992 0
#10 2002 1
#11 2003 3
#12 2004 10
#13 2005 36
I should note that this approach assumes that the data is already as sorted (as in the example).
Assuming the data DF is sorted by Year as in the question, examine successive triples (at ends partial lets us look at doubles) and return TRUE if there is at least one difference equal to 1:
library(zoo)
DF[ rollapply(DF$Year, 3, function(x) 1 %in% diff(x), partial = TRUE), ]
This gives:
Year SINDEX
2 1981 16
3 1982 85
4 1983 135
5 1984 141
7 1988 6
8 1989 0
9 1990 0
10 1991 0
11 1992 0
13 2002 1
14 2003 3
15 2004 10
16 2005 36

How do I format row.names of an R table?

Consider this x set of dates:
set.seed(1234)
x <- sample(1980:2010, 100, replace = T)
x <- strptime(x, '%Y')
x <- strftime(x, '%Y')
The following is a distribution of the years of those dates:
> table(x)
x
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1994
4 4 3 3 6 4 3 4 5 12 1 1 1 2
1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
9 4 2 1 4 4 2 1 4 1 4 3 4 3
2010
1
Now say I want to group them by decade. For this, I use the cut function:
> table(cut(x, seq(1980, 2010, 10)))
Error in cut.default(x, seq(1980, 2010, 10)) : 'x' must be numeric
Ok, so let's force x to numeric:
> table(cut(as.numeric(x), seq(1980, 2010, 10)))
(1.98e+03,1.99e+03] (1.99e+03,2e+03] (2e+03,2.01e+03]
45 28 23
Now, as you can see, the row.names of that table are in scientific format. How do I force them to not be in scientific notation? I've tried wrapping that whole command above inside format, formatC and prettyNum, but all those do is format the frequencies.
Thanks joran for pointing the path to the answer. I'll elaborate it here for the record:
Changing cut's dig.lab parameter from the default 3 to 4 solved this particular mockup as well as my real problem:
> table(cut(as.numeric(x), seq(1980, 2010, 10), dig.lab = 4))
(1980,1990] (1990,2000] (2000,2010]
45 28 23
By the way, in order for 1980 to be counted one should include the include.lowest argument:
> table(cut(as.numeric(x), seq(1980, 2010, 10), dig.lab = 4, include.lowest = T))
[1980,1990] (1990,2000] (2000,2010]
49 28 23
Now it sums to 100! :)
This doesn't exactly answer the question you asked, but shows you a possible alternative: use the fact that there is a cut.Date method:
set.seed(1234)
x <- sample(1980:2010, 100, replace = T)
x <- strptime(x, '%Y')
out <- table(cut(x, "10 years"))
out
#
# 1980-01-01 1990-01-01 2000-01-01 2010-01-01
# 48 25 26 1
Here, we also get what I would consider the "correct" values for each bin.
As a crude justification of my statement about "correct" values, consider the values we get when we manually calculate based on table:
y <- strftime(x, '%Y')
Tab <- table(y)
Tab
# y
# 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1994 1995 1996
# 4 4 3 3 6 4 3 4 5 12 1 1 1 2 9 4
# 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2010
# 2 1 4 4 2 1 4 1 4 3 4 3 1
sum(Tab[grepl("198", names(Tab))])
# [1] 48
sum(Tab[grepl("199", names(Tab))])
# [1] 25
sum(Tab[grepl("200", names(Tab))])
# [1] 26
sum(Tab[grepl("201", names(Tab))])
# [1] 1

How to get standardized column for specific rows only? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Idiomatic R code for partitioning a vector by an index and performing an operation on that partition
Related to How to get column mean for specific rows only?
I am trying to create a new column in my dataframe that scales the "Score" column into sections based off the "Round" column.
Score Quarter
98.7 QTR 1 2011
88.6 QTR 1 2011
76.5 QTR 1 2011
93.5 QTR 2 2011
97.7 QTR 2 2011
89.1 QTR 1 2012
79.4 QTR 1 2012
80.3 QTR 1 2012
Would look like this
Unit Score Quarter Scale
6 98.7 QTR 1 2011 1.01
1 88.6 QTR 1 2011 .98
3 76.5 QTR 1 2011 .01
5 93.5 QTR 2 2011 2.0
6 88.6 QTR 2 2011 2.5
9 89.1 QTR 1 2012 2.2
1 79.4 QTR 1 2012 -.09
3 80.3 QTR 1 2012 -.01
3 98.7 QTR 1 2011 -2.2
I do not want to standardize the entire column because I want to trend the data and truly see how units did relative to each other quarter to quarter rather than scale(data$Score) which would compare all points to each other regardless of round.
I've tried variants of something like this:
data$Score_Scale <- with (data, scale(Score), findInterval(QTR, c(-Inf,"2011-01-01","2011-06-30", Inf)), FUN= scale)
Using ave might be a good option here:
Get your data:
test <- read.csv(textConnection("Score,Quarter
98.7,Round 1 2011
88.6,Round 1 2011
76.5,Round 1 2011
93.5,Round 2 2011
97.7,Round 2 2011
89.1,Round 1 2012
79.4,Round 1 2012
80.3,Round 1 2012"),header=TRUE)
scale the data within each Quarter group:
test$score_scale <- ave(test$Score,test$Quarter,FUN=scale)
test
Score Quarter score_scale
1 98.7 Round 1 2011 0.96866054
2 88.6 Round 1 2011 0.05997898
3 76.5 Round 1 2011 -1.02863953
4 93.5 Round 2 2011 -0.70710678
5 97.7 Round 2 2011 0.70710678
6 89.1 Round 1 2012 1.15062301
7 79.4 Round 1 2012 -0.65927589
8 80.3 Round 1 2012 -0.49134712
Just to make it obvious that this works, here are the individual results for each Quarter group:
> as.vector(scale(test$Score[test$Quarter=="Round 1 2011"]))
[1] 0.96866054 0.05997898 -1.02863953
> as.vector(scale(test$Score[test$Quarter=="Round 2 2011"]))
[1] -0.7071068 0.7071068
> as.vector(scale(test$Score[test$Quarter=="Round 1 2012"]))
[1] 1.1506230 -0.6592759 -0.4913471

Resources