How to get standardized column for specific rows only? [duplicate] - r

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Idiomatic R code for partitioning a vector by an index and performing an operation on that partition
Related to How to get column mean for specific rows only?
I am trying to create a new column in my dataframe that scales the "Score" column into sections based off the "Round" column.
Score Quarter
98.7 QTR 1 2011
88.6 QTR 1 2011
76.5 QTR 1 2011
93.5 QTR 2 2011
97.7 QTR 2 2011
89.1 QTR 1 2012
79.4 QTR 1 2012
80.3 QTR 1 2012
Would look like this
Unit Score Quarter Scale
6 98.7 QTR 1 2011 1.01
1 88.6 QTR 1 2011 .98
3 76.5 QTR 1 2011 .01
5 93.5 QTR 2 2011 2.0
6 88.6 QTR 2 2011 2.5
9 89.1 QTR 1 2012 2.2
1 79.4 QTR 1 2012 -.09
3 80.3 QTR 1 2012 -.01
3 98.7 QTR 1 2011 -2.2
I do not want to standardize the entire column because I want to trend the data and truly see how units did relative to each other quarter to quarter rather than scale(data$Score) which would compare all points to each other regardless of round.
I've tried variants of something like this:
data$Score_Scale <- with (data, scale(Score), findInterval(QTR, c(-Inf,"2011-01-01","2011-06-30", Inf)), FUN= scale)

Using ave might be a good option here:
Get your data:
test <- read.csv(textConnection("Score,Quarter
98.7,Round 1 2011
88.6,Round 1 2011
76.5,Round 1 2011
93.5,Round 2 2011
97.7,Round 2 2011
89.1,Round 1 2012
79.4,Round 1 2012
80.3,Round 1 2012"),header=TRUE)
scale the data within each Quarter group:
test$score_scale <- ave(test$Score,test$Quarter,FUN=scale)
test
Score Quarter score_scale
1 98.7 Round 1 2011 0.96866054
2 88.6 Round 1 2011 0.05997898
3 76.5 Round 1 2011 -1.02863953
4 93.5 Round 2 2011 -0.70710678
5 97.7 Round 2 2011 0.70710678
6 89.1 Round 1 2012 1.15062301
7 79.4 Round 1 2012 -0.65927589
8 80.3 Round 1 2012 -0.49134712
Just to make it obvious that this works, here are the individual results for each Quarter group:
> as.vector(scale(test$Score[test$Quarter=="Round 1 2011"]))
[1] 0.96866054 0.05997898 -1.02863953
> as.vector(scale(test$Score[test$Quarter=="Round 2 2011"]))
[1] -0.7071068 0.7071068
> as.vector(scale(test$Score[test$Quarter=="Round 1 2012"]))
[1] 1.1506230 -0.6592759 -0.4913471

Related

How to adjust list interval in xticks?

I have data where I want to plot their timeseries. The FRQ is the column bar and the FILT is the line chart.
YEAR FRQ FILT
1960 1
1961 3
1962 1 1.416666667
1963 1 0.916666667
1964 0 0.833333333
1965 1 1.333333333
1966 3 1.75
1967 2 1.5
1968 0 0.833333333
1969 0 0.666666667
1970 1 1.166666667
1971 3 1.666666667
1972 1 1.833333333
1973 2 1.75
1974 2 1.5
1975 1 1
1976 0 0.5
1977 0 0.416666667
1978 1 0.833333333
1979 1 1.333333333
1980 3 1.5
1981 0 1.333333333
1982 2 1
1983 0 0.833333333
1984 1 0.75
1985 1 0.583333333
1986 0 0.5
1987 0 0.75
1988 2 1.166666667
1989 2 1.25
1990 0 0.916666667
1991 1 0.833333333
1992 0 1.25
1993 4 1.5
1994 0 1.416666667
1995 1 1.25
1996 2 1.416666667
1997 1 1.833333333
1998 3 2
1999 2 1.75
2000 1 1.166666667
2001 0 1.083333333
2002 1 1.666666667
2003 5 2
2004 0 1.75
2005 1 1.5
2006 2 1.75
2007 3 2.166666667
2008 1 2.333333333
2009 4 2.333333333
2010 1 2.25
2011 3 1.916666667
2012 1 1.5
2013 1 1.166666667
2014 1 0.916666667
2015 1 0.75
2016 0 0.666666667
2017 1 0.75
2018 1 0.833333333
2019 1
2020 0
My working code looks like this:
#Read Tropical cyclone frequency
TC = pd.read_csv (r'G:\TC_Atlas\\data.csv', encoding="ISO-8859-1")
TC = pd.DataFrame(TC,columns=['YEAR','FRQ','FILT','FRQ2','FILT2','LMI','FILTLMI','LMI2','FILTLMI2'])
TC= TC[TC['YEAR'].between(1960,2020,inclusive="both")]
#TC = TC.set_index('YEAR')
labels =['1960','1961','1962','1963','1964','1965','1966','1967','1968','1969','1970','1971','1972','1973','1974','1975',
'1976','1977','1978','1979','1980','1981','1982','1983','1984','1985','1986','1987','1988','1989','1990','1991','1992',
'1993','1994','1995','1996','1997','1998','1999','2000','2001','2002','2003','2004','2005','2006','2007','2008','2009',
'2010','2011','2012','2013','2014','2015','2016','2017','2018','2019','2020']
#Plot timeseries
TC['FRQ'].plot(kind='bar', color='lightgray', width=1, edgecolor='darkgray')
TC['FILT'].plot(kind='line',color='black')
plt.suptitle("TC Passage Frequency",fontweight='bold',y=0.95,x=0.53)
plt.title("Isabela (1960-2020)", pad=0)
L=plt.legend()
L.get_texts()[0].set_text('filtered')
plt.yticks(fontsize=12)
tickvalues = range(0,len(labels))
plt.xticks(ticks = tickvalues ,labels = labels, rotation = 30)
plt.xlabel('Year', color='black', fontsize=14, weight='bold',labelpad=10)
plt.ylabel('Frequency' , color='black', fontsize=14, weight='bold',labelpad=15)
plt.tight_layout()
plt.show()
Unfortunately, I cannot adjust the interval of the x-axis to make the xticks every 4 year interval. I have scouring for possible solution. Kindly help. I use Jupyter Notebook in Python. Below is the sample output but my goal is to make the xticks 4 year interval.
You are explicitly adding ticks every year. The culprit is this line:
plt.xticks(ticks = tickvalues ,labels = labels, rotation = 30)
To make them less frequent, you can take advantage of list slicing like so:
plt.xticks(ticks=tickvalues[::4], labels=labels[::4], rotation=30)
If you need to e.g. shift them so that a specific year is listed, you could set the initial index in the slice as well (e.g. tickvalues[2::4]).
EDIT: Since you are producing plots from a pd.DataFrame, a more sensible way would be using a column/index for ticks:
plt.xticks(TC['YEAR'][::4], rotation=30)
If your data is not converted properly, you might encounter weird bugs with ordering, in which case make sure you convert YEAR column to a number type by using astype or to_numeric first (a detailed writeup can be found in this SO answer).

Aggregate by specific year in R

Apologies if this question has already been dealt with already on SO, but I cannot seem to find a quick solution as of yet.
I am trying to aggregate a dataset by a specific year. My data frame consists of hourly climate data over a period of 10 years.
head(df)
# day month year hour rain temp pressure wind
#1 1 1 2005 0 0 7.6 1016 15
#2 1 1 2005 1 0 8.0 1015 14
#3 1 1 2005 2 0 7.7 1014 15
#4 1 1 2005 3 0 7.8 1013 17
#5 1 1 2005 4 0 7.3 1012 17
#6 1 1 2005 5 0 7.6 1010 17
To calculate daily means from the above dataset, I use this aggregate function
g <- aggregate(cbind(temp,pressure,wind) ~ day + month + year, d, mean)
options(digits=2)
head(g)
# day month year temp pressure wind
#1 1 1 2005 6.6 1005 25
#2 2 1 2005 6.5 1018 25
#3 3 1 2005 9.7 1019 22
#4 4 1 2005 7.5 1010 25
#5 5 1 2005 7.3 1008 25
#6 6 1 2005 9.6 1009 26
Unfortunately, I get a huge dataset spanning the whole 10 years (2005 to 2014). I am wondering if anybody would be able to help me tweak the above aggregate code so as I would be able to summaries daily means over a specific year as opposed to all of them in one swipe?
You can use the subset argument in aggregate
aggregate(cbind(temp,pressure,wind) ~ day + month + year, df,
subset=year %in% 2005:2014, mean)
Dplyr also does it nicely.
library(dplyr)
df %>%
filter(year==2005) %>%
group_by(day, month, year) %>%
summarise_each(funs(mean), temp, pressure, wind)

Taking Average and Median by Month and then Ordering by Date and Factor in R

Lets suppose I have the following data:
set.seed(123)
Dates <- c("2013-10-07","2013-10-14","2013-11-21","2013-11-28" , "2013-12-04" , "2013-12-11","2013-01-18","2013-01-18")
Dates.New <- c(Dates,Dates)
Values <- sample(seq(1:10),16,replace = TRUE)
Factor <- c(rep("Group 1",8),rep("Group 2",8))
df <- data.frame(Dates.New,Values,Factor)
df[sample(1:nrow(df)),]
This returns
Dates.New Values Factor
4 2013-11-28 9 Group 1
1 2013-10-07 3 Group 1
5 2013-12-04 10 Group 1
13 2013-12-04 7 Group 2
11 2013-11-21 10 Group 2
8 2013-01-18 9 Group 1
7 2013-01-18 6 Group 1
9 2013-10-07 6 Group 2
6 2013-12-11 1 Group 1
14 2013-12-11 6 Group 2
16 2013-01-18 9 Group 2
3 2013-11-21 5 Group 1
2 2013-10-14 8 Group 1
15 2013-01-18 2 Group 2
12 2013-11-28 5 Group 2
10 2013-10-14 5 Group 2
What I am trying to do here is find the monthly average and median for both of my factors then order each group by month in a new data frame. So the new data frame would have a median and average for months 10,11,12,1 for Group 1 bundled together and the next 4 rows would have the median and average for months 10,11,12,1 for Group 2bundled together as well. I am open to packages. Thanks!
Here is a data.table solution. The question seems to be looking for both mean and median. See if this suits your need.
library(zoo); library(data.table)
setDT(df)[, list(Mean = mean(Values),
Median = median(Values)),
by = list(Factor, as.yearmon(Dates.New))][order(Factor, as.yearmon)]
# Factor as.yearmon Mean Median
# 1: Group 1 Jan 2013 7.5 7.5
# 2: Group 1 Oct 2013 5.5 5.5
# 3: Group 1 Nov 2013 7.0 7.0
# 4: Group 1 Dec 2013 5.5 5.5
# 5: Group 2 Jan 2013 5.5 5.5
# 6: Group 2 Oct 2013 5.5 5.5
# 7: Group 2 Nov 2013 7.5 7.5
# 8: Group 2 Dec 2013 6.5 6.5
Like this?
df$Dates.New <- as.Date(df$Dates.New)
library(zoo) # for as.yearmon(...)
result <- aggregate(Values~as.yearmon(Dates.New)+Factor,df,mean)
names(result)[1] <- "Year.Mon"
result
# Year.Mon Factor Values
# 1 Jan 2013 Group 1 7.5
# 2 Oct 2013 Group 1 5.5
# 3 Nov 2013 Group 1 7.0
# 4 Dec 2013 Group 1 5.5
# 5 Jan 2013 Group 2 5.5
# 6 Oct 2013 Group 2 5.5
# 7 Nov 2013 Group 2 7.5
# 8 Dec 2013 Group 2 6.5

R How do I add a dataframe column whose values are derived from other column values in a DIFFERENT row?

The answer to this question might be simple but I can't seem to get around it.
I have a dataset with: years, treatments, treatment levels and a value (yield). Treatments include mineral (fertiliser), manure and compost. I would like to add a column with a reference value. This reference should be the value (yield) of given year and level of the mineral treatment. For example:
DF1<-data.frame(treatment = c("mineral","mineral", "manure","manure","compost","compost","mineral","mineral", "manure","manure", "compost","compost"),
year = c("1990","1990","1990","1990","1990","1990", "1991","1991","1991", "1991","1991","1991"),
level = c("1","2","1","2","1","2","1","2","1","2","1","2"),
value = c("1","2","1.1","2.2","1.3","2.5","3","4","3.2","4.4","3.5","4.8"))
DF1
treatment year level value
mineral 1990 1 1
mineral 1990 2 2
manure 1990 1 1.1
manure 1990 2 2.2
compost 1990 1 1.3
compost 1990 2 2.5
mineral 1991 1 3
mineral 1991 2 4
manure 1991 1 3.2
manure 1991 2 4.4
compost 1991 1 3.5
compost 1991 2 4.8
Mineral should be the referent. So I would like to add a column called ref which will give for all treatments (manure, compost and mineral) in year 1990 a value 1 if level 1 and a value 2 if level 2. For the year 1991 the reference value should be for all treatments 3 if level 1 and 4 if level 2.
Anybody would could give me advice on this: I would be very grateful
You could try
res <- do.call(rbind,
lapply(split(DF1, list(DF1$year, DF1$level), drop=TRUE),
function(x){x$ref <- x$value[x$treatment=='mineral']
x}))
indx <- as.numeric(gsub(".*\\.", "", row.names(res)))
res1 <- res[order(indx),]
row.names(res1) <- NULL
res1
Or using data.table
library(data.table)
DT <- as.data.table(DF1)
DT1 <- DT[treatment=='mineral', list(ref=value), by=list(year, level)]
DT[,indx:=1:.N]
setkey(DT, year, level)
DT[J(DT1)][order(indx),][,indx:=NULL][]
# treatment year level value ref
#1: mineral 1990 1 1 1
#2: mineral 1990 2 2 2
#3: manure 1990 1 1.1 1
#4: manure 1990 2 2.2 2
#5: compost 1990 1 1.3 1
#6: compost 1990 2 2.5 2
#7: mineral 1991 1 3 3
#8: mineral 1991 2 4 4
#9: manure 1991 1 3.2 3
#10: manure 1991 2 4.4 4
#11: compost 1991 1 3.5 3
#12: compost 1991 2 4.8 4

How to remove subjects who have missing measurements in time series data?

I have data like the following:
ID Year Measurement
1 2009 5.6
1 2010 6.2
1 2011 4.5
2 2008 6.4
2 2009 5.2
3 2008 3.5
3 2010 5.6
4 2009 5.9
4 2010 2.2
4 2011 4.1
4 2012 5.5
Where subjects are measured over a few years with different starting and ending years. Subjects are also measured a different number of times. I want to remove subjects that are not measured every single year between their start and end measurement years. So, in the above data I would want subject 3 removed since they missed a measurement in 2009.
I thought about doing a for loop where I get the maximum and minimum value of the variable Year for each unique ID. I then take the difference between the maximum and minimum for each player and add 1. I then count the number of times each unique ID appears in the data and check to see if they are equal. This ought to work but I feel like there has a got to be a quick and more efficient way to do this.
This will be easiest with the data.table package:
dt = data.table(df, key="Year")
dt[,Remove:=any(diff(Year) > 1),by=ID]
dt = dt[(!Remove)]
dt$Remove = NULL
ID Year Measurement
1: 1 2009 5.6
2: 1 2010 6.2
3: 1 2011 4.5
4: 2 2008 6.4
5: 2 2009 5.2
6: 4 2009 5.9
7: 4 2010 2.2
8: 4 2011 4.1
9: 4 2012 5.5
Here's an alternative
> ind <- aggregate(Year~ID, FUN=function(x) x[2]-x[1], data=df)$Year>1
> df[!df$ID==unique(df$ID)[ind], ]
ID Year Measurement
1 1 2009 5.6
2 1 2010 6.2
3 1 2011 4.5
4 2 2008 6.4
5 2 2009 5.2
8 4 2009 5.9
9 4 2010 2.2
10 4 2011 4.1
11 4 2012 5.5
You may try ave. My anonymous function is basically the pseudo code suggested in the question.
df[as.logical(ave(df$Year, df$ID, FUN = function(x) length(x) > max(x) - min(x))), ]
# ID Year Measurement
# 1 1 2009 5.6
# 2 1 2010 6.2
# 3 1 2011 4.5
# 4 2 2008 6.4
# 5 2 2009 5.2
# 8 4 2009 5.9
# 9 4 2010 2.2
# 10 4 2011 4.1
# 11 4 2012 5.5

Resources