Create a Custom Function that Extracts Certain Rows - r

head(MYK)
X Analyte Subject Cohort DayNominal HourNominal Concentration uniqueID FS EF VTI deltaFS deltaEF deltaVTI HR
2 MYK-461 005-010 1 1 0.25 31.00 005-0100.25 31.82 64.86 0.00 3 -1 -100 58
3 MYK-461 005-010 1 1 0.50 31.80 005-0100.5 NA NA NA NA NA NA NA
4 MYK-461 005-010 1 1 1.00 9.69 005-0101 26.13 69.11 0.00 -15 6 -100 55
5 MYK-461 005-010 1 1 1.50 8.01 005-0101.5 NA NA NA NA NA NA NA
6 MYK-461 005-010 1 1 2.00 5.25 005-0102 NA NA NA NA NA NA NA
7 MYK-461 005-010 1 1 3.00 3.26 005-0103 29.89 60.99 23.49 -3 -7 9 55
105 MYK-461 005-033 2 1 0.25 3.4 005-0330.25 30.18 68.59 23.22 1 0 16 47
106 MYK-461 005-033 2 1 0.50 12.4 005-0330.5 NA NA NA NA NA NA NA
107 MYK-461 005-033 2 1 0.75 27.1 005-0330.75 NA NA NA NA NA NA NA
108 MYK-461 005-033 2 1 1.00 23.5 005-0331 32.12 69.60 21.06 7 2 5 43
109 MYK-461 005-033 2 1 1.50 16.8 005-0331.5 NA NA NA NA NA NA NA
110 MYK-461 005-033 2 1 2.00 15.8 005-0332 NA NA NA NA NA NA NA
organize = function(x, y) {
g1 = subset(x, Cohort == y)
g1 = aggregate(x[,'Concentration'], by=list(x[,'HourNominal']), FUN=mean)
g1 = setNames(g1, c('HourNominal', 'Concentration'))
g2 = aggregate(x[,'Concentration'], by=list(x[,'HourNominal']), FUN=sd)
g2 = setNames(g2, c('HourNominal', 'SD'))
g1[,'SD'] = g2$SD
g1$top = g1$Concentration + g1$SD
g1$bottom = g1$Concentration - g1$SD
return(g1)
}
I have a dataframe here, along with some code to subset the dataframe based on a certain Cohort, and to aggregate the Concentration based on Hour. However, all of the dataframes look the same.
CA1 = organize(MYK, 1)
CA2 = organize(MYK, 2)
Yet whenever I use these two commands, the two datasets are identical.
I want a dataset that looks like
HourNominal Concentration SD top bottom
1 0.25 27.287500 25.112204 52.399704 2.1752958
2 0.50 41.989722 32.856013 74.845735 9.1337094
3 0.75 49.866667 22.485254 72.351921 27.3814122
4 1.00 107.168889 104.612098 211.780987 2.5567908
5 1.50 191.766389 264.375466 456.141855 -72.6090774
6 1.75 319.233333 290.685423 609.918757 28.5479100
7 2.00 226.785278 272.983234 499.768512 -46.1979560
8 2.25 341.145833 301.555769 642.701602 39.5900645
9 2.50 341.145833 319.099679 660.245512 22.0461542
10 3.00 195.303333 276.530533 471.833866 -81.2271993
11 4.00 107.913889 140.251991 248.165880 -32.3381024
12 6.00 50.174167 64.700785 114.874952 -14.5266184
13 8.00 38.132639 47.099796 85.232435 -8.9671572
14 12.00 31.404444 39.667850 71.072294 -8.2634051
15 24.00 33.488583 41.267392 74.755975 -7.7788087
16 48.00 29.304833 38.233776 67.538609 -8.9289422
17 72.00 7.322792 6.548898 13.871690 0.7738932
18 96.00 7.002833 6.350251 13.353085 0.6525821
19 144.00 6.463875 5.612630 12.076505 0.8512452
20 216.00 5.007792 4.808156 9.815948 0.1996353
21 312.00 3.964727 4.351626 8.316353 -0.3868988
22 480.00 2.452857 3.220947 5.673804 -0.7680897
23 648.00 1.826625 2.569129 4.395754 -0.7425044
The problem is that the even why I try to separate the values by Cohort, the two dataframes have the same content. They should not be identical.

Related

Fill Timeseries with founded values or keep NA if no value was found yet

I had a longer TimeSeries and turned it into wider for forecasting purposes, currently timeseries has the following structure :
Day
Value
Strength1
Strength2
Strength3
1/2
1.356
3
NA
NA
2/2
1.385
NA
NA
NA
3/2
1.385
NA
1.01
NA
4/2
1.4
NA
NA
10
5/2
1.6
NA
NA
NA
6/2
1.7
4
NA
NA
7/2
1.8
NA
1.05
NA
8/2
1.88
NA
NA
NA
9/2
1.98
NA
NA
11
10/2
1.8
NA
NA
NA
I want a function that :
given a TimeSeries
loops through columns if cell == NA and
previously only NAs were found in the column , keep NA
if cell != NA good
if cell == NA But previously we found not NA values, change to previously found value
This would be result :
Day
Value
Strength1
Strength2
Strength3
1/2
1.356
3
NA
NA
2/2
1.385
3
NA
NA
3/2
1.385
3
1.01
NA
4/2
1.4
3
1.01
10
5/2
1.6
3
1.01
10
6/2
1.7
4
1.01
10
7/2
1.8
4
1.05
10
8/2
1.88
4
1.05
10
9/2
1.98
4
1.05
11
10/2
1.8
4
1.05
11
I tried this function but it isn't right :
filler <- function(df) {
col <- colnames(df)
one <- NA
for (i in col) {
for (a in i) {
if(!is.na(a)) {
one = a
}
if(!is.na(one) & is.na(a)) {
a = one
}
}
}
}
You may use tidyr::fill -
filler <- function(data) tidyr::fill(data, dplyr::everything())
filler(df)
# Day Value Strength1 Strength2 Strength3
#1 1/2 1.356 3 NA NA
#2 2/2 1.385 3 NA NA
#3 3/2 1.385 3 1.01 NA
#4 4/2 1.400 3 1.01 10
#5 5/2 1.600 3 1.01 10
#6 6/2 1.700 4 1.01 10
#7 7/2 1.800 4 1.05 10
#8 8/2 1.880 4 1.05 10
#9 9/2 1.980 4 1.05 11
#10 10/2 1.800 4 1.05 11

Quarterly year-to-year changes

I have a quarterly time series. I am trying to apply a function which is supposed calculate the year-to-year growth and year-to-year difference and multiply a variable by (-1).
I already used a similar function for calculating quarter-to-quarter changes and it worked.
I modified this function for yoy changes and it does not have any effect on my data frame. And any error popped up.
Do you have any suggestion how to modify the function or how to accomplish to apply the yoy change function on a time series?
Here is the code:
Date <- c("2004-01-01","2004-04-01", "2004-07-01","2004-10-01","2005-01-01","2005-04-01","2005-07-01","2005-10-01","2006-01-01","2006-04-01","2006-07-01","2006-10-01","2007-01-01","2007-04-01","2007-07-01","2007-10-01")
B1 <- c(3189.30,3482.05,3792.03,4128.66,4443.62,4876.54,5393.01,5885.01,6360.00,6930.00,7430.00,7901.00,8279.00,8867.00,9439.00,10101.00)
B2 <- c(7939.97,7950.58,7834.06,7746.23,7760.59,8209.00,8583.05,8930.74,9424.00,9992.00,10041.00,10900.00,11149.00,12022.00,12662.00,13470.00)
B3 <- as.numeric(c("","","","",140.20,140.30,147.30,151.20,159.60,165.60,173.20,177.30,185.30,199.30,217.10,234.90))
B4 <- as.numeric(c("","","","",-3.50,-14.60,-11.60,-10.20,-3.10,-16.00,-4.90,-17.60,-5.30,-10.90,-12.80,-8.40))
df <- data.frame(Date,B1,B2,B3,B4)
The code will produce following data frame:
Date B1 B2 B3 B4
1 2004-01-01 3189.30 7939.97 NA NA
2 2004-04-01 3482.05 7950.58 NA NA
3 2004-07-01 3792.03 7834.06 NA NA
4 2004-10-01 4128.66 7746.23 NA NA
5 2005-01-01 4443.62 7760.59 140.2 -3.5
6 2005-04-01 4876.54 8209.00 140.3 -14.6
7 2005-07-01 5393.01 8583.05 147.3 -11.6
8 2005-10-01 5885.01 8930.74 151.2 -10.2
9 2006-01-01 6360.00 9424.00 159.6 -3.1
10 2006-04-01 6930.00 9992.00 165.6 -16.0
11 2006-07-01 7430.00 10041.00 173.2 -4.9
12 2006-10-01 7901.00 10900.00 177.3 -17.6
13 2007-01-01 8279.00 11149.00 185.3 -5.3
14 2007-04-01 8867.00 12022.00 199.3 -10.9
15 2007-07-01 9439.00 12662.00 217.1 -12.8
16 2007-10-01 10101.00 13470.00 234.9 -8.4
And I want to apply following changes on the variables:
# yoy absolute difference change
abs.diff = c("B1","B2")
# yoy percentage change
percent.change = c("B3")
# make the variable negative
negative = c("B4")
This is the fuction that I am trying to use for my data frame.
transformation = function(D,abs.diff,percent.change,negative)
{
TT <- dim(D)[1]
DData <- D[-1,]
nms <- c()
for (i in c(2:dim(D)[2])) {
# yoy absolute difference change
if (names(D)[i] %in% abs.diff)
{ DData[,i] = (D[5:TT,i]-D[1:(TT-4),i])
names(DData)[i] = paste('a',names(D)[i],sep='') }
# yoy percent. change
if (names(D)[i] %in% percent.change)
{ DData[,i] = 100*(D[5:TT,i]-D[1:(TT-4),i])/D[1:(TT-4),i]
names(DData)[i] = paste('p',names(D)[i],sep='') }
#CA.deficit
if (names(D)[i] %in% negative)
{ DData[,i] = (-1)*D[1:TT,i] }
}
return(DData)
}
This is what I would like to get :
Date pB1 pB2 aB3 B4
1 2004-01-01 NA NA NA NA
2 2004-04-01 NA NA NA NA
3 2004-07-01 NA NA NA NA
4 2004-10-01 NA NA NA NA
5 2005-01-01 39.33 -2.26 NA 3.5
6 2005-04-01 40.05 3.25 NA 14.6
7 2005-07-01 42.22 9.56 NA 11.6
8 2005-10-01 42.54 15.29 11.0 10.2
9 2006-01-01 43.13 21.43 19.3 3.1
10 2006-04-01 42.11 21.72 18.3 16.0
11 2006-07-01 37.77 16.99 22.0 4.9
12 2006-10-01 34.26 22.05 17.7 17.6
13 2007-01-01 30.17 18.3 19.7 5.3
14 2007-04-01 27.95 20.32 26.1 10.9
15 2007-07-01 27.04 26.1 39.8 12.8
16 2007-10-01 27.84 23.58 49.6 8.4
Grouping by the months, i.e. 6th and 7th substring using ave and do the necessary calculations. With sapply we may loop over the columns.
f <- function(x) {
g <- substr(Date, 6, 7)
l <- length(unique(g))
o <- ave(x, g, FUN=function(x) 100/x * c(x[-1], NA) - 100)
c(rep(NA, l), head(o, -4))
}
cbind(df[1], sapply(df[-1], f))
# Date B1 B2 B3 B4
# 1 2004-01-01 NA NA NA NA
# 2 2004-04-01 NA NA NA NA
# 3 2004-07-01 NA NA NA NA
# 4 2004-10-01 NA NA NA NA
# 5 2005-01-01 39.32901 -2.259202 NA NA
# 6 2005-04-01 40.04796 3.250329 NA NA
# 7 2005-07-01 42.21960 9.560688 NA NA
# 8 2005-10-01 42.54044 15.291439 NA NA
# 9 2006-01-01 43.12655 21.434066 13.83738 -11.428571
# 10 2006-04-01 42.10895 21.720063 18.03279 9.589041
# 11 2006-07-01 37.77093 16.986386 17.58316 -57.758621
# 12 2006-10-01 34.25636 22.050356 17.26190 72.549020
# 13 2007-01-01 30.17296 18.304329 16.10276 70.967742
# 14 2007-04-01 27.95094 20.316253 20.35024 -31.875000
# 15 2007-07-01 27.03903 26.102978 25.34642 161.224490
# 16 2007-10-01 27.84458 23.577982 32.48731 -52.272727

Pick the color of choice (HSV or HCL or RGB) for individual legend values using ggplot

I have a dataset something like this
data <- read.table(text = "Me EE PE DE TE DEE CE
1 1 1 4.5 2000 0.50 0.2547 0.69
2 1 2 2.4 3000 NA 0.5896 2.56
3 1 3 6.5 2345 15.24 NA 1.85
4 1 4 NA NA 18.23 1.2594 2.06
5 2 1 2.6 NA 12.25 1.5943 2.34
6 2 2 NA 3145 10.25 NA NA
7 2 3 2.7 4235 NA NA 2.90
8 2 4 NA NA 6.32 2.5990 3.18
9 3 1 3.5 NA 8.25 2.9339 3.46
10 3 2 3.8 NA NA NA NA
11 3 3 NA NA NA 3.6037 1.58
12 3 4 4.4 4325 NA 3.9386 4.30
13 4 1 4.7 NA 15.24 4.2735 4.58
14 4 2 NA 4325 6.66 NA NA
15 4 3 5.3 NA 25.20 NA NA
16 4 4 5.6 3256 NA 5.2782 5.42
17 5 1 NA 4351 25.36 5.6131 5.70
18 5 2 6.2 2345 NA 5.9480 NA
19 5 3 6.5 NA 19.36 NA NA
20 5 4 NA 4643 17.25 6.6178 6.54", header = T)
Dataset can be found in here. Dataset
I am trying to plot a ggplot2 point graph using
library(ggplot2)
ggplot(data,aes(x=EE,y=PE)) + geom_point( aes(color = factor(Me)))
and the plot is
I want to pick the color of my choice from the RGB or HSV or HCL and assign to individual factors or levels(Me) how can I do that?
How can I have proper colors assigned to unique(data$Me) values like c(red, blue, green, yellow, orange, black, brown, magenta, ...)
Thanks in advance.
Add this to the plot
+ scale_colour_manual(values = c('red', 'green'))
You can also use hex codes.
+ scale_colour_manual(values = c('#FF0000', '#00FF00'))

Combine facet_wrap of line plot(1st dataset) with facet_wrap of point plot(2nd dataset) on single plot

I have two datsets I call them 1.data 2.data_other
1.data
data <- read.table(text = "Me EE PE DE TE DEE CE
1 1 1 4.5 2000 0.50 0.2547 0.69
2 1 2 2.4 3000 NA 0.5896 2.56
3 1 3 6.5 2345 15.24 NA 1.85
4 1 4 NA NA 18.23 1.2594 2.06
5 2 1 2.6 NA 12.25 1.5943 2.34
6 2 2 NA 3145 10.25 NA NA
7 2 3 2.7 4235 NA NA 2.90
8 2 4 NA NA 6.32 2.5990 3.18
9 3 1 3.5 NA 8.25 2.9339 3.46
10 3 2 3.8 NA NA NA NA
11 3 3 NA NA NA 3.6037 1.58
12 3 4 4.4 4325 NA 3.9386 4.30
13 4 1 4.7 NA 15.24 4.2735 4.58
14 4 2 NA 4325 6.66 NA NA
15 4 3 5.3 NA 25.20 NA NA
16 4 4 5.6 3256 NA 5.2782 5.42
17 5 1 NA 4351 25.36 5.6131 5.70
18 5 2 6.2 2345 NA 5.9480 NA
19 5 3 6.5 NA 19.36 NA NA
20 5 4 NA 4643 17.25 6.6178 6.54", header = T)
2.data_other
Me EE PE DE TE DEE CE
1 1 1.3 1452 12.5 0.2587 1.25
1 2 1.7 2458 10.8 1.5469 0.69
1 3 0.8 3524 11.96 0.5874 0.87
1 4 0.7 1905 13.58 0.9654 0.98
2 1 0.5 2941 12.54 0.2548 1.65
2 2 0.2 3183.5 13.134 1.3658 1.11
2 3 0.01 3426 13.42 0.48692 1.320666667
2 4 1.52 3668.5 13.706 1.9547 1.386095238
3 1 1.98 3911 13.992 0.36906 1.45152381
3 2 2.1 4153.5 14.278 0.31013 1.516952381
3 3 2.44 6674 15.96 0.564 0.2512
3 4 2.736 6671 16.5 1.85 1.5687
4 1 1.05 4881 15.136 0.13334 1.713238095
4 2 1.52 5123.5 15.422 0.07441 1.778666667
4 3 0.95 5366 15.708 0.01548 1.844095238
4 4 1.073 5608.5 15.994 1.2548 1.90952381
5 1 1.0233 5851 16.28 1.9658 1.974952381
5 2 0.9733 6093.5 16.566 2.2154 2.040380952
5 3 0.9233 6336 16.852 1.2587 2.105809524
5 4 2.45 6578.5 17.138 1.9657 2.171238095
I am plotting point graph with data and I am using following code
data <- read.table("data.txt")
datad <- melt(data,id.vars = c("Me","EE"),measure.vars= c("PE","DE","TE","DEE","CE"))
ggplot(subset(datad, !is.na(EE)),aes(x=EE,y=value), as.table = T) +
geom_point(aes(color=factor(Me)), size = 3) +
facet_wrap(~variable, nrow=3, ncol=2, scales = "free_y")
I am plotting line plot with data_other and using following code
data_other <- read.table("data_other.txt", header = T)
data_otherd <- melt(data_other, id.vars = c("Me","EE"), measure.vars = c("PE","DE","TE","DEE","CE"))
ggplot(subset(data_otherd, !is.na(EE)), aes(x=EE, y=value), as.table = T) +
geom_line(aes(color = factor (Me))) +
facet_wrap(~variable, nrow=3, ncol=2, scales = "free_y")
Result.
data (point graph)
data_other (line graph)
I am trying to combine both the data sets on to the same plot (Meaning data_other line plot should be overlapped on to the data point plot or vice versa ).
Both the datasets have same number of columns and columns names are also similar.
Without the facet_wrap I know I can use different geom to specify different dataset and plot the line plot on the point graph. But with facet wrap I dont know how to combine both the dataset.
How to wrap two datasets with different geom on to a single plot.
Please help. Thanks in advance.
You need just one call to ggplot() with first data frame and then add geom_line() with data=data_otherd. As all variable names are the same, geom_line() will use the same information that is provided is aes() of ggplot() (move also color=factor(Me) to aes() of ggplot() call).
ggplot(datad,aes(x=EE,y=value,color=factor(Me)))+
geom_point()+
geom_line(data=data_otherd)+
facet_wrap(~variable, nrow=3, ncol=2, scales = "free_y")

selecting column N of matrix/dataframe where N is based on another vector

I have a dataframe X with several columns and want to select column N for each row where N is different for each row depending on some vector ( in this example : values in column sel)
A B C D sel
16/04/2012 NA -1.25 -1.25 0.25 1
17/04/2012 NA 20 21.25 17.25 1
18/04/2012 -5.25 -5.25 -5.75 -1 2
19/04/2012 -6 -6 -6.25 -12 2
20/04/2012 2.5 2.5 2.75 NA 2
23/04/2012 NA -12.25 -12 NA 2
24/04/2012 NA 7.25 7.5 7.25 2
25/04/2012 NA 17.5 17 18.25 4
26/04/2012 NA 9.5 10 11.5 4
27/04/2012 NA 2 1 -3.25 4
30/04/2012 NA -4.75 -4 -1 4
01/05/2012 NA 6.25 5.75 17 3
02/05/2012 NA -3 -2.75 -16 3
03/05/2012 NA -11.5 -11.5 -6.75 4
04/05/2012 NA -23.5 -23.75 -23 4
so i would end up with
16/04/2012 NA
17/04/2012 NA
18/04/2012 -5.25
19/04/2012 -6
20/04/2012 2.5
23/04/2012 -12.25
24/04/2012 7.25
25/04/2012 18.25
26/04/2012 11.5
27/04/2012 -3.25
30/04/2012 -1
01/05/2012 5.75
02/05/2012 -2.75
03/05/2012 -6.75
04/05/2012 -23
X[,X$sel]
gave me a square matrix equal to nrow(X), not quite what i need.
is there some sort of "Excel's INDEX' type of functions i can use maybe inside an apply function?
You could use the method of subsetting a data frame by passing a two-column matrix with row numbers in the first column and column numbers in the second column. So:
X[matrix(ncol=2, c(1:nrow(X), X$sel)]
will give you a vector of those selected elements, which you can then build into whatever result data frame you're aiming for. Or just add to the existing data frame like this:
X$selected_values <- X[matrix(ncol=2, c(1:nrow(X), X$sel)]

Resources