I'm new to this community but hope that someone will be able to help me with this issue:
I am trying to find the changes of plane efficiency data after the implementation of an intervention (nudge) in 2014. For this, I only need a single breakpoint in 2014 to be able to compare the phases before and after the implementation and then do the same with a control group.
Using fixed.psi = 2014 I was able to fix the first psi, however, R identifies another breakpoint later in 2016 which should not be included. So I am trying to create a plot that shows the linear regression from 2009 through 2014 and is followed by an independent linear regression from 2014 through 2019.
Here's my code so far:
# input data:
year period nudge fcost. fconsumption ask distance
1 2009 1 0 NA 396176200 34468768133 NA
2 2010 2 0 NA 403415300 33502639755 NA
3 2011 3 0 NA 381698000 35648670708 NA
4 2012 4 0 NA 409338200 37250324313 NA
5 2013 5 0 NA 398479550 39405973517 NA
6 2014 6 1 NA 406376750 40978703492
# get the data
setwd("/Users/username/R/Thesis")
vaa_data <- read.csv("dataset.csv", na="NA", sep=";")
vaa_data <- vaa_data[-12,]
vaa_data$efficiency <- with(vaa_data, ask/fconsumption)
# create a linear regression from 2009 to 2019 (all data)
model1 <- lm(efficiency ~ year, data=vaa_data)
# create a segmented regression with 2014 as the fixed psi/breakpoint
seg1 <- segmented(obj = model1,
seg.Z = ~ year,
psi = 2014,
fixed.psi = 2014)
# get the fitted data
fitted <- fitted(seg1)
segmodel <- data.frame(Year = vaa_data$year, Efficiency = fitted)
# plot the fitted model
ggplot(segmodel, aes(x = Year, y = Efficiency)) + geom_line()
# output dput(vaa_data):
structure(list(year = 2009:2019, period = 1:11, nudge = c(0L,
0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L), fcost. = c(NA, NA, NA,
1012000000L, 979000000L, 854700000L, 525500000L, 435200000L,
548600000L, 697900000L, 686300000L), fconsumption = c(517301594L,
486740423L, 502363575L, 511000000L, 498000000L, 482299406L, 460164739L,
423060756L, 413466614L, 426236232L, 434442862L), ask = c(4.87e+10,
4.65e+10, 4.92e+10, 5.0466e+10, 5.033e+10, 4.871e+10, 4.8385e+10,
4.7175e+10, 4.6154e+10, 4.7747e+10, 4.8832e+10), distance = c(148440000L,
138140000L, 145130000L, 149230000L, 154480000L, 149990000L, 150230000L,
142910000L, 138790000L, 150840000L, 159260000L), efficiency = c(94.1423737426179,
95.5334667159954, 97.9370369358487, 98.7592954990215, 101.064257028112,
100.995355569648, 105.147126451164, 111.508806550707, 111.626908768987,
112.020040567551, 112.40143243509)), row.names = c(NA, 11L), class = "data.frame")
Any suggestions on what to do? Thanks so much in advance!
This is what I got so far. The first part is perfect, but I need to 'skip' the second breaking point:
Related
I have two dataframes in R, recurrent and L1HS. I am trying to find a way to do this:
If a sequence in recurrent matches sequence in L1HS, paste a value from a column in recurrent into new column in L1HS.
The recurrent dataframe looks like this:
> head(recurrent)
chr start end X Y level unique
1: chr4 56707846 56708347 0 38 03 chr4_56707846_56708347
2: chr1 20252181 20252682 0 37 03 chr1_20252181_20252682
3: chr2 224560903 224561404 0 37 03 chr2_224560903_224561404
4: chr5 131849595 131850096 0 36 03 chr5_131849595_131850096
5: chr7 46361610 46362111 0 36 03 chr7_46361610_46362111
6: chr1 20251169 20251670 0 36 03 chr1_20251169_20251670
The L1HS dataset contains many columns containing genetic sequence basepairs and a column "Sequence" that should hopefully have some matches with "unique" in the recurrent data frame, like so:
> head(L1HS$Sequence)
"chr1_35031657_35037706"
"chr1_67544575_67550598"
"chr1_81404889_81410942"
"chr1_84518073_84524089"
"chr1_87144764_87150794"
I know how to search for matches using
test <- recurrent$unique %in% L1HS$Sequence
to get the Booleans:
> head(test)
[1] FALSE FALSE FALSE FALSE FALSE FALSE
But I have a couple of problems from here. If the sequence is found, I want to copy the "level" value from the recurrent dataset to the L1HS dataset in a new column. For example, if the sequence "chr4_56707846_56708347" from the recurrent data was found in the full-length data, I'd like the full-length data frame to look like:
Sequence level other_columns
chr4_56707846_56708347 03 gggtttcatgaccc....
I was thinking of trying something like:
for (i in L1HS){
if (recurrent$unique %in% L1HS$Sequence{
L1HS$level <- paste(recurrent$level[i])}
}
but of course this isn't working and I can't figure it out.
I am wondering what the best approach is here! I'm wondering if merge/intersect/apply might be easier/better, or just what best practice might look like for a somewhat simple question like this. I've found some similar examples for Python/pandas, but am stuck here.
Thanks in advance!
You can do a simple left_join to add level to L1HS with dplyr.
library(dplyr)
L1HS %>%
left_join(., recurrent %>% select(unique, level), by = c("Sequence" = "unique"))
Or with merge:
merge(x=L1HS,y=recurrent[, c("unique", "level")], by.x = "Sequence", by.y = "unique",all.x=TRUE)
Output
Sequence level
1 chr1_35031657_35037706 4
2 chr1_67544575_67550598 2
3 chr1_81404889_81410942 NA
4 chr1_84518073_84524089 3
5 chr1_87144764_87150794 NA
*Note: This will still retain all the columns in L1HS. I just didn't create any additional columns in the example data below.
Data
recurrent <- structure(list(chr = c("chr4", "chr1", "chr2", "chr5", "chr7",
"chr1"), start = c(56707846L, 20252181L, 224560903L, 131849595L,
46361610L, 20251169L), end = c(56708347L, 20252682L, 224561404L,
131850096L, 46362111L, 20251670L), X = c(0L, 0L, 0L, 0L, 0L,
0L), Y = c(38L, 37L, 37L, 36L, 36L, 36L), level = c(3L, 2L, 3L,
3L, 3L, 4L), unique = c("chr4_56707846_56708347", "chr1_67544575_67550598",
"chr2_224560903_224561404", "chr5_131849595_131850096", "chr1_84518073_84524089",
"chr1_35031657_35037706")), class = "data.frame", row.names = c(NA,
-6L))
L1HS <- structure(list(Sequence = c("chr1_35031657_35037706", "chr1_67544575_67550598",
"chr1_81404889_81410942", "chr1_84518073_84524089", "chr1_87144764_87150794"
)), class = "data.frame", row.names = c(NA, -5L))
I am trying to forecast multiple time series data that are present in a single data frame.
The dataframe df looks like below. The dput(df) is given below as well to reproduce quickly.
Date Group Value
01-04-2019 Saffron 62.78
01-04-2019 Green 75.65
01-05-2019 Saffron 67.89
01-06-2019 Saffron 54.56
01-06-2019 Green 77.00
01-07-2019 Green 71.22
structure(list(Date = structure(c(1L, 1L, 2L, 3L, 3L, 4L), .Label = c("01-04-2019", "01-05-2019", "01-06-2019", "01-07-2019"), class = "factor"),
Group = structure(c(2L, 1L, 2L, 2L, 1L, 1L), .Label = c("Green",
"Saffron"), class = "factor"), Value = c(62.78, 75.65, 67.89,
54.56, 77, 71.22)), .Names = c("Date", "Group", "Value"), class = "data.frame", row.names = c(NA, -6L))
Objective: I want to forecast for each Group using forecast package.
So my approach was as follows:
col_name_date <- "Date"
col_name_measure <- "Value"
col_name_sku_depo <- "Group"
dates_to_forecast <- 3
for (v in unique(as.character(df$Group))) {
temp <-subset(data,Group == v)
assign(paste0("df_",tolower(v)),temp)
temp <- temp [order(temp[, col_name_date]), ]
start_date <- as.Date(temp[1, col_name_date], date_format) #< ---library(lubridate)
ts_historic <- ts(temp[, col_name_measure],
start = c(year(start_date), month(start_date)),
frequency = 12)
----Forecasting process using forecast package, omitting as it is out of scope-----
forecast_mean <- rep(NA, dates_to_forecast)
forecast_mean <- ts_forecast$mean
forecast_upper <- ts_forecast$upper
forecast_lower <- ts_forecast$lower
dates_all_mean <- as.numeric(c(as.numeric(ts_historic), as.numeric(forecast_mean)))
dates_all_lower <- as.numeric(c(rep(NA, length(ts_historic)), as.numeric(forecast_lower)))
dates_all_upper <- as.numeric(c(rep(NA, length(ts_historic)), as.numeric(forecast_upper)))
result <- data.frame(
MONTH = dates_all,
MEASURETYPE = date_types,
GROUP = v
MEASURE = dates_all_mean,
MEASURELOWER = dates_all_lower,
MEASUREUPPER = dates_all_upper,
MODEL = model_descr)
}
The above code works fine for a single Group i.e. Saffron. But this doesn't produce the result for Green group.
I am looking for the following output:
MONTH MEASURETYPE GROUP MEASURE MEASUREUPPER MEASURELOWER MODEL
01-04-2019 Actual Saffron 62.78 NA NA Test
01-05-2019 Actual Saffron 67.89 NA NA Test
01-06-2019 Actual Saffron 54.56 NA NA Test
01-07-2019 Forecast Saffron 55.35 56.15 54.23 Test
01-08-2019 Forecast Saffron 57.29 58.15 56.39 Test
01-04-2019 Actual Green 75.65 NA NA Test
01-05-2019 Actual Green 77.00 NA NA Test
01-06-2019 Actual Green 71.22 NA NA Test
01-07-2019 Forecast Green 76.35 77.15 75.23 Test
01-08-2019 Forecast Green 73.29 74.29 72.30 Test
As you can see from the code, I am able to generate the above output only for Saffron.
How can I also add Green as shown in the above output?
Where I am missing out in for loop?
As part of a project, I am currently using R to analyze some data. I am currently stuck with the retrieving few values from the existing dataset which i have imported from a csv file.
The file looks like:
For my analysis, I wanted to create another column which is the subtraction of the current value of x and its previous value. But the first value of every unique i, x would be the same value as it is currently. I am new to R and i was trying various ways for sometime now but still not able to figure out a way to do so. Request your suggestions in the approach that I can follow to achieve this task.
Mydata structure
structure(list(t = 1:10, x = c(34450L, 34469L, 34470L, 34483L,
34488L, 34512L, 34530L, 34553L, 34575L, 34589L), y = c(268880.73342868,
268902.322359863, 268938.194698248, 268553.521856105, 269175.38273083,
268901.619719038, 268920.864512966, 269636.604121984, 270191.206593437,
269295.344751692), i = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L)), .Names = c("t", "x", "y", "i"), row.names = c(NA, 10L), class = "data.frame")
You can use the package data.table to obtain what you want:
library(data.table)
setDT(MyData)[, x_diff := c(x[1], diff(x)), by=i]
MyData
# t x i x_diff
# 1: 1 34287 1 34287
# 2: 2 34789 1 502
# 3: 3 34409 1 -380
# 4: 4 34883 1 474
# 5: 5 34941 1 58
# 6: 6 34045 2 34045
# 7: 7 34528 2 483
# 8: 8 34893 2 365
# 9: 9 34551 2 -342
# 10: 10 34457 2 -94
Data:
set.seed(123)
MyData <- data.frame(t=1:10, x=sample(34000:35000, 10, replace=T), i=rep(1:2, e=5))
You can use the diff() function. If you want to add a new column to your existing data frame, the diff function will return a vector x-1 length of your current data frame though. so in your case you can try this:
# if your data frame is called MyData
MyData$newX = c(NA,diff(MyData$x))
That should input an NA value as the first entry in your new column and the remaining values will be the difference between sequential values in your "x" column
UPDATE:
You can create a simple loop by subsetting through every unique instance of "i" and then calculating the difference between your x values
# initialize a new dataframe
newdf = NULL
values = unique(MyData$i)
for(i in 1:length(values)){
data1 = MyData[MyData$i = values[i],]
data1$newX = c(NA,diff(data1$x))
newdata = rbind(newdata,data1)
}
# and then if you want to overwrite newdf to your original dataframe
MyData = newdf
# remove some variables
rm(data1,newdf,values)
I have a problem regarding results from an aggregate function in R. My aim is to select certain bird species from a data set and calculate the density
of observed individuals over the surveyed area. To that end, I took a subset of the main data file, then aggregated over area, calculating the
mean, and the number of individuals (represented by length of vector). Then I wanted to use the calculated mean area and number of individuals to
calculate density. That didn't work. The code I used is given below:
> head(data)
positionmonth positionyear quadrant Species Code sum_areainkm2
1 5 2014 1 Bar-tailed Godwit 5340 155.6562
2 5 2014 1 Bar-tailed Godwit 5340 155.6562
3 5 2014 1 Bar-tailed Godwit 5340 155.6562
4 5 2014 1 Bar-tailed Godwit 5340 155.6562
5 5 2014 1 Gannet 710 155.6562
6 5 2014 1 Bar-tailed Godwit 5340 155.6562
sub.gannet<-subset(data, species == "Gannet")
sub.gannet<-data.frame(sub.gannet)
x<-sub.gannet
aggr.gannet<-aggregate(sub.gannet$sum_areainkm2, by=list(sub.gannet$positionyear, sub.gannet$positionmonth, sub.gannet$quadrant, sub.gannet$Species, sub.gannet$Code), FUN=function(x) c(observed_area=mean(x), NoInd=length(x)))
names(aggr.gannet)<-c("positionyear", "positionmonth", "quadrant", "species", "code", "x")
aggr.gannet<-data.frame(aggr.gannet)
> aggr.gannet
positionyear positionmonth quadrant species code x.observed_area x.NoInd
1 2014 5 4 Gannet 710 79.8257 10.0000
density <- c(aggr.gannet$x.NoInd/aggr.gannet$x.observed_area)
aggr.gannet <- cbind(aggr.gannet, density)
Error in data.frame(..., check.names = FALSE) :
Arguments imply differing number of rows: 1, 0
> density
numeric(0)
> aggr.gannet$x.observed_area
NULL
> aggr.gannet$x.NoInd
NULL
R doesn't seem to view the results from the function (observed_area and NoInd) as numeric values in their own right. That was already apparent, when I couldn't give them a name each, but had to call them "x".
How can I calculate density under these circumstances? Or is there another way to aggregate with multiple functions over the same variable that will result in a usable output?
It's a quirk of aggregate with multiple aggregations that the resulting aggregations are stored in a list within the column related to the aggregated variable.
The easiest way to get rid of this is to go through an as.list before as.dataframe, which flattens the data structure.
aggr.gannet <- as.data.frame(as.list(aggr.gannet))
It will still use x as the name. The way I discovered to fix this is to use the formula interface to aggregate, so your aggregate would look more like
aggr.gannet<-aggregate(
sum_areainkm2 ~ positionyear + positionmonth +
quadrant + Species + Code,
data=sub.gannet,
FUN=function(x) c(observed_area=mean(x), NoInd=length(x)))
Walking it through (here I haven't taken the subset to illustrate the aggregation by species)
df <- structure(list(positionmonth = c(5L, 5L, 5L, 5L, 5L, 5L), positionyear = c(2014L, 2014L, 2014L, 2014L, 2014L, 2014L), quadrant = c(1L, 1L, 1L, 1L, 1L, 1L), Species = structure(c(1L, 1L, 1L, 1L, 2L, 1L), .Label = c("Bar-tailed Godwit", "Gannet"), class = "factor"), Code = c(5340L, 5340L, 5340L, 5340L, 710L, 5340L), sum_areainkm2 = c(155.6562, 155.6562, 155.6562, 155.6562, 155.6562, 155.6562)), .Names = c("positionmonth", "positionyear", "quadrant", "Species", "Code", "sum_areainkm2"), class = "data.frame", row.names = c(NA, -6L))
df.agg <- as.data.frame(as.list(aggregate(
sum_areainkm2 ~ positionyear + positionmonth +
quadrant + Species + Code,
data=df,
FUN=function(x) c(observed_area=mean(x), NoInd=length(x)))))
Which results in what you want:
> df.agg
positionyear positionmonth quadrant Species Code
1 2014 5 1 Gannet 710
2 2014 5 1 Bar-tailed Godwit 5340
sum_areainkm2.observed_area sum_areainkm2.NoInd
1 155.6562 1
2 155.6562 5
> names(df.agg)
[1] "positionyear" "positionmonth"
[3] "quadrant" "Species"
[5] "Code" "sum_areainkm2.observed_area"
[7] "sum_areainkm2.NoInd"
Obligatory note here, that dplyr and data.table are powerful libraries that allow doing this sort of aggregation very simply and efficiently.
dplyr
Dplyr has some strange syntax (the %>% operator), but ends up being quite readable, and allows chaining more complex operations
> require(dplyr)
> df %>%
group_by(positionyear, positionmonth, quadrant, Species, Code) %>%
summarise(observed_area=mean(sum_areainkm2), NoInd = n())
data.table
data.table has a more compact syntax and may be faster with large datasets.
dt[,
.(observed_area=mean(sum_areainkm2), NoInd=.N),
by=.(positionyear, positionmonth, quadrant, Species, Code)]
I have the following table ordered group by first, second and name.
myData <- structure(list(first = c(120L, 120L, 126L, 126L, 126L, 132L, 132L), second = c(1.33, 1.33, 0.36, 0.37, 0.34, 0.46, 0.53),
Name = structure(c(5L, 5L, 3L, 3L, 4L, 1L, 2L), .Label = c("Benzene",
"Ethene._trichloro-", "Heptene", "Methylamine", "Pentanone"
), class = "factor"), Area = c(699468L, 153744L, 32913L,
4948619L, 83528L, 536339L, 105598L), Sample = structure(c(3L,
2L, 3L, 3L, 3L, 1L, 1L), .Label = c("PO1:1", "PO2:1", "PO4:1"
), class = "factor")), .Names = c("first", "second", "Name",
"Area", "Sample"), class = "data.frame", row.names = c(NA, -7L))
Within each group I want to extract the area that correspond to the specific sample. Several groups don´t have areas from the samples, so if the sample is´nt detected it should return "NA".Ideally, the final output should be a column for each sample.
I have tried the ifelse function to create one column to each sample:
PO1<-ifelse(myData$Sample=="PO1:1",myData$Area, "NA")
However this doesn´t takes into account the group distribution. I want to do this, but within the group. Within each group (a group as equal value for first, second and Name columns) if sample=PO1:1, Area, else NA.
For the first group:
structure(list(first = c(120L, 120L), second = c(1.33, 1.33),
Name = structure(c(1L, 1L), .Label = "Pentanone", class = "factor"),
Area = c(699468L, 153744L), Sample = structure(c(2L, 1L), .Label = c("PO2:1",
"PO4:1"), class = "factor")), .Names = c("first", "second", "Name",
"Area", "Sample"), class = "data.frame", row.names = c(NA, -2L))
The output should be:
structure(list(PO1.1 = NA, PO2.1 = 153744L, PO3.1 = NA, PO4.1 = 699468L), .Names =c("PO1.1", "PO2.1", "PO3.1", "PO4.1"), class = "data.frame", row.names = c(NA, -1L))
Any suggestion?
As in the example in the quesiton, I am assuming Sample is a factor. If this is not the case, consider making it such.
First, lets clean up the column Sample to make it a legal name, or else it might cause errors
levels(myData$Sample) <- make.names(levels(myData$Sample))
## DEFINE THE CUTS##
# Adjust these as necessary
#--------------------------
max.second <- 3 # max & nin range of myData$second
min.second <- 0 #
sprd <- 0.15 # with spread for each group
#--------------------------
# we will cut the myData$second according to intervals, cut(myData$second, intervals)
intervals <- seq(min.second, max.second, sprd*2)
# Next, lets create a group column to split our data frame by
myData$group <- paste(myData$first, cut(myData$second, intervals), myData$Name, sep='-')
groups <- split(myData, myData$group)
samples <- levels(myData$Sample) ## I'm assuming not all samples are present in the example. Manually adjusting with: samples <- sort(c(samples, "PO3.1"))
# Apply over each group, then apply over each sample
myOutput <-
t(sapply(groups, function(g) {
#-------------------------------
# NOTE: If it's possible that within a group there is more than one Area per Sample, then we have to somehow allow for thi. Hence the "paste(...)"
res <- sapply(samples, function(s) paste0(g$Area[g$Sample==s], collapse=" - ")) # allowing for multiple values
unlist(ifelse(res=="", NA, res))
## If there is (or should be) only one Area per Sample, then remove the two lines aboce and uncomment the two below:
# res <- sapply(samples, function(s) g$Area[g$Sample==s]) # <~~ This line will work when only one value per sample
# unlist(ifelse(res==0, NA, res))
#-------------------------------
}))
# Cleanup names
rownames(myOutput) <- paste("Group", 1:nrow(myOutput), sep="-") ## or whichever proper group name
# remove dummy column
myData$group <- NULL
Results
myOutput
PO1.1 PO2.1 PO3.1 PO4.1
Group-1 NA "153744" NA "699468"
Group-2 NA NA NA "32913 - 4948619"
Group-3 NA NA NA "83528"
Group-4 "536339" NA NA NA
Group-5 "105598" NA NA NA
You cannot really expect R to intuit that there is a fourth factor level between PO2 and PO4 , now can you.
> reshape(inp, direction="wide", idvar=c('first','second','Name'), timevar="Sample")
first second Name Area.PO4:1 Area.PO2:1 Area.PO1:1
1 120 1.3 Pentanone 699468 153744 NA
3 126 0.4 Heptene 32913 NA NA
4 126 0.4 Heptene 4948619 NA NA
5 126 0.3 Methylamine 83528 NA NA
6 132 0.5 Benzene NA NA 536339
7 132 0.5 Ethene._trichloro- NA NA 105598