I have 2 data frames for 2 stacks that gives information about potential emission. One data frame gives the time frame of what hours the system turn on and off for 4 seasons. Each season start on specific date. The 2nd file give me the details of the stack.
I am trying with some sample file to test how to do this and so far I have managed to create a function following stack overflow example that allow me to create a data frame with the dates that I would like and a column with seasons for each date. I am really struggling now with the programming concept to understand how do I combine the 3 data frames to create the output template that I am trying to set up.
To show you an example my sample input are:
Stack_info File:
example seasonal Profile that shows when the system is on or off:
and the output I am after should create data frames for each year in the following format (only the black font and the red text to just explain what the values are):
What is the most difficult I am finding is that my output files for each year will have a unique first Row and the 2nd row will repeat for each pollutant. and from 3rd row the hourly data for all 8760 hours. This need to repeat for the next pollutant.
So far I have managed to create a function that helps me to assign season to each day of the year. For example:
#function to create seasons
d = function(month_day) which(lut$month_day == month_day)
lut = data.frame(all_dates = as.POSIXct("2012-1-1") + ((0:365) * 3600 * 24),
season = NA)
lut = within(lut, { month_day = strftime(all_dates, "%b-%d") })
lut[c(d("Jan-01"):d("Mar-15"), d("Nov-08"):d("Dec-31")), "season"] = "winter"
lut[c(d("Mar-16"):d("Apr-30")), "season"] = "spring"
lut[c(d("May-01"):d("Sep-27")), "season"] = "summer"
lut[c(d("Sep-28"):d("Nov-07")), "season"] = "autumn"
rownames(lut) = lut$month_day
## create date data frame and assign seasons
dates = data.frame(dates =seq(as.Date('2010-01-01'),as.Date('2012-12-31'),by = 1))
dates = within(dates, {
season = lut[strftime(dates, "%b-%d"), "season"]
})
This gives me a dates data frame and my other 2 samples data frames are (as shown in the image):
structure(list(`Source no` = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), Source = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("Stack 1", "Stack 2"), class = "factor"),
Period = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), Day = structure(c(2L,
6L, 7L, 5L, 1L, 3L, 4L, 2L, 6L, 7L, 5L, 1L, 3L, 4L, 2L, 6L,
7L, 5L, 1L, 3L, 4L), .Label = c("Fri", "Mon", "Sat", "Sun",
"Thu", "Tue", "Wed"), class = "factor"), `Spring On` = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 15L,
15L, 15L, 15L, 15L, 15L, 15L), `Spring Off` = c(23L, 23L,
23L, 23L, 23L, 23L, 23L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 18L,
18L, 18L, 18L, 18L, 18L, 18L), `Summer On` = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = "off", class = "factor"), `Summer Off` = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = "off", class = "factor"), `Autumn On` = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = "off", class = "factor"), `Autumn Off` = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = "off", class = "factor"), `Winter On` = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("0", "off"), class = "factor"),
`Winter Off` = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("23",
"off"), class = "factor")), .Names = c("Source no", "Source",
"Period", "Day", "Spring On", "Spring Off", "Summer On", "Summer Off",
"Autumn On", "Autumn Off", "Winter On", "Winter Off"), class = "data.frame", row.names = c(NA,
-21L)) -> profile
structure(list(SNAME = structure(1:2, .Label = c("Stack 1", "Stack 2"
), class = "factor"), ISVARY = c(1L, 4L), VELVOL = c(1L, 4L),
TEMPDENS = c(0L, 2L), `DUM 1` = c(999L, 999L), `DUM 2` = c(999L,
999L), NPOL = c(2L, 2L), `EXIT VEL` = c(26.2, 22.4), TEMP = c(341L,
328L), `STACK DIAM` = c(1.5, 2.5), W = c(0L, 15L), Nox = c(39,
33.3), Sox = c(15.5, 17.9)), .Names = c("SNAME", "ISVARY",
"VELVOL", "TEMPDENS", "DUM 1", "DUM 2", "NPOL", "EXIT VEL", "TEMP",
"STACK DIAM", "W", "Nox", "Sox"), class = "data.frame", row.names = c(NA,
-2L)) -> stack_info
If anyone could give me any guidance of how I can proceed with the programming part would be really useful as I am just not sure how I can approach this to create separate output files as data frame for year 2010, 2011 and 2012.
The way your data is organised isn't ideal for processing. Maybe you have a look at Hadley Wickhams papar about tidy data.
According to your desired output you need a dataframe with the number of lines equal to the number of hours a specific machine (stack n) is switched on. Therefore I suggest you create a dataframe containing every hour of a given year:
d.out = data.frame(dates = seq(from=as.POSIXct("2010-01-01"), by=3600, to= as.POSIXct("2010-12-31")))
d.out$year = as.numeric(format(d.out$dates, "%Y"))
d.out$month = as.numeric(format(d.out$dates, "%m"))
d.out$day = as.numeric(format(d.out$dates, "%d"))
d.out$hour = as.numeric(format(d.out$dates, "%H"))
d.out$weekday = as.character(format(d.out$dates, "%a"))
d.out$doj = as.numeric(format(d.out$dates, "%j"))
d.out$season = "Winter"
d.out$season[d.out$doj >= 75 & d.out$doj < 121] = "Spring"
d.out$season[d.out$doj >= 121 & d.out$doj < 271] = "Summer"
d.out$season[d.out$doj >= 271 & d.out$doj < 312] = "Autumn"
The goal is to join this dataframe with your profile dataframe. Before joining, the profile-df has to be rearranged:
library(dplyr)
library(tidyr)
profile_new =
profile %>%
gather(season, hour, -c(`Source no`, Source, Period, Day)) %>%
extract(season, c("season", "status"), "(\\w+?)\\s(\\w+)") %>%
filter(hour != "off") %>%
mutate(Day = as.character(Day), hour=as.numeric(hour)) %>%
spread(status, hour)
Now it's easy to join the three dataframes to put together all the information you need to create your output:
d.out %>%
inner_join(profile_new, by=c("weekday"="Day", "season"="season")) %>%
group_by(Source, dates, year, day, weekday, season, hour) %>%
summarise(status = any(hour >= On & hour <= Off)) %>%
inner_join(stack_info, by=c("Source"="SNAME")) %>%
mutate(Nox = ifelse(status, Nox, 0),
Sox = ifelse(status, Sox, 0)) %>%
arrange(Source, year, dates, hour) %>%
select(Source, year, day, weekday, season, hour, `EXIT VEL`, TEMP, `STACK DIAM`, W, Nox, Sox)
Obviously it's not quite the format you posted. From here you could write your dataframe to a csv (stack by stack by using append = TRUE).
Related
I have a data frame with different variables (columns).
I want to transform this data frame into a table with a different structure to make it more readable.
For example, I have a data frame like this:
myData = structure(list(X = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "20", class = "factor"),
Y = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("20", "100"), class = "factor"),
MethodType = structure(c(2L, 2L, 4L, 4L, 1L, 1L, 3L, 3L,
2L, 2L, 4L, 4L, 1L, 1L, 3L, 3L), .Label = c("E", "Q", "R",
"W"), class = "factor"), MethodType2 = structure(c(1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A",
"B"), class = "factor"), Metric1 = c(0.970017512487058, 0.969647220975651,
0.965873991040769, 0.966242788535318, 0.986725852301671,
0.98696657967457, 0.98252107117733, 0.982655296614757, 0.278826941542694,
-0.990926101696033, 0.194574672498287, 0.281916524368647,
0.152983364411985, 1.44135982835554, 0.330270447575806, -0.369627160641594
), Metric2 = c(0.987541353383459, 0.987007518796992, 0.980984962406015,
0.981646616541353, 0.984082706766917, 0.984481203007519,
0.988165413533835, 0.988375939849624, -0.109331599015822,
-0.148471161609603, 1.31331396089969, -1.34238564643737,
2.14014350779371, -0.422879539464588, -1.25706359685425,
1.09603324772565)), row.names = c(NA, -16L), class = "data.frame")
and I want to have a table like this:
Which kind of manipulation I can use? Which tool I can use. I'm looking for something flexible that can work also with more factors.
Can someone help me how to count from another dataframe?
df1(out)
structure(list(Item = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), class = "factor", .Label = "0S1576"), LC = structure(c(1L,
1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), class = "factor", .Label = c("MW92",
"OY01", "RM11")), Fiscal.Month = c("2019-M06", "2019-M07", "2019-M06",
"2019-M07", "2019-M08", "2019-M09", "2019-M06", "2019-M07", "2019-M08"
)), row.names = c(NA, -9L), class = "data.frame")
df2(tempdf)
structure(list(Item = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "0S1576", class = "factor"),
LC = structure(c(1L, 1L, 1L, 1L, 2L, 3L, 4L, 6L, 5L, 1L,
2L, 2L, 3L, 3L), .Label = c("MW92", "OY01", "RM11", "RS11",
"WK14", "WK15"), class = "factor"), Fiscal.Month = structure(c(1L,
2L, 3L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("2019-M06",
"2019-M07", "2019-M08", "2019-M09"), class = "factor"), fcst = c(22L,
21L, 20L, 19L, 12L, 10L, 10L, 12L, 10L, 12L, 10L, 10L, 10L,
10L)), row.names = c(NA, -14L), class = "data.frame")
I want to count the frequency of Item,LC,Fiscal.month of df1 from df2
You can count using table and merge df1 with df2 by using factor and you need interaction as you use more than one column to merge.
table(factor(interaction(df2[c("Item","LC","Fiscal.Month")]), levels=interaction(df1)))
#0S1576.MW92.2019-M06 0S1576.MW92.2019-M07 0S1576.OY01.2019-M06
# 2 1 3
#0S1576.OY01.2019-M07 0S1576.OY01.2019-M08 0S1576.OY01.2019-M09
# 0 0 0
#0S1576.RM11.2019-M06 0S1576.RM11.2019-M07 0S1576.RM11.2019-M08
# 3 0 0
Or a speed improved version using match and tabulate:
(df1$freq <- tabulate(match(interaction(df2[c("Item","LC","Fiscal.Month")]), interaction(df1)), nrow(df1)))
#[1] 2 1 3 0 0 0 3 0 0
Or sometimes even faster using fastmatch:
library(fastmatch)
df1$freq <- tabulate(fmatch(interaction(df2[c("Item","LC","Fiscal.Month")]), interaction(df1)), nrow(df1))
I am trying to make a PCA plot using ggplot and geom_point.
I would like to illustrate 3 factors (Diet, Time, Antibiotics).
I thought I could outline the points in black for one factor).
However this isn't showing the third factor (Time) for the Fill color.
Here is a subset of my data:
> dput(dat.pcx.annot.test)
structure(list(PC1 = c(25.296379160162, 1.4703101394886, 11.4138097811008,
1.41798772574591, 23.7253675969881, 15.5683516005535, -34.6012195481675,
-25.7129281491955, -2.97230018393742, 4.83421092719293, -0.0274189140249825,
23.227939504077, 15.2002258785889, -35.2243685702227, -34.2537374460037,
-7.6380794043063), PC2 = c(27.2678813936857, -9.88577494210313,
-6.19394322321806, -8.88953660465497, 33.6791127012231, -13.2912233546802,
7.77877968081575, 2.7371646557436, -8.41929538502921, -11.5151849519265,
-9.40733576034963, 32.3549860618533, -11.2170071727855, 10.0455709347794,
3.05679707335492, -6.66218028060621), Diet = structure(c(1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L), .Label = c("RC",
"WD"), class = "factor"), Time = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("ZT14",
"ZT2"), class = "factor"), Antibiotics = structure(c(2L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L), .Label = c("Antibiotics ",
"None"), class = "factor")), row.names = c(1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 10L, 11L, 18L, 19L, 20L, 21L, 22L), class = "data.frame")
Here is the plotting command :
ggplot(dat.pcx.annot.test,aes(x=PC1,y=PC2,color=Diet,shape=Antibiotics,Fill=Time))+
geom_point(size=3,alpha=0.5)+
scale_color_manual(values = c("black","white") )
And the plot it produces:
I thought if I had both color and fill specified then they would both show.
I would like black outlines for Antibiotics, and Fill color for Time.
Right now Time is not represented.
Any help on how to simultaneously view the 3 factors.
Thanks
Yes I had a fill typo. And I finally figured out how to get the legends to correspond. Here is my final answer.
ggplot(dat.pcx.annot,aes(x=PC1,y=PC2,color=Diet,shape=Antibiotics,fill=Time))+
geom_point(size=3)+
scale_shape_manual(values = c(21, 22) )+
scale_color_manual(values = c("black","white") )+
scale_fill_manual(values=c("#EC9DAE","#AEDE94"))+
xlab(PC1var)+
ylab(PC2var)+
guides(fill=guide_legend(override.aes=list(shape=21)))+
guides(color=guide_legend(override.aes=list(shape=21)))
guides(fill=guide_legend(override.aes=list(shape=21,fill=c("#EC9DAE","#AEDE94"),color=c("black","white"))))
ggsave("cohort2_pca.pdf")
I have trouble figuring out how I can compute a simple mean with dplyr on Long Format data.
My data look like this :
hldid idno sex diary age
1 1294 1294_1 2 1 39
2 1294 1294_1 2 2 39
3 1294 1294_2 1 1 43
4 1294 1294_2 1 2 43
...
With 4 variables : hldid idno sex diary age
idno is the personal identifier but not the unique key.
Each individual is repeated 2 times, one for each diary filled.
What I would like is to simply compute the age mean by sex.
Could you help me out ?
I tried something like :
dta %>%
group_by(sex) %>%
mutate( ng = n_distinct(idno)) %>%
group_by(age, add=TRUE) %>%
summarise(mean = n()/ng[1] )
But it does not work.
The data :
dta = structure(list(hldid = c(1294, 1294, 1294, 1294, 1352, 1352,
1352, 1352, 3741, 3741, 3741, 3741, 3809, 3809, 3809, 3809, 4037,
4037, 4037, 4037), idno = c("1294_1", "1294_1", "1294_2", "1294_2",
"1352_1", "1352_1", "1352_2", "1352_2", "3741_1", "3741_1", "3741_2",
"3741_2", "3809_1", "3809_1", "3809_2", "3809_2", "4037_1", "4037_1",
"4037_2", "4037_2"), sex = c(2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L,
2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L), diary = c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L), age = c(39L, 39L, 43L, 43L, 31L, 31L, 37L, 37L,
33L, 33L, 37L, 37L, 34L, 34L, 37L, 37L, 41L, 41L, 32L, 32L)), .Names = c("hldid",
"idno", "sex", "diary", "age"), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -20L), vars = list(hldid), drop = TRUE, indices = list(
0:3, 4:7, 8:11, 12:15, 16:19), group_sizes = c(4L, 4L, 4L,
4L, 4L), biggest_group_size = 4L, labels = structure(list(hldid = c(1294,
1352, 3741, 3809, 4037)), class = "data.frame", row.names = c(NA,
-5L), .Names = "hldid", vars = list(hldid)))
quick update
Maybe this does not apply for this example,
but this kind of issues I have in mind is the following :
Imagine we have data like this :
3 women and 2 men, and a dummy act variable.
If we do and not taking into account the Long format computing the mean, we will have problems.
aggregate(act ~ sex, FUN = mean, data = dtaTime)
What we should do is this :
aggregate(act ~ sex, FUN = sum, data = dtaTime)
6 / 2 # men
10 / 3 # women
Data
dtaTime = structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L),
sex = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), act = c(1L,
1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L,
1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L)), .Names = c("id", "sex",
"act"), class = "data.frame", row.names = c(NA, -25L))
You are making it too complicated,
dta %>%
group_by(sex) %>%
summarise(meanage = mean(age))
should give you the mean age by sex.
A base R alternative:
aggregate(age ~ sex, dta, mean)
A data.table alternative:
library(data.table)
setDT(dta)[, .(meanage = mean(age)), by = sex]
So, I have a data frame with several continuous variables and several dummy variables. The survey that this data frame comes from uses 6,7,8 and 9 to denote different types of non-response. So, I would like to replace 6,7,8 and 9 with NA whenever they show up in a dummy variable column but leave them be in the continuous variable column.
Is there a concise way to go about doing this?
Here's my data:
> dput(head(sfsuse[c(4:16)]))
structure(list(famsize = c(3L, 1L, 2L, 5L, 3L, 5L), famtype = c(2L,
1L, 2L, 3L, 2L, 3L), cc = c(1L, 1L, 1L, 1L, 1L, 1L), nocc = c(1L,
1L, 1L, 3L, 1L, 1L), pdloan = c(2L, 2L, 2L, 2L, 2L, 2L), help = c(2L,
2L, 2L, 2L, 2L, 2L), budget = c(1L, 1L, 1L, 1L, 2L, 2L), income = c(340000L,
20500L, 0L, 165000L, 95000L, -320000L), govtrans = c(7500L, 15500L,
22000L, 350L, 0L, 9250L), childexp = c(0L, 0L, 0L, 0L, 0L, 0L
), homeown = c(1L, 1L, 1L, 1L, 1L, 2L), bank = c(2000L, 80000L,
25000L, 20000L, 57500L, 120000L), vehval = c(33000L, 7500L, 5250L,
48000L, 8500L, 50000L)), .Names = c("famsize", "famtype", "cc",
"nocc", "pdloan", "help", "budget", "income", "govtrans", "childexp",
"homeown", "bank", "vehval"), row.names = c(NA, 6L), class = "data.frame")
I'm trying to subs in NA for 6,7,8 and 9 in columns 3:7 and column 11. I know how to do this one column at a time by the column names:
df$name[df$name %in% 6:9]<-NA
but I would have to do this for each column by name, is there a concise way to do it by column index?
Thanks
This function should work
f <- function(data,k) {
data[data[,k] %in% 6:9,k] <- NA
data
}
Now at the console:
> for (k in c(3:7,11)) { data <- f(data,k) }