I have data frame (9000 x 304) but it looks like to this :
date
a
b
1997-01-01
8.720551
10.61597
1997-01-02
na
na
1997-01-03
8.774251
na
1997-01-04
8.808079
11.09641
I want to calculate the values data such as :
first <- data[i-1,] - data[i-2,]
second <- data[i,] - data[i-1,]
third <- data[i,] - data[i-2,]
I want to ignore the NA values and if there is na I want to get the last value that is not na in the column.
For example in the second diff i = 4 from column b :
11.09641 - 10.61597 is the value of b_diff on 1997-01-04
This is what I did but it keeps generating data with NA :
first <- NULL
for (i in 3:nrow(data)){
first <-rbind(first, data[i-1,] - data[i-2,])
}
second <- NULL
for (i in 3:nrow(data)){
second <- rbind(second, data[i,] - data[i-1,])
}
third <- NULL
for (i in 3:nrow(data)){
third <- rbind(third, data[i,] - data[i-2,])
}
It can be a way to solve it with aggregate function but I need a solution that can be applied on big data and I can't specify each colnames separately. Moreover my colnames are in foreign language.
Thank you very much ! I hope I gave you all the information you need to help me, otherwise, let me know please.
You can use fill to replace NAs with the closest value, and then use across and lag to compute the new variables. It is unclear as to what exactly is your expected output, but you can also replace the default value of lag when it does not exist (e.g. for the first value), using lag(.x, default = ...).
library(dplyr)
library(tidyr)
data %>%
fill(a, b) %>%
mutate(across(a:b, ~ lag(.x) - lag(.x, n = 2), .names = "first_{.col}"),
across(a:b, ~ .x - lag(.x), .names = "second_{.col}"),
across(a:b, ~ .x - lag(.x, n = 2), .names = "third_{.col}"))
date a b first_a first_b second_a second_b third_a third_b
1 1997-01-01 8.720551 10.61597 NA NA NA NA NA NA
2 1997-01-02 8.720551 10.61597 NA NA 0.000000 0.00000 NA NA
3 1997-01-03 8.774251 10.61597 0.0000 0 0.053700 0.00000 0.053700 0.00000
4 1997-01-04 8.808079 11.09641 0.0537 0 0.033828 0.48044 0.087528 0.48044
I would like to calculate enthalpy by using steam table function .
I want to adapt the function to a Tibble table which include temp and pressure, but failed.
For example, I want to add the enthalpy row.
sample_table
temp pressure
800 16
900 17
1000 18
sample_table_add_enthalpy <- sample_table %>%
mutate(enthalpy = hTp(temp, pressure))
The result is
temp pressure enthalpy
800 16 3375.08509
900 17 3375.08509
1000 18 3375.08509
In this case, the calculation is only adapted to the first column.
How should I do to calculate for all column by using mutate?
After thinking more about your question, now I understand that you were not talking about multiple columns. Instead, it seems like you would like to have a function that can process data for multiple rows.
Here I provided two solutions. The first one is to use Vectorize function to covert your function to a version that can generate vectorized output.
library(IAPWS95)
library(tidyverse)
hTp_vectorize <- Vectorize(hTp)
sample_table_add_enthalpy <- sample_table %>%
mutate(enthalpy = hTp_vectorize(temp, pressure))
sample_table_add_enthalpy
# temp pressure enthalpy
# 1 800 16 3375.08509
# 2 900 17 3636.88144
# 3 1000 18 3889.57761
The second one is to use map2 from the purrr package to vectorized the operations.
sample_table_add_enthalpy <- sample_table %>%
mutate(enthalpy = map2(temp, pressure, hTp))
sample_table_add_enthalpy
# temp pressure enthalpy
# 1 800 16 3375.08509
# 2 900 17 3636.88144
# 3 1000 18 3889.57761
My data looks like this:
colnames(dati)< - c("grupa", "regions5", "regions6", "novads.rep", "pilseta.lt", "specialists", "limenis.1", "limenis.2", "cipari.3", "ratio", "gads", "KV", "DS")
and I have manually applied split to it in order to have 24 splits (12 splits including year and 12 without splitting by years). I did them following way:
k1<-split(dati$ratio, list(dati$gads, dati$grupa), drop=TRUE)
k2<-split(dati$ratio, list(dati$gads, dati$grupa, dati$regions5), drop=TRUE)
...
k13<-split(dati$ratio,list(dati$grupa),drop=TRUE)
k14<-split(dati$ratio,list(dati$grupa,dati$regions5),drop=TRUE)
...etc
and what I mean to do is to apply these splits to my function as follows:
function(k1,k13)
but instead of inserting the values manually I would like to change them so that I could do my function similar to this:
for(i in 1:12){function(k[i],k[i+12])}
I just can't seem to find the right way to do it
dati after i split them look like this:
grupa regions5 regions6 novads.rep pilseta.lt specialists
1 1* Zemgales Zemgales Novads lauki Silva
2 1* Kurzemes Kurzemes Novads lauki Sniedze
3 3* Kurzemes Kurzemes REP pilsēta AnitaE
4 1* Vidzemes Vidzemes Novads pilsēta Dainis
limenis.1 limenis.2 cipari.3 ratio gads KV
1 Jelgavas nov. Svētes pag. 1 0.8682626 2011 2162
2 Ventspils nov. Vārves pag. 1 0.3923857 2011 27467
3 _Liepāja _Liepāja 4 0.4069100 2011 30107
4 Alūksnes nov. Alūksne 2 0.5641127 2011 8147
DS
1 2490.03
2 70000.00
3 73989.33
4 14442.15
...
and here is the output i'm looking for:
count mean lowermean uppermean median ...
2011.1*.Kurzemes 119 0.83322820 7.719323e-01 0.8945241 0.79888324
2012.1*.Kurzemes 171 0.82800498 7.836221e-01 0.8723879 0.84424821
2013.1*.Kurzemes 144 0.77551814 7.347631e-01 0.8162731 0.80745150
2014.1*.Kurzemes 180 0.78134649 7.396007e-01 0.8230923 0.81635065
2015.1*.Kurzemes 80 0.78146588 7.135070e-01 0.8494248 0.73659659
2011.10*.Kurzemes 16 1.09552970 6.930780e-01 1.4979814 1.02127841
2012.10*.Kurzemes 22 0.87442906 5.721409e-01 1.1767172 0.74787482
2013.10*.Kurzemes 25 0.84406131 6.947097e-01 0.9934129 0.91786319
2014.10*.Kurzemes 22 0.79385199 5.880507e-01 0.9996533 0.71708060
2015.10*.Kurzemes 12 1.19059850 8.213604e-01 1.5598365 1.25322750
2012.11*.Kurzemes 1 0.09461065 NA NA 0.09461065
2013.11*.Kurzemes 2 0.18134522 -1.823437e+00 2.1861274 0.18134522
2014.11*.Kurzemes 1 0.11097174 NA NA 0.11097174
2013.12*.Kurzemes 1 0.44620780 NA NA 0.44620780
...
You could use a list:
k <- list()
k[[1]] <- split(dati$ratio, list(dati$gads, dati$grupa), drop=TRUE)
k[[2]] <- split(dati$ratio, list(dati$gads, dati$grupa, dati$regions5), drop=TRUE)
# etc
Then the following is valid:
for(i in 1:12){
function(k[[i]],k[[i+12]])
}
Note that k3 is the name of a variable, which could be x, myvar32, whatever. When you type k[3], you state that you want to access the third cell of the vector k. Note that k and k3 are totally distinct variables. If you want to be able to access you variables using k[i], you must first create the vector k and store what you need in k[i]...
The double bracket notation is used to access lists, which are basically handy store anything -- what you need in your case.
In the graph below,
Is it possible to create same graph with less lines of codes? I mean, since each Figs. A-D has different label settings, I have to write settings for each Fig. which makes it longer.
The graph below is produced with the data in pdf device.
Any help with these issues is highly appreciated.(Newbie to R!). Since all the code is too long to post here, I have posted a part relevant to the problem here for Fig.C
#FigC
label1=c(0,100,200,300)
plot(data$TimeVariable2C,data$Variable2C,axes=FALSE,ylab="",xlab="",xlim=c(0,24),
ylim=c(0,2.4),xaxs="i",yaxs="i",pch=19)
lines(data$TimeVariable3C,data$Variable3C)
axis(2,tick=T,at=seq(0.0,2.4,by=0.6),label= seq(0.0,2.4,by=0.6))
axis(1,tick=T,at=seq(0,24,by=6),label=seq(0,24,by=6))
mtext("(C)",side=1,outer=F,line=-10,adj=0.8)
minor.tick(nx=5,ny=5)
par(new=TRUE)
plot(data$TimeVariable1C,data$Variable1C,axes=FALSE,xlab="",ylab="",type="l",
ylim=c(800,0),xaxs="i",yaxs="i")
axis(3,xlim=c(0,24),tick=TRUE,at= seq(0,24,by=6),label=seq(0,24,by=6),col.axis="violetred4",col="violetred4")
axis(4,tick=TRUE,at= label1,label=label1,col.axis="violetred4",col="violetred4")
polygon(data$TimeVariable1C,data$Variable1C,col='violetred4',border=NA)
You ask many questions in the same OP. I will try to answer to just one : How to simplify your code or rather how to call it once for each letter. I think it is better to put your data in the long format. For example, This will create a list of 4 elements
ll <- lapply(LETTERS[1:4],function(let){
dat.let <- dat[,grepl(let,colnames(dat))]
dd <- reshape(dat.let,direction ='long',
v.names=c('TimeVariable','Variable'),
varying=1:6)
dd$time <- factor(dd$time)
dd$Type <- let
dd
}
)
ll is a list of 4 data.frame, where each one that looks like :
head(ll[[1]])
time TimeVariable Variable id Type
1.1 1 0 0 1 A
2.1 1 0 5 2 A
3.1 1 8 110 3 A
4.1 1 16 0 4 A
5.1 1 NA NA 5 A
6.1 1 NA NA 6 A
Then you can use it like this for example :
library(Hmisc)
layout(matrix(1:4, 2, 2, byrow = TRUE))
lapply(ll,function(data){
label1=c(0,100,200,300)
Type <- unique(dat$Type)
dat <- subset(data,time==2)
x.mm <- max(dat$Variable,na.rm=TRUE)
plot(dat$TimeVariable,dat$Variable,axes=FALSE,ylab="",xlab="",xlim=c(0,x.mm),
ylim=c(0,2.4),xaxs="i",yaxs="i",pch=19)
dat <- subset(data,time==2)
lines(dat$TimeVariable,dat$Variable)
axis(2,tick=T,at=seq(0.0,2.4,by=0.6),label= seq(0.0,2.4,by=0.6))
axis(1,tick=T,at=seq(0,x.mm,by=6),label=seq(0,x.mm,by=6))
mtext(Type,side=1,outer=F,line=-10,adj=0.8)
minor.tick(nx=5,ny=5)
par(new=TRUE)
dat <- subset(data,time==1)
plot(dat$TimeVariable,dat$Variable,axes=FALSE,xlab="",ylab="",type="l",
ylim=c(800,0),xaxs="i",yaxs="i")
axis(3,xlim=c(0,24),tick=TRUE,at= seq(0,24,by=6),label=seq(0,24,by=6),col.axis="violetred4",col="violetred4")
axis(4,tick=TRUE,at= label1,label=label1,col.axis="violetred4",col="violetred4")
polygon(dat$TimeVariable,dat$Variable,col='violetred4',border=NA)
})
Another advantage of using the long data format is to use ``ggplot2andfacet_wrap` for example .
## transform your data to a data.frame
dat.l <- do.call(rbind,ll)
library(ggplot2)
ggplot(subset(dat.l,time !=1)) +
geom_line(aes(x=TimeVariable,y=Variable,group=time,color=time))+
geom_polygon(data=subset(dat.l,time ==1),
aes(x=TimeVariable,y=60-Variable/10,fill=Type))+
geom_line(data=subset(dat.l,time ==1),
aes(x=TimeVariable,y=Variable,fill=Type))+
facet_wrap(~Type,scales='free')
I would like to know the number of unique dams which gave birth on each of the birth dates recorded. My data frame is similar to this one:
dam <- c("2A11","2A11","2A12","2A12","2A12","4D23","4D23","1X23")
bdate <- c("2009-10-01","2009-10-01","2009-10-01","2009-10-01",
"2009-10-01","2009-10-03","2009-10-03","2009-10-03")
mydf <- data.frame(dam,bdate)
mydf
# dam bdate
# 1 2A11 2009-10-01
# 2 2A11 2009-10-01
# 3 2A12 2009-10-01
# 4 2A12 2009-10-01
# 5 2A12 2009-10-01
# 6 4D23 2009-10-03
# 7 4D23 2009-10-03
# 8 1X23 2009-10-03
I used aggregate(dam ~ bdate, data=mydf, FUN=length) but it counts all the dams that gave birth on a particular date
bdate dam
1 2009-10-01 5
2 2009-10-03 3
Instead, I need to have something like this:
mydf2
bdate dam
1 2009-10-01 2
2 2009-10-03 2
Your help is very much appreciated!
What about:
aggregate(dam ~ bdate, data=mydf, FUN=function(x) length(unique(x)))
You could also run unique on the data first:
aggregate(dam ~ bdate, data=unique(mydf[c("dam","date")]), FUN=length)
Then you could also use table instead of aggregate, though the output is a little different.
> table(unique(mydf[c("dam","date")])$bdate)
2009-10-01 2009-10-03
2 2
This is just an example of how to think of the problem and one of the approaches on how to solve it.
split.mydf <- with(mydf, split(x = mydf, f = bdate)) #each list element has only one date.
# it's just a matter of counting unique dams
unique.mydf <- lapply(X = split.mydf, FUN = unique)
#and then count the number of unique elements
unilen.mydf <- lapply(unique.mydf, length)
#you can do these two last steps in one go like so
lapply(split.mydf, FUN = function(x) length(unique(x)))
as.data.frame(unlist(unilen.mydf)) #data.frame is just a special list, so this is water to your mill
unlist(unilen.mydf)
2009-10-01 2
2009-10-03 2
In dplyr you can use n_distinct :
library(tidyverse)
mydf %>%
group_by(bdate) %>%
summarize(dam = n_distinct(dam))