I am new to R and struggling to understand its quirks. I'm trying to do something which should be really simple, but is turning out to be apparently very complicated.
I am used to Excel, SQL and Minitab, where you can enter a value in one column which includes references to other columns and parameters. However, R doesn't seem to be allowing me to do this.
I have a table with (currently) four columns:
Date Pallets Lt Tt
1 28/12/2011 491 NA NA
2 29/12/2011 385 NA 0.787890411
3 30/12/2011 662 NA NA
4 31/12/2011 28 NA NA
5 01/01/2012 46 NA NA
6 02/01/2012 403 NA NA
7 03/01/2012 282 NA NA
8 04/01/2012 315 NA NA
9 05/01/2012 327 NA NA
10 06/01/2012 458 NA NA
and have a parameter "beta", with a value which I have assigned as 0.0002.
All I want to do is assign a formula to rows 3:10 which is:
beta*(Pallets t - Pallets t-1)+(1-beta)*Tt t-1.
I thought that the appropriate code might be:
Table[3:10,4]<-beta*(Table[3:10,"Pallets"]-Table[2:9,"Pallets"])+(1-beta)*Table[2:9,"Tt"]
However, this doesn't work. The first time I enter this formula, it generates:
Date Pallets Lt Tt
1 28/12/2011 491 NA NA
2 29/12/2011 385 NA 0.7878904
3 30/12/2011 662 NA 0.8431328
4 31/12/2011 28 NA NA
5 01/01/2012 46 NA NA
6 02/01/2012 403 NA NA
7 03/01/2012 282 NA NA
8 04/01/2012 315 NA NA
9 05/01/2012 327 NA NA
10 06/01/2012 458 NA NA
So it's generated the correct answer for the second item in the series, but not for any of the subsequent values.
It seems as though R doesn't automatically update each row, and the relationship to each other row, when you enter a formula, as Excel does. Having said that, Excel actually would require me to enter the formula in cell [4,Tt], and then drag this down to all of the other cells. Perhaps R is the same, and there is an equivalent to "dragging down" which I need to do?
Finally, I also noticed that when I change the value of the beta parameter, through, e.g. beta<-0.5, and then print the Table values again, they are unchanged - so the table hasn't updated even though I have changed the value of the parameter.
Appreciate that these are basic questions, but I am very new to R.
In R, the computations are not made "cell by cell", but are vectorised - in your example, R takes the vectors Table[3:10,"Pallets"], Table[2:9,"Pallets"] and Table[2:9,"Tt"] as they are at the moment, computes the resulting vector, and finally assigns it to Table[3:10,4].
If you want to make some computations "cell by cell", you have to use the for loop:
beta <- 0.5
df <- data.frame(v1 = 1:12, v2 = 0)
for (i in 3:10) {
df[i, "v2"] <- beta * (df[i, "v1"] - df[i-1, "v1"]) + (1 - beta) * df[i-1, "v2"]
}
df
v1 v2
1 1 0.0000000
2 2 0.0000000
3 3 0.5000000
4 4 0.7500000
5 5 0.8750000
6 6 0.9375000
7 7 0.9687500
8 8 0.9843750
9 9 0.9921875
10 10 0.9960938
11 11 0.0000000
12 12 0.0000000
As it comes to your second question, R will never update any values on its own (imagine having set manual calculation in Excel). So you need to repeat the computations after changing beta.
Although it's generally a bad design, but you can iterate over rows in a loop:
Table$temp <- c(0,diff(Table$Palletes,1))
prevTt = 0
for (i in 1:10)
{
Table$Tt[i] = Table$temp * beta + (1-beta)*prevTt
prevTt = Table$Tt[i]
}
Table$temp <- NULL
Related
I have some difficulties to solve a problem concerning the selection of values in a data frame. Here is the thing:
- I have a data frame containing these variables: x-coordinates, y-coordinates, diameter, G value, H value, Quality value, Ecological value. Each line corresponds to one individual (which are trees in this exercise)
I need to find the individual with the best quality value = this I can do it
But then, I have to find the second tree with a good quality value, which has to be in the 10 next meters of the reference tree (the one with the best quality value).
And this selection has to be made at every tree selected, every time 10 meters further!
this should bring me to a selection of x-y-coordinates, which are separated by 10 meters and represent good quality value.
Now, here is what I tried:
> kk<- function(x, y)
+ {
+ coordx<-data$x.Koordinate[data$Q==24] #I have looked before for the best quality value of the sample, which is 24
+ coordy<-data$y.Koordinate[data$Q==24]
+ x <- ifelse(data$x.Koordinate>coordx-11 & data$Q>15,data$x.Koordinate,0) #I choose that I did'nt wanted to have less than 15 of quality value
+ y<-ifelse(data$y.Koordinate>coordy-11 & data$Q>15,data$y.Koordinate,0)#-11 meters from the reference coordinates, the next tree selected has to be inbetween
+ return(c(x,y))
+ }
> kk(data$x.Koordinate, data$y.Koordinate)
[1] 0 0 0 0 0 205550 205550 0 205600 205600 0 0 0 0 0 0 0
[18] 604100 0 604150 604100 0
The problem here is that we can not clearly see the difference between the coordinates for x and the ones for y.
I tried this:
> kk<- function(x, y)
+ {
+ coordx<-data$x.Koordinate[data$Q==24]
+ coordy<-data$y.Koordinate[data$Q==24]
+ x <- ifelse(data$x.Koordinate>coordx-11 & data$Q>15,data$x.Koordinate," ")
+ y<-ifelse(data$y.Koordinate>coordy-11 & data$Q>15,data$y.Koordinate," ")
+ return(list(x,y))
+ }
> kk(data$x.Koordinate, data$y.Koordinate)
[[1]]
[1] " " " " " " " " " " "205550" "205550" " " "205600" "205600" " "
[[2]]
[1] " " " " " " " " " " " " "604100" " " "604150" "604100" " "
>
Where we can see better the two levels related to the x and y coordinates.
The first question is simple: Is it possible for this function to return the values in a form like x,y or x y ? (without any 0, or « », or space) Or should I use another R function to obtain this result?
The second question is complex: How can I say to R to repeat this function from the coordinates he finds in this first attempt, and for the whole data?
Thank you very much for your answer. It helps me a lot! The first part of my problem is solved, the second seems to have a bug somewhere... And I don't see clearly where the function says to R to go every 10 meters further (in fact, every 50 meters, according to my data, see below)...But thank you anyway, it's a good starter, I will continue my research on this problem :)
PS: I understand it is difficult without the data. Unfortunately, I cannot show them on the net. However, I can show you a part of it:
ID Bezeichnung x.Koordinate y.Koordinate Q N hdom V Mittelstamm Fi Ta Foe Lae ueN Bu Es Ei Ah ueL Struktur
1 10,809 62 205450 603950 8 1067 21 64 10 NA NA NA NA NA 100 NA NA NA NA NA
2 10,810 63 205450 604000 16 1333 22 128 12 NA NA NA NA NA 75 NA NA 25 NA NA
3 10,811 56 205500 604050 20 800 22 160 18 NA NA NA NA NA 60 NA NA NA 40 NA
4 10,812 55 205500 604000 12 1033 20 97 12 33 NA NA NA NA 67 NA NA NA NA NA
5 10,813 54 205500 603950 20 500 56 0 23 NA NA NA NA NA 100 NA NA NA NA NA
6 10,814 46 205550 604050 16 567 32 215 19 75 NA NA NA NA 25 NA NA NA NA NA
7 10,815 47 205550 604100 16 233 26 174 30 NA 25 NA NA NA 50 NA NA NA 25 NA
8 10,816 48 205550 604150 0 1167 16 0 0 NA NA NA NA NA NA NA NA NA NA NA
9 10,817 43 205600 604150 24 633 33 366 22 83 17 NA NA NA NA NA NA NA NA NA
10 10,818 42 205600 604100 16 1500 33 282 12 NA NA NA NA NA NA NA NA 75 25 NA
Here is the result with your answer for the second problem:
> Arbres<-kk(x.Koordinate, y.Koordinate, data=data)
> for (i in 1:length(Arbres[,1])
+ kk(Arbres(i,1),Arbres[i,2])
Error: unexpected symbol in:
"for (i in 1:length(Arbres[,1])
kk"
Sorry, I just rename it "Arbre"
Thanks again,
C.
It's a bit hard for me to try this out without your dataset, or a small example of it, but I think the following should work for your first question.
The first time you use the function you enter the x and y coordinate of the tree that has Quality 24 for x and y in your function
> kk<- function(x, y)
+ {
+ coordx<-x
+ coordy<-y
+ x1 <- ifelse(data$x.Koordinate>coordx-11 & data$Q>15,data$x.Koordinate,NA)
+ y1 <- ifelse(data$y.Koordinate>coordy-11 & data$Q>15,data$y.Koordinate,NA)
+ return(matrix(c(x1,y1),nrow=length(x1), ncol=2, dimnames=list(NULL, c("x","y"))))
+ }
That should give you a matrix with two columns corresponding to the x and y coordinates and a NA if the condition is not met.
The second question is more difficult because as your output already showed there are multiple trees that meet the criteria you've set. If you want all of these checked again you can use the output of your function in a loop. Something like this:
Tree1_friends<-kk(data$x.Koordinate[data$Q==24], data$y.Koordinate[data$Q==24])
for (i in 1:length(Tree1_friends[,1]))
print(kk(Tree1_friends[i,1],Tree1_friends[i,2]))
Note that this code only prints the result, but with some clever assignment strategy you can probably save them as well
I have a data set that contains occurrences of events over multiple years, regions, quarters, and types. Sample:
REGION Prov Year Quarter Type Hit Miss
xxx yy 2008 4 Snow 1 0
xxx yy 2009 2 Rain 0 1
I have variables defined to examine the columns of interest:
syno.h <- data$Type
quarter.number<-data$Quarter
syno.wrng<- data$Type
I wanted to get the amount of Hits per type, and quarter for all of the data. Given that the Hits are either 0 or 1, then a simple sum() function using tapply was my first attempt.
tapply(syno.h, list(syno.wrng, quarter.number), sum)
this returned:
1 2 3 4
ARCO NA NA NA 0
BLSN 0 NA 15 74
BLZD 4 NA 17 54
FZDZ NA NA 0 1
FZRA 26 0 143 194
RAIN 106 126 137 124
SNOW 43 2 215 381
SNSQ 0 NA 18 53
WATCHSNSQ NA NA NA 0
WATCHWSTM 0 NA NA NA
WCHL NA NA NA 1
WIND 47 38 155 167
WIND-SUETES 27 6 37 56
WIND-WRECK 34 14 44 58
WTSM 0 1 7 18
For a some of the types that have no occurrences in a given quarter, tapply sometimes returns NA instead of zero. I have checked the data a number of times, and I am confident that it is clean. The values that aren't NA are also correct.
If I check the type/quarter combinations that return NA with tapply using just sum() I get values I expect:
sum(syno.h[quarter.number==3&syno.wrng=="BLSN"])
[1] 15
> sum(syno.h[quarter.number==1&syno.wrng=="BLSN"])
[1] 0
> sum(syno.h[quarter.number==2&syno.wrng=="BLSN"])
[1] 0
> sum(syno.h[quarter.number==2&syno.wrng=="ARCO"])
[1] 0
It seems that my issue is with how I use tapply with sum, and not with the data itself.
Does anyone have any suggestions on what the issue may be?
Thanks in advance
I have two potential solutions for you depending on exactly what you are looking for. If you just are interested in your number of positive Hits per Type and Quarter and don't need a record of when no Hits exist, you can get an answer as
aggregate(data[["Hit"]], by = data[c("Type","Quarter")], FUN = sum)
If it is important to keep a record of the ones where there are no hits as well, you can use
dataHit <- data[data[["Hit"]] == 1, ]
dataHit[["Type"]] <- factor(data[["Type"]])
dataHit[["Quarter"]] <- factor(data[["Quarter"]])
table(dataHit[["Type"]], dataHit[["Quarter"]])
This is my data frame:
ID <- c('TZ1','TZ2','TZ3','TZ4')
hr <- c(56,32,38,NA)
cr <- c(1,4,5,2)
data <- data.frame(ID,hr,cr)
ID hr cr
1 TZ1 56 1
2 TZ2 32 4
3 TZ3 38 5
4 TZ4 NA 2
I want to remove the rows where data$hr = 56. This is what I want the end product to be:
ID hr cr
2 TZ2 32 4
3 TZ3 38 5
4 TZ4 NA 2
This is what I thought would work:
data = data[data$hr !=56,]
However the resulting data frame looks like this:
ID hr cr
2 TZ2 32 4
3 TZ3 38 5
NA <NA> NA NA
How can I mofify my code to encorporate the NA value so this doesn't happen? Thank you for your help, I can't figure it out.
EDIT: I also want to keep the NA value in the data frame.
The issue is that when we do the == or !=, if there are NA values, it will remain as such and create an NA row for that corresponding NA value. So one way to make the logical index with only TRUE/FALSE values will be to use is.na also in the comparison.
data[!(data$hr==56 & !is.na(data$hr)),]
# ID hr cr
#2 TZ2 32 4
#3 TZ3 38 5
#4 TZ4 NA 2
We could also apply the reverse logic
subset(data, hr!=56|is.na(hr))
# ID hr cr
#2 TZ2 32 4
#3 TZ3 38 5
#4 TZ4 NA 2
I'm having trouble with the na.spline() function in the zoo package. Although the documentation explicitly states that this is an interpolation function, the behaviour I'm getting includes extrapolation.
The following code reproduces the problem:
require(zoo)
vector <- c(NA,NA,NA,NA,NA,NA,5,NA,7,8,NA,NA)
na.spline(vector)
The output of this should be:
NA NA NA NA NA NA 5 6 7 8 NA NA
This would be interpolation of the internal NA, leaving the trailing NAs in place. But, instead I get:
-1 0 1 2 3 4 5 6 7 8 9 10
According to the documentation, this shouldn't happen. Is there some way to avoid extrapolation?
I recognise that in my example, I could use linear interpolation, but this is a MWE. Although I'm not necessarily wed to the na.spline() function, I need some way to interpolate using cubic splines.
This behavior appears to be coming from the stats::spline function, e.g.,
spline(seq_along(vector), vector, xout=seq_along(vector))$y
# [1] -1 0 1 2 3 4 5 6 7 8 9 10
Here is a work around, using the fact that na.approx strictly interpolates.
replace(na.spline(vector), is.na(na.approx(vector, na.rm=FALSE)), NA)
# [1] NA NA NA NA NA NA 5 6 7 8 NA NA
Edit
As #G.Grothendieck suggests in the comments below, another, no doubt more performant, way is:
na.spline(vector) + 0*na.approx(vector, na.rm = FALSE)
I am not sure how I can do this, but what I need is I need to form a cluster of this dataframe mydf where I want to omit the inf(infitive) values and the values greater than 50. I need to get the table that has no inf and no values greater than 50. How can I get a table that contains no inf and no value greater than 50(may be by nullifying those cells)? However, For clustering part, I don't have any problem because I can do this using mfuzz package. So the only problem I have is that I want to scale the cluster within 0-50 margin.
mydf
s.no A B C
1 Inf Inf 999.9
2 0.43 30 23
3 34 22 233
4 3 43 45
You can use NA, the built in missing data indicator in R:
?NA
By doing this:
mydf[mydf > 50 | mydf == Inf] <- NA
mydf
s.no A B C
1 1 NA NA NA
2 2 0.43 30 23
3 3 34.00 22 NA
4 4 3.00 43 45
Any stuff you do downstream in R should have NA handling methods, even if it's just na.omit