I guess something similar should have been asked before, however I could only find an answer for python and SQL. So please notify me in the comments when this was also asked for R!
Data
Let's say we have a dataframe like this:
set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20))
# In cause you do not get the same dataframe see the comment by #Ian Campbell - thanks!
position value
1 1 27
2 2 37
3 3 57
4 4 89
5 5 20
6 6 86
7 7 97
8 8 62
9 9 58
10 10 6
11 11 19
12 12 16
13 13 61
14 14 34
15 15 67
16 16 43
17 17 88
18 18 83
19 19 32
20 20 63
Goal
I'm interested in calculating the average value for n positions and subtract this from the average value of the next n positions, let's say n=5 for now.
What I tried
I now used this method, however when I apply this to a bigger dataframe it takes a huge amount of time, and hence wonder if there is a faster method for this.
calc <- function( pos ) {
this.five <- df %>% slice(pos:(pos+4))
next.five <- df %>% slice((pos+5):(pos+9))
differ = mean(this.five$value)- mean(next.five$value)
data.frame(dif= differ)
}
df %>%
group_by(position) %>%
do(calc(.$position))
That produces the following table:
position dif
<int> <dbl>
1 1 -15.8
2 2 9.40
3 3 37.6
4 4 38.8
5 5 37.4
6 6 22.4
7 7 4.20
8 8 -26.4
9 9 -31
10 10 -35.4
11 11 -22.4
12 12 -22.3
13 13 -0.733
14 14 15.5
15 15 -0.400
16 16 NaN
17 17 NaN
18 18 NaN
19 19 NaN
20 20 NaN
I suspect a data.table approach may be faster.
library(data.table)
setDT(df)
df[,c("roll.position","rollmean") := lapply(.SD,frollmean,n=5,fill=NA, align = "left")]
df[, result := rollmean[.I] - rollmean[.I + 5]]
df[,.(position,value,rollmean,result)]
# position value rollmean result
# 1: 1 27 46.0 -15.8
# 2: 2 37 57.8 9.4
# 3: 3 57 69.8 37.6
# 4: 4 89 70.8 38.8
# 5: 5 20 64.6 37.4
# 6: 6 86 61.8 22.4
# 7: 7 97 48.4 4.2
# 8: 8 62 32.2 -26.4
# 9: 9 58 32.0 -31.0
#10: 10 6 27.2 -35.4
#11: 11 19 39.4 -22.4
#12: 12 16 44.2 NA
#13: 13 61 58.6 NA
#14: 14 34 63.0 NA
#15: 15 67 62.6 NA
#16: 16 43 61.8 NA
#17: 17 88 NA NA
#18: 18 83 NA NA
#19: 19 32 NA NA
#20: 20 63 NA NA
Data
RNGkind(sample.kind = "Rounding")
set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20))
RNGkind(sample.kind = "default")
I need to calculate a percentage change over 4 years per km. is there a function that would allow this calculation ?
df <- data.frame(km = c(100:111),
A2012 = c(12:23),
A2013 = c(14,25),
A2014 = c(10,21),
A2015 = c(18, 29),
Coef_Evol="?")
I don't think there is such a thing as 1 number to account for the overall changes over time. So I think you can either put the calculation you already used: (Finalvalue - StartValue) / StartValue) or you can create an additional df2 that shows the percentage change year over year:
df <- data.frame(km = c(100:111),
A2012 = c(12:23),
A2013 = c(14,25),
A2014 = c(10,21),
A2015 = c(18, 29))
df
km A2012 A2013 A2014 A2015
1 100 12 14 10 18
2 101 13 25 21 29
3 102 14 14 10 18
4 103 15 25 21 29
5 104 16 14 10 18
6 105 17 25 21 29
7 106 18 14 10 18
8 107 19 25 21 29
9 108 20 14 10 18
10 109 21 25 21 29
11 110 22 14 10 18
12 111 23 25 21 29
df2 <- data.frame(df[1], NA * df[2], 100 * (df[-(1:2)] / df[-c(1, ncol(df))] - 1))
df2
km A2012 A2013 A2014 A2015
1 100 NA 16.666667 -28.57143 80.00000
2 101 NA 92.307692 -16.00000 38.09524
3 102 NA 0.000000 -28.57143 80.00000
4 103 NA 66.666667 -16.00000 38.09524
5 104 NA -12.500000 -28.57143 80.00000
6 105 NA 47.058824 -16.00000 38.09524
7 106 NA -22.222222 -28.57143 80.00000
8 107 NA 31.578947 -16.00000 38.09524
9 108 NA -30.000000 -28.57143 80.00000
10 109 NA 19.047619 -16.00000 38.09524
11 110 NA -36.363636 -28.57143 80.00000
12 111 NA 8.695652 -16.00000 38.09524
Perhaps you can then add an additional column that shows the average percentage change...
I have a longitudinal dataset in the long form with the length of around 2800, with around 400 participants in total. Here's a sample of my data.
# ID wave score sex age edu
#1 1001 1 28 1 69 12
#2 1001 2 27 1 70 12
#3 1001 3 28 1 71 12
#4 1001 4 26 1 72 12
#5 1002 1 30 2 78 9
#6 1002 3 30 2 80 9
#7 1003 1 30 2 65 16
#8 1003 2 30 2 66 16
#9 1003 3 29 2 67 16
#10 1003 4 28 2 68 16
#11 1004 1 22 2 85 4
#12 1005 1 20 2 60 9
#13 1005 2 18 1 61 9
#14 1006 1 22 1 74 9
#15 1006 2 23 1 75 9
#16 1006 3 25 1 76 9
#17 1006 4 19 1 77 9
I want to create a new column "cutoff" with values "Normal" or "Impaired" because my outcome data, "score" has a cutoff score indicating impairment according to norm. The norm consists of different -1SD measures(the cutoff point) according to Sex, Edu(year of education), and Age.
Below is what I'm currently doing, checking an excel file myself and putting in the corresponding cutoff score according to the three conditions. First of all, I am not sure if I am creating the right column.
data$cutoff <- ifelse(data$sex==1 & data$age<70
& data$edu<3
& data$score<19.91, "Impaired", "Normal")
data$cutoff <- ifelse(data$sex==2 & data$age<70
& data$edu<3
& data$score<18.39, "Impaired", "Normal")
Additionally, I am wondering if I can import an excel file stating the norm, and create a column according to the values in it.
The excel file has a structure as shown below.
# Sex Male Female
#60-69 Edu(yr) 0-3 4-6 7-12 13>= 0-3 4-6 7-12 13>=
#Age Number 22 51 119 72 130 138 106 51
# Mean 24.45 26.6 27.06 27.83 23.31 25.86 27.26 28.09
# SD 3.03 1.89 1.8 1.53 3.28 2.55 1.85 1.44
# -1.5SD' 19.92 23.27 23.76 24.8 18.53 21.81 23.91 25.15
#70-79 Edu(yr) 0-3 4-6 7-12 13>= 0-3 4-6 7-12 13>=
....
I have created new columns "agecat" and "educat," allocating each ID into a group of age and education used in the norm. Now I want to make use of these columns, matching it with rows and columns of the excel file above. One of the motivations is to create a code that can be used for further research using the test scores of my data.
I think your ifelse statements should work fine, but I would definitely import the Excel file rather than hardcoding it, though you may need to structure it a bit differently. I would structure it just like a dataset, with columns for Sex, Edu, Age, Mean, SD, -1.5SD, etc., import it into R, then do a left outer join on Sex+Edu+Age:
merge(x = long_df, y = norm_df, by = c("Sex", "Edu(yr)", "Age"), all.x = TRUE)
Then you can compare the columns directly.
If I understand correctly, the OP wants to mark a certain type of outliers in his dataset. So, there are two tasks here:
Compute the statistics mean(score), sd(score), and cutoff value mean(score) - 1.5 * sd(score) for each group of sex, age category agecat, and edu category edcat.
Find all rows where score is lower than the cutoff value for the particular group.
As already mentioned by hannes101, the second step can be implemented by a non-equi join.
library(data.table)
# categorize age and edu (left closed intervals)
mydata[, c("agecat", "educat") := .(cut(age, c(seq(0, 90, 10), Inf), right = FALSE),
cut(edu, c(0, 4, 7, 13, Inf), right = FALSE))][]
# compute statistics
cutoffs <- mydata[, .(.N, Mean = mean(score), SD = sd(score),
m1.5SD = mean(score) - 1.5 * sd(score)),
by = .(sex, agecat, educat)]
# non-equi update join
mydata[, cutoff := "Normal"]
mydata[cutoffs, on = .(sex, agecat, educat, score < m1.5SD), cutoff := "Impaired"][]
mydata
ID wave score sex age edu agecat educat cutoff
1: 1001 1 28 1 69 12 [60,70) [7,13) Normal
2: 1001 2 27 1 70 12 [70,80) [7,13) Normal
3: 1001 3 28 1 71 12 [70,80) [7,13) Normal
4: 1001 4 26 1 72 12 [70,80) [7,13) Normal
5: 1002 1 30 2 78 9 [70,80) [7,13) Normal
6: 1002 3 30 2 80 9 [80,90) [7,13) Normal
7: 1003 1 33 2 65 16 [60,70) [13,Inf) Normal
8: 1003 2 32 2 66 16 [60,70) [13,Inf) Normal
9: 1003 3 31 2 67 16 [60,70) [13,Inf) Normal
10: 1003 4 24 2 68 16 [60,70) [13,Inf) Impaired
11: 1004 1 22 2 85 4 [80,90) [4,7) Normal
12: 1005 1 20 2 60 9 [60,70) [7,13) Normal
13: 1005 2 18 1 61 9 [60,70) [7,13) Normal
14: 1006 1 22 1 74 9 [70,80) [7,13) Normal
15: 1006 2 23 1 75 9 [70,80) [7,13) Normal
16: 1006 3 25 1 76 9 [70,80) [7,13) Normal
17: 1006 4 19 1 77 9 [70,80) [7,13) Normal
18: 1007 1 33 2 65 16 [60,70) [13,Inf) Normal
19: 1007 2 32 2 66 16 [60,70) [13,Inf) Normal
20: 1007 3 31 2 67 16 [60,70) [13,Inf) Normal
21: 1007 4 31 2 68 16 [60,70) [13,Inf) Normal
ID wave score sex age edu agecat educat cutoff
In this made-up example there is only one row which meets the "Impaired" conditions.
Likewise, the statistics is rather sparsely populated:
cutoffs
sex agecat educat N Mean SD m1.5SD
1: 1 [60,70) [7,13) 2 23.00000 7.071068 12.39340
2: 1 [70,80) [7,13) 7 24.28571 3.147183 19.56494
3: 2 [70,80) [7,13) 1 30.00000 NA NA
4: 2 [80,90) [7,13) 1 30.00000 NA NA
5: 2 [60,70) [13,Inf) 8 30.87500 2.900123 26.52482
6: 2 [80,90) [4,7) 1 22.00000 NA NA
7: 2 [60,70) [7,13) 1 20.00000 NA NA
Data
OP's sample dataset has been modified in one group for demonstration.
library(data.table)
mydata <- fread("
# ID wave score sex age edu
#1 1001 1 28 1 69 12
#2 1001 2 27 1 70 12
#3 1001 3 28 1 71 12
#4 1001 4 26 1 72 12
#5 1002 1 30 2 78 9
#6 1002 3 30 2 80 9
#7 1003 1 33 2 65 16
#8 1003 2 32 2 66 16
#9 1003 3 31 2 67 16
#10 1003 4 24 2 68 16
#11 1004 1 22 2 85 4
#12 1005 1 20 2 60 9
#13 1005 2 18 1 61 9
#14 1006 1 22 1 74 9
#15 1006 2 23 1 75 9
#16 1006 3 25 1 76 9
#17 1006 4 19 1 77 9
#18 1007 1 33 2 65 16
#19 1007 2 32 2 66 16
#20 1007 3 31 2 67 16
#21 1007 4 31 2 68 16
", drop = 1L)
I have time series data with N/As. The data are to end up in an animated scatterplot
Week X Y
1 1 105
2 3 110
3 5 N/A
4 7 130
8 15 160
12 23 180
16 30 N/A
20 37 200
For a smooth animation, the data will be supplemented by calculated, additional values/rows. For the X values this is simply arithmetical. No problem so far.
Week X Y
1 1 105
2
2 3 110
4
3 5 N/A
6
4 7 130
8
9
10
11
12
13
14
8 15 160
16
17
18
19
20
21
22
12 23 180
24
25
26
27
28
29
16 30 N/A
31
32
33
34
35
36
20 37 200
The Y values should be interpolated and there is the additional requirement, that interpolation should only appear between two consecutive values and not between values, that have a N/A between them.
Week X Value
1 1 105
2 interpolated value
2 3 110
4
3 5 N/A
6
4 7 130
8 interpolated value
9 interpolated value
10 interpolated value
11 interpolated value
12 interpolated value
13 interpolated value
14 interpolated value
8 15 160
16 interpolated value
17 interpolated value
18 interpolated value
19 interpolated value
20 interpolated value
21 interpolated value
22 interpolated value
12 23 180
24
25
26
27
28
29
16 30 N/A
31
32
33
34
35
36
20 37 200
I have already experimented with approx, converted the "original" N/A to placeholder values and tried the zoo package with na.approx etc. but don´t get it, to express a correct condition statement for this kind of "conditional approximation" or "conditional gap filling". Any hint is welcome and very appreciated.
Thanks in advance
Replace the NAs with Inf, interpolate and then revert infinite values to NA.
library(zoo)
DF2 <- DF
DF2$Y[is.na(DF2$Y)] <- Inf
w <- merge(DF2, data.frame(Week = min(DF2$Week):max(DF2$Week)), by = 1, all.y = TRUE)
w$Value <- na.approx(w$Y)
w$Value[!is.finite(Value)] <- NA
giving the following where Week has been expanded to all weeks, Y is such that the original NAs are shown as Inf and the inserted NAs as NA. Value is the interpolated Y.
> w
Week X Y Value
1 1 1 105 105.0
2 2 3 110 110.0
3 3 5 Inf NA
4 4 7 130 130.0
5 5 NA NA 137.5
6 6 NA NA 145.0
7 7 NA NA 152.5
8 8 15 160 160.0
9 9 NA NA 165.0
10 10 NA NA 170.0
11 11 NA NA 175.0
12 12 23 180 180.0
13 13 NA NA NA
14 14 NA NA NA
15 15 NA NA NA
16 16 30 Inf NA
17 17 NA NA NA
18 18 NA NA NA
19 19 NA NA NA
20 20 37 200 200.0
Note: Input DF in reproducible form:
Lines <- "
Week X Y
1 1 105
2 3 110
3 5 N/A
4 7 130
8 15 160
12 23 180
16 30 N/A
20 37 200"
DF <- read.table(text = Lines, header = TRUE, na.strings = "N/A")
I want to merge the df OldData and NewData.
In this case, Nov-2015 and Dec 2015 are present in both df.
Since NewData is the most accurate update available, I want to update the value of Nov-2015 and Dec 2015 using the value in df NewData and of course adding the records of Jan-2016 and Feb-2016 as well.
Can anyone help?
OldData
Month Value
1 Jan-2015 3
2 Feb-2015 76
3 Mar-2015 31
4 Apr-2015 45
5 May-2015 99
6 Jun-2015 95
7 Jul-2015 18
8 Aug-2015 97
9 Sep-2015 61
10 Oct-2015 7
11 Nov-2015 42
12 Dec-2015 32
NewData
Month Value
1 Nov-2015 88
2 Dec-2015 45
3 Jan-2016 32
4 Feb-2016 11
Here is the output I want:
JoinData
Month Value
1 Jan-2015 3
2 Feb-2015 76
3 Mar-2015 31
4 Apr-2015 45
5 May-2015 99
6 Jun-2015 95
7 Jul-2015 18
8 Aug-2015 97
9 Sep-2015 61
10 Oct-2015 7
11 Nov-2015 88
12 Dec-2015 45
13 Jan-2016 32
14 Feb-2016 11
Thanks for #akrun, the problem is solved, and the following code works smoothly!!
rbindlist(list(OldData, NewData))[!duplicated(Month, fromLast=TRUE)]
Update: Now, let's upgrade our problem little bit.
suppose our OldData and NewData have another column called "Type".
How do we merge/update it this time?
> OldData
Month Type Value
1 2015-01 A 3
2 2015-02 A 76
3 2015-03 A 31
4 2015-04 A 45
5 2015-05 A 99
6 2015-06 A 95
7 2015-07 A 18
8 2015-08 A 97
9 2015-09 A 61
10 2015-10 A 7
11 2015-11 B 42
12 2015-12 C 32
13 2015-12 D 77
> NewData
Month Type Value
1 2015-11 A 88
2 2015-12 C 45
3 2015-12 D 22
4 2016-01 A 32
5 2016-02 A 11
The JoinData will suppose to update all value from NewData ass following:
> JoinData
Month Type Value
1 2015-01 A 3
2 2015-02 A 76
3 2015-03 A 31
4 2015-04 A 45
5 2015-05 A 99
6 2015-06 A 95
7 2015-07 A 18
8 2015-08 A 97
9 2015-09 A 61
10 2015-10 A 7
11 2015-11 B 42
12 2015-11 A 88 (originally not included, added from the NewData)
12 2015-12 C 45 (Updated the value by NewData)
13 2015-12 D 22 (Updated the value by NewData)
14 2016-01 A 32 (newly added from NewData)
15 2016-02 A 11 (newly added from NewData)
Thanks for #akrun: I have got the solution here for the second question as well.
Thanks for the help for everyone here!
Here is the answer:
d1 <- merge(OldData, NewData, by = c("Month","Type"), all = TRUE);d2 <- transform(d1, Value.x= ifelse(!is.na(Value.y), Value.y, Value.x))[-4];d2[!duplicated(d2[1:2], fromLast=TRUE),]
Here is an option using data.table (similar approach as #thelatemail mentioned in the comments)
library(data.table)
rbindlist(list(OldData, NewData))[!duplicated(Month, fromLast=TRUE)]
Or
rbindlist(list(OldData, NewData))[,if(.N >1) .SD[.N] else .SD, Month]