Slide along data frame rows and compare rows with next rows - r

I guess something similar should have been asked before, however I could only find an answer for python and SQL. So please notify me in the comments when this was also asked for R!
Data
Let's say we have a dataframe like this:
set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20))
# In cause you do not get the same dataframe see the comment by #Ian Campbell - thanks!
position value
1 1 27
2 2 37
3 3 57
4 4 89
5 5 20
6 6 86
7 7 97
8 8 62
9 9 58
10 10 6
11 11 19
12 12 16
13 13 61
14 14 34
15 15 67
16 16 43
17 17 88
18 18 83
19 19 32
20 20 63
Goal
I'm interested in calculating the average value for n positions and subtract this from the average value of the next n positions, let's say n=5 for now.
What I tried
I now used this method, however when I apply this to a bigger dataframe it takes a huge amount of time, and hence wonder if there is a faster method for this.
calc <- function( pos ) {
this.five <- df %>% slice(pos:(pos+4))
next.five <- df %>% slice((pos+5):(pos+9))
differ = mean(this.five$value)- mean(next.five$value)
data.frame(dif= differ)
}
df %>%
group_by(position) %>%
do(calc(.$position))
That produces the following table:
position dif
<int> <dbl>
1 1 -15.8
2 2 9.40
3 3 37.6
4 4 38.8
5 5 37.4
6 6 22.4
7 7 4.20
8 8 -26.4
9 9 -31
10 10 -35.4
11 11 -22.4
12 12 -22.3
13 13 -0.733
14 14 15.5
15 15 -0.400
16 16 NaN
17 17 NaN
18 18 NaN
19 19 NaN
20 20 NaN

I suspect a data.table approach may be faster.
library(data.table)
setDT(df)
df[,c("roll.position","rollmean") := lapply(.SD,frollmean,n=5,fill=NA, align = "left")]
df[, result := rollmean[.I] - rollmean[.I + 5]]
df[,.(position,value,rollmean,result)]
# position value rollmean result
# 1: 1 27 46.0 -15.8
# 2: 2 37 57.8 9.4
# 3: 3 57 69.8 37.6
# 4: 4 89 70.8 38.8
# 5: 5 20 64.6 37.4
# 6: 6 86 61.8 22.4
# 7: 7 97 48.4 4.2
# 8: 8 62 32.2 -26.4
# 9: 9 58 32.0 -31.0
#10: 10 6 27.2 -35.4
#11: 11 19 39.4 -22.4
#12: 12 16 44.2 NA
#13: 13 61 58.6 NA
#14: 14 34 63.0 NA
#15: 15 67 62.6 NA
#16: 16 43 61.8 NA
#17: 17 88 NA NA
#18: 18 83 NA NA
#19: 19 32 NA NA
#20: 20 63 NA NA
Data
RNGkind(sample.kind = "Rounding")
set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20))
RNGkind(sample.kind = "default")

Related

replace NA with 0 and all other values/text as 1

airquality
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
Hi there,
How do I replace values in Ozone to be binary? If NA then 0 and if a value then 1.
Thanks
H
Assuming your dataframe is called airquality
airquality$Ozone <- ifelse(is.na(airquality$Ozone), 0, 1)
airquality$Ozone <- as.integer(!is.na(airquality$Ozone))
Alternatively
airquality$Ozone[!is.na(airquality$Ozone)] <- 1L
airquality$Ozone[is.na(airquality$Ozone)] <- 0L

how to calculate a percentage of evolution or trend over several years with R?

I need to calculate a percentage change over 4 years per km. is there a function that would allow this calculation ?
df <- data.frame(km = c(100:111),
A2012 = c(12:23),
A2013 = c(14,25),
A2014 = c(10,21),
A2015 = c(18, 29),
Coef_Evol="?")
I don't think there is such a thing as 1 number to account for the overall changes over time. So I think you can either put the calculation you already used: (Finalvalue - StartValue) / StartValue) or you can create an additional df2 that shows the percentage change year over year:
df <- data.frame(km = c(100:111),
A2012 = c(12:23),
A2013 = c(14,25),
A2014 = c(10,21),
A2015 = c(18, 29))
df
km A2012 A2013 A2014 A2015
1 100 12 14 10 18
2 101 13 25 21 29
3 102 14 14 10 18
4 103 15 25 21 29
5 104 16 14 10 18
6 105 17 25 21 29
7 106 18 14 10 18
8 107 19 25 21 29
9 108 20 14 10 18
10 109 21 25 21 29
11 110 22 14 10 18
12 111 23 25 21 29
df2 <- data.frame(df[1], NA * df[2], 100 * (df[-(1:2)] / df[-c(1, ncol(df))] - 1))
df2
km A2012 A2013 A2014 A2015
1 100 NA 16.666667 -28.57143 80.00000
2 101 NA 92.307692 -16.00000 38.09524
3 102 NA 0.000000 -28.57143 80.00000
4 103 NA 66.666667 -16.00000 38.09524
5 104 NA -12.500000 -28.57143 80.00000
6 105 NA 47.058824 -16.00000 38.09524
7 106 NA -22.222222 -28.57143 80.00000
8 107 NA 31.578947 -16.00000 38.09524
9 108 NA -30.000000 -28.57143 80.00000
10 109 NA 19.047619 -16.00000 38.09524
11 110 NA -36.363636 -28.57143 80.00000
12 111 NA 8.695652 -16.00000 38.09524
Perhaps you can then add an additional column that shows the average percentage change...

Find first previous lower value for each value in dataframe column

Given the following dataframe :
set.seed(1)
my_df = data.frame(x = rep(words[1:5], 50) %>% sort(),
y = 1:250,
z = sample(seq(from = 30 , to = 90, by = 0.1), size = 250, replace = T))
my_df %>% head(30)
x y z
1 a 1 45.9
2 a 2 52.3
3 a 3 64.4
4 a 4 84.5
5 a 5 42.1
6 a 6 83.9
7 a 7 86.7
8 a 8 69.7
9 a 9 67.8
10 a 10 33.7
11 a 11 42.3
12 a 12 40.6
13 a 13 71.2
14 a 14 53.0
15 a 15 76.2
16 a 16 59.9
17 a 17 73.1
18 a 18 89.6
19 a 19 52.8
20 a 20 76.7
21 a 21 86.1
22 a 22 42.7
23 a 23 69.1
24 a 24 37.5
25 a 25 46.0
26 a 26 53.2
27 a 27 30.8
28 a 28 52.9
29 a 29 82.2
30 a 30 50.4
I would like to create the following column using dplyr mutate
for each value In column z show the row index of the first value in z which is lower.
For example:
for row 8 show 5
for row 22 show 12
I'm not sure how to do this using dplyr, but here is a data.table attempt using a self non-equi join
library(data.table)
setDT(my_df) %>% #convert to data.table
# Run a self non-equi join and find the closest lower value
.[., .N - which.max(rev(z < i.z)) + 1L, on = .(y <= y), by = .EACHI] %>%
# filter the cases where there are no such values
.[y != V1] %>%
# join the result back to the original data
my_df[., on = .(y), res := V1]
head(my_df, 22)
# x y z res
# 1: a 1 45.9 NA
# 2: a 2 52.3 1
# 3: a 3 64.4 2
# 4: a 4 84.5 3
# 5: a 5 42.1 NA
# 6: a 6 83.9 5
# 7: a 7 86.7 6
# 8: a 8 69.7 5
# 9: a 9 67.8 5
# 10: a 10 33.7 NA
# 11: a 11 42.3 10
# 12: a 12 40.6 10
# 13: a 13 71.2 12
# 14: a 14 53.0 12
# 15: a 15 76.2 14
# 16: a 16 59.9 14
# 17: a 17 73.1 16
# 18: a 18 89.6 17
# 19: a 19 52.8 12
# 20: a 20 76.7 19
# 21: a 21 86.1 20
# 22: a 22 42.7 12
I have managed to find a dplyr solution inspired
by a solution given to one of my other questions using rollapply
in this link.
set.seed(1)
my_df = data.frame(x = rep(words[1:5], 50) %>% sort(),
y = 1:250,
z = sample(seq(from = 30 , to = 90, by = 0.1), size = 250, replace = T))
my_df %>%
mutate(First_Lower_z_Backwards = row_number() - rollapply(z,
width = list((0:(-n()))),
FUN = function(x) which(x < x[1])[1] - 1,
fill = NA,
partial = T)) %>%
head(22)
x y z First_Lower_z_Backwards
1 a 1 45.9 NA
2 a 2 52.3 1
3 a 3 64.4 2
4 a 4 84.5 3
5 a 5 42.1 NA
6 a 6 83.9 5
7 a 7 86.7 6
8 a 8 69.7 5
9 a 9 67.8 5
10 a 10 33.7 NA
11 a 11 42.3 10
12 a 12 40.6 10
13 a 13 71.2 12
14 a 14 53.0 12
15 a 15 76.2 14
16 a 16 59.9 14
17 a 17 73.1 16
18 a 18 89.6 17
19 a 19 52.8 12
20 a 20 76.7 19
21 a 21 86.1 20
22 a 22 42.7 12

How to flatten out nested list into one list more efficiently instead of using unlist method?

I have a nested list which contains set of data.frame objects in it, now I want them flatten out. I used most common approach like unlist method, it is not properly fatten out my list, the output was not well represented. How can I make this happen more efficiently? Does anyone knows any trick of doing this operation? Thanks.
example:
mylist <- list(pass=list(Alpha.df1_yes=airquality[2:4,], Alpha.df2_yes=airquality[3:6,],Alpha.df3_yes=airquality[2:5,],Alpha.df4_yes=airquality[7:9,]),
fail=list(Alpha.df1_no=airquality[5:7,], Alpha.df2_no=airquality[8:10,], Alpha.df3_no=airquality[13:16,],Alpha.df4_no=airquality[11:13,]))
I tried like this, it works but output was not properly arranged.
res <- lapply(mylist, unlist)
after flatten out, I would like to do merge them without duplication:
out <- lapply(res, rbind.data.frame)
my desired output:
mylist[[1]]$pass:
Ozone Solar.R Wind Temp Month Day
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
How can make this sort of flatten output more compatibly represented? Can anyone propose possible idea of doing this in R? Thanks a lot.
Using lapply and duplicated:
res <- lapply(mylist, function(i){
x <- do.call(rbind, i)
x[ !duplicated(x), ]
rownames(x) <- NULL
x
})
res$pass
# Ozone Solar.R Wind Temp Month Day
# 1 36 118 8.0 72 5 2
# 2 12 149 12.6 74 5 3
# 3 18 313 11.5 62 5 4
# 4 12 149 12.6 74 5 3
# 5 18 313 11.5 62 5 4
# 6 NA NA 14.3 56 5 5
# 7 28 NA 14.9 66 5 6
# 8 36 118 8.0 72 5 2
# 9 12 149 12.6 74 5 3
# 10 18 313 11.5 62 5 4
# 11 NA NA 14.3 56 5 5
# 12 23 299 8.6 65 5 7
# 13 19 99 13.8 59 5 8
# 14 8 19 20.1 61 5 9
Above still returns a list, if we want to keep all in one dataframe with no lists, then:
res <- do.call(rbind, unlist(mylist, recursive = FALSE))
res <- res[!duplicated(res), ]
res
# Ozone Solar.R Wind Temp Month Day
# pass.Alpha.df1_yes.2 36 118 8.0 72 5 2
# pass.Alpha.df1_yes.3 12 149 12.6 74 5 3
# pass.Alpha.df1_yes.4 18 313 11.5 62 5 4
# pass.Alpha.df2_yes.5 NA NA 14.3 56 5 5
# pass.Alpha.df2_yes.6 28 NA 14.9 66 5 6
# pass.Alpha.df4_yes.7 23 299 8.6 65 5 7
# pass.Alpha.df4_yes.8 19 99 13.8 59 5 8
# pass.Alpha.df4_yes.9 8 19 20.1 61 5 9
# fail.Alpha.df2_no.10 NA 194 8.6 69 5 10
# fail.Alpha.df3_no.13 11 290 9.2 66 5 13
# fail.Alpha.df3_no.14 14 274 10.9 68 5 14
# fail.Alpha.df3_no.15 18 65 13.2 58 5 15
# fail.Alpha.df3_no.16 14 334 11.5 64 5 16
# fail.Alpha.df4_no.11 7 NA 6.9 74 5 11
# fail.Alpha.df4_no.12 16 256 9.7 69 5 12

dcast without ID variables

In the "An Introduction to reshape2" package Sean C. Anderson presents the following example.
He uses the airquality data and renames the column names
names(airquality) <- tolower(names(airquality))
The data look like
# ozone solar.r wind temp month day
# 1 41 190 7.4 67 5 1
# 2 36 118 8.0 72 5 2
# 3 12 149 12.6 74 5 3
# 4 18 313 11.5 62 5 4
# 5 NA NA 14.3 56 5 5
# 6 28 NA 14.9 66 5 6
Then he melts them by
aql <- melt(airquality, id.vars = c("month", "day"))
to get
# month day variable value
# 1 5 1 ozone 41
# 2 5 2 ozone 36
# 3 5 3 ozone 12
# 4 5 4 ozone 18
# 5 5 5 ozone NA
# 6 5 6 ozone 28
Finally he gets the original one (different column order) by
aqw <- dcast(aql, month + day ~ variable)
My Quesiton
Assume now that we do not have ID variables (i.e. month and day) and have melted the data as follows
aql <- melt(airquality)
which look like
# variable value
# 1 ozone 41
# 2 ozone 36
# 3 ozone 12
# 4 ozone 18
# 5 ozone NA
# 6 ozone 28
My question is how can I get the original ones? The original ones would correspond to
# ozone solar.r wind temp
# 1 41 190 7.4 67
# 2 36 118 8.0 72
# 3 12 149 12.6 74
# 4 18 313 11.5 62
# 5 NA NA 14.3 56
# 6 28 NA 14.9 66
Another option is unstack
out <- unstack(aql,value~variable)
head(out)
# ozone solar.r wind temp month day
#1 41 190 7.4 67 5 1
#2 36 118 8.0 72 5 2
#3 12 149 12.6 74 5 3
#4 18 313 11.5 62 5 4
#5 NA NA 14.3 56 5 5
#6 28 NA 14.9 66 5 6
As the question is about dcast, we can create a sequence column and then use dcast
aql$indx <- with(aql, ave(seq_along(variable), variable, FUN=seq_along))
out1 <- dcast(aql, indx~variable, value.var='value')[,-1]
head(out1)
# ozone solar.r wind temp month day
#1 41 190 7.4 67 5 1
#2 36 118 8.0 72 5 2
#3 12 149 12.6 74 5 3
#4 18 313 11.5 62 5 4
#5 NA NA 14.3 56 5 5
#6 28 NA 14.9 66 5 6
If you are using data.table, the devel version of data.table ie. v1.9.5 also has dcast function. Instructions to install the devel version are here
library(data.table)#v1.9.5+
setDT(aql)[, indx:=1:.N, variable]
dcast(aql, indx~variable, value.var='value')[,-1]
One option using split,
out <- data.frame(sapply(split(aql, aql$variable), `[[`, 2))
Here, the data is split by the variable column, then the second column of each group is combined back into a data frame (the [[ function with the argument 2 is passed to sapply)
head(out)
# Ozone Solar.R Wind Temp Month Day
# 1 41 190 7.4 67 5 1
# 2 36 118 8.0 72 5 2
# 3 12 149 12.6 74 5 3
# 4 18 313 11.5 62 5 4
# 5 NA NA 14.3 56 5 5
# 6 28 NA 14.9 66 5 6

Resources