R programming help in editing code - r

I've asked many questions about this and all the answers were really helpful...but once again my data is weird and I need help...Basically, what I want to do is find the average speed at a certain range of intervals...lets say from 6 s to 40 s my average speed would be 5 m/s...etc etc..
So it was pointed out to me to use this code...
library(IRanges)
idx <- seq(1, ncol(data), by=2)
# idx is now 1, 3, 5. It will be passed one value at a time to `i`.
# that is, `i` will take values 1 first, then 3 and then 5 and each time
# the code within is executed.
o <- lapply(idx, function(i) {
ir1 <- IRanges(start=seq(0, max(data[[i]]), by=401), width=401)
ir2 <- IRanges(start=data[[i]], width=1)
t <- findOverlaps(ir1, ir2)
d <- data.frame(mean=tapply(data[[i+1]], queryHits(t), mean))
cbind(as.data.frame(ir1), d)
})
which gives this output
# > o
# [[1]]
# start end width mean
# 1 0 400 401 1.05
#
# [[2]]
# start end width mean
# 1 0 400 401 1.1
#
# [[3]]
# start end width mean
# 1 0 400 401 1.383333
So if I wanted it to be every 100 s... I'll just change ir1 <- ....., by = 401 to become by=100.
But my data is weird because of a few things
my data doesnt always start with 0 s sometimes it starts at 20 s...depending on the specimen and whether it moves
My data collection does not happen every 1s or 2s or 3s. Hence sometimes I get data 1-20 s but it skips over 20-40 s simply because the specimen does not move.
I think the findOverlaps portion of the code affects my output. How can I get rid of that without disturbing the output?
Here is some data to illustrate my troubles...but all of my real data ends in 2000s
Time Speed Time Speed Time Speed
6.3 1.6 3.1 1.7 0.3 2.4
11.3 1.3 5.1 2.2 1.3 1.3
13.8 1.3 6.3 3.4 3.1 1.5
14.1 1.0 7.0 2.3 4.5 2.7
47.4 2.9 11.3 1.2 5.1 0.5
49.2 0.7 26.5 3.3 5.9 1.7
50.5 0.9 27.3 3.4 9.7 2.4
57.1 1.3 36.6 2.5 11.8 1.3
72.9 2.9 40.3 1.1 13.1 1.0
86.6 2.4 44.3 3.2 13.8 0.6
88.5 3.4 50.9 2.6 14.0 2.4
89.0 3.0 62.6 1.5 14.8 2.2
94.8 2.9 66.8 0.5 15.5 2.6
117.4 0.5 67.3 1.1 16.4 3.2
123.7 3.2 67.7 0.6 26.5 0.9
124.5 1.0 68.2 3.2 44.7 3.0
126.1 2.8 72.1 2.2 45.1 0.8
As you can see from the data, it doesnt necessarily end in 60 s etc sometimes it only ends at 57 etc
EDIT add dput of data
structure(list(Time = c(6.3, 11.3, 13.8, 14.1, 47.4, 49.2, 50.5,
57.1, 72.9, 86.6, 88.5, 89, 94.8, 117.4, 123.7, 124.5, 126.1),
Speed = c(1.6, 1.3, 1.3, 1, 2.9, 0.7, 0.9, 1.3, 2.9, 2.4,
3.4, 3, 2.9, 0.5, 3.2, 1, 2.8), Time.1 = c(3.1, 5.1, 6.3,
7, 11.3, 26.5, 27.3, 36.6, 40.3, 44.3, 50.9, 62.6, 66.8,
67.3, 67.7, 68.2, 72.1), Speed.1 = c(1.7, 2.2, 3.4, 2.3,
1.2, 3.3, 3.4, 2.5, 1.1, 3.2, 2.6, 1.5, 0.5, 1.1, 0.6, 3.2,
2.2), Time.2 = c(0.3, 1.3, 3.1, 4.5, 5.1, 5.9, 9.7, 11.8,
13.1, 13.8, 14, 14.8, 15.5, 16.4, 26.5, 44.7, 45.1), Speed.2 = c(2.4,
1.3, 1.5, 2.7, 0.5, 1.7, 2.4, 1.3, 1, 0.6, 2.4, 2.2, 2.6,
3.2, 0.9, 3, 0.8)), .Names = c("Time", "Speed", "Time.1",
"Speed.1", "Time.2", "Speed.2"), class = "data.frame", row.names = c(NA,
-17L))

sorry if i don't understand your question entirely, could you explain why this example doesn't do what you're trying to do?
# use a pre-loaded data set
mtcars
# choose which variable to cut
var <- 'mpg'
# define groups, whether that be time or something else
# and choose how to cut it.
x <- cut( mtcars[ , var ] , c( -Inf , seq( 15 , 25 , by = 2.5 ) , Inf ) )
# look at your cut points, for every record
x
# you can merge them back on to the mtcars data frame if you like..
mtcars$cutpoints <- x
# ..but that's not necessary
# find the mean within those groups
tapply(
mtcars[ , var ] ,
x ,
mean
)
# find the mean within groups, using a different variable
tapply(
mtcars[ , 'wt' ] ,
x ,
mean
)

Related

Finding distance between a row and the row two above it in R

I would like to efficiently compute distances between every row in a matrix and the row two rows above it in R...
My attempts at finding a dplyr rowwise solution with lag(., n = 2) have failed, and I'm sure there's a better solution than this for loop.
Thoughts are much appreciated!
library(rdist)
library(tidyverse)
structure(list(sodium = c(140, 152.6, 138, 152.4, 140, 152.6,
141, 152.7, 141, 152.7), chloride = c(103, 148.9, 104, 149, 102,
148.8, 103, 148.9, 104, 149), potassium_plas = c(3.4, 0.34, 4.1,
0.41, 3.7, 0.37, 4, 0.4, 3.7, 0.37), co2_totl = c(31, 3.1, 22,
2.2, 23, 2.3, 27, 2.7, 20, 2), bun = c(11, 1.1, 5, 0.5, 8, 0.8,
21, 2.1, 10, 1), creatinine = c(0.84, 0.084, 0.53, 0.053, 0.69,
0.069, 1.04, 0.104, 1.86, 0.186), calcium = c(9.3, 0.93, 9.8,
0.98, 9.4, 0.94, 9.4, 0.94, 9.1, 0.91), glucose = c(102, 10.2,
99, 9.9, 115, 11.5, 94, 9.4, 122, 12.2), anion_gap = c(6, 0.599999999999989,
12, 1.20000000000001, 15, 1.50000000000001, 11, 1.09999999999998,
17, 1.69999999999999)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
dist_prior <- rep(NA, n = nrow(input_labs))
for(i in 3:nrow(input_labs)){
dist_prior[i] <- cdist(input_labs[i,], input_labs[i-2,])
}
We could loop over the sequence of rows in map and apply the function, append NAs at the beginning to make the length correct
library(dplyr)
library(rdist)
library(purrr)
input_labs %>%
mutate(dist_prior = c(NA_real_, NA_real_,
map_dbl(3:n(), ~ cdist(cur_data()[.x,], cur_data()[.x-2, ]))))
-output
# A tibble: 10 × 10
sodium chloride potassium_plas co2_totl bun creatinine calcium glucose anion_gap dist_prior
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 140 103 3.4 31 11 0.84 9.3 102 6 NA
2 153. 149. 0.34 3.1 1.1 0.084 0.93 10.2 0.600 NA
3 138 104 4.1 22 5 0.53 9.8 99 12 13.0
4 152. 149 0.41 2.2 0.5 0.053 0.98 9.9 1.20 1.30
5 140 102 3.7 23 8 0.69 9.4 115 15 16.8
6 153. 149. 0.37 2.3 0.8 0.069 0.94 11.5 1.50 1.68
7 141 103 4 27 21 1.04 9.4 94 11 25.4
8 153. 149. 0.4 2.7 2.1 0.104 0.94 9.4 1.10 2.54
9 141 104 3.7 20 10 1.86 9.1 122 17 31.5
10 153. 149 0.37 2 1 0.186 0.91 12.2 1.70 3.15
Or may split by row on the original data and the laged one and use map2 to loop over the list and apply
input_labs$dist_prior <- map2_dbl(
asplit(lag(input_labs, n = 2), 1),
asplit(input_labs, 1),
~ cdist(as.data.frame.list(.x), as.data.frame.list(.y))[,1])
in Base R you can use diff and rowSums as shown below:
c(NA, NA, sqrt(rowSums(diff(as.matrix(input_labs), 2)^2)))
[1] NA NA 12.955157 1.295516 16.832873 1.683287 25.381342 2.538134 31.493688 3.149369
You can cbind the results to the original dataframe.

Tidy Evaluation not working with mutate and stringr

I've trying to use Tidy Eval and Stringr togheter inside a mutate pipe, but every time I run it it gives me an undesirable result.
Instead of changing the letter 'a' for the letter 'X', it overwrite the entire vector with the column name, as you can see in the example below, that uses the IRIS dataset.
text_col="Species"
iris %>%
mutate({{text_col}} := str_replace_all({{text_col}}, pattern = "a", replacement = "X"))
result:
structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6,
5, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8), Sepal.Width = c(3.5, 3,
3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3, 3, 4), Petal.Length = c(1.4,
1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1,
1.2), Petal.Width = c(0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2,
0.2, 0.1, 0.2, 0.2, 0.1, 0.1, 0.2), Species = c("Species", "Species",
"Species", "Species", "Species", "Species", "Species", "Species",
"Species", "Species", "Species", "Species", "Species", "Species",
"Species")), row.names = c(NA, 15L), class = "data.frame")
Doesn't Stringr supports tidy evaluation or the curly-curly ({{}}) operator??
Tidy evaluation completely depends on how you send your inputs.
For example, if you send your input as an unquoted variable your attempt would work.
library(dplyr)
library(stringr)
library(rlang)
change_fun <- function(df, text_col) {
df %>% mutate({{text_col}} := str_replace_all({{text_col}}, "a","X"))
}
change_fun(iris, Species) %>% head
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosX
#2 4.9 3.0 1.4 0.2 setosX
#3 4.7 3.2 1.3 0.2 setosX
#4 4.6 3.1 1.5 0.2 setosX
#5 5.0 3.6 1.4 0.2 setosX
#6 5.4 3.9 1.7 0.4 setosX
To pass input as quoted variables use sym to convert into symbol first and then evaluate !!.
change_fun <- function(df, text_col) {
df %>% mutate(!!text_col := str_replace_all(!!sym(text_col), "a","X"))
}
change_fun(iris, "Species") %>% head
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosX
#2 4.9 3.0 1.4 0.2 setosX
#3 4.7 3.2 1.3 0.2 setosX
#4 4.6 3.1 1.5 0.2 setosX
#5 5.0 3.6 1.4 0.2 setosX
#6 5.4 3.9 1.7 0.4 setosX

Use numbers as column names while grouping them

Here is my sample data;
mydata<-structure(list(x1 = c(0, 8.6, 11.2, 8.4, 0, 0), x2 = c(0, 0,
7.8, 7.6, 1.2, 10.2), y1 = c(0, 0, 3.4, 21.4, 1.8, 1.4), y2 = c(7.8,
7.6, 1.2, 10.2, 7, 0), z1 = c(0, 1.6, 7.6, 23.6, 3.2, 0), z2 = c(8.6,
1.4, 0, 0, 0, 0)), .Names = c("x1", "x2", "y1", "y2", "z1", "z2"
), class = "data.frame", row.names = c(NA, -6L))
x1 x2 y1 y2 z1 z2
1 0.0 0.0 0.0 7.8 0.0 8.6
2 8.6 0.0 0.0 7.6 1.6 1.4
3 11.2 7.8 3.4 1.2 7.6 0.0
4 8.4 7.6 21.4 10.2 23.6 0.0
5 0.0 1.2 1.8 7.0 3.2 0.0
6 0.0 10.2 1.4 0.0 0.0 0.0
With the code below, it is possible to group columns as x, y and z.
grps <- unique(gsub("[0-9]", "", colnames(mydata)))
# [1] "x" "y" "z"
But When I rename columns like that;
myd<-structure(list(X2005 = c(0, 8.6, 11.2, 8.4, 0, 0), X2005.1 = c(0,
0, 7.8, 7.6, 1.2, 10.2), X2006 = c(0, 0, 3.4, 21.4, 1.8, 1.4),
X2006.1 = c(7.8, 7.6, 1.2, 10.2, 7, 0), X2007 = c(0, 1.6,
7.6, 23.6, 3.2, 0), X2007.1 = c(8.6, 1.4, 0, 0, 0, 0)), .Names = c("X2005",
"X2005.1", "X2006", "X2006.1", "X2007", "X2007.1"), row.names = c(NA,
6L), class = "data.frame")
X2005 X2005.1 X2006 X2006.1 X2007 X2007.1
1 0.0 0.0 0.0 7.8 0.0 8.6
2 8.6 0.0 0.0 7.6 1.6 1.4
3 11.2 7.8 3.4 1.2 7.6 0.0
4 8.4 7.6 21.4 10.2 23.6 0.0
5 0.0 1.2 1.8 7.0 3.2 0.0
6 0.0 10.2 1.4 0.0 0.0 0.0
I want to see;
# [1] "2005" "2006" "2007"
We can use gsub to match the letter 'X' at the beginning (^) of the string or (| the . followed by numbers at the end ($) of the string and replace with blank ("")
names(myd) <- gsub("^X|\\.\\d+$", "", names(myd))
names(myd)
#[1] "2005" "2005" "2006" "2006" "2007" "2007"
unique(names(myd))
#[1] "2005" "2006" "2007"
If we know the number of digits and position, then substr would be faster
substr(names(myd), 2, 5)
One option would be to to use sub and convert the names to factor with labels as needed.
names(mydata) <- factor(sub("[0-9]", "", names(mydata)), labels = 2005:2007)
and then check your column names
names(mydata)
#[1] "2005" "2005" "2006" "2006" "2007" "2007"

Example of using dput()

Being a new user here, my questions are not being fully answered due to not being reproducible. I read the thread relating to producing reproducible code but to avail. Specifically lost on how to use the dput() function.
Could someone provide a step by step on how to use the dput() using the iris df for eg it would be very helpful.
Using the iris dataset, which is handily included into R, we can see how dput() works:
data(iris)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Now we can get the whole dataset using dput(iris). In most situations, a whole dataset is unnecessary to provide for a Stackoverflow question, as a few lines of the relevant variables suffice as a working data example.
Two things come in handy: The head() function outputs only the first six rows of a dataframe/matrix. Also, the indexing in R (via brackets) allows you to select only specific columns.
Therefore, we can restrict the output of dput() using a combination of these two:
dput(head(iris[, c(1, 3)]))
structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4),
Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7)), .Names = c("Sepal.Length",
"Petal.Length"), row.names = c(NA, 6L), class = "data.frame")
will give us the code to reproduce the first (up to) six rows of column 1 and 3 of the iris dataset.
df <- structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4),
Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7)), .Names = c("Sepal.Length",
"Petal.Length"), row.names = c(NA, 6L), class = "data.frame")
> df
Sepal.Length Petal.Length
1 5.1 1.4
2 4.9 1.4
3 4.7 1.3
4 4.6 1.5
5 5.0 1.4
6 5.4 1.7
If the first rows do not suffice, we can skip using head() and rely on indexing only:
dput(iris[1:20, c(1, 3)])
structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6,
5, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1
), Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4,
1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.5, 1.3, 1.4, 1.7, 1.5)), .Names = c("Sepal.Length",
"Petal.Length"), row.names = c(NA, 20L), class = "data.frame")
will give us the the first twenty rows:
df <- structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6,
5, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1
), Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4,
1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.5, 1.3, 1.4, 1.7, 1.5)), .Names = c("Sepal.Length",
"Petal.Length"), row.names = c(NA, 20L), class = "data.frame")
> df
Sepal.Length Petal.Length
1 5.1 1.4
2 4.9 1.4
3 4.7 1.3
4 4.6 1.5
5 5.0 1.4
6 5.4 1.7
7 4.6 1.4
8 5.0 1.5
9 4.4 1.4
10 4.9 1.5
11 5.4 1.5
12 4.8 1.6
13 4.8 1.4
14 4.3 1.1
15 5.8 1.2
16 5.7 1.5
17 5.4 1.3
18 5.1 1.4
19 5.7 1.7
20 5.1 1.5

how to plot data in time series

I have data that looks like this:
time sucrose fructose glucose galactose molasses water
1 5 0.0 0.00 0.0 0.0 0.3 0
2 10 0.3 0.10 0.1 0.0 1.0 0
3 15 0.8 0.20 0.2 0.2 1.4 0
4 20 1.3 0.35 0.7 0.4 2.5 0
5 25 2.2 0.80 1.6 0.5 3.5 0
6 30 3.1 1.00 2.3 0.6 4.5 0
7 35 3.6 1.60 3.1 0.7 5.7 0
8 40 5.1 2.80 4.3 0.7 6.7 0
How can i make a time series plot that uses the time column? They are all increasing values.
I saw this post multiple-time-series-in-one-plot which uses ts.plot to achieve something similar to what i want to show, which is this:
Input data for the table above:
structure(list(time = c(5, 10, 15, 20, 25, 30, 35, 40), sucrose = c(0,
0.3, 0.8, 1.3, 2.2, 3.1, 3.6, 5.1), fructose = c(0, 0.1, 0.2,
0.35, 0.8, 1, 1.6, 2.8), glucose = c(0, 0.1, 0.2, 0.7, 1.6, 2.3,
3.1, 4.3), galactose = c(0, 0, 0.2, 0.4, 0.5, 0.6, 0.7, 0.7),
molasses = c(0.3, 1, 1.4, 2.5, 3.5, 4.5, 5.7, 6.7), water = c(0,
0, 0, 0, 0, 0, 0, 0)), .Names = c("time", "sucrose", "fructose",
"glucose", "galactose", "molasses", "water"), row.names = c(NA,
-8L), class = "data.frame")
It doesn't seem like a ts plot is necessary. Here's how you could do it in base-R:
with(df, plot(time, sucrose, type="n", ylab="contents"))
var <- names(df)[-1]
for(i in var) lines(df$time, df[,i])
The more elegant solution would however be using the 'dplyrandggplot2` package:
df <- df %>%
gather(content, val, -time)
ggplot(df, aes(time, val, col=content)) + geom_line()

Resources