Subset a dataframe based on time variable - r

I have the following dataframe which is already a subset of a much larger dataframe:
Time X.N2O._ppm
1 15/05/2015 13:30:07.291 0.03941801
2 15/05/2015 13:30:08.307 0.01014003
3 15/05/2015 13:30:09.323 0.02577801
4 15/05/2015 13:30:10.338 0.02554231
5 15/05/2015 13:30:11.354 0.02489800
6 15/05/2015 13:30:12.370 0.02417584
7 15/05/2015 13:30:13.386 0.02489115
8 15/05/2015 13:30:14.402 0.02524912
9 15/05/2015 13:30:15.417 0.02556182
10 15/05/2015 13:30:16.433 0.02574274
I'm trying to subset based on the Time variable but get the following error:
subtime = subset(datasubcolrow, "Time" < 15/05/2015 13:30:15.417)
Error: unexpected numeric constant in "subtime = subset(datasubcolrow, "Time" < 15/05/2015 13"
Checking data types shows that Time is numeric:
sapply(datasubcolrow, mode)
Time X.N2O._ppm
"numeric" "numeric"
Do I need to convert it into a date format and how to I go about doing this?
Thanks
Rory

You can change the 'Time' column to 'POSIXct' class and then subset
datasubcolrow$Time <- as.POSIXct(datasubcolrow$Time, format='%d/%m/%Y %H:%M:%OS')
subset(datasubcolrow, Time < as.POSIXct('15/05/2015 13:30:15.417',
format='%d/%m/%Y %H:%M:%OS'))
data
datasubcolrow <- structure(list(Time = c("15/05/2015 13:30:07.291",
"15/05/2015 13:30:08.307",
"15/05/2015 13:30:09.323", "15/05/2015 13:30:10.338",
"15/05/2015 13:30:11.354",
"15/05/2015 13:30:12.370", "15/05/2015 13:30:13.386",
"15/05/2015 13:30:14.402",
"15/05/2015 13:30:15.417", "15/05/2015 13:30:16.433"),
X.N2O._ppm = c(0.03941801,
0.01014003, 0.02577801, 0.02554231, 0.024898, 0.02417584, 0.02489115,
0.02524912, 0.02556182, 0.02574274)), .Names = c("Time", "X.N2O._ppm"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5",
"6", "7", "8", "9", "10"))

Related

Hieraching across rows for the same id

So, I have a data set with a lot of observations for X individuals and more rows per some individuals. For each row, I have assigned a classification (the variable clinical_significance) that takes three values in prioritized order: definite disease, possible, colonization. Now, I would like to have only one row for each individual and the "highest classification" across the rows, e.g. definite if present, subsidiary possible and colonization. Any good suggestions on how to overcome this?
For instance, as seen in the example, I would like all ID #23 clinical_signifiance to be 'definite disease' as this outranks 'possible'
id id_row number_of_samples species_ny clinical_significa…
18 1 2 MAC possible
18 2 2 MAC possible
20 1 2 scrofulaceum possible
20 2 2 scrofulaceum possible
23 1 2 MAC possible
23 2 2 MAC definite disease
Making a reproducible example:
df <- structure(
list(
id = c("18", "18", "20", "20", "23", "23"),
id_row = c("1","2", "1", "2", "1", "2"),
number_of_samples = c("2", "2", "2","2", "2", "2"),
species_ny = c("MAC", "MAC", "scrofulaceum", "scrofulaceum", "MAC", "MAC"),
clinical_significance = c("possible", "possible", "possible", "possible", "possible", "definite disease")
),
row.names = c(NA, -6L), class = c("data.frame")
)
The idea is to turn clinical significance into a factor, which is stored as an integer instead of character (i.e. 1 = definite, 2 = possible, 3 = colonization). Then, for each ID, take the row with lowest number.
df_prio <- df |>
mutate(
fct_clin_sig = factor(
clinical_significance,
levels = c("definite disease", "possible", "colonization")
)
) |>
group_by(id) |>
slice_min(fct_clin_sig)
I fixed it using
df <- df %>%
group_by(id) %>%
mutate(clinical_significance_new = ifelse(any(clinical_significance == "definite disease"), "definite disease", as.character(clinical_significance)))

Operations with two lines in one data.frame for all columns in r

I would like to perform an operation between specific lines of a data.frame.
For example, considering the following data.frame:
structure(list(`2020` = c(1264, 5.23, 25475, 34454, 57011,
3,312), `2019` = c(1.57, 5,115, 24,811, 33,414, 50,883, 3,332
), `2018` = c(1,587, 5,373, 25,391, 33,589, 50,547, 3,952), `2017` = c(1,711,
5,675, 24,674, 33,978, 54,903, 4,288), `2016` = c(1,739, 4,507,
24,199, 35,015, 52,051, 4,419), `2015` = c(1,813, 5,631, 25,488,
35929, 56415, 3859)), row.names = c("3", "4", "5", "6", "7",
"8"), class = "data.frame")
I need to create a new row below considering subtracting row 7 from rows 5 and 4, so that this is valid for all columns, so it looks like this:
structure(list(`2020` = c(1264, 5.23, 25475, 34454, 57011,
3,312, 26,306), `2019` = c(1.57, 5,115, 24,811, 33,414, 50,883,
3,332, 20,957), `2018` = c(1,587, 5,373, 25,391, 33,589, 50,547,
3,952, 25,391), `2017` = c(1,711, 5,675, 24,674, 33,978, 54,903,
4,288, 24,554), `2016` = c(1,739, 4,507, 24,199, 35,015, 52,051,
4,419, 23,345), `2015` = c(1,813, 5,631, 25,488, 35,929, 56,415,
3859, 25296)), row.names = c("3", "4", "5", "6", "7", "8",
"9"), class = "data.frame")
I need to perform this operation inside a loop, so a dplyr solution might be preferable, but any help is welcome.
I achieved it by doing this:
df['9',] <- df['7',] - df['5',] - df['4',]
the number i got in column 2020 row 9 doesn't match what is in your example though.
2020 2019 2018 2017 2016 2015
3 1264.00 1.57 1587 1711 1739 1813
4 5.23 5115.00 5373 5675 4507 5631
5 25475.00 24811.00 25391 24674 24199 25488
6 34454.00 3414.00 33589 33978 35015 35929
7 57011.00 50883.00 50547 54903 52051 56415
8 3312.00 3332.00 3952 4288 4419 3859
9 31530.77 20957.00 19783 24554 23345 25296
don't put commas in numeric type columns in R or any other language.

Creating New Columns out of strings in cell

I have a data frame that looks like
inx time
1 201566.202331.203500.203924.1628390 915.22
2 201571.202696.203095.203932.1628371 864.86
3 202329.203081.203090.203468.203994 743.54
4 201572.202339.203114.203507.1627763 597.34
5 101107.201587.202689.203087.203469 592.97
6 201152.201954.202711.203506.1626167 555.01
7 200768.201586.201980.202695.1627783 542.16
8 201143.202681.202694.203935.1628369 504.30
9 202357.202697.1626161.1627741.1628368 499.81
10 201937.202324.203497.204060.1628378 499.60
And Instead of the 5 sets of numbers separated by a period, I want 5 columns + plus the column at the end for time
We can use read.table with sep="." to read that as five columns
cbind(read.table(text = df1$inx, header = FALSE, sep="."), df1['time'])
data
df1 <- structure(list(inx = c("201566.202331.203500.203924.1628390",
"201571.202696.203095.203932.1628371", "202329.203081.203090.203468.203994",
"201572.202339.203114.203507.1627763", "101107.201587.202689.203087.203469",
"201152.201954.202711.203506.1626167", "200768.201586.201980.202695.1627783",
"201143.202681.202694.203935.1628369", "202357.202697.1626161.1627741.1628368",
"201937.202324.203497.204060.1628378"), time = c(915.22, 864.86,
743.54, 597.34, 592.97, 555.01, 542.16, 504.3, 499.81, 499.6)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
A data.table option with tstrsplit
> setDT(df)[, c(Map(as.numeric, tstrsplit(inx, "\\.")), time = .(time))]
V1 V2 V3 V4 V5 time
1: 201566 202331 203500 203924 1628390 915.22
2: 201571 202696 203095 203932 1628371 864.86
3: 202329 203081 203090 203468 203994 743.54
4: 201572 202339 203114 203507 1627763 597.34
5: 101107 201587 202689 203087 203469 592.97
6: 201152 201954 202711 203506 1626167 555.01
7: 200768 201586 201980 202695 1627783 542.16
8: 201143 202681 202694 203935 1628369 504.30
9: 202357 202697 1626161 1627741 1628368 499.81
10: 201937 202324 203497 204060 1628378 499.60

Replicate and append to dataframe in R

I believe this is fairly simple, although I am new to using R and code. I have a dataset which has a single row for each rodent trap site. There were however, 8 occasions of trapping over 4 years. What I wish to do is to expand the trap site data and append a number 1 to 8 for each row.
Then I can then label them with the trap visit for a subsequent join with the obtained trap data.
I have managed to replicate the rows with the following code. And while the rows are expanded in the data frame to 1, 1.1...1.7,2, 2.1...2.7 etc. I cannot figure out how to convert this to a useable column based ID.
structure(list(TrapCode = c("IA1sA", "IA2sA", "IA3sA", "IA4sA",
"IA5sA"), Y = c(-12.1355987315, -12.1356879776, -12.1357664998,
-12.1358823313, -12.1359720852), X = c(-69.1335789865, -69.1335225279,
-69.1334668485, -69.1333847769, -69.1333226532)), row.names = c(NA,
5L), class = "data.frame")
gps_1 <– gps_1[rep(seq_len(nrow(gps_1)), 3), ]
gives
"IA5sA", "IA1sA", "IA2sA", "IA3sA", "IA4sA", "IA5sA", "IA1sA",
"IA2sA", "IA3sA", "IA4sA", "IA5sA"), Y = c(-12.1355987315, -12.1356879776,
-12.1357664998, -12.1358823313, -12.1359720852, -12.1355987315,
-12.1356879776, -12.1357664998, -12.1358823313, -12.1359720852,
-12.1355987315, -12.1356879776, -12.1357664998, -12.1358823313,
-12.1359720852), X = c(-69.1335789865, -69.1335225279, -69.1334668485,
-69.1333847769, -69.1333226532, -69.1335789865, -69.1335225279,
-69.1334668485, -69.1333847769, -69.1333226532, -69.1335789865,
-69.1335225279, -69.1334668485, -69.1333847769, -69.1333226532
)), row.names = c("1", "2", "3", "4", "5", "1.1", "2.1", "3.1",
"4.1", "5.1", "1.2", "2.2", "3.2", "4.2", "5.2"), class = "data.frame")
I have a column with Trap_ID currently being a unique identifier. I hope that after the replication I could append an iteration number to this to keep it as a unique ID.
For example:
Trap_ID
IA1sA.1
IA1sA.2
IA1sA.3
IA2sA.1
IA2sA.2
IA2sA.3
Simply use a cross join (i.e., join with no by columns to return a cartesian product of both sets):
mdf <- merge(data.frame(Trap_ID = 1:8), trap_side_df, by=NULL)

Determine whether each level is monotonically increasing

I need to know if each level of a factor provides increasing values. I've seen How to check if a sequence of numbers is monotonically increasing (or decreasing)? but don't know how to apply to the single levels only.
Let's say there is the data frame df which is divided into persons. Each person has height over years. Now I want to know if the data set is correct. Therefore I need to know if the height has increasing values - per person:
I tried
Results<- by(df, df$person,
function(x) {data = x,
all(x == cummax(height))
}
)
but it does not work. And
Results<- by(df, df$person,
all(height == cummax(height))
}
)
also not. I receive that height cannot be found.
What am I doing wrong here?
A small data extraction:
Serial_number Amplification Voltage
1 608004648 111.997 379.980
2 608004648 123.673 381.968
3 608004648 137.701 383.979
4 608004648 154.514 385.973
5 608004648 175.331 387.980
6 608004648 201.379 389.968
7 608004649 118.753 378.080
8 608004649 131.739 380.085
9 608004649 147.294 382.082
10 608004649 166.238 384.077
11 608004649 189.841 386.074
12 608004649 220.072 388.073
13 608004650 115.474 382.066
14 608004650 127.838 384.063
15 608004650 142.602 386.064
16 608004650 160.452 388.056
17 608004650 182.732 390.060
18 608004650 211.035 392.065
Serial_number is the factor and I want to check each serial number if the corresponding amplification values are increasing.
We can do this with a group by operation
library(dplyr)
df %>%
group_by(Serial_number) %>%
summarise(index = all(sign(Amplification -
lag(Amplification, default = first(Amplification))) >= 0))
Or with by from base R. As we are passing the complete dataset, the x (anonymous function call object) is the dataset, from which we can extract the column of interest with $ or [[
by(df, list(df$Serial_number), FUN = function(x) all(sign(diff(x$Amplification))>=0))
Or using data.table
library(data.table)
setDT(df)[, .(index = all(sign(Amplification - shift(Amplification,
fill = first(Amplification))) >=0)), .(Serial_number)]
data
df <- structure(list(Serial_number = c(608004648L, 608004648L, 608004648L,
608004648L, 608004648L, 608004648L, 608004649L, 608004649L, 608004649L,
608004649L, 608004649L, 608004649L, 608004650L, 608004650L, 608004650L,
608004650L, 608004650L, 608004650L), Amplification = c(111.997,
123.673, 137.701, 154.514, 175.331, 201.379, 118.753, 131.739,
147.294, 166.238, 189.841, 220.072, 115.474, 127.838, 142.602,
160.452, 182.732, 211.035), Voltage = c(379.98, 381.968, 383.979,
385.973, 387.98, 389.968, 378.08, 380.085, 382.082, 384.077,
386.074, 388.073, 382.066, 384.063, 386.064, 388.056, 390.06,
392.065)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
"16", "17", "18"))
What about something like
vapply(unique(df$person),
function (k) all(diff(df$height[df$person == k]) >= 0), # or '> 0' if strictly mon. incr.
logical(1))
# returns
[1] TRUE FALSE FALSE
with
set.seed(123)
df <- data.frame(person = c("A","B", "C","A","A","C","B"), height = runif(7, 1.75, 1.85))
df
person height
1 A 1.778758
2 B 1.828831
3 C 1.790898
4 A 1.838302
5 A 1.844047
6 C 1.754556
7 B 1.802811

Resources