I need to know if each level of a factor provides increasing values. I've seen How to check if a sequence of numbers is monotonically increasing (or decreasing)? but don't know how to apply to the single levels only.
Let's say there is the data frame df which is divided into persons. Each person has height over years. Now I want to know if the data set is correct. Therefore I need to know if the height has increasing values - per person:
I tried
Results<- by(df, df$person,
function(x) {data = x,
all(x == cummax(height))
}
)
but it does not work. And
Results<- by(df, df$person,
all(height == cummax(height))
}
)
also not. I receive that height cannot be found.
What am I doing wrong here?
A small data extraction:
Serial_number Amplification Voltage
1 608004648 111.997 379.980
2 608004648 123.673 381.968
3 608004648 137.701 383.979
4 608004648 154.514 385.973
5 608004648 175.331 387.980
6 608004648 201.379 389.968
7 608004649 118.753 378.080
8 608004649 131.739 380.085
9 608004649 147.294 382.082
10 608004649 166.238 384.077
11 608004649 189.841 386.074
12 608004649 220.072 388.073
13 608004650 115.474 382.066
14 608004650 127.838 384.063
15 608004650 142.602 386.064
16 608004650 160.452 388.056
17 608004650 182.732 390.060
18 608004650 211.035 392.065
Serial_number is the factor and I want to check each serial number if the corresponding amplification values are increasing.
We can do this with a group by operation
library(dplyr)
df %>%
group_by(Serial_number) %>%
summarise(index = all(sign(Amplification -
lag(Amplification, default = first(Amplification))) >= 0))
Or with by from base R. As we are passing the complete dataset, the x (anonymous function call object) is the dataset, from which we can extract the column of interest with $ or [[
by(df, list(df$Serial_number), FUN = function(x) all(sign(diff(x$Amplification))>=0))
Or using data.table
library(data.table)
setDT(df)[, .(index = all(sign(Amplification - shift(Amplification,
fill = first(Amplification))) >=0)), .(Serial_number)]
data
df <- structure(list(Serial_number = c(608004648L, 608004648L, 608004648L,
608004648L, 608004648L, 608004648L, 608004649L, 608004649L, 608004649L,
608004649L, 608004649L, 608004649L, 608004650L, 608004650L, 608004650L,
608004650L, 608004650L, 608004650L), Amplification = c(111.997,
123.673, 137.701, 154.514, 175.331, 201.379, 118.753, 131.739,
147.294, 166.238, 189.841, 220.072, 115.474, 127.838, 142.602,
160.452, 182.732, 211.035), Voltage = c(379.98, 381.968, 383.979,
385.973, 387.98, 389.968, 378.08, 380.085, 382.082, 384.077,
386.074, 388.073, 382.066, 384.063, 386.064, 388.056, 390.06,
392.065)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
"16", "17", "18"))
What about something like
vapply(unique(df$person),
function (k) all(diff(df$height[df$person == k]) >= 0), # or '> 0' if strictly mon. incr.
logical(1))
# returns
[1] TRUE FALSE FALSE
with
set.seed(123)
df <- data.frame(person = c("A","B", "C","A","A","C","B"), height = runif(7, 1.75, 1.85))
df
person height
1 A 1.778758
2 B 1.828831
3 C 1.790898
4 A 1.838302
5 A 1.844047
6 C 1.754556
7 B 1.802811
Related
So, I have a data set with a lot of observations for X individuals and more rows per some individuals. For each row, I have assigned a classification (the variable clinical_significance) that takes three values in prioritized order: definite disease, possible, colonization. Now, I would like to have only one row for each individual and the "highest classification" across the rows, e.g. definite if present, subsidiary possible and colonization. Any good suggestions on how to overcome this?
For instance, as seen in the example, I would like all ID #23 clinical_signifiance to be 'definite disease' as this outranks 'possible'
id id_row number_of_samples species_ny clinical_significa…
18 1 2 MAC possible
18 2 2 MAC possible
20 1 2 scrofulaceum possible
20 2 2 scrofulaceum possible
23 1 2 MAC possible
23 2 2 MAC definite disease
Making a reproducible example:
df <- structure(
list(
id = c("18", "18", "20", "20", "23", "23"),
id_row = c("1","2", "1", "2", "1", "2"),
number_of_samples = c("2", "2", "2","2", "2", "2"),
species_ny = c("MAC", "MAC", "scrofulaceum", "scrofulaceum", "MAC", "MAC"),
clinical_significance = c("possible", "possible", "possible", "possible", "possible", "definite disease")
),
row.names = c(NA, -6L), class = c("data.frame")
)
The idea is to turn clinical significance into a factor, which is stored as an integer instead of character (i.e. 1 = definite, 2 = possible, 3 = colonization). Then, for each ID, take the row with lowest number.
df_prio <- df |>
mutate(
fct_clin_sig = factor(
clinical_significance,
levels = c("definite disease", "possible", "colonization")
)
) |>
group_by(id) |>
slice_min(fct_clin_sig)
I fixed it using
df <- df %>%
group_by(id) %>%
mutate(clinical_significance_new = ifelse(any(clinical_significance == "definite disease"), "definite disease", as.character(clinical_significance)))
I have a data frame that looks like
inx time
1 201566.202331.203500.203924.1628390 915.22
2 201571.202696.203095.203932.1628371 864.86
3 202329.203081.203090.203468.203994 743.54
4 201572.202339.203114.203507.1627763 597.34
5 101107.201587.202689.203087.203469 592.97
6 201152.201954.202711.203506.1626167 555.01
7 200768.201586.201980.202695.1627783 542.16
8 201143.202681.202694.203935.1628369 504.30
9 202357.202697.1626161.1627741.1628368 499.81
10 201937.202324.203497.204060.1628378 499.60
And Instead of the 5 sets of numbers separated by a period, I want 5 columns + plus the column at the end for time
We can use read.table with sep="." to read that as five columns
cbind(read.table(text = df1$inx, header = FALSE, sep="."), df1['time'])
data
df1 <- structure(list(inx = c("201566.202331.203500.203924.1628390",
"201571.202696.203095.203932.1628371", "202329.203081.203090.203468.203994",
"201572.202339.203114.203507.1627763", "101107.201587.202689.203087.203469",
"201152.201954.202711.203506.1626167", "200768.201586.201980.202695.1627783",
"201143.202681.202694.203935.1628369", "202357.202697.1626161.1627741.1628368",
"201937.202324.203497.204060.1628378"), time = c(915.22, 864.86,
743.54, 597.34, 592.97, 555.01, 542.16, 504.3, 499.81, 499.6)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
A data.table option with tstrsplit
> setDT(df)[, c(Map(as.numeric, tstrsplit(inx, "\\.")), time = .(time))]
V1 V2 V3 V4 V5 time
1: 201566 202331 203500 203924 1628390 915.22
2: 201571 202696 203095 203932 1628371 864.86
3: 202329 203081 203090 203468 203994 743.54
4: 201572 202339 203114 203507 1627763 597.34
5: 101107 201587 202689 203087 203469 592.97
6: 201152 201954 202711 203506 1626167 555.01
7: 200768 201586 201980 202695 1627783 542.16
8: 201143 202681 202694 203935 1628369 504.30
9: 202357 202697 1626161 1627741 1628368 499.81
10: 201937 202324 203497 204060 1628378 499.60
I want to classify my data by minimum distance between known centers.
How to implement using R?
the centers data
> centers
X
1 -0.78998176
2 2.40331380
3 0.77320007
4 -1.64054294
5 -0.05343331
6 -1.14982180
7 1.67658736
8 -0.44575567
9 0.36314671
10 1.18697840
the data wanted to be classified
> Y
[1] -0.7071068 0.7071068 -0.3011463 -0.9128686 -0.5713978 NA
the result I expected:
1. find the closest distance (minimum absolute difference value) between each
items in Y and centers.
2. Assigns sequence number of classes to each items in Y
expected result:
> Y
[1] 1 3 8 1 8 NA
Y <- c(-0.707106781186548, 0.707106781186548, -0.301146296962689,
-0.912868615826101, -0.571397763410073, NA)
centers <- structure(c(-0.789981758587318, 2.40331380121291, 0.773200070034431,
-1.64054294268215, -0.0534333085941505, -1.14982180092619, 1.67658736336158,
-0.445755672120908, 0.363146708827924, 1.18697840480949), .Dim = c(10L,
1L), .Dimnames = list(c("1", "2", "3", "4", "5", "6", "7", "8",
"9", "10"), "X"))
sapply(Y, function(y) {r=which.min(abs(y-centers)); ifelse(is.na(y), NA, r)})
Essentially, you are applying which.min to each element of Y, and determining which center has the smallest absolute distance. Ties go to the earlier element on the list. NA values need to be handled separately, which is why I have a second statement with ifelse there.
This is not clustering.
But nearest neighbor classification.
See the knn function.
I have the following dataframe which is already a subset of a much larger dataframe:
Time X.N2O._ppm
1 15/05/2015 13:30:07.291 0.03941801
2 15/05/2015 13:30:08.307 0.01014003
3 15/05/2015 13:30:09.323 0.02577801
4 15/05/2015 13:30:10.338 0.02554231
5 15/05/2015 13:30:11.354 0.02489800
6 15/05/2015 13:30:12.370 0.02417584
7 15/05/2015 13:30:13.386 0.02489115
8 15/05/2015 13:30:14.402 0.02524912
9 15/05/2015 13:30:15.417 0.02556182
10 15/05/2015 13:30:16.433 0.02574274
I'm trying to subset based on the Time variable but get the following error:
subtime = subset(datasubcolrow, "Time" < 15/05/2015 13:30:15.417)
Error: unexpected numeric constant in "subtime = subset(datasubcolrow, "Time" < 15/05/2015 13"
Checking data types shows that Time is numeric:
sapply(datasubcolrow, mode)
Time X.N2O._ppm
"numeric" "numeric"
Do I need to convert it into a date format and how to I go about doing this?
Thanks
Rory
You can change the 'Time' column to 'POSIXct' class and then subset
datasubcolrow$Time <- as.POSIXct(datasubcolrow$Time, format='%d/%m/%Y %H:%M:%OS')
subset(datasubcolrow, Time < as.POSIXct('15/05/2015 13:30:15.417',
format='%d/%m/%Y %H:%M:%OS'))
data
datasubcolrow <- structure(list(Time = c("15/05/2015 13:30:07.291",
"15/05/2015 13:30:08.307",
"15/05/2015 13:30:09.323", "15/05/2015 13:30:10.338",
"15/05/2015 13:30:11.354",
"15/05/2015 13:30:12.370", "15/05/2015 13:30:13.386",
"15/05/2015 13:30:14.402",
"15/05/2015 13:30:15.417", "15/05/2015 13:30:16.433"),
X.N2O._ppm = c(0.03941801,
0.01014003, 0.02577801, 0.02554231, 0.024898, 0.02417584, 0.02489115,
0.02524912, 0.02556182, 0.02574274)), .Names = c("Time", "X.N2O._ppm"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5",
"6", "7", "8", "9", "10"))
I'm doing a handful of transformation steps for several dfs, so I have ventured into the beautiful world of apply, lapply, sweep, etc. Unfortunately I got stuck trying to use sweep for listed dfs.
What I would like to do, is calculate the percentage of each value, based on the mean of each data frame's first row.
So I put my dfs into a list which ends up looking something like this;
df1 <- read.table(header = TRUE, text = "a b
1 16.26418 19.60232
2 16.09745 18.44320
3 17.25242 18.21141
4 17.61503 17.64766
5 18.35453 19.52620")
df2 <- read.table(header = TRUE, text = "a b
1 4.518654 4.346056
2 4.231176 4.175854
3 2.658694 4.999478
4 3.348019 2.345594
5 3.103378 2.556690")
list.one <- list(df1,df2)
> list.one
[[1]]
a b
1 16.26418 19.60232
2 16.09745 18.44320
3 17.25242 18.21141
4 17.61503 17.64766
5 18.35453 19.52620
[[2]]
a b
1 4.518654 4.346056
2 4.231176 4.175854
3 2.658694 4.999478
4 3.348019 2.345594
5 3.103378 2.556690
Now I calculate the mean of each first row and store it
one.hundred <- lapply(list.one, function(i)
{rowMeans(i[1,], na.rm=T)})
> one.hundred
[[1]]
1
17.93325
[[2]]
1
4.432355
Now I calculate their percentage (as compared to the values stored in the second list) and the best I came up with is this rather tedious workaround:
df1.per<-sweep(list.one[[1]], 1, one.hundred[[1]],
function(x,y){100/y*x})
df2.per<-sweep(list.one[[2]], 1, one.hundred[[2]],
function(x,y){100/y*x})
list.new(df1.per,df2.per)
If somebody could suggest me simpler, preferably list based solution that would be great help.
Thanks a lot.
Here's another approach with sapply and Map that will also return a list of data.frames:
means <- sapply(list.one, function(df) rowMeans(df[1, ], na.rm = TRUE))
Map(function(vec, df) df/vec*100, means, list.one)
#$`1`
# a b
#1 90.69287 109.30713
#2 89.76315 102.84360
#3 96.20353 101.55109
#4 98.22553 98.40748
#5 102.34916 108.88266
#
#$`1`
# a b
#1 101.94702 98.05298
#2 95.46113 94.21299
#3 59.98378 112.79507
#4 75.53589 52.91981
#5 70.01646 57.68243
data:
> dput(list.one)
list(structure(list(a = c(16.26418, 16.09745, 17.25242, 17.61503,
18.35453), b = c(19.60232, 18.4432, 18.21141, 17.64766, 19.5262
)), .Names = c("a", "b"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5")), structure(list(a = c(4.518654, 4.231176,
2.658694, 3.348019, 3.103378), b = c(4.346056, 4.175854, 4.999478,
2.345594, 2.55669)), .Names = c("a", "b"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5")))