How I can calculate the max for these data - r

Here is a part of my data
dat<-read.table (text="
Flower A1 A2 A3 TM MN B1 B2 B3
F1 12 9 11 12 0.56 19 1 12
F2 11 16 13 13 0.65 22 4 12
F3 10 12 14 11 0.44 29 9 12
", header=TRUE)
I want to calculate Max for column MN. For example, for value 0.44, the max is max(0.44,1-0.44)= 0.56.
I struggle to get it with a data frame.
Here is the outcome of the interest:
Flower A TM B MN Max
F1 12 12 19 0.56 0.56
F2 11 13 22 0.65 0.65
F3 10 11 29 0.44 0.56
F1 9 12 1 0.56 0.56
F2 16 13 4 0.65 0.65
F3 12 11 9 0.44 0.56
F1 11 12 12 0.56 0.56
F2 13 13 12 0.65 0.65
F3 14 11 12 0.44 0.56

Try the code below
transform(
reshape(
setNames(dat, gsub("(\\d+)", ".\\1", names(dat))),
direction = "long",
idvar = c("Flower", "TM", "MN"),
varying = -c(1, 5, 6)
),
Max = pmax(MN, 1 - MN)
)
which gives
Flower TM MN time A B Max
F1.12.0.56.1 F1 12 0.56 1 12 19 0.56
F2.13.0.65.1 F2 13 0.65 1 11 22 0.65
F3.11.0.44.1 F3 11 0.44 1 10 29 0.56
F1.12.0.56.2 F1 12 0.56 2 9 1 0.56
F2.13.0.65.2 F2 13 0.65 2 16 4 0.65
F3.11.0.44.2 F3 11 0.44 2 12 9 0.56
F1.12.0.56.3 F1 12 0.56 3 11 12 0.56
F2.13.0.65.3 F2 13 0.65 3 13 12 0.65
F3.11.0.44.3 F3 11 0.44 3 14 12 0.56

Using reshape and ave.
reshape(dat, varying=list(2:4, 7:9), direction='long', idvar='Flower') |>
transform(Max=ave(MN, Flower, FUN=max))
# Flower TM MN time A1 B1 Max
# F1.1 F1 12 0.56 1 12 19 0.56
# F2.1 F2 13 0.65 1 11 22 0.65
# F3.1 F3 11 0.44 1 10 29 0.44
# F1.2 F1 12 0.56 2 9 1 0.56
# F2.2 F2 13 0.65 2 16 4 0.65
# F3.2 F3 11 0.44 2 12 9 0.44
# F1.3 F1 12 0.56 3 11 12 0.56
# F2.3 F2 13 0.65 3 13 12 0.65
# F3.3 F3 11 0.44 3 14 12 0.44
Note: R >= 4.1 used.

Related

Clean a dataset that includes some character in each row

I have a dataset like this:
structure(list(`Frequency
Percent` = c("car", "window", "ball",
"ups"), AI = c("2\n0.00", "3\n0.00", "1\n0.00", "2\n0.00"), BLK = c("0\n0.00",
"218\n0.29", "48\n0.06", "0\n0.00"), HIANIC = c("1\n0.00", "8\n0.01",
"4\n0.01", "0\n0.00"), NATRICAN = c("9\n0.01", "7\n0.01", "8\n0.01",
"0\n0.00"), UNK = c("15\n0.02", "83\n0.11", "36\n0.05", "0\n0.00"
), yy = c("111\n0.15", "897\n1.20", "756\n1.02", "1\n0.00")), class = "data.frame", row.names = c(NA,
-4L))
How can I split each row by "" and remove n to make two new columns. For instance, car and AI cell (“2\n0.00”), I will have 2 and 0.00 in two different columns.
One way is to use tidyr::separate in a for loop:
for(i in names(df[,-1])){
df <- tidyr::separate(df, i, sep = "\n", into = c(i, paste0(i,"_val")))
}
Output:
# Frequency\n Percent AI AI_val BLK BLK_val HIANIC HIANIC_val NATRICAN NATRICAN_val UNK UNK_val yy yy_val
# 1 car 2 0.00 0 0.00 1 0.00 9 0.01 15 0.02 111 0.15
# 2 window 3 0.00 218 0.29 8 0.01 7 0.01 83 0.11 897 1.20
# 3 ball 1 0.00 48 0.06 4 0.01 8 0.01 36 0.05 756 1.02
# 4 ups 2 0.00 0 0.00 0 0.00 0 0.00 0 0.00 1 0.00
Using tidyr::separate_rows and tidyr::pivot_wider you could do:
library(tidyr)
library(dplyr)
dat |>
mutate(unit = c("n\npct")) |>
separate_rows(-1, sep = "\n") |>
pivot_wider(names_from = "unit", values_from = -1)
#> # A tibble: 4 × 15
#> Frequency\n…¹ AI_n AI_pct BLK_n BLK_pct HIANI…² HIANI…³ NATRI…⁴ NATRI…⁵ UNK_n
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 car 2 0.00 0 0.00 1 0.00 9 0.01 15
#> 2 window 3 0.00 218 0.29 8 0.01 7 0.01 83
#> 3 ball 1 0.00 48 0.06 4 0.01 8 0.01 36
#> 4 ups 2 0.00 0 0.00 0 0.00 0 0.00 0
#> # … with 5 more variables: UNK_pct <chr>, yy_n <chr>, yy_pct <chr>,
#> # unit_n <chr>, unit_pct <chr>, and abbreviated variable names
#> # ¹​`Frequency\n Percent`, ²​HIANIC_n, ³​HIANIC_pct, ⁴​NATRICAN_n,
#> # ⁵​NATRICAN_pct
A base one liner:
do.call(data.frame, lapply(DF, \(x) do.call(rbind, strsplit(x, "\n"))))
# Frequency.Percent AI.1 AI.2 BLK.1 BLK.2 HIANIC.1 HIANIC.2 NATRICAN.1
#1 car 2 0.00 0 0.00 1 0.00 9
#2 window 3 0.00 218 0.29 8 0.01 7
#3 ball 1 0.00 48 0.06 4 0.01 8
#4 ups 2 0.00 0 0.00 0 0.00 0
# NATRICAN.2 UNK.1 UNK.2 yy.1 yy.2
#1 0.01 15 0.02 111 0.15
#2 0.01 83 0.11 897 1.20
#3 0.01 36 0.05 756 1.02
#4 0.00 0 0.00 1 0.00
Or add also a type convert.
type.convert(do.call(data.frame, lapply(DF, \(x) do.call(rbind, strsplit(x, "\n")))), as.is=TRUE)
There is also a base R solution:
dat = structure(list(`Frequency
Percent` = c("car", "window", "ball",
"ups"), AI = c("2\n0.00", "3\n0.00", "1\n0.00", "2\n0.00"), BLK = c("0\n0.00",
"218\n0.29", "48\n0.06", "0\n0.00"), HIANIC = c("1\n0.00", "8\n0.01",
"4\n0.01", "0\n0.00"), NATRICAN = c("9\n0.01", "7\n0.01", "8\n0.01",
"0\n0.00"), UNK = c("15\n0.02", "83\n0.11", "36\n0.05", "0\n0.00"
), yy = c("111\n0.15", "897\n1.20", "756\n1.02", "1\n0.00")), class = "data.frame", row.names = c(NA,
-4L))
transformed = data.frame(Freq_pc = dat[,1])
for(col in seq(2, ncol(dat))){
transformed = cbind(transformed, t(matrix(unlist(strsplit(dat[,col], "\n")), nrow=2)))
names(transformed)[c(2*(col-1), 2*(col-1)+1)] = c(paste0(names(dat)[col], "_n"), paste0(names(dat)[col], "_pc"))
}
That results in:
Freq_pc AI_n AI_pc BLK_n BLK_pc HIANIC_n HIANIC_pc NATRICAN_n NATRICAN_pc UNK_n UNK_pc yy_n yy_pc
1 car 2 0.00 0 0.00 1 0.00 9 0.01 15 0.02 111 0.15
2 window 3 0.00 218 0.29 8 0.01 7 0.01 83 0.11 897 1.20
3 ball 1 0.00 48 0.06 4 0.01 8 0.01 36 0.05 756 1.02
4 ups 2 0.00 0 0.00 0 0.00 0 0.00 0 0.00 1 0.00
We may use cSplit
library(splitstackshape)
cSplit(df1, 2:ncol(df1), sep = "\n")
-output
Frequency\nPercent AI_1 AI_2 BLK_1 BLK_2 HIANIC_1 HIANIC_2 NATRICAN_1 NATRICAN_2 UNK_1 UNK_2 yy_1 yy_2
1: car 2 0 0 0.00 1 0.00 9 0.01 15 0.02 111 0.15
2: window 3 0 218 0.29 8 0.01 7 0.01 83 0.11 897 1.20
3: ball 1 0 48 0.06 4 0.01 8 0.01 36 0.05 756 1.02
4: ups 2 0 0 0.00 0 0.00 0 0.00 0 0.00 1 0.00

Create matrix from dataset in R

I want to create a matrix from my data. My data consists of two columns, date and my observations for each date. I want the matrix to have year as rows and days as columns, e.g. :
17 18 19 20 ... 31
1904 x11 x12 ...
1905
1906
.
.
.
2019
The days in this case is for December each year. I would like missing values to equal NA.
Here's a sample of my data:
> head(cdata)
# A tibble: 6 x 2
Datum Snödjup
<dttm> <dbl>
1 1904-12-01 00:00:00 0.02
2 1904-12-02 00:00:00 0.02
3 1904-12-03 00:00:00 0.01
4 1904-12-04 00:00:00 0.01
5 1904-12-12 00:00:00 0.02
6 1904-12-13 00:00:00 0.02
I figured that the first thing I need to do is to split the date into year, month and day (European formatting, YYYY-MM-DD) so I did that and got rid of the date column (the one that says Datum) and also got rid of the unrelevant days, namely the ones < 17.
cdata %>%
dplyr::mutate(year = lubridate::year(Datum),
month = lubridate::month(Datum),
day = lubridate::day(Datum))
select(cd, -c(Datum))
cu <- cd[which(cd$day > 16
& cd$day < 32
& cd$month == 12),]
and now it looks like this:
> cu
# A tibble: 1,284 x 4
Snödjup year month day
<dbl> <dbl> <dbl> <int>
1 0.01 1904 12 26
2 0.01 1904 12 27
3 0.01 1904 12 28
4 0.12 1904 12 29
5 0.12 1904 12 30
6 0.15 1904 12 31
7 0.07 1906 12 17
8 0.05 1906 12 18
9 0.05 1906 12 19
10 0.04 1906 12 20
# … with 1,274 more rows
Now I need to fit my data into a matrix with missing values as NA. Is there anyway to do this?
Base R approach, using by.
r <- `colnames<-`(do.call(rbind, by(dat, substr(dat$date, 1, 4), function(x) x[2])), 1:31)
r[,17:31]
# 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
# 1904 -0.28 -2.66 -2.44 1.32 -0.31 -1.78 -0.17 1.21 1.90 -0.43 -0.26 -1.76 0.46 -0.64 0.46
# 1905 1.44 -0.43 0.66 0.32 -0.78 1.58 0.64 0.09 0.28 0.68 0.09 -2.99 0.28 -0.37 0.19
# 1906 -0.89 -1.10 1.51 0.26 0.09 -0.12 -1.19 0.61 -0.22 -0.18 0.93 0.82 1.39 -0.48 0.65
Toy data
set.seed(42)
dat <- do.call(rbind, lapply(1904:1906, function(x)
data.frame(date=seq(ISOdate(x, 12, 1, 0), ISOdate(x, 12, 31, 0), "day" ),
value=round(rnorm(31), 2))))
You can try :
library(dplyr)
library(tidyr)
cdata %>%
mutate(year = lubridate::year(Datum),
day = lubridate::day(Datum)) %>%
filter(day >= 17) %>%
complete(day = 17:31) %>%
select(year, day, Snödjup) %>%
pivot_wider(names_from = day, values_from = Snödjup)

Why same category is giving different frequency in R

Process_Table = Process_Table[order(-Process_Table$Process, -Process_Table$Freq),]
#output
Process Freq Percent
17 Other Airport Services 45 15.46
5 Check-in 35 12.03
23 Ticket sales and support channels 35 12.03
11 Flight and inflight 33 11.34
19 Pegasus Plus 23 7.90
24 Time Delays 16 5.50
7 Other 13 4.47
14 Other 13 4.47
22 Other 13 4.47
25 Other 13 4.47
16 Other 11 3.78
20 Other 6 2.06
26 Other 6 2.06
3 Other 5 1.72
13 Other 5 1.72
18 Other 5 1.72
21 Other 4 1.37
1 Other 2 0.69
2 Other 1 0.34
4 Other 1 0.34
6 Other 1 0.34
8 Other 1 0.34
9 Other 1 0.34
10 Other 1 0.34
12 Other 1 0.34
15 Other 1 0.34
as you can see it is giving different frequency for the same level
whereas, if i am printing the levels in that feature it is giving an output as the following
levels(Process_Table$Process)
[1] "Check-in" "Flight and inflight"
[3] "Other" "Other Airport Services"
[5] "Pegasus Plus" "Ticket sales and support channels"
[7] "Time Delays"
what i want is the combined frequency of "Others" category. Can anyone help me out on this.
Edit: code was used to derive to the first set of output:
Process_Table$Percent = round(Process_Table$Freq/sum(Process_Table$Freq) * 100, 2)
Process_Table$Process = as.character(Process_Table$Process)
low_list = Process_Table %>%
filter(Percent < 5.50) %>%
select(Process)
Process_Table$Process = ifelse(Process_Table$Process %in% low_list$Process, 'Other', Process_Table$Process)
as.data.frame(Process_Table)
Process_Table$Process = as.factor(Process_Table$Process)
Your Processed_Table should undergo another step of aggregating. Add the following to your final step of data aggregating.
Processed_Table <- Processed_Table %>% group_by(Process) %>% summarize(Freq = sum(Freq), Percent = sum(Percent))

Finding Row number of Consecutive decreasing values in data

I need to detect from data the first element of the first sequence of length 5 of consecutively decreasing numbers. There is a similar post here but when I applied to my data it failed.
set.seed(201)
az <- c(sort(runif(10,0,0.9),decreasing = T),sort(runif(3,-0.3,0),decreasing = T),sort(runif(3,-0.3,0),decreasing = F),sort(runif(4,-0.3,0),decreasing = T),sort(runif(4,-0.3,0),decreasing = F),sort(runif(6,-0.3,0),decreasing = T))
tz <- seq(1,length(az))
df <- data.frame(tz,az=round(az,2))
In the figure above it would be somewhere around tz = 25.
The post says that this function need to improve and so far I cannot get my desired result!
getFirstBefore<-function(x,len){
r<-rle(sign(diff(x)))
n<-which(r$lengths>=len & r$values<0)
if(length(n)==0)
return(-1)
1+sum(r$lengths[seq_len(n[1]-1)])
}
df1 <- df%>%
mutate(cns_tz=getFirstBefore(az,5))
tz az cns_tz
#1 1 0.56 4
#2 2 0.55 4
#3 3 0.33 4
#4 4 0.33 4
#5 5 0.26 4
#6 6 0.15 4
#7 7 0.12 4
#8 8 0.09 4
#9 9 0.04 4
#10 10 0.04 4
#11 11 -0.10 4
#12 12 -0.12 4
#13 13 -0.16 4
#14 14 -0.16 4
#15 15 -0.14 4
#16 16 -0.14 4
#17 17 -0.13 4
#18 18 -0.15 4
#19 19 -0.22 4
#20 20 -0.30 4
#21 21 -0.12 4
#22 22 -0.12 4
#23 23 -0.11 4
#24 24 -0.07 4
#25 25 -0.05 4
#26 26 -0.09 4
#27 27 -0.10 4
#28 28 -0.15 4
#29 29 -0.17 4
#30 30 -0.22 4
I would sort every 5 consecutive values, and see if that matches with the unsorted data. Then find the first occurance of such a match:
set.seed(123)
test <- rnorm(100)
decr <- sapply(seq_along(test),function(x){all(sort(test[x:(x+5)],decreasing = T) == test[x:(x+5)])})
firstdecr <- min(which(decr)):(min(which(decr))+5)
plot(test)
lines(firstdecr, test[firstdecr], col="red")
Only flaw I can see if there are equal values in a 5 value epoch, but you could also test for that.
We can use rleid from data.table
library(data.table)
n <- 5
v1 <- setDT(df)[sign(az)<0, .I[which(.N==n)] , rleid(c(1, sign(diff(az))))]$V1[1L]
v1
#[1] 26
df[, cnz_tz := v1]
Or another option is shift with Reduce
setDT(df)[, cnz_tz := .I[Reduce(`&`, shift((az - shift(az, fill=az[1])) < 0,
0:4, type = "lead", fill=FALSE)) & sign(az) < 0][1]]
We can also use rleid in dplyr
library(dplyr)
v1 <- df %>%
group_by(rl= rleid(c(1, sign(diff(az))))) %>%
mutate(rn = sign(az) < 0 & n()==5) %>%
.$rn %>%
which() %>%
head(., 1)
v1
#[1] 26
df %>%
mutate(cnz_tz = v1)
My naive pure-dplyr approach would be to compute a rolling sum of signs for differences and identify rows where the next five diffs have a negative sign. I say "naive" because this solution does not use rle for detecting streaks.
library(dplyr)
diff_details <- df %>%
mutate(diff = c(0, diff(az)),
diff_sign = sign(diff),
rolling_signs = cumsum(diff_sign),
next_five = lead(rolling_signs, 5) - rolling_signs)
diff_details
#> tz az diff diff_sign rolling_signs next_five
#> 1 1 0.56 0.00 0 0 -4
#> 2 2 0.55 -0.01 -1 -1 -4
#> 3 3 0.33 -0.22 -1 -2 -4
#> 4 4 0.33 0.00 0 -2 -5
#> 5 5 0.26 -0.07 -1 -3 -4
#> 6 6 0.15 -0.11 -1 -4 -4
#> 7 7 0.12 -0.03 -1 -5 -4
#> 8 8 0.09 -0.03 -1 -6 -4
#> 9 9 0.04 -0.05 -1 -7 -3
#> 10 10 0.04 0.00 0 -7 -2
#> 11 11 -0.10 -0.14 -1 -8 -1
#> 12 12 -0.12 -0.02 -1 -9 1
#> 13 13 -0.16 -0.04 -1 -10 1
#> 14 14 -0.16 0.00 0 -10 0
#> 15 15 -0.14 0.02 1 -9 -2
#> 16 16 -0.14 0.00 0 -9 -1
#> 17 17 -0.13 0.01 1 -8 -2
#> 18 18 -0.15 -0.02 -1 -9 0
#> 19 19 -0.22 -0.07 -1 -10 2
#> 20 20 -0.30 -0.08 -1 -11 4
#> 21 21 -0.12 0.18 1 -10 2
#> 22 22 -0.12 0.00 0 -10 1
#> 23 23 -0.11 0.01 1 -9 -1
#> 24 24 -0.07 0.04 1 -8 -3
#> 25 25 -0.05 0.02 1 -7 -5
#> 26 26 -0.09 -0.04 -1 -8 NA
#> 27 27 -0.10 -0.01 -1 -9 NA
#> 28 28 -0.15 -0.05 -1 -10 NA
#> 29 29 -0.17 -0.02 -1 -11 NA
#> 30 30 -0.22 -0.05 -1 -12 NA
Instead of identifying streaks in a sequence, we look at a cumulative sum of the signs of the differences in rolling_signs. next_five computes the difference in rolling_signs over the next five rows. When next_five is -5, then the next five rows have decreasing changes.
(diff_details$next_five %in% -5) %>% which %>% max
#> [1] 25
Each of the steps/columns could be abstracted into a function, like:
cum_diff_signs <- function(xs, window) {
rolling_signs <- cumsum(sign(c(0, diff(xs))))
next_diffs <- dplyr::lead(rolling_signs, window) - rolling_signs
next_diffs
}
cum_diff_signs(df$az, 5)
#> [1] -4 -4 -4 -5 -4 -4 -4 -4 -3 -2 -1 1 1 0 -2 -1 -2 0 2 4 2 1 -1
#> [24] -3 -5 NA NA NA NA NA
(cum_diff_signs(df$az, 5) %in% -5) %>% which %>% max
#> [1] 25

function to return suitably lagged and iterated divided value in R

I have a time series data, and I wanted to use a function to return suitably lagged and iterated divided value.
Data:
ID Temperature value
1 -1.1923333
2 -0.2123333
3 -0.593
4 -0.7393333
5 -0.731
6 -0.4976667
7 -0.773
8 -0.6843333
9 -0.371
10 0.754
11 1.798
12 3.023
13 3.8233333
14 4.2456667
15 4.599
16 5.078
17 4.9133333
18 3.5393333
19 2.0886667
20 1.8236667
21 1.2633333
22 0.6843333
23 0.7953333
24 0.6883333
The function should work like this:
new values : 23ID=value(24)/value(23), 22ID=value(23)/value(22), 21ID=value(22)/value(21), and so forth.
Expected Results:
ID New Temperature value
1 0.17
2 2.79
3 1.24
4 0.98
5 0.68
6 1.55
7 0.885
8 0.54
9 -2.03
10 2.38
11 1.68
12 1.264
13 1.11
14 1.083
15 1.104
16 0.967
17 0.72
18 0.59
19 0.873
20 0.69
21 0.541
22 1.16
23 0.86
24 NAN
To divide each element of a vector x by its successor, use:
x[-1] / x[-length(x)]
This will return a vector with a length of length(x) - 1. If you really need the NaN value at the end, add it by hand via c(x[-1] / x[-length(x)], NaN).

Resources