Finding Row number of Consecutive decreasing values in data - r

I need to detect from data the first element of the first sequence of length 5 of consecutively decreasing numbers. There is a similar post here but when I applied to my data it failed.
set.seed(201)
az <- c(sort(runif(10,0,0.9),decreasing = T),sort(runif(3,-0.3,0),decreasing = T),sort(runif(3,-0.3,0),decreasing = F),sort(runif(4,-0.3,0),decreasing = T),sort(runif(4,-0.3,0),decreasing = F),sort(runif(6,-0.3,0),decreasing = T))
tz <- seq(1,length(az))
df <- data.frame(tz,az=round(az,2))
In the figure above it would be somewhere around tz = 25.
The post says that this function need to improve and so far I cannot get my desired result!
getFirstBefore<-function(x,len){
r<-rle(sign(diff(x)))
n<-which(r$lengths>=len & r$values<0)
if(length(n)==0)
return(-1)
1+sum(r$lengths[seq_len(n[1]-1)])
}
df1 <- df%>%
mutate(cns_tz=getFirstBefore(az,5))
tz az cns_tz
#1 1 0.56 4
#2 2 0.55 4
#3 3 0.33 4
#4 4 0.33 4
#5 5 0.26 4
#6 6 0.15 4
#7 7 0.12 4
#8 8 0.09 4
#9 9 0.04 4
#10 10 0.04 4
#11 11 -0.10 4
#12 12 -0.12 4
#13 13 -0.16 4
#14 14 -0.16 4
#15 15 -0.14 4
#16 16 -0.14 4
#17 17 -0.13 4
#18 18 -0.15 4
#19 19 -0.22 4
#20 20 -0.30 4
#21 21 -0.12 4
#22 22 -0.12 4
#23 23 -0.11 4
#24 24 -0.07 4
#25 25 -0.05 4
#26 26 -0.09 4
#27 27 -0.10 4
#28 28 -0.15 4
#29 29 -0.17 4
#30 30 -0.22 4

I would sort every 5 consecutive values, and see if that matches with the unsorted data. Then find the first occurance of such a match:
set.seed(123)
test <- rnorm(100)
decr <- sapply(seq_along(test),function(x){all(sort(test[x:(x+5)],decreasing = T) == test[x:(x+5)])})
firstdecr <- min(which(decr)):(min(which(decr))+5)
plot(test)
lines(firstdecr, test[firstdecr], col="red")
Only flaw I can see if there are equal values in a 5 value epoch, but you could also test for that.

We can use rleid from data.table
library(data.table)
n <- 5
v1 <- setDT(df)[sign(az)<0, .I[which(.N==n)] , rleid(c(1, sign(diff(az))))]$V1[1L]
v1
#[1] 26
df[, cnz_tz := v1]
Or another option is shift with Reduce
setDT(df)[, cnz_tz := .I[Reduce(`&`, shift((az - shift(az, fill=az[1])) < 0,
0:4, type = "lead", fill=FALSE)) & sign(az) < 0][1]]
We can also use rleid in dplyr
library(dplyr)
v1 <- df %>%
group_by(rl= rleid(c(1, sign(diff(az))))) %>%
mutate(rn = sign(az) < 0 & n()==5) %>%
.$rn %>%
which() %>%
head(., 1)
v1
#[1] 26
df %>%
mutate(cnz_tz = v1)

My naive pure-dplyr approach would be to compute a rolling sum of signs for differences and identify rows where the next five diffs have a negative sign. I say "naive" because this solution does not use rle for detecting streaks.
library(dplyr)
diff_details <- df %>%
mutate(diff = c(0, diff(az)),
diff_sign = sign(diff),
rolling_signs = cumsum(diff_sign),
next_five = lead(rolling_signs, 5) - rolling_signs)
diff_details
#> tz az diff diff_sign rolling_signs next_five
#> 1 1 0.56 0.00 0 0 -4
#> 2 2 0.55 -0.01 -1 -1 -4
#> 3 3 0.33 -0.22 -1 -2 -4
#> 4 4 0.33 0.00 0 -2 -5
#> 5 5 0.26 -0.07 -1 -3 -4
#> 6 6 0.15 -0.11 -1 -4 -4
#> 7 7 0.12 -0.03 -1 -5 -4
#> 8 8 0.09 -0.03 -1 -6 -4
#> 9 9 0.04 -0.05 -1 -7 -3
#> 10 10 0.04 0.00 0 -7 -2
#> 11 11 -0.10 -0.14 -1 -8 -1
#> 12 12 -0.12 -0.02 -1 -9 1
#> 13 13 -0.16 -0.04 -1 -10 1
#> 14 14 -0.16 0.00 0 -10 0
#> 15 15 -0.14 0.02 1 -9 -2
#> 16 16 -0.14 0.00 0 -9 -1
#> 17 17 -0.13 0.01 1 -8 -2
#> 18 18 -0.15 -0.02 -1 -9 0
#> 19 19 -0.22 -0.07 -1 -10 2
#> 20 20 -0.30 -0.08 -1 -11 4
#> 21 21 -0.12 0.18 1 -10 2
#> 22 22 -0.12 0.00 0 -10 1
#> 23 23 -0.11 0.01 1 -9 -1
#> 24 24 -0.07 0.04 1 -8 -3
#> 25 25 -0.05 0.02 1 -7 -5
#> 26 26 -0.09 -0.04 -1 -8 NA
#> 27 27 -0.10 -0.01 -1 -9 NA
#> 28 28 -0.15 -0.05 -1 -10 NA
#> 29 29 -0.17 -0.02 -1 -11 NA
#> 30 30 -0.22 -0.05 -1 -12 NA
Instead of identifying streaks in a sequence, we look at a cumulative sum of the signs of the differences in rolling_signs. next_five computes the difference in rolling_signs over the next five rows. When next_five is -5, then the next five rows have decreasing changes.
(diff_details$next_five %in% -5) %>% which %>% max
#> [1] 25
Each of the steps/columns could be abstracted into a function, like:
cum_diff_signs <- function(xs, window) {
rolling_signs <- cumsum(sign(c(0, diff(xs))))
next_diffs <- dplyr::lead(rolling_signs, window) - rolling_signs
next_diffs
}
cum_diff_signs(df$az, 5)
#> [1] -4 -4 -4 -5 -4 -4 -4 -4 -3 -2 -1 1 1 0 -2 -1 -2 0 2 4 2 1 -1
#> [24] -3 -5 NA NA NA NA NA
(cum_diff_signs(df$az, 5) %in% -5) %>% which %>% max
#> [1] 25

Related

How to use one dataframe as a key for value limits in another dataframe?

I am trying to figure out a simple way to use one table of values for specific analyte and matrix combinations to evaluate test results in another table for Water Quality testing purposes.
I have created an example that has a "key" table showing maximum Water Quality values for three analytes in two different matrices.
Analytes: As (Arsenic), Cd (Cadmium), Cr (Chromium)
Matrices: Fish (fish tissue), Floc (flocculent)
The key table is produced with this code:
limits= matrix(c(30,33,9.79,
0.5,4.98,0.99,
0.88,111,43.4), nrow=3, ncol=3, byrow=TRUE)
colnames(limits) = c("wet_fish","dry_floc_PEC","dry_floc_TEC")
rownames(limits) = c("As","Cd","Cr")
limits=data.frame(limits)
> limits
wet_fish dry_floc_PEC dry_floc_TEC
As 30.00 33.00 9.79
Cd 0.50 4.98 0.99
Cr 0.88 111.00 43.40
Note that there is only one limit value for the Fish matrix, which uses the analyte concentration by wet weight. The Floc matrix has two limits, the lower TEC (Threshold Effect Concentration) and the higher PEC (Probable Effect Concentration). These two limits use the analyte concentration by dry weight.
The data would normally be imported from an excel csv file, but is replicated using this code:
data = matrix(c("Floc","As","31","1",
"Floc","Cd","4.99","0.1",
"Floc","Cr","112","0.1",
"Fish","As","3","34",
"Fish","Cd","1","4.99",
"Fish","Cr","1","50",
"Floc","As","1","1",
"Floc","Cd","0.04","0.002",
"Floc","Cr","0.08","0.008",
"Fish","As","0.002","0.2",
"Fish","Cd","0.0005","0.05",
"Fish","Cr","0.001","5"), ncol=4, byrow=T)
colnames(data) = c("Matrix","Analyte","ResultDry","ResultWet")
data = data.frame(data)
> data
Matrix Analyte ResultDry ResultWet
1 Floc As 31 1
2 Floc Cd 4.99 0.1
3 Floc Cr 112 0.1
4 Fish As 3 34
5 Fish Cd 1 4.99
6 Fish Cr 1 50
7 Floc As 1 1
8 Floc Cd 0.04 0.002
9 Floc Cr 0.08 0.008
10 Fish As 0.002 0.2
11 Fish Cd 0.0005 0.05
12 Fish Cr 0.001 5
Joining the tables so that the analytes and matrices match up across both tables would result in new columns in the data table for the limit values and whether or not the data value in each row exceeds it.
That would result in a final table that looks something like this:
> data
Matrix Analyte ResultDry ResultWet LimitWet TECDry PECDry Exceed
1 Floc As 31 1 NA 9.79 33 TEC
2 Floc Cd 4.99 0.1 NA 0.99 4.98 PEC
3 Floc Cr 112 0.1 NA 43.4 111 PEC
4 Fish As 3 34 30 NA NA Fish
5 Fish Cd 1 4.99 0.5 NA NA Fish
6 Fish Cr 1 50 0.88 NA NA Fish
7 Floc As 1 1 NA 9.79 33 None
8 Floc Cd 0.04 0.002 NA 0.99 4.98 None
9 Floc Cr 0.08 0.008 NA 43.4 111 None
10 Fish As 0.002 0.2 30 NA NA None
11 Fish Cd 0.0005 0.05 0.5 NA NA None
12 Fish Cr 0.001 5 0.88 NA NA None
The closest I can get to this is to have 3 columns, each testing for the matrix and if it the result is higher than the limit:
Data_final = limits %>%
full_join(data, by=c("Analyte"="Analyte")) %>%
mutate(ResultDry = as.numeric(ResultDry),
ResultWet = as.numeric(ResultWet),
wet_fish = as.numeric(wet_fish),
dry_floc_TEC = as.numeric(dry_floc_TEC),
dry_floc_PEC = as.numeric(dry_floc_PEC)) %>%
mutate(Exceed_Fish = ifelse(Matrix=="Fish",ResultWet>wet_fish,NA)) %>%
mutate(Exceed_Floc_TEC = ifelse(Matrix=="Floc",ResultDry>dry_floc_TEC,NA)) %>%
mutate(Exceed_Floc_PEC = ifelse(Matrix=="Floc",ResultDry>dry_floc_PEC,NA))
> Data_final
Analyte wet_fish dry_floc_PEC dry_floc_TEC Matrix ResultDry ResultWet Exceed_Fish Exceed_Floc_TEC Exceed_Floc_PEC
1 As 30.00 33.00 9.79 Floc 31.0000 1.000 NA TRUE FALSE
2 As 30.00 33.00 9.79 Fish 3.0000 34.000 TRUE NA NA
3 As 30.00 33.00 9.79 Floc 1.0000 1.000 NA FALSE FALSE
4 As 30.00 33.00 9.79 Fish 0.0020 0.200 FALSE NA NA
5 Cd 0.50 4.98 0.99 Floc 4.9900 0.100 NA TRUE TRUE
6 Cd 0.50 4.98 0.99 Fish 1.0000 4.990 TRUE NA NA
7 Cd 0.50 4.98 0.99 Floc 0.0400 0.002 NA FALSE FALSE
8 Cd 0.50 4.98 0.99 Fish 0.0005 0.050 FALSE NA NA
9 Cr 0.88 111.00 43.40 Floc 112.0000 0.100 NA TRUE TRUE
10 Cr 0.88 111.00 43.40 Fish 1.0000 50.000 TRUE NA NA
11 Cr 0.88 111.00 43.40 Floc 0.0800 0.008 NA FALSE FALSE
12 Cr 0.88 111.00 43.40 Fish 0.0010 5.000 TRUE NA NA
This is on the right track, but when I try to nest ifelse functions within a mutate to combine the three columns, it doesn't work correctly:
Data_combined = Data_final %>%
mutate(Exceed = ifelse(Exceed_Fish==TRUE,"Yes - Fish",
ifelse(Exceed_Floc_TEC==TRUE&Exceed_Floc_PEC==FALSE, "Yes - Floc TEC",
ifelse(Exceed_Floc_PEC==TRUE, "Yes - Floc PEC", "No"))))
> Data_combined
Analyte wet_fish dry_floc_PEC dry_floc_TEC Matrix ResultDry ResultWet Exceed_Fish Exceed_Floc_TEC Exceed_Floc_PEC Exceed
1 As 30.00 33.00 9.79 Floc 31.0000 1.000 NA TRUE FALSE <NA>
2 As 30.00 33.00 9.79 Fish 3.0000 34.000 TRUE NA NA Yes - Fish
3 As 30.00 33.00 9.79 Floc 1.0000 1.000 NA FALSE FALSE <NA>
4 As 30.00 33.00 9.79 Fish 0.0020 0.200 FALSE NA NA <NA>
5 Cd 0.50 4.98 0.99 Floc 4.9900 0.100 NA TRUE TRUE <NA>
6 Cd 0.50 4.98 0.99 Fish 1.0000 4.990 TRUE NA NA Yes - Fish
7 Cd 0.50 4.98 0.99 Floc 0.0400 0.002 NA FALSE FALSE <NA>
8 Cd 0.50 4.98 0.99 Fish 0.0005 0.050 FALSE NA NA <NA>
9 Cr 0.88 111.00 43.40 Floc 112.0000 0.100 NA TRUE TRUE <NA>
10 Cr 0.88 111.00 43.40 Fish 1.0000 50.000 TRUE NA NA Yes - Fish
11 Cr 0.88 111.00 43.40 Floc 0.0800 0.008 NA FALSE FALSE <NA>
12 Cr 0.88 111.00 43.40 Fish 0.0010 5.000 TRUE NA NA Yes - Fish
I think this is what you want. I changed the Limits data frame a little and then used case_when instead of ifelse
limits= as.data.frame(matrix(c(30,33,9.79,
0.5,4.98,0.99,
0.88,111,43.4), nrow=3, ncol=3, byrow=TRUE))
colnames(limits) = c("wet_fish","dry_floc_PEC","dry_floc_TEC")
#rownames(limits) = c("As","Cd","Cr")
limits$Analyte <- c("As","Cd","Cr")
limits
data = matrix(c("Floc","As","31","1",
"Floc","Cd","4.99","0.1",
"Floc","Cr","112","0.1",
"Fish","As","3","34",
"Fish","Cd","1","4.99",
"Fish","Cr","1","50",
"Floc","As","1","1",
"Floc","Cd","0.04","0.002",
"Floc","Cr","0.08","0.008",
"Fish","As","0.002","0.2",
"Fish","Cd","0.0005","0.05",
"Fish","Cr","0.001","5"), ncol=4, byrow=T)
colnames(data) = c("Matrix","Analyte","ResultDry","ResultWet")
data = data.frame(data)
data
Data_final <-
limits %>%
full_join(data, by=c("Analyte"="Analyte")) %>%
mutate(ResultDry = as.numeric(ResultDry),
ResultWet = as.numeric(ResultWet),
wet_fish = as.numeric(wet_fish),
dry_floc_TEC = as.numeric(dry_floc_TEC),
dry_floc_PEC = as.numeric(dry_floc_PEC)) %>%
mutate(Exceed_Fish = ifelse(Matrix=="Fish",ResultWet>wet_fish,NA)) %>%
mutate(Exceed_Floc_TEC = ifelse(Matrix=="Floc",ResultDry>dry_floc_TEC,NA)) %>%
mutate(Exceed_Floc_PEC = ifelse(Matrix=="Floc",ResultDry>dry_floc_PEC,NA))
Data_combined <-
Data_final %>%
mutate(Exceed = case_when(Exceed_Fish==TRUE ~"Yes - Fish",
Exceed_Floc_TEC == TRUE & Exceed_Floc_PEC == FALSE ~ "Yes - Floc TEC",
Exceed_Floc_PEC==TRUE ~ "Yes - Floc PEC",
TRUE ~ "No"))
I would calculate Exceed using case_when instead of nested ifelse. case_when runs tests in order, so if the first condition is TRUE, it never goes to the next steps, which simplifies the conditionals (ie. for the second case, we can assume that ResultWet is not greater than LimitWet, and so don't have to test for that). In addition, by wrapping the tests in isTRUE, we can automatically coerce any operation involving an NA to FALSE
data %>%
left_join(rownames_to_column(limits, 'Analyte'), by='Analyte') %>%
mutate_at(vars(starts_with('Result')), ~as.numeric(.)) %>%
mutate(LimitWet=if_else(ResultWet>ResultDry, wet_fish, NA_real_),
TECDry=if_else(ResultDry>ResultWet, dry_floc_TEC, NA_real_),
PECDry=if_else(ResultDry>ResultWet, dry_floc_PEC, NA_real_)) %>%
select(-wet_fish, -starts_with('dry_floc')) %>%
rowwise() %>%
mutate(Exceed=case_when(isTRUE(ResultWet>=LimitWet) ~ 'Fish',
isTRUE(ResultDry>=PECDry) ~ 'PEC',
isTRUE(ResultDry>=TECDry) ~ 'TEC',
TRUE ~ 'None'))
Matrix Analyte ResultDry ResultWet LimitWet TECDry PECDry Exceed
1 Floc As 31 1 NA 9.79 33 TEC
2 Floc Cd 4.99 0.1 NA 0.99 4.98 PEC
3 Floc Cr 112 0.1 NA 43.4 111 PEC
4 Fish As 3 34 30 NA NA Fish
5 Fish Cd 1 4.99 0.5 NA NA Fish
6 Fish Cr 1 50 0.88 NA NA Fish
7 Floc As 1 1 NA NA NA None
8 Floc Cd 0.04 0.002 NA 0.99 4.98 None
9 Floc Cr 0.08 0.008 NA 43.4 111 None
10 Fish As 0.002 0.2 30 NA NA None
11 Fish Cd 0.0005 0.05 0.5 NA NA None
12 Fish Cr 0.001 5 0.88 NA NA Fish
An approach using dplyr. Not quite sure though whats the logic behind Exceed ...
full_join(data, pivot_longer(limits, contains("_")) %>%
mutate(Matrix = str_to_title(gsub("^.{3}_|_.*", "", name)))) %>%
pivot_wider(names_from=name, values_from=value) %>%
rename(LimitWet = wet_fish, PECDry = dry_floc_PEC, TECDry = dry_floc_TEC)
Joining, by = c("Matrix", "Analyte")
# A tibble: 12 × 7
Matrix Analyte ResultDry ResultWet PECDry TECDry LimitWet
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Floc As 31 1 33 9.79 NA
2 Floc Cd 4.99 0.1 4.98 0.99 NA
3 Floc Cr 112 0.1 111 43.4 NA
4 Fish As 3 34 NA NA 30
5 Fish Cd 1 4.99 NA NA 0.5
6 Fish Cr 1 50 NA NA 0.88
7 Floc As 1 1 33 9.79 NA
8 Floc Cd 0.04 0.002 4.98 0.99 NA
9 Floc Cr 0.08 0.008 111 43.4 NA
10 Fish As 0.002 0.2 NA NA 30
11 Fish Cd 0.0005 0.05 NA NA 0.5
12 Fish Cr 0.001 5 NA NA 0.88
with Exceed if its just checking against ResultDry and ResultWet
full_join(data, pivot_longer(limits, contains("_")) %>%
mutate(Matrix = str_to_title(gsub("^.{3}_|_.*", "", name)))) %>%
pivot_wider(names_from=name, values_from=value) %>%
rename(LimitWet = wet_fish, PECDry = dry_floc_PEC, TECDry = dry_floc_TEC) %>%
mutate(Exceed = case_when(
ResultWet >= LimitWet ~ "Fish",
ResultDry >= PECDry ~ "PEC",
ResultDry >= TECDry ~ "TEC", TRUE ~ "None"))
Joining, by = c("Matrix", "Analyte")
# A tibble: 12 × 8
Matrix Analyte ResultDry ResultWet PECDry TECDry LimitWet Exceed
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Floc As 31 1 33 9.79 NA TEC
2 Floc Cd 4.99 0.1 4.98 0.99 NA PEC
3 Floc Cr 112 0.1 111 43.4 NA PEC
4 Fish As 3 34 NA NA 30 Fish
5 Fish Cd 1 4.99 NA NA 0.5 Fish
6 Fish Cr 1 50 NA NA 0.88 Fish
7 Floc As 1 1 33 9.79 NA None
8 Floc Cd 0.04 0.002 4.98 0.99 NA None
9 Floc Cr 0.08 0.008 111 43.4 NA None
10 Fish As 0.002 0.2 NA NA 30 None
11 Fish Cd 0.0005 0.05 NA NA 0.5 None
12 Fish Cr 0.001 5 NA NA 0.88 Fish
Data
data <- structure(list(Matrix = c("Floc", "Floc", "Floc", "Fish", "Fish",
"Fish", "Floc", "Floc", "Floc", "Fish", "Fish", "Fish"), Analyte = c("As",
"Cd", "Cr", "As", "Cd", "Cr", "As", "Cd", "Cr", "As", "Cd", "Cr"
), ResultDry = c(31, 4.99, 112, 3, 1, 1, 1, 0.04, 0.08, 0.002,
5e-04, 0.001), ResultWet = c(1, 0.1, 0.1, 34, 4.99, 50, 1, 0.002,
0.008, 0.2, 0.05, 5)), row.names = c(NA, -12L), class = "data.frame")
limits <- structure(list(wet_fish = c(30, 0.5, 0.88), dry_floc_PEC = c(33,
4.98, 111), dry_floc_TEC = c(9.79, 0.99, 43.4), Analyte = c("As",
"Cd", "Cr")), class = "data.frame", row.names = c("As", "Cd",
"Cr"))

Clean a dataset that includes some character in each row

I have a dataset like this:
structure(list(`Frequency
Percent` = c("car", "window", "ball",
"ups"), AI = c("2\n0.00", "3\n0.00", "1\n0.00", "2\n0.00"), BLK = c("0\n0.00",
"218\n0.29", "48\n0.06", "0\n0.00"), HIANIC = c("1\n0.00", "8\n0.01",
"4\n0.01", "0\n0.00"), NATRICAN = c("9\n0.01", "7\n0.01", "8\n0.01",
"0\n0.00"), UNK = c("15\n0.02", "83\n0.11", "36\n0.05", "0\n0.00"
), yy = c("111\n0.15", "897\n1.20", "756\n1.02", "1\n0.00")), class = "data.frame", row.names = c(NA,
-4L))
How can I split each row by "" and remove n to make two new columns. For instance, car and AI cell (“2\n0.00”), I will have 2 and 0.00 in two different columns.
One way is to use tidyr::separate in a for loop:
for(i in names(df[,-1])){
df <- tidyr::separate(df, i, sep = "\n", into = c(i, paste0(i,"_val")))
}
Output:
# Frequency\n Percent AI AI_val BLK BLK_val HIANIC HIANIC_val NATRICAN NATRICAN_val UNK UNK_val yy yy_val
# 1 car 2 0.00 0 0.00 1 0.00 9 0.01 15 0.02 111 0.15
# 2 window 3 0.00 218 0.29 8 0.01 7 0.01 83 0.11 897 1.20
# 3 ball 1 0.00 48 0.06 4 0.01 8 0.01 36 0.05 756 1.02
# 4 ups 2 0.00 0 0.00 0 0.00 0 0.00 0 0.00 1 0.00
Using tidyr::separate_rows and tidyr::pivot_wider you could do:
library(tidyr)
library(dplyr)
dat |>
mutate(unit = c("n\npct")) |>
separate_rows(-1, sep = "\n") |>
pivot_wider(names_from = "unit", values_from = -1)
#> # A tibble: 4 × 15
#> Frequency\n…¹ AI_n AI_pct BLK_n BLK_pct HIANI…² HIANI…³ NATRI…⁴ NATRI…⁵ UNK_n
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 car 2 0.00 0 0.00 1 0.00 9 0.01 15
#> 2 window 3 0.00 218 0.29 8 0.01 7 0.01 83
#> 3 ball 1 0.00 48 0.06 4 0.01 8 0.01 36
#> 4 ups 2 0.00 0 0.00 0 0.00 0 0.00 0
#> # … with 5 more variables: UNK_pct <chr>, yy_n <chr>, yy_pct <chr>,
#> # unit_n <chr>, unit_pct <chr>, and abbreviated variable names
#> # ¹​`Frequency\n Percent`, ²​HIANIC_n, ³​HIANIC_pct, ⁴​NATRICAN_n,
#> # ⁵​NATRICAN_pct
A base one liner:
do.call(data.frame, lapply(DF, \(x) do.call(rbind, strsplit(x, "\n"))))
# Frequency.Percent AI.1 AI.2 BLK.1 BLK.2 HIANIC.1 HIANIC.2 NATRICAN.1
#1 car 2 0.00 0 0.00 1 0.00 9
#2 window 3 0.00 218 0.29 8 0.01 7
#3 ball 1 0.00 48 0.06 4 0.01 8
#4 ups 2 0.00 0 0.00 0 0.00 0
# NATRICAN.2 UNK.1 UNK.2 yy.1 yy.2
#1 0.01 15 0.02 111 0.15
#2 0.01 83 0.11 897 1.20
#3 0.01 36 0.05 756 1.02
#4 0.00 0 0.00 1 0.00
Or add also a type convert.
type.convert(do.call(data.frame, lapply(DF, \(x) do.call(rbind, strsplit(x, "\n")))), as.is=TRUE)
There is also a base R solution:
dat = structure(list(`Frequency
Percent` = c("car", "window", "ball",
"ups"), AI = c("2\n0.00", "3\n0.00", "1\n0.00", "2\n0.00"), BLK = c("0\n0.00",
"218\n0.29", "48\n0.06", "0\n0.00"), HIANIC = c("1\n0.00", "8\n0.01",
"4\n0.01", "0\n0.00"), NATRICAN = c("9\n0.01", "7\n0.01", "8\n0.01",
"0\n0.00"), UNK = c("15\n0.02", "83\n0.11", "36\n0.05", "0\n0.00"
), yy = c("111\n0.15", "897\n1.20", "756\n1.02", "1\n0.00")), class = "data.frame", row.names = c(NA,
-4L))
transformed = data.frame(Freq_pc = dat[,1])
for(col in seq(2, ncol(dat))){
transformed = cbind(transformed, t(matrix(unlist(strsplit(dat[,col], "\n")), nrow=2)))
names(transformed)[c(2*(col-1), 2*(col-1)+1)] = c(paste0(names(dat)[col], "_n"), paste0(names(dat)[col], "_pc"))
}
That results in:
Freq_pc AI_n AI_pc BLK_n BLK_pc HIANIC_n HIANIC_pc NATRICAN_n NATRICAN_pc UNK_n UNK_pc yy_n yy_pc
1 car 2 0.00 0 0.00 1 0.00 9 0.01 15 0.02 111 0.15
2 window 3 0.00 218 0.29 8 0.01 7 0.01 83 0.11 897 1.20
3 ball 1 0.00 48 0.06 4 0.01 8 0.01 36 0.05 756 1.02
4 ups 2 0.00 0 0.00 0 0.00 0 0.00 0 0.00 1 0.00
We may use cSplit
library(splitstackshape)
cSplit(df1, 2:ncol(df1), sep = "\n")
-output
Frequency\nPercent AI_1 AI_2 BLK_1 BLK_2 HIANIC_1 HIANIC_2 NATRICAN_1 NATRICAN_2 UNK_1 UNK_2 yy_1 yy_2
1: car 2 0 0 0.00 1 0.00 9 0.01 15 0.02 111 0.15
2: window 3 0 218 0.29 8 0.01 7 0.01 83 0.11 897 1.20
3: ball 1 0 48 0.06 4 0.01 8 0.01 36 0.05 756 1.02
4: ups 2 0 0 0.00 0 0.00 0 0.00 0 0.00 1 0.00

How I can calculate the max for these data

Here is a part of my data
dat<-read.table (text="
Flower A1 A2 A3 TM MN B1 B2 B3
F1 12 9 11 12 0.56 19 1 12
F2 11 16 13 13 0.65 22 4 12
F3 10 12 14 11 0.44 29 9 12
", header=TRUE)
I want to calculate Max for column MN. For example, for value 0.44, the max is max(0.44,1-0.44)= 0.56.
I struggle to get it with a data frame.
Here is the outcome of the interest:
Flower A TM B MN Max
F1 12 12 19 0.56 0.56
F2 11 13 22 0.65 0.65
F3 10 11 29 0.44 0.56
F1 9 12 1 0.56 0.56
F2 16 13 4 0.65 0.65
F3 12 11 9 0.44 0.56
F1 11 12 12 0.56 0.56
F2 13 13 12 0.65 0.65
F3 14 11 12 0.44 0.56
Try the code below
transform(
reshape(
setNames(dat, gsub("(\\d+)", ".\\1", names(dat))),
direction = "long",
idvar = c("Flower", "TM", "MN"),
varying = -c(1, 5, 6)
),
Max = pmax(MN, 1 - MN)
)
which gives
Flower TM MN time A B Max
F1.12.0.56.1 F1 12 0.56 1 12 19 0.56
F2.13.0.65.1 F2 13 0.65 1 11 22 0.65
F3.11.0.44.1 F3 11 0.44 1 10 29 0.56
F1.12.0.56.2 F1 12 0.56 2 9 1 0.56
F2.13.0.65.2 F2 13 0.65 2 16 4 0.65
F3.11.0.44.2 F3 11 0.44 2 12 9 0.56
F1.12.0.56.3 F1 12 0.56 3 11 12 0.56
F2.13.0.65.3 F2 13 0.65 3 13 12 0.65
F3.11.0.44.3 F3 11 0.44 3 14 12 0.56
Using reshape and ave.
reshape(dat, varying=list(2:4, 7:9), direction='long', idvar='Flower') |>
transform(Max=ave(MN, Flower, FUN=max))
# Flower TM MN time A1 B1 Max
# F1.1 F1 12 0.56 1 12 19 0.56
# F2.1 F2 13 0.65 1 11 22 0.65
# F3.1 F3 11 0.44 1 10 29 0.44
# F1.2 F1 12 0.56 2 9 1 0.56
# F2.2 F2 13 0.65 2 16 4 0.65
# F3.2 F3 11 0.44 2 12 9 0.44
# F1.3 F1 12 0.56 3 11 12 0.56
# F2.3 F2 13 0.65 3 13 12 0.65
# F3.3 F3 11 0.44 3 14 12 0.44
Note: R >= 4.1 used.

Create matrix from dataset in R

I want to create a matrix from my data. My data consists of two columns, date and my observations for each date. I want the matrix to have year as rows and days as columns, e.g. :
17 18 19 20 ... 31
1904 x11 x12 ...
1905
1906
.
.
.
2019
The days in this case is for December each year. I would like missing values to equal NA.
Here's a sample of my data:
> head(cdata)
# A tibble: 6 x 2
Datum Snödjup
<dttm> <dbl>
1 1904-12-01 00:00:00 0.02
2 1904-12-02 00:00:00 0.02
3 1904-12-03 00:00:00 0.01
4 1904-12-04 00:00:00 0.01
5 1904-12-12 00:00:00 0.02
6 1904-12-13 00:00:00 0.02
I figured that the first thing I need to do is to split the date into year, month and day (European formatting, YYYY-MM-DD) so I did that and got rid of the date column (the one that says Datum) and also got rid of the unrelevant days, namely the ones < 17.
cdata %>%
dplyr::mutate(year = lubridate::year(Datum),
month = lubridate::month(Datum),
day = lubridate::day(Datum))
select(cd, -c(Datum))
cu <- cd[which(cd$day > 16
& cd$day < 32
& cd$month == 12),]
and now it looks like this:
> cu
# A tibble: 1,284 x 4
Snödjup year month day
<dbl> <dbl> <dbl> <int>
1 0.01 1904 12 26
2 0.01 1904 12 27
3 0.01 1904 12 28
4 0.12 1904 12 29
5 0.12 1904 12 30
6 0.15 1904 12 31
7 0.07 1906 12 17
8 0.05 1906 12 18
9 0.05 1906 12 19
10 0.04 1906 12 20
# … with 1,274 more rows
Now I need to fit my data into a matrix with missing values as NA. Is there anyway to do this?
Base R approach, using by.
r <- `colnames<-`(do.call(rbind, by(dat, substr(dat$date, 1, 4), function(x) x[2])), 1:31)
r[,17:31]
# 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
# 1904 -0.28 -2.66 -2.44 1.32 -0.31 -1.78 -0.17 1.21 1.90 -0.43 -0.26 -1.76 0.46 -0.64 0.46
# 1905 1.44 -0.43 0.66 0.32 -0.78 1.58 0.64 0.09 0.28 0.68 0.09 -2.99 0.28 -0.37 0.19
# 1906 -0.89 -1.10 1.51 0.26 0.09 -0.12 -1.19 0.61 -0.22 -0.18 0.93 0.82 1.39 -0.48 0.65
Toy data
set.seed(42)
dat <- do.call(rbind, lapply(1904:1906, function(x)
data.frame(date=seq(ISOdate(x, 12, 1, 0), ISOdate(x, 12, 31, 0), "day" ),
value=round(rnorm(31), 2))))
You can try :
library(dplyr)
library(tidyr)
cdata %>%
mutate(year = lubridate::year(Datum),
day = lubridate::day(Datum)) %>%
filter(day >= 17) %>%
complete(day = 17:31) %>%
select(year, day, Snödjup) %>%
pivot_wider(names_from = day, values_from = Snödjup)

function to return suitably lagged and iterated divided value in R

I have a time series data, and I wanted to use a function to return suitably lagged and iterated divided value.
Data:
ID Temperature value
1 -1.1923333
2 -0.2123333
3 -0.593
4 -0.7393333
5 -0.731
6 -0.4976667
7 -0.773
8 -0.6843333
9 -0.371
10 0.754
11 1.798
12 3.023
13 3.8233333
14 4.2456667
15 4.599
16 5.078
17 4.9133333
18 3.5393333
19 2.0886667
20 1.8236667
21 1.2633333
22 0.6843333
23 0.7953333
24 0.6883333
The function should work like this:
new values : 23ID=value(24)/value(23), 22ID=value(23)/value(22), 21ID=value(22)/value(21), and so forth.
Expected Results:
ID New Temperature value
1 0.17
2 2.79
3 1.24
4 0.98
5 0.68
6 1.55
7 0.885
8 0.54
9 -2.03
10 2.38
11 1.68
12 1.264
13 1.11
14 1.083
15 1.104
16 0.967
17 0.72
18 0.59
19 0.873
20 0.69
21 0.541
22 1.16
23 0.86
24 NAN
To divide each element of a vector x by its successor, use:
x[-1] / x[-length(x)]
This will return a vector with a length of length(x) - 1. If you really need the NaN value at the end, add it by hand via c(x[-1] / x[-length(x)], NaN).

Resources