what is the meaning of class_loss, iou_loss in YOLOv4 training output? - yolov4

I don't understand the training output of YOLOv4. I understand 4001 represents the iteration, and 0.325970 represents the average loss of this iteration. However, I don't understand the line with v3, there is numerous v3. I guess class_loss represents the loss in the classification of objects. What is iou_loss and its value is very large compared with class_loss. Also, I guess each v3 represents a layer, is the value in the last v3 represent the final loss?
4001: 0.325970, 0.277811 avg loss, 0.002610 rate, 0.746759 seconds, 256064 images, 4.467991 hours left
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 1, class_loss = 0.000000, iou_loss = 0.000000, total_loss = 0.000000
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.590250), count: 11, class_loss = 0.213037, iou_loss = 70.602318, total_loss = 70.815353
total_bbox = 458829, rewritten_bbox = 8.034584 %
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 1, class_loss = 0.000000, iou_loss = 0.000000, total_loss = 0.000000
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.357880), count: 4, class_loss = 0.227196, iou_loss = 72.440552, total_loss = 72.667747
total_bbox = 458833, rewritten_bbox = 8.034513 %
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 1, class_loss = 0.000000, iou_loss = 0.000000, total_loss = 0.000000
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.517755), count: 1, class_loss = 0.050293, iou_loss = 11.369855, total_loss = 11.420148
total_bbox = 458834, rewritten_bbox = 8.034496 %
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 1, class_loss = 0.000000, iou_loss = 0.000000, total_loss = 0.000000
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.560250), count: 2, class_loss = 0.132711, iou_loss = 8.734029, total_loss = 8.866740
total_bbox = 458836, rewritten_bbox = 8.034461 %
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 1, class_loss = 0.000000, iou_loss = 0.000000, total_loss = 0.000000
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.599422), count: 3, class_loss = 0.215771, iou_loss = 10.993549, total_loss = 11.209321
total_bbox = 458839, rewritten_bbox = 8.034409 %
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 1, class_loss = 0.000000, iou_loss = 0.000000, total_loss = 0.000000
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.623037), count: 8, class_loss = 0.692446, iou_loss = 121.287407, total_loss = 121.979851
total_bbox = 458847, rewritten_bbox = 8.034268 %
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 1, class_loss = 0.000000, iou_loss = 0.000000, total_loss = 0.000000
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.604043), count: 6, class_loss = 0.330252, iou_loss = 60.332104, total_loss = 60.662357
total_bbox = 458853, rewritten_bbox = 8.034163 %
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 1, class_loss = 0.000000, iou_loss = 0.000000, total_loss = 0.000000
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.518494), count: 25, class_loss = 2.354674, iou_loss = 166.798553, total_loss = 169.153229
total_bbox = 458878, rewritten_bbox = 8.033944 %
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 1, class_loss = 0.000000, iou_loss = 0.000000, total_loss = 0.000000
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.528473), count: 5, class_loss = 0.496927, iou_loss = 34.804985, total_loss = 35.301910
total_bbox = 458883, rewritten_bbox = 8.034074 %
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 1, class_loss = 0.000000, iou_loss = 0.000000, total_loss = 0.000000
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.631476), count: 8, class_loss = 0.488647, iou_loss = 109.094719, total_loss = 109.583366
total_bbox = 458891, rewritten_bbox = 8.033934 %
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 1, class_loss = 0.000000, iou_loss = 0.000000, total_loss = 0.000000
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.566340), count: 9, class_loss = 0.637714, iou_loss = 85.766495, total_loss = 86.404205
total_bbox = 458900, rewritten_bbox = 8.033995 %
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 1, class_loss = 0.000000, iou_loss = 0.000000, total_loss = 0.000000
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.536185), count: 3, class_loss = 0.011424, iou_loss = 40.380566, total_loss = 40.391991
total_bbox = 458903, rewritten_bbox = 8.033942 %
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 1, class_loss = 0.000000, iou_loss = 0.000000, total_loss = 0.000000
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.557040), count: 4, class_loss = 0.802973, iou_loss = 12.161980, total_loss = 12.964952
total_bbox = 458907, rewritten_bbox = 8.033872 %
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 1, class_loss = 0.000000, iou_loss = 0.000000, total_loss = 0.000000
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.554378), count: 9, class_loss = 0.435504, iou_loss = 115.920799, total_loss = 116.356300
total_bbox = 458916, rewritten_bbox = 8.033714 %
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 1, class_loss = 0.000000, iou_loss = 0.000000, total_loss = 0.000000
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.418351), count: 21, class_loss = 0.921397, iou_loss = 160.036682, total_loss = 160.958084
total_bbox = 458937, rewritten_bbox = 8.034436 %
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.000000), count: 1, class_loss = 0.000000, iou_loss = 0.000000, total_loss = 0.000000
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.598399), count: 4, class_loss = 0.016917, iou_loss = 82.296219, total_loss = 82.313133
total_bbox = 458941, rewritten_bbox = 8.034366 %
Loaded: 6.379705 seconds - performance bottleneck on CPU or Disk HDD/SSD ```

Related

Finding matches for multiple words with stringdist

I have test data as follows. I am trying to find (near) matches for a vector of words, using stringdist as the actual database is large:
library(stringdist)
test_data <- structure(list(Province = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), Year = c(2000,
2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000,
2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001,
2001, 2002, 2002, 2002), Municipality = c("Some", "Anything",
"Nothing", "Someth.", "Anything", "Not", "Something", "Anything",
"None", "Some", "Anything", "Nothing", "Someth.", "Anything",
"Not", "Something", "Anything", "None", "Some", "Anything", "Nothing",
"Someth.", "Anything", "Not", "Something", "Anything", "None"
), `Other Values` = c(0.41, 0.42, 0.34, 0.47, 0.0600000000000001,
0.8, 0.14, 0.15, 0.01, 0.41, 0.42, 0.34, 0.47, 0.0600000000000001,
0.8, 0.14, 0.15, 0.01, 0.41, 0.42, 0.34, 0.47, 0.0600000000000001,
0.8, 0.14, 0.15, 0.01)), row.names = c(NA, -27L), class = c("tbl_df",
"tbl", "data.frame"))
# A tibble: 27 x 4
Province Year Municipality `Other Values`
<dbl> <dbl> <chr> <dbl>
1 1 2000 Some 0.41
2 1 2000 Anything 0.42
3 1 2000 Nothing 0.34
4 1 2001 Someth. 0.47
5 1 2001 Anything 0.0600
6 1 2001 Not 0.8
7 1 2002 Something 0.14
8 1 2002 Anything 0.15
9 1 2002 None 0.01
10 2 2000 Some 0.41
# ... with 17 more rows
I tried to run:
test_match_out <- amatch(c("Anything","Something"),test_data[,3],maxDist=2)
EDIT:
Following the comment of zx8754 , I tried:
test_match_out <- amatch(c("Anything","Something"),test_data[[3]],maxDist=2)
And:
test_match_out <- amatch(c("Anything","Something"),test_data$Municipality,maxDist=2)
I was under the impression that the previous line (amatch) would give me something like a vector of indices, where there would be a match. But it just gives me a vector with two NA values. Am I misunderstanding what amatch does, or is there something wrong in the syntax?
I want to get the values for which amatch is a match and the word that is matched.
Desired output:
test_data_2 <- structure(list(Province = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), Year = c(2000,
2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000,
2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001,
2001, 2002, 2002, 2002), Municipality = c("Some", "Anything",
"Nothing", "Someth.", "Anything", "Not", "Something", "Anything",
"None", "Some", "Anything", "Nothing", "Someth.", "Anything",
"Not", "Something", "Anything", "None", "Some", "Anything", "Nothing",
"Someth.", "Anything", "Not", "Something", "Anything", "None"
), `Other Values` = c(0.41, 0.42, 0.34, 0.47, 0.0600000000000001,
0.8, 0.14, 0.15, 0.01, 0.41, 0.42, 0.34, 0.47, 0.0600000000000001,
0.8, 0.14, 0.15, 0.01, 0.41, 0.42, 0.34, 0.47, 0.0600000000000001,
0.8, 0.14, 0.15, 0.01), `Matched Values` = c(NA, 0.42, NA, NA, 0.06000,
NA, 0.14, 0.15, NA, NA, 0.42, NA, NA, 0.0600000000000001,
NA, 0.14, 0.15, NA, NA, 0.42, NA, NA, 0.0600000000000001,
NA, 0.14, 0.15, NA), `Matched Values` = c(NA, "Anything", NA, NA, "Anything",
NA, "Something", "Anything", NA, NA, "Anything", NA, NA, "Anything",
NA, "Something", "Anything", NA, NA, "Anything", NA, NA, "Anything",
NA, "Something", "Anything", NA)), row.names = c(NA, -27L), class = c("tbl_df",
"tbl", "data.frame"))
Get the index of matches, then update all rows that match:
ix <- amatch(c("Anything","Something"), test_data[[ 3 ]], maxDist = 2)
# [1] 2 7
ifelse(test_data$Municipality %in% test_data$Municipality[ ix ],
test_data$`Other Values`, NA)
# [1] NA 0.42 NA NA 0.06 NA 0.14 0.15 NA NA 0.42
# [12] NA NA 0.06 NA 0.14 0.15 NA NA 0.42 NA NA
# [23] 0.06 NA 0.14 0.15 NA

How to save cat results as data.frame

Here Im trying to save my results as data.frame but I couldn't the only way I was able to show them by using "cat"
library(metafor)
id <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3)
year <- c(1978, 1983, 1974, 1989, 1974, 2002, 1990, 1974, 1998, 1989, 1974, 1983, 1978, 1998, 1978, 1974, 1974, 1998)
study.name <- c("Banninger 1978", "Roberts 1983", "Beard 1974", "Mahomed 1989", "Livingstone 1974", "Upton 2002", "Olah 1990", "Rogers 1974", "Mackrodt 1998",
"Mahomed 1989", "Beard 1974", "Roberts 1983", "Banninger 1978", "Mackrodt 1998", "Banninger 1978", "Beard 1974", "Livingstone 1974", "Mackrodt 1998")
y <- c(-0.81, -0.87, -0.67, -0.77, -0.03, -0.94, -0.78, -0.12, -0.34, -0.34, -0.76, -0.87, -0.55, -0.99, -0.44, -0.14, -0.34, -0.76)
s <- c(0.11, 0.19, 0.05, 0.17, 0.09, 0.03, 0.08, 0.22, 0.21, 0.27, 0.15, 0.04, 0.15, 0.09, 0.02, 0.11, 0.03, 0.09)
data <- as.data.frame(cbind(id, year, study.name, y, s))
data.ids <- unique(data$id)
n.ma.binary <- length(data.ids)
for(i in 1:n.ma.binary){
temp <- data[data$id == data.ids[i],]
temp <- temp[order(temp$year),] # sorting MA by year
list <- lapply(4:nrow(temp)-1, function(k) head(temp, k))
for(j in 1:length(list)){
dd <- as.data.frame(list[j])
result <- rma(yi = as.numeric(dd$y), vi = as.numeric(dd$s))
alpha <- 0.05
n <- result$k
# To get PI
lower.PI <- result$b - qt(1-alpha/2,n-2)*sqrt(2*result$tau2 + (result$se.tau2)^2)
upper.PI <- result$b + qt(1-alpha/2,n-2)*sqrt(2*result$tau2 + (result$se.tau2)^2)
remaining <- temp[ !(temp$study.name %in% dd$study.name), ]
decision <- as.numeric(ifelse(sapply(remaining$y, function(p)
any(lower.PI <= p & upper.PI >= p)),"1", "0"))
proportion <- mean(decision)
cat(list.num = j, new.id = unique(temp$id), proportion = proportion,
lower.PI = lower.PI, upper.PI = upper.PI, "\n")
}
}
Is anyone able to save the results as data.frame?
It is a bit more involved since you are doing multiple analyses within each group. Modify your code as follows. First, insert the following before entering the first loop:
results.all <- data.frame(list.num=NA, new.id=NA, proportion=NA, lower.PI=NA, upper.PI=NA)
idx <- 0
Now replace the cat statement with the following:
idx <- idx+1
results.all[idx,] <- c(list.num = j, new.id = unique(temp$id), proportion = proportion, lower.PI = lower.PI, upper.PI = upper.PI)
Now you can print the data frame:
results.all
# list.num new.id proportion lower.PI upper.PI
# 1 1 1 0 -5.75317346362513 5.07162070815196
# 2 2 1 0 -1.97747383048426 1.06346158846251
# 3 3 1 0 -1.44463494970073 0.414992978516971
# 4 4 1 0 -1.14935843511063 0.0548216056175898
# 5 5 1 0 -0.886747285977298 -0.295909268008411
# 6 6 1 0 -0.710099787395835 -0.441424896621657
# 7 1 2 0 -2.02192058321991 0.431485800611215
# 8 2 2 0 -1.13751587707834 -0.372704387238843
# 9 1 3 0 -0.80174083281866 0.0528883738022665

how to make calculations and comparisons with the next line in R

i got stuck in a problem:
i got this df:
df <- data.frame(station = c("A", "A", "A", "B", "B"),
Initial_height = c(20, 50, 100, 30, 60),
final_height = c(50, 100, 300, 60, 110),
initial_flow = c(0.5, 1.2, 1.9, 0.8, 0.7),
final_Flow = c(1.21, 1.92, 0.805, 0.7, 1))
context: each height has a flow value, but is calculated differently for each line of the data frame.
I would like to compare, for the same station, the flow value where the height is the same.
My perfect data frame:
df.answer <- data.frame(station = c("A", "A", "A", "B", "B"),
Initial_height = c(20, 50, 100, 30, 60),
final_height = c(50, 100, 300, 60, 110),
initial_flow = c(0.5, 1.2, 1.9, 0.8, 0.7),
final_Flow = c(1.21, 1.92, 0.805, 0.7, 1),
diff_flow = c(0.010, 0.020, NA, 0, NA))
NA can be replaced by any other character
EDIT: this can happen:
df <- data.frame(station = c("A", "A", "A", "B", "B"),
Initial_height = c(20, 51, 100, 30, 60),
final_height = c(50, 100, 300, 60, 110),
initial_flow = c(0.5, 1.2, 1.9, 0.8, 0.7),
final_Flow = c(1.21, 1.92, 0.805, 0.7, 1),
diff_flow = c(NA, 0.020, NA, 0, NA)))
at station A, the initial and final values ​​do not match. should return NA
We can subtract the lead i.e next value of 'initial_flow' from 'final_flow after grouping by 'station'
library(dplyr)
out <- df %>%
group_by(station) %>%
mutate(diff_flow = final_Flow - lead(initial_flow)) %>%
ungroup
-output
out
# A tibble: 5 x 6
# station Initial_height final_height initial_flow final_Flow diff_flow
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 A 20 50 0.5 1.21 0.01
#2 A 50 100 1.2 1.92 0.02
#3 A 100 30 1.9 0.805 NA
#4 B 30 60 0.8 0.7 0
#5 B 60 110 0.7 1 NA
In data.table you can use shift to get next row in each group.
library(data.table)
setDT(df)[,diff_flow := final_Flow - shift(initial_flow, type = 'lead'), station]
# station Initial_height final_height initial_flow final_Flow diff_flow
#1: A 20 50 0.5 1.210 0.01
#2: A 50 100 1.2 1.920 0.02
#3: A 100 300 1.9 0.805 NA
#4: B 30 60 0.8 0.700 0.00
#5: B 60 110 0.7 1.000 NA

R: expand data frame columnwise with shifted rows of data

- Example Data to work with:
To create a reduced example, this is the output of dput(df):
df <- structure(list(SubjectID = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L), .Label = c("1", "2", "3"), class = "factor"), EventNumber = structure(c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("1", "2"), class = "factor"),
EventType = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), .Label = c("A", "B"), class = "factor"), Param1 = c(0.3,
0.21, 0.87, 0.78, 0.9, 1.2, 1.4, 1.3, 0.6, 0.45, 0.45, 0.04,
0, 0.1, 0.03, 0.01, 0.09, 0.06, 0.08, 0.09, 0.03, 0.04, 0.04,
0.02), Param2 = c(45, 38, 76, 32, 67, 23, 27, 784, 623, 54,
54, 1056, 487, 341, 671, 859, 7769, 2219, 4277, 4060, 411,
440, 224, 57), Param3 = c(1.5, 1.7, 1.65, 1.32, 0.6, 0.3,
2.5, 0.4, 1.4, 0.67, 0.67, 0.32, 0.1, 0.15, 0.22, 0.29, 0.3,
0.2, 0.8, 1, 0.9, 0.8, 0.3, 0.1), Param4 = c(0.14, 0, 1,
0.86, 0, 0.6, 1, 1, 0.18, 0, 0, 0.39, 0, 1, 0.29, 0.07, 0.33,
0.53, 0.29, 0.23, 0.84, 0.61, 0.57, 0.59), Param5 = c(0.18,
0, 1, 0, 1, 0, 0.09, 1, 0.78, 0, 0, 1, 0.2, 0, 0.46, 0.72,
0.16, 0.22, 0.77, 0.52, 0.2, 0.68, 0.58, 0.17), Param6 = c(0,
1, 0.75, 0, 0.14, 0, 1, 0, 1, 0.27, 0, 1, 0, 0.23, 0.55,
0.86, 1, 0.33, 1, 1, 0.88, 0.75, 0, 0), AbsoluteTime = structure(c(1522533600,
1522533602, 1522533604, 1522533604, 1525125600, 1525125602,
1525125604, 1519254000, 1519254002, 1519254004, 1519254006,
1521759600, 1521759602, 1521759604, 1521759606, 1521759608,
1517353224, 1517353226, 1517353228, 1517353230, 1517439600,
1517439602, 1517439604, 1517439606), class = c("POSIXct",
"POSIXt"), tzone = "")), row.names = c(NA, -24L), class = "data.frame")
df
The real data has 20 subject, EventNumbers ranging from 1 to 100, and parameters are from Param1 to Param40 (depending on the experiment).
Row number are around 60 000 observation.
- What I want to achieve:
For df, create n * 40 new columns. # (40 or any number of parameters that will be chosen later.)
Think of n as "steps into the future".
Name the 40 * n newly created columns:
Param1_2, Param2_2, Param3_2, ..., Param39_2, Param40_2, ...,
Param1_3, Param2_3, Param3_3, ..., Param39_3, Param40_3, ...,
...,
Param1_n, Param2_n, Param3_n, ..., Param39_n, Param40_n
Resulting in columns
Param1_1, Param2_1, Param1_2, Param2_2, Param1_3, Param2_3, Param1_4, Param2_4, ... Param1_n, Param2_n
So every observation of subset df[X, c(4:9)] will get an additional set of variables with values from df[X+1, c(4:9)] to df[X+n, c(4:9)].
This is what the new df.extended should look like for n = 1:
df.extended <- structure(list(SubjectID = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2,
2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3), EventNumber = c(1, 1,
1, 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2,
2), EventType = c("A", "A", "A", "A", "B", "B", "B", "A", "A",
"A", "A", "B", "B", "B", "B", "B", "A", "A", "A", "A", "B", "B",
"B", "B"), Param1 = c(0.3, 0.21, 0.87, 0.78, 0.9, 1.2, 1.4, 1.3,
0.6, 0.45, 0.45, 0.04, 0, 0.1, 0.03, 0.01, 0.05, 0.07, 0.06,
0.01, 0.01, 0.01, 0.07, 0.04), Param2 = c(45, 38, 76, 32, 67,
23, 27, 784, 623, 54, 54, 1056, 487, 341, 671, 859, 1858, 640,
8181, 220, 99, 86, 170, 495), Param3 = c(1.5, 1.7, 1.65, 1.32,
0.6, 0.3, 2.5, 0.4, 1.4, 0.67, 0.67, 0.32, 0.1, 0.15, 0.22, 0.29,
1.5, 0.9, 0.8, 0.9, 0.1, 0, 0.8, 0.1), Param4 = c(0.14, 0, 1,
0.86, 0, 0.6, 1, 1, 0.18, 0, 0, 0.39, 0, 1, 0.29, 0.07, 0.64,
0.11, 0.12, 0.32, 0.55, 0.67, 0.83, 0.82), Param5 = c(0.18, 0,
1, 0, 1, 0, 0.09, 1, 0.78, 0, 0, 1, 0.2, 0, 0.46, 0.72, 0.27,
0.14, 0.7, 0.67, 0.23, 0.44, 0.61, 0.76), Param6 = c(0, 1, 0.75,
0, 0.14, 0, 1, 0, 1, 0.27, 0, 1, 0, 0.23, 0.55, 0.86, 1, 0.56,
0.45, 0.5, 0, 0, 0.89, 0.11), AbsoluteTime = c("2018-04-01 00:00:00",
"2018-04-01 00:00:02", "2018-04-01 00:00:04", "2018-04-01 00:00:04",
"2018-05-01 00:00:00", "2018-05-01 00:00:02", "2018-05-01 00:00:04",
"2018-02-22 00:00:00", "2018-02-22 00:00:02", "2018-02-22 00:00:04",
"2018-02-22 00:00:06", "2018-03-23 00:00:00", "2018-03-23 00:00:02",
"2018-03-23 00:00:04", "2018-03-23 00:00:06", "2018-03-23 00:00:08",
"2018-01-31 00:00:24", "2018-01-31 00:00:26", "2018-01-31 00:00:28",
"2018-01-31 00:00:30", "2018-02-01 00:00:00", "2018-02-01 00:00:02",
"2018-02-01 00:00:04", "2018-02-01 00:00:06"), Param1_2 = c(0.21,
0.87, 0.78, NA, 1.2, 1.4, NA, 0.6, 0.45, 0.45, NA, 0, 0.1, 0.03,
0.01, NA, 0.07, 0.07, 0.08, NA, 0.09, 0.06, 0.01, NA), Param2_2 = c(38,
76, 32, NA, 23, 27, NA, 623, 54, 54, NA, 487, 341, 671, 859,
NA, 6941, 4467, 808, NA, 143, 301, 219, NA), Param3_2 = c(1.7,
1.65, 1.32, NA, 0.3, 2.5, NA, 1.4, 0.67, 0.67, NA, 0.1, 0.15,
0.22, 0.29, NA, 1, 1, 0.1, NA, 0.5, 1, 0.3, NA), Param4_2 = c(0,
1, 0.86, NA, 0.6, 1, NA, 0.18, 0, 0, NA, 0, 1, 0.29, 0.07, NA,
0.31, 0.16, 0.68, NA, 0.86, 0.47, 0.47, NA), Param5_2 = c(0,
1, 0, NA, 0, 0.09, NA, 0.78, 0, 0, NA, 0.2, 0, 0.46, 0.72, NA,
0.29, 0.26, 0.1, NA, 0.88, 0.86, 0.95, NA), Param6_2 = c(1, 0,
0, NA, 0, 1, NA, 1, 0.27, 0, NA, 0, 0.23, 0.55, 0.86, NA, 0.68,
0.66, 0, NA, 0.44, 1, 0.22, NA)), row.names = c(NA, 24L), class = "data.frame")
df.extended
How can this be solved without using loops, writing column indexes by hand etc.? Write a function for trial 2 and use doBy?
My thoughts and what I have done so far to solve this:
Trial 1:
Cycle through the SubjectIDs in a for-loop
In an inner for-loop, cycle through the EventNumber
In another inner for-loop, cycle through the rows
Get the first row by grabbing df[1, ] and save into df.temp
Merge df.temp with df[2, parameters] #
Merge merge df.temp with df[3, parameters] and so on
Save all resulting df.temps into df.final
Problems I ran into: Step 5:
df.temp <- df[1,]
df.temp <- merge(df.temp, df[2, !(colnames(df) == "AbsoluteTime")], by = c("SubjectID", "EventNumber", "EventType"))
df.temp <- merge(df.temp, df[3, !(colnames(df) == "AbsoluteTime")], by = c("SubjectID", "EventNumber", "EventType"))
df.temp <- merge(df.temp, df[4, !(colnames(df) == "AbsoluteTime")], by = c("SubjectID", "EventNumber", "EventType"))
Warning:
In merge.data.frame(df.temp, df[4, ], by = c("SubjectID", "EventNumber", :
column names ‘Param1.x’, ‘Param2.x’, ‘Param3.x’, ‘Param4.x’, ‘Param5.x’, ‘Param6.x’, ‘AbsoluteTime.x’, ‘Param1.y’, ‘Param2.y’,
‘Param3.y’, ‘Param4.y’, ‘Param5.y’, ‘Param6.y’, ‘AbsoluteTime.y’ are
duplicated in the result.
The column names are repeated, see the warning.
I can not figure out how to easily create the column names / rename the new columns based on a given column name and variable.
There must a better way than this:
n <- 3
names_vector <- c()
for (n in seq(from = c(1), to = n)) {
for (i in names(df[4:9])) {
names_vector <- c(names_vector, paste0(i, "_", c(n+1)))
}
}
names(df.temp)[c(4:9)] <- parameters
names(df.temp)[c(11:ncol(df.temp))] <- names_vector
names(df.temp)
Also, how do I prevent the last n-1 rows from breaking the script? This is a lot of work to do by hand and I think quite error prone!?
Trial 2:
Cycle through the SubjectIDs in a for-loop
In an inner for-loop, cycle through the EventNumber
Get all rows of parameters into a new data frame except the first row
Append a row with NAs
use cbind() to merge the rows
Repeat n times.
This is the code for one SubjectID and one EventNumber:
df.temp <- df[which(df$SubjectID == "1" & df$EventNumber == "1"), ]
df.temp2 <- df.temp[2:nrow(df.temp)-1, parameters]
df.temp2 <- rbind(df.temp2, NA)
df.temp <- cbind(df.temp, df.temp2)
df.temp2 <- df.temp[3:nrow(df.temp)-1, parameters]
df.temp2 <- rbind(df.temp2, NA, NA)
df.temp <- cbind(df.temp, df.temp2)
df.temp2 <- df.temp[4:nrow(df.temp)-1, parameters]
df.temp2 <- rbind(df.temp2, NA, NA, NA)
df.temp <- cbind(df.temp, df.temp2)
n <- 3
names_vector <- c()
for (n in seq(from = c(1), to = n)) {
for (i in names(df[4:9])) {
print(i)
print(n)
names_vector <- c(names_vector, paste0(i, "_", c(n+1)))
}
}
names(df.temp)[c(4:9)] <- parameters
names(df.temp)[c(11:ncol(df.temp))] <- names_vector
df.temp
That solves the problem with missing rows (NAs are acceptable in my case).
Still lots of work by hand / for loops and error prone!?
What about something like this:
You can use the developer version of the package dplyr to add and rename variables according to various subsets of interest in your data. dplyr also provides the functions lead()and lag(), which can be used to find the "next" or "previous" values in a vector (or here row). You can use lead() in combination with the function mutate_at() to extract the values from the succeeding "nth"-row and use them to create new set of variables.
Here I use the data you provided in your example:
# load dplyr package
require(dplyr)
# creacte new data frame "df.extended"
df.extended <- df
# number of observations per group (e.g., SubjectID)
# or desired number of successions
obs = 3
# loop until number of successions achieved
for (i in 1:obs) {
# overwrite df.extended with new information
df.extended <- df.extended %>%
# group by subjects and events
group_by(SubjectID, EventNumber) %>%
# create new variable for each parameter
mutate_at( vars(Param1:Param6),
# using the lead function
.funs = funs(step = lead),
# for the nth followning row
n = i) %>%
# rename the new variables to show the succession number
rename_at(vars(contains("_step")), funs(sub("step", as.character(i), .)))
}
This should roughly recreate the data you posted as desired result.
# Look at first part of "df.extended"
> head(df.extended)
# A tibble: 6 x 28
# Groups: SubjectID, EventNumber [2]
SubjectID EventNumber EventType Param1 Param2 Param3 Param4 Param5 Param6 AbsoluteTime Param1_1 Param2_1 Param3_1 Param4_1 Param5_1 Param6_1
<fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 A 0.300 45. 1.50 0.140 0.180 0. 2018-04-01 00:00:00 0.210 38. 1.70 0. 0. 1.00
2 1 1 A 0.210 38. 1.70 0. 0. 1.00 2018-04-01 00:00:02 0.870 76. 1.65 1.00 1.00 0.750
3 1 1 A 0.870 76. 1.65 1.00 1.00 0.750 2018-04-01 00:00:04 0.780 32. 1.32 0.860 0. 0.
4 1 1 A 0.780 32. 1.32 0.860 0. 0. 2018-04-01 00:00:04 NA NA NA NA NA NA
5 1 2 B 0.900 67. 0.600 0. 1.00 0.140 2018-05-01 00:00:00 1.20 23. 0.300 0.600 0. 0.
6 1 2 B 1.20 23. 0.300 0.600 0. 0. 2018-05-01 00:00:02 1.40 27. 2.50 1.00 0.0900 1.00
# ... with 12 more variables: Param1_2 <dbl>, Param2_2 <dbl>, Param3_2 <dbl>, Param4_2 <dbl>, Param5_2 <dbl>, Param6_2 <dbl>, Param1_3 <dbl>,
# Param2_3 <dbl>, Param3_3 <dbl>, Param4_3 <dbl>, Param5_3 <dbl>, Param6_3 <dbl>
For base R, consider by to slice by SubjectID, EventNumber, and EventType, and run a merge using a helper group_num. And to run across a series of params, wrap by process in an lapply for list of dataframes that you chain merge on the outside for final merge with original dataframe:
df_list <- lapply(2:3, function(i) {
# BUILD LIST OF DATAFRAMES
by_list <- by(df, df[c("SubjectID", "EventNumber", "EventType")], FUN=function(sub){
sub$grp_num <- 1:nrow(sub)
row_less_sub <- transform(sub, AbsoluteTime=NULL, grp_num=grp_num-(i-1))
merge(sub, row_less_sub, by=c("SubjectID", "EventNumber", "EventType", "grp_num"),
all.x=TRUE, suffixes = c("", paste0("_", i)))
})
# APPEND ALL DATAFRAMES IN LIST
grp_df <- do.call(rbind, by_list)
grp_df <- with(grp_df, grp_df[order(SubjectID, EventNumber),])
# KEEP NEEDED COLUMNS
grp_df <- grp_df[c("SubjectID", "EventNumber", "EventType", "grp_num",
names(grp_df)[grep("Param[0-9]_", names(grp_df))])]
row.names(grp_df) <- NULL
return(grp_df)
})
# ALL PARAMS_* CHAIN MERGE
params_df <- Reduce(function(x,y) merge(x, y, by=c("SubjectID", "EventNumber", "EventType", "grp_num")), df_list)
# ORIGINAL DF AND PARAMS MERGE
df$grp_num <- ave(df$Param1, df$SubjectID, df$EventNumber, df$EventType,
FUN=function(x) cumsum(rep(1, length(x))))
final_df <- transform(merge(df, params_df, by=c("SubjectID", "EventNumber", "EventType", "grp_num")), grp_num=NULL)
Output
head(final_df, 10)
# SubjectID EventNumber EventType Param1 Param2 Param3 Param4 Param5 Param6 AbsoluteTime Param1_2 Param2_2 Param3_2 Param4_2 Param5_2 Param6_2 Param1_3 Param2_3 Param3_3 Param4_3 Param5_3 Param6_3
# 1 1 1 A 0.30 45 1.50 0.14 0.18 0.00 2018-03-31 17:00:00 0.21 38 1.70 0.00 0.00 1.00 0.87 76 1.65 1.00 1.00 0.75
# 2 1 1 A 0.21 38 1.70 0.00 0.00 1.00 2018-03-31 17:00:02 0.87 76 1.65 1.00 1.00 0.75 0.78 32 1.32 0.86 0.00 0.00
# 3 1 1 A 0.87 76 1.65 1.00 1.00 0.75 2018-03-31 17:00:04 0.78 32 1.32 0.86 0.00 0.00 NA NA NA NA NA NA
# 4 1 1 A 0.78 32 1.32 0.86 0.00 0.00 2018-03-31 17:00:04 NA NA NA NA NA NA NA NA NA NA NA NA
# 5 1 2 B 0.90 67 0.60 0.00 1.00 0.14 2018-04-30 17:00:00 1.20 23 0.30 0.60 0.00 0.00 1.40 27 2.50 1.00 0.09 1.00
# 6 1 2 B 1.20 23 0.30 0.60 0.00 0.00 2018-04-30 17:00:02 1.40 27 2.50 1.00 0.09 1.00 NA NA NA NA NA NA
# 7 1 2 B 1.40 27 2.50 1.00 0.09 1.00 2018-04-30 17:00:04 NA NA NA NA NA NA NA NA NA NA NA NA
# 8 2 1 A 1.30 784 0.40 1.00 1.00 0.00 2018-02-21 17:00:00 0.60 623 1.40 0.18 0.78 1.00 0.45 54 0.67 0.00 0.00 0.27
# 9 2 1 A 0.60 623 1.40 0.18 0.78 1.00 2018-02-21 17:00:02 0.45 54 0.67 0.00 0.00 0.27 0.45 54 0.67 0.00 0.00 0.00
# 10 2 1 A 0.45 54 0.67 0.00 0.00 0.27 2018-02-21 17:00:04 0.45 54 0.67 0.00 0.00 0.00 NA NA NA NA NA NA

Nested reshape from wide to long

I keep on getting all sort of error messages when trying to reshape an object into long direction. Toy data:
d <- structure(c(0.204, 0.036, 0.015, 0.013, 0.208, 0.037, 0.015,
0.006, 0.186, 0.044, 0.016, 0.023, 0.251, 0.044, 0.02, 0.01,
0.268, 0.04, 0.007, 0.007, 0.208, 0.062, 0.027, 0.036, 0.272,
0.054, 0.006, 0.01, 0.274, 0.05, 0.011, 0.006, 0.28, 0.039, 0.007,
0.019, 1.93, 0.345, 0.087, 0.094, 2.007, 0.341, 0.064, 0.061,
1.733, 0.39, 0.131, 0.201, 0.094, 0.01, 0.004, 0, 0.096, 0.014,
0, 0.001, 0.081, 0.016, 0.002, 0.016, 0.062, 0.007, 0.011, 0.001,
0.07, 0.003, 0.005, 0.002, 0.043, 0.033, 0, 0.007, 0.081, 0.039,
0.007, 0, 0.085, 0.033, 0.008, 0, 0.086, 0.023, 0.007, 0.007,
0.083, 0.015, 0, 0, 0.09, 0.009, 0, 0, 0.049, 0.052, 0, 0.025,
2.779, 0.203, 0.098, 0.016, 2.801, 0.242, 0.135, 0.01, 2.12,
0.466, 0.177, 0.121, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 0, 1,
2, 3, 0, 1, 2, 3, 0, 1, 2, 3), .Dim = c(12L, 11L), .Dimnames = list(
c("0", "1", "2", "3", "0", "1", "2", "3", "0", "1", "2",
"3"), c("age_77", "age_78", "age_79", "age_80", "age_81",
"age_82", "age_83", "age_84", "age_85", "item", "k")))
Basically I have different ages, for which 3 items have been reported with four response categories each. I would like to obtain a long-shaped object with colnames = age, item, k, proportion, like this:
structure(c(77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 78,
78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 1, 1, 1, 1, 2, 2,
2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 0, 1, 2,
3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3,
0.204, 0.036, 0.015, 0.013, 0.208, 0.037, 0.015, 0.006, 0.186,
0.044, 0.016, 0.023, 0.251, 0.044, 0.02, 0.01, 0.268, 0.04, 0.007,
0.007, 0.208, 0.062, 0.027, 0.036), .Dim = c(24L, 4L), .Dimnames = list(
c("0", "1", "2", "3", "0", "1", "2", "3", "0", "1", "2",
"3", "0", "1", "2", "3", "0", "1", "2", "3", "0", "1", "2",
"3"), c("age", "item", "k", "proportion")))
An example I tried:
reshape(as.data.frame(d), varying =1:9, sep = "_", direction = "long",
times = "k", idvar = "item")
Error in `row.names<-.data.frame`(`*tmp*`, value = paste(ids, times[i], :
duplicate 'row.names' are not allowed
Any clue where's my mistake? Thanks a lot beforehand!
The object d as provided by the OP is not a data.frame but a matrix which is causing the error:
str(d)
num [1:12, 1:11] 0.204 0.036 0.015 0.013 0.208 0.037 0.015 0.006 0.186 0.044 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:12] "0" "1" "2" "3" ...
..$ : chr [1:11] "age_77" "age_78" "age_79" "age_80" ...
In addition, the row numbers are not unique which causes an error as well when coercing d to data.frame.
With data.table, d can be coerced to a data.table object and reshaped from wide to long format using melt(). Finally, age is extracted from the column names and stored as integer values as requested by the OP.
library(data.table)
melt(as.data.table(d), measure.vars = patterns("^age_"),
variable.name = "age", value.name = "proportion")[
, age := as.integer(stringr::str_replace(age, "age_", ""))][]
item k age proportion
1: 1 0 77 0.204
2: 1 1 77 0.036
3: 1 2 77 0.015
4: 1 3 77 0.013
5: 2 0 77 0.208
---
104: 2 3 85 0.010
105: 3 0 85 2.120
106: 3 1 85 0.466
107: 3 2 85 0.177
108: 3 3 85 0.121

Resources