Hi I have a bunch of hydrological data on streamflow(Q) that I want to standardize. Data is stored an a large nested table with a layout like the one below that I need to keep:
Flowtestlist <- list(list("910" = data.frame( Q=c(650, 720, 550, 580, 800)),
"950" = data.frame( Q=c(550, 770, 520, 540, 790))),
list ("910" = data.frame( Q=c(450, 620, 750, 580, 800)),
"950" = data.frame( Q=c(650, 750, 580, 520, 890))))
I have levels [[1]] and [[2]], in reality, I have 9 of them and those are also model numbers. Within each model I have 18 subbasins numbered 910, 950, 1012, 1087 etc (in the example above just two subbasins 910, 950 for simplicity). The subbasins contain data on streamflow (Q).
There's also a lookup table:
test_model <- c(1,1,2,2)
test_subbasin <- c(910,950,910,950)
Q_mean <- c(870,765,823,689)
FlowtestDF <- data.frame(test_model, test_subbasin, Q_mean)
This data frame includes streamflow means (Q_mean) for the reference period for each model and subbasin. I want to take each Q from the nested table and find the matching model number and subbasin in the lookup table and divide it to get the standardized streamflow Q_st.
fun_st <- function(x, y=FlowtestDF) {
x$Q_st <- x$Q/y$Q_mean
x <- x
}
testresult <- lapply(Flowtestlist, lapply, fun_st)
It doesn't work. As I understand the function can't find the appropriate location of the needed number in the lookup table (model and subbasin). How can I make this work, while keeping the nested table structure of the data?
Are you looking for this?
Map(\(x, y) lapply(y[match(x$test_subbasin, names(y))], \(i) i / x$Q_mean),
split(FlowtestDF, FlowtestDF$test_model),
Flowtestlist)
# $`1`
# $`1`$`910`
# Q
# 1 0.7471264
# 2 0.9411765
# 3 0.6321839
# 4 0.7581699
# 5 0.9195402
#
# $`1`$`950`
# Q
# 1 0.6321839
# 2 1.0065359
# 3 0.5977011
# 4 0.7058824
# 5 0.9080460
#
#
# $`2`
# $`2`$`910`
# Q
# 1 0.5467801
# 2 0.8998549
# 3 0.9113001
# 4 0.8417997
# 5 0.9720535
#
# $`2`$`950`
# Q
# 1 0.7897934
# 2 1.0885341
# 3 0.7047388
# 4 0.7547170
# 5 1.0814095
Note: If you're (still) using R<4.1, instead of e.g. \(x, y) use function(x, y).
It is easier to do the processing if you have data in a flat dataframe. If for some reason you have to keep the dataframe in nested structure you may split it again.
library(dplyr)
library(purrr)
map_df(Flowtestlist, ~bind_rows(., .id = 'test_subbasin'), .id = 'test_model') %>%
type.convert(as.is = TRUE) %>%
left_join(FlowtestDF, by = c('test_subbasin', 'test_model')) %>%
mutate(Q_st = Q/Q_mean) %>%
split(.$test_model) %>%
map(~.x %>% select(Q, Q_st) %>% split(.x$test_subbasin))
#$`1`
#$`1`$`910`
# Q Q_st
#1 650 0.7471264
#2 720 0.8275862
#3 550 0.6321839
#4 580 0.6666667
#5 800 0.9195402
#$`1`$`950`
# Q Q_st
#6 550 0.7189542
#7 770 1.0065359
#8 520 0.6797386
#9 540 0.7058824
#10 790 1.0326797
#$`2`
#$`2`$`910`
# Q Q_st
#11 450 0.5467801
#12 620 0.7533414
#13 750 0.9113001
#14 580 0.7047388
#15 800 0.9720535
#$`2`$`950`
# Q Q_st
#16 650 0.9433962
#17 750 1.0885341
#18 580 0.8417997
#19 520 0.7547170
#20 890 1.2917271
library(tidyr)
extr <- function(x){
a <- data.frame(x)
names(a) <- names(x)
a$test_model <- parent.frame()$i
a <- pivot_longer(a,setdiff(names(a),'test_model'),names_to = 'test_subbasin',values_to = 'Q')
a
}
to_df <- lapply(Flowtestlist,extr)
df <- do.call(rbind,to_df)
with_lookup <- merge(df,FlowtestDF,by =c('test_model','test_subbasin'))
with_lookup$Q_st <- with_lookup$Q/with_lookup$Q_mean
with_lookup
output;
test_model test_subbasin Q Q_mean Q_st
<int> <chr> <dbl> <dbl> <dbl>
1 1 910 650 870 0.747
2 1 910 720 870 0.828
3 1 910 550 870 0.632
4 1 910 580 870 0.667
5 1 910 800 870 0.920
6 1 950 550 765 0.719
7 1 950 770 765 1.01
8 1 950 520 765 0.680
9 1 950 540 765 0.706
10 1 950 790 765 1.03
11 2 910 450 823 0.547
12 2 910 620 823 0.753
13 2 910 750 823 0.911
14 2 910 580 823 0.705
15 2 910 800 823 0.972
16 2 950 650 689 0.943
17 2 950 750 689 1.09
18 2 950 580 689 0.842
19 2 950 520 689 0.755
20 2 950 890 689 1.29
The following will derive the required output ...
df <- data.frame(test_subbasin = unlist(Flowtestlist), ref = names(unlist(Flowtestlist)))
df$Q_st <- df$test_subbasin / FlowtestDF$Q_mean[match(gsub("\\..*", "", df$ref), FlowtestDF$test_subbasin)]
df
# test_subbasin ref Q_st
# 1 650 910.Q1 0.7471264
# 2 720 910.Q2 0.8275862
# 3 550 910.Q3 0.6321839
# 4 580 910.Q4 0.6666667
# 5 800 910.Q5 0.9195402
# 6 550 950.Q1 0.7189542
# 7 770 950.Q2 1.0065359
# 8 520 950.Q3 0.6797386
# 9 540 950.Q4 0.7058824
# 10 790 950.Q5 1.0326797
# 11 450 910.Q1 0.5172414
# 12 620 910.Q2 0.7126437
# 13 750 910.Q3 0.8620690
# 14 580 910.Q4 0.6666667
# 15 800 910.Q5 0.9195402
# 16 650 950.Q1 0.8496732
# 17 750 950.Q2 0.9803922
# 18 580 950.Q3 0.7581699
# 19 520 950.Q4 0.6797386
# 20 890 950.Q5 1.1633987
This question already has answers here:
Reshape horizontal to to long format using pivot_longer
(3 answers)
Closed 2 years ago.
Thank you all for your answers, I thought I was smarter than I am and hoped I would've understood any of it. I think I messed up my visualisation of my data aswell. I have edited my post to better show my sample data. Sorry for the inconvenience, and I truly hope that someone can help me.
I have a question about reshaping my data. The data collected looks as such:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurment4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
Now i would like it to look something like this:
PID Time Value
1 1435 1356
1 1405 1483
1 1374 1563
2 1848 943
2 1818 1173
2 1785 1300
3 185 1590
... ... ...
How would i tend to get there? I have looked up some things about wide to longformat, but it doesn't seem to do the trick. Am reletively new to Rstudio and Stackoverflow (if you couldn't tell that already).
Kind regards, and thank you in advance.
Here is a slightly different pivot_longer() version.
library(tidyr)
library(dplyr)
dw %>%
pivot_longer(cols = -PID, names_to =".value", names_pattern = "(.+)[0-9]")
# A tibble: 9 x 3
PID T measurement
<dbl> <dbl> <dbl>
1 1 1 100
2 1 4 200
3 1 7 50
4 2 2 150
5 2 5 300
6 2 8 60
7 3 3 120
8 3 6 210
9 3 9 70
The names_to = ".value" argument creates new columns from column names based on the names_pattern argument. The names_pattern argument takes a special regex input. In this case, here is the breakdown:
(.+) # match everything - anything noted like this becomes the ".values"
[0-9] # numeric characters - tells the pattern that the numbers
# at the end are excluded from ".values". If you have multiple digit
# numbers, use [0-9*]
In the last edit you asked for a solution that is easy to understand. A very simple approach would be to stack the measurement columns on top of each other and the Tdays columns on top of each other. Although specialty packages make things very concise and elegant, for simplicity we can solve this without additional packages. Standard R has a convenient function aptly named stack, which works like this:
> exp <- data.frame(value1 = 1:5, value2 = 6:10)
> stack(exp)
values ind
1 1 value1
2 2 value1
3 3 value1
4 4 value1
5 5 value1
6 6 value2
7 7 value2
8 8 value2
9 9 value2
10 10 value2
We can stack measurements and Tdays seperately and then combine them via cbind:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurement4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
cbind(stack(data, c(measurement1, measurement2, measurement3, measurement4)),
stack(data, c(Tdays1, Tdays2, Tdays3, Tdays4)))
Which keeps measurements and Tdays neatly together but leaves us without pid which we can add using rep to replicate the original pid 4 times:
result <- cbind(pid = rep(data$pid, 4),
stack(data, c(measurement1, measurement2, measurement3, measurement4)),
stack(data, c(Tdays1, Tdays2, Tdays3, Tdays4)))
The head of which looks like
> head(result)
pid values ind values ind
1 1 1356 measurement1 1435 Tdays1
2 2 943 measurement1 1848 Tdays1
3 3 1590 measurement1 185 Tdays1
4 4 130 measurement1 72 Tdays1
5 4 140 measurement1 82 Tdays1
6 4 220 measurement1 126 Tdays1
As I said above, this is not the order you expected and you can try to sort this data.frame, if that is of any concern:
result <- result[order(result$pid), c(1, 4, 2)]
names(result) <- c("pid", "Time", "Value")
leading to the final result
> head(result)
pid Time Value
1 1 1435 1356
13 1 1405 1483
25 1 1374 1563
37 1 NA NA
2 2 1848 943
14 2 1818 1173
tidyverse solution
library(tidyverse)
dw %>%
pivot_longer(-PID) %>%
mutate(name = gsub('^([A-Za-z]+)(\\d+)$', '\\1_\\2', name )) %>%
separate(name, into = c('A', 'B'), sep = '_', convert = T) %>%
pivot_wider(names_from = A, values_from = value)
Gives the following output
# A tibble: 9 x 4
PID B T measurement
<int> <int> <int> <int>
1 1 1 1 100
2 1 2 4 200
3 1 3 7 50
4 2 1 2 150
5 2 2 5 300
6 2 3 8 60
7 3 1 3 120
8 3 2 6 210
9 3 3 9 70
Considering a dataframe, df like the following:
PID T1 measurement1 T2 measurement2 T3 measurement3
1 1 100 4 200 7 50
2 2 150 5 300 8 60
3 3 120 6 210 9 70
You can use this solution to get your required dataframe:
iters = seq(from = 4, to = length(colnames(df))-1, by = 2)
finalDf = df[, c(1,2,3)]
for(j in iters){
tobind = df[, c(1,j,j+1)]
finalDf = rbind(finalDf, tobind)
}
finalDf = finalDf[order(finalDf[,1]),]
print(finalDf)
The output of the print statement is this:
PID T1 measurement1
1 1 1 100
4 1 4 200
7 1 7 50
2 2 2 150
5 2 5 300
8 2 8 60
3 3 3 120
6 3 6 210
9 3 9 70
Maybe you can try reshape like below
reshape(
setNames(data, gsub("(\\d+)$", "\\.\\1", names(data))),
direction = "long",
varying = 2:ncol(data)
)
I am a beginner in R and I never processed these type of data before. I have the following two types of sample data sets (df1 and df2) that looks like the following:
df1 <- c("{\"\"Wednesday\"\":4,\"\"Monday\"\":5,\"\"Saturday\"\":4,\"\"Thursday\"\":4,\"\"Tuesday\"\":5,\"\"Friday\"\":1,\"\"Sunday\"\":5,\"\"Missing day\"\":2}",
"{\"\"Wednesday\"\":6,\"\"Monday\"\":5,\"\"Saturday\"\":2,\"\"Thursday\"\":6,\"\"Tuesday\"\":0,\"\"Friday\"\":2,\"\"Sunday\"\":4,\"\"Missing day\"\":1}",
"{\"\"Wednesday\"\":5,\"\"Monday\"\":5,\"\"Saturday\"\":3,\"\"Thursday\"\":8,\"\"Tuesday\"\":4,\"\"Friday\"\":3,\"\"Sunday\"\":6,\"\"Missing day\"\":4}",
"{\"\"Wednesday\"\":3,\"\"Monday\"\":5,\"\"Saturday\"\":4,\"\"Thursday\"\":1,\"\"Tuesday\"\":5,\"\"Friday\"\":4,\"\"Sunday\"\":4,\"\"Missing day\"\":6}")
df2 <- c("[373,357,382,411,310,315,330,385,367,396,402,348,354,343,392,395,392,401,376,448,341,373,369,304,298,332,366,287,334,222]",
"[319,347,284,313,300,292,228,322,291,275,278,289,323,342,272,242,295,347,290,343,337,309,268,251,256,266,346,260,232,160]",
"[165,154,161,152,164,152,156,150,137,170,147,210,235,190,176,175,191,186,209,157,210,199,162,149,162,165,174,171,178,126]",
"[253,274,240,258,264,231,296,233,230,252,210,233,233,295,235,229,270,275,278,297,255,253,250,252,299,305,310,308,263,141]")
Now, I need to convert df1 into df1_final and df2 into df2_final. Here is how the final data sets should look like:
df1_final <- data.frame("Day"=c("Wednesday","Monday", "Saturday", "Thursday", "Tuesday", "Friday", "Sunday", "Missing day"),
"Count1"=c(4,5,4,4,5,1,5,2),
"Count2"=c(6,5,2,6,0,2,4,1),
"Count3"=c(5,5,3,8,4,3,6,4),
"Count4"=c(3,5,4,1,5,4,4,6))
df2_final <- data.frame("group1"=c(373,357,382,411,310,315,330,385,367,396,402,348,354,343,392,395,392,401,376,448,341,373,369,304,298,332,366,287,334,222), "group2"=c(319,347,284,313,300,292,228,322,291,275,278,289,323,342,272,242,295,347,290,343,337,309,268,251,256,266,346,260,232,160), "group3"=c(165,154,161,152,164,152,156,150,137,170,147,210,235,190,176,175,191,186,209,157,210,199,162,149,162,165,174,171,178,126), "group4"=c(253,274,240,258,264,231,296,233,230,252,210,233,233,295,235,229,270,275,278,297,255,253,250,252,299,305,310,308,263,141))
Can someone please help me figure this out? Appreciate for any help. Thank you !!
So you could use either reticulate or jsonlite. I will use Jsonlite as below:
for df1:
df1_f <- jsonlite::fromJSON(gsub('"+','"',sprintf("[%s]", paste0(df1, collapse = ","))))
data.frame(Day = names(df1_f), `colnames<-`(t(df1_f), paste0("count",1:4)), row.names = NULL)
Day count1 count2 count3 count4
1 Wednesday 4 6 5 3
2 Monday 5 5 5 5
3 Saturday 4 2 3 4
4 Thursday 4 6 8 1
5 Tuesday 5 0 4 5
6 Friday 1 2 3 4
7 Sunday 5 4 6 4
8 Missing day 2 1 4 6
for df2 since the lists are not within the {} we will have to manually transform it to a dataframe:
df2_fin <- jsonlite::fromJSON(sprintf("[%s]",paste0(df2, collapse = ",")))
(df2_final <- setNames(data.frame(t(df2_fin)), paste0("group",1:4)))
group1 group2 group3 group4
1 373 319 165 253
2 357 347 154 274
3 382 284 161 240
4 411 313 152 258
5 310 300 164 264
6 315 292 152 231
7 330 228 156 296
8 385 322 150 233
9 367 291 137 230
10 396 275 170 252
11 402 278 147 210
12 348 289 210 233
13 354 323 235 233
:
: