Get rownames to column names and put data together from rows to columns with the same name - r

I have a dataframe like this:
X ID X1 X2 X3 X4 X5
BIL 1 1 2 7 1 5
Date 1 12.2 13.5 1.1 26.9 7.9
Year 1 2012 2013 2020 1999 2017
BIL 2 7 9 2 1 5
Date 2 12.2 13.5 1.1 26.9 7.9
Year 2 2022 2063 2000 1989 2015
BIL 3 1 2 7 1 5
Date 3 12.2 13.5 1.1 26.9 7.9
Year 3 2012 2013 2020 1999 2017
I would like to transform it so that I get a new df with BIL Date Year as column names and the values listed in the rows below for example
ID BIL Date Year
1 1 1 12.2 2012
2 1 2 13.5 2013
3 1 7
4 1 1
5 1 5
6 2 7 12.2 2022
7 2 9 13.5 2063
Any help would really be appreciated!
Edit: Is there any way to also add a grouping variable like I added above

This strategy will work.
create an ID column by equating first column name with X.
transform into long format, deselect unwanted column (your original variable names) using names_to = NULL argument
transform back into wide, this time using correct variable names
collect multiple instances into list column using values_fn = list argument in pivot_wider
unnest all except ID
df <- read.table(text = 'X X1 X2 X3 X4 X5
BIL 1 2 7 1 5
Date 12.2 13.5 1.1 26.9 7.9
Year 2012 2013 2020 1999 2017
BIL 7 9 2 1 5
Date 12.2 13.5 1.1 26.9 7.9
Year 2022 2063 2000 1989 2015
BIL 1 2 7 1 5
Date 12.2 13.5 1.1 26.9 7.9
Year 2012 2013 2020 1999 2017', header = T)
library(tidyverse)
df %>% mutate(ID = cumsum(X == df[1,1])) %>%
pivot_longer(!c(X,ID), names_to = NULL) %>%
pivot_wider(id_cols = c(ID), names_from = X, values_from = value, values_fn = list) %>%
unnest(!ID)
#> # A tibble: 15 x 4
#> ID BIL Date Year
#> <int> <dbl> <dbl> <dbl>
#> 1 1 1 12.2 2012
#> 2 1 2 13.5 2013
#> 3 1 7 1.1 2020
#> 4 1 1 26.9 1999
#> 5 1 5 7.9 2017
#> 6 2 7 12.2 2022
#> 7 2 9 13.5 2063
#> 8 2 2 1.1 2000
#> 9 2 1 26.9 1989
#> 10 2 5 7.9 2015
#> 11 3 1 12.2 2012
#> 12 3 2 13.5 2013
#> 13 3 7 1.1 2020
#> 14 3 1 26.9 1999
#> 15 3 5 7.9 2017
Created on 2021-05-17 by the reprex package (v2.0.0)
This will also give you same results
df %>% mutate(ID = cumsum(X == df[1,1])) %>%
pivot_longer(!c(X,ID)) %>%
pivot_wider(id_cols = c(ID, name), names_from = X, values_from = value) %>%
select(-name)

Get the data in long format, create a unique row number for each value in X column and get it back in wide format.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -X) %>%
group_by(X) %>%
mutate(row = row_number()) %>%
ungroup %>%
pivot_wider(names_from = X, values_from = value) %>%
select(-row, -name)
# BIL Date Year
# <dbl> <dbl> <dbl>
# 1 1 12.2 2012
# 2 2 13.5 2013
# 3 7 1.1 2020
# 4 1 26.9 1999
# 5 5 7.9 2017
# 6 7 12.2 2022
# 7 9 13.5 2063
# 8 2 1.1 2000
# 9 1 26.9 1989
#10 5 7.9 2015
#11 1 12.2 2012
#12 2 13.5 2013
#13 7 1.1 2020
#14 1 26.9 1999
#15 5 7.9 2017
In data.table with melt + dcast
library(data.table)
dcast(melt(setDT(df), id.vars = 'X'), rowid(X)~X, value.var = 'value')
data
df <- structure(list(X = c("BIL", "Date", "Year", "BIL", "Date", "Year",
"BIL", "Date", "Year"), X1 = c(1, 12.2, 2012, 7, 12.2, 2022,
1, 12.2, 2012), X2 = c(2, 13.5, 2013, 9, 13.5, 2063, 2, 13.5,
2013), X3 = c(7, 1.1, 2020, 2, 1.1, 2000, 7, 1.1, 2020), X4 = c(1,
26.9, 1999, 1, 26.9, 1989, 1, 26.9, 1999), X5 = c(5, 7.9, 2017,
5, 7.9, 2015, 5, 7.9, 2017)), class = "data.frame", row.names = c(NA, -9L))

A base R option using reshape (first wide and then long)
p <- reshape(
transform(
df,
id = ave(X, X, FUN = seq_along)
),
direction = "wide",
idvar = "id",
timevar = "X"
)
q <- reshape(
setNames(p, gsub("(.*)\\.(.*)", "\\2.\\1", names(p))),
direction = "long",
idvar = "id",
varying = -1
)
and you will see
id time BIL Date Year
1.X1 1 X1 1 12.2 2012
2.X1 2 X1 7 12.2 2022
3.X1 3 X1 1 12.2 2012
1.X2 1 X2 2 13.5 2013
2.X2 2 X2 9 13.5 2063
3.X2 3 X2 2 13.5 2013
1.X3 1 X3 7 1.1 2020
2.X3 2 X3 2 1.1 2000
3.X3 3 X3 7 1.1 2020
1.X4 1 X4 1 26.9 1999
2.X4 2 X4 1 26.9 1989
3.X4 3 X4 1 26.9 1999
1.X5 1 X5 5 7.9 2017
2.X5 2 X5 5 7.9 2015
3.X5 3 X5 5 7.9 2017

You may unlist, matrix and coerce it as.data.frame, use some cells for setNames.
setNames(as.data.frame(t(matrix(unlist(dat[-1]), 3, 15))), dat[1:3, 1])
# BIL Date Year
# 1 1 12.2 2012
# 2 7 12.2 2022
# 3 1 12.2 2012
# 4 2 13.5 2013
# 5 9 13.5 2063
# 6 2 13.5 2013
# 7 7 1.1 2020
# 8 2 1.1 2000
# 9 7 1.1 2020
# 10 1 26.9 1999
# 11 1 26.9 1989
# 12 1 26.9 1999
# 13 5 7.9 2017
# 14 5 7.9 2015
# 15 5 7.9 2017
If less hardcoded wanted, use:
m <- length(unique(dat$X))
n <- ncol(dat[-1]) * m
setNames(as.data.frame(t(matrix(unlist(dat[-1]), m, n))), dat[1:m, 1])
Data:
dat <- read.table(header=T, text='X X1 X2 X3 X4 X5
BIL 1 2 7 1 5
Date 12.2 13.5 1.1 26.9 7.9
Year 2012 2013 2020 1999 2017
BIL 7 9 2 1 5
Date 12.2 13.5 1.1 26.9 7.9
Year 2022 2063 2000 1989 2015
BIL 1 2 7 1 5
Date 12.2 13.5 1.1 26.9 7.9
Year 2012 2013 2020 1999 2017
')

Related

R: Cumulative Mean Excluding Current Value?

I am working with the R programming language.
I have a dataset that looks something like this:
id = c(1,1,1,1,2,2,2)
year = c(2010,2011,2012,2013, 2012, 2013, 2014)
var = rnorm(7,7,7)
my_data = data.frame(id, year,var)
id year var
1 1 2010 12.186300
2 1 2011 19.069836
3 1 2012 7.456078
4 1 2013 14.875019
5 2 2012 20.827933
6 2 2013 5.029625
7 2 2014 -2.260658
For each "group" within the ID column - at each row, I want to take the CUMULATIVE MEAN of the "var" column but EXCLUDE the value of "var" within that row (i.e. most recent).
As an example:
row 1: NA
row 2: 12.186300/1
row 3: (12.186300 + 19.069836)/2
row 4: (12.186300 + 19.069836 + 7.45)/3
row 5: NA
row 6: 20.827933
row 7: (20.827933 + 5.029625)/2
I found this post here (Cumsum excluding current value) which (I think) shows how to do this for the "cumulative sum" - I tried to apply the logic here to my question:
transform(my_data, cmean = ave(var, id, FUN = cummean) - var)
id year var cmean
1 1 2010 12.186300 0.000000
2 1 2011 19.069836 -3.441768
3 1 2012 7.456078 5.447994
4 1 2013 14.875019 -1.478211
5 2 2012 20.827933 0.000000
6 2 2013 5.029625 7.899154
7 2 2014 -2.260658 10.126291
The code appears to have run - but I don't think I have done this correctly (i.e. the numbers produced don't match up with the numbers I had anticipated).
I then tried an answer provided here (Compute mean excluding current value):
my_data %>%
group_by(id) %>%
mutate(avg = (sum(var) - var)/(n() - 1))
# A tibble: 7 x 4
# Groups: id [2]
id year var avg
<dbl> <dbl> <dbl> <dbl>
1 1 2010 12.2 13.8
2 1 2011 19.1 11.5
3 1 2012 7.46 15.4
4 1 2013 14.9 12.9
5 2 2012 20.8 1.38
6 2 2013 5.03 9.28
But it is still not working.
Can someone please show me what I am doing wrong and what I can do this fix this problem?
Thanks!
df %>%
group_by(id)%>%
mutate(avg = lag(cummean(var)))
# A tibble: 7 × 4
# Groups: id [2]
id year var avg
<int> <int> <dbl> <dbl>
1 1 2010 12.2 NA
2 1 2011 19.1 12.2
3 1 2012 7.46 15.6
4 1 2013 14.9 12.9
5 2 2012 20.8 NA
6 2 2013 5.03 20.8
7 2 2014 -2.26 12.9
With the help of some intermediate variables you can do it like so:
library(dplyr)
df <- read.table(text = "
id year var
1 1 2010 12.186300
2 1 2011 19.069836
3 1 2012 7.456078
4 1 2013 14.875019
5 2 2012 20.827933
6 2 2013 5.029625
7 2 2014 -2.260658", header=T)
df |>
group_by(id) |>
#mutate(avg =lag(cummean(var)))
mutate(id_g = row_number()) |>
mutate(ms = cumsum(var)) |>
mutate(cm = ms/id_g,
cm = ifelse(ms == cm, NA, cm)) |>
select(-id_g, -ms)
#> # A tibble: 7 × 4
#> # Groups: id [2]
#> id year var cm
#> <int> <int> <dbl> <dbl>
#> 1 1 2010 12.2 NA
#> 2 1 2011 19.1 15.6
#> 3 1 2012 7.46 12.9
#> 4 1 2013 14.9 13.4
#> 5 2 2012 20.8 NA
#> 6 2 2013 5.03 12.9
#> 7 2 2014 -2.26 7.87

creating a dummy variable with consecutive cases

I have a similar problem like this one:
How can I create a dummy variable over consecutive values by group id?
the difference is: as soon I have the Dummy = 1 I want my dummy for the rest of my group (ID) beeing 1 since year is in descending order. So for example, out of df1:
df1 <-data.frame(ID = rep(seq(1:3), each = 4),
year = rep(c(2014, 2015, 2016, 2017),3),
value = runif(12, min = 0, max = 25),
Dummy = c(0,0,1,0 ,0,1,0,1, 1,0,0,0))
shall be :
df2 <- data.frame(ID = rep(seq(1:4), 3),
year = rep(c(2014, 2015, 2016, 2017),3),
value = runif(12, min = 0, max = 25),
Dummy = c(0,0,1,1 ,0,1,1, 1, 1,1,1,1))
I've tried something like that (and some others) but that failed:
df2<- df1%>% group_by(ID) %>% arrange(ID , year) %>%
mutate(treated = case_when(Dummy == 1 ~ 1,
lag(Dummy, n= unique(n()), default = 0) == 1 ~ 1))
If your input data is as below then we can just use cummax():
library(dplyr)
df1 <-data.frame(ID = rep(seq(1:3), each = 4),
year = rep(c(2014, 2015, 2016, 2017),3),
value = runif(12, min = 0, max = 25),
Dummy = c(0,0,1,0 ,0,1,0,1, 1,0,0,0))
df1
#> ID year value Dummy
#> 1 1 2014 14.144996 0
#> 2 1 2015 20.621603 0
#> 3 1 2016 8.325170 1
#> 4 1 2017 21.725028 0
#> 5 2 2014 11.894383 0
#> 6 2 2015 13.445744 1
#> 7 2 2016 3.332338 0
#> 8 2 2017 2.984941 1
#> 9 3 2014 17.551266 1
#> 10 3 2015 5.250556 0
#> 11 3 2016 11.062577 0
#> 12 3 2017 20.169439 0
df1 %>%
group_by(ID) %>%
mutate(Dummy = cummax(Dummy))
#> # A tibble: 12 x 4
#> # Groups: ID [3]
#> ID year value Dummy
#> <int> <dbl> <dbl> <dbl>
#> 1 1 2014 14.1 0
#> 2 1 2015 20.6 0
#> 3 1 2016 8.33 1
#> 4 1 2017 21.7 1
#> 5 2 2014 11.9 0
#> 6 2 2015 13.4 1
#> 7 2 2016 3.33 1
#> 8 2 2017 2.98 1
#> 9 3 2014 17.6 1
#> 10 3 2015 5.25 1
#> 11 3 2016 11.1 1
#> 12 3 2017 20.2 1
Created on 2022-10-14 by the reprex package (v2.0.1)

Interpolate df column within each group

I have a data frame df and a sample vector years of the following kind:
> df <- data.frame(year = rep(c(2000, 2025, 2030, 2050), 2),
type = rep(c('a', 'b'), each = 4),
value = c(3, 9, 8, 6, 7, 5, 2, 10))
> years = seq(2010, 2050, 10)
> df
year type value
1 2000 a 3
2 2025 a 9
3 2030 a 8
4 2050 a 6
5 2000 b 7
6 2025 b 5
7 2030 b 2
8 2050 b 10
> years
[1] 2010 2020 2030 2040 2050
Now I would like to interpolate value within each group of type to get the values for years. My expected result looks like this (where values for 2010, 2020 and 2040 are interpolated):
> result
year type value
1 2010 a 5.4
2 2020 a 7.8
3 2030 a 8
4 2040 a 7
5 2050 a 6
6 2010 b 6.2
7 2020 b 5.4
8 2030 b 2
9 2040 b 6
10 2050 b 10
I have tried something like this but did not succeed as I am not allowed to change the length of the group. Any help is very much appreciated!
> result <- df %>%
group_by(type) %>%
mutate(year = years,
value = approx(year, value, years)$y)
Error: Problem with `mutate()` input `year`.
x Input `year` can't be recycled to size 4.
i Input `year` is `years`.
i Input `year` must be size 4 or 1, not 5.
i The error occurred in group 1: type = "a".
We can use complete to get all the sequence per 'type' and then apply approx
library(dplyr)
library(tidyr)
df %>%
complete(year = years, type) %>%
group_by(type) %>%
mutate(value = approx(year, value, year)$y) %>%
ungroup %>%
arrange(type, year)
-output
# A tibble: 14 x 3
# year type value
# <dbl> <chr> <dbl>
# 1 2000 a 3
# 2 2010 a 5.4
# 3 2020 a 7.8
# 4 2025 a 9
# 5 2030 a 8
# 6 2040 a 7
# 7 2050 a 6
# 8 2000 b 7
# 9 2010 b 6.2
#10 2020 b 5.4
#11 2025 b 5
#12 2030 b 2
#13 2040 b 6
#14 2050 b 10

For loop in R to rewrite initial datasets

UPD:
HERE what I need:
Example of some datasets are here (I have 8 of them):
https://drive.google.com/drive/folders/1gBV2ZkywW6JqDjRICafCwtYhh2DHWaUq?usp=sharing
What I need is:
For example, in those datasets there is lev variable. Let's say this is a snapshot of the data in these datasets:
ID Year lev
1 2011 0.19
1 2012 0.19
1 2013 0.21
1 2014 0.18
2 2013 0.39
2 2014 0.15
2 2015 0.47
2 2016 0.35
3 2013 0.30
3 2015 0.1
3 2017 0.13
3 2018 0.78
4 2011 0.13
4 2012 0.35
Now, I need to create in each of my datasets EE_AB, EE_C, EE_H, etc., create variables ff1 and ff2 which are constructed for year ID, in each year respectively to the median of the whole IDs in that particular year.
Let's take an example of the year 2011. The median of the variable lev in this dataset in 2011 is (0.19+0.13)/2 = 0.16, so ff1 for ID 1 in 2011 should be 0.19/0.16 = 1.1875, and for ID 4 in 2011 ff1 = 0.13/0.16 = 0.8125.
Now let's take the example of 2013. The median lev is 0.3. so ff1 for ID 1, 2, 3 will be 0.7, 1.3, 1 respectively.
The desired output should be the ff1 variable in each dataset (e.g., EE_AB, EE_C, EE_H) as:
ID Year lev ff1
1 2011 0.19 1.1875
1 2012 0.19 0.7037
1 2013 0.21 0.7
1 2014 0.18 1.0909
2 2013 0.39 1.3
2 2014 0.15 0.9091
2 2015 0.47 1.6491
2 2016 0.35 1
3 2013 0.30 1
3 2015 0.1 0.3509
3 2017 0.13 1
3 2018 0.78 1
4 2011 0.13 0.8125
4 2012 0.35 1.2963
And this should be in the same way for other dataframes.
Here's a tidyverse method:
library(dplyr)
# library(purrr)
data_frameAB %>%
group_by(Year) %>%
mutate(ff1 = (c+d) / purrr::map2_dbl(c, d, median)) %>%
ungroup()
# # A tibble: 14 x 5
# ID Year c d ff1
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2011 10 12 2.2
# 2 1 2012 11 13 2.18
# 3 1 2013 12 14 2.17
# 4 1 2014 13 15 2.15
# 5 1 2015 14 16 2.14
# 6 1 2016 15 34 3.27
# 7 1 2017 16 25 2.56
# 8 1 2018 17 26 2.53
# 9 1 2019 18 56 4.11
# 10 15 2015 23 38 2.65
# 11 15 2016 26 25 1.96
# 12 15 2017 30 38 2.27
# 13 45 2011 100 250 3.5
# 14 45 2012 200 111 1.56
Without purrr, that inner expression would be
mutate(ff1 = (c+d) / mapply(median, c, d))
albeit with type-safeness.
Since you have multiple frames in your data management, I have two suggestions:
Combine them into a list. This recommendation stems off the assumption that whatever you're doing to one frame you are likely to do all three. In that case, you can use lapply or purrr::map on the list of frames, doing all frames in one step. See https://stackoverflow.com/a/24376207/3358227.
list_of_frames <- list(AB=data_frameAB, C=data_frameC, F=data_frameF)
list_of_frames2 <- purrr::map(
list_of_frames,
~ .x %>%
group_by(Year) %>%
mutate(ff1 = (c+d) / purrr::map2_dbl(c, d, median)) %>% ungroup()
)
Again, without purrr, that would be
list_of_frames2 <- lapply(
list_of_frames,
function(.x) group_by(.x, Year) %>%
mutate(ff1 = (c+d) / mapply(median c, d)) %>%
ungroup()
)
Combine them into one frame, preserving the original data. Starting with list_of_frames,
bind_rows(list_of_frames, .id = "Frame") %>%
group_by(Frame, Year) %>%
mutate(ff1 = (c+d) / purrr::map2_dbl(c, d, median)) %>%
ungroup()
# # A tibble: 42 x 6
# Frame ID Year c d ff1
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 AB 1 2011 10 12 2.2
# 2 AB 1 2012 11 13 2.18
# 3 AB 1 2013 12 14 2.17
# 4 AB 1 2014 13 15 2.15
# 5 AB 1 2015 14 16 2.14
# 6 AB 1 2016 15 34 3.27
# 7 AB 1 2017 16 25 2.56
# 8 AB 1 2018 17 26 2.53
# 9 AB 1 2019 18 56 4.11
# 10 AB 15 2015 23 38 2.65
# # ... with 32 more rows

how to replace missing values with previous year's binned mean

I have a data frame as below
p1_bin and f1_bin are calculated by cut function by me with
Bins <- function(x) cut(x, breaks = c(0, seq(1, 1000, by = 5)), labels = 1:200)
binned <- as.data.frame (sapply(df[,-1], Bins))
colnames(binned) <- paste("Bin", colnames(binned), sep = "_")
df<- cbind(df, binned)
Now how to calculate mean/avg for previous two years and replace in NA values with in that bin
for example : at row-5 value is NA for p1 and f1 is 30 with corresponding bin 7.. now replace NA with previous 2 years mean for same bin (7) ,i.e
df
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 NA 30 NA 7
6 2016 10 NA 2 NA
df1
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 **22** 30 NA 7
6 2016 10 **16.5** 2 NA
Thanks in advance
I believe the following code produces the desired output. There's probably a much more elegant way than using mean(rev(lag(f1))[1:2]) to get the average of the last two values of f1 but this should do the trick anyway.
library(dplyr)
df %>%
arrange(year) %>%
mutate_at(c("p1", "f1"), "as.double") %>%
group_by(Bin_p1) %>%
mutate(f1 = ifelse(is.na(f1), mean(rev(lag(f1))[1:2]), f1)) %>%
group_by(Bin_f1) %>%
mutate(p1 = ifelse(is.na(p1), mean(rev(lag(p1))[1:2]), p1)) %>%
ungroup
and the output is:
# A tibble: 6 x 6
ID year p1 f1 Bin_p1 Bin_f1
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2013 20 30.0 5 7
2 2 2013 24 29.0 5 7
3 3 2014 10 16.0 2 3
4 4 2014 11 17.0 2 3
5 5 2015 22 30.0 NA 7
6 6 2016 10 16.5 2 NA

Resources