Transpose only certain columns - data formating [duplicate] - r

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 1 year ago.
Im trying to modify some data and have the following data now:
df <- data.frame(year = c(2010,2011,2010,2011),A = c(10,11,10,11),B = c(11,12,11,12))
year A B
1 2010 10 11
2 2011 11 12
3 2010 10 11
4 2011 11 12
I want it to look like this, but do not know how to do it. Can anyone help me?
company year Variable
1 A 2010 10
2 A 2011 11
3 B 2010 11
4 B 2011 12

We can use pivot_longer
library(tidyr)
pivot_longer(df, cols = -year,
names_to = 'company', values_to = 'Variable')
-output
# A tibble: 8 × 3
year company Variable
<dbl> <chr> <dbl>
1 2010 A 10
2 2010 B 11
3 2011 A 11
4 2011 B 12
5 2010 A 10
6 2010 B 11
7 2011 A 11
8 2011 B 12

Related

Ranking of values in one quarter [duplicate]

This question already has answers here:
Calculate rank by group
(4 answers)
How to emulate SQLs rank functions in R?
(5 answers)
Closed 8 days ago.
I am trying to implement a calculation that will rank the Price values in a separate partition. Below you can see my data
df<-data.frame( year=c(2010,2010,2010,2010,2010,2010),
quarter=c("q1","q1","q1","q2","q2","q2"),
Price=c(10,20,30,10,20,30)
)
df
Now I want to count over each quarter and I expect to have 1 for the smallest Price and 3 for the highest Price
df %>% group_by(quarter) %>% mutate(id = row_number(Price))
Instead of the expected results, I received different results. Below you can see the result from the code. Instead of ranking in separate quarter, ranging is in both quarters.
So can anybody help me how to solve this problem and to receive results as in table below
You probably want rank.
transform(df, id=ave(Price, year, quarter, FUN=rank))
# year quarter Price id
# 1 2010 q1 10 1
# 2 2010 q1 20 2
# 3 2010 q1 30 3
# 4 2010 q2 10 1
# 5 2010 q2 20 2
# 6 2010 q2 30 3
With dplyr, use dense_rank
library(dplyr)
df %>%
group_by(quarter) %>%
mutate(id = dense_rank(Price)) %>%
ungroup
# A tibble: 6 × 4
year quarter Price id
<dbl> <chr> <dbl> <int>
1 2010 q1 10 1
2 2010 q1 20 2
3 2010 q1 30 3
4 2010 q2 10 1
5 2010 q2 20 2
6 2010 q2 30 3
In the newer version of dplyr, can also use .by in mutate
df %>%
mutate(id = dense_rank(Price), .by = 'quarter')
year quarter Price id
1 2010 q1 10 1
2 2010 q1 20 2
3 2010 q1 30 3
4 2010 q2 10 1
5 2010 q2 20 2
6 2010 q2 30 3
Alternatively with row_number()
library(tidyverse)
df %>% group_by(year, quarter) %>% mutate(id=row_number())
Created on 2023-02-12 with reprex v2.0.2
# A tibble: 6 × 4
# Groups: year, quarter [2]
year quarter Price id
<dbl> <chr> <dbl> <int>
1 2010 q1 10 1
2 2010 q1 20 2
3 2010 q1 30 3
4 2010 q2 10 1
5 2010 q2 20 2
6 2010 q2 30 3

How to clean messy date formats in a dataframe using R

What is a quick way to clean a column with multiple date formats and obtain only the year?
Suppose in r there is a dataframe (df) as below, which has aDatecolumn of characters with different dates formats.
df <- data.frame(z= paste("Date",seq(1:10)), Date=c("2000-10-22", "9/21/2001", "2003", "2017/2018", "9/28/2010",
"9/27/2011","2019/2020", "2017-10/2018-12", "NA", "" ))
df:
z Date
1 Date 1 2000-10-22
2 Date 2 9/21/2001
3 Date 3 2003
4 Date 4 2017/2018
5 Date 5 9/28/2010
6 Date 6 9/27/2011
7 Date 7 2019/2020
8 Date 8 2017-10/2018-12
9 Date 9 NA
10 Date 10
Using r commands what is a quick way to extract out the years e.g. 2003, 2010 from the Date column? The first year is to be selected for cells with two years in a row.
So that the expected output would be like below:
z Date year
1 Date 1 2000-10-22 2000
2 Date 2 9/21/2001 2001
3 Date 3 2003 2003
4 Date 4 2007/2018 2017
5 Date 5 9/28/2010 2010
6 Date 6 9/27/2011 2011
7 Date 7 2007/2018 2019
8 Date 8 2017-10/2018-12 2017
9 Date 9 NA NA
10 Date 10
Use extract from tidyr. If there are two years it will use the first.
library(dplyr)
library(tidyr)
df %>% extract(Date, "Year", "(\\d{4})", remove = FALSE, convert = TRUE)
giving:
z Date Year
1 Date 1 2000-10-22 2000
2 Date 2 9/21/2001 2001
3 Date 3 2003 2003
4 Date 4 2017/2018 2017
5 Date 5 9/28/2010 2010
6 Date 6 9/27/2011 2011
7 Date 7 2019/2020 2019
8 Date 8 2017-10/2018-12 2017
9 Date 9 NA NA
10 Date 10 NA
If the second year is needed as well then:
df %>%
extract(Date, "Year2", "\\d{4}.*(\\d{4})", remove = FALSE, convert = TRUE) %>%
extract(Date, "Year", "(\\d{4})", remove = FALSE, convert = TRUE)
giving:
z Date Year Year2
1 Date 1 2000-10-22 2000 NA
2 Date 2 9/21/2001 2001 NA
3 Date 3 2003 2003 NA
4 Date 4 2017/2018 2017 2018
5 Date 5 9/28/2010 2010 NA
6 Date 6 9/27/2011 2011 NA
7 Date 7 2019/2020 2019 2020
8 Date 8 2017-10/2018-12 2017 2018
9 Date 9 NA NA NA
10 Date 10 NA NA

Convert data from wide format to long format with multiple measure columns [duplicate]

This question already has answers here:
wide to long multiple measures each time
(5 answers)
Closed 1 year ago.
I want to do this but the exact opposite. So say my dataset looks like this:
ID
X_1990
X_2000
X_2010
Y_1990
Y_2000
Y_2010
A
1
4
7
10
13
16
B
2
5
8
11
14
17
C
3
6
9
12
15
18
but with a lot more measure variables (i.e. also Z_1990, etc.). How can I get it so that the year becomes a variable and it will keep the different measures, like this:
ID
Year
X
Y
A
1990
1
10
A
2000
4
13
A
2010
7
16
B
1990
2
11
B
2000
5
14
B
2010
8
17
C
1990
3
12
C
2000
3
15
C
2010
9
18
You may use pivot_longer with names_sep argument.
tidyr::pivot_longer(df, cols = -ID, names_to = c('.value', 'Year'), names_sep = '_')
# ID Year X Y
# <chr> <chr> <int> <int>
#1 A 1990 1 10
#2 A 2000 4 13
#3 A 2010 7 16
#4 B 1990 2 11
#5 B 2000 5 14
#6 B 2010 8 17
#7 C 1990 3 12
#8 C 2000 6 15
#9 C 2010 9 18
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(ID = c("A", "B", "C"), X_1990 = 1:3, X_2000 = 4:6,
X_2010 = 7:9, Y_1990 = 10:12, Y_2000 = 13:15, Y_2010 = 16:18),
row.names = c(NA, -3L), class = "data.frame")

Combine data in many row into a columnn

I have a data like this:
year Male
1 2011 8
2 2011 1
3 2011 4
4 2012 3
5 2012 12
6 2012 9
7 2013 4
8 2013 3
9 2013 3
and I need to group the data for the year 2011 in one column, 2012 in the next column and so on.
2011 2012 2013
1 8 3 4
2 1 12 3
3 4 9 3
How do I achieve this?
One option is unstack if the number of rows per 'year' is the same
unstack(df1, Male ~ year)
One option is to use functions from dplyr and tidyr.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(year) %>%
mutate(ID = 1:n()) %>%
spread(year, Male) %>%
select(-ID)
1
If every year has the same number of data, you could split the data and cbind it using base R
do.call(cbind, split(df$Male, df$year))
# 2011 2012 2013
#[1,] 8 3 4
#[2,] 1 12 3
#[3,] 4 9 3
2
If every year does not have the same number of data, you could use rbind.fill of plyr
df[10,] = c(2015, 5) #Add only one data for the year 2015
library(plyr)
setNames(object = data.frame(t(rbind.fill.matrix(lapply(split(df$Male, df$year), t)))),
nm = unique(df$year))
# 2011 2012 2013 2015
#1 8 3 4 5
#2 1 12 3 NA
#3 4 9 3 NA
3
Yet another way is to use dcast to convert data from long to wide format
df[10,] = c(2015, 5) #Add only one data for the year 2015
library(reshape2)
dcast(df, ave(df$Male, df$year, FUN = seq_along) ~ year, value.var = "Male")[,-1]
# 2011 2012 2013 2015
#1 8 3 4 5
#2 1 12 3 NA
#3 4 9 3 NA

how to replace missing values with previous year's binned mean

I have a data frame as below
p1_bin and f1_bin are calculated by cut function by me with
Bins <- function(x) cut(x, breaks = c(0, seq(1, 1000, by = 5)), labels = 1:200)
binned <- as.data.frame (sapply(df[,-1], Bins))
colnames(binned) <- paste("Bin", colnames(binned), sep = "_")
df<- cbind(df, binned)
Now how to calculate mean/avg for previous two years and replace in NA values with in that bin
for example : at row-5 value is NA for p1 and f1 is 30 with corresponding bin 7.. now replace NA with previous 2 years mean for same bin (7) ,i.e
df
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 NA 30 NA 7
6 2016 10 NA 2 NA
df1
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 **22** 30 NA 7
6 2016 10 **16.5** 2 NA
Thanks in advance
I believe the following code produces the desired output. There's probably a much more elegant way than using mean(rev(lag(f1))[1:2]) to get the average of the last two values of f1 but this should do the trick anyway.
library(dplyr)
df %>%
arrange(year) %>%
mutate_at(c("p1", "f1"), "as.double") %>%
group_by(Bin_p1) %>%
mutate(f1 = ifelse(is.na(f1), mean(rev(lag(f1))[1:2]), f1)) %>%
group_by(Bin_f1) %>%
mutate(p1 = ifelse(is.na(p1), mean(rev(lag(p1))[1:2]), p1)) %>%
ungroup
and the output is:
# A tibble: 6 x 6
ID year p1 f1 Bin_p1 Bin_f1
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2013 20 30.0 5 7
2 2 2013 24 29.0 5 7
3 3 2014 10 16.0 2 3
4 4 2014 11 17.0 2 3
5 5 2015 22 30.0 NA 7
6 6 2016 10 16.5 2 NA

Resources