Convert some rows to columns in r [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
How to sum a variable by group
(18 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 5 years ago.
I have a dataset like the one below:
test <- structure(list(SR = c(1L, 1L, 15L, 20L, 20L, 96L, 110L, 110L,
121L, 121L, 130L, 130L, 143L, 143L), Area = structure(c(3L, 3L,
1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 4L, 4L, 2L, 2L), .Label = c("FH",
"MO", "TSC", "WMB"), class = "factor"), Period = structure(c(1L,
2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("First",
"Second"), class = "factor"), count = c(4L, 6L, 3L, 6L, 6L, 3L,
6L, 6L, 6L, 6L, 6L, 6L, 5L, 6L), countTotal = c(10L, 10L, 3L,
12L, 12L, 3L, 12L, 12L, 12L, 12L, 12L, 12L, 11L, 11L), SumTotal = c(1520,
5769.02, 29346.78, 13316.89, 11932.68, 10173.05, 13243.5, 17131.94,
111189.07, 84123.52, 79463.1, 120010.57, 7035.88, 11520)), .Names = c("SR",
"Area", "Period", "count", "countTotal", "SumTotal"), class = "data.frame", row.names = c(NA,
-14L))
SR Area Period count countTotal SumTotal
1 TSC First 4 10 1520.00
1 TSC Second 6 10 5769.02
15 FH First 3 3 29346.78
20 FH First 6 12 13316.89
20 FH Second 6 12 11932.68
96 FH First 3 3 10173.05
110 MO First 6 12 13243.50
110 MO Second 6 12 17131.94
121 FH First 6 12 111189.07
121 FH Second 6 12 84123.52
130 WMB First 6 12 79463.10
130 WMB Second 6 12 120010.57
143 MO First 5 11 7035.88
143 MO Second 6 11 11520.00
I want to convert some of the rows to columns to make the dataset look like this:
SR Area countTotal First.Count Second.Count First.SumTotal Second.SumTotal
1 TSC 10 4 6 1520.00 5769.02
15 FH 3 3 NA 29346.78 NA
20 FH 12 6 6 13316.89 11932.68
96 FH 3 3 NA 10173.05 NA
110 MO 12 6 6 13243.50 17131.94
121 FH 12 6 6 111189.07 84123.52
130 WMB 12 6 6 79463.10 120010.57
143 MO 11 5 6 7035.88 11520.00
I was trying to use spread from tidyr with this code
test %>% spread(Period, SumTotal) but I still get two lines for each SR and Area.
Can someone help?

You need to first gather by the columns you want to spread, and combine the Period column with the variable column, then spread the resulting variable column:
library(dplyr)
library(tidyr)
test %>%
gather(variable, value, count:SumTotal) %>%
unite("variable", Period, variable, sep = ".") %>%
spread(variable, value)
Result:
SR Area First.count First.countTotal First.SumTotal Second.count Second.countTotal
1 1 TSC 4 10 1520.00 6 10
2 15 FH 3 3 29346.78 NA NA
3 20 FH 6 12 13316.89 6 12
4 96 FH 3 3 10173.05 NA NA
5 110 MO 6 12 13243.50 6 12
6 121 FH 6 12 111189.07 6 12
7 130 WMB 6 12 79463.10 6 12
8 143 MO 5 11 7035.88 6 11
Second.SumTotal
1 5769.02
2 NA
3 11932.68
4 NA
5 17131.94
6 84123.52
7 120010.57
8 11520.00

Related

Is there any R function to make this happen?

Hi this is an excel form of data i want to be able to create in R
Just want to make it clear, I need to be able to make the column Group_fix equal to 5 for the next 12 months period observation, every time an observation date has 5 in its Group column.
How to make it possible in R? Can we use ifelse function?
Here is an approach with lag from dplyr.
library(dplyr)
data %>%
mutate(GroupFix = case_when(Group == 5 |
lag(Group,2) == 5 |
lag(Group,2) == 5 |
lag(Group,3) == 5 |
lag(Group,4) == 5 |
lag(Group,5) == 5 |
lag(Group,6) == 5 |
lag(Group,7) == 5 |
lag(Group,8) == 5 |
lag(Group,9) == 5 |
lag(Group,10) == 5 |
lag(Group,11) == 5 ~ 5,
TRUE ~ as.numeric(Group)))
Observation.Date Group GroupFix
1 12/31/19 1 1
2 1/31/20 2 2
3 2/29/20 2 2
4 3/31/20 2 2
5 4/30/20 3 3
6 5/31/20 4 4
7 6/30/20 5 5
8 7/31/20 5 5
9 8/31/20 4 5
10 9/30/20 3 5
11 10/31/20 2 5
12 11/30/20 3 5
13 12/31/20 4 5
14 1/31/21 5 5
15 2/28/21 5 5
16 3/31/21 4 5
17 4/30/21 3 5
18 5/31/21 2 5
19 6/30/21 1 5
20 7/31/21 1 5
21 8/31/21 1 5
22 9/30/21 1 5
23 10/31/21 1 5
24 11/30/21 1 5
25 12/31/21 1 5
26 1/31/22 1 5
27 2/28/22 1 1
Data
data <- structure(list(Observation.Date = structure(c(8L, 1L, 13L, 14L,
16L, 18L, 20L, 22L, 24L, 26L, 4L, 6L, 9L, 2L, 11L, 15L, 17L,
19L, 21L, 23L, 25L, 27L, 5L, 7L, 10L, 3L, 12L), .Label = c("1/31/20",
"1/31/21", "1/31/22", "10/31/20", "10/31/21", "11/30/20", "11/30/21",
"12/31/19", "12/31/20", "12/31/21", "2/28/21", "2/28/22", "2/29/20",
"3/31/20", "3/31/21", "4/30/20", "4/30/21", "5/31/20", "5/31/21",
"6/30/20", "6/30/21", "7/31/20", "7/31/21", "8/31/20", "8/31/21",
"9/30/20", "9/30/21"), class = "factor"), Group = c(1L, 2L, 2L,
2L, 3L, 4L, 5L, 5L, 4L, 3L, 2L, 3L, 4L, 5L, 5L, 4L, 3L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-27L))

Tidy data.frame with repeated column names

I have a program that gives me data in this format
toy
file_path Condition Trial.Num A B C ID A B C ID A B C ID
1 root/some.extension Baseline 1 2 3 5 car 2 1 7 bike 4 9 0 plane
2 root/thing.extension Baseline 2 3 6 45 car 5 4 4 bike 9 5 4 plane
3 root/else.extension Baseline 3 4 4 6 car 7 5 4 bike 68 7 56 plane
4 root/uniquely.extension Treatment 1 5 3 7 car 1 7 37 bike 9 8 7 plane
5 root/defined.extension Treatment 2 6 7 3 car 4 6 8 bike 9 0 8 plane
My goal is to tidy the format into something that at least can be easier to finally tidy with reshape having unique column names
tidy_toy
file_path Condition Trial.Num A B C ID
1 root/some.extension Baseline 1 2 3 5 car
2 root/thing.extension Baseline 2 3 6 45 car
3 root/else.extension Baseline 3 4 4 6 car
4 root/uniquely.extension Treatment 1 5 3 7 car
5 root/defined.extension Treatment 2 6 7 3 car
6 root/some.extension Baseline 1 2 1 7 bike
7 root/thing.extension Baseline 2 5 4 4 bike
8 root/else.extension Baseline 3 7 5 4 bike
9 root/uniquely.extension Treatment 1 1 7 37 bike
10 root/defined.extension Treatment 2 4 6 8 bike
11 root/some.extension Baseline 1 4 9 0 plane
12 root/thing.extension Baseline 2 9 5 4 plane
13 root/else.extension Baseline 3 68 7 56 plane
14 root/uniquely.extension Treatment 1 9 8 7 plane
15 root/defined.extension Treatment 2 9 0 8 plane
If I try to melt from toy it doesn't work because only the first ID column will get used for id.vars (hence everything will get tagged as cars). Identical variables will get dropped.
Here's the dput of both tables
structure(list(file_path = structure(c(3L, 4L, 2L, 5L, 1L), .Label = c("root/defined.extension",
"root/else.extension", "root/some.extension", "root/thing.extension",
"root/uniquely.extension"), class = "factor"), Condition = structure(c(1L,
1L, 1L, 2L, 2L), .Label = c("Baseline", "Treatment"), class = "factor"),
Trial.Num = c(1L, 2L, 3L, 1L, 2L), A = 2:6, B = c(3L, 6L,
4L, 3L, 7L), C = c(5L, 45L, 6L, 7L, 3L), ID = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "car", class = "factor"), A = c(2L,
5L, 7L, 1L, 4L), B = c(1L, 4L, 5L, 7L, 6L), C = c(7L, 4L,
4L, 37L, 8L), ID = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "bike", class = "factor"),
A = c(4L, 9L, 68L, 9L, 9L), B = c(9L, 5L, 7L, 8L, 0L), C = c(0L,
4L, 56L, 7L, 8L), ID = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "plane", class = "factor")), .Names = c("file_path",
"Condition", "Trial.Num", "A", "B", "C", "ID", "A", "B", "C",
"ID", "A", "B", "C", "ID"), class = "data.frame", row.names = c(NA,
-5L))
structure(list(file_path = structure(c(3L, 4L, 2L, 5L, 1L, 3L,
4L, 2L, 5L, 1L, 3L, 4L, 2L, 5L, 1L), .Label = c("root/defined.extension",
"root/else.extension", "root/some.extension", "root/thing.extension",
"root/uniquely.extension"), class = "factor"), Condition = structure(c(1L,
1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L), .Label = c("Baseline",
"Treatment"), class = "factor"), Trial.Num = c(1L, 2L, 3L, 1L,
2L, 1L, 2L, 3L, 1L, 2L, 1L, 2L, 3L, 1L, 2L), A = c(2L, 3L, 4L,
5L, 6L, 2L, 5L, 7L, 1L, 4L, 4L, 9L, 68L, 9L, 9L), B = c(3L, 6L,
4L, 3L, 7L, 1L, 4L, 5L, 7L, 6L, 9L, 5L, 7L, 8L, 0L), C = c(5L,
45L, 6L, 7L, 3L, 7L, 4L, 4L, 37L, 8L, 0L, 4L, 56L, 7L, 8L), ID = structure(c(2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L), .Label = c("bike",
"car", "plane"), class = "factor")), .Names = c("file_path",
"Condition", "Trial.Num", "A", "B", "C", "ID"), class = "data.frame", row.names = c(NA,
-15L))
You can use the make.unique-function to create unique column names. After that you can use melt from the data.table-package which is able to create multiple value-columns based on patterns in the columnnames:
# make the column names unique
names(toy) <- make.unique(names(toy))
# let the 'Condition' column start with a small letter 'c'
# so it won't be detected by the patterns argument from melt
names(toy)[2] <- tolower(names(toy)[2])
# load the 'data.table' package
library(data.table)
# tidy the data into long format
tidy_toy <- melt(setDT(toy),
measure.vars = patterns('^A','^B','^C','^ID'),
value.name = c('A','B','C','ID'))
which gives:
> tidy_toy
file_path condition Trial.Num variable A B C ID
1: root/some.extension Baseline 1 1 2 3 5 car
2: root/thing.extension Baseline 2 1 3 6 45 car
3: root/else.extension Baseline 3 1 4 4 6 car
4: root/uniquely.extension Treatment 1 1 5 3 7 car
5: root/defined.extension Treatment 2 1 6 7 3 car
6: root/some.extension Baseline 1 2 2 1 7 bike
7: root/thing.extension Baseline 2 2 5 4 4 bike
8: root/else.extension Baseline 3 2 7 5 4 bike
9: root/uniquely.extension Treatment 1 2 1 7 37 bike
10: root/defined.extension Treatment 2 2 4 6 8 bike
11: root/some.extension Baseline 1 3 4 9 0 plane
12: root/thing.extension Baseline 2 3 9 5 4 plane
13: root/else.extension Baseline 3 3 68 7 56 plane
14: root/uniquely.extension Treatment 1 3 9 8 7 plane
15: root/defined.extension Treatment 2 3 9 0 8 plane
Another option is to use a list of column-indexes for measure.vars:
tidy_toy <- melt(setDT(toy),
measure.vars = list(c(4,8,12), c(5,9,13), c(6,10,14), c(7,11,15)),
value.name = c('A','B','C','ID'))
Making the column-names unique isn't necessary then.
A more complicated method that creates names that are better distinguishable by the patterns argument:
# select the names that are not unique
tt <- table(names(toy))
idx <- which(names(toy) %in% names(tt)[tt > 1])
nms <- names(toy)[idx]
# make them unique
names(toy)[idx] <- paste(nms,
rep(seq(length(nms) / length(names(tt)[tt > 1])),
each = length(names(tt)[tt > 1])),
sep = '.')
# your columnnames are now unique:
> names(toy)
[1] "file_path" "Condition" "Trial.Num" "A.1" "B.1" "C.1" "ID.1" "A.2"
[9] "B.2" "C.2" "ID.2" "A.3" "B.3" "C.3" "ID.3"
# tidy the data into long format
tidy_toy <- melt(setDT(toy),
measure.vars = patterns('^A.\\d','^B.\\d','^C.\\d','^ID.\\d'),
value.name = c('A','B','C','ID'))
which will give the same end-result.
As mentioned in the comments, the janitor-package can be helpful for this problem as well. The clean_names() works similar as the make.unique function. See here for an explanation.
with tidyverse we can do :
library(tidyverse)
toy %>%
repair_names(sep="_") %>%
pivot_longer(-(1:3),names_to = c(".value","id"), names_sep="_") %>%
select(-id)
#> # A tibble: 15 x 7
#> file_path Condition Trial.Num A B C ID
#> <fct> <fct> <int> <int> <int> <int> <fct>
#> 1 root/some.extension Baseline 1 2 3 5 car
#> 2 root/some.extension Baseline 1 2 1 7 bike
#> 3 root/some.extension Baseline 1 4 9 0 plane
#> 4 root/thing.extension Baseline 2 3 6 45 car
#> 5 root/thing.extension Baseline 2 5 4 4 bike
#> 6 root/thing.extension Baseline 2 9 5 4 plane
#> 7 root/else.extension Baseline 3 4 4 6 car
#> 8 root/else.extension Baseline 3 7 5 4 bike
#> 9 root/else.extension Baseline 3 68 7 56 plane
#> 10 root/uniquely.extension Treatment 1 5 3 7 car
#> 11 root/uniquely.extension Treatment 1 1 7 37 bike
#> 12 root/uniquely.extension Treatment 1 9 8 7 plane
#> 13 root/defined.extension Treatment 2 6 7 3 car
#> 14 root/defined.extension Treatment 2 4 6 8 bike
#> 15 root/defined.extension Treatment 2 9 0 8 plane
#> Warning message:
#> Expected 2 pieces. Missing pieces filled with `NA` in 4 rows [1, 2, 3, 4].

Expand by ID for future periods only

Is there a way to fill in for implicit missingness for future dates based on id?
For example, imagine a experiment that starts in Jan-2016. I have 3 participants that join in at different periods. Subject 1 joins me in Jan and continues to stay until Aug. Subj 2 joins me in March, and stays in the experiment until August. Subject 3 also joins me in March, but drops out sometime in in May, so no observations are recorded for periods May-Aug.
The question is, how do I fill in the dates when subject 3 dropped out of the experiment? Here is some mock data for how things look like:
Subject Date
1 1 Jan-16
2 1 Feb-16
3 1 Mar-16
4 1 Apr-16
5 1 May-16
6 1 Jun-16
7 1 Jul-16
8 1 Aug-16
9 2 Mar-16
10 2 Apr-16
11 2 May-16
12 2 Jun-16
13 2 Jul-16
14 2 Aug-16
15 3 Mar-16
16 3 Apr-16
structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L), Date = structure(c(5L, 4L, 8L, 2L,
9L, 7L, 6L, 3L, 8L, 2L, 9L, 7L, 6L, 3L, 8L, 2L), .Label = c("",
"Apr-16", "Aug-16", "Feb-16", "Jan-16", "Jul-16", "Jun-16", "Mar-16",
"May-16"), class = "factor")), class = "data.frame", row.names = c(NA,
-16L), .Names = c("Subject", "Date"))
And here is how the data should look like:
Subject Date
1 1 Jan-16
2 1 Feb-16
3 1 Mar-16
4 1 Apr-16
5 1 May-16
6 1 Jun-16
7 1 Jul-16
8 1 Aug-16
9 2 Mar-16
10 2 Apr-16
11 2 May-16
12 2 Jun-16
13 2 Jul-16
14 2 Aug-16
15 3 Mar-16
16 3 Apr-16
17 3 May-16
18 3 Jun-16
19 3 Jul-16
20 3 Aug-16
structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), Date = structure(c(4L,
3L, 7L, 1L, 8L, 6L, 5L, 2L, 7L, 1L, 8L, 6L, 5L, 2L, 7L, 1L, 8L,
6L, 5L, 2L), .Label = c("Apr-16", "Aug-16", "Feb-16", "Jan-16",
"Jul-16", "Jun-16", "Mar-16", "May-16"), class = "factor")), class = "data.frame", row.names = c(NA,
-20L), .Names = c("Subject", "Date"))
I tried using expand from tidyr and TimeFill from DataCombine package, but the issue with those approaches is that I would get dates for periods before a participant joined an experiment. In this particular instance, I only want the periods to be filled for cases when a participant drops out of an experiment.
The complete function from tidyr is designed for turning implicit missing values into explicit missing values. We will have to do some filtering to not include past completion. The easiest way seems to be to do a join on a table with starting values:
library(dplyr)
library(tidyr)
df <- df %>%
filter(Date != '') %>%
droplevels() %>%
group_by(Subject)
df2 <- summarise(df, start = first(Date))
df %>%
complete(Subject, Date) %>%
left_join(df2) %>%
mutate(Date2 = as.Date(paste0('01-', Date), format = '%d-%b-%y'),
start = as.Date(paste0('01-', start), format = '%d-%b-%y')) %>%
filter(Date2 >= start) %>%
arrange(Subject, Date2) %>%
select(-start, -Date2)
Result:
Source: local data frame [20 x 2]
Groups: Subject [3]
Subject Date
<int> <fctr>
1 1 Jan-16
2 1 Feb-16
3 1 Mar-16
4 1 Apr-16
5 1 May-16
6 1 Jun-16
7 1 Jul-16
8 1 Aug-16
9 2 Mar-16
10 2 Apr-16
11 2 May-16
12 2 Jun-16
13 2 Jul-16
14 2 Aug-16
15 3 Mar-16
16 3 Apr-16
17 3 May-16
18 3 Jun-16
19 3 Jul-16
20 3 Aug-16
I use date conversion to do a reliable comparison with the starting date, but you might be able to use the row_numbers somehow. A problem is that complete will rearrange the data.
p.s. Note that your example dput has an empty factor level (""), so I filter that out first.

Creating Multi dimension pivot table in R [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 6 years ago.
I have the following data frame:
Event Scenario Year Cost
1 1 1 10
2 1 1 5
3 1 2 6
4 1 2 6
5 2 1 15
6 2 1 12
7 2 2 10
8 2 2 5
9 3 1 4
10 3 1 5
11 3 2 6
12 3 2 5
I need to produce a pivot table/ frame that will sum the total cost per year for each scenario. So the result will be.
Scenario Year Cost
1 1 15
1 2 12
2 1 27
2 2 15
3 1 9
3 2 11
I need to produce a ggplot line graph that plot the cost of each scenario per year. I know how to do that, I just can't get the right data frame.
Try
library(dplyr)
df %>% group_by(Scenario, Year) %>% summarise(Cost=sum(Cost))
Or
library(data.table)
setDT(df)[, list(Cost=sum(Cost)), by=list(Scenario, Year)]
Or
aggregate(Cost~Scenario+Year, df,sum)
data
df <- structure(list(Event = 1:12, Scenario = c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L), Year = c(1L, 1L, 2L, 2L, 1L, 1L,
2L, 2L, 1L, 1L, 2L, 2L), Cost = c(10L, 5L, 6L, 6L, 15L, 12L,
10L, 5L, 4L, 5L, 6L, 5L)), .Names = c("Event", "Scenario", "Year",
"Cost"), class = "data.frame", row.names = c(NA, -12L))
The following does it:
library(plyr)
ddply(df, .(Scenario, Year), summarize, Cost = sum(Cost))
#Scenario Year Cost
#1 1 1 15
#2 1 2 12
#3 2 1 27
#4 2 2 15
#5 3 1 9
#6 3 2 11

Remove duplicated 2 columns permutations

I can't find a good title for this question so feel free to edit it please.
I have this data.frame
section time to from
1 a 9 1 2
2 a 9 2 1
3 a 12 2 3
4 a 12 2 4
5 a 12 3 2
6 a 12 3 4
7 a 12 4 2
8 a 12 4 3
I want to remove duplicated rows that have the same to and from simultaneously, without computing permutations of the 2 columns: e.g (1,2) and (2,1) are duplicated.
So final output would be:
section time to from
1 a 9 1 2
3 a 12 2 3
4 a 12 2 4
6 a 12 3 4
I have a solution by constructing a new column key e.g
key <- paste(min(to,from),max(to,from))
and remove duplicated key using duplicated, but I think this is dirty solution.
here the dput of my data
structure(list(section = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "a", class = "factor"), time = c(9L, 9L, 12L,
12L, 12L, 12L, 12L, 12L), to = c(1L, 2L, 2L, 2L, 3L, 3L, 4L,
4L), from = c(2L, 1L, 3L, 4L, 2L, 4L, 2L, 3L)), .Names = c("section",
"time", "to", "from"), row.names = c(NA, -8L), class = "data.frame")
mn <- pmin(s$to, s$from)
mx <- pmax(s$to, s$from)
int <- as.numeric(interaction(mn, mx))
s[match(unique(int), int),]
section time to from
1 a 9 1 2
3 a 12 2 3
4 a 12 2 4
6 a 12 3 4
Credit for the idea goes to this question: Remove consecutive duplicates from dataframe and specifically #MatthewPlourde's answer.
You can try using sort within the apply function to order the combinations.
mydf[!duplicated(t(apply(mydf[3:4], 1, sort))), ]
# section time to from
# 1 a 9 1 2
# 3 a 12 2 3
# 4 a 12 2 4
# 6 a 12 3 4

Resources