Is there a way to shift a group of columns into their own row in R?
I currently have large a dataset that includes column headings like this:
Month
Year
Tenant 1 Name
Tenant 1 Rate
Tenant 1 Vacate Date
Tenant 1 Notes
Tenant 1 Name
Tenant 2 Rate
Tenant 2 Vacate Date
Tenant 2 Notes
Jan
2001
Bob
1
2
3
Joe
1
2
3
I want to combine this information so that each Tenant within each month and year have their own rows. So the rows would just be like this:
Month
Year
Name
Rate
Date
Notes
Jan
2001
Bob
1
2
3
Jan
2001
Joe
1
2
3
I assume this would be something like group_by() but for multiple columns somehow?
Sorry for the clumsy formatting!
First, to generate an example like yours (your example had "Tenant 1 Name" twice, but I guess it was just a typo).
colnames<-c("Month","Year","Tenant 1 Name","Tenant 1 Rate","Tenant 1 Vacate Date","Tenant 1 Notes","Tenant 2 Name","Tenant 2 Rate","Tenant 2 Vacate Date","Tenant 2 Notes")
fields<-c("Jan","2001","Bob","1","2","3","Joe","1","2","3")
mat<-matrix(fields,nrow=1)
colnames(mat)<-colnames
View(mat)
It will look like this:
Now, identify which column have "Name" in them
cols<-grep("Name",colnames(mat))
cols
Then, extract names from those columns:
names<-mat[,cols]
And finally, filla new matrix:
newmat<-matrix(NA,nrow=0,ncol=6)
for(n in names){
whichcol<-which(mat[1,]==n)
newline<-c(mat[,1:2],mat[,whichcol:(whichcol+3)])
newmat<-rbind(newmat,newline)
}
View(newmat)
It will result in what you are looking for:
However, I have a feeling that the dataset you are working with has more layers of complexity (e.g., multiple lines), requiring a more complex solution. Please let us know if that's the case!
If the column name for 'Joe' is 'Tenant 2 Name', use pivot_longer, specify the cols as all except the 'Month', 'Year', and with names_pattern, capture the column name substring as the characters that are not a space (\\S+) at the end ($) of the string
library(tidyr)
pivot_longer(df1, cols = -c(Month, Year),
names_to = ".value", names_pattern = ".*\\s+(\\S+)$")
-output
# A tibble: 2 x 6
# Month Year Name Rate Date Notes
# <chr> <int> <chr> <int> <int> <int>
#1 Jan 2001 Bob 1 2 3
#2 Jan 2001 Joe 1 2 3
data
df1 <- structure(list(Month = "Jan", Year = 2001L, `Tenant 1 Name` = "Bob",
`Tenant 1 Rate` = 1L, `Tenant 1 Vacate Date` = 2L, `Tenant 1 Notes` = 3L,
`Tenant 2 Name` = "Joe", `Tenant 2 Rate` = 1L, `Tenant 2 Vacate Date` = 2L,
`Tenant 2 Notes` = 3L), class = "data.frame", row.names = c(NA,
-1L))
Thanks for the subtle tip from dear #akrun as always. I added a $ to the last capturing group to make sure it always chooses the last one.
This may sound a bit verbose, but it also does the trick. I created 3 name patterns, turning the first two into NA and capture the third one:
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(!c(Month, Year), names_to = c(NA, NA, ".value"),
names_pattern = "(\\w+) (\\w+) (\\w+$)")
# A tibble: 2 x 6
Month Year Name Rate Vacate Notes
<chr> <int> <chr> <int> <int> <int>
1 Jan 2001 Bob 1 2 3
2 Jan 2001 Joe 1 2 3
Related
I have a data frame that looks like this:
ID
Feature
Quality
Quantity
Condition
21
Shed
A
1
AV
72
Masonry
1
72
Shed
D
1
AV
Currently the data frame has the unit of observation as the feature, not the ID number. I would like to pivot this to a data frame that looks like this :
ID
ShedQuant
ShedQual
ShedCond
MasonryQuant
MasonryQual
MasonryCond
21
1
A
AV
72
1
D
AV
1
In the new data frame, the unit of observation should be the ID number (aka each ID number is one row that lists all features associated with the ID number, and their quantities/qualities/conditions.
I tried to combine several pivot_widers but it did not give me the intended result. Any help is appreciated!
Note: If the quantity of a certain feature is more than 1 for a certain ID, I want a sum for the quantity column and blanks for quality and condition.
library(tidyr)
data.frame(
stringsAsFactors = FALSE,
ID = c(21L, 72L, 72L),
Feature = c("Shed", "Masonry", "Shed"),
Quality = c("A", NA, "D"),
Quantity = c(1L, 1L, 1L),
Condition = c("AV", NA, "AV")
) %>%
pivot_wider(ID, names_from = Feature, names_glue = "{Feature}_{.value}",
values_from = Quality:Condition, names_vary = "slowest")
Result
# A tibble: 2 × 7
ID Shed_Quality Shed_Quantity Shed_Condition Masonry_Quality Masonry_Quantity Masonry_Condition
<int> <chr> <int> <chr> <chr> <int> <chr>
1 21 A 1 AV NA NA NA
2 72 D 1 AV NA 1 NA
I am trying to move data to a new column after certain points in the data. My data is spread across multiple data frames that only have some elements in common, so I would like to be able to make a loop to clean the data sets. I am looking for a function that after the first time there is certain text, for example "Total", in a row all the data below that moves to a new columns.
first
second
third
One
1
One
1
Total
2
Two
2
Two
2
Total
2
I want my data to look similar to this below, but due to the variability of the data I am having trouble finding a solution that can be reproduced easily.
left
center
right
fourth
One
1
Two
2
One
1
Two
2
Total
1
Total
2
Personal opinion cbinding data on wider side will be too cumbersome, if the rows are too much. Still you can divide the data into separate groups like this
df <- read.table(text = "first second
One 1
One 1
Total 2
Two 2
Two 2
Total 2", header = T)
df$dummy = rev(cumsum(rev(df$first == "Total")))
df
> df
first second dummy
1 One 1 2
2 One 1 2
3 Total 2 2
4 Two 2 1
5 Two 2 1
6 Total 2 1
You may notice that your data is divided into two groups. You may still cbind() or bind_cols() if you want, easily
df %>% group_split(d = rev(cumsum(rev(first == "Total")))) %>% bind_cols()
# A tibble: 3 x 6
first...1 second...2 d...3 first...4 second...5 d...6
<chr> <int> <int> <chr> <int> <int>
1 Two 2 1 One 1 2
2 Two 2 1 One 1 2
3 Total 2 1 Total 2 2
Here is another try
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(purrr)
data <- structure(list(first = c("One", "One", "Total", "Two", "Two",
"Total"), second = c(1L, 1L, 2L, 2L, 2L, 2L)), row.names = c(NA,
-6L), class = "data.frame")
new_data <- data %>%
# create group using first == "Total"
mutate(total_group = cumsum(first == "Total")) %>%
mutate(total_group = if_else(first == "Total", total_group - 1L, total_group)) %>%
# split df into multiple df and bind cols
group_split(total_group, .keep = FALSE) %>%
bind_cols()
#> New names:
#> * first -> first...1
#> * second -> second...2
#> * first -> first...3
#> * second -> second...4
new_data
#> # A tibble: 3 x 4
#> first...1 second...2 first...3 second...4
#> <chr> <int> <chr> <int>
#> 1 One 1 Two 2
#> 2 One 1 Two 2
#> 3 Total 2 Total 2
# if you only have two group this could work - otherwise need some more
# work on the approach. Hope this provide you enough hint to develop further
names(new_data) <- c("left", "center", "right", "fourth")
new_data
#> # A tibble: 3 x 4
#> left center right fourth
#> <chr> <int> <chr> <int>
#> 1 One 1 Two 2
#> 2 One 1 Two 2
#> 3 Total 2 Total 2
Created on 2021-04-04 by the reprex package (v1.0.0)
I have: data table:
Id
Time
v1
v2
v3
T1
2
1
2
T2
3
1
2
T3
1
3
3
Basically, I have data in three waves (T1, T2) etc. I need to make it a wide format so it looks like this:
id
v1T1
v2T1
v3T1
v1T2
v2T2
v3T2
v1T3
V2T3
2
1
2
3
1
2
1
3
I have tried the following code:
data %>%
group_by(id) %>%
mutate(id=paste0("id", row_number())) %>%
spread(id, v1, v2, v3)
What am I missing? I know how to do this with casetovars in SPSS, but I can't duplicate it in R.
You can use pivot_wider :
tidyr::pivot_wider(df, names_from = Time, values_from = v1:v3)
# Id v1_T1 v1_T2 v1_T3 v2_T1 v2_T2 v2_T3 v3_T1 v3_T2 v3_T3
# <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 1 2 3 1 1 1 3 2 2 3
Or using data.table :
library(data.table)
dcast(setDT(df), Id~Time, value.var = c('v1', 'v2', 'v3'))
data
df <- structure(list(Id = c(1, 1, 1), Time = c("T1", "T2", "T3"), v1 = c(2L,
3L, 1L), v2 = c(1L, 1L, 3L), v3 = c(2L, 2L, 3L)), row.names = c(NA,
-3L), class = "data.frame")
Since your question title mentions "reshape2", the function you'd be looking for in that package is recast. The recast function is basically a melt followed by a dcast, which is required in reshape2::dcast to cast multiple value variables (but not in data.table::dcast, which can accept a vector of value variables, as demonstrated in Ronak's answer).
Here's what it looks like:
library(reshape2)
recast(df, ... ~ variable + Time, id.var=1:2)
## Id v1_T1 v1_T2 v1_T3 v2_T1 v2_T2 v2_T3 v3_T1 v3_T2 v3_T3
## 1 1 2 3 1 1 1 3 2 2 3
For reference, you can also do this with reshape from base R:
reshape(df, direction = "wide", idvar = "Id", timevar = "Time")
## Id v1.T1 v2.T1 v3.T1 v1.T2 v2.T2 v3.T2 v1.T3 v2.T3 v3.T3
## 1 1 2 1 2 3 1 2 1 3 3
That said, many people find reshape very hard to learn, and while the "reshape2" package will be maintained, it's not being actively developed. Thus, while you can expect that things won't break, new features aren't going to be added to it. For that, you'll have to look at the "data.table" implementation or start using "tidyr" or other alternatives.
I want to combine rows that have almost the same values, but I want to combine the values that are different so I won't loose information that I want to analyse later.
I have the following dataset:
SessionId Client id Product_type Item quantity
1 1 Couch 1
1 1 Table 1
2 2 Couch 1
2 2 Chair 5
I want to have an output like:
SessionId Client id Product_type Item quantity
1 1 Couch, Table 2
2 2 Couch, Chair 6
So I need to merge rows based on the session id. But for the column product type I want to paste character names behind each other and for the item quantity I want to sum the quantities. I have way more columns, but those values can stay the same.
Maybe I need to do it in two steps, but im not sure how to begin. Hopefully someone can help me out.
Try this.
d %>% group_by(SessionId,Client_id) %>%
summarise(prod_type = toString(Product_type),
sum_item_q = sum(Item_quantity, na.rm = T))
output as:
# A tibble: 2 x 4
# Groups: SessionId [2]
SessionId Client_id prod_type sum_item_q
<int> <int> <chr> <int>
1 1 1 Couch, Table 2
2 2 2 Couch, Chair 6
data
structure(list(SessionId = c(1L, 1L, 2L, 2L), Client_id = c(1L,
1L, 2L, 2L), Product_type = c("Couch", "Table", "Couch", "Chair"
), Item_quantity = c(1L, 1L, 1L, 5L)), row.names = c(NA, -4L), class = c("data.table",
"data.frame"))->d
This can be achieved like so
df <- read.table(text = "SessionId 'Client id' Product_type 'Item quantity'
1 1 Couch 1
1 1 Table 1
2 2 Couch 1
2 2 Chair 5", header = TRUE)
library(dplyr)
df %>%
group_by(SessionId, Client.id) %>%
summarise(Product_type = paste(Product_type, collapse = ", "),
Item.quantity = sum(Item.quantity))
#> # A tibble: 2 x 4
#> # Groups: SessionId [2]
#> SessionId Client.id Product_type Item.quantity
#> <int> <int> <chr> <int>
#> 1 1 1 Couch, Table 2
#> 2 2 2 Couch, Chair 6
Created on 2020-05-23 by the reprex package (v0.3.0)
Base R solution:
aggregate(.~SessionId+Client_Id, within(df, {Product_type <- as.character(Product_type)}),
FUN = function(x){if(is.integer(x)){sum(x)}else{toString(as.character(x))}})
I have a dataset in Excel that is structured as follows:
A B C
ID Start_date End_date
1 01/01/2000 05/01/2000
1 06/01/2000 15/05/2000
1 16/05/2000 07/04/2018
2 06/07/2016 09/10/2019
2 10/10/2019 14/12/2019
3 02/08/2000 06/08/2006
3 07/08/2006 15/02/2020
4 05/09/2012 09/11/2017
I would like to create a time series of the number of unique values in the above dataset that occur more than 3 times in the 12 months prior to any month in the date range covered by the dataset (in this case 01/01/2000 - 15/02/2020). So, for example, the number of unique values appearing more than three times in the 12 months prior to January 2001 would be 1 (ID = 1).
I've tried this in Excel using the following formula:
{=SUM(--(FREQUENCY(IF(($B$2:$B$8<=EOMONTH('Time Series'!A2,0))*($C$2:$C$8>=EOMONTH('Time Series'!A2,-12),$A$2:$A$8),$A$2:$A$8)>0))}
Where the value in 'Time Series'!A2 is January 2001.
However, this only returns the number of unique values that occur in the 12 months prior to January 2001, not how many unique values occur more than three times in the period.
Any help on this would be greatly appreciated - while I have been doing this in Excel so far, I would be open to performing the calculation in R if that would prove simpler.
I am not sure if I understood your question correctly:
1.Create minimal reproducible example:
df <-structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L),
Start_date = c("01/01/2000", "06/01/2000", "16/05/2000", "06/07/2016", "10/10/2019", "02/08/2000", "07/08/2006", "05/09/2012"),
End_date = c("05/01/2000", "15/05/2000","07/04/2018", "09/10/2019", "14/12/2019", "06/08/2006", "15/02/2020", "09/11/2017")),
class = "data.frame", row.names = c(NA, -8L))
head(df)
Returns:
ID Start_date End_date
1 1 01/01/2000 05/01/2000
2 1 06/01/2000 15/05/2000
3 1 16/05/2000 07/04/2018
4 2 06/07/2016 09/10/2019
5 2 10/10/2019 14/12/2019
6 3 02/08/2000 06/08/2006
Suggested solution using dplyr
Format date columns as.Date:
library(dplyr)
df_formated <- df %>%
mutate(Start_date = as.Date(Start_date, "%d/%m/%Y"),
End_date = as.Date(End_date, "%d/%m/%Y"))
str(df)
Returns:
'data.frame': 8 obs. of 3 variables:
$ ID : int 1 1 1 2 2 3 3 4
$ Start_date: chr "01/01/2000" "06/01/2000" "16/05/2000" "06/07/2016" ...
$ End_date : chr "05/01/2000" "15/05/2000" "07/04/2018" "09/10/2019" ...
Filter by cutoff_date and count occurences and filter by min_number_of_occurences:
cutoff_date <- as.Date("01/01/2001", "%d/%m/%Y")
min_number_of_occurences <- 3
df_formated %>%
filter(Start_date < cutoff_date) %>%
group_by(ID) %>%
summarise(N = n()) %>%
filter(N >= min_number_of_occurences)
Returns:
# A tibble: 1 x 2
ID N
<int> <int>
1 1 3