I have a data frame that looks like this:
ID
Feature
Quality
Quantity
Condition
21
Shed
A
1
AV
72
Masonry
1
72
Shed
D
1
AV
Currently the data frame has the unit of observation as the feature, not the ID number. I would like to pivot this to a data frame that looks like this :
ID
ShedQuant
ShedQual
ShedCond
MasonryQuant
MasonryQual
MasonryCond
21
1
A
AV
72
1
D
AV
1
In the new data frame, the unit of observation should be the ID number (aka each ID number is one row that lists all features associated with the ID number, and their quantities/qualities/conditions.
I tried to combine several pivot_widers but it did not give me the intended result. Any help is appreciated!
Note: If the quantity of a certain feature is more than 1 for a certain ID, I want a sum for the quantity column and blanks for quality and condition.
library(tidyr)
data.frame(
stringsAsFactors = FALSE,
ID = c(21L, 72L, 72L),
Feature = c("Shed", "Masonry", "Shed"),
Quality = c("A", NA, "D"),
Quantity = c(1L, 1L, 1L),
Condition = c("AV", NA, "AV")
) %>%
pivot_wider(ID, names_from = Feature, names_glue = "{Feature}_{.value}",
values_from = Quality:Condition, names_vary = "slowest")
Result
# A tibble: 2 × 7
ID Shed_Quality Shed_Quantity Shed_Condition Masonry_Quality Masonry_Quantity Masonry_Condition
<int> <chr> <int> <chr> <chr> <int> <chr>
1 21 A 1 AV NA NA NA
2 72 D 1 AV NA 1 NA
Related
I have a table that is somewhat like this:
var
RC
distance50
2
distance20
4
precMax
5
precMin
1
total_prec
8
travelTime
5
travelTime
2
I want to sum all similar type variables, resulting in something like this:
var
sum
dist
6
prec
14
trav
7
Using 4 letters is enough to separate the different types. I have tried and tried but not figured it out. Could anyone please assist? I generally try to work with dplyr, so that would be preferred. The datasets are small (n<100) so speed is not required.
Base R solution:
aggregate(
RC ~ var,
data = transform(
with(df, df[!(grepl("total", var)),]),
var = gsub("^(\\w+)([A-Z0-9]\\w+$)", "\\1", var)
),
FUN = sum
)
Data:
df <- structure(list(var = c("distance50", "distance20", "precMax",
"precMin", "total_prec", "travelTime", "travelTime"), RC = c(2L,
4L, 5L, 1L, 8L, 5L, 2L)), class = "data.frame", row.names = c(NA,
-7L))
library(dplyr)
library(tidyr)
df %>%
separate(var, c("var", "b"), sep = "[_A-Z0-9]", extra = "merge") %>%
group_by(var = ifelse(b %in% var, b, var)) %>%
summarize(RC = sum(RC), .groups = "drop")
separate var into two columns by splitting on underscores (_), capital letters A-Z or numbers 0-9.
In the group_by statement, if the second column can be found in the first then fill the first column.
Lastly, sum RC by group.
Output
var RC
<chr> <int>
1 distance 6
2 prec 14
3 travel 7
tibble(
var=c('dista', 'distb', 'travelTime'),
rc=2:4) %>%
print() %>%
# A tibble: 3 x 2
# var rc
# <chr> <int>
#1 dista 2
#2 distb 3
#3 travelTime 4
group_by(var=str_sub(var, end=4)) %>%
print() %>%
# A tibble: 3 x 2
# Groups: var [2]
# var rc
# <chr> <int>
#1 dist 2
#2 dist 3
#3 trav 4
summarise(sum=sum(rc))
# A tibble: 2 x 2
# var sum
# <chr> <int>
#1 dist 5
#2 trav 4
Is there a way to shift a group of columns into their own row in R?
I currently have large a dataset that includes column headings like this:
Month
Year
Tenant 1 Name
Tenant 1 Rate
Tenant 1 Vacate Date
Tenant 1 Notes
Tenant 1 Name
Tenant 2 Rate
Tenant 2 Vacate Date
Tenant 2 Notes
Jan
2001
Bob
1
2
3
Joe
1
2
3
I want to combine this information so that each Tenant within each month and year have their own rows. So the rows would just be like this:
Month
Year
Name
Rate
Date
Notes
Jan
2001
Bob
1
2
3
Jan
2001
Joe
1
2
3
I assume this would be something like group_by() but for multiple columns somehow?
Sorry for the clumsy formatting!
First, to generate an example like yours (your example had "Tenant 1 Name" twice, but I guess it was just a typo).
colnames<-c("Month","Year","Tenant 1 Name","Tenant 1 Rate","Tenant 1 Vacate Date","Tenant 1 Notes","Tenant 2 Name","Tenant 2 Rate","Tenant 2 Vacate Date","Tenant 2 Notes")
fields<-c("Jan","2001","Bob","1","2","3","Joe","1","2","3")
mat<-matrix(fields,nrow=1)
colnames(mat)<-colnames
View(mat)
It will look like this:
Now, identify which column have "Name" in them
cols<-grep("Name",colnames(mat))
cols
Then, extract names from those columns:
names<-mat[,cols]
And finally, filla new matrix:
newmat<-matrix(NA,nrow=0,ncol=6)
for(n in names){
whichcol<-which(mat[1,]==n)
newline<-c(mat[,1:2],mat[,whichcol:(whichcol+3)])
newmat<-rbind(newmat,newline)
}
View(newmat)
It will result in what you are looking for:
However, I have a feeling that the dataset you are working with has more layers of complexity (e.g., multiple lines), requiring a more complex solution. Please let us know if that's the case!
If the column name for 'Joe' is 'Tenant 2 Name', use pivot_longer, specify the cols as all except the 'Month', 'Year', and with names_pattern, capture the column name substring as the characters that are not a space (\\S+) at the end ($) of the string
library(tidyr)
pivot_longer(df1, cols = -c(Month, Year),
names_to = ".value", names_pattern = ".*\\s+(\\S+)$")
-output
# A tibble: 2 x 6
# Month Year Name Rate Date Notes
# <chr> <int> <chr> <int> <int> <int>
#1 Jan 2001 Bob 1 2 3
#2 Jan 2001 Joe 1 2 3
data
df1 <- structure(list(Month = "Jan", Year = 2001L, `Tenant 1 Name` = "Bob",
`Tenant 1 Rate` = 1L, `Tenant 1 Vacate Date` = 2L, `Tenant 1 Notes` = 3L,
`Tenant 2 Name` = "Joe", `Tenant 2 Rate` = 1L, `Tenant 2 Vacate Date` = 2L,
`Tenant 2 Notes` = 3L), class = "data.frame", row.names = c(NA,
-1L))
Thanks for the subtle tip from dear #akrun as always. I added a $ to the last capturing group to make sure it always chooses the last one.
This may sound a bit verbose, but it also does the trick. I created 3 name patterns, turning the first two into NA and capture the third one:
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(!c(Month, Year), names_to = c(NA, NA, ".value"),
names_pattern = "(\\w+) (\\w+) (\\w+$)")
# A tibble: 2 x 6
Month Year Name Rate Vacate Notes
<chr> <int> <chr> <int> <int> <int>
1 Jan 2001 Bob 1 2 3
2 Jan 2001 Joe 1 2 3
First I get a total amount for each category.
table <- my_data %>%
group_by(Category) %>%
summarise(Total = sum(count))
table
my result -
Category Total
<chr> <dbl>
1 Date Range 5
2 None 87
3 Product 1
4 Reset 2
what I want to do next is get a percentage of each category compared to the total.
I try like this.
irisNew <- table %>% group_by(Total) %>%
summarize(count = n()) %>%
mutate(pct = count/sum(count)) # find percent of total
irisNew
but it gives me
Total count pct
<dbl> <int> <dbl>
1 1 3 0.3
2 2 1 0.1
3 3 1 0.1
4 5 1 0.1
I want something that gets percentage of each category compared to total amount
Category Total Percent
<chr> <dbl>
1 Date Range 5 %
2 None 87 %
3 Product 1 %
4 Reset 2 %
From the table, we can divide the sum of 'Total' after ungrouping
library(dplyr)
table %>%
ungroup %>%
mutate(Percent = 100 * Total/sum(Total))
-output
# Category Total Percent
#1 Date Range 5 5.263158
#2 None 87 91.578947
#3 Product 1 1.052632
#4 Reset 2 2.105263
NOTE: table is a function name. It is better to avoid naming objects with function names
data
table <- structure(list(Category = c("Date Range", "None", "Product",
"Reset"), Total = c(5L, 87L, 1L, 2L)), class = "data.frame",
row.names = c("1",
"2", "3", "4"))
I want to combine rows that have almost the same values, but I want to combine the values that are different so I won't loose information that I want to analyse later.
I have the following dataset:
SessionId Client id Product_type Item quantity
1 1 Couch 1
1 1 Table 1
2 2 Couch 1
2 2 Chair 5
I want to have an output like:
SessionId Client id Product_type Item quantity
1 1 Couch, Table 2
2 2 Couch, Chair 6
So I need to merge rows based on the session id. But for the column product type I want to paste character names behind each other and for the item quantity I want to sum the quantities. I have way more columns, but those values can stay the same.
Maybe I need to do it in two steps, but im not sure how to begin. Hopefully someone can help me out.
Try this.
d %>% group_by(SessionId,Client_id) %>%
summarise(prod_type = toString(Product_type),
sum_item_q = sum(Item_quantity, na.rm = T))
output as:
# A tibble: 2 x 4
# Groups: SessionId [2]
SessionId Client_id prod_type sum_item_q
<int> <int> <chr> <int>
1 1 1 Couch, Table 2
2 2 2 Couch, Chair 6
data
structure(list(SessionId = c(1L, 1L, 2L, 2L), Client_id = c(1L,
1L, 2L, 2L), Product_type = c("Couch", "Table", "Couch", "Chair"
), Item_quantity = c(1L, 1L, 1L, 5L)), row.names = c(NA, -4L), class = c("data.table",
"data.frame"))->d
This can be achieved like so
df <- read.table(text = "SessionId 'Client id' Product_type 'Item quantity'
1 1 Couch 1
1 1 Table 1
2 2 Couch 1
2 2 Chair 5", header = TRUE)
library(dplyr)
df %>%
group_by(SessionId, Client.id) %>%
summarise(Product_type = paste(Product_type, collapse = ", "),
Item.quantity = sum(Item.quantity))
#> # A tibble: 2 x 4
#> # Groups: SessionId [2]
#> SessionId Client.id Product_type Item.quantity
#> <int> <int> <chr> <int>
#> 1 1 1 Couch, Table 2
#> 2 2 2 Couch, Chair 6
Created on 2020-05-23 by the reprex package (v0.3.0)
Base R solution:
aggregate(.~SessionId+Client_Id, within(df, {Product_type <- as.character(Product_type)}),
FUN = function(x){if(is.integer(x)){sum(x)}else{toString(as.character(x))}})
I have a household and member dataset in one long flat format. There is a fixed number of members and each corresponds to a column. For simplicity, assume 2 members per household and assume 2 questions are asked for the members- age (Q1), gender(Q2).
The file format looks as given below:
HHID, MEM_ID_1, MEM_ID_2, AGE_1, AGE_2, GENDER_1, GENDER_2
1 1 2 50 45 M F
And I want to convert it to the following format:
HHID MEM_ID AGE GENDER
1 1 50 M
1 2 45 F
Let's say our data frame is test
dput(test)
structure(list(HHID = 1L, MEM_ID_1 = 1L, MEM_ID_2 = 2L, AGE_1 = 50L,
AGE_2 = 45L, GENDER_1 = structure(1L, .Label = "Male", class = "factor"),
GENDER_2 = structure(1L, .Label = "Female", class = "factor")), class = "data.frame", row.names = c(NA,
-1L))
You could try the reshape function on this data frame as below:
reshape(test, direction = "long",
varying = list(c("MEM_ID_1","MEM_ID_2"), c("AGE_1","AGE_2"), c( "GENDER_1","GENDER_2")),
v.names = c("MEM_ID","AGE","GENDER"),
idvar = 'HHID')
The reshape() function comes from the base R. Broadly speaking, it can simultaneously melt over multiple sets of variables, by using the varying parameter and setting the direction to long.
For example in your case we have a list of three vectors of variable names to the varying argument:
varying = list(c("MEM_ID_1","MEM_ID_2"), c("AGE_1","AGE_2"), c( "GENDER_1","GENDER_2"))
The output is below:
HHID time MEM_ID AGE GENDER
1.1 1 1 1 50 Male
1.2 1 2 2 45 Female
You can use tidyr::gather(), tidyr::separate(), and tidyr::spread() in order. Here household is the name of your data frame.
library(tidyverse)
1. gather
First, tidyr::gather(). Then you can get the below result.
household %>%
gather(-HHID, key = domestic, value = value)
#> HHID domestic value
#> 1 1 MEM_ID_1 1
#> 2 1 MEM_ID_2 2
#> 3 1 AGE_1 50
#> 4 1 AGE_2 45
#> 5 1 GENDER_1 M
#> 6 1 GENDER_2 F
Now all you have to do is
separate domestic column at _[0-9]: In regular expression, _(?=[0-9])
Changing the format into the wide format, you can see the output you want.
2. Conclusion: entire code
household %>%
gather(-HHID, key = domestic, value = value) %>% # long data
separate(domestic, into = c("domestic", "vals"), sep = "_(?=[0-9])") %>% # separate the digit
spread(domestic, value) %>% # wide format
select(HHID, MEM_ID, AGE, GENDER, -vals) # just arranging columns, and excluding needless column
#> HHID MEM_ID AGE GENDER
#> 1 1 1 50 M
#> 2 1 2 45 F