How to use gather in R instead of unite - r

I'm trying to use gather function, instead of unite to get the output. is it possible to do so?
This is my data:
Description Temp
<fctr> <dbl>
1 location1:48:2018-10-23 -0.9381736
2 location2:83:2018-01-05 1.1714643
3 location3:73:2018-11-05 -0.7064954
4 location4:27:2018-07-26 0.4420571
5 location5:33:2018-02-03 0.9060360
6 location6:88:2018-04-27 1.9407284
I've used to separate to separate the data by the following command;
library(tidyr)
sepData <- separate(data, Description, c("Location", "ID", "Date"), sep = ":")
Location ID Date Temp
<chr> <chr> <chr> <dbl>
1 location1 48 2018-10-23 -0.9381736
2 location2 83 2018-01-05 1.1714643
3 location3 73 2018-11-05 -0.7064954
4 location4 27 2018-07-26 0.4420571
5 location5 33 2018-02-03 0.9060360
6 location6 88 2018-04-27 1.9407284
Now i want to get the data to its original form, using gather.
please help if possible.

If we check the ?separate, it also has an argument remove which is by default TRUE. Changing it to FALSE, will also return the original column without removing it from the dataset
separate(data, Description, c("Location", "ID", "Date"), sep = ":", remove = FALSE)
# Description Location ID Date Temp
#1 location1:48:2018-10-23 location1 48 2018-10-23 -0.9381736
#2 location2:83:2018-01-05 location2 83 2018-01-05 1.1714643
#3 location3:73:2018-11-05 location3 73 2018-11-05 -0.7064954
#4 location4:27:2018-07-26 location4 27 2018-07-26 0.4420571
#5 location5:33:2018-02-03 location5 33 2018-02-03 0.9060360
#6 location6:88:2018-04-27 location6 88 2018-04-27 1.9407284
data
data <- structure(list(Description = c("location1:48:2018-10-23",
"location2:83:2018-01-05",
"location3:73:2018-11-05", "location4:27:2018-07-26", "location5:33:2018-02-03",
"location6:88:2018-04-27"), Temp = c(-0.9381736, 1.1714643, -0.7064954,
0.4420571, 0.906036, 1.9407284)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

You can use paste
togetherDate <- sepData %>%
mutate(Description = as.factor(paste(Location, ID, Date, sep = ':'))) %>%
select(-Location, -ID, -Date) %>% select(Description, Temp)
Should return you the same data.frame as earlier.

Related

Creating a new data frame in R based on unique values and time stamp

I'm new to R, hence this elementary question.
I have a data frame with ~700 rows and 25 columns. Each row is a single appointment with the information about that appointment (time, priority, gender). The rows have a unique identifier in the form of a 7 digit number and there are multiple rows for the same identifier (when the same person came in for more than one appointment).
ID
PRIORITY
TIME
234
Reading
10/29
546
Writing
10/30
678
Communication
10/29
546
Communication
11/1
234
Writing
11/1
What I would like to do is create a new dataframe that has each unique ID along with the priority of their first visit, second visit, etc.
ID
PRIORITY 1
PRIORITY 2
234
Reading
Writing
546
Writing
Communication
678
Communication
So far I have the list of all unique identifiers:
uniqueID <- unique(data$ID)
Now I would like to pull the data from PRIORITY based on these unique identifiers.
*Edit for better explanation of data:
ID
PRIORITY
TIME
581205
Communication
2021-08-28 10:00:00
938596
Reading
2021-08-30 09:00:00
582948
Writing
2021-09-01 05:00:00
535893
Reading
2021-09-01 12:00:00
938596
Writing
2021-09-02 08:00:00
582948
Communication
2021-09-02 08:30:00
581205
Writing
2021-09-03 09:00:00
482940
Reading
2021-09-03 09:30:00
*Edit
Adding dput format:
data<- structure(list(ID = c(581205, 938596, 582948, 535893, 938596, 582948, 581205, 482940), PRIORITY = c("Communication", "Reading", "Writing", "Reading", "Writing", "Communication", "Writing", "Reading"), TIME = structure(c(1630144800, 1630314000, 1630472400, 1630497600, 1630569600, 1630571400, 1630659600, 1630661400), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"))
You can do:
df <- data.frame(ID = c(234, 546, 678, 546, 234),
PRIORITY = c("Reading", "Writing", "Communication", "Communication", "Writing"),
TIME = c("10/29", "10/30", "10/29", "11/1", "11/1"))
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(ID_count = 1:n()) %>%
ungroup() %>%
pivot_wider(id_cols = ID,
values_from = c(PRIORITY, TIME),
names_from = ID_count)
which gives:
# A tibble: 3 x 5
ID PRIORITY_1 PRIORITY_2 TIME_1 TIME_2
<dbl> <chr> <chr> <chr> <chr>
1 234 Reading Writing 10/29 11/1
2 546 Writing Communication 10/30 11/1
3 678 Communication <NA> 10/29 <NA>
This would be another option also:
library(dplyr)
library(tidyr)
dummy_data <- data.frame(
"ID" = c(234,546,678,546,234),
"PRIORITY" = c("Reading","Writing","Communication","Communication","Writing"),
"TIME" = c("10/29","10/30","10/29","11/1","11/1"))
income_data_drop <- dummy_data %>% pivot_wider(names_from = "TIME", values_from = "PRIORITY")
income_data_drop
ID `10/29` `10/30` `11/1`
<dbl> <chr> <chr> <chr>
1 234 Reading <NA> Writing
2 546 <NA> Writing Communication
3 678 Communication <NA> <NA>

gather() with two key columns

I have a dataset that has two rows of data, and want to tidy them using something like gather() but don't know how to mark both as key columns.
The data looks like:
Country US Canada US
org_id 332 778 920
02-15-20 25 35 54
03-15-20 30 10 60
And I want it to look like
country org_id date purchase_price
US 332 02-15-20 25
Canada 778 02-15-20 35
US 920 02-15-20 54
US 332 03-15-20 30
Canada 778 03-15-20 10
US 920 03-15-20 60
I know gather() can move the country row to a column, for example, but is there a way to move both the country and org_id rows to columns?
It is not a good idea to have duplicate column names in the data so I'll rename one of them.
names(df)[4] <- 'US_1'
gather has been retired and replaced with pivot_longer.
This is not a traditional reshape because the data in the 1st row needs to be treated differently than rest of the rows so we can perform the reshaping separately and combine the result to get one final dataframe.
library(dplyr)
library(tidyr)
df1 <- df %>% slice(-1L) %>% pivot_longer(cols = -Country)
df %>%
slice(1L) %>%
pivot_longer(-Country, values_to = 'org_id') %>%
select(-Country) %>%
inner_join(df1, by = 'name') %>%
rename(Country = name, date = Country) -> result
result
# Country org_id date value
# <chr> <int> <chr> <int>
#1 US 332 02-15-20 25
#2 US 332 03-15-20 30
#3 Canada 778 02-15-20 35
#4 Canada 778 03-15-20 10
#5 US_1 920 02-15-20 54
#6 US_1 920 03-15-20 60
data
df <- structure(list(Country = c("org_id", "02-15-20", "03-15-20"),
US = c(332L, 25L, 30L), Canada = c(778L, 35L, 10L), US = c(920L,
54L, 60L)), class = "data.frame", row.names = c(NA, -3L))
First, we paste together Country and org_id
library(tidyverse)
data <- set_names(data, paste(names(data), data[1,], sep = "-"))
data
Country-org_id US-332 Canada-778 US-920
1 org_id 332 778 920
2 02-15-20 25 35 54
3 03-15-20 30 10 60
Then, we drop the first row, pivot the table and separate the column name.
df <- data %>%
slice(2:n()) %>%
rename(date = `Country-org_id`) %>%
pivot_longer(cols = -date, values_to = "price") %>%
separate(col = name, into = c("country", "org_id"), sep = "-")
df
# A tibble: 6 x 4
date country org_id price
<chr> <chr> <chr> <int>
1 02-15-20 US 332 25
2 02-15-20 Canada 778 35
3 02-15-20 US 920 54
4 03-15-20 US 332 30
5 03-15-20 Canada 778 10
6 03-15-20 US 920 60

How to rbind reshaped data tables of different column sizes and with different names

I checked similar entries in SO, none answers my question exactly.
My problem is this:
Let's say, User1 has 6 purchases, User2 has 2.
Purchase data is something like this:
set.seed(1234)
purchase <- data.frame(id = c(rep("User1", 6), rep("User2", 2)),
purchaseid = sample(seq(1, 100, 1), 8),
purchaseDate = seq(Sys.Date(), Sys.Date() + 7, 1),
price = sample(seq(30, 200, 10), 8))
#
users <- data.frame(id = c("User1","User2"),
uname = c("name1", "name2"),
uaddress = c("add1", "add2"))
> purchase
id purchaseid purchaseDate price
1 User1 12 2019-09-27 140
2 User1 62 2019-09-28 110
3 User1 60 2019-09-29 200
4 User1 61 2019-09-30 190
5 User1 83 2019-10-01 60
6 User1 97 2019-10-02 150
7 User2 1 2019-10-03 160
8 User2 22 2019-10-04 120
End data required includes 1 row for each user, that keeps the user name, address, etc. Then comes next columns for 20 purchases. The purchase data needs to be placed one after another in the same row. This is the rule: only one row for each user. If the user does not have 20 purchases, the remaining fields should be empty.
End data should therefore look like this:
id uname uaddr p1id p1date p1price p2id p2date p2price p3id p3date p3price p4id
1 User1 name1 add1 12 2019-09-27 140 62 2019-09-28 110 60 2019-09-29 200 61
2 User2 name2 add2 1 2019-10-03 160 22 2019-10-04 120 NA <NA> NA NA
p4date p4price
1 2019-09-30 190
2 <NA> NA
enddata <- data.frame(id = c("User1", "User2"),
uname = c("name1", "name2"),
uaddr = c("add1", "add2"),
p1id = c(12,1),
p1date = c("2019-09-27","2019-10-03"),
p1price = c(140, 160),
p2id = c(62, 22),
p2date = c("2019-09-28", "2019-10-04"),
p2price = c(110, 120),
p3id = c(60, NA),
p3date = c("2019-09-29", NA),
p3price = c(200, NA),
p4id = c(61, NA),
p4date = c("2019-09-30", NA),
p4price = c(190, NA))
I used reshape to get the data for each user into the wide format. The idea was doing it in a loop for each user id. Then I used rbindlist with the fill option TRUE, but this time I am having problem with column names. After reshape, each gets different column names. Without fixed number of columns, you cannot set names either.
Any elegant solution to this?
There's no need to process each id separately. Instead we can operate by id within a single data frame. Below is a tidyverse approach. You can stop the chain at any point to see the intermediate output. I've added comments to explain what the code is doing, but let me know if anything is unclear.
library(tidyverse)
dat = users %>%
# Join purchase data to user data
left_join(purchase) %>%
arrange(purchaseDate) %>%
# Create a count column to assign a sequence number to each purchase within each id.
# We'll use this later to create columns for each purchase event with a unique
# sequence number for each purchase.
group_by(id) %>%
mutate(seq=1:n()) %>%
ungroup %>%
# Reshape data frame to from "wide" to "long" format
gather(key, value, purchaseid:price) %>%
arrange(seq) %>%
# Paste together the "key" and "seq" columns (the resulting column will still be
# called "key"). This will allow us to spread the data frame to one row per id
# with each purchase event properly numbered.
unite(key, key, seq, sep="_") %>%
mutate(key = factor(key, levels=unique(key))) %>%
spread(key, value) %>%
# Convert date columns back to Date class
mutate_at(vars(matches("Date")), as.Date, origin="1970-01-01")
dat
id uname uaddress purchaseid_1 purchaseDate_1 price_1 purchaseid_2 purchaseDate_2 price_2
1 User1 name1 add1 12 2019-09-27 140 62 2019-09-28 110
2 User2 name2 add2 1 2019-10-03 160 22 2019-10-04 120
purchaseid_3 purchaseDate_3 price_3 purchaseid_4 purchaseDate_4 price_4 purchaseid_5 purchaseDate_5
1 60 2019-09-29 200 61 2019-09-30 190 83 2019-10-01
2 NA <NA> NA NA <NA> NA NA <NA>
price_5 purchaseid_6 purchaseDate_6 price_6
1 60 97 2019-10-02 150
2 NA NA <NA> NA
Another option using data.table:
#pivot to wide format
setDT(users)
setDT(purchase)[, pno := rowid(id)]
ans <- dcast(purchase[users, on=.(id)], id + uname + uaddress ~ pno,
value.var=c("purchaseid","purchaseDate", "price"))
#reorder columns
nm <- grep("[1-9]$", names(ans), value=TRUE)
setcolorder(ans, c(setdiff(names(ans), nm), nm[order(gsub("(.*)_", "", nm))]))
ans
output:
id uname uaddress purchaseid_1 purchaseDate_1 price_1 purchaseid_2 purchaseDate_2 price_2 purchaseid_3 purchaseDate_3 price_3 purchaseid_4 purchaseDate_4 price_4 purchaseid_5 purchaseDate_5 price_5 purchaseid_6 purchaseDate_6 price_6
1: User1 name1 add1 12 2019-09-30 140 62 2019-10-01 110 60 2019-10-02 200 61 2019-10-03 190 83 2019-10-04 60 97 2019-10-05 150
2: User2 name2 add2 1 2019-10-06 160 22 2019-10-07 120 NA <NA> NA NA <NA> NA NA <NA> NA NA <NA> NA

R: Loop through data frame, extract subset of multiple variables, then store in an aggregate dataset

I have an aggregate data table with about 60 million rows. Simplified, the data looks like this:
ServiceN Customer Product LValue EDate CovBDate CovEDate
1 1 12 3 2016-08-03 2016-07-07 2017-07-06
2 1 12 19 2016-07-07 2016-07-07 2017-07-06
3 2 23 222 2017-09-09 2016-10-01 2017-09-31
4 2 23 100 2017-10-01 2017-10-01 2018-09-31
I need to go through each row and subset the entire dataset by Customer with all entry dates(EDate) between CovBDate and CovEDate. Then, I need to find the sum of the LValue for each product (we're only looking at 10, so it's not terrible).
As an example, the final dataset would look something like this:
ServiceN Customer Product LValue EDate CovBDate CovEDate Prod12 Prod23
1 1 12 3 2016-08-03 2016-07-07 2017-07-06 22 0
2 1 12 19 2016-07-07 2016-07-07 2017-07-06 22 0
3 2 23 222 2017-09-09 2016-10-01 2017-09-31 0 222
4 2 23 100 2017-10-01 2017-10-01 2018-09-31 0 100
I don't know where to begin on this problem, however, I've started with this (which does not work):
for (i in 1:length(nrow)) {
tempdata<-dataset[Customer==Customer[i] & EDate>=CovBDate[i] &
EDate<=CovEDate[i]] #data.table subsetting
tempdata$Prod12<- with(tempdata, sum(LValue[Product== "12"], na.rm=T))
#I could make this a function, but I want to get this for loop automated first...
tempdata$Prod23<- with(tempdata, sum(LValue[Product=="23"], na.rm=T))
}
My questions, therefore, are:
1) How do I make this for loop work with so many variables?
2) How do I make the new variable get added to the original dataset (called dataset)?
Using dplyr you could do something like this:
library(dplyr)
dataset <- data.frame(ServiceN = c("1", "2", "3", "4"),
Customer = c("1", "1", "2", "2"),
Product = c("12", "12", "23", "23"),
LValue = c(3, 19, 222, 100),
EDate = c("2016-08-03", "2016-07-07", "2017-09-09", "2017-10-01"),
CovBDate = c("2016-07-07", "2016-07-07", "2016-10-01", "2017-10-01"),
CovEDate = c("2017-07-06", "2017-07-06", "2017-09-31", "2018-09-31"),
stringsAsFactors = FALSE)
## Group by customer and product so summary results are per-customer/product combination
dataset %>% group_by(Customer, Product) %>%
## Filter based on dates
filter(EDate >= CovBDate & EDate <= CovEDate) %>%
## Sum the LValue based on the defined groupings
summarise(Sum = sum(LValue))
## A tibble: 2 x 3
## Groups: Customer [?]
# Customer Product Sum
#<chr> <chr> <dbl>
#1 1 12 22
#2 2 23 322

Function that returns NA if subset is empty

I would like an efficient function or code snippet that tries to subset a vector, and returns NA if there are no elements in the subset. For example, for
v1 = c(1, 1, NA)
The code unique(v1[!is.na(v1)]) returns one entry which is great, but for
v2 = c(NA, NA, NA)
the code unique(v2[!is.na(v2)]) returns logical(0) which is not great, when this subsetting operation is used as part of a dplyr chain containing summarise_each or summarise. I would like the second operation to return NA instead of logical(0).
The context behind this is that I am trying to solve this question using multiple spread commands. Example data taken from the previous question:
set.seed(10)
tmp_dat <- data_frame(
Person = rep(c("greg", "sally", "sue"), each=2),
Time = rep(c("Pre", "Post"), 3),
Score1 = round(rnorm(6, mean = 80, sd=4), 0),
Score2 = round(jitter(Score1, 15), 0),
Score3 = 5 + (Score1 + Score2)/2
)
> tmp_dat
Source: local data frame [6 x 5]
Person Time Score1 Score2 Score3
<chr> <chr> <dbl> <dbl> <dbl>
1 greg Pre 80 78 84.0
2 greg Post 79 80 84.5
3 sally Pre 75 74 79.5
4 sally Post 78 78 83.0
5 sue Pre 81 78 84.5
6 sue Post 82 81 86.5
Now, using multiple spreads we can achieve the desired output (albeit with different column names):
tmp_dat %>%
mutate(Time_2 = Time,
Time_3 = Time) %>%
spread(Time, Score1, sep = '.') %>%
spread(Time_2, Score2, sep = '.') %>%
spread(Time_3, Score3, sep = '.') %>%
group_by(Person) %>%
summarise_each(funs(((function(x)x[!is.na(x)])(.))))
Now, the problem arises if there are too many NA's:
# Replace last two entries in the last row with NA's
tmp_dat$Score2[6] <- NA
tmp_dat$Score3[6] <- NA
Now running the code snippet with the summarise_each produces the error:
Error in eval(substitute(expr), envir, enclos) : expecting a single value
This can be easily done with dcast from data.table which can take multiple value.var columns
library(data.table)
dcast(setDT(tmp_dat), Person ~paste0("Time.", Time),
value.var = c("Score1", "Score2", "Score3"))
# Person Score1_Time.Post Score1_Time.Pre Score2_Time.Post Score2_Time.Pre Score3_Time.Post Score3_Time.Pre
#1: greg 79 80 80 78 84.5 84.0
#2: sally 78 75 78 74 83.0 79.5
#3: sue 82 81 NA 78 NA 84.5
If we need to use dplyr/tidyr, an option would be to gather the 'Score' columns to 'long' format, unite columns to a single column ('Time1') and then do the spread
library(dplyr)
library(tidyr)
gather(tmp_dat, Var, Val, Score1:Score3) %>%
mutate(TimeN = 'Time', Var = sub("\\D+", "", Var)) %>%
unite(Time1, TimeN, Time, Var) %>%
spread(Time1, Val)
# # A tibble: 3 × 7
# Person Time_Post_1 Time_Post_2 Time_Post_3 Time_Pre_1 Time_Pre_2 Time_Pre_3
# * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 greg 79 80 84.5 80 78 84.0
#2 sally 78 78 83.0 75 74 79.5
#3 sue 82 NA NA 81 78 84.5

Resources