I am trying to something very simple in R, transpose a data set so I can create a primary key for joining with other tables that have many values.
I've tried dcast and aggregate, and haven't gotten them to work.
Here's what my dataframe currently looks like
Current R dataframe
Here's what I would like it to look like:
New R dataframe
You can insere code in your post, so paste the code that create your data.frame, like this:
df <- data.frame(
Make = c('Ford', 'Ford', 'Ford', 'Chevy', 'Chrysler', 'Chrysler'),
DateSold = c('2017-07-01', '2017-08-01', '2017-10-01', '2017-01-01', '2017-03-01', '2017-04-01'),
Amount = c(30, 15, 25, 23, 22, 21) * 1e3
)
Now for your question, you can use the library tidyverse which have a lot of useful functions to manipulate data. You can execute the following code line by line in order to understand the different steps to arrive to the solution.
library(tidyverse)
df %>%
gather(-Make, key = Column, value = Value) %>%
group_by(Make, Column) %>%
mutate(Count = 1:n()) %>%
unite(Column_count, Column, Count) %>%
spread(Column_count, Value)
# Make Amount_1 Amount_2 Amount_3 DateSold_1 DateSold_2 DateSold_3
# <fct> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 Chevy 23000 NA NA 2017-01-01 NA NA
# 2 Chrysler 22000 21000 NA 2017-03-01 2017-04-01 NA
# 3 Ford 30000 15000 25000 2017-07-01 2017-08-01 2017-10-01
Using reshape, you can do somithing like:
reshape(transform(df,time=ave(Amount,Make,FUN=seq_along)),dir = 'wide',idvar='Make')
Make DateSold.1 Amount.1 DateSold.2 Amount.2 DateSold.3 Amount.3
1 Ford 2017-07-01 30000 2017-08-01 15000 2017-10-01 25000
4 Chevy 2017-01-01 23000 <NA> NA <NA> NA
5 Chrysler 2017-03-01 22000 2017-04-01 21000 <NA> NA
Related
I have over 100 tables. Each ID has multiple columns (ID, Date, Days, Mass, Float, Date 2, Days 2, pH).
I split the IDs from the data frame and made them the names of the tables as shown below.
data = NN
ID <- paste0("", NN$ID)
SD<- split(NN,ID)
SD
Each of the ID's look as follows
> SD$`4469912`
# A tibble: 5 × 8
ID Date Days Mass Float `Date 2` `Days 2` pH
<dbl> <dttm> <dbl> <dbl> <dbl> <dttm> <dbl> <chr>
1 4469912 2022-05-24 00:00:00 0 440 16.9 NA 0 NA
2 4469912 2022-05-27 00:00:00 3 813 NA NA 0 NA
3 4469912 2022-06-02 00:00:00 9 930 NA NA 0 NA
4 4469912 2022-06-03 00:00:00 10 914. NA NA 0 NA
5 4469912 2022-06-06 00:00:00 13 944 NA NA 0 NA
I would like to convert each ID to its own Dataframe as shown below
`4469912`<- data.frame(SD$`4469912`)
AKA
`4469912`<- data.frame(SD[9])
The problem I am running into is running a loop to create each table as its own data frame. I would like to name the data frames to their corresponding ID. Something along the lines of the code below.
for (x in SD) {
names(SD[x]) <- data.frame(SD[x])
}
EDIT: I will add that the end goal is to pull or select specific IDs to then plot them on top or against one another in ggplot as each ID is its own geom_line for example:
`4469912`<- data.frame(SD$`4469912`)
`4469822`<- data.frame(SD$`4469822`)
`4469222`<- data.frame(SD$`4469222`)
ggplot(data=NULL,aes(x=`Date`,y=`Mass`)) +
geom_line(data = `4469912`,aes(col="red"))+
geom_line(data = `4469822`,aes(col="blue"))+
geom_line(data = `4469222`,aes(col="green"))
Rather than plotting the entirety of my original data frame, I can determine falloff or regression between the IDs rather than the entirety of the data points selected; if that makes sense and/or is relevant.
I'm working with a large data set in RStudio that includes multiple test scores for the same individuals. I've filtered my data set to display the same individual's scores in two consecutive rows with the test date for each test administration in one column. My data appears as follows:
id test_date score baseline_number_1 baseline_number_2
1 08/15/2017 21.18 Baseline N/A
1 08/28/2019 28.55 N/A Baseline
2 11/22/2017 33.38 Baseline N/A
2 11/06/2019 35.3 N/A Baseline
3 07/25/2018 30.77 Baseline N/A
3 07/31/2019 33.42 N/A Baseline
I would like to calculate the total duration of time between baseline 1 and baseline 2 administration and store that value in a new column. Therefore, my first question is what is the best way to calculate the duration of time between two dates? And two, what is the best way to condense each individual's data into one row to make calculating the difference between test scores easier and to be stored in a new column?
Thank you for any assistance!
This is a solution inside the tidyverse universe. The packages we are going to use are dplyr and tidyr.
First, we create the dataset (you read it from a file instead) and convert strings to date format:
library(dplyr)
library(tidyr)
dataset <- read.table(text = "id test_date score baseline_number_1 baseline_number_2
1 08/15/2017 21.18 Baseline N/A
1 08/28/2019 28.55 N/A Baseline
2 11/22/2017 33.38 Baseline N/A
2 11/06/2019 35.3 N/A Baseline
3 07/25/2018 30.77 Baseline N/A
3 07/31/2019 33.42 N/A Baseline", header = TRUE)
dataset$test_date <- as.Date(dataset$test_date, format = "%m/%d/%Y")
# id test_date score baseline_number_1 baseline_number_2
# 1 1 2017-08-15 21.18 Baseline <NA>
# 2 1 2019-08-28 28.55 <NA> Baseline
# 3 2 2017-11-22 33.38 Baseline <NA>
# 4 2 2019-11-06 35.30 <NA> Baseline
# 5 3 2018-07-25 30.77 Baseline <NA>
# 6 3 2019-07-31 33.42 <NA> Baseline
The best solution to condense each individual's data into one row and compute the difference between the two baselines can be achieved as follows:
dataset %>%
group_by(id) %>%
mutate(number = row_number()) %>%
ungroup() %>%
pivot_wider(
id_cols = id,
names_from = number,
values_from = c(test_date, score),
names_glue = "{.value}_{number}"
) %>%
mutate(
time_between = test_date_2 - test_date_1
)
Brief explanation: first we create the variable number which indicates the baseline number in each row; then we use pivot_wider to make the dataset "wider" indeed, i.e. we have one row for each id along with its features; finally we create the variable time_between which contains the difference in days between two baselines. In you are not familiar with some of these functions, I suggest you break the pipeline after each operation and analyse it step by step.
Final output
# A tibble: 3 x 6
# id test_date_1 test_date_2 score_1 score_2 time_between
# <int> <date> <date> <dbl> <dbl> <drtn>
# 1 1 2017-08-15 2019-08-28 21.2 28.6 743 days
# 2 2 2017-11-22 2019-11-06 33.4 35.3 714 days
# 3 3 2018-07-25 2019-07-31 30.8 33.4 371 days
I'm in the tidyverse.
I read in several CSV files using read_csv (all have the same columns)
df <- read_csv("data.csv")
to obtain the a series of dataframes. After a bunch of data cleaning and calculations, I want to merge all the dataframes.
There are a dozen dataframes of several hundred rows, and a few dozen columns. A minimal example is
DF1
ID name costcentre start stop date
<chr> <chr> <chr> <time> <tim> <chr>
1 R_3PMr4GblKPV~ Geo Prizm 01:00 03:00 25/12/2019
2 R_s6IDep6ZLpY~ Chevy Malibu NA NA NA
3 R_238DgbfO0hI~ Toyota Corolla 08:00 11:00 25/12/2019
DF2
ID name costcentre start stop date
<chr> <chr> <chr> <lgl> <time> <chr>
1 R_3PMr4GblKPV1OYd Geo Prizm NA NA NA
2 R_s6IDep6ZLpYvUeR Chevy Malibu NA 03:00 12/12/2019
3 R_238DgbfO0hItPxZ Toyota Corolla NA NA NA
Based on my cleaning requirements (is start == NA & stop != NA), some of the NAs in start must be 00:00. I can enter a zero in that cell:
df <- within(df, start[is.na(df$start) & !is.na(df$stop)] <- 0)
This results in
DF1
ID name costcentre start stop date
<chr> <chr> <chr> <time> <tim> <chr>
1 R_3PMr4GblKPV~ Geo Prizm 01:00 03:00 25/12/2019
2 R_s6IDep6ZLpY~ Chevy Malibu NA NA NA
3 R_238DgbfO0hI~ Toyota Corolla 08:00 11:00 25/12/2019
DF2
ID name costcentre start stop date
<chr> <chr> <chr> <dbl> <time> <chr>
1 R_3PMr4GblKPV1OYd Geo Prizm NA NA NA
2 R_s6IDep6ZLpYvUeR Chevy Malibu 0 03:00 12/12/2019
3 R_238DgbfO0hItPxZ Toyota Corolla NA NA NA
I run into issues on merging, as sometimes start is a double (as I've done some replacements), is logical (as there were all NAs with no replacements), or is time (if there were some times in the original data reading)
merged_df <- bind_rows(DF1, DF2,...)
gives me the error Error: Columnstartcan't be converted from hms, difftime to numeric
How do I coerce the start column to be of the type time so that I may merge my data?
I think the important point is that the columns start and stop, which appear to be of type time, are based on the hms package. I wondered why/when is displayed, becauses I had not heared about this class before.
As I see it, these columns are actually of class hms and difftime. Such objects are actually stored not in minutes (as the printed tibble suggests) but in seconds. We see this if we look at the data via View(df). Interestingly, if we print the data, the variable type is displayed as time.
To solve your problem, you have to convert all your start and stop columns consistently into hms difftime columns as in the example below.
Minimal reproducible example:
library(dplyr)
library(hms)
df1 <- tibble(id = 1:3,
start = as_hms(as.difftime(c(1*60,NA,8*60), units = "mins")),
stop = as_hms(as.difftime(c(3*60,NA,11*60), units = "mins")))
df2 <- tibble(id = 4:6,
start = c(NA,NA,NA),
stop = as_hms(as.difftime(c(NA,3*60,NA), units = "mins")))
Or even easier (but with slightly different printing than in the question):
df1 <- tibble(id = 1:3,
start = as_hms(c(1*60,NA,8*60)),
stop = as_hms(c(3*60,NA,11*60)))
df2 <- tibble(id = 4:6,
start = c(NA,NA,NA),
stop = as_hms(c(NA,3*60,NA)))
Solving the problem:
class(df1$start) # In df1 start has class hms and difftime
class(df2$start) # In df2 start has class logical
# We set start=0 if stop is not missing and turn the whole column into an hms object
df2 <- df2 %>% mutate(start = new_hms(ifelse(!is.na(stop), 0, NA)))
# Now that column types are consistent across tibbles we can easily bind them together
df <- bind_rows(df1, df2)
df
I checked similar entries in SO, none answers my question exactly.
My problem is this:
Let's say, User1 has 6 purchases, User2 has 2.
Purchase data is something like this:
set.seed(1234)
purchase <- data.frame(id = c(rep("User1", 6), rep("User2", 2)),
purchaseid = sample(seq(1, 100, 1), 8),
purchaseDate = seq(Sys.Date(), Sys.Date() + 7, 1),
price = sample(seq(30, 200, 10), 8))
#
users <- data.frame(id = c("User1","User2"),
uname = c("name1", "name2"),
uaddress = c("add1", "add2"))
> purchase
id purchaseid purchaseDate price
1 User1 12 2019-09-27 140
2 User1 62 2019-09-28 110
3 User1 60 2019-09-29 200
4 User1 61 2019-09-30 190
5 User1 83 2019-10-01 60
6 User1 97 2019-10-02 150
7 User2 1 2019-10-03 160
8 User2 22 2019-10-04 120
End data required includes 1 row for each user, that keeps the user name, address, etc. Then comes next columns for 20 purchases. The purchase data needs to be placed one after another in the same row. This is the rule: only one row for each user. If the user does not have 20 purchases, the remaining fields should be empty.
End data should therefore look like this:
id uname uaddr p1id p1date p1price p2id p2date p2price p3id p3date p3price p4id
1 User1 name1 add1 12 2019-09-27 140 62 2019-09-28 110 60 2019-09-29 200 61
2 User2 name2 add2 1 2019-10-03 160 22 2019-10-04 120 NA <NA> NA NA
p4date p4price
1 2019-09-30 190
2 <NA> NA
enddata <- data.frame(id = c("User1", "User2"),
uname = c("name1", "name2"),
uaddr = c("add1", "add2"),
p1id = c(12,1),
p1date = c("2019-09-27","2019-10-03"),
p1price = c(140, 160),
p2id = c(62, 22),
p2date = c("2019-09-28", "2019-10-04"),
p2price = c(110, 120),
p3id = c(60, NA),
p3date = c("2019-09-29", NA),
p3price = c(200, NA),
p4id = c(61, NA),
p4date = c("2019-09-30", NA),
p4price = c(190, NA))
I used reshape to get the data for each user into the wide format. The idea was doing it in a loop for each user id. Then I used rbindlist with the fill option TRUE, but this time I am having problem with column names. After reshape, each gets different column names. Without fixed number of columns, you cannot set names either.
Any elegant solution to this?
There's no need to process each id separately. Instead we can operate by id within a single data frame. Below is a tidyverse approach. You can stop the chain at any point to see the intermediate output. I've added comments to explain what the code is doing, but let me know if anything is unclear.
library(tidyverse)
dat = users %>%
# Join purchase data to user data
left_join(purchase) %>%
arrange(purchaseDate) %>%
# Create a count column to assign a sequence number to each purchase within each id.
# We'll use this later to create columns for each purchase event with a unique
# sequence number for each purchase.
group_by(id) %>%
mutate(seq=1:n()) %>%
ungroup %>%
# Reshape data frame to from "wide" to "long" format
gather(key, value, purchaseid:price) %>%
arrange(seq) %>%
# Paste together the "key" and "seq" columns (the resulting column will still be
# called "key"). This will allow us to spread the data frame to one row per id
# with each purchase event properly numbered.
unite(key, key, seq, sep="_") %>%
mutate(key = factor(key, levels=unique(key))) %>%
spread(key, value) %>%
# Convert date columns back to Date class
mutate_at(vars(matches("Date")), as.Date, origin="1970-01-01")
dat
id uname uaddress purchaseid_1 purchaseDate_1 price_1 purchaseid_2 purchaseDate_2 price_2
1 User1 name1 add1 12 2019-09-27 140 62 2019-09-28 110
2 User2 name2 add2 1 2019-10-03 160 22 2019-10-04 120
purchaseid_3 purchaseDate_3 price_3 purchaseid_4 purchaseDate_4 price_4 purchaseid_5 purchaseDate_5
1 60 2019-09-29 200 61 2019-09-30 190 83 2019-10-01
2 NA <NA> NA NA <NA> NA NA <NA>
price_5 purchaseid_6 purchaseDate_6 price_6
1 60 97 2019-10-02 150
2 NA NA <NA> NA
Another option using data.table:
#pivot to wide format
setDT(users)
setDT(purchase)[, pno := rowid(id)]
ans <- dcast(purchase[users, on=.(id)], id + uname + uaddress ~ pno,
value.var=c("purchaseid","purchaseDate", "price"))
#reorder columns
nm <- grep("[1-9]$", names(ans), value=TRUE)
setcolorder(ans, c(setdiff(names(ans), nm), nm[order(gsub("(.*)_", "", nm))]))
ans
output:
id uname uaddress purchaseid_1 purchaseDate_1 price_1 purchaseid_2 purchaseDate_2 price_2 purchaseid_3 purchaseDate_3 price_3 purchaseid_4 purchaseDate_4 price_4 purchaseid_5 purchaseDate_5 price_5 purchaseid_6 purchaseDate_6 price_6
1: User1 name1 add1 12 2019-09-30 140 62 2019-10-01 110 60 2019-10-02 200 61 2019-10-03 190 83 2019-10-04 60 97 2019-10-05 150
2: User2 name2 add2 1 2019-10-06 160 22 2019-10-07 120 NA <NA> NA NA <NA> NA NA <NA> NA NA <NA> NA
I'm new to Stackoverflow and looked at similar posts but couldn't find a solution that can capture time differences from multiple events from the same ID.
What I've got:
Time<-c('2016-10-04','2016-10-18', '2016-10-04','2016-10-18','2016-10-19','2016-10-28','2016-10-04','2016-10-19','2016-10-21','2016-10-22', '2017-01-02', '2017-03-04')
Value<-c(0,1,0,1,0,0,0,1,0,1,1,0)
StoreID<-c('a','a','b','b','c','c','d','d','a','a','d','c')
Unit<-c(1,1,2,2,5,5,6,6,1,1,6,5)
Helper<-c('a1','a1','b2','b2','c5','c5','d6','d6','a1','a1','d6','c5')
The helper column is the StoreID and Unit combined because I couldn't figure out how to group by both Store ID and the Unit. I want to sort the data to show when the unit was disabled (value =0) and enabled again (value =1).
Ultimately, I'd want:
Store_ID Unit Helper Time(v=0) Time(v=1) Time2(v=0) Time 2(v=1)
a 1 a1 2016-10-04 2016-10-18 2016-10-21 2016-10-22
b 2 b2 2016-10-04 2016-10-18
c 5 c5 2016-10-19 2016-10-28 2017-03-04
d 6 d6 2016-10-04 2017-10-19
Any thoughts?
I'm thinking something in dplyr but am stumped about where to go further.
Create a Header column that combines the Value column and the row number that distinguishes duplicates, then spread to wide format:
Didn't use the helper column, grouped by StoredID and Unit instead.
df <- data.frame(StoreID, Unit, Time, Value)
df %>%
group_by(StoreID, Unit, Value) %>%
mutate(Headers = sprintf('Time %s (v=%s)', row_number(), Value)) %>%
ungroup() %>% select(-Value) %>%
spread(Headers, Time)
# A tibble: 4 x 7
# StoreID Unit `Time 1 (v=0)` `Time 1 (v=1)` `Time 2 (v=0)` `Time 2 (v=1)` `Time 3 (v=0)`
#* <fctr> <dbl> <fctr> <fctr> <fctr> <fctr> <fctr>
#1 a 1 2016-10-04 2016-10-18 2016-10-21 2016-10-22 NA
#2 b 2 2016-10-04 2016-10-18 NA NA NA
#3 c 5 2016-10-19 NA 2016-10-28 NA 2017-03-04
#4 d 6 2016-10-04 2016-10-19 NA 2017-01-02 NA