Transforming a data frame in R [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 4 years ago.
I have a dataframe in R with client information and sales per product. Product is a field with multiple values. Sales is a separate field. I would like to convert the table so the sales from each product has its own column so that I have one row per client (rather than one row per client per product). I have seen information on how to transpose a table, but this is different. Below are two simplified examples of what I am starting with and the desired end result. The real situation will have many more columns, clients and products.
Starting point:
start <- data.frame(client = c(1,1,1,2,2,2),
product=c("Product1","Product2","Product3","Product1","Product2","Product3"),
sales = c(100,500,300,200,400,600))
Output:
client product sales
1 1 Product1 100
2 1 Product2 500
3 1 Product3 300
4 2 Product1 200
5 2 Product2 400
6 2 Product3 600
Following is the desired end result:
end <- data.frame(client = c(1,2),
Product1 = c(100,200), Product2 = c(500,400),
Product3 = c(300,600))
Output:
client Product1 Product2 Product3
1 1 100 500 300
2 2 200 400 600
How can I transform this data from the start to end in R? Thanks in advance for any assistance!

> install.packages("reshape2") # to install 'reshape2'.
> library(reshape2)
> dcast(start, client ~ product)
Using sales as value column: use value.var to override.
client Product1 Product2 Product3
1 1 100 500 300
2 2 200 400 600

Related

How to create a dataset of ids that get treated at some point in time? (R) [duplicate]

This question already has answers here:
Select groups which have at least one of a certain value
(3 answers)
How to keep all instances of a column ID if a specific value is found in another column 1 or more times [duplicate]
(2 answers)
Closed 11 months ago.
I have a longitudinal dataset with 3 important variables: ID, Year, Treatment
I would like to keep all the IDs that get treated at some point of time and drop all the IDs that never get treated. How do I do this on R?
Example:
ID
Year
Treatment
0001
2000
0
0001
2001
0
0001
2002
0
0002
2000
0
0002
2001
0
0002
2002
1
I would like to keep all observations of ID 0002 (Treated at some point in time), but drop all of ID 0001 (Never treated). I have a very big dataset with more IDs than that so I can not do this manually.
Thanks in advance.
Find the IDs that have treatment, then subset those IDs:
d[ d$ID %in% unique(d[ d$Treatment == 1, "ID" ]), ]
# ID Year Treatment
# 4 0002 2000 0
# 5 0002 2001 0
# 6 0002 2002 1

How can I do this split process with this sequence in R?

I'm trying to create a corpus with the BILOU format, and I wanted to reuse a table. Where a given sentence is separated by columns, and each column would be an entity. How can I perform a sequence this way, making each string in each row represent the sequence B_xxx, I_xxx, L_xxx. And that it repeats itself from the moment the entity (column) changes.
Old dataframe:
first_entity <- c("Product and Other","Product2 and Second", "Product")
second_entity <- c("Price and Prices","Price2", "Price3 and example")
df <- data.frame(first_entity, second_entity)
df
----------------------------------------
first_entity second_entity
1 Product and Other Price and Prices
2 Product2 and Second Price2
3 Product Price3 and example
Desired dataframe:
Word Ent
1 Product B_pro
2 and I_pro
3 Other L_pro
4 Price B_pri
5 and I_pri
6 Prices L_pri
7 Product2 B_pro
8 and I_pro
9 Second L_pro
10 Price2 B_pri
11 Product B_pro
12 Price3 B_pri
13 and I_pri
14 example L_pri

R Create dummy datasets based on reference dataset

Context
I'd like to build a two dummy survey dataframes for a project. One dataframe has responses to a Relationship survey, and another to aPulse survey.
Here are what each look like -
Relationship Dataframe
#Relationship Data
rel_data= data.frame(
TYPE=rep('Relationship',446),
SURVEY_ID = rep('SURVEY 2018 Z662700',446),
SITE_ID=rep('Z662700',446),
START_DATE= rep(as.Date('2018-07-01'),446),
END_DATE= rep(as.Date('2018-07-04'),446)
)
Pulse Dataframe
#Pulse Data
pulse_data= data.frame(
TYPE=rep('Pulse',525),
SURVEY_ID = rep('SURVEY 2018 W554800',525),
SITE_ID=rep('W554800',525),
START_DATE= rep(as.Date('2018-04-01'),525),
END_DATE= rep(as.Date('2018-04-04'),525)
)
My Objective
I'd like to add columns to each of these two dataframes, based on conditions from a reference table.
The reference table consists of the questions to be added to each of the two survey dataframes, along with further details on each question asked. This is what it looks like
Reference Table
#Reference Table - Question Bank
qbank= data.frame(QUEST_ID=c('QR1','QR2','QR3','QR4','QR5','QP1','QP2','QP3','QP4','QP5','QP6'),
QUEST_TYPE=c('Relationship','Relationship','Relationship','Relationship','Relationship',
'Pulse','Pulse','Pulse','Pulse','Pulse','Pulse'),
SCALE=c('Preference','Satisfaction','Satisfaction','Satisfaction','Preference','NPS',
'Satisfaction','Satisfaction','Satisfaction','Preference','Open-Ended'),
FOLLOWUP=c('No','No','No','No','No','No','Yes','No','Yes','No','No'))
The Steps
For each survey dataframe( Relationship & Pulse), I'd like to do the following -
1) Lookup their respective question codes in the reference table, and add only those questions to the dataframe. For example, the Relationship dataframe would have only question codes pertaining to TYPE = 'Relationship' from the reference table. And the same for the Pulse dataframe.
2) The responses to each question would be conditionally added to each dataframe. Here are the conditions -
If SCALE = 'Preference' in the Reference table, then responses would be either 150,100,50,0 or -50. Also, these numbers would be generated in any random order.
If SCALE = 'NPS' in the Reference table, then responses would range from 0 to 10. Numbers would be generated such that the Net Promoter Score (NPS) equals 50%. Reminder: NPS = Percentage of 9s & 10s minus Percentage of 0s to 6s.
If SCALE = 'Satisfaction' in the Reference table, then responses would range from 1 (Extremely Dissatisfied) to 5 (Extremely Satisfied). Numbers would be generated such that the percentage of 1s & 2s equal 90%.
If SCALE = 'Open-Ended' in the Reference table, then ensure the column is empty (i.e. contains no responses).
My Attempt
Using this previously asked question for the conditional response creation and this one to add columns from the reference table, I attempted to solve the problem. But I haven't got what I was looking for yet.
Any inputs on this would be greatly appreciated
Desired Output
My desired output tables would look like this -
Relationship Dataframe Output
TYPE SURVEY_ID SITE_ID START_DATE END_DATE QR1 QR2 QR3 QR4 QR5
1 Relationship SURVEY 2018 Z662700 Z662700 2018-07-01 2018-07-04 150 5 1 2 2
2 Relationship SURVEY 2018 Z662700 Z662700 2018-07-01 2018-07-04 100 1 2 2 2
3 Relationship SURVEY 2018 Z662700 Z662700 2018-07-01 2018-07-04 100 4 5 2 2
4 Relationship SURVEY 2018 Z662700 Z662700 2018-07-01 2018-07-04 150 1 1 2 2
and so on
And the Pulse Dataframe Output
TYPE SURVEY_ID SITE_ID START_DATE END_DATE QP1 QP2 QP3 QP4 QP5 QP6
1 Pulse SURVEY 2018 W554800 W554800 2018-04-01 2018-04-04 7 1 3 3 100
2 Pulse SURVEY 2018 W554800 W554800 2018-04-01 2018-04-04 8 5 3 1 100
3 Pulse SURVEY 2018 W554800 W554800 2018-04-01 2018-04-04 3 1 4 3 100
4 Pulse SURVEY 2018 W554800 W554800 2018-04-01 2018-04-04 1 2 4 3 100
and so on
Will something like
rel_data %>%
left_join(qbank, by = c("TYPE" = "QUEST_TYPE")) %>%
select(-FOLLOWUP) %>%
unique() %>%
mutate(val = case_when(SCALE == "Preference" ~ "A",
SCALE == "Satisfaction" ~ "B",
SCALE == "NPS" ~ "C",
TRUE ~ NA_character_ )) %>%
select(-SCALE) %>%
spread(key = QUEST_ID, value = val)
work for you?
you can modify the case_when conditions to fit your need.

Joins in R while also spreading out information from one data frame

I am attempting to join together two data frames. One contains records of when certain events happened. The other contains daily information on values that occurred for a given organization.
My current challenge is how to join together the information in the "when certain events happened" data frame fully into the records data frame. Most of dplyr's joins appear to simply join one line together. I need to fully spread out the record information based on start and end dates.
In other words, I need to spread out information from one line into many lines, while simultaneously joining to the daily data table. It is important that I do this in R because the alternative is quite a bit of filtering and dragging in Excel (the information covers thousands of rows).
Below is a representation of the daily data table
value year month day org link
12 1 1 1 AA AA-1-1
45 1 1 2 AA AA-1-2
31 1 1 3 AA AA-1-3
10 1 1 4 AA AA-1-4
Below is a representation of the records table
year month day org link end_link event event_info
1 1 2 AA AA-1-1-2 AA-1-1-3 Buy Yes
1 2 7 BB BB-1-2-7 BB-1-2-10 Sell Yes
And finally, here is what I am aiming for in the end:
value month day org link event event_info
12 1 1 AA AA-1-1-1
45 1 2 AA AA-1-1-2 Buy Yes
31 1 3 AA AA-1-1-3 Buy Yes
10 1 4 AA AA-1-1-4
Is there any way to accomplish this in R? I have tried using dplyr joins but usually am only able to join together the initial link.
Edit: The second "end" link refers to an end date. In the records table this is all in one line, while the second data frame has daily information.
Edit: Below I have put together a cleaner look at my real data. The first image is of DAILY DATA while the second is of RECORDS OF EVENTS. The third is what I would like to see (ideally).
Daily data, which will have multiple orgs present
Records data, note org id AA and the audience
Ideal combined data
We have first to build some dates in order to build date sequences that we'll unnest to get a long version of df2, which we right join on df1:
library(tidyverse)
df2 %>%
separate(link,c("org1","year1","month1","day1")) %>%
separate(end_link,c("org2","year2","month2","day2")) %>%
rowwise %>%
transmute(org,event,event_info, date = list(
as.Date(paste0(year1,"-",month1,"-",day1)):as.Date(paste0(year2,"-",month2,"-",day2)))) %>%
unnest %>%
right_join(df1 %>% mutate(date=as.numeric(as.Date(paste0(year,"-",month,"-",day))))) %>%
select(value, month, day, org, link, event,event_info)
# # A tibble: 4 x 7
# value month day org link event event_info
# <int> <int> <int> <chr> <chr> <chr> <chr>
# 1 12 1 1 AA AA-1-1 <NA> <NA>
# 2 45 1 2 AA AA-1-2 Buy Yes
# 3 31 1 3 AA AA-1-3 Buy Yes
# 4 10 1 4 AA AA-1-4 <NA> <NA>
data
df1 <- read.table(text="value year month day org link
12 1 1 1 AA AA-1-1
45 1 1 2 AA AA-1-2
31 1 1 3 AA AA-1-3
10 1 1 4 AA AA-1-4",h=T,strin=F)
df2 <- read.table(text="year month day org link end_link event event_info
1 1 2 AA AA-1-1-2 AA-1-1-3 Buy Yes
1 2 7 BB BB-1-2-7 BB-1-2-10 Sell Yes",h=T,strin=F)
I would use the Data table package, it is for me the best R package to do data analysis. Hope to have properly understood the problem, let me know if it does not work.
The first part creates the data-set (I created the two data.table objects in two different ways just to show both alternatives, you could read your data directly from excel, .txt, .csv or similar, let me know if you want to know how to do this).
library(data.table)
value<-c(12,45,31,10)
year<-c(1,1,1,1)
month<-c(1,1,1,1)
day<-c(1,2,3,4)
org<-c("AA","AA","AA","AA")
link<-c("AA-1-1","AA-1-2","AA-1-3","AA-1-4")
Daily_dt<-data.table(value, year,month,day,org,link)
Records_dt<-data.table(year=c(1,1),month=c(1,1),day=c(2,3),org=c("AA","BB"),link=c("AA-1-1-2","BB-1-2-7"),end_link=c("AA-1-1-3","BB-1-2-10"),
event=c("Buy","Buy"),event_info=c("Yes","Yes"))
Daily_dt[,Date:=as.Date(paste(year,"-",month,"-",day,sep=""))]
To achieve what you want you need these lines
Records_dt=rbind(Records_dt[,c("org","link","event","event_info")],
Records_dt[,list(org,link=end_link,event,event_info)])
Record_Dates<-as.data.table(tstrsplit(Records_dt$link,"-")[-1])
Record_Dates[,Dates:=as.Date(paste(V1,"-",V2,"-",V3,sep=""))]
Records_dt[,Date:=Record_Dates$Dates]
setkey(Records_dt,Date)
setkey(Daily_dt,Date)
Records_dt<-Records_dt[,c("Date","event","event_info")][Daily_dt,]
Records_dt<-Records_dt[,c("value","month","day","org","link","event","event_info")]
and this is the result
> Records_dt
value month day org link event event_info
1: 12 1 1 AA AA-1-1 NA NA
2: 45 1 2 AA AA-1-2 Buy Yes
3: 31 1 3 AA AA-1-3 Buy Yes
4: 10 1 4 AA AA-1-4 NA NA
If your input data had more than one event in the same day (with or without the same org) something like:
> Records_dt
year month day org link end_link event event_info
1: 1 1 2 AA AA-1-1-2 AA-1-1-3 Buy Yes
2: 1 1 3 BB BB-1-2-7 BB-1-2-10 Buy Yes
3: 1 1 2 AA AA-1-1-2 AA-1-1-3 Buy Yes
4: 1 1 3 AA AA-1-2-7 AA-1-2-10 Buy Yes
some tweaks may be required, but am not sure if you required this, so did not add it.

Only changing a single variable in R

I have a dataframe df:
Group Age Sales
A1234 12 1000
A2312 11 900
B2100 23 2100
...
I intend to create a new dataframe through the modification of the Group variable, by only taking the substring of Group. At present, I am able to execute it in 2 steps:
dt1<- dt
dt1$Group<- substr(dt$Group,1,2)
Is it able to do the above in one single command? I guess the following would get tedious if I have to create and transform many intermediate dataframes along the way.
You can try:
dt1<-`$<-`(dt,"Group",substr(dt$Group,1,2))
dt1
# Group Age Sales
#1 A1 12 1000
#2 A2 11 900
#3 B2 23 2100
dt
# Group Age Sales
#1 A1234 12 1000
#2 A2312 11 900
#3 B2100 23 2100
The original table is unchanged and you get the new one with a single line.

Resources