Generate new variable based on table of conditions - r

I am trying to write a loop in R that creates a new variable based on a table of conditional outcomes.
I have four treatment groups (A, B, C, D). Each treatment group pays a different price at three different time periods (day, dinner, night).
Treatment Group Day Price Dinnertime Price Night Price
A 10 20 7
B 11 25 8
C 12 30 9
D 13 35 10
The time period is recorded as a given "hour" (day is hours 8-17, dinner is from 17-19 and night is from 19-0 and 0-8).
Hour Usage
Person 1 1 0
Person 1 2 0
Person 2 20 5
Person 3 17 6
Based on both treatment group (A, B, C and D) and time of day (night, day, dinnertime), I would like to create a new vector of prices.
Ideally, I would create dummy variables for each of the time periods (day, night and dinner) based on these hourly conditions. However, my data set is pretty large (24 observations per person per day) so I'm looking for a more elegant solution.
In plain language, I want this:
if group==A & time==night, then price=7 --> and this information saved in a new variable "price"
Any advice?
Edit: Question is about the loop with two conditions. Is there a way to refer this directly to the data-frame with the treatment groups and tariffs or do I just need to write it manually?

Assuming that you have some way of including a column for the group each person belongs to in the dataframe with the transactions on it. Then something like this may work for you.
df.pricing <- structure(list(Treatment.Group = c("A", "B", "C", "D"), Day.Price = 10:13,
Dinnertime.Price = c(20L, 25L, 30L, 35L), Night.Price = 7:10),
.Names = c("Treatment.Group", "Day.Price", "Dinnertime.Price", "Night.Price"),
class = "data.frame",
row.names = c(NA, -4L))
df.transactions <- structure(list(Person = c("Person1", "Person1", "Person2", "Person3", "Person4"),
Hour = c(1L, 2L, 20L, 17L, 9L),
Usage = c(0L, 0L, 5L, 6L, 2L)),
.Names = c("Person", "Hour", "Usage"),
class = "data.frame", row.names = c(NA, -5L))
# Add the group that each person belongs to
df.transactions$group <- c("A","A","B","C","D")
# Get the transaction price
df.transactions$price <- apply(df.transactions, 1, function(x){
hour <- as.numeric(x[["Hour"]])
price <- ifelse(hour >= 8 & hour <= 16, df.pricing[df.pricing$Treatment.Group == x[["group"]], "Day.Price"],
ifelse((hour > 16 & hour <= 18), df.pricing[df.pricing$Treatment.Group == x[["group"]], "Dinnertime.Price"],
df.pricing[df.pricing$Treatment.Group == x[["group"]], "Night.Price"]))})

Related

How to calculate elapsed times for the total duration of events?

I have collected a dataframe that models the duration of time for events in a group problem solving session in which the members Communicate (Discourse Code) and construct models (Modeling Code). Each minute that that occurs is captured in the Time_Processed column. Technically these events occur simultaneously. I would like to know how long the students are constructing each type of model which is the total duration of that model or the time elapsed before that model changes.
I have the following dataset:
Looks like this:
`Modeling Code` `Discourse Code` Time_Processed
<fct> <fct> <dbl>
1 OFF OFF 10.0
2 MA Q 11.0
3 MA AG 16.0
4 V S 18.0
5 V Q 20.0
6 MA C 21.0
7 MA C 23.0
8 MA C 25.0
9 V J 26.0
10 P S 28.0
# My explicit dataframe.
df <- structure(list(`Modeling Code` = structure(c(3L, 2L, 2L, 6L,
6L, 2L, 2L, 2L, 6L, 4L), .Label = c("A", "MA", "OFF", "P", "SM",
"V"), class = "factor"), `Discourse Code` = structure(c(7L, 8L,
1L, 9L, 8L, 2L, 2L, 2L, 6L, 9L), .Label = c("AG", "C", "D", "DA",
"G", "J", "OFF", "Q", "S"), class = "factor"), Time_Processed = c(10,
11, 16, 18, 20, 21, 23, 25, 26, 28)), row.names = c(NA, -10L), .Names = c("Modeling Code",
"Discourse Code", "Time_Processed"), class = c("tbl_df", "tbl",
"data.frame"))
For this dataframe I can find how often the students were constructing each type of model logically like this.
With Respect to the Modeling Code and Time_Processed columns,
At 10 minutes they are using the OFF model method, then at 11 minutes, they change the model so the duration of the OFF model is (11 - 10) minutes = 1 minute. There are no other occurrences of the "OFF" method so the duration of OFF = 1 min.
Likewise, for Modeling Code method "MA", the model is used from 11 minutes to 16 minutes (duration = 5 minutes) and then from 16 minutes to 18 minutes before the model changes to V with (duration = 2 minutes), then the model is used again at 21 minutes and ends at 26 minutes (duration = 5 minutes). So the total duration of "MA" is (5 + 2 + 5) minutes = 12 minutes.
Likewise the duration of Modeling Code method "V" starts at 18 minutes, ends at 21 minutes (duration = 3 minutes), resumes at 26 minutes, ends at 28 minutes (duration = 2) minutes. So total duration of "V" is 3 + 2 = 5 minutes.
Then the duration of Modeling Code P, starts at 28 minutes and there is no continuity so total duration of P is 0 minutes.
So the total duration (minutes) table of the Modeling Codes is this:
Modeling Code Total_Duration
OFF 1
MA 12
V 5
P 0
This models a barchart that looks like this:
How can the total duration of these modeling methods be constructed?
It would also be nice to know the duration of the combinations
such that the only visible combination in this small subset happens to be Modeling Code "MA" paired with Discourse Code "C" and this occurs for 26 - 21 = 5 minutes.
Thank you.
UPDATED SOLUTION
df %>%
mutate(dur = lead(Time_Processed) - Time_Processed) %>%
replace_na(list(dur = 0)) %>%
group_by(`Modeling Code`) %>%
summarise(tot_time = sum(dur))
(^ Thanks to Nick DiQuattro)
PREVIOUS SOLUTION
Here's one solution that creates a new variable, mcode_grp, which keeps track of discrete groupings of the same Modeling Code. It's not particularly pretty - it requires looping over each row in df - but it works.
First, rename columns for ease of reference:
df <- df %>%
rename(m_code = `Modeling Code`,
d_code = `Discourse Code`)
We'll update df with a few extra variables.
- lead_time_proc gives us the Time_Processed value for the next row in df, which we'll need when computing the total amount of time for each m_code batch
- row_n for keeping track of row number in our iteration
- mcode_grp is the unique label for each m_code batch
df <- df %>%
mutate(lead_time_proc = lead(Time_Processed),
row_n = row_number(),
mcode_grp = "")
Next, we need a way to keep track of when we've hit a new batch of a given m_code value. One way is to keep a counter for each m_code, and increment it whenever a new batch is reached. Then we can label all the rows for that m_code batch as belonging to the same time window.
mcode_ct <- df %>%
group_by(m_code) %>%
summarise(ct = 0) %>%
mutate(m_code = as.character(m_code))
This is the ugliest part. We loop over every row in df, and check to see if we've reached a new m_code. If so, we update accordingly, and register a value for mcode_grp for each row.
mc <- ""
for (i in 1:nrow(df)) {
current_mc <- df$m_code[i]
if (current_mc != mc) {
mc <- current_mc
mcode_ct <- mcode_ct %>% mutate(ct = ifelse(m_code == mc, ct + 1, ct))
current_grp <- mcode_ct %>% filter(m_code == mc) %>% select(ct) %>% pull()
}
df <- df %>% mutate(mcode_grp = ifelse(row_n == i, current_grp, mcode_grp))
}
Finally, group_by m_code and mcode_grp, compute the duration for each batch, and then sum over m_code values.
df %>%
group_by(m_code, mcode_grp) %>%
summarise(start_time = min(Time_Processed),
end_time = max(lead_time_proc)) %>%
mutate(total_time = end_time - start_time) %>%
group_by(m_code) %>%
summarise(total_time = sum(total_time)) %>%
replace_na(list(total_time=0))
Output:
# A tibble: 4 x 2
m_code total_time
<fct> <dbl>
1 MA 12.
2 OFF 1.
3 P 0.
4 V 5.
For any dplyr/tidyverse experts out there, I'd love tips on how to accomplish more of this without resorting to loops and counters!

how can I group based on similarity in strings

I have a data like this
df <-structure(list(label = structure(c(5L, 6L, 7L, 8L, 3L, 1L, 2L,
9L, 10L, 4L), .Label = c(" holand", " holandindia", " Holandnorway",
" USAargentinabrazil", "Afghanestan ", "Afghanestankabol", "Afghanestankabolindia",
"indiaAfghanestan ", "USA", "USAargentina "), class = "factor"),
value = structure(c(5L, 4L, 1L, 9L, 7L, 10L, 6L, 3L, 2L,
8L), .Label = c("1941029507", "2367321518", "2849255881",
"2913128511", "2927576083", "4550996370", "457707181.9",
"637943892.6", "796495286.2", "89291651.19"), class = "factor")), .Names = c("label",
"value"), class = "data.frame", row.names = c(NA, -10L))
I want to get the largest name (in letter) and then see how many smaller and similar names are and assign them to a group
then go for another next large name and assign them to another group
until no group left
at first I calculate the length of each so I will have the length of them
library(dplyr)
dft <- data.frame(names=df$label,chr=apply(df,2,nchar)[,1])
colnames(dft)[1] <- "label"
df2 <- inner_join(df, dft)
Now I can simply find which string is the longest
df2[which.max(df2$chr),]
Now I should see which other strings have the letters similar to this long string . we have these possibilities
Afghanestankabolindia
it can be
A
Af
Afg
Afgh
Afgha
Afghan
Afghane
.
.
.
all possible combinations but the order of letter should be the same (from left to right) for example it should be Afghand cannot be fAhg
so we have only two other strings that are similar to this one
Afghanestan
Afghanestankabol
it is because they should be exactly similar and not even a letter different (more than the largest string) to be assigned to the same group
The desire output for this is as follows:
label value group
Afghanestan 2927576083 1
Afghanestankabol 2913128511 1
Afghanestankabolindia 1941029507 1
indiaAfghanestan 796495286.2 2
Holandnorway 457707181.9 3
holand 89291651.19 3
holandindia 4550996370 3
USA 2849255881 4
USAargentina 2367321518 4
USAargentinabrazil 637943892.6 4
why indiaAfghanestan is a seperate group? because it does not completely belong to another name (it has partially name from one or another). it should be part of a bigger name
I tried to use this one Find similar strings and reconcile them within one dataframe which did not help me at all
I found something else which maybe helps
require("Biostrings")
pairwiseAlignment(df2$label[3], df2$label[1], gapOpening=0, gapExtension=4,type="overlap")
but still I don't know how to assign them into one group
You could try
library(magrittr)
df$label %>%
tolower %>%
trimws %>%
stringdist::stringdistmatrix(method = "jw", p = 0.1) %>%
as.dist %>%
`attr<-`("Labels", df$label) %>%
hclust %T>%
plot %T>%
rect.hclust(h = 0.3) %>%
cutree(h = 0.3) %>%
print -> df$group
df
# label value group
# 1 Afghanestan 2927576083 1
# 2 Afghanestankabol 2913128511 1
# 3 Afghanestankabolindia 1941029507 1
# 4 indiaAfghanestan 796495286.2 2
# 5 Holandnorway 457707181.9 3
# 6 holand 89291651.19 3
# 7 holandindia 4550996370 3
# 8 USA 2849255881 4
# 9 USAargentina 2367321518 4
# 10 USAargentinabrazil 637943892.6 4
See ?stringdist::'stringdist-metrics' for an overview of the string dissimilarity measures offered by stringdist.

How do I plot boxplots of two different series?

I have 2 dataframe sharing the same rows IDs but with different columns
Here is an example
chrom coord sID CM0016 CM0017 CM0018
7 10 3178881 SP_SA036,SP_SA040 0.000000000 0.000000000 0.0009923
8 10 38894616 SP_SA036,SP_SA040 0.000434783 0.000467464 0.0000970
9 11 104972190 SP_SA036,SP_SA040 0.497802888 0.529319536 0.5479003
and
chrom coord sID CM0001 CM0002 CM0003
4 10 3178881 SP_SA036,SA040 0.526806527 0.544927536 0.565610860
5 10 38894616 SP_SA036,SA040 0.009049774 0.002849003 0.002857143
6 11 104972190 SP_SA036,SA040 0.451612903 0.401617251 0.435318275
I am trying to create a composite boxplot figure where I have in x axis the chrom and coord combined (so 3 points) and for each x value 2 boxplots side by side corresponding to the two dataframes ?
What is the best way of doing this ? Should I merge the two dataframes together somehow in order to get only one and loop over the boxplots rendering by 3 columns ?
Any idea on how this can be done ?
The problem is that the two dataframes have the same number of rows but can differ in number of columns
> dim(A)
[1] 99 20
> dim(B)
[1] 99 28
I was thinking about transposing the dataframe in order to get the same number of column but got lost on how to this properly
Thanks in advance
UPDATE
This is what I tried to do
I merged chrom and coord columns together to create a single ID
I used reshape t melt the dataframes
I merged the 2 melted dataframe into a single one
the head looks like this
I have two variable A2 and A4 corresponding to the 2 dataframes
then I created a boxplot such using this
ggplot(A2A4, aes(factor(combine), value)) +geom_boxplot(aes(fill = factor(variable)))
I think it solved my problem but the boxplot looks very busy with 99 x values with 2 boxplots each
So if these are your input tables
d1<-structure(list(chrom = c(10L, 10L, 11L),
coord = c(3178881L, 38894616L, 104972190L),
sID = structure(c(1L, 1L, 1L), .Label = "SP_SA036,SP_SA040", class = "factor"),
CM0016 = c(0, 0.000434783, 0.497802888), CM0017 = c(0, 0.000467464,
0.529319536), CM0018 = c(0.0009923, 9.7e-05, 0.5479003)), .Names = c("chrom",
"coord", "sID", "CM0016", "CM0017", "CM0018"), class = "data.frame", row.names = c("7",
"8", "9"))
d2<-structure(list(chrom = c(10L, 10L, 11L), coord = c(3178881L,
38894616L, 104972190L), sID = structure(c(1L, 1L, 1L), .Label = "SP_SA036,SA040", class = "factor"),
CM0001 = c(0.526806527, 0.009049774, 0.451612903), CM0002 = c(0.544927536,
0.002849003, 0.401617251), CM0003 = c(0.56561086, 0.002857143,
0.435318275)), .Names = c("chrom", "coord", "sID", "CM0001",
"CM0002", "CM0003"), class = "data.frame", row.names = c("4",
"5", "6"))
Then I would combine and reshape the data to make it easier to plot. Here's what i'd do
m1<-melt(d1, id.vars=c("chrom", "coord", "sID"))
m2<-melt(d2, id.vars=c("chrom", "coord", "sID"))
dd<-rbind(cbind(m1, s="T1"), cbind(m2, s="T2"))
mm$pos<-factor(paste(mm$chrom,mm$coord,sep=":"),
levels=do.call(paste, c(unique(dd[order(dd[[1]],dd[[2]]),1:2]), sep=":")))
I first melt the two input tables to turn columns into rows. Then I add a column to each table so I know where the data came from and rbind them together. And finally I do a bit of messy work to make a factor out of the chr/coord pairs sorted in the correct order.
With all that done, I'll make the plot like
ggplot(mm, aes(x=pos, y=value, color=s)) +
geom_boxplot(position="dodge")
and it looks like

Combing two data frames if values in one column fall between values in another

I imagine that there's some way to do this with sqldf, though I'm not familiar with the syntax of that package enough to get this to work. Here's the issue:
I have two data frames, each of which describe genomic regions and contain some other data. I have to combine the two if the region described in the one df falls within the region of the other df.
One df, g, looks like this (though my real data has other columns)
start_position end_position
1 22926178 22928035
2 22887317 22889471
3 22876403 22884442
4 22862447 22866319
5 22822490 22827551
And another, l, looks like this (this sample has a named column)
name start end
101 GRMZM2G001024 11149187 11511198
589 GRMZM2G575546 24382534 24860958
7859 GRMZM2G441511 22762447 23762447
658 AC184765.4_FG005 26282236 26682919
14 GRMZM2G396835 10009264 10402790
I need to merge the two dataframes if the values from the start_position OR end_position columns in g fall within the start-end range in l, returning only the columns in l that have a match. I've been trying to get findInterval() to do the job, but haven't been able to return a merged DF. Any ideas?
My data:
g <- structure(list(start_position = c(22926178L, 22887317L, 22876403L,
22862447L, 22822490L), end_position = c(22928035L, 22889471L,
22884442L, 22866319L, 22827551L)), .Names = c("start_position",
"end_position"), row.names = c(NA, 5L), class = "data.frame")
l <- structure(list(name = structure(c(2L, 12L, 9L, 1L, 8L), .Label = c("AC184765.4_FG005",
"GRMZM2G001024", "GRMZM2G058655", "GRMZM2G072028", "GRMZM2G157132",
"GRMZM2G160834", "GRMZM2G166507", "GRMZM2G396835", "GRMZM2G441511",
"GRMZM2G442645", "GRMZM2G572807", "GRMZM2G575546", "GRMZM2G702094"
), class = "factor"), start = c(11149187L, 24382534L, 22762447L,
26282236L, 10009264L), end = c(11511198L, 24860958L, 23762447L,
26682919L, 10402790L)), .Names = c("name", "start", "end"), row.names = c(101L,
589L, 7859L, 658L, 14L), class = "data.frame")

split dataset by day and save it as data frame

I have a dataset with 2 months of data (month of Feb and March). Can I know how can I split the data into 59 subsets of data by day and save it as data frame (28 days for Feb and 31 days for Mar)? Preferably to save the data frame in different name according to the date, i.e. 20140201, 20140202 and so forth.
df <- structure(list(text = structure(c(4L, 6L, 5L, 2L, 8L, 1L), .Label = c(" Terpilih Jadi Maskapai dengan Pelayanan Kabin Pesawat cont",
"booking number ZEPLTQ I want to cancel their flight because they can not together with my wife and kids",
"Can I change for the traveler details because i choose wrongly for the Mr or Ms part",
"cant do it with cards either", "Coming back home AK", "gotta try PNNL",
"Jadwal penerbangan medanjktsblm tangalmasi ada kah", "Me and my Tart would love to flyLoveisintheAir",
"my flight to Bangkok onhas been rescheduled I couldnt perform seat selection now",
"Pls checks his case as money is not credited to my bank acctThanks\n\nCASLTP",
"Processing fee Whatt", "Tacloban bound aboardto get them boats Boats boats boats Tacloban HeartWork",
"thanks I chatted with ask twice last week and told the same thing"
), class = "factor"), created = structure(c(1L, 1L, 2L, 2L, 3L,
3L), .Label = c("1/2/2014", "2/2/2014", "5/2/2014", "6/2/2014"
), class = "factor")), .Names = c("text", "created"), row.names = c(NA,
6L), class = "data.frame")
You don't need to output multiple dataframes. You only need to select/subset them by year&month of the 'created' field. So here are two ways do do that: 1. is simpler if you don't plan on needing any more date-arithmetic
# 1. Leave 'created' a string, just use text substitution to extract its month&date components
df$created_mthyr <- gsub( '([0-9]+/)[0-9]+/([0-9]+)', '\\1\\2', df$created )
# 2. If you need to do arbitrary Date arithmetic, convert 'created' field to Date object
# in this case you need an explicit format-string
df$created <- as.Date(df$created, '%M/%d/%Y')
# Now you can do either a) split
split(df, df$created_mthyr)
# specifically if you want to assign the output it creates to 3 dataframes:
df1 <- split(df, df$created_mthyr)[[1]]
df2 <- split(df, df$created_mthyr)[[2]]
df5 <- split(df, df$created_mthyr)[[3]]
# ...or else b) do a Split-Apply-Combine and perform arbitrary command on each separate subset. This is very powerful. See plyr/ddply documentation for examples.
require(plyr)
df1 <- dlply(df, .(created_mthyr))[[1]]
df2 <- dlply(df, .(created_mthyr))[[2]]
df5 <- dlply(df, .(created_mthyr))[[3]]
# output looks like this - strictly you might not want to keep 'created','created_mthyr':
> df1
# text created created_mthyr
#1 cant do it with cards either 1/2/2014 1/2014
#2 gotta try PNNL 1/2/2014 1/2014
> df2
#3
#Coming back home AK
#4 booking number ZEPLTQ I want to cancel their flight because they can not together with my wife and kids
# created created_mthyr
#3 2/2/2014 2/2014
#4 2/2/2014 2/2014

Resources