Using rbind.fill does not fill the correct value - r

I don’t understand how rbind.fill works, I guess. I have a data frame called main.df:
TLT PCY SHY VTI TIP VNQ VWO RWX VEA DBC GLD
Pct 0 0 0 0 0 0 0 0 0 0 0
I want to bind the following different-sized data frame named p.df to it:
VWO VEA VTI
Pct 0.3333333 0.3333333 0.3333333
When I execute rbind.fill(main.df, p.df) I get:
TLT PCY SHY VTI TIP VNQ VWO RWX VEA DBC GLD
1 0 0 0 0 0 0 0 0 0 0 0
2 NA NA NA 1 NA NA 1 NA 1 NA NA
which is not what I want. I expected to get:
TLT PCY SHY VTI TIP VNQ VWO RWX VEA DBC GLD
1 0 0 0 0 0 0 0 0 0 0 0
2 NA NA NA 0.333 NA NA 0.333 NA 0.333 NA NA
How do I do this? The dput of my objects are below.
main.df <- structure(list(
TLT = 0, PCY = 0, SHY = 0, VTI = 0, TIP = 0, VNQ = 0, VWO = 0, RWX = 0, VEA = 0, DBC = 0, GLD = 0),
.Names = c("TLT", "PCY", "SHY", "VTI", "TIP", "VNQ", "VWO", "RWX", "VEA", "DBC", "GLD"),
row.names = "Pct", class = "data.frame")
p.df <- structure(list(
VWO = structure(1L, .Names = "Pct", .Label = c("0.3333333", "VWO"), class = "factor"),
VEA = structure(1L, .Names = "Pct", .Label = c("0.3333333", "VEA"), class = "factor"),
VTI = structure(1L, .Names = "Pct", .Label = c("0.3333333", "VTI"), class = "factor")),
.Names = c("VWO", "VEA", "VTI"), row.names = "Pct", class = "data.frame")

It would help if you provided a reproducible example using dput(main.df) and dput(p.df), but it appears that one or both of those objects contain factor vectors, not numeric vectors. So you need to convert them.
main.df[] <- lapply(main.df, function(f) as.numeric(levels(f))[f])
p.df[] <- lapply(p.df, function(f) as.numeric(levels(f))[f])
See How to convert a factor to an integer\numeric without a loss of information for details.

Related

Nested ifelse to output 3 responses in R

This is a related question from my original post found here: How to create a new variable based on condition from different dataframe in R
I have 2 data frames from an experiment. The 1st df reads a (roughly) continuous signal over 40 mins. There are 5 columns, 1:3 are binary - saying whether a button was pushed. The 4th column is a binary of if either from column 2 or 3 was pushed. The 5th column is an approximate time in seconds. Example from df below:
initiate
left
right
l or r
time
0
0
1
1
2.8225
0
0
1
1
2.82375
0
0
1
1
2.82500
0
0
1
1
2.82625
1
0
0
0
16.8200
1
0
0
0
16.8212
etc.
The 2nd data frame is session info where each row is a trial, usually 100-150 rows depending on the day. I have a column that marks trial start time and another column that marks trial end time in seconds. I have another column that states whether or not the trial had an intervention. Example from df below (I omitted several irrelevant columns):
trial
control
t start
t end
1
0
16.64709
35.49431
2
0
41.81843
57.74304
3
0
65.54510
71.16612
4
0
82.65743
87.30914
11
3
187.0787
193.5898
12
0
200.0486
203.1883
30
3
415.1710
418.0405
etc.
For the 1st data frame, I want to create a column that indicates whether or not the button was pushed within a trial. If the button was indeed pushed within a trial, I need to label it based on intervention. This is based on those start and end times in the 2nd df, along with the control info. In this table, 0 = intervention and 3 = control.
I would like it to look something like this (iti = inter-trial, wt_int = within trial & intervention, wt_control = within trial & control):
initiate
left
right
l or r
time
trial_type
0
0
1
1
2.8225
iti
0
0
1
1
2.82375
iti
0
0
1
1
2.82500
iti
0
0
1
1
2.82625
iti
1
0
0
0
16.82000
wt_int
1
0
0
0
16.82125
wt_int
1
0
0
0
187.0800
wt_control
etc.
Going off previous recommendations, I've tried nested ifelse statements with no success. I can get it to label all of the trials as either "iti" or "wt_int" with different failed attempts, or an error at row 1037 (when it changes from iti to wt). From my original question I have a "trial" column now in my 1st df which I'm using for the following code. Perhaps there is a more straightforward approach that combines the original code?
Errors out part way through:
df %>%
rowwise() %>%
mutate(trial_type = ifelse(any(trial == "wt" & df2$control == 0,
ifelse(trial == "wt" & df2$control == 3,
"wt_omission", "iti"), "wt_odor")))
Also tried this, which labels all as wt_int:
df$trial_type <- ifelse(df$trial == 'wt' && df2$control == 0,
ifelse(df$trial == 'wt' && df2$control == 3,
"wt_control", "iti"), "wt_int")
Thank you!
You could use cut to create intervals and check, if a values falls into them:
library(dplyr)
df1 %>%
mutate(
check_1 = cut(time, breaks = df2$t_start, labels = FALSE),
check_2 = coalesce(cut(time, breaks = df2$t_end, labels = FALSE), 0),
check_3 = df2$control[check_1],
trial_type = case_when(
check_1 - check_2 == 1 & check_3 == 0 ~ "wt_int",
check_1 - check_2 == 1 & check_3 == 3 ~ "wt_control",
TRUE ~ "iti"
)
) %>%
select(-starts_with("check_"))
This returns
# A tibble: 7 x 6
initiate left right l_or_r time trial_type
<dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 0 0 1 1 2.82 iti
2 0 0 1 1 2.82 iti
3 0 0 1 1 2.82 iti
4 0 0 1 1 2.83 iti
5 1 0 0 0 16.8 wt_int
6 1 0 0 0 16.8 wt_int
7 1 0 0 0 187. wt_control
Data
df1 <- structure(list(initiate = c(0, 0, 0, 0, 1, 1, 1), left = c(0,
0, 0, 0, 0, 0, 0), right = c(1, 1, 1, 1, 0, 0, 0), l_or_r = c(1,
1, 1, 1, 0, 0, 0), time = c(2.8225, 2.82375, 2.825, 2.82625,
16.82, 16.8212, 187.08)), class = c("spec_tbl_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -7L), spec = structure(list(
cols = list(initiate = structure(list(), class = c("collector_double",
"collector")), left = structure(list(), class = c("collector_double",
"collector")), right = structure(list(), class = c("collector_double",
"collector")), l_or_r = structure(list(), class = c("collector_double",
"collector")), time = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
df2 <- structure(list(trial = c(1, 2, 3, 4, 11, 12, 30), control = c(0,
0, 0, 0, 3, 0, 3), t_start = c(16.64709, 41.81843, 65.5451, 82.65743,
187.0787, 200.0486, 415.171), t_end = c(35.49431, 57.74304, 71.16612,
87.30914, 193.5898, 203.1883, 418.0405)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -7L), spec = structure(list(
cols = list(trial = structure(list(), class = c("collector_double",
"collector")), control = structure(list(), class = c("collector_double",
"collector")), t_start = structure(list(), class = c("collector_double",
"collector")), t_end = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))

Make the leading column value NA if condition is met using R

I got a df such as
structure(list(id = c(15305, 15305, 15305, 6224, 6224), transfer = c(0,
1, 0, 1, 0), hosp = c(2182, 2452, 2846, 1474, 1476), out = c(2183,
NA, 2857, NA, 1486), Insti = c(NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-5L))
And I want to insert NA in the leading "hosp" column if the lagging "out" and lagging "Insti" columns are NA AND the "transfer" column == 1
I want the df to look like this
structure(list(id2 = c(15305, 15305, 15305, 6224, 6224), transfer2 = c(0,
1, 0, 1, 0), hosp2 = c(2182, 2452, NA, 1474, NA), out2 = c(2183,
NA, 2857, NA, 1486), Insti2 = c(NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-5L))
You can use the following solution:
library(dplyr)
df %>%
mutate(hosp = case_when(
is.na(lag(out)) & is.na(lag(Insti)) & lag(transfer) == 1 ~ NA_real_,
TRUE ~ hosp
))
id transfer hosp out Insti
1 15305 0 2182 2183 NA
2 15305 1 2452 NA NA
3 15305 0 NA 2857 NA
4 6224 1 1474 NA NA
5 6224 0 NA 1486 NA
To get the "lag" you may remove last value and add NA as first value. Here a base R solution using ifelse.
transform(df,
hosp=ifelse(is.na(c(NA, out[-nrow(df)])) & is.na(c(NA, Insti[-nrow(df)])) &
c(NA, Insti[-nrow(df)]) == 1, NA, hosp))
# id transfer hosp out Insti
# 1 15305 0 NA 2183 NA
# 2 15305 1 2452 NA NA
# 3 15305 0 NA 2857 NA
# 4 6224 1 1474 NA NA
# 5 6224 0 NA 1486 NA

problem while changing col names with str_to_title

I have a data set that looks like this:
It can be build using codes:
df<- structure(list(`Med` = c("DOCETAXEL",
"BEVACIZUMAB", "CARBOPLATIN", "CETUXIMAB", "DOXORUBICIN", "IRINOTECAN"
), `2.4 mg` = c(0, 0, 0, 0, 1, 0), `PRIOR CANCER THERAPY` = c(4L,
3L, 3L, 3L, 3L, 3L), `PRIOR CANCER SURGERY` = c(0, 0, 0, 0, 0,
0), `PRIOR RADIATION THERAPY` = c(0, 0, 0, 0, 0, 0)), row.names = c(NA,
6L), class = "data.frame")
Now I would like to change col name that are not start with number to proper case. How should I do it? I thought I could use str_to_title. I have tried many ways can not get it to work. Here is the codes that I tried:
# try1:
df[,3:5] %>% setNames(str_to_title(colnames(df[,3:5])))
#try2:
df[,3:5] <- df[,3:5]%>% rename_with (str_to_title)
# try3:
colnames(df[,3:5])<- str_to_title(colnames(df[,3:5]))
What did I do wrong? there is no error message, just the col names did not get updated. Could anyone help me identify the issue, or maybe show me a better way if you have?
Here I have small data then I can find the col number. If I want it to auto correct the col names to proper case, how can I do that?
Thanks.
We can use
library(dplyr)
library(stringr)
df %>%
rename_at(3:5, ~ str_to_title(.))
-output
# Med 2.4 mg Prior Cancer Therapy Prior Cancer Surgery Prior Radiation Therapy
#1 DOCETAXEL 0 4 0 0
#2 BEVACIZUMAB 0 3 0 0
#3 CARBOPLATIN 0 3 0 0
#4 CETUXIMAB 0 3 0 0
#5 DOXORUBICIN 1 3 0 0
#6 IRINOTECAN 0 3 0 0
Or using rename_with
df %>%
rename_with(~ str_to_title(.), 3:5)

Weighted random sampling for Monte Carlo simulation in R

I would like to run a Monte Carlo simulation. I have a data.frame where rows are unique IDs which have a probability of association with one of the columns. The data entered into the columns can be treated as the weights for that probability. I want to randomly sample each row in the data.frame based on the weights listed for each row. Each row should only return one value per run. The data.frame structure looks like this:
ID, X2000, X2001, X2002, X2003, X2004
X11, 0, 0, 0.5, 0.5, 0
X33, 0.25, 0.25, 0.25, 0.25, 0
X55, 0, 0, 0, 0, 1
X77, 0.5, 0, 0, 0, 0.5
For weighting, "X11" should either return X2002 or X2003, "X33" should have an equal probability of returning X2000, X2001, X2002, or X2003, should be equal with no chance of returning X2004. The only possible return for "X55" should be X2004.
The output data I am interested in are the IDs and the column that was sampled for that run, although it would probably be simpler to return something like this:
ID, X2000, X2001, X2002, X2003, X2004
X11, 0, 0, 1, 0, 0
X33, 1, 0, 0, 0, 0
X55, 0, 0, 0, 0, 1
X77, 1, 0, 0, 0, 0
Your data.frame is transposed - the sample() function takes a probability vector. However, your probability vector is rowwise which means it's harder to extract from a data.frame.
To get around this - you can import your ID column as a row.name. This allows you to be able to access it during an apply() statement. Note the apply() will coerce the data.frame to a matrix which means only one data type is allowed. That's why the IDs needed to be rownames - otherwise we'd have a probability vector of characters instead of numerics.
mc_df <- read.table(
text =
'ID X2000 X2001 X2002 X2003 X2004
X11 0 0 0.5 0.5 0
X33 0.25 0.25 0.25 0.25 0
X55 0 0 0 0 1
X77 0.5 0 0 0 0.5'
, header = T
,row.names = 1)
From there, can use the apply function:
apply(mc_df, 1, function(x) sample(names(x), size = 200, replace = T, prob = x))
Or you could make it fancy
apply(mc_df, 1, function(x) table(sample(names(x), size = 200, replace = T, prob = x)))
$X11
X2002 X2003
102 98
$X33
X2000 X2001 X2002 X2003
54 47 64 35
$X55
X2004
200
$X77
X2000 X2004
103 97
Fancier:
apply(mc_df, 1, function(x) table(sample(as.factor(names(x)), size = 200, replace = T, prob = x)))
X11 X33 X55 X77
X2000 0 51 0 99
X2001 0 50 0 0
X2002 91 57 0 0
X2003 109 42 0 0
X2004 0 0 200 101
Or fanciest:
prop.table(apply(mc_df
, 1
, function(x) table(sample(as.factor(names(x)), size = 200, replace = T, prob = x)))
,2)
X11 X33 X55 X77
X2000 0.00 0.270 0 0.515
X2001 0.00 0.235 0 0.000
X2002 0.51 0.320 0 0.000
X2003 0.49 0.175 0 0.000
X2004 0.00 0.000 1 0.485

Row by row application in R [duplicate]

I have my data in the form of a data.table given below
structure(list(atp = c(1, 0, 1, 0, 0, 1), len = c(2, NA, 3, NA,
NA, 1), inv = c(593, 823, 668, 640, 593, 745), GU = c(36, 94,
57, 105, 48, 67), RUTL = c(100, NA, 173, NA, NA, 7)), .Names = c("atp",
"len", "inv", "GU", "RUTL"), row.names = c(NA, -6L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000000320788>)
I need to form 4 new columns csi_begin,csi_end, IRQ and csi_order. the value of csi_begin and csi_end when atp=1 depends directly on inv and gu values.
But when atp is not equal to 1 csi_begin and csi_end depends on inv and gu values and IRQ value of previous row
The value of IRQ depends on csi_order of that row if atp==1 else its 0 and csi_order value depends on two rows previous csi_begin value.
I have written the condition with the help of for loop.
Below is the code given
lostsales<-function(transit)
{
if (transit$atp==1)
{
transit$csi_begin[i]<-(transit$inv)[i]
transit$csi_end[i]<-transit$csi_begin[i]-transit$GU[i]
}
else
{
transit$csi_begin[i]<-(transit$inv)[i]+transit$IRQ[i-1]
transit$csi_end[i]<-transit$csi_begin[i]-transit$GU[i]
}
if (transit$csi_begin[i-2]!= NA)
{
transit$csi_order[i]<-transit$csi_begin[i-2]
}
else
{ transit$csi_order[i]<-0}
if (transit$atp==1)
{
transit$IRQ[i]<-transit$csi_order[i]-transit$RUTL[i]
}
else
{
transit$IRQ[i]<-0
}
}
Can anyone help me how to do efficient looping with data.tables using setkeys? As my data set is very large and I cannot use for loop else the timing would be very high.
Adding the desired outcome to your example would be very helpful, as I'm having trouble following the if/then logic. But I took a stab at it anyway:
library(data.table)
# Example data:
dt <- structure(list(atp = c(1, 0, 1, 0, 0, 1), len = c(2, NA, 3, NA, NA, 1), inv = c(593, 823, 668, 640, 593, 745), GU = c(36, 94, 57, 105, 48, 67), RUTL = c(100, NA, 173, NA, NA, 7)), .Names = c("atp", "len", "inv", "GU", "RUTL"), row.names = c(NA, -6L), class = c("data.table", "data.frame"), .internal.selfref = "<pointer: 0x0000000000320788>")
# Add a row number:
dt[,rn:=.I]
# Use this function to get the value from a previous (shiftLen is negative) or future (shiftLen is positive) row:
rowShift <- function(x, shiftLen = 1L) {
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
# My attempt to follow the seemingly circular if/then rules:
lostsales2 <- function(transit) {
# If atp==1, set csi_begin to inv and csi_end to csi_begin - GU:
transit[atp==1, `:=`(csi_begin=inv, csi_end=inv-GU)]
# Set csi_order to the value of csi_begin from two rows prior:
transit[, csi_order:=rowShift(csi_begin,-2)]
# Set csi_order to 0 if csi_begin from two rows prior was NA
transit[is.na(csi_order), csi_order:=0]
# Initialize IRQ to 0
transit[, IRQ:=0]
# If ATP==1, set IRQ to csi_order - RUTL
transit[atp==1, IRQ:=csi_order-RUTL]
# If ATP!=1, set csi_begin to inv + IRQ value from previous row, and csi_end to csi_begin - GU
transit[atp!=1, `:=`(csi_begin=inv+rowShift(IRQ,-1), csi_end=inv+rowShift(IRQ,-1)-GU)]
return(transit)
}
lostsales2(dt)
## atp len inv GU RUTL rn csi_begin csi_end csi_order IRQ
## 1: 1 2 593 36 100 1 593 557 0 -100
## 2: 0 NA 823 94 NA 2 NA NA 0 0
## 3: 1 3 668 57 173 3 668 611 593 420
## 4: 0 NA 640 105 NA 4 640 535 0 0
## 5: 0 NA 593 48 NA 5 593 545 668 0
## 6: 1 1 745 67 7 6 745 678 640 633
Is this output close to what you were expecting?

Resources