problem while changing col names with str_to_title - r

I have a data set that looks like this:
It can be build using codes:
df<- structure(list(`Med` = c("DOCETAXEL",
"BEVACIZUMAB", "CARBOPLATIN", "CETUXIMAB", "DOXORUBICIN", "IRINOTECAN"
), `2.4 mg` = c(0, 0, 0, 0, 1, 0), `PRIOR CANCER THERAPY` = c(4L,
3L, 3L, 3L, 3L, 3L), `PRIOR CANCER SURGERY` = c(0, 0, 0, 0, 0,
0), `PRIOR RADIATION THERAPY` = c(0, 0, 0, 0, 0, 0)), row.names = c(NA,
6L), class = "data.frame")
Now I would like to change col name that are not start with number to proper case. How should I do it? I thought I could use str_to_title. I have tried many ways can not get it to work. Here is the codes that I tried:
# try1:
df[,3:5] %>% setNames(str_to_title(colnames(df[,3:5])))
#try2:
df[,3:5] <- df[,3:5]%>% rename_with (str_to_title)
# try3:
colnames(df[,3:5])<- str_to_title(colnames(df[,3:5]))
What did I do wrong? there is no error message, just the col names did not get updated. Could anyone help me identify the issue, or maybe show me a better way if you have?
Here I have small data then I can find the col number. If I want it to auto correct the col names to proper case, how can I do that?
Thanks.

We can use
library(dplyr)
library(stringr)
df %>%
rename_at(3:5, ~ str_to_title(.))
-output
# Med 2.4 mg Prior Cancer Therapy Prior Cancer Surgery Prior Radiation Therapy
#1 DOCETAXEL 0 4 0 0
#2 BEVACIZUMAB 0 3 0 0
#3 CARBOPLATIN 0 3 0 0
#4 CETUXIMAB 0 3 0 0
#5 DOXORUBICIN 1 3 0 0
#6 IRINOTECAN 0 3 0 0
Or using rename_with
df %>%
rename_with(~ str_to_title(.), 3:5)

Related

Nested ifelse to output 3 responses in R

This is a related question from my original post found here: How to create a new variable based on condition from different dataframe in R
I have 2 data frames from an experiment. The 1st df reads a (roughly) continuous signal over 40 mins. There are 5 columns, 1:3 are binary - saying whether a button was pushed. The 4th column is a binary of if either from column 2 or 3 was pushed. The 5th column is an approximate time in seconds. Example from df below:
initiate
left
right
l or r
time
0
0
1
1
2.8225
0
0
1
1
2.82375
0
0
1
1
2.82500
0
0
1
1
2.82625
1
0
0
0
16.8200
1
0
0
0
16.8212
etc.
The 2nd data frame is session info where each row is a trial, usually 100-150 rows depending on the day. I have a column that marks trial start time and another column that marks trial end time in seconds. I have another column that states whether or not the trial had an intervention. Example from df below (I omitted several irrelevant columns):
trial
control
t start
t end
1
0
16.64709
35.49431
2
0
41.81843
57.74304
3
0
65.54510
71.16612
4
0
82.65743
87.30914
11
3
187.0787
193.5898
12
0
200.0486
203.1883
30
3
415.1710
418.0405
etc.
For the 1st data frame, I want to create a column that indicates whether or not the button was pushed within a trial. If the button was indeed pushed within a trial, I need to label it based on intervention. This is based on those start and end times in the 2nd df, along with the control info. In this table, 0 = intervention and 3 = control.
I would like it to look something like this (iti = inter-trial, wt_int = within trial & intervention, wt_control = within trial & control):
initiate
left
right
l or r
time
trial_type
0
0
1
1
2.8225
iti
0
0
1
1
2.82375
iti
0
0
1
1
2.82500
iti
0
0
1
1
2.82625
iti
1
0
0
0
16.82000
wt_int
1
0
0
0
16.82125
wt_int
1
0
0
0
187.0800
wt_control
etc.
Going off previous recommendations, I've tried nested ifelse statements with no success. I can get it to label all of the trials as either "iti" or "wt_int" with different failed attempts, or an error at row 1037 (when it changes from iti to wt). From my original question I have a "trial" column now in my 1st df which I'm using for the following code. Perhaps there is a more straightforward approach that combines the original code?
Errors out part way through:
df %>%
rowwise() %>%
mutate(trial_type = ifelse(any(trial == "wt" & df2$control == 0,
ifelse(trial == "wt" & df2$control == 3,
"wt_omission", "iti"), "wt_odor")))
Also tried this, which labels all as wt_int:
df$trial_type <- ifelse(df$trial == 'wt' && df2$control == 0,
ifelse(df$trial == 'wt' && df2$control == 3,
"wt_control", "iti"), "wt_int")
Thank you!
You could use cut to create intervals and check, if a values falls into them:
library(dplyr)
df1 %>%
mutate(
check_1 = cut(time, breaks = df2$t_start, labels = FALSE),
check_2 = coalesce(cut(time, breaks = df2$t_end, labels = FALSE), 0),
check_3 = df2$control[check_1],
trial_type = case_when(
check_1 - check_2 == 1 & check_3 == 0 ~ "wt_int",
check_1 - check_2 == 1 & check_3 == 3 ~ "wt_control",
TRUE ~ "iti"
)
) %>%
select(-starts_with("check_"))
This returns
# A tibble: 7 x 6
initiate left right l_or_r time trial_type
<dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 0 0 1 1 2.82 iti
2 0 0 1 1 2.82 iti
3 0 0 1 1 2.82 iti
4 0 0 1 1 2.83 iti
5 1 0 0 0 16.8 wt_int
6 1 0 0 0 16.8 wt_int
7 1 0 0 0 187. wt_control
Data
df1 <- structure(list(initiate = c(0, 0, 0, 0, 1, 1, 1), left = c(0,
0, 0, 0, 0, 0, 0), right = c(1, 1, 1, 1, 0, 0, 0), l_or_r = c(1,
1, 1, 1, 0, 0, 0), time = c(2.8225, 2.82375, 2.825, 2.82625,
16.82, 16.8212, 187.08)), class = c("spec_tbl_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -7L), spec = structure(list(
cols = list(initiate = structure(list(), class = c("collector_double",
"collector")), left = structure(list(), class = c("collector_double",
"collector")), right = structure(list(), class = c("collector_double",
"collector")), l_or_r = structure(list(), class = c("collector_double",
"collector")), time = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
df2 <- structure(list(trial = c(1, 2, 3, 4, 11, 12, 30), control = c(0,
0, 0, 0, 3, 0, 3), t_start = c(16.64709, 41.81843, 65.5451, 82.65743,
187.0787, 200.0486, 415.171), t_end = c(35.49431, 57.74304, 71.16612,
87.30914, 193.5898, 203.1883, 418.0405)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -7L), spec = structure(list(
cols = list(trial = structure(list(), class = c("collector_double",
"collector")), control = structure(list(), class = c("collector_double",
"collector")), t_start = structure(list(), class = c("collector_double",
"collector")), t_end = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))

Complex summary table using gtsummary in R

I have selected a few columns within the data set and I want to make a table by using gtsummary. I have come across some issues and not sure how to make it work.
Part of the reproducible data are here
structure(list(country = c("SGP", "JPN", "THA", "CHN", "JPN",
"CHN", "CHN", "JPN", "JPN", "JPN"), Final_Medal = c(NA, NA, NA,
NA, NA, "GOLD", NA, NA, NA, NA), Success = c(0, 0, 0, 0, 0, 1,
0, 0, 0, 0)), row.names = c(NA, 10L), class = "data.frame")
And it looks like this :
country Final_Medal Success
SGP NA 0
JPN NA 0
THA NA 0
Final_Medal contain NA, GOLD, SILVER and BRONZE
Success contains 0 and 1
All I want for the output is to group by country and count number of medal and success for each country.
Desire output:
Country GOLD Silver Bronze Success Total_Entry
SGP 5 2 10 17 50
JPN 4 3 5 12 60
CHN 5 2 6 13 60
Success will only count 1 and Total_Entry I want it to be included doesn't matter if it is 0 or 1
I have a code that look like this but it does't work and am not sure what needs to be done.
library(gtsummary)
example%>%tbl_summary(
by = country,
missing = "no" # don't list missing data separately
) %>%
bold_labels()
You may do the aggregation in dplyr and use gt/gtsummary for display purpose.
library(dplyr)
library(gt)
df %>%
group_by(country) %>%
summarise(Gold = sum(Final_Medal == 'GOLD', na.rm = TRUE),
Silver = sum(Final_Medal == 'SILVER', na.rm = TRUE),
Bronze = sum(Final_Medal == 'BRONZE', na.rm = TRUE),
Success = sum(Success),
Total_Entry = n()) %>%
gt()

Need to separate out variable names from a column in r

So I have a pretty bad dataset I am not allowed to change. I would like to take the column "Draw_CashFlow" and make only certain values into their own columns. Additionally I need to make the variables all one column (period) (wide to Tidy if you will).
In the dataset below we have a column (Draw_CashFlow) which begins with the variable in question followed by a list of IDs, then repeats for the next variable. Some variables may have NA entries.
structure(list(Draw_CashFlow = c("Principal", "R01",
"R02", "R03", "Workout Recovery Principal",
"Prepaid Principal", "R01", "R02", "R03",
"Interest", "R01", "R02"), `PERIOD 1` = c(NA,
834659.51, 85800.18, 27540.31, NA, NA, 366627.74, 0, 0, NA, 317521.73,
29175.1), `PERIOD 2` = c(NA, 834659.51, 85800.18, 27540.31, NA,
NA, 306125.98, 0, 0, NA, 302810.49, 28067.8), `PERIOD 3` = c(NA,
834659.51, 85800.18, 27540.31, NA, NA, 269970.12, 0, 0, NA, 298529.92,
27901.36), `PERIOD 4` = c(NA, 834659.51, 85800.18, 27540.31,
NA, NA, 307049.06, 0, 0, NA, 293821.89, 27724.4)), row.names = c(NA,
-12L), class = c("tbl_df", "tbl", "data.frame"))
Now it is a finite list of variables needed (Principal, Workout Recovery Principal, Prepaid Principal, and Interest) so I tried to make a loop where it would see if it existed then gather but that was not correct.
After the variables are set apart from Draw_CashFlow I hope it looks something like this (First four rows, ignore variable abbreviations).
ID Period Principal Wrk_Reco_Principal Prepaid_Principal Interest
R01 1 834659.51 NA 366627.74 317521.73
R02 1 85800.18 NA 0.00 29175.10
R03 1 27540.31 NA 0.00 NA
R01 2 834659.51 NA 306125.98 302810.49
Notes: Wrl_Reco_Principal is NA because there are no ID's within this Draw_CashFlow for this variable. Keep in mind this is supposed to be built to combat any number of IDs, but the variable names in the Draw_CashFlow column will always be the same.
Here's an approach which assumes the Draw_CashFlow values that start with an R are ID numbers. You might need a different method (e.g. !Draw_CashFlow %in% LIST_OF_VARIABLES) if that doesn't hold up.
df %>%
# create separate columns for ID and Variable
mutate(ID = if_else(Draw_CashFlow %>% str_starts("R"),
Draw_CashFlow, NA_character_),
Variable = if_else(!Draw_CashFlow %>% str_starts("R"),
Draw_CashFlow, NA_character_)) %>%
fill(Variable) %>% # Fill down Variable in NA rows from above
select(-Draw_CashFlow) %>%
gather(Period, value, -c(ID, Variable)) %>% # Gather into long form
drop_na() %>%
spread(Variable, value, fill = 0) %>% # Spread based on Variable
mutate(Period = parse_number(Period))
# A tibble: 12 x 5
ID Period Interest `Prepaid Principal` Principal
<chr> <dbl> <dbl> <dbl> <dbl>
1 R01 1 317522. 366628. 834660.
2 R01 2 302810. 306126. 834660.
3 R01 3 298530. 269970. 834660.
4 R01 4 293822. 307049. 834660.
5 R02 1 29175. 0 85800.
6 R02 2 28068. 0 85800.
7 R02 3 27901. 0 85800.
8 R02 4 27724. 0 85800.
9 R03 1 0 0 27540.
10 R03 2 0 0 27540.
11 R03 3 0 0 27540.
12 R03 4 0 0 27540.

R - matching values in a dataframe column to another dataframes row names

I think this is a fairly challenging data manipulation problem in R, and have struggled constructing a function that can achieve this. The context is organizing basketball players who play different positions into a lineup together, subject to what position each player plays. For some clarity, here is an example of the dataframe I am working with, in two different forms:
dput(my_df)
structure(list(Name = c("C.J. McCollum", "DeMar DeRozan", "Jimmy Butler",
"Jonas Valanciunas", "Kevin Durant", "Markieff Morris", "Pascal Siakam",
"Pau Gasol"), Pos1 = c("PG", "SG", "SG", "C", "SF", "SF", "PF",
"C"), Pos2 = c("SG", "", "SF", "", "PF", "PF", "", "")), .Names = c("Name",
"Pos1", "Pos2"), class = "data.frame", row.names = c(18L, 33L,
62L, 68L, 78L, 92L, 106L, 111L))
my_df
Name Pos1 Pos2
18 C.J. McCollum PG SG
33 DeMar DeRozan SG
62 Jimmy Butler SG SF
68 Jonas Valanciunas C
78 Kevin Durant SF PF
92 Markieff Morris SF PF
106 Pascal Siakam PF
111 Pau Gasol C
dput(my_df2)
structure(list(Name = c("C.J. McCollum", "DeMar DeRozan", "Jimmy Butler",
"Jonas Valanciunas", "Kevin Durant", "Markieff Morris", "Pascal Siakam",
"Pau Gasol"), Pos1 = c("PG", "SG", "SG", "C", "SF", "SF", "PF",
"C"), Pos2 = c("SG", "", "SF", "", "PF", "PF", "", ""), PG = c(1,
0, 0, 0, 0, 0, 0, 0), SG = c(1, 1, 1, 0, 0, 0, 0, 0), SF = c(0,
0, 1, 0, 1, 1, 0, 0), PF = c(0, 0, 0, 0, 1, 1, 1, 0), C = c(0,
0, 0, 1, 0, 0, 0, 1), BackupG = c(1, 1, 1, 0, 0, 0, 0, 0), BackupF = c(0,
0, 1, 0, 1, 1, 1, 0), Man8 = c(1, 1, 1, 1, 1, 1, 1, 1)), .Names = c("Name",
"Pos1", "Pos2", "PG", "SG", "SF", "PF", "C", "BackupG", "BackupF",
"Man8"), row.names = c(18L, 33L, 62L, 68L, 78L, 92L, 106L, 111L
), class = "data.frame")
my_df2
Name Pos1 Pos2 PG SG SF PF C BackupG BackupF Man8
18 C.J. McCollum PG SG 1 1 0 0 0 1 0 1
33 DeMar DeRozan SG 0 1 0 0 0 1 0 1
62 Jimmy Butler SG SF 0 1 1 0 0 1 1 1
68 Jonas Valanciunas C 0 0 0 0 1 0 0 1
78 Kevin Durant SF PF 0 0 1 1 0 0 1 1
92 Markieff Morris SF PF 0 0 1 1 0 0 1 1
106 Pascal Siakam PF 0 0 0 1 0 0 1 1
111 Pau Gasol C 0 0 0 0 1 0 0 1
In a basketball lineup, we want 1 player set for each of the 5 positions in basketball (PG, SG, PF, SF, C), we also want 1 backup guard (a PG or SG is a guard), 1 backup forward (a PF or FS is a forward), and an 8th player who can play any position. With this group of 8 players, we could construct the lineup in this way:
Name
PG C.J. McCollum
SG DeMar DeRozan
PF Kevin Durant
SF Markieff Morris
C Pau Gasol
Backup G Jimmy Butler
Backup F Pascal Siakam
8th Man Jonas Valanciunas
Ofcourse there is some flexibility with this (Kevin Durant and Markieff Morris could have been switched, in fact theres several players who could have switched spots in the 2nd dataframe). I would like to be able to organize my_df into this 2nd dataframes format in a fairly quick matter, something that takes the Pos1 and Pos2 columns from my_df, is able to check the rownames of the 2nd dataframe, and then fill in the player names.
There is a puzzle aspect to this however. Of note is that, not all players have a second position, but those players who do have a second position can be listed at either of the two positions. (for example, Jimmy Butler can be set as a SG, a SF, a Backup G, a backup F, or the 8th man, whereas Pau Gasol can only be set as a C, or as the 8th man). Additionally, while C.J. McCollum is listed as a PG and SG, he is the only player in my_df who is listed as a PG, and therefore must go in the PG row of the second dataframe.
Any thoughts are appreciated with this! I can provide more context if needed.
(edit: potentially editing my_df, adding Pos3, Pos4, Pos5 columns for whether a player can be a backup G, backup F, or 8th Man, may help as well, and is something I am currently working with).
Edit - see Simplify this grid such that each row and column has 1 value for a revised version of my question, which is a simpler problem to solve but will give me a solution to this question!
This approach is guaranteed to return a result if there is one, in fact it will return all viable combinations.
st<-as.matrix(my_df2[4:dim(my_df2)[2]]) # Make a numeric matrix
## allCombinationsAux may not be necessary if you are using a combinatorics library
allCombinationsAux<-function(z,nreg,x){
if(sum(nreg)>1){
innerLoop<-do.call(rbind,lapply(x[nreg&(z!=x)], test1,nreg&(z!=x),x))
ret<-cbind(z,innerLoop )
}
else{
ret<-x[nreg]
}
ret
}
## Find all the possible row combinations for the matrix
combs<-do.call(rbind,lapply(x,function(y) allCombinationsAux(y,y!=x,x)))
## Identify which combinations are valid
inds<-which(apply(combs,1,function(x) sum(diag(st[x,]))==8))
## Select valid matricies
validChoices<-lapply(inds,function(x) st[combs[x,],])
Make a matrix out of my_df2
Find all possible matrix substitutions
Iterate through all possible matrices testing if the diags are all 1
Select those matrices that are valid
To get the output to look like your example you can run
validChoices<-lapply(inds,function(x) {
matr<-st[combs[x,],]
retVal<-data.frame(Name=my_df2[combs[x,],"Name"])
rownames(retVal)<-colnames(matr)
retVal
})

Using rbind.fill does not fill the correct value

I don’t understand how rbind.fill works, I guess. I have a data frame called main.df:
TLT PCY SHY VTI TIP VNQ VWO RWX VEA DBC GLD
Pct 0 0 0 0 0 0 0 0 0 0 0
I want to bind the following different-sized data frame named p.df to it:
VWO VEA VTI
Pct 0.3333333 0.3333333 0.3333333
When I execute rbind.fill(main.df, p.df) I get:
TLT PCY SHY VTI TIP VNQ VWO RWX VEA DBC GLD
1 0 0 0 0 0 0 0 0 0 0 0
2 NA NA NA 1 NA NA 1 NA 1 NA NA
which is not what I want. I expected to get:
TLT PCY SHY VTI TIP VNQ VWO RWX VEA DBC GLD
1 0 0 0 0 0 0 0 0 0 0 0
2 NA NA NA 0.333 NA NA 0.333 NA 0.333 NA NA
How do I do this? The dput of my objects are below.
main.df <- structure(list(
TLT = 0, PCY = 0, SHY = 0, VTI = 0, TIP = 0, VNQ = 0, VWO = 0, RWX = 0, VEA = 0, DBC = 0, GLD = 0),
.Names = c("TLT", "PCY", "SHY", "VTI", "TIP", "VNQ", "VWO", "RWX", "VEA", "DBC", "GLD"),
row.names = "Pct", class = "data.frame")
p.df <- structure(list(
VWO = structure(1L, .Names = "Pct", .Label = c("0.3333333", "VWO"), class = "factor"),
VEA = structure(1L, .Names = "Pct", .Label = c("0.3333333", "VEA"), class = "factor"),
VTI = structure(1L, .Names = "Pct", .Label = c("0.3333333", "VTI"), class = "factor")),
.Names = c("VWO", "VEA", "VTI"), row.names = "Pct", class = "data.frame")
It would help if you provided a reproducible example using dput(main.df) and dput(p.df), but it appears that one or both of those objects contain factor vectors, not numeric vectors. So you need to convert them.
main.df[] <- lapply(main.df, function(f) as.numeric(levels(f))[f])
p.df[] <- lapply(p.df, function(f) as.numeric(levels(f))[f])
See How to convert a factor to an integer\numeric without a loss of information for details.

Resources