Need to separate out variable names from a column in r - r

So I have a pretty bad dataset I am not allowed to change. I would like to take the column "Draw_CashFlow" and make only certain values into their own columns. Additionally I need to make the variables all one column (period) (wide to Tidy if you will).
In the dataset below we have a column (Draw_CashFlow) which begins with the variable in question followed by a list of IDs, then repeats for the next variable. Some variables may have NA entries.
structure(list(Draw_CashFlow = c("Principal", "R01",
"R02", "R03", "Workout Recovery Principal",
"Prepaid Principal", "R01", "R02", "R03",
"Interest", "R01", "R02"), `PERIOD 1` = c(NA,
834659.51, 85800.18, 27540.31, NA, NA, 366627.74, 0, 0, NA, 317521.73,
29175.1), `PERIOD 2` = c(NA, 834659.51, 85800.18, 27540.31, NA,
NA, 306125.98, 0, 0, NA, 302810.49, 28067.8), `PERIOD 3` = c(NA,
834659.51, 85800.18, 27540.31, NA, NA, 269970.12, 0, 0, NA, 298529.92,
27901.36), `PERIOD 4` = c(NA, 834659.51, 85800.18, 27540.31,
NA, NA, 307049.06, 0, 0, NA, 293821.89, 27724.4)), row.names = c(NA,
-12L), class = c("tbl_df", "tbl", "data.frame"))
Now it is a finite list of variables needed (Principal, Workout Recovery Principal, Prepaid Principal, and Interest) so I tried to make a loop where it would see if it existed then gather but that was not correct.
After the variables are set apart from Draw_CashFlow I hope it looks something like this (First four rows, ignore variable abbreviations).
ID Period Principal Wrk_Reco_Principal Prepaid_Principal Interest
R01 1 834659.51 NA 366627.74 317521.73
R02 1 85800.18 NA 0.00 29175.10
R03 1 27540.31 NA 0.00 NA
R01 2 834659.51 NA 306125.98 302810.49
Notes: Wrl_Reco_Principal is NA because there are no ID's within this Draw_CashFlow for this variable. Keep in mind this is supposed to be built to combat any number of IDs, but the variable names in the Draw_CashFlow column will always be the same.

Here's an approach which assumes the Draw_CashFlow values that start with an R are ID numbers. You might need a different method (e.g. !Draw_CashFlow %in% LIST_OF_VARIABLES) if that doesn't hold up.
df %>%
# create separate columns for ID and Variable
mutate(ID = if_else(Draw_CashFlow %>% str_starts("R"),
Draw_CashFlow, NA_character_),
Variable = if_else(!Draw_CashFlow %>% str_starts("R"),
Draw_CashFlow, NA_character_)) %>%
fill(Variable) %>% # Fill down Variable in NA rows from above
select(-Draw_CashFlow) %>%
gather(Period, value, -c(ID, Variable)) %>% # Gather into long form
drop_na() %>%
spread(Variable, value, fill = 0) %>% # Spread based on Variable
mutate(Period = parse_number(Period))
# A tibble: 12 x 5
ID Period Interest `Prepaid Principal` Principal
<chr> <dbl> <dbl> <dbl> <dbl>
1 R01 1 317522. 366628. 834660.
2 R01 2 302810. 306126. 834660.
3 R01 3 298530. 269970. 834660.
4 R01 4 293822. 307049. 834660.
5 R02 1 29175. 0 85800.
6 R02 2 28068. 0 85800.
7 R02 3 27901. 0 85800.
8 R02 4 27724. 0 85800.
9 R03 1 0 0 27540.
10 R03 2 0 0 27540.
11 R03 3 0 0 27540.
12 R03 4 0 0 27540.

Related

Complex summary table using gtsummary in R

I have selected a few columns within the data set and I want to make a table by using gtsummary. I have come across some issues and not sure how to make it work.
Part of the reproducible data are here
structure(list(country = c("SGP", "JPN", "THA", "CHN", "JPN",
"CHN", "CHN", "JPN", "JPN", "JPN"), Final_Medal = c(NA, NA, NA,
NA, NA, "GOLD", NA, NA, NA, NA), Success = c(0, 0, 0, 0, 0, 1,
0, 0, 0, 0)), row.names = c(NA, 10L), class = "data.frame")
And it looks like this :
country Final_Medal Success
SGP NA 0
JPN NA 0
THA NA 0
Final_Medal contain NA, GOLD, SILVER and BRONZE
Success contains 0 and 1
All I want for the output is to group by country and count number of medal and success for each country.
Desire output:
Country GOLD Silver Bronze Success Total_Entry
SGP 5 2 10 17 50
JPN 4 3 5 12 60
CHN 5 2 6 13 60
Success will only count 1 and Total_Entry I want it to be included doesn't matter if it is 0 or 1
I have a code that look like this but it does't work and am not sure what needs to be done.
library(gtsummary)
example%>%tbl_summary(
by = country,
missing = "no" # don't list missing data separately
) %>%
bold_labels()
You may do the aggregation in dplyr and use gt/gtsummary for display purpose.
library(dplyr)
library(gt)
df %>%
group_by(country) %>%
summarise(Gold = sum(Final_Medal == 'GOLD', na.rm = TRUE),
Silver = sum(Final_Medal == 'SILVER', na.rm = TRUE),
Bronze = sum(Final_Medal == 'BRONZE', na.rm = TRUE),
Success = sum(Success),
Total_Entry = n()) %>%
gt()

problem while changing col names with str_to_title

I have a data set that looks like this:
It can be build using codes:
df<- structure(list(`Med` = c("DOCETAXEL",
"BEVACIZUMAB", "CARBOPLATIN", "CETUXIMAB", "DOXORUBICIN", "IRINOTECAN"
), `2.4 mg` = c(0, 0, 0, 0, 1, 0), `PRIOR CANCER THERAPY` = c(4L,
3L, 3L, 3L, 3L, 3L), `PRIOR CANCER SURGERY` = c(0, 0, 0, 0, 0,
0), `PRIOR RADIATION THERAPY` = c(0, 0, 0, 0, 0, 0)), row.names = c(NA,
6L), class = "data.frame")
Now I would like to change col name that are not start with number to proper case. How should I do it? I thought I could use str_to_title. I have tried many ways can not get it to work. Here is the codes that I tried:
# try1:
df[,3:5] %>% setNames(str_to_title(colnames(df[,3:5])))
#try2:
df[,3:5] <- df[,3:5]%>% rename_with (str_to_title)
# try3:
colnames(df[,3:5])<- str_to_title(colnames(df[,3:5]))
What did I do wrong? there is no error message, just the col names did not get updated. Could anyone help me identify the issue, or maybe show me a better way if you have?
Here I have small data then I can find the col number. If I want it to auto correct the col names to proper case, how can I do that?
Thanks.
We can use
library(dplyr)
library(stringr)
df %>%
rename_at(3:5, ~ str_to_title(.))
-output
# Med 2.4 mg Prior Cancer Therapy Prior Cancer Surgery Prior Radiation Therapy
#1 DOCETAXEL 0 4 0 0
#2 BEVACIZUMAB 0 3 0 0
#3 CARBOPLATIN 0 3 0 0
#4 CETUXIMAB 0 3 0 0
#5 DOXORUBICIN 1 3 0 0
#6 IRINOTECAN 0 3 0 0
Or using rename_with
df %>%
rename_with(~ str_to_title(.), 3:5)

How to extract the previous n rows where a certain column value cannot be a particular value?

I've been searching for quite some time now with no luck. Essentially, I'm trying to figure out a way in R to extract the previous n rows where the "LTO Column" is a 0 but starting from where the "LTO Column" is a 1.
Data table:
Week Price LTO
1/1/2019 11 0
2/1/2019 12 0
3/1/2019 11 0
4/1/2019 11 0
5/1/2019 9.5 1
6/1/2019 10 0
7/1/2019 8 1
Then what I'm trying to do is say if n = 3, starting from 5/1/2019 where LTO = 1. I want to be able to pull the rows 4/1/2019, 3/1/2019. 2/1/2019.
But then for 7/1/2019 where the LTO is also equal to 1, I want to grab the rows 6/1/2019, 4/1/2019, 3/1/2019. In this situation it skips the row 5/1/2019 because is has a 1 in the LTO column.
Any help would be much appreciated.
There could be better way to do this , here is one attempt using base R.
#Number of rows to look back
n <- 3
#Find row index where LTO is 1.
inds <- which(df$LTO == 1)
#Remove row index where LTO is 1
remaining_rows <- setdiff(seq_len(nrow(df)), inds)
#For every inds find the previous n rows from remaining_rows
#use it to subset from the dataframe and add a new column week2
#with its corresponding date
do.call(rbind, lapply(inds, function(x) {
o <- match(x - 1, remaining_rows)
transform(df[remaining_rows[o:(o - (n -1))], ], week2 = df$Week[x])
}))
# Week Price LTO week2
#4 4/1/2019 11 0 5/1/2019
#3 3/1/2019 11 0 5/1/2019
#2 2/1/2019 12 0 5/1/2019
#6 6/1/2019 10 0 7/1/2019
#41 4/1/2019 11 0 7/1/2019
#31 3/1/2019 11 0 7/1/2019
data
df <- structure(list(Week = structure(1:7, .Label = c("1/1/2019",
"2/1/2019", "3/1/2019", "4/1/2019", "5/1/2019", "6/1/2019", "7/1/2019"), class =
"factor"), Price = c(11, 12, 11, 11, 9.5, 10, 8), LTO = c(0L, 0L, 0L,
0L, 1L, 0L, 1L)), class = "data.frame", row.names = c(NA, -7L))

Assign max value of group to all rows in that group

I would like to assign the max value of a group to all rows within that group. How do I do that?
I have a dataframe containing the names of the group and the max number of credits that belongs to it.
course_credits <- aggregate(bsc_academic$Credits, by = list(bsc_academic$Course_code), max)
which gives
Course Credits
1 ABC1000 6.5
2 ABC1003 6.5
3 ABC1004 6.5
4 ABC1007 5.0
5 ABC1010 6.5
6 ABC1021 6.5
7 ABC1023 6.5
The main dataframe looks like this:
Appraisal.Type Resits Credits Course_code Student_ID
Final result 0 6.5 ABC1000 10
Final result 0 6.5 ABC1003 10
Grade supervisor 0 0 ABC1000 10
Grade supervisor 0 0 ABC1003 10
Final result 0 12 ABC1294 23
Grade supervisor 0 0 ABC1294 23
As you see, student 10 took course ABC1000, worth 6.5 credits. For each course (per student), however, two rows exist: Final result and Grade supervisor. In the end, Final result should be deleted, but the credits should be kept. Therefore, I want to assign the max value of 6.5 to the Grade supervisor row.
Likewise, student 23 has followed course ABC1294, worth 12 credits.
In the end, this should be the result:
Appraisal.Type Resits Credits Course_code Student_ID
Grade supervisor 0 6.5 ABC1000 10
Grade supervisor 0 6.5 ABC1003 10
Grade supervisor 0 12 ABC1294 23
How do I go about this?
An option would be to group by 'Student_ID', mutate the 'Credits' with max of 'Credits' and filter the rows with 'Appraisal.Type' as "Grade supervisor"
library(dplyr)
df1 %>%
group_by(Student_ID) %>%
dplyr::mutate(Credits = max(Credits)) %>%
ungroup %>%
filter(Appraisal.Type == "Grade supervisor")
# A tibble: 2 x 5
# Appraisal.Type Resits Credits Course_code Student_ID
# <chr> <int> <dbl> <chr> <int>
#1 Grade supervisor 0 6.5 ABC1000 10
#2 Grade supervisor 0 6.5 ABC1003 10
If we also need 'Course_code' to be included in the grouping
df2 %>%
group_by(Student_ID, Course_code) %>%
dplyr::mutate(Credits = max(Credits)) %>%
filter(Appraisal.Type == "Grade supervisor")
# A tibble: 3 x 5
# Groups: Student_ID, Course_code [3]
# Appraisal.Type Resits Credits Course_code Student_ID
# <chr> <int> <dbl> <chr> <int>
#1 Grade supervisor 0 6.5 ABC1000 10
#2 Grade supervisor 0 6.5 ABC1003 10
#3 Grade supervisor 0 12 ABC1294 23
NOTE: I case, plyr package is also loaded, there can be some masking of functions esp summarise/mutate which is also found in plyr. To prevent it, either do this on a fresh session without loading plyr or explicitly specify dplyr::mutate
data
df1 <- structure(list(Appraisal.Type = c("Final result", "Final result",
"Grade supervisor", "Grade supervisor"), Resits = c(0L, 0L, 0L,
0L), Credits = c(6.5, 6.5, 0, 0), Course_code = c("ABC1000",
"ABC1003", "ABC1000", "ABC1003"), Student_ID = c(10L, 10L, 10L,
10L)), class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(Appraisal.Type = c("Final result", "Final result",
"Grade supervisor", "Grade supervisor", "Final result", "Grade supervisor"
), Resits = c(0L, 0L, 0L, 0L, 0L, 0L), Credits = c(6.5, 6.5,
0, 0, 12, 0), Course_code = c("ABC1000", "ABC1003", "ABC1000",
"ABC1003", "ABC1294", "ABC1294"), Student_ID = c(10L, 10L, 10L,
10L, 23L, 23L)), class = "data.frame", row.names = c(NA, -6L))
Generate a sample dataset.
data <- as.data.frame(list(Appraisal.Type = c(rep("Final result", 2), rep("Grade supervisor", 2)),
Resits = rep(0, 4),
Credits = c(rep(6.5, 2), rep(0, 2)),
Course_code = rep(c("ABC1000", "ABC1003"), 2),
Student_ID = rep(10, 4)))
Assign the max value of a group to all rows in this group and then delete rows that contain "Final results".
##Reassign the values of "Credits" column
for (i in 1: nlevels(as.factor(data$Course_code))) {
Course_code <- unique(data$Course_code)[i]
data$Credits [data$Course_code == Course_code] <- max (data$Credits [data$Course_code == Course_code])
}
##New dataset without "Final result" rows
data <- data[data$Appraisal.Type != "Final result",]
Here is the result.
data
Appraisal.Type Resits Credits Course_code Student_ID
3 Grade supervisor 0 6.5 ABC1000 10
4 Grade supervisor 0 6.5 ABC1003 10
Here's a data.table solution,
DT[,Credits := max(Credits),by=Student_ID]
Result <- DT[Appraisal.Type == "Grade supervisor"]

How to drop NA variables in a data frame by row

Here is my data frame:
structure(list(Q = c(NA, 346.86, 166.95, 162.57, NA, NA, NA,
266.7), L = c(18.93, NA, 15.72, 39.51, NA, NA, NA, NA), C = c(NA,
23.8, NA, 8.47, 20.89, 18.72, 14.94, NA), X = c(40.56, NA, 26.05,
3.08, 23.77, 59.37, NA, NA), W = c(29.47, NA, NA, NA, 36.08,
NA, 27.34, 28.19), S = c(NA, 7.47, NA, NA, 18.64, NA, 25.34,
NA), Y = c(NA, 2.81, 0, NA, NA, 21.18, 10.83, 12.19), H = c(0,
NA, NA, NA, NA, 0, NA, 0)), class = "data.frame", row.names = c(NA,
-8L), .Names = c("Q", "L", "C", "X", "W", "S", "Y", "H"))
Each row has 4 variables that are NAs, now I want to do the same operations to every row:
Drop those 4 varibles that are NAs
Calculate diversity for the rest 4 variables (it's just some computations involved with the rest, here I use diversity() from vegan)
Append the output to a new data frame
But the problem is:
How to do drop NA variables using dplyr? I don't know whether select() can make it.
How to apply operations to every row of a data frame?
It seems that drop_na() will remove the entire row for my dataset, any suggestion?
With tidyverse it may be better to gather into 'long' format and then spread it back. Assuming that we have exactly 4 non-NA elements per row, create a row index with rownames_to_column (from tibble), gather (from tidyr) into 'long' format, remove the NA elements, grouped by row number ('rn'), change the 'key' values to common values and then spread it to wide' format
library(tibble)
library(tidyr)
library(dplyr)
res <- rownames_to_column(df1, 'rn') %>%
gather(key, val, -rn) %>%
filter(!is.na(val)) %>%
group_by(rn) %>%
mutate(key = LETTERS[1:4]) %>%
spread(key, val) %>%
ungroup %>%
select(-rn)
res
# A tibble: 8 x 4
# A B C D
#* <dbl> <dbl> <dbl> <dbl>
#1 18.9 40.6 29.5 0
#2 347 23.8 7.47 2.81
#3 167 15.7 26.0 0
#4 163 39.5 8.47 3.08
#5 20.9 23.8 36.1 18.6
#6 18.7 59.4 21.2 0
#7 14.9 27.3 25.3 10.8
#8 267 28.2 12.2 0
diversity(res)
# 1 2 3 4 5 6 7 8
#1.0533711 0.3718959 0.6331070 0.7090783 1.3517680 0.9516232 1.3215712 0.4697572
Regarding the diversity calculation, we can replace NA with 0 and apply on the whole dataset i.e.
library(vegan)
diversity(replace(df1, is.na(df1), 0))
#[1] 1.0533711 0.3718959 0.6331070 0.7090783
#[5] 1.3517680 0.9516232 1.3215712 0.4697572
as we get the same output as in the first solution

Resources