This question is probably best illustrated with an example.
Suppose I have a dataframe df with a binary variable b (values of b are 0 or 1). How can I take a random sample of size 10 from this dataframe so that I have 2 instances where b=0 in the random sample, and 8 instances where b=1 in the dataframe?
Right now, I know that I can do df[sample(nrow(df),10,] to get part of the answer, but that would give me a random amount of 0 and 1 instances. How can I specify a specific amount of 0 and 1 instances while still taking a random sample?
Here's an example of how I'd do this... take two samples and combine them. I've written a simple function so you can "just take one sample."
With a vector:
pop <- sample(c(0,1), 100, replace = TRUE)
yoursample <- function(pop, n_zero, n_one){
c(sample(pop[pop == 0], n_zero),
sample(pop[pop == 1], n_one))
}
yoursample(pop, n_zero = 2, n_one = 8)
[1] 0 0 1 1 1 1 1 1 1 1
Or, if you are working with a dataframe with some unique index called id:
# Where d1 is your data you are summarizing with mean and sd
dat <- data.frame(
id = 1:100,
val = sample(c(0,1), 100, replace = TRUE),
d1 = runif(100))
yoursample <- function(dat, n_zero, n_one){
c(sample(dat[dat$val == 0,"id"], n_zero),
sample(dat[dat$val == 1,"id"], n_one))
}
sample_ids <- yoursample(dat, n_zero = 2, n_one = 8)
sample_ids
mean(dat[dat$id %in% sample_ids,"d1"])
sd(dat[dat$id %in% sample_ids,"d1"])
Here is a suggestion:
First create a sample of 0 and 1 with id column.
Then sample 2:8 df's with condition and bind them together:
library(tidyverse)
set.seed(123)
df <- as_tibble(sample(0:1,size=50,replace=TRUE)) %>%
mutate(id = row_number())
df1 <- df[ sample(which (df$value ==0) ,2), ]
df2 <- df[ sample(which (df$value ==1), 8), ]
df_final <- bind_rows(df1, df2)
value id
<int> <int>
1 0 14
2 0 36
3 1 21
4 1 24
5 1 2
6 1 50
7 1 49
8 1 41
9 1 28
10 1 33
library(tidyverse)
set.seed(123)
df <- data.frame(a = letters,
b = sample(c(0,1),26,T))
bind_rows(
df %>%
filter(b == 0) %>%
sample_n(2),
df %>%
filter(b == 1) %>%
sample_n(8)
) %>%
arrange(a)
a b
1 d 1
2 g 1
3 h 1
4 l 1
5 m 1
6 o 1
7 p 0
8 q 1
9 s 0
10 v 1
I have a data frame where I would like to put in front of a column name the following words: "high_" and "low_". The name of the columns from X2-X4 should be renamed eg.high_X2 and X5-X7 eg. low_X6.
Please see an example below.
X1 X2 X3 X4 X5 X6 X7
a 1 0 1 1 1 1 0
b 2 2 1 1 1 1 0
result
X1 high_X2 high_X3 high_X4 low_X5 low_X6 low_X7
a 1 0 1 1 1 1 0
b 2 2 1 1 1 1 0
You can use rep and paste -
names(df)[-1] <- paste(rep(c('high', 'low'), each = 3), names(df)[-1], sep = '_')
df
# X1 high_X2 high_X3 high_X4 low_X5 low_X6 low_X7
#a 1 0 1 1 1 1 0
#b 2 2 1 1 1 1 0
If you want to rely on range of columns then dplyr code would be easier.
library(dplyr)
df %>%
rename_with(~paste('high', ., sep = '_'), X2:X4) %>%
rename_with(~paste('low', ., sep = '_'), X5:X7)
The base solution (which is more straitforward for these kind of things imo)
df <- data.frame(X1=c(a=1L,b=2L),
X2=c(a=0L,b=2L),
X3=c(a=1L,b=1L),
X4=c(a=1L,b=1L),
X5=c(a=1L,b=1L),
X6=c(a=1L,b=1L),
X7=c(a=1L,b=1L))
cn <- colnames(df)
cond <- as.integer(substr(cn,2L,nchar(cn))) %% 2L == 0L
colnames(df)[cond] <- paste0(cn[cond],"_is_pair")
A tidyverse solution (a bit more awkward due to the tidyeval)
library(dplyr)
library(stringr)
library(tidyselect)
df <- data.frame(X1=c(a=1L,b=2L),
X2=c(a=0L,b=2L),
X3=c(a=1L,b=1L),
X4=c(a=1L,b=1L),
X5=c(a=1L,b=1L),
X6=c(a=1L,b=1L),
X7=c(a=1L,b=1L))
is_pair <- function(vars = peek_vars(fn = "is_pair")) {
vars[as.integer(str_sub(vars,2L,nchar(vars))) %% 2L == 0L]
}
df %>% rename_with(~paste0(.x,"_is_pair"),
is_pair())
I have below-mentioned dataframe in R:
DF <- tibble::tribble(
~ID, ~Check,
"I-1", "A1",
"I-2", "A2",
"I-2", "OT",
"I-2", "LP",
"I-3", "A1",
"I-3", "A2",
"I-4", NA,
"I-5", NA,
"I-6", "A1",
"I-6", "OT",
"I-7", "A2"
)
DF2 <- tibble::tribble(
~ID, ~Remarks,
"I-1", "{X1,XR,XT}",
"I-2", "{X2,XR}",
"I-3", NA,
"I-4", "{X1,XR,X2}",
"I-5", "{X1}",
"I-6", "{XT}",
"I-7", "{X1,X2}"
)
Using the above mentioned two dataframe, I need the output in the following format:
Where I want to identify the exclusive count of Check and Remark for each unique ID and combination of each Check with another Check and similar with Remark.
Note - The order of rows should be highest to lowest based on Exclusive_Count of Check. It is quite possible that the number of unique Check and Remark may differ in my actual dataframe. (i.e 10 unique Remark and 5 Check, something like this)
DF_Output<-
Remark Exclusive_Count % X1 X2 XR XT Check Exclusive_Count % A1 A2 OT LP
Blank 1 33.33% 0 0 0 0 Blank 2 50.00% 0 0 0 0
X1 1 33.33% 0 2 2 1 A1 1 25.00 0 1 1 0
X2 0 0.00% 2 0 1 0 A2 1 25.00% 1 0 1 1
XR 0 0.00% 2 2 0 1 OT 0 0.00% 1 1 0 1
XT 1 33.33% 1 0 1 0 LP 0 0.00% 0 1 1 0
Total 3 100.00% 5 4 4 2 Total 4 100.00% 2 3 3 2
The OP has requested a canonical answer. So, I have created a function get_exclusive_counts() which takes the first two columns of any tibble, data.frame, or data.table where the first column contains IDs and the second column contains the payload, e.g., Check, in long format.
The function is independent of column names and will work with an arbitrary number of different items in the payload column. It returns a data.table for each input tibble:
get_exclusive_counts(DF)
Check Exclusive_Count % A1 A2 LP OT
1: Blank 2 50.00% 0 0 0 0
2: A1 1 25.00% 0 1 0 1
3: A2 1 25.00% 1 0 1 1
4: LP 0 0.00% 0 1 0 1
5: OT 0 0.00% 1 1 1 0
6: Totals 4 100.00% 2 3 2 3
For the second use case DF2, the payload needs to be split into separate rows beforehand:
library(magrittr)
DF2 %>%
dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>%
tidyr::separate_rows(Remarks) %>%
get_exclusive_counts()
Remarks Exclusive_Count % X1 X2 XR XT
1: Blank 1 33.33% 0 0 0 0
2: X1 1 33.33% 0 2 2 1
3: XT 1 33.33% 1 0 1 0
4: X2 0 0.00% 2 0 2 0
5: XR 0 0.00% 2 2 0 1
6: Totals 3 100.00% 5 4 5 2
Note that the name of the first column of the result table has been retained from the input data.frame.
The OP has mentioned that the number of Remarks and Check may differ. Therefore, it doesn't really make sense to cbind() the two result tables because this only will give a reasonable result in case the number of rows is the same.
Also, OP's expected result has some column names repeated (at least Exclusive_Count, %, perhaps more) which indicates that the result may not be used for further processing but for display / print only.
Printing results side by side
However, I have created a function get_exclusive_counts_side_by_side() which prints the results from calling get_exclusive_counts()
for an arbitray number of input datasets,
with differing numbers of rows, and
with the last rows (Totals) aligned.
The function returns a data.table with character columns.
The call below will reproduce OP'S expected result:
get_exclusive_counts_side_by_side(
DF2 %>%
dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>%
tidyr::separate_rows(Remarks),
DF)
Remarks Exclusive_Count % X1 X2 XR XT Check Exclusive_Count % A1 A2 LP OT
1: Blank 1 33.33% 0 0 0 0 Blank 2 50.00% 0 0 0 0
2: X1 1 33.33% 0 2 2 1 A1 1 25.00% 0 1 0 1
3: XT 1 33.33% 1 0 1 0 A2 1 25.00% 1 0 1 1
4: X2 0 0.00% 2 0 2 0 LP 0 0.00% 0 1 0 1
5: XR 0 0.00% 2 2 0 1 OT 0 0.00% 1 1 1 0
6: Totals 3 100.00% 5 4 5 2 Totals 4 100.00% 2 3 2 3
Here is another use case to demonstrate that it will work with differing rows and an arbitrary number of input datasets:
get_exclusive_counts_side_by_side(
DF,
DF3 %>%
dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>%
tidyr::separate_rows(Remarks),
DF)
Check Exclusive_Count % A1 A2 LP OT Remarks Exclusive_Count % X1 X2 XR XT Y2 Y3 Y4 Check Exclusive_Count % A1 A2 LP OT
1: Blank 2 50.00% 0 0 0 0 X1 2 50.00% 0 2 2 1 1 1 0 Blank 2 50.00% 0 0 0 0
2: A1 1 25.00% 0 1 0 1 Blank 1 25.00% 0 0 0 0 0 0 0 A1 1 25.00% 0 1 0 1
3: A2 1 25.00% 1 0 1 1 XT 1 25.00% 1 0 1 0 0 0 0 A2 1 25.00% 1 0 1 1
4: LP 0 0.00% 0 1 0 1 X2 0 0.00% 2 0 2 0 0 0 0 LP 0 0.00% 0 1 0 1
5: OT 0 0.00% 1 1 1 0 XR 0 0.00% 2 2 0 1 0 0 0 OT 0 0.00% 1 1 1 0
6: Y2 0 0.00% 1 0 0 0 0 1 1
7: Y3 0 0.00% 1 0 0 0 1 0 0
8: Y4 0 0.00% 0 0 0 0 1 0 0
9: Totals 4 100.00% 2 3 2 3 Totals 4 100.00% 7 4 5 2 3 2 1 Totals 4 100.00% 2 3 2 3
Function definitions
The code looks rather bulky but half of the lines are comments. So, the code should be fairly self-explanatory.
Also, about half of the lines of code are due to OP's additional requirements, like a % column or a Totals row.
get_exclusive_counts <- function(DF) {
library(data.table)
library(magrittr)
# make copy of first 2 cols to preserve original attributes of DF
DT <- as.data.table(DF[, 1:2])
# retain original column names
old <- colnames(DT)[1:2]
# rename colnames in copy for convenience of programming
setnames(DT, c("id", "val")) # col 1 contains id, col 2 contains payload
# aggregate by id to find exclusive counts = ids with only one element
tmp <- DT[, .N, keyby = id][N == 1L]
# create table of exclusive counts by joining and aggregating
excl <- DT[tmp, on = .(id)][, .(Exclusive_Count = .N), keyby = val] %>%
# append column of proportions, will be formatted after computing Totals
.[, `%` := Exclusive_Count / sum(Exclusive_Count)]
# anti-join to find remaining rows
rem <- DT[!tmp, on = .(id)]
# create co-occurrence matrix in long format by a self-join
coocc <- rem[rem, on = .(id), allow.cartesian = TRUE] %>%
# reshape to wide format and compute counts of co-occurrences w/o diagonals
dcast(val ~ i.val, length, subset = .(val != i.val))
# build final result table by merging both subresults
merge(excl, coocc, by = "val", all = TRUE) %>%
# replace NA counts by 0
.[, lapply(.SD, nafill, fill = 0L), by = val] %>%
# clean-up: order by decreasing Exclusive_Counts %>%
.[order(-Exclusive_Count)] %>%
# append Totals row
rbind(., .[, c(.(val = "Totals"), lapply(.SD, sum)), .SDcols = is.numeric]) %>%
# clean-up: format proportion as percentage
.[, `%` := sprintf("%3.2f%%", 100 * `%`)] %>%
# clean-up: Replace <NA> by "Blank" in val column
.[is.na(val), val := "Blank"] %>%
# rename val column
setnames("val", old[2]) %>%
# return result visibly
.[]
}
Here is the code for get_exclusive_counts_side_by_side():
get_exclusive_counts_side_by_side <- function(...) {
library(data.table)
library(magrittr)
# process input, return list of subresults
ec_list<- list(...) %>%
lapply(get_exclusive_counts)
# create row indices for maximum rows
rid <- ec_list %>%
lapply(nrow) %>%
Reduce(max, .) %>%
{data.table(.rowid = 1:.)}
# combine subresults
ec_list %>%
# insert empty rows if necessary
lapply(function(.x) .x[
, .rowid := .I][
# but align last row
.rowid == .N, .rowid := nrow(rid)][
rid, on =.(.rowid)][
, .rowid := NULL]
) %>%
# all data.tables have the same number of rows, now cbind()
do.call(cbind, .) %>%
# replace all NA by empty character strings
.[, lapply(.SD, . %>% as.character %>% fifelse(is.na(.), "", .))]
}
Additional explanation
If I understand correctly, exclusive counts refers to IDs which have only of one item (or NA) assigned to it. This is fairly straight forward to compute by
counting the number of items per ID,
picking the IDs with only one item,
picking the rows in the input data.frame which belong to those IDs (using a join), and
counting the appearances of the items in the subset of exclusive rows.
Furthermore, the function deals with OP's additional requirements which go beyond the identification of exclusive counts:
adding a matrix of co-occurrence counts of the remaining, non-exclusive
rows,
adding a column of proportions of exclusive counts at a specific position and formatting it as percent,
adding a Totals row,
replacing NAs by zero or "Blank", resp.
Data
DF <- tibble::tribble(
~ID, ~Check,
"I-1", "A1",
"I-2", "A2",
"I-2", "OT",
"I-2", "LP",
"I-3", "A1",
"I-3", "A2",
"I-4", NA,
"I-5", NA,
"I-6", "A1",
"I-6", "OT",
"I-7", "A2"
)
DF2 <- tibble::tribble(
~ID, ~Remarks,
"I-1", "{X1,XR,XT}",
"I-2", "{X2,XR}",
"I-3", NA,
"I-4", "{X1,XR,X2}",
"I-5", "{X1}",
"I-6", "{XT}",
"I-7", "{X1,X2}"
)
DF3 <- tibble::tribble(
~ID, ~Remarks,
"I-1", "{X1,XR,XT}",
"I-2", "{X2,XR}",
"I-3", NA,
"I-4", "{X1,XR,X2}",
"I-5", "{X1}",
"I-6", "{XT}",
"I-7", "{X1,X2}",
"I-8", "{X1,Y2,Y3}",
"I-9", "{Y2,Y4}",
"I10", "{X1}",
)
I think this will do what you're looking for... Likely not the most succinct, but seems to do the trick.
# Load Library
library('tidyverse')
### CHECK ###
# Load Check Table
DF <- tibble::tribble(
~ID, ~Check,
"I-1", "A1",
"I-2", "A2",
"I-2", "OT",
"I-2", "LP",
"I-3", "A1",
"I-3", "A2",
"I-4", NA,
"I-5", NA,
"I-6", "A1",
"I-6", "OT",
"I-7", "A2"
)
# Count by ID
DF <- DF %>%
group_by(ID) %>%
mutate(count = n())
# Count by Check
DF_X <- DF %>% dplyr::filter(count == 1) %>%
group_by(Check) %>%
dplyr::summarize("Count" = sum(count))
# Identify unique values of Check
DF_UNIQUE <- unique(DF$Check)
DF_FIN <- data.frame("Check" = DF_UNIQUE,stringsAsFactors = FALSE)
# Join Counts by Check with unique list of Checks
DF_FIN <- left_join(x = DF_FIN, y = DF_X, by = "Check")
# Replace NA's with zeros
DF_FIN[is.na(DF_FIN$Count),2] <- 0
# Calculate Percentages
DF_FIN <- DF_FIN %>%
mutate("Check Percentage" = `Count`/sum(`Count`))
# Rename Columns
colnames(DF_FIN) <- c("Check", "Exclusive Count", "Check Percentage")
# Replace NA value with the word "BLANK"
DF_FIN[is.na(DF_FIN$Check),1] <- "BLANK"
# Sort by Exclusive Count and then by Check (alphabetical)
DF_FIN <- DF_FIN %>%
arrange(desc(`Exclusive Count`), Check)
# Join Checks to itself and count instances
DF_CHECKS <- full_join(x = DF, y = DF, by = "ID")
DF_CHECKS <- DF_CHECKS %>%
group_by(Check.x, Check.y) %>%
dplyr::summarize("N" = n())
DF_CHECKS_SPREAD <- DF_CHECKS %>%
tidyr::pivot_wider(names_from = Check.y, values_from = N)
check_order <- DF_CHECKS_SPREAD$Check.x
check_order[is.na(check_order)] <- 'NA'
DF_CHECKS_SPREAD <- DF_CHECKS_SPREAD %>% select(check_order)
# Set the diagonal to zeros
for (i in 1:nrow(DF_CHECKS_SPREAD)){
DF_CHECKS_SPREAD[i,i+1] <-0
}
# Rename Columns
colnames(DF_CHECKS_SPREAD)[1] <- "Check"
colnames(DF_CHECKS_SPREAD)[colnames(DF_CHECKS_SPREAD) == "NA"] <- "BLANK"
# Drop the BLANK column
DF_CHECKS_SPREAD$BLANK <- NULL
# Replace NA value with the word "BLANK"
DF_CHECKS_SPREAD[is.na(DF_CHECKS_SPREAD$Check),1] <- "BLANK"
# Replace all other NA's with zero
DF_CHECKS_SPREAD[is.na(DF_CHECKS_SPREAD)] <- 0
# Join the two Checks data sets together & calculate grand totals
FINAL_TABLE_CHECKS <- left_join(x = DF_FIN, y = DF_CHECKS_SPREAD, by = "Check")
FINAL_TABLE_CHECKS <- FINAL_TABLE_CHECKS %>%
bind_rows(summarise(.,
across(where(is.numeric), sum),
across(where(is.character), ~"Total")))
### REMARKS ###
# Load Remarks table
DF2 <- tibble::tribble(
~ID, ~Remarks,
"I-1", "{X1,XR,XT}",
"I-2", "{X2,XR}",
"I-3", NA,
"I-4", "{X1,XR,X2}",
"I-5", "{X1}",
"I-6", "{XT}",
"I-7", "{X1,X2}"
)
# Remove the {} from the Remarks string
DF2$Remarks <- str_replace_all(string = DF2$Remarks, c("\\{" = "", "\\}" = ""))
# Expand string into rows
DF2 <- separate_rows(DF2, Remarks, convert = TRUE)
# Group and count by ID
DF2 <- DF2 %>%
group_by(ID) %>%
mutate(count = n())
# Count by Remarks
DF2_X <- DF2 %>% dplyr::filter(count == 1) %>%
group_by(Remarks) %>%
dplyr::summarize("Count" = sum(count))
# Identify unique Remarks
DF2_UNIQUE <- unique(DF2$Remarks)
DF2_FIN <- data.frame("Remarks" = DF2_UNIQUE,stringsAsFactors = FALSE)
# Join count of Remarks with unique list of Remarks
DF2_FIN <- left_join(x = DF2_FIN, y = DF2_X, by = "Remarks")
# Replace NA's with zeros
DF2_FIN[is.na(DF2_FIN$Count),2] <- 0
# Calculate Percentages
DF2_FIN <- DF2_FIN %>%
mutate("Remarks Percentage" = `Count`/sum(`Count`))
# Rename columns
colnames(DF2_FIN) <- c("Remarks", "Exclusive Count", "Remarks Percentage")
# Replace NA value with the word "BLANK"
DF2_FIN[is.na(DF2_FIN$Remarks),1] <- "BLANK"
# Sort by Exclusive Count and then by Check (alphabetical)
DF2_FIN <- DF2_FIN %>%
arrange(desc(`Exclusive Count`), Remarks)
# Join Remarks to itself and count instances
DF_REMARKS <- full_join(x = DF2, y = DF2, by = "ID")
DF_REMARKS <- DF_REMARKS %>%
group_by(Remarks.x, Remarks.y) %>%
dplyr::summarize("N" = n())
DF_REMARKS_SPREAD <- DF_REMARKS %>%
tidyr::pivot_wider(names_from = Remarks.y, values_from = N)
check_order <- DF_REMARKS_SPREAD$Remarks.x
check_order[is.na(check_order)] <- 'NA'
DF_REMARKS_SPREAD <- DF_REMARKS_SPREAD %>% select(check_order)
# Set the diagonal to zeros
for (i in 1:nrow(DF_REMARKS_SPREAD)){
DF_REMARKS_SPREAD[i,i+1] <-0
}
# Rename Columns
colnames(DF_REMARKS_SPREAD)[1] <- "Remarks"
colnames(DF_REMARKS_SPREAD)[colnames(DF_CHECKS_SPREAD) == "NA"] <- "BLANK"
# Drop the BLANK column
DF_REMARKS_SPREAD$BLANK <- NULL
# Replace NA value with the word "BLANK"
DF_REMARKS_SPREAD[is.na(DF_REMARKS_SPREAD$Remarks),1] <- "BLANK"
# Replace all other NA's with zero
DF_REMARKS_SPREAD[is.na(DF_REMARKS_SPREAD)] <- 0
# Join the two Remarks data sets together & calculate grand totals
FINAL_TABLE_REMARKS <- left_join(x = DF2_FIN, y = DF_REMARKS_SPREAD, by = "Remarks")
FINAL_TABLE_REMARKS <- FINAL_TABLE_REMARKS %>%
bind_rows(summarise(.,
across(where(is.numeric), sum),
across(where(is.character), ~"Total")))
# Count Rows in Check and Remarks dataframes and add rows in dataframe
# with less rows to match # of rows in other.
checkRows <- nrow(FINAL_TABLE_CHECKS)
remarksRows <- nrow(FINAL_TABLE_REMARKS)
rowDiff <- abs(checkRows - remarksRows)
if(checkRows < remarksRows){
cat("Adding", rowDiff , "rows to the Checks dataframe.\n\n")
FINAL_TABLE_CHECKS[nrow(FINAL_TABLE_CHECKS)+rowDiff,] <- NA
FINAL_TABLE_CHECKS[nrow(FINAL_TABLE_CHECKS),] <- FINAL_TABLE_CHECKS[checkRows,]
FINAL_TABLE_CHECKS[checkRows,] <- NA
}else if(remarksRows < checkRows){
cat("Adding", rowDiff , "rows to the Remarks dataframe.\n\n")
FINAL_TABLE_REMARKS[nrow(FINAL_TABLE_REMARKS)+rowDiff,] <- NA
FINAL_TABLE_REMARKS[nrow(FINAL_TABLE_REMARKS),] <- FINAL_TABLE_REMARKS[remarksRows,]
FINAL_TABLE_REMARKS[remarksRows,] <- NA
}else{
print("There is no difference in number of rows between Checks and Remarks.\n\n")
}
# Combine columns from Checks and Remarks into one table.
RESULTS <- cbind(FINAL_TABLE_REMARKS, FINAL_TABLE_CHECKS)
RESULTS$`Check Percentage` <- paste(round(100*RESULTS$`Check Percentage`,2), "%", sep="")
RESULTS$`Remarks Percentage` <- paste(round(100*RESULTS$`Remarks Percentage`,2), "%", sep="")
RESULTS
I have a data-frame with these characteristics:
Z Y X1 X2 X3 X4 X5 ... X30
A n1 1 2 1 2 1 2 1 2
B n2 1 2 1 2 1 2 1 2
C n3 1 2 1 2 1 2 1 2
D n4 1 2 1 2 1 2 1 2
.
.
.
My purpose is to stack the column x1, x2, … x30, and associated the new column with columns z, y, and x. Some like this:
Newcolumn zyx
1 x-y-z
... I need a data-frame like this:
colum1 colum2
1 A+n1+X1.headername 1
2 B+n2+X2.headernam 2
3 C+n3X3.headername 1
4 D+n4X4.headername 2
. .
. .
. .
I’m trying to build a function, but I have some troubles
I follow this code for the data-frame:
df$zy <- paste(df$z,"-",df$y)
After that, I eliminate the columns “z” and “y”:
df$z <- NULL
df$y <- NULL
And save column df$zy as data-frame for use later:
df_zy <- as.data.frame(df$zy)
Then eliminate df$xy of original dataframe:
df$xy <- NULL
After that, I save as data-frame the column x1, and incorporate df_zy and name of column x1 (the name is “1”):
a <- as.data.frame(df$`1`)
b <- cbind(a, df_xy, x_column= 1)
b$zy <- paste(b$x_column,"-",b$` df$zy`)
b$` df$zy ` <- NULL
b$ x_column <- NULL
colnames(b)
names(b)[names(b) == "b$`1`"] <- "new_column"
This works, but only for the column x1 and I need this for x1 to x30, and stack all new column
Does anybody have an answer to this problem? Thanks!
You can use tidyr and dplyr librairies:
library(dplyr)
library(tidyr)
df_zy = df %>% pivot_longer(., cols = starts_with("X"), names_to = "Variables", values_to = "Value") %>%
mutate(NewColumn = paste0(Z,"-",Y,"-",Variables)) %>% select(NewColumn, Value)
And you get:
> df_zy
# A tibble: 8 x 2
NewColumn Value
<chr> <dbl>
1 A-n1-X1 1
2 A-n1-X2 2
3 B-n2-X1 1
4 B-n2-X2 2
5 C-n3-X1 1
6 C-n3-X2 2
7 D-n4-X1 1
8 D-n4-X2 2
Data
df = data.frame("Z" = LETTERS[1:4],
"Y" = c("n1","n2","n3","n4"),
"X1" = c(1,1,1,1),
"X2" = c(2,2,2,2))
Is it what you are looking for ?