Dynamically Create Columns in R - r

I have a dataset for which I'd like to be able to dynamically create columns, including their names.
An example simplified dataset is:
dataset <- data.frame(data = c(0,1,2,3),
signups = c(100, 150, 200, 210),
leads = c(10, 12, 15, 18),
opportunities = c(2, 4, 5, 3),
closed = c(1,4,2,1))
I'd like the following additional fields for the dataset, defined as such:
lead_percentage <- dataset$leads / dataset$signups
opportunity_percentage <- dataset$opportunities / dataset$signups
closed_percentage <- dataset$closed / dataset$signups
I have many columns for which this happens, and can't figure out how to loop through in order to do this.
So far, I know I can create a list of the column names using this code:
colnames_list <- 0
c <- 1
for(c in c(4:ncol(dataset)-1)) {
colnames_list[c] = paste(colnames(dataset)[c], "percentage")
}
I also know how to dynamically define the values of the new columns, but can't seem to figure out how to get the new column names from the list to the dataframe.

this could what you need
l <- lapply(dataset[,3:5], "/", dataset$signups)
names(l) <- paste(names(dataset[,3:5]), "percentage", sep = "_")
dataset <- cbind(dataset,l)

Related

Convert R dataframe into a 3-dimensional array

I have an R dataframe that has N rows and 6 columns. For exemplification I will use following column names: "theDate","theIndex","Component_1","Component_2","Component_3","Component_4"
I am trying to convert it to a 3 dimensional array, with first dimension corresponding to "theDate", second dimension to "theIndex" and third dimension to the values of the components.
To give an example, the expression NewArray[2,4,3] will display the 2-nd element from "theDate" column, the 4-th element from "theIndex" column and the value of Component_3 that is on same row as the 2-nd value from "theDate" column and the 4-th value from "theIndex" column.
I have looked into using abind, narray, and a combination of apply/split/abind, without full success.
The closest question I found on SO is this one: Link SO, but I could not generalize it along same lines as the answer found there.
The desired multidimensional array has dimensions (5, 7, 4). First two dimensions are corresponding to 5 distinct elements in "theDate" column and to 7 distinct elements in "theIndex" column, while the third dimension corresponds to the 4 additional columns in dataframe: Component_1,...,Component_4)
Here is a small piece of code to create the dataframe, and to create an empty multidimensional array of desired dimensions
EDIT: I have also added a piece of code which appears to work, and I would be interested in other solutions
`%>%` <- dplyr::`%>%`
base::set.seed(seed = 1785)
setOfComponents <-c("Component_1","Component_2","Component_3","Component_4")
setOfDates <- c(234, 342, 456, 678, 874)
setOfIndices <- c(2, 7, 11, 15, 24, 36, 56)
numIndices <- length(setOfIndices)
numDates <- length(setOfDates)
numElementsComponent <- numIndices * numDates
theDF <- base::data.frame(
theDate = c(base::rep(x = setOfDates[1],times = numIndices),
base::rep(x = setOfDates[2],times = numIndices),
base::rep(x = setOfDates[3],times = numIndices),
base::rep(x = setOfDates[4],times = numIndices),
base::rep(x = setOfDates[5],times = numIndices)),
theIndex = base::rep(x = setOfIndices,times = numDates),
Component_1 = stats::runif(n = numElementsComponent, min = 0, max = 100),
Component_2 = stats::runif(n = numElementsComponent, min = 0, max = 100),
Component_3 = stats::runif(n = numElementsComponent, min = 0, max = 100),
Component_4 = stats::runif(n = numElementsComponent, min = 0, max = 100) )
theNewDF <- theDF %>%
tidyr::gather(key = "IdxComp", value = "ValueComp", Component_1, Component_2, Component_3, Component_4)
newArray <- array(theNewDF$ValueComp, dim = c(length(unique(theDF$theDate)),length(unique(theDF$theIndex)),length(setOfComponents)))
Check out the tidyr package.
I think you want the gather function.
See the package, or the descriptions here:
http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/

R: create a new categorical variable from a categorical variable based on a continuous variable

I already had a look here, where the cut function is used. However, I haven't been able to come up with a clever solution given my situation.
First some example data that I currently have:
df <- data.frame(
Category = LETTERS[1:20],
Nber_within_category = c(rep(1,8), rep(2,3), rep(6,2), rep(10,3), 30, 50, 77, 90)
)
I would like to make a third column that forms a new category based on the Nber_within_category column. In this example, how can I make e.g. Category_new such that in each category, the Nber_within_category is at least 5 with the constrain that if Category already has Nber_within_category >= 5, that the original category is taken.
So for example, it should look like this:
df <- data.frame(
Category = LETTERS[1:20],
Nber_within_category = c(rep(1,8), rep(2,3), rep(6,2), rep(10,3), 30, 50, 77, 90),
Category_new = c(rep('a',5), rep('b', 4), rep('c',2), LETTERS[12:20])
)
It's a bit of a hack, but it works:
df %>%
mutate(tmp = floor((cumsum(Nber_within_category) - 1)/5)) %>%
mutate(new_category = ifelse(Nber_within_category >= 5,
Category,
letters[tmp+1]))
The line floor((cumsum(Nber_within_category) - 1)/5) is a way of categorising the cumsum with bins of size 5 (-1 to include the rows where the sum is exactly 5), and which I'm using as an index to get new categories for the rows where Nber_within_category < 5
It might be easier to understand how the column tmp is defined if you run :
x <- 1:100
data.frame(x, y = floor((x- 1)/5))

Making Calculations on Several Textfiles and making a Dataframe from it R

I am trying to create a table from calculations that I am doing to several text file. I think this might require a loop of some sort, but I am stuck on how to proceed. I have tried different loops but none seem to be working. I have managed to do what I want with one file. Here is my working code:
flare <- read.table("C:/temp/HD3_Bld_CD8_TEM.txt",
header=T)
head(flare[,c(1,2)])
#sum of the freq column, check to see if close to 1
sum(flare$freq)
#Sum of top 10
ten <- sum(flare$freq[1:10])
#Sum of 11-100
to100 <- sum(flare$freq[11:100])
#Sum of 101-1000
to1000 <- sum(flare$freq[101:1000])
#sum of 1001+
rest <- sum(flare$freq[-c(1:1000)])
#place the values of the sum in a table
df <- data.frame(matrix(ncol = 1, nrow = 4))
x <- c("Sum")
colnames(df) <- x
y <- c("10", "11-100", "101-1000", "1000+")
row.names(df) <- y
df[,1] <- c(ten,to100,to1000,rest)
The dataframe ends up looking like this:
>View(df)
Sum
10 0.1745092
11-100 0.2926735
101-1000 0.4211533
1000+ 0.1116640
This is perfect for making a stacked barplot, which I did. However, this is only for one text file. I have several of the same files. All of them have the same column names, so I know that all of them will be using DF$freq column for the calculations. How do I make a table after doing calculations with each file? I want to keep the names of the text files as the sample names so that way when i make a joint stacked barplot all the names will be there. Also, what is the best way to orient the data when writing the new table/dataframe?
I am still new to R, so any help, any explanation would be most welcome. Thank you.
How about something like this, your example is not reproducible so I made a dummy example which you can adjust:
library(tidyverse)
###load ALL your dataframes
test_df_1 <- data.frame(var1 = matrix(c(1,2,3,4,5,6), nrow = 6, ncol = 1))
test_df_1
test_df_2 <- data.frame(var2 = matrix(c(7,8,9,10,11,12), nrow = 6, ncol = 1))
test_df_2
### Bind them into one big wide dataframe
df <- cbind(test_df_1, test_df_2)
### Add an id column which repeats (in your case adjust this to repeat for the grouping you want, i.e replace the each = 2 with each = 10, and each = 4 with each = 100)
df <- df %>%
mutate(id = paste0("id_", c(rep(1, each = 2), rep(2, each = 4))))
### Gather your dataframes into long format by the id
df_gathered <- df %>%
gather(value = value, key = key, - id)
df_gathered
### use group_by to group data by id and summarise to get the sum of each group
df_gathered_sum <- df_gathered %>%
group_by(id, key) %>%
summarise(sigma = sum(value))
df_gathered_sum
You might have some issues with the ID column if your dfs are not equal length so this is only a partial answer. Can do better with a shortened example of your dataset. Can anyone else weigh in on creating an id column? May have sorted it with a couple of edits...
I think I solved it! It gives me the dataframe I want, and from it, I can make the stacked barplot to display the data.
sumfunction <- function(x) {
wow <- read.table(x, header=T)
#Sum of top 10
ten <- sum(wow$freq[1:10])
#Sum of 11-100
to100 <- sum(wow$freq[11:100])
#Sum of 101-1000
to1000 <- sum(wow$freq[101:1000])
#sum of 1001+
rest <- sum(wow$freq[-c(1:1000)])
blah <- c(ten,to100,to1000,rest)
}
library(data.table)
library(tools)
dir = "C:/temp/"
filenames <- list.files(path = dir, pattern = "*.txt", full.names = FALSE)
alltogether <- lapply(filenames, function(x) sumfunction(x))
data <- as.data.frame(data.table::transpose(alltogether),
col.names =c("Top 10 ", "From 11 to 100", "From 101 to 1000", "From 1000 on "),
row.names = file_path_sans_ext(basename(filenames)))
This gives me the dataframe that I want. I instead of putting the "top 10, 11-100, 101-1000, 1000+" as the row names, I changed them to column names and instead made the names of each text file become the row names. The file_path_sans_ext(basename(filenames)) makes sure to just keep the file name and remove the extension.
I hope this helps anyone that reads this! thank you again! I love this platform because just being part of this environment gets me thinking and always striving to better myself at R.
If anyone has any input, that would be great!!! <3

Trying to compare two dataframes, and writing a logical result to a new dataframe in R

I have an R dataframe that contains 18 columns, I would like to write a function that compares column 1 to column 2, and if both columns contain the same value, a logical result of T or F is written to a new column (this part is not too hard for me), however I would like to repeat this process over for the next columns and write T/F to a new column.
values col 1 = values col 2, write T/F to new column, values col 3 = values col 4, write T/F to a new column (or write results to a new dataframe)
I have been trying to do this with the purrr package, and use the pmap/map function, but I know I am making a mistake and missing some important part.
This function should work if I understand your problem correctly.
df <-
data.frame(a = c(18, 6, 2 ,0),
b = c(0, 6, 2, 18),
c = c(1, 5, 6, 8),
d = c(3, 5, 9, 2))
compare_columns <-
function(x){
n_columns <- ncol(x)
odd_columns <- 2*1:(n_columns/2) - 1
even_columns <- 2*1:(n_columns/2)
comparisons_list <-
lapply(seq_len(n_columns/2),
function(y){
df[, odd_columns[y]] == df[, even_columns[y]]
})
comparisons_df <-
as.data.frame(comparisons_list,
col.names = paste0("column", odd_columns, "_column", even_columns))
return(cbind(x, comparisons_df))
}
compare_columns(df)

Referring to a data frame by a variable name when creating a new column in R

I have a series of ten data frames containing two columns, x and y. I want to add a new column to each data frame containing the name of the data frame. The problem I am running into is how to refer to the data frame using a variable so I can perform this task iteratively. In addition to just referring to it by the variable name, I have also tried get() as follows:
for(i in 1:10){
name <- paste(substr(fileList, 3, 7),i, sep = "")
assign(newName, as.data.frame(get(name)))
get(newName)$Species = c(paste(substr(fileList, 3, 7),i, sep = ""))
}
However, I get the following error when I do so:
Error in get(newName)$Species = c(paste(substr(fileList[a], 3, 7), i, :
could not find function "get<-"
Is there another way to phrase the column assignment command so that I can get around this error, or is the solution more complex?
Here are three different options if you put all your data frames into a named list:
df_list <- list(a = data.frame(x = 1:5),
b = data.frame(x = 1:5))
#Option 1
for (i in seq_along(df_list)){
df_list[[i]][,'Species'] <- names(df_list)[i]
}
#Option 2
tmp <- do.call(rbind,df_list)
tmp$Species <- rep(names(df_list),times = sapply(df_list,nrow))
split(tmp,tmp$Species)
#Option 3
mapply(function(x,y) {x$Species <- y; x},df_list,names(df_list),SIMPLIFY = FALSE)

Resources