I have an R dataframe that has N rows and 6 columns. For exemplification I will use following column names: "theDate","theIndex","Component_1","Component_2","Component_3","Component_4"
I am trying to convert it to a 3 dimensional array, with first dimension corresponding to "theDate", second dimension to "theIndex" and third dimension to the values of the components.
To give an example, the expression NewArray[2,4,3] will display the 2-nd element from "theDate" column, the 4-th element from "theIndex" column and the value of Component_3 that is on same row as the 2-nd value from "theDate" column and the 4-th value from "theIndex" column.
I have looked into using abind, narray, and a combination of apply/split/abind, without full success.
The closest question I found on SO is this one: Link SO, but I could not generalize it along same lines as the answer found there.
The desired multidimensional array has dimensions (5, 7, 4). First two dimensions are corresponding to 5 distinct elements in "theDate" column and to 7 distinct elements in "theIndex" column, while the third dimension corresponds to the 4 additional columns in dataframe: Component_1,...,Component_4)
Here is a small piece of code to create the dataframe, and to create an empty multidimensional array of desired dimensions
EDIT: I have also added a piece of code which appears to work, and I would be interested in other solutions
`%>%` <- dplyr::`%>%`
base::set.seed(seed = 1785)
setOfComponents <-c("Component_1","Component_2","Component_3","Component_4")
setOfDates <- c(234, 342, 456, 678, 874)
setOfIndices <- c(2, 7, 11, 15, 24, 36, 56)
numIndices <- length(setOfIndices)
numDates <- length(setOfDates)
numElementsComponent <- numIndices * numDates
theDF <- base::data.frame(
theDate = c(base::rep(x = setOfDates[1],times = numIndices),
base::rep(x = setOfDates[2],times = numIndices),
base::rep(x = setOfDates[3],times = numIndices),
base::rep(x = setOfDates[4],times = numIndices),
base::rep(x = setOfDates[5],times = numIndices)),
theIndex = base::rep(x = setOfIndices,times = numDates),
Component_1 = stats::runif(n = numElementsComponent, min = 0, max = 100),
Component_2 = stats::runif(n = numElementsComponent, min = 0, max = 100),
Component_3 = stats::runif(n = numElementsComponent, min = 0, max = 100),
Component_4 = stats::runif(n = numElementsComponent, min = 0, max = 100) )
theNewDF <- theDF %>%
tidyr::gather(key = "IdxComp", value = "ValueComp", Component_1, Component_2, Component_3, Component_4)
newArray <- array(theNewDF$ValueComp, dim = c(length(unique(theDF$theDate)),length(unique(theDF$theIndex)),length(setOfComponents)))
Check out the tidyr package.
I think you want the gather function.
See the package, or the descriptions here:
http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/
Related
I want to create multiple lag variables for a column in a data frame for a range of values. I have code that successfully does what I want but is not scalable for what I need (hundreds of iterations)
I have code below that successfully does what I want but is not scalable for what I need (hundreds of iterations)
Lake_Lag <- Lake_Champlain_long.term_monitoring_1992_2016 %>%
group_by(StationID,Test) %>%
arrange(StationID,Test,VisitDate) %>%
mutate(lag.Result1 = dplyr::lag(Result, n = 1, default = NA))%>%
mutate(lag.Result5 = dplyr::lag(Result, n = 5, default = NA))%>%
mutate(lag.Result10 = dplyr::lag(Result, n = 10, default = NA))%>%
mutate(lag.Result15 = dplyr::lag(Result, n = 15, default = NA))%>%
mutate(lag.Result20 = dplyr::lag(Result, n = 20, default = NA))
I would like to be able to use a list c(1,5,10,15,20) or a range 1:150 to create lagging variables for my data frame.
Here's an approach that makes use of some 'tidy eval helpers' included in dplyr that come from the rlang package.
The basic idea is to create a new column in mutate() whose name is based on a string supplied by a for-loop.
library(dplyr)
grouped_data <- Lake_Champlain_long.term_monitoring_1992_2016 %>%
group_by(StationID,Test) %>%
arrange(StationID,Test,VisitDate)
for (lag_size in c(1, 5, 10, 15, 20)) {
new_col_name <- paste0("lag_result_", lag_size)
grouped_data <- grouped_data %>%
mutate(!!sym(new_col_name) := lag(Result, n = lag_size, default = NA))
}
The sym(new_col_name) := is a dynamic way of writing lag_result_1 =, lag_result_2 =, etc. when using functions like mutate() or summarize() from the dplyr package.
We can use shift from data.table, which can take take multiple valuees for n. According to ?shift
n - Non-negative integer vector denoting the offset to lead or lag the input by. To create multiple lead/lag vectors, provide multiple values to n
Convert the 'data.frame' to 'data.table' (setDT), order by 'StationID', 'Test', 'VisitDate' in i, grouped by 'StationID', 'Test'), get the lag (default type of shift is "lag") of 'Result' with n as a vector of values, and assign (:=) the output to a vector of columns names (created with paste0)
library(data.table)
i1 <- c(1, 5, 10, 15, 20)
setDT(Lake_Champlain_long.term_monitoring_1992_2016)[order(StationID,
Test, VisitDate), paste0("lag.Result", i) := shift(Result, n= i),
by = .(StationID, Test)][]
NOTE: Showed a much efficient solution
I already had a look here, where the cut function is used. However, I haven't been able to come up with a clever solution given my situation.
First some example data that I currently have:
df <- data.frame(
Category = LETTERS[1:20],
Nber_within_category = c(rep(1,8), rep(2,3), rep(6,2), rep(10,3), 30, 50, 77, 90)
)
I would like to make a third column that forms a new category based on the Nber_within_category column. In this example, how can I make e.g. Category_new such that in each category, the Nber_within_category is at least 5 with the constrain that if Category already has Nber_within_category >= 5, that the original category is taken.
So for example, it should look like this:
df <- data.frame(
Category = LETTERS[1:20],
Nber_within_category = c(rep(1,8), rep(2,3), rep(6,2), rep(10,3), 30, 50, 77, 90),
Category_new = c(rep('a',5), rep('b', 4), rep('c',2), LETTERS[12:20])
)
It's a bit of a hack, but it works:
df %>%
mutate(tmp = floor((cumsum(Nber_within_category) - 1)/5)) %>%
mutate(new_category = ifelse(Nber_within_category >= 5,
Category,
letters[tmp+1]))
The line floor((cumsum(Nber_within_category) - 1)/5) is a way of categorising the cumsum with bins of size 5 (-1 to include the rows where the sum is exactly 5), and which I'm using as an index to get new categories for the rows where Nber_within_category < 5
It might be easier to understand how the column tmp is defined if you run :
x <- 1:100
data.frame(x, y = floor((x- 1)/5))
I have several (named) vectors in a list:
data = list(a=runif(n = 50, min = 1, max = 10), b=runif(n = 50, min = 1, max = 10), c=runif(n = 50, min = 1, max = 10), d=runif(n = 50, min = 1, max = 10))
I want to play around with different combinations of them depending on the row from another array called combs:
var <- letters[1:length(data)]
combs <- do.call(expand.grid, lapply(var, function(x) c("", x)))[-1,]
I would like to be able to extract each combination so that I can use the vectors created by these combinations.
All this is to be able to apply functions to each row extracted, and then to each combinations of these dataframes. So for example:
# Row 5 is "a", "c"
combs[5,]
# Use this information to extract this particular combination from my data:
# by hand it would be:
res_row5 = cbind(data[["a"]], data[["c"]])
# Extract another combination
# Row 11 is "a", "b", "d"
combs[11,]
res_row11 = cbind(data[["a"]], data[["b"]], data[["d"]])
# So that I can apply functions to each row across all these vectors
res_row_5_func = apply(res_row5, 1, sum)
# Apply another function to res_row11
res_row_5_func = apply(res_row11, 1, prod)
# Multiply the two, do other computations which can do as long as I have extracted the right vectors
I had already asked a very similar question here: Is there an easy way to match values of a list to array in R?
But can't figure out how to extract the actual data...
Thanks so much!
What you could do is first generate a list of vectors indexing the relevant entries in data:
library(magrittr)
combList <- lapply(1:nrow(combs), function(ii) combs[ii,] %>% unlist %>% setdiff(""))
You could then use this list to index the columns in data and generate a new list of the desired matrices:
dataMatrixList <- lapply(combList, function(indVec) data[indVec] %>% do.call('cbind', .))
The i-th entry in your dataMatrixList the contains a matrix with columns corresponding to the i-th row in combs. You can then compute sums, products etc. using
rowSumsList <- lapply(dataMatrixList, function(x) apply(x, 1, sum))
This would be another approach, that I think gives what you want? it will return a list of your dataframes by subsetting your data list by the (non-empty) elements of each row of combs:
data_sets <- apply(combs,
1,
function(x) do.call(cbind.data.frame, data[unlist(x[x!=''])])
)
I have a dataset for which I'd like to be able to dynamically create columns, including their names.
An example simplified dataset is:
dataset <- data.frame(data = c(0,1,2,3),
signups = c(100, 150, 200, 210),
leads = c(10, 12, 15, 18),
opportunities = c(2, 4, 5, 3),
closed = c(1,4,2,1))
I'd like the following additional fields for the dataset, defined as such:
lead_percentage <- dataset$leads / dataset$signups
opportunity_percentage <- dataset$opportunities / dataset$signups
closed_percentage <- dataset$closed / dataset$signups
I have many columns for which this happens, and can't figure out how to loop through in order to do this.
So far, I know I can create a list of the column names using this code:
colnames_list <- 0
c <- 1
for(c in c(4:ncol(dataset)-1)) {
colnames_list[c] = paste(colnames(dataset)[c], "percentage")
}
I also know how to dynamically define the values of the new columns, but can't seem to figure out how to get the new column names from the list to the dataframe.
this could what you need
l <- lapply(dataset[,3:5], "/", dataset$signups)
names(l) <- paste(names(dataset[,3:5]), "percentage", sep = "_")
dataset <- cbind(dataset,l)
Suppose I have a data frame in R that has names of students in one column and their marks in another column. These marks range from 20 to 100.
> mydata
id name marks gender
1 a1 56 female
2 a2 37 male
I want to divide the student into groups, based on the criteria of obtained marks, so that difference between marks in each group should be more than 10. I tried to use the function table, which gives the number of students in each range from say 20-30, 30-40, but I want it to pick those students that have marks in a given range and put all their information together in a group. Any help is appreciated.
I am not sure what you mean with "put all their information together in a group", but here is a way to obtain a list with dataframes split up of your original data frame where each element is a data frame of the students within a mark range of 10:
mydata <- data.frame(
id = 1:100,
name = paste0("a",1:100),
marks = sample(20:100,100,TRUE),
gender = sample(c("female","male"),100,TRUE))
split(mydata,cut(mydata$marks,seq(20,100,by=10)))
I think that #Sacha's answer should suffice for what you need to do, even if you have more than one set.
You haven't explicitly said how you want to "group" the data in your original post, and in your comment, where you've added a second dataset, you haven't explicitly said whether you plan to "merge" these first (rbind would suffice, as recommended in the comment).
So, with that, here are several options, each with different levels of detail or utility in the output. Hopefully one of them suits your needs.
First, here's some sample data.
# Two data.frames (myData1, and myData2)
set.seed(1)
myData1 <- data.frame(id = 1:20,
name = paste("a", 1:20, sep = ""),
marks = sample(20:100, 20, replace = TRUE),
gender = sample(c("F", "M"), 20, replace = TRUE))
myData2 <- data.frame(id = 1:17,
name = paste("b", 1:17, sep = ""),
marks = sample(30:100, 17, replace = TRUE),
gender = sample(c("F", "M"), 17, replace = TRUE))
Second, different options for "grouping".
Option 1: Return (in a list) the values from myData1 and myData2 which match a given condition. For this example, you'll end up with a list of two data.frames.
lapply(list(myData1 = myData1, myData2 = myData2),
function(x) x[x$marks >= 30 & x$marks <= 50, ])
Option 2: Return (in a list) each dataset split into two, one for FALSE (doesn't match the stated condition) and one for TRUE (does match the stated condition). In other words, creates four groups. For this example, you'll end up with a nested list with two list items, each with two data.frames.
lapply(list(myData1 = myData1, myData2 = myData2),
function(x) split(x, x$marks >= 30 & x$marks <= 50))
Option 3: More flexible than the first. This is essentially #Sacha's example extended to a list. You can set your breaks wherever you would like, making this, in my mind, a really convenient option. For this example, you'll end up with a nested list with two list items, each with multiple data.frames.
lapply(list(myData1 = myData1, myData2 = myData2),
function(x) split(x, cut(x$marks,
breaks = c(0, 30, 50, 75, 100),
include.lowest = TRUE)))
Option 4: Combine the data first and use the grouping method described in Option 1. For this example, you will end up with a single data.frame containing only values which match the given condition.
# Combine the data. Assumes all the rownames are the same in both sets
myDataALL <- rbind(myData1, myData2)
# Extract just the group of scores you're interested in
myDataALL[myDataALL$marks >= 30 & myDataALL$marks <= 50, ]
Option 5: Using the combined data, split the data into two groups: one group which matches the stated condition, one which doesn't. For this example, you will end up with a list with two data.frames.
split(myDataALL, myDataALL$marks >= 30 & myDataALL$marks <= 50)
I hope one of these options serves your needs!
I had the same kind of issue and after researching some answers on stack overflow I came up with the following solution :
Step 1 : Define range
Step 2 : Find the elements that fall in the range
Step 3 : Plot
A sample code is as shown below:
range = NULL
for(i in seq(0, max(all$downlink), 2000)){
range <- c(range, i)
}
counts <- numeric(length(range)-1);
for(i in 1:length(counts)) {
counts[i] <- length(which(all$downlink>=range[i] & all$downlink<range[i+1]));
}
countmax = max(counts)
a = round(countmax/1000)*1000
barplot(counts, col= rainbow(16), ylim = c(0,a))