Create new columns based on similar row values - r
How do I create a new set of data frame columns based on matched row values?
For instance, for this sample data frame:
x<-data.frame(cbind(numsp=rep(c(16,64,256),each=12),Colless=rep(c("loIc","midIc","hiIc"),each=4, times=3), lambdaE=rep(c(TRUE,FALSE),each=2,times=9),ntree=rep(c(1,2),length.out=36), metric1=seq(1:36), metric2=seq(1:36)))
For when some parameter, e.g., lambdaE, I'd like to create new columns for metric1 and metric 2 based on whether lambdaE is TRUE or FALSE.
The data frame would look something like this:
x2<-data.frame(cbind(numsp=rep(c(16,64,256),each=6),Colless=rep(c("hiIc","loIc","midIc"),each=2, times=3), ntree=rep(c(1,2),length.out=18), metric1.lambdE.FALSE=c(11,12,3,4,7,8,35,36,27,28,31,32,23,24,15,16,19,20), metric2.lambdE.FALSE=c(11,12,3,4,7,8,35,36,27,28,31,32,23,24,15,16,19,20),metric1.lambdE.TRUE=c(9,10,1,2,5,6,33,34,25,26,29,30,21,22,13,14,17,18), metric2.lambdE.TRUE=c(9,10,1,2,5,6,33,34,25,26,29,30,21,22,13,14,17,18)))
Or alternatively for the parameter "Colless", a new set of columns for metric1 and metric2 for each level of Colless.
Thanks in advance!
Okay, looks like library reshape2 has a quick solution:
reshape(x, direction="wide", idvar=c("numsp","Colless","ntree"), timevar="lambdaE")
melt and dcast of reshape2 can also be used:
library(reshape2)
mm =melt(x, id=c('numsp','Colless','lambdaE','ntree'))
dcast(mm, numsp+Colless+ntree~lambdaE+variable)
numsp Colless ntree FALSE_metric1 FALSE_metric2 TRUE_metric1 TRUE_metric2
1 16 hiIc 1 11 11 9 9
2 16 hiIc 2 12 12 10 10
3 16 loIc 1 3 3 1 1
4 16 loIc 2 4 4 2 2
5 16 midIc 1 7 7 5 5
6 16 midIc 2 8 8 6 6
7 256 hiIc 1 35 35 33 33
8 256 hiIc 2 36 36 34 34
9 256 loIc 1 27 27 25 25
10 256 loIc 2 28 28 26 26
11 256 midIc 1 31 31 29 29
12 256 midIc 2 32 32 30 30
13 64 hiIc 1 23 23 21 21
14 64 hiIc 2 24 24 22 22
15 64 loIc 1 15 15 13 13
16 64 loIc 2 16 16 14 14
17 64 midIc 1 19 19 17 17
18 64 midIc 2 20 20 18 18
Related
How to delete values from a column only from a threshold on? [duplicate]
This question already has answers here: Subset data frame based on multiple conditions [duplicate] (3 answers) Closed 3 years ago. I have a dataframe (k by 4). I have ordered one of the four columns in a descending order (from 19 to -9 let'say). I would like to throw away those values that are smaller than 1.5. I just tried unsuccessfully various combinations of the following code subset(w, select = -c(columnofinterest, <=1.50)) Can anyone help me? Thanks a lot!
You can use arrange and filter from dplyr package: library(dplyr) w <- data.frame(use_this = round(runif(100, min = -9, max = 19)), second = runif(100), third = runif(100), fourth = runif(100)) %>% arrange(desc(use_this)) %>% filter(use_this >= 1.5) Output: > w use_this second third fourth 1 19 0.264306555 0.11234097 0.30149863 2 19 0.574675520 0.50406805 0.71502833 3 19 0.376586752 0.21530618 0.35323250 4 18 0.949974135 0.46726122 0.36008741 5 17 0.339737597 0.11358402 0.04035303 6 16 0.180291264 0.81855913 0.16109650 7 16 0.958398058 0.94827266 0.54693974 8 16 0.297317238 0.28726682 0.63560208 9 16 0.653006870 0.15175848 0.69305851 10 16 0.685338886 0.30493976 0.89360112 11 16 0.493931093 0.52830391 0.68391458 12 16 0.945083084 0.19880501 0.66769341 13 16 0.910927578 0.86032225 0.73062990 14 15 0.662130980 0.19207451 0.44240610 15 15 0.730482762 0.92418574 0.46387086 16 15 0.547101759 0.87847767 0.27973739 17 15 0.487773258 0.05870471 0.40147753 18 15 0.695824922 0.91289504 0.94897518 19 14 0.576095914 0.42914670 0.27707368 20 14 0.156691824 0.02187951 0.31940887 21 13 0.079037019 0.16993999 0.53232350 22 13 0.944372064 0.63485350 0.23548337 23 13 0.016378244 0.42772076 0.76618218 24 13 0.606340182 0.33611591 0.36017352 25 13 0.170346203 0.43325314 0.16285515 26 13 0.605379012 0.95574187 0.23941377 27 12 0.157352454 0.90963650 0.01611328 28 12 0.353934785 0.80058806 0.13782414 29 12 0.464950823 0.81835421 0.12771521 30 12 0.624139506 0.69472154 0.02833191 31 11 0.362033514 0.98849181 0.37684822 32 11 0.067974815 0.24154922 0.49300890 33 11 0.522271380 0.03502680 0.50665790 34 10 0.810183210 0.56598130 0.41279787 35 10 0.609560713 0.46745813 0.34939724 36 10 0.087748839 0.56531646 0.02249387 37 10 0.008262635 0.68432285 0.35648525 38 10 0.757824842 0.57826099 0.89973902 39 10 0.428174539 0.12538288 0.69233083 40 10 0.785175550 0.21516237 0.36578714 41 10 0.631388832 0.63700087 0.40933640 42 10 0.171396873 0.37925970 0.27935731 43 10 0.773437320 0.24710107 0.23902388 44 8 0.443778088 0.77238651 0.08517639 45 8 0.954302451 0.87102748 0.52031446 46 8 0.347608835 0.79912385 0.36169856 47 8 0.839238717 0.54200177 0.52221408 48 8 0.235710838 0.85575923 0.78092366 49 7 0.610772265 0.16833538 0.94704562 50 7 0.242917834 0.02852729 0.87131760 51 7 0.875879507 0.04537683 0.81000861 52 7 0.577880660 0.54259171 0.43301336 53 6 0.541772984 0.06164861 0.62867700 54 6 0.071746509 0.51758874 0.70365933 55 5 0.103953563 0.99147043 0.33944620 56 5 0.504618656 0.95827073 0.65527417 57 5 0.726648637 0.37460291 0.47072657 58 5 0.796268586 0.09644167 0.93960812 59 5 0.796498528 0.68346948 0.23290885 60 5 0.490859592 0.76727730 0.39888256 61 5 0.949232913 0.02954981 0.56672834 62 4 0.360401806 0.62879833 0.31107107 63 4 0.926329930 0.87624801 0.91260914 64 4 0.922783983 0.11524112 0.06240194 65 3 0.518727534 0.23927630 0.37114683 66 3 0.951288192 0.58672287 0.45337659 67 3 0.767943126 0.76102957 0.24347122 68 2 0.786254279 0.39824869 0.58548193 69 2 0.321557042 0.75393236 0.43273743 70 2 0.872124621 0.89918160 0.55623725 71 2 0.242389529 0.85453423 0.78540085 72 2 0.013294874 0.61593974 0.70549476
Adding a name to fields in a newly created data frame
I have created a new data.frame from another data.frame, for example: aaa=cbind(bb1[,1],bb1[,2],ay,ax) I want to name bb1[,1] as prob, bb1[,2] as recommendation and remaining as it is. Can someone tell me the syntax of doing this? Thanks
bb1 = data.frame(c(1:10),c(11:20)) ax = c(21:30) ay = c(31:40) aaa = data.frame(cbind(bb1[,1],bb1[,2],ay,ax)) colnames(aaa) = c("prob", "recommandation","ax","ay") Output aaa prob recommandation ax ay 1 1 11 31 21 2 2 12 32 22 3 3 13 33 23 4 4 14 34 24 5 5 15 35 25 6 6 16 36 26 7 7 17 37 27 8 8 18 38 28 9 9 19 39 29 10 10 20 40 30
Trying to integrate over discrete points from a data frame
I have several months of weather data; an example day is here: Hour Avg.Temp 1 1 11 2 2 11 3 3 11 4 4 10 5 5 10 6 6 11 7 7 12 8 8 14 9 9 15 10 10 17 11 11 19 12 12 21 13 13 22 14 14 24 15 15 23 16 16 22 17 17 21 18 18 18 19 19 16 20 20 15 21 21 14 22 22 12 23 23 11 24 24 10 I need to figure out the total number of hours above 15 degrees by integrating in R. I'm analyzing for degree days, a concept in agriculture, that gives valuable information about relative growth rate. For example, hour 10 is 2 degree hours and hour 11 is 4 degree hours above 15 degrees. This can help predict when to harvest fruit. How can I write the code for this? Another column could potentially work with a simple subtraction. Then I would have to make a cumulative sum after canceling out all negative numbers. That is the approach I'm setting out to do right now. Is there an integral I could write and have an answer in one step?
This solution subtracts your threshold (i.e., 15°), fits a function to the result, then integrates this function. Note that if the temperature is below the threshold this contribute zero to the total rather than a negative value. df <- read.table(text = "Hour Avg.Temp 1 1 11 2 2 11 3 3 11 4 4 10 5 5 10 6 6 11 7 7 12 8 8 14 9 9 15 10 10 17 11 11 19 12 12 21 13 13 22 14 14 24 15 15 23 16 16 22 17 17 21 18 18 18 19 19 16 20 20 15 21 21 14 22 22 12 23 23 11 24 24 10", header = TRUE) with(df, integrate(approxfun(Hour, pmax(Avg.Temp-15, 0)), lower = min(Hour), upper = max(Hour))) #> 53.00017 with absolute error < 0.0039 Created on 2019-02-08 by the reprex package (v0.2.1.9000)
The OP has requested to figure out the total number of hours above 15 degrees by integrating in R. It is not fully clear to me what the espected result is. Does the OP want to count the number of hours above 15 degrees or does the OP want to sum up the degrees greater 15 ("integrate"). However, the code below creates both figures. Supposed the data is sampled at each hour without gaps (as suggested by OP's sample dataset), cumsum() and sum() can be used, resp.: library(data.table) setDT(DT)[, c("deg_hrs_sum", "deg_hrs_cnt") := .(cumsum(pmax(0, Avg.Temp - 15)), cumsum(Avg.Temp > 15))] Hour Avg.Temp deg_hrs_sum deg_hrs_cnt 1: 1 11 0 0 2: 2 11 0 0 3: 3 11 0 0 4: 4 10 0 0 5: 5 10 0 0 6: 6 11 0 0 7: 7 12 0 0 8: 8 14 0 0 9: 9 15 0 0 10: 10 17 2 1 11: 11 19 6 2 12: 12 21 12 3 13: 13 22 19 4 14: 14 24 28 5 15: 15 23 36 6 16: 16 22 43 7 17: 17 21 49 8 18: 18 18 52 9 19: 19 16 53 10 20: 20 15 53 10 21: 21 14 53 10 22: 22 12 53 10 23: 23 11 53 10 24: 24 10 53 10 Hour Avg.Temp deg_hrs_sum deg_hrs_cnt Alternatively, setDT(DT)[, .(deg_hrs_sum = sum(pmax(0, Avg.Temp - 15)), deg_hrs_cnt = sum(Avg.Temp > 15))] returns only the final result (last row): deg_hrs_sum deg_hrs_cnt 1: 53 10 Data library(data.table) DT <- fread(" rn Hour Avg.Temp 1 1 11 2 2 11 3 3 11 4 4 10 5 5 10 6 6 11 7 7 12 8 8 14 9 9 15 10 10 17 11 11 19 12 12 21 13 13 22 14 14 24 15 15 23 16 16 22 17 17 21 18 18 18 19 19 16 20 20 15 21 21 14 22 22 12 23 23 11 24 24 10", drop = 1L)
Create partition based in two variables
I have a data set with two outcome variables, case1 and case2. Case1 has 4 levels, while case2 has 50 (levels in case2 could increase later). I would like to create data partition for train and test keeping the ratio in both cases. The real data is imbalanced for both case1 and case2. As an example, library(caret) set.seed(123) matris=matrix(rnorm(10),1000,20) case1 <- as.factor(ceiling(runif(1000, 0, 4))) case2 <- as.factor(ceiling(runif(1000, 0, 50))) df <- as.data.frame(matris) df$case1 <- case1 df$case2 <- case2 split1 <- createDataPartition(df$case1, p=0.2)[[1]] train1 <- df[-split1,] test1 <- df[split1,] length(split1) 201 split2 <- createDataPartition(df$case2, p=0.2)[[1]] train2 <- df[-split2,] test2 <- df[split2,] length(split2) 220 If I do separate splitting, I get different length for the data frame. If I do one splitting based on case2 (one with more classes), I lose the ratio of classes for case1. I will be predicting the two cases separately, but at the end my accuracy will be given by having the exact match for both cases (e.g., ix = which(pred1 == case1 & pred2 == case2), so I need the arrays to be the same size. Is there a smart way to do this? Thank you!
If I understand correctly (which I do not guarantee) I can offer the following approach: Group by case1 and case2 and get the group indices library(tidyverse) df %>% select(case1, case2) %>% group_by(case1, case2) %>% group_indices() -> indeces use these indeces as the outcome variable in create data partition: split1 <- createDataPartition(as.factor(indeces), p=0.2)[[1]] check if satisfactory: table(df[split1,22]) #output 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 5 6 5 8 5 5 6 6 4 6 6 6 6 6 5 5 5 4 4 7 5 6 5 6 7 5 5 8 6 7 6 6 7 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 4 5 6 6 6 5 5 6 5 6 6 5 4 5 6 4 6 table(df[-split1,22]) #output 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 15 19 13 18 12 13 16 15 8 13 13 15 21 14 11 13 12 9 12 20 17 15 16 19 16 11 14 21 13 20 18 13 16 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 9 6 12 19 14 10 16 19 17 17 16 14 4 15 14 9 19 table(df[split1,21]) #output 1 2 3 4 71 70 71 67 table(df[-split1,21]) 1 2 3 4 176 193 174 178
How to obtain all possible sub-samples of size n from a dataframe of size N in R?
I have a dataframe with 20 classrooms [1 to 20] indexes and 20 different number of students in each class, how to obtain all sub-samples of size n = 8 and store them because i want to use them later for calculations. I used combn() but that takes only one vector, can i use it with a dataframe and how? (sorry but i'm new in R), dataframe below: classrooms students 1 1 29 2 2 30 3 3 35 4 4 28 5 5 32 6 6 20 7 7 25 8 8 22 9 9 32 10 10 26 11 11 27 12 12 34 13 13 27 14 14 28 15 15 33 16 16 21 17 17 36 18 18 24 19 19 19 20 20 32
It is as simple as passing a function to combn. simplify = FALSE means that a list will be returned. Assuming you want all possible combinations of 8 classrooms from the dataset classrooms combinations <- combn(nrow(classrooms), 8, function(x,data) data[x,], simplify = FALSE, data =classrooms ) head(combinations, n = 2) [[1]] classrooms students 1 1 29 2 2 30 3 3 35 4 4 28 5 5 32 6 6 20 7 7 25 8 8 22 [[2]] classrooms students 1 1 29 2 2 30 3 3 35 4 4 28 5 5 32 6 6 20 7 7 25 9 9 32