Adding values of two columns on the same row to get a new value - r

Sorry for asking a very basic question but I am new to R and really stuck on a rather simple matter; I have the data frame below (2 rows and 7 columns):
Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
6 175 434 596 585 601 593 211
7 130 592 592 593 600 384 166
These values correspond with time duration (secs) for seven test conditions
col$names <- c(sup_b, hdt, sup_2, lbnp, sup_3, hut, sup_4)
and 17 rows (each row is for one study subject- I have only included first two rows).
I am trying to add values from row 1 col$sup_b (175) and row 1 col$hdt (434) to get the combined duration for the first two conditions i.e. 609 secs. I then add the value of the previous two cols (609) to the next col$sup_2 to get the total duration (609 + 596) and so on until the last condition col$sup_4.
I have tried the method below which is for subject 6 (row 1), which works fine, but I want to tidy this up and make it easier as I have 17 subjects (rows) and have been advised there is an easier way around this:
sup_b <- 175
hdt <- (sup_b + 434)
sup_2 <- (hdt + 596)
lbnp <- (sup_2 + 585)
sup_3 <- (hdt_lbnp + 601)
hut <- (sup_3 + 593)
sup_4 <- (hut + 211)
I want to be able to just change the number of row and have the data pulled across from the data frame rather than entering each individual time period; for instance:
line <- 1 ### the row I want which corresponds to the subject
sup_b <- df[line, 2]
hdt <-df[line, 2] + df[line, 3]
but I keep getting this warning message:
In Ops.factor(df[line, 2], df[line, 3]) : ‘+’ not meaningful for factor
I have even tried: colSums(df[,c(2:3)]), but get the following warning:
Error in colSums(df[, c(2:3)]) : 'x' must be numeric.
also tried: st$sum <- apply(df[,c(2:3)], 1, sum), which doesn't work either.

df1[-1] <- t(apply(df1[-1],1,cumsum))
# Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
# 1 6 175 609 1205 1790 2391 2984 3195
# 2 7 130 722 1314 1907 2507 2891 3057
data
df1 <- read.table(text="Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
6 175 434 596 585 601 593 211
7 130 592 592 593 600 384 166",h=T,strin=F)

Related

Remove row with specific value

I have the following data:
library(data.table)
sales <- data.table(Customer = c(192,964,929,345,898,477,705,804,188,231,780,611,420,816,171,212,504,526,471,979,524,410,557,152,417,359,435,820,305,268,763,194,757,475,351,933,805,687,813,880,798,327,602,710,785,840,446,891,165,662),
Producttype = c(1,2,3,2,3,3,2,1,3,3,1,1,2,2,1,3,1,3,3,1,1,1,1,3,3,3,3,2,1,1,3,3,3,3,1,1,3,3,3,2,3,2,3,3,3,2,1,2,3,1),
Price = c(469,721,856,956,554,188,429,502,507,669,427,582,574,992,418,835,652,983,149,917,370,617,876,337,663,252,599,949,915,556,313,842,892,724,415,307,900,114,439,456,541,261,881,757,199,308,958,374,409,738),
Quarter = c(2,3,3,4,4,1,4,4,3,3,1,1,1,1,1,1,4,1,2,1,3,1,2,3,3,4,4,1,1,4,1,1,3,2,1,3,3,2,2,2,1,4,3,3,1,1,1,3,1,1))
How can I remove (let's say) the row in which Customer = 891?
And then I have another question:
If I want to manipulate the data I use data [row, column]. But when I want to use only the rows in which Quarter equals (for example) 4. I use data [Quarter = 4,] Why is it not data [, Quarter = 4] since Quarter is a column and not a row?
I did not find an appropriate answer in the internet which really explains the why.
Thank you.
You have used 'data.table' function to import your data, so you could write :
sales[Customer != 891,]
The data[Quarter = 4, ], ensures that all columns should be returned for the rows where Quarter is equal to 4. The comma(,) is necessary to only select the rows, and not the column Quarter = 4.
When you use indexing, ie, data[row, column] you are telling R to look for either a specific row or column index.
Row: sales[sales$Customer %in% c(192,964),] translates to "search the specific column Customer in the data frame (or table) for any rows that have values that contain 192 or 964 and isolate them. Note that data.table will allow for sales[Customer %in% c(192, 964),] but data frames cant (use sales[sales$Customer %in% c(192,964),])
Customer Producttype Price Quarter
1: 192 1 469 2
2: 964 2 721 3
Columns sales[, "Customer"] translates to "search the data frame (or table) for columns named "Customer" and isolate all its rows
Customer
1: 192
2: 964
3: 929
4: 345
5: 898
...
Note this returns a data table with one column. If you use sales[,Customer] (data table) or sales$Customer (data frame), it will return a vector:
# [1] 192 964 929 345 898 477 705 804 188 231 780 611 420 816 171 212 504 526 471 979 524
# [22] 410 557 152 417 359 435 820 305 268 763 194 757 475 351 933 805 687 813 880 798 327
# [43] 602 710 785 840 446 891 165 662
You can of course combine - if you did, sales[sales$Quarter %in% 1:2, c("Customer", "Producttype")] you would isolate all values of Customer and Producttype which were in quarters 1 and 2:
Customer Producttype
1: 192 1
2: 477 3
3: 780 1
4: 611 1
5: 420 2
...

Stratifying multiple columns for cross-validation

There are many ways I've seen to stratify a sample by a single variable to use for cross-validation. The caret package does this nicely with the createFolds() function. By default it seems that caret will partition such that each fold has roughly the same target event rate.
What I want to do though is stratify by the target rate and by time. I've found a function that can partially do this, it's the splitstackshape package and uses the stratified() function. The issue with that function though is it returns a single sample, it doesn't split the data into k groups under the given conditions.
Here's some dummy data to reproduce.
set.seed(123)
time = rep(seq(1:10),100)
target = rbinom(n=100, size=1, prob=0.3)
data = as.data.frame(cbind(time,target))
table(data$time,data$target)
0 1
1 60 40
2 80 20
3 80 20
4 60 40
5 80 20
6 80 20
7 60 40
8 60 40
9 70 30
10 80 20
As you can see, the target event rate is not the same across time. It's 40% in time 1 and 20% in time 2, etc. I want to preserve this when creating the folds used for cross-validation. If I understand correctly, caret will partition by the overall event rate.
table(data$target)
0 1
710 290
This rate of ~30% will be preserved overall, but target event rate over time will not.
We can get one sample like this:
library(splitstackshape)
train.index <- stratified(data,c("target","time"),size=.2)
I need to repeat this though 4 more times for a 5-fold cross validation and it needs to be done such that once a row is assigned it can't be assigned again. I feel like there should be a function designed for this already. Any ideas?
I know this post is old but I just had the same problem and I couldn't find another solution. In case anyone else needs an answer, here's the solution I'm implementing.
library(data.table)
mystratified <- function(indt, group, NUM_FOLDS) {
indt <- setDT(copy(indt))
if (is.numeric(group))
group <- names(indt)[group]
temp_grp <- temp_ind <- NULL
indt[, `:=`(temp_ind, .I)]
indt[, `:=`(temp_grp, do.call(paste, .SD)), .SDcols = group]
samp_sizes <- indt[, .N, by = group]
samp_sizes[, `:=`(temp_grp, do.call(paste, .SD)), .SDcols = group]
inds <- split(indt$temp_ind, indt$temp_grp)[samp_sizes$temp_grp]
z = unlist(inds,use.names=F)
model_folds <- suppressWarnings(split(z, 1:NUM_FOLDS))
}
Which is basically a rewriting of splitstackshape::stratified. It works like the following, giving as output a list of validation indeces for each fold.
myfolds = mystratified(indt = data, group = colnames(data), NUM_FOLDS = 5)
str(myfolds)
List of 5
$ 1: int [1:200] 1 91 181 261 351 441 501 591 681 761 ...
$ 2: int [1:200] 41 101 191 281 361 451 541 601 691 781 ...
$ 3: int [1:200] 51 141 201 291 381 461 551 641 701 791 ...
$ 4: int [1:200] 61 151 241 301 391 481 561 651 741 801 ...
$ 5: int [1:200] 81 161 251 341 401 491 581 661 751 841 ...
So, for instance the train and validation data for each fold are:
# first fold
train = data[-myfolds[[1]],]
valid = data[myfolds[[1]],]
# second fold
train = data[-myfolds[[2]],]
valid = data[myfolds[[2]],]
# etc...

Creating data continuously using rnorm until an outlier occurs in R

Sorry for the confusing title, but i wasn't sure how to title what i am trying to do. My objective is to create a dataset of 1000 obs each would be the length of the run. I have created a phase1 dataset, from which a set of control limits are produced. What i am trying to do now is create a phase2 dataset most likely using rnorm. what im trying to do is create a repeat loop that will continuously create values in the phase2 dataset until one of those values is outside of the control limits produced from the phase1 dataset. for example if i had 3.0 and -3.0 as control limits the phase2 dataset would create a bunch of observations until obs 398 when the value here happens to be 3.45, thus stopping the creation of data. my objective is then to record the number 398. Furthermore, I am then trying to loop the code back to the phase1 dataset/ control limits portion and create a new set of control limits and then run another phase2, until i have 1000 run lengths recorded. the code i have for the phase1/ control limits works fine and looks like this:
nphase1=50
nphase2=1000
varcount=1
meanshift= 0
sigmashift= 1
##### phase1 dataset/ control limits #####
phase1 <- matrix(rnorm(nphase1*varcount, 0, 1), nrow = nphase1, ncol=varcount)
mean_var <- apply(phase1, 2, mean)
std_var <- apply(phase1, 2, sd)
df_var <- data.frame(mean_var, std_var)
Upper_SPC_Limit_Method1 <- with(df_var, mean_var + 3 * std_var)
Lower_SPC_Limit_Method1 <- with(df_var, mean_var - 3 * std_var)
df_control_limits<- data.frame(Upper_SPC_Limit_Method1, Lower_SPC_Limit_Method1)
I have previously created this code in SAS and it looks like this. might be a better reference for what i am trying to achieve then me trying to explain it.
%macro phase2_dataset (n=,varcount=, meanshift=, sigmashift=, nphase1=,simID=,);
%do z=1 %to &n;
%phase1_dataset (n=&nphase1, varcount=&varcount);
data phase2; set control_limits n=lastobs;
call streaminit(0);
do until (phase2_var1<Lower_SPC_limit_method1_var1 or
phase2_var1>Upper_SPC_limit_method1_var1);
phase2_var1 = rand("normal", &meanshift, &sigmashift);
output;
end;
run;
ods exclude all;
proc means data=phase2;
var phase2_var1;
ods output summary=x;
run;
ods select all;
data run_length; set x;
keep Phase2_var1_n;
run;
proc append base= QA.Phase2_dataset&simID data=Run_length force; run;
%end;
%mend;
Also been doing research about using a while loop in replace of the repeat loop.
Im new to R so Any ideas you are able to throw my way are greatly appreciated. Thanks!
Using a while loop indeed seems to be the way to go. Here's what I think you're looking for:
set.seed(10) #Making results reproducible
replicate(100, { #100 is easier to display here
phase1 <- matrix(rnorm(nphase1*varcount, 0, 1), nrow = nphase1, ncol=varcount)
mean_var <- colMeans(phase1) #Slightly better than apply
std_var <- apply(phase1, 2, sd)
df_var <- data.frame(mean_var, std_var)
Upper_SPC_Limit_Method1 <- with(df_var, mean_var + 3 * std_var)
Lower_SPC_Limit_Method1 <- with(df_var, mean_var - 3 * std_var)
df_control_limits<- data.frame(Upper_SPC_Limit_Method1, Lower_SPC_Limit_Method1)
#Phase 2
x <- 0
count <- 0
while(x > Lower_SPC_Limit_Method1 && x < Upper_SPC_Limit_Method1) {
x <- rnorm(1)
count <- count + 1
}
count
})
The result is:
[1] 225 91 97 118 304 275 550 58 115 6 218 63 176 100 308 844 90 2758
[19] 161 311 1462 717 2446 74 175 91 331 210 118 1517 420 32 39 201 350 89
[37] 64 385 212 4 72 730 151 7 1159 65 36 333 97 306 531 1502 26 18
[55] 67 329 75 532 64 427 39 352 283 483 19 9 2 1018 137 160 223 98
[73] 15 182 98 41 25 1136 405 474 1025 1331 159 70 84 129 233 2 41 66
[91] 1 23 8 325 10 455 363 351 108 3
If performance becomes a problem, perhaps it would be interesting to explore some improvements, like creating more numbers with rnorm() at a time and then counting how many are necessary to exceed the limits and repeat if necessary.

Combining Two Rows with Different Levels according to Some Conditions into One in R

This is a part of my data: (The actual data contains about 10,000 observations with about 500 levels of SalesItem)
s1<-c('1008','1009','1012','1013','1016','1017','1018','1019','1054','1055')
s2<-c(155,153,154,150,176,165,159,143,179,150)
S<-data.frame(SalesItem=factor(s1), Sales=s2)
> str(S)
'data.frame': 10 obs. of 2 variables:
$ SalesItem: Factor w/ 10 levels "1008","1009",..: 1 2 3 4 5 6 7 8 9 10
$ Sales : num 155 153 154 150 176 165 159 143 179 150`
What I want to do is, if diff(SalesItem)=1, I want to combine the level of SalesItem into 1, for example: diff between SalesItem 1008 and 1009 equal to one, so, I want to rename SalesItem 1009 to 1008. So, later I can compute the sum of Sales for this SalesItem as one, because of my actual data=10,000, so, it is quite hard for me to do this one by one.
Is there any simplest way for me to do that?
Clearly the fact that you have converted the first column to a factor indicates that you might need those factors in some place. so i would suggest that instead of changing any of the columns, add a third column to your data frame which will help you maintain the SalesItem relevant to that value. here are the steps for it :
> s1<-c('1008','1009','1012','1013','1016','1017','1018','1019','1054','1055')
> s2<-c(155,153,154,150,176,165,159,143,179,150)
> s1 = as.integer(s1)
> s3 = ifelse((s1-1) %in% s1, s1-1, s1)
> S <- data.frame(SalesItem=s1, Sales=s2, ItemId=s3)
then you can just count on the basis of the ItemId column.
This is not a terribly efficient solution, but since your data only contains 10000 records, it is not going to be a big problem.
Set up provided example data, but convert the SalesItem field to an integer so that the diff() operation makes sense.
> s1<-c('1008','1009','1012','1013','1016','1017','1018','1019','1054','1055')
> s2<-c(155,153,154,150,176,165,159,143,179,150)
> s1 = as.integer(s1)
> S<-data.frame(SalesItem=s1, Sales=s2)
Reorder data frame so that the SalesItem field is in ascending order (not necessary for current data set, but required for solution) then find the differences.
> S = S[order(S$SalesItem),]
> d = c(0, diff(S$SalesItem))
Duplicate the SalesItem data and then filter based on the values of the differences.
> labels = s1
> #
> for (n in 1:nrow(S)) {if (d[n] == 1) labels[n] = labels[n-1]}
> S$labels = labels
The (temporary) labels field now has the required new values for the SalesItem field. Once you are happy that this is doing the right thing, you can modify last line in above code to simply over-write the existing SalesItem field.
> S
SalesItem Sales labels
1 1008 155 1008
2 1009 153 1008
3 1012 154 1012
4 1013 150 1012
5 1016 176 1016
6 1017 165 1016
7 1018 159 1016
8 1019 143 1016
9 1054 179 1054
10 1055 150 1054

Find the non zero values and frequency of those values in R

I have a data which has two parameters, they are data/time and flow. The flow data is intermittent flow. Lets say at times there is zero flow and suddenly the flow starts and there will be non-zero values for sometime and then the flow will be zero again. I want to understand when the non-zero values occur and how long does each non-zero flow last. I have attached the sample dataset at this location https://www.dropbox.com/s/ef1411dq4gyg0cm/sampledataflow.csv
The data is 1 minute data.
I was able to import the data into R as follows:
flow <- read.csv("sampledataflow.csv")
summary(flow)
names(flow) <- c("Date","discharge")
flow$Date <- strptime(flow$Date, format="%m/%d/%Y %H:%M")
sapply(flow,class)
plot(flow$Date, flow$discharge,type="l")
I made plot to see the distribution but couldn't get a clue where to start to get the frequency of each non zero values. I would like to see a output table as follows:
Date Duration in Minutes
Please let me know if I am not clear here. Thanks.
Additional Info:
I think we need to check the non-zero value first and then find how many non zero values are there continuously before it reaches zero value again. What I want to understand is the flow release durations. For eg. in one day there might be multiple releases and I want to note at what time did the release start and how long did it continue before coming to value zero. I hope this explain the problem little better.
The first point is that you have too many NA in your data. In case you want to look into it.
If I understand correctly, you require the count of continuous 0's followed by continuous non-zeros, zeros, non-zeros etc.. for each date.
This can be achieved with rle of course, as also mentioned by #mnel under comments. But there are quite a few catches.
First, I'll set up the data with non-NA entries:
flow <- read.csv("~/Downloads/sampledataflow.csv")
names(flow) <- c("Date","discharge")
flow <- flow[1:33119, ] # remove NA entries
# format Date to POSIXct to play nice with data.table
flow$Date <- as.POSIXct(flow$Date, format="%m/%d/%Y %H:%M")
Next, I'll create a Date column:
flow$g1 <- as.Date(flow$Date)
Finally, I prefer using data.table. So here's a solution using it.
# load package, get data as data.table and set key
require(data.table)
flow.dt <- data.table(flow)
# set key to both "Date" and "g1" (even though, just we'll use just g1)
# to make sure that the order of rows are not changed (during sort)
setkey(flow.dt, "Date", "g1")
# group by g1 and set data to TRUE/FALSE by equating to 0 and get rle lengths
out <- flow.dt[, list(duration = rle(discharge == 0)$lengths,
val = rle(discharge == 0)$values + 1), by=g1][val == 2, val := 0]
> out # just to show a few first and last entries
# g1 duration val
# 1: 2010-05-31 120 0
# 2: 2010-06-01 722 0
# 3: 2010-06-01 138 1
# 4: 2010-06-01 32 0
# 5: 2010-06-01 79 1
# ---
# 98: 2010-06-22 291 1
# 99: 2010-06-22 423 0
# 100: 2010-06-23 664 0
# 101: 2010-06-23 278 1
# 102: 2010-06-23 379 0
So, for example, for 2010-06-01, there are 722 0's followed by 138 non-zeros, followed by 32 0's followed by 79 non-zeros and so on...
I looked a a small sample of the first two days
> do.call( cbind, tapply(flow$discharge, as.Date(flow$Date), function(x) table(x > 0) ) )
2010-06-01 2010-06-02
FALSE 1223 911
TRUE 217 529 # these are the cumulative daily durations of positive flow.
You may want this transposed in which case the t() function should succeed. Or you could use rbind.
If you jsut wante the number of flow-postive minutes, this would also work:
tapply(flow$discharge, as.Date(flow$Date), function(x) sum(x > 0, na.rm=TRUE) )
#--------
2010-06-01 2010-06-02 2010-06-03 2010-06-04 2010-06-05 2010-06-06 2010-06-07 2010-06-08
217 529 417 463 0 0 263 220
2010-06-09 2010-06-10 2010-06-11 2010-06-12 2010-06-13 2010-06-14 2010-06-15 2010-06-16
244 219 287 234 31 245 311 324
2010-06-17 2010-06-18 2010-06-19 2010-06-20 2010-06-21 2010-06-22 2010-06-23 2010-06-24
299 305 124 129 295 296 278 0
To get the lengths of intervals with discharge values greater than zero:
tapply(flow$discharge, as.Date(flow$Date), function(x) rle(x>0)$lengths[rle(x>0)$values] )
#--------
$`2010-06-01`
[1] 138 79
$`2010-06-02`
[1] 95 195 239
$`2010-06-03`
[1] 57 360
$`2010-06-04`
[1] 6 457
$`2010-06-05`
integer(0)
$`2010-06-06`
integer(0)
... Snipped output
If you want to look at the distribution of these durations you will need to unlist that result. (And remember that the durations which were split at midnight may have influenced the counts and durations.) If you just wanted durations without dates, then use this:
flowrle <- rle(flow$discharge>0)
flowrle$lengths[!is.na(flowrle$values) & flowrle$values]
#----------
[1] 138 79 95 195 296 360 6 457 263 17 203 79 80 85 30 189 17 270 127 107 31 1
[23] 2 1 241 311 229 13 82 299 305 3 121 129 295 3 2 291 278

Resources