R: Comparing values in vector to column in data frame - r

Apologies if this has been asked before, but I've searched for a while and can't find anything to answer my question. I'm somewhat comfortable using R but never really learned the fundamentals. Here's what I'm trying to do.
I've got a vector (call it "responseTimes") that looks something like this:
150 50 250 200 100 150 250
(It's actually much longer, but I'm truncating it here.)
I've also got a data frame where one column, timeBin, is essentially counting up by 50 from 0 (so 0 50 100 150 200 250 etc).
What I'm trying to do is to count how many values in responseTimes are less than or equal to each row in the data frame. I want to store these counts in a new column of my data frame. My output should look something like this:
timeBin counts
0 0
50 1
100 2
150 4
200 5
250 7
I know I can use the sum function to compare vector elements to some constant (e.g., sum(responseTimes>100) would give me 5 for the data I've shown here) but I don't know how to do this to compare to a changing value (that is, to compare to each row in the timeBin column).
I'd prefer not to use a loop, as I'm told those can be particularly slow in R and I have quite a large data set that I'm working with. Any suggestions would be much appreciated! Thanks in advance.

You can use sapply this way:
> timeBin <- seq(0, 250, by=50)
> responseTimes <- c(150, 50, 250, 200, 100, 150, 250 )
>
> # using sapply (after all `sapply` is a loop)
> ans <- sapply(timeBin, function(x) sum(responseTimes<=x))
> data.frame(timeBin, counts=ans) # your desired output.
timeBin counts
1 0 0
2 50 1
3 100 2
4 150 4
5 200 5
6 250 7

That might help:
responseTimes <- c(150, 50, 250, 200, 100, 150, 250)
bins1 <- seq(0, 250, by = 50)
sahil1 <- function(input = responseTimes, binsx = bins1) {
tablem <- table(cut(input, binsx)) # count of input across bins
tablem <- cumsum(tablem) # cumulative sums
return(as.data.frame(tablem)) # table to data frame
}

Related

Unable to create exactly equal data partitions using createDataPartition in R- getting 1396 and 1398 observations each but need 1397

I am quite familiar with R but never had this requirement where I need to create exactly equal data partition randomly using createDataPartition in R.
index = createDataPartition(final_ts$SAR,p=0.5, list = F)
final_test_data = final_ts[index,]
final_validation_data = final_ts[-index,]
This code creates two datasets with sizes 1396 and 1398 observations respectively.
I am surprised why p=0.5 doesn't do what it is supposed to do. Does it have something to do with resulting dataset not having odd number of observations by default?
Thanks in advance!
It has to do with the number of cases of the response variable (final_ts$SAR in your case).
For example:
y <- rep(c(0,1), 10)
table(y)
y
0 1
10 10
# even number of cases
Now we split:
train <- y[caret::createDataPartition(y, p=0.5,list=F)]
table(train) # we have 10 obs
train
0 1
5 5
test <- y[-caret::createDataPartition(y, p=0.5,list=F)]
table(test) # we have 10 obs.
test
0 1
5 5
If we build and example instead with odd number of cases:
y <- rep(c(0,1), 11)
table(y)
y
0 1
11 11
We have:
train <- y[caret::createDataPartition(y, p=0.5,list=F)]
table(train) # we have 12 obs.
train
0 1
6 6
test <- y[-caret::createDataPartition(y, p=0.5,list=F)]
table(test) # we have 10 obs.
test
0 1
5 5
More info here.
Here is another thread which explains why the number returned from createDataPartition might seem to be "off" to us but not according to what this function is trying to do.
So, it depends on what you have in final_ts$SAR and the spread of the data.
If it is categorical value, ex: T and F, if you have 100 total, 55 are T, 45 are F. When you invoke the way in your code, it will return you 51 because:
55*0.5=27.5, 45*0.5=22.5, round each result up, 28+23=51.
You can refer to below thread which has a great explanation about this when the values you want to split are numbers.
R - caret createDataPartition returns more samples than expected

Running a interaction matrix between many variables

I have a data set with 70 column variables, each is 0-1 dummy variable, and 3500 observations. I am looking to see how often observations with a 'success' in one variable are matched with another variable. In other words it obs 1 has a success dummy in variable one how often does it also have a success in variable 2 and so on for all the variables. I have found how to create a matrix table showing interactions when only two columns are involved however i cant find anything involving many columns. Ideally id like to present this in an interaction matrix with 70 variables across and 70 down. Here is an idea of the data set:
Dat A B C D
XX 1 1 1 1
XY 0 1 0 1
XZ 0 0 1 1
The output im hoping for would be:
Out A B C D
A 0 1 1 1
B 0 1 2
C 0 2
D 0
Showing the number of times that (A,B) is a pairing (B,C) is a pairing and so on.
I have tried using the table() command as well as as.matrix but it seems these require data organized as two columns and cannot understand the data when it refers to many column variables. I am fairly new to R so I apologize if my question isnt clear or is possibly quite simple.
Any help is appreciated. Thanks
Here's how to create a correlation matrix of indefinite size. First create a reproducible example of your dataset...
dat <- matrix(sample(0:1, size = 700, replace = TRUE), ncol = 70)
dat <- data.frame(dat)
Then calculate the correlation...
dat <- cor(dat)
And then plot the correlation visually...
library(corrplot)
corrplot(dat, method = "square")
You can also plot the correlation using numbers instead of colors...
corrplot(dat, method = "number")
Obviously you'll want to finesse these charts before using them in a publication. corrplot offers tons of options for chart appearance.
You can try:
res <- apply(combn(2:ncol(df), 2), 2, function(x, y) sum(rowSums(y[, x]) == 2), df)
m <- diag(x=0, ncol(df)-1)
m[upper.tri(m)] <- res
m[lower.tri(m)] <- NA
dimnames(m) <- list(colnames(df)[-1], colnames(df)[-1])
A B C D
A 0 1 1 1
B NA 0 1 2
C NA NA 0 2
D NA NA NA 0

Convert xy dataset to array

I have a dataset of time/value e.g.
> df <- data.frame(time=c(1,4,5,6), speed=c(100, 500, 600, 800))
I want to convert it to an array with one item per time tick:
> some_func(df, speed ~ time, step=1)
[1] 100 100 100 500 600 800
notice that a values for time == 2 and 3 were added.
Then I can use it in the cross correlation function.
A possible way is by using the stepfunction to interpolate, as you wanted this in your above example.
sf<-stepfun(df$time,c(df$speed[1],df$speed))
sf(1:6)
This will return the values that you want.
df$speed[rep(seq_len(nrow(df)), c(diff(df$time), 1))]
[1] 100 100 100 500 600 800
Here, we use rep along with a vector argument to the second argument to return the indices of the desired vector for df$speed. We achieve this with diff, which will calculate the time period gaps and append 1 at the end to return the final value.

Generate a set of random unique integers from an interval

I am trying to build some machine learning models,
so I need training data and a validation data
so suppose I have N number of examples, I want to select random x examples in a data frame.
For example, suppose I have 100 examples, and I need 10 random numbers, is there a way (to efficiently) generate 10 random INTEGER numbers for me to extract the training data out of my sample data?
I tried using a while loop, and slowly change the repeated numbers, but the running time is not very ideal, so I am looking for a more efficient way to do it.
Can anyone help, please?
sample (or sample.int) does this:
sample.int(100, 10)
# [1] 58 83 54 68 53 4 71 11 75 90
will generate ten random numbers from the range 1–100. You probably want replace = TRUE, which samples with replacing:
sample.int(20, 10, replace = TRUE)
# [1] 10 2 11 13 9 9 3 13 3 17
More generally, sample samples n observations from a vector of arbitrary values.
If I understand correctly, you are trying to create a hold-out sampling. This is usually done using probabilities. So if you have n.rows samples and want a fraction of training.fraction to be used for training, you may do something like this:
select.training <- runif(n=n.rows) < training.fraction
data.training <- my.data[select.training, ]
data.testing <- my.data[!select.training, ]
If you want to specify EXACT number of training cases, you may do something like:
indices.training <- sample(x=seq(n.rows), size=training.size, replace=FALSE) #replace=FALSE makes sure the indices are unique
data.training <- my.data[indices.training, ]
data.testing <- my.data[-indices.training, ] #note that index negation means "take everything except for those"
from the raster package:
raster::sampleInt(242, 10, replace = FALSE)
## 95 230 148 183 38 98 137 110 188 39
This may fail if the limits are too large:
sample.int(1e+12, 10)

Sample with constraint, vectorized

What is the most efficient way to sample a data frame under a certain constraint?
For example, say I have a directory of Names and Salaries, how do I select 3 such that their sum does not exceed some value. I'm just using a while loop but that seems pretty inefficient.
You could face a combinatorial explosion. This simulates the selection of 3 combinations of the EE's from a set of 20 with salaries at a mean of 60 and sd 20. It shows that from the enumeration of the 1140 combinations you will find only 263 having sum of salaries less than 150.
> sum( apply( combn(1:20,3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 200
> set.seed(123)
> salry <- data.frame(EEnams = sapply(1:20 ,
function(x){paste(sample(letters[1:20], 6) ,
collapse="")}), sals = rnorm(20, 60, 20))
> head(salry)
EEnams sals
1 fohpqa 67.59279
2 kqjhpg 49.95353
3 nkbpda 53.33585
4 gsqlko 39.62849
5 ntjkec 38.56418
6 trmnah 66.07057
> sum( apply( combn(1:NROW(salry), 3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 263
If you had 1000 EE's then you would have:
> choose(1000, 3) # Combination possibilities
# [1] 166,167,000 Commas added to output
One approach would be to start with the full data frame and sample one case. Create a data frame which consists of all the cases which have a salary less than your constraint minus the selected salary. Select a second case from this and repeat the process of creating a remaining set of cases to choose from. Stop if you get to the number you need (3), or if at any point there are no cases in the data frame to choose from (reject what you have so far and restart the sampling procedure).
Note that different approaches will create different probability distributions for a case being included; generally it won't be uniform.
How big is your dataset? If it is small (and small really depends on your hardware), you could just list all groups of three, calculate the sum, and sample from that.
## create data frame
N <- 100
salary <- rnorm(N))
## list all possible groups of 3 from this
x <- combn(salary, 3)
## the sum
sx <- colSums(x)
sxc <- sx[sx<1]
## sampling with replacement
sample(sxc, 10, replace=TRUE)

Resources