bin range and form a data frame using the boundary.

bin range and form a data frame using the boundary. - r

Suppose I want to generate bins for range 1 to 10
round(seq(1,20,length.out=5))
the output is
1 6 10 15 20
I want to form a data.frame as
[,1] [,2]
[1,] 1 6
[2,] 7 10
[3,] 11 15
[4,] 16 20
so the start will be 1,7, 11, 16, and ends are 6, 10, 15, 20, respectively.
Any solution for this?

x = round(seq(1,20,length.out=5))
df = data.frame(a = c(x[1], head(x[-1],-1) + 1), b = x[-1])
df
# a b
#1 1 6
#2 7 10
#3 11 15
#4 16 20

I am not sure if you are looking for the following solution. If you are, you can use cut and sub function as in my earlier post:
mydata<-round(seq(1,20,length.out=5))
mydata<-as.data.frame(mydata)
names(mydata)<-"V" #name the column as V
mydata$V1<-cut(mydata$V,5) #break the data into five intervals and name that as col V1
mydata$lower<-with(mydata,as.numeric( sub("\\((.+),.*", "\\1", V1))) #extract lower value
mydata$upper<-with(mydata,as.numeric( sub("[^,]*,([^]]*)\\]", "\\1",V1))) # extract upper value
myfinaldata<-mydata[,c("lower","upper")] #create data frame of lower and upper values
> myfinaldata
lower upper
1 0.981 4.79
2 4.790 8.60
3 8.600 12.40
4 12.400 16.20
5 16.200 20.00
Note: Although these look like ovelapping intervals, they are not. For example for the first row this means all data>=0.981 but <4.79 where as for the second row, this is >=4.79 and <8.60.

Related

divide multiple column by a value based on each condition

I have a dataset that has 3 different conditions. Data within condition 1 will need to be divided by 15, data within conditions 2 and 3 will need to be divided by 10. I tried to do for() in order to create separate datasets for each condition and then merge the two groups (group 1 is composed of condition 1, group 2 is composed of conditions 2 and 3). This is what I have so far for condition 1. Is there an easier way to do this that does not require creating subgroups?
Group1 <- NULL
for (val in ParticipantID) {
ParticipantID_subset_Group1 <- subset(PronounData, ParticipantID == val & Condition == "1")
I_Words_PPM <- (ParticipantID_subset_Group1$I_Words/"15")
YOU_Words_PPM <- (ParticipantID_subset_Group1$YOU_Words/"15")
WE_Words_PPM <- (ParticipantID_subset_Group1$WE_Words/"15")
df <- data.frame(val, Group, I_Words_PPM, YOU_Words_PPM, WE_Words_PPM)
Group1 <- rbind(Group1, df)
}
dim(Group1)
colnames(Group1) <- c("ParticipantID", "Condition", "I_Words_PPM", "YOU_Words_PPM", "WE_Words_PPM")
View(Group1)

Couldn't fully test this solution without example data, but this should do what you want:
# make some fake data
PronounData <- data.frame(
ParticipantID = 1:9,
Condition = rep(1:3, 3),
I_Words = sample(0:20, 9, replace = TRUE),
YOU_Words = sample(0:40, 9, replace = TRUE),
WE_Words = sample(0:10, 9, replace = TRUE)
)
# if Condition 1, divide by 15
PronounData[PronounData$Condition == 1, c("I_Words_PPM", "YOU_Words_PPM", "WE_Words_PPM")] <-
PronounData[PronounData$Condition == 1, c("I_Words", "YOU_Words", "WE_Words")] / 15
# if Condition 2 or 3, divide by 10
PronounData[PronounData$Condition %in% 2:3, c("I_Words_PPM", "YOU_Words_PPM", "WE_Words_PPM")] <-
PronounData[PronounData$Condition %in% 2:3, c("I_Words", "YOU_Words", "WE_Words")] / 10
# result
PronounData
# ParticipantID Condition I_Words YOU_Words WE_Words I_Words_PPM YOU_Words_PPM WE_Words_PPM
# 1 1 1 17 40 6 1.1333 2.6667 0.4000
# 2 2 2 14 1 6 1.4000 0.1000 0.6000
# 3 3 3 2 34 8 0.2000 3.4000 0.8000
# 4 4 1 0 33 1 0.0000 2.2000 0.0667
# 5 5 2 4 15 0 0.4000 1.5000 0.0000
# 6 6 3 1 7 6 0.1000 0.7000 0.6000
# 7 7 1 6 10 1 0.4000 0.6667 0.0667
# 8 8 2 1 33 9 0.1000 3.3000 0.9000
# 9 9 3 9 40 0 0.9000 4.0000 0.0000
NB, R is built on vectorized operations, so looping through each row is rarely the best solution. Instead, you generally want to find a way of modifying whole vectors/columns at once, or at least subsets of them. This will usually be faster and simpler.

How can I get R to format diff(x) as rows?

I would like the output of the code below to express diff(x) of a series of numbers in rows, not columns. It currently calculates diff(x) of the rows of a data frame, which is what I want, so the values are correct but they are formatted in columns. This is some example code that generates diff(x) of some series of numbers:
x <- c(19, 26, 39)
y <- c(34, 47, 51)
z <- c(45,50,60)
B <- data.frame(x, y, z)
B
f1 = function(x){return(diff(x))}
apply(B,1,f1)
>[,1] [,2] [,3]
y 15 21 12
z 11 3 9
#this seems to give diff(x) as columns
#want as rows, i.e. the transpose:
15 11
21 3
12 9
Many thanks

Maybe you can try
Bout <- B[-1]- B[-ncol(B)]
such that
> Bout
y z
1 15 11
2 21 3
3 12 9
or
Bout <- t(diff(t(B)))
such that
> Bout
y z
[1,] 15 11
[2,] 21 3
[3,] 12 9

From a sequence of numbers, how do I find an immediate smaller (and an immediate bigger) number than a particular random number, In R?

So I have 10 increasing sequence of numbers, each of them look like (say x(i) <- c(2, 3, 5, 6, 8, 10, 11, 17) for i ranging from 1 to 10 ) and I have a random sampling number say p=9.
Now for each sequence x(i), I need to find the number immediately smaller than p and immediately bigger than p, and then for each i (from 1 to 10) , I need to take the difference of these two numbers and store them in a string.
For the x(i) that I have given here, the immediate smaller number than p=9 would be 8 and the immediate bigger number than p=9 would be 10, the difference of these would be (10-8)=2.
I am trying to get a code that would create a string of these differences, where first number of the string would mean the difference for i=1, second number would mean the difference for i=2 and so on. The string would have i numbers.
I am relatively new to R, so anywhere connected to loops throws me off a little bit. Any help would be appreciated. Thanks.
EDIT: I am putting the code I am working with for clarification.
fr = 100
dt = 1/1000 #dt in milisecond
duration = 2 #no of duration in s
nBins = 2000 #SpikeTrain
nTrials = 20 #NumberOfSimulations
MyPoissonSpikeTrain = function(p, fr= 100) {
p = runif(nBins)
q = ifelse(p < fr*dt, 1, 0)
return(q)
}
set.seed(1)
SpikeMat <- t(replicate(nTrials, MyPoissonSpikeTrain()))
Spike_times <- function(i) {
c(dt*which( SpikeMat[i, ]==1))}
set.seed(4)
RT <- runif(1, 0 , 2)
for (i in 1:nTrials){
The explanation for this code, is mentioned in my previous question. I have 20 (number of trials aka nTrials) strings with name Spike_times(i) here. Each Spike_times(i) is a string of time stamps between o and 2 seconds where spikes occurred and they have different number of entries. Now I have a random time sample in the form of RT, which is a random number between 0 and 2 seconds. Say RT is 1.17 seconds and Spike_times(i) are the sequence of increasing times stamps between 0 and 2 seconds.
Let me give you an example, Spike_times(3) looks like 0.003 0.015 0.017 ... 1.169 1.176 1.189 ... 1.985 1.990 1.997 then I need a code that picks out 1.169 and 1.176 and gives me the difference of these entries 0.007 and stores it in another string say W as the third entry c(_, _, 0.007, ...) and does this for all 20 strings Spike_times(i) and gives me W with 20 entries.
I hope my question is clear enough. Please let me know if I need to correct something.

This approach should do what you want. I am making a function that extracts the desired result from a single sequence and then applying it to each sequence. I am assuming here that your sequences are row-vectors and are stacked in a matrix. If your actual data structure is different the code can be adapted, but you need to indicate how your sequences are actually stored.
x <- matrix(rep(c(2,3,5,6,8,10,11,17), 10), nrow=10, byrow = T)
x
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] 2 3 5 6 8 10 11 17
#> [2,] 2 3 5 6 8 10 11 17
#> [3,] 2 3 5 6 8 10 11 17
#> [4,] 2 3 5 6 8 10 11 17
#> [5,] 2 3 5 6 8 10 11 17
#> [6,] 2 3 5 6 8 10 11 17
#> [7,] 2 3 5 6 8 10 11 17
#> [8,] 2 3 5 6 8 10 11 17
#> [9,] 2 3 5 6 8 10 11 17
#> [10,] 2 3 5 6 8 10 11 17
set.seed(123)
p = sample(10, 1)
# write a function to do what you want on one sequence:
# NOTE: If p appears in the sequence I assume you want the
# closest numbers not equal to p! If you want the closest
# numbers to p including p itself change the less than/
# greater than to <= / >=
get_l_r_diff <- function(row, p) {
temp <- row - p
lower <- max(row[temp < 0])
upper <- min(row[temp > 0])
upper - lower
}
apply(x, 1, function(row)get_l_r_diff(row, p))
#> [1] 3 3 3 3 3 3 3 3 3 3
apply(x, 1, function(row) get_l_r_diff(row, 9))
#> [1] 2 2 2 2 2 2 2 2 2 2
# if the result really needs to be a string
paste(apply(x, 1, function(row) get_l_r_diff(row, 9)), collapse = "")
#> [1] "2222222222"
For your case you can just apply the two functions to your indices:
spikes <- sapply(1:20, function(i){get_l_r_diff(Spike_times(i), RT)})

By making a small change to your Spike_times function you can do this with sapply returning a vector of all calculated values
Spike_times <- function(i) {
x <- c(dt*which( SpikeMat[i, ]==1))
min(x[x > RT]) - max(x[x < RT])
}
set.seed(4)
RT <- runif(1, 0 , 2)
results <- sapply(1:20, Spike_times)

Calculate the Mean and STD sub-setting a dynamic set of columns

Calculate the Mean and STD sub-setting a dynamic set of columns.
to show it as an example:
sales <- data.frame(ItemID=c("1A","1B","1C"),
Jul=c(0,1,5),
Aug=c(1,2,6),
Sep=c(0,3,7),
Oct=c(1,4,8),
Nov=c(1,4,8),
Dec=c(1,4,8),
Jan=c(1,4,8),
Nmon=c(7,4,6))
The above test data gives the below below table. What I would need is to apply functions on each row based on the value in the "Nmon" column.
ItemID Jul Aug Sep Oct Nov Dec Jan Nmon
1 1A 0 1 0 1 1 1 1 7
2 1B 1 2 3 4 4 4 4 4
3 1C 5 6 7 8 8 8 8 6
e.g. the first record has the Nmon value as 7. Then I need to calculate the mean and standard deviation of all the values from Jul to Jan (mean = 0.71, std = 0.49).
In case of second record where the Nmon value is 4 the mean and standard deviation should only be calculated for the the months ranging from Oct-Jan (mean = 4, std = 0)
Here the number of months will increase or decrease but the first(Item number) and last (Nmon) columns will remain the same.
I have a large data set of items and need an efficient way to do this calculations.

Perhaps this helps
t(apply(sales[-1], 1, function(x) {i1 <- length(x)
x2 <- x[(i1 -x[i1]):(i1-1)]
c(mean = mean(x2), sd = sd(x2))}))
# mean sd
#[1,] 0.7142857 0.48795
#[2,] 4.0000000 0.00000
#[3,] 7.5000000 0.83666

Here is another solution with base R:
sales <- data.frame(ItemID=c("1A","1B","1C"),
Jul=c(0,1,5),
Aug=c(1,2,6),
Sep=c(0,3,7),
Oct=c(1,4,8),
Nov=c(1,4,8),
Dec=c(1,4,8),
Jan=c(1,4,8),
Nmon=c(7,4,6))
my.m.sd <- function(i) {
n <- sales$Nmon[i]
x <- unlist(sales[i, seq(to=8, length.out = n)])
c(m=mean(x), s=sd(x))
}
sapply(1:3, my.m.sd)
# > sapply(1:3, my.m.sd)
# [,1] [,2] [,3]
# m 0.7142857 4 7.50000
# s 0.4879500 0 0.83666

How to select/find coordinates within a distance from a list (X/Y) using R

I have a data frame with list of X/Y locations (>2000 rows). What I want is to select or find all the rows/locations based on a max distance. For example, from the data frame select all the locations that are between 1-100 km from each other. Any suggestions on how to do this?

You need to somehow determine the distance between each pair of rows.
The simplest way is with a corresponding distance matrix
# Assuming Thresh is your threshold
thresh <- 10
# create some sample data
set.seed(123)
DT <- data.table(X=sample(-10:10, 5, TRUE), Y=sample(-10:10, 5, TRUE))
# create the disance matrix
distTable <- matrix(apply(createTable(DT), 1, distance), nrow=nrow(DT))
# remove the lower.triangle since we have symmetry (we don't want duplicates)
distTable[lower.tri(distTable)] <- NA
# Show which rows are above the threshold
pairedRows <- which(distTable >= thresh, arr.ind=TRUE)
colnames(pairedRows) <- c("RowA", "RowB") # clean up the names
Starting with:
> DT
X Y
1: -4 -10
2: 6 1
3: -2 8
4: 8 1
5: 9 -1
We get:
> pairedRows
RowA RowB
[1,] 1 2
[2,] 1 3
[3,] 2 3
[4,] 1 4
[5,] 3 4
[6,] 1 5
[7,] 3 5
These are the two functions used for creating the distance matrix
# pair-up all of the rows
createTable <- function(DT)
expand.grid(apply(DT, 1, list), apply(DT, 1, list))
# simple cartesian/pythagorean distance
distance <- function(CoordPair)
sqrt(sum((CoordPair[[2]][[1]] - CoordPair[[1]][[1]])^2, na.rm=FALSE))

I'm not entirely clear from your question, but assuming you mean you want to take each row of coordinates and find all the other rows whose coordinates fall within a certain distance:
# Create data set for example
set.seed(42)
x <- sample(-100:100, 10)
set.seed(456)
y <- sample(-100:100, 10)
coords <- data.frame(
"x" = x,
"y" = y)
# Loop through all rows
lapply(1:nrow(coords), function(i) {
dis <- sqrt(
(coords[i,"x"] - coords[, "x"])^2 + # insert your preferred
(coords[i,"y"] - coords[, "y"])^2 # distance calculation here
)
names(dis) <- 1:nrow(coords) # replace this part with an index or
# row names if you have them
dis[dis > 0 & dis <= 100] # change numbers to preferred threshold
})
[[1]]
2 6 7 9 10
25.31798 95.01579 40.01250 30.87070 73.75636
[[2]]
1 6 7 9 10
25.317978 89.022469 51.107729 9.486833 60.539243
[[3]]
5 6 8
70.71068 91.78780 94.86833
[[4]]
5 10
40.16217 99.32774
[[5]]
3 4 6 10
70.71068 40.16217 93.40771 82.49242
[[6]]
1 2 3 5 7 8 9 10
95.01579 89.02247 91.78780 93.40771 64.53681 75.66373 97.08244 34.92850
[[7]]
1 2 6 9 10
40.01250 51.10773 64.53681 60.41523 57.55867
[[8]]
3 6
94.86833 75.66373
[[9]]
1 2 6 7 10
30.870698 9.486833 97.082439 60.415230 67.119297
[[10]]
1 2 4 5 6 7 9
73.75636 60.53924 99.32774 82.49242 34.92850 57.55867 67.11930