Calculate the Mean and STD sub-setting a dynamic set of columns.
to show it as an example:
sales <- data.frame(ItemID=c("1A","1B","1C"),
Jul=c(0,1,5),
Aug=c(1,2,6),
Sep=c(0,3,7),
Oct=c(1,4,8),
Nov=c(1,4,8),
Dec=c(1,4,8),
Jan=c(1,4,8),
Nmon=c(7,4,6))
The above test data gives the below below table. What I would need is to apply functions on each row based on the value in the "Nmon" column.
ItemID Jul Aug Sep Oct Nov Dec Jan Nmon
1 1A 0 1 0 1 1 1 1 7
2 1B 1 2 3 4 4 4 4 4
3 1C 5 6 7 8 8 8 8 6
e.g. the first record has the Nmon value as 7. Then I need to calculate the mean and standard deviation of all the values from Jul to Jan (mean = 0.71, std = 0.49).
In case of second record where the Nmon value is 4 the mean and standard deviation should only be calculated for the the months ranging from Oct-Jan (mean = 4, std = 0)
Here the number of months will increase or decrease but the first(Item number) and last (Nmon) columns will remain the same.
I have a large data set of items and need an efficient way to do this calculations.
Perhaps this helps
t(apply(sales[-1], 1, function(x) {i1 <- length(x)
x2 <- x[(i1 -x[i1]):(i1-1)]
c(mean = mean(x2), sd = sd(x2))}))
# mean sd
#[1,] 0.7142857 0.48795
#[2,] 4.0000000 0.00000
#[3,] 7.5000000 0.83666
Here is another solution with base R:
sales <- data.frame(ItemID=c("1A","1B","1C"),
Jul=c(0,1,5),
Aug=c(1,2,6),
Sep=c(0,3,7),
Oct=c(1,4,8),
Nov=c(1,4,8),
Dec=c(1,4,8),
Jan=c(1,4,8),
Nmon=c(7,4,6))
my.m.sd <- function(i) {
n <- sales$Nmon[i]
x <- unlist(sales[i, seq(to=8, length.out = n)])
c(m=mean(x), s=sd(x))
}
sapply(1:3, my.m.sd)
# > sapply(1:3, my.m.sd)
# [,1] [,2] [,3]
# m 0.7142857 4 7.50000
# s 0.4879500 0 0.83666
Related
I have a dataset that has 3 different conditions. Data within condition 1 will need to be divided by 15, data within conditions 2 and 3 will need to be divided by 10. I tried to do for() in order to create separate datasets for each condition and then merge the two groups (group 1 is composed of condition 1, group 2 is composed of conditions 2 and 3). This is what I have so far for condition 1. Is there an easier way to do this that does not require creating subgroups?
Group1 <- NULL
for (val in ParticipantID) {
ParticipantID_subset_Group1 <- subset(PronounData, ParticipantID == val & Condition == "1")
I_Words_PPM <- (ParticipantID_subset_Group1$I_Words/"15")
YOU_Words_PPM <- (ParticipantID_subset_Group1$YOU_Words/"15")
WE_Words_PPM <- (ParticipantID_subset_Group1$WE_Words/"15")
df <- data.frame(val, Group, I_Words_PPM, YOU_Words_PPM, WE_Words_PPM)
Group1 <- rbind(Group1, df)
}
dim(Group1)
colnames(Group1) <- c("ParticipantID", "Condition", "I_Words_PPM", "YOU_Words_PPM", "WE_Words_PPM")
View(Group1)
Couldn't fully test this solution without example data, but this should do what you want:
# make some fake data
PronounData <- data.frame(
ParticipantID = 1:9,
Condition = rep(1:3, 3),
I_Words = sample(0:20, 9, replace = TRUE),
YOU_Words = sample(0:40, 9, replace = TRUE),
WE_Words = sample(0:10, 9, replace = TRUE)
)
# if Condition 1, divide by 15
PronounData[PronounData$Condition == 1, c("I_Words_PPM", "YOU_Words_PPM", "WE_Words_PPM")] <-
PronounData[PronounData$Condition == 1, c("I_Words", "YOU_Words", "WE_Words")] / 15
# if Condition 2 or 3, divide by 10
PronounData[PronounData$Condition %in% 2:3, c("I_Words_PPM", "YOU_Words_PPM", "WE_Words_PPM")] <-
PronounData[PronounData$Condition %in% 2:3, c("I_Words", "YOU_Words", "WE_Words")] / 10
# result
PronounData
# ParticipantID Condition I_Words YOU_Words WE_Words I_Words_PPM YOU_Words_PPM WE_Words_PPM
# 1 1 1 17 40 6 1.1333 2.6667 0.4000
# 2 2 2 14 1 6 1.4000 0.1000 0.6000
# 3 3 3 2 34 8 0.2000 3.4000 0.8000
# 4 4 1 0 33 1 0.0000 2.2000 0.0667
# 5 5 2 4 15 0 0.4000 1.5000 0.0000
# 6 6 3 1 7 6 0.1000 0.7000 0.6000
# 7 7 1 6 10 1 0.4000 0.6667 0.0667
# 8 8 2 1 33 9 0.1000 3.3000 0.9000
# 9 9 3 9 40 0 0.9000 4.0000 0.0000
NB, R is built on vectorized operations, so looping through each row is rarely the best solution. Instead, you generally want to find a way of modifying whole vectors/columns at once, or at least subsets of them. This will usually be faster and simpler.
So I have 10 increasing sequence of numbers, each of them look like (say x(i) <- c(2, 3, 5, 6, 8, 10, 11, 17) for i ranging from 1 to 10 ) and I have a random sampling number say p=9.
Now for each sequence x(i), I need to find the number immediately smaller than p and immediately bigger than p, and then for each i (from 1 to 10) , I need to take the difference of these two numbers and store them in a string.
For the x(i) that I have given here, the immediate smaller number than p=9 would be 8 and the immediate bigger number than p=9 would be 10, the difference of these would be (10-8)=2.
I am trying to get a code that would create a string of these differences, where first number of the string would mean the difference for i=1, second number would mean the difference for i=2 and so on. The string would have i numbers.
I am relatively new to R, so anywhere connected to loops throws me off a little bit. Any help would be appreciated. Thanks.
EDIT: I am putting the code I am working with for clarification.
fr = 100
dt = 1/1000 #dt in milisecond
duration = 2 #no of duration in s
nBins = 2000 #SpikeTrain
nTrials = 20 #NumberOfSimulations
MyPoissonSpikeTrain = function(p, fr= 100) {
p = runif(nBins)
q = ifelse(p < fr*dt, 1, 0)
return(q)
}
set.seed(1)
SpikeMat <- t(replicate(nTrials, MyPoissonSpikeTrain()))
Spike_times <- function(i) {
c(dt*which( SpikeMat[i, ]==1))}
set.seed(4)
RT <- runif(1, 0 , 2)
for (i in 1:nTrials){
The explanation for this code, is mentioned in my previous question. I have 20 (number of trials aka nTrials) strings with name Spike_times(i) here. Each Spike_times(i) is a string of time stamps between o and 2 seconds where spikes occurred and they have different number of entries. Now I have a random time sample in the form of RT, which is a random number between 0 and 2 seconds. Say RT is 1.17 seconds and Spike_times(i) are the sequence of increasing times stamps between 0 and 2 seconds.
Let me give you an example, Spike_times(3) looks like 0.003 0.015 0.017 ... 1.169 1.176 1.189 ... 1.985 1.990 1.997 then I need a code that picks out 1.169 and 1.176 and gives me the difference of these entries 0.007 and stores it in another string say W as the third entry c(_, _, 0.007, ...) and does this for all 20 strings Spike_times(i) and gives me W with 20 entries.
I hope my question is clear enough. Please let me know if I need to correct something.
This approach should do what you want. I am making a function that extracts the desired result from a single sequence and then applying it to each sequence. I am assuming here that your sequences are row-vectors and are stacked in a matrix. If your actual data structure is different the code can be adapted, but you need to indicate how your sequences are actually stored.
x <- matrix(rep(c(2,3,5,6,8,10,11,17), 10), nrow=10, byrow = T)
x
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] 2 3 5 6 8 10 11 17
#> [2,] 2 3 5 6 8 10 11 17
#> [3,] 2 3 5 6 8 10 11 17
#> [4,] 2 3 5 6 8 10 11 17
#> [5,] 2 3 5 6 8 10 11 17
#> [6,] 2 3 5 6 8 10 11 17
#> [7,] 2 3 5 6 8 10 11 17
#> [8,] 2 3 5 6 8 10 11 17
#> [9,] 2 3 5 6 8 10 11 17
#> [10,] 2 3 5 6 8 10 11 17
set.seed(123)
p = sample(10, 1)
# write a function to do what you want on one sequence:
# NOTE: If p appears in the sequence I assume you want the
# closest numbers not equal to p! If you want the closest
# numbers to p including p itself change the less than/
# greater than to <= / >=
get_l_r_diff <- function(row, p) {
temp <- row - p
lower <- max(row[temp < 0])
upper <- min(row[temp > 0])
upper - lower
}
apply(x, 1, function(row)get_l_r_diff(row, p))
#> [1] 3 3 3 3 3 3 3 3 3 3
apply(x, 1, function(row) get_l_r_diff(row, 9))
#> [1] 2 2 2 2 2 2 2 2 2 2
# if the result really needs to be a string
paste(apply(x, 1, function(row) get_l_r_diff(row, 9)), collapse = "")
#> [1] "2222222222"
For your case you can just apply the two functions to your indices:
spikes <- sapply(1:20, function(i){get_l_r_diff(Spike_times(i), RT)})
By making a small change to your Spike_times function you can do this with sapply returning a vector of all calculated values
Spike_times <- function(i) {
x <- c(dt*which( SpikeMat[i, ]==1))
min(x[x > RT]) - max(x[x < RT])
}
set.seed(4)
RT <- runif(1, 0 , 2)
results <- sapply(1:20, Spike_times)
I've been braking my head whole morning how to do this.
So lets say this is my data set
set.seed(1)
temp <- as.data.frame(cbind(Key = letters[1:5], sapply(1:12, function(x) sample(c(0, 1), 5, replace = T))))
names(temp)[2:13] <- month.abb
temp
# Key Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
# 1 a 0 1 0 0 1 0 0 1 1 1 0 0
# 2 b 0 1 0 1 0 0 1 1 1 0 1 0
# 3 c 1 1 1 1 1 0 0 0 1 0 0 1
# 4 d 1 1 0 0 0 1 0 1 1 1 0 1
# 5 e 0 0 1 1 0 0 1 0 1 1 0 0
What I'm trying to do is to calculate the percentage of occurrences (1s) in two consecutive months.
For example, c and d had an occurrence in Jan. Both had occurrence in Feb too, so the output will be 1 for this month. In Feb, a-d had occurrences but only c had also an occurrence in Mar, so the the output will be .25 for that month, etc.
Desired output for that mini example:
data.frame(Month = month.abb[1:11], OverlapPercent = c(1, 1/4, 1, 1/3, 0, 0, 1/2, 1, 3/5, 0, 0))
# Month OverlapPercent
# 1 Jan 1.0000000
# 2 Feb 0.2500000
# 3 Mar 1.0000000
# 4 Apr 0.3333333
# 5 May 0.0000000
# 6 Jun 0.0000000
# 7 Jul 0.5000000
# 8 Aug 1.0000000
# 9 Sep 0.6000000
# 10 Oct 0.0000000
# 11 Nov 0.0000000
Was thinking to some how use rle for it, but not sure how to force it to stop on two occurences each time
Unless I'm missing something, the following looks valid:
#just to remove 'factor's from "temp"
tmp = do.call(cbind.data.frame, c(temp[1], lapply(temp[-1], function(x) as.numeric(as.character(x)))))
sapply(head(seq_len(ncol(tmp))[-1], -1),
function(i) sum(tmp[[i]] & tmp[[i+1]]) / sum(tmp[[i]]))
#[1] 1.0000000 0.2500000 1.0000000 0.3333333 0.0000000 0.0000000 0.5000000 1.0000000 0.6000000 0.0000000 0.0000000
EDIT:
Out of curiosity I checked #Bathsheba 's "bitwise AND" speed and seems to be faster than the "logical AND":
#identical results
sapply(head(seq_len(ncol(tmp))[-1], -1),
function(i) sum(bitwAnd(tmp[[i]], tmp[[i+1]])) / sum(tmp[[i]]))
#[1] 1.0000000 0.2500000 1.0000000 0.3333333 0.0000000 0.0000000 0.5000000 1.0000000 0.6000000 0.0000000 0.0000000
#twice as fast
x1 = sample(0:1, 1e6, T); x2 = sample(0:1, 1e6, T)
identical(sum(x1 & x2) / sum(x1), sum(bitwAnd(x1, x2)) / sum(x1))
#[1] TRUE
microbenchmark(sum(x1 & x2) / sum(x1), sum(bitwAnd(x1, x2)) / sum(x1), times = 50)
#Unit: milliseconds
# expr min lq median uq max neval
# sum(x1 & x2)/sum(x1) 23.95648 25.32448 25.78471 26.56232 49.18491 50
# sum(bitwAnd(x1, x2))/sum(x1) 10.97982 11.07309 11.20237 13.00450 35.67963 50
First fix up temp so that the 0/1 columns are numeric rather than factor. Then apply overlap to each pair of columns:
temp[-1] <- lapply(temp[-1], function(x) as.numeric(as.character(x)))
overlap <- function(x, y) mean(y[x == 1])
data.frame(Month = month.abb[-12],
Overlap = sapply(2:12, function(i) overlap(temp[,i], temp[,i+1])))
The above is preferred as it keeps the independent parts of the solution separate; however, as an alternative we could omit the first line above (which fixes up the factors) and instead incorporate that into overlap like this:
overlap <- function(x, y) mean(as.numeric(as.character(y))[x == 1]
Note that the Overlaps are fractions (as per the output shown in the question) and not percents as the heading in the question suggests.
In pseudocode, represent each column as a binary number.
E.g. Jan = 0b00110 and Feb = 0b11110.
Your formula for Jan is then
Bitcount(Jan AND Feb) / Bitcount(Jan)
Where AND is the bitwise AND operator and Bitcount counts the number of 1 bits in the number. (I can supply a way of bit counting if you need it). Of course, the formula for other months is a trivial generalisation.
Obviously you'll need a branch for the denominator being zero: not well defined in your question.
length(which(!xor(data["Feb"],data["Mar"]) & data["Feb"]==1)) / length(which(data["Feb"]==1))
!xor is the negated exclusive or.
length(which(...)) gives the number of true values in a logical vector.
Suppose I want to generate bins for range 1 to 10
round(seq(1,20,length.out=5))
the output is
1 6 10 15 20
I want to form a data.frame as
[,1] [,2]
[1,] 1 6
[2,] 7 10
[3,] 11 15
[4,] 16 20
so the start will be 1,7, 11, 16, and ends are 6, 10, 15, 20, respectively.
Any solution for this?
x = round(seq(1,20,length.out=5))
df = data.frame(a = c(x[1], head(x[-1],-1) + 1), b = x[-1])
df
# a b
#1 1 6
#2 7 10
#3 11 15
#4 16 20
I am not sure if you are looking for the following solution. If you are, you can use cut and sub function as in my earlier post:
mydata<-round(seq(1,20,length.out=5))
mydata<-as.data.frame(mydata)
names(mydata)<-"V" #name the column as V
mydata$V1<-cut(mydata$V,5) #break the data into five intervals and name that as col V1
mydata$lower<-with(mydata,as.numeric( sub("\\((.+),.*", "\\1", V1))) #extract lower value
mydata$upper<-with(mydata,as.numeric( sub("[^,]*,([^]]*)\\]", "\\1",V1))) # extract upper value
myfinaldata<-mydata[,c("lower","upper")] #create data frame of lower and upper values
> myfinaldata
lower upper
1 0.981 4.79
2 4.790 8.60
3 8.600 12.40
4 12.400 16.20
5 16.200 20.00
Note: Although these look like ovelapping intervals, they are not. For example for the first row this means all data>=0.981 but <4.79 where as for the second row, this is >=4.79 and <8.60.
I have a data frame with list of X/Y locations (>2000 rows). What I want is to select or find all the rows/locations based on a max distance. For example, from the data frame select all the locations that are between 1-100 km from each other. Any suggestions on how to do this?
You need to somehow determine the distance between each pair of rows.
The simplest way is with a corresponding distance matrix
# Assuming Thresh is your threshold
thresh <- 10
# create some sample data
set.seed(123)
DT <- data.table(X=sample(-10:10, 5, TRUE), Y=sample(-10:10, 5, TRUE))
# create the disance matrix
distTable <- matrix(apply(createTable(DT), 1, distance), nrow=nrow(DT))
# remove the lower.triangle since we have symmetry (we don't want duplicates)
distTable[lower.tri(distTable)] <- NA
# Show which rows are above the threshold
pairedRows <- which(distTable >= thresh, arr.ind=TRUE)
colnames(pairedRows) <- c("RowA", "RowB") # clean up the names
Starting with:
> DT
X Y
1: -4 -10
2: 6 1
3: -2 8
4: 8 1
5: 9 -1
We get:
> pairedRows
RowA RowB
[1,] 1 2
[2,] 1 3
[3,] 2 3
[4,] 1 4
[5,] 3 4
[6,] 1 5
[7,] 3 5
These are the two functions used for creating the distance matrix
# pair-up all of the rows
createTable <- function(DT)
expand.grid(apply(DT, 1, list), apply(DT, 1, list))
# simple cartesian/pythagorean distance
distance <- function(CoordPair)
sqrt(sum((CoordPair[[2]][[1]] - CoordPair[[1]][[1]])^2, na.rm=FALSE))
I'm not entirely clear from your question, but assuming you mean you want to take each row of coordinates and find all the other rows whose coordinates fall within a certain distance:
# Create data set for example
set.seed(42)
x <- sample(-100:100, 10)
set.seed(456)
y <- sample(-100:100, 10)
coords <- data.frame(
"x" = x,
"y" = y)
# Loop through all rows
lapply(1:nrow(coords), function(i) {
dis <- sqrt(
(coords[i,"x"] - coords[, "x"])^2 + # insert your preferred
(coords[i,"y"] - coords[, "y"])^2 # distance calculation here
)
names(dis) <- 1:nrow(coords) # replace this part with an index or
# row names if you have them
dis[dis > 0 & dis <= 100] # change numbers to preferred threshold
})
[[1]]
2 6 7 9 10
25.31798 95.01579 40.01250 30.87070 73.75636
[[2]]
1 6 7 9 10
25.317978 89.022469 51.107729 9.486833 60.539243
[[3]]
5 6 8
70.71068 91.78780 94.86833
[[4]]
5 10
40.16217 99.32774
[[5]]
3 4 6 10
70.71068 40.16217 93.40771 82.49242
[[6]]
1 2 3 5 7 8 9 10
95.01579 89.02247 91.78780 93.40771 64.53681 75.66373 97.08244 34.92850
[[7]]
1 2 6 9 10
40.01250 51.10773 64.53681 60.41523 57.55867
[[8]]
3 6
94.86833 75.66373
[[9]]
1 2 6 7 10
30.870698 9.486833 97.082439 60.415230 67.119297
[[10]]
1 2 4 5 6 7 9
73.75636 60.53924 99.32774 82.49242 34.92850 57.55867 67.11930