I have a dataframe simplified as following:
Day Place dendrometer max
1 1 1 4684
2 1 1 4831
1 1 2 2486
2 1 2 2596
1 2 1 6987
2 2 1 6824
I need the first element of each dendrometer as NA, so every time R calculates “max” for a new dendrometer (independently of the place), starts with NA, like this:
Day Place dendrometer max
1 1 1 NA
2 1 1 4831
1 1 2 NA
2 1 2 2596
1 2 1 NA
2 2 1 6824
Could you also let me know I could calculate MEAN of the max column for each dendrometer within each ring (sapply, aggregate?) instead of calculating mean for the entire max column?
NOTE: dendro 1 in place 1 is different to dendro 1 in place 2, I need different information for each of them
library(data.table)
myDat <- data.table(myDat, key="Day")
# using the `mult` argument, make the first instance of each Day NA
myDat[.(Day), dendrometer := NA, mult="first"]
# add mean
myDat[, mean := mean(dendrometer, na.rm=TRUE), by=Day]
# add max
myDat[, max := max(dendrometer, na.rm=TRUE), by=Day]
Results:
> myDat
Day Place dendrometer mean max
1: 1 1 NA 3304.333 4831
2: 1 1 4831 3304.333 4831
3: 1 2 2486 3304.333 4831
4: 1 2 2596 3304.333 4831
5: 2 1 NA 6824.000 6824
6: 2 1 6824 6824.000 6824
Sample Data Used:
read.table(text=
"Day Place dendrometer
1 1 4684
1 1 4831
1 2 2486
1 2 2596
2 1 6987
2 1 6824", header=TRUE, stringsAsFactors=FALSE) -> myDat
Do you always have only two measurements of one dendrometer in one place? If so, then you could just set every other value as NA:
#x is your data.frame
x<-read.table("clipboard",header=TRUE)
x[seq(1,nrow(x),by=2),4]<-NA
and the max values are the non-NA values
x[seq(2,nrow(x),by=2),4]
If your data is more complicated, this should work:
dup<-duplicated(x[,2:3]) #find the non-unique cases
x[!dup,4]<-NA #set the first measurements as NA
tapply(x[dup,4],which(dup),max) #compute max from others.
Note that for computing the mean you do not need to set the first measurements as NA.
Firstly, the mean values of max an be calculated with tapply.
dat <- transform(dat,
mean = tapply(max, c(0, cumsum(abs(diff(dendrometer)))), mean))
Day Place dendrometer max mean
1 1 1 1 4684 4757.5
2 2 1 1 4831 2541.0
3 1 1 2 2486 6905.5
4 2 1 2 2596 4757.5
5 1 2 1 6987 2541.0
6 2 2 1 6824 6905.5
You can use the diff function to find differences between dendrometer and the is.na<- function to replace values in max with NA.
is.na(dat$max) <- c(TRUE, diff(dat$dendrometer) != 0)
Day Place dendrometer max mean
1 1 1 1 NA 4757.5
2 2 1 1 4831 2541.0
3 1 1 2 NA 6905.5
4 2 1 2 2596 4757.5
5 1 2 1 NA 2541.0
6 2 2 1 6824 6905.5
Related
I am a newbie to R and I have a data frame which contains the following fields:
day place hour time_spent count
1 1 1 1 120
1 1 1 2 100
1 1 1 3 90
1 1 1 4 80
So my aim is to calculate the time spent in each place where 75% of the vehicles to cross the place.So from this data frame I generate the below data frame by
day place hour time_spent count cum_count percentage
1 1 1 1 120 120 30.7%
1 1 1 2 100 220 56.4%
1 1 1 3 90 310 79%
1 1 1 4 80 390 100%
df$cum_count=cumsum(df$count)
df$percentage=cumsum(df$percentage)
for(i in 1:length(df$percentage)){
if(df$percentage[i]>75%){
low time=df$time_spent[i-1]
high_time=df$time_spent[i]
}
}
So which means that 75% of vehicles are spending 2-3 minutes in the place 1.But now I have a data frame like this which is for all the places and for all the days.
day place hour time_spent count
1 1 1 1 120
1 1 1 2 100
1 1 1 3 90
1 1 1 4 80
1 2 1 1 220
1 2 1 2 100
1 2 1 3 90
1 2 1 4 80
1 3 1 1 100
1 3 1 2 80
1 3 1 3 90
1 3 1 4 100
2 1 1 1 120
2 1 1 2 100
2 1 1 3 90
2 1 1 4 80
2 2 1 1 220
2 2 1 2 100
2 2 1 3 90
2 2 1 4 80
2 3 1 1 100
2 3 1 2 80
2 3 1 3 90
2 3 1 4 100
How is it possible to calculate the high time and low time for each place?Any help is appreciated.
The max and min functions ought to do the trick here. Although you could also do summary to get median, mean, etc in one go. I'd also recommend the quantile function for these percentages. As usually the case with R the tricky part if getting the data in the correct format.
Say you want the total time spent at each place:
index <- sort(unique(df$place))
times <- as.list(rep(NA, length(index)))
names(times) <- index
for(ii in index){
counter <- c()
for(jj in df[df$place==ii,]$time_spent){
counter <- c(counter, rep(jj, df[df$place==ii,]$count[jj]))
}
times[[ii]] <- counter
}
Now for each place you can compute the max and min with:
lapply(times, max)
lapply(times, min)
Similarly you can compute the mean:
lapply(times, function(x) sum(x)/length(x))
lapply(times, mean)
I think what you want are the quantiles:
lapply(times, quantile, 0.75)
This would be time by which at least 75% of vehicles had passed though a place, i.e., 75% of vehicles had took this time or less to pass through.
We can use a group by operation
library(dplyr)
dfN %>%
group_by(day, place) %>%
mutate(cum_count = cumsum(count),
percentage = 100*cum_count/sum(count),
low_time = time_spent[which.max(percentage > 75)-1],
high_time = time_spent[low_time+1])
if i understood your question correctly (you want min and max value of time_spent in a place):
df %>%
group_by(place) %>%
summarise(min(time_spent),
max(time_spent))
will give you this:
place min(time-spent) max(time_spent)
1 1 4
2 1 4
3 1 4
I have the following data
Exp = my data frame
dt<-data.table(Game=c(rep(1,9),rep(2,3)),
Round=rep(1:3,4),
Participant=rep(1:4,each=3),
Left_Choice=c(1,0,0,1,1,0,0,0,1,1,1,1),
Total_Points=c(5,15,12,16,83,7,4,8,23,6,9,14))
> dt
Game Round Participant Left_Choice Total_Points
1: 1 1 1 1 5
2: 1 2 1 0 15
3: 1 3 1 0 12
4: 1 1 2 1 16
5: 1 2 2 1 83
6: 1 3 2 0 7
7: 1 1 3 0 4
8: 1 2 3 0 8
9: 1 3 3 1 23
10: 2 1 4 1 6
11: 2 2 4 1 9
12: 2 3 4 1 14
Now, I need to do the following:
First of all for each of the participants in each of the Games I need to calculated the mean "Left Choice rate".
After that I want to break the results to 5 groups (Left choice <20%,
left choice between 20% and 40% e.t.c),
For each group (in each of the games), I want to calculate the mean of the Total_Points **in the last round - round 3 in this simple example **** [ONLY the value of the round 3] - so for example for participant 1, in game 1, the total points are in round 3 are 12. And for Participant 4, in game 2 it is 14.
So in the first stage I think I should calculate the following:
Game Participant Percent_left Total_Points (in last round)
1 1 33% 12
1 2 66% 7
1 3 33% 23
2 4 100% 14
And the final result should look like this:
Game Left_Choice Total_Poins (average)
1 >35% 17.5= (12+23)/2
1 <35%<70% 7
1 >70% NA
2 >35% NA
2 <35%<70% NA
2 >70% 14
Please help! :)
Working in data.table
1: simple group mean with by
dt[,pct_left:=mean(Left_Choice),by=.(Game,Participant)]
2: use cut; not totally clear, but I think you want include.lowest=T.
dt[,pct_grp:=cut(pct_left,breaks=seq(0,1,by=.2),include.lowest=T)]
3: slightly more complicated group mean with by
dt[Round==max(Round),end_mean:=mean(Total_Points),by=.(pct_grp,Game)]
(if you just want the reduced table, use .(end_mean=mean(Total_Points))instead).
You didn't make it clear whether there is a global maximum number of rounds (i.e. whether all games end in the same number of rounds); this was assumed above. You'll have to be more clear about this in order to provide an exact alternative, but I suggest starting with just defining it round by round:
dt[,end_mean:=mean(Total_Points),by=.(pct_grp,Game,Round)]
I am trying to number in sequence locations gathered within a certain time period (those with time since previous location >60 seconds). I've eliminated columns irrelevant to this question, so example data looks like:
TimeSincePrev
1
1
1
1
511
1
2
286
1
My desired output looks like this: (sorry for the underscores, but I couldn't otherwise figure out how to get it to include my spaces to make the columns obvious...)
TimeSincePrev ___ NoInSeries
1 ________________ 1
1 ________________ 2
1 ________________ 3
1 ________________ 4
511 ______________ 1
1 ________________ 2
2 ________________ 3
286 ______________ 1
1 ________________ 2
...and so on for another 3500 lines
I have tried a couple of ways to approach this unsuccessfully:
First, I tried to do an ifelse, where I would make the NoInSequence 1 if the TimeSincePrev was more than a minute, or else the previous row's value +1..(In this case, I first insert a line number column to help me reference the previous row, but I suspect there is an easier way to do this?)
df$NoInSeries <- ifelse((dfTimeSincePrev > 60), 1, ((df[((df$LineNo)-1),"NoInSeries"])+1)).
I don't get any errors, but it only gives me the 1s where I want to restart sequences but does not fill in any of the other values:
TimeSincePrev ___ NoInSeries
1 ________________ NA
1 ________________ NA
1 ________________ NA
1 ________________ NA
511 ______________ 1
1 ________________ NA
2 ________________ NA
286 ______________ 1
1 ________________ NA
I assume this has something to do with trying to reference back to itself?
My other approach was to try to get it to do sequences of numbers (max 15), restarting every time there is a change in the TimeSincePrev value:
df$NoInSeries <- ave(df$TimeSincePrev, df$TimeSincePrev, FUN=function(y) 1:15)
I still get no errors but exactly the same output as before, with NAs in place and no other numbers filled in.
Thanks for any help!
Using ave after creating a group detecting serie's change using (diff + cumsum)
dt$NoInSeries <-
ave(dt$TimeSincePrev,
cumsum(dt$TimeSincePrev >60),
FUN=seq)
The result is:
dt
# TimeSincePrev NoInSeries
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 511 1
# 6 1 2
# 7 2 3
# 8 286 1
# 9 1 2
add steps explanation:
## detect time change > 60 seconds
## group value by the time change
(gg <- cumsum(dt$TimeSincePrev >60))
[1] 0 0 0 0 1 1 1 2 2
## get the sequence by group
ave(dt$TimeSincePrev, gg, FUN=seq)
[1] 1 2 3 4 1 2 3 1 2
Using data.table
library(data.table)
setDT(dt)[,NoInSeries:=seq_len(.N), by=cumsum(TimeSincePrev >60)]
dt
# TimeSincePrev NoInSeries
#1: 1 1
#2: 1 2
#3: 1 3
#4: 1 4
#5: 511 1
#6: 1 2
#7: 2 3
#8: 286 1
#9: 1 2
Or
indx <- c(which(dt$TimeSincePrev >60)-1, nrow(dt))
sequence(c(indx[1], diff(indx)))
#[1] 1 2 3 4 1 2 3 1 2
data
dt <- data.frame(TimeSincePrev=c(1,1,1,1,511, 1,2, 286,1))
I have an R script that allows me to select a sample size and take fifty individual random samples with replacement. Below is an example of this code:
## Creates data frame
df = as.data.table(data)
## Select sample size
sample.size = 5
## Creates Sample 1 (Size 5)
Sample.1<-df[,
Dollars[sample(.N, size=sample.size, replace=TRUE)], by = Num]
Sample.1$Sample <- c("01")
According to the R script above, I first created a data frame. I then select my sample size, which in this case is 5. This represents just one sample. Due to my lack of experience with R, I repeat this code 49 more times. The last piece of code looks like this:
## Creates Sample 50 (Size 5)
Sample.50<-df[,
Dollars[sample(.N, size=sample.size, replace=TRUE)], by = Num]
Sample.50$Sample <- c("50")
The sample output would look something like this (Sample Range 1 - 50):
Num Dollars Sample
1 85000 01
1 4900 01
1 18000 01
1 6900 01
1 11000 01
1 8800 50
1 3800 50
1 10400 50
1 2200 50
1 29000 50
It should be noted that varaible 'Num' was created for grouping purposes and has little to no influence on my overall question (which is posted below).
Instead of repeating this code fifty times, to get me fifty individual samples (with a size of 5), is there a loop I can create to help me limit my code? I have been recently asked to create ten thousand random samples, each of a size of 5. I obviously cannot repeat this code ten thousand times so I need some sort of loop.
A sample of my final output should look something like this (Sample Range 1 - 10,000):
Num Dollars Sample
1 85000 01
1 4900 01
1 18000 01
1 6900 01
1 11000 01
1 9900 10000
1 8300 10000
1 10700 10000
1 6800 10000
1 31000 10000
Thank you all in advance for your help, its greatly appreciated.
Here is some sample code if needed:
Num Dollars
1 31002
1 13728
1 23526
1 80068
1 86244
1 9330
1 27169
1 13694
1 4781
1 9742
1 20060
1 35230
1 15546
1 7618
1 21604
1 8738
1 5299
1 12081
1 7652
1 16779
A very simple method would be to use a for loop and store the results in a list:
lst <- list()
for(i in seq_len(3)){
lst[[i]] <- df[sample(seq_len(nrow(df)), 5, replace = TRUE),]
lst[[i]]["Sample"] <- i
}
> lst
[[1]]
Num Dollars Sample
20 1 16779 1
1 1 31002 1
12 1 35230 1
14 1 7618 1
14.1 1 7618 1
[[2]]
Num Dollars Sample
9 1 4781 2
13 1 15546 2
12 1 35230 2
17 1 5299 2
12.1 1 35230 2
[[3]]
Num Dollars Sample
1 1 31002 3
7 1 27169 3
17 1 5299 3
5 1 86244 3
6 1 9330 3
Then, to create a single data.frame, use do.call to rbind the list elements together:
do.call(rbind, lst)
Num Dollars Sample
20 1 16779 1
1 1 31002 1
12 1 35230 1
14 1 7618 1
14.1 1 7618 1
9 1 4781 2
13 1 15546 2
121 1 35230 2
17 1 5299 2
12.1 1 35230 2
11 1 31002 3
7 1 27169 3
171 1 5299 3
5 1 86244 3
6 1 9330 3
It's worth noting that if you're sampling with replacement, then drawing 50 (or 10,000) samples of size 5 is equivalent to drawing one sample of size 250 (or 50,000). Thus I would do it like this (you'll see I stole a line from #beginneR's answer):
df = as.data.table(data)
## Select sample size
sample.size = 5
n.samples = 10000
# Sample and assign groups
draws <- df[sample(seq_len(nrow(df)), sample.size * n.samples, replace = TRUE), ]
draws[, Sample := rep(1:n.samples, each = sample.size)]
i have this data frame, I want to count the frequency (number) of each unique value in a column.
userID bookmarkID tagID value
228 1 1 0.0005
255 1 1 0.0007
5 2 1 0.0068
66 2 1 0.0008
99 2 1 0.0006
206 2 1 0.0006
3 3 1 -0.0007
5 3 1 0.0633
7 3 1 -0.0012
For example,the column bookmarkID, I want to get two vectors: one is the unique values [1,2,3], the other is the corresponding count: [2,4,3]. How can I do this?
I think you're looking for table and unique. Consider your data.frame is df,
> table(df$bookmarkID)
1 2 3
2 4 3
> unique(df$bookmarkID)
[1] 1 2 3