How to calculate percentage in R? - r

I am a newbie to R and I have a data frame which contains the following fields:
day place hour time_spent count
1 1 1 1 120
1 1 1 2 100
1 1 1 3 90
1 1 1 4 80
So my aim is to calculate the time spent in each place where 75% of the vehicles to cross the place.So from this data frame I generate the below data frame by
day place hour time_spent count cum_count percentage
1 1 1 1 120 120 30.7%
1 1 1 2 100 220 56.4%
1 1 1 3 90 310 79%
1 1 1 4 80 390 100%
df$cum_count=cumsum(df$count)
df$percentage=cumsum(df$percentage)
for(i in 1:length(df$percentage)){
if(df$percentage[i]>75%){
low time=df$time_spent[i-1]
high_time=df$time_spent[i]
}
}
So which means that 75% of vehicles are spending 2-3 minutes in the place 1.But now I have a data frame like this which is for all the places and for all the days.
day place hour time_spent count
1 1 1 1 120
1 1 1 2 100
1 1 1 3 90
1 1 1 4 80
1 2 1 1 220
1 2 1 2 100
1 2 1 3 90
1 2 1 4 80
1 3 1 1 100
1 3 1 2 80
1 3 1 3 90
1 3 1 4 100
2 1 1 1 120
2 1 1 2 100
2 1 1 3 90
2 1 1 4 80
2 2 1 1 220
2 2 1 2 100
2 2 1 3 90
2 2 1 4 80
2 3 1 1 100
2 3 1 2 80
2 3 1 3 90
2 3 1 4 100
How is it possible to calculate the high time and low time for each place?Any help is appreciated.

The max and min functions ought to do the trick here. Although you could also do summary to get median, mean, etc in one go. I'd also recommend the quantile function for these percentages. As usually the case with R the tricky part if getting the data in the correct format.
Say you want the total time spent at each place:
index <- sort(unique(df$place))
times <- as.list(rep(NA, length(index)))
names(times) <- index
for(ii in index){
counter <- c()
for(jj in df[df$place==ii,]$time_spent){
counter <- c(counter, rep(jj, df[df$place==ii,]$count[jj]))
}
times[[ii]] <- counter
}
Now for each place you can compute the max and min with:
lapply(times, max)
lapply(times, min)
Similarly you can compute the mean:
lapply(times, function(x) sum(x)/length(x))
lapply(times, mean)
I think what you want are the quantiles:
lapply(times, quantile, 0.75)
This would be time by which at least 75% of vehicles had passed though a place, i.e., 75% of vehicles had took this time or less to pass through.

We can use a group by operation
library(dplyr)
dfN %>%
group_by(day, place) %>%
mutate(cum_count = cumsum(count),
percentage = 100*cum_count/sum(count),
low_time = time_spent[which.max(percentage > 75)-1],
high_time = time_spent[low_time+1])

if i understood your question correctly (you want min and max value of time_spent in a place):
df %>%
group_by(place) %>%
summarise(min(time_spent),
max(time_spent))
will give you this:
place min(time-spent) max(time_spent)
1 1 4
2 1 4
3 1 4

Related

Create a new dataframe in R resulting from comparison of differently ordered columns from two other databases with different lengths

I have this two dataframe CDD26_FF (5593 rows) and CDD_HI (5508 rows) having a structure (columns) like below. CDDs are "consecutive dry days", and the two table show species exposure to CDD in far future (FF) and historical period (HI).
I want to focus only on "Biom" and "Species_name" columnes.
As you can see the two table have same "Species_names" and same "Biom" (areas in the world with sama climatic conditions). "Biom" values goes from 0 to 15. By the way, "Species_name" do not always appear in both tables (e.g. Abromoco_ben); Furthemore, the two tables not always have the combinations of "Species_name" and "Biom" (combinations are simply population of the same species belonging to that Biom)
CDD26_FF :
CDD26_FF
AreaCell
Area_total
Biom
Species_name
AreaCellSuAreaTotal
1
1
13
10
Abrocomo_ben
0.076923
1
1
8
1
Abrocomo_cin
0.125000
1
1
30
10
Abrocomo_cin
0.033333
1
2
10
1
Abrothrix_an
0.200000
1
1
44
10
Abrothrix_an
0.022727
1
3
6
2
Abrothrix_je
0.500000
1
1
7
12
Abrothrix_lo
0.142857
CDD_HI
CDD_HI
AreaCell
Area_total
Biom
Species_name
AreaCellSuAreaTot_HI
1
1
8
1
Abrocomo_cin
0.125000
1
5
30
10
Abrocomo_cin
0.166666
1
1
5
2
Abrocomo_cin
0.200000
1
1
10
1
Abrothrix_an
0.100000
1
1
44
10
Abrothrix_an
0.022727
1
6
18
1
Abrothrix_je
0.333333
1
1
23
4
Abrothrix_lo
0.130434
I want to highlight rows that have same matches of "Species_name" and "Biom": in the example they are lines 3, 4, 5 from CDD26_FF matching lines 2, 4, 5 from CDD_HI, respectively. I want to store these line in a new table, but I want to store not only "Species_name" and "Biom" column (as "compare()" function seems to do), but also all the other columns.
More precisely, I want then to calculate the ratio of "AreaCellSuAreaTot" / "AreaCellSuAreaTot_HI" from the highlighted lines.
How can I do that?
Aside from "compare()", I tried a "for" loop, but lengths of the table differ, so I tried with a 3-nested for loop, still without results. I also tried "compareDF()" and "semi_join()". No results untill now. Thank you for your help.
You could use an inner join (provided by dplyr). An inner join returns all datasets that are present in both tables/data.frames and with matching conditions (in this case: matching "Biom" and "Species_name").
Subsequently it's easy to calculate some ratio using mutate:
library(dplyr)
cdd26_f %>%
inner_join(cdd_hi, by=c("Biom", "Species_name")) %>%
mutate(ratio = AreaCellSuAreaTotal/AreaCellSuAreaTot_HI) %>%
select(Biom, Species_name, ratio)
returns
# A tibble: 4 x 3
Biom Species_name ratio
<dbl> <chr> <dbl>
1 1 Abrocomo_cin 1
2 10 Abrocomo_cin 0.200
3 1 Abrothrix_an 2
4 10 Abrothrix_an 1
Note: Remove the select-part, if you need all columns or manipulate it for other columns.
Data
cdd26_f <- readr::read_table2("CDD26_FF AreaCell Area_total Biom Species_name AreaCellSuAreaTotal
1 1 13 10 Abrocomo_ben 0.076923
1 1 8 1 Abrocomo_cin 0.125000
1 1 30 10 Abrocomo_cin 0.033333
1 2 10 1 Abrothrix_an 0.200000
1 1 44 10 Abrothrix_an 0.022727
1 3 6 2 Abrothrix_je 0.500000
1 1 7 12 Abrothrix_lo 0.142857")
cdd_hi <- readr::read_table2("CDD_HI AreaCell Area_total Biom Species_name AreaCellSuAreaTot_HI
1 1 8 1 Abrocomo_cin 0.125000
1 5 30 10 Abrocomo_cin 0.166666
1 1 5 2 Abrocomo_cin 0.200000
1 1 10 1 Abrothrix_an 0.100000
1 1 44 10 Abrothrix_an 0.022727
1 6 18 1 Abrothrix_je 0.333333
1 1 23 4 Abrothrix_lo 0.130434")

Check if a variable return to a previous level in rolling window in r

I am trying to create a function to apply to a variable in a dataframe that, for a windows of 3 days forward from the current observation, calculate if the current price decrease and then return to the original price. The dataframe looks like this:
VarA VarB Date Price Diff VarD
1 1 2007-04-09 50 NA 0
1 1 2007-04-10 50 0 0
1 1 2007-04-11 48 -2 1
1 1 2007-04-12 48 0 1
1 1 2007-04-13 50 2 0
1 1 2007-04-14 50 0 0
1 1 2007-04-15 45 -5 1
1 1 2007-04-16 50 5 0
1 1 2007-04-17 45 -5 0
1 1 2007-04-18 48 3 0
1 1 2007-04-19 48 0 0
1 1 2007-04-20 50 2 0
Where VarA and VarB are grouping variables (in this example, they do not change), Price is the variable I wish to detect if it decrease and then increase again to the starting level, and Diff is the lagged price difference (if is of any help).
VarD shows the result of applying the function I am trying to guess. There are two conditions for VarD to take the value 1: 1) the price decrease from a level and then, in any of the two following days window, returns to the original level (i.e., 50 to 48 and again to 50, in rows 2 to 5, or 50 to 45 and again to 50 in rows 6 to 8); 2) there is a maximum of two days for the price to increase again to the starting level. Otherwise, VarD should take the value 0.
I do not have any clue of how to start.
The dataframe db is:
db <- read.table(header = TRUE, sep = ",", text = "VarA,VarB,Date,Price,Diff
1,1,2007-04-09,50,NA
1,1,2007-04-10,50,0
1,1,2007-04-11,48,-2
1,1,2007-04-12,48,0
1,1,2007-04-13,50,2
1,1,2007-04-14,50,0
1,1,2007-04-15,45,-5
1,1,2007-04-16,50,5
1,1,2007-04-17,45,-5
1,1,2007-04-18,48,3
1,1,2007-04-19,48,0
1,1,2007-04-20,50,2")
Thanks in advance.
Hope I understood your requirements correctly:
library(dplyr)
db %>%
#create Diff.2 as helper variable: increase in price from current day to 2 days later
mutate(Diff.2 = diff(c(Price,NA,NA), lag = 2)) %>%
mutate(Var.D = ifelse(
Diff.2 + lag(Diff.2, 2) == 0 & #condition 1: price increase from current day to 2 days later
#is cancelled out by price decrease from 2 days ago to current day
Diff.2 > 0, #condition 2: price increases from current day to 2 days later
1, 0)) %>%
mutate(Var.D = ifelse(is.na(Var.D), 0, Var.D)) %>%
select(-Diff.2)
VarA VarB Date Price Diff Var.D
1 1 1 2007-04-09 50 NA 0
2 1 1 2007-04-10 50 0 0
3 1 1 2007-04-11 48 -2 1
4 1 1 2007-04-12 48 0 1
5 1 1 2007-04-13 50 2 0
6 1 1 2007-04-14 50 0 0
7 1 1 2007-04-15 48 -2 0
8 1 1 2007-04-16 49 1 0
9 1 1 2007-04-17 45 -4 0
10 1 1 2007-04-18 45 0 0
11 1 1 2007-04-19 45 0 0
12 1 1 2007-04-20 50 0 0
I think I found the solution, if it is of interest. I use inputs from #G. Grothendieck so he deserve most of the credit (but not the blame for errors). The solution is in four steps:
Step 1: create a dummy variable equal to 1 if prices decrease and for each month it stays low.
db$Tmp1 <- 0
for (n in 1 : length(db$Date)) db$Tmp1[n] <- ifelse
(db$Diff[n] < 0, 1, ifelse (db$Tmp1[n-1:min(0, n)] == 1 &&
db$Diff[n] == 0, 1, 0))
The first part of ifelse tells that if price at date [n] decrease or if the previous value of Step1 is equal to 1 and price do not change, then assign the value 1, else 0.
Step 2: restrict the number of days the price could be lower in Step 1 to two days (thanks to #G. Grothendieck).
loop <- function(x) if (all(x[-1] == 1)) 0 else x[1]
db$Tmp2 = ifelse(db$Diff < 0, rollapply(db$Tmp1, 3, loop, partial = T,
align = "left"), ifelse(db$Diff==0 & lag(db$Tmp2) ==
1, 1, 0))
loop is a function that value 0 if all values -except the current date- are equal to 1, else take the value of Tmp1. Then, if the price decrease (db$diff < 0) apply loop to 3 values of Tmp1 forward, but if the price do not change and the previous value of Tmp2 is 1, assign a value of 1. Else assign 0.
Step 3: calculate if the price previous to the price decrease is repeated in a three days window from the original price.
loop2 <- function(x) if (any(x[-1] == x[1])) 1 else 0
Tmp3 = rollapply(Price, 4, loop2, partial = T, align = "left")
The function loop2 search if any price is repeated in 3 days from the current date (the 4 in the Tmp3 function). Then, Tmp3 apply loop2 to the Price vector (following this tread Ifelse statement with dataframe subset using date)
Step 4: Multiply Tmp2 and Tmp3 to obtain the result (and delete the auxiliary variables).
db$Sale <- db$Tmp2 * db$Tmp3
db$Tmp1 <- db$Tmp2 <- db$Tmp3 <- NULL
Now, Sale is just to multiply Tmp2 and Tmp3, as the first one adjust sales to a 3 days windows, and the second one show if the original price at the start of the decrease in price is present at a window of 3 days before.
Hope its useful to anyone.If any has corrections or suggestions, they are very welcome. Lastly, each one of the codes should be applied to each VarA and VarB, so each step should be in the following code:
db <-
db %>% group_by(VarA, VarB) %>%
mutate(
code
)
The output is:
VarA VarB Date Price Diff Tmp1 Tmp2 Tmp3 Sale
<int> <int> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 2007-04-09 50 0 0 NA 1 NA
2 1 1 2007-04-10 50 0 0 0 1 0
3 1 1 2007-04-11 48 -2 1 1 1 1
4 1 1 2007-04-12 48 0 1 1 1 1
5 1 1 2007-04-13 50 2 0 0 1 0
6 1 1 2007-04-14 50 0 0 0 0 0
7 1 1 2007-04-15 48 -2 1 1 0 0
8 1 1 2007-04-16 49 1 0 0 0 0
9 1 1 2007-04-17 45 -4 1 0 1 0
10 1 1 2007-04-18 45 0 1 0 1 0
11 1 1 2007-04-19 45 0 1 0 0 0
12 1 1 2007-04-20 50 5 0 0 0 0
Thanks a lot.

Mean of Groups of means in R

I have the following data
Exp = my data frame
dt<-data.table(Game=c(rep(1,9),rep(2,3)),
Round=rep(1:3,4),
Participant=rep(1:4,each=3),
Left_Choice=c(1,0,0,1,1,0,0,0,1,1,1,1),
Total_Points=c(5,15,12,16,83,7,4,8,23,6,9,14))
> dt
Game Round Participant Left_Choice Total_Points
1: 1 1 1 1 5
2: 1 2 1 0 15
3: 1 3 1 0 12
4: 1 1 2 1 16
5: 1 2 2 1 83
6: 1 3 2 0 7
7: 1 1 3 0 4
8: 1 2 3 0 8
9: 1 3 3 1 23
10: 2 1 4 1 6
11: 2 2 4 1 9
12: 2 3 4 1 14
Now, I need to do the following:
First of all for each of the participants in each of the Games I need to calculated the mean "Left Choice rate".
After that I want to break the results to 5 groups (Left choice <20%,
left choice between 20% and 40% e.t.c),
For each group (in each of the games), I want to calculate the mean of the Total_Points **in the last round - round 3 in this simple example **** [ONLY the value of the round 3] - so for example for participant 1, in game 1, the total points are in round 3 are 12. And for Participant 4, in game 2 it is 14.
So in the first stage I think I should calculate the following:
Game Participant Percent_left Total_Points (in last round)
1 1 33% 12
1 2 66% 7
1 3 33% 23
2 4 100% 14
And the final result should look like this:
Game Left_Choice Total_Poins (average)
1 >35% 17.5= (12+23)/2
1 <35%<70% 7
1 >70% NA
2 >35% NA
2 <35%<70% NA
2 >70% 14
Please help! :)
Working in data.table
1: simple group mean with by
dt[,pct_left:=mean(Left_Choice),by=.(Game,Participant)]
2: use cut; not totally clear, but I think you want include.lowest=T.
dt[,pct_grp:=cut(pct_left,breaks=seq(0,1,by=.2),include.lowest=T)]
3: slightly more complicated group mean with by
dt[Round==max(Round),end_mean:=mean(Total_Points),by=.(pct_grp,Game)]
(if you just want the reduced table, use .(end_mean=mean(Total_Points))instead).
You didn't make it clear whether there is a global maximum number of rounds (i.e. whether all games end in the same number of rounds); this was assumed above. You'll have to be more clear about this in order to provide an exact alternative, but I suggest starting with just defining it round by round:
dt[,end_mean:=mean(Total_Points),by=.(pct_grp,Game,Round)]

calculate the mean of trials for each subject in R

I have a dataframe here: each subject does 6 trials, there are 105 subjects.
I want to find the mean of 'skip' for 6 trials for each subj.
How do I start?
> subj entropy n_gambles trial choice
1 0 high 2 0 skip
2 0 high 2 1 skip
3 0 high 2 2 skip
4 0 high 2 3 skip
5 0 high 2 4 skip
6 0 high 2 5 skip
7 1 high 32 0 buy
8 1 high 32 1 buy
9 1 high 32 2 buy
10 1 high 32 3 buy
11 1 high 32 4 buy
12 1 high 32 5 buy
You can use ddply from plyr package: (You mentioned that there will be six trials, so mean is computed by dividing 6 for number of observations with just choice=skip for each subject)
library(plyr)
ddply(df,.(subj),summarise,mymean=(length(which(choice=="skip")))/6)
subj mymean
1 0 1
2 1 0
Note: df is your data
If I have to guess, then you intend to get mean of n_gambles for each subject where choice==skip, then this might work:
# Data
df<- read.table(text="subj entropy n_gambles trial choice
0 high 2 0 skip
0 high 2 1 skip
0 high 2 2 skip
0 high 2 3 skip
0 high 2 4 skip
0 high 2 5 skip
1 high 32 0 buy
1 high 32 1 buy
1 high 32 2 buy
1 high 32 3 buy
1 high 32 4 buy
1 high 32 5 buy",header=T)
# Get mean
aggregate(df[df$choice == "skip","n_gambles"],
list(subj=df[df$choice == "skip","subj"]),
mean)
# Output
# subj x
# 1 0 2
EDIT:
As I understand you want frequency of skip per subj:
Try this:
# Get counts
result <- as.data.frame(table(df$subj,df$choice))
colnames(result) <- c("subj","choice","Freq")
# Subset for "skip" and divide by 6
result <- result[ result$choice == "skip",]
result$Freq <- result$Freq/6

Convert first element of each factor to NA in R

I have a dataframe simplified as following:
Day Place dendrometer max
1 1 1 4684
2 1 1 4831
1 1 2 2486
2 1 2 2596
1 2 1 6987
2 2 1 6824
I need the first element of each dendrometer as NA, so every time R calculates “max” for a new dendrometer (independently of the place), starts with NA, like this:
Day Place dendrometer max
1 1 1 NA
2 1 1 4831
1 1 2 NA
2 1 2 2596
1 2 1 NA
2 2 1 6824
Could you also let me know I could calculate MEAN of the max column for each dendrometer within each ring (sapply, aggregate?) instead of calculating mean for the entire max column?
NOTE: dendro 1 in place 1 is different to dendro 1 in place 2, I need different information for each of them
library(data.table)
myDat <- data.table(myDat, key="Day")
# using the `mult` argument, make the first instance of each Day NA
myDat[.(Day), dendrometer := NA, mult="first"]
# add mean
myDat[, mean := mean(dendrometer, na.rm=TRUE), by=Day]
# add max
myDat[, max := max(dendrometer, na.rm=TRUE), by=Day]
Results:
> myDat
Day Place dendrometer mean max
1: 1 1 NA 3304.333 4831
2: 1 1 4831 3304.333 4831
3: 1 2 2486 3304.333 4831
4: 1 2 2596 3304.333 4831
5: 2 1 NA 6824.000 6824
6: 2 1 6824 6824.000 6824
Sample Data Used:
read.table(text=
"Day Place dendrometer
1 1 4684
1 1 4831
1 2 2486
1 2 2596
2 1 6987
2 1 6824", header=TRUE, stringsAsFactors=FALSE) -> myDat
Do you always have only two measurements of one dendrometer in one place? If so, then you could just set every other value as NA:
#x is your data.frame
x<-read.table("clipboard",header=TRUE)
x[seq(1,nrow(x),by=2),4]<-NA
and the max values are the non-NA values
x[seq(2,nrow(x),by=2),4]
If your data is more complicated, this should work:
dup<-duplicated(x[,2:3]) #find the non-unique cases
x[!dup,4]<-NA #set the first measurements as NA
tapply(x[dup,4],which(dup),max) #compute max from others.
Note that for computing the mean you do not need to set the first measurements as NA.
Firstly, the mean values of max an be calculated with tapply.
dat <- transform(dat,
mean = tapply(max, c(0, cumsum(abs(diff(dendrometer)))), mean))
Day Place dendrometer max mean
1 1 1 1 4684 4757.5
2 2 1 1 4831 2541.0
3 1 1 2 2486 6905.5
4 2 1 2 2596 4757.5
5 1 2 1 6987 2541.0
6 2 2 1 6824 6905.5
You can use the diff function to find differences between dendrometer and the is.na<- function to replace values in max with NA.
is.na(dat$max) <- c(TRUE, diff(dat$dendrometer) != 0)
Day Place dendrometer max mean
1 1 1 1 NA 4757.5
2 2 1 1 4831 2541.0
3 1 1 2 NA 6905.5
4 2 1 2 2596 4757.5
5 1 2 1 NA 2541.0
6 2 2 1 6824 6905.5

Resources