calculate the mean of trials for each subject in R - r

I have a dataframe here: each subject does 6 trials, there are 105 subjects.
I want to find the mean of 'skip' for 6 trials for each subj.
How do I start?
> subj entropy n_gambles trial choice
1 0 high 2 0 skip
2 0 high 2 1 skip
3 0 high 2 2 skip
4 0 high 2 3 skip
5 0 high 2 4 skip
6 0 high 2 5 skip
7 1 high 32 0 buy
8 1 high 32 1 buy
9 1 high 32 2 buy
10 1 high 32 3 buy
11 1 high 32 4 buy
12 1 high 32 5 buy

You can use ddply from plyr package: (You mentioned that there will be six trials, so mean is computed by dividing 6 for number of observations with just choice=skip for each subject)
library(plyr)
ddply(df,.(subj),summarise,mymean=(length(which(choice=="skip")))/6)
subj mymean
1 0 1
2 1 0
Note: df is your data

If I have to guess, then you intend to get mean of n_gambles for each subject where choice==skip, then this might work:
# Data
df<- read.table(text="subj entropy n_gambles trial choice
0 high 2 0 skip
0 high 2 1 skip
0 high 2 2 skip
0 high 2 3 skip
0 high 2 4 skip
0 high 2 5 skip
1 high 32 0 buy
1 high 32 1 buy
1 high 32 2 buy
1 high 32 3 buy
1 high 32 4 buy
1 high 32 5 buy",header=T)
# Get mean
aggregate(df[df$choice == "skip","n_gambles"],
list(subj=df[df$choice == "skip","subj"]),
mean)
# Output
# subj x
# 1 0 2
EDIT:
As I understand you want frequency of skip per subj:
Try this:
# Get counts
result <- as.data.frame(table(df$subj,df$choice))
colnames(result) <- c("subj","choice","Freq")
# Subset for "skip" and divide by 6
result <- result[ result$choice == "skip",]
result$Freq <- result$Freq/6

Related

Data Frame- Add number of occurrences with a condition in R

I'm having a bit of a struggle trying to figure out how to do the following. I want to map how many days of high sales I have previously a change of price. For example, I have a price change on day 10 and the high sales indicator will tell me any sale greater than or equal to 10. Need my algorithm to count the number of consecutive high sales.
In this case it should return 5 (day 5 to 9)
For example purposes, the dataframe is called df. Code:
#trying to create a while loop that will check if lag(high_sales) is 1, if yes it will count until
#there's a lag(high_sales) ==0
#loop is just my dummy variable that will take me out of the while loop
count_sales<-0
loop<-0
df<- df %>% mutate(consec_high_days= ifelse(price_change > 0, while(loop==0){
if(lag(High_sales_ind)==1){
count_sales<-count_sales +1}
else{loop<-0}
count_sales},0))
day
price
price_change
sales
High_sales_ind
1
5
0
12
1
2
5
0
6
0
3
5
0
5
0
4
5
0
4
0
5
5
0
10
1
6
5
0
10
1
7
5
0
10
1
8
5
0
12
1
9
5
0
14
1
10
7
2
3
0
11
7
0
2
0
This is my error message:
Warning: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
i the condition has length > 1 and only the first element will be used
Warning: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
i 'x' is NULL so the result will be NULL
Error: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
x replacement has length zero
Any help would be greatly appreciated.
This is a very inelegant brute-force answer, though hopefully someone better than me can provide a more elegant answer - but to get the desired dataset, you can try:
df <- read.table(text = "day price price_change sales High_sales_ind
1 5 0 12 1
2 5 0 6 0
3 5 0 5 0
4 5 0 4 0
5 5 0 10 1
6 5 0 10 1
7 5 0 10 1
8 5 0 12 1
9 5 0 14 1
10 7 2 3 0
11 7 0 2 0", header = TRUE)
# assign consecutive instances of value
df$seq <- sequence(rle(as.character(df$sales >= 10))$lengths)
# Find how many instance of consecutive days occurred before price change
df <- df %>% mutate(lseq = lag(seq))
# define rows you want to keep and when to end
keepz <- df[df$price_change != 0, "lseq"]
end <- as.numeric(rownames(df[df$price_change != 0,]))-1
df_want <- df[keepz:end,-c(6:7)]
Output:
# day price price_change sales High_sales_ind
# 5 5 5 0 10 1
# 6 6 5 0 10 1
# 7 7 5 0 10 1
# 8 8 5 0 12 1
# 9 9 5 0 14 1

How to calculate percentage in R?

I am a newbie to R and I have a data frame which contains the following fields:
day place hour time_spent count
1 1 1 1 120
1 1 1 2 100
1 1 1 3 90
1 1 1 4 80
So my aim is to calculate the time spent in each place where 75% of the vehicles to cross the place.So from this data frame I generate the below data frame by
day place hour time_spent count cum_count percentage
1 1 1 1 120 120 30.7%
1 1 1 2 100 220 56.4%
1 1 1 3 90 310 79%
1 1 1 4 80 390 100%
df$cum_count=cumsum(df$count)
df$percentage=cumsum(df$percentage)
for(i in 1:length(df$percentage)){
if(df$percentage[i]>75%){
low time=df$time_spent[i-1]
high_time=df$time_spent[i]
}
}
So which means that 75% of vehicles are spending 2-3 minutes in the place 1.But now I have a data frame like this which is for all the places and for all the days.
day place hour time_spent count
1 1 1 1 120
1 1 1 2 100
1 1 1 3 90
1 1 1 4 80
1 2 1 1 220
1 2 1 2 100
1 2 1 3 90
1 2 1 4 80
1 3 1 1 100
1 3 1 2 80
1 3 1 3 90
1 3 1 4 100
2 1 1 1 120
2 1 1 2 100
2 1 1 3 90
2 1 1 4 80
2 2 1 1 220
2 2 1 2 100
2 2 1 3 90
2 2 1 4 80
2 3 1 1 100
2 3 1 2 80
2 3 1 3 90
2 3 1 4 100
How is it possible to calculate the high time and low time for each place?Any help is appreciated.
The max and min functions ought to do the trick here. Although you could also do summary to get median, mean, etc in one go. I'd also recommend the quantile function for these percentages. As usually the case with R the tricky part if getting the data in the correct format.
Say you want the total time spent at each place:
index <- sort(unique(df$place))
times <- as.list(rep(NA, length(index)))
names(times) <- index
for(ii in index){
counter <- c()
for(jj in df[df$place==ii,]$time_spent){
counter <- c(counter, rep(jj, df[df$place==ii,]$count[jj]))
}
times[[ii]] <- counter
}
Now for each place you can compute the max and min with:
lapply(times, max)
lapply(times, min)
Similarly you can compute the mean:
lapply(times, function(x) sum(x)/length(x))
lapply(times, mean)
I think what you want are the quantiles:
lapply(times, quantile, 0.75)
This would be time by which at least 75% of vehicles had passed though a place, i.e., 75% of vehicles had took this time or less to pass through.
We can use a group by operation
library(dplyr)
dfN %>%
group_by(day, place) %>%
mutate(cum_count = cumsum(count),
percentage = 100*cum_count/sum(count),
low_time = time_spent[which.max(percentage > 75)-1],
high_time = time_spent[low_time+1])
if i understood your question correctly (you want min and max value of time_spent in a place):
df %>%
group_by(place) %>%
summarise(min(time_spent),
max(time_spent))
will give you this:
place min(time-spent) max(time_spent)
1 1 4
2 1 4
3 1 4

Data Cleaning for Survival Analysis

I’m in the process of cleaning some data for a survival analysis and I am trying to make it so that an individual only has a single, sustained, transition from symptom present (ss=1) to symptom remitted (ss=0). An individual must have a complete sustained remission in order for it to count as a remission. Statistical problems/issues aside, I’m wondering how I can go about addressing the issues detailed below.
I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.
The details are below:
#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(1,1,1,1,NA,0,0,1,1,0,NA,0,0,0,1,1,1,1,1,1,NA,1,1,0,NA,NA,0,0)
mydat <- data.frame(id, time, ss)
*Bold and underlined characters represent changes from the dataset above
The goal here is to find a way to get the NA values for ID #1 (variable ss) to look like this: 1,1,1,1,1,0,0
ID# 2 (variable ss) to look like this: 1,1,0,0,0,0,0
ID #3 (variable ss) to look like this: 1,1,1,1,1,1,NA (no change because the row with NA will be deleted eventually)
ID #4 (variable ss) to look like this: 1,1,1,1,1,0,0 (this one requires multiple changes and I expect it is the most challenging to tackle).
I don't really think you have considered all the "edge case". What to do with two NA's in a row at the end of a period or 4 or 5 NA's in a row. This will give you the requested solution in your tiny test case, however, using the na.locf-function:
require(zoo)
fillNA <- function(vec) { if ( is.na(tail(vec, 1)) ){ vec } else { vec <- na.locf(vec) }
}
> mydat$locf <- with(mydat, ave(ss, id, FUN=fillNA))
> mydat
id time ss locf
1 1 0 1 1
2 1 1 1 1
3 1 2 1 1
4 1 3 1 1
5 1 4 NA 1
6 1 5 0 0
7 1 6 0 0
8 2 0 1 1
9 2 1 1 1
10 2 2 0 0
11 2 3 NA 0
12 2 4 0 0
13 2 5 0 0
14 2 6 0 0
15 3 0 1 1
16 3 1 1 1
17 3 2 1 1
18 3 3 1 1
19 3 4 1 1
20 3 5 1 1
21 3 6 NA NA
22 4 0 1 1
23 4 1 1 1
24 4 2 0 0
25 4 3 NA 0
26 4 4 NA 0
27 4 5 0 0
28 4 6 0 0

Mean of Groups of means in R

I have the following data
Exp = my data frame
dt<-data.table(Game=c(rep(1,9),rep(2,3)),
Round=rep(1:3,4),
Participant=rep(1:4,each=3),
Left_Choice=c(1,0,0,1,1,0,0,0,1,1,1,1),
Total_Points=c(5,15,12,16,83,7,4,8,23,6,9,14))
> dt
Game Round Participant Left_Choice Total_Points
1: 1 1 1 1 5
2: 1 2 1 0 15
3: 1 3 1 0 12
4: 1 1 2 1 16
5: 1 2 2 1 83
6: 1 3 2 0 7
7: 1 1 3 0 4
8: 1 2 3 0 8
9: 1 3 3 1 23
10: 2 1 4 1 6
11: 2 2 4 1 9
12: 2 3 4 1 14
Now, I need to do the following:
First of all for each of the participants in each of the Games I need to calculated the mean "Left Choice rate".
After that I want to break the results to 5 groups (Left choice <20%,
left choice between 20% and 40% e.t.c),
For each group (in each of the games), I want to calculate the mean of the Total_Points **in the last round - round 3 in this simple example **** [ONLY the value of the round 3] - so for example for participant 1, in game 1, the total points are in round 3 are 12. And for Participant 4, in game 2 it is 14.
So in the first stage I think I should calculate the following:
Game Participant Percent_left Total_Points (in last round)
1 1 33% 12
1 2 66% 7
1 3 33% 23
2 4 100% 14
And the final result should look like this:
Game Left_Choice Total_Poins (average)
1 >35% 17.5= (12+23)/2
1 <35%<70% 7
1 >70% NA
2 >35% NA
2 <35%<70% NA
2 >70% 14
Please help! :)
Working in data.table
1: simple group mean with by
dt[,pct_left:=mean(Left_Choice),by=.(Game,Participant)]
2: use cut; not totally clear, but I think you want include.lowest=T.
dt[,pct_grp:=cut(pct_left,breaks=seq(0,1,by=.2),include.lowest=T)]
3: slightly more complicated group mean with by
dt[Round==max(Round),end_mean:=mean(Total_Points),by=.(pct_grp,Game)]
(if you just want the reduced table, use .(end_mean=mean(Total_Points))instead).
You didn't make it clear whether there is a global maximum number of rounds (i.e. whether all games end in the same number of rounds); this was assumed above. You'll have to be more clear about this in order to provide an exact alternative, but I suggest starting with just defining it round by round:
dt[,end_mean:=mean(Total_Points),by=.(pct_grp,Game,Round)]

Get frequencies (absolute and relative) of levels of a categorical variable from incidence binary data by combination of columns factors

I would like to have the frequencies of each levels of a categorical variable (row vector) denoting ecological type (3 levels: H,F,T) of a set of 93 herbaceous plants for the observed species present (=1) conditioning by sites (3 levels: A,B,C), habitats (3 levels: 1,2,3,4) and years (3 levels: 1,2,3).
I know the procedure is passed by tapply(), but the messy thing come from the logic operator for linking levels of the categorical variable (H,F,T) for the present species (=1) accross all of the species conditioning by combination of columns factors.
This could be summarized by a 12 x 3 contingency table indicating the numbers of each ecological types (3) of species per sites (3) and habitats (4).
Ex of my data (each habitat contain 20 lines): for each species (Sp1 to Sp93) 0 for absent and 1 for present. Vector "type" contain ecological type for each species.
Site,Habitat,Year,Sp1,Sp2,Sp3,Sp4,Sp5,Sp6,...,Sp93
type= c(H,H,F,T,F,T,H,....T) # vector of length 93
Thank you in advance.
I hope this would help describe my data objects better.
data = read.csv(file = "Veg_06.csv", header = TRUE)
data = data[1:240, -c(1,4:7)]
Ilot #
Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ... each level has 4 sublevels (from "Site") with 20 lines each, adding up to 80 lines by levels.
Site #
Factor w/ 4 levels "Am","Av","CP","CS": 2 2 2 2 2 2 2 2 2 2 ...
Sp #
int [1:240] 0 0 0 0 0 0 0 0 0 0 ... either "0" or "1" for absence or presence of species.
veg #
Factor w/ 3 levels "H","F","T": 3 3 2 2 3 1 2 1 2 1 ... categorical factor indicating type of species.
First off, I would recommend http://vita.had.co.nz/papers/tidy-data.pdf, Hadley Wickham's paper on Tidy Data, for some ideas on how to organize the data to be better suited to analysis. In essence, we think of each row as a single observation.
It sounds like fundamentally, your data is a collection of year, site, habitat, quadrant(? maybe line, not sure from the description), species with the observation point being that species was observed in that site, habitat, quadrant, and year. For simplicity, a row is present if the species is present.
In addition, there's the concept of type, which is associated with each species.
Analyzing and contingency table
Putting aside the question of how to get your data into this form, let's assume that we have the data in the form described above.
> raw <- expand.grid(species=1:93, quadrant=1:20, habitat=1:4, site=1:3, year=1:3)
> head(raw)
species quadrant habitat site year
1 1 1 1 1 1
2 2 1 1 1 1
3 3 1 1 1 1
4 4 1 1 1 1
5 5 1 1 1 1
6 6 1 1 1 1
And let's take a small sample and a large sample
> set.seed(100); d.small <- raw[sample(nrow(raw),20), ]
> set.seed(100); d.large <- raw[sample(nrow(raw),1000), ]
We can use the ftable function to get this into a state that we want, the 12x4 contingency table, as
> ftable(habitat ~ year + site, data=d.small)
habitat 1 2 3 4
year site
1 1 0 0 1 0
2 0 0 1 1
3 0 1 1 1
2 1 2 1 1 0
2 1 1 0 2
3 0 0 1 0
3 1 2 0 0 1
2 0 1 0 1
3 0 0 0 0
This will count the same species twice if it occurs in two different quadrants of the site/habitat mixture. We can discard the habitat and unique-ify to get the count across all of them
> ftable(habitat ~ year + site , data=unique(d.small[c('species', 'habitat','year','site')]))
Transforming (tidying the source data)
To transform the data as it stands into a form like this is tricky in vanilla R. With the tidyr package it gets easier (reshape does very similar things as well)
> onerow <- data.frame(year=1, site=1, habitat=2, quadrant=3, sp1=0, sp2=1,sp3=0,sp4=0,sp5=1)
> onerow
year site habitat quadrant sp1 sp2 sp3 sp4 sp5
1 1 1 2 3 0 1 0 0 1
Here I'm making assumptions about what your data look like that seem reasonable
> subset(gather(onerow, species, present, -(year:quadrant)), present==1)
year site habitat quadrant species present
2 1 1 2 3 sp2 1
5 1 1 2 3 sp5 1
> subset(gather(onerow, species, present, -(year:quadrant)), present==1, select=-present)
year site habitat quadrant species
2 1 1 2 3 sp2
5 1 1 2 3 sp5
And now you can proceed with the analysis above.
Merging in the species type data
Looking at your description a little closer, I think you also want to merge in a parallel vector of species type information.
> set.seed(100); sp.type <- data.frame(species=1:93, type=factor(sample(1:4, 93, replace=T)))
> merge(d.small, sp.type)
species quadrant habitat site year type
1 6 16 4 2 3 2
2 27 9 2 2 2 4
3 27 8 4 2 1 4
4 32 18 1 2 2 4
5 33 18 1 1 2 2
6 45 14 4 2 2 3
7 49 6 2 3 1 1
8 54 3 3 2 1 2
9 55 2 1 1 3 3
10 56 2 4 3 1 2
11 56 1 3 1 1 2
12 57 7 2 1 2 1
13 62 18 4 2 2 3
14 70 19 1 1 2 3
15 77 2 3 3 1 4
16 80 7 3 1 2 1
17 81 17 1 1 3 2
18 82 5 2 2 3 3
19 86 9 4 1 3 3
20 87 10 3 3 2 3
And now you can use the subset, unique, and ftable approach above to get the data you need.
Assuming you had a dataframe with (among other things) the columns named: "sites", "habitats", "years":
dfrm <- data.frame( sites = sample( LETTERS[1:3], 20, replace=TRUE),
habitats= sample( factor(1:4), 20, replace=TRUE),
years = sample( factor(paste("Y",1:4, sep="_")), 20, replace=TRUE) )
Then this will give you an additional factor-mode column that encodes the various levels of each row.
dfrm$three.way.inter <- with(dfrm, interaction(sites, habitats, years))
If you want non-populated levels then do nothing else. If you want possible levels that have no instances, then use drop=TRUE. Then you can analyze these within individual levels of the three classification variables.

Resources