how to remove error and post error trials - r

For my research, I have to remove certain trials to limit contamination in the data. Here are the rules:
Remove the first trial.
For RT calculation, remove error trials and trials following an error. Say, we have 20 trials and 3 errors, we have to remove 6 trials from the final data, i.e., mean RT will be calculated from 14 trials. If the errors are in a row, say 3 errors in a row, RT will be calculated based on 17 trials.
For error rate calculation, remove trials following an error. Say, we have 20 trials and 3 errors, we have to remove 3 trials following errors from the final data, i.e., error rate will be calculated by 3/17. If the errors are in a row, say 3 errors in a row, only the first error retains and the next two errors are excluded, so the error rate will be calculated by 1/18.
trial_no. correct
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
11 1
14 0
15 1
16 1
17 1
18 1
19 0
20 1
21 1
22 1
23 0
24 0
25 1
26 1
27 1
28 1
29 1
30 1
31 1
32 0
33 0
34 1
35 1
36 1
37 1
38 1
39 0
40 1

Related

Get the average of the values of one column for the values in another

I was not so sure how to ask this question. i am trying to answer what is the average tone when an initiative is mentioned and additionally when a topic, and a goal( or achievement) are mentioned. My dataframe (df) has many mentions of 70 initiatives (rows). meaning my df has 500+ rows of data, but only 70 Initiatives.
My data looks like this
> tabmean
Initiative Topic Goals Achievements Tone
1 52 44 2 2 2
2 294 42 2 2 2
3 103 31 2 2 2
4 52 41 2 2 2
5 87 26 2 1 1
6 52 87 2 2 2
7 136 81 2 2 2
8 19 7 2 2 1
9 19 4 2 2 2
10 0 63 2 2 2
11 0 25 2 2 2
12 19 51 2 2 2
13 52 51 2 2 2
14 108 94 2 2 1
15 52 89 2 2 2
16 110 37 2 2 2
17 247 25 2 2 2
18 66 95 2 2 2
19 24 49 2 2 2
20 24 110 2 2 2
I want to find what is the mean or average Tone when an Initiative is mentioned. as well as what is the Tone when an Initiative, a Topic and a Goal are mentioned at the same time. The code options for Tone are : positive(coded: 1), neutral(2), negative (coded:3), and both positive and negative(4). Goals and Achievements are coded yes(1) and no(2).
I have used this code:
GoalMeanTone <- tabmean %>%
group_by(Initiative,Topic,Goals,Tone) %>%
summarize(averagetone = mean(Tone))
With Solution output :
GoalMeanTone
# A tibble: 454 x 5
# Groups: Initiative, Topic, Goals [424]
Initiative Topic Goals Tone averagetone
<chr> <chr> <chr> <chr> <dbl>
1 0 104 2 0 NA
2 0 105 2 0 NA
3 0 22 2 0 NA
4 0 25 2 0 NA
5 0 29 2 0 NA
6 0 30 2 1 NA
7 0 31 1 1 NA
8 0 42 1 0 NA
9 0 44 2 0 NA
10 0 44 NA 0 NA
# ... with 444 more rows
note that for Initiative Value 0 means "other initiative".
and I've also tried this code
library(plyr)
GoalMeanTone2 <- ddply( tabmean, .(Initiative), function(x) mean(tabmean$Tone) )
with solution output
> GoalMeanTone2
Initiative V1
1 0 NA
2 1 NA
3 101 NA
4 102 NA
5 103 NA
6 104 NA
7 105 NA
8 107 NA
9 108 NA
10 110 NA
Note that in both instances, I do not get an average for Tone but instead get NA's
I have removed the NAs in the df from the column "Tone" also have tried to remove all the other mission values in the df ( its only about 30 values that i deleted).
and I have also re-coded the values for Tone :
tabmean<-Meantable %>% mutate(Tone=recode(Tone,
`1`="1",
`2`="0",
`3`="-1",
`4`="2"))
I still cannot manage to get the average tone for an initiative. Maybe the solution is more obvious than i think, but have gotten stuck and have no idea how to proceed or solve this.
i'd be super grateful for a better code to get this. Thanks!
I'm not completely sure what you mean by 'the average tone when an initiative is mentioned', but let's say that you'd want to get the average tone for when initiative=1, you could try the following:
tabmean %>% filter(initiative==1) %>% summarise(avg_tone=mean(tone, na.rm=TRUE)
Note that (1) you have to add na.rm==TRUE to the summarise call if you have missing values in the column that you are summarizing, otherwise it will only produce NA's, and (2) check that the columns are of type numeric (you could check that with str(tabmean) and for example change tone to numeric with tabmean <- tabmean %>% mutate(tone=as.numeric(tone)).

How to get value from upcomming row if condition is met?

I searched in google and SO but could not find any answer to my question.
I try to get a value from the first upcomming row if the condition is met.
Example:
Pupil participation bonus
2 55 6
2 33 3
2 88 9
2 0 -100
2 44 4
2 66 7
2 0 -33
to
Pupil participation bonus bonusAtNoParti sumBonusTillParticipation=0
2 55 6 -94 6+3+9 = 18
2 33 3 -97 3+9 = 12
2 88 9 -91 9
2 0 -100 0 0
2 44 4 -29 4+7=11
2 66 7 -26 7
2 0 -33 0 0
So I need to do this:
Iterate through the dataframe and check next rows till participation equals to 0 and get the bonus from that line and add the bonus from the current line and write it to bonusAtNoPati.
My problem here is the "check next rows till participation equals to 0 and get the bonus from that line"
I know how to Iterate through the whole list but not after the current point(row)
I would need to do this process to the whole list where i can get any random participation value in random order.
Has anyone any idea how to realize it?
Edit, I also added another column("sumBonusTillParticipation=0", only sum value is required) which is even harder to realize. R is such a hard to learn language =(
you can use which to get which row number participation is 0.
df <- read.table(text = 'Pupil participation bonus
2 55 6
2 33 3
2 88 9
2 0 -100
2 44 4
2 66 7
2 0 -33', header = T)
index <- c(0, which(df$participation == 0))
diffs <- diff(index)
df$tp <- rep(df$bonus[index], times = diffs)
df$bonusAtNoParti <- df$bonus + df$tp
df$bonusAtNoParti[index] <- 0
df$tp <- NULL
Pupil participation bonus bonusAtNoParti
1 2 55 6 -94
2 2 33 3 -97
3 2 88 9 -91
4 2 0 -100 0
5 2 44 4 -29
6 2 66 7 -26
7 2 0 -33 0

Select specific rows based on previous row value (in the same column)

I've been trying to figure a way to script this through R, but just can't get it. I have a dataset like this:
Trial Type Correct Latency
1 55 0 0
3 30 1 766
4 10 1 344
6 40 1 716
7 10 1 326
9 30 1 550
10 10 1 350
11 64 0 0
13 30 1 683
14 10 1 270
16 30 1 666
17 10 1 297
19 40 1 616
20 10 1 315
21 64 0 0
23 40 1 850
24 10 1 322
26 30 1 566
27 20 0 766
28 40 1 500
29 20 1 230
which goes for much longer(around 1000 rows).
From this one dataset, I would like to create 4 separate data.frames/tables I can export tables with as well as do my own calculations
I would like to have a data.frame (4 in total), one for each of these bullet points:
type 10 rows which are preceded by a type 30 row
type 10 rows which are preceded by a type 40 row
type 20 rows which are preceded by a type 30 row
type 20 rows which are preceded by a type 40 row
I would like for all the columns in the relevant rows to be placed into these new tables, but only including the column info of row types 10 or 20.
For example, the first table (type 10 preceded by type 30) would like this based on the sample data:
Trial Type Correct Latency
4 10 1 344
10 10 1 350
14 10 1 270
17 10 1 297
Second table (type 10 preceded by type 40):
Trial Type Correct Latency
7 10 1 326
20 10 1 315
24 10 1 322
Third table (type 20 preceded by type 30):
Trial Type Correct Latency
27 20 0 766
Fourth table (table 20 preceded by type 40):
Trial Type Correct Latency
29 20 1 230
I can subset just fine to get one table only of type 10 rows and another for type 20 rows, but I can't figure out how to create different tables for type 10 and 20 rows based on the previous type value. Also, an issue is that "Trials" is not in order (skips numbers).
Any help would be greatly appreciated. Thank you.
Also, is there a way to include the previous row as well, so the output for the fourth table would look something like this:
Fourth table (table 20 preceded by type 40):
Trial Type Correct Latency
28 40 1 500
29 20 1 230
For the fourth example, you could use which() in combination with lag() from dplyr, to attain the indices that meet your criteria. Then you can use these to subset the data.frame.
# Get indices of rows that meet condition
ind2 <- which(df$Type==20 & dplyr::lag(df$Type)==40)
# Get indices of rows before the ones that meet condition
ind1 <- which(df$Type==20 & dplyr::lag(df$Type)==40)-1
# Subset data
> df[c(ind1,ind2)]
Trial Type Correct Latency
1: 28 40 1 500
2: 29 20 1 230
Here is an example code if you always want to delete the first trials of your data.
var1 <- c(1,2,1,2,1,2,1,2,1,2)
var2 <- c(1,1,1,2,2,2,2,3,3,3)
dat <- data.frame(var1, var2)
var1 var2
1 1 1
2 2 1
3 1 1
4 2 2
5 1 2
6 2 2
7 1 2
8 2 3
9 1 3
10 2 3
#delete only this line directly
filter(dat,lag(var2)==var2)
var1 var2
1 1 1
2 2 1
3 1 1
6 2 2
7 1 2
10 2 3
#delete the first 2 trials
#make a list of all rows where var2[n-1]!=var2[n] --> using lag from dplyr
drops <- c(1,2,which(lag(dat$var2)!=dat$var2), which(lag(dat$var2)!=dat$var2)+1)
if (!identical(drops,numeric(0))) { dat <- dat[-drops,] }
var1 var2
3 1 1
6 2 2
7 1 2
10 2 3

transform values in data frame, generate new values as 100 minus current value

I'm currently working on a script which will eventually plot the accumulation of losses from cell divisions. Firstly I generate a matrix of values and then I add the number of times 0 occurs in each column - a 0 represents a loss.
However, I am now thinking that a nice plot would be a degradation curve. So, given the following example;
>losses_plot_data <- melt(full_losses_data, id=c("Divisions", "Accuracy"), value.name = "Losses", variable.name = "Size")
> full_losses_data
Divisions Accuracy 20 15 10 5 2
1 0 0 0 0 3 25
2 0 0 0 1 10 39
3 0 0 1 3 17 48
4 0 0 1 5 23 55
5 0 1 3 8 29 60
6 0 1 4 11 34 64
7 0 2 5 13 38 67
8 0 3 7 16 42 70
9 0 4 9 19 45 72
10 0 5 11 22 48 74
Is there a way I can easily turn this table into being 100 minus the numbers shown in the table? If I can plot that data instead of my current data, I would have a lovely curve of degradation from 100% down to however many cells have been lost.
Assuming you do not want to do that for the first column:
fld <- full_losses_data
fld[, 2:ncol(fld)] <- 100 - fld[, -1]

Combining 2 columns into 1 column many times in a very large dataset in R

Combining 2 columns into 1 column many times in a very large dataset in R
The clumsy solutions I am working on are not going to be very fast if I can get them to work and the true dataset is ~1500 X 45000 so they need to be fast. I definitely at a loss for 1) at this point although have some code for 2) and 3).
Here is a toy example of the data structure:
pop = data.frame(status = rbinom(n, 1, .42), sex = rbinom(n, 1, .5),
age = round(rnorm(n, mean=40, 10)), disType = rbinom(n, 1, .2),
rs123=c(1,3,1,3,3,1,1,1,3,1), rs123.1=rep(1, n), rs157=c(2,4,2,2,2,4,4,4,2,2),
rs157.1=c(4,4,4,2,4,4,4,4,2,2), rs132=c(4,4,4,4,4,4,4,4,2,2),
rs132.1=c(4,4,4,4,4,4,4,4,4,4))
Thus, there are a few columns of basic demographic info and then the rest of the columns are biallelic SNP info. Ex: rs123 is allele 1 of rs123 and rs123.1 is the second allele of rs123.
1) I need to merge all the biallelic SNP data that is currently in 2 columns into 1 column, so, for example: rs123 and rs123.1 into one column (but within the dataset):
11
31
11
31
31
11
11
11
31
11
2) I need to identify the least frequent SNP value (in the above example it is 31).
3) I need to replace the least frequent SNP value with 1 and the other(s) with 0.
Do you mean 'merge' or 'rearrange' or simply concatenate? If it is the latter then
R> pop2 <- data.frame(pop[,1:4], rs123=paste(pop[,5],pop[,6],sep=""),
+ rs157=paste(pop[,7],pop[,8],sep=""),
+ rs132=paste(pop[,9],pop[,10], sep=""))
R> pop2
status sex age disType rs123 rs157 rs132
1 0 0 42 0 11 24 44
2 1 1 37 0 31 44 44
3 1 0 38 0 11 24 44
4 0 1 45 0 31 22 44
5 1 1 25 0 31 24 44
6 0 1 31 0 11 44 44
7 1 0 43 0 11 44 44
8 0 0 41 0 11 44 44
9 1 1 57 0 31 22 24
10 1 1 40 0 11 22 24
and now you can do counts and whatnot on pop2:
R> sapply(pop2[,5:7], table)
$rs123
11 31
6 4
$rs157
22 24 44
3 3 4
$rs132
24 44
2 8
R>

Resources