identifying and extracting the prior observation for a specific source - r

i have a data in the form:
id source
1 m
1 p
1 l1
1 l1
2 t
2 q
3 p
3 l1
3 n
3 l1
Now for every id, i want to identify l1 when it occurs in the source and extract the observation prior to l1.
For eg: for id 1, the 3rd source in l1 and the observation prior to that is p.
so my data should look like this:
id source
1 p
3 p
3 n
How can i create this in R?

A data.table solution
library(data.table)
dd <- data.table(df)
dd[, source[match('l1', source)-1L],by = id]

There might be a more direct method, but try this:
#get your data
test <- read.table(text="id source
1 m
1 p
1 l1
1 l1
2 t
2 q
3 p
3 l1
3 n
3 l1",header=TRUE)
# do some picking of the cases
result <- do.call(rbind,by(test,test$id,function(x) x[which(x$source=="l1")-1,]))
result <- result[result$source!="l1",]
Which gives:
> result
id source
2 1 p
7 3 p
9 3 n

Here is another data.table solution. I wasn't able to get what seemed like a correct answer with the earlier version from #mnel.
library(data.table)
## Create the test data table:
dt <- data.table(id=c(1,1,1,1,2,2,3,3,3,3),
source1=c("m","p","l1","l1","t","q","p","l1","n","l1"))
dt[,list(id, source1, source0=c(NA,source1[seq_len(.N-1L)]))][source1=="l1"]
## id source1 source0
## 1: 1 l1 p
## 2: 1 l1 l1
## 3: 3 l1 p
## 4: 3 l1 n
This is adding a column source0 to the data table that gets the previous row (or NA for the first row). The .N is a row number, and I'm using seq_len to get the previous row number. Then it subsets the result where the original source1 has a value of "l1".

Here is a vectorized solution using only simple functions from the base of R.
If DF is the input data frame then sel is a logical vector whose TRUE components select out the required rows. The three terms connected by & signs select those rows:
for which the following row's source column equals "l1" and
whose source column is not l1 and
are such that the following row is not the first with that id
The length of sel is one less than the number of rows in DF so we use which to avoid recycling of sel.
is.l1 <- DF$source == "l1"
sel <- is.l1[-1] & !is.l1[-nrow(DF)] & duplicated(DF$id)[-1]
DF[which(sel),]
The result of the last line is:
id source
2 1 p
7 3 p
9 3 n

Related

How to transpose a long data frame every n rows

I have a data frame like this:
x=data.frame(type = c('a','b','c','a','b','a','b','c'),
value=c(5,2,3,2,10,6,7,8))
every item has attributes a, b, c while some records may be missing records, i.e. only have a and b
The desired output is
y=data.frame(item=c(1,2,3), a=c(5,2,6), b=c(2,10,7), c=c(3,NA,8))
How can I transform x to y? Thanks
We can use dcast
library(data.table)
out <- dcast(setDT(x), rowid(type) ~ type, value.var = 'value')
setnames(out, 'type', 'item')
out
# item a b c
#1: 1 5 2 3
#2: 2 2 10 8
#3: 3 6 7 NA
Create a grouping vector g assuming each occurrence of a starts a new group, use tapply to create a table tab and coerce that to a data frame. No packages are used.
g <- cumsum(x$type == "a")
tab <- with(x, tapply(value, list(g, type), c))
as.data.frame(tab)
giving:
a b c
1 5 2 3
2 2 10 NA
3 6 7 8
An alternate definition of the grouping vector which is slightly more complex but would be needed if some groups have a missing is the following. It assumes that x lists the type values in order of their levels within group so that if a level is less than the prior level it must be the start of a new group.
g <- cumsum(c(-1, diff(as.numeric(x$type))) < 0)
Note that ultimately there must be some restriction on missingness; otherwise, the problem is ambiguous. For example if one group can have b and c missing and then next group can have a missing then whether b and c in the second group actually form a second group or are part of the first group is not determinable.

creating a summary data frame from long formated data

Following this worked example:
case <- c('a','a','a','b','b','c','c','c','c','d','d','e','e')
ID <- c('aa','bb','zz','aa','cc','ee','ff','gg','kk','aa','kk','cc','dd')
score <- c(1,1,3,4,2,3,2,2,1,1,3,3,2)
df1 <- data.frame(case, ID, score)
identifier <- c('aa','bb','ff')
For each unique case, (that is a,b,c,d...), I want to scan the ID column and see how often we have an identifier value.
So we look into the 3x case==a, then how many times do the ID equal identifier? (in this case 2 times)
We then look at 2x case==b, and also count how many time ID equal identifier? (in this case 1 times)
we do this for all unique case's
I have used the following command, but this is for the whole sample, not separated per unique case
df1$ID %in% identifier
And what I want as a end result is a table, with one column with each unique case and a second column with the number of times ID and identifier were equal.
So I want to loop/automate the process and return a similiar output like:
data.frame(c('a','b','c','d','e'), c(2,1,1,1,0))
You can use tapply():
tapply(df1$ID, df1$case, FUN = function(id) sum(id %in% identifier))
a b c d e
2 1 1 1 0
but as #Jaap pointed out, you can use aggregate() to get a data.frame:
aggregate(ID ~ case, data = df1, FUN = function(id) sum(id %in% identifier))
case ID
1 a 2
2 b 1
3 c 1
4 d 1
5 e 0
And if you want more grouping you can do :
df <- aggregate(ID ~ case+(score>1), data = df1, FUN = function(id) sum(id %in% identifier))
df[df$`score > 1`,c(1,3)]
case ID
4 a 0
5 b 1
6 c 1
7 d 0
8 e 0

'Random' Sorting with a condition in R for Psychology Research

I have Valence Category for word stimuli in my psychology experiment.
1 = Negative, 2 = Neutral, 3 = Positive
I need to sort the thousands of stimuli with a pseudo-randomised condition.
Val_Category cannot have more than 2 of the same valence stimuli in a row i.e. no more than 2x negative stimuli in a row.
for example - 2, 2, 2 = not acceptable
2, 2, 1 = ok
I can't sequence the data i.e. decide the whole experiment will be 1,3,2,3,1,3,2,3,2,2,1 because I'm not allowed to have a pattern.
I tried various packages like dylpr, sample, order, sort and nothing so far solves the problem.
I think there's a thousand ways to do this, none of which are probably very pretty. I wrote a small function that takes care of the ordering. It's a bit hacky, but it appeared to work for what I tried.
To explain what I did, the function works as follows:
Take the vector of valences and samples from it.
If sequences are found that are larger than the desired length, then, (for each such sequence), take the last value of that sequence at places it "somewhere else".
Check if the problem is solved. If so, return the reordered vector. If not, then go back to 2.
# some vector of valences
val <- rep(1:3,each=50)
pseudoRandomize <- function(x, n){
# take an initial sample
out <- sample(val)
# check if the sample is "bad" (containing sequences longer than n)
bad.seq <- any(rle(out)$lengths > n)
# length of the whole sample
l0 <- length(out)
while(bad.seq){
# get lengths of all subsequences
l1 <- rle(out)$lengths
# find the bad ones
ind <- l1 > n
# take the last value of each bad sequence, and...
for(i in cumsum(l1)[ind]){
# take it out of the original sample
tmp <- out[-i]
# pick new position at random
pos <- sample(2:(l0-2),1)
# put the value back into the sample at the new position
out <- c(tmp[1:(pos-1)],out[i],tmp[pos:(l0-1)])
}
# check if bad sequences (still) exist
# if TRUE, then 'while' continues; if FALSE, then it doesn't
bad.seq <- any(rle(out)$lengths > n)
}
# return the reordered sequence
out
}
Example:
The function may be used on a vector with or without names. If the vector was named, then these names will still be present on the pseudo-randomized vector.
# simple unnamed vector
val <- rep(1:3,each=5)
pseudoRandomize(val, 2)
# gives:
# [1] 1 3 2 1 2 3 3 2 1 2 1 3 3 1 2
# when names assigned to the vector
names(val) <- 1:length(val)
pseudoRandomize(val, 2)
# gives (first row shows the names):
# 1 13 9 7 3 11 15 8 10 5 12 14 6 4 2
# 1 3 2 2 1 3 3 2 2 1 3 3 2 1 1
This property can be used for randomizing a whole data frame. To achieve that, the "valence" vector is taken out of the data frame, and names are assigned to it either by row index (1:nrow(dat)) or by row names (rownames(dat)).
# reorder a data.frame using a named vector
dat <- data.frame(val=rep(1:3,each=5), stim=rep(letters[1:5],3))
val <- dat$val
names(val) <- 1:nrow(dat)
new.val <- pseudoRandomize(val, 2)
new.dat <- dat[as.integer(names(new.val)),]
# gives:
# val stim
# 5 1 e
# 2 1 b
# 9 2 d
# 6 2 a
# 3 1 c
# 15 3 e
# ...
I believe this loop will set the Valence Category's appropriately. I've called the valence categories treat.
#Generate example data
s1 = data.frame(id=c(1:10),treat=NA)
#Setting the first two rows
s1[1,"treat"] <- sample(1:3,1)
s1[2,"treat"] <- sample(1:3,1)
#Looping through the remainder of the rows
for (i in 3:length(s1$id))
{
s1[i,"treat"] <- sample(1:3,1)
#Check if the treat value is equal to the previous two values.
if (s1[i,"treat"]==s1[i-1,"treat"] & s1[i-1,"treat"]==s1[i-2,"treat"])
#If so draw one of the values not equal to that value
{
a = 1:3
remove <- s1[i,"treat"]
a=a[!a==remove]
s1[i,"treat"] <- sample(a,1)
}
}
This solution is not particularly elegant. There may be a much faster way to accomplish this by sorting several columns or something.

Using sum(x:y) to create a new variable/vector from existing values in R

I am working in R with a data frame d:
ID <- c("A","A","A","B","B")
eventcounter <- c(1,2,3,1,2)
numberofevents <- c(3,3,3,2,2)
d <- data.frame(ID, eventcounter, numberofevents)
> d
ID eventcounter numberofevents
1 A 1 3
2 A 2 3
3 A 3 3
4 B 1 2
5 B 2 2
where numberofevents is the highest value in the eventcounter for each ID.
Currently, I am trying to create an additional vector z <- c(6,6,6,3,3).
If the numberofevents == 3, it is supposed to calculate sum(1:3), equally to 3 + 2 + 1 = 6.
If the numberofevents == 2, it is supposed to calculate sum(1:2) equally to 2 + 1 = 3.
Working with a large set of data, I thought it might be convenient to create this additional vector
by using the sum function in R d$z<-sum(1:d$numberofevents), i.e.
sum(1:3) # for the rows 1-3
and
sum(1:2) # for the rows 4-5.
However, I always get this warning:
Numerical expression has x elements: only the first is used.
You can try ave
d$z <- with(d, ave(eventcounter, ID, FUN=sum))
Or using data.table
library(data.table)
setDT(d)[,z:=sum(eventcounter), ID][]
Try using apply sapply or lapply functions in R.
sapply(numberofevents, function(x) sum(1:x))
It works for me.

Random row selection in R

I have this dataframe
id <- c(1,1,1,2,2,3)
name <- c("A","A","A","B","B","C")
value <- c(7:12)
df<- data.frame(id=id, name=name, value=value)
df
This function selects a random row from it:
randomRows = function(df,n){
return(df[sample(nrow(df),n),])
}
i.e.
randomRows(df,1)
But I want to randomly select one row per 'name' (or per 'id' which is the same) and concatenate that entire row into a new table, so in this case, three rows. This has to loop throught a 2000+ rows dataframe. Please show me how?!
I think you can do this with the plyr package:
library("plyr")
ddply(df,.(name),randomRows,1)
which gives you for example:
id name value
1 1 A 8
2 2 B 11
3 3 C 12
Is this what you are looking for?
Here's one way of doing it in base R.
> df.split <- split(df, df$name)
> df.sample <- lapply(df.split, randomRows, 1)
> df.final <- do.call("rbind", df.sample)
> df.final
id name value
A 1 A 7
B 2 B 11
C 3 C 12

Resources