My data set has about 54,000 rows. I want to set a value (First_Pass) to either T or F depending upon both a value in another column and also whether or not that other column's value has been seen before. I have a for loop that does exactly what I need it to do. However, that loop is only for a subset of the data. I need that same for loop to be run individually for different subsets based upon factor levels.
This seems like the perfect case for the plyr functions as I want to split the data into subsets, apply a function (my for loop) and then rejoin the data. However, I cannot get it to work. First, I give a sample of the df, called char.data.
session_id list Sent_Order Sentence_ID Cond1 Cond2 Q_ID Was_y CI CI_Delta character tsle tsoc Direct
5139 2 b 9 25 rc su 25 correct 1 0 T 995 56 R
5140 2 b 9 25 rc su 25 correct 2 1 h 56 56 R
5141 2 b 9 25 rc su 25 correct 3 1 e 56 56 R
5142 2 b 9 25 rc su 25 correct 4 1 56 37 R
There is some clutter in there. The key columns are session_id, Sentence_ID, CI, and CI_Delta.
I then initialise a column called First_Pass to "F"
char.data$First_Pass <- "F"
I want to now calculate when First_Pass is actually "T" for each combination of session_id and Sentence_ID. I created a toy set, which is just one subset to work out the overall logic. Here's the code of a for loop that gives me just what I want for the toy data.
char.data.toy$First_Pass <- "F"
l <-c(200)
for (i in 1:nrow(char.data.toy)) {
if(char.data.toy[i,]$CI_Delta >= 0 & char.data.toy[i,]$CI %nin% l){
char.data.toy[i,]$First_Pass <- "T"
l <- c(l,char.data.toy[i,]$CI)}
}
I now want to take this loop and run it for every session_id and Sentence_ID subset. I've created a function called set_fp and then called it inside ddply. Here is that code:
#define function
set_fp <- function (df){
l <- 200
for (i in 1:nrow(df)) {
if(df[i,]$CI_Delta >= 0 & df[i,]$CI %nin% l){
df[i,]$First_Pass <- "T"
l <- c(l,df[i,]$CI)}
else df[i,]$First_Pass <- "F"
return(df)
}
}
char.data.fp <- ddply(char.data,c("session_id","Sentence_ID"),function(df)set_fp(df))
Unfortunately, this is not quite right. For a long time, I was getting all "F" values for First_Pass. Now I'm getting 24 T values, when it should be many more, so I suspect, it's only keeping the last subset or something similar. Help?
This is a little hard to test with only the four rows that you've provided. I created random data to see if it works and it seems to work for me. Try it on you data too.
This uses the data.table library and doesn't try to run loops inside a ddply. I'm assuming the means aren't important.
library(data.table)
dt <- data.table(df)
l <- c(200)
# subsetting to keep only the important fields
dt <- dt[,list(session_id, Sentence_ID, CI, CI_Delta)]
# Initialising First_Pass
dt[,First_Pass := 'F']
# The next two lines are basically rewording your logic -
# Within each group of session_id, Sentence_ID, identify the duplicate CI entries. These would have been inserted in l. The first time occurence of these CI entries is marked false as they wouldn't have been in l when that row was being checked
dt[CI_Delta >= 0,duplicatedCI := duplicated(CI), by = c("session_id", "Sentence_ID")]
# So if the CI value hasn't occurred before within the session_id,Sentence_ID group, and it doesn't appear in l, then mark it as "T"
dt[!(CI %in% l) & !(duplicatedCI), First_Pass := "T"]
# Just for curiosity's sake, calculating l too
l <- c(l,dt[duplicatedCI == FALSE,CI])
Related
As an exercise I was given two samples from a seed called u and v and asked to show how many values are in v but not in u fell into the bins [1,50] and [51,100]. Then I am asked to add a line of code in to confirm my answer using a relational operator (like >) and sum().
I solved the first part:
table(findInterval(setdiff(v,u),c(50))
But for the second part, i don't really get what I need to do; any help is appreciated!
Example:
set.seed(1201)
u = sample(100,100,replace=TRUE)
v = sample(100,100,replace=TRUE)
table(findInterval(setdiff(v,u),c(50)))
Output:
0 1
12 12
If we want to use comparative operators and sum, create a logical vector and get the sum of logical vector
i1 <- v[!v %in% u] > 50
sum(i1)
sum(!i1)
Note: If the OP intended to use only unique values (as in setdiff), then get the unique
i1 <- unique(v[!v %in% u]) > 50
out1 <- sum(i1)
out2 <- sum(!i1)
-checking with the output of table
tbl1 <- table(findInterval(setdiff(v,u),c(50)))
all.equal(as.numeric(tbl1), c(out1, out2), check.attributes = FALSE)
#[1] TRUE
Since there is only one number that you are cutting the intervals in, you can verify your answer using > directly.
This is your code
set.seed(1201)
u = sample(100,100,replace=TRUE)
v = sample(100,100,replace=TRUE)
table(findInterval(setdiff(v,u),50))
#0 1
#9 9
Without findInterval
table(setdiff(v,u) > 50)
#FALSE TRUE
# 9 9
I have managed to do chisq-test using loop in R but it is very slow for a large data and I wonder if you could help me out doing it faster with something like dplyr? I've tried with dplyr but I ended up getting an error all the time which I am not sure about the reason.
Here is a short example of my data:
df
1 2 3 4 5
row_1 2260.810 2136.360 3213.750 3574.750 2383.520
row_2 328.050 496.608 184.862 383.408 151.450
row_3 974.544 812.508 1422.010 1307.510 1442.970
row_4 2526.900 826.197 1486.000 2846.630 1486.000
row_5 2300.130 2499.390 1698.760 1690.640 2338.640
row_6 280.980 752.516 277.292 146.398 317.990
row_7 874.159 794.792 1033.330 2383.420 748.868
row_8 437.560 379.278 263.665 674.671 557.739
row_9 1357.350 1641.520 1397.130 1443.840 1092.010
row_10 1749.280 1752.250 3377.870 1534.470 2026.970
cs
1 1 1 2 1 2 2 1 2 3
What I want to do is to run chisq-test between each row of the df and cs. Then giving me the statistics and p.values as well as row names.
here is my code for the loop:
value = matrix(nrow=ncol(df),ncol=3)
for (i in 1:ncol(df)) {
tst <- chisq.test(df[i,], cs)
value[i,1] <- tst$p.value
value[i,2] <- tst$statistic
value[i,3] <- rownames(df)[i]}
Thanks for your help.
I guess you do want to do this column by column. Knowing the structure of Biobase::exprs(PANCAN_w)) would have helped greatly. Even better would have been to use an example from the Biobase package instead of a dataset that cannot be found.
This is an implementation of the code I might have used. Note: you do NOT want to use a matrix to store results if you are expecting a mixture of numeric and character values. You would be coercing all the numerics to character:
value = data.frame(p_val =NA, stat =NA, exprs = rownames(df) )
for (i in 1:col(df)) {
# tbl <- table((df[i,]), cs) ### No use seen for this
# I changed the indexing in the next line to compare columsn to the standard `cs`.
tst <- chisq.test(df[ ,i], cs) #chisq.test not vectorized, need some sort of loop
value[i, 1:2] <- tst[ c('p.value', 'statistic')] # one assignment per row
}
Obviously, you would need to change every instance of df (not a great name since there is also a df function) to Biobase::exprs(PANCAN_w)
I have a whole bunch of data.frames with irregular time spacing.
I would like to make a new data.frame and join the others to it, for each data.frame being joined picking the latest value out of the new data.frame.
For example, listOfDataFrames below contains a list of data.frames each of which has a time column in seconds. I find the total range, mod the range by 60 and seqn it by to obtain an increasing seqn of full minutes. Now I need to merge the list of data.frames to the left of this new seqn. e.g. if the value in mypoints is 60, the value joined to it should be the latest value <= 60.
xrange <- range(lapply(listOfDataFrames,function(x) range(x$Time)))
mypoints <- 60*do.call(seq,as.list(xrange%/%60))
I believe this is sometimes called an asof join.
Is there a simple procedure to do this?
Thanks
EDIT: this is what I currently use
xrange <- range(lapply(listOfDataFrames,function(x) range(x$Time)))
mypoints <- 60*seq(xrange[1]%/%60,1+xrange[2]%/%60)
result <- data.frame(Time=mypoints)
for(index in 1:length(listOfDataFrames))
{
x<-listOfDataFrames[[index]]
indices <- which(sort(c(mypoints,x$Time)) %in% mypoints) - 1:length(mypoints)
indices[indices==0] <- NA
newdf<-data.frame(new=x$Result[indices])
colnames(newdf)<-paste("S",index,sep="")
result <- cbind(result,newdf)
}
EDIT: full example
AsOfJoin <- function (listOfDataFrames) {
xrange <- range(lapply(listOfDataFrames,function(x) range(x$Time)))
mypoints <- 60*seq(xrange[1]%/%60,1+xrange[2]%/%60)
result <- data.frame(Time=mypoints)
for(index in 1:length(listOfDataFrames))
{
x<-listOfDataFrames[[index]]
indices <- which(sort(c(mypoints,x$Time)) %in% mypoints) - 1:length(mypoints)
indices[indices==0] <- NA
newdf<-data.frame(new=x$Result[indices])
colnames(newdf)<-paste("S",index,sep="")
result <- cbind(result,newdf)
}
result[is.na(result)]<-0
result
}
a<-data.frame(Time=c(28947.5,28949.6,29000),Result=c(10,15,9))
b<-data.frame(Time=c(28947.8,28949.5),Result=c(14,19))
listOfDataFrames <- list(a,b)
result<-AsOfJoin(listOfDataFrames)
> a
Time Result
1 28947.5 10
2 28949.6 15
3 29000.0 9
> b
Time Result
1 28947.8 14
2 28949.5 19
> result
Time S1 S2
1 28920 0 0
2 28980 15 19
3 29040 9 19
data.table provide very fast asof joins out of the box.
See also This post for an example
See my edit for answer. Apparently the best way.
Apologises for a semi 'double post'. I feel I should be able to crack this but I'm going round in circles. This is on a similar note to my previously well answered question:
Within ID, check for matches/differences
test <- data.frame(
ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05",
"2004-03-05","2004-06-05","2004-09-05","2005-01-05",
"2006-10-03","2007-02-05")
)
What I want to do is tag the subject whose first vist (as at DOV) was less than 180 days from their diagnosis (DOD). I have the following from the plyr package.
ddply(test, "ID", function(x) ifelse( (as.numeric(x$DOV[1]) - as.numeric(x$DOD[1])) < 180,1,0))
Which gives:
ID V1
1 A 1
2 B 0
3 C 1
What I would like is a vector 1,1,1,0,0,0,0,1,1 so I can append it as a column to the data frame. Basically this ddply function is fine, it makes a 'lookup' table where I can see which IDs have a their first visit within 180 days of their diagnosis, which I could then take my original test and go through and make an indicator variable, but I should be able to do this is one step I'd have thought.
I'd also like to use base if possible. I had a method with 'by', but again it only gave one result per ID and was also a list. Have been trying with aggregate but getting things like 'by has to be a list', then 'it's not the same length' and using the formula method of input I'm stumped 'cbind(DOV,DOD) ~ ID'...
Appreciate the input, keen to learn!
After wrapping as.Date around the creation of those date columns, this returns the desired marking vector assuming the df named 'test' is sorted by ID (and done in base):
# could put an ordering operation here if needed
0 + unlist( # to make vector from list and coerce logical to integer
lapply(split(test, test$ID), # to apply fn with ID
function(x) rep( # to extend a listwise value across all ID's
min(x$DOV-x$DOD) <180, # compare the minimum of a set of intervals
NROW(x)) ) )
11 12 13 21 22 23 24 31 32 # the labels
1 1 1 0 0 0 0 1 1 # the values
I have added to data.frame function stringsAsFactors=FALSE:
test <- data.frame(ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05","2004-03-05",
"2004-06-05","2004-09-05","2005-01-05","2006-10-03","2007-02-05")
, stringsAsFactors=FALSE)
CODE
test$V1 <- ifelse(c(FALSE, diff(test$ID) == 0), 0,
1*(as.numeric(as.Date(test$DOV)-as.Date(test$DOD))<180))
test$V1 <- ave(test$V1,test$ID,FUN=max)
I have two dataframes and I wish to insert the values of one dataframe into another (let's call them DF1 and DF2).
DF1 consists of 2 columns 1 and 2. Column 1 (col1) contains characters a to z and col2 has values associated with each character (from a to z)
DF2 is a dataframe with 3 columns. The first two consist of every combination of DF1$col1 so: aa ab ac ad etc; where the first letter is in col1 and the second letter is in col2
I want to create a simple mathematical model utilizing the values in DF1$col2 to see the outcomes of every possible combination of objects in DF1$col1
The first step I wanted to do is to transfer values from DF1$col2 to DF2$col3 (values from DF2$col3 should be associated to values in DF2col1), but that's where I'm stuck. I currently have
for(j in 1:length(DF2$col1))
{
## this part is to use the characters in DF2$col1 as an input
## to yield the output for DF2$col3--
input=c(DF2$col1)[j]
## This is supposed to use the values found in DF1$col2 to fill in DF2$col3
g=DF1[(DF1$col2==input),"pred"]
## This is so that the values will fill in DF2$col3--
DF2$col3=g
}
When I run this, DF2$col3 will be filled up with the same value for a specific character from DF1 (e.g. DF2$col3 will have all the rows filled with the value associated with character "a" from DF1)
What exactly am I doing wrong?
Thanks a bunch for your time
You should really use merge for this as #Aaron suggested in his comment above, but if you insist on writing your own loop, than you have the problem in your last line, as you assign g value to the whole col3 column. You should use the j index there also, like:
for(j in 1:length(DF2$col1))
{
DF2$col3[j] = DF1[(which(DF1$col2 == DF2$col1[j]), "pred"]
}
If this would not work out, than please also post some sample database to be able to help in more details (as I do not know, but have a gues what could be "pred").
It sounds like what you are trying to do is a simple join, that is, match DF1$col1 to DF2$col1 and copy the corresponding value from DF1$col2 into DF2$col3. Try this:
DF1 <- data.frame(col1=letters, col2=1:26, stringsAsFactors=FALSE)
DF2 <- expand.grid(col1=letters, col2=letters, stringsAsFactors=FALSE)
DF2$col3 <- DF1$col2[match(DF2$col1, DF1$col1)]
This uses the function match(), which, as the documentation states, "returns a vector of the positions of (first) matches of its first argument in its second." The values you have in DF1$col1 are unique, so there will not be any problem with this method.
As a side note, in R it is usually better to vectorize your work rather than using explicit loops.
Not sure I fully understood your question, but you can try this:
df1 <- data.frame(col1=letters[1:26], col2=sample(1:100, 26))
df2 <- with(df1, expand.grid(col1=col1, col2=col1))
df2$col3 <- df1$col2
The last command use recycling (it could be writtent as rep(df1$col2, 26) as well).
The results are shown below:
> head(df1, n=3)
col1 col2
1 a 68
2 b 73
3 c 45
> tail(df1, n=3)
col1 col2
24 x 22
25 y 4
26 z 17
> head(df2, n=3)
col1 col2 col3
1 a a 68
2 b a 73
3 c a 45
> tail(df2, n=3)
col1 col2 col3
674 x z 22
675 y z 4
676 z z 17