Custom function within subset of data, base functions, vector output?

Custom function within subset of data, base functions, vector output? - r

Apologises for a semi 'double post'. I feel I should be able to crack this but I'm going round in circles. This is on a similar note to my previously well answered question:
Within ID, check for matches/differences
test <- data.frame(
ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05",
"2004-03-05","2004-06-05","2004-09-05","2005-01-05",
"2006-10-03","2007-02-05")
)
What I want to do is tag the subject whose first vist (as at DOV) was less than 180 days from their diagnosis (DOD). I have the following from the plyr package.
ddply(test, "ID", function(x) ifelse( (as.numeric(x$DOV[1]) - as.numeric(x$DOD[1])) < 180,1,0))
Which gives:
ID V1
1 A 1
2 B 0
3 C 1
What I would like is a vector 1,1,1,0,0,0,0,1,1 so I can append it as a column to the data frame. Basically this ddply function is fine, it makes a 'lookup' table where I can see which IDs have a their first visit within 180 days of their diagnosis, which I could then take my original test and go through and make an indicator variable, but I should be able to do this is one step I'd have thought.
I'd also like to use base if possible. I had a method with 'by', but again it only gave one result per ID and was also a list. Have been trying with aggregate but getting things like 'by has to be a list', then 'it's not the same length' and using the formula method of input I'm stumped 'cbind(DOV,DOD) ~ ID'...
Appreciate the input, keen to learn!

After wrapping as.Date around the creation of those date columns, this returns the desired marking vector assuming the df named 'test' is sorted by ID (and done in base):
# could put an ordering operation here if needed
0 + unlist( # to make vector from list and coerce logical to integer
lapply(split(test, test$ID), # to apply fn with ID
function(x) rep( # to extend a listwise value across all ID's
min(x$DOV-x$DOD) <180, # compare the minimum of a set of intervals
NROW(x)) ) )
11 12 13 21 22 23 24 31 32 # the labels
1 1 1 0 0 0 0 1 1 # the values

I have added to data.frame function stringsAsFactors=FALSE:
test <- data.frame(ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05","2004-03-05",
"2004-06-05","2004-09-05","2005-01-05","2006-10-03","2007-02-05")
, stringsAsFactors=FALSE)
CODE
test$V1 <- ifelse(c(FALSE, diff(test$ID) == 0), 0,
1*(as.numeric(as.Date(test$DOV)-as.Date(test$DOD))<180))
test$V1 <- ave(test$V1,test$ID,FUN=max)

Related

How to run Chisq test for multiple rows FASTER in R?

I have managed to do chisq-test using loop in R but it is very slow for a large data and I wonder if you could help me out doing it faster with something like dplyr? I've tried with dplyr but I ended up getting an error all the time which I am not sure about the reason.
Here is a short example of my data:
df
1 2 3 4 5
row_1 2260.810 2136.360 3213.750 3574.750 2383.520
row_2 328.050 496.608 184.862 383.408 151.450
row_3 974.544 812.508 1422.010 1307.510 1442.970
row_4 2526.900 826.197 1486.000 2846.630 1486.000
row_5 2300.130 2499.390 1698.760 1690.640 2338.640
row_6 280.980 752.516 277.292 146.398 317.990
row_7 874.159 794.792 1033.330 2383.420 748.868
row_8 437.560 379.278 263.665 674.671 557.739
row_9 1357.350 1641.520 1397.130 1443.840 1092.010
row_10 1749.280 1752.250 3377.870 1534.470 2026.970
cs
1 1 1 2 1 2 2 1 2 3
What I want to do is to run chisq-test between each row of the df and cs. Then giving me the statistics and p.values as well as row names.
here is my code for the loop:
value = matrix(nrow=ncol(df),ncol=3)
for (i in 1:ncol(df)) {
tst <- chisq.test(df[i,], cs)
value[i,1] <- tst$p.value
value[i,2] <- tst$statistic
value[i,3] <- rownames(df)[i]}
Thanks for your help.

I guess you do want to do this column by column. Knowing the structure of Biobase::exprs(PANCAN_w)) would have helped greatly. Even better would have been to use an example from the Biobase package instead of a dataset that cannot be found.
This is an implementation of the code I might have used. Note: you do NOT want to use a matrix to store results if you are expecting a mixture of numeric and character values. You would be coercing all the numerics to character:
value = data.frame(p_val =NA, stat =NA, exprs = rownames(df) )
for (i in 1:col(df)) {
# tbl <- table((df[i,]), cs) ### No use seen for this
# I changed the indexing in the next line to compare columsn to the standard `cs`.
tst <- chisq.test(df[ ,i], cs) #chisq.test not vectorized, need some sort of loop
value[i, 1:2] <- tst[ c('p.value', 'statistic')] # one assignment per row
}
Obviously, you would need to change every instance of df (not a great name since there is also a df function) to Biobase::exprs(PANCAN_w)

trouble setting up iteration on multiple data.frames in r

I am having a recurring issue of performing specific tasks on multiple data.frames. Here is my working example data.frame, which was imported from text files.
cellID X Y Area AVGFP DeviationGFP AvgRFP DeviationsRFP Slice GUI.ID
1 1 18.20775 26.309859 568 5.389085 7.803248 12.13028 5.569880 0 1
2 2 39.78755 9.505495 546 5.260073 6.638375 17.44505 17.220153 0 1
3 3 30.50000 28.250000 4 6.000000 4.000000 8.50000 1.914854 0 1
4 4 38.20233 132.338521 257 3.206226 5.124264 14.04669 4.318130 0 1
5 5 43.22467 35.092511 454 6.744493 9.028574 11.49119 5.186897 0 1
6 6 57.06534 130.355114 352 3.781250 5.713022 20.96591 14.303546 0 1
7 7 86.81765 15.123529 1020 6.043137 8.022179 16.36471 19.194279 0 1
8 8 75.81932 132.146417 321 3.666667 5.852172 99.47040 55.234726 0 1
9 9 110.54277 36.339233 678 4.159292 6.689660 12.65782 4.264624 0 1
10 10 127.83480 11.384886 569 4.637961 6.992881 11.39192 4.287963 0 1
As previous questions I have posted, there are 40 of these data.frames named slice1...slice40.
What I want to do is add a new column to each of these data.frames that contains the product of AVGFP and Area. I can perform this on one data.frame easily by using
stats[[1]]$totalGFP <- stats[[1]]$AVGFP * stats[[1]]$Area
I am stuck trying to apply this command to every data.frame in stats
I appreciate any and all help. To help moving forward when you post a solution can you please describe the details of the commands used to help me follow along, thank you!

Like this:
stats <- lapply(stats, transform, totalGFP = AVGFP * Area)
I'll do my best to explain but please refer to ?lapply and ?transform for the full docs.
transform is a function to add columns to a data.frame, according to formulas of the type totalGFP = AVGFP * Area passed as arguments. For example, to add the totalGFP column to your first data.frame, you could run transform(stats[[1]], totalGFP = AVGFP * Area).
lapply applies a function (here transform) to each element of a list or a vector (here stats), and returns a list. If the function to be applied requires more arguments, they can be passed at the end of the lapply call, here totalGFP = AVGFP * Area. So here lapply is an elegant way of running transform on each element of stats.

Given that you wrote "please describe the details of the commands", try this simple example:
# create two small data frames
df1 <- data.frame(AVGFP = 1:3, Area = 4:6)
df2 <- data.frame(AVGFP = 7:9, Area = 1:3)
# create a list with named objects: the two data frames.
# ?list: "The arguments to list [...] of the form [...] tag = value
ll <- list(df1 = df1, df2 = df2)
str(ll)
# apply a function on each element in the list
# each element is a single data frame
# Use an 'anonymous function', function(x), where 'x' corresponds to each single data frame
# The function does this:
# (1) calculate the new variable 'total', and (2) add it to the data frame
ll2 <- lapply(X = ll, FUN = function(x){
total <- x$AVGFP * x$Area
x <- data.frame(x, total)
})
# check ll2
str(ll2)

Add a column of ranks

I have some data:
test <- data.frame(A=c("aaabbb",
"aaaabb",
"aaaabb",
"aaaaab",
"bbbaaa")
)
and so on. All the elements are the same length, and are already sorted before I get them.
I need to make a new column of ranks, "First", "Second", "Third", anything after that can be left blank, and it needs to account for ties. So in the above case, I'd like to get the following output:
A B
aaabbb First
aaaabb Second
aaaabb Second
aaaaab Third
bbbaaa
bbbbaa
I looked at rank() and some other posts that used it, but I wasn't able to get it to do what I was looking for.

How about this:
test$B <- match(test$A , unique(test$A)[1:3] )
test
A B
1 aaabbb 1
2 aaaabb 2
3 aaaabb 2
4 aaaaab 3
5 bbbaaa NA
6 bbbbaa NA
One of many ways to do this. Possibly not the best, but one that readily springs to mind and is fairly intuitive. You can use unique because you receive the data pre-sorted.
As data is sorted another suitable function worth considering is rle, although it's slightly more obtuse in this example:
rnk <- rle(as.integer(df$A))$lengths
rnk
# [1] 1 2 1 1 1
test$B <- c( rep( 1:3 , times = rnk[1:3] ) , rep(NA, sum( rnk[-c(1:3)] ) ) )
rle computes the lengths (and values which we don't really care about here) of runs of equal values in a vector - so again this works because your data are already sorted.
And if you don't have to have blanks after the third ranked item it's even simpler (and more readable):
test$B <- rep(1:length(rnk),times=rnk)

This seems like a good application for factors:
test$B <- as.numeric(factor(test$A, levels = unique(test$A)))
cumsum also comes to mind, where we add 1 every time the value changes:
test$B <- cumsum(c(TRUE, tail(test$A, -1) != head(test$A, -1)))
(Like #Simon said, there are many ways to do this...)

Syntax (and/or functions) for applying an op over elements of one vector, using as arg elements of a 2nd vector

I am trying to find the right expression for creating a vector result by applying an operation over an vector, using, in a vectorised way, elements of a 2nd vector. The use case is that I have a vector of raw values, and a vector of breakpoints. What I want is an expression that will give me the result of applying a sum of a logical operation on the breakpoints with respect to the values in the values vector. In other words:
Given:
rawfoo <- c(30, 4, 22, 77, 1,169, 10)
breaksfoo <- c(10,50, 80)
resultfoo <- data.frame(breaks=breaksfoo, matching=numeric(length(breaksfoo)))
I want to write a single expression that delivers the column values for resultfoo$matching, which is: for each value in breaksfoo, sum(rawfoo > breaksfoo[i]),
resultfoo
breaks nmatching
1 10 3
2 50 2
3 80 1
I have been trying various forms of apply and having problems with how to express the function. Perhaps I am barking up the wrong tree? Can supply multiple demonstration of failure if required. (But my guess is that this question is so simple it doesn't need error messages to disambiguate it ;-)

You can do it in three steps:
Write a function that, given a break, returns a list of two element: the break itself and the result of sum(break > rawfoo).
Than you can use sapply to apply this function to breaksfoo.
Finally, you would need to transform the result of sapply, which is a matrix, to get a dataframe you need.
The following code does all of these three steps in one statement:
as.data.frame(t(sapply(breaksfoo,
function(x) list(breaks = x, nmatching = sum(x > rawfoo)))))
returns
breaks nmatching
1 10 2
2 50 5
3 80 6

Combining findInterval with table might get you what you're looking for.
#finds which interval rawfoo is in
x <- findInterval(rawfoo,breaksfoo)
#[1] 1 0 1 2 0 3 1
#tabulates the information
table(x)
#0 1 2 3
#2 3 1 1
#cuts off the last element
head(table(x),-1)
#0 1 2
#2 3 1
resultfoo$nmatching <- head(table(x),-1)
This is almost what you want, except that 10 is being placed in the second bucket because findInterval's intervals are inclusive on the lower end, while your example puts it in the first bucket because you want a strict inequality. You can add a corrective vector that will reassign to the right bucket:
y <- table(rawfoo)[as.character(breaksfoo)]
y[is.na(y)] <- 0
y <- y - c(0,head(y,-1))
resultfoo$nmatching <- resultfoo$nmatching + y
To make this easier to do, you can wrap it into a function.
fnfoo <- function(raw,breaks) {
x <- head(table(findInterval(rawfoo,breaksfoo)),-1)
y <- table(rawfoo)[as.character(breaksfoo)]
y[is.na(y)] <- 0
x + y - c(0,head(y,-1))
}
resultfoo$nmatching <- fnfoo(rawfoo,breaksfoo)
EDIT: I was browsing another question and realized that cut works better here.
data.frame(table(cut(rawfoo,c(-Inf,breaksfoo),right=TRUE)))
# Var1 Freq
# 1 (-Inf,10] 3
# 2 (10,50] 2
# 3 (50,80] 1

R: Stacking Multiple Punch Question Data

Suppose we have 2 questions in a survey, one is about how likely an individual is to recommend a company (let's say there's 2 companies for simplicity).
So, I have one data.frame with 2 columns for this question:
df.recommend <- data.frame(rep(1:5,20),rep(1:5,20))
colnames(df.recommend) <- c("Company1","Company2")
And, suppose we have another question that asks respondents to checkmark a box beside an attribute that they believe "fits" with the company.
So, I have another data.frame with 4 columns for this question:
df.attribute <- data.frame(rep(0:1,50),rep(1:0,50),rep(0:1,50),rep(1:0,50))
colnames(df.attribute) <- c(
"Attribute1.Company1",
"Attribute2.Company1",
"Attribute1.Company2",
"Attribute2.Company2")
Now, what I would like to be able to do is review how Attributes 1 and 2 are related to the scale in the likelyhood to recommend question, for all companies (company independent). Just to get an idea of what inertia lies between those people that are highly likely to recommend and attribute 1 for example.
So, I start off by binding the two questions together:
df <- cbind(df.recommend, df.attribute)
My problem is trying to figure out how to stack these data such that the columns look something like:
df.stacked <- data.frame(c(df$Company1,df$Company2),
c(df$Attribute1.Company1,df$Attribute1.Company2),
c(df$Attribute2.Company1,df$Attribute2.Company2))
colnames(df.stacked) <- c("Likelihood","Attribute1","Attribute2")
This example is simplified to a large degree. In my actual problem, I have 34 companies and 24 attributes.
Could you think of a way to stack them effectively, without having to type out all the c() statements?
Note: The column pattern for likelyhood is Co1,Co2,Co3,Co4... and the pattern for the attributes is At1.Co1,At2.Co1,At3.Co1 ... At1.Co34,At2.Co34...

For this type of problem, Hadley's reshape package is the perfect tool. I combine it with a few stringr and plyr statements (also packages written by Hadley).
Here is what I believe to be a complete solution in about a dozen lines of code.
First, create some data
library(reshape2) # EDIT 1: reshape2 is faster
library(stringr)
library(plyr)
# Create data frame
# Important: note the addition of a respondent id column
df_comp <- data.frame(
RespID = 1:10,
Company1 = rep(1:5, 2),
Company2 = rep(1:5, 2)
)
df_attr <- data.frame(
RespID = 1:10,
Attribute1.Company1 = rep(0:1,5),
Attribute2.Company1 = rep(1:0,5),
Attribute1.Company2 = rep(0:1,5),
Attribute2.Company2 = rep(1:0,5)
)
Now start the data manipulation:
# Use melt to convert data from wide to tall
melt_comp <- melt(df_comp, id.vars="RespID")
melt_comp <- rename(melt_comp, c(variable="comp", value="likelihood"))
melt_attr <- melt(df_attr, id.vars="RespID")
# Use str_split to split attribute variables into attribute and company
# "." period needs to be escaped
# EDIT 2: reshape::colsplit is simpler than str_split
split <- colsplit(melt_attr$variable, "\\.", names=c("attr", "comp"))
melt_attr <- data.frame(melt_attr, split)
melt_attr$variable <- NULL
# Use cast to convert from tall to somewhat tall
cast_attr <- cast(melt_attr, RespID + comp ~ attr, mean)
# Combine data frames using join() in package plyr
df <- join(melt_comp, cast_attr)
head(df)
And the output:
RespID comp likelihood Attribute1 Attribute2
1 1 Company1 1 0 1
2 2 Company1 2 1 0
3 3 Company1 3 0 1
4 4 Company1 4 1 0
5 5 Company1 5 0 1
6 6 Company1 1 1 0

Something I quickly cooked up. Doesn't look the best and uses a for-loop but that shouldn't be a problem with only 24 values
df.recommend <- data.frame(rep(1:5,20),rep(1:5,20))
colnames(df.recommend) <- c("Co1","Co2")
df.attribute <- data.frame(rep(0:1,50),rep(1:0,50),rep(0:1,50),rep(1:0,50))
colnames(df.attribute) <- c(
"At1.Co1",
"At2.Co1",
"At1.Co2",
"At2.Co2")
df.stacked <- data.frame(
likelihood <- unlist(df.recommend)
)
str <- strsplit(names(df.attribute),split="\\.")
atts <- unique(sapply(str,function(x)x[1]))
for (i in 1:length(atts))
{
df.stacked[,i+1] <- unlist(df.attribute[sapply(str,function(x)x[1]==atts[i])])
}
names(df.stacked) <- c("likelihood",paste("attribute",1:length(atts),sep=""))
EDIT: It assumes that companies are in the same order for each attribute

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Custom function within subset of data, base functions, vector output? - r

Related

How to run Chisq test for multiple rows FASTER in R?

trouble setting up iteration on multiple data.frames in r

Add a column of ranks

Syntax (and/or functions) for applying an op over elements of one vector, using as arg elements of a 2nd vector

R: Stacking Multiple Punch Question Data

Categories

Resources