Labeling contiguous chunks of observations without a for loop - r

I have a standard 'can-I-avoid-a-loop' problem, but cannot find a solution.
I answered this question by #splaisan but I had to resort to some ugly contortions in the middle section, with a for and multiple if tests. I simulate a simpler version here in the hope that someone can give a better answer...
THE PROBLEM
Given a data structure like this:
df <- read.table(text = 'type
a
a
a
b
b
c
c
c
c
d
e', header = TRUE)
I want to identify contiguous chunks of the same type and label them in groups. The first chunk should be labelled 0, the next 1, and so on. There is an indefinite number of chunks, and each chunk may be as short as only one member.
type label
a 0
a 0
a 0
b 1
b 1
c 2
c 2
c 2
c 2
d 3
e 4
MY SOLUTION
I had to resort to a for loop to do this, here is the code:
label <- 0
df$label <- label
# LOOP through the label column and increment the label
# whenever a new type is found
for (i in 2:length(df$type)) {
if (df$type[i-1] != df$type[i]) { label <- label + 1 }
df$label[i] <- label
}
MY QUESTION
Can anyone do this without the loop and conditionals?

Using rle
r <- rle(as.numeric(df$type))
df$label <- rep(seq(from=0, length=length(r$lengths)), times=r$lengths)
Not using rle, but cumsum over logicals that are coerced to numeric.
df$label <- c(0,cumsum(df$type[-1] != df$type[-length(df$type)]))
Both give:
> df
type label
1 a 0
2 a 0
3 a 0
4 b 1
5 b 1
6 c 2
7 c 2
8 c 2
9 c 2
10 d 3
11 e 4

My crack at it:
as.numeric(df[, 1])-1

This just occurred to me as well, you can simply convert to a factor, then back to integers and subtract one:
as.integer(as.factor(df$type))-1
If type is already a factor, you can skip that step.

Related

How to get the sum shared values of all the randomly picked two columns in a dataframe

I'm quite new to R, so please forgive me. I even don't know how to ask this question...The purpose of this question is to figure out which two or three factors shared most.
I have a dataframe like this:
mydata<-read.table(header=TRUE, text="
A B C D
peak_1 peak_1 0 0
peak_2 0 0 peak_2
0 0 peak_3 peak_3
peak_4 0 0 peak_4
peak_6 0 0 0
peak_7 0 peak_7 0
peak_8 peak_8 peak_8 peak_8")
A,B,C and D are four factors. Hopefully this table can be displayed well in your R.
I want to figure out the number of shared value (but not 0) between every two columns. I'm expecting results will be displayed like below:
myresuts<-read.table(header=TRUE, text = "
factor_1 factor_2 number_of_shared
A B 2
A C 2
A D 3
B C 1
B D 1
C D 2")
For this small table, I can do the intersection manually. But in fact I have a quite big table with more than 100 columns to do such calculation. I wonder how to write a function to solve this problem.
Also, if I want to figure out the sum of shared values in every three column (hopefully this can be solved in the same way).
Thanks!
A useful function for calculating combinations and permutations can be found in the gtools library.
library(gtools)
cbn <- data.frame(combinations(ncol(mydata),2,names(mydata)))
cbn$num_shared = apply(cbn, 1, function(i) sum(mydata[,i[1]] == mydata[,i[2]]))
cbn
X1 X2 num_shared
1 A B 2
2 A C 3
3 A D 4
4 B C 4
5 B D 3
6 C D 4
If you do not want to compare zeroes, convert them to NA using mydata[mydata == 0] <- NA and place na.rm = T inside the sum.
Your desired results suggest that you don't want to count zero values in the comparison. I'm doing this by converting zeros to NA first (I also convert to character so we can compare columns with non-overlapping values).
mydata <- lapply(mydata,
function(x) {
x[x==0] <- NA
as.character(x)
})
cc <- combn(names(mydata),2,
FUN=function(x) {
data.frame(matrix(x,nrow=1),
val=sum(mydata[[x[1]]]==mydata[[x[2]]],na.rm=TRUE))
},
simplify=FALSE)
do.call(rbind,cc)
This should work for 3 columns if you change the condition in the function appropriately ...

How to create a table in R that includes column totals

I'm somewhat new to R programming and am in need of assistance.
I'm looking to take the sum of 4 columns in a dataframe and list these totals in a simple table.
Essentially, take the sum of 4 columns (A, B, C, D) and list the total in a table (table = column 1: A, B, C, D column 2: sum of column A, B, C, D) - something along the lines of:
A = 3
B = 4
C = 4
D = 3
Does anyone know how to get this output? Also, the less "manual" the response, the better (i.e. trying to avoid having to input several lines of code to get this output if possible).
Thank you.
If your data looks like this:
a <- c(1:4)
b <- c(2:5)
c <- c(3:6)
d <- c(4:7)
df <- data.frame(a,b,c,d)
a b c d
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
Use
> res <- sapply(df,sum)
to get
a b c d
10 14 18 22
in order to apply the function only on numeric columns, try
> res <- colSums(df[sapply(df,is.numeric)])
There is colSums:
colSums(Filter(is.numeric, df))

R: operate over subset of columns in data.table

I'm trying to implement a data.table for my relatively large datasets and I can't figure out how to operate a function over multiple columns in the same row. Specifically, I want to create a new column that contains a specifically-formatted tally of the values (i.e., a histogram) in a subset of columns. It is kind of like table() but that also includes 0 entries and is sorted--so, if you know of a better/faster method I'd appreciate that too!
Simplified test case:
DF<-data.frame("A"=c("a","d","a"),"B"=c("b","a","a"),"C"=c("c","a","a"),"D"=c("a","b","c"),"E"=c("a","a","c"))
DT<-as.data.table(DF)
> DT
A B C D E
1: a b c a a
2: d a a b a
3: a a a c c
my klunky histogram function:
histo<-function(vec){
foo<-c("a"=0,"b"=0,"c"=0,"d"=0)
for(i in vec){foo[i]=foo[i]+1}
return(foo)}
>histo(unname(unlist(DF[1,])))
a b c d
3 1 1 0
>histo(unname(unlist(DF[2,])))
a b c d
3 1 0 1
>histo(unname(unlist(DF[3,])))
a b c d
3 0 2 0
pseduocode of desired function and output
>DT[,his:=some_func_with_histo(A:E)]
>DT
A B C D E his
1: a b c a a (3,1,1,0)
2: d a a b a (3,1,0,1)
3: a a a c c (3,0,2,0)
df <- data.table(DF)
df$hist <- unlist(apply(df, 1, function(x) {
list(
sapply(letters[1:4], function(d) {
b <- sum(!is.na(grep(d,x)))
assign(d, b)
}))
}), recursive=FALSE)
Your df$hist column is a list, with each value named:
> df
A B C D E hist
1: a b c a a 3,1,2,0
2: d a a b a 3,1,1,1
3: a a a c c 3,0,3,0
> df$hist
[[1]]
a b c d
3 1 2 0
[[2]]
a b c d
3 1 1 1
[[3]]
a b c d
3 0 3 0
NOTE: Answer has been updated to to OP's request and mnel's comment
OK, how do you like that solution:
library(data.table)
DT <- data.table(A=c("a","d","a"),
B=c("b","a","a"),
C=c("c","a","a"),
D=c("a","b","c"),
E=c("a","a","c"))
fun <- function(vec, char) {
sum(vec==char)
}
DT[, Vec_Nr:= paste(Vectorize(fun, 'char')(.SD, letters[1:4]), collapse=","),
by=1:nrow(DT),
.SDcols=LETTERS[1:5]]
A B C D E Vec_Nr
1: a b c a a 3,1,1,0
2: d a a b a 3,1,0,1
3: a a a c c 3,0,2,0
I basically split up your problem into several steps:
First, I define a function fun that gives me the number of occurrences for one character. To see how
that function works, just call
fun(c("a", "a", "b"), "b")
[1] 1
Next, I vectorize this function because you don't want to know that for only one character "b", but for many. To pass a vector of arguments to a function,
use Vectorize. To see how that works, just type
Vectorize(fun, "char")(c("a", "a", "b"), c("a", "b"))
a b
2 1
Next, I collapse the results into one string and save that as a new column. Note that I deliberatly used the letters and LETTERS here to show you how make this more dynamic.
EDIT (also see below): Provided you first convert column classes to character, e.g., with DT <- DT[,lapply(.SD,as.character)]...
By using factor, you can convert vec and pass the values (a,b,c,d) in one step:
histo2 <- function(x) table(factor(x,levels=letters[1:4]))
Then you can iterate over rows by passing by=1:nrow(DT).
DT[,as.list(histo2(.SD)),by=1:nrow(DT)]
This gives...
nrow a b c d
1: 1 3 1 1 0
2: 2 3 1 0 1
3: 3 3 0 2 0
Also, this iterates over columns. This works because .SD is a special variable holding the subset of data associated with the call to by. In this case, that subset is the data.table consisting of one of the rows. histo2(DT[1]) works the same way.
EDIT (responding to OP's comment): Oh, sorry, I instinctively replaced your first line with
DF<-data.frame("A"=c("a","d","a"),"B"=c("b","a","a"),"C"=c("c","a","a"),"D"=c("a","b","c"),"E"=c("a","a","c")
,stringsAsFactors=FALSE)
since I dislike using factors except when making tables. If you do not want to convert your factor columns to character columns in this way, this will work:
histo3 <- function(x) table(factor(sapply(x,as.character),levels=letters[1:4]))
To put the output into a single column, you use := as you suggested...
DT[,hist:=list(list(histo3(.SD))),by=1:nrow(DT)]
The list(list()) part is key; I always figure this out by trial-and-error. Now DT looks like this:
A B C D E hist
1: a b c a a 3,1,1,0
2: d a a b a 3,1,0,1
3: a a a c c 3,0,2,0
You might find that it's a pain to access the information directly from your new column. For example, to access the "a" column of the "histogram", I think the fastest route is...
DT[,hist[[1]][["a"]],by=1:nrow(DT)]
My initial suggestion created an auxiliary data.table with just the counts. I think it's cleaner to do whatever you want to do with the counts in that data.table and then cbind it back. If you choose to store it in a column, you can always create the auxiliary data.table later with
DT[,as.list(hist[[1]]),by=1:nrow(DT)]
You are correct about using .SDcols. For your example, ...
cols = c("A","C")
histname = paste(c("hist",cols),collapse="")
DT[,(histname):=list(list(histo3(.SD))),by=1:nrow(DT),.SDcols=cols]
This gives
A B C D E hist histAC
1: a b c a a 3,1,1,0 1,0,1,0
2: d a a b a 3,1,0,1 1,0,0,1
3: a a a c c 3,0,2,0 2,0,0,0

Rep values from a data frame to another data frame. apply? sapply?

I have the following data frame
data<-data.frame(ID=c("a", "b", "c", "d"), zeros=c(3,2,5,4), ones=c(1,1,2,1))
ID zeros ones
1 a 3 1
2 b 2 1
3 c 5 2
4 d 4 1
and I wish to create another data frame with 2 columns:
First column(id) the ID is repeated (zero+ones) times
Second column value should be the c(rep(0, zeros), rep(1, ones))
so that the result would be
id value
1 a 0
2 a 0
3 a 0
4 a 1
5 b 0
6 b 0
7 b 1
8 c 0
9 c 0
10 c 0
11 c 0
12 c 0
13 c 1
14 c 1
15 d 0
16 d 0
17 d 0
18 d 0
19 d 1
I tried data.frame(id=(rep(data$ID, (data$zeros+data$ones))), value=c(rep(0, data$zeros), rep(1, data$ones))) but doesnt work. Any ideas? Thank you in advance
This is perhaps overkill, using ddply from the plyr package, but it's the first thing that came to me:
ddply(dat,.(ID),function(x){data.frame(value = rep(c(0,1),times = c(x$zeros,x$ones)))})
Oh and I changed the name of your data frame to dat to avoid a bad habit (data is the name of an oft used function).
Here's a base R solution. I prefer the overkill of plyr myself:
dat <- data.frame(ID = letters[1:4], zeros = c(3,2,5,4), ones = c(1,1,2,1))
do.call("rbind"
, apply(dat, 1, function(x)
data.frame(cbind(id = x[1], value = rep(0:1, times = x[2:3])))
)
)
Since you've already got a base R solution for the first column, this is one for your second column:
lengths<-as.vector(t(as.matrix(data[,2:3]))) #notice the t
what<-rep(c(0,1), nrow(data))
times<-rep(what, lengths)
Edit: changed a minor thing above and tested it. It works now.
I also prefer the plyr method, but I thought I'd throw another base R solution related to reshaping the data first, and then replicating it. (also using dat instead of data):
names(dat)[2:3] <- c("times.0", "times.1")
tmp <- reshape(dat, varying=2:3, direction="long")
tmp <- tmp[rep(seq(length=nrow(tmp)),tmp$times),c("ID","time")]
names(tmp) <- c("id","value")
tmp <- tmp[order(tmp$id, tmp$value),]
rownames(tmp) <- NULL
Not as elegant as some of the other base solutions because it requires intermediate storage, but possibly interesting.

Multiply various subsets of a data frame by different vectors

I would like to multiply several columns in my data frame by a vector of values. The specific vector of values changes depending on the value in another column.
--EDIT--
What if I make the data set more complicated, i.e., more than 2 conditions and the conditions are randomly shuffled around the data set?
Here is an example of my data set:
df=data.frame(
Treatment=(rep(LETTERS[1:4],each=2)),
Species=rep(1:4,each=2),
Value1=c(0,0,1,3,4,2,0,0),
Value2=c(0,0,3,4,2,1,4,5),
Value3=c(0,2,4,5,2,1,4,5),
Condition=c("A","B","A","C","B","A","B","C")
)
Which looks like:
Treatment Species Value1 Value2 Value3 Condition
A 1 0 0 0 A
A 1 0 0 2 B
B 2 1 3 4 A
B 2 3 4 5 C
C 3 4 2 2 B
C 3 2 1 1 A
D 4 0 4 4 B
D 4 0 5 5 C
If Condition=="A", I would like to multiply columns 3-5 by the vector c(1,2,3). If Condition=="B", I would like to multiply columns 3-5 by the vector c(4,5,6). If Condition=="C", I would like to multiply columns 3-5 by the vector c(0,1,0). The resulting data frame would therefore look like this:
Treatment Species Value1 Value2 Value3 Condition
A 1 0 0 0 A
A 1 0 0 12 B
B 2 1 6 12 A
B 2 0 4 0 C
C 3 16 10 12 B
C 3 2 2 3 A
D 4 0 20 24 B
D 4 0 5 0 C
I have tried subsetting the data frame and multiplying by the vector:
t(t(subset(df[,3:5],df[,6]=="A")) * c(1,2,3))
But I can't return the subsetted data frame to the original. Is there any way to perform this operation without subsetting the data frame, so that other columns (e.g., Treatment, Species) are preserved?
Here's a fairly general solution that you should be able to adapt to fit your needs.
Note the first argument in the outer call is a logical vector and the second is numeric, so before multiplication TRUE and FALSE are converted to 1 and 0, respectively. We can add the outer results because the conditions are non-overlapping and the FALSE elements will be zero.
multiples <-
outer(df$Condition=="A",c(1,2,3)) +
outer(df$Condition=="B",c(4,5,6)) +
outer(df$Condition=="C",c(0,1,0))
df[,3:5] <- df[,3:5] * multiples
Here's a non-vectorized, but easy to understand solution:
replaceFunction <- function(v){
m <- as.numeric(v[3:5])
if (v[6]=="A")
out <- m * c(1,2,3)
else if (v[6]=="B")
out <- m * c(4,5,6)
else
out <- m
return(out)
}
g <- apply(df, 1, replaceFunction)
df[3:5] <- t(g)
df
Edited to reflect some notes from the comments
Assuming that Condition is a factor, you could do this:
#Modified to reflect OP's edit - the same solution works just fine
m <- matrix(c(1:6,0,1,0),3,3,byrow = TRUE)
df[,3:5] <- with(df,df[,3:5] * m[Condition,])
which makes use of fairly quick vectorized multiplication. And obviously, wrapping this in with isn't strictly necessary, it's just what popped out of my brain. Also note the subsetting comment below by Backlin.
More globally, remember that every subsetting you can do with subset you can also do with [, and crucially, [ support assignment via [<-. So if you want to alter a portion of a data frame or matrix, you can always use this type of idiom:
df[rowCondition,colCondition] <- <replacement values>
assuming of course that <replacement values> is the same dimension as your subset of df. It may work otherwise, but you will run afoul of R's recycling rules and R may kick back a warning.
df[3:5] <- df[3:5] * t(sapply(df$Condition, function(x) if(x=="B") 4:6 else 1:3))
Or by vector multiplication
df[3:5] <- df[3:5] * (3*(df$Condition == "B") %*% matrix(1, 1, 3)
+ matrix(1:3, nrow(df), 3, byrow=T))

Resources