Rep values from a data frame to another data frame. apply? sapply? - r

I have the following data frame
data<-data.frame(ID=c("a", "b", "c", "d"), zeros=c(3,2,5,4), ones=c(1,1,2,1))
ID zeros ones
1 a 3 1
2 b 2 1
3 c 5 2
4 d 4 1
and I wish to create another data frame with 2 columns:
First column(id) the ID is repeated (zero+ones) times
Second column value should be the c(rep(0, zeros), rep(1, ones))
so that the result would be
id value
1 a 0
2 a 0
3 a 0
4 a 1
5 b 0
6 b 0
7 b 1
8 c 0
9 c 0
10 c 0
11 c 0
12 c 0
13 c 1
14 c 1
15 d 0
16 d 0
17 d 0
18 d 0
19 d 1
I tried data.frame(id=(rep(data$ID, (data$zeros+data$ones))), value=c(rep(0, data$zeros), rep(1, data$ones))) but doesnt work. Any ideas? Thank you in advance

This is perhaps overkill, using ddply from the plyr package, but it's the first thing that came to me:
ddply(dat,.(ID),function(x){data.frame(value = rep(c(0,1),times = c(x$zeros,x$ones)))})
Oh and I changed the name of your data frame to dat to avoid a bad habit (data is the name of an oft used function).

Here's a base R solution. I prefer the overkill of plyr myself:
dat <- data.frame(ID = letters[1:4], zeros = c(3,2,5,4), ones = c(1,1,2,1))
do.call("rbind"
, apply(dat, 1, function(x)
data.frame(cbind(id = x[1], value = rep(0:1, times = x[2:3])))
)
)

Since you've already got a base R solution for the first column, this is one for your second column:
lengths<-as.vector(t(as.matrix(data[,2:3]))) #notice the t
what<-rep(c(0,1), nrow(data))
times<-rep(what, lengths)
Edit: changed a minor thing above and tested it. It works now.

I also prefer the plyr method, but I thought I'd throw another base R solution related to reshaping the data first, and then replicating it. (also using dat instead of data):
names(dat)[2:3] <- c("times.0", "times.1")
tmp <- reshape(dat, varying=2:3, direction="long")
tmp <- tmp[rep(seq(length=nrow(tmp)),tmp$times),c("ID","time")]
names(tmp) <- c("id","value")
tmp <- tmp[order(tmp$id, tmp$value),]
rownames(tmp) <- NULL
Not as elegant as some of the other base solutions because it requires intermediate storage, but possibly interesting.

Related

How to get the sum shared values of all the randomly picked two columns in a dataframe

I'm quite new to R, so please forgive me. I even don't know how to ask this question...The purpose of this question is to figure out which two or three factors shared most.
I have a dataframe like this:
mydata<-read.table(header=TRUE, text="
A B C D
peak_1 peak_1 0 0
peak_2 0 0 peak_2
0 0 peak_3 peak_3
peak_4 0 0 peak_4
peak_6 0 0 0
peak_7 0 peak_7 0
peak_8 peak_8 peak_8 peak_8")
A,B,C and D are four factors. Hopefully this table can be displayed well in your R.
I want to figure out the number of shared value (but not 0) between every two columns. I'm expecting results will be displayed like below:
myresuts<-read.table(header=TRUE, text = "
factor_1 factor_2 number_of_shared
A B 2
A C 2
A D 3
B C 1
B D 1
C D 2")
For this small table, I can do the intersection manually. But in fact I have a quite big table with more than 100 columns to do such calculation. I wonder how to write a function to solve this problem.
Also, if I want to figure out the sum of shared values in every three column (hopefully this can be solved in the same way).
Thanks!
A useful function for calculating combinations and permutations can be found in the gtools library.
library(gtools)
cbn <- data.frame(combinations(ncol(mydata),2,names(mydata)))
cbn$num_shared = apply(cbn, 1, function(i) sum(mydata[,i[1]] == mydata[,i[2]]))
cbn
X1 X2 num_shared
1 A B 2
2 A C 3
3 A D 4
4 B C 4
5 B D 3
6 C D 4
If you do not want to compare zeroes, convert them to NA using mydata[mydata == 0] <- NA and place na.rm = T inside the sum.
Your desired results suggest that you don't want to count zero values in the comparison. I'm doing this by converting zeros to NA first (I also convert to character so we can compare columns with non-overlapping values).
mydata <- lapply(mydata,
function(x) {
x[x==0] <- NA
as.character(x)
})
cc <- combn(names(mydata),2,
FUN=function(x) {
data.frame(matrix(x,nrow=1),
val=sum(mydata[[x[1]]]==mydata[[x[2]]],na.rm=TRUE))
},
simplify=FALSE)
do.call(rbind,cc)
This should work for 3 columns if you change the condition in the function appropriately ...

Perform ifelse() on every element of a data frame, but different test for each column in R

I've got a large data frame [4000,600] and I'd like to convert elements to 0 if they are smaller than three orders of magnitude less than each column maximum. So each element would need to be compared to the maximum value of its column and if the element < 0.001*$column_max then it should be converted to 0 and if it isn't, it should remain the same.
I am having a tough time getting apply() to let me use an ifelse() function. Is there a better approach or function I am missing?? I'm fairly new to R.
Use lapply to loop over each column with a replace call:
dat <- data.frame(a=c(1,2,1001),b=c(3,4,3003))
dat
# a b
#1 1 3
#2 2 4
#3 1001 3003
dat[] <- lapply(dat, function(x) replace(x, x < max(x)/10^3, 0) )
dat
# a b
#1 0 0
#2 2 4
#3 1001 3003
This should work with ifelse if you use apply column-wise:
df <- data.frame(a = c(1:10, 4000), b = c(4:13, 7000))
apply(df, 2, function(x){ifelse(x < 0.001*max(x), 0, x)})
We could do this without using ifelse
library(dplyr)
dat %>%
mutate_each(funs((.>= 0.001*max(.))*.))
# a b
#1 0 0
#2 2 4
#3 1001 3003
data
dat <- data.frame(a=c(1,2,1001),b=c(3,4,3003))

Labeling contiguous chunks of observations without a for loop

I have a standard 'can-I-avoid-a-loop' problem, but cannot find a solution.
I answered this question by #splaisan but I had to resort to some ugly contortions in the middle section, with a for and multiple if tests. I simulate a simpler version here in the hope that someone can give a better answer...
THE PROBLEM
Given a data structure like this:
df <- read.table(text = 'type
a
a
a
b
b
c
c
c
c
d
e', header = TRUE)
I want to identify contiguous chunks of the same type and label them in groups. The first chunk should be labelled 0, the next 1, and so on. There is an indefinite number of chunks, and each chunk may be as short as only one member.
type label
a 0
a 0
a 0
b 1
b 1
c 2
c 2
c 2
c 2
d 3
e 4
MY SOLUTION
I had to resort to a for loop to do this, here is the code:
label <- 0
df$label <- label
# LOOP through the label column and increment the label
# whenever a new type is found
for (i in 2:length(df$type)) {
if (df$type[i-1] != df$type[i]) { label <- label + 1 }
df$label[i] <- label
}
MY QUESTION
Can anyone do this without the loop and conditionals?
Using rle
r <- rle(as.numeric(df$type))
df$label <- rep(seq(from=0, length=length(r$lengths)), times=r$lengths)
Not using rle, but cumsum over logicals that are coerced to numeric.
df$label <- c(0,cumsum(df$type[-1] != df$type[-length(df$type)]))
Both give:
> df
type label
1 a 0
2 a 0
3 a 0
4 b 1
5 b 1
6 c 2
7 c 2
8 c 2
9 c 2
10 d 3
11 e 4
My crack at it:
as.numeric(df[, 1])-1
This just occurred to me as well, you can simply convert to a factor, then back to integers and subtract one:
as.integer(as.factor(df$type))-1
If type is already a factor, you can skip that step.

Replace values in selected columns by passing column name of data.frame into apply() or plyr function

Suppose I have a date.frame like:
df <- data.frame(a=1:5, b=sample(1:5, 5, replace=TRUE), c=5:1)
df
a b c
1 1 4 5
2 2 3 4
3 3 5 3
4 4 2 2
5 5 1 1
and I need to replace all the 5 as NA in column b & c then return to df:
df
a b c
1 1 4 NA
2 2 3 4
3 3 NA 3
4 4 2 2
5 5 1 1
But I want to do a generic apply() function instead of using replace() each by each because there are actually many variables need to be replaced in the real data. Suppose I've defined a variable list:
var <- c("b", "c")
and come up with something like:
df <- within(df, sapply(var, function(x) x <- replace(x, x==5, NA)))
but nothing happens. I was thinking if there is a way to work this out with something similar to the above by passing a variable list of column names from a data.frame into a generic apply / plyr function (or maybe some other completely different ways). Thanks~
You could just do
df[,var][df[,var] == 5] <- NA
df <- data.frame(a=1:5, b=sample(1:5, 5, replace=TRUE), c=5:1)
df
var <- c("b","c")
df[,var] <- sapply(df[,var],function(x) ifelse(x==5,NA,x))
df
I find the ifelse notation easier to understand here, but most Rers would probably use indexing instead.

Multiply various subsets of a data frame by different vectors

I would like to multiply several columns in my data frame by a vector of values. The specific vector of values changes depending on the value in another column.
--EDIT--
What if I make the data set more complicated, i.e., more than 2 conditions and the conditions are randomly shuffled around the data set?
Here is an example of my data set:
df=data.frame(
Treatment=(rep(LETTERS[1:4],each=2)),
Species=rep(1:4,each=2),
Value1=c(0,0,1,3,4,2,0,0),
Value2=c(0,0,3,4,2,1,4,5),
Value3=c(0,2,4,5,2,1,4,5),
Condition=c("A","B","A","C","B","A","B","C")
)
Which looks like:
Treatment Species Value1 Value2 Value3 Condition
A 1 0 0 0 A
A 1 0 0 2 B
B 2 1 3 4 A
B 2 3 4 5 C
C 3 4 2 2 B
C 3 2 1 1 A
D 4 0 4 4 B
D 4 0 5 5 C
If Condition=="A", I would like to multiply columns 3-5 by the vector c(1,2,3). If Condition=="B", I would like to multiply columns 3-5 by the vector c(4,5,6). If Condition=="C", I would like to multiply columns 3-5 by the vector c(0,1,0). The resulting data frame would therefore look like this:
Treatment Species Value1 Value2 Value3 Condition
A 1 0 0 0 A
A 1 0 0 12 B
B 2 1 6 12 A
B 2 0 4 0 C
C 3 16 10 12 B
C 3 2 2 3 A
D 4 0 20 24 B
D 4 0 5 0 C
I have tried subsetting the data frame and multiplying by the vector:
t(t(subset(df[,3:5],df[,6]=="A")) * c(1,2,3))
But I can't return the subsetted data frame to the original. Is there any way to perform this operation without subsetting the data frame, so that other columns (e.g., Treatment, Species) are preserved?
Here's a fairly general solution that you should be able to adapt to fit your needs.
Note the first argument in the outer call is a logical vector and the second is numeric, so before multiplication TRUE and FALSE are converted to 1 and 0, respectively. We can add the outer results because the conditions are non-overlapping and the FALSE elements will be zero.
multiples <-
outer(df$Condition=="A",c(1,2,3)) +
outer(df$Condition=="B",c(4,5,6)) +
outer(df$Condition=="C",c(0,1,0))
df[,3:5] <- df[,3:5] * multiples
Here's a non-vectorized, but easy to understand solution:
replaceFunction <- function(v){
m <- as.numeric(v[3:5])
if (v[6]=="A")
out <- m * c(1,2,3)
else if (v[6]=="B")
out <- m * c(4,5,6)
else
out <- m
return(out)
}
g <- apply(df, 1, replaceFunction)
df[3:5] <- t(g)
df
Edited to reflect some notes from the comments
Assuming that Condition is a factor, you could do this:
#Modified to reflect OP's edit - the same solution works just fine
m <- matrix(c(1:6,0,1,0),3,3,byrow = TRUE)
df[,3:5] <- with(df,df[,3:5] * m[Condition,])
which makes use of fairly quick vectorized multiplication. And obviously, wrapping this in with isn't strictly necessary, it's just what popped out of my brain. Also note the subsetting comment below by Backlin.
More globally, remember that every subsetting you can do with subset you can also do with [, and crucially, [ support assignment via [<-. So if you want to alter a portion of a data frame or matrix, you can always use this type of idiom:
df[rowCondition,colCondition] <- <replacement values>
assuming of course that <replacement values> is the same dimension as your subset of df. It may work otherwise, but you will run afoul of R's recycling rules and R may kick back a warning.
df[3:5] <- df[3:5] * t(sapply(df$Condition, function(x) if(x=="B") 4:6 else 1:3))
Or by vector multiplication
df[3:5] <- df[3:5] * (3*(df$Condition == "B") %*% matrix(1, 1, 3)
+ matrix(1:3, nrow(df), 3, byrow=T))

Resources