My question is very simple. I have a data frame with various numbers in each row, more than 100 columns. First column is always a non zero number. What I want to do is replace each nonzero number in each row (excluding the first column) with the first number in the row (the value of the first column)
I would think in the lines of an ifelse and a for loop that iterates through rows but there must be a simpler vectorised way to do it...
Another approach is to use sapply, which is more efficient than looping. Assuming your data is in a data frame df:
df[,-1] <- sapply(df[,-1], function(x) {ind <- which(x!=0); x[ind] = df[ind,1]; return(x)})
Here, we are applying the function over each and all columns of df except for the first column. In the function, x is each of these columns in turn:
First find the row indices of the column that are zeroes using which.
Set these rows in x to the corresponding values in the rows of the first column of df.
Returns the column
Note that the operations in the function are all "vectorized" over the column. That is, no looping over the rows of the column. The result from sapply is a matrix of the processed columns, which replaces all columns of df that are not the first column.
See this for an excellent review of the *apply family of functions.
Hope this helps.
Since you're data is not that big, I suggest you use a simple loop
for (i in 1:nrow(mydata))
{
for (j in 2:ncol(mydata)
{
mydata[i,j]<- ifelse(mydata[i,j]==0 ,0 ,mydata[i,1])
}
}
Suppose your data frame is dat, I have a fully-vectorized solution for you:
mat <- as.matrix(dat[, -1])
pos <- which(mat != 0)
mat[pos] <- rep(dat[[1]], times = ncol(mat))[pos]
new_dat <- "colnames<-"(cbind.data.frame(dat[1], mat), colnames(dat))
Example
set.seed(0)
dat <- "colnames<-"(cbind.data.frame(1:5, matrix(sample(0:1, 25, TRUE), 5)),
c("val", letters[1:5]))
# val a b c d e
#1 1 1 0 0 1 1
#2 2 0 1 0 0 1
#3 3 0 1 0 1 0
#4 4 1 1 1 1 1
#5 5 1 1 0 0 0
My code above gives:
# val a b c d e
#1 1 1 0 0 1 1
#2 2 0 2 0 0 2
#3 3 0 3 0 3 0
#4 4 4 4 4 4 4
#5 5 5 5 0 0 0
You want a benchmark?
set.seed(0)
n <- 2000 ## use a 2000 * 2000 matrix
dat <- "colnames<-"(cbind.data.frame(1:n, matrix(sample(0:1, n * n, TRUE), n)),
c("val", paste0("x",1:n)))
## have to test my solution first, as aichao's solution overwrites `dat`
## my solution
system.time({mat <- as.matrix(dat[, -1])
pos <- which(mat != 0)
mat[pos] <- rep(dat[[1]], times = ncol(mat))[pos]
"colnames<-"(cbind.data.frame(dat[1], mat), colnames(dat))})
# user system elapsed
# 0.352 0.056 0.410
## solution by aichao
system.time(dat[,-1] <- sapply(dat[,-1], function(x) {ind <- which(x!=0); x[ind] = dat[ind,1]; x}))
# user system elapsed
# 7.804 0.108 7.919
My solution is 20 times faster!
Related
I have 2 datasets, i want for each row in datset1 to calculate the difference between all rows in another dataset2. I also replace any negative difference by 0. Here is a simple example of my 2 datasets (because i have datasets around 1000*1000).
df1 <- data.frame(ID = c(1, 2), Obs = c(1.0, 2.0), var=c(2.0,5.0))
df2 <- data.frame(ID = c(2, 1), Obs = c(3.0, 2.0),var=c(7.0,3.0))
df1
ID Obs var
1 1 1 2
2 2 2 5
df2
ID Obs var
1 2 3 7
2 1 2 3
for(i in 1:nrow(df1)){
b1=as.matrix(df1)
b2=as.matrix(df2)
diff= b1-b2
diff[which(diff < 0 )] <- 0
diff.data= data.frame(cbind(diff, total = rowSums(diff)))
}
diff.data
ID Obs var total
1 0 0 0 0
2 1 0 2 3
This is what i have been able to do, i did the difference between the 2 datasets, replace the negative values by 0 and also was interested to sum the values of the columns after. For the first row in df1 i would like to calculate the difference between all the rows in df2, and for the second row in df1 calculate the difference between all the rows in df2 (and so on). Note that i should not calculate the difference between the IDs (i don't know how to do it, maybe changing diff= b1-b2 by diff= b1[,-1]-b2[,-1]? ). I want to keep the ID from df1 to keep track of my patients (the case of my dataset). I would like to have something like that
diff.data
ID Obs var total
1 0 0 0
1 0 0 0
2 0 0 0
2 0 2 2
I thank you in advance for your help.
Here is what i have using your answer, i wanted to create a simple function. But i would like to have the option that my datasets could be either matrices or dataframes, i was only able to generate an error if the datasets are not dataframes:
difference=function(df1,df2){
if(class(df1) != "data.frame" || class(df2) != "data.frame") stop(" df1 or df2 is not a dataframe!")
df1=data.frame(df1)
df2=data.frame(df2)
ID1=seq(nrow(df1))
ID2=seq(nrow(df2))
new_df1 = df1[rep(ID1, each = nrow(df2)), ]
new_df1[-1] = new_df1[-1] - df2[rep(seq(nrow(df2)), nrow(df1)), -1]
new_df1[new_df1 < 0] = 0
new_df1$total = rowSums(new_df1[-1])
rownames(new_df1) = NULL
output=new_df1
return(output)
}
I know the fact that i specified df1=data.frame(df1) must be a dataframe its just i don't know how to also include that it could be a matrix.
Thank you again in advance for your help.
You can repeat each row in df1 with for nrow(df2) times and each row in df2 for nrow(df1) times so that the size of dataframes is equal and we can directly subtract the values.
#Repeat each row of df1 nrow(df2) times
new_df1 <- df1[rep(df1$ID, each = nrow(df2)), ]
#Repeat rows of df2 and subtract
new_df1[-1] <- new_df1[-1] - df2[rep(seq(nrow(df2)), nrow(df1)), -1]
#Replace negative values with 0
new_df1[new_df1 < 0] <- 0
#Add row-wise sum
new_df1$total <- rowSums(new_df1[-1])
#Remove rownames
rownames(new_df1) <- NULL
new_df1
# ID Obs var total
#1 1 0 0 0
#2 1 0 0 0
#3 2 0 0 0
#4 2 0 2 2
I've got a large data frame [4000,600] and I'd like to convert elements to 0 if they are smaller than three orders of magnitude less than each column maximum. So each element would need to be compared to the maximum value of its column and if the element < 0.001*$column_max then it should be converted to 0 and if it isn't, it should remain the same.
I am having a tough time getting apply() to let me use an ifelse() function. Is there a better approach or function I am missing?? I'm fairly new to R.
Use lapply to loop over each column with a replace call:
dat <- data.frame(a=c(1,2,1001),b=c(3,4,3003))
dat
# a b
#1 1 3
#2 2 4
#3 1001 3003
dat[] <- lapply(dat, function(x) replace(x, x < max(x)/10^3, 0) )
dat
# a b
#1 0 0
#2 2 4
#3 1001 3003
This should work with ifelse if you use apply column-wise:
df <- data.frame(a = c(1:10, 4000), b = c(4:13, 7000))
apply(df, 2, function(x){ifelse(x < 0.001*max(x), 0, x)})
We could do this without using ifelse
library(dplyr)
dat %>%
mutate_each(funs((.>= 0.001*max(.))*.))
# a b
#1 0 0
#2 2 4
#3 1001 3003
data
dat <- data.frame(a=c(1,2,1001),b=c(3,4,3003))
I have a dataframe containing (surprise) data. I have one column which I wish to populated on a per-row basis, calculated from the values of other columns in the same row.
From googling, it seems like I need 'apply', or one of it's close relatives. Unfortunately I haven't managed to make it actually work.
Example code:
#Example function
getCode <- function (ar1, ar2, ar3){
if(ar1==1 && ar2==1 && ar3==1){
return(1)
} else if(ar1==0 && ar2==0 && ar3==0){
return(0)
}
return(2)
}
#Create data frame
a = c(1,1,0)
b = c(1,0,0)
c = c(1,1,0)
df <- data.frame(a,b,c)
#Add column for new data
df[,"x"] <- 0
#Apply function to new column
df[,"x"] <- apply(df[,"x"], 1, getCode(df[,"a"], df[,"b"], df[,"c"]))
I would like df to be taken from:
a b c x
1 1 1 1 0
2 1 0 1 0
3 0 0 0 0
to
a b c x
1 1 1 1 1
2 1 0 1 2
3 0 0 0 0
Unfortunately running this spits out:
Error in match.fun(FUN) : 'getCode(df[, "a"], df[, "b"], df[,
"c"])' is not a function, character or symbol
I'm new to R, so apologies if the answer is blindingly simple. Thanks.
A few things: apply would be along the dataframe itself (i.e. apply(df, 1, someFunc)); it's more idiomatic to access columns by name using the $ operator.. so if I have a dataframe named df with a column named a, access a with df$a.
In this case, I like to do an sapply along the index of the dataframe, and then use that index to get the appropriate elements from the dataframe.
df$x <- sapply(1:nrow(df), function(i) getCode(df$a[i], df$b[i], df$c[i]))
As #devmacrile mentioned above, I would just modify the function to be able to get a vector with 3 elements as input and use it within an apply command as you mentioned.
#Example function
getCode <- function (x){
ifelse(x[1]==1 & x[2]==1 & x[3]==1,
1,
ifelse(x[1]==0 & x[2]==0 & x[3]==0,
0,
2)) }
#Create data frame
a = c(1,1,0)
b = c(1,0,0)
c = c(1,1,0)
df <- data.frame(a,b,c)
df
# a b c
# 1 1 1 1
# 2 1 0 1
# 3 0 0 0
# create your new column of results
df$x = apply(df, 1, getCode)
df
# a b c x
# 1 1 1 1 1
# 2 1 0 1 2
# 3 0 0 0 0
I've two distance matrices.. but either of them can have items missing, and they can be out of order -- for example:
matrix #1 (missing item c)
a b d
a 0 2 3
b 2 0 4
d 3 4 0
matrix #2 (missing item b, and items out of order)
d c a
d 0 1 2
c 1 0 1
a 2 1 0
I want to find the difference between the matrices, while assuming that any missing items are 0. So, my resulting matrix should be:
a b c d
a 0 2 1 1
b 2 0 0 4
c 1 0 0 1
d 1 4 1 0
What's the best way to go about this? Should I be sorting both matrices and then filling in missing columns/rows so that I can then just abs(m1-m2), or is there a way to use row/column headings to have them automatically "match up" when subtracting?
These matrices are 5000x5000 or so, and I'll have about a 1000 to do pairwise comparison on, so I'd rather take a hit on preprocessing the data if that will make each computation significantly faster.
Any hints or suggestions are welcome. I'm usually a non-R programmer, so an iterative solution that I would normally come up would take forever -- I'm hoping for the "R way" of doing things that will be significantly faster.
We create a names index ('Un1') which is the union of names of the first ('m1') and second ('m2') matrix. Two new 0 matrices ('m1N', 'm2N') are created by specifying the dimensions and dim names based on 'Un1'. By row/column indexing, we change the 0 values in these matrices to the values in 'm1', 'm2', subtract and get the absolute.
Un1 <- sort(union(colnames(m1), colnames(m2)))
m1N <- matrix(0, ncol=length(Un1), nrow=length(Un1), dimnames=list(Un1, Un1))
m2N <- m1N
m1N[rownames(m1), colnames(m1)] <- m1
m2N[rownames(m2), colnames(m2)] <- m2
abs(m1N-m2N)
# a b c d
#a 0 2 1 1
#b 2 0 0 4
#c 1 0 0 1
#d 1 4 1 0
Update
If we have several matrices with object names m followed by numbers, we can place them in a list. We get the object names using ls and the values in a list with mget. Loop through the list with lapply to get the column names, use union as f in Reduce, sort to get the unique elements.
lst <- mget(ls(pattern='m\\d+')) #change the pattern accordingly
Un1 <- sort(Reduce(union, lapply(lst, colnames)))
We can create another list with matrix of 0s.
lst1 <- lapply(seq_along(lst), function(i)
matrix(0, ncol=length(Un1), nrow=length(Un1), dimnames=list(Un1, Un1)))
We can change the corresponding elements of 'lst1' using the row/column index of corresponding matrices of 'lst' using Map.
lst2 <- Map(function(x,y) {x[rownames(y), colnames(y)] <- y; x}, lst1, lst)
If we need pairwise difference, combn may be an option
lst3 <- combn(seq_along(lst2),2, FUN=function(x)
list(abs(lst2[[x[1]]]-lst2[[x[2]]])))
names(lst3) <- combn(seq_along(lst2), 2, FUN=paste, collapse='_')
Another approach using match (beginning is similar to #akrun):
func = function(cols, m)
{
res = `dimnames<-`(m[match(cols,rownames(m)), match(cols,colnames(m))],
list(cols, cols))
ifelse(is.na(res), 0, res)
}
cols = sort(union(colnames(m1), colnames(m2)))
abs(func(cols,m1) - func(cols,m2))
# a b c d
#a 0 2 1 1
#b 2 0 0 4
#c 1 0 0 1
#d 1 4 1 0
I have a matrix mat.
mat<-matrix(
c('a','a','b','a','b','b'),
nrow=3, ncol=2)
I want to make a vector of the count matches in each row of the matrix. For example, let's say I wanted to count the number of matches of the letter a in each row. The first row of the matrix has an a,a: two matches of a. The second row of the matrix has an a,b: one match of a.
I can count the number of matches of the character a in a row with this line of code:
sum(!is.na(charmatch(mat[1,c(1,2)],"a"))) # first row, returns 2
sum(!is.na(charmatch(mat[2,c(1,2)],"a"))) # second row, returns 1
I want to vectorize this counting procedure. In other words, I want to do something like this
as.vector(rowsum(!is.na(charmatch(mat[,c(1,2)], "a"))))
So that it returns a vector like this 2,1,0 which means 2 matches of a in row 1 of the matrix, 1 match of a in row 2 of the matrix, 0 matches of a in row 3 of the matrix.
You can just do
rowSums(mat=='a', na.rm=TRUE)
#[1] 2 1 0
For all unique values
Un <- sort(unique(c(mat)))
res <- sapply(Map(`==`, list(mat), Un), rowSums, na.rm=TRUE)
colnames(res) <- Un
res
# a b
#[1,] 2 0
#[2,] 1 1
#[3,] 0 2
Or as contributed by #Ananda Mahto, a faster approach would be
lvl <- sort(unique(c(mat)))
vapply(lvl, function(x) rowSums(mat == x, na.rm = TRUE), numeric(nrow(mat)))
If you wanted to do this for all values, you can try one of the following:
table with factor in apply
levs <- unique(c(mat))
t(apply(mat, 1, function(x) table(factor(x, levs))))
# a b
# [1,] 2 0
# [2,] 1 1
# [3,] 0 2
melt and dcast with fun.aggregate = length from "reshape2"
library(reshape2)
dcast(melt(mat), Var1 ~ value, value.var = "Var2")
# Aggregation function missing: defaulting to length
# Var1 a b
# 1 1 2 0
# 2 2 1 1
# 3 3 0 2
Better yet would just be table after manually creating the values to tabulate:
table(rep(sequence(nrow(mat)), ncol(mat)), c(mat))
#
# a b
# 1 2 0
# 2 1 1
# 3 0 2