I am trying to generate dummy variables (must be 1/0) using a loop based on the most frequent response of a variable. After lots of googling, I haven't managed to come up with a solution. I have extracted the most frequent responses (strings, say the top 5 are "A","B",...,"E") using
top5<-names(head(sort(table(data$var1), decreasing = TRUE),5)
I would like the loop to check if another variable ("var2") equals A, if so set =1, OW =0, then give a summary using aggregate(). In Stata, I can refer to the looped variable i using `i' but not in R... The code that does not work is:
for(i in top5) {
data$i.dummy <- ifelse(data$var2=="i",1,0)
aggregate(data$i.dummy~data$age+data$year,data,mean)
}
Any suggestions?
If you want one column per item in your top 5 then I would use sapply along the elements in top5. No need for ifelse because == compares and gives TRUE or 1 if the comparison is TRUE and 0 otherwise
Here we cbind a matrix of 5 columns, one each for each element of top5 containing 1 if the row in data$var2 equals the respective element of 'top5':
data <- cbind( data , sapply( top5 , function(x) as.integer( data$var2 == x ) ) )
If you want one column for matches of any of top5 it's even easier:
data$dummies <- as.integer( data$var2 %in% top5 )
The as.integer() in both cases is used to turn TRUE or FALSE to 1 and 0 respectively.
A cut down example to illustrate how it works:
set.seed(123)
top2 <- c("A","B")
data <- data.frame( var2 = sample(LETTERS[1:4],6,repl=TRUE) )
# Make dummy variables, one column for each element in topX vector
data <- cbind( data , sapply( top2 , function(x) as.integer( data$var2 == x ) ) )
data
# var2 A B
#1 B 0 1
#2 D 0 0
#3 B 0 1
#4 D 0 0
#5 D 0 0
#6 A 1 0
# Make single column for all elements in topX vector
data$ANY <- as.integer( data$var2 %in% top2 )
data
# var2 ANY A B
#1 B 1 0 1
#2 D 0 0 0
#3 B 1 0 1
#4 D 0 0 0
#5 D 0 0 0
#6 A 1 1 0
See fortune(312), then read the help ?"[[" and possibly the help for paste0.
Then possibly consider using other tools like model.matrix and sapply rather than doing everything yourself using loops.
Related
df <- data.frame(class=c('A', 'B'),
var2=c(1, 0),
var3=c(0, 1))
for (i in colnames(df)[2:3]) {
#print(i)
table(paste0('df$', i), df$class)
}
results in
Error in table(paste0("df$", i), df$class) :
all arguments must have the same length
Also tried putting
get(paste0('df$',i))
Is there a way to loop through these columns and tabulate?
The issue with your code is that because paste0() returns a character vector e.g, 'var2' and is not a correct argument for table() function. You can use the double bracket '[[' to extract the columns:
# create a list to save the results from loop
tl<-vector(mode = 'list')
# run the loop and add the results for each column in the corresponding element of 'tl'
for (i in colnames(df)[2:3]) {
tl[[i]]<-table(df[[i]], df$class)
}
output
tl
$var2
A B
0 0 1
1 1 0
$var3
A B
0 1 0
1 0 1
alternatively you can use lapply() function:
lapply(df[, 2:3], function(x) table(x, df$class))
var2
x A B
0 0 1
1 1 0
$var3
x A B
0 1 0
1 0 1
Not much info on what exactly your preferred output is other than it tabulates the columns, but here's a potential solution:
# This is your df:
class<- c('A','B')
var2<- c(1,0)
var3 <- c(0,1)
df<- data.frame(class,var2,var3)
# Using lapply to tabulate each column. The output is a list of tables:
dftable <- lapply(df, table)
The output looks like this:
> dftb
$class
A B
1 1
$var2
0 1
1 1
$var3
0 1
1 1
The map() function from the purr package (part of tidyverse) can also be used:
library(tidyverse)
dftb <- lapply(df, table)
I have the following data:
Letters <- c("A","B","C")
Numbers <- c(1,0,1)
Numbers <- as.integer(Numbers)
Data.Frame <- data.frame(Letters,Numbers)
I want to create a Dummy Variable for the Letters and wrote the following for-loop:
for(level in unique(Data.Frame$Letters)){Data.Frame[paste("", level, sep = "")]
<- ifelse(Data.Frame$Letters == level, 1, 0)}
Is there a way to vectorize this for-loop? Is the following use of dcast alredy vectorized?
dt <- data.table(Letters,Numbers)
dcast.data.table(dt, Letters+Numbers~Letters,fun.aggregate=length)
You could use outer
cbind(Data.Frame, +outer(Letters, setNames(nm=Letters), "=="))
# Letters Numbers A B C
# 1 A 1 1 0 0
# 2 B 0 0 1 0
# 3 C 1 0 0 1
My question is very simple. I have a data frame with various numbers in each row, more than 100 columns. First column is always a non zero number. What I want to do is replace each nonzero number in each row (excluding the first column) with the first number in the row (the value of the first column)
I would think in the lines of an ifelse and a for loop that iterates through rows but there must be a simpler vectorised way to do it...
Another approach is to use sapply, which is more efficient than looping. Assuming your data is in a data frame df:
df[,-1] <- sapply(df[,-1], function(x) {ind <- which(x!=0); x[ind] = df[ind,1]; return(x)})
Here, we are applying the function over each and all columns of df except for the first column. In the function, x is each of these columns in turn:
First find the row indices of the column that are zeroes using which.
Set these rows in x to the corresponding values in the rows of the first column of df.
Returns the column
Note that the operations in the function are all "vectorized" over the column. That is, no looping over the rows of the column. The result from sapply is a matrix of the processed columns, which replaces all columns of df that are not the first column.
See this for an excellent review of the *apply family of functions.
Hope this helps.
Since you're data is not that big, I suggest you use a simple loop
for (i in 1:nrow(mydata))
{
for (j in 2:ncol(mydata)
{
mydata[i,j]<- ifelse(mydata[i,j]==0 ,0 ,mydata[i,1])
}
}
Suppose your data frame is dat, I have a fully-vectorized solution for you:
mat <- as.matrix(dat[, -1])
pos <- which(mat != 0)
mat[pos] <- rep(dat[[1]], times = ncol(mat))[pos]
new_dat <- "colnames<-"(cbind.data.frame(dat[1], mat), colnames(dat))
Example
set.seed(0)
dat <- "colnames<-"(cbind.data.frame(1:5, matrix(sample(0:1, 25, TRUE), 5)),
c("val", letters[1:5]))
# val a b c d e
#1 1 1 0 0 1 1
#2 2 0 1 0 0 1
#3 3 0 1 0 1 0
#4 4 1 1 1 1 1
#5 5 1 1 0 0 0
My code above gives:
# val a b c d e
#1 1 1 0 0 1 1
#2 2 0 2 0 0 2
#3 3 0 3 0 3 0
#4 4 4 4 4 4 4
#5 5 5 5 0 0 0
You want a benchmark?
set.seed(0)
n <- 2000 ## use a 2000 * 2000 matrix
dat <- "colnames<-"(cbind.data.frame(1:n, matrix(sample(0:1, n * n, TRUE), n)),
c("val", paste0("x",1:n)))
## have to test my solution first, as aichao's solution overwrites `dat`
## my solution
system.time({mat <- as.matrix(dat[, -1])
pos <- which(mat != 0)
mat[pos] <- rep(dat[[1]], times = ncol(mat))[pos]
"colnames<-"(cbind.data.frame(dat[1], mat), colnames(dat))})
# user system elapsed
# 0.352 0.056 0.410
## solution by aichao
system.time(dat[,-1] <- sapply(dat[,-1], function(x) {ind <- which(x!=0); x[ind] = dat[ind,1]; x}))
# user system elapsed
# 7.804 0.108 7.919
My solution is 20 times faster!
I have a dataframe containing (surprise) data. I have one column which I wish to populated on a per-row basis, calculated from the values of other columns in the same row.
From googling, it seems like I need 'apply', or one of it's close relatives. Unfortunately I haven't managed to make it actually work.
Example code:
#Example function
getCode <- function (ar1, ar2, ar3){
if(ar1==1 && ar2==1 && ar3==1){
return(1)
} else if(ar1==0 && ar2==0 && ar3==0){
return(0)
}
return(2)
}
#Create data frame
a = c(1,1,0)
b = c(1,0,0)
c = c(1,1,0)
df <- data.frame(a,b,c)
#Add column for new data
df[,"x"] <- 0
#Apply function to new column
df[,"x"] <- apply(df[,"x"], 1, getCode(df[,"a"], df[,"b"], df[,"c"]))
I would like df to be taken from:
a b c x
1 1 1 1 0
2 1 0 1 0
3 0 0 0 0
to
a b c x
1 1 1 1 1
2 1 0 1 2
3 0 0 0 0
Unfortunately running this spits out:
Error in match.fun(FUN) : 'getCode(df[, "a"], df[, "b"], df[,
"c"])' is not a function, character or symbol
I'm new to R, so apologies if the answer is blindingly simple. Thanks.
A few things: apply would be along the dataframe itself (i.e. apply(df, 1, someFunc)); it's more idiomatic to access columns by name using the $ operator.. so if I have a dataframe named df with a column named a, access a with df$a.
In this case, I like to do an sapply along the index of the dataframe, and then use that index to get the appropriate elements from the dataframe.
df$x <- sapply(1:nrow(df), function(i) getCode(df$a[i], df$b[i], df$c[i]))
As #devmacrile mentioned above, I would just modify the function to be able to get a vector with 3 elements as input and use it within an apply command as you mentioned.
#Example function
getCode <- function (x){
ifelse(x[1]==1 & x[2]==1 & x[3]==1,
1,
ifelse(x[1]==0 & x[2]==0 & x[3]==0,
0,
2)) }
#Create data frame
a = c(1,1,0)
b = c(1,0,0)
c = c(1,1,0)
df <- data.frame(a,b,c)
df
# a b c
# 1 1 1 1
# 2 1 0 1
# 3 0 0 0
# create your new column of results
df$x = apply(df, 1, getCode)
df
# a b c x
# 1 1 1 1 1
# 2 1 0 1 2
# 3 0 0 0 0
I've two distance matrices.. but either of them can have items missing, and they can be out of order -- for example:
matrix #1 (missing item c)
a b d
a 0 2 3
b 2 0 4
d 3 4 0
matrix #2 (missing item b, and items out of order)
d c a
d 0 1 2
c 1 0 1
a 2 1 0
I want to find the difference between the matrices, while assuming that any missing items are 0. So, my resulting matrix should be:
a b c d
a 0 2 1 1
b 2 0 0 4
c 1 0 0 1
d 1 4 1 0
What's the best way to go about this? Should I be sorting both matrices and then filling in missing columns/rows so that I can then just abs(m1-m2), or is there a way to use row/column headings to have them automatically "match up" when subtracting?
These matrices are 5000x5000 or so, and I'll have about a 1000 to do pairwise comparison on, so I'd rather take a hit on preprocessing the data if that will make each computation significantly faster.
Any hints or suggestions are welcome. I'm usually a non-R programmer, so an iterative solution that I would normally come up would take forever -- I'm hoping for the "R way" of doing things that will be significantly faster.
We create a names index ('Un1') which is the union of names of the first ('m1') and second ('m2') matrix. Two new 0 matrices ('m1N', 'm2N') are created by specifying the dimensions and dim names based on 'Un1'. By row/column indexing, we change the 0 values in these matrices to the values in 'm1', 'm2', subtract and get the absolute.
Un1 <- sort(union(colnames(m1), colnames(m2)))
m1N <- matrix(0, ncol=length(Un1), nrow=length(Un1), dimnames=list(Un1, Un1))
m2N <- m1N
m1N[rownames(m1), colnames(m1)] <- m1
m2N[rownames(m2), colnames(m2)] <- m2
abs(m1N-m2N)
# a b c d
#a 0 2 1 1
#b 2 0 0 4
#c 1 0 0 1
#d 1 4 1 0
Update
If we have several matrices with object names m followed by numbers, we can place them in a list. We get the object names using ls and the values in a list with mget. Loop through the list with lapply to get the column names, use union as f in Reduce, sort to get the unique elements.
lst <- mget(ls(pattern='m\\d+')) #change the pattern accordingly
Un1 <- sort(Reduce(union, lapply(lst, colnames)))
We can create another list with matrix of 0s.
lst1 <- lapply(seq_along(lst), function(i)
matrix(0, ncol=length(Un1), nrow=length(Un1), dimnames=list(Un1, Un1)))
We can change the corresponding elements of 'lst1' using the row/column index of corresponding matrices of 'lst' using Map.
lst2 <- Map(function(x,y) {x[rownames(y), colnames(y)] <- y; x}, lst1, lst)
If we need pairwise difference, combn may be an option
lst3 <- combn(seq_along(lst2),2, FUN=function(x)
list(abs(lst2[[x[1]]]-lst2[[x[2]]])))
names(lst3) <- combn(seq_along(lst2), 2, FUN=paste, collapse='_')
Another approach using match (beginning is similar to #akrun):
func = function(cols, m)
{
res = `dimnames<-`(m[match(cols,rownames(m)), match(cols,colnames(m))],
list(cols, cols))
ifelse(is.na(res), 0, res)
}
cols = sort(union(colnames(m1), colnames(m2)))
abs(func(cols,m1) - func(cols,m2))
# a b c d
#a 0 2 1 1
#b 2 0 0 4
#c 1 0 0 1
#d 1 4 1 0