I have the following data:
Letters <- c("A","B","C")
Numbers <- c(1,0,1)
Numbers <- as.integer(Numbers)
Data.Frame <- data.frame(Letters,Numbers)
I want to create a Dummy Variable for the Letters and wrote the following for-loop:
for(level in unique(Data.Frame$Letters)){Data.Frame[paste("", level, sep = "")]
<- ifelse(Data.Frame$Letters == level, 1, 0)}
Is there a way to vectorize this for-loop? Is the following use of dcast alredy vectorized?
dt <- data.table(Letters,Numbers)
dcast.data.table(dt, Letters+Numbers~Letters,fun.aggregate=length)
You could use outer
cbind(Data.Frame, +outer(Letters, setNames(nm=Letters), "=="))
# Letters Numbers A B C
# 1 A 1 1 0 0
# 2 B 0 0 1 0
# 3 C 1 0 0 1
Related
df <- data.frame(class=c('A', 'B'),
var2=c(1, 0),
var3=c(0, 1))
for (i in colnames(df)[2:3]) {
#print(i)
table(paste0('df$', i), df$class)
}
results in
Error in table(paste0("df$", i), df$class) :
all arguments must have the same length
Also tried putting
get(paste0('df$',i))
Is there a way to loop through these columns and tabulate?
The issue with your code is that because paste0() returns a character vector e.g, 'var2' and is not a correct argument for table() function. You can use the double bracket '[[' to extract the columns:
# create a list to save the results from loop
tl<-vector(mode = 'list')
# run the loop and add the results for each column in the corresponding element of 'tl'
for (i in colnames(df)[2:3]) {
tl[[i]]<-table(df[[i]], df$class)
}
output
tl
$var2
A B
0 0 1
1 1 0
$var3
A B
0 1 0
1 0 1
alternatively you can use lapply() function:
lapply(df[, 2:3], function(x) table(x, df$class))
var2
x A B
0 0 1
1 1 0
$var3
x A B
0 1 0
1 0 1
Not much info on what exactly your preferred output is other than it tabulates the columns, but here's a potential solution:
# This is your df:
class<- c('A','B')
var2<- c(1,0)
var3 <- c(0,1)
df<- data.frame(class,var2,var3)
# Using lapply to tabulate each column. The output is a list of tables:
dftable <- lapply(df, table)
The output looks like this:
> dftb
$class
A B
1 1
$var2
0 1
1 1
$var3
0 1
1 1
The map() function from the purr package (part of tidyverse) can also be used:
library(tidyverse)
dftb <- lapply(df, table)
My question is very simple. I have a data frame with various numbers in each row, more than 100 columns. First column is always a non zero number. What I want to do is replace each nonzero number in each row (excluding the first column) with the first number in the row (the value of the first column)
I would think in the lines of an ifelse and a for loop that iterates through rows but there must be a simpler vectorised way to do it...
Another approach is to use sapply, which is more efficient than looping. Assuming your data is in a data frame df:
df[,-1] <- sapply(df[,-1], function(x) {ind <- which(x!=0); x[ind] = df[ind,1]; return(x)})
Here, we are applying the function over each and all columns of df except for the first column. In the function, x is each of these columns in turn:
First find the row indices of the column that are zeroes using which.
Set these rows in x to the corresponding values in the rows of the first column of df.
Returns the column
Note that the operations in the function are all "vectorized" over the column. That is, no looping over the rows of the column. The result from sapply is a matrix of the processed columns, which replaces all columns of df that are not the first column.
See this for an excellent review of the *apply family of functions.
Hope this helps.
Since you're data is not that big, I suggest you use a simple loop
for (i in 1:nrow(mydata))
{
for (j in 2:ncol(mydata)
{
mydata[i,j]<- ifelse(mydata[i,j]==0 ,0 ,mydata[i,1])
}
}
Suppose your data frame is dat, I have a fully-vectorized solution for you:
mat <- as.matrix(dat[, -1])
pos <- which(mat != 0)
mat[pos] <- rep(dat[[1]], times = ncol(mat))[pos]
new_dat <- "colnames<-"(cbind.data.frame(dat[1], mat), colnames(dat))
Example
set.seed(0)
dat <- "colnames<-"(cbind.data.frame(1:5, matrix(sample(0:1, 25, TRUE), 5)),
c("val", letters[1:5]))
# val a b c d e
#1 1 1 0 0 1 1
#2 2 0 1 0 0 1
#3 3 0 1 0 1 0
#4 4 1 1 1 1 1
#5 5 1 1 0 0 0
My code above gives:
# val a b c d e
#1 1 1 0 0 1 1
#2 2 0 2 0 0 2
#3 3 0 3 0 3 0
#4 4 4 4 4 4 4
#5 5 5 5 0 0 0
You want a benchmark?
set.seed(0)
n <- 2000 ## use a 2000 * 2000 matrix
dat <- "colnames<-"(cbind.data.frame(1:n, matrix(sample(0:1, n * n, TRUE), n)),
c("val", paste0("x",1:n)))
## have to test my solution first, as aichao's solution overwrites `dat`
## my solution
system.time({mat <- as.matrix(dat[, -1])
pos <- which(mat != 0)
mat[pos] <- rep(dat[[1]], times = ncol(mat))[pos]
"colnames<-"(cbind.data.frame(dat[1], mat), colnames(dat))})
# user system elapsed
# 0.352 0.056 0.410
## solution by aichao
system.time(dat[,-1] <- sapply(dat[,-1], function(x) {ind <- which(x!=0); x[ind] = dat[ind,1]; x}))
# user system elapsed
# 7.804 0.108 7.919
My solution is 20 times faster!
I have a dataframe containing (surprise) data. I have one column which I wish to populated on a per-row basis, calculated from the values of other columns in the same row.
From googling, it seems like I need 'apply', or one of it's close relatives. Unfortunately I haven't managed to make it actually work.
Example code:
#Example function
getCode <- function (ar1, ar2, ar3){
if(ar1==1 && ar2==1 && ar3==1){
return(1)
} else if(ar1==0 && ar2==0 && ar3==0){
return(0)
}
return(2)
}
#Create data frame
a = c(1,1,0)
b = c(1,0,0)
c = c(1,1,0)
df <- data.frame(a,b,c)
#Add column for new data
df[,"x"] <- 0
#Apply function to new column
df[,"x"] <- apply(df[,"x"], 1, getCode(df[,"a"], df[,"b"], df[,"c"]))
I would like df to be taken from:
a b c x
1 1 1 1 0
2 1 0 1 0
3 0 0 0 0
to
a b c x
1 1 1 1 1
2 1 0 1 2
3 0 0 0 0
Unfortunately running this spits out:
Error in match.fun(FUN) : 'getCode(df[, "a"], df[, "b"], df[,
"c"])' is not a function, character or symbol
I'm new to R, so apologies if the answer is blindingly simple. Thanks.
A few things: apply would be along the dataframe itself (i.e. apply(df, 1, someFunc)); it's more idiomatic to access columns by name using the $ operator.. so if I have a dataframe named df with a column named a, access a with df$a.
In this case, I like to do an sapply along the index of the dataframe, and then use that index to get the appropriate elements from the dataframe.
df$x <- sapply(1:nrow(df), function(i) getCode(df$a[i], df$b[i], df$c[i]))
As #devmacrile mentioned above, I would just modify the function to be able to get a vector with 3 elements as input and use it within an apply command as you mentioned.
#Example function
getCode <- function (x){
ifelse(x[1]==1 & x[2]==1 & x[3]==1,
1,
ifelse(x[1]==0 & x[2]==0 & x[3]==0,
0,
2)) }
#Create data frame
a = c(1,1,0)
b = c(1,0,0)
c = c(1,1,0)
df <- data.frame(a,b,c)
df
# a b c
# 1 1 1 1
# 2 1 0 1
# 3 0 0 0
# create your new column of results
df$x = apply(df, 1, getCode)
df
# a b c x
# 1 1 1 1 1
# 2 1 0 1 2
# 3 0 0 0 0
I have a data frame structured something like:
a <- c(1,1,1,2,2,2,3,3,3,3,4,4)
b <- c(1,2,3,1,2,3,1,2,3,4,1,2)
c <- c(NA, NA, 2, NA, 1, 1, NA, NA, 1, 1, NA, NA)
df <- data.frame(a,b,c)
Where a and b uniquely identify an observation. I want to create a new variable, d, which indicates if each observation's value for b is present at least once in c as grouped by a. Such that d would be:
[1] 0 1 0 1 0 0 1 0 0 0 0 0
I can write a for loop which will do the trick,
attach(df)
for (i in unique(a)) {
for (j in b[a == i]) {
df$d[a == i & b == j] <- ifelse(j %in% c[a == i], 1, 0)
}
}
But surely in R there must be a cleaner/faster way of achieving the same result?
Using data.table:
library(data.table)
setDT(df) #convert df to a data.table without copying
# +() is code golf for as.integer
df[ , d := +(b %in% c), by = a]
# a b c d
# 1: 1 1 NA 0
# 2: 1 2 NA 1
# 3: 1 3 2 0
# 4: 2 1 NA 1
# 5: 2 2 1 0
# 6: 2 3 1 0
# 7: 3 1 NA 1
# 8: 3 2 NA 0
# 9: 3 3 1 0
# 10: 3 4 1 0
# 11: 4 1 NA 0
# 12: 4 2 NA 0
Adding the dplyr version for those of that persuasion. All credit due to #akrun.
library(dplyr)
df %>% group_by(a) %>% mutate(d = +(b %in% c))
And for posterity, a base R version as well (via #thelatemail below)
df <- df[order(df$a, df$b), ]
df$d <- unlist(by(df, df$a, FUN = function(x) (x$b %in% x$c) + 0L ))
The above answer by MichaelChirico apparently works well and is correct. I rarely use data.table so I don't understand the syntax. This is a way to get the same results without data.table.
invisible(lapply(unique(df$a), function(x) {
df$d[df$a==x] <<- 0L + (df$b[df$a==x] %in% df$c[df$a==x])
}))
This code gets all of the unique levels of a and then modifies the data.frame for that level of a using the logic you request. The <<- is necessary because df will otherwise be modified just in the scope of the apply and not in .GlobalEnv. With <<- it finds the parent environment where df is defined and sets df there.
Also, note the slightly different version of the + "trick" where a leading 0 which makes it clearer to the reader that the resulting vector is an integer because it must be cast that way for the addition to work. The L after the 0 indicates that 0 is in integer and not a double. Note that the notation used by MichaelChirico for this casting gives the same results (a column of class integer).
I am trying to generate dummy variables (must be 1/0) using a loop based on the most frequent response of a variable. After lots of googling, I haven't managed to come up with a solution. I have extracted the most frequent responses (strings, say the top 5 are "A","B",...,"E") using
top5<-names(head(sort(table(data$var1), decreasing = TRUE),5)
I would like the loop to check if another variable ("var2") equals A, if so set =1, OW =0, then give a summary using aggregate(). In Stata, I can refer to the looped variable i using `i' but not in R... The code that does not work is:
for(i in top5) {
data$i.dummy <- ifelse(data$var2=="i",1,0)
aggregate(data$i.dummy~data$age+data$year,data,mean)
}
Any suggestions?
If you want one column per item in your top 5 then I would use sapply along the elements in top5. No need for ifelse because == compares and gives TRUE or 1 if the comparison is TRUE and 0 otherwise
Here we cbind a matrix of 5 columns, one each for each element of top5 containing 1 if the row in data$var2 equals the respective element of 'top5':
data <- cbind( data , sapply( top5 , function(x) as.integer( data$var2 == x ) ) )
If you want one column for matches of any of top5 it's even easier:
data$dummies <- as.integer( data$var2 %in% top5 )
The as.integer() in both cases is used to turn TRUE or FALSE to 1 and 0 respectively.
A cut down example to illustrate how it works:
set.seed(123)
top2 <- c("A","B")
data <- data.frame( var2 = sample(LETTERS[1:4],6,repl=TRUE) )
# Make dummy variables, one column for each element in topX vector
data <- cbind( data , sapply( top2 , function(x) as.integer( data$var2 == x ) ) )
data
# var2 A B
#1 B 0 1
#2 D 0 0
#3 B 0 1
#4 D 0 0
#5 D 0 0
#6 A 1 0
# Make single column for all elements in topX vector
data$ANY <- as.integer( data$var2 %in% top2 )
data
# var2 ANY A B
#1 B 1 0 1
#2 D 0 0 0
#3 B 1 0 1
#4 D 0 0 0
#5 D 0 0 0
#6 A 1 1 0
See fortune(312), then read the help ?"[[" and possibly the help for paste0.
Then possibly consider using other tools like model.matrix and sapply rather than doing everything yourself using loops.