This question already has answers here:
How do I get a contingency table?
(6 answers)
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
I want to create a matrix with 3 columns and many rows assigning 1 or 0 if the condition is satisfied.
I have data stored in 3 variables
df1 <- data.frame(names=c("A","B","C","D","E","F"))
df2 <- data.frame(names=c("A","B","C","F"))
df3 <- data.frame(names=c("E","F","H"))
output will be
df1 df2 df3
A 1 1 0
B 1 1 0
C 1 1 0
D 1 0 0
E 1 1 1
F 1 0 1
H 0 0 1
In first row if A is present in dataset then I will assign 1 under each column and 0 if A not present in dataset
Here is what I have tried
DF <- rbind(df1,df2,df3)
for (i in DF) {
for (j in 1:length(df1$names)) {
if(i == df1$names[j]){
A3 <-data.frame(paste0("",i),paste0(1),paste0(0),paste0(0))
names(A3) <- NULL
}
else{
A3 <-data.frame(paste0("",i),paste0(0),paste0(0),paste0(0))
}
}
}
I have written this code only for df1 but its very slow because I have more than 1500 rows in my orignal data set. What would be the fastest way to do it?
Add a grouping variable to each dataframe:
df1 <- data.frame(names=c("A","B","C","D","E","F"),group="df1")
df2 <- data.frame(names=c("A","B","C","F"),group="df2")
df3 <- data.frame(names=c("E","F","H"),group="df3")
DF <- rbind(df1,df2,df3)
Then do this:
res <- table(DF)
> res
group
names df1 df2 df3
A 1 1 0
B 1 1 0
C 1 1 0
D 1 0 0
E 1 0 1
F 1 1 1
H 0 0 1
Or if you want a dataframe:
library(reshape2)
dcast(names~group, data=DF,fun.aggregate = length)
When using the idcol parameter in rbindlist of the data.table package, there is no need of creating a grouping column for each dataframe separately:
library(data.table) # I used v1.9.5 for this
DT <- rbindlist(list(df1, df2, df3), idcol="id")
dcast(DT[, .N , by=.(id,names)], names ~ id, fill=0)
which gives:
names 1 2 3
1: A 1 1 0
2: B 1 1 0
3: C 1 1 0
4: D 1 0 0
5: E 1 0 1
6: F 1 1 1
7: H 0 0 1
%in% operator lets you check if a string is present in a vector of strings. It is also vectorised, so it works quite quick:
x=c(LETTERS[c(1:6,8)])
df=data.frame(x=x,df1=as.numeric(x %in% df1$names),
df2=as.numeric(x %in% df2$names),
df3=as.numeric(x %in% df3$names))
df
If speed is crucial, {data.table} package gives a little speed boost with %chin% operator:
library(data.table)
x=c(LETTERS[c(1:6,8)])
dt=data.table(x=x,df1=as.numeric(x %chin% as.character(df1$names)),
df2=as.numeric(x %chin% as.character(df2$names)),
df3=as.numeric(x %chin% as.character(df3$names)))
dt
The code below is slightly more general than the other answers. Also, I think it's useful to know how to dynamically create commands...
I use the data frames as you prepared them:
df1 <- data.frame( names = c( "A", "B", "C", "D", "E", "F") )
df2 <- data.frame( names = c( "A", "B", "C"," F") )
df3 <- data.frame( names = c( "E", "F", "H") )
DF <- rbind( df1, df2, df3 )
nDF <- unique( DF ) #we don't want to duplicate tests.
Then the main loop is just like this:
n_ <- 3
for( ii in 1 : n_){
nDF[ paste0( "df", ii ) ] <- as.logical( NA ) #dynamically creates a new variable in your data frame
cmnd <- paste0("nDF$names %in% df",ii,"$names") #dynamically creates the appropriate command (in this case you want to test e.g. whether "nDF$names %in% df1$names".
nDF[ paste0("df",ii)] <- eval( parse( text = cmnd ) ) #evaluates the dynamically created command and saves it into the previously created variable.
}
Should be relatively fast. But if you don't have duplicates in your data, then heroka's suggestion to this questions is probably the way to go.
Related
I am attempting to convert the rows of org_type into columns. Is there a way to do this I tried spread, but I have not had success. Does anyone know how?
Below is a picture.
r
You can do it using mltools and data.table R package. First you need to convert the data frame to data.table. Then you've to use one_hot() function of mltools. You can see the example I've used with a dummy data.
# demo data frame
df <- data.frame(
org_name = c("A", "B", "C", "D", "E", "F"),
org_type = c("Tech", "Tech", "Business", "Business", "Bank", "Bank")
)
df
# load the libraries
library(data.table)
library(mltools)
# convert df to data.table
df_dt <- as.data.table(df)
# use one_hot specify the column name in cols
df_one_hot <- one_hot(df_dt, cols = "org_type")
df_one_hot
Output before:
org_name org_type
1 A Tech
2 B Tech
3 C Business
4 D Business
5 E Bank
6 F Bank
Output after:
org_name org_type_Bank org_type_Business org_type_Tech
1: A 0 0 1
2: B 0 0 1
3: C 0 1 0
4: D 0 1 0
5: E 1 0 0
6: F 1 0 0
I have the following data:
Letters <- c("A","B","C")
Numbers <- c(1,0,1)
Numbers <- as.integer(Numbers)
Data.Frame <- data.frame(Letters,Numbers)
I want to create a Dummy Variable for the Letters and wrote the following for-loop:
for(level in unique(Data.Frame$Letters)){Data.Frame[paste("", level, sep = "")]
<- ifelse(Data.Frame$Letters == level, 1, 0)}
Is there a way to vectorize this for-loop? Is the following use of dcast alredy vectorized?
dt <- data.table(Letters,Numbers)
dcast.data.table(dt, Letters+Numbers~Letters,fun.aggregate=length)
You could use outer
cbind(Data.Frame, +outer(Letters, setNames(nm=Letters), "=="))
# Letters Numbers A B C
# 1 A 1 1 0 0
# 2 B 0 0 1 0
# 3 C 1 0 0 1
I have several dataframes in a list, which i want to merge into one big dataframe. The actual list contains several thouands of this dataframes and i am therefore looking for a preferably efficient solution.
The list looks similar to this:
v <- data.frame(answer = c(1,1,1))
rownames(v) <- c("A","B","C")
w <- data.frame(answer = c(1,0,0))
rownames(w) <- c("A","B","D")
x <- data.frame(answer = c(1,1,1))
rownames(x) <- c("A","B","C")
y <- data.frame(answer = c(0,0,0))
rownames(y) <- c("A","C","D")
z <- data.frame(answer = c(0,0,0,1))
rownames(z) <- c("A","B","C","D")
l <- list(v,w,x,y,z)
names(l) <- c("V","W","X","Y","Z")
The final output should look like this:
v W X Y Z
A 1 1 1 0 0
B 1 0 1 NA 0
C 1 NA 1 0 0
D NA 0 NA 0 1
What i have tried already (feel free to ignore this part, if you already have a working solution)
df <- data.frame(matrix(unlist(l), nrow=length(l), byrow=T),stringsAsFactors=FALSE)
and
df <- do.call(rbind.data.frame, l)
and
df<- rbindlist(l) (from library("data.frame"))
Those all loose the information contained in the rownames and only seemed to work if all dataframes have the same length and the same order.
The only one that kinda worked with my actual data was something along the lines of:
df<- suppressWarnings(Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by =
"answer", all = TRUE),l))
but i am not able to make it work with my example list and even when it worked it was extremly unefficiently and took ages once the list got longer.
Here is a base R solution using merge and Reduce:
df <- Reduce(
function(x, y) merge(x, y, by = "id", all = T),
lapply(l, function(x) { x$id <- rownames(x); x }))
colnames(df) <- c("id", names(l))
# id V W X Y Z
#1 A 1 1 1 0 0
#2 B 1 0 1 NA 0
#3 C 1 NA 1 0 0
#4 D NA 0 NA 0 1
We create a row names column and then do the join. We loop through the list with map, create a row names column with rownames_to_column and reduce to a single dataset by doing a full_join by the row names and rename the column names if needed
library(tidyverse)
l %>%
map( ~ .x %>%
rownames_to_column('rn')) %>%
reduce(full_join, by = 'rn') %>%
rename_at(2:6, ~ names(l))
# rn V W X Y Z
#1 A 1 1 1 0 0
#2 B 1 0 1 NA 0
#3 C 1 NA 1 0 0
#4 D NA 0 NA 0 1
Or another option is to bind_rows and then spread
l %>%
map(rownames_to_column, 'rn') %>%
bind_rows(.id = 'grp') %>%
spread(grp, answer)
# rn V W X Y Z
#1 A 1 1 1 0 0
#2 B 1 0 1 NA 0
#3 C 1 NA 1 0 0
#4 D NA 0 NA 0 1
One way of doing this using something similar to what kind of already worked for you is to first declare the rownames as a variable, then rename the columns of your data frames to match their names in the list, and then merge.
df_l <- l %>% Map(setNames, ., names(.)) %>%
map(~mutate(., r=rownames(.))) %>%
Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2,by="r"), .)
rownames(df_l) <- df_l$r
df_l$r <- NULL
To be honest, I'm not sure it is efficient though, and like you said it will probably take long as the list grows.
This question already has answers here:
Not sure why dcast() this data set results in dropping variables
(1 answer)
How to reshape data from long to wide format
(14 answers)
Closed 5 years ago.
I have two data frames:
id <- c("a", "b", "c")
a <- 0
b <- 0
c <- 0
df1 <- data.frame(id, a, b, c)
id a b c
1 a 0 0 0
2 b 0 0 0
3 c 0 0 0
num <- c("a", "c", "c")
partner <- c("b", "b", "a")
value <- c("10", "20", "30")
df2 <- data.frame(num, partner, value)
num partner value
1 a b 10
2 c b 20
3 c a 30
I'd like to replace zeroes in df1 with df2$value in every instance df1$id==df2$num & colnames(df1)==df2$partner. So the output should look like:
a <- c(0, 0, 30)
b <- c(10, 0, 20)
c <- c(0, 0, 0)
df.nice <- data.frame(id, a, b, c)
id a b c
1 a 0 10 0
2 b 0 0 0
3 c 30 20 0
I can replace individual cells with the following:
df1$b[df1$id=="a"] <- ifelse(df2$num=="a" & df2$partner=="b", df2$value, 0)
but I need to cycle through all possible df1 row/column combinations for a large data frame. I suspect this involves plyr and match together, but can't quite figure out how.
Update
Thanks to #MikeH., I've turned to using reshape. This seems to work:
df.nice <- melt(df2, id=c("num", "partner"))
df.nice <- dcast(test.nice, num ~ partner, value.var="value")
to produce this:
num a b
1 a <NA> 10
2 c 30 20
I do need all possible row/column combinations, however, with all represented as zero. Is there a way to ask reshape to obtain rows and columns from another data frame (e.g., df1) or do should I bind those after reshaping?
If you want a replace (rather than a reshape) I think a simple base R solution would be to do:
idxs <- t(mapply(cbind, match(df2$num, df1$id), match(df2$partner, names(df1))))
df1[idxs] <- df2$value
df1
id a b c
1 a 0 10 0
2 b 0 0 0
3 c 30 20 0
Note that I build the row/column combination lookups to replace using the t(mapply(...)). When you select like df1[idxs] this converts to matrix (to select specific row/column combinations) and then converts back to data.frame.
I had to read in your data using stringsAsFactors = FALSE so the values would register properly (instead of numerics).
Data:
df2 <- data.frame(num, partner, value, stringsAsFactors = F)
df1 <- data.frame(id, a, b, c, stringsAsFactors = F)
I am trying to generate dummy variables (must be 1/0) using a loop based on the most frequent response of a variable. After lots of googling, I haven't managed to come up with a solution. I have extracted the most frequent responses (strings, say the top 5 are "A","B",...,"E") using
top5<-names(head(sort(table(data$var1), decreasing = TRUE),5)
I would like the loop to check if another variable ("var2") equals A, if so set =1, OW =0, then give a summary using aggregate(). In Stata, I can refer to the looped variable i using `i' but not in R... The code that does not work is:
for(i in top5) {
data$i.dummy <- ifelse(data$var2=="i",1,0)
aggregate(data$i.dummy~data$age+data$year,data,mean)
}
Any suggestions?
If you want one column per item in your top 5 then I would use sapply along the elements in top5. No need for ifelse because == compares and gives TRUE or 1 if the comparison is TRUE and 0 otherwise
Here we cbind a matrix of 5 columns, one each for each element of top5 containing 1 if the row in data$var2 equals the respective element of 'top5':
data <- cbind( data , sapply( top5 , function(x) as.integer( data$var2 == x ) ) )
If you want one column for matches of any of top5 it's even easier:
data$dummies <- as.integer( data$var2 %in% top5 )
The as.integer() in both cases is used to turn TRUE or FALSE to 1 and 0 respectively.
A cut down example to illustrate how it works:
set.seed(123)
top2 <- c("A","B")
data <- data.frame( var2 = sample(LETTERS[1:4],6,repl=TRUE) )
# Make dummy variables, one column for each element in topX vector
data <- cbind( data , sapply( top2 , function(x) as.integer( data$var2 == x ) ) )
data
# var2 A B
#1 B 0 1
#2 D 0 0
#3 B 0 1
#4 D 0 0
#5 D 0 0
#6 A 1 0
# Make single column for all elements in topX vector
data$ANY <- as.integer( data$var2 %in% top2 )
data
# var2 ANY A B
#1 B 1 0 1
#2 D 0 0 0
#3 B 1 0 1
#4 D 0 0 0
#5 D 0 0 0
#6 A 1 1 0
See fortune(312), then read the help ?"[[" and possibly the help for paste0.
Then possibly consider using other tools like model.matrix and sapply rather than doing everything yourself using loops.