This question already has answers here:
R: Convert delimited string into variables
(3 answers)
Closed 5 years ago.
I have the following dataset
#datset
id attributes value
1 a,b,c 1
2 c,d 0
3 b,e 1
I wish to make a pivot table out of them and assign binary values to the attribute (1 to the attributes if they exist otherwise assign 0 to them). My ideal output will be the following:
#output
id a b c d e Value
1 1 1 1 0 0 1
2 0 0 1 1 0 0
3 0 1 0 0 1 1
Any tip is really appreciated.
We split the 'attributes' column by ',', get the frequency with mtabulate from qdapTools and cbind with the first and third column.
library(qdapTools)
cbind(df1[1], mtabulate(strsplit(df1$attributes, ",")), df1[3])
# id a b c d e value
#1 1 1 1 1 0 0 1
#2 2 0 0 1 1 0 0
#3 3 0 1 0 0 1 1
With base R:
attributes <- sort(unique(unlist(strsplit(as.character(df$attributes), split=','))))
cols <- as.data.frame(matrix(rep(0, nrow(df)*length(attributes)), ncol=length(attributes)))
names(cols) <- attributes
df <- cbind.data.frame(df, cols)
df <- as.data.frame(t(apply(df, 1, function(x){attributes <- strsplit(x['attributes'], split=','); x[unlist(attributes)] <- 1;x})))[c('id', attributes, 'value')]
df
id a b c d e value
1 1 1 1 1 0 0 1
2 2 0 0 1 1 0 0
3 3 0 1 0 0 1 1
Related
This question already has answers here:
Search multiple columns for string to set indicator variable
(3 answers)
R model.matrix using same factor set among all columns
(1 answer)
Closed 4 years ago.
I am a beginner in R and looking to implement dummy variables on a dataset.
I am having a data set with few columns like below -
Dataset1
T1 T2 T3
A C B
A C B
A C B
A D C
B D C
B E F
I want to add dummy variables to this like dummy,A; dummy,B; dummy,C and so on.. And assign them values as 1 if it is present in either T1, T2 or T3, else 0.
So the final data set should look like -
T1 T2 T3 dummy,A dummy,B dummy,C dummy,D dummy,E dummy,F
A C B 1 1 1 0 0 0
A C B 1 1 1 0 0 0
A C B 1 1 1 0 0 0
A D C 1 0 1 1 0 0
B D C 0 1 1 1 0 0
B E F 0 1 0 0 1 1
So can anyone please suggest how I can achieve this?
Any help in this regard is really appreciated. Thanks!
We can use mtabulate from qdapTools. Transpose the 'Dataset1', convert it to data.frame, apply the mtabulate, change its column names (if needed) and cbind with the original 'Dataset1'
library(qdapTools)
d1 <- mtabulate(as.data.frame(t(Dataset1)))
row.names(d1) <- NULL
names(d1) <- paste0("dummy.", names(d1))
cbind(Dataset1, d1)
# T1 T2 T3 dummy.A dummy.B dummy.C dummy.D dummy.E dummy.F
#1 A C B 1 1 1 0 0 0
#2 A C B 1 1 1 0 0 0
#3 A C B 1 1 1 0 0 0
#4 A D C 1 0 1 1 0 0
#5 B D C 0 1 1 1 0 0
#6 B E F 0 1 0 0 1 1
I have a data frame where each Item has three categories (a, b,c) and a numeric Answer for each category is recorded (either 0 or 1). I would like to create a new column contingent on the rows in the Answer column. This is how my data frame looks like:
Item <- rep(c(1:3), each=3)
Option <- rep(c('a','b','c'), times=3)
Answer <- c(1,1,0,1,0,1,1,1,1)
df <- data.frame(Item, Option, Answer)
Item Option Answer
1 1 a 1
2 1 b 1
3 1 c 0
4 2 a 0
5 2 b 0
6 2 c 1
7 3 a 1
8 3 b 1
9 3 c 1
What is needed: whenever the three categories in the Option column are 1, the New column should receive a 1. In any other case, the column should have a 0. The desired output should look like this:
Item Option Answer New
1 1 a 1 0
2 1 b 1 0
3 1 c 0 0
4 2 a 0 0
5 2 b 0 0
6 2 c 1 0
7 3 a 1 1
8 3 b 1 1
9 3 c 1 1
I tried to achieve this without using a loop, but I got stuck because I don't know how to make a new column contingent on a group of rows, not just a single one. I have tried this solution but it doesn't work if the rows are not grouped in pairs.
Do you have any suggestions? Thanks a bunch!
This should work:
df %>%
group_by(Item)%>%
mutate(New = as.numeric(all(as.logical(Answer))))
using data.table
DT <- data.table(Item, Option, Answer)
DT[, Index := as.numeric(all(as.logical(Answer))), by= Item]
DT
Item Option Answer Index
1: 1 a 1 0
2: 1 b 1 0
3: 1 c 0 0
4: 2 a 1 0
5: 2 b 0 0
6: 2 c 1 0
7: 3 a 1 1
8: 3 b 1 1
9: 3 c 1 1
Or using only base R
df$Index <- with(df, +(ave(!!Answer, Item, FUN = all)))
df$Index
#[1] 0 0 0 0 0 0 1 1 1
This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
I have a data frame in R that I need to manipulate (pivot). At the simplest level the first few rows would look like the following:
Batch Unit Success InputGrouping
1 1 1 A
2 5 1 B
3 4 0 C
1 1 1 D
2 5 1 A
I would like to pivot this data so that the column names would be InputGrouping and the values would be 1 if it exists and 0 if not. Using above:
Batch Unit Success A B C D
1 1 1 1 0 0 1
2 5 1 1 1 0 0
3 4 0 0 0 1 0
I've looked at reshape/cast but can't figure out if this transformation is possible with the package. Any advice would be very much appreciated.
This is indeed possible using reshape2 with the function dcast().
Recreate your data:
dat <- read.table(header=TRUE, text="
Batch Unit Success InputGrouping
1 1 1 A
2 5 1 B
3 4 0 C
1 1 1 D
2 5 1 A")
Now recast the data:
library("reshape2")
dcast(Batch + Unit + Success ~ InputGrouping, data=dat, fun.aggregate = length)
The results:
Using InputGrouping as value column: use value.var to override.
Batch Unit Success A B C D
1 1 1 1 1 0 0 1
2 2 5 1 1 1 0 0
3 3 4 0 0 0 1 0
Here's a possible solution using the data.table package
library(data.table)
setDT(df)[, as.list(table(InputGrouping)), by = .(Batch, Unit, Success)]
# Batch Unit Success A B C D
# 1: 1 1 1 1 0 0 1
# 2: 2 5 1 1 1 0 0
# 3: 3 4 0 0 0 1 0
I have two data frames:
DATA1:
ID com_alc_cd com_liv_cd com_hyee_cd
A 1 0 0
B 0 0 1
D 0 0 0
C 0 1 0
DATA2:
ID com_alc_dd com_liv_dd com_hyee_dd
B 0 2 0
A 1 0 2
C 0 1 0
D 0 1 0
I want to combine the two data frames, so as to obtain the sum of the two:
SUM(DATA1, DATA2):
ID com_alc com_liv com_hyee
A 2 0 2
B 0 2 1
C 0 2 0
D 0 1 0
Try this for example( assuming that your data.frames are matrix of the same size)
d1 <- DATA1[order(DATA1$ID),]
d2 <- DATA2[order(DATA2$ID),]
data.frame(ID=d1$ID,as.matrix(subset(d1,select=-ID)) +
as.matrix(subset(d2,select=-ID)))
ID com_alc_cd com_liv_cd com_hyee_cd
1 A 2 0 2
2 B 0 2 1
4 C 0 2 0
3 D 0 1 0
EDIT general solution
library(reshape2)
## put the data in the long format
res <- do.call(rbind,lapply(list(DATA1,DATA2),melt,id.vars='ID'))
## polish names
res$variable <- gsub('(.*_.*)_.*','\\1',res$variable)
## wide format and aggregate using sum
dcast(ID~variable,data=res,fun.aggregate=sum)
ID com_alc com_hyee com_liv
1 A 2 2 0
2 B 0 1 2
3 C 0 0 2
4 D 0 0 1
You can also use aggregate
names(df1) <- names(df2)
df3 <- rbind(df1, df2)
res <- aggregate(df3[,-1], by=list(df3$ID), sum)
I have a long format unbalanced longitudinal data. I would like to exclude all the cases that do not contain complete information. By that I mean all cases that do not repeat 8 times. Someone can help me finding a solution?
Below an example: I have three subjects {A, B, and C}. I have 8 information for A and B, but only 2 for C. How can I delete rows in which C is present based on the information it has less than 8 repeated measurements?
temp = scan()
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
A 1 1 1 1
A 0 1 0 0
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
B 1 1 1 1
B 0 1 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
C 1 1 1 1
C 0 1 0 0
Any help?
Assuming your variable names are V1, V2... and so on, here's one approach:
temp[temp$V1 %in% names(which(table(temp$V1) == 8)), ]
The table(temp$V1) == 8 matches the values in the V1 column that have exactly 8 cases. The names(which(... part creates a basic character vector that we can match using %in%.
And another:
temp[ave(as.character(temp$V1), temp$V1, FUN = length) == "8", ]
Here's another approach:
temp <- read.table(text="
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
A 1 1 1 1
A 0 1 0 0
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
B 1 1 1 1
B 0 1 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
C 1 1 1 1
C 0 1 0 0", header=FALSE)
do.call(rbind,
Filter(function(subgroup) nrow(subgroup) == 8,
split(temp, temp[[1]])))
split breaks the data.frame up by its first column, then Filter drops the subgroups that don't have 8 rows. Finally, do.call(rbind, ...) collapses the remaining subgroups back into a single data.frame.
If the first column of temp is character (rather than factor, which you can verify with str(temp)) and the rows are ordered by subgroup, you could also do:
with(rle(temp[[1]]), temp[rep(lengths==8, times=lengths), ])