Reshape Panel Data Wide Format to Long Format - r

I am struggling with transformation of a Panel Dataset from wide to long format. The Dataset looks like this:
ID | KP1_430a | KP1_430b | KP1_430c | KP2_430a | KP2_430b | KP2_430c | KP1_1500a | ...
1 ....
2 ....
KP1; KP2 up to KP7 describe the Waves.
a,b up to f describe a specific Item. (E.g. left to right right placement of Party a)
I would like to have this data in long format. Like this:
ID | Party | Wave | 430 | 1500
1 1 1 .. ..
1 2 1 .. ..
. . .
1 1 2 .. ..
. . .
2 1 1 .. ..
I tried to use the reshape function. But I had problems reshaping it over time and over the parties simultaneously.
Here is a small data.frame example.
data <- data.frame(matrix(rnorm(10),2,10))
data[,1] <- 1:2
names(data) <- c("ID","KP1_430a" , "KP1_430b" , "KP1_430c" , "KP2_430a" , "KP2_430b ", "KP2_430c ", "KP1_1500a" ,"KP1_1500b", "KP1_1500c")
And this is how far I got.
data_long <- reshape(data,varying=list(names(data)[2:4],names(data)[5:7], names(data[8:10]),
v.names=c("KP1_430","KP2_430","KP1_1500"),
direction="long", timevar="Party")
The question remains: how I can get the time varying variables in long format as well? And is there a more elegant way to reshape this data? In the code above I would have to enter the names (names(data)[2:4]) for each wave and variable. With this small data.frame it is Ok, but the Dataset is a lot larger.
EDIT: How this transformation could be done by hand: I actually have done this, which leaves me with a page-long code file.
First, Bind KP1_430a and KP1_1500a with IDs, Time=1 and Party=1 column wise. Second create the same object for all parties [b-f], changing the party index respectively, and append it row-wise. Do step one and two for the rest of the waves [2-7], respectively changing party and time var, and append them row-wise.

It is usually easier to proceed in two steps: first use melt to put your data into a "tall" format (unless it is already the case) and then use dcast to convert ti to a wider format.
library(reshape2)
library(stringr)
# Tall format
d <- melt(data, id.vars="ID")
# Process the column containing wave and party
d1 <- str_match_all(
as.character( d$variable ),
"KP([0-9])_([0-9]+)([a-z])"
)
d1 <- do.call( rbind, d1 )
d1 <- d1[,-1]
colnames(d1) <- c("wave", "number", "party")
d1 <- as.data.frame( d1)
d <- cbind( d, d1 )
# Convert to the desired format
d <- dcast( d, ID + wave + party ~ number )

At the moment your Wave data is in your variable names and you need to extract it with some string processing. I had no trouble with melt
mdat <- melt(data, id.vars="ID")
mdat$wave=sub("KP", "", sub("_.+$", "", mdat$variable)) # remove the other stuff
mdat
Your description is too sketchy (so far) for me to figure out the rule for deriving a "Party" variable, so perhaps you can edit you question to show how that might be done by a human being .... and then we can show the computer how to to do it.
EDIT: If the last lower-case letter in the original column names is Party as Vincent thinks, then you could trim the trailing spaces in those names and extract:
mdat$var <- sub("\\s", "", (as.character(mdat$variable)))
mdat$party=substr( mdat$var, nchar(mdat$var), nchar(mdat$var))
#--------------
> mdat
ID variable value wave party var
1 1 KP1_430a 0.7220627 1 a KP1_430a
2 2 KP1_430a 0.9585243 1 a KP1_430a
3 1 KP1_430b -1.2954671 1 b KP1_430b
4 2 KP1_430b 0.3393617 1 b KP1_430b
5 1 KP1_430c -1.1477627 1 c KP1_430c
6 2 KP1_430c -1.0909179 1 c KP1_430c
<snipped output>

Related

Using value in 1 column to fill in values in 2 other columns

When entering behavior data in a different system, I wrote the subjects in a form such as 3-2 (to mean rank 3 to rank 2). I exported these to Excel, which took these entries as dates (so 2-Mar for this example).
I now have thousands of entries in this format. I have added two columns ("Actor" and "Recipient") and would like to fill in the rank numbers for these, based on what is in the "Subject" column.
A couple of lines of what I'm hoping my R output will give me:
Subject Actor Recipient
2-Mar 3 2
5-Jun 6 5
6-Feb 2 6
etc.
So I already have the "Subject" columns and need help figuring out code to fill in the "Actor" and "Recipient" columns. Rank numbers only go up to 6.
I've tried a couple of things but just keep getting error messages... Any help with this would be GREATLY appreciated!
Here you can use tstrsplit() after converting to date format
# Recreate your data
x <- data.frame("Subject" = c("2-Mar", "5-Jun", "6-Feb"))
# Change the format of your Subject coumn
x[, "Subject"] <- format(as.POSIXct(x[, "Subject"], format = "%d-%b"), "%m %d")
# Split into the two strings
library(data.table) # to get tstrsplit() function
x[, c("Actor", "Recipient")] <- tstrsplit(x[, "Subject"], " ")
# Convert to numeric
x[, "Actor"] <- as.numeric(x[, "Actor"])
x[, "Recipient"] <- as.numeric(x[, "Recipient"])
This returns
> x
Subject Actor Recipient
1 02 03 3 2
2 05 06 6 5
3 06 02 2 6
And if you want Subject in its original format
# Return Subject to original format
x[, "Subject"] <- format(as.POSIXct(x[, "Subject"], format = "%m %d"), "%d-%b")
Giving
> x
Subject Actor Recipient
1 02-Mar 3 2
2 05-Jun 6 5
3 06-Feb 2 6
Explained:
Your vector/variable "Subject" was imported as a character-type atomic vector (atomic vectors are a 1 dimensional structure of one or more elements, where all elements must be the same type). The solution was to convert that something that R would interpret as a date using the as.POSIXct(..., format = "...") function, where format is telling R how the string is formatted (see codes here). I then wrapped that in the format() function, telling it to change the format to numeric months. That was then split into two columns using the tstrsplit() function, but R interpreted those as character-type data, so I converted them using the as.numeric() function to double-type data.
You could convert Subject to date and extract month and year from it.
temp <- as.Date(df$Subject, "%d-%b")
df$Actor <- as.integer(format(temp, "%m"))
df$Recipient <- as.integer(format(temp, "%d"))
df
# Subject Actor Recipient
#1 2-Mar 3 2
#2 5-Jun 6 5
#3 6-Feb 2 6
This can also be done using lubridate functions.
df$Actor <- month(temp)
df$Recipient <- day(temp)

Split string using delimiter alternatively

I have a list of urls like this:
mydata <- read.table(header=TRUE, text="
Id
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrickpattern%3ADecorative%2FArt+Deco%3Abrickpattern%3AFloral%3Abrickpattern%3AGeometric%3Abrickpattern%3AGraphic%3Abrickpattern%3ATropical%3Aprice%3A300%2C10500&page=7&gridValue=4
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Averticalsizegroupformat%3AIN%2040%3Averticalcolorfamily%3ABlack%3Averticalcolorfamily%3ABlue%3Averticalcolorfamily%3AWhite
https://www.example.com/dp/c/830316016?q=%3Arelevance%3Averticalcolorfamily%3AWhite&gclid=CjwKEAjw9_jJBRCXycSarr3csWcSJABthk07W_H0RxQtOPZX7VdD9CSmK4S01BMYdXbtc0XxC0OeChoCky_w_wcB
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrand%3AFLYING%20MACHINE%3Abrand%3AMUFTI%3Abrand%3AUNITED%20COLORS%20OF%20BENETTON
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Averticalsizegroupformat%3AIN%2038%3Averticalsizegroupformat%3AIN%2039%3Averticalsizegroupformat%3AIN%20M%3Averticalsizegroupformat%3AUK%2039%3Averticalsizegroupformat%3AUK%20M%3Averticalsizegroupformat%3AUK%20S%3Averticalsizegroupformat%3AUS%20M%3Averticalsizegroupformat%3AUS%20S%3Abrickpattern%3ASolid%3Averticalcolorfamily%3ABlack%3Averticalcolorfamily%3AWhite
https://www.example.com/dp/c/830216013?q=%3Aprce-asc%3Abricksleeve%3AShort%3Aprice%3A300%2C10500&page=2&gridValue=4
https://www.example.com/dp/c/830216013??q=%3Aprce-asc%3Abrand%3AUS+POLO%3Abricksleeve%3AShort%3Aprice%3A300%2C10500
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrand%3AAJIO%3Abrand%3ABASICS%3Abrand%3ACelio%3Abrand%3ADNMX%3Abrand%3AGAS%3Abrand%3ALEVIS%3Abrand%3ANETPLAY%3Abrand%3ASIN%3Abrand%3ASUPERDRY%3Abrand%3AUS%20POLO%3Abrand%3AVIMAL%3Abrand%3AVIMAL%20APPARELS%3Abrand%3AVOI%20JEANS
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrand%3ABritish+Club%3Abrand%3ACelio%3Abrand%3AFLYING+MACHINE%3Aprice%3A300%2C10500&page=1&gridValue=4
")
I need to pull out value of parameters like the brand, verticalcolorfamily, q= etc from the urls. These parameters are the filters applied on the website.
The output which i am looking for is a data frame with three columns:parameter,value and the frequency of occurrence of the value. For Ex:
parameter | value | frequency
----------|----------------|----------
brand | FLYING+MACHINE | 2
q= | relevance | 5
price | 300%2C10500 | 2
brand | BASICS | 1
Currently i am able to think of is to collect each urls as a string vectors separated by alternating values of "%3A" as a delimiter:[q=%3Arelevance ,brickpattern%3ADecorative%2FArt+Deco,brickpattern%3AFloral , brickpattern%3AGeometric , brickpattern%3AGraphic , brickpattern%3ATropical , price%3A300%2C10500].
Then place each element in a column of a data frame and then again split by '%3A' and do a group by.
Suggestions on an other approach will be really appreciated.
Also if i am supposed to use this approach i am unaware of the method of using alternating '%3A' as delimiter .
urltools looks like an awesome package for what you want to do. Here's a hacked answer in the meantime. Starting with your data.frame:
# Convert to character list
# Get rid of url
# Split by "%3A" and convert to "long" list
L <- as.character(mydata$Id)
L <- gsub("https://www.example.com/dp/c/830216013\\?", "", L)
L <- unlist(strsplit(L, "%3A"))
head(L)
[1] "q=" "relevance" "brickpattern"
[4] "Decorative%2FArt+Deco" "brickpattern" "Floral"
Then:
# Convert to 2-column data frame
# Count unique parameter:value pairs
df <- data.frame(parameter = L[seq(1,length(L),2)], value = L[seq(2,length(L),2)]) %>%
group_by(parameter, value) %>%
summarize(frequency=sum(!is.na(value)))
I will show only the following entries where frequency >= 2:
# Show only entries with frequency >= 2
filter(df, frequency >= 2)
parameter value frequency
<fctr> <fctr> <int>
1 brand Celio 2
2 bricksleeve Short 2
3 q= relevance 6
4 verticalcolorfamily Black 2
5 verticalcolorfamily White 2
Note that brand::FLYING+MACHINE != 2 because FLYING+MACHINE occurs as FLYING%20MACHINE and FLYING+MACHINE.

Custom function within subset of data, base functions, vector output?

Apologises for a semi 'double post'. I feel I should be able to crack this but I'm going round in circles. This is on a similar note to my previously well answered question:
Within ID, check for matches/differences
test <- data.frame(
ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05",
"2004-03-05","2004-06-05","2004-09-05","2005-01-05",
"2006-10-03","2007-02-05")
)
What I want to do is tag the subject whose first vist (as at DOV) was less than 180 days from their diagnosis (DOD). I have the following from the plyr package.
ddply(test, "ID", function(x) ifelse( (as.numeric(x$DOV[1]) - as.numeric(x$DOD[1])) < 180,1,0))
Which gives:
ID V1
1 A 1
2 B 0
3 C 1
What I would like is a vector 1,1,1,0,0,0,0,1,1 so I can append it as a column to the data frame. Basically this ddply function is fine, it makes a 'lookup' table where I can see which IDs have a their first visit within 180 days of their diagnosis, which I could then take my original test and go through and make an indicator variable, but I should be able to do this is one step I'd have thought.
I'd also like to use base if possible. I had a method with 'by', but again it only gave one result per ID and was also a list. Have been trying with aggregate but getting things like 'by has to be a list', then 'it's not the same length' and using the formula method of input I'm stumped 'cbind(DOV,DOD) ~ ID'...
Appreciate the input, keen to learn!
After wrapping as.Date around the creation of those date columns, this returns the desired marking vector assuming the df named 'test' is sorted by ID (and done in base):
# could put an ordering operation here if needed
0 + unlist( # to make vector from list and coerce logical to integer
lapply(split(test, test$ID), # to apply fn with ID
function(x) rep( # to extend a listwise value across all ID's
min(x$DOV-x$DOD) <180, # compare the minimum of a set of intervals
NROW(x)) ) )
11 12 13 21 22 23 24 31 32 # the labels
1 1 1 0 0 0 0 1 1 # the values
I have added to data.frame function stringsAsFactors=FALSE:
test <- data.frame(ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05","2004-03-05",
"2004-06-05","2004-09-05","2005-01-05","2006-10-03","2007-02-05")
, stringsAsFactors=FALSE)
CODE
test$V1 <- ifelse(c(FALSE, diff(test$ID) == 0), 0,
1*(as.numeric(as.Date(test$DOV)-as.Date(test$DOD))<180))
test$V1 <- ave(test$V1,test$ID,FUN=max)

R: losing column names when adding rows to an empty data frame

I am just starting with R and encountered a strange behaviour: when inserting the first row in an empty data frame, the original column names get lost.
example:
a<-data.frame(one = numeric(0), two = numeric(0))
a
#[1] one two
#<0 rows> (or 0-length row.names)
names(a)
#[1] "one" "two"
a<-rbind(a, c(5,6))
a
# X5 X6
#1 5 6
names(a)
#[1] "X5" "X6"
As you can see, the column names one and two were replaced by X5 and X6.
Could somebody please tell me why this happens and is there a right way to do this without losing column names?
A shotgun solution would be to save the names in an auxiliary vector and then add them back when finished working on the data frame.
Thanks
Context:
I created a function which gathers some data and adds them as a new row to a data frame received as a parameter.
I create the data frame, iterate through my data sources, passing the data.frame to each function call to be filled up with its results.
The rbind help pages specifies that :
For ‘cbind’ (‘rbind’), vectors of zero
length (including ‘NULL’) are ignored
unless the result would have zero rows
(columns), for S compatibility.
(Zero-extent matrices do not occur in
S3 and are not ignored in R.)
So, in fact, a is ignored in your rbind instruction. Not totally ignored, it seems, because as it is a data frame the rbind function is called as rbind.data.frame :
rbind.data.frame(c(5,6))
# X5 X6
#1 5 6
Maybe one way to insert the row could be :
a[nrow(a)+1,] <- c(5,6)
a
# one two
#1 5 6
But there may be a better way to do it depending on your code.
was almost surrendering to this issue.
1) create data frame with stringsAsFactor set to FALSE or you run straight into the next issue
2) don't use rbind - no idea why on earth it is messing up the column names. simply do it this way:
df[nrow(df)+1,] <- c("d","gsgsgd",4)
df <- data.frame(a = character(0), b=character(0), c=numeric(0))
df[nrow(df)+1,] <- c("d","gsgsgd",4)
#Warnmeldungen:
#1: In `[<-.factor`(`*tmp*`, iseq, value = "d") :
# invalid factor level, NAs generated
#2: In `[<-.factor`(`*tmp*`, iseq, value = "gsgsgd") :
# invalid factor level, NAs generated
df <- data.frame(a = character(0), b=character(0), c=numeric(0), stringsAsFactors=F)
df[nrow(df)+1,] <- c("d","gsgsgd",4)
df
# a b c
#1 d gsgsgd 4
Workaround would be:
a <- rbind(a, data.frame(one = 5, two = 6))
?rbind states that merging objects demands matching names:
It then takes the classes of the
columns from the first data frame, and
matches columns by name (rather than
by position)
FWIW, an alternative design might have your functions building vectors for the two columns, instead of rbinding to a data frame:
ones <- c()
twos <- c()
Modify the vectors in your functions:
ones <- append(ones, 5)
twos <- append(twos, 6)
Repeat as needed, then create your data.frame in one go:
a <- data.frame(one=ones, two=twos)
One way to make this work generically and with the least amount of re-typing the column names is the following. This method doesn't require hacking the NA or 0.
rs <- data.frame(i=numeric(), square=numeric(), cube=numeric())
for (i in 1:4) {
calc <- c(i, i^2, i^3)
# append calc to rs
names(calc) <- names(rs)
rs <- rbind(rs, as.list(calc))
}
rs will have the correct names
> rs
i square cube
1 1 1 1
2 2 4 8
3 3 9 27
4 4 16 64
>
Another way to do this more cleanly is to use data.table:
> df <- data.frame(a=numeric(0), b=numeric(0))
> rbind(df, list(1,2)) # column names are messed up
> X1 X2
> 1 1 2
> df <- data.table(a=numeric(0), b=numeric(0))
> rbind(df, list(1,2)) # column names are preserved
a b
1: 1 2
Notice that a data.table is also a data.frame.
> class(df)
"data.table" "data.frame"
You can do this:
give one row to the initial data frame
df=data.frame(matrix(nrow=1,ncol=length(newrow))
add your new row and take out the NAS
newdf=na.omit(rbind(newrow,df))
but watch out that your newrow does not have NAs or it will be erased too.
Cheers
Agus
I use the following solution to add a row to an empty data frame:
d_dataset <-
data.frame(
variable = character(),
before = numeric(),
after = numeric(),
stringsAsFactors = FALSE)
d_dataset <-
rbind(
d_dataset,
data.frame(
variable = "test",
before = 9,
after = 12,
stringsAsFactors = FALSE))
print(d_dataset)
variable before after
1 test 9 12
HTH.
Kind regards
Georg
Researching this venerable R annoyance brought me to this page. I wanted to add a bit more explanation to Georg's excellent answer (https://stackoverflow.com/a/41609844/2757825), which not only solves the problem raised by the OP (losing field names) but also prevents the unwanted conversion of all fields to factors. For me, those two problems go together. I wanted a solution in base R that doesn't involve writing extra code but preserves the two distinct operations: define the data frame, append the row(s)--which is what Georg's answer provides.
The first two examples below illustrate the problems and the third and fourth show Georg's solution.
Example 1: Append the new row as vector with rbind
Result: loses column names AND coverts all variables to factors
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
c("Bob", 250)
)
my.df
X.Bob. X.250.
1 Bob 250
str(my.df)
'data.frame': 1 obs. of 2 variables:
$ X.Bob.: Factor w/ 1 level "Bob": 1
$ X.250.: Factor w/ 1 level "250": 1
Example 2: Append the new row as a data frame inside rbind
Result: keeps column names but still converts character variables to factors.
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
data.frame(name="Bob", score=250)
)
my.df
name score
1 Bob 250
str(my.df)
'data.frame': 1 obs. of 2 variables:
$ name : Factor w/ 1 level "Bob": 1
$ score: num 250
Example 3: Append the new row inside rbind as a data frame, with stringsAsFactors=FALSE
Result: problem solved.
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
data.frame(name="Bob", score=250, stringsAsFactors=FALSE)
)
my.df
name score
1 Bob 250
str(my.df)
'data.frame': 1 obs. of 2 variables:
$ name : chr "Bob"
$ score: num 250
Example 4: Like example 3, but adding multiple rows at once.
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
data.frame(
name=c("Bob", "Carol", "Ted"),
score=c(250, 124, 95),
stringsAsFactors=FALSE)
)
str(my.df)
'data.frame': 3 obs. of 2 variables:
$ name : chr "Bob" "Carol" "Ted"
$ score: num 250 124 95
my.df
name score
1 Bob 250
2 Carol 124
3 Ted 95
Instead of constructing the data.frame with numeric(0) I use as.numeric(0).
a<-data.frame(one=as.numeric(0), two=as.numeric(0))
This creates an extra initial row
a
# one two
#1 0 0
Bind the additional rows
a<-rbind(a,c(5,6))
a
# one two
#1 0 0
#2 5 6
Then use negative indexing to remove the first (bogus) row
a<-a[-1,]
a
# one two
#2 5 6
Note: it messes up the index (far left). I haven't figured out how to prevent that (anyone else?), but most of the time it probably doesn't matter.

R: Stacking Multiple Punch Question Data

Suppose we have 2 questions in a survey, one is about how likely an individual is to recommend a company (let's say there's 2 companies for simplicity).
So, I have one data.frame with 2 columns for this question:
df.recommend <- data.frame(rep(1:5,20),rep(1:5,20))
colnames(df.recommend) <- c("Company1","Company2")
And, suppose we have another question that asks respondents to checkmark a box beside an attribute that they believe "fits" with the company.
So, I have another data.frame with 4 columns for this question:
df.attribute <- data.frame(rep(0:1,50),rep(1:0,50),rep(0:1,50),rep(1:0,50))
colnames(df.attribute) <- c(
"Attribute1.Company1",
"Attribute2.Company1",
"Attribute1.Company2",
"Attribute2.Company2")
Now, what I would like to be able to do is review how Attributes 1 and 2 are related to the scale in the likelyhood to recommend question, for all companies (company independent). Just to get an idea of what inertia lies between those people that are highly likely to recommend and attribute 1 for example.
So, I start off by binding the two questions together:
df <- cbind(df.recommend, df.attribute)
My problem is trying to figure out how to stack these data such that the columns look something like:
df.stacked <- data.frame(c(df$Company1,df$Company2),
c(df$Attribute1.Company1,df$Attribute1.Company2),
c(df$Attribute2.Company1,df$Attribute2.Company2))
colnames(df.stacked) <- c("Likelihood","Attribute1","Attribute2")
This example is simplified to a large degree. In my actual problem, I have 34 companies and 24 attributes.
Could you think of a way to stack them effectively, without having to type out all the c() statements?
Note: The column pattern for likelyhood is Co1,Co2,Co3,Co4... and the pattern for the attributes is At1.Co1,At2.Co1,At3.Co1 ... At1.Co34,At2.Co34...
For this type of problem, Hadley's reshape package is the perfect tool. I combine it with a few stringr and plyr statements (also packages written by Hadley).
Here is what I believe to be a complete solution in about a dozen lines of code.
First, create some data
library(reshape2) # EDIT 1: reshape2 is faster
library(stringr)
library(plyr)
# Create data frame
# Important: note the addition of a respondent id column
df_comp <- data.frame(
RespID = 1:10,
Company1 = rep(1:5, 2),
Company2 = rep(1:5, 2)
)
df_attr <- data.frame(
RespID = 1:10,
Attribute1.Company1 = rep(0:1,5),
Attribute2.Company1 = rep(1:0,5),
Attribute1.Company2 = rep(0:1,5),
Attribute2.Company2 = rep(1:0,5)
)
Now start the data manipulation:
# Use melt to convert data from wide to tall
melt_comp <- melt(df_comp, id.vars="RespID")
melt_comp <- rename(melt_comp, c(variable="comp", value="likelihood"))
melt_attr <- melt(df_attr, id.vars="RespID")
# Use str_split to split attribute variables into attribute and company
# "." period needs to be escaped
# EDIT 2: reshape::colsplit is simpler than str_split
split <- colsplit(melt_attr$variable, "\\.", names=c("attr", "comp"))
melt_attr <- data.frame(melt_attr, split)
melt_attr$variable <- NULL
# Use cast to convert from tall to somewhat tall
cast_attr <- cast(melt_attr, RespID + comp ~ attr, mean)
# Combine data frames using join() in package plyr
df <- join(melt_comp, cast_attr)
head(df)
And the output:
RespID comp likelihood Attribute1 Attribute2
1 1 Company1 1 0 1
2 2 Company1 2 1 0
3 3 Company1 3 0 1
4 4 Company1 4 1 0
5 5 Company1 5 0 1
6 6 Company1 1 1 0
Something I quickly cooked up. Doesn't look the best and uses a for-loop but that shouldn't be a problem with only 24 values
df.recommend <- data.frame(rep(1:5,20),rep(1:5,20))
colnames(df.recommend) <- c("Co1","Co2")
df.attribute <- data.frame(rep(0:1,50),rep(1:0,50),rep(0:1,50),rep(1:0,50))
colnames(df.attribute) <- c(
"At1.Co1",
"At2.Co1",
"At1.Co2",
"At2.Co2")
df.stacked <- data.frame(
likelihood <- unlist(df.recommend)
)
str <- strsplit(names(df.attribute),split="\\.")
atts <- unique(sapply(str,function(x)x[1]))
for (i in 1:length(atts))
{
df.stacked[,i+1] <- unlist(df.attribute[sapply(str,function(x)x[1]==atts[i])])
}
names(df.stacked) <- c("likelihood",paste("attribute",1:length(atts),sep=""))
EDIT: It assumes that companies are in the same order for each attribute

Resources