All Levels of a Factor in a Model Matrix in R - r

I have a data.frame consisting of numeric and factor variables as seen below.
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))
I want to build out a matrix that assigns dummy variables to the factor and leaves the numeric variables alone.
model.matrix(~ First + Second + Third + Fourth + Fifth, data=testFrame)
As expected when running lm this leaves out one level of each factor as the reference level. However, I want to build out a matrix with a dummy/indicator variable for every level of all the factors. I am building this matrix for glmnet so I am not worried about multicollinearity.
Is there a way to have model.matrix create the dummy for every level of the factor?

(Trying to redeem myself...) In response to Jared's comment on #Fabians answer about automating it, note that all you need to supply is a named list of contrast matrices. contrasts() takes a vector/factor and produces the contrasts matrix from it. For this then we can use lapply() to run contrasts() on each factor in our data set, e.g. for the testFrame example provided:
> lapply(testFrame[,4:5], contrasts, contrasts = FALSE)
$Fourth
Alice Bob Charlie David
Alice 1 0 0 0
Bob 0 1 0 0
Charlie 0 0 1 0
David 0 0 0 1
$Fifth
Edward Frank Georgia Hank Isaac
Edward 1 0 0 0 0
Frank 0 1 0 0 0
Georgia 0 0 1 0 0
Hank 0 0 0 1 0
Isaac 0 0 0 0 1
Which slots nicely into #fabians answer:
model.matrix(~ ., data=testFrame,
contrasts.arg = lapply(testFrame[,4:5], contrasts, contrasts=FALSE))

You need to reset the contrasts for the factor variables:
model.matrix(~ Fourth + Fifth, data=testFrame,
contrasts.arg=list(Fourth=contrasts(testFrame$Fourth, contrasts=F),
Fifth=contrasts(testFrame$Fifth, contrasts=F)))
or, with a little less typing and without the proper names:
model.matrix(~ Fourth + Fifth, data=testFrame,
contrasts.arg=list(Fourth=diag(nlevels(testFrame$Fourth)),
Fifth=diag(nlevels(testFrame$Fifth))))

caret implemented a nice function dummyVars to achieve this with 2 lines:
library(caret)
dmy <- dummyVars(" ~ .", data = testFrame)
testFrame2 <- data.frame(predict(dmy, newdata = testFrame))
Checking the final columns:
colnames(testFrame2)
"First" "Second" "Third" "Fourth.Alice" "Fourth.Bob" "Fourth.Charlie" "Fourth.David" "Fifth.Edward" "Fifth.Frank" "Fifth.Georgia" "Fifth.Hank" "Fifth.Isaac"
The nicest point here is you get the original data frame, plus the dummy variables having excluded the original ones used for the transformation.
More info: http://amunategui.github.io/dummyVar-Walkthrough/

dummyVars from caret could also be used. http://caret.r-forge.r-project.org/preprocess.html

Ok. Just reading the above and putting it all together. Suppose you wanted the matrix e.g. 'X.factors' that multiplies by your coefficient vector to get your linear predictor. There are still a couple extra steps:
X.factors =
model.matrix( ~ ., data=X, contrasts.arg =
lapply(data.frame(X[,sapply(data.frame(X), is.factor)]),
contrasts, contrasts = FALSE))
(Note that you need to turn X[*] back into a data frame in case you have only one factor column.)
Then say you get something like this:
attr(X.factors,"assign")
[1] 0 1 **2** 2 **3** 3 3 **4** 4 4 5 6 7 8 9 10 #emphasis added
We want to get rid of the **'d reference levels of each factor
att = attr(X.factors,"assign")
factor.columns = unique(att[duplicated(att)])
unwanted.columns = match(factor.columns,att)
X.factors = X.factors[,-unwanted.columns]
X.factors = (data.matrix(X.factors))

A tidyverse answer:
library(dplyr)
library(tidyr)
result <- testFrame %>%
mutate(one = 1) %>% spread(Fourth, one, fill = 0, sep = "") %>%
mutate(one = 1) %>% spread(Fifth, one, fill = 0, sep = "")
yields the desired result (same as #Gavin Simpson's answer):
> head(result, 6)
First Second Third FourthAlice FourthBob FourthCharlie FourthDavid FifthEdward FifthFrank FifthGeorgia FifthHank FifthIsaac
1 1 5 4 0 0 1 0 0 1 0 0 0
2 1 14 10 0 0 0 1 0 0 1 0 0
3 2 2 9 0 1 0 0 1 0 0 0 0
4 2 5 4 0 0 0 1 0 1 0 0 0
5 2 13 5 0 0 1 0 1 0 0 0 0
6 2 15 7 1 0 0 0 1 0 0 0 0

Using the R package 'CatEncoders'
library(CatEncoders)
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))
fit <- OneHotEncoder.fit(testFrame)
z <- transform(fit,testFrame,sparse=TRUE) # give the sparse output
z <- transform(fit,testFrame,sparse=FALSE) # give the dense output

I am currently learning Lasso model and glmnet::cv.glmnet(), model.matrix() and Matrix::sparse.model.matrix()(for high dimensions matrix, using model.matrix will killing our time as suggested by the author of glmnet.).
Just sharing there has a tidy coding to get the same answer as #fabians and #Gavin's answer. Meanwhile, #asdf123 introduced another package library('CatEncoders') as well.
> require('useful')
> # always use all levels
> build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = FALSE)
>
> # just use all levels for Fourth
> build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = c(Fourth = FALSE, Fifth = TRUE))
Source : R for Everyone: Advanced Analytics and Graphics (page273)

I write a package called ModelMatrixModel to improve the functionality of model.matrix(). The ModelMatrixModel() function in the package in default return a class containing a sparse matrix with all levels of dummy variables which is suitable for input in cv.glmnet() in glmnet package. Importantly, returned
class also stores transforming parameters such as the factor level information, which can then be applied to new data. The function can hand most items in r formula like poly() and interaction. It also gives several other options like handle invalid factor levels , and scale output.
#devtools::install_github("xinyongtian/R_ModelMatrixModel")
library(ModelMatrixModel)
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
Fourth=rep(c("Alice","Bob","Charlie","David"), 5))
newdata=data.frame(First=sample(1:10, 2, replace=T),
Second=sample(1:20, 2, replace=T), Third=sample(1:10, 2, replace=T),
Fourth=c("Bob","Charlie"))
mm=ModelMatrixModel(~First+Second+Fourth, data = testFrame)
class(mm)
## [1] "ModelMatrixModel"
class(mm$x) #default output is sparse matrix
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
data.frame(as.matrix(head(mm$x,2)))
## First Second FourthAlice FourthBob FourthCharlie FourthDavid
## 1 7 17 1 0 0 0
## 2 9 7 0 1 0 0
#apply the same transformation to new data, note the dummy variables for 'Fourth' includes the levels not appearing in new data
mm_new=predict(mm,newdata)
data.frame(as.matrix(head(mm_new$x,2)))
## First Second FourthAlice FourthBob FourthCharlie FourthDavid
## 1 6 3 0 1 0 0
## 2 2 12 0 0 1 0

You can use tidyverse to achieve this without specifying each column manually.
The trick is to make a "long" dataframe.
Then, munge a few things, and spread it back to wide to create the indicators/dummy variables.
Code:
library(tidyverse)
## add index variable for pivoting
testFrame$id <- 1:nrow(testFrame)
testFrame %>%
## pivot to "long" format
gather(feature, value, -id) %>%
## add indicator value
mutate(indicator=1) %>%
## create feature name that unites a feature and its value
unite(feature, value, col="feature_value", sep="_") %>%
## convert to wide format, filling missing values with zero
spread(feature_value, indicator, fill=0)
The output:
id Fifth_Edward Fifth_Frank Fifth_Georgia Fifth_Hank Fifth_Isaac First_2 First_3 First_4 ...
1 1 1 0 0 0 0 0 0 0
2 2 0 1 0 0 0 0 0 0
3 3 0 0 1 0 0 0 0 0
4 4 0 0 0 1 0 0 0 0
5 5 0 0 0 0 1 0 0 0
6 6 1 0 0 0 0 0 0 0
7 7 0 1 0 0 0 0 1 0
8 8 0 0 1 0 0 1 0 0
9 9 0 0 0 1 0 0 0 0
10 10 0 0 0 0 1 0 0 0
11 11 1 0 0 0 0 0 0 0
12 12 0 1 0 0 0 0 0 0
...

model.matrix(~ First + Second + Third + Fourth + Fifth - 1, data=testFrame)
or
model.matrix(~ First + Second + Third + Fourth + Fifth + 0, data=testFrame)
should be the most straightforward

Related

Creating a boolean data frame from a data frame in R

I have a data frame and I want to create a boolean data frame from it. I want to make all unique values of every column in the original data frame as column names in the bolean data frame. To show it using an example:
mydata =
sex route
m oral
f oral
m topical
f unknown
Then, I want to create
m f oral topical unknown
1 0 1 0 0
0 1 1 0 0
1 0 0 1 0
0 1 0 0 1
I am using the code below to create the bolean data frame. It works in R but not in shiny. What could be the problem?
col_names=c()
for(i in seq(1,ncol(mydata))){
col_names=c(col_names,unique(mydata[i]))
}
col_names= as.vector(unlist(col_names))
my_boolean= data.frame(matrix(0, nrow = nrow(mydata), ncol = length(col_names)))
colnames( my_boolean)=col_names
for(i in seq(1,nrow(mydata))){
for(j in seq(1,ncol(mydata)))
{
my_boolean[i,which(mydata[i,j]==colnames(my_boolean))]=1
}}
There are a few ways you can do this, but I always find table the easiest to understand. Here's an approach with table:
do.call(cbind, lapply(mydf, function(x) table(1:nrow(mydf), x)))
## f m oral topical unknown
## 1 0 1 1 0 0
## 2 1 0 1 0 0
## 3 0 1 0 1 0
## 4 1 0 0 0 1

how to subset a data frame based on list of multiple match case in columns

So I have a list that contains certain characters as shown below
list <- c("MY","GM+" ,"TY","RS","LG")
And I have a variable named "CODE" in the data frame as follows
code <- c("MY GM+","","LGTY", "RS","TY")
df <- data.frame(1:5,code)
df
code
1 MY GM+
2
3 LGTY
4 RS
5 TY
Now I want to create 5 new variables named "MY","GM+","TY","RS","LG"
Which takes binary value, 1 if there's a match case in the CODE variable
df
code MY GM+ TY RS LG
1 MY GM+ 1 1 0 0 0
2 0 0 0 0 0
3 LGTY 0 0 1 0 1
4 RS 0 0 0 1 0
5 TY 0 0 1 0 0
Really appreciate your help. Thank you.
Since you know how many values will be returned (5), and what you want their types to be (integer), you could use vapply() with grepl(). We can turn the resulting logical matrix into integer values by using integer() in vapply()'s FUN.VALUE argument.
cbind(df, vapply(List, grepl, integer(nrow(df)), df$code, fixed = TRUE))
# code MY GM+ TY RS LG
# 1 MY GM+ 1 1 0 0 0
# 2 0 0 0 0 0
# 3 LGTY 0 0 1 0 1
# 4 RS 0 0 0 1 0
# 5 TY 0 0 1 0 0
I think your original data has a couple of typos, so here's what I used:
List <- c("MY", "GM+" , "TY", "RS", "LG")
df <- data.frame(code = c("MY GM+", "", "LGTY", "RS", "TY"))

Looking for a more concise way to recategorise a variable

I have a vector of integer ages that I want to turn into multiple categories:
ages <- round(runif(10, 0, 99))
Now I want this variable to be binned into three categories, depending on age. I want an output object, ages.cat to look like this:
young mid old
1 0 0 1
2 1 0 0
3 1 0 0
4 1 0 0
5 1 0 0
6 0 1 0
7 1 0 0
8 0 0 1
9 0 1 0
10 0 1 0
At present I am creating this object with the following code:
ages.cat <- array(0, dim=c(10,3)) # create categorical object for 3 bins
ages.cat[ages < 30, 1] <- 1
ages.cat[ages >= 30 & ages < 60, 2] <- 1
ages.cat[ages >= 60, 3] <- 1
ages.cat <- data.frame(ages.cat)
names(ages.cat) <- c("young", "mid", "old")
There must be a faster and more concise way to recode this data - had a play with dplyr
but couldn't see a solution to this particular problem with its functions. Any ideas? What's would be the 'canonical' solution to this problem in base R or using a package? Whatever the alternatives, I'm certain they'll be more concise than my clunky code!
Its two one-liners.
Use cut to create a factor:
ages <- round(runif(10, 0, 99))
ageF=cut(ages,c(-Inf,30,60,Inf),labels=c("young","mid","old"))
> ageF
[1] young mid young young old mid old young old old
Levels: young mid old
Usually you'd leave that as a factor and work with it, if you are using R's modelling functions they'll work out the matrix for you. But if you are doing it yourself:
Use model.matrix to create the matrix, with a -1 to remove the intercept and create columns for each level:
> m = model.matrix(~ageF-1)
> m
ageFyoung ageFmid ageFold
1 1 0 0
2 0 1 0
3 1 0 0
4 1 0 0
5 0 0 1
6 0 1 0
7 0 0 1
8 1 0 0
9 0 0 1
10 0 0 1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$ageF
[1] "contr.treatment"
You can ignore all the contrasty stuff at the end, its just a matrix with some extra attributes for modelling.
Try this:
library(dplyr)
ages <-
data.frame(ages = round(runif(10, 0, 99))) %.%
mutate(id = 1:n(),
cat = factor(ifelse(ages < 30, "young",
ifelse(ages >= 30 & ages < 60,
"mid", "old")))) %.%
dcast(id ~ cat, value.var = 'ages', length)

R: Converting multiple binary columns into one factor variable whose factors are binary column names

I am a new R user. Currently I am working on a dataset wherein I have to transform the multiple binary columns into single factor column
Here is the example:
current dataset like :
$ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
$ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
$ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
$ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
Property.RealEstate Property.Insurance Property.CarOther Property.Unknown
1 0 0 0
0 1 0 0
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
Recoded column should be:
Property
1 Real estate
2 Insurance
3 Real estate
4 Insurance
5 CarOther
6 Unknown
It is basically a reverse of melt.matrix function.
Thank You all for your Precious Inputs. It does work.
But one issue though,
I have some rows which takes value as:
Property.RealEstate Property.Insurance Property.CarOther Property.Unknown
0 0 0 0
I want these to be marked as NA or Null
Would be a help if you suggest on this as well.
Thank You
> mat <- matrix(c(0,1,0,0,0,
+ 1,0,0,0,0,
+ 0,0,0,1,0,
+ 0,0,1,0,0,
+ 0,0,0,0,1), ncol = 5, byrow = TRUE)
> colnames(mat) <- c("Level1","Level2","Level3","Level4","Level5")
> mat
Level1 Level2 Level3 Level4 Level5
[1,] 0 1 0 0 0
[2,] 1 0 0 0 0
[3,] 0 0 0 1 0
[4,] 0 0 1 0 0
[5,] 0 0 0 0 1
Create a new factor based upon the index of each 1 in each row
Use the matrix column names as the labels for each level
NewFactor <- factor(apply(mat, 1, function(x) which(x == 1)),
labels = colnames(mat))
> NewFactor
[1] Level2 Level1 Level4 Level3 Level5
Levels: Level1 Level2 Level3 Level4 Level5
also you can try:
factor(mat%*%(1:ncol(mat)), labels = colnames(mat))
also use Tomas solution - ifounf somewhere in SO
as.factor(colnames(mat)[mat %*% 1:ncol(mat)])
Melt is certainly a solution. I'd suggest using the reshape2 melt as follows:
library(reshape2)
df=data.frame(Property.RealEstate=c(0,0,1,0,0,0),
Property.Insurance=c(0,1,0,1,0,0),
Property.CarOther=c(0,0,0,0,1,0),
Property.Unknown=c(0,0,0,0,0,1))
#add id column (presumably you have ids more meaningful than row numbers)
df$row=1:nrow(df)
#melt to "long" format
long=melt(df,id="row")
#only keep 1's
long=long[which(long$value==1),]
#merge in ids for NA entries
long=merge(df[,"row",drop=F],long,all.x=T)
#clean up to match example output
long=long[order(long$row),"variable",drop=F]
names(long)="Property"
long$Property=gsub("Property.","",long$Property,fixed=T)
#results
long
Alternately, you can just do it in the naïve way. I think it's more transparent than any of the other suggestions (including my other suggestion).
df=data.frame(Property.RealEstate=c(0,0,1,0,0,0),
Property.Insurance=c(0,1,0,1,0,0),
Property.CarOther=c(0,0,0,0,1,0),
Property.Unknown=c(0,0,0,0,0,1))
propcols=c("Property.RealEstate", "Property.Insurance", "Property.CarOther", "Property.Unknown")
df$Property=NA
for(colname in propcols)({
coldata=df[,colname]
df$Property[which(coldata==1)]=colname
})
df$Property=gsub("Property.","",df$Property,fixed=T)
Something different:
Get the data:
dat <- data.frame(Property.RealEstate=c(1,0,1,0,0,0),Property.Insurance=c(0,1,0,1,0,0),Property.CarOther=c(0,0,0,0,1,0),Property.Unknown=c(0,0,0,0,0,1))
Reshape it:
names(dat)[row(t(dat))[t(dat)==1]]
#[1] "Property.RealEstate" "Property.Insurance" "Property.RealEstate"
#[4] "Property.Insurance" "Property.CarOther" "Property.Unknown"
If you want it cleaned up, do:
gsub("Property\\.","",names(dat)[row(t(dat))[t(dat)==1]])
#[1] "RealEstate" "Insurance" "RealEstate" "Insurance" "CarOther" "Unknown"
If you prefer a factor output:
factor(row(t(dat))[t(dat)==1],labels=names(dat))
...and cleaned up:
factor(row(t(dat))[t(dat)==1],labels=gsub("Property\\.","",names(dat)) )

splitting dataframe with collated points in to individuals in R

I have a dataframe (.txt) which looks like this [where "dayX" = the day of death in a survival assay in fruitflies, the numbers beneath are the number of flies to die in that treatment combination on that day, X or A are treaments, m & f are also treatments, the first number is the line, the second number is the block]
line day1 day2 day3 day4 day5
1 Xm1.1 0 0 0 2 0
2 Xm1.2 0 0 1 0 0
3 Xm2.1 1 1 0 0 0
4 Xm2.2 0 0 0 3 1
5 Xf1.1 0 3 0 0 1
6 Xf1.2 0 0 1 0 0
7 Xf2.1 2 0 2 0 0
8 Xf2.2 1 0 1 0 0
9 Am1.1 0 0 0 0 2
10 Am1.2 0 0 1 0 0
11 Am2.1 0 2 0 0 1
12 Am2.2 0 2 0 0 0
13 Af1.1 3 0 0 1 0
14 Af1.2 0 1 3 0 0
15 Af1.1 0 0 0 1 0
16 Af2.2 1 0 0 0 0
and want it to become this using R->
XA mf line block individual age
1 X m 1 1 1 4
2 X m 1 1 2 4
3 X m 1 2 1 3
and so on...
the resulting dataframe collects the "age" value from the day the individual died, as scored in the upper dataframe, for example there were two flies that died on the 4th day (day4) in treatment Xm1.1 therefore R creates two rows, one containing information extracted regarding the first individual and thus being labelled as individual "1", then another row with the same information except labelled as individual "2".. if a 3rd individual died in the same treatment on day 5, there would be a third row which is the same as the above two rows except the "age" would be "5" and individual would be "3". When it moves on to the next treatment row, in this case Xm1.2, the first individual to die within that treatment set would be labelled as individual "1" (which in this case dies on day 3). In my example there is a total of 38 deaths, therefore I am trying to get R to build a df which is 38*6 (excl. headers).
is there a way to take my dataframe [the real version is approx 50*640 with approx 50 individuals per unique combination of X/A, m/f, line (1:40), block (1-4) so ~32000 individual deaths] to an end dataframe of 6*~32000 in an automated way?
both of these example dataframes can be built using this code if it helps you to try out solutions:
test<-data.frame(1:16);colnames(test)=("line")
test$line=c("Xm1.1","Xm1.2","Xm2.1","Xm2.2","Xf1.1","Xf1.2","Xf2.1","Xf2.2","Am1.1","Am1.2","Am2.1","Am2.2","Af1.1","Af1.2","Af2.1","Af2.2")
test$day1=rep(0,16);test$day2=rep(0,16);test$day3=rep(0,16);test$day4=rep(0,16);test$day5=rep(0,16)
test$day4[1]=2;test$day3[2]=1;test$day2[3]=1;test$day4[4]=3;test$day5[5]=1;
test$day3[6]=1;test$day1[7]=2;test$day1[8]=1;test$day5[9]=3;test$day3[10]=1;
test$day2[11]=2;test$day2[12]=2;test$day4[13]=1;test$day3[14]=3;test$day4[15]=1;
test$day1[16]=1;test$day3[7]=2;test$day3[8]=1;test$day2[5]=3;test$day1[3]=1;
test$day5[11]=1;test$day5[9]=2;test$day5[4]=1;test$day1[13]=3;test$day2[14]=1;
test2=data.frame(rep(1:3),rep(1:3),rep(1:3),rep(1:3),rep(1:3),rep(1:3))
colnames(test2)=c("XA","mf","line","block","individual","age")
test2$XA[1]="X";test2$mf[1]="m";test2$line[1]=1;test2$block[1]=1;test2$individual[1]=1;test2$age[1]=4;
test2$XA[2]="X";test2$mf[2]="m";test2$line[2]=1;test2$block[2]=1;test2$individual[2]=2;test2$age[2]=4;
test2$XA[3]="X";test2$mf[3]="m";test2$line[3]=1;test2$block[3]=2;test2$individual[3]=1;test2$age[3]=3;
apologies for the awfully long way of making this dummy dataset, suffering from sleep deprivation and jetlag and haven't used R for months, if you run the code in R you will hopefully see better what I aim to do
-------------------------------------------------------------------------------------
By Rg255:
Currently stuck at this derived from #Arun's answer (I have added the strsplit (as.character(dt$line) , "" )) section to get around one error)
df=read.table("C:\\Users\\...\\data.txt",header=T)
require(data.table)
head(df[1:20])
dt <- as.data.table(df)
dt <- dt[, {dd <- unlist(.SD, use.names = FALSE);
list(individual = sequence(dd[dd>0]),
age = rep(which(dd>0), dd[dd>0])
)}, by=line]
out <- as.data.table(data.frame(do.call(rbind, strsplit(as.character(dt$line), ""))[, c(1:3,5)], stringsAsFactors=FALSE))
setnames(out, c("XA", "mf", "line", "block"))
out[, `:=`(line = as.numeric(line), block = as.numeric(block))]
out <- cbind(out, dt[, list(individual, age)])
Produces the following output:
> df=read.table("C:\\Users\\..\\data.txt",header=T)
> require(data.table)
> head(df[1:20])
line Day4 Day6 Day8 Day10 Day12 Day14 Day16 Day18 Day20 Day22 Day24 Day26 Day28 Day30 Day32 Day34 Day36 Day38 Day40
1 Xm1.1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 4 2
2 Xm2.1 0 0 0 0 0 0 0 0 0 2 0 0 0 1 2 1 0 2 0
3 Xm3.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1
4 Xm4.1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 2 3 8
5 Xm5.1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 3 3 3 6
6 Xm6.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
> dt <- as.data.table(df)
> dt <- dt[, {dd <- unlist(.SD, use.names = FALSE);
+ list(individual = sequence(dd[dd>0]),
+ age = rep(which(dd>0), dd[dd>0])
+ )}, by=line]
> out <- as.data.table(data.frame(do.call(rbind, strsplit(as.character(dt$line), ""))[, c(1:3,5)], stringsAsFactors=FALSE))
Warning message:
In function (..., deparse.level = 1) :
number of columns of result is not a multiple of vector length (arg 1)
> setnames(out, c("XA", "mf", "line", "block"))
> out[, `:=`(line = as.numeric(line), block = as.numeric(block))]
Error in `[.data.table`(out, , `:=`(line = as.numeric(line), block = as.numeric(block))) :
LHS of := must be a single column name, when with=TRUE. When with=FALSE the LHS may be a vector of column names or positions.
In addition: Warning message:
In eval(expr, envir, enclos) : NAs introduced by coercion
> out <- cbind(out, dt[, list(individual, age)])
>
Here goes a data.table solution. The line column must have unique values.
require(data.table)
df <- read.table("data.txt", header=TRUE, stringsAsFactors=FALSE)
dt <- as.data.table(df)
dt <- dt[, {dd <- unlist(.SD, use.names = FALSE);
list(individual = sequence(dd[dd>0]),
age = rep(which(dd>0), dd[dd>0])
)}, by=line]
out <- as.data.table(data.frame(do.call(rbind,
strsplit(gsub("([[:alpha:]])([[:alpha:]])([0-9]+)\\.([0-9]+)$",
"\\1 \\2 \\3 \\4", dt$line), " ")), stringsAsFactors=FALSE))
setnames(out, c("XA", "mf", "line", "block"))
out[, `:=`(line = as.numeric(line), block = as.numeric(block))]
out <- cbind(out, dt[, list(individual, age)])
This works on your data.txt file.

Resources