reshaping a data frame into long format in R - r

I'm struggling with a reshape in R. I have 2 types of error (err and rel_err) that have been calculated for 3 different models. This gives me a total of 6 error variables (i.e. err_1, err_2, err_3, rel_err_1, rel_err_2, and rel_err_3). For each of these types of error I have 3 different types of predivtive validity tests (ie random holdouts, backcast, forecast). I would like to make my data set long so I keep the 4 types of test long while also making the two error measurements long. So in the end I will have one variable called err and one called rel_err as well as an id variable for what model the error corresponds to (1,2,or 3)
Here is my data right now:
iter err_1 rel_err_1 err_2 rel_err_2 err_3 rel_err_3 test_type
1 -0.09385732 -0.2235443 -0.1216982 -0.2898543 -0.1058366 -0.2520759 random
1 0.16141630 0.8575728 0.1418732 0.7537442 0.1584816 0.8419816 back
1 0.16376930 0.8700738 0.1431505 0.7605302 0.1596502 0.8481901 front
1 0.14345986 0.6765194 0.1213689 0.5723444 0.1374676 0.6482615 random
1 0.15890059 0.7435382 0.1589823 0.7439204 0.1608709 0.7527580 back
1 0.14412360 0.6743928 0.1442039 0.6747684 0.1463520 0.6848202 front
and here is what I would like it to look like:
iter model err rel_err test_type
1 1 -0.09385732 (#'s) random
1 2 -0.1216982 (#'s) random
1 3 -0.1216982 (#'s) random
and on...
I've tried playing around with the syntax but can't quite figure out what to put for the time.varying argument
Thanks very much for any help you can offer.

You could do it the "hard" way. For transparency you can use names.
with( dat, data.frame(iter = rep(iter, 3),
model = rep(1:3, each = nrow(dat)),
err = c(err_1, err_2, err_3),
rel_err = c(rel_err_1, rel_err_2, rel_err_3),
test_type = rep(test_type, 3)) )
Or, for conciseness, indexes.
data.frame(iter = dat[,1], model = rep(1:3, each = nrow(dat)), err = dat[,c(2, 4, 6)],
rel_err = dat[,c(3, 5, 7)], test_type = dat[,8]) )
If you had a LOT of columns the hard way might involve grepping the column names.
This "hard" way was about as concise as reshape and required less thinking about how to use the commands. Sometimes I just skip thinking about reshape.

The base function reshape will let you do this
reshape(DT, direction = 'long', varying = list(paste('err',1:3,sep ='_'), paste('rel_err',1:3,sep ='_')), v.names = c('err','rel_err'), timevar = 'model')
iter test_type model err rel_err id
1.1 1 random 1 -0.09385732 -0.2235443 1
2.1 1 back 1 0.16141630 0.8575728 2
3.1 1 front 1 0.16376930 0.8700738 3
4.1 1 random 1 0.14345986 0.6765194 4
5.1 1 back 1 0.15890059 0.7435382 5
6.1 1 front 1 0.14412360 0.6743928 6
1.2 1 random 2 -0.12169820 -0.2898543 1
2.2 1 back 2 0.14187320 0.7537442 2
3.2 1 front 2 0.14315050 0.7605302 3
4.2 1 random 2 0.12136890 0.5723444 4
5.2 1 back 2 0.15898230 0.7439204 5
6.2 1 front 2 0.14420390 0.6747684 6
1.3 1 random 3 -0.10583660 -0.2520759 1
2.3 1 back 3 0.15848160 0.8419816 2
3.3 1 front 3 0.15965020 0.8481901 3
4.3 1 random 3 0.13746760 0.6482615 4
5.3 1 back 3 0.16087090 0.7527580 5
6.3 1 front 3 0.14635200 0.6848202 6
I agree that the syntax for reshape hard to get your head around sometimes. I will spell out how this call works
direction = 'long' -- reshaping to long format
varying = list(paste('err',1:3,sep ='_'), paste('rel_err',1:3,sep ='_')) -- We pass a list of length 2 because we are trying to stack into two different variables. The columns paste('err',1:3,sep ='_') will become the first new variable in long format and
paste('rel_err',1:3,sep ='_')) will become the second new variable in long format
v.names = c('err','rel_err') sets the names of the two new variables in long format
timevar = 'model' sets the name of the time identifier (here the _1 from the columns in wide format.
I hope this is somewhat clearer.

Related

lcmm - knowing which IDs are in which class

I am using the LCMM package with the function hlme.
I want to know which IDs are classified in each class. Some of my cases are being dropped from the model. So when I use $pprob, the id variable is basically just starting at 1 and increasing. It is not the original study ID. So I am not able to merge it with the original dataset.
Does anyone know how to do that?
Here is the code:
q4r.fv <- hlme(i_rest~days+days2, subject = 'ninclu', ng=4, idiag=T, mixture=~days+days2,
data=patch, maxiter = 100, returndata = TRUE,
classmb=~tttantal+sexe+atchroniq+hads_anxiI+hads_depI+pcs_totI)
summary(q4r.fv)
fmq4r.fv <-q4r.fv$pprob
write.csv(fmq4r.fv,file="fmq4r.fv",row.names=F)
With the csv file, I obtain the following. The ninclu should be my ID but it no longer matches my original minclu variables that is a string participant ID variable
> print.data.frame(fmq4r.fv)
ninclu class prob1 prob2 prob3 prob4
1 1 2 7.416779e-09 9.635142e-01 3.630078e-02 1.850563e-04
2 2 2 5.479232e-02 8.710804e-01 7.412726e-02 9.118352e-16
3 3 1 9.933911e-01 6.607882e-03 9.830514e-07 1.110920e-23
4 4 2 2.620132e-07 9.991825e-01 8.155809e-04 1.631318e-06
5 5 2 4.382259e-04 9.877001e-01 1.186168e-02 1.166050e-11
6 6 3 4.239271e-09 2.361263e-01 7.634902e-01 3.835313e-04
You mention that your subject ID is a string participant ID variable.
If you look at the source code for lcmm, you can see that lcmm requires a numeric subject:
if(!is.numeric(data[,subject])) stop("The argument subject must be numeric")
I suspect the character string participant ID is coerced to factor, and it is the default factor coding that is appearing as "subj" in the model output predictions.
Try using a numeric participant ID and see if your results make more sense.

R package arules: read.transactions file format

I have a .csv file with the following type of data:
Day Item
1 12,19,24,31,48,
1 1,19,
1 16,28,32,45,
1 19,36,41,43,44,
1 7,24,27,
1 21,31,33,41,
1 46
1 50
2 12,31,36,48,
2 17,29,47,
2 2,18,20,29,38,39,40,41
2 17,29,47,
And I can't get the read.transactions to read it properly.
The data set is based on several item selection for each day (more than one time per day, if necessary). For instance, the third selection on day 1, returned items 16,28,32, and 45.
Shouldn't this be enough?
library(arules)
dataset <- read.transactions("file.csv", format = 'basket')
I have tried to create a sample data using data provided by you
data <- read.table(text="Day Item
1 12,19,24,31,48,
1 1,19,
1 16,28,32,45,
1 19,36,41,43,44,
1 7,24,27,
1 21,31,33,41,
1 46
1 50
2 12,31,36,48,
2 17,29,47,
2 2,18,20,29,38,39,40,41
2 17,29,47",header = T)
data <- as(data[-1], "transactions") ##removing 1st header column for the transactional data
inspect(data)
## apply apriori algorithm ###
rules <- apriori(data, parameter = list(supp = 0.001, conf = 0.80))
### Arrange top 10 rules by lift ####
inspect(rules[1:10])
Please try this method hope it helps

R aggregate by a variable then find out proportion of a each column

Sorry, I've tried my best but I didn't find the answer. As beginner, I'm not sure that I'm able to put the question clearly. Thanks in advance.
So I have a dataframe with data about consumption with 24000 rows.
In this dataframe, there is a series of variable about the number of objects bought within the last two months :
NumberOfCoat, NumberOfShirt, NumberOfPants, NumberOfShoes...
And there is a variable "profession" registered by number.
So now the data looks looks like this
profession NumberOfCoat NumberOfShirt NumberOfShoes
individu1 1 1 1 1
individu2 3 2 4 1
individu3 2 2 0 0
individu4 6 0 3 2
individu5 5 0 2 3
individu6 7 1 0 5
individu7 4 3 1 2
I would like to know the structure of consumption by profession and get something like this :
ProportionOfCoat ProportionOfShirt ProportionOfShoe...
profession1 0.3 0.5 0.1
profession2 0.1 0.2 0.4
profession3 0.2 0.6 0.1
profession4 0.1 0.1 0.2
I don't know if it is clear, but finally I want to be able to say :
10% of clothing products that doctors bought are Tshirts whereas 20% of what teachers bought are T-shirts.
And finally, I'd like to draw a stacked barplot where each stack is scaled to sum to 100%.
I suppose that we can you dplyr ?
Thank you very much !!
temp <- aggregate( . ~ profession, data=zzz, FUN=sum)
cbind(temp[1],temp[-1]/rowSums(temp[-1]))
or also using prop.table
As other people noted, it is always better to post a reproducible example, I´ll try to post one with my solution, which is longer than the ones already posted but, for the same reason, maybe clearer.
First you should create an example dataframe:
set.seed(10) # I set a seed cause I´ll use the sample() function
n <- 1:100 # vector from 1 to 100 to obtain the number of products bought
p <- 1:8 # vector for obtaining id of professions
profession <- sample(p,50, replace = TRUE)
NumberOfCoat <- sample(n,50, replace = TRUE)
NumberOfShirt <- sample(n,50, replace = TRUE)
NumberOfShoes <- sample(n,50, replace = TRUE)
df <- as.data.frame(cbind(profession, NumberOfCoat,
NumberOfShirt, NumberOfShoes))
Once you got the dataframe, you can explain what you have tried so far, or a possible solution. Here I used dplyr.
df <- df %>% group_by(profession) %>% summarize(coats = sum(NumberOfCoat),
shirts = sum(NumberOfShirt),
shoes = sum(NumberOfShoes)) %>%
mutate(tot_prod = coats + shirts + shoes,
ProportionOfCoat = coats/tot_prod,
ProportionOfShirt = shirts/tot_prod,
ProportionofShoes = shoes/tot_prod) %>%
select(profession, ProportionOfCoat, ProportionOfShirt,
ProportionofShoes)
dfcorresponds to the second dataframe you show, where you have the proportion of each product bought by each profession. In my example looks like this:
profession ProportionOfCoat ProportionOfShirt ProportionofShoes
<int> <dbl> <dbl> <dbl>
1 1 0.3910483 0.2343934 0.3745583
2 2 0.4069641 0.3525571 0.2404788
3 3 0.3330804 0.3968134 0.2701062
4 4 0.2740657 0.3952435 0.3306908
5 5 0.2573991 0.3784753 0.3641256
6 6 0.2293814 0.3543814 0.4162371
7 7 0.2245841 0.3955638 0.3798521
8 8 0.2861635 0.3490566 0.3647799
If you want to produce a stacked barplot, you have to reshape your data to a long format in order to be able to use ggplot2. As #alistaire noted, you can do it with the gather function from the tidyr package.
df <- df %>% gather(product, proportion, -profession)
And finally you can plot it with ggplot2.
ggplot(df, aes(x=profession, y=proportion, fill=product)) +
geom_bar(stat="identity")

names of a dataset returns NULL- R version 3.2.4 Revised-Ubuntu 14.04 LTS

I have a small issue regarding a dataset I am using. Suppose I have a dataset called mergedData2 defined using those command lines from a subset of mergedData:
mergedData=rbind(test_set,training_set)
lookformean<-grep("mean()",names(mergedData),fixed=TRUE)
lookforstd<-grep("std()",names(mergedData),fixed=TRUE)
varsofinterests<-sort(c(lookformean,lookforstd))
mergedData2<-mergedData[,c(1:2,varsofinterests)]
If I do names(mergedData2), I get:
[1] "volunteer_identifier" "type_of_experiment"
[3] "body_acceleration_mean()-X" "body_acceleration_mean()-Y"
[5] "body_acceleration_mean()-Z" "body_acceleration_std()-X"
(I takes this 6 first names as MWE but I have a vector of 68 names)
Now, suppose I want to take the average of each of the measurements per volunteer_identifier and type_of_experiment. For this, I used a combination of split and lapply:
mylist<-split(mergedData2,list(mergedData2$volunteer_identifier,mergedData2$type_of_experiment))
average_activities<-lapply(mylist,function(x) colMeans(x))
average_dataset<-t(as.data.frame(average_activities))
As average_activities is a list, I converted it into a data frame and transposed this data frame to keep the same format as mergedData and mergedData2. The problem now is the following: when I call names(average_dataset), it returns NULL !! But, more strangely, when I do:head(average_dataset) ; it returns :
volunteer_identifier type_of_experiment body_acceleration_mean()-X body_acceleration_mean()-Y
1 1 0.2773308 -0.01738382
2 1 0.2764266 -0.01859492
3 1 0.2755675 -0.01717678
4 1 0.2785820 -0.01483995
5 1 0.2778423 -0.01728503
6 1 0.2836589 -0.01689542
This is just a small sample of the output, to say that the names of the variables are there. So why names(average_dataset) returns NULL ?
Thanks in advance for your reply, best
EDIT: Here is an MWE for mergedData2:
volunteer_identifier type_of_experiment body_acceleration_mean()-X body_acceleration_mean()-Y
1 2 5 0.2571778 -0.02328523
2 2 5 0.2860267 -0.01316336
3 2 5 0.2754848 -0.02605042
4 2 5 0.2702982 -0.03261387
5 2 5 0.2748330 -0.02784779
6 2 5 0.2792199 -0.01862040
body_acceleration_mean()-Z body_acceleration_std()-X body_acceleration_std()-Y body_acceleration_std()-Z
1 -0.01465376 -0.9384040 -0.9200908 -0.6676833
2 -0.11908252 -0.9754147 -0.9674579 -0.9449582
3 -0.11815167 -0.9938190 -0.9699255 -0.9627480
4 -0.11752018 -0.9947428 -0.9732676 -0.9670907
5 -0.12952716 -0.9938525 -0.9674455 -0.9782950
6 -0.11390197 -0.9944552 -0.9704169 -0.9653163
gravity_acceleration_mean()-X gravity_acceleration_mean()-Y gravity_acceleration_mean()-Z
1 0.9364893 -0.2827192 0.1152882
2 0.9274036 -0.2892151 0.1525683
3 0.9299150 -0.2875128 0.1460856
4 0.9288814 -0.2933958 0.1429259
5 0.9265997 -0.3029609 0.1383067
6 0.9256632 -0.3089397 0.1305608
gravity_acceleration_std()-X gravity_acceleration_std()-Y gravity_acceleration_std()-Z
1 -0.9254273 -0.9370141 -0.5642884
2 -0.9890571 -0.9838872 -0.9647811
3 -0.9959365 -0.9882505 -0.9815796
4 -0.9931392 -0.9704192 -0.9915917
5 -0.9955746 -0.9709604 -0.9680853
6 -0.9988423 -0.9907387 -0.9712319
My duty is to get this average_dataset (which is a dataset which contains the average value for each physical quantity (column 3 and onwards) for each volunteer and type of experiment (e.g 1 1 mean1 mean2 mean3...mean68
2 1 mean1 mean2 mean3...mean68, etc)
After this I will have to export it as a txt file (so I think using write.table with row.names=F, and col.names=T). Note that for now, if I do this and import the dataset generated using read.table, I don't recover the names of the columns of the dataset; even while specifying col.names=T.

Multiple plots in R with different settings for each axis with less lines of code

In the graph below,
Is it possible to create same graph with less lines of codes? I mean, since each Figs. A-D has different label settings, I have to write settings for each Fig. which makes it longer.
The graph below is produced with the data in pdf device.
Any help with these issues is highly appreciated.(Newbie to R!). Since all the code is too long to post here, I have posted a part relevant to the problem here for Fig.C
#FigC
label1=c(0,100,200,300)
plot(data$TimeVariable2C,data$Variable2C,axes=FALSE,ylab="",xlab="",xlim=c(0,24),
ylim=c(0,2.4),xaxs="i",yaxs="i",pch=19)
lines(data$TimeVariable3C,data$Variable3C)
axis(2,tick=T,at=seq(0.0,2.4,by=0.6),label= seq(0.0,2.4,by=0.6))
axis(1,tick=T,at=seq(0,24,by=6),label=seq(0,24,by=6))
mtext("(C)",side=1,outer=F,line=-10,adj=0.8)
minor.tick(nx=5,ny=5)
par(new=TRUE)
plot(data$TimeVariable1C,data$Variable1C,axes=FALSE,xlab="",ylab="",type="l",
ylim=c(800,0),xaxs="i",yaxs="i")
axis(3,xlim=c(0,24),tick=TRUE,at= seq(0,24,by=6),label=seq(0,24,by=6),col.axis="violetred4",col="violetred4")
axis(4,tick=TRUE,at= label1,label=label1,col.axis="violetred4",col="violetred4")
polygon(data$TimeVariable1C,data$Variable1C,col='violetred4',border=NA)
You ask many questions in the same OP. I will try to answer to just one : How to simplify your code or rather how to call it once for each letter. I think it is better to put your data in the long format. For example, This will create a list of 4 elements
ll <- lapply(LETTERS[1:4],function(let){
dat.let <- dat[,grepl(let,colnames(dat))]
dd <- reshape(dat.let,direction ='long',
v.names=c('TimeVariable','Variable'),
varying=1:6)
dd$time <- factor(dd$time)
dd$Type <- let
dd
}
)
ll is a list of 4 data.frame, where each one that looks like :
head(ll[[1]])
time TimeVariable Variable id Type
1.1 1 0 0 1 A
2.1 1 0 5 2 A
3.1 1 8 110 3 A
4.1 1 16 0 4 A
5.1 1 NA NA 5 A
6.1 1 NA NA 6 A
Then you can use it like this for example :
library(Hmisc)
layout(matrix(1:4, 2, 2, byrow = TRUE))
lapply(ll,function(data){
label1=c(0,100,200,300)
Type <- unique(dat$Type)
dat <- subset(data,time==2)
x.mm <- max(dat$Variable,na.rm=TRUE)
plot(dat$TimeVariable,dat$Variable,axes=FALSE,ylab="",xlab="",xlim=c(0,x.mm),
ylim=c(0,2.4),xaxs="i",yaxs="i",pch=19)
dat <- subset(data,time==2)
lines(dat$TimeVariable,dat$Variable)
axis(2,tick=T,at=seq(0.0,2.4,by=0.6),label= seq(0.0,2.4,by=0.6))
axis(1,tick=T,at=seq(0,x.mm,by=6),label=seq(0,x.mm,by=6))
mtext(Type,side=1,outer=F,line=-10,adj=0.8)
minor.tick(nx=5,ny=5)
par(new=TRUE)
dat <- subset(data,time==1)
plot(dat$TimeVariable,dat$Variable,axes=FALSE,xlab="",ylab="",type="l",
ylim=c(800,0),xaxs="i",yaxs="i")
axis(3,xlim=c(0,24),tick=TRUE,at= seq(0,24,by=6),label=seq(0,24,by=6),col.axis="violetred4",col="violetred4")
axis(4,tick=TRUE,at= label1,label=label1,col.axis="violetred4",col="violetred4")
polygon(dat$TimeVariable,dat$Variable,col='violetred4',border=NA)
})
Another advantage of using the long data format is to use ``ggplot2andfacet_wrap` for example .
## transform your data to a data.frame
dat.l <- do.call(rbind,ll)
library(ggplot2)
ggplot(subset(dat.l,time !=1)) +
geom_line(aes(x=TimeVariable,y=Variable,group=time,color=time))+
geom_polygon(data=subset(dat.l,time ==1),
aes(x=TimeVariable,y=60-Variable/10,fill=Type))+
geom_line(data=subset(dat.l,time ==1),
aes(x=TimeVariable,y=Variable,fill=Type))+
facet_wrap(~Type,scales='free')

Resources