Keep the ID variable when creating seqelist in TraMineR - r

I am creating a list of sequences with TraMineR with the following code:
cj.seqe <- seqecreate(id=cj$party_id, time=cj$DATE_IN_num, event=cj$EVT_CD)
However, this list only contains the events and drops the id variable. I would like to merge the event sequences back to the original data. Is there a way to do this? I couldn't find anything in the docs. Thanks!

Maybe you could try with the following:
cj.seqe <- seqecreate(data=cj, id="party_id", time=cj$DATE_IN_num, event=cj$EVT_CD)
I suggest this because, to my experience, when I tried defining the id in seqcreate function in this way
id=dataset.name$variable_id didn't work!
Whereas trying defining id with id="variabile_id" worked!
I hope this could help despite I have no rigorous explanation.

Related

In R, dataframe[-NULL] returns an empty dataframe

I'm creating some routines in R to ease model creation and to distinguish several groups based on several parameters (ex: original watches VS fakes ones using watches common attributes).
During the proccess, I keep track of the potential excluded lines in a vector (empty at first), and I get ride of them at the end using:
model$var <- raw_data[-line_excluded,]
The problem is that if line_excluded is c() (ndlr no line exlcuded), model$var is an empty dataframe then in that case I want all the lines of the dataframe.
The only solution I have think about is the us of
if (!is.null(line_excluded)){
model$var <- raw_data[-line_excluded,]}
But that's not really pretty, and I have several tracking variables as line_excluded which need that.
Thanks for the help
You can make it in another way using setdiff(), which can deal with empty line_excluded i.e.,
model$var <- raw_data[setdiff(seq(nrow(raw_data)),line_excluded),]
You can also try:
model$var <- raw_data[!(1:nrow(raw_data) %in% line_excluded),]
This is similar to what #THomasIsCoding suggested, you look for the row numbers that are not in your line_excluded..

Using semi_join to find similarities but returns none mistakenly

I am trying to find the similar genes between two columns that I can later work with just the similar genes. Below is my code:
top100_1Beta <- data.frame(grp1_Beta$my_data.SYMBOL[1:100])
top100_2Beta<- data.frame(grp2_Beta$my_data.SYMBOL[1:100])
common100_Beta <- semi_join(top100_1Beta,top100_2Beta)`
When I run the code I get the following error:
Error: by required, because the data sources have no common variables
This is wrong since when I open top100_1Beta and top100_2Beta I can see at least the first few list the exact same genes: ATP2A1, SLMAP, MEOX2,...
I am confused on why then it's returning that no commonalities.
Any help would be greatly appreciated.
Thanks!
I don't think you need any form of *_join here; instead it seems you're looking for intersect
intersect(grp1_Beta$my_data.SYMBOL[1:100], grp2_Beta$my_data.SYMBOL[1:100])
This returns a vector of common entries amongst the first 100 entries of grp1_Beta$my_data.SYMBOL and grp1_Beta$my_data.SYMBOL.
Without a full working example, I'm guessing that your top100_1Beta and top100_2Beta dataframes do not have the same column names. They are probably grp1_Beta.my_data.SYMBOL.1.100. and grp2_Beta.my_data.SYMBOL.1.100.. This means the semi_join function doesn't know where to match the dataframes up. Renaming the columns should fix the issue.

update function in R

I am trying to use update function on survey.design object. For instance, I want to create a variable that is the mean of 4 other variables, as follows
x1<-runif(3)
x2<-runif(3)
x3<-runif(3)
population=10000
testdf<-data.frame(x1,x2,x3,population)
testsvy<-svydesign(id=~1,weights=c(30,30,30),data=testdf)
testsvy<-update(testsvy,avg=mean(c(x1,x2,x3)))
However this returns a vector of the same number for every person. There must be something wrong. Alternatively I can modify on test$variables, but I don't feel that this is the easiest way...
OK I got the answer myself... Hope that it could be simpler since I type the object names three times...
testsvy<-update(testsvy,avg2=rowMeans(testsvy$variables[,c("x1","x2","x3")],na.rm=TRUE))

Declaration of mass variables in column headings in R

I cannot figure out how to assign the column headers from my imported xlsx sheet as variables. I have several column headers, for example DAY_CHNG and INPUT_CHG. So far, I can only run gls(DAY_CHG~INPUT_CHG) by first assigning the values as variables by X<-mydata$DAY_CHG. Is there some command to get these variables assigned automatically when I import?
I had horrible problems getting the program up and running, by the way, due to firewalls at the firm for which I'm working, wondering if that's causing some of the issue.
Any help is much appreciated. Thanks!
attach(mydata) will allow you to directly use the variable names. However, attach may cause problems, especially with more complex data/analyses (see Do you use attach() or call variables by name or slicing? for a discussion)
An alternative would be to use with, such as with(mydata, gls(DAY_CHG~INPUT_CHG)
I would suggest using the $ in order to use the headers as variables and still be able to use other data sets. All that needs to be done is assign the data to an object such as your mydata and by putting a $ immediately following, you will be able to refer to your headers as variables.
As an example for your case, instead of creating a new object x, simply take what you assigned x to and put it directly into your command.
gls(mydata$DAY_CHG ~ mydata$INPUT_CHG)
when it becomes more complicated with more data sets this will allow you to have access to all of them still while not limiting yourself to the data set you attach()

Working with "..." input in R function

I am putting together an R function that takes some undefined input through the ... argument described in the docs as:
"..." the special variable length argument ***
The idea is that the user will enter a number of column names here, each belonging to a dataset also specified by the user. These columns will then be cross-tabulated in comparison to the dependent variable by tapply. The function is to return a table (independent variable x indedependent variable).
Thus, I tried:
plotter=function(dataset, dependent_variable, ...)
{
indi_variables=list(...); # making a list of the ... input as described in the docs
result=with (dataset, tapply(dependent_variable, indi_variables, mean); # this fails
}
I figured this should work as tapply can take a list as input.
But it does not in this case ('Error in tapply...arguments must have same length') and I think it is because indi_variables is a list of strings.
If I input the contents of the list by hand and leave out the quotation marks, everything works just fine.
However, if the user feeds the function the column names as non-strings, R will interpret them as variable names; and I cannot figure out how to transform the list indi_variables in the right way, unsuccessfully trying things like this:
indi_variables=lapply(indi_variables, as.factor)
So I am wondering
What causes the error described above? Is my interpretation correct?
How would one go about transforming the list created through ... in the right way?
Is there an overall better way of doing this, in the input or the implementation of tapply?
Any help is much appreciated!
Thanks to Joran's helpful reading, I have come up with these improvements than make things work out...
indi_variables=substitute(list(...));
result=with (dataset, tapply(dependent_variable, eval(indi_variables, dataset), FUN=mean));

Resources