Determine when columns of a data.frame change value and return indices of the change - r

I am trying to find a way to determine when a set of columns changes value in a data.frame. Let me get straight to the point, please consider the following example:
x<-data.frame(cnt=1:10, code=rep('ELEMENT 1',10), val0=rep(5,10), val1=rep(6,10),val2=rep(3,10))
x[4,]$val0=6
The cnt column is a unique ID (could be a date, or time column, for simplicity it's an int here)
The code column is like an code for the set of rows (imagine several such groups but with different codes). The code and cnt are the keys in my data.table.
The val0,val1,val2 columns are something like scores.
The data.frame above should be read as: The scores for 'ELEMENT 1' started as 5,6,3, remained as is until the 4 iteration when they changed to 6,6,3, and then changed back to 5,6,3.
My question, is there a way to get the 1st, 4th, and 5th row of the data.frame? Is there a way to detect when the columns change? (There are 12 columns btw)
I tried using the duplicated of data.table (which worked perfectly in the majority of the cases) but in this case it will remove all duplicates and leave rows 1 and 4 only (removing the 5th).
Do you have any suggestions? I would rather not use a for loop as there are approx. 2M lines.

In data.table version 1.8.10 (stable version in CRAN), there's a(n) (unexported) function called duplist that does exactly this. And it's also written in C and is therefore terribly fast.
require(data.table) # 1.8.10
data.table:::duplist(x[, 3:5])
# [1] 1 4 5
If you're using the development version of data.table (1.8.11), then there's a more efficient version (in terms of memory) renamed as uniqlist, that does exactly the same job. Probably this should be exported for next release. Seems to have come up on SO more than once. Let's see.
require(data.table) # 1.8.11
data.table:::uniqlist(x[, 3:5])
# [1] 1 4 5

Totally unreadable, but:
c(1,which(rowSums(sapply(x[,grep('val',names(x))],diff))!=0)+1)
# [1] 1 4 5
Basically, run diff on each row, to find all the changes. If a change occurs in any column, then a change has occurred in the row.
Also, without the sapply:
c(1,which(rowSums(diff(as.matrix(x[,grep('val',names(x))])))!=0)+1)

Related

R set column value to be other column value based on string search

I'm trying to find a clean way to get the first column of my DT, for each row, to be equal to the user_id found in other columns. That is, I must perform a search of "user_id" across each row, and return the entirety of the cell where the instance is found.
I first tried to get the index of the column where the partial match is found, and then use this to set the first column's values, but it did not work. Example:
user_id 1 2
1: N/A 300 user_id154
2: N/A user_id301 user_id125040
3: N/A 302 user_id2
For instance, I want to obtain the following
**user_id**
user_id154
user_id301
user_id2
Please bear in mind I am new to such data formatting in R (most of the work I do does not involve cleaning JSON files..), and that my data.table has overs 1M rows. The answer does not need to be super efficient, but it definitely shouldn't take more than 5 minutes or it will be considered as too slow by my boss.
Hopefully it is understandable
I'm sure someone will provide a more elegant solution, but this does the trick:
dt[, user_id := str_extract(str_c(1, 2), "user_id[0-9]*")]
This first combines all columns row-per-row, then for each row, looks for the first user_id in the combined value.
(Requires the stringr package)
For every row in your table grep first value that has "user_id" in it and put result into column user_id.
df$user_id <- apply(df, 1, function(x) grep("user_id", x, value = TRUE)[1])

R commands for finding mode in R seem to be wrong

I watched video on YouTube re finding mode in R from list of numerics. When I enter commands they do not work. R does not even give an error message. The vector is
X <- c(1,2,2,2,3,4,5,6,7,8,9)
Then instructor says use
temp <- table(as.vector(x))
to basically sort all unique values in list. R should give me from this command 1,2,3,4,5,6,7,8,9 but nothing happens except when the instructor does it this list is given. Then he says to use command,
names(temp)[temp--max(temp)]
which basically should give me this: 1,3,1,1,1,1,1,1,1 where 3 shows that the mode is 2 because it is repeated 3 times in list. I would like to stay with these commands as far as is possible as the instructor explains them in detail. Am I doing a typo or something?
You're kind of confused.
X <- c(1,2,2,2,3,4,5,6,7,8,9) ## define vector
temp <- table(as.vector(X))
to basically sort all unique values in list.
That's not exactly what this command does (sort(unique(X)) would give a sorted vector of the unique values; note that in R, lists and vectors are different kinds of objects, it's best not to use the words interchangeably). What table() does is to count the number of instances of each unique value (in sorted order); also, as.vector() is redundant.
R should give me from this command 1,2,3,4,5,6,7,8,9 but nothing happens except when the instructor does it this list is given.
If you assign results to a variable, R doesn't print anything. If you want to see the value of a variable, type the variable's name by itself:
temp
you should see
1 2 3 4 5 6 7 8 9
1 3 1 1 1 1 1 1 1
the first row is the labels (unique values), the second is the counts.
Then he says to use command, names(temp)[temp--max(temp)] which basically should give me this: 1,3,1,1,1,1,1,1,1 where 3 shows that the mode is 2 because it is repeated 3 times in list.
No. You already have the sequence of counts stored in temp. You should have typed
names(temp)[temp==max(temp)]
(note =, not -) which should print
[1] "2"
i.e., this is the mode. The logic here is that temp==max(temp) gives you a logical vector (a vector of TRUE and FALSE values) that's only TRUE for the elements of temp that are equal to the maximum value; names(temp)[temp==max(temp)] selects the elements of the names vector (the first row shown in the printout of temp above) that correspond to TRUE values ...

Convert/transform an abundance (OTU) table/data.frame (to a fasta file) in R

I'm working on a large dataset at the moment and so far I could solve all my ideas/problems via countless google searches and long try & error sessions very well. I've managed to use plyr and reshape functions for some transformations of my different datasets and learned a lot, but I think I've reached a point where my present R knowledge won't help me anymore.
Even if my question sounds very specific (i.e. OTU table and fasta file) I guess my attempt is a common R application across many different fields (and not just bioinformatics).
Right now, I have merged an reference sequence file with an abundance table, and I would like to generate a specific file based on the information of this data.frame - a fasta file.
My df looks a bit like this at the moment:
repSeq sw.1.102 sw.3.1021 sw.30.101 sw.5.1042 ...
ACCT-AGGA 3 0 1 0
ACCT-AGGG 1 1 2 0
ACTT-AGGG 0 1 0 25
...
The resulting file should look like this:
>sw.1.102_1
ACCT-AGGA
>sw.1.102_2
ACCT-AGGA
>sw.1.102_3
ACCT-AGGA
>sw.1.102_4
ACCT-AGGG
>sw.3.1021_1
ACCT-AGGG
>sw.3.1021_2
ACTT-AGGG
>sw.30.101_1
ACCT-AGGA
>sw.30.101_2
ACCT-AGGG
...
As you can see I would like to use the information about the number of (reference) sequences for each sample (i.e. sw.n) to create a (fasta) file.
I have no experiences with loops in R (I used basic loops only during simple processing attempts), but I assume this could do the trick here. I have found the write.fasta function from the SeqinR package, but I could not find any solution there. The deunique.seqs command in mothur wont work, because it needs a fasta file as input (which I obviously don't have). It could be very possible that there is something on Bioconductor (OTUbase?), but to be honest, I don't know where to beginn and I'm glad about any help.
And I really would like to do this in R, since I enjoy working with it, but any other ideas are also very welcome.
//small edit:
Both answers below work very well (see my comments) - I also found two possible not-so-elegant & non-R workarounds (not tested yet):
since I already have a taxonomy file and an abundance OTU table, I think the mothur command make.biom could be used to create a biom-format file. I haven't worked with biom files yet, but I think there are some tools and scripts available to save the biom-file data as fasta again
convert Qiime files to oligotyping format - this also needs a taxonomy file and an Otu table
Not sure if both ways work - therefore, please correct me if I'm wrong.
Here's your data, coerced to a matrix (which is a more natural representation for rectangular data of homogeneous type).
df <- read.delim(textConnection(
"repSeq sw.1.102 sw.3.1021 sw.30.101 sw.5.1042
ACCT-AGGA 3 0 1 0
ACCT-AGGG 1 1 2 0
ACTT-AGGG 0 1 0 25"
), sep="", row.names=1)
m <- as.matrix(df)
The tricky part is to figure out how to number the duplicated column name entries. I did this by creating sequences of the appropriate length and un-listing. I then created a matrix with two rows, the first (from replicating the colnames() as required by entries in the original matrix) is the id, and the second the sequence.
csum <- colSums(m)
idx <- unlist(lapply(csum, seq_len), use.names=FALSE)
res <- matrix(c(sprintf(">%s_%d", rep(colnames(m), csum), idx), # id
rep(rownames(m)[row(m)], m)), # sequence
nrow=2, byrow=TRUE)
Use writeLines(res, "your.fasta") to write out the results, or setNames(res[2,], res[1,]) to get a named vector of sequences.
Try this, it goes through the dataframe line by line and concatenates repetitions of sequences :
fasta_seq<-apply(df,1,function(x){
p<-x[1]
paste(unlist(mapply(function(x,y,z){
if(as.numeric(y)>0) {paste(">",x,"_",(z+1):(z+y),"\n",p,"\n",sep="")}
},colnames(df)[-1],as.numeric(x[-1]),c(0,lag(cumsum(as.numeric(x[-1])))[-1]),USE.NAMES=F)),collapse="")
})
write(paste(fasta_seq,collapse=""),"your_file.txt")

for loop in R using if & print [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
Maybe I'm thinking too hard on this but I need to create a for loop & if statement to find the highest value in my data set. We also have to write a print statement that prints it out & the day. There's 93 rows & 4 columns in the initial matrix. Column 4 has the needed data. The days are in column 1.
I don't know programming at all. So far this is what I got:
I created a vector out of the column with the data:
only.data <- c(data[,4])
Here's my feeble attempt at a for & if statement:
for (counter in 1:93) {
if (only.data >= data[,4])
print (only.data)
}
How do I get it to spit out the highest value using this method? It prints the max value 93 times and that's not what I want. Do I need to create the only.data vector or can I use the original matrix? I also need to print out the corresponding date next to the highest value.
ps - I know I can use the max function which is much quicker but that's not the assignment.
It seems like you are cheating, thus I won't post a full solution here, but only point you in the right direction
data[,4] is already a vector and there is no reason whatsoever to use c() on it. There is also no reason to save it in a new object only.data, although it potentially can make your loop faster as it won't need to index in each loop.
The idea of a loop is that you will use an index in it (although you don't have to, but there is no real reason not to). Thus, you are specifying the index in for(). Although you specified an index (counter), you haven't used it, thus your loop prints only.data regardless of anything you are doing.
All your if doing is to check if only.data >= only.data in every iteration (which is obviously unnecessary)
To calculate the maximum in a loop is not such an obvious thing, as you comparing a single value in each iteration, thus you''ll need some strategy. For example, you could create a dummy variable which will be compared in each iteration against only.data[counter] to check if it's bigger, and then be replaced in case it's not
To illustrate my last point, consider a toy example
set.seed(1)
only.data <- sample(10,10)
only.data
#[1] 3 4 5 7 2 8 9 6 10 1
You can see that the maximum value is in the 9th position, now we will assign the first value of this vector to a dummy variable and will try to use a for loop in order to find the maximum
dummy <- only.data[1]
dummy
## [1] 3
for (counter in only.data) {
if (counter > dummy) dummy <- counter
}
dummy
## [1] 10

R - How to completely detach a subset plm.dim from a parent plm.dim object?

I want to be able to completely detach a subset (created by tapply) of a dataframe from its parent dataframe. Basically I want R to forget the existing relation and consider the subset dataframe in its own right.
**Following the proposed solution in the comments, I find it does not work for my data. The reason might be that my real dataset is a plm.dim object with an assigned index. I tried this at home for the example dataset and it worked fine. However, once again in my real data, the problem is not solved.
Here's the output of my actual data (original 37 firms)
sum(tapply(p.data$abs_pb_t,p.data$Rfirm,sum)==0)
[1] 7
s.data <- droplevels(p.data[tapply(p.data$abs_pb_t,p.data$ID,sum)!=0,])
sum(tapply(s.data$abs_pb_t,s.data$Rfirm,sum)==0)
[1] 8
Not only is the problem not solved for some reason I get an extra count of a zero variable while I explicitly ask to only keep the ones that differ from zero
Unfortunately, I cannot recreate the same problem with a simple example. For that example, as said, droplevels() works just fine
A simple reproducible example explains:
library(plm)
dad<-cbind(as.data.frame(matrix(seq(1:40),8,5)),factors = c("q","w","e","r"), year = c("1991","1992", "1993","1994"))
dad<-plm.data(dad,index=c("factors","year"))
kid<-dad[tapply(dad$V5,dad$factors,sum)<=70,]
tapply(kid$V1,kid$factors,mean)
kid<-droplevels(dad[tapply(dad$V5,dad$factors,sum)<=70,])
tapply(kid$V1,kid$factors,mean)
So I create a dad and a kid dataframe based on some tapply condition (I'm sure this extends more generally).
the result of the tapply on the kid is the following
e q r w
7 NA 8 NA
Clearly R has not forgotten the dad and it adds that two factors are NA . In itself not much of a problem but in my real dataset which much more variables and subsetting to do, I'd like a cleaner cut so that it will make searching through the kid(s) easier. In other words, I don't want the initial factors q w e r to be remembered. The desired output would thus be:
e r
7 8
So, can anyone think of a reason why what works perfectly in a small data.frame would work differently in a larger dataframe? for p.data (N = 592, T = 16 and n = 37). I find that when I run 2 identical tapply functions, one on s.data and one on p.data, all values are different. So not only have the zeros not disappeared, literally every sum has changed in the s.data which should not be the case. Maybe that gives a clue as to where I go wrong...
And potentially it could solve the mystery of the factors that refuse to drop as well
Thanks
Simon

Resources