I have a data.table that is both wide and long, and also sparse. This is the simplest example:
Row Val1 Val2
1 1
2 1
Reshaping from wide to long yields:
Row Idx Val
1 1 1
1 2
2 1
2 2 1
Reshaping from long (index is implicit based on non-missing rows, in this case row numbers) to wide yields:
Row Val1.1 Val2.1 Val1.2 Val2.2
1 1 1
What I want is:
Row Idx Val
1 1 1
2 2 1
Missing values are structurally missing and should be discarded.
The data set is very complex (400+ columns); it is from a survey in which one question was replicated in six different ways for six different cases, with answers selectively filled based on the case. Each question has six binary answers, making 36 columns. These need to be collapsed into the eight columns representing the eight unique binary answers, plus a new column identifying the case.
There are several other questions with similar issues so I need to find an algorithm to do this, and I don't have the vocabulary to explain it to Google. What is the right way to do this?
Try
setDT(df1)[,list(Idx=which.max(.SD), Val=1) , Row]
# Row Idx Val
#1: 1 1 1
#2: 2 2 1
Related
This question already has answers here:
Creation of a specific vector without loop or recursion in R
(2 answers)
Split data.frame by value
(2 answers)
Closed 4 years ago.
I have a dataframe whose rows represent people. For a given family, the first row has the value 1 in the column A, and all following rows contain members of the same family until another row in in column A has the value 1. Then, a new family starts.
I would like to assign IDs to all families in my dataset. In other words, I would like to take:
A
1
2
3
1
3
3
1
4
And turn it into:
A family_id
1 1
2 1
3 1
1 2
3 2
3 2
1 3
4 3
I'm playing with a dataframe of 3 million rows, so a simple for-loop solution I came up with falls short of necessary efficiency. Also, the family_id need not be sequential.
I'll take a dplyr solution.
data:
df <- data.frame(A = c(1:3,1,3,3,1,4))
code:
df$familiy_id <- cumsum(c(-1,diff(df$A)) < 0)
result:
# A familiy_id
#1 1 1
#2 2 1
#3 3 1
#4 1 2
#5 3 2
#6 3 2
#7 1 3
#8 4 3
please note:
This solution starts a new group when a number occurs that is smaller than the previous one.
When its 100% sure that a new group always begins with a 1 consistently, then ronak's solution is perfect.
Please help with following:
# create sequence of numbers
a<-matrix(0:16384,1,16385)
# fill matrix with this same sequence across 20,000 rows
b<-matrix(a,20000,ncol(a),byrow=T)
# find rank of every element, by row
c<-t(apply(b,1, function(x) rank(x, ties.method="first")))
This is just a sample, b will have a random sequence.
Here is an example of the output:
If b had this (2 rows by 3 columns for now) :
1 5 10
2 1 6
I want this to be in my output:
1 2 3
2 1 3
It's taking a very long time to process. Please, please help.
Thank you.
This question already has answers here:
Frequency counts in R [duplicate]
(2 answers)
Closed 7 years ago.
I'm sure this question has been asked before, but I can't seem to find an answer anywhere, so I apologize if this is a duplicate.
I'm looking for R code that allows me to aggregate a variable in R, but while doing so creates new columns that count instances of levels of a factor.
For example, let's say I have the data below:
Week Var1
1 a
1 b
1 a
1 b
1 b
2 c
2 c
2 a
2 b
2 c
3 b
3 a
3 b
3 a
First, I want to aggregate by week. I'm sure this can be done with group_by in dplyr. I then need to be able to cycle through the code and create a new column each time a new level appears in Var 1. Finally, I need counts of each level of Var1 within each week. Note that I can probably figure out a way to do this manually, but I'm looking for an automated solution as I will have thousands of unique values in Var1. The result would be something like this:
Week a b c
1 2 3 0
2 1 1 3
3 2 2 0
I think from the way you worded your question, you've been looking for the wrong thing/something too complicated. It's a simple data-reshaping problem, and as such can be solved with reshape2:
library(reshape2)
#create wide dataframe (from long)
res <- dcast(Week~Var1, value.var="Var1",
fun.aggregate = length, data=data)
> res
Week a b c
1 1 2 3 0
2 2 1 1 3
3 3 2 2 0
So I have a "by" class object (which is essentially a list).
It is indexed by 2 factors [id1,id2], with a list associated with each unique pair.
e.g.
id1:1
id2:1
1,2,3
------
id1:1
id2:2
4,4,NA
------
id1:2
id2:1
NA
I would like to convert this to a data frame which has 3 columns {id1,id2,value} and would take the above and return
id1, id2, value
1 1 1
1 1 2
1 1 3
1 2 4
1 2 4
1 2 NA
2 1 NA
This can be done with a for loop but is obviously slow. I am looking to try and merge the value column back to a data frame which has indices 1 and 2.
Answer: Use the data.table package. It is ridiculously quick for these sorts of problems.
I asked a similar yet different question before (here)
Now I wanna change this dataset:
dt <- data.table(a=c("A","A","A"),b=1:3,c=c(0,1,0))
dt
a b c
1: A 1 0
2: A 2 1
3: A 3 0
to this one:
a 1 2 3
1: A 0 1 0
So values of column "b" should become columns each with the value of column "c". Values of "a" can be seen as participants (here just one Person ("A"). The original dataset continues with multiple values of B and so on. After "transposing", column "a" should include unique values (e.g. A,B,C etc)
Any suggestions?
I think I can see where you're going with this.
The following should cope with varying unique items in b as well and line them up.
.u = unique(dt$b) # find first the unique columns
ans = dt[,as.list(c[match(.u,b)]),by=a] # as.list is the way to make columns
setnames(ans,c("a",.u)) # rename once afterwards is faster
ans
a 1 2 3
1: A 0 1 0
Untested on more complex cases.