transpose-like procedure in data.table - r

I asked a similar yet different question before (here)
Now I wanna change this dataset:
dt <- data.table(a=c("A","A","A"),b=1:3,c=c(0,1,0))
dt
a b c
1: A 1 0
2: A 2 1
3: A 3 0
to this one:
a 1 2 3
1: A 0 1 0
So values of column "b" should become columns each with the value of column "c". Values of "a" can be seen as participants (here just one Person ("A"). The original dataset continues with multiple values of B and so on. After "transposing", column "a" should include unique values (e.g. A,B,C etc)
Any suggestions?

I think I can see where you're going with this.
The following should cope with varying unique items in b as well and line them up.
.u = unique(dt$b) # find first the unique columns
ans = dt[,as.list(c[match(.u,b)]),by=a] # as.list is the way to make columns
setnames(ans,c("a",.u)) # rename once afterwards is faster
ans
a 1 2 3
1: A 0 1 0
Untested on more complex cases.

Related

find the place where the variable in a dataframe changes its value [duplicate]

This question already has answers here:
Finding the index of first changes in the elements of a vector
(5 answers)
Closed 4 years ago.
I have a lot of data frames in R which look like that:
A B
1 0
2 0
3 0
4 1
5 1
6 1
So between 3 and 4 the B changes value from 0 to 1. What is the most R way of returning the value of A where B changes value?
In the data B changes the value only once, and A is sorted (from 1 to n).
Here is a possible way. Use diff to get the values where column b changes but be carefull, the first value of b, by definition of change, hasn't changed. (The problem is that diff returns a vector with one less element.)
inx <- c(FALSE, diff(data$b) != 0)
data[inx, ]
# a b
#4 4 1
After seeing the OP's comment to another post, the following code shows that this method can also solve the issue when b starts with any value,not just zero.
data2 <- data.frame(a=c(1,2,3,4,5,6),b=c(1,1,1,0,0,0))
inx <- c(FALSE, diff(data2$b) != 0)
data2[inx, ]
# a b
#4 4 0
As OP mentioned,
In the data B changes the value only once
We can use cumsum with duplicated and which.max
which.max(cumsum(!duplicated(df$B)))
#[1] 4
If the value changes multiple times, this will give the index for last change instead.
If we need to subset the row, then we can do
df[which.max(cumsum(!duplicated(df$B))), ]
# A B
#4 4 1
To break it down further, for better understanding
!duplicated(df$B)
#[1] TRUE FALSE FALSE TRUE FALSE FALSE
cumsum(!duplicated(df$B))
#[1] 1 1 1 2 2 2
which.max(cumsum(!duplicated(df$B)))
#[1] 4
In order to identify a change in a sequence, one may use diff, like in the following code:
my_df <- data.frame(A = 1:6, B = c(0,0,0,1,1,1))
which(diff(my_df$B)==1)+1
[1] 4

Countif function (excel) in R

I have a dataset "a" with a column "id" with about 23,000 rows, which are unique in this dataframe. I want to count the appearance frequency of these unique values in another two datasets "b" and "c".
To do this, I tried the code:
count1 <- as.data.frame(apply(a,1,function(x)sum(b$id==x[45])))
a <- cbind(a,count1)
, since "id" is the 45th column in the dataframe "b".
The code works for counting in b, but when I tried the same code for counting the frequency of "id" in dataframe "c":
count2 <- as.data.frame(apply(a,1,function(x)sum(c$id==x[17])))
"id" in dataframe "c" is in the 17th column. The frequencies of all "id"s are counted as 0, which is not the case it should be. Could anyone suggest where the problem is or how to fix this problem?
We can actually do this in a way that might at first seem a little weird, but is relatively straight forward. Let's start by working with just data frames a and b and let's simplify things a bit. Let's assume that the id variable in both a and b are the following:
a_id <- 1:5
b_id <- 1:5
In this simple example, a_id and b_id are exactly identical. What we want to know is how many times each of the values in a_id shows up in b_id. We obviously know the answer is one time each, but how do we get R to tell us that? That's where the table function can come in handy:
table(a_id, b_id)
# b_id
# a_id 1 2 3 4 5
# 1 1 0 0 0 0
# 2 0 1 0 0 0
# 3 0 0 1 0 0
# 4 0 0 0 1 0
# 5 0 0 0 0 1
That might look a little ugly, but you can see that we have our b_ids on the top (1-5) and our a_ids on the left-hand side. Down the diagonal, we see the counts for how many times each value of a_id shows up in b_id, and it's 1 each just like we already knew. So how do we get just that information? R has a nice function called diag that gets the main diagonal for us:
diag(table(a_id, b_id))
# 1 2 3 4 5
# 1 1 1 1 1
And there we have it. A vector with our "countif" values. But what if b_id doesn't have all of the values that are in a_id? If we try to do what we just did, we'll get an error because table doesn't like it when two vectors have different lengths. So we modify it a bit:
a_id <- 1:10
b_id <- 4:8
table(b_id[b_id %in% a_id])
# 4 5 6 7 8
# 1 1 1 1 1
A couple new things here. The use of %in% just asks R to tell us if a value exists in a vector. For example, 1 %in% 1:3 would return TRUE, but 4 %in% 1:3 would return FALSE. Next, you'll notice that we indexed b_id by using [. This only returns of the values of b_id where b_id %in% a_id is TRUE, which in this case is all of b_id.
So what does this look like if we expect more than one value of each a_id in b_id, but not every a_id value to be in b_id? Let's look at a more real example:
a_id <- 1:10
b_id <- sample(3:7, 1000, replace=TRUE)
table(b_id[b_id %in% a_id])
# 3 4 5 6 7
# 210 182 216 177 215
Like I said, it might seem a little weird at first, but it's relatively straight forward. Hopefully this helps you more than it confuses you.

Using Table on data frame by mutliple variables

I have a data table that is in a "long" format, containing many entries for each unique ID. For example...
id <- c(1,1,1,2,2,2)
date <- c("A","A","B","C","C","C")
loc <- c("X", "X", "X", "X","Y","Z")
dfTest <- data.frame(id,date,loc)
Which creates a sample table.
id date loc
1 1 A X
2 1 A X
3 1 B X
4 2 C X
5 2 C Y
6 2 C Z
My goal is to create a table that looks like this.
id X Y Z
1 2 0 0
2 1 1 1
I would like to see how many times a location was visited uniquely. ID#1 visited X on day A and day B, giving a total unique visits of 2. I approached this using reshape, thinking to turn this into a "wide" format. However, I don't know how to factor in the second variable (the date). I'm trying to pull out the number of visits to each location on unique dates. The actual date itself otherwise does not matter, just that it identify the duplicate entries.
My current solution would be poor form in R (to use iterative loops to look at locations found within each unique date). I was hoping reshape, apply, aggregate, or perhaps another package may be of more help. I've looked through a bunch of other reshape guides, but am still a bit stuck on the clever way to do this.
By the sounds of it, you should be able to do what you need with:
table(unique(dfTest)[-2])
## loc
## id X Y Z
## 1 2 0 0
## 2 1 1 1
We can group by 'loc', 'id', get the length of unique elements of 'date' and use dcast to get the expected output.
library(data.table)#v1.9.6+
dcast(setDT(dfTest)[, uniqueN(date), .(loc, id)], id~loc, value.var='V1', fill=0)
# id X Y Z
#1: 1 2 0 0
#2: 2 1 1 1

Reshaping R data.table that is both wide and long (and sparse)

I have a data.table that is both wide and long, and also sparse. This is the simplest example:
Row Val1 Val2
1 1
2 1
Reshaping from wide to long yields:
Row Idx Val
1 1 1
1 2
2 1
2 2 1
Reshaping from long (index is implicit based on non-missing rows, in this case row numbers) to wide yields:
Row Val1.1 Val2.1 Val1.2 Val2.2
1 1 1
What I want is:
Row Idx Val
1 1 1
2 2 1
Missing values are structurally missing and should be discarded.
The data set is very complex (400+ columns); it is from a survey in which one question was replicated in six different ways for six different cases, with answers selectively filled based on the case. Each question has six binary answers, making 36 columns. These need to be collapsed into the eight columns representing the eight unique binary answers, plus a new column identifying the case.
There are several other questions with similar issues so I need to find an algorithm to do this, and I don't have the vocabulary to explain it to Google. What is the right way to do this?
Try
setDT(df1)[,list(Idx=which.max(.SD), Val=1) , Row]
# Row Idx Val
#1: 1 1 1
#2: 2 2 1

Using Merge with an R By class object

So I have a "by" class object (which is essentially a list).
It is indexed by 2 factors [id1,id2], with a list associated with each unique pair.
e.g.
id1:1
id2:1
1,2,3
------
id1:1
id2:2
4,4,NA
------
id1:2
id2:1
NA
I would like to convert this to a data frame which has 3 columns {id1,id2,value} and would take the above and return
id1, id2, value
1 1 1
1 1 2
1 1 3
1 2 4
1 2 4
1 2 NA
2 1 NA
This can be done with a for loop but is obviously slow. I am looking to try and merge the value column back to a data frame which has indices 1 and 2.
Answer: Use the data.table package. It is ridiculously quick for these sorts of problems.

Resources