How to keep User ID using Rtsne package - r

I want to use T-SNE to visualize user's variable but I want to be able to join the data to the user's social information.
Unfortunately, the output of Rtsne doesn't seems to return data with the user id..
The data looks like this:
client_id recency frequen monetary
1 2 1 1 1
2 3 3 1 2
3 4 1 1 2
4 5 3 1 1
5 6 4 1 2
6 7 5 1 1
and the Rtsne output:
x y
1 -6.415009 -0.4726438
2 -16.027732 -9.3751709
3 17.947615 0.2561859
4 1.589996 13.8016613
5 -9.332319 -13.2144419
6 10.545698 8.2165265
and the code:
tsne = Rtsne(rfm[, -1], dims=2, check_duplicates=F)

Rtsne preserves the input order of the dataframe you pass to it.
Try:
Tsne_with_ID = cbind.data.frame(rfm[,1],tsne$y)
and then just fix the first column name:
colnames(Tsne_with_ID)[1] <- paste(colnames(rfm)[1])

Related

subseting columns by the name of rows of another dataframe

I need to subset the columns of a dataframe taking into account the rownames of another dataframe.(in R)
Im trying to select the representative species of Brazilian Amazon subseting a great Brazilian database taking into account the percentage of representative location, information which is in another dataframe
> a <- data.frame("John" = c(2,1,1,2), "Dora" = c(1,1,3,2), "camilo" = c(1:4),"alex"=c(1,2,1,2))
> a
John Dora camilo alex
1 2 1 1 1
2 1 1 2 2
3 1 3 3 1
4 2 2 4 2
> b <- data.frame("SN" = 1:3, "Age" = c(15,31,2), "Name" = c("John","Dora","alex"))
> b
SN Age Name
1 1 15 John
2 2 31 Dora
3 3 2 alex
> result <- a[,rownames(b)[1:3]]
Error in `[.data.frame`(a, , rownames(b)[1:3]) :
undefined columns selected
I want to get this dataframe
John Dora alex
1 2 1 1
2 1 1 2
3 1 3 1
4 2 2 2
The simple a[,b$Name] does not work because b$Name is considered a factor. Be careful because it won't throw an error but you will get the wrong answer!
But this is easy to fit by using a[,as.character(b$Name)]instead!

Generating Sequences with Irregular Patterns in R

I'm attempting to generate a column that shows persistence throughout a field. The field is sequential and numeric, but not conventionally increasing. Essentially, it goes up by 7 (when it ends in 2) and then 3 (when it ends in 9) by each ID. It's possible for an ID to miss one or more of the sequence, but then return to the same pattern. The data looks like this:
ID Col
1 0769
1 0772
1 0779
1 0782
1 0799
1 0802
1 0812
2 0769
2 0772
2 0779
3 0782
3 0799
3 0802
3 0812
What I'm trying to do is generate this:
ID Col Persistence
1 0769 1
1 0772 1
1 0779 1
1 0782 1
1 0799 2
1 0802 2
1 0812 3
2 0769 1
2 0772 1
2 0779 1
3 0782 1
3 0799 2
3 0802 2
3 0812 3
If you just want to make sure the jump is either 3 or 7, you can write a helper function to increment when a jump of a different size occurs
jumpchange <- function(x) c(0,cumsum(!diff(x) %in% c(3,7)))+1
Then you can apply this to each group most easily with dplyr
library(dplyr)
dd %>% group_by(ID) %>%
mutate(persistence = jumpchange(Col))
Or you can use transform/ave with just base R
transform(dd, persistence=ave(Col, ID, FUN=jumpchange))

How to BiCluster with constant values in columns - in R

My Problem in general:
I have a data frame where i would like to find all bi-clusters with constant values in columns.
For Example the initial dataframe:
> df
v1 v2 v3
1 0 2 1
2 1 3 2
3 2 4 3
4 3 3 4
5 4 2 3
6 5 2 4
7 2 2 3
8 3 1 2
And for example i would like to find the a cluster like this:
> cluster1
v1 v3
1 2 3
2 2 3
I tried to use the biclust package and tested several functions but the result was always not what i want to archive.
I figured out that I may can use the BCPlaid function with fit.model = y ~ m. But it looks like this produce also different results.
Is there a way to archive this task efficient?

Pulling Specific Row Values based on Another Column

Simple question here -
if I have a dataframe such as:
> dat
typeID ID modelOption
1 2 1 good
2 2 2 avg
3 2 3 bad
4 2 4 marginCost
5 1 5 year1Premium
6 1 6 good
7 1 7 avg
8 1 8 bad
and I wanted to pull only the modelOption values based on the typeID. I know I can subset out all rows corresponding with the typeID, but I just want to pull the modelOption values in this case.

Calculating the occurrences of numbers in the subsets of a data.frame

I have a data frame in R which is similar to the follows. Actually my real ’df’ dataframe is much bigger than this one here but I really do not want to confuse anybody so that is why I try to simplify things as much as possible.
So here’s the data frame.
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,3)
df <-data.frame(id,a,b,c,d,e)
df
Basically what I would like to do is to get the occurrences of numbers for each column (a,b,c,d,e) and for each id group (1,2,3) (for this latter grouping see my column ’id’).
So, for column ’a’ and for id number ’1’ (for the latter see column ’id’) the code would be something like this:
as.numeric(table(df[1:10,2]))
##The results are:
[1] 3 7
Just to briefly explain my results: in column ’a’ (and regarding only those records which have number ’1’ in column ’id’) we can say that number '1' occured 3 times and number '3' occured 7 times.
Again, just to show you another example. For column ’a’ and for id number ’2’ (for the latter grouping see again column ’id’):
as.numeric(table(df[11:20,2]))
##After running the codes the results are:
[1] 4 3 3
Let me explain a little again: in column ’a’ and regarding only those observations which have number ’2’ in column ’id’) we can say that number '1' occured 4 times, number '2' occured 3 times and number '3' occured 3 times.
So this is what I would like to do. Calculating the occurrences of numbers for each custom-defined subsets (and then collecting these values into a data frame). I know it is not a difficult task but the PROBLEM is that I’m gonna have to change the input ’df’ dataframe on a regular basis and hence both the overall number of rows and columns might change over time…
What I have done so far is that I have separated the ’df’ dataframe by columns, like this:
for (z in (2:ncol(df))) assign(paste("df",z,sep="."),df[,z])
So df.2 will refer to df$a, df.3 will equal df$b, df.4 will equal df$c etc. But I’m really stuck now and I don’t know how to move forward…
Is there a proper, ”automatic” way to solve this problem?
How about -
> library(reshape)
> dftab <- table(melt(df,'id'))
> dftab
, , value = 1
variable
id a b c d e
1 3 8 2 2 4
2 4 6 3 2 4
3 4 2 1 5 1
, , value = 2
variable
id a b c d e
1 0 1 4 3 3
2 3 3 3 6 2
3 1 4 5 3 4
, , value = 3
variable
id a b c d e
1 7 1 4 5 3
2 3 1 4 2 4
3 5 4 4 2 5
So to get the number of '3's in column 'a' and group '1'
you could just do
> dftab[3,'a',1]
[1] 4
A combination of tapply and apply can create the data you want:
tapply(df$id,df$id,function(x) apply(df[id==x,-1],2,table))
However, when a grouping doesn't have all the elements in it, as in 1a, the result will be a list for that id group rather than a nice table (matrix).
$`1`
$`1`$a
1 3
3 7
$`1`$b
1 2 3
8 1 1
$`1`$c
1 2 3
2 4 4
$`1`$d
1 2 3
2 3 5
$`1`$e
1 2 3
4 3 3
$`2`
a b c d e
1 4 6 3 2 4
2 3 3 3 6 2
3 3 1 4 2 4
$`3`
a b c d e
1 4 2 1 5 1
2 1 4 5 3 4
3 5 4 4 2 5
I'm sure someone will have a more elegant solution than this, but you can cobble it together with a simple function and dlply from the plyr package.
ColTables <- function(df) {
counts <- list()
for(a in names(df)[names(df) != "id"]) {
counts[[a]] <- table(df[a])
}
return(counts)
}
results <- dlply(df, "id", ColTables)
This gets you back a list - the first "layer" of the list will be the id variable; the second the table results for each column for that id variable. For example:
> results[['2']]['a']
$a
1 2 3
4 3 3
For id variable = 2, column = a, per your above example.
A way to do it is using the aggregate function, but you have to add a column to your dataframe
> df$freq <- 0
> aggregate(freq~a+id,df,length)
a id freq
1 1 1 3
2 3 1 7
3 1 2 4
4 2 2 3
5 3 2 3
6 1 3 4
7 2 3 1
8 3 3 5
Of course you can write a function to do it, so it's easier to do it frequently, and you don't have to add a column to your actual data frame
> frequency <- function(df,groups) {
+ relevant <- df[,groups]
+ relevant$freq <- 0
+ aggregate(freq~.,relevant,length)
+ }
> frequency(df,c("b","id"))
b id freq
1 1 1 8
2 2 1 1
3 3 1 1
4 1 2 6
5 2 2 3
6 3 2 1
7 1 3 2
8 2 3 4
9 3 3 4
You didn't say how you'd like the data. The by function might give you the output you like.
by(df, df$id, function(x) lapply(x[,-1], table))

Resources