Aggregating by unique identifier and put unique values into a string - r

I have a need that I imagine could be satisfied by aggregate or reshape, but I can't quite figure out.
I have a list of names with the color of the car that own each person. This data is in long form, so names can have multiple colours. I'd like to fuse by the name and get the max colour.
For example,
Name car_colour
Euler blue
Gauss red
Hilbert white
Hilbert green
Knuth yellow
Knuth orange
Knuth cyan
Knuth violet
Knuth darkblue
Would become...
Name car_color
Euler blue
Gauss red
Hilbert green
Knuth cyan
How would I accomplish this?

Sorry guys but the answer was very simple:
> Name=c('Euler','Gauss','Hilbert','Hilbert','Knuth','Knuth','Knuth','Knuth','Knuth')
> car_colour=c('blue','red','white','green','yellow','orange','cyan','violet','darkblue')
> nc=as.data.frame(cbind(Name,car_colour))
> nc
Name car_colour
1 Euler blue
2 Gauss red
3 Hilbert white
4 Hilbert green
5 Knuth yellow
6 Knuth orange
7 Knuth cyan
8 Knuth violet
9 Knuth darkblue
> nc.agg <- aggregate( as.character(car_colour) ~ Name, nc, FUN = "min")
> nc.agg
Name as.character(car_colour)
1 Euler blue
2 Gauss red
3 Hilbert green
4 Knuth cyan

Quick, super awful method!
I made an example called test.
test
letter color
[1,] "a" "blue"
[2,] "a" "red"
[3,] "a" "red"
[4,] "b" "orange"
[5,] "c" "green"
testTable=table(test[,1], test[,2])
testNames=colnames(testTable)[apply(testTable,1,which.max)]
testOut=data.frame(letter=unique(test[,1]), color=testNames)
I'm thinking someone else might have a less silly way to get to the same answer. I'll happily upvote someone who one-liners it!

Related

How to use R to sort two groups based on shared elements?

I have 2 groups (alpha & beta) and want to use R to get 3 lists of the elements present in 1. only alpha, 2. only beta, 3. both groups. So basically a Venn-diagramm in list-form. Here an example:
group color
alpha red
alpha blue
alpha black
alpha white
alpha orange
beta green
beta white
beta purple
beta yellow
beta black
As a result, the lists should be something like:
alpha: red, blue, orange
beta: green, purple, yellow
both: black, white
Assuming I have the data saved in a (tab-separated) .txt-file or a .csv-file (e.g. FILE.txt), how would I have to import/preprocess the data and how could I get the elements sorted as described? Are there any packages that need to be installed beforehand? Sorry, I know some steps likely seem obvious, but my R-skills are somewhat limited.
Thanks a lot for the help!
p.s. Not essential, but "nice to have": What if I wanted to sort 3 different groups?
You can use setdiff and intersect:
y <- split(x$color, x$group)
z <- list(setdiff(y[[1]], y[[2]]), setdiff(y[[2]], y[[1]]))
names(z) <- names(y)
z[["both"]] <- intersect(y[[1]], y[[2]])
z
#$alpha
#[1] "red" "blue" "orange"
#
#$beta
#[1] "green" "purple" "yellow"
#
#$both
#[1] "black" "white"
Data:
x <- read.table(header=TRUE, text="group color
alpha red
alpha blue
alpha black
alpha white
alpha orange
beta green
beta white
beta purple
beta yellow
beta black")

Is it possible to use the whole data frame as predictor in R?

For example, I have a 4x3 data frame:
Weight Color Shape
Apple 0.1 Pink heart
Orange 0.2 Orange sphere
Strawberry 0.01 White heart
Watermelon 1.72 Green square
and I would like the output to be Japan (country). All the formation is important here, so is it possible to use the whole data frame to predict one value (Japan).

Randomly assign objects to `K` clusters, according to a Dirichlet Multinomial Distribution

I'm trying to clusterize short documents like, e.g., the following
sentences<-c("The color blue neutralizes orange yellow reflections.",
"Zod stabbed me with blue Kryptonite.",
"Because blue is your favourite colour.",
"Red is wrong, blue is right.",
"You and I are going to yellowstone.",
"Van Gogh looked for some yellow at sunset.",
"You ruined my beautiful green dress.",
"You do not agree.",
"There's nothing wrong with green.")
In the initialization step of my code, I should randomly assign the documents to K clusters, according to a Dirichlet Multinomial Distribution.
How could I perform this task?
Edit Thanks to #ags29's comment, I found in Sampling from Dirichlet-Multinomial
D=9 # number of documents in the corpus; I have 9 sentences in my example
k=2 # number of clusters (e.g. 2)
alpha=runif(D) # value of alpha, here chosen at random
p=rgamma(D,alpha) # pre-simulation of the Dirichlet
x=rmultinom(1,k,p)
What do you think?

apply with subset function (or custom function based on subset)

I am trying to find a way to use apply function along with subset (or custom function based on subset). I know similar questions has already been asked, mine is little bit more specific. I need to subset certain part of multiple data sets based on more than one variables. I have couple "types" of data frame structures, one of them looks similar to this:
colour shade value
RED LIGHT -1.05
RED LIGHT -1.37
RED LIGHT -0.32
RED LIGHT 0.87
RED LIGHT -0.2
RED DARK 0.52
RED DARK -0.2
RED DARK 0.64
RED DARK 1.12
RED DARK 4
BLUE LIGHT 0.93
BLUE LIGHT 0.78
BLUE LIGHT -1.84
BLUE LIGHT -0.5
BLUE LIGHT -1.11
BLUE DARK -4.86
BLUE DARK 1.11
BLUE DARK 0.14
BLUE DARK 0.12
BLUE DARK -1.65
GREEN LIGHT 3.13
GREEN LIGHT 2.65
GREEN LIGHT -2.36
GREEN LIGHT -3.11
GREEN LIGHT 3.49
GREEN DARK 1.91
GREEN DARK -1.1
GREEN DARK -1.93
GREEN DARK 1
GREEN DARK -0.23
I have lot of those. They names are stored in
list.dfs.names=df1,df2,df3
Based on this I need to use subset or custom function based on it:
customSubset=function(df,col,shade){subset(df,df$colour %in% col & df$shade %in% shade)}
I use custom functions like this because as I said I have couple types of df structures and it speeds up my work a little bit. It works like this:
example=customSubset(df1,"BLUE","DARK")
and output is:
colour shade value
11 BLUE LIGHT 0.93
12 BLUE LIGHT 0.78
13 BLUE LIGHT -1.84
14 BLUE LIGHT -0.50
15 BLUE LIGHT -1.11
16 BLUE DARK -4.86
17 BLUE DARK 1.11
18 BLUE DARK 0.14
19 BLUE DARK 0.12
20 BLUE DARK -1.65
Till now I was using for loops but I want to change my approach to apply which seems to be more convenient especially where nesting loops is required. So I tired:
lapply(customSubset(list.dfs.names, "BLUE","DARK") )
and
lapply(list.dfs.names, customSubset("BLUE","DARK") )
with no success. Could anyone give mi little hand on this issue, I dont think I clearly understand how apply loops works. However I am quite familiar with for method so any additional explanation about differences would be appreciated.
If it is not possible with customSubset its ok for me to use regular subset or any other method that produces same result as example presented above.
Thank you in advance
EDIT: here is code to produce similar df to example i posted:
`data.frame("colour"=(c(rep("RED",10),rep("BLUE",10),rep("GREEN",10)))
,"shade"=c(rep(c(rep("LIGHT",5),rep("DARK",5)),3))
, runif(30,min=0,max=1))`
EDIT2:As requested I am editing my post to expand it on my year problem. My dfs comes from different years (multiple from each) for example like this: df.1.2012, df.2.2012,df.1.2011 and so on. The main issue is that I never need to refer to same year in all of dfs (it would be very easy then) instead I need to subset data based on certain horizon (example: year+2 or year-1). I used to create list of desired years (example with year+2 it would be list.year=c(2014,2014,2013)) which was paired with list of my dfs (that how it worked with for loop).
I need to find similar method for apply approach. Here is example:
set.seed(200)
df_2014=data.frame(colour=(c(rep("RED",10),rep("BLUE",10),rep("GREEN",10)))
,shade=c(rep(c(rep("LIGHT",5),rep("DARK",5)),3))
,year=c(rep(2011:2015,6))
,value=runif(30,min=0,max=1))
df_2013=data.frame(colour=(c(rep("RED",10),rep("BLUE",10),rep("GREEN",10)))
,shade=c(rep(c(rep("LIGHT",5),rep("DARK",5)),3))
,year=c(rep(2011:2015,6))
,value=runif(30,min=0,max=1))
horizon=+1
subset(df_2014, df_2014$colour %in% "BLUE" & df_2014$shade %in% "DARK" & df_2014$year %in% c(2014+horizon))
subset(df_2013, df_2013$colour %in% "BLUE" & df_2013$shade %in% "DARK" & df_2013$year %in% c(2013+horizon))
So i added column with years and i called it year and named dfs after year (so year+1 would be here 2014+1). Horizon is self explanatory. Result is:
#df_2014
colour shade year value
20 BLUE DARK 2015 0.6463296
#df_2013
colour shade year value
20 BLUE DARK 2015 0.6532767
I need to use apply function to list of data frames (in this edit list.df=list(df_2014,df_2013) as in previous example but this time add subset condition year+horizon (and possible puts all result in one df, but this is not main issue here).
In conclusion: when you look at both my subset function in this part in year+horizon, year has to change based on which df(from list) in loop it refers (while horizon is constant).
If you have trouble understanding what I mean please let me know, I tried to be very specific.
The problem seems to be the construct
subset(df,df$colour %in% col & df$shade %in% shade)
You are using subset, that evaluates the logical expression in the environment of its first argument, df, and then doing df$shade %in% shade. This is equivalent to shade %in% shade, since the df is the first argument. You should rewrite the function as follows, to use different names will do the trick.
customSubset <- function(DF, COL, SHADE){
subset(DF, colour %in% COL & shade %in% SHADE)
}
Now everything works as expected.
set.seed(5601) # make the results reproducible
df1 <- data.frame(colour = sample(c("RED", "GREEN", "BLUE"), 30, TRUE),
shade = sample(c("LIGHT", "DARK"), 30, TRUE),
value = rnorm(30, sd = 9))
df2 <- data.frame(colour = c(rep("RED",10), rep("BLUE",10), rep("GREEN",10))
,shade=c(rep(c(rep("LIGHT",5),rep("DARK",5)), 3))
, value = runif(30,min=0,max=1))
list.dfs <- list(df1, df2)
customSubset(df1,"BLUE","DARK")
# colour shade value
#5 BLUE DARK 4.288107
#6 BLUE DARK 2.860724
#8 BLUE DARK -10.720379
#10 BLUE DARK -15.407090
#14 BLUE DARK -2.259848
#30 BLUE DARK -18.364494
# apply the function to all df's in the list
# both forms are equivalent
lapply(list.dfs, function(x) customSubset(x, "BLUE", "DARK"))
lapply(list.dfs, customSubset, "BLUE", "DARK")

probability of urnsample gives 0?

An urn contains 10 balls, in which 3 are white, 4 blue and 3 black. Three balls are drawn at random from the urn. I assign this to a sample space using the following code:
require(prob)
L<-rep(c("White","Blue","Black"),times=c(3,4,3))
M<-urnsamples(L,size=3,replace=FALSE, ordered=FALSE)
N<-probspace(M)
While calculating the probability of drawing three blue balls, I get the right answer.
> Prob(N,isin(N,c("White","Black")))
[1] 0.45
But, while trying to calculate the probability for drawing two white balls and one black ball, or for one ball of each colour, i get a returned answer as 0:
> Prob(N,isrep(N,"White","Blue","Black",1,1,1))
[1] 0
> Prob(N,isrep(N,"White","Black",2,1))
[1] 0
Is there something wrong with the code? Because logically the answers are 0.3 and 0.75 respectively. And if it works with the first case, why not the second and third, since all three should have the same code
You want to be able to specify the number of times that a certain color will appear in your results.
Bear in mind that we are somewhat limited by the sample size that you set, which was 3. We can see the list of possible combinations of 3 colors and their probabilities in an easy-to-read format using noorder:
noorder(N)
X1 X2 X3 probs
1 Ash Gray Ash Gray Ash Gray 0.008333333
2 Ash Gray Ash Gray Blue 0.100000000
3 Ash Gray Blue Blue 0.150000000
4 Blue Blue Blue 0.033333333
5 Ash Gray Ash Gray Ghost White 0.075000000
6 Ash Gray Blue Ghost White 0.300000000
7 Blue Blue Ghost White 0.150000000
8 Ash Gray Ghost White Ghost White 0.075000000
9 Blue Ghost White Ghost White 0.100000000
10 Ghost White Ghost White Ghost White 0.008333333
So from that table you can see that the probability of having 3 "Ash Gray" balls for instance is 0.008333333.
If we want to find the probability of having 2 "Ghost White" balls in the sample:
Q <- noorder(N)
Prob(Q,isin(Q,c("Ghost White", "Ghost White")))
[1] 0.1833333
We can verify this answer using the table above:
> 0.100000000+0.008333333+0.075000000
[1] 0.1833333
Let's make the sample size bigger and experiment some more.
M<-urnsamples(L,size=7,replace=FALSE, ordered=FALSE)
N<-probspace(M)
Q <- noorder(N)
With a sample size of 7 the probability of 2 "Ash Gray" and 1 "Ghost White" is:
Prob(Q,isin(Q,c("Ash Gray", rep(c("Ghost White", "Ash Gray"),1))))
[1] 0.8083333
and the probability of 3 "Ash Gray" and 2 "Ghost White" is:
> Prob(Q,isin(Q,c("Ash Gray", rep(c("Ghost White", "Ash Gray"),2)))
[1] 0.1833333

Resources