apply with subset function (or custom function based on subset) - r

I am trying to find a way to use apply function along with subset (or custom function based on subset). I know similar questions has already been asked, mine is little bit more specific. I need to subset certain part of multiple data sets based on more than one variables. I have couple "types" of data frame structures, one of them looks similar to this:
colour shade value
RED LIGHT -1.05
RED LIGHT -1.37
RED LIGHT -0.32
RED LIGHT 0.87
RED LIGHT -0.2
RED DARK 0.52
RED DARK -0.2
RED DARK 0.64
RED DARK 1.12
RED DARK 4
BLUE LIGHT 0.93
BLUE LIGHT 0.78
BLUE LIGHT -1.84
BLUE LIGHT -0.5
BLUE LIGHT -1.11
BLUE DARK -4.86
BLUE DARK 1.11
BLUE DARK 0.14
BLUE DARK 0.12
BLUE DARK -1.65
GREEN LIGHT 3.13
GREEN LIGHT 2.65
GREEN LIGHT -2.36
GREEN LIGHT -3.11
GREEN LIGHT 3.49
GREEN DARK 1.91
GREEN DARK -1.1
GREEN DARK -1.93
GREEN DARK 1
GREEN DARK -0.23
I have lot of those. They names are stored in
list.dfs.names=df1,df2,df3
Based on this I need to use subset or custom function based on it:
customSubset=function(df,col,shade){subset(df,df$colour %in% col & df$shade %in% shade)}
I use custom functions like this because as I said I have couple types of df structures and it speeds up my work a little bit. It works like this:
example=customSubset(df1,"BLUE","DARK")
and output is:
colour shade value
11 BLUE LIGHT 0.93
12 BLUE LIGHT 0.78
13 BLUE LIGHT -1.84
14 BLUE LIGHT -0.50
15 BLUE LIGHT -1.11
16 BLUE DARK -4.86
17 BLUE DARK 1.11
18 BLUE DARK 0.14
19 BLUE DARK 0.12
20 BLUE DARK -1.65
Till now I was using for loops but I want to change my approach to apply which seems to be more convenient especially where nesting loops is required. So I tired:
lapply(customSubset(list.dfs.names, "BLUE","DARK") )
and
lapply(list.dfs.names, customSubset("BLUE","DARK") )
with no success. Could anyone give mi little hand on this issue, I dont think I clearly understand how apply loops works. However I am quite familiar with for method so any additional explanation about differences would be appreciated.
If it is not possible with customSubset its ok for me to use regular subset or any other method that produces same result as example presented above.
Thank you in advance
EDIT: here is code to produce similar df to example i posted:
`data.frame("colour"=(c(rep("RED",10),rep("BLUE",10),rep("GREEN",10)))
,"shade"=c(rep(c(rep("LIGHT",5),rep("DARK",5)),3))
, runif(30,min=0,max=1))`
EDIT2:As requested I am editing my post to expand it on my year problem. My dfs comes from different years (multiple from each) for example like this: df.1.2012, df.2.2012,df.1.2011 and so on. The main issue is that I never need to refer to same year in all of dfs (it would be very easy then) instead I need to subset data based on certain horizon (example: year+2 or year-1). I used to create list of desired years (example with year+2 it would be list.year=c(2014,2014,2013)) which was paired with list of my dfs (that how it worked with for loop).
I need to find similar method for apply approach. Here is example:
set.seed(200)
df_2014=data.frame(colour=(c(rep("RED",10),rep("BLUE",10),rep("GREEN",10)))
,shade=c(rep(c(rep("LIGHT",5),rep("DARK",5)),3))
,year=c(rep(2011:2015,6))
,value=runif(30,min=0,max=1))
df_2013=data.frame(colour=(c(rep("RED",10),rep("BLUE",10),rep("GREEN",10)))
,shade=c(rep(c(rep("LIGHT",5),rep("DARK",5)),3))
,year=c(rep(2011:2015,6))
,value=runif(30,min=0,max=1))
horizon=+1
subset(df_2014, df_2014$colour %in% "BLUE" & df_2014$shade %in% "DARK" & df_2014$year %in% c(2014+horizon))
subset(df_2013, df_2013$colour %in% "BLUE" & df_2013$shade %in% "DARK" & df_2013$year %in% c(2013+horizon))
So i added column with years and i called it year and named dfs after year (so year+1 would be here 2014+1). Horizon is self explanatory. Result is:
#df_2014
colour shade year value
20 BLUE DARK 2015 0.6463296
#df_2013
colour shade year value
20 BLUE DARK 2015 0.6532767
I need to use apply function to list of data frames (in this edit list.df=list(df_2014,df_2013) as in previous example but this time add subset condition year+horizon (and possible puts all result in one df, but this is not main issue here).
In conclusion: when you look at both my subset function in this part in year+horizon, year has to change based on which df(from list) in loop it refers (while horizon is constant).
If you have trouble understanding what I mean please let me know, I tried to be very specific.

The problem seems to be the construct
subset(df,df$colour %in% col & df$shade %in% shade)
You are using subset, that evaluates the logical expression in the environment of its first argument, df, and then doing df$shade %in% shade. This is equivalent to shade %in% shade, since the df is the first argument. You should rewrite the function as follows, to use different names will do the trick.
customSubset <- function(DF, COL, SHADE){
subset(DF, colour %in% COL & shade %in% SHADE)
}
Now everything works as expected.
set.seed(5601) # make the results reproducible
df1 <- data.frame(colour = sample(c("RED", "GREEN", "BLUE"), 30, TRUE),
shade = sample(c("LIGHT", "DARK"), 30, TRUE),
value = rnorm(30, sd = 9))
df2 <- data.frame(colour = c(rep("RED",10), rep("BLUE",10), rep("GREEN",10))
,shade=c(rep(c(rep("LIGHT",5),rep("DARK",5)), 3))
, value = runif(30,min=0,max=1))
list.dfs <- list(df1, df2)
customSubset(df1,"BLUE","DARK")
# colour shade value
#5 BLUE DARK 4.288107
#6 BLUE DARK 2.860724
#8 BLUE DARK -10.720379
#10 BLUE DARK -15.407090
#14 BLUE DARK -2.259848
#30 BLUE DARK -18.364494
# apply the function to all df's in the list
# both forms are equivalent
lapply(list.dfs, function(x) customSubset(x, "BLUE", "DARK"))
lapply(list.dfs, customSubset, "BLUE", "DARK")

Related

Adapt given color pairs to adhere to W3C Accessibility standard for ePubs

We are trying to produce ePub publications that adhere to the W3C accessibility standards. One of the remaining issues is insufficient color contrast between the text color and the background color. We use Ace by DAISY (great tool!) which provides information about this sort of issue in both textual form and JSON:
And here is the JSON (it's not very straightforward to extract the two color values from the dct:description, but with a regular expression we manage):
{
"#type": "earl:assertion",
"earl:result": {
"earl:outcome": "fail",
"dct:description": "Element has insufficient color contrast of 2.86 (foreground color: #fd5f07, background color: #fff5ea, font size: 28.1pt (37.52px), font weight: normal). Expected contrast ratio of 3:1",
"earl:pointer": {
"cfi": [
"/4/2/6/2",
"/4/2/6"
],
"css": [
".inbrief-title",
".inbrief"
]
},
"html": "<div xmlns=\"http://www.w3.org/1999/xhtml\" class=\"inbrief-title\">In Brief</div> <!--##--> <div xmlns=\"http://www.w3.org/1999/xhtml\" class=\"box-container inbrief\">"
},
"earl:assertedBy": "aXe",
"earl:mode": "automatic",
"earl:test": {
"earl:impact": "serious",
"dct:title": "color-contrast",
"dct:description": "Ensures the contrast between foreground and background colors meets WCAG 2 AA contrast ratio thresholds",
"help": {
"url": "http://kb.daisy.org/publishing/docs/css/color.html",
"dct:title": "Color",
"dct:description": "Elements must have sufficient color contrast"
},
"rulesetTags": [
"cat.color",
"wcag2aa",
"wcag143"
]
}
},
However, I'd like to adapt these colors programmatically to create sufficient contrast between background and text colors by changing the colors as little as possible. For example, if there is a light yellow with white text, change the yellow to a darker hue until it meets the contrast requirement. Or, conversely, make the text black.
Is there an algorithm that one could implement that can do this? I couldn't find anything that seemed useful. When describing the use case above I found myself noticing that it is probably quite a subjective decision and therefore hard to automate.
The first thing you'd have to do is decide which color to change, the foreground or background. What I'd probably do is decide which color is the furthest from white or black because its value could be changed the most.
But before that, you have to figure out which color is light and which is dark. Fortunately, that's pretty easy. Since white is #fff and black is #000, whichever color value is the smallest (ie, closer to #000) is the darker one.
Then just subtract the light color from #fff (hexadecimal subtraction or convert the colors to decimal) and compare that to the darker color.
If the subtracted value for the light color is smaller than the dark color, then the light color is closer to white than the darker color is to black so you'll want to start modifying the darker color.
If the subtracted value for the light color is larger than the dark color, then the light color is further from white than the darker color is from black so you'll want to start modifying the light color.
When changing the color, just add or subtract 1 from each RGB component. Add 1 if you're making the light color lighter or subtract 1 if you're making the dark color darker.
After you add or subtract 1 from each RGB, compute the luminance contrast to see if you're above 4.5 (if the font is small), or above 3 (if the font is large - where "large" is defined as 14pt bold or 18pt normal).
The contrast ratio formula is kind of messy but doable.
(L1 + 0.05) / (L2 + 0.05)
L1 is the relative luminance of the lighter of the colors, and
L2 is the relative luminance of the darker of the colors.
You already know which color is lighter and which is darker.
The relative luminance value (L) is the messy part, although it initially looks simple. It's just:
L = 0.2126 * R + 0.7152 * G + 0.0722 * B
Essentially, it's 21% red, 71% blue, 7% green. But it's not the straight values of R, G, and B from your color. First you have to take each R, G, B value and divide by 255. Then depending on that value, you do some magic on the values.
R = R / 255
if R <= 0.03928 then R = R/12.92
else R = ((R+0.055)/1.055) ^ 2.4
Do that for G and B too. Now that you have new values for R, G, and B, then you can apply the 21%/71%/7% relative luminance above and that gives you L1 or L2, and then you can add 0.05 to both values and divide L1 by L2.
I told you it was messy but doable. This will give you the contrast value which should be between 1 and 21. (1 is when the two colors are the same [white on white, red on red, orange on orange, etc] and 21 is when the two colors are black (#000) and white (#fff).
If you haven't reached 4.5 (or 3.0, depending on the font size), then add or subtract 1 from each R/G/B again and compute the ratio again.
For funsies, here's an example:
color1 = #5EBA7D (the green background color on stackoverflow that shows the upvote value)
color2 = #FDF7E2 (the yellow background color on stackoverflow for the sidebar)
(The upvote text color is white but that's boring for an example so I chose yellow instead).
It doesn't matter which color is foreground or background. You get the same ratio. So lets use C1 as foreground and C2 as background. If you use a tool, such as CCA, it shows a value of 2.2:1, which fails WCAG 1.4.3.
Let's plug those numbers into the formula.
C1
#5EBA7D
R = #5E (94)
G = #BA (186)
B = #7D (125)
Divide each one by 255
R = 94/255 = 0.368
G = 186/255 = 0.729
B = 125/255 = 0.490
None of those values are less than 0.03928 so we have to apply the messy part. Add 0.055 then divide by 1.055 then raise that to the power of 2.4.
R = ((0.368 + 0.055)/1.055) ^ 2.4 = 0.112
G = ((0.729 + 0.055)/1.055) ^ 2.4 = 0.491
B = ((0.490 + 0.055)/1.055) ^ 2.4 = 0.205
Now you apply the percentages (21%/71%/7%) to get the first color's luminance value.
L1 = 0.2126 * R + 0.7152 * G + 0.0722 * B
L1 = 0.2126 * (.112) + 0.7152 * (.491) + 0.0722 * (.205)
L1 = 0.024 + 0.351 + 0.015
L1 = 0.390
Then you do it for the other color to get L2 = 0.929
L2 was the yellow color so is actually the lighter color and should be L1 in the original formula. So we'll make the green, which is darker, L2.
(L1 + 0.05) / (L2 + 0.05)
= (0.929 + 0.05) / (0.390 + 0.05)
= 2.226
If you use a color contrast tool, you should get the same value, although most tools typically round to one decimal place so you'll get 2.2.
Now that you know the ratio, which color is further away from black or white so you know which color to adjust?
color1 = #5EBA7D (green)
color2 = #FDF7E2 (yellow)
The smaller number is darker (closer to #000 black) so that would be #5EBA7D (green).
How far is the lighter color (#FDF7E2, yellow) from white (#fff)?
#FFFFFF - #FDF7E2 = #02081D
Now, which is bigger? The dark color, #5EBA7D (green), or the distance from yellow to white, #02081D? In this case, the green is bigger so it's further away from black than the yellow is from white.
Doing a gut check, that makes sense. The green isn't very dark, kind of a light-ish green so it's not very close to black. But yellow is very light and is close to white.
Since green is further from black than the yellow is from white, you can start adjusting the green and making it darker. Do this by subtracting 1 from each R/G/B value then recomputing the luminance. It probably won't change much so you'll have to keep adjusting the green and making it darker until you reach 4.5 (or 3.0, depending on font size).
If you program out this case, the green should eventually get to #258144 (which ends up subtracting 57 from each RGB value), which has a color ratio of 4.6 to the yellow.
If any of the RGB values reach 0 before you get to a decent contrast ratio, then you have to start making the light color lighter by adding 1 to each RGB value and recomputing the contrast. By adding or subtracting 1 from each RGB, you keep the color hue. The green is still green but is darker, or the yellow is still yellow but is lighter.
Are you wishing you hadn't asked the question now?
Minor Update:
Adjusting the RGB values by 1 each time and recomputing the luminance might not be very efficient. In my example, that would have to be done 57 times. You might want to add or subtract 10 from each RGB component instead of 1 and see how close to 4.5 you get. If not there yet, adjust by 10 again. Once you get greater than 4.5, you can tweak it back a bit and adjust by 2 in the opposite direction until you get as close to 4.5 without going under.

How to use R to sort two groups based on shared elements?

I have 2 groups (alpha & beta) and want to use R to get 3 lists of the elements present in 1. only alpha, 2. only beta, 3. both groups. So basically a Venn-diagramm in list-form. Here an example:
group color
alpha red
alpha blue
alpha black
alpha white
alpha orange
beta green
beta white
beta purple
beta yellow
beta black
As a result, the lists should be something like:
alpha: red, blue, orange
beta: green, purple, yellow
both: black, white
Assuming I have the data saved in a (tab-separated) .txt-file or a .csv-file (e.g. FILE.txt), how would I have to import/preprocess the data and how could I get the elements sorted as described? Are there any packages that need to be installed beforehand? Sorry, I know some steps likely seem obvious, but my R-skills are somewhat limited.
Thanks a lot for the help!
p.s. Not essential, but "nice to have": What if I wanted to sort 3 different groups?
You can use setdiff and intersect:
y <- split(x$color, x$group)
z <- list(setdiff(y[[1]], y[[2]]), setdiff(y[[2]], y[[1]]))
names(z) <- names(y)
z[["both"]] <- intersect(y[[1]], y[[2]])
z
#$alpha
#[1] "red" "blue" "orange"
#
#$beta
#[1] "green" "purple" "yellow"
#
#$both
#[1] "black" "white"
Data:
x <- read.table(header=TRUE, text="group color
alpha red
alpha blue
alpha black
alpha white
alpha orange
beta green
beta white
beta purple
beta yellow
beta black")

Is it possible to use the whole data frame as predictor in R?

For example, I have a 4x3 data frame:
Weight Color Shape
Apple 0.1 Pink heart
Orange 0.2 Orange sphere
Strawberry 0.01 White heart
Watermelon 1.72 Green square
and I would like the output to be Japan (country). All the formation is important here, so is it possible to use the whole data frame to predict one value (Japan).

probability of urnsample gives 0?

An urn contains 10 balls, in which 3 are white, 4 blue and 3 black. Three balls are drawn at random from the urn. I assign this to a sample space using the following code:
require(prob)
L<-rep(c("White","Blue","Black"),times=c(3,4,3))
M<-urnsamples(L,size=3,replace=FALSE, ordered=FALSE)
N<-probspace(M)
While calculating the probability of drawing three blue balls, I get the right answer.
> Prob(N,isin(N,c("White","Black")))
[1] 0.45
But, while trying to calculate the probability for drawing two white balls and one black ball, or for one ball of each colour, i get a returned answer as 0:
> Prob(N,isrep(N,"White","Blue","Black",1,1,1))
[1] 0
> Prob(N,isrep(N,"White","Black",2,1))
[1] 0
Is there something wrong with the code? Because logically the answers are 0.3 and 0.75 respectively. And if it works with the first case, why not the second and third, since all three should have the same code
You want to be able to specify the number of times that a certain color will appear in your results.
Bear in mind that we are somewhat limited by the sample size that you set, which was 3. We can see the list of possible combinations of 3 colors and their probabilities in an easy-to-read format using noorder:
noorder(N)
X1 X2 X3 probs
1 Ash Gray Ash Gray Ash Gray 0.008333333
2 Ash Gray Ash Gray Blue 0.100000000
3 Ash Gray Blue Blue 0.150000000
4 Blue Blue Blue 0.033333333
5 Ash Gray Ash Gray Ghost White 0.075000000
6 Ash Gray Blue Ghost White 0.300000000
7 Blue Blue Ghost White 0.150000000
8 Ash Gray Ghost White Ghost White 0.075000000
9 Blue Ghost White Ghost White 0.100000000
10 Ghost White Ghost White Ghost White 0.008333333
So from that table you can see that the probability of having 3 "Ash Gray" balls for instance is 0.008333333.
If we want to find the probability of having 2 "Ghost White" balls in the sample:
Q <- noorder(N)
Prob(Q,isin(Q,c("Ghost White", "Ghost White")))
[1] 0.1833333
We can verify this answer using the table above:
> 0.100000000+0.008333333+0.075000000
[1] 0.1833333
Let's make the sample size bigger and experiment some more.
M<-urnsamples(L,size=7,replace=FALSE, ordered=FALSE)
N<-probspace(M)
Q <- noorder(N)
With a sample size of 7 the probability of 2 "Ash Gray" and 1 "Ghost White" is:
Prob(Q,isin(Q,c("Ash Gray", rep(c("Ghost White", "Ash Gray"),1))))
[1] 0.8083333
and the probability of 3 "Ash Gray" and 2 "Ghost White" is:
> Prob(Q,isin(Q,c("Ash Gray", rep(c("Ghost White", "Ash Gray"),2)))
[1] 0.1833333

Tableau, orient color of datapoints on scatterplot

I am plotting two undesireable statistics
Columns: AGG(%SEP11)
Row: AGG(%Outdated_Defs)
This is how my graph looks. Only the points that have 50% or more installations of SEP 11 are red, even if they have high % of outdated defs.
I wish to make is such that sites with high % of outdated defs are also red, i.e.
In other words, only bottom left side of scatterplot should have green dots, remaining should have shades of red where top right quadrant had the most deep red dots.
Please help!
One option is to create a calculated field called bad_color:
IF AGG(%SEP11) >= 0.5 OR AGG(%Outdated_Defs) >= 0.5 THEN 1 ELSE 0 END
Then drag bad_color to the color field. Doubleclick on the color field and select red for 1 and green for 0.

Resources