Is there a way to convert a list of cosine similarities to percentage? I tried to wrap my brain around this but I'm in great doubt.
Would it make sense to normalize the cosine values of the four documents like so:
Doc #1 0.9600
Doc #2 0.9300
Doc #3 0.8800
Doc #4 0.8500
Summing them all up
0.9600 + 0.9300 + 0.8800 + 0.8500 = 3.6200
And normalize them.
Doc #1 0.9600 / 3.6200 = 0.2652
Doc #2 0.9300 / 3.6200 = 0.2570
Doc #3 0.8800 / 3.6200 = 0.2431
Doc #4 0.8500 / 3.6200 = 0.2348
or is there are more accepted way of displaying this?
I guess it depends on your use case, but in general I don't think there's much of a need to normalize cosine similarity scores as they are already on a 0 to 1 scale.
Related
I am using the dataset nba_ht_wt which can be imported via text(readr) by the url http://users.stat.ufl.edu/~winner/data/nba_ht_wt.csv . The question I am trying to tackle is "What percentage of players have a BMI over 25, which is considered "overweight"?
I already created a new variable in the table called highbmi, which corresponds to bmi > 25. This is my code, but the table is hard to read, how could I get a more concise and easier to read table?
nba_ht_wt = nba_ht_wt %>% mutate(highbmi = bmi>25)
tab = table(nba_ht_wt$highbmi, nba_ht_wt$Player)
100*prop.table(tab,1)
I am using R programming.
There is no variable called bmi in the data provided so I will take a guess it is calculated via formula Weight/Height^2, where height is in meters.
data <- read.csv("http://users.stat.ufl.edu/~winner/data/nba_ht_wt.csv")
head(data)
Player Pos Height Weight Age
1 Nate Robinson G 69 180 29
2 Isaiah Thomas G 69 185 24
3 Phil Pressey G 71 175 22
4 Shane Larkin G 71 176 20
5 Ty Lawson G 71 195 25
6 John Lucas III G 71 157 30
I am no expert but it looks to me like height and weight have it names swapped for some reason.
So I will make this adjustment to calculate bmi:
data$bmi <- data$Height/(data$Weight/100)**2
And now we can answer "What percentage of players have a BMI over 25, which is considered "overweight"? with simple line of code:
mean(data$bmi > 25)
Multiply this number by 100 to get answer in percentages. So the answer will be 1.782178%
Assuming the formula: weight (lb) / [height (in)]^2 * 703 (source: https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html), you could do:
library(data.table)
nba_ht_wt <- fread("http://users.stat.ufl.edu/%7Ewinner/data/nba_ht_wt.csv")
nba_ht_wt[, highbmi:=(Weight / Height**2 * 703)>25][,
.(`% of Players`=round(.N/dim(nba_ht_wt)[1]*100,2)), by="highbmi"][]
#> highbmi % of Players
#> 1: TRUE 45.35
#> 2: FALSE 54.65
... or plug in the formula into the previous response for a base R solution.
This simple formula might not be really appropriate for basketball players, obviously.
I'm trying to extract some specific characters from a column that are formatted very differently and i have some problems with the code. I'm using the next DF:
details<-data.frame(details=c("MG/0,9 ML SOL. INY. JRP",
"MG CM REC",
"MG LIOFIL P/INF. IV FAM",
"MG/ 5ML SOL. INY",
"MG/ML SOL.ORAL FC 100-200ML"))
I'm trying using extract() function but i don't know how to code de regex part:
extract(details,"details",c("detail_1","detail_2"),regex = ??)
I'm want to finally get the next two columns:
detail_1 detail_2
1 MG/0,9 ML SOL. INY. JRP
2 MG CM REC
3 MG LIOFIL P/INF. IV FAM
4 MG/ 5ML SOL. INY
5 MG/ML SOL.ORAL FC 100-200ML
Any help is highly appreciated. Thank you in advance!
Using extract we can do :
tidyr::extract(details, details, c("detail_1","detail_2"),
regex = '(.*(?:MG|ML)[^.$])(.*)')
# detail_1 detail_2
#1 MG/0,9 ML SOL. INY. JRP
#2 MG CM REC
#3 MG LIOFIL P/INF. IV FAM
#4 MG/ 5ML SOL. INY
#5 MG/ML SOL.ORAL FC 100-200ML
For detail_1 we extract everything until we encounter either "MG" or "ML" and which is not end of the sentence. For detail_2 we extract everything after that.
Another option using dplyr and stringr would be :
library(dplyr)
library(stringr)
details %>%
mutate(detail_1 = str_extract(details, ".*(MG|ML)[^.$]"),
detail_2 = str_remove(details, detail_1))
I have been doing some research on the solution to my problem and I think it lies somewhere in daply. I need to take my data frame split it by boat name add 0.1 to net # every time the activity changes and then combine the data sets. My data frame looks like this.
Boat Net # Activity
Ray F 40 Lift
Dawn 67 Lift
Ray F 40 Set
Dawn 67 Set
Ray F 40 Lift
Ray F 40 Set
Ray F 40 Lift
Dawn 67 Lift
After I apply the functions I need the frame to look like this. Essentially adding 0.1 to the net # each time Activity = Set, but the boats are independent of each other.
Boat Net # Activity
Ray F 40.0 Lift
Dawn 67.0 Lift
Ray F 40.1 Set
Dawn 67.1 Set
Ray F 40.1 Lift
Ray F 40.2 Set
Ray F 40.2 Lift
Dawn 67.1 Lift
I have been using this function to add 0.1 to net # for every change in Activity, and it has worked really well but does not take into consideration the boat name.
df$`Net #` <- df$`Net #` + seq(0, 1, by = 0.1)[with(df, cumsum(c(TRUE, Activity[-1]!= Activity[-length(Activity)])))] + 1
Initially I tried to use Split, and then apply the function but that did nothing so I switched to daply. I tried this and got the following error:
daply(df, df$Boat, .fun = df$`Net #` + seq(0, 1, by = 0.1)[with(df, cumsum(c(TRUE, Activity[-1]!= Activity[-length(Activity)])))] + 1)
Error in parse(text = x) : <text>:1:6: unexpected symbol
1: Dawn Marie
^
I think I am on the right path but any help would be great.
Using dplyr package and the %>% operator:
df <- df %>% group_by(Boat) %>% mutate(Net = Net + cumsum(Activity == "Set") * 0.1) %>% ungroup
we have the answer:
Boat Net Activity
1 Ray F 40.0 Lift
2 Dawn 67.0 Lift
3 Ray F 40.1 Set
4 Dawn 67.1 Set
5 Ray F 40.1 Lift
6 Ray F 40.2 Set
7 Ray F 40.2 Lift
8 Dawn 67.1 Lift
The same code but without the %>% if you prefer:
df <- ungroup(mutate(group_by(df, Boat), Net = Net + cumsum(Activity == "Set") * 0.1))
So I've got some golf data that I'm messing with in R:
player rd hole shot distToPin distShot
E. Els 1 1 1 525 367.6
E. Els 1 1 2 157.4 130.8
E. Els 1 1 3 27.5 27.4
E. Els 1 1 4 1.2 1.2
E. Els 1 2 1 222 216.6
E. Els 1 2 2 6.8 6.6
E. Els 1 2 3 0.3 0.3
E. Els 2 1 1 378 244.4
E. Els 2 1 2 135.9 141.6
E. Els 2 1 3 6.7 6.9
E. Els 2 1 4 0.1 0.1
I'm trying to make an "efficiency" computation. Basically, I want to compute the following formula (which I made up, if you can't tell) by round:
E = hole yardage / (sum(distance of all shots) - hole yardage)
And ultimately, I want my results to look like this:
rd efficiency
E.Els 1 205.25
2 25.2
That efficiency column is the averaged result of the efficiency for each hole over the entire round. The issue that I'm having is I can't quite figure out how to do such a complex calculation using dplyr::summarize():
efficiency <- df %>%
group_by(player, rd) %>%
summarize(efficiency = (sum(distShot) - distToPin))
But the problem with that particular script is that it returns the error:
Error: expecting a single value
I think my problem is that were it to run, it wouldn't be able to tell WHICH distToPin to subtract, and the one I want is obviously the first distToPin of each hole, or the accurate hole length (unfortunately, I don't have a column of just "hole yardage." I want to pull that first distToPin of each hole out and use it within my summarize() arithmetic. Is this even possible?
I'm guessing that there is a way to do these types of complex, multi-step calculations within the summarize function, But maybe there's not! Any ideas or advice?
You seem to be missing some steps. Here is a deliberately labored version to show that, using dplyr. It assumes that your data frame is named golfdf:
golfdf %>%
group_by(player, round, hole) %>%
summarise(hole.length = first(distToPin), shots.length = sum(distShot)) %>%
group_by(player, round) %>%
summarise(efficiency = sum(hole.length) / (sum(shots.length) - sum(hole.length)))
I have a data file that is several million lines long, and contains information from many groups. Below is an abbreviated section:
MARKER GROUP1_A1 GROUP1_A2 GROUP1_FREQ GROUP1_N GROUP2_A1 GROUP2_A2 GROUP2_FREQ GROUP2_N
rs10 A C 0.055 1232 A C 0.055 3221
rs1000 A G 0.208 1232 A G 0.208 3221
rs10000 G C 0.134 1232 C G 0.8624 3221
rs10001 C A 0.229 1232 A C 0.775 3221
I would like to created a weighted average of the frequency (FREQ) variable (which in itself is straightforward), however in this case some of the rows are mismatched (rows 3 & 4). If the letters do not line up, then the frequency of the second group needs to be subtracted by 1 before the weighted mean of that marker is calculated.
I would like to set up a simple IF statement, but I am unsure of the syntax of such a task.
Any insight or direction is appreciated!
Say you've read your data in a data frame called mydata. Then do the following:
mydata$GROUP2_FREQ <- mydata$GROUP2_FREQ - (mydata$GROUP1_A1 != mydata$GROUP2_A1)
It works because R treats TRUE values as 1 and FALSE values as 0.
EDIT: Try the following instead:
mydata$GROUP2_FREQ <- abs( (as.character(mydata$GROUP1_A1) !=
as.character(mydata$GROUP2_A1)) -
as.numeric(mydata$GROUP2_FREQ) )