Calculate win loss ratio - r

I want to create a win/loss ratio from a dataset that has a series of winning and losing teamID's. The dataset looks something like this:
WTeamID LTeamID
11 12
12 13
11 13
Im trying to get a dataset that would look like this:
TeamID WLRatio
11 1.0
12 0.5
13 0.0

A straightforward way is to just divide the count of the first column by the count of both columns:
res <- table(factor(df$WTeamID, levels = unique(unlist(df)))) / table(factor(unlist(df)))
as.data.frame(res)
Var1 Freq
1 11 1.0
2 12 0.5
3 13 0.0

Related

Frequency distribution using binCounts

I have a dataset of Ages for the customer and I wanted to make a frequency distribution by 9 years of a gap of age.
Ages=c(83,51,66,61,82,65,54,56,92,60,65,87,68,64,51,
70,75,66,74,68,44,55,78,69,98,67,82,77,79,62,38,88,76,99,
84,47,60,42,66,74,91,71,83,80,68,65,51,56,73,55)
My desired outcome would be similar to below-shared table, variable names can be differed(as you wish)
Could I use binCounts code into it ? if yes could you help me out using the code as not sure of bx and idxs in this code?
binCounts(x, idxs = NULL, bx, right = FALSE) ??
Age Count
38-46 3
47-55 7
56-64 7
65-73 14
74-82 10
83-91 6
92-100 3
Much Appreciated!
I don't know about the binCounts or even the package it is in but i have a bare r function:
data.frame(table(cut(Ages,0:7*9+37)))
Var1 Freq
1 (37,46] 3
2 (46,55] 7
3 (55,64] 7
4 (64,73] 14
5 (73,82] 10
6 (82,91] 6
7 (91,100] 3
To exactly duplicate your results:
lowerlimit=c(37,46,55,64,73,82,91,101)
Labels=paste(head(lowerlimit,-1)+1,lowerlimit[-1],sep="-")#I add one to have 38 47 etc
group=cut(Ages,lowerlimit,Labels)#Determine which group the ages belong to
tab=table(group)#Form a frequency table
as.data.frame(tab)# transform the table into a dataframe
group Freq
1 38-46 3
2 47-55 7
3 56-64 7
4 65-73 14
5 74-82 10
6 83-91 6
7 92-100 3
All this can be combined as:
data.frame(table(cut(Ages,s<-0:7*9+37,paste(head(s+1,-1),s[-1],sep="-"))))

Imputation for longitudinal data using observation before and after missing data

I’m in the process of cleaning some longitudinal data and I have several missing cases. I am trying to use an imputation that incorporates observations before and after the missing case. I’m wondering how I can go about addressing the issues detailed below.
I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.
The details are below:
#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(1,3,2,3,NA,0,0,2,4,0,NA,0,0,0,4,1,2,4,2,3,NA,2,1,0,NA,NA,0,0)
mydat <- data.frame(id, time, ss)
*Bold characters represent changes from the dataset above
The goal here is to find a way to get the mean of the value before (3) and after (0) the NA value for ID #1 (variable ss) so that the data look like this: 1,3,2,3,1.5,0,0,
ID# 2 (variable ss) should look like this: 2,4,0,0,0,0,0
ID #3 (variable ss) should use a last observation carried forward approach, so it would need to look like this: 4,1,2,4,2,3,3
ID #4 (variable ss) has two consecutive NA values and should not be changed. It will be flagged for a different analysis later in my project. So, it should look like this: 2,1,0,NA,NA,0,0 (no change).
I use a package, smwrBase, the syntax for only filling in 1 missing value is below, but doesn't address id.
smwrBase::fillMissing(ss, max.fill=1)
The zoo package might be more standard, same issue though.
zoo::na.approx(ss, maxgap=1)
Below is an approach that accounts for the variable id. Current interpolation approaches dont like to fill in the last value, so i added a manual if stmt for that. A bit brute force as there might be a tapply approach out there.
> id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
> time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
> ss <- c(1,3,2,3,NA,0,0,2,4,0,NA,0,0,0,4,1,2,4,2,3,NA,2,1,0,NA,NA,0,0)
> mydat <- data.frame(id, time, ss, ss2=NA_real_)
> for (i in unique(id)) {
+ # interpolate for gaps
+ mydat$ss2[mydat$id==i] <- zoo::na.approx(ss[mydat$id==i], maxgap=1, na.rm=FALSE)
+ # extension for gap as last value
+ if(is.na(mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])])) {
+ mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])] <-
+ mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])-1]
+ }
+ }
> mydat
id time ss ss2
1 1 0 1 1.0
2 1 1 3 3.0
3 1 2 2 2.0
4 1 3 3 3.0
5 1 4 NA 1.5
6 1 5 0 0.0
7 1 6 0 0.0
8 2 0 2 2.0
9 2 1 4 4.0
10 2 2 0 0.0
11 2 3 NA 0.0
12 2 4 0 0.0
13 2 5 0 0.0
14 2 6 0 0.0
15 3 0 4 4.0
16 3 1 1 1.0
17 3 2 2 2.0
18 3 3 4 4.0
19 3 4 2 2.0
20 3 5 3 3.0
21 3 6 NA 3.0
22 4 0 2 2.0
23 4 1 1 1.0
24 4 2 0 0.0
25 4 3 NA NA
26 4 4 NA NA
27 4 5 0 0.0
28 4 6 0 0.0
The interpolated value in id=1 is 1.5 (avg of 3 and 0), id=2 is 0 (avg of 0 and 0, and id=3 is 3 (the value preceding since it there is no following value).

R - conditional cumsum using multiple columns

I'm new to stackoverflow So I hope I post my question in the right format. I have a test dataset with three columns where rank is the ranking of a cell, Esvalue is the value of a cell and zoneID is an area identifier(Note! in the real dataset I have up to 40.000 zoneIDs)
rank<-seq(0.1,1,0.1)
Esvalue<-seq(10,1)
zoneID<-rep(seq.int(1,2),times=5)
rank Esvalue zoneID
0.1 10 1
0.2 9 2
0.3 8 1
0.4 7 2
0.5 6 1
0.6 5 2
0.7 4 1
0.8 3 2
0.9 2 1
1.0 1 2
I want to calculate the following:
% ES value <- For each rank, including all lower ranks, the cumulative % share of the total ES value relative to the ES value of all zones
cumsum(df$Esvalue)/sum(df$Esvalue)
% ES value zone <- For each rank, including all lower ranks, the cumulative % share of the total Esvalue relative to the ESvalue of a zoneID for each zone. I tried this now using mutate and using dplyr. Both so far only give me the cumulative sum, not the share. In the end this will generate a variable for each zoneID
df %>%
mutate(cA=cumsum(ifelse(!is.na(zoneID) & zoneID==1,Esvalue,0))) %>%
mutate(cB=cumsum(ifelse(!is.na(zoneID) & zoneID==2,Esvalue,0)))
These two variables I want to combine by
1) calculating the abs difference between the two for all the zoneIDs
2) for each rank calculate the mean of the absolute difference over all zoneIDs
In the end the final output should look like:
rank Esvalue zoneID mean_abs_diff
0.1 10 1 0.16666667
0.2 9 2 0.01333333
0.3 8 1 0.12000000
0.4 7 2 0.02000000
0.5 6 1 0.08000000
0.6 5 2 0.02000000
0.7 4 1 0.04666667
0.8 3 2 0.01333333
0.9 2 1 0.02000000
1.0 1 2 0.00000000
Now I created the last using some intermediate steps in Excel but my final dataset will be way too big to be handled by Excel. Any advice on how to proceed would be appreciated

Tidying Time Intervals for Plotting Histogram in R

I'm doing some cluster analysis on the MLTobs from the LifeTables package and have come across a tricky problem with the Year variable in the mlt.mx.info dataframe. Year contains the period that the life table was taken, in intervals. Here's a table of the data:
1751-1754 1755-1759 1760-1764 1765-1769 1770-1774 1775-1779 1780-1784 1785-1789 1790-1794
1 1 1 1 1 1 1 1 1
1795-1799 1800-1804 1805-1809 1810-1814 1815-1819 1816-1819 1820-1824 1825-1829 1830-1834
1 1 1 1 1 2 3 3 3
1835-1839 1838-1839 1840-1844 1841-1844 1845-1849 1846-1849 1850-1854 1855-1859 1860-1864
4 1 5 3 8 1 10 11 11
1865-1869 1870-1874 1872-1874 1875-1879 1876-1879 1878-1879 1880-1884 1885-1889 1890-1894
11 11 1 12 2 1 15 15 15
1895-1899 1900-1904 1905-1909 1908-1909 1910-1914 1915-1919 1920-1924 1921-1924 1922-1924
15 15 15 1 16 16 16 2 1
1925-1929 1930-1934 1933-1934 1935-1939 1937-1939 1940-1944 1945-1949 1947-1949 1948-1949
19 19 1 20 1 22 22 3 1
1950-1954 1955-1959 1956-1959 1958-1959 1960-1964 1965-1969 1970-1974 1975-1979 1980-1984
30 30 2 1 40 40 41 41 41
1983-1984 1985-1989 1990-1994 1991-1994 1992-1994 1995-1999 2000-2003 2000-2004 2005-2006
1 42 42 1 1 44 3 41 22
2005-2007
14
As you can see, some of the intervals sit within other intervals. Thankfully none of them overlap. I want to simplify the intervals so intervals such as 1992-1994 and 1991-1994 all go into 1990-1994.
An idea might be to get the modulo of each interval and sort them into their new intervals that way but I'm unsure how to do this with the interval data type. If anyone has any ideas I'd really appreciate the help. Ultimately I want to create a histogram or barplot to illustrate the nicely.
If I understand your problem, you'll want something like this:
bottom <- seq(1750, 2010, 5)
library(dplyr)
new_df <- mlt.mx.info %>%
arrange(Year) %>%
mutate(year2 = as.numeric(substr(Year, 6, 9))) %>%
mutate(new_year = paste0(bottom[findInterval(year2, bottom)], "-",(bottom[findInterval(year2, bottom) + 1] - 1)))
View(new_df)
So what this does, it creates bins, and outputs a new column (new_year) that is the bottom of the bin. So everything from 1750-1754 will correspond to a new value of 1750-1754 (in string form; the original is an integer type, not sure how to fix that). Does this do what you want? Double check the results, but it looks right to me.

Count values in a data set that exceed a threshold in R

I have 2 data sets. The first data set has a vector of p-values from 0.5 - 0.001, and the corresponding threshold that meets that p-vale. For example, for 0.05, the value is 13. Any value greater than 13 has a p-value of <0.05. This data set contains all my thresholds that I'm interested in. Like so:
V1 V2
1 0.500 10
2 0.200 11
3 0.100 12
4 0.050 13
5 0.010 14
6 0.001 15
The 2nd data set is just one long list of values. I need to write an R script that counts the number of values in this set that exceed each threshold. For example, count how many values in the 2nd data set that exceed 13, and therefore have a p-value of <0.05, and do this fore each threshold value.
Here are the first 15 values of the 2nd data set (1000 total):
1 11.100816
2 8.779858
3 10.510090
4 9.503772
5 9.392222
6 10.285920
7 8.317523
8 10.007738
9 11.021283
10 9.964725
11 9.081947
12 11.253643
13 10.896120
14 10.272814
15 10.282408
Function which will help you:
length( which( data$V1 > 3 & data$V2 <0.05 ) )
Assuming dat1 and dat2 both have a V2 column, something like this:
colSums(outer(dat2$V2, setNames(dat1$V2, dat1$V2), ">"))
# 10 11 12 13 14 15
# 9 3 0 0 0 0
(reads as follows: 9 items have a value greater than 10, 3 items have a value greater than 11, etc.)

Resources