I've been braking my head whole morning how to do this.
So lets say this is my data set
set.seed(1)
temp <- as.data.frame(cbind(Key = letters[1:5], sapply(1:12, function(x) sample(c(0, 1), 5, replace = T))))
names(temp)[2:13] <- month.abb
temp
# Key Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
# 1 a 0 1 0 0 1 0 0 1 1 1 0 0
# 2 b 0 1 0 1 0 0 1 1 1 0 1 0
# 3 c 1 1 1 1 1 0 0 0 1 0 0 1
# 4 d 1 1 0 0 0 1 0 1 1 1 0 1
# 5 e 0 0 1 1 0 0 1 0 1 1 0 0
What I'm trying to do is to calculate the percentage of occurrences (1s) in two consecutive months.
For example, c and d had an occurrence in Jan. Both had occurrence in Feb too, so the output will be 1 for this month. In Feb, a-d had occurrences but only c had also an occurrence in Mar, so the the output will be .25 for that month, etc.
Desired output for that mini example:
data.frame(Month = month.abb[1:11], OverlapPercent = c(1, 1/4, 1, 1/3, 0, 0, 1/2, 1, 3/5, 0, 0))
# Month OverlapPercent
# 1 Jan 1.0000000
# 2 Feb 0.2500000
# 3 Mar 1.0000000
# 4 Apr 0.3333333
# 5 May 0.0000000
# 6 Jun 0.0000000
# 7 Jul 0.5000000
# 8 Aug 1.0000000
# 9 Sep 0.6000000
# 10 Oct 0.0000000
# 11 Nov 0.0000000
Was thinking to some how use rle for it, but not sure how to force it to stop on two occurences each time
Unless I'm missing something, the following looks valid:
#just to remove 'factor's from "temp"
tmp = do.call(cbind.data.frame, c(temp[1], lapply(temp[-1], function(x) as.numeric(as.character(x)))))
sapply(head(seq_len(ncol(tmp))[-1], -1),
function(i) sum(tmp[[i]] & tmp[[i+1]]) / sum(tmp[[i]]))
#[1] 1.0000000 0.2500000 1.0000000 0.3333333 0.0000000 0.0000000 0.5000000 1.0000000 0.6000000 0.0000000 0.0000000
EDIT:
Out of curiosity I checked #Bathsheba 's "bitwise AND" speed and seems to be faster than the "logical AND":
#identical results
sapply(head(seq_len(ncol(tmp))[-1], -1),
function(i) sum(bitwAnd(tmp[[i]], tmp[[i+1]])) / sum(tmp[[i]]))
#[1] 1.0000000 0.2500000 1.0000000 0.3333333 0.0000000 0.0000000 0.5000000 1.0000000 0.6000000 0.0000000 0.0000000
#twice as fast
x1 = sample(0:1, 1e6, T); x2 = sample(0:1, 1e6, T)
identical(sum(x1 & x2) / sum(x1), sum(bitwAnd(x1, x2)) / sum(x1))
#[1] TRUE
microbenchmark(sum(x1 & x2) / sum(x1), sum(bitwAnd(x1, x2)) / sum(x1), times = 50)
#Unit: milliseconds
# expr min lq median uq max neval
# sum(x1 & x2)/sum(x1) 23.95648 25.32448 25.78471 26.56232 49.18491 50
# sum(bitwAnd(x1, x2))/sum(x1) 10.97982 11.07309 11.20237 13.00450 35.67963 50
First fix up temp so that the 0/1 columns are numeric rather than factor. Then apply overlap to each pair of columns:
temp[-1] <- lapply(temp[-1], function(x) as.numeric(as.character(x)))
overlap <- function(x, y) mean(y[x == 1])
data.frame(Month = month.abb[-12],
Overlap = sapply(2:12, function(i) overlap(temp[,i], temp[,i+1])))
The above is preferred as it keeps the independent parts of the solution separate; however, as an alternative we could omit the first line above (which fixes up the factors) and instead incorporate that into overlap like this:
overlap <- function(x, y) mean(as.numeric(as.character(y))[x == 1]
Note that the Overlaps are fractions (as per the output shown in the question) and not percents as the heading in the question suggests.
In pseudocode, represent each column as a binary number.
E.g. Jan = 0b00110 and Feb = 0b11110.
Your formula for Jan is then
Bitcount(Jan AND Feb) / Bitcount(Jan)
Where AND is the bitwise AND operator and Bitcount counts the number of 1 bits in the number. (I can supply a way of bit counting if you need it). Of course, the formula for other months is a trivial generalisation.
Obviously you'll need a branch for the denominator being zero: not well defined in your question.
length(which(!xor(data["Feb"],data["Mar"]) & data["Feb"]==1)) / length(which(data["Feb"]==1))
!xor is the negated exclusive or.
length(which(...)) gives the number of true values in a logical vector.
Related
If I have a data frame in R with two UK postcode fields (both in upper case), is there an easy, convenient way to define a score that is based on some kind of fuzzy comparison between these two fields?
Have done some googling but everything I found was some kind of "fuzzy join", and I don't need the join bit here.
So for example if I had:
my_df <- data.frame(postcode_1 = c("AB1 1AB", "DN17 2DF", "TN38 8LE", "FK1 2ZZ"),
postcode_2 = c("AB1 1AB", "EC1X 3WW", "TN38 9LE", "FK2 1ZZ"))
then I might like to do something like
my_df <- my_df |>
mutate(score = fuzzy_string_compare_thingy(postcode_1, postcode_2))
to give me (for example):
my_df <- data.frame(postcode_1 = c("AB1 1AB", "DN17 2DF", "TN38 8LE", "FK1 2ZZ"),
postcode_2 = c("AB1 1AB", "EC1X 3WW", "TN38 9LE", "FK2 1ZZ"),
score = c(1, 0.1, 0.9, 0.7))
(the values in the score field are made up of course)
Choose your desired distance method using e.g. stringdist
library(stringdist)
cbind(my_df,
sapply(c("osa", "lv", "dl", "hamming", "lcs", "qgram",
"cosine", "jaccard", "jw", "soundex" ), function(m)
apply(my_df, 1, function(x) stringdist(x[1], x[2], method=m))))
postcode_1 postcode_2 osa lv dl hamming lcs qgram cosine jaccard
1 AB1 1AB AB1 1AB 0 0 0 0 0 0 0.0000000 0.0000000
2 DN17 2DF EC1X 3WW 6 6 6 6 12 12 0.8000000 0.8333333
3 TN38 8LE TN38 9LE 1 1 1 1 2 2 0.1055728 0.1250000
4 FK1 2ZZ FK2 1ZZ 2 2 2 2 4 0 0.0000000 0.0000000
jw soundex
1 0.00000000 0
2 0.50000000 1
3 0.08333333 0
4 0.04761905 0
Using "jw", also "normalize" to 1 (meaning identity) with dplyr
library(dplyr)
my_df %>%
mutate(score_1 = 1 - stringdist(postcode_1, postcode_2, method="jw"))
postcode_1 postcode_2 score_1
1 AB1 1AB AB1 1AB 1.0000000
2 DN17 2DF EC1X 3WW 0.5000000
3 TN38 8LE TN38 9LE 0.9166667
4 FK1 2ZZ FK2 1ZZ 0.9523810
I am trying to calculate row means to create an average variable from 3 assessment points. I want to include cases that have 2 or 3 measurement points, but not those that have only one.
For example,
> a <- c(1,0,NA,1,NA,0,1,0,NA,0,NA)
> b <- c(1,0,NA,1,0,1,1,1,NA,0,1)
> c <- c(1,NA,NA,0,NA,0,1,1,1,0,0)
> mydata <- data.frame(a,b,c)
> mydata$M <- rowMeans(subset(mydata, select = c(1:3)), na.rm = TRUE)
> mydata$M
The current output produces a list of means for all the rows except one, which had 3 NAs:
[1] 1.00 0.00 NaN 0.66 0.00 0.33 1.00 0.66 1.00 0.00 0.50
However, my desired output would be:
[1] 1.00 0.00 NaN 0.66 NaN 0.33 1.00 0.66 NaN 0.00 0.50
Such that only rows with at least two data points would be used to calculate the mean, rather than returning the single data point as the row mean.
This is a complicated rule and I'm not sure how to define it. Any help would be appreciated.
(This is a data set with several thousand rows, so doing it manually is unthinkable!)
Thank you!
Sophie
You could make a function that applies a mean to a row based on some condition. In your example, if there are two or more valid measurements, calculate mean.
a <- c(1,0,NA,1,NA,0,1,0,NA,0,NA)
b <- c(1,0,NA,1,0,1,1,1,NA,0,1)
c <- c(1,NA,NA,0,NA,0,1,1,1,0,0)
mydata <- data.frame(a,b,c)
Reading functions is best done from inside out. This one will take a vector x and see how many are not NA. When it sums (sum) the TRUE/FALSE values it turns them beforehand to 1 and 0, respectively. It then performs a test if there are more than 1 (so 2 or more) values - that are not NA.
conditionalMean <- function(x) {
if (sum(!is.na(x)) > 1) {
mean(x, na.rm = TRUE)
} else {
NA
}
}
We apply this function to your data.frame row-wise, as denoted by MARGIN = 1. If you had a function that worked column-wise, you would use MARGIN = 2. You can try it out. Compare apply(mydata, MARGIN = 2, FUN = mean, na.rm = TRUE) and colMeans(mydata, na.rm = TRUE).
apply(mydata, MARGIN = 1, FUN = conditionalMean)
[1] 1.0000000 0.0000000 NA 0.6666667 NA 0.3333333 1.0000000
[8] 0.6666667 NA 0.0000000 0.5000000
You can try something like this with dplyr:
library(dplyr)
mydata %>%
mutate(row_mean = ifelse((is.na(a)+is.na(b)+is.na(c)) %in% 2:3, NaN,
rowMeans(.[1:3], na.rm = TRUE)))
Result:
a b c row_mean
1 1 1 1 1.0000000
2 0 0 NA 0.0000000
3 NA NA NA NaN
4 1 1 0 0.6666667
5 NA 0 NA NaN
6 0 1 0 0.3333333
7 1 1 1 1.0000000
8 0 1 1 0.6666667
9 NA NA 1 NaN
10 0 0 0 0.0000000
11 NA 1 0 0.5000000
I have a vector made of 0 and non-zero numbers. I would like to know the length and starting-position of each of the non-zero number series:
a = c(0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 2.6301334 1.8372030 0.0000000 0.0000000 0.0000000 1.5632647 1.1433757 0.0000000 1.5412216 0.8762267 0.0000000 1.3087967 0.0000000 0.0000000 0.0000000)
based on a previous post it is easy to find the starting positions of the non-zero regions:
Finding the index of first changes in the elements of a vector in R
c(1,1+which(diff(a)!=0))
However I cannot seem to configure a way of finding the length of these regions....
I have tried the following:
dif=diff(which(a==0))
dif_corrected=dif-1 # to correct for the added lengths
row=rbind(postion=seq(length(a)), length=c(1, dif_corrected))
position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
length 1 0 0 0 0 2 0 0 2 2 1 0 0 1 0
NOTE: not all columns are displayed ( there are actually 20)
Then I subset this to take away 0 values:
> row[,-which(row[2,]==0)]
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
position 1 6 9 10 11 14 19
length 1 2 2 2 1 1 2
This seems like a decent way of coming up with the positions and lengths of each non-zero series in the series, but it is incorrect:
The position 9 (identified as the start of a non-zero series) is a 0 and instead 10 and 11 are non-zero so I would expect the position 10 and a length of 2 to appear here....
The only result that is correct is position 6 which is the start of the first non-zero series- it is correctly identified as having a length of 2- all other positions are incorrect.
Can anyone tell me how to index correctly to identify the starting-position of each of the non-zero series and the corresponding lengths?
NOTE I only did this in R because of the usefulness of the which command but it would also be good to know how to do this numpy and create a dictionary of positions and length values
It seems like rle could be useful here.
# a slightly simpler vector
a <- c(0, 0, 1, 2, 0, 2, 1, 2, 0, 0, 0, 1)
# runs of zero and non-zero elements
r <- rle(a != 0)
# lengths of non-zero elements
r$lengths[r$values]
# [1] 2 3 1
# start of non-zero runs
cumsum(r$lengths)[r$values] - r$lengths[r$values] + 1
# [1] 3 6 12
This also works on vectors with only 0 or non-0, and does not depend on whether or not the vector starts/ends with 0 or non-0. E.g.:
a <- c(1, 1)
a <- c(0, 0)
a <- c(1, 1, 0, 1, 1)
a <- c(0, 0, 1, 1, 0, 0)
A possibly data.table alternative, using rleid to create groups, and .I to get start index and calculate length.
library(data.table)
d <- data.table(a)
d[ , .(start = min(.I), len = max(.I) - min(.I) + 1, nonzero = (a != 0)[1]),
by = .(run = rleid(a != 0))]
# run start len nonzero
# 1: 1 1 2 FALSE
# 2: 2 3 2 TRUE
# 3: 3 5 1 FALSE
# 4: 4 6 3 TRUE
# 5: 5 9 3 FALSE
# 6: 6 12 1 TRUE
If desired, the runs can then easily be sliced by the 'nonzero' column.
For numpy this is a parallel method to #Maple (with a fix for arrays ending with a nonzero):
def subSeries(a):
d = np.logical_not(np.isclose(a, np.zeros_like(a))).astype(int)
starts = np.where(np.diff(np.r_[0, d, 0]) == 1))
ends = np.where(np.diff(np.r_[0, d, 0]) == -1))
return np.c_[starts - 1, ends - starts]
Definition:
sublistLen = function(list) {
z_list <- c(0, list, 0)
ids_start <- which(diff(z_list != 0) == 1)
ids_end <- which(diff(z_list != 0) == - 1)
lengths <- ids_end - ids_start
return(
list(
'ids_start' = ids_start,
'ids_end' = ids_end - 1,
'lengths' = lengths)
)
}
Example:
> a <- c(-2,0,0,12,5,0,124,0,0,0,0,4,48,24,12,2,0,9,1)
> sublistLen(a)
$ids_start
[1] 1 4 7 12 18
$ids_end
[1] 1 5 7 16 19
$lengths
[1] 1 2 1 5 2
Calculate the Mean and STD sub-setting a dynamic set of columns.
to show it as an example:
sales <- data.frame(ItemID=c("1A","1B","1C"),
Jul=c(0,1,5),
Aug=c(1,2,6),
Sep=c(0,3,7),
Oct=c(1,4,8),
Nov=c(1,4,8),
Dec=c(1,4,8),
Jan=c(1,4,8),
Nmon=c(7,4,6))
The above test data gives the below below table. What I would need is to apply functions on each row based on the value in the "Nmon" column.
ItemID Jul Aug Sep Oct Nov Dec Jan Nmon
1 1A 0 1 0 1 1 1 1 7
2 1B 1 2 3 4 4 4 4 4
3 1C 5 6 7 8 8 8 8 6
e.g. the first record has the Nmon value as 7. Then I need to calculate the mean and standard deviation of all the values from Jul to Jan (mean = 0.71, std = 0.49).
In case of second record where the Nmon value is 4 the mean and standard deviation should only be calculated for the the months ranging from Oct-Jan (mean = 4, std = 0)
Here the number of months will increase or decrease but the first(Item number) and last (Nmon) columns will remain the same.
I have a large data set of items and need an efficient way to do this calculations.
Perhaps this helps
t(apply(sales[-1], 1, function(x) {i1 <- length(x)
x2 <- x[(i1 -x[i1]):(i1-1)]
c(mean = mean(x2), sd = sd(x2))}))
# mean sd
#[1,] 0.7142857 0.48795
#[2,] 4.0000000 0.00000
#[3,] 7.5000000 0.83666
Here is another solution with base R:
sales <- data.frame(ItemID=c("1A","1B","1C"),
Jul=c(0,1,5),
Aug=c(1,2,6),
Sep=c(0,3,7),
Oct=c(1,4,8),
Nov=c(1,4,8),
Dec=c(1,4,8),
Jan=c(1,4,8),
Nmon=c(7,4,6))
my.m.sd <- function(i) {
n <- sales$Nmon[i]
x <- unlist(sales[i, seq(to=8, length.out = n)])
c(m=mean(x), s=sd(x))
}
sapply(1:3, my.m.sd)
# > sapply(1:3, my.m.sd)
# [,1] [,2] [,3]
# m 0.7142857 4 7.50000
# s 0.4879500 0 0.83666
I have a data frame like the one below, but with a lot more rows
> df<-data.frame(x1=c(1,1,0,0,1,0),x2=c("a","a","b","a","c","c"))
> df
x1 x2
1 1 a
2 1 a
3 0 b
4 0 a
5 1 c
6 0 c
From df I want a data frame where the rows are the unique values of df$x2 and col1 is the proportion of 1s associated with each letter, and col2 is the count of each letter. So, my output would be
> getprops(df)
prop count
a .6666 3
b 0 1
c 0.5 2
I can think of some elaborate, dirty ways to do this, but I'm looking for something short and efficient. Thanks
I like #RicardoSaporta's solution (+1), but you can use ?prop.table as well:
> df<-data.frame(x1=c(1,1,0,0,1,0),x2=c("a","a","b","a","c","c"))
> df
x1 x2
1 1 a
2 1 a
3 0 b
4 0 a
5 1 c
6 0 c
> tab <- table(df$x2, df$x1)
> tab
0 1
a 1 2
b 1 0
c 1 1
> ptab <- prop.table(tab, margin=1)
> ptab
0 1
a 0.3333333 0.6666667
b 1.0000000 0.0000000
c 0.5000000 0.5000000
> dframe <- data.frame(values=rownames(tab), prop=ptab[,2], count=tab[,2])
> dframe
values prop count
a a 0.6666667 2
b b 0.0000000 0
c c 0.5000000 1
If you'd like, you can put this together into a single function:
getprops <- function(values, indicator){
tab <- table(values, indicator)
ptab <- prop.table(tab, margin=1)
dframe <- data.frame(values=rownames(tab), prop=ptab[,2], count=tab[,2])
return(dframe)
}
> getprops(values=df$x2, indicator=df$x2)
values prop count
a a 0.6666667 2
b b 0.0000000 0
c c 0.5000000 1
Try installing plyr and running
library(plyr)
df <- data.frame(x1=c(1, 1, 0, 0, 1, 0),
label=c("a", "a", "b", "a", "c", "c"))
ddply(df, .(label), summarize, prop = mean(x1), count = length(x1))
# label prop count
# 1 a 0.6666667 3
# 2 b 0.0000000 1
# 3 c 0.5000000 2
which under the hood applies a split/apply/combine method similar to this in base R:
do.call(rbind, lapply(split(df, df$x2),
with, list(prop = mean(x1),
count = length(x1))))
Here is a one-liner in data.table:
> DT[, list(props=sum(x1) / .N, count=.N), by=x2]
x2 props count
1: a 0.6666667 3
2: b 0.0000000 1
3: c 0.5000000 2
where DT <- data.table(df)
I am not sure if this does what you want.
df<-data.frame(x1=c(1,1,0,0,1,0),x2=c("a","a","b","a","c","c"))
ones <- with(df, aggregate(x1 ~ x2, FUN = sum))
count <- table(df$x2)
prop <- ones$x1 / count
df2 <- data.frame(prop, count)
df2
rownames(df2) <- df2[,3]
df2 <- df2[,c(2,4)]
colnames(df2) <- c('prop', 'count')
df2
prop count
a 0.6666667 3
b 0.0000000 1
c 0.5000000 2
Try using table
tbl <- table(df$x1, df$x2)
# a b c
# 0 1 1 1
# 1 2 0 1
tbl["1",] / colSums(tbl)
# a b c
# 0.6666667 0.0000000 0.5000000
For nice output use:
data.frame(proportions=tbl["1",] / colSums(tbl))
proportions
a 0.6666667
b 0.0000000
c 0.5000000