Add jitter to column value using dplyr - r

I have a data frame of the following format.
author year stages
1 A 1150 1
2 B 1200 1
3 C 1200 1
4 D 1300 1
5 D 1300 1
6 E 1390 3
7 F 1392 3
8 G 1400 3
9 G 1400 3
...
I want to jitter each year and author combination by a small amount. I want documents by different authors in the same year to be jittered by unique values. For example, tokens from author B and C appear in the same year, but should be jittered by different amounts. All tokens from the same author, for example the two tokens from author G at 1400 should be jittered by the same amount.
I've tried the following, but get a unique jitter amount for each and every row.
data %>% group_by(author) %>% mutate(year = jitter(year, amount=.5))
The output of this code is the following.
author year stages
1 A 1150.400 1
2 B 1200.189 1
3 C 1200.222 1
4 D 1300.263 1
5 D 1299.788 1
6 E 1390.045 3
7 F 1391.964 3
8 G 1399.982 3
9 G 1399.783 3
However, I would like the following, where both tokens from author G should be shifted by the same amount. The crucial difference is that for author G all tokens are shifted by the same amount.
author year stages
1 A 1150.400 1
2 B 1200.189 1
3 C 1200.222 1
4 D 1300.263 1
5 D 1299.788 1
6 E 1390.045 3
7 F 1391.964 3
8 G 1399.982 3
9 G 1399.982 3

Calculate the jitter for one case and add the difference to all cases:
dat %>%
group_by(author) %>%
mutate(year = year + (year[1] - jitter(year[1], amount=.5)))
# author year stages
#1 A 1149.720 1
#2 B 1200.385 1
#3 C 1199.888 1
#4 D 1299.589 1
#5 D 1299.589 1
#6 E 1389.866 3
#7 F 1392.225 3
#8 G 1400.147 3
#9 G 1400.147 3

Related

R language, how to sum values by group skipping the same numbers (nested in another a group)?

I want to add a new column, calculating the average bonus for each company per employee, for example, the expected output for company A is (18+8+2)/3, and fill the value to each row of company A. Then same logic for company B, C, D. BTW, the rows with duplicated values cannot be dropped. At first, I was thinking to calculate the sum of the mean of bonus, but the code didn't work. Then I was thinking to add a loop skipping the same values, but it didn't work either. Anyone has some thoughts? I appreciate that a lot!
Lacking readable input, I made up some
set.seed(30258)
df <- tibble(COMPANY.ID = sample(LETTERS[1:4], 20, replace = TRUE),
EMP.ID = sample(1:5, 20, replace = TRUE),
BONUS = sample(2:20, 20, replace = TRUE)) %>%
arrange(COMPANY.ID, EMP.ID, BONUS)
# A tibble: 20 x 3
COMPANY.ID EMP.ID BONUS
<chr> <int> <int>
1 A 1 3
2 A 2 13
3 A 2 16
4 B 1 10
5 B 1 18
6 B 2 20
7 B 3 3
8 B 4 20
9 B 5 7
10 B 5 10
11 B 5 10
12 C 2 4
13 C 3 4
14 C 3 16
15 C 5 4
16 C 5 13
17 D 1 8
18 D 1 9
19 D 3 8
20 D 4 12
A formula for the company's avg bonus - if an employee receives multiple bonuses from the same company, they are additive.
avgCoBonus <- df %>%
group_by(COMPANY.ID) %>%
summarise(AVG.BONUS = round(sum(BONUS) / length(unique(EMP.ID)), 2))
# A tibble: 4 x 2
COMPANY.ID AVG.BONUS
<chr> <dbl>
1 A 16
2 B 17.6
3 C 13.7
4 D 12.3
I think that's what you had in mind.

Compare two columns of two different data frames with different length of rows return a third row

I have two different df which have the same columns: "O" for place and "date" for time.
Df 1 gives different information for a certain place (O) and time (date) in one 1 row and df 2 has many information for the same year and place in many different rows. No I want to extract one condition of the first df that applies for all the rows of the second df if values for "O" and "date" are equal.
To make it more clear:
I have one line in df 1: krnqm=250 for O=1002 and date=1885. Now I want a new column "krnqm" in df 2 where df2$krnqm = 250 for all rows where df2$0=1002 and df2$date=1885.
Unfortunately I have no idea how to put that condition into a code line and would be greatful for your help.
You can do this quite easily in base R using the merge function. Here's an example.
Simulate some data from your description:
df1 <- expand.grid(O = letters[c(2:4,7)], date = c(1,3))
df2 <- data.frame(O = rep(letters[1:6], c(2,3,3,6,2,2)), date = rep(1:3, c(3,2,4)))
df1$krnqm <- sample(1:1000, size = nrow(df1), replace=T)
> df1
O date krnqm
1 b 1 833
2 c 1 219
3 d 1 773
4 g 1 514
5 b 3 118
6 c 3 969
7 d 3 704
8 g 3 914
> df2
O date
1 a 1
2 a 1
3 b 1
4 b 2
5 b 2
6 c 3
7 c 3
8 c 3
9 d 3
10 d 1
11 d 1
12 d 1
13 d 2
14 d 2
15 e 3
16 e 3
17 f 3
18 f 3
Now let's combine the two data frames in the manner you describe.
df2 <- merge(df2, df1, all.x=T)
> df2
O date krnqm
1 a 1 NA
2 a 1 NA
3 b 1 833
4 b 2 NA
5 b 2 NA
6 c 3 969
7 c 3 969
8 c 3 969
9 d 1 773
10 d 1 773
11 d 1 773
12 d 2 NA
13 d 2 NA
14 d 3 704
15 e 3 NA
16 e 3 NA
17 f 3 NA
18 f 3 NA
So you can see, the krnqm column in the resulting data frame contains NAs for any combinations of 'O' and 'date' that were not found in the data frame where the krnqm values were extracted from. If your df1 has other columns, that you do not want to be included in the merge, just change the merge call slightly to only use those columns that you want: df2 <- merge(df2, df1[,c("O", "date", "krnqm")], all.x=T).
Good luck!

summation for multiple columns dynamically

Hi I have dataframe with multiple columns ,I.e first 5 columns are my metadata and remaing
columns (columns count will be even) are actual columns which need to be calculated
formula : (col6*col9) + (col7*col10) + (col8*col11)
country<-c("US","US","US","US")
name <-c("A","B","c","d")
dob<-c(2017,2018,2018,2010)
day<-c(1,4,7,9)
hour<-c(10,11,2,4)
a <-c(1,3,4,5)
d<-c(1,9,4,0)
e<-c(8,1,0,7)
f<-c(10,2,5,6)
j<-c(1,4,2,7)
m<-c(1,5,7,1)
df=data.frame(country,name,dob,day,hour,a,d,e,f,j,m)
how to get final summation if i have more columns
I have tried with below code
df$final <-(df$a*df$f)+(df$d*df$j)+(df$e*df$m)
Here is one way to do generalize the computation:
x <- ncol(df) - 5
df$final <- rowSums(df[6:(5 + x/2)] * df[(ncol(df) - x/2 + 1):ncol(df)])
# country name dob day hour a d e f j m final
# 1 US A 2017 1 10 1 1 8 10 1 1 19
# 2 US B 2018 4 11 3 9 1 2 4 5 47
# 3 US c 2018 7 2 4 4 0 5 2 7 28
# 4 US d 2010 9 4 5 0 7 6 7 1 37

Find co-occurrence of values in large data set

I have a large data set with month, customer ID and store ID. There is one record per customer, per location, per month summarizing their activity at that location.
Month Customer ID Store
Jan 1 A
Jan 4 A
Jan 2 A
Jan 3 A
Feb 7 B
Feb 2 B
Feb 1 B
Feb 12 B
Mar 1 C
Mar 11 C
Mar 3 C
Mar 12 C
I'm interested in creating a matrix that shows the number of customers that each location shares with another. Like this:
A B C
A 4 2 2
B 2 4 2
C 2 2 4
For example, since customer visited Store A and then Store B in the next month, they would be added to the tally. I'm interested in number of shared customers, not number of visits.
I tried the sparse matrix approach in this thread(Creating co-occurrence matrix), but the numbers returned don't match up for some reason I cannot understand.
Any ideas would be greatly appreciated!
Update:
The original solution that I posted worked for your data. But your data has
the unusual property that no customer ever visited the same store in two different
months. Presuming that would happen, a modification is needed.
What we need is a matrix of stores by customers that has 1 if the customer ever
visited the store and zero otherwise. The original solution used
M = as.matrix(table(Dat$ID_Store, Dat$Customer))
which gives how many different months the store was visited by each customer. With
different data, these numbers might be more than one. We can fix that by using
M = as.matrix(table(Dat$ID_Store, Dat$Customer) > 0)
If you look at this matrix, it will say TRUE and FALSE, but since TRUE=1 and FALSE=0
that will work just fine. So the full corrected solution is:
M = as.matrix(table(Dat$ID_Store, Dat$Customer) > 0)
M %*% t(M)
A B C
A 4 2 2
B 2 4 2
C 2 2 4
We can try this too:
library(reshape2)
df <- dcast(df,CustomerID~Store, length, value.var='Store')
# CustomerID A B C
#1 1 1 1 1
#2 2 1 1 0 # Customer 2 went to stores A,B but not to C
#3 3 1 0 1
#4 4 1 0 0
#5 7 0 1 0
#6 11 0 0 1
#7 12 0 1 1
crossprod(as.matrix(df[-1]))
# A B C
#A 4 2 2
#B 2 4 2
#C 2 2 4
with library arules:
library(arules)
write(' Jan 1 A
Jan 4 A
Jan 2 A
Jan 3 A
Feb 7 B
Feb 2 B
Feb 1 B
Feb 12 B
Mar 1 C
Mar 11 C
Mar 3 C
Mar 12 C', 'basket_single')
tr <- read.transactions("basket_single", format = "single", cols = c(2,3))
inspect(tr)
# items transactionID
#[1] {A,B,C} 1
#[2] {C} 11
#[3] {B,C} 12
#[4] {A,B} 2
#[5] {A,C} 3
#[6] {A} 4
#[7] {B} 7
image(tr)
crossTable(tr, sort=TRUE)
# A B C
#A 4 2 2
#B 2 4 2
#C 2 2 4

How to sum over diagonals of data frame

Say that I have this data frame:
1 2 3 4
100 8 12 5 14
99 1 6 4 3
98 2 5 4 11
97 5 3 7 2
In this above data frame, the values indicate counts of how many observations take on (100, 1), (99, 1), etc.
In my context, the diagonals have the same meanings:
1 2 3 4
100 A B C D
99 B C D E
98 C D E F
97 D E F G
How would I sum across the diagonals (i.e., sum the counts of the like letters) in the first data frame?
This would produce:
group sum
A 8
B 13
C 13
D 28
E 10
F 18
G 2
For example, D is 5+5+4+14
You can use row() and col() to identify row/column relationships.
m <- read.table(text="
1 2 3 4
100 8 12 5 14
99 1 6 4 3
98 2 5 4 11
97 5 3 7 2")
vals <- sapply(2:8,
function(j) sum(m[row(m)+col(m)==j]))
or (as suggested in comments by ?#thelatemail)
vals <- sapply(split(as.matrix(m), row(m) + col(m)), sum)
data.frame(group=LETTERS[seq_along(vals)],sum=vals)
or (#Frank)
data.frame(vals = tapply(as.matrix(m),
(LETTERS[row(m) + col(m)-1]), sum))
as.matrix() is required to make split() work correctly ...
Another aggregate variation, avoiding the formula interface, which actually complicates matters in this instance:
aggregate(list(Sum=unlist(dat)), list(Group=LETTERS[c(row(dat) + col(dat))-1]), FUN=sum)
# Group Sum
#1 A 8
#2 B 13
#3 C 13
#4 D 28
#5 E 10
#6 F 18
#7 G 2
Another solution using bgoldst's definition of df1 and df2
sapply(unique(c(as.matrix(df2))),
function(x) sum(df1[df2 == x]))
Gives
#A B C D E F G
#8 13 13 28 10 18 2
(Not quite the format that you wanted, but maybe it's ok...)
Here's a solution using stack(), and aggregate(), although it requires the second data.frame contain character vectors, as opposed to factors (could be forced with lapply(df2,as.character)):
df1 <- data.frame(a=c(8,1,2,5), b=c(12,6,5,3), c=c(5,4,4,7), d=c(14,3,11,2) );
df2 <- data.frame(a=c('A','B','C','D'), b=c('B','C','D','E'), c=c('C','D','E','F'), d=c('D','E','F','G'), stringsAsFactors=F );
aggregate(sum~group,data.frame(sum=stack(df1)[,1],group=stack(df2)[,1]),sum);
## group sum
## 1 A 8
## 2 B 13
## 3 C 13
## 4 D 28
## 5 E 10
## 6 F 18
## 7 G 2

Resources