I'm not sure what I'm missing here:
library(dplyr)
df1<-data.frame(n=c(1,1,1,2,1,1,2))
mutate(df1,foo=n/mean(c(n,lag(n)),na.rm=TRUE))
n foo
1 1 0.8125
2 1 0.8125
3 1 0.8125
4 2 1.6250
5 1 0.8125
6 1 0.8125
7 2 1.6250
What on earth is going on? The first row should be, basically, 1/mean(1), i.e., '1'. Why am I getting 0.8125? What's even stranger is in my original dataset, I'm getting yet another number - 0.608, for basically the same calculation. What am I missing?
Try summarise(df1, length(c(n,lag(n)))) — the length of the vector c(n,lag(n)) is the same as two times the number of rows and has mean 1.230769.
What I believe you want to do is:
mutate(df1,foo=n/rowMeans(cbind(n,lag(n)),na.rm=TRUE))
n foo
1 1 1.0000000
2 1 1.0000000
3 1 1.0000000
4 2 1.3333333
5 1 0.6666667
6 1 1.0000000
7 2 1.3333333
Related
The details of the dataframe are
ID Price Result
1 20 -0.1
2 18 0.1667
3 21 -0.2381
4 16 0.1875
5 19 -1
so i have to subtract the second row from first row then divide by the first row. (18-20)/20 = -0.1 but for the last row as there is no next value its like (0-19)/19 = -1
Please help me with this. I am getting NA at the end.
transform(df, Result = diff(c(Price, 0))/Price)
ID Price Result
1 1 20 -0.1000000
2 2 18 0.1666667
3 3 21 -0.2380952
4 4 16 0.1875000
5 5 19 -1.0000000
I have the following data
Exp = my data frame
dt<-data.table(Game=c(rep(1,9),rep(2,3)),
Round=rep(1:3,4),
Participant=rep(1:4,each=3),
Left_Choice=c(1,0,0,1,1,0,0,0,1,1,1,1),
Total_Points=c(5,15,12,16,83,7,4,8,23,6,9,14))
> dt
Game Round Participant Left_Choice Total_Points
1: 1 1 1 1 5
2: 1 2 1 0 15
3: 1 3 1 0 12
4: 1 1 2 1 16
5: 1 2 2 1 83
6: 1 3 2 0 7
7: 1 1 3 0 4
8: 1 2 3 0 8
9: 1 3 3 1 23
10: 2 1 4 1 6
11: 2 2 4 1 9
12: 2 3 4 1 14
Now, I need to do the following:
First of all for each of the participants in each of the Games I need to calculated the mean "Left Choice rate".
After that I want to break the results to 5 groups (Left choice <20%,
left choice between 20% and 40% e.t.c),
For each group (in each of the games), I want to calculate the mean of the Total_Points **in the last round - round 3 in this simple example **** [ONLY the value of the round 3] - so for example for participant 1, in game 1, the total points are in round 3 are 12. And for Participant 4, in game 2 it is 14.
So in the first stage I think I should calculate the following:
Game Participant Percent_left Total_Points (in last round)
1 1 33% 12
1 2 66% 7
1 3 33% 23
2 4 100% 14
And the final result should look like this:
Game Left_Choice Total_Poins (average)
1 >35% 17.5= (12+23)/2
1 <35%<70% 7
1 >70% NA
2 >35% NA
2 <35%<70% NA
2 >70% 14
Please help! :)
Working in data.table
1: simple group mean with by
dt[,pct_left:=mean(Left_Choice),by=.(Game,Participant)]
2: use cut; not totally clear, but I think you want include.lowest=T.
dt[,pct_grp:=cut(pct_left,breaks=seq(0,1,by=.2),include.lowest=T)]
3: slightly more complicated group mean with by
dt[Round==max(Round),end_mean:=mean(Total_Points),by=.(pct_grp,Game)]
(if you just want the reduced table, use .(end_mean=mean(Total_Points))instead).
You didn't make it clear whether there is a global maximum number of rounds (i.e. whether all games end in the same number of rounds); this was assumed above. You'll have to be more clear about this in order to provide an exact alternative, but I suggest starting with just defining it round by round:
dt[,end_mean:=mean(Total_Points),by=.(pct_grp,Game,Round)]
i have this data frame, I want to count the frequency (number) of each unique value in a column.
userID bookmarkID tagID value
228 1 1 0.0005
255 1 1 0.0007
5 2 1 0.0068
66 2 1 0.0008
99 2 1 0.0006
206 2 1 0.0006
3 3 1 -0.0007
5 3 1 0.0633
7 3 1 -0.0012
For example,the column bookmarkID, I want to get two vectors: one is the unique values [1,2,3], the other is the corresponding count: [2,4,3]. How can I do this?
I think you're looking for table and unique. Consider your data.frame is df,
> table(df$bookmarkID)
1 2 3
2 4 3
> unique(df$bookmarkID)
[1] 1 2 3
I'd like to do a cut with a guaranteed number of levels returned. So i'd like to take any vector of cumulative percentages and get a cut into deciles. I've tried using cut and it works well in most situations, but in cases where there are deciles that have a large percentages it fails to return the desired number of unique cuts, which is 10. Any ideas on how to ensure that the number of cuts is guaranteed to be 10?
In the included example there is no occurrance of decile 7.
> (x <- c(0.04,0.1,0.22,0.24,0.26,0.3,0.35,0.52,0.62,0.66,0.68,0.69,0.76,0.82,1.41,6.19,9.05,18.34,19.85,20.5,20.96,31.85,34.33,36.05,36.32,43.56,44.19,53.33,58.03,72.46,73.4,77.71,78.81,79.88,84.31,90.07,92.69,99.14,99.95))
[1] 0.04 0.10 0.22 0.24 0.26 0.30 0.35 0.52 0.62 0.66 0.68 0.69 0.76 0.82 1.41 6.19 9.05 18.34 19.85 20.50 20.96 31.85 34.33
[24] 36.05 36.32 43.56 44.19 53.33 58.03 72.46 73.40 77.71 78.81 79.88 84.31 90.07 92.69 99.14 99.95
> (cut(x,seq(0,max(x),max(x)/10),labels=FALSE))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 4 4 4 5 5 6 6 8 8 8 8 8 9 10 10 10 10
> (as.integer(cut2(x,seq(0,max(x),max(x)/10))))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 4 4 4 5 5 6 6 8 8 8 8 8 9 10 10 10 10
> (findInterval(x,seq(0,max(x),max(x)/10),rightmost.closed=TRUE,all.inside=TRUE))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 4 4 4 5 5 6 6 8 8 8 8 8 9 10 10 10 10
I would like to get 10 approximately equally sized intervals, sized in such a way that I am assured of getting 10. cut et al gives 9 bins with this example, i want 10. So I'm looking for an algorithm that would recognize that the break between [58.03,72.46],73.4 is large. Instead of assigning to bins 6,8,8 it would assign these cases to bins 6,7,8.
xx <- cut(x, breaks=quantile(x, (1:10)/10, na.rm=TRUE) )
table(xx)
#------------------------
xx
(0.256,0.58] (0.58,0.718] (0.718,6.76] (6.76,20.5]
4 4 4 4
(20.5,35.7] (35.7,49.7] (49.7,75.1] (75.1,85.5]
3 4 4 4
(85.5,100]
4
numBins = 10
cut(x, breaks = seq(from = min(x), to = max(x), length.out = numBins+1))
Output:
...
...
...
10 Levels: (0.04,10] (10,20] (20,30] (30,40] (40,50] (50,60] ... (90,100]
This will make 10 bins that are approximately equally spaced. Note, that by changing the numBins variable, you may obtain any number of bins that are approximately equally spaced.
Not sure I understand what you need, but if you drop the labels=FALSE and use table to make a frequency table of your data, you will get the number of categories desired:
> table(cut(x, breaks=seq(0, 100, 10)))
(0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90] (90,100]
17 2 2 4 2 2 0 5 1 4
Notice that there are is no data in the 7th category, (60,70].
What is the problem you are trying to solve? If you don't want quantiles, then your cutpoints are pretty much arbitrary, so you could just as easily create ten bins by sampling without replacement from your original dataset. I realize that's an absurd method, but I want to make a point: you may be way off track but we can't tell because you haven't explained what you intend to do with your bins. Why, for example, is it so bad that one bin has no content?
I have a dataframe in R containing the columns ID.A, ID.B and DISTANCE, where distance represents the distance between ID.A and ID.B. For each value (1->n) of ID.A, there may be multiple values of ID.B and DISTANCE (i.e. there may be multiple duplicate rows in ID.A e.g. all of value 4 which each has a different ID.B and distance in that row).
I would like to be able to remove rows where ID.A is duplicated, but conditional upon the distance value such that I am left with the smallest distance values for each ID.A record.
Hopefully that makes sense?
Many thanks in advance
EDIT
Hopefully an example will prove more useful than my text. Here I would like to remove the second and third rows where ID.A = 3:
myDF <- read.table(text="ID.A ID.B DISTANCE
1 3 1
2 6 8
3 2 0.4
3 3 1
3 8 5
4 8 7
5 2 11", header = TRUE)
You can also do it easily in base R. If dat is your dataframe,
do.call(rbind,
by(dat, INDICES=list(dat$ID.A),
FUN=function(x) head(x[order(x$DISTANCE), ], 1)))
One possibility:
myDF <- myDF[order(myDF$ID.A, myDF$DISTANCE), ]
newdata <- myDF[which(!duplicated(myDF$ID.A)),]
Which gives :
ID.A ID.B DISTANCE
1 1 3 1.0
2 2 6 8.0
5 3 2 0.4
6 4 8 7.0
7 5 2 11.0
You can use the plyr package for that. For example, if your data are like these :
d <- data.frame(id.a=c(1,1,1,2,2,3,3,3,3),
id.b=c(1,2,3,1,2,1,2,3,4),
dist=c(12,10,15,20,18,16,17,25,9))
id.a id.b dist
1 1 1 12
2 1 2 10
3 1 3 15
4 2 1 20
5 2 2 18
6 3 1 16
7 3 2 17
8 3 3 25
9 3 4 9
You can use the ddply function like this :
library(plyr)
ddply(d, "id.a", function(df) return(df[df$dist==min(df$dist),]))
Which gives :
id.a id.b dist
1 1 2 10
2 2 2 18
3 3 4 9