Related
I have count data from different regions per year. The original data is structured like this:
count region year
1 1 A 2011
2 2 A 2010
3 1 A 2009
4 5 A 2008
5 4 A 2007
6 2 B 2011
7 2 B 2010
8 1 B 2009
9 5 B 2008
10 3 B 2007
11 3 C 2011
12 3 C 2010
13 2 C 2009
14 1 C 2008
15 3 C 2007
16 4 D 2011
17 3 D 2010
18 2 D 2009
19 1 D 2008
20 4 D 2007
I now need to combine (sum) the values only for region A and D per year and keep the value A for the column regions of these calculated sums. The output should look like this:
count region year
1 5 A 2011
2 5 A 2010
3 3 A 2009
4 6 A 2008
5 8 A 2007
6 2 B 2011
7 2 B 2010
8 1 B 2009
9 5 B 2008
10 3 B 2007
11 3 C 2011
12 3 C 2010
13 2 C 2009
14 1 C 2008
15 3 C 2007
The counts for region B and C should not be changed. I tried but never received the needed output. Does anyone have a tip? I would be very grateful.
We may replace the D to A, and do a group_by sum
library(dplyr)
df1 %>%
group_by(region = replace(region, region == 'D', 'A'), year) %>%
summarise(count = sum(count), .groups = 'drop')
This question already has answers here:
Calculating moving average
(17 answers)
Closed 2 years ago.
I got this df:
df <- data.frame(flow = c(1,2,3,4,5,6,7,8,9,10,11))
flow
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
and i want to get the week average from the line we're, like this:
flow flow7mean
1 1 4 `(mean of 1,2,3,4,5,6,7)`
2 2 5 (mean of 2,3,4,5,6,7,8)
3 3 6 (mean of 3,4,5,6,7,8,9)
4 4 7 (mean of 4,5,6,7,8,9,10)
5 5 8 (mean of 5,6,7,8,9,10,11)
6 6 NA (it's ok, because there is just 6 flow data)
7 7 NA
8 8 NA
9 9 NA
10 10 NA
11 11 NA
i have tried some loop solutions, but i think that a vectorized solution is better
Try this using rollmean() from zoo package:
library(zoo)
#Code
df$M <- rollmean(df$flow,k = 7,align = 'left',fill=NA)
Output:
df
flow M
1 1 4
2 2 5
3 3 6
4 4 7
5 5 8
6 6 NA
7 7 NA
8 8 NA
9 9 NA
10 10 NA
11 11 NA
We can use roll_mean from RcppRoll
library(RcppRoll)
df$flow7mean <- roll_mean(df$flow, 7, fill = NA, align = 'left')
-output
df
# flow flow7mean
#1 1 4
#2 2 5
#3 3 6
#4 4 7
#5 5 8
#6 6 NA
#7 7 NA
#8 8 NA
#9 9 NA
#10 10 NA
#11 11 NA
Here is a base R option using embed
within(df,flow7mean <- `length<-`(rowMeans(embed(flow,7)),length(flow)))
which gives
flow flow7mean
1 1 4
2 2 5
3 3 6
4 4 7
5 5 8
6 6 NA
7 7 NA
8 8 NA
9 9 NA
10 10 NA
11 11 NA
A very basic question! I tried finding searching a lot and using my own brain but eventually, had to come here.. :)
Well here is a sample dataframe
df<- data.frame(id=c(1,1,1,1,2,2,2,2,3,3,3,3),
quarter=c(1,2,3,4,1,2,3,4,1,2,3,4),
year=c(2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,2015),
value=c(2.75,2.75,2.75,2.75,2.90,2.90,2.90,2.90,2.21,2.21,2.21,2.21))
> df
id quarter year value
1 1 1 2015 2.75
2 1 2 2015 2.75
3 1 3 2015 2.75
4 1 4 2015 2.75
5 2 1 2015 2.90
6 2 2 2015 2.90
7 2 3 2015 2.90
8 2 4 2015 2.90
9 3 1 2015 2.21
10 3 2 2015 2.21
11 3 3 2015 2.21
12 3 4 2015 2.21
I need unique value per id. So, I use this-
df$value[duplicated(df$value)]<-NA
And I get what I need.
> df
id quarter year value
1 1 1 2015 2.75
2 1 2 2015 NA
3 1 3 2015 NA
4 1 4 2015 NA
5 2 1 2015 2.90
6 2 2 2015 NA
7 2 3 2015 NA
8 2 4 2015 NA
9 3 1 2015 2.21
10 3 2 2015 NA
11 3 3 2015 NA
12 3 4 2015 NA
Now lets say that I have the a new dataframe with more similar values -
df<- data.frame(id=c(1,1,1,1,2,2,2,2,3,3,3,3),
quarter=c(1,2,3,4,1,2,3,4,1,2,3,4),
year=c(2015,2015,2015,2015,2016,2016,2016,2016,2015,2015,2015,2015),
value=c(2.75,2.75,2.75,2.75,2.75,2.75,2.75,2.75,2.21,2.21,2.21,2.21))
If I use the same code, I will end up with data missing for ID 2 as well.
How could I retain unique values for every ID per year??
Any help is much appreciated.
Here is a base R solution using ave + duplicated
df <- within(df,value <- ave(value,
id,
year,
FUN = function(v) ifelse(duplicated(v),NA,v)))
such that
> df
id quarter year value
1 1 1 2015 2.75
2 1 2 2015 NA
3 1 3 2015 NA
4 1 4 2015 NA
5 2 1 2015 2.90
6 2 2 2015 NA
7 2 3 2015 NA
8 2 4 2015 NA
9 3 1 2015 2.21
10 3 2 2015 NA
11 3 3 2015 NA
12 3 4 2015 NA
Using duplicated on cbind id and year instead of value should give you the desired result:
df[duplicated(cbind(df$id, df$year)), "value"]<-NA
Using this solution on your second data.frame that gave you missing rows:
df<- data.frame(id=c(1,1,1,1,2,2,2,2,3,3,3,3),
quarter=c(1,2,3,4,1,2,3,4,1,2,3,4),
year=c(2015,2015,2015,2015,2016,2016,2016,2016,2015,2015,2015,2015),
value=c(2.75,2.75,2.75,2.75,2.75,2.75,2.75,2.75,2.21,2.21,2.21,2.21))
df[duplicated(cbind(df$id, df$year)), "value"]<-NA
Returns:
id quarter year value
1 1 1 2015 2.75
2 1 2 2015 NA
3 1 3 2015 NA
4 1 4 2015 NA
5 2 1 2016 2.75
6 2 2 2016 NA
7 2 3 2016 NA
8 2 4 2016 NA
9 3 1 2015 2.21
10 3 2 2015 NA
11 3 3 2015 NA
12 3 4 2015 NA
I have a dataframe with counts of different items, in different years:
df <- data.frame(item = rep(c('a','b','c'), 3),
year = rep(c('2010','2011','2012'), each=3),
count = c(1,4,6,3,8,3,5,7,9))
And I would like to add a "year.rank" column, which gives an item's rank within a given year, where a higher count leads to a higher "rank". With the above, it would look like:
item year count year.rank
1 a 2010 1 3
2 b 2010 4 2
3 c 2010 6 1
4 a 2011 3 2
5 b 2011 8 1
6 c 2011 3 3
7 a 2012 5 3
8 b 2012 7 2
9 c 2012 9 1
I know I could do this for the whole data frame using order(df$count), but I'm not sure how I would do it by year.
There is a rank function to help you with that:
transform(df,
year.rank = ave(count, year,
FUN = function(x) rank(-x, ties.method = "first")))
item year count year.rank
1 a 2010 1 3
2 b 2010 4 2
3 c 2010 6 1
4 a 2011 3 2
5 b 2011 8 1
6 c 2011 3 3
7 a 2012 5 3
8 b 2012 7 2
9 c 2012 9 1
data.table version for practice:
library(data.table)
DT <- as.data.table(df)
DT[,yrrank:=rank(-count,ties.method="first"),by=year]
item year count yrrank
1: a 2010 1 3
2: b 2010 4 2
3: c 2010 6 1
4: a 2011 3 2
5: b 2011 8 1
6: c 2011 3 3
7: a 2012 5 3
8: b 2012 7 2
9: c 2012 9 1
Using order function,
transform(dat, x= ave(count,year,FUN=function(x) order(x,decreasing=T)))
item year count x
1 a 2010 1 3
2 b 2010 4 2
3 c 2010 6 1
4 a 2011 3 2
5 b 2011 8 1
6 c 2011 3 3
7 a 2012 5 3
8 b 2012 7 2
9 c 2012 9 1
EDIT
You can use plyr here also:
ddply(dat,.(year),transform,x = order(count,decreasing=T))
Using dplyr you could do it as follows:
library(dplyr) # 0.4.1
df %>%
group_by(year) %>%
mutate(yrrank = row_number(-count))
#Source: local data frame [9 x 4]
#Groups: year
#
# item year count yrrank
#1 a 2010 1 3
#2 b 2010 4 2
#3 c 2010 6 1
#4 a 2011 3 2
#5 b 2011 8 1
#6 c 2011 3 3
#7 a 2012 5 3
#8 b 2012 7 2
#9 c 2012 9 1
It is the same as:
df %>%
group_by(year) %>%
mutate(yrrank = rank(-count, ties.method = "first"))
Note that the resulting data is still grouped by "year". If you want to remove the grouping you can simply extend the pipe with %>% ungroup().
While using the answers given by others, I found that the following performs faster than the transform and dyplr variants:
df$year.rank <- ave(count, year, FUN = function(x) rank(-x, ties.method = "first"))
Say I have two matrix, A and B:
mth <- c(rep(1:5,2))
day <- c(rep(10,5),rep(11,5))
hr <- c(3,4,5,6,7,3,4,5,6,7)
v <- c(3,4,5,4,3,3,4,5,4,3)
A <- data.frame(cbind(mth,day,hr,v))
year <- c(2008:2012)
mth <- c(1:5)
B <- data.frame(cbind(year,mth))
What I want should be look like:
mth <- c(rep(2008:2012,2))
day <- c(rep(10,5),rep(11,5))
hr <- c(3,4,5,6,7,3,4,5,6,7)
v <- c(3,4,5,4,3,3,4,5,4,3)
A <- data.frame(cbind(mth,day,hr,v))
Basically what I need is to change the column mth in A with column year in B, Maybe I didn't search for the right keyword, I was not able to get what I want(I tried which()), please help, thank you.
A2 <- merge(A,B, by = "mth")[ , -1]
names(A2)[(which(names(A2)=="year"))] <- "mth"
> A2
day hr v mth
1 10 3 3 2008
2 11 3 3 2008
3 11 4 4 2009
4 10 4 4 2009
5 11 5 5 2010
6 10 5 5 2010
7 11 6 4 2011
8 10 6 4 2011
9 10 7 3 2012
10 11 7 3 2012
Probably the easiest solution is to use merge, which is equivalent to a sql join in a lot of ways:
merge(A,B)
#-----
merge(A, B)
mth day hr v year
1 1 10 3 3 2008
2 1 11 3 3 2008
3 2 11 4 4 2009
4 2 10 4 4 2009
5 3 11 5 5 2010
6 3 10 5 5 2010
7 4 11 6 4 2011
8 4 10 6 4 2011
9 5 10 7 3 2012
10 5 11 7 3 2012
You could also probably use match like this to replace mth in place:
A$mth <- B[match(A$mth, B$mth),1]
#-----
mth day hr v
1 2008 10 3 3
2 2009 10 4 4
3 2010 10 5 5
4 2011 10 6 4
5 2012 10 7 3
6 2008 11 3 3
7 2009 11 4 4
8 2010 11 5 5
9 2011 11 6 4
10 2012 11 7 3
While a little dense, that code indexes B by matching the two mth columns from A and B and then grabs the first column.+