This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 years ago.
I have a data frame like this:
Date Amount Category
1 02.07.15 1 1
2 02.07.15 2 1
3 02.07.15 3 1
4 02.07.15 4 2
5 03.07.15 5 2
6 04.07.15 6 3
7 05.07.15 7 3
8 06.07.15 8 3
9 07.07.15 9 4
10 08.07.15 10 5
11 09.07.15 11 6
12 10.07.15 12 4
13 11.07.15 13 4
14 12.07.15 14 5
15 13.07.15 15 5
16 14.07.15 16 6
17 15.07.15 17 6
18 16.07.15 18 5
19 17.07.15 19 4
I would like to calculate the sum of the amount for each single day in a category. My attempts like (see the code) are both not sufficient.
summarise(group_by(testData, Category), sum(Amount))
Wrong output --> here the sum is calculated over each group
Category sum(Amount)
1 1 6
2 2 9
3 3 21
4 4 53
5 5 57
6 6 44
summarise(group_by(testData, Date), sum(Amount), categories = toString(Category))
Wrong output --> here the sum is calculated over each day but the categories are not considered
Date sum(Amount) categories
1 02.07.15 10 1, 1, 1, 2
2 03.07.15 5 2
3 04.07.15 6 3
4 05.07.15 7 3
5 06.07.15 8 3
6 07.07.15 9 4
7 08.07.15 10 5
8 09.07.15 11 6
9 10.07.15 12 4
10 11.07.15 13 4
11 12.07.15 14 5
12 13.07.15 15 5
13 14.07.15 16 6
14 15.07.15 17 6
15 16.07.15 18 5
16 17.07.15 19 4
So far I did not succeed in combining both statements.
How can I nest both group_by statements to calculate the sum of the amount for each single day in each category?
Nesting the groups like:
summarise(group_by(group_by(testData, Date), Category), sum(Amount), dates = toString(Date))
Category sum(Amount) dates
1 1 6 02.07.15, 02.07.15, 02.07.15
2 2 9 02.07.15, 03.07.15
3 3 21 04.07.15, 05.07.15, 06.07.15
4 4 53 07.07.15, 10.07.15, 11.07.15, 17.07.15
5 5 57 08.07.15, 12.07.15, 13.07.15, 16.07.15
6 6 44 09.07.15, 14.07.15, 15.07.15
does not work as intended.
I have heard of dplyr - summarise weighted data summarise_each but could not get it to work:
summarise_each(testData, funs(Category))
Error could not find function Category
You can try
testData %>%
group_by(Date,Category) %>%
summarise(Amount= sum(Amount))
Related
I would like to order a data frame based on an alphanumeric variable. Here how my dataset looks like:
sample.data <- data.frame(Grade=c(4,4,4,4,3,3,3,3,3,3,3,3),
ItemID = c(15,15,15,15,17,17,17,17,16,16,16,16),
common.names = c("15_AS_SA1_Correct","15_AS_SA10_Correct","15_AS_SA2_Correct","15_AS_SA3_Correct",
"17_AS_2_B2","17_AS_2_B1","17_AS_5_C1","17_AS_4_D1",
"16_AS_SA1_Negative","16_AS_SA11_Prediction","16_AS_SA12_UnitMeaning","16_AS_SA3_Complete"))
> sample.data
Grade ItemID common.names
1 4 15 15_AS_SA1_Correct
2 4 15 15_AS_SA10_Correct
3 4 15 15_AS_SA2_Correct
4 4 15 15_AS_SA3_Correct
5 3 17 17_AS_2_B2
6 3 17 17_AS_2_B1
7 3 17 17_AS_5_C1
8 3 17 17_AS_4_D1
9 3 16 16_AS_SA1_Negative
10 3 16 16_AS_SA11_Prediction
11 3 16 16_AS_SA12_UnitMeaning
12 3 16 16_AS_SA3_Complete
I need to order by Grade and ItemID, then by common.names variable that contains alphanumeric.
I used this:
sample.data.ordered <- sample.data %>%
arrange(Grade, ItemID,common.names)
but it did not work for the whole set.
My desired output is:
> sample.data.ordered
Grade ItemID common.names
1 3 16 16_AS_SA1_Negative
2 3 16 16_AS_SA3_Complete
3 3 16 16_AS_SA11_Prediction
4 3 16 16_AS_SA12_UnitMeaning
5 3 17 17_AS_2_B1
6 3 17 17_AS_2_B2
7 3 17 17_AS_4_D1
8 3 17 17_AS_5_C1
9 4 15 15_AS_SA1_Correct
10 4 15 15_AS_SA2_Correct
11 4 15 15_AS_SA3_Correct
12 4 15 15_AS_SA10_Correct
Any thoughts?
Thanks!
A base R solution using order as well as a more complex procedure for common.names involving gsub, regular expression and multiple backreference to match the numbers in the strings by which the column can be ordered:
sample.data[order(sample.data$Grade,
sample.data$ItemID,
as.numeric(gsub(".*(SA|AS_)(\\d+)_(\\w)?(\\d)?.*", "\\2\\4", sample.data$common.names))),]
Grade ItemID common.names
9 3 16 16_AS_SA1_Negative
12 3 16 16_AS_SA3_Complete
10 3 16 16_AS_SA11_Prediction
11 3 16 16_AS_SA12_UnitMeaning
6 3 17 17_AS_2_B1
5 3 17 17_AS_2_B2
8 3 17 17_AS_4_D1
7 3 17 17_AS_5_C1
1 4 15 15_AS_SA1_Correct
3 4 15 15_AS_SA2_Correct
4 4 15 15_AS_SA3_Correct
2 4 15 15_AS_SA10_Correct
I have coordinates for each site and the year each site was sampled (fake dataframe below).
dfA<-matrix(nrow=20,ncol=3)
dfA<-as.data.frame(dfA)
colnames(dfA)<-c("LAT","LONG","YEAR")
#fill LAT
dfA[,1]<-rep(1:5,4)
#fill LONG
dfA[,2]<-c(rep(11:15,3),16:20)
#fill YEAR
dfA[,3]<-2001:2020
> dfA
LAT LONG YEAR
1 1 11 2001
2 2 12 2002
3 3 13 2003
4 4 14 2004
5 5 15 2005
6 1 11 2006
7 2 12 2007
8 3 13 2008
9 4 14 2009
10 5 15 2010
11 1 11 2011
12 2 12 2012
13 3 13 2013
14 4 14 2014
15 5 15 2015
16 1 16 2016
17 2 17 2017
18 3 18 2018
19 4 19 2019
20 5 20 2020
I'm trying to pull out the year each unique location was sampled. So I first pulled out each unique location and the times it was sampled using the following code
dfB <- dfA %>%
group_by(LAT, LONG) %>%
summarise(Freq = n())
dfB<-as.data.frame(dfB)
LAT LONG Freq
1 1 11 3
2 1 16 1
3 2 12 3
4 2 17 1
5 3 13 3
6 3 18 1
7 4 14 3
8 4 19 1
9 5 15 3
10 5 20 1
I am now trying to get the year for each unique location. I.e. I ultimately want this:
LAT LONG Freq . Year
1 1 11 3 . 2001,2006,2011
2 1 16 1 . 2016
3 2 12 3 . 2002,2007,2012
4 2 17 1
5 3 13 3
6 3 18 1
7 4 14 3
8 4 19 1
9 5 15 3
10 5 20 1
This is what I've tried:
1) Find which rows in dfA that corresponds with dfB:
dfB$obs_Year<-NA
idx <- match(paste(dfA$LAT,dfA$LONG), paste(dfB$LAT,dfB$LONG))
> idx
[1] 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 2 4 6 8 10
So idx[1] means dfA[1] matches dfB[1]. And that dfA[6],df[11] all match dfB[1].
I've tried this to extract info:
for (row in 1:20){
year<-as.character(dfA$YEAR[row])
tmp<-dfB$obs_Year[idx[row]]
if(isTRUE(is.na(dfB$obs_Year[idx[row]]))){
dfB$obs_Year[idx[row]]<-year
}
if(isFALSE(is.na(dfB$obs_Year[idx[row]]))){
dfB$obs_Year[idx[row]]<-as.list(append(tmp,year))
}
}
I keep getting this error code:
number of items to replace is not a multiple of replacement length
Does anyone know how to extract years from matching pairs of dfA to dfB? I don't know if this is the most efficient code but this is as far as I've gotten....Thanks in advance!
You can do this with a dplyr chain that first builds your date column and then filters down to only unique observations.
The logic is to build the date variable by grouping your data by locations, and then pasting all the dates for a given location into a single string variable which we call year_string. We then also compute the frequency but this is not strictly necessary.
The only column in your data that varies over time is YEAR, meaning that if we exclude that column you would see values repeated for locations. So we exclude the YEAR column and then ask R to return unique() values of the data.frame to us. It will pick one of the observations per location where multiple occur, but since they are identical that doesn't matter.
Code below:
library(dplyr)
dfA<-matrix(nrow=20,ncol=3)
dfA<-as.data.frame(dfA)
colnames(dfA)<-c("LAT","LONG","YEAR")
#fill LAT
dfA[,1]<-rep(1:5,4)
#fill LONG
dfA[,2]<-c(rep(11:15,3),16:20)
#fill YEAR
dfA[,3]<-2001:2020
# We assign the output to dfB
dfB <- dfA %>% group_by(LAT, LONG) %>% # We group by locations
mutate( # The mutate verb is for building new variables.
year_string = paste(YEAR, collapse = ","), # the function paste()
# collapses the vector YEAR into a string
# the argument collapse = "," says to
# separate each element of the string with a comma
Freq = n()) %>% # I compute the frequency as you did
select(LAT, LONG, Freq, year_string) %>%
# Now I select only the columns that index
# location, frequency and the combined years
unique() # Now I filter for only unique observations. Since I have not picked
# YEAR in the select function only unique locations are retained
dfB
#> # A tibble: 10 x 4
#> # Groups: LAT, LONG [10]
#> LAT LONG Freq year_string
#> <int> <int> <int> <chr>
#> 1 1 11 3 2001,2006,2011
#> 2 2 12 3 2002,2007,2012
#> 3 3 13 3 2003,2008,2013
#> 4 4 14 3 2004,2009,2014
#> 5 5 15 3 2005,2010,2015
#> 6 1 16 1 2016
#> 7 2 17 1 2017
#> 8 3 18 1 2018
#> 9 4 19 1 2019
#> 10 5 20 1 2020
Created on 2019-01-21 by the reprex package (v0.2.1)
My data looks like this:
x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18
y is a grouping variable. I would like to see how well this grouping went.
Because of this I want to extract a sample of n pairs of cases that are grouped together by variable y
and n pairs of cases that are not grouped together by variable y. In order to calculate the number of
false positives and false negatives (either falsly grouped or not). How do I extract a sample of grouped pairs
and a sample of not-grouped pairs?
I would like the samples to look like this (for n=6) :
Grouped sample:
x y
2 2
3 2
9 9
10 9
15 14
17 14
Not-grouped sample:
x y
1 1
2 2
6 8
6 8
11 11
19 17
How would I go about this in R?
I'm not entirely clear on what you like to do, partly because I feel there is some context missing as to what you're trying to achieve. I also don't quite understand your expected output (for example, the not-grouped sample contains an entry 6 8 that does not exist in your original data...)
That aside, here is a possible approach.
# Maximum number of samples per group
n <- 3;
# Set fixed RNG seed for reproducibility
set.seed(2017);
# Grouped samples
df.grouped <- do.call(rbind.data.frame, lapply(split(df, df$y),
function(x) if (nrow(x) > 1) x[sample(min(n, nrow(x))), ]));
df.grouped;
# x y
#2.3 3 2
#2.2 2 2
#6.6 6 6
#6.7 7 6
#9.10 10 9
#9.9 9 9
#13.13 13 13
#13.14 14 13
#14.15 15 14
#14.17 17 14
# Ungrouped samples
df.ungrouped <- df[sample(nrow(df.grouped)), ];
df.ungrouped;
# x y
#7 7 6
#1 1 1
#9 9 9
#4 4 4
#3 3 2
#2 2 2
#5 5 5
#6 6 6
#10 10 9
#8 8 8
Explanation: Split df based on y, then draw min(n, nrow(x)) samples from subset x containing >1 rows; rbinding gives the grouped df.grouped. We then draw nrow(df.grouped) samples from df to produce the ungrouped df.ungrouped.
Sample data
df <- read.table(text =
"x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18", header = T)
I am a R noob, and hope some of you can help me.
I have two data sets:
- store (containing store data, including location coordinates (x,y). The location are integer values, corresponding to GridIds)
- grid (containing all gridIDs (x,y) as well as a population variable TOT_P for each grid point)
What I want to achieve is this:
For each store I want loop over the grid date, and sum the population of the grid ids close to the store grid id.
I.e basically SUMIF the grid population variable, with the condition that
grid(x) < store(x) + 1 &
grid(x) > store(x) - 1 &
grid(y) < store(y) + 1 &
grid(y) > store(y) - 1
How can I accomplish that? My own take has been trying to use different things like merge, sapply, etc, but my R inexperience stops me from getting it right.
Thanks in advance!
Edit:
Sample data:
StoreName StoreX StoreY
Store1 3 6
Store2 5 2
TOT_P GridX GridY
8 1 1
7 2 1
3 3 1
3 4 1
22 5 1
20 6 1
9 7 1
28 1 2
8 2 2
3 3 2
12 4 2
12 5 2
15 6 2
7 7 2
3 1 3
3 2 3
3 3 3
4 4 3
13 5 3
18 6 3
3 7 3
61 1 4
25 2 4
5 3 4
20 4 4
23 5 4
72 6 4
14 7 4
178 1 5
407 2 5
26 3 5
167 4 5
58 5 5
113 6 5
73 7 5
76 1 6
3 2 6
3 3 6
3 4 6
4 5 6
13 6 6
18 7 6
3 1 7
61 2 7
25 3 7
26 4 7
167 5 7
58 6 7
113 7 7
The output I am looking for is
StoreName StoreX StoreY SUM_P
Store1 3 6 479
Store2 5 2 119
I.e for store1 it is the sum of TOT_P for Grid fields X=[2-4] and Y=[5-7]
One approach would be to use dplyr to calculate the difference between each store and all grid points and then group and sum based on these new columns.
#import library
library(dplyr)
#create example store table
StoreName<-paste0("Store",1:2)
StoreX<-c(3,5)
StoreY<-c(6,2)
df.store<-data.frame(StoreName,StoreX,StoreY)
#create example population data (copied example table from OP)
df.pop
#add dummy column to each table to enable cross join
df.store$k=1
df.pop$k=1
#dplyr to join, calculate absolute distance, filter and sum
df.store %>%
inner_join(df.pop, by='k') %>%
mutate(x.diff = abs(StoreX-GridX), y.diff=abs(StoreY-GridY)) %>%
filter(x.diff<=1, y.diff<=1) %>%
group_by(StoreName) %>%
summarise(StoreX=max(StoreX), StoreY=max(StoreY), tot.pop = sum(TOT_P) )
#output:
StoreName StoreX StoreY tot.pop
<fctr> <dbl> <dbl> <int>
1 Store1 3 6 721
2 Store2 5 2 119
I'm using the aggregate function for calculating the difference for every observation of two variables,so somehow like this (and the I want to save the result as a new variable) :
data1
Group Points_Attempt1 Points_Attempt2
1 1 10 5
2 1 34 23
3 1 50 5
4 1 10 12
5 2 11 21
6 2 23 23
7 2 32 10
8 2 12 10
I'm able to do something like this:
aggregate(data1[c("Points_Attempt1","Points_Attempt2")],list(data1$group),diff)
But I want it for every single observations and I just do not now to select the observations, so somehow the row numbers (here from 1-8).
So I'm searching for the following fourth column (Difference), which I then would like to safe as a new variable:
Group Points_Attempt1 Points_Attempt2 Difference
1 1 10 5 5
2 1 34 23 11
3 1 50 5 45
4 1 10 12 -2
5 2 11 21 -10
6 2 23 23 0
7 2 32 10 22
8 2 12 10 2
I would be highly thankful, if someone could help me with this.
We can use mutate_each
library(dplyr)
data1 %>%
group_by(Group) %>%
mutate_each(funs(c(NA, diff(.))), 2:3)
Or if we need to subtract between the variables,
data1 %>%
mutate(Difference = Points_Attemp1 - Points_Attemp2)