Related
I'd like to create a new variable called POPULATION that takes up the sum of the values of the variable P1 grouped by the variable CODASC. It seemed easy to me at the beginning, but I'm eventually struggling. Since I have to do this for a lot of variables and for several datasets, I really need a quick way of doing it! If anyone can help me, I would really appreciate it!
Many thanks,
Ilaria
My data frame looks like that:
PROCOM SEZ2011 SEZ CODASC P1 P47 P62 P131 E1 E3 ST15 A46
<int> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 48017 480000000000 60001 4 251 25 9 20 70 40 19 20
2 48017 480000000000 60002 3 15 1 0 1 4 4 0 3
3 48017 480000000000 60003 2 20 7 2 1 1 1 1 1
4 48017 480000000000 60004 3 253 21 4 10 63 40 49 22
5 48017 480000000000 60005 5 3 0 1 0 1 1 0 2
6 48017 480000000000 60006 1 161 19 7 5 27 17 26 13
>
And my code looks like that:
df <- df %>%
group_by(CODASC) %>%
mutate(POPULATION = sum(P1 , na.rm= T))
To apply sum within a group across multiple variables you could do, as an example:
library(dplyr)
df %>%
group_by(CODASC) %>%
mutate(across(P1:last_col(), sum, .names = "{.col}_sum")) %>%
ungroup()
To apply this across multiple data frames (if you're grouping by the same variable and summing the same columns) you can iterate through them easily if they're in a list and with the purrr library:
library(purrr)
library(dplyr)
l <- list(df, df, df)
map(l, ~ .x %>%
group_by(CODASC) %>%
mutate(across(P1:last_col(), sum, .names = "{.col}_sum")) %>%
ungroup())
Your code looks like it does what you want, but you are just looking for a way to streamline it to multiple columns?
It looks like your first 4 columns are some identifiers. If you want to summarise all remaining columns you can do something like:
df <- df %>%
group_by(PROCOM, SEZ2011, SEZ, CODASC) %>%
summarise_all(sum) ## or whatever function you want here
see https://dplyr.tidyverse.org/reference/summarise_all.html for more details on summarise_all() or summarise_at().
If you want to create a function to apply to many datasets, perhaps check out making functions: https://swcarpentry.github.io/r-novice-inflammation/02-func-R/ and apply functions
I am working with weather data and trying to find the first time a temperature is negative for each winter season. I have a data frame with a column for the winter season (1,2,3,etc.), the temperature, and the ID.
I can get the first time the temperature is negative with this code:
FirstNegative <- min(which(df$temp<=0))
but it only returns the first value, and not one for each season.
I know I somehow need to group_by season, but how do I incorporate this?
For example,
season<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
temp<-c(2,-1,0,-1,3,-1,0,-1,2,-1,4,5,-1,-1,2)
ID<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
df <- cbind(season,temp,ID)
Ideally I want a table that looks like this from the above dummy code:
table
season id_firstnegative
[1,] 1 2
[2,] 2 4
[3,] 3 8
[4,] 4 10
[5,] 5 13
A base R option using subset and aggregate
aggregate(ID ~ season, subset(df, temp < 0), head, 1)
# season ID
#1 1 2
#2 2 4
#3 3 8
#4 4 10
#5 5 13
library(dplyr)
season<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
temp<-c(2,-1,0,-1,3,-1,0,-1,2,-1,4,5,-1,-1,2)
ID<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
df<-as.data.frame(cbind(season,temp,ID))
df %>%
dplyr::filter(temp < 0) %>%
group_by(season) %>%
dplyr::filter(row_number() == 1) %>%
ungroup()
As you said, I believe you could solve this by simply grouping season and examining the first index of IDs below zero within that grouping. However, the ordering of your data will be important, so ensure that each season has the correct ordering before using this possible solution.
library(dplyr)
library(tibble)
season<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
temp<-c(2,-1,0,-1,3,-1,0,-1,2,-1,4,5,-1,-1,2)
ID<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
df<- tibble(season,temp,ID)
df <- df %>%
group_by(season) %>%
mutate(firstNeg = ID[which(temp<0)][1]) %>%
distinct(season, firstNeg) # Combine only unique values of these columns for reduced output
This will provide output like:
# A tibble: 5 x 2
# Groups: season [5]
season firstNeg
<dbl> <dbl>
1 1 2
2 2 4
3 3 8
4 4 10
5 5 13
I have a dataset with three columns as below:
data <- data.frame(
grpA = c(1,1,1,1,1,2,2,2),
idB = c(1,1,2,2,3,4,5,6),
valueC = c(10,10,20,20,10,30,40,50),
otherD = c(1,2,3,4,5,6,7,8)
)
valueC is unique to each unique value of idB.
I want to use dplyr pipe (as the rest of my code is in dplyr) and use group_by on grpA to get a new column with sum of valueC values for each group.
The answer should be like:
newCol <- c(40,40,40,40,40,120,120,120)
but with data %>% group_by(grpA) %>%
mutate(newCol=sum(valueC), I get newCol <- c(70,70,70,70,70,120,120,120)
How do I include unique value of idB? Is there anything else I can use instead of group_by in dplyr %>% pipe.
I cant use summarise as I need to keep values in otherD intact for later use.
Other option I have is to create newCol separately through sql and then merge with left join. But I am looking for a better solution inline.
If it has been answered before, please refer me to the link as I could not find any relevant answer to this issue.
We need unique with match
data %>%
group_by(grpA) %>%
mutate(ind = sum(valueC[match(unique(idB), idB)]))
# A tibble: 8 x 5
# Groups: grpA [2]
# grpA idB valueC otherD ind
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 10 1 40
#2 1 1 10 2 40
#3 1 2 20 3 40
#4 1 2 20 4 40
#5 1 3 10 5 40
#6 2 4 30 6 120
#7 2 5 40 7 120
#8 2 6 50 8 120
Or another option is to get the distinct rows by 'grpA', 'idB', grouped by 'grpA', get the sum of 'valueC' and left_join with the original data
data %>%
distinct(grpA, idB, .keep_all = TRUE) %>%
group_by(grpA) %>%
summarise(newCol = sum(valueC)) %>%
left_join(data, ., by = 'grpA')
I have the following bases in R.
table1<-data.frame(group=c(1,1,1,2,2,2),price=c(10,20,30,10,20,30),
visits=c(100,200,300,150,250,350))
table1<-table1 %>% arrange(price) %>% split(.$group)
$`1`
group price visits
1 1 10 100
3 1 20 200
5 1 30 300
$`2`
group price visits
2 2 10 150
4 2 20 250
6 2 30 350
group_1<-data.frame(case_1=c(0.2,0.3,0.4),case_2=c(0.22,0.33,0.44))
group_2<-data.frame(case_1=c(0.3,0.4,0.5),case_2=c(0.33,0.44,0.55))
So, the question is How can I do the following operation without repeating it four times. I suppose that an apply function, or similar, will suit better.
sum(table1$`1`[,c("group")] * group_1[,c("case_1")])
sum(table1$`1`[,c("group")] * group_1[,c("case_2")])
sum(table2$`1`[,c("group")] * group_2[,c("case_1")])
sum(table2$`1`[,c("group")] * group_2[,c("case_2")])
After going through step-by-step in the data you have provided and understanding what you are trying to do. Here is a suggestion using mapply.
group_list <- list(group_1, group_2)
mapply(function(x, y) colSums(x * y),split(table1$group, table1$group),group_list)
# 1 2
#case_1 0.90 2.40
#case_2 0.99 2.64
We take the groups in one list say group_list. Split table1 by group and perform multiplication between them using mapply and take the column-wise sum. If I have understood you correctly, this is what you needed let me know if it is otherwise.
Based on the initial dataset, we can do this using group_by operations
library(tidyverse)
bind_rows(group_1, group_2) %>%
bind_cols(table1['group'], .) %>%
mutate(case_1 = group*case_1, case_2 = group*case_2) %>%
group_by(group) %>%
summarise_each(funs(sum))
# A tibble: 2 × 3
# group case_1 case_2
# <dbl> <dbl> <dbl>
#1 1 0.9 0.99
#2 2 2.4 2.64
data
table1<-data.frame(group=c(1,1,1,2,2,2),price=c(10,20,30,10,20,30),
visits=c(100,200,300,150,250,350))
I have a data set like this:
df <- data.frame(situation1=rnorm(30),
situation2=rnorm(30),
situation3=rnorm(30),
models=c(rep("A",10), rep("B",10), rep("C", 10)))
where I compare three models (A,B,C) in three situations. I have 10 measurements for each model.
I now want to summarise this into ranks, i.e. how often each models wins in each situtation. Win is defined by the highest value.
A final output could be something like this:
model situation1 situtation2 situtation3
A 4 3 3
B 7 1 2
C 1 4 5
In base R:
table(df$models,colnames(df[-4])[max.col(df[-4])])
# situation1 situation2 situation3
# A 2 4 4
# B 4 5 1
# C 2 4 4
Results may change from your OP, since you didn't set a seed.
Here is an option using data.table
library(data.table)
setDT(df)[, lapply(Map(`==`, .SD, list(do.call(pmax, .SD))), sum), models]
Here's a dplyr option:
df %>%
group_by(models) %>%
mutate_all(funs(. == pmax(situation1, situation2, situation3))) %>%
summarise_all(sum)
Or possibly a little more efficient:
df %>%
mutate_at(vars(-models), funs(. == pmax(situation1, situation2, situation3))) %>%
group_by(models) %>%
summarise_all(sum)
## A tibble: 3 × 4
# models situation1 situation2 situation3
# <chr> <int> <int> <int>
#1 A 3 3 3
#2 B 3 5 1
#3 C 6 1 2
If you're looking for the minimum, use pmin instead of pmax. And in case there may be NAs, use the na.rm-argument in pmax/pmin.
Final note: the result doesn't match OP's because the sample data was generated without setting a seed.