Aggregate and adding new column - r

I have a dataset with district name, household latitude, and longitude. The dataset has 2000 household locations. I want to calculate the mean of latitude and longitude based on district name. Next, I want to add two new columns (i.e. Lat_mean, Long_mean) in which the mean Lat and Long will be stored for each household.
I was just able to aggregate the mean values for latitude and longitude. I don't know how to paste the summarized data as a new column for each ID (see code)
id <- c(1,2,3,4,5,6)
district <- c("A", "B", "C", "A", "A", "B")
lat <- c(28.6, 30.2, 35.9, 27.5, 27.9, 31.5)
long <- c(77.5, 85.2, 66.5, 75.0, 79.2, 88.8)
df <- data.frame(id, district, lat, long)
df_group <- df %>% group_by(district) %>% summarise_at(vars(lat:long), mean)
I am expecting the following. Lat_mean & Long_mean columns will be added to 'df' and each ID will have values based on district name. See the image below.

We can use mutate_at instead of summarise_at. Within the list, specify the name, so that it will create a new column with suffix as that name
library(dplyr)
df %>%
group_by(district) %>%
mutate_at(vars(lat, long), list(mean = mean))
# A tibble: 6 x 6
# Groups: district [3]
# id district lat long lat_mean long_mean
# <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
#1 1 A 28.6 77.5 28 77.2
#2 2 B 30.2 85.2 30.8 87
#3 3 C 35.9 66.5 35.9 66.5
#4 4 A 27.5 75 28 77.2
#5 5 A 27.9 79.2 28 77.2
#6 6 B 31.5 88.8 30.8 87

> df %>%
mutate(lat_mean = ave(lat, district, FUN=mean),
lon_mean = ave(long, district, FUN=mean))
id district lat long lat_mean lon_mean
1 1 A 28.6 77.5 28.00 77.23333
2 2 B 30.2 85.2 30.85 87.00000
3 3 C 35.9 66.5 35.90 66.50000
4 4 A 27.5 75.0 28.00 77.23333
5 5 A 27.9 79.2 28.00 77.23333
6 6 B 31.5 88.8 30.85 87.00000

Related

Summing multiple observation rows in R

I have a dataset with 4 observations for 90 variables. The observations are answer to a questionnaire of the type "completely agree" to "completely disagree", expressed in percentages. I want to sum the two positive observations (completely and somewhat agree) and the two negative ones (completely and somewhat disagree) for all variables. Is there a way to do this in R?
My dataset looks like this:
Albania Andorra Azerbaijan etc.
1 13.3 18.0 14.9 ...
2 56.3 45.3 27.2 ...
3 21.3 27.2 28.0 ...
4 8.9 9.4 5.2 ...
And I want to sum rows 1+2 and 3+4 to look something like this:
Albania Andorra Azerbaijan etc.
1 69.6 63.3 65.4 ...
2 30.2 36.6 33.2 ...
I am really new to R so I have no idea how to go about this. All answers to similar questions I found on this website and others either have character type observations, multiple rows for the same observation (with missing data), or combine all the rows into just 1 row. My problem falls in none of these categories, I just want to collapse some of the observations.
Since you only have four rows, it's probably easiest to just add the first two rows together and the second two rows together. You can use rbind to stick the two resulting rows together into the desired data frame:
rbind(df[1,] + df[2, ], df[3,] + df[4,])
#> Albania Andorra Azerbaijan
#> 1 69.6 63.3 42.1
#> 3 30.2 36.6 33.2
Data taken from question
df <- structure(list(Albania = c(13.3, 56.3, 21.3, 8.9), Andorra = c(18,
45.3, 27.2, 9.4), Azerbaijan = c(14.9, 27.2, 28, 5.2)), class = "data.frame",
row.names = c("1", "2", "3", "4"))
Another option could be by summing every 2 rows with rowsum and using gl with k = 2 like in the following coding:
rowsum(df, gl(n = nrow(df), k = 2, length = nrow(df)))
#> Albania Andorra Azerbaijan
#> 1 69.6 63.3 42.1
#> 2 30.2 36.6 33.2
Created on 2023-01-06 with reprex v2.0.2
Using dplyr
library(dplyr)
df %>%
group_by(grp = gl(n(), 2, n())) %>%
summarise(across(everything(), sum))
-output
# A tibble: 2 × 4
grp Albania Andorra Azerbaijan
<fct> <dbl> <dbl> <dbl>
1 1 69.6 63.3 42.1
2 2 30.2 36.6 33.2

Merge two datasets but one of them is year_month and the other is year_month_week

I practice data merging using R nowadays. Here are simple two data df1 and df2.
df1<-data.frame(id=c(1,1,1,2,2,2,2),
year_month=c(202205,202206,202207,202204,202205,202206,202207),
points=c(65,58,47,21,25,27,43))
df2<-data.frame(id=c(1,1,1,2,2,2),
year_month_week=c(2022052,2022053,2022061,2022043,2022051,2022052),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
For df1, 202205 in year_month column means May 2022.
For df2, 2022052 in year_month_week column means 2nd week of May, 2022.
I want to merge df1 and df2 with respect to year_month_week. So, all the elements of df2 are left, but some values of df2 can be copied.
For example, 202205 in year_month includes 2022052 and 2022053. There is no column points in df2. In this case, 65 is copied. My expected output looks like this:
df<-data.frame(id=c(1,1,1,2,2,2),
year_month_week=c(2022052,2022053,2022061,2022043,2022051,2022052),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3),
points=c(65,65,58,21,25,25))
Create a temporary year_month column in df2 by taking the first six characters of year_month_week, then do a left join on df1 by year_month and id before removing the temporary column.
Using tidyverse, we could do this as follows:
library(tidyverse)
df2 %>%
mutate(year_month = as.numeric(substr(year_month_week, 1, 6))) %>%
left_join(df1, by = c('year_month', 'id')) %>%
select(-year_month)
#> id year_month_week temperature points
#> 1 1 2022052 36.1 65
#> 2 1 2022053 36.3 65
#> 3 1 2022061 36.6 58
#> 4 2 2022043 34.3 21
#> 5 2 2022051 34.9 25
#> 6 2 2022052 35.3 25
Or in base R using merge:
df2$year_month <- substr(df2$year_month_week, 1, 6)
merge(df2, df1, by = c('year_month', 'id'))[-1]
#> id year_month_week temperature points
#> 1 2 2022043 34.3 21
#> 2 1 2022052 36.1 65
#> 3 1 2022053 36.3 65
#> 4 2 2022051 34.9 25
#> 5 2 2022052 35.3 25
#> 6 1 2022061 36.6 58

Calculating mean age by group in R

I have the following data: https://raw.githubusercontent.com/fivethirtyeight/data/master/congress-age/congress-terms.csv
I'm trying to determine how to calculate the mean age of members of Congress by year (termstart) for each party (Republican and Democrat).
I was hoping for some help on how to go about doing this. I am a beginner in R and I'm just playing around with the data.
Thanks!
Try this approach. Make a filter for the required parties and then summarise. After that you can reshape to wide in order to have both parties for each individual date. Here the code using tidyverse functions:
library(dplyr)
library(tidyr)
#Data
df <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/congress-age/congress-terms.csv',stringsAsFactors = F)
#Code
newdf <- df %>% filter(party %in% c('R','D')) %>%
group_by(termstart,party) %>% summarise(MeanAge=mean(age,na.rm=T)) %>%
pivot_wider(names_from = party,values_from=MeanAge)
Output:
# A tibble: 34 x 3
# Groups: termstart [34]
termstart D R
<chr> <dbl> <dbl>
1 1947-01-03 52.0 53.0
2 1949-01-03 51.4 54.6
3 1951-01-03 52.3 54.3
4 1953-01-03 52.3 54.1
5 1955-01-05 52.3 54.7
6 1957-01-03 53.2 55.4
7 1959-01-07 52.4 54.7
8 1961-01-03 53.4 53.9
9 1963-01-09 53.3 52.6
10 1965-01-04 52.3 52.2
# ... with 24 more rows

Compare one group to the rest of the groups as a whole in R

Here is some sample data:
movie_df <- data.frame("ID" = c(1,2,3,4,5,6,7,8,9,10),
"movie_type" = c("Action", "Horror", "Comedy", "Thriller", "Comedy",
"Action","Thriller", "Horror", "Action", "Comedy"),
"snack_type" = c("Chocolate", "Popcorn", "Candy", "Popcorn", "Popcorn",
"Candy","Chocolate", "Candy", "Popcorn", "Chocolate"),
"event_type" = c("Solo", "Family", "Date", "Friends", "Solo",
"Family","Date", "Date", "Friends", "Friends"),
"total_cost" = c(50, 35, 20, 50, 30,
60, 25, 35, 20, 50))
What I want to do is go through each column and compare each group to the rest of the groups on total_cost. For example, I want to see how movie_type == 'Action' compares to movie_type != 'Action' for total_cost. I want to do that for every type in movie_type then every type in snack_type and event_type.
What I ultimately want to get to is this where sd = Standard Deviation. Ideally this will be done by a tidyverse method in R (e.g. dplyr or tidyr):
> results_df
# A tibble: 11 x 11
Group Grp_1 Grp_2 Grp_1_mean Grp_2_mean Grp_1_sd Grp_2_sd Grp_1_n Grp_2_n Mean_Diff `t-test`
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 movie_type Action Rest of group 43.3 35 20.8 11.5 3 7 8.33 2.84
2 movie_type Horror Rest of group 35 38.1 0 16.0 2 8 -3.12 -2.21
3 movie_type Thriller Rest of group 37.5 37.5 17.7 14.6 2 8 0 0
4 movie_type Comedy Rest of group 33.3 39.3 15.3 14.6 3 7 -5.95 -2.22
5 snack_type Chocolate Rest of group 41.7 35.7 14.4 14.8 3 7 5.95 2.26
6 snack_type Candy Rest of group 38.3 37.1 20.2 12.9 3 7 1.19 0.407
7 snack_type Popcorn Rest of group 33.8 40 12.5 15.8 4 6 -6.25 -2.60
8 event_type Date Rest of group 26.7 42.1 7.64 14.1 3 7 -15.5 -7.25
9 event_type Family Rest of group 47.5 35 17.7 13.4 2 8 12.5 3.86
10 event_type Friends Rest of group 40 36.4 17.3 14.1 3 7 3.57 1.28
11 event_type Solo Rest of group 40 36.9 14.1 15.1 2 8 3.12 1.04
It's same logic as Daniel did using purrr::map and purrr::map2.
library(dplyr)
library(tibble)
library(purrr)
library(stringr)
needed_cols <- c("movie_type", "snack_type", "event_type")
new_names <- 1:2 %>%
map(~str_c(c("group", "mean", "sd", "n"), "_", .x)) %>%
unlist()
my_data <- needed_cols %>%
map(function(df_c)
map(unique(movie_df[[df_c]]),
function(v){
df <- movie_df %>%
mutate(group = ifelse(get(df_c) == v, v, "rest_of_group")) %>%
group_by(group) %>%
summarize(mean = mean(total_cost), sd = sd(total_cost), n = n()) %>%
.[match(.$group, c(v, "rest_of_group")),]
df <- bind_cols(df[1, ], df[2,])
names(df) <- new_names
df
}
)
) %>%
map2(needed_cols, ~bind_rows(.x) %>% mutate(group = .y)) %>%
bind_rows() %>%
select(
str_subset(names(.), "group") %>% sort(),
str_subset(names(.), "mean"),
str_subset(names(.), "sd"),
str_subset(names(.), "n")
) %>%
mutate(mean_diff = mean_1 - mean_2)
Sorry its not in pipes, but in Base R we can:
results_df <- do.call(rbind,unlist(
apply(movie_df[,2:4],2,function(u)
lapply(unique(u), function(x)
data.frame(
group1 = as.character(x),
group2 = "rest",
grp1_mean = mean(movie_df$total_cost[u == x]),
grp2_mean = mean(movie_df$total_cost[u != x]),
grp1_sd = sd(movie_df$total_cost[u == x]),
grp2_sd = sd(movie_df$total_cost[u != x])
)
)
),recursive=F)
)
#add mean differences
results_df$meandiff <- with(results_df, grp1_mean - grp2_mean)
> results_df
group1 group2 grp1_mean grp2_mean grp1_sd grp2_sd meandiff
movie_type1 Action rest 43.33333 35.00000 20.816660 11.54701 8.333333
movie_type2 Horror rest 35.00000 38.12500 0.000000 16.02175 -3.125000
movie_type3 Comedy rest 33.33333 39.28571 15.275252 14.55695 -5.952381
movie_type4 Thriller rest 37.50000 37.50000 17.677670 14.63850 0.000000
snack_type1 Chocolate rest 41.66667 35.71429 14.433757 14.84042 5.952381
snack_type2 Popcorn rest 33.75000 40.00000 12.500000 15.81139 -6.250000
snack_type3 Candy rest 38.33333 37.14286 20.207259 12.86375 1.190476
event_type1 Solo rest 40.00000 36.87500 14.142136 15.10381 3.125000
event_type2 Family rest 47.50000 35.00000 17.677670 13.36306 12.500000
event_type3 Date rest 26.66667 42.14286 7.637626 14.09998 -15.476190
event_type4 Friends rest 40.00000 36.42857 17.320508 14.05770 3.571429

R: using customised function in dplyr

Sample data:
library(tidyverse)
set.seed(123)
dat <- tibble(
year = rep(1980:2015, each = 100),
day = rep(200:299, times = 36),
rain = sample(0:17, size = 100*36,replace = T),
PETc = sample(rnorm(100*36)),
ini.t = rep(10:45, each = 100 ))
I have a function that operates on a DataFrame
my.func <- function(df, initial, thres, upper.limit){
df$paw <- rep(NA, nrow(df))
df$aetc <- rep(NA, nrow(df))
df$sw <- rep(NA, nrow(df))
for(n in 1:nrow(df)){
df$paw[n] <- df$rain[n] + initial
df$aetc[n] <- ifelse(df$paw[n] >= thres, df$PETc[n], (df$paw[n]/thres) * df$PETc[n])
df$aetc[n] <- ifelse(df$aetc[n] > df$paw[n], df$paw[n], df$aetc[n])
df$sw[n] <- initial + df$rain[n] - df$aetc[n]
df$sw[n] <- ifelse(df$sw[n] > upper.limit,upper.limit,ifelse(df$sw[n] < 0, 0,df$sw[n]))
initial <- df$sw[n]
}
return(df)
}
thres <- 110
upper.limit <- 200
Applying the above function for a single year:
dat.1980 <- dat[dat$year == 1980,]
my.func(dat.1980, initial = dat.1980$ini.t[1], thres, upper.limit)
How do I apply this function to each year. I thought of using dplyr
dat %>% group_by(year)%>% run my function on each year.
Also since there are 35 years, there will be 35 dataframes returned. How do I return the bind these data frame row wise?
You were on the right track. do lets you perform functions by group.
dat %>%
group_by(year) %>%
do(my.func(., initial = head(.$ini.t, 1), thres, upper.limit))
# Groups: year [36]
# year day rain PETc ini.t paw aetc sw
# <int> <int> <int> <dbl> <int> <dbl> <dbl> <dbl>
# 1 1980 200 5 0.968 10 15.0 0.132 14.9
# 2 1980 201 14 0.413 10 28.9 0.108 28.8
# 3 1980 202 7 -0.912 10 35.8 -0.296 36.1
# 4 1980 203 15 -0.337 10 51.1 -0.156 51.2
# 5 1980 204 16 0.412 10 67.2 0.252 67.0
# 6 1980 205 0 -0.923 10 67.0 -0.562 67.5
# 7 1980 206 9 1.17 10 76.5 0.813 75.7
# 8 1980 207 16 0.0542 10 91.7 0.0452 91.7
# 9 1980 208 9 -0.293 10 101 -0.268 101
# 10 1980 209 8 0.0788 10 109 0.0781 109
# ... with 3,590 more rows
purrr::map functions are the du jour method but I think in this case it's a stylistic choice
We can split by 'year' and then use map to apply the my.func to each of the split datasets in the list
library(purrr)
dat %>%
split(.$year) %>%
map_df(~my.func(.x, initial = .x$ini.t[1], thres, upper.limit))

Resources