Summing multiple observation rows in R - r

I have a dataset with 4 observations for 90 variables. The observations are answer to a questionnaire of the type "completely agree" to "completely disagree", expressed in percentages. I want to sum the two positive observations (completely and somewhat agree) and the two negative ones (completely and somewhat disagree) for all variables. Is there a way to do this in R?
My dataset looks like this:
Albania Andorra Azerbaijan etc.
1 13.3 18.0 14.9 ...
2 56.3 45.3 27.2 ...
3 21.3 27.2 28.0 ...
4 8.9 9.4 5.2 ...
And I want to sum rows 1+2 and 3+4 to look something like this:
Albania Andorra Azerbaijan etc.
1 69.6 63.3 65.4 ...
2 30.2 36.6 33.2 ...
I am really new to R so I have no idea how to go about this. All answers to similar questions I found on this website and others either have character type observations, multiple rows for the same observation (with missing data), or combine all the rows into just 1 row. My problem falls in none of these categories, I just want to collapse some of the observations.

Since you only have four rows, it's probably easiest to just add the first two rows together and the second two rows together. You can use rbind to stick the two resulting rows together into the desired data frame:
rbind(df[1,] + df[2, ], df[3,] + df[4,])
#> Albania Andorra Azerbaijan
#> 1 69.6 63.3 42.1
#> 3 30.2 36.6 33.2
Data taken from question
df <- structure(list(Albania = c(13.3, 56.3, 21.3, 8.9), Andorra = c(18,
45.3, 27.2, 9.4), Azerbaijan = c(14.9, 27.2, 28, 5.2)), class = "data.frame",
row.names = c("1", "2", "3", "4"))

Another option could be by summing every 2 rows with rowsum and using gl with k = 2 like in the following coding:
rowsum(df, gl(n = nrow(df), k = 2, length = nrow(df)))
#> Albania Andorra Azerbaijan
#> 1 69.6 63.3 42.1
#> 2 30.2 36.6 33.2
Created on 2023-01-06 with reprex v2.0.2

Using dplyr
library(dplyr)
df %>%
group_by(grp = gl(n(), 2, n())) %>%
summarise(across(everything(), sum))
-output
# A tibble: 2 × 4
grp Albania Andorra Azerbaijan
<fct> <dbl> <dbl> <dbl>
1 1 69.6 63.3 42.1
2 2 30.2 36.6 33.2

Related

r convert number saved as char in tibble to double NA introduced

I'd like to convert the relevant columns in the following tibble to numeric (double precision):
# A tibble: 6 x 6
Date Open High Low Close Shares
<chr> <chr> <chr> <chr> <chr> <chr>
1 16.04.2021 53,64 54,12 53,64 54,12 50
2 15.04.2021 53,19 53,19 53,19 53,19 -
3 14.04.2021 53,29 53,29 53,29 53,29 -
4 13.04.2021 52,86 52,86 52,86 52,86 -
5 12.04.2021 53,17 53,17 53,17 53,17 -
6 09.04.2021 53,18 53,18 53,18 53,18 -
However, if I apply as.numeric to the relevant columns, NA would be introduced.
What is the most efficient way to convert the entries in the relevant columns to double without generating the NAs?
Reproducible sample data:
df <- tribble(
~Date, ~Open, ~High, ~Low, ~Close, ~Shares,
"16.04.2021", "53,64", "54,12", "53,64", "54,12", 50,
"15.04.2021", "53,19", "53,19", "53,19", "53,19", NA,
"14.04.2021", "53,29", "53,29", "53,29", "53,29", NA,
"13.04.2021", "52,86", "52,86", "52,86", "52,86", NA,
"12.04.2021", "53,17", "53,17", "53,17", "53,17", NA,
"09.04.2021", "53,18", "53,18", "53,18", "53,18", NA
)
You can replace comma with a dot and convert to numeric. Use lapply to apply the function to multiple columns.
df[2:5] <- lapply(df[2:5], function(x) as.numeric(sub(',', '.', x)))
Using dplyr :
library(dplyr)
library(readr)
df %>%
mutate(across(Open:Close, ~parse_number(., locale = locale(decimal_mark = ","))))
That reason you can't turn them into numeric values are , as decimal separator instead of .. So you can use the following code:
library(dplyr)
library(stringr)
df %>%
mutate(across(Open:Close, ~ str_replace(., ",", "\\.")),
across(Open:Close, as.numeric))
# A tibble: 6 x 6
Date Open High Low Close Shares
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 16.04.2021 53.6 54.1 53.6 54.1 50
2 15.04.2021 53.2 53.2 53.2 53.2 NA
3 14.04.2021 53.3 53.3 53.3 53.3 NA
4 13.04.2021 52.9 52.9 52.9 52.9 NA
5 12.04.2021 53.2 53.2 53.2 53.2 NA
6 09.04.2021 53.2 53.2 53.2 53.2 NA
First escape the "." in your regular expression.
Second replace the commas with a "." before you can convert to numeric
df %>%
mutate(across(2:5, ~as.numeric(gsub(",", ".", gsub("\\.", "", .)))))
Output:
Date Open High Low Close Shares
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 16.04.2021 53.6 54.1 53.6 54.1 50
2 15.04.2021 53.2 53.2 53.2 53.2 -
3 14.04.2021 53.3 53.3 53.3 53.3 -
4 13.04.2021 52.9 52.9 52.9 52.9 -
5 12.04.2021 53.2 53.2 53.2 53.2 -
6 09.04.2021 53.2 53.2 53.2 53.2 -

Calculating mean age by group in R

I have the following data: https://raw.githubusercontent.com/fivethirtyeight/data/master/congress-age/congress-terms.csv
I'm trying to determine how to calculate the mean age of members of Congress by year (termstart) for each party (Republican and Democrat).
I was hoping for some help on how to go about doing this. I am a beginner in R and I'm just playing around with the data.
Thanks!
Try this approach. Make a filter for the required parties and then summarise. After that you can reshape to wide in order to have both parties for each individual date. Here the code using tidyverse functions:
library(dplyr)
library(tidyr)
#Data
df <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/congress-age/congress-terms.csv',stringsAsFactors = F)
#Code
newdf <- df %>% filter(party %in% c('R','D')) %>%
group_by(termstart,party) %>% summarise(MeanAge=mean(age,na.rm=T)) %>%
pivot_wider(names_from = party,values_from=MeanAge)
Output:
# A tibble: 34 x 3
# Groups: termstart [34]
termstart D R
<chr> <dbl> <dbl>
1 1947-01-03 52.0 53.0
2 1949-01-03 51.4 54.6
3 1951-01-03 52.3 54.3
4 1953-01-03 52.3 54.1
5 1955-01-05 52.3 54.7
6 1957-01-03 53.2 55.4
7 1959-01-07 52.4 54.7
8 1961-01-03 53.4 53.9
9 1963-01-09 53.3 52.6
10 1965-01-04 52.3 52.2
# ... with 24 more rows

How do I return a list of IDs based on missing values of another variable?

It's been a while since I used R so apologies for asking probably such a basic question :s
I have a variable that has data in baseline, 4 months, and 12 months for the same IDs. I'm essentially trying to figure out which IDs have missing data in 4 months so I can delete those IDs from the entire dataset.
ID Baseline 4MOS 12MOS
123_ABC 53.5 NA NA
456_DEF 45.1 32.5 12.2
789_GHI 45.4 NA NA
923_JKL 88.4 11.1 23.1
734_BBB 45.4 20.1 NA
343_CHF 22.1 16.1 NA
I've gotten as far as identifying the row number where there is missing 4 month data:
clean <- which(is.na(df$4MONTHS))
This is a code I tried afterwards to try and return the IDs to me but it just gave me a message saying "Error: attempt to apply non-function":
clean <- list(df$ID(which(is.na(df$4MOS))))
Gladly appreciate any help re: this!
EDIT:
To get IDs with NAs(here we assume that all are NA not just any NA. In the latter case, use anyNA instead):
df %>%
group_by(ID) %>%
filter(all(is.na(X4MOS))) %>%
pull(ID)
[1] "123_ABC" "789_GHI"
base(no grouping):
df[is.na(df["X4MOS"]),"ID"]
[1] "123_ABC" "789_GHI"
ORIGINAL: Returns where all are not NA
A dplyr solution:
df %>%
group_by(ID) %>%
filter(!all(is.na(X4MOS)))
# A tibble: 4 x 4
# Groups: ID [4]
ID Baseline X4MOS X12MOS
<chr> <dbl> <dbl> <dbl>
1 456_DEF 45.1 32.5 12.2
2 923_JKL 88.4 11.1 23.1
3 734_BBB 45.4 20.1 NA
4 343_CHF 22.1 16.1 NA
With base(no grouping):
df[!is.na(df["X4MOS"]),]
ID Baseline X4MOS X12MOS
2 456_DEF 45.1 32.5 12.2
4 923_JKL 88.4 11.1 23.1
5 734_BBB 45.4 20.1 NA
6 343_CHF 22.1 16.1 NA
Data:
df <- structure(list(ID = c("123_ABC", "456_DEF", "789_GHI", "923_JKL",
"734_BBB", "343_CHF"), Baseline = c(53.5, 45.1, 45.4, 88.4, 45.4,
22.1), X4MOS = c(NA, 32.5, NA, 11.1, 20.1, 16.1), X12MOS = c(NA,
12.2, NA, 23.1, NA, NA)), class = "data.frame", row.names = c(NA,
-6L))

Aggregate and adding new column

I have a dataset with district name, household latitude, and longitude. The dataset has 2000 household locations. I want to calculate the mean of latitude and longitude based on district name. Next, I want to add two new columns (i.e. Lat_mean, Long_mean) in which the mean Lat and Long will be stored for each household.
I was just able to aggregate the mean values for latitude and longitude. I don't know how to paste the summarized data as a new column for each ID (see code)
id <- c(1,2,3,4,5,6)
district <- c("A", "B", "C", "A", "A", "B")
lat <- c(28.6, 30.2, 35.9, 27.5, 27.9, 31.5)
long <- c(77.5, 85.2, 66.5, 75.0, 79.2, 88.8)
df <- data.frame(id, district, lat, long)
df_group <- df %>% group_by(district) %>% summarise_at(vars(lat:long), mean)
I am expecting the following. Lat_mean & Long_mean columns will be added to 'df' and each ID will have values based on district name. See the image below.
We can use mutate_at instead of summarise_at. Within the list, specify the name, so that it will create a new column with suffix as that name
library(dplyr)
df %>%
group_by(district) %>%
mutate_at(vars(lat, long), list(mean = mean))
# A tibble: 6 x 6
# Groups: district [3]
# id district lat long lat_mean long_mean
# <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
#1 1 A 28.6 77.5 28 77.2
#2 2 B 30.2 85.2 30.8 87
#3 3 C 35.9 66.5 35.9 66.5
#4 4 A 27.5 75 28 77.2
#5 5 A 27.9 79.2 28 77.2
#6 6 B 31.5 88.8 30.8 87
> df %>%
mutate(lat_mean = ave(lat, district, FUN=mean),
lon_mean = ave(long, district, FUN=mean))
id district lat long lat_mean lon_mean
1 1 A 28.6 77.5 28.00 77.23333
2 2 B 30.2 85.2 30.85 87.00000
3 3 C 35.9 66.5 35.90 66.50000
4 4 A 27.5 75.0 28.00 77.23333
5 5 A 27.9 79.2 28.00 77.23333
6 6 B 31.5 88.8 30.85 87.00000

Make table using mean of a column in R

I have following data frame:
test <- data.frame(Gender = rep(c("M","F"),5), Death = c(1981:1985), Age = c(21:30))
and I wanted to know how can I reproduce following results using command table rather than ddply:
library(plyr)
ddply(test, c("Gender", "Death"), summarise, AgeMean = mean(Age))
Death AgeMean
1 1981 23.5
2 1982 24.5
3 1983 25.5
4 1984 26.5
5 1985 27.5
I think you mean aggregate...
aggregate( Age ~ Death , data = test , FUN = mean )
# Death Age
#1 1981 23.5
#2 1982 24.5
#3 1983 25.5
#4 1984 26.5
#5 1985 27.5
Or you could also use summaryBy from the doBy package:
summaryBy(Age ~ Death,data=test,FUN=mean)
Death Age.mean
1981 23.5
1982 24.5
1983 25.5
1984 26.5
1985 27.5
The variable(s) to the left of the ~ is the variable(s) you want to perform the function FUN= on (in this case mean) and the variable(s) to the right of the ~ is the new level of aggregation you want.
You can also do this using dplyr:
library(dplyr)
test %>%
group_by(Death) %>%
summarise(Age.mean = mean(Age))
I find dplyr's chaining syntax results in very readable code, but that's a personal preference.
Source: local data frame [5 x 2]
Death Age.mean
1 1981 23.5
2 1982 24.5
3 1983 25.5
4 1984 26.5
5 1985 27.5

Resources