Subtract columns when column name is a year - r

I'm certain the answer to this is simple, but I can't figure it out.
I have a pivot table where the column headings are years.
# A tibble: 3 x 5
country 2012 2013 2014 2015
<chr> <dbl> <dbl> <dbl> <dbl>
USA 45 23 12 42
Canada 67 98 14 25
Mexico 89 104 78 3
I want to create a new column that calculate the difference between two other columns. Rather than recognize the year as a column heading, however, R takes the difference between the two years.
Below is a sample of my code. If I put " " around the years, I get an error: "x non-numeric argument to binary operator". Without " ", R creates a new column with the value -3, simply subtracting the years.
df %>%
pivot_wider(names_from = year, values_from = value) %>%
mutate(diff = 2012 - 2015)
How do I re-write this to get the following table:
# A tibble: 3 x 6
country 2012 2013 2014 2015 diff
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
USA 45 23 12 47 -2
Canada 67 98 14 25 42
Mexico 89 104 78 3 86

You may try
df %>%
pivot_wider(names_from = year, values_from = value) %>%
mutate(diff = .$'2012' - .$'2015')
with your data,
df <- read.table(text = "country 2012 2013 2014 2015
USA 45 23 12 42
Canada 67 98 14 25
Mexico 89 104 78 3
", header = T)
names(df) <- c("country", 2012, 2013, 2014, 2015 )
df %>%
mutate(diff = .$'2012' - .$'2015')
country 2012 2013 2014 2015 diff
1 USA 45 23 12 42 3
2 Canada 67 98 14 25 42
3 Mexico 89 104 78 3 86

Related

How best to do row operations in R

Below is the sample data
year <- c (2016,2017,2018,2019,2020,2021,2016,2017,2018,2019,2020,2021,2016,2017,2018,2019,2020,2021,2016,2017,2018,2019,2020,2021)
indcode <- c(71,71,71,71,71,71,72,72,72,72,72,72,44,44,44,44,44,44,45,45,45,45,45,45)
avgemp <- c(44,46,48,50,55,56,10,11,12,13,14,15,21,22,22,23,25,25,61,62,62,63,69,77)
ownership <-c(50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50)
test3 <- data.frame (year,indcode,avgemp,ownership)
The desired result is to have where it sums the avgemp for two specific combinations (71+72 and 44+45) and produces one additional row per year. The items in the parentheses (below) are just there to illustrate which numbers get added. The primary source of my confusion is how to have it select and therefore add certain indcode combinations. My initial thought is that you would pivot wider, add the columns, and the pivot_longer but hoping for something a bit less convoluted.
year indcode avgemp ownership
2016 71+72 54 (44+10) 50
2016 71 44 50
2016 72 10
2017 71+72 57 (46+11) 50
2018 71+72 60 (48+12) 50
2019 71+72 63 (50+13) 50
2020 71+72 69 (55+14) 50
2021 71+72 71 (56+15) 50
I know that it would start something like this
test3 <- test3 %>% group_by (indcode) %>% mutate("71+72" = (something that filters out 71 and 72)
group_by(year, gr = indcode %/%10) %>%
summarise(indcode = paste(unique(indcode), collapse = '+'),
avgemp = sum(avgemp), ownership = ownership[1], .groups = 'drop') %>%
select(-gr)%>%
arrange(indcode)
# A tibble: 12 x 4
year indcode avgemp ownership
<dbl> <chr> <dbl> <dbl>
1 2016 44+45 82 50
2 2017 44+45 84 50
3 2018 44+45 84 50
4 2019 44+45 86 50
5 2020 44+45 94 50
6 2021 44+45 102 50
7 2016 71+72 54 50
8 2017 71+72 57 50
9 2018 71+72 60 50
10 2019 71+72 63 50
11 2020 71+72 69 50
12 2021 71+72 71 50
Using data.table - convert the data.frame to 'data.table' with setDT, grouped by 'year', 'ownership', and the 'indcode' created by an ifelse/fcase method), get the sum of 'avgemp' as a summarised output
library(data.table)
setDT(test3)[, .(avgemp = sum(avgemp)), .(year, ownership,
indcode = fcase(indcode %in% c(71, 72), '71+72', default = '44+45'))]
-output
year ownership indcode avgemp
<num> <num> <char> <num>
1: 2016 50 71+72 54
2: 2017 50 71+72 57
3: 2018 50 71+72 60
4: 2019 50 71+72 63
5: 2020 50 71+72 69
6: 2021 50 71+72 71
7: 2016 50 44+45 82
8: 2017 50 44+45 84
9: 2018 50 44+45 84
10: 2019 50 44+45 86
11: 2020 50 44+45 94
12: 2021 50 44+45 102

How can I group and count observations made by decade in a dataset that has the years set as individual values?

I have the folllowing dataset:
# A tibble: 90 × 2
decade n
<dbl> <int>
1 1930 13
2 1931 48
3 1932 44
4 1933 76
5 1934 73
6 1935 63
7 1936 54
8 1937 51
9 1938 41
10 1939 42
# … with 80 more rows
With more years that continue until 2010. I wish to "group" by decade like 1930s, 1940s, etc... and to have on another column the count of n in each year until the end of the decade.
For example:
# A tibble: 90 × 2
decade n
<dbl> <int>
1 1930-1939 449
2 1940-1949 516
Thanks!
We could use the modulo operator %%:
library(dplyr)
df %>%
group_by(decade = decade - decade %% 10) %>%
summarise(n = sum(n))
decade n
<dbl> <int>
1 1930 505

Performing row wise normalization of data in r

Simple operation I would like to do which is proving not to be so simple. So I have a time series data set, and I would like to perform row wise normalization, so for each observation, (x- mean(row))/stdev(row).
This was one attempt but to no avail, and also I've replaced NA values with 0 so that doesn't seem to be the issue.
norm <- for (i in 1:nrow(clusterdatairaq2)){
for(j in 2:ncol(clusterdatairaq2)) {
clusterdatairaq2[i,j] <- (clusterdatairaq2[i,j] - mean(clusterdatairaq2[i,]))/ sd(clusterdatairaq2[i,])
}
}
Thanks in advance for any help!!
Assuming we have a data frame like this:
library(dplyr)
df = tibble(
Destination = c("Belgium", "Bulgaria", "Czechia"),
`Jan 2008` = sample(1:1000, size=3),
`Feb 2008` = sample(1:1000, size=3),
`Mar 2008` = sample(1:1000, size=3)
)
df
# A tibble: 3 × 4
Destination `Jan 2008` `Feb 2008` `Mar 2008`
<chr> <int> <int> <int>
1 Belgium 811 299 31
2 Bulgaria 454 922 421
3 Czechia 638 709 940
The tidyverse way to do this (which I think is better than base R here)
library(dplyr)
library(tidyr)
scaled = df %>%
pivot_longer(`Jan 2008`:`Mar 2008`) %>%
group_by(Destination) %>%
mutate(value = as.numeric(scale(value))) %>%
ungroup()
scaled
Destination name value
<chr> <chr> <dbl>
1 Belgium Jan 2008 1.09
2 Belgium Feb 2008 -0.205
3 Belgium Mar 2008 -0.881
4 Bulgaria Jan 2008 -0.517
5 Bulgaria Feb 2008 1.15
6 Bulgaria Mar 2008 -0.635
7 Czechia Jan 2008 -0.787
8 Czechia Feb 2008 -0.338
9 Czechia Mar 2008 1.13
Now, you could pivot it back to the original form, but there's not much point, because analysis will be much easier in long form:
scaled %>% pivot_wider(names_from=name, values_from=value)
# A tibble: 3 × 4
Destination `Jan 2008` `Feb 2008` `Mar 2008`
<chr> <dbl> <dbl> <dbl>
1 Belgium 1.09 -0.205 -0.881
2 Bulgaria -0.517 1.15 -0.635
3 Czechia -0.787 -0.338 1.13
set.seed(42)
mtx <- matrix(sample(99, size=6*5, replace=TRUE), nrow=6)
df <- cbind(data.frame(id = letters[1:6]), mtx)
df
# id A B C D E
# 1 a 49 47 26 95 58
# 2 b 65 24 3 5 97
# 3 c 25 71 41 84 42
# 4 d 74 89 89 34 24
# 5 e 18 37 27 92 30
# 6 f 49 20 36 3 43
out <- t(apply(df[,-1], 1, function(X) (X-mean(X)) / sd(X)))
colnames(out) <- paste0(colnames(df[,-1]), "_norm")
df <- cbind(df, out)
df
# id A B C D E A_norm B_norm C_norm D_norm E_norm
# 1 a 49 47 26 95 58 -0.2376354 -0.3168472 -1.1485711 1.5842361 0.1188177
# 2 b 65 24 3 5 97 0.6393668 -0.3611690 -0.8736386 -0.8248320 1.4202728
# 3 c 25 71 41 84 42 -1.1427812 0.7618541 -0.4802994 1.3001207 -0.4388942
# 4 d 74 89 89 34 24 0.3878036 0.8725581 0.8725581 -0.9048751 -1.2280448
# 5 e 18 37 27 92 30 -0.7749098 -0.1291516 -0.4690243 1.7401483 -0.3670625
# 6 f 49 20 36 3 43 1.0067737 -0.5462283 0.3106004 -1.4566088 0.6854630
I used the mtcars dataset as an exemple :
library(tidyverse)
mtcars %>% #the dataset
select(disp) %>% #disp is the row that we want to normalize just as an exemple
mutate(disp2=(disp-mean(disp))/sd(disp)) #disp2 is the name of the now normalized row
A dplyr solution, re-using #Migwell toy example (please provide a reproducible example in your question):
library(dplyr)
df = data.table(
Destination = c("Belgium", "Bulgaria", "Czechia"),
`Jan 2008` = sample(1:1000, size=3),
`Feb 2008` = sample(1:1000, size=3),
`Mar 2008` = sample(1:1000, size=3))
> df
Destination Jan 2008 Feb 2008 Mar 2008
1: Belgium 443 114 628
2: Bulgaria 755 801 493
3: Czechia 123 512 517
You can use:
df2 <- df %>% select(`Jan 2008`:`Mar 2008`) %>% mutate(normJan2008=(`Jan 2008`-rowMeans(.,na.rm=T))/apply(.,1,sd))
> df2
Jan 2008 Feb 2008 Mar 2008 normJan2008
1: 443 114 628 0.1843742
2: 755 801 493 0.4333577
3: 123 512 517 -1.1546299
And do this for every variable you need to normalize.

subset a dataframe on multiple conditions in R

My dataset has several variables and I want to build a subset as well as create new variables based on those conditions
dat1
S1 S2 H1 H2 Month1 Year1 Month2 Year2
16 17 81 70 09 2017 07 2017
17 16 80 70 08 2017 08 2016
14 16 81 81 09 2016 05 2016
18 15 70 81 07 2016 09 2017
17 16 80 80 08 2016 05 2016
18 18 81 70 05 2017 04 2016
I want to subset such that if S1=16,17,18 and H1=81,80 then I create a new variable Hist=H1 , date=paste(Month1,Year1) Sip = S1
Same goes for set of S2, H2 .
My output should be: [ The first 4 rows comes for sets of S1,H1, Month1,Year2 and last 2 rows comes from S2,H2,Month2,Year2
Hist Sip Date
81 16 09-2017
80 17 08-2017
80 17 08-2016
81 18 05-2017
81 16 05-2016
80 16 05-2016
My Code :
datnew <- dat1 %>%
mutate(Date=ifelse((S1==16|S1==17|S1=18)&(H1==80|H1==81),paste(01,Month1,Year1,sep="-"),
ifelse((S2==16|S2==17|S2==18)&(H2==80|H2==81),paste(Month2,Year2,sep="-"),"NA")),
hist=ifelse((S1==16|S1==17|S1=18)&(H1==80|H1==81),H1,
ifelse((S2==16|S2==17|S2==18)&(H2==80|H2==81),H2,"NA")),
sip=ifelse((S1==16|S1==17|S1=18)&(H1==80|H1==81),S1,
ifelse((S2==16|S2==17|S2==18)&(H2==80|H2==81),S2,"NA")))
In the original data I have 10 sets of such columns ie S1-S10, H1-H10, Month1_-Month10... And for each variable I have lot more conditions of numbers.
In this method it is going on and on. Is there any better way to do this?
Thanks in advance
Here is a tidyverse solution. Separate into two data frames and bind the rows together.
library(tidyverse)
bind_rows(
dat1 %>% select(patientId, ends_with("1")) %>% rename_all(str_remove, "1"),
dat1 %>% select(patientId, ends_with("2")) %>% rename_all(str_remove, "2")
) %>%
transmute(
patientId,
Hist = H,
Sip = S,
date = paste0(Month, "-", Year)
) %>%
filter(
Sip %in% 16:18,
Hist %in% 80:81
)
#> # A tibble: 6 x 4
#> patientId Hist Sip date
#> <int> <dbl> <dbl> <chr>
#> 1 1 81 16 09-2017
#> 2 2 80 17 08-2017
#> 3 5 80 17 08-2016
#> 4 6 81 18 05-2017
#> 5 3 81 16 05-2016
#> 6 5 80 16 05-2016

Remove rows with NA values and delete those observations in another year [duplicate]

This question already has answers here:
Filter rows in R based on values in multiple rows
(2 answers)
Closed 5 years ago.
I find it a bit hard to find the right words for what I'm trying to do.
Say I have this dataframe:
library(dplyr)
# A tibble: 74 x 3
country year conf_perc
<chr> <dbl> <dbl>
1 Canada 2017 77
2 France 2017 45
3 Germany 2017 60
4 Greece 2017 33
5 Hungary 2017 67
6 Italy 2017 38
7 Canada 2009 88
8 France 2009 91
9 Germany 2009 93
10 Greece 2009 NA
11 Hungary 2009 NA
12 Italy 2009 NA
Now I want to delete the rows that have NA values in 2009 but then I want to remove the rows of those countries in 2017 as well. I would like to get the following results:
# A tibble: 74 x 3
country year conf_perc
<chr> <dbl> <dbl>
1 Canada 2017 77
2 France 2017 45
3 Germany 2017 60
4 Canada 2009 88
5 France 2009 91
6 Germany 2009 93
We can do any after grouping by 'country'
library(dplyr)
df1 %>%
group_by(country) %>%
filter(!any(is.na(conf_perc)))
# A tibble: 6 x 3
# Groups: country [3]
# country year conf_perc
# <chr> <int> <int>
#1 Canada 2017 77
#2 France 2017 45
#3 Germany 2017 60
#4 Canada 2009 88
#5 France 2009 91
#6 Germany 2009 93
base R solution:
foo <- df$year == 2009 & is.na(df$conf_perc)
bar <- df$year == 2017 & df$country %in% unique(df$country[foo])
df[-c(which(foo), which(bar)), ]
# country year conf_perc
# 1 Canada 2017 77
# 2 France 2017 45
# 3 Germany 2017 60
# 7 Canada 2009 88
# 8 France 2009 91
# 9 Germany 2009 93

Resources