how to perform calculation chr and dbl - r

let say I have this run this code
df_customer %>%
separate(DOB,sep = "-",into = c("D", "M","Y")) %>%
mutate(Age=2021)
then this dataframe comes out
ID D M Y G C Age
<int> < chr > <dbl>
268408 02 01 1970 M 4 2021
269696 07 01 1970 F 8 2021
268159 08 01 1970 F 8 2021
270181 10 01 1970 F 2 2021
268073 11 01 1970 M 1 2021
273216 15 01 1970 F 5 2021
266929 15 01 1970 M 8 2021
275152 16 01 1970 M 4 2021
275034 18 01 1970 F 4 2021
273966 21 01 1970 M 8 2021
then, I want to change that list of mutate column
how can I calculate something like 2021-"Y" column?
2021 is dbl and Y is chr

Adding convert = TRUE in separate should give you numeric values. You can also use as.numeric to convert character to numbers.
library(dplyr)
library(tidyr)
df_customer %>%
separate(DOB,sep = "-",into = c("D", "M","Y"), convert = TRUE) %>%
mutate(Age=2021 - as.numeric(Y))

We could do this in base R
transform(cbind(df_customer, read.table(text = df_customer$DOB, sep = "-",
column.names = c("D", "M", "Y"))), Age = 2021- Y)

Related

How best to parse fields in R?

Below is the sample data. This is how it comes from the current population survey. There are 115 columns in the original. Below is just a subset. At the moment, I simply append a new row each month and leave it as is. However, there has been a new request that it be made longer and parsed a bit.
For some context, the first character is the race, a = all, b=black, w=white, and h= hispanic. The second character is the gender, x = all, m = male, and f= female. The third variable, which does not appear in all columns is the age. These values are 2024 for ages 20-24, 3039 or 30-39, and so on. Each one will end in the terms, laborforce unemp or unemprate.
stfips <- c(32,32,32,32,32,32,32,32)
areatype <- c(01,01,01,01,01,01,01,01)
periodyear <- c(2021,2021,2021,2021,2021,2021,2021,2021)
period <- (01,02,03,04,05,06,07,08)
xalaborforce <- c(1210.9,1215.3,1200.6,1201.6,1202.8,1209.3,1199.2,1198.9)
xaunemp <- c(55.7,55.2,65.2,321.2,77.8,88.5,92.4,102.6)
xaunemprate <- c(2.3,2.5,2.7,2.9,3.2,6.5,6.0,12.5)
walaborforce <- c(1000.0,999.2,1000.5,1001.5,998.7,994.5,999.2,1002.8)
waunemp <- c(50.2,49.5,51.6,251.2,59.9,80.9,89.8,77.8)
waunemprate <- c(3.4,3.6,3.8,4.0,4.2,4.5,4.1,2.6)
balaborforce <- c (5.5,5.7,5.2,6.8,9.2,2.5,3.5,4.5)
ba2024laborforce <- c(1.2,1.4,1.2,1.3,1.6,1.7,1.4,1.5)
ba2024unemp <- c(.2,.3,.2,.3,.4,.5,.02,.19))
ba2024lunemprate <- c(2.1,2.2,3.2,3.2,3.3,3.4,1.2,2.5)
test2 <- data.frame (stfips,areatype,periodyear, period, xalaborforce,xaunemp,xaunemprate,walaborforce, waunemp,waunemprate,balaborforce,ba2024laborforce,ba2024unemp,ba2024unemprate)
Desired result
stfips areatype periodyear period race gender age laborforce unemp unemprate
32 01 2021 01 x a all 1210.9 55.7 2.3
32 01 2021 02 x a all 1215.3 55.2 2.5
.....(the other six rows for race = x and gender = a
32 01 2021 01 w a all 1000.0 50.2 3.4
32 01 2021 02 w a all 999.2 49.5 3.6
....(the other six rows for race = w and gender = a
32 01 2021 01 b a 2024 1.2 .2 2.1
Edit -- added handling for columns with age prefix. Mostly there, but would be nice to have a concise way to add the - to make 2024 into 20-24....
test2 %>%
pivot_longer(xalaborforce:ba2024laborforce) %>%
separate(name, c("race", "gender", "stat"), sep = c(1,2)) %>%
mutate(age = coalesce(parse_number(stat) %>% as.character, "all"),
stat = str_remove_all(stat, "[0-9]")) %>%
pivot_wider(names_from = stat, values_from = value)
# A tibble: 32 × 10
stfips areatype periodyear period race gender age laborforce unemp unemprate
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 32 1 2021 1 x a all 1211. 55.7 2.3
2 32 1 2021 1 w a all 1000 50.2 3.4
3 32 1 2021 1 b a all 5.5 NA NA
4 32 1 2021 1 b a 2024 1.2 NA NA
5 32 1 2021 2 x a all 1215. 55.2 2.5
6 32 1 2021 2 w a all 999. 49.5 3.6
7 32 1 2021 2 b a all 5.7 NA NA
8 32 1 2021 2 b a 2024 1.4 NA NA
9 32 1 2021 3 x a all 1201. 65.2 2.7
10 32 1 2021 3 w a all 1000. 51.6 3.8
# … with 22 more rows
# ℹ Use `print(n = ...)` to see more rows

R moving average between data frame variables

I am trying to find a solution but haven't, yet.
I have a dataframe structured as follows:
country City 2014 2015 2016 2017 2018 2019
France Paris 23 34 54 12 23 21
US NYC 1 2 2 12 95 54
I want to find the moving average for every 3 years (i.e. 2014-16, 2015-17, etc) to be placed in ad-hoc columns.
country City 2014 2015 2016 2017 2018 2019 2014-2016 2015-2017 2016-2018 2017-2019
France Paris 23 34 54 12 23 21 37 33.3 29.7 18.7
US NYC 1 2 2 12 95 54 etc etc etc etc
Any hint?
1) Using the data shown reproducibly in the Note at the end we apply rollmean to each column in the transpose of the data and then transpose back. We rollapply the appropriate paste command to create the names.
library(zoo)
DF2 <- DF[-(1:2)]
cbind(DF, setNames(as.data.frame(t(rollmean(t(DF2), 3))),
rollapply(names(DF2), 3, function(x) paste(range(x), collapse = "-"))))
giving:
country City 2014 2015 2016 2017 2018 2019 2014-2016 2015-2017 2016-2018 2017-2019
1 France Paris 23 34 54 12 23 21 37.000000 33.333333 29.66667 18.66667
2 US NYC 1 2 2 12 95 54 1.666667 5.333333 36.33333 53.66667
2) This could also be expressed using dplyr/tidyr/zoo like this:
library(dplyr)
library(tidyr)
library(zoo)
DF %>%
pivot_longer(-c(country, City)) %>%
group_by(country, City) %>%
mutate(value = rollmean(value, 3, fill = NA),
name = rollapply(name, 3, function(x) paste(range(x), collapse="-"), fill=NA)) %>%
ungroup %>%
drop_na %>%
pivot_wider %>%
left_join(DF, ., by = c("country", "City"))
Note
Lines <- "country City 2014 2015 2016 2017 2018 2019
France Paris 23 34 54 12 23 21
US NYC 1 2 2 12 95 54 "
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, check.names = FALSE)

Divide 2 columns from 2 different dataframes

Does anybody know how to divide two columns from two different dataframes when there are multiple columns to id from?
Example:
library(dplyr)
name <- c('A','A',
'B','B')
month = c("oct 2018", "nov 2018",
"oct 2018", "nov 2018")
var1 = c("99", "99",
"99", "99")
value <- seq(1:length(month))
df1 = data.frame(name, month, var1, value)
df2 = df1
df2["var1"] = c("992", "992", "992", "992")
df2["value"] = c(2, 4, 6, 8)
df1
df2
Output
> df1
name month var1 value
1 A oct 2018 99 1
2 A nov 2018 99 2
3 B oct 2018 99 3
4 B nov 2018 99 4
> df2
name month var1 value
1 A oct 2018 992 2
2 A nov 2018 992 4
3 B oct 2018 992 6
4 B nov 2018 992 8
Does anybody know how to create a new dataframe that divides the "value"-column in df2 by the value column of df1? The method should be possible to use also when there are more columns than in the current example.
In base R, we can do merge
df3 <- merge(df1, df2, by = c("name", "month"))
df3$value <- df3$value.x/df3$value.y
df3
# name month var1.x value.x var1.y value.y value
#1 A nov 2018 99 2 992 4 0.5
#2 A oct 2018 99 1 992 2 0.5
#3 B nov 2018 99 4 992 8 0.5
#4 B oct 2018 99 3 992 6 0.5
You can drop value.x and value.y column if they are not needed.
Join the two data frames together and then perform the division and drop unwanted columns that were generated by the join (assuming you want computed value column to replace the value columns from the original data frames). Depending on what you want you may need a different *_join.
library(dplyr)
df1 %>%
inner_join(df2, by = c("name", "month")) %>%
mutate(value = value.x / value.y) %>%
select(-value.x, -value.y)
giving:
name month var1.x var1.y value
1 A oct 2018 99 992 0.5
2 A nov 2018 99 992 0.5
3 B oct 2018 99 992 0.5
4 B nov 2018 99 992 0.5
We can use data.table as well to do a join and create the column 'value' by dividing the 'value' column by the corresponding column in the other dataset while joining on 'name' and 'month'
library(data.table)
df3 <- copy(df1)
setDT(df3)[df2, value := value/i.value, on = .(name, month)]
df3
# name month var1 value
#1: A oct 2018 99 0.5
#2: A nov 2018 99 0.5
#3: B oct 2018 99 0.5
#4: B nov 2018 99 0.5

R, dplyr: How to divide date frame elements by specific elements

edit: Solution at the end.
I have a dataframe that contains different variables and the sum of these different variables as a variable called "total".
I want to add a new column that calculates each variables' share of the "total"-variable.
Example:
library(dplyr)
name <- c('A','A',
'B','B')
month = c("oct 2018", "nov 2018",
"oct 2018", "nov 2018")
value <- seq(1:length(month))
df = data.frame(name, month, value)
# Create total variable
dfTotal =
df%>%
group_by_("month")%>%
summarize(value = sum(value, na.rm = TRUE))
dfTotal[["name"]] <- "Total"
dfTotal = as.data.frame(dfTotal)
# Add total column to dataframe
df2 = rbind(df, dfTotal)
df2
which gives the dataframe
name month value
1 A oct 2018 1
2 A nov 2018 2
3 B oct 2018 3
4 B nov 2018 4
5 Total nov 2018 6
6 Total oct 2018 4
What I want is to produce a new column with the shares of the total for each month in the above dataframe, so that I get something like
name month value share
1 A oct 2018 1 0.25 (=1/4)
2 A nov 2018 2 0.33 (=2/6)
3 B oct 2018 3 0.75 (=3/4)
4 B nov 2018 4 0.67 (=4/6)
5 Total nov 2018 6 1.00 (=6/6)
6 Total oct 2018 4 1.00 (=4/4)
Does anybody know how I from the first dataframe can produce the last column in the second dataframe?
Solution:
Based on tmfmnk's comment, the following solves the problem:
df2 =
df2 %>%
group_by(month) %>%
mutate(share = value/max(value))
df2
which gives
name month value share
<fct> <fct> <int> <dbl>
1 A oct 2018 1 0.25
2 A nov 2018 2 0.333
3 B oct 2018 3 0.75
4 B nov 2018 4 0.667
5 Total nov 2018 6 1
6 Total oct 2018 4 1

Renaming Columns in R According to Repeating Sequence

I have a wide data frame in R and I am trying to rename the column names so that I can reshape it to a long format.
Currently, the data is structured like this:
long lat V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V477
I'd like to rename the columns so that they are:
long lat Jan_1979 Feb_1979 Mar_1979 Apr_1979 ... Sept_2018
I'm not sure how to go about doing this. Any help would be appreciated.
There are multiple ways you could do this.
One way in base R is by using seq to create monthly dates in the format you need. So for example, you could create first 10 sequence starting from 1979-01-01 by
format(seq(as.Date('1979-01-01'), length.out = 10, by = "1 month"), "%b_%Y")
#[1] "Jan_1979" "Feb_1979" "Mar_1979" "Apr_1979" "May_1979" "Jun_1979" "Jul_1979"
#[8] "Aug_1979" "Sep_1979" "Oct_1979"
For your case, this should work
names(df)[3:479] <- format(seq(as.Date('1979-01-01'),
length.out = 477, by = "1 month"), "%b_%Y")
We can use expand.grid to get all month year combinations:
name_combn <- expand.grid(month.abb, 1979:2018)[1:477,]
names(df) <- c('long', 'lat', paste(name_combn$Var1, name_combn$Var2, sep = "_"))
Output:
> head(name_combn, 20)
Var1 Var2
1 Jan 1979
2 Feb 1979
3 Mar 1979
4 Apr 1979
5 May 1979
6 Jun 1979
7 Jul 1979
8 Aug 1979
9 Sep 1979
10 Oct 1979
11 Nov 1979
12 Dec 1979
13 Jan 1980
14 Feb 1980
15 Mar 1980
16 Apr 1980
17 May 1980
18 Jun 1980
19 Jul 1980
20 Aug 1980

Resources