Sum consecutive strings in multiple columns

Sum consecutive strings in multiple columns - r

I have a table with flight ids, arrivals, and departures:
> test
arrival departure flight_id
1 9 2233
2 8 1982
3 1 2164
4 9 2081
5 2130
6 2 2040
7 9 2030
8 2130
9 4 3169
10 6 2323
11 8 2130
12 2220
13 3169
14 9 2204
15 1 1910
16 2 837
17 1994
18 9 8 1994
19 1994
20 1994
21 9 1 2338
22 1 8 1981
23 9 2365
24 8 2231
25 9 2048
My objective is to count only the rows where arrival and departure are blank, and then to aggregate by flight_id. But there is a catch. I believe this cannot be done with table(), aggregate() or rle() because they do not account for breaks.
For example, only consecutive flight ids where arrival ="" and departure ="" should be counted, and the count should start again from zero if a flight id with a non-blank value occurs. NOTE: Other flight ids appearing in between don't matter - each flight id should be treated separately which is why flight 2130 is counted twice.
In other words, the resulting output from the test should look exactly like this:
output
flight_id count
1 2130 2
2 2220 1
3 3169 1
4 1994 1
5 1994 2
Notice that flight id 1994 occurs three times where arrival and departure are blank but that there is a break in between at row 18. Therefore, the flight id must be counted twice.
I have tried writing a for loop but get an error that there is missing value where TRUE/FALSE needed:
raw_data = test
unique_id = unique(raw_data$flight_id)
output<- data.frame("flight_id"= integer(0), "count" = integer(0), stringsAsFactors=FALSE)
for (flight_id in unique_id)
{
oneflight <- raw_data[ which(raw_data$flight_id == flight_id), ]
if(nrow(oneflight) >= 1 ){
for(i in 2:nrow(oneflight)) {
if(oneflight[i,"arrival"] == "" & oneflight[i,"departure"] == "") {
new_row <- c(flight_id, sum(flight_id)[i])
output[nrow(output) + 1,] = new_row
}
}
}
}
How could one improve the above code or could someone suggest a quicker method with dplyr for example? Here is a sample of the data:
> dput(test)
structure(list(arrival = c("", "", "1", "", "", "2", "9", "",
"", "6", "", "", "", "", "1", "", "", "9", "", "", "9", "1",
"9", "", "9"), departure = c("9", "8", "", "9", "", "", "", "",
"4", "", "8", "", "", "9", "", "2", "", "8", "", "", "1", "8",
"", "8", ""), flight_id = c(2233, 1982, 2164, 2081, 2130, 2040,
2030, 2130, 3169, 2323, 2130, 2220, 3169, 2204, 1910, 837, 1994,
1994, 1994, 1994, 2338, 1981, 2365, 2231, 2048)), .Names = c("arrival",
"departure", "flight_id"), row.names = c(NA, 25L), class = "data.frame")

A base R approach :
do.call("rbind", lapply(split(test, test$flight_id), function(x) {
o = rle(x[["arrival"]] == "" & x[["departure"]] == "")
data.frame(flight_id = rep(unique(x[["flight_id"]]), sum(o$values)),
count = o$lengths[o$values])
}))
#flight_id count
# 1994 1
# 1994 2
# 2130 2
# 2220 1
# 3169 1
We split the dataframe by flight_id and for every group we apply rle to find continuous empty rows in arrival and departure and return a dataframe with the flight_id and the number of continuous empty rows in the group.

If I understand your question, one trick you could use is to add a decimale to the flight_id, indicating a group.
For example, get an index vector
i <- find(oneflight$arrival == "" & oneflight$departure =="")
Then take cumsum(1-diff(i)) / 100 or a sufficiently large power of ten, add it to the flight IDs, and you then have groups flights that can be counted with table()

Here's a solution using data.table:
library(data.table)
flights <- test$flight_id[test$arrival=="" & test$departure==""]
setDT(test)[flight_id %in% flights, grp := rleid(arrival=="",departure=="")][
arrival=="" & departure=="",.(count = .N),.(flight_id, grp)]
# flight_id grp count
#1: 2130 1 2
#2: 2220 3 1
#3: 3169 3 1
#4: 1994 3 1
#5: 1994 5 2
Explanation:
First we attain the flight_id's that have at least one record with empty arrival and departure values. Then, we use this vector flights to subset your data and generate a run-length id column based on arrival=="" and departure =="" called "grp". Lastly we generate the count of of records (ie. .N) where, arrival=="" & departure =="", grouped by the columns flight_id and grp.
You can consequently drop the grp column if needed.

Related

Is there a way to write this in a single Dplyr statement / more efficiently?

My (simplified) dataset consists of donor occupation and contribution amounts. I'm trying to determine what the average contribution amount by occupation is (note: donor occupations are often repeated in the column, so I use that as a grouping variable). Right now, I'm using two dplyr statements -- one to get a sum of contributions amount by each occupation and another to get a count of the number of donations from that specific occupation. I am then binding the dataframes with cbind and creating a new column with mutate, where I can divide the sum by the count.
Data example:
contributor_occupation contribution_receipt_amount
1 LISTING COORDINATOR 5.00
2 NOT EMPLOYED 2.70
3 TEACHER 2.70
4 ELECTRICAL DESIGNER 2.00
5 STUDENT 50.00
6 SOFTWARE ENGINEER 10.00
7 TRUCK DRIVER 2.70
8 NOT EMPLOYED 50.00
9 CONTRACTOR 5.00
10 ENGINEER 6.00
11 FARMER 2.70
12 ARTIST 50.00
13 CIRCUS ARTIST 100.00
14 CIRCUS ARTIST 27.00
15 INFORMATION SECURITY ANALYST 2.00
16 LAWYER 5.00
occupation2 <- b %>%
select(contributor_occupation, contribution_receipt_amount) %>%
group_by(contributor_occupation) %>%
summarise(total = sum(contribution_receipt_amount)) %>%
arrange(desc(contributor_occupation))
occupation3 <- b %>%
select(contributor_occupation) %>%
count(contributor_occupation) %>%
group_by(contributor_occupation) %>%
arrange(desc(contributor_occupation))
final_occ <- cbind(occupation2, occupation3[, 2]) # remove duplicate column
occ_avg <- final_occ %>%
select(contributor_occupation:n) %>%
mutate("Average Donation" = total/n) %>%
rename("Number of Donations"= n, "Occupation" = contributor_occupation, "Total Donated" = total)
occ_avg %>%
arrange(desc(`Average Donation`))
This gives me the result I want but seems like a very cumbersome process. It seems I get the same result by using the following code; however, I am confused as to why it works:
avg_donation_occupation <- b %>%
select(contributor_occupation, contribution_receipt_amount) %>%
group_by(contributor_occupation) %>%
summarize(avg_donation_by_occupation = sum(contribution_receipt_amount)/n()) %>%
arrange(desc(avg_donation_by_occupation))
Wouldn't dividing by n divide by the number of rows (i.e., number of occupations) as opposed to the number of people in that occupation (which is what I used the count function for previously)?
Thanks for the help clearing up any confusion!

We may need both sum and mean along with n() which gives the number of observations in the grouped data. According to ?context
n() gives the current group size.
and `?mean
mean - Generic function for the (trimmed) arithmetic mean.
which is basically the sum of observations divided by the number of observations
library(dplyr)
out <- b %>%
group_by(Occupation = contributor_occupation) %>%
summarise(`Total Donated` = sum(contribution_receipt_amount),
`Number of Donations` = n(),
`Average Donation` = mean(contribution_receipt_amount),
#or
#`Average Donation` = `Total Donated`/`Number of Donations`,
.groups = 'drop') %>%
arrange(desc(`Average Donation`))
-output
out
# A tibble: 14 × 4
Occupation `Total Donated` `Number of Donations` `Average Donation`
<chr> <dbl> <int> <dbl>
1 CIRCUS ARTIST 127 2 63.5
2 ARTIST 50 1 50
3 STUDENT 50 1 50
4 NOT EMPLOYED 52.7 2 26.4
5 SOFTWARE ENGINEER 10 1 10
6 ENGINEER 6 1 6
7 CONTRACTOR 5 1 5
8 LAWYER 5 1 5
9 LISTING COORDINATOR 5 1 5
10 FARMER 2.7 1 2.7
11 TEACHER 2.7 1 2.7
12 TRUCK DRIVER 2.7 1 2.7
13 ELECTRICAL DESIGNER 2 1 2
14 INFORMATION SECURITY ANALYST 2 1 2
data
b <- structure(list(contributor_occupation = c("LISTING COORDINATOR",
"NOT EMPLOYED", "TEACHER", "ELECTRICAL DESIGNER", "STUDENT",
"SOFTWARE ENGINEER", "TRUCK DRIVER", "NOT EMPLOYED", "CONTRACTOR",
"ENGINEER", "FARMER", "ARTIST", "CIRCUS ARTIST", "CIRCUS ARTIST",
"INFORMATION SECURITY ANALYST", "LAWYER"), contribution_receipt_amount = c(5,
2.7, 2.7, 2, 50, 10, 2.7, 50, 5, 6, 2.7, 50, 100, 27, 2, 5)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16"))

Select value from comma separated values in cell based on previous and next values

I have a large database, a subset of which looks like this
ID year value1 value2
1 2000 203,305,701 1, 2, 1
1 2001 203,504 1, 1
1 2002 203 1
2 2010 245 3
2 2011 245,332 2, 1
2 2012 332 3
2 2013 332 2
2 2014 245,332 2, 1
Reproducible code:
structure(list(
ID = c("1", "1", "1", "2", "2", "2", "2", "2"),
year = c("2000", "2001", "2002", "2010", "2011", "2012",
"2013", "2014"), value1 = c("203, 305, 701",
"203, 504", "203", "245", "245, 332",
"332", "332", "245, 332"), value2 = c("1, 2, 1",
"1, 1", "1", "3", "2, 1", "3", "2", "2, 1")), class = "data.frame", row.names = c(NA, -8L))
"value1" and "value2" contain comma-separated values. The objective is to simplify the "value1" column to a single value. The algorithm I've thought out goes like this:
Check for previous and next values for each row while grouping by ID (taking intersections: i.e. the common value in two consecutive rows).
For example, for row 5: The intersection of {245, 332} with the previous row {245} for value1 is 245, while with the next row {332} it is 332
Prefer next value over previous value for selection.
I want to prioritise the next value i.e. {332} in this split decision.
If either intersection does not narrow down to a single value, select value1 based on max(value2). If value2 does not have a maximum, select randomly.
The third step does not come into play since a single value is selected based on the first two steps.
The algorithm continues to the next row as soon as a single value is reached. Previous and next refers to the preceding and the following row respectively.
Similarly, for row 1:
The intersection is 203 with only the next row, as we stopped the algorithm as soon as we arrived at a single value.
The final data should look like this
ID year value1 value2
1 2000 203 1, 2, 1
1 2001 203 1, 1
1 2002 203 1
2 2010 245 3
2 2011 332 2, 1
2 2012 332 3
2 2013 332 2
2 2014 332 2, 1
I tried writing a basic code in R to loop over each row grouping by "ID" and going through each year since I have no idea which package to use for this and going case by case, but it seems to me that this might not be the most efficient method. (I am also very new to R)

Separating values into existing column in R

I'm tidying some data that I read into R from a PDF using tabulizer. Unfortunately some cells haven't been read properly. In column 9 (Split 5 at 37.1km) rows 3 and 4 contain information that should have ended up in column 10 (Final Time).
How do I separate that column (9) just for these rows and paste the necessary data into an already existing column (10)?
I know how to use tidyr::separate function but can't figure out how (an if) to apply it here. Any help and guidance will be appreciated.
structure(list(Rank = c("23", "24", "25", "26"), `Race Number` = c("13",
"11", "29", "30"), Name = c("FOSS Tobias S.", "McNULTY Brandon",
"BENNETT George", "KUKRLE Michael"), `NOC Code` = c("NOR", "USA",
"NZL", "CZE"), `Split 1 at 9.7km` = c("13:47.65(22)", "13:28.23(15)",
"14:05.46(30)", "14:05.81(32)"), `Split 2 at 15.0km` = c("19:21.16(22)",
"19:04.80(18)", "19:47.53(31)", "19:48.77(32)"), `Split 3 at 22.1km` = c("29:17.44(24)",
"29:01.94(20)", "29:58.88(28)", "29:58.09(27)"), `Split 4 at 31.8km` = c("44:06.82(24)",
"43:51.67(23)", "44:40.28(25)", "44:42.74(26)"), `Split 5 at 37.1km` = c("49:49.65(24)",
"49:40.49(23)", "50:21.82(25)1:00:28.39 (25)", "50:30.02(26)1:00:41.55 (26)"
), `Final Time` = c("59:51.68 (23)", "59:57.73 (24)", "", ""),
`Time Behind` = c("+4:47.49", "+4:53.54", "+5:24.20", "+5:37.36"
), `Average Speed` = c("44.302", "44.228", "43.854", "43.696"
)), class = "data.frame", row.names = c(NA, -4L))

My answer is not really fancy, but it does the job for any number in the final time column. It works as long as there are always numbers in brackets at the end.
# dummy df
df <- data.frame("split" = c("49:49.65(24)", "49:40.49(23)", "50:21.82(25)1:00:28.39 (25)", "50:30.02(26)1:00:41.55 (26)"),
"final" = c("59:51.68 (23)", "59:57.73 (24)", "", ""))
# combining & splitting strings
merge_strings <- paste0(df$split, df$final)
split_strings <- strsplit(merge_strings, ")")
df$split <- paste0(unlist(lapply(split_strings, "[[", 1)),")")
df$final <- paste0(unlist(lapply(split_strings, "[[", 2)),")")
This gives:
split final
1 49:49.65(24) 59:51.68 (23)
2 49:40.49(23) 59:57.73 (24)
3 50:21.82(25) 1:00:28.39 (25)
4 50:30.02(26) 1:00:41.55 (26)

Calling df to your dataframe:
library(tidyr)
library(dplyr)
df %>%
separate(`Split 5 at 37.1km`, into = c("Split 5 at 37.1km","aux"), sep = "\\)") %>%
mutate(`Final Time` = coalesce(if_else(`Final Time`!="",`Final Time`, NA_character_), paste0(aux, ")")),
aux = NULL,
`Split 5 at 37.1km` = paste0(`Split 5 at 37.1km`, ")"))
Rank Race Number Name NOC Code Split 1 at 9.7km Split 2 at 15.0km Split 3 at 22.1km Split 4 at 31.8km Split 5 at 37.1km Final Time
1 23 13 FOSS Tobias S. NOR 13:47.65(22) 19:21.16(22) 29:17.44(24) 44:06.82(24) 49:49.65(24) 59:51.68 (23)
2 24 11 McNULTY Brandon USA 13:28.23(15) 19:04.80(18) 29:01.94(20) 43:51.67(23) 49:40.49(23) 59:57.73 (24)
3 25 29 BENNETT George NZL 14:05.46(30) 19:47.53(31) 29:58.88(28) 44:40.28(25) 50:21.82(25) 1:00:28.39 (25)
4 26 30 KUKRLE Michael CZE 14:05.81(32) 19:48.77(32) 29:58.09(27) 44:42.74(26) 50:30.02(26) 1:00:41.55 (26)
Time Behind Average Speed
1 +4:47.49 44.302
2 +4:53.54 44.228
3 +5:24.20 43.854
4 +5:37.36 43.696

You could use dplyr and stringr:
library(dplyr)
library(stringr)
data %>%
mutate(`Final Time` = ifelse(`Final Time` == "", str_remove(`Split 5 at 37.1km`, "\\d+:\\d+\\.\\d+\\(\\d+\\)"), `Final Time`),
`Split 5 at 37.1km` = str_extract(`Split 5 at 37.1km`, "\\d+:\\d+\\.\\d+\\(\\d+\\)"))
which returns
Rank Race Number Name NOC Code Split 1 at 9.7km Split 2 at 15.0km Split 3 at 22.1km Split 4 at 31.8km
1 23 13 FOSS Tobias S. NOR 13:47.65(22) 19:21.16(22) 29:17.44(24) 44:06.82(24)
2 24 11 McNULTY Brandon USA 13:28.23(15) 19:04.80(18) 29:01.94(20) 43:51.67(23)
3 25 29 BENNETT George NZL 14:05.46(30) 19:47.53(31) 29:58.88(28) 44:40.28(25)
4 26 30 KUKRLE Michael CZE 14:05.81(32) 19:48.77(32) 29:58.09(27) 44:42.74(26)
Split 5 at 37.1km Final Time Time Behind Average Speed
1 49:49.65(24) 59:51.68 (23) +4:47.49 44.302
2 49:40.49(23) 59:57.73 (24) +4:53.54 44.228
3 50:21.82(25) 1:00:28.39 (25) +5:24.20 43.854
4 50:30.02(26) 1:00:41.55 (26) +5:37.36 43.696

I like to use regex and stringr. Whilst theres some suboptimal code here the key step is with str_extract(). Using this we can select the two substrings we want, that of the first time and that of the second time. If either time is missing then we will have a missing value. So we can then fill in the columns based on where missingness occurs.
The regex string is as follows^((\\d+:)?\\d{2}:\\d{2}.\\d{2}\\(\\d+\\))\\.?+((\\d+:)?\\d{2}:\\d{2}.\\d{2} \\(\\d+\\))$. Here we have 4 capture groups, the first and third group capture the two whole times respectively. the second and fourth select the optional groups containing the hour (this ensures that times over an hour are completely captured. Additionally we check for an optional space.
My code is as follows:
library(tidyverse)
data <- structure(list(Rank = c("23", "24", "25", "26"), `Race Number` = c("13",
"11", "29", "30"), Name = c("FOSS Tobias S.", "McNULTY Brandon",
"BENNETT George", "KUKRLE Michael"), `NOC Code` = c("NOR", "USA",
"NZL", "CZE"), `Split 1 at 9.7km` = c("13:47.65(22)", "13:28.23(15)",
"14:05.46(30)", "14:05.81(32)"), `Split 2 at 15.0km` = c("19:21.16(22)",
"19:04.80(18)", "19:47.53(31)", "19:48.77(32)"), `Split 3 at 22.1km` = c("29:17.44(24)",
"29:01.94(20)", "29:58.88(28)", "29:58.09(27)"), `Split 4 at 31.8km` = c("44:06.82(24)",
"43:51.67(23)", "44:40.28(25)", "44:42.74(26)"), `Split 5 at 37.1km` = c("49:49.65(24)",
"49:40.49(23)", "50:21.82(25)1:00:28.39 (25)", "50:30.02(26)1:00:41.55 (26)"
), `Final Time` = c("59:51.68 (23)", "59:57.73 (24)", "", ""),
`Time Behind` = c("+4:47.49", "+4:53.54", "+5:24.20", "+5:37.36"
), `Average Speed` = c("44.302", "44.228", "43.854", "43.696"
)), class = "data.frame", row.names = c(NA, -4L))
# Take data and use a matching string to the regex pattern
data |>
mutate(match = map(`Split 5 at 37.1km`, ~unlist(str_match(., "^((\\d+:)?\\d{2}:\\d{2}.\\d{2}\\(\\d+\\))((\\d+:)?\\d{2}:\\d{2}.\\d{2} ?\\(\\d+\\))$")))) |>
# Grab the strings that match the whole first and second/final times
mutate(match1 = map(match, ~.[[2]]), match2 = map(match, ~.[[4]]), .keep = "unused") |>
# Check where the NAs are and put into the dataframe accordingly
mutate(`Split 5 at 37.1km`= ifelse(is.na(match1), `Split 5 at 37.1km`, match1),
`Final Time` = ifelse(is.na(match2), `Final Time`, match2), .keep = "unused")
#> Rank Race Number Name NOC Code Split 1 at 9.7km Split 2 at 15.0km
#> 1 23 13 FOSS Tobias S. NOR 13:47.65(22) 19:21.16(22)
#> 2 24 11 McNULTY Brandon USA 13:28.23(15) 19:04.80(18)
#> 3 25 29 BENNETT George NZL 14:05.46(30) 19:47.53(31)
#> 4 26 30 KUKRLE Michael CZE 14:05.81(32) 19:48.77(32)
#> Split 3 at 22.1km Split 4 at 31.8km Split 5 at 37.1km Final Time
#> 1 29:17.44(24) 44:06.82(24) 49:49.65(24) 59:51.68 (23)
#> 2 29:01.94(20) 43:51.67(23) 49:40.49(23) 59:57.73 (24)
#> 3 29:58.88(28) 44:40.28(25) 50:21.82(25) 1:00:28.39 (25)
#> 4 29:58.09(27) 44:42.74(26) 50:30.02(26) 1:00:41.55 (26)
#> Time Behind Average Speed
#> 1 +4:47.49 44.302
#> 2 +4:53.54 44.228
#> 3 +5:24.20 43.854
#> 4 +5:37.36 43.696
Created on 2021-07-28 by the reprex package (v2.0.0)
Note in the above I use the base pipe from R 4.1 onwards |> this can be replaced simply with the magrittr pipe %>% if you are on an earlier R version.

Loop Output Stored as List

I have wide supervisory data where a single observation consists of a level 1 employee and their department all the way down to level 8. I use a loop with other commands to produce a list all employees and the departments beneath them in long format so that I can see what departments employees are responsible for at all levels. There may be a more elegant way to do this than a loop, but it works fine. Sample data (through level 3 for succinctness):
data <- tibble(LV1_Employee_Name = "Chuck", LV1_Employee_Nbr = "1", LV1_Department = "Tha Boss", LV1_Department_Nbr = "90",
LV2_Employee_Name = c("Alex", "Alex", "Paul", "Paul", "Jennifer", "Jennifer"), LV2_Employee_Nbr = c("2", "2", "3", "3", "4", "4"), LV2_Department = c("Leadership", "Leadership", "Finance", "Finance", "Philanthropy", "Philanthropy"), LV2_Department_Nbr = c("91", "91", "92", "92", "93", "93"),
LV3_Employee_Name = c("Dan", "Wendy", "Sarah", "Monique", "Miguel", "Brandon"), LV3_Employee_Nbr = c("2", "2", "3", "3", "4", "4"), LV3_Department = c("Analytics", "Pop Health", "Acounting", "Investments", "Yacht Aquisitions", "Golf Junkets"), LV3_Department_Nbr = c("94", "95", "96", "97", "98", "99"))
The loop below first produces six tibbles named level1_1, level1_2, level1_3, level2_2, level2_3, level3_3. Each tibble contains an employee name, number, and the department at the same department level or below. The code then lists and binds the rows of these tibbles with ls() and bind_rows(), then applies the distinct() command, and I've got what I need.
first_department <- 1
data_colnames <- c("Employee", "Employee_Id", "Department", "Department_Number")
for(i in 1:3){
for(k in first_department:3){
assign(paste0("level", i, "_", k), setNames(distinct(as_tibble(c(data[ ,paste0("LV", i, "_", "Employee_Name")], data[ ,paste0("LV", i, "_", "Employee_Nbr")],
data[ ,paste0("LV", k, "_", "Department")], data[ ,paste0("LV", k, "_", "Department_Nbr")]))),
data_colnames))
}
first_department = first_department + 1
}
employees_departments <- distinct(bind_rows(mget(ls(pattern = "^level")))) %>%
filter(is.na(Department) == FALSE)
rm(list = ls(pattern = "^level"))
What I'd like to do is, rather than produce an initial output of six tibbles, have the loop itself output the list. This will save me from having a huge list of tibbles in the output which, I'm told, is not very "R-like".

Here is a revised version that stores the results in a list within your loop. This will include an index idx incremented each time through the loop. Afterwards, you can use bind_rows on this list to get a complete result.
library(tidyverse)
idx <- 1
first_department <- 1
data_colnames <- c("Employee", "Employee_Id", "Department", "Department_Number")
data_lst <- list()
for(i in 1:3){
for(k in first_department:3){
data_lst[[idx]] <- setNames(
distinct(as_tibble(
c(data[ ,paste0("LV", i, "_", "Employee_Name")],
data[ ,paste0("LV", i, "_", "Employee_Nbr")],
data[ ,paste0("LV", k, "_", "Department")],
data[ ,paste0("LV", k, "_", "Department_Nbr")]))),
data_colnames)
idx <- idx + 1
}
first_department = first_department + 1
}
distinct(bind_rows(data_lst)) %>%
filter(!is.na(Department))
Output
Employee Employee_Id Department Department_Number
<chr> <chr> <chr> <chr>
1 Chuck 1 Tha Boss 90
2 Chuck 1 Leadership 91
3 Chuck 1 Finance 92
4 Chuck 1 Philanthropy 93
5 Chuck 1 Analytics 94
6 Chuck 1 Pop Health 95
7 Chuck 1 Acounting 96
8 Chuck 1 Investments 97
9 Chuck 1 Yacht Aquisitions 98
10 Chuck 1 Golf Junkets 99
11 Alex 2 Leadership 91
12 Paul 3 Finance 92
13 Jennifer 4 Philanthropy 93
14 Alex 2 Analytics 94
15 Alex 2 Pop Health 95
16 Paul 3 Acounting 96
17 Paul 3 Investments 97
18 Jennifer 4 Yacht Aquisitions 98
19 Jennifer 4 Golf Junkets 99
20 Dan 2 Analytics 94
21 Wendy 2 Pop Health 95
22 Sarah 3 Acounting 96
23 Monique 3 Investments 97
24 Miguel 4 Yacht Aquisitions 98
25 Brandon 4 Golf Junkets 99

Multiplication in R of specific portion of a dataframe

I have a dataset from 1966 to 2002, I want change the units(multiply values by 0.305) of some of the values in the dataframe from 1967 to 1973and want the rest of the values to remain as they are.
Sample Data
Date A01
1 1966/05/07 4.870000
2 1966/05/08 4.918333
3 1966/05/09 4.892000
4 1966/05/10 4.858917
5 1966/05/11 4.842000
6 1967/03/18 4.89517
7 1966/05/07 4.870000
8 1966/05/08 4.918333
9 1966/05/09 4.892000
10 2000/05/10 2.858917
11 2001/05/11 1.842000
12 2002/03/18 0.89517
Desired Outcome
Date A01
1 1966/05/07 1.4843
2 1966/05/08 1.4990
3 1966/05/09 1.49108
4 1966/05/10 1.480992
5 1966/05/11 1.48565
6 1967/03/18 1.4920
7 1966/05/07 1.4843
8 1966/05/08 1.4991
9 1966/05/09 1.4910
10 2000/05/10 2.858917
11 2001/05/11 1.842000
12 2002/03/18 0.89517

An option in base R would be get the 'year' part from the Date column to create a logical index ('i1'), subset the 'A01' column, multiply by 0.305 and assign it back to the original column
i1 <- as.numeric(format(as.Date(df1$Date, '%Y/%m/%d'), "%Y")) %in% 1966:1973
df1$A01[i1] <- df1$A01[i1] * 0.305
data
df1 <- structure(list(Date = c("1966/05/07", "1966/05/08", "1966/05/09",
"1966/05/10", "1966/05/11", "1967/03/18", "1966/05/07", "1966/05/08",
"1966/05/09", "2000/05/10", "2001/05/11", "2002/03/18"), A01 = c(4.87,
4.918333, 4.892, 4.858917, 4.842, 4.89517, 4.87, 4.918333, 4.892,
2.858917, 1.842, 0.89517)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

Convert Date to date class, extract the year from it and multiply A01 with 0.305 if it is between 1967 and 1974 or with 1 otherwise.
library(dplyr)
library(lubridate)
df %>%
mutate(Date = ymd(Date),
A01 = A01 * c(1, 0.305)[(between(year(Date), 1967, 1974)) + 1])

Another base R option using ifelse
transform(
df,
A01 = A01 * ifelse(as.numeric(format(as.Date(Date), "%Y")) %in% 1967:1973, 0.305, 1)
)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Sum consecutive strings in multiple columns - r

Related

Is there a way to write this in a single Dplyr statement / more efficiently?

Select value from comma separated values in cell based on previous and next values

Separating values into existing column in R

Loop Output Stored as List

Multiplication in R of specific portion of a dataframe

Categories

Resources