I have data in a wide format containing a Category column listing types of transport and then columns with the name of that type of transport and totals.
I want to create the Calc column where each row is summed across the columns but the value for where the category and column name is the same is excluded.
So for the row total of Car, the sum would be train + bus. The row total of Train would be Car + Bus.
If there is a type of transport in the Category column which isnt listed as a column name, then there should be a NA in the Calc column.
The dataframe is as below, with the Calc column with the results added as expected.
Category<-c("Car","Train","Bus","Bicycle")
Car<-c(9,15,25,5)
Train<-c(8,22,1,7)
Bus<-c(5,2,4,8)
Calc<-c(13, 17,26,NA)
df<-data.frame(Category,Car,Train,Bus,Calc, stringsAsFactors = FALSE)
Can anyone suggest how to add the Calc column as per above? Ideally a vectorised calculation without a loop.
Here is an alternative in base R. You can use apply row-wise through your data.frame. If the Category is one of your columns, then calculate the sum by excluding both the Category column as well as the column corresponding to the Category column. Otherwise, use NA.
df$Calc <- apply(
df,
1,
\(x) {
if (x["Category"] %in% names(x)) {
sum(as.numeric(x[setdiff(names(x), c(x["Category"], "Category"))]))
} else {
NA_integer_
}
}
)
df
Output
Category Car Train Bus Calc
1 Car 9 8 5 13
2 Train 15 22 2 17
3 Bus 25 1 4 26
4 Bicycle 5 7 8 NA
Here is a tidyverse solution:
df<-data.frame(Category,Car,Train,Bus, stringsAsFactors = FALSE)
library(dplyr)
library(tidyr)
df |>
pivot_longer(cols = !Category,
names_to = "cat2",
values_to = "value") |>
group_by(Category) |>
mutate(value = case_when((Category %in% cat2) ~ value,
TRUE ~ NA_real_)) |>
filter(cat2 != Category) |>
summarize(Calc = sum(value)) |>
left_join(df)
# A tibble: 4 × 5
Category Calc Car Train Bus
<chr> <dbl> <dbl> <dbl> <dbl>
1 Bicycle NA 5 7 8
2 Bus 26 25 1 4
3 Car 13 9 8 5
4 Train 17 15 22 2
Using rowSums and a matrix for indexing.
# Example data
Category <- c("Car","Train","Bus","Bicycle")
Car <- c(9,15,25,5)
Train <- c(8,22,1,7)
Bus <- c(5,2,4,8)
df <- data.frame(Category,Car,Train,Bus, stringsAsFactors = FALSE)
# add the "Calc" column
df$Calc <- rowSums(df[,2:4]) - df[,2:4][matrix(c(1:nrow(df), match(df$Category, colnames(df)[2:4])), ncol = 2)]
df
#> Category Car Train Bus Calc
#> 1 Car 9 8 5 13
#> 2 Train 15 22 2 17
#> 3 Bus 25 1 4 26
#> 4 Bicycle 5 7 8 NA
Related
I have two data frames:
dat <- data.frame(Digits_Lower = 1:5,
Digits_Upper = 6:10,
random = 20:24)
dat
#> Digits_Lower Digits_Upper random
#> 1 1 6 20
#> 2 2 7 21
#> 3 3 8 22
#> 4 4 9 23
#> 5 5 10 24
cb <- data.frame(Digits = c("Digits_Lower", "Digits_Upper"),
x = 1:2,
y = 3:4)
cb
#> Digits x y
#> 1 Digits_Lower 1 3
#> 2 Digits_Upper 2 4
I am trying to perform some operation on multiple columns in dat similar to these examples: In data.table: iterating over the rows of another data.table and R multiply columns by values in second dataframe. However, I
am hoping to operate on these columns with an extended expression for every value in its corresponding row in cb. The solution should be applicable
for a large dataset. I have created this for-loop so far.
dat.loop <- dat
for(i in seq_len(nrow(cb)))
{
#create new columns from the Digits column of `cb`
dat.loop[paste0("disp", sep = '.', cb$Digits[i])] <-
#some operation using every value in a column in `dat` with its corresponding row in `cb`
(dat.loop[, cb$Digits[i]]- cb$y[i]) * cb$x[i]
}
dat.loop
#> Digits_Lower Digits_Upper random disp.Digits_Lower disp.Digits_Upper
#> 1 1 6 20 -2 4
#> 2 2 7 21 -1 6
#> 3 3 8 22 0 8
#> 4 4 9 23 1 10
#> 5 5 10 24 2 12
I will then perform operations on the data that I appended to dat in dat.loop applying a similar
for-loop, and then perform yet another operation on those values. My dataset is very large, and I imagine
my use of for-loops will become cumbersome. I am wondering:
Would another method improve efficiency such as using data.table or tidyverse?
How would I go about using another method, or improving my for-loop? My main confusion is how to write concise code
to perform operations on columns in dat with corresponding rows in cb. Ideally, I would split my for-loop into
multiple functions that would for example, avoid indexing into cb for the same values over and over again or appending unnecessary data to my dataframe, but I'm not really sure how to
do this.
Any help is appreciated!
EDIT:
I've modified the code #Desmond provided allowing for more generic code since dat and cb will be from user-inputted files,
and dat can have a varying number of columns/ column names that I will be operating on (columns in dat will always start with
"Digits_" and will be specified in the "Digits" column of cb.
library(tidytable)
results <- dat %>%
crossing.(cb) %>%
mutate_rowwise.(disp = (get(`Digits`)-y) *x ) %>%
pivot_wider.(names_from = Digits,
values_from = disp,
names_prefix = "disp_")
results2 <- results %>%
fill.(starts_with("disp"), .direction = c("downup"), .by = 'random') %>%
select.(-c(x,y)) %>%
distinct.()
results2
#> Digits_Lower Digits_Upper random disp_Digits_Lower disp_Digits_Upper
#> 1 1 6 20 -2 4
#> 2 2 7 21 -1 6
#> 3 3 8 22 0 8
#> 4 4 9 23 1 10
#> 5 5 10 24 2 12
Here's a tidyverse solution:
crossing generates combinations from both datasets
case_when to apply your logic
pivot_wider, filter and bind_cols to clean up the output
To scale this to a large dataset, I suggest using the tidytable package. After loading it, simply replace crossing() with crossing.(), pivot_wider() with pivot_wider.(), etc
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.2.1
#> Warning: package 'tibble' was built under R version 4.2.1
dat <- data.frame(
Digits_Lower = 1:5,
Digits_Upper = 6:10,
random = 20:24
)
cb <- data.frame(
Digits = c("Digits_Lower", "Digits_Upper"),
x = 1:2,
y = 3:4
)
results <- dat |>
crossing(cb) |>
mutate(disp = case_when(
Digits == "Digits_Lower" ~ (Digits_Lower - y) * x,
Digits == "Digits_Upper" ~ (Digits_Upper - y) * x
)) |>
pivot_wider(names_from = Digits,
values_from = disp,
names_prefix = "disp_")
results |>
filter(!is.na(disp_Digits_Lower)) |>
select(-c(x, y, disp_Digits_Upper)) |>
bind_cols(results |>
filter(!is.na(disp_Digits_Upper)) |>
select(disp_Digits_Upper))
#> # A tibble: 5 × 5
#> Digits_Lower Digits_Upper random disp_Digits_Lower disp_Digits_Upper
#> <int> <int> <int> <int> <int>
#> 1 1 6 20 -2 4
#> 2 2 7 21 -1 6
#> 3 3 8 22 0 8
#> 4 4 9 23 1 10
#> 5 5 10 24 2 12
Created on 2022-08-20 by the reprex package (v2.0.1)
My data looks like this:
I have 5 different levels with nested data:
Categories (e.g., "Countries")
Countries (e.g., "USA")
Cities (e.g., "New York")
Counties (e.g., "Manhattan")
Places (e.g., "Times Square")
Each row in my df (except for LVL 1 entries) is linked to a parent (a level above). For example: Times Square -> Manhatten -> New York -> USA -> Countries
For each Name, there is a corresponding n_values column, indicating the number of data entries.
My goal: I want to form groups with >=8 data entries. For groups with n_values <8, I want to merge them with the Parent column a level above. This new allocation should be expressed in a new variable new_group.
It is important to start in the lower levels first! For example, there are only 2 data entries for "Times Square" so we want to merge those entries with the parent "Manhattan". Manhattan now has 3+2=5 data entries. This is still <8 so we merge those 5 entries with the next parent "New York" which now hast 16+5=21 entries, so we're good.
I have tried to write a loop like this:
for (i in 5:1){
df %>% filter(Level==i) %>% group_by(ID) %>% summarize(n = n())
However, I fail to merge that information with the original data to create the dataset I want. Can anyone help?
The data:
structure(list(ID = c(19,12,3,41,50,6,77,83,9,105,11),
Parent = c(NA,19,12,3,41,12,19,77,77,19,105),
Level = c(1,2,3,4,5,3,2,3,3,2,3),
Name = c("Countries","USA","New York","Manhattan","Times Square",
"Boston","UK","London","Oxford","Canada","Vancouver"),
n_values = c(NA,17,16,3,2,13,12,7,8,9,8)),
class = "data.frame",
row.names = c(NA, -11L))
Let's assume that your data is stored in a data frame called df. The most straightforward approach would be to first sort the rows of the table by "Level" in descending order and set "new_group" to the values of "Name". We'll also track the per-group totals in a column called "new_values". Then iterate through the rows until a row with new_values < 8 is encountered, at which point that row's "new_group" is changed to that of its parent, and its "Parent" is also updated to match its parent's "Parent". At that point, the row loop restarts. The outer loop terminates when no "new_group"s have new_values < 8:
library(tidyverse)
df_sorted <- df %>%
arrange(desc(Level)) %>%
mutate(new_group = Name) %>%
group_by(new_group) %>%
mutate(new_values = sum(n_values)) %>%
ungroup
while (any(df_sorted$new_values < 8, na.rm = T)) {
for (i in 1:nrow(df_sorted)) {
if (df_sorted$new_values[i] < 8) {
to_id <- df_sorted$Parent[i]
to_row <- which(df_sorted$ID == to_id)
df_sorted$new_group[i] <- df_sorted$Name[to_row]
df_sorted$Parent[i] <- df_sorted$Parent[to_row]
df_sorted <- df_sorted %>%
group_by(new_group) %>%
mutate(new_values = sum(n_values)) %>%
ungroup
break # terminate the for loop immediately and return to the outer while loop
}
}
}
ID Parent Level Name n_values new_group new_values
<dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl>
1 50 12 5 Times Square 2 New York 21
2 41 12 4 Manhattan 3 New York 21
3 3 12 3 New York 16 New York 21
4 6 12 3 Boston 13 Boston 13
5 83 19 3 London 7 UK 19
6 9 77 3 Oxford 8 Oxford 8
7 11 105 3 Vancouver 8 Vancouver 8
8 12 19 2 USA 17 USA 17
9 77 19 2 UK 12 UK 19
10 105 19 2 Canada 9 Canada 9
11 19 NA 1 Countries NA Countries NA
Edit: The version below adds a "touched" column to track rows that have been modified in the loop, and also adds some checks for NA values. For the data set used above, it produces an identical result to the previous version. It also appears to work correctly on the data set below.
df <- structure(list(ID = c(19,12,3,41,50,6,77,83,9,105,11), Parent = c(NA,19,12,3,41,12,19,77,77,19,105), Level = c(1,2,3,4,5,3,2,3,3,2,3), Name = c("Countries","USA","New York","Manhattan","Times Square", "Boston","UK","London","Oxford","Canada","Vancouver"), n_values = c(NA,0,0,3,2,0,12,7,8,9,8)), class = "data.frame", row.names = c(NA, -11L))
df_sorted <- df %>%
arrange(desc(Level)) %>%
mutate(new_group = Name) %>%
group_by(new_group) %>%
mutate(
new_values = sum(n_values),
touched = is.na(n_values) | n_values >= 8
) %>%
ungroup
while (any(!df_sorted$touched)) {
for (i in 1:nrow(df_sorted)) {
if (df_sorted$new_values[i] < 8 & !is.na(df_sorted$Parent[i]) & any(!df_sorted$touched)) {
to_id <- df_sorted$Parent[i]
to_row <- which(df_sorted$ID == to_id)
df_sorted$new_group[i] <- df_sorted$Name[to_row]
df_sorted$Parent[i] <- df_sorted$Parent[to_row]
df_sorted$touched[i] <- TRUE
df_sorted <- df_sorted %>%
group_by(new_group) %>%
mutate(new_values = sum(n_values, na.rm = T)) %>%
ungroup
break # terminate the for loop immediately and return to the outer while loop
}
}
}
ID Parent Level Name n_values new_group new_values touched
<dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <lgl>
1 50 NA 5 Times Square 2 Countries 5 TRUE
2 41 NA 4 Manhattan 3 Countries 5 TRUE
3 3 NA 3 New York 0 Countries 5 TRUE
4 6 NA 3 Boston 0 Countries 5 TRUE
5 83 19 3 London 7 UK 19 TRUE
6 9 77 3 Oxford 8 Oxford 8 TRUE
7 11 105 3 Vancouver 8 Vancouver 8 TRUE
8 12 NA 2 USA 0 Countries 5 TRUE
9 77 19 2 UK 12 UK 19 TRUE
10 105 19 2 Canada 9 Canada 9 TRUE
11 19 NA 1 Countries NA Countries 5 TRUE
I have a range of columns containing the numerators of certain diseases, and a range of columns containing the denominators of the same diseases. I want to loop through each of the numerator columns dividing by the appropriate denominator column creating a percentage column for each disease.
All my columns follow the same name format, disease1_num, disease2_num, disease1_den, disease1_den
I want to divide disease1_num/disease1_den*100 to create disease1_perc, then disease2_num/disease2_den*100 to create disease2_perc etc.
There are approximately 20 diseases in my dataset.
I am mainly using tidyverse commands.
I have tried using gather to create two datasets, one with the numerators, one with the denominator, extracted the diseasename, joined them together, calculated the percentage and then spread the dataset again, before adding this back to the original dataset, which does work but it is a bit long winded, ideally I would like to do this in place in the original dataset.
# A tibble: 3 x 5
id disease1_num disease2_num disease1_den disease2_den
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 5 4 12 15
2 2 8 6 14 16
3 3 10 8 17 18
df_num <- df %>%
select(id,disease1_num:disease2_num) %>%
gather(key="num_indicator",value="num",disease1_num:disease2_num) %>%
mutate(indicator=str_remove(num_indicator,'_num'))
df_den <- df%>%
select(id, disease1_den:disease2_den) %>%
gather(key="den_indicator",value="den",disease1_den:disease2_den) %>%
mutate(indicator=str_remove(den_indicator,'_den'))
df_numden <- left_join(df_num,df_den,c('id','indicator'))
df_perc <- df_numden %>%
mutate(perc_indicator=str_replace(den_indicator,'den','perc'),
perc=num/den*100) %>%
select(id, perc_indicator:perc) %>%
spread(perc_indicator,perc)
df_final <- left_join(df,df_perc,'id')
We can just use grep to get column indices and divide directly.
num_cols <- grep("num$", names(df), value = TRUE)
den_cols <- grep("den$", names(df), value = TRUE)
df[sub("_num","_perc", num_cols)]<- df[num_cols]/df[den_cols] * 100
df
# id disease1_num disease2_num disease1_den disease2_den disease1_perc disease2_perc
#1 1 5 4 12 15 41.7 26.7
#2 2 8 6 14 16 57.1 37.5
#3 3 10 8 17 18 58.8 44.4
Note that you need to be sure that you have same number of num_cols and den_cols.
I have many columns in a table where there is missing data. I want to be able to pull in the information from another table if the data is missing for a particular record based on ID. I thought about possibly joining the two tables and writing a for loop where if column X is NA then pull in information from column Y, however, I have many columns and would require writing many of these conditions.
I want to create a function or a loop where I can pass in the data column names with the missing data and be able to pass in the column name from another table to get the information from.
Reproducible Example:
ID <- c(1,2,3,4,5,6)
Year <- c(1990,1987,NA,NA,1968,1992)
Month <- c(1,NA,8,12,NA,5)
Day <- c(3,NA,NA,NA,NA,30)
New_Data = data.frame(ID=ID,Year=Year,Month=Month,Day=Day)
ID <- c(2,3,4,5)
Year <- c(NA,1994,1967,NA)
Month <- c(4,NA,NA,10)
Day <- c(23,12,16,9)
Old_Data = data.frame(ID=ID,Year=Year,Month=Month,Day=Day)
Expected Output:
ID <- c(1,2,3,4,5,6)
Year <- c(1990,1987,1994,1967,1968,1992)
Month <- c(1,4,8,12,10,5)
Day <- c(3,23,12,16,9,30)
New_Data = data.frame(ID=ID,Year=Year,Month=Month,Day=Day)
Using rbind combine two dataframe , then we using group_by with summarise_all
library(dplyr)
rbind(New_Data,Old_Data)%>%group_by(ID)%>%dplyr::summarise_all(function(x) x[!is.na(x)][1])
# A tibble: 6 x 4
ID Year Month Day
<dbl> <dbl> <dbl> <dbl>
1 1 1990 1 3
2 2 1987 4 23
3 3 1994 8 12
4 4 1967 12 16
5 5 1968 10 9
6 6 1992 5 30
An option using dplyr::left_join and dplyr::coalesce can be as:
library(dplyr)
New_Data %>% left_join(Old_Data, by="ID") %>%
mutate(Year = coalesce(Year.x, Year.y),
Month = coalesce(Month.x, Month.y),
Day = coalesce(Day.x, Day.y)) %>%
select(ID, Year, Month, Day)
# ID Year Month Day
# 1 1 1990 1 3
# 2 2 1987 4 23
# 3 3 1994 8 12
# 4 4 1967 12 16
# 5 5 1968 10 9
# 6 6 1992 5 30
Here's a solution using only base functions from another SO question
I modified it to your needs (created a function, and made an argument for the key column name):
fill_missing_data = function(df1, df2, keyColumn) {
commonNames <- names(df1)[which(colnames(df1) %in% colnames(df2))]
commonNames <- commonNames[commonNames != keyColumn]
dfmerge<- merge(df1,df2,by="ID",all=T)
for(i in commonNames){
left <- paste(i, ".x", sep="")
right <- paste(i, ".y", sep="")
dfmerge[is.na(dfmerge[left]),left] <- dfmerge[is.na(dfmerge[left]),right]
dfmerge[right]<- NULL
colnames(dfmerge)[colnames(dfmerge) == left] <- i
}
return(dfmerge)
}
result = fill_missing_data(New_Data, Old_Data, "ID")
I have a dataset with three columns as below:
data <- data.frame(
grpA = c(1,1,1,1,1,2,2,2),
idB = c(1,1,2,2,3,4,5,6),
valueC = c(10,10,20,20,10,30,40,50),
otherD = c(1,2,3,4,5,6,7,8)
)
valueC is unique to each unique value of idB.
I want to use dplyr pipe (as the rest of my code is in dplyr) and use group_by on grpA to get a new column with sum of valueC values for each group.
The answer should be like:
newCol <- c(40,40,40,40,40,120,120,120)
but with data %>% group_by(grpA) %>%
mutate(newCol=sum(valueC), I get newCol <- c(70,70,70,70,70,120,120,120)
How do I include unique value of idB? Is there anything else I can use instead of group_by in dplyr %>% pipe.
I cant use summarise as I need to keep values in otherD intact for later use.
Other option I have is to create newCol separately through sql and then merge with left join. But I am looking for a better solution inline.
If it has been answered before, please refer me to the link as I could not find any relevant answer to this issue.
We need unique with match
data %>%
group_by(grpA) %>%
mutate(ind = sum(valueC[match(unique(idB), idB)]))
# A tibble: 8 x 5
# Groups: grpA [2]
# grpA idB valueC otherD ind
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 10 1 40
#2 1 1 10 2 40
#3 1 2 20 3 40
#4 1 2 20 4 40
#5 1 3 10 5 40
#6 2 4 30 6 120
#7 2 5 40 7 120
#8 2 6 50 8 120
Or another option is to get the distinct rows by 'grpA', 'idB', grouped by 'grpA', get the sum of 'valueC' and left_join with the original data
data %>%
distinct(grpA, idB, .keep_all = TRUE) %>%
group_by(grpA) %>%
summarise(newCol = sum(valueC)) %>%
left_join(data, ., by = 'grpA')