Hi I am looking into figuring out how to match data frames together by column, then renaming it. If there is no name that matches, then I would want to drop that column instead.
For example, I would use this main dataset, call it DF1:
Name
Reference
Good
Fair
Bad
Great
Poor
George
Hill
34
21
33
21
32
Frank
Stairs
29
28
29
30
29
Bertha
Trail
25
25
24
21
26
Then another DF, call this DF2, that allows me to replace the names of the columns of DF1
Name
Adjusted_Name
Good
good_run
Great
very_great_work
Bad
bad run
Fair
fair run decent
Essentially, the words that would be substituted would not be any pattern of any sort, and I would try to match this first column in DF2 and match to DF1, and if there is a match in DF2$Name and DF(whatever column), then I would replace that name with the same row of DF2$Adjusted_Name. If there is no match, then the value in DF1 is dropped.
So the final goal would be to achieve:
Name
Reference
good_run
fair run decent
Bad run
very_great_work
George
Hill
34
21
33
21
Frank
Stairs
29
28
29
30
Bertha
Trail
25
25
24
21
In this case, "poor" was dropped because it didnt match the column name of DF1.
How should I go about this? How would I account if there thousands of columns? Does that change anything in how i Code? I am a bit new to R, and would appreciate any tips. Thank you!
If you are open to a tidyverse solution, you could use
library(dplyr)
library(tibble)
df %>%
rename_with(~deframe(df2)[.x], .cols = df2$Name) %>%
select(Name, Reference, any_of(df2$Adjusted_Name))
This returns
# A tibble: 3 x 6
Name Reference good_run very_great_work bad_run fair_run_decent
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 George Hill 34 21 33 21
2 Frank Stairs 29 30 29 28
3 Bertha Trail 25 21 24 25
Data
df <- structure(list(Name = c("George", "Frank", "Bertha"), Reference = c("Hill",
"Stairs", "Trail"), Good = c(34, 29, 25), Fair = c(21, 28, 25
), Bad = c(33, 29, 24), Great = c(21, 30, 21), Poor = c(32, 29,
26)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-3L), spec = structure(list(cols = list(Name = structure(list(), class = c("collector_character",
"collector")), Reference = structure(list(), class = c("collector_character",
"collector")), Good = structure(list(), class = c("collector_double",
"collector")), Fair = structure(list(), class = c("collector_double",
"collector")), Bad = structure(list(), class = c("collector_double",
"collector")), Great = structure(list(), class = c("collector_double",
"collector")), Poor = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
df2 <- structure(list(Name = c("Good", "Great", "Bad", "Fair"), Adjusted_Name = c("good_run",
"very_great_work", "bad_run", "fair_run_decent")), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L), spec = structure(list(
cols = list(Name = structure(list(), class = c("collector_character",
"collector")), Adjusted_Name = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
Try the following - using the list of adjusted names, you can grep the list of desired words against column names and subset the data frame on it:
Data
df <- read.table(header = TRUE, text = "Name Reference Good Fair Bad Great Poor
George Hill 34 21 33 21 32
Frank Stairs 29 28 29 30 29
Bertha Trail 25 25 24 21 26")
adj_name <- c("good_run","very_great_run","bad run","fair run decent")
Index the columns based on grep from the string of desired names (note the tolower() on the column names as well)
desired_words <- paste(unlist(strsplit(adj_name, "_| ")), collapse = "|")
df[,c(1:2,grep(desired_words, tolower(names(df))))]
Output
# Name Reference Good Fair Bad Great
#1 George Hill 34 21 33 21
#2 Frank Stairs 29 28 29 30
#3 Bertha Trail 25 25 24 21
Related
I have a table with prefixes (here in csv format):
PREFIX,LABEL
A,Infectious diseases
B,Infectious diseases
C,Tumor
D1,Tumor
D2,Tumor
D31,Tumor
D32,Tumor
D33,Blood disorder
D4,Blood disorder
D5,Blood disorder
And I want to join it with this one:
AGE,DEATH_CODE
67,A02
85,D318
75,C007+X
62,D338
To get obviously:
AGE,LABEL
67,Infectious diseases
85,Tumor
75,Tumor
62,Blood disorder
I know how to do that with SQL and LIKE but not with tidyverse left_join or base R.
Dput of data
Table 1: CIM_CODES
structure(list(PREFIX = c("A", "B", "C", "D1", "D2", "D31", "D32",
"D33", "D4", "D5"), LABEL = c("Infectious diseases", "Infectious diseases",
"Tumor", "Tumor", "Tumor", "Tumor", "Tumor", "Blood disorder",
"Blood disorder", "Blood disorder")), row.names = c(NA, -10L), spec = structure(list(
cols = list(PREFIX = structure(list(), class = c("collector_character",
"collector")), LABEL = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), problems = <pointer: 0x000002527d306190>, class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
Table 2: DEATH_CAUSES
structure(list(AGE = c(67, 85, 75, 62), DEATH_CODE = c("A02",
"D318", "C007+X", "D338")), row.names = c(NA, -4L), spec = structure(list(
cols = list(AGE = structure(list(), class = c("collector_double",
"collector")), DEATH_CODE = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), problems = <pointer: 0x0000025273898c60>, class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
You could do a partial string match that has the lowest difference between the two columns:
library(tidyverse)
DEATH_CAUSES |>
mutate(LABEL = map_chr(DEATH_CODE,
~CIM_CODES$LABEL[
which.min(stringdist::stringdist(.x, CIM_CODES$PREFIX))
]))
#> # A tibble: 4 x 3
#> AGE DEATH_CODE LABEL
#> <dbl> <chr> <chr>
#> 1 67 A02 Infectious diseases
#> 2 85 D318 Tumor
#> 3 75 C007+X Tumor
#> 4 62 D338 Blood disorder
UPDATE
not using the stringdist package as requested.
library(tidyverse)
get_match <- function(code, prefix, target){
map(code, \(x){
map(prefix, \(y){
grepl(paste0("^", y), x)
})
}) |>
map_chr(\(z) target[unlist(z) |> which()] )
}
DEATH_CAUSES |>
mutate(LABEL = get_match(DEATH_CAUSES$DEATH_CODE,
CIM_CODES$PREFIX,
CIM_CODES$LABEL))
#> # A tibble: 4 x 3
#> AGE DEATH_CODE LABEL
#> <dbl> <chr> <chr>
#> 1 67 A02 Infectious diseases
#> 2 85 D318 Tumor
#> 3 75 C007+X Tumor
#> 4 62 D338 Blood disorder
EDIT
how to do this with a join:
library(tidyverse)
library(fuzzyjoin)
fuzzy_left_join(DEATH_CAUSES,
CIM_CODES,
by = c("DEATH_CODE" = "PREFIX"),
str_detect)
#> # A tibble: 4 x 4
#> AGE DEATH_CODE PREFIX LABEL
#> <dbl> <chr> <chr> <chr>
#> 1 67 A02 A Infectious diseases
#> 2 85 D318 D31 Tumor
#> 3 75 C007+X C Tumor
#> 4 62 D338 D33 Blood disorder
My code below, I used mysql:
select a.age, p.label
from prefix p
left join age a on a.death_code like CONCAT("%",p.prefix,"%");
You can refer here: how to use a like with a join in sql?
I have my data in the wide-format
what is the easiest way to change it to long without having a X in front of the time variables
Sample data:
structure(list(X1 = c("01/12/2019", "02/12/2019"), `00:30` = c(41.95,
39.689), `01:00` = c(44.96, 40.47), `01:30` = c(42.939, 38.95
), `02:00` = c(43.221, 40.46), `02:30` = c(44.439, 41.97)), class = "data.frame", row.names = c(NA,
-2L), spec = structure(list(cols = list(X1 = structure(list(), class = c("collector_character",
"collector")), `00:30` = structure(list(), class = c("collector_double",
"collector")), `01:00` = structure(list(), class = c("collector_double",
"collector")), `01:30` = structure(list(), class = c("collector_double",
"collector")), `02:00` = structure(list(), class = c("collector_double",
"collector")), `02:30` = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
with pivot_longer and pivot_wider from tidyr:
dat |>
pivot_longer(names_to="time",values_to="val",`00:30`:`02:30`) |>
pivot_wider(names_from="X1", values_from="val")
Output:
# A tibble: 5 x 3
time `01/12/2019` `02/12/2019`
<chr> <dbl> <dbl>
1 00:30 42.0 39.7
2 01:00 45.0 40.5
3 01:30 42.9 39.0
4 02:00 43.2 40.5
5 02:30 44.4 42.0
I this special case, you could transpose the part of your data.frame containing numbers and assign the column names:
df_new <- data.frame(t(df[,-1]))
colnames(df_new) <- df[, 1]
This returns a data.frame df_new:
01/12/2019 02/12/2019
00:30 41.950 39.689
01:00 44.960 40.470
01:30 42.939 38.950
02:00 43.221 40.460
02:30 44.439 41.970
Edit (Thanks to jay.sf)
For versions of R >= 4.1, you could use the natural pipe:
t(df[, -1]) |>
data.frame() |>
`colnames<-`(df[, 1])
I need a chart of accounts to stay in order when new accounts are added or dropped in future years. This is because in Accounting the accounts are sorted by type (for example Asset, Liability Equity) but it is not explicit in the dataset. This is an example of the code that is putting new "Accounts" from Year2 and Year3 at the bottom.
XYZCompany_Consolidated <- XYZCompany_Year1 %>%
full_join(XYZCompany_Year2 by = "Account") %>%
full_join(XYZCompany_Year3, by = "Account")
Example: This picture is just to give a simplified example. The highlight in orange is where the new accounts are going and to the right is the code i'm using, and the green is what I'm trying to achieve
Perhaps I'm overthinking this problem but I find it hard to solve. Let's define some data first:
df_year1 <- structure(list(Account = c("Cash", "Accounts", "Loan1", "Auto",
"JaneDoe"), Year_1 = c(100, 1000, 20, 300, 500)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L), spec = structure(list(
cols = list(Account = structure(list(), class = c("collector_character",
"collector")), Year_1 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
df_year2 <- structure(list(Account = c("Cash", "Accounts", "Loan1", "Auto",
"Laptop", "JaneDoe"), Year_2 = c(80, 1200, 50, 300, 500, 0)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), spec = structure(list(
cols = list(Account = structure(list(), class = c("collector_character",
"collector")), Year_2 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
df_year3 <- structure(list(Account = c("Cash", "Accounts", "Loan1", "Auto",
"Rent", "JaneDoe"), Year_3 = c(80, 1200, 50, 300, 1000, 0)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), spec = structure(list(
cols = list(Account = structure(list(), class = c("collector_character",
"collector")), Year_3 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
Those are similar to the data shown in the OP's picture, e.g. df_year1 looks like
# A tibble: 5 x 2
Account Year_1
<chr> <dbl>
1 Cash 100
2 Accounts 1000
3 Loan1 20
4 Auto 300
5 JaneDoe 500
Next we transform those data a little bit, namely
library(dplyr)
library(tidyr)
df_y1 <- df_year1 %>%
mutate(Year = 1,
no = row_number()) %>%
rename(value = Year_1)
which returns
# A tibble: 5 x 4
Account value Year no
<chr> <dbl> <dbl> <int>
1 Cash 100 1 1
2 Accounts 1000 1 2
3 Loan1 20 1 3
4 Auto 300 1 4
5 JaneDoe 500 1 5
The new column no stores the account's original position, column Year stores the chart's year. All three data.frames are processed like this, so we get df_y1, df_y2, df_y3.
Finally we bind them together
bind_rows(df_y1, df_y2, df_y3) %>%
mutate(num_years = max(Year)) %>%
group_by(Account) %>%
mutate(rank = sum((num_years - n() + 1) * no), .keep = "unused") %>%
pivot_wider(names_from = Year) %>%
arrange(rank) %>%
select(-rank) %>%
ungroup()
and calculate a rank for each account. The accounts are ordered by this rank. As a result, we get
# A tibble: 7 x 4
Account Year_1 Year_2 Year_3
<chr> <dbl> <dbl> <dbl>
1 Cash 100 80 80
2 Accounts 1000 1200 1200
3 Loan1 20 50 50
4 Auto 300 300 300
5 Laptop NA 500 NA
6 Rent NA NA 1000
7 JaneDoe 500 0 0
Note
I believe, there are better approaches, but at least this works for the example data.
I'm not sure about the calculated rank's stability. Take care.
I have a dataframe that looks like this (but for every US county)
county
state
n_state_1
n_state_2
n_state_3
n_state_4
Autauga County
AL
NA
FL
NA
NA
Baldwin County
AL
GA
NA
TN
NA
Catron County
AL
FL
GA
NA
CA
I want to move the non-missing values (FL,GA,TN etc.) to the first columns starting from n_state_1 and then delete the columns containing only missing values to get:
county
state
n_state_1
n_state_2
n_state_3
Autauga County
AL
FL
NA
NA
Baldwin County
AL
GA
TN
NA
Catron County
AL
FL
GA
CA
I am struggling with the first step. I thought about using the function distinct but it doesn't work because there are non-empty elements in each column.
You could use dplyr and tidyr:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(starts_with("n_state")) %>%
drop_na() %>%
group_by(county, state) %>%
mutate(name=row_number()) %>%
pivot_wider(names_prefix="n_state_")
which returns
county state n_state_1 n_state_2 n_state_3
<chr> <chr> <chr> <chr> <chr>
1 Autauga_County AL FL NA NA
2 Baldwin_County AL GA TN NA
3 Catron_County AL FL GA CA
What happened here?
pivot_longer takes the n_state_{n}-columns and collapses them into two columns: the name-column contains the original column name (n_state_1, n_state_2 etc), the value-column contains the states (FL, GA or <NA> in many cases).
Next we remove every <NA> entry. (Note: I use <NA> to make clear it's an NA-value).)
After a grouping by county and state we add a rownumber. These numbers will be later used to create the new column names.
pivot_wider now takes these row numbers and prefixes them with n_state_ to get the new columns. The values are taken from the value-column created in the second line of code. pivot_wider fills the missing values with <NA>-values (default behaviour).
Data
structure(list(county = c("Autauga_County", "Baldwin_County",
"Catron_County"), state = c("AL", "AL", "AL"), n_state_1 = c(NA,
"GA", "FL"), n_state_2 = c("FL", NA, "GA"), n_state_3 = c(NA,
"TN", NA), n_state_4 = c(NA, NA, "CA")), problems = structure(list(
row = 3L, col = "n_state_4", expected = "", actual = "embedded null",
file = "literal data"), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame")), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -3L), spec = structure(list(
cols = list(county = structure(list(), class = c("collector_character",
"collector")), state = structure(list(), class = c("collector_character",
"collector")), n_state_1 = structure(list(), class = c("collector_character",
"collector")), n_state_2 = structure(list(), class = c("collector_character",
"collector")), n_state_3 = structure(list(), class = c("collector_character",
"collector")), n_state_4 = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
Or another option with dapply from collapse and select only columns with any non-NA elements
library(collapse)
library(dplyr)
dapply(df1, MARGIN = 1, FUN = function(x) c(x[!is.na(x)], x[is.na(x)])) %>%
select(where(~ any(complete.cases(.))))
# A tibble: 3 x 5
county state n_state_1 n_state_2 n_state_3
<chr> <chr> <chr> <chr> <chr>
1 Autauga_County AL FL <NA> <NA>
2 Baldwin_County AL GA TN <NA>
3 Catron_County AL FL GA CA
I have a dataframe with the following structure:
Timestamp Value1 Value2 Problem1 Problem2
00:00 32 40 No No
00:05 12 55 No No
00:10 14 42 Yes No
00:15 50 33 Yes No
00:20 78 47 No No
Where Problem1 defines if there is a problem with Value1, and Problem2 defines if there is a problem with Value2. In case of having a Yes in Problem1, I'd need to replace Value1 by Value2. In case of having problems in both, they should keep unchanged.
My problem here is that I won't know how many Value and Problem columns I'll have. So, in case of having more than 2, I'd need to replace the value with problems by the average of those values without problems.
So, in another example:
Timestamp Value1 Value2 Value3 Problem1 Problem2 Problem3
00:00 32 40 45 No No No
00:05 12 55 48 No No No
00:10 14 42 55 Yes No No
00:15 50 33 13 Yes No Yes
00:20 78 47 75 No No No
Here I'd need to replace Value1 at 00:10 by the average of Value2 and Value3. Also, I'd need to replace Value1 and Value3 at 00:15 by Value2.
I bet there is a more elegant solution.
library(tidyr)
library(dplyr)
df %>%
mutate(across(starts_with("Problem"), ~ .x == "Yes")) %>%
pivot_longer(-Timestamp, names_to = c("name", "id"), names_pattern = "(.*)(\\d+)") %>%
pivot_wider() %>%
group_by(Timestamp) %>%
mutate(Value = case_when(sum(Problem) == 0 | sum(Problem) == n() | !Problem ~ Value,
TRUE~ sum(Value * (1 - Problem))/sum(1-Problem))) %>%
pivot_longer(cols=c("Value", "Problem")) %>%
mutate(name = paste0(name,id), .keep="unused") %>%
pivot_wider() %>%
ungroup() %>%
mutate(across(starts_with("Problem"), ~ ifelse(.x == 1, "Yes", "No")))
returns
# A tibble: 5 x 7
Timestamp Value1 Problem1 Value2 Problem2 Value3 Problem3
<time> <dbl> <chr> <dbl> <chr> <dbl> <chr>
1 00'00" 32 No 40 No 45 No
2 05'00" 12 No 55 No 48 No
3 10'00" 48.5 Yes 42 No 55 No
4 15'00" 33 Yes 33 No 33 Yes
5 20'00" 78 No 47 No 75 No
What approach did I use?
Transform your Problem Variable into a boolean. R is able to use booleans in calculations, technically it is transformed later into a double.
Turn your value/problem numbers into a id, so for every timestamp there are several rows for Value and Problem.
Calculate the new value based on the number of problems and if the value is problematic.
Restore the shape of your data.frame.
Data
df <- structure(list(Timestamp = structure(c(0, 300, 600, 900, 1200
), class = c("hms", "difftime"), units = "secs"), Value1 = c(32,
12, 14, 50, 78), Value2 = c(40, 55, 42, 33, 47), Value3 = c(45,
48, 55, 13, 75), Problem1 = c("No", "No", "Yes", "Yes", "No"),
Problem2 = c("No", "No", "No", "No", "No"), Problem3 = c("No",
"No", "No", "Yes", "No")), problems = structure(list(row = 5L,
col = "Problem3", expected = "", actual = "embedded null",
file = "literal data"), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame")), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -5L), spec = structure(list(
cols = list(Timestamp = structure(list(format = ""), class = c("collector_time",
"collector")), Value1 = structure(list(), class = c("collector_double",
"collector")), Value2 = structure(list(), class = c("collector_double",
"collector")), Value3 = structure(list(), class = c("collector_double",
"collector")), Problem1 = structure(list(), class = c("collector_character",
"collector")), Problem2 = structure(list(), class = c("collector_character",
"collector")), Problem3 = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
I use the data.table package. I call your data given in the second example "din".
I hope this code helps you:
#I use the library data.table; get data in data.table format
library(data.table)
din <- data.table(din)
din[,Value1:=as.numeric(Value1)]
din[,Value2:=as.numeric(Value2)]
din[,Value3:=as.numeric(Value3)]
#set Values to NA if there is a Problem
din[Problem1=="Yes", Value1:=NA]
din[Problem2=="Yes", Value2:=NA]
din[Problem3=="Yes", Value3:=NA]
#print table with NA replaced if we have a Problem
#print(din)
# Timestamp Value1 Value2 Value3 Problem1 Problem2 Problem3
#1: 00:00 32 40 45 No No No
#2: 00:05 12 55 48 No No No
#3: 00:10 NA 42 55 Yes No No
#4: 00:15 NA 33 NA Yes No Yes
#5: 00:20 78 47 75 No No No
#use the mean function to replace if I have an NA in the table (just working if Timestamp is a unique id, otherwise you need to generate one and use this in the by argument)
din[is.na(Value1), Value1:=mean(c(Value2,Value3), na.rm=T), by=Timestamp]
din[is.na(Value2), Value2:=mean(c(Value1,Value2), na.rm=T), by=Timestamp]
din[is.na(Value3), Value3:=mean(c(Value1,Value2), na.rm=T), by=Timestamp]
#print final table
#print(din)
# Timestamp Value1 Value2 Value3 Problem1 Problem2 Problem3
#1: 00:00 32.0 40 45 No No No
#2: 00:05 12.0 55 48 No No No
#3: 00:10 48.5 42 55 Yes No No
#4: 00:15 33.0 33 33 Yes No Yes
#5: 00:20 78.0 47 75 No No No
``
I made a sample for the data:
df <- data.frame(value1 = runif(10, min = 0, max = 100),
value2 = runif(10, min = 0, max = 100),
value3 = runif(10, min = 0, max = 100))
df_problem <- data.frame(problem1 = sample(c('yes','no'), 10, replace = T),
problem2 = sample(c('yes','no'), 10, replace = T),
problem3 = sample(c('yes','no'), 10, replace = T))
See that I separated the values from the problems. Then:
df_problem[df_problem == 'yes'] <- 1
df_problem[df_problem == 'no'] <- NA
df_problem <- matrix(as.numeric(unlist(df_problem)), nrow = nrow(df)) #rebuild matrix
Finally:
df <- df * df_problem
for (i in 1:nrow(df)){
if (T %in% is.na(df[i,])){
df[i,c(which(is.na(df[i,])))] <- mean(unlist(df[i,]), na.rm = T)
}
}
df