I have a list of lists of data.frames, which I would like to convert to a data.frame. The structure is as follows:
l_of_lists <- list(
year1 = list(
one = data.frame(date = c("Jan-10", "Jan-22"), type = c("type 1", "type 2")),
two = data.frame(date = c("Feb-1", "Feb-28"), type = c("type 2", "type 3")),
three = data.frame(date = c("Mar-10", "Mar-15"), type = c("type 1", "type 4"))
),
year2 = list( # dates is used here on purpose, as the names don't perfectly match
one = data.frame(dates = c("Jan-22"), type = c("type 2"), another_col = c("entry 2")),
two = data.frame(date = c("Feb-10", "Feb-18"), type = c("type 2", "type 3"), another_col = c("entry 2", "entry 3")),
three = data.frame(date = c("Mar-10", "Mar-15"), type = c("type 1", "type 4"), another_col = c("entry 4", "entry 5"))
),
year3 = list( # this deliberately only contains two data frames
one = data.frame(date = c("Jan-10", "Jan-12"), type = c("type 1", "type 2")),
two = data.frame(date = c("Feb-8", "Jan-28"), type = c("type 2", "type 3"))
))
The data frame has two particularities I tried to mimic above:
the column names differ by 1-2 characters (e.g. date vs. dates)
some columns are only present in some data frames (e.g. another_col)
I now would like to convert this to a data frame (I tried different calls to rbind and also do.call, as described e.g. here unsuccessfully) and would like to
- match on column names tolerantly (if the column names are similar to 1-2 characters, I want them to be matched and
- fill non-existent columns with NA in other columns.
I want a data frame similar to the following
year level date type another_col
1 one "Jan-10" "type 1" NA
1 one "Jan-22" "type 2" NA
1 two "Feb-1" "type 2" NA
1 two "Feb-28" "type 3" NA
1 three "Mar-10" "type 1" NA
1 three "Mar-15" "type 4" NA
2 one "Jan-22" "type 2" "entry 2"
2 two "Feb-1" "type 2" "entry 2"
2 two "Feb-28" "type 3" "entry 3"
2 three "Mar-10" "type 1" "entry 4"
2 three "Mar-15" "type 4" "entry 5"
3 one "Jan-10" "type 1" NA
3 one "Jan-12" "type 2" NA
3 two "Feb-8" "type 2" NA
3 two "Feb-28" "type 3" NA
Can someone point out if rbind is the correct path here - and what I am missing?
You could do something like the following using purrr and dplyr:
l_of_lists <- list(
year1 = list(
one = data.frame(date = c("Jan-10", "Jan-22"), type = c("type 1", "type 2")),
two = data.frame(date = c("Feb-1", "Feb-28"), type = c("type 2", "type 3")),
three = data.frame(date = c("Mar-10", "Mar-15"), type = c("type 1", "type 4"))
),
year2 = list( # dates is used here on purpose, as the names don't perfectly match
one = data.frame(dates = c("Jan-22"), type = c("type 2"), another_col = c("entry 2")),
two = data.frame(date = c("Feb-10", "Feb-18"), type = c("type 2", "type 3"), another_col = c("entry 2", "entry 3")),
three = data.frame(date = c("Mar-10", "Mar-15"), type = c("type 1", "type 4"), another_col = c("entry 4", "entry 5"))
),
year3 = list( # this deliberately only contains two data frames
one = data.frame(date = c("Jan-10", "Jan-12"), type = c("type 1", "type 2")),
two = data.frame(date = c("Feb-8", "Jan-28"), type = c("type 2", "type 3"))
))
# add libraries
library(dplyr)
library(purrr)
# Map bind_rows to each list within the list
l_of_lists %>%
map_dfr(~bind_rows(.x, .id = "level"), .id = "year")
This will yield:
year level date type dates another_col
1 year1 one Jan-10 type 1 <NA> <NA>
2 year1 one Jan-22 type 2 <NA> <NA>
3 year1 two Feb-1 type 2 <NA> <NA>
4 year1 two Feb-28 type 3 <NA> <NA>
5 year1 three Mar-10 type 1 <NA> <NA>
6 year1 three Mar-15 type 4 <NA> <NA>
7 year2 one <NA> type 2 Jan-22 entry 2
8 year2 two Feb-10 type 2 <NA> entry 2
9 year2 two Feb-18 type 3 <NA> entry 3
10 year2 three Mar-10 type 1 <NA> entry 4
11 year2 three Mar-15 type 4 <NA> entry 5
12 year3 one Jan-10 type 1 <NA> <NA>
13 year3 one Jan-12 type 2 <NA> <NA>
14 year3 two Feb-8 type 2 <NA> <NA>
15 year3 two Jan-28 type 3 <NA> <NA>
Then of course you can do some regex parsing to keep only the numeric year:
l_of_lists %>%
map_dfr(~bind_rows(.x, .id = "level"), .id = "year") %>%
mutate(year = substring(year, regexpr("\\d", year)))
If you know that date and dates are the same, you can always use mutate to changed then to those values that are not missing (i.e.mutate(date = ifelse(!is.na(date), date, dates)))
Related
my nested list looks like this:
myList <- list(structure(list(id = 1:3, value = c(22, 33, 44),
code = c("943", "943", "3a0"),
product = c("Product 1", "Product 1", "Product 1")),
row.names = c(NA,-3L),
class = c("data.table", "data.frame")),
structure(list(id = 1:3, value = c(22, 33, 44),
code = c("943", "94f", "3a0"),
product = c("Product 2", "Product 2", "Product 2")),
row.names = c(NA,-3L),
class = c("data.table", "data.frame")),
structure(list(id = 1:3, value = c(22, 33, 44),
code = c("977", "943", "3a0"),
product = c("Product 3", "Product 3", "Product 3")),
row.names = c(NA,-3L),
class = c("data.table", "data.frame")))
i want to remove all list objects that have more than one list element with the same code. For example the first object [[1]] has two entries that have the code 943. I want to remove the entire object and keep only those that do not have any duplicates.
The expected outcome would therefore be: myList <- list(
structure(list(id = 1:3, value = c(22, 33, 44),
code = c("943", "94f", "3a0"),
product = c("Product 2", "Product 2", "Product 2")),
row.names = c(NA,-3L),
class = c("data.table", "data.frame")),
structure(list(id = 1:3, value = c(22, 33, 44),
code = c("977", "943", "3a0"),
product = c("Product 3", "Product 3", "Product 3")),
row.names = c(NA,-3L),
class = c("data.table", "data.frame")))
I was thinking of using and lapply, but i couldnt get it to qwork
any(duplicated(myList[[1]]$code))
any ides or suggestion?
this seems like a relatively simple problem, but i cant figure it out
Your code any(duplicated(myList[[1]]$code)) can be used in Filter
Filter(function(x) !any(duplicated(x$code)), myList)
#[[1]]
# id value code product
#1: 1 22 943 Product 2
#2: 2 33 94f Product 2
#3: 3 44 3a0 Product 2
#[[2]]
# id value code product
#1: 1 22 977 Product 3
#2: 2 33 943 Product 3
#3: 3 44 3a0 Product 3
Or with purrr :
purrr::keep(myList, ~!any(duplicated(.x$code)))
purrr::discard(myList, ~any(duplicated(.x$code)))
Does this work:
myList[sapply(lapply(myList, function(x) +duplicated(x$code)), function(x) sum(x) == 0)]
[[1]]
id value code product
1: 1 22 943 Product 2
2: 2 33 94f Product 2
3: 3 44 3a0 Product 2
[[2]]
id value code product
1: 1 22 977 Product 3
2: 2 33 943 Product 3
3: 3 44 3a0 Product 3
I have a data.farme that looks like this:
I want to generate a new df as codebook where the numbers in col Label will be replaced using the information from ID and Subject.
what should I do?
The codebook file that I want to achieve is sth that looks like this:
Sample data can be build using codes:
df<-structure(list(Var = c("Subject1", "Subject2", "Subject4", "Subject5",
"Subject6", "Score1", "Score2", "Score3", "Score4", "Score5",
"Score6", "TestDate1", "TestDate2", "TestDate3", "TestDate4",
"TestDate5", "TestDate6"), Label = c("Subject 1", "Subject 2",
"Subject 4", "Subject 5", "Subject 6", "Score for Subject 1",
"Score for Subject 2", "Score for Subject 3", "Score for Subject 4",
"Score for Subject 5", "Score for Subject 6", "Date for test Subject 1",
"Date for test Subject 2", "Date for test Subject 3", "Date for test Subject 4",
"Date for test Subject 5", "Date for test Subject 6"), ID = c(1,
2, 3, 4, 5, 6, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Subject = c("Math",
"ELA", "PE", "Art", "Physic", "Chemistry", NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA)), row.names = c(NA, -17L), class = c("tbl_df",
"tbl", "data.frame"))
We can use str_replace_all with a named vector
library(dplyr)
library(stringr)
df1 <- df %>%
transmute(Var, Label = str_replace_all(Label,
setNames(na.omit(Subject), na.omit(ID))))
-output
df1
# A tibble: 17 x 2
# Var Label
# <chr> <chr>
# 1 Subject1 Subject Math
# 2 Subject2 Subject ELA
# 3 Subject4 Subject Art
# 4 Subject5 Subject Physic
# 5 Subject6 Subject Chemistry
# 6 Score1 Score for Subject Math
# 7 Score2 Score for Subject ELA
# 8 Score3 Score for Subject PE
# 9 Score4 Score for Subject Art
#10 Score5 Score for Subject Physic
#11 Score6 Score for Subject Chemistry
#12 TestDate1 Date for test Subject Math
#13 TestDate2 Date for test Subject ELA
#14 TestDate3 Date for test Subject PE
#15 TestDate4 Date for test Subject Art
#16 TestDate5 Date for test Subject Physic
#17 TestDate6 Date for test Subject Chemistry
or using gsubfn
library(gsubfn)
df$Label <- with(df, gsubfn("(\\d+)",
setNames(as.list(na.omit(Subject)), na.omit(ID)), Label))
I am trying to find accounts/customers that deliver recurring invoices within a time series dataset. The sample input is as follow:
yearmonth <- c("2019-11", "2019-11", "2019-12", "2020-01", "2020-02", "2020-02", "2020-03", "2020-03", "2020-04", "2020-05")
receivables <- c("Cust A", "Cust B", "Cust A", "Cust A", "Cust B", "Cust C", "Cust A", "Cust B", "Cust D", "Cust E")
category_group_name <- c("Expense", "Expense", "Expense", "Expense", "Expense", "Expense","Expense", "Expense","Expense","Expense")
Now what I would like to create, is a mutated category_group_name, in which the recurring invoices are classified as "Fixed Expense", and one time invoices as "Variable Expense".
I am getting a bit stuck here, is there anyone that could help?
Many thanks!
Does this answer:
> dat %>% group_by(receivables) %>% mutate(recurring = n()) %>% mutate(category_group_name = case_when(recurring > 1 ~ "Fixed Expense", TRUE ~ "Variable Expense")) %>% select(-recurring)
# A tibble: 10 x 3
# Groups: receivables [5]
yearmonth receivables category_group_name
<chr> <chr> <chr>
1 2019-11 Cust A Fixed Expense
2 2019-11 Cust B Fixed Expense
3 2019-12 Cust A Fixed Expense
4 2020-01 Cust A Fixed Expense
5 2020-02 Cust B Fixed Expense
6 2020-02 Cust C Variable Expense
7 2020-03 Cust A Fixed Expense
8 2020-03 Cust B Fixed Expense
9 2020-04 Cust D Variable Expense
10 2020-05 Cust E Variable Expense
>
Data used:
> dat
yearmonth receivables category_group_name
1 2019-11 Cust A Expense
2 2019-11 Cust B Expense
3 2019-12 Cust A Expense
4 2020-01 Cust A Expense
5 2020-02 Cust B Expense
6 2020-02 Cust C Expense
7 2020-03 Cust A Expense
8 2020-03 Cust B Expense
9 2020-04 Cust D Expense
10 2020-05 Cust E Expense
> dput(dat)
structure(list(yearmonth = c("2019-11", "2019-11", "2019-12",
"2020-01", "2020-02", "2020-02", "2020-03", "2020-03", "2020-04",
"2020-05"), receivables = c("Cust A", "Cust B", "Cust A", "Cust A",
"Cust B", "Cust C", "Cust A", "Cust B", "Cust D", "Cust E"),
category_group_name = c("Expense", "Expense", "Expense",
"Expense", "Expense", "Expense", "Expense", "Expense", "Expense",
"Expense")), class = "data.frame", row.names = c(NA, -10L
))
>
I'm trying to display my data in table format and I can't figure out how to rearrange my data to display it in the proper format. I'm used to wrangling data for plots, but I'm finding myself a little lost when it comes to preparing tables. This seems like something really basic, but I haven't been able to find an explanation on what I'm doing wrong here.
I have 3 columns of data, Type, Year, and n. The data formatted as it is now produces a table that looks like this:
Type Year n
Type C 1 5596
Type D 1 1119
Type E 1 116
Type A 1 402
Type F 1 1614
Type B 1 105
Type C 2 26339
Type D 2 14130
Type E 2 98
Type A 2 3176
Type F 2 3071
Type B 2 88
What I want to do is to have Type as row names, Year as column names, and n populating the table contents like this:
1 2
Type A 402 3176
Type B 105 88
Type C 26339 5596
Type D 1119 14130
Type E 116 98
Type F 1614 3071
The mistake might have been made upstream from this point. Using the full original data set I arrived at this output by doing the following:
exampletable <- df %>%
group_by(Year) %>%
count(Type) %>%
select(Type, Year, n)
Here is the dput() output
structure(list(Type = c("Type C", "Type D", "Type E", "Type A",
"Type F", "Type B", "Type C", "Type D", "Type E", "Type A", "Type F",
"Type B", "Type C", "Type D", "Type E", "Type A", "Type F", "Type B",
"Type C", "Type D", "Type E", "Type A", "Type F", "Type B", "Type C",
"Type D", "Type E"), Year = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5), n = c(5596,
1119, 116, 402, 1614, 105, 26339, 14130, 98, 3176, 3071, 88,
40958, 17578, 104, 3904, 3170, 102, 33145, 23800, 93, 1264, 7084,
1262, 34642, 24911, 504)), class = c("spec_tbl_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -27L), spec = structure(list(
cols = list(Type = structure(list(), class = c("collector_character",
"collector")), Year = structure(list(), class = c("collector_double",
"collector")), n = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
You can get the data in wide format and change Type column to rowname.
tidyr::pivot_wider(df, names_from = Year, values_from = n) %>%
tibble::column_to_rownames('Type')
# 1 2 3 4 5
#Type C 5596 26339 40958 33145 34642
#Type D 1119 14130 17578 23800 24911
#Type E 116 98 104 93 504
#Type A 402 3176 3904 1264 NA
#Type F 1614 3071 3170 7084 NA
#Type B 105 88 102 1262 NA
You can use tidyr package to get to wider format and tibble package to convert a column to rownames
dataset <- read.csv(file_location)
dataset <- tidyr::pivot_wider(dataset, names_from = Year, values_from = n)
tibble::column_to_rownames(dataset, var = 'Type')
1 2
Type C 5596 26339
Type D 1119 14130
Type E 116 98
Type A 402 3176
Type F 1614 3071
Type B 105 88
I want to select certain values from multiple columns using conditions.(also let assign row 1 as ID#1, ... row5 as ID#5)
column1 <- c("rice 2", "apple 4", "melon 6", "blueberry 4", "orange 6")
column2 <- c("rice 8", "blueberry 8", "grape 10", "water 10", "mango 3")
column3 <- c("rice 6", "apple 8", "blueberry 12", "pineapple 8", "mango 3")
I want to get new column using IDs with condition only rice > 5, blueberry > 7 or orange > 5
First, I would like to get ID#1, ID#2, ID#3, ID#5
Second, I would to count how many conditions met per ID
I would like to get results
ID#1 -> 2 conditions met
ID#2 -> 1 conditions met
ID#3 -> 1 conditions met
ID#4 -> 0 conditions met
ID#5 -> 1 conditions met
If I understood the question correctly then one of the approach could be
library(dplyr)
cols <- names(df)[-1]
df1 <- df %>%
mutate_if(is.factor, as.character) %>%
mutate(rice_gt_5 = (select(., one_of(cols)) %>%
rowwise() %>%
mutate_all(funs(strsplit(., split=" ")[[1]][1] =='rice' & as.numeric(strsplit(., split=" ")[[1]][2]) > 5)) %>%
rowSums)) %>%
mutate(blueberry_gt_7 = (select(., one_of(cols)) %>%
rowwise() %>%
mutate_all(funs(strsplit(., split=" ")[[1]][1] =='blueberry' & as.numeric(strsplit(., split=" ")[[1]][2]) > 7)) %>%
rowSums)) %>%
mutate(orange_gt_5 = (select(., one_of(cols)) %>%
rowwise() %>%
mutate_all(funs(strsplit(., split=" ")[[1]][1] =='orange' & as.numeric(strsplit(., split=" ")[[1]][2]) > 5)) %>%
rowSums))
#IDs which satisfy at least one of your conditions i.e. rice > 5 OR blueberry > 7 OR orange > 5
df1$ID[which(df1 %>% select(rice_gt_5, blueberry_gt_7, orange_gt_5) %>% rowSums() >0)]
#[1] 1 2 3 5
#How many conditions are met per ID
df1 %>%
mutate(no_of_cond_met = rowSums(select(., one_of(c("rice_gt_5", "blueberry_gt_7", "orange_gt_5"))))) %>%
select(ID, no_of_cond_met)
# ID no_of_cond_met
#1 1 2
#2 2 1
#3 3 1
#4 4 0
#5 5 1
Sample data:
df <- structure(list(ID = 1:5, column1 = structure(c(5L, 1L, 3L, 2L,
4L), .Label = c("apple 4", "blueberry 4", "melon 6", "orange 6",
"rice 2"), class = "factor"), column2 = structure(c(4L, 1L, 2L,
5L, 3L), .Label = c("blueberry 8", "grape 10", "mango 3", "rice 8",
"water 10"), class = "factor"), column3 = structure(c(5L, 1L,
2L, 4L, 3L), .Label = c("apple 8", "blueberry 12", "mango 3",
"pineapple 8", "rice 6"), class = "factor")), .Names = c("ID",
"column1", "column2", "column3"), row.names = c(NA, -5L), class = "data.frame")