subsetting a nested list based on a conidtion in R - r

my nested list looks like this:
myList <- list(structure(list(id = 1:3, value = c(22, 33, 44),
code = c("943", "943", "3a0"),
product = c("Product 1", "Product 1", "Product 1")),
row.names = c(NA,-3L),
class = c("data.table", "data.frame")),
structure(list(id = 1:3, value = c(22, 33, 44),
code = c("943", "94f", "3a0"),
product = c("Product 2", "Product 2", "Product 2")),
row.names = c(NA,-3L),
class = c("data.table", "data.frame")),
structure(list(id = 1:3, value = c(22, 33, 44),
code = c("977", "943", "3a0"),
product = c("Product 3", "Product 3", "Product 3")),
row.names = c(NA,-3L),
class = c("data.table", "data.frame")))
i want to remove all list objects that have more than one list element with the same code. For example the first object [[1]] has two entries that have the code 943. I want to remove the entire object and keep only those that do not have any duplicates.
The expected outcome would therefore be: myList <- list(
structure(list(id = 1:3, value = c(22, 33, 44),
code = c("943", "94f", "3a0"),
product = c("Product 2", "Product 2", "Product 2")),
row.names = c(NA,-3L),
class = c("data.table", "data.frame")),
structure(list(id = 1:3, value = c(22, 33, 44),
code = c("977", "943", "3a0"),
product = c("Product 3", "Product 3", "Product 3")),
row.names = c(NA,-3L),
class = c("data.table", "data.frame")))
I was thinking of using and lapply, but i couldnt get it to qwork
any(duplicated(myList[[1]]$code))
any ides or suggestion?
this seems like a relatively simple problem, but i cant figure it out

Your code any(duplicated(myList[[1]]$code)) can be used in Filter
Filter(function(x) !any(duplicated(x$code)), myList)
#[[1]]
# id value code product
#1: 1 22 943 Product 2
#2: 2 33 94f Product 2
#3: 3 44 3a0 Product 2
#[[2]]
# id value code product
#1: 1 22 977 Product 3
#2: 2 33 943 Product 3
#3: 3 44 3a0 Product 3
Or with purrr :
purrr::keep(myList, ~!any(duplicated(.x$code)))
purrr::discard(myList, ~any(duplicated(.x$code)))

Does this work:
myList[sapply(lapply(myList, function(x) +duplicated(x$code)), function(x) sum(x) == 0)]
[[1]]
id value code product
1: 1 22 943 Product 2
2: 2 33 94f Product 2
3: 3 44 3a0 Product 2
[[2]]
id value code product
1: 1 22 977 Product 3
2: 2 33 943 Product 3
3: 3 44 3a0 Product 3

Related

Remove Rows occurring after a String R Data frame

I want to remove all rows after a certain string occurrence in a data frame column. I want to only return the 3 rows that appear above "total" appearing in column A. The 2 rows appearing below "total" would be excluded.
A B
Bob Smith 01005
Carl Jones 01008
Syndey Lewis 01185
total
Adam Price 01555
Megan Watson 02548
We can subset with row_numberand which
library(dplyr)
df %>% filter(row_number() < which(A=='total'))
A B
1 Bob Smith 01005
2 Carl Jones 01008
3 Syndey Lewis 01185
You could use
library(dplyr)
df %>%
filter(cumsum(A == "total") == 0)
This returns
# A tibble: 3 x 2
A B
<chr> <chr>
1 Bob Smith 01005
2 Carl Jones 01008
3 Syndey Lewis 01185
Data
structure(list(A = c("Bob Smith", "Carl Jones", "Syndey Lewis",
"total", "Adam Price", "Megan Watson"), B = c("01005", "01008",
"01185", NA, "01555", "02548")), problems = structure(list(row = 4L,
col = NA_character_, expected = "2 columns", actual = "1 columns",
file = "literal data"), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame")), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -6L), spec = structure(list(
cols = list(A = structure(list(), class = c("collector_character",
"collector")), B = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
A <- c('Bob Smith','Carl Jones','Syndey Lewis','total','Adam Price','Megan Watson')
B <- c('01005','01008','01185','','01555','02548')
df <- data.frame(A, B)
val = which(df$A=="total") #get index of total
C = df[1:val-1,]
It's a little clunky but this should solve what you're wanting it to do:
library(dplyr)
df <- data.frame(A = c("Bob Smith", "Carl Jones", "Sydney Lewis", "total", "Adam Price", "Megan Watson"),
B = c("01005", "01008", "01185", NA, "01555", "02548"))
index <- df[df$A=="total",] %>% rownames()
df %>% slice(1:index)

Reorder Rows Multiple Values

I am trying to arrange my current data set so that all the Visits are arranged for all the Individuals.
I tried the method suggested in this question and it works but only shows the values for the first individual.
Data:
structure(list(Individual = c("John", "John", "John", "Anna",
"Anna", "Anna", "Seth", "Seth", "Seth"), Visit = c("Last", "First",
"Review", "Last", "First", "Review", "Last", "First", "Review"
), Amount = c(25, 100, 75, 25, 100, 75, 25, 100, 75)), row.names = c(NA,
-9L), class = c("tbl_df", "tbl", "data.frame"))
Attempted code:
target <- c("First","Review","Last")
Visit <- Visit[match(target, Visit$Visit),]
You can use :
Visit[with(Visit, order(Individual, match(Visit, target))), ]
Or using dplyr :
library(dplyr)
df %>% arrange(Individual, match(Visit, target))
# Individual Visit Amount
# <chr> <chr> <dbl>
#1 Anna First 100
#2 Anna Review 75
#3 Anna Last 25
#4 John First 100
#5 John Review 75
#6 John Last 25
#7 Seth First 100
#8 Seth Review 75
#9 Seth Last 25
I think you need conversion of Visit field into a factor field with ordering.
target <- c("First","Review","Last")
df$Visit <- factor(df$Visit, levels = target, ordered = T)
dplyr::arrange(df, Individual, Visit)
> dplyr::arrange(df, Individual, Visit)
# A tibble: 9 x 3
Individual Visit Amount
<chr> <ord> <dbl>
1 Anna First 100
2 Anna Review 75
3 Anna Last 25
4 John First 100
5 John Review 75
6 John Last 25
7 Seth First 100
8 Seth Review 75
9 Seth Last 25
dput used
df <- structure(list(Individual = c("John", "John", "John", "Anna",
"Anna", "Anna", "Seth", "Seth", "Seth"), Visit = c("Last", "First",
"Review", "Last", "First", "Review", "Last", "First", "Review"
), Amount = c(25, 100, 75, 25, 100, 75, 25, 100, 75)), row.names = c(NA,
-9L), class = c("tbl_df", "tbl", "data.frame"))

Formatting grouped data for tables in R

I'm trying to display my data in table format and I can't figure out how to rearrange my data to display it in the proper format. I'm used to wrangling data for plots, but I'm finding myself a little lost when it comes to preparing tables. This seems like something really basic, but I haven't been able to find an explanation on what I'm doing wrong here.
I have 3 columns of data, Type, Year, and n. The data formatted as it is now produces a table that looks like this:
Type Year n
Type C 1 5596
Type D 1 1119
Type E 1 116
Type A 1 402
Type F 1 1614
Type B 1 105
Type C 2 26339
Type D 2 14130
Type E 2 98
Type A 2 3176
Type F 2 3071
Type B 2 88
What I want to do is to have Type as row names, Year as column names, and n populating the table contents like this:
1 2
Type A 402 3176
Type B 105 88
Type C 26339 5596
Type D 1119 14130
Type E 116 98
Type F 1614 3071
The mistake might have been made upstream from this point. Using the full original data set I arrived at this output by doing the following:
exampletable <- df %>%
group_by(Year) %>%
count(Type) %>%
select(Type, Year, n)
Here is the dput() output
structure(list(Type = c("Type C", "Type D", "Type E", "Type A",
"Type F", "Type B", "Type C", "Type D", "Type E", "Type A", "Type F",
"Type B", "Type C", "Type D", "Type E", "Type A", "Type F", "Type B",
"Type C", "Type D", "Type E", "Type A", "Type F", "Type B", "Type C",
"Type D", "Type E"), Year = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5), n = c(5596,
1119, 116, 402, 1614, 105, 26339, 14130, 98, 3176, 3071, 88,
40958, 17578, 104, 3904, 3170, 102, 33145, 23800, 93, 1264, 7084,
1262, 34642, 24911, 504)), class = c("spec_tbl_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -27L), spec = structure(list(
cols = list(Type = structure(list(), class = c("collector_character",
"collector")), Year = structure(list(), class = c("collector_double",
"collector")), n = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
You can get the data in wide format and change Type column to rowname.
tidyr::pivot_wider(df, names_from = Year, values_from = n) %>%
tibble::column_to_rownames('Type')
# 1 2 3 4 5
#Type C 5596 26339 40958 33145 34642
#Type D 1119 14130 17578 23800 24911
#Type E 116 98 104 93 504
#Type A 402 3176 3904 1264 NA
#Type F 1614 3071 3170 7084 NA
#Type B 105 88 102 1262 NA
You can use tidyr package to get to wider format and tibble package to convert a column to rownames
dataset <- read.csv(file_location)
dataset <- tidyr::pivot_wider(dataset, names_from = Year, values_from = n)
tibble::column_to_rownames(dataset, var = 'Type')
1 2
Type C 5596 26339
Type D 1119 14130
Type E 116 98
Type A 402 3176
Type F 1614 3071
Type B 105 88

create dataframe from list of lists of data.frames

I have a list of lists of data.frames, which I would like to convert to a data.frame. The structure is as follows:
l_of_lists <- list(
year1 = list(
one = data.frame(date = c("Jan-10", "Jan-22"), type = c("type 1", "type 2")),
two = data.frame(date = c("Feb-1", "Feb-28"), type = c("type 2", "type 3")),
three = data.frame(date = c("Mar-10", "Mar-15"), type = c("type 1", "type 4"))
),
year2 = list( # dates is used here on purpose, as the names don't perfectly match
one = data.frame(dates = c("Jan-22"), type = c("type 2"), another_col = c("entry 2")),
two = data.frame(date = c("Feb-10", "Feb-18"), type = c("type 2", "type 3"), another_col = c("entry 2", "entry 3")),
three = data.frame(date = c("Mar-10", "Mar-15"), type = c("type 1", "type 4"), another_col = c("entry 4", "entry 5"))
),
year3 = list( # this deliberately only contains two data frames
one = data.frame(date = c("Jan-10", "Jan-12"), type = c("type 1", "type 2")),
two = data.frame(date = c("Feb-8", "Jan-28"), type = c("type 2", "type 3"))
))
The data frame has two particularities I tried to mimic above:
the column names differ by 1-2 characters (e.g. date vs. dates)
some columns are only present in some data frames (e.g. another_col)
I now would like to convert this to a data frame (I tried different calls to rbind and also do.call, as described e.g. here unsuccessfully) and would like to
- match on column names tolerantly (if the column names are similar to 1-2 characters, I want them to be matched and
- fill non-existent columns with NA in other columns.
I want a data frame similar to the following
year level date type another_col
1 one "Jan-10" "type 1" NA
1 one "Jan-22" "type 2" NA
1 two "Feb-1" "type 2" NA
1 two "Feb-28" "type 3" NA
1 three "Mar-10" "type 1" NA
1 three "Mar-15" "type 4" NA
2 one "Jan-22" "type 2" "entry 2"
2 two "Feb-1" "type 2" "entry 2"
2 two "Feb-28" "type 3" "entry 3"
2 three "Mar-10" "type 1" "entry 4"
2 three "Mar-15" "type 4" "entry 5"
3 one "Jan-10" "type 1" NA
3 one "Jan-12" "type 2" NA
3 two "Feb-8" "type 2" NA
3 two "Feb-28" "type 3" NA
Can someone point out if rbind is the correct path here - and what I am missing?
You could do something like the following using purrr and dplyr:
l_of_lists <- list(
year1 = list(
one = data.frame(date = c("Jan-10", "Jan-22"), type = c("type 1", "type 2")),
two = data.frame(date = c("Feb-1", "Feb-28"), type = c("type 2", "type 3")),
three = data.frame(date = c("Mar-10", "Mar-15"), type = c("type 1", "type 4"))
),
year2 = list( # dates is used here on purpose, as the names don't perfectly match
one = data.frame(dates = c("Jan-22"), type = c("type 2"), another_col = c("entry 2")),
two = data.frame(date = c("Feb-10", "Feb-18"), type = c("type 2", "type 3"), another_col = c("entry 2", "entry 3")),
three = data.frame(date = c("Mar-10", "Mar-15"), type = c("type 1", "type 4"), another_col = c("entry 4", "entry 5"))
),
year3 = list( # this deliberately only contains two data frames
one = data.frame(date = c("Jan-10", "Jan-12"), type = c("type 1", "type 2")),
two = data.frame(date = c("Feb-8", "Jan-28"), type = c("type 2", "type 3"))
))
# add libraries
library(dplyr)
library(purrr)
# Map bind_rows to each list within the list
l_of_lists %>%
map_dfr(~bind_rows(.x, .id = "level"), .id = "year")
This will yield:
year level date type dates another_col
1 year1 one Jan-10 type 1 <NA> <NA>
2 year1 one Jan-22 type 2 <NA> <NA>
3 year1 two Feb-1 type 2 <NA> <NA>
4 year1 two Feb-28 type 3 <NA> <NA>
5 year1 three Mar-10 type 1 <NA> <NA>
6 year1 three Mar-15 type 4 <NA> <NA>
7 year2 one <NA> type 2 Jan-22 entry 2
8 year2 two Feb-10 type 2 <NA> entry 2
9 year2 two Feb-18 type 3 <NA> entry 3
10 year2 three Mar-10 type 1 <NA> entry 4
11 year2 three Mar-15 type 4 <NA> entry 5
12 year3 one Jan-10 type 1 <NA> <NA>
13 year3 one Jan-12 type 2 <NA> <NA>
14 year3 two Feb-8 type 2 <NA> <NA>
15 year3 two Jan-28 type 3 <NA> <NA>
Then of course you can do some regex parsing to keep only the numeric year:
l_of_lists %>%
map_dfr(~bind_rows(.x, .id = "level"), .id = "year") %>%
mutate(year = substring(year, regexpr("\\d", year)))
If you know that date and dates are the same, you can always use mutate to changed then to those values that are not missing (i.e.mutate(date = ifelse(!is.na(date), date, dates)))

Calculate average over many data frames having same format?

I have 15 data frames that are exactly identical, but each have different values stored within each column. Each header row is exactly the same.
Here's an example data frame, call it "A":
Product Q1 Q2
1 Product X 10 15
2 Product Y 20 40
3 Product Z 30 50
And here's another, call it "B":
Product Q1 Q2
1 Product X 12 5
2 Product Y 25 44
3 Product Z 32 51
I would like to calculate the average value across all 15 data frames. Using my two examples, the output would be a similar data frame but with averages. Something like this:
Product Q1 Q2
1 Product X 11.0 10.0
2 Product Y 22.5 42.0
3 Product Z 31.0 50.5
I've searched around for a solution, but to no avail. It seems like the mapply function might be what I need, but I'm not sure how best to put it to use here.
aggregate(.~Product, rbind(A, B), mean)
# Product Q1 Q2
#1 Product X 11.0 10.0
#2 Product Y 22.5 42.0
#3 Product Z 31.0 50.5
DATA
A = structure(list(Product = c("Product X", "Product Y", "Product Z"
), Q1 = c(10L, 20L, 30L), Q2 = c(15L, 40L, 50L)), .Names = c("Product",
"Q1", "Q2"), class = "data.frame", row.names = c("1", "2", "3"
))
B = structure(list(Product = c("Product X", "Product Y", "Product Z"
), Q1 = c(12L, 25L, 32L), Q2 = c(5L, 44L, 51L)), .Names = c("Product",
"Q1", "Q2"), class = "data.frame", row.names = c("1", "2", "3"
))
Since the headers match, let's put all of your data frames into one data frame.
df <- rbind(A,B,... O)
Then we'll use dplyr to summarize:
require(dplyr)
df %>% group_by(Product) %>%
summarize(Q1_Avg= mean(Q1), Q2_Avg= mean(Q2))

Resources