I've got a pile of nested tibbles that are from the tidyrss package. The data looks like this:
What I'm trying to do is take the four common items from each tibble and tidy them, so that the output looks like this:
item_title
item_link
item_description
item_pub_date
title from article 1
some url
longer text
posix date
title from article 2
some url
longer text
posix date
title from article 3
some url
longer text
posix date
title from article 4
some url
longer text
posix date
Thus far I've tried unlist() and deframe() and both of those just make a general mess of things - and an added twist is not all the list items are tibbles. Some are functions, and I want to ignore those. What's the best tidyverse approach to tackle this task?
map_dfr seems to do what you want! It loops over a list and applies a function to each one - in this case, the only "function" we want to apply is returning the data frame/tibble, but that also allows us to skip the functions:
clean_feed_df <- list(
data.frame(item_title=sample(letters, 3),
item_link=sample(letters, 3),
item_desc=sample(letters, 3),
item_date=sample(letters, 3)),
data.frame(item_title=as.character(sample(1:100, 5)),
item_link=as.character(sample(1:100, 5)),
item_desc=as.character(sample(1:100, 5)),
item_date=as.character(sample(1:100, 5))),
function(x)sum(x)
)
map_dfr(clean_feed_df, function(rssentry){
if(is(rssentry, "data.frame")){
return(rssentry)
}
})
which returns
item_title item_link item_desc item_date
1 s s u i
2 x d o x
3 t x d h
4 40 51 21 91
5 4 25 37 34
6 5 44 18 71
7 65 70 83 90
8 32 85 76 89
Related
I am using the pivot function from the lessR package, to create an Excel-like pivot table with two categorical variables that make up the vertical and horizontal categories, and a mean in each cell. (Hope this makes sense).
I followed the code that the documentation (https://cran.r-project.org/web/packages/lessR/vignettes/pivot.html) gives. Let's follow their example:
d <- Read("Employee")
a <- pivot(d, mean, Salary, Dept, Gender)
The data d is like this:
Years Gender Dept Salary JobSat Plan Pre Post
Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
Wu, James NA M SALE 94494.58 low 1 62 74
Hoang, Binh 15 M SALE 111074.86 low 3 96 97
Jones, Alissa 5 F <NA> 53772.58 <NA> 1 65 62
Downs, Deborah 7 F FINC 57139.90 high 2 90 86
Afshari, Anbar 6 F ADMN 69441.93 high 2 100 100
Knox, Michael 18 M MKTG 99062.66 med 3 81 84
Campagna, Justin 8 M SALE 72321.36 low 1 76 84
Kimball, Claire 8 F MKTG 61356.69 high 2 93 92
The pivottable a is a nice table, exactly as I want it to look in terms of cell contents, etc. It appears to be a knitr_kable.
Gender F M
Dept
------- --------- ---------
ACCT 63237.16 59626.20
ADMN 81434.00 80963.35
FINC 57139.90 72967.60
MKTG 64496.02 99062.66
SALE 64188.25 86150.97
Next, I would like to make a dataframe out of this, for easier manipulation in my code and for copying it to the clipboard. However, I don't know how to convert a knitr_kable to a dataframe. Here is my code and the error it results in:
as.data.frame(a)
Error in as.data.frame.default(a) :
cannot coerce class ‘"knitr_kable"’ to a data.frame
The knitr-documentation does not say anything about this conversion - it is only about converting a dataframe to a knitr_kable, which is the opposite of what I want.
I have also tried pivottabler, but this has similar issues: the resulting class cannot be coerced to a dataframe either.
Here are two potential answers:
Most direct: Wrangle the data yourself
If you're open to a tidyverse-style approach, it only takes a few lines to do the wrangling and summarising yourself. That will give you a datatable output that you can work with right away.
# load packages
library(lessR)
library(dplyr)
library(tidyr)
# load data
d <- Read("Employee")
# use tidyverse-style code to pivot and summarise the data yourself
d %>%
group_by(Gender, Dept) %>%
summarise(Salary_mean = mean(Salary)) %>%
pivot_wider(names_from= "Gender", values_from = "Salary_mean")
Read the knitr::kable() markdown output into a data frame
If you prefer to work backwards from a knitr::kable() output to a dataframe, this is addressed in this SO question: Markdown table to data frame in R
I have a column with text strings in it, and I would like to extract not just a specific string, but also the string or number/s following this specified string. What is a good solution for this?
In the example below- I would like to create a column "extract" and str_extract the words "lot" and "unit" AND also extract the subsequent numbers following this text.
id
notes
extract
1
LOT 56, STRATA TITLE, 56/SP77100,
LOT 56
2
18/SP71866, COMMERCIAL, 17/SP71866, lot 18
lot 18
3
unit 9; 3R/PS732002
unit 9
4
V1602 F63, Section 8 Block 68 Unit 3
Unit 3
Have looked at a lot of regex code but nothing helpful to find how to extract subsequent values from the specified target text string.
Tried this so far from another StackOverflow problem-
result <- table %>%
mutate(extract = str_extract(notes, "(?lot\\s)\\W\\s?\\d+\\")) %>%
mutate(lot = str_squish(lot))
You can use
str_extract(notes, "(?i)\\b(?:lot|unit)\\W*\\d+")
See the regex demo.
Details
(?i) - case insensitive flag
\b - a word boundary
(?:lot|unit) - either lot or unit
\W* - any zero or more non-word chars
\d+ - one or more digits.
R test:
library(dplyr)
library(stringr)
df <- data.frame(notes=c("LOT 56, STRATA TITLE, 56/SP77100,","18/SP71866, COMMERCIAL, 17/SP71866, lot 18","unit 9; 3R/PS732002", "V1602 F63, Section 8 Block 68 Unit 3"))
df %>%
+ mutate(extract = str_extract(notes, "(?i)\\b(?:lot|unit)\\W*\\d+"))
notes extract
1 LOT 56, STRATA TITLE, 56/SP77100, LOT 56
2 18/SP71866, COMMERCIAL, 17/SP71866, lot 18 lot 18
3 unit 9; 3R/PS732002 unit 9
4 V1602 F63, Section 8 Block 68 Unit 3 Unit 3
A base R option using regmatches
transform(
df,
extract = unlist(regmatches(notes, gregexpr("\\b(lot|unit)\\s\\d+", notes, ignore.case = TRUE)))
)
gives
id notes extract
1 1 LOT 56, STRATA TITLE, 56/SP77100 LOT 56
2 2 18/SP71866, COMMERCIAL, 17/SP71866, lot 18 lot 18
3 3 unit 9; 3R/PS732002 unit 9
4 4 V1602 F63, Section 8 Block 68 Unit 3 Unit 3
Data
> dput(df)
structure(list(id = 1:4, notes = c("LOT 56, STRATA TITLE, 56/SP77100",
"18/SP71866, COMMERCIAL, 17/SP71866, lot 18", "unit 9; 3R/PS732002",
"V1602 F63, Section 8 Block 68 Unit 3")), class = "data.frame", row.names = c(NA,
-4L))
Looking for advice on refining my code and also trimming to a date range.
The spreadsheet itself is pulled from another system and so the structure of the excel cannot be changed. When you pull the data it basically starts at E2, with the first date column in F2, and the first item in E3. The data will continue to populate to the right for as long as it goes on for. I have replicated the structure below.
AndI want it to look like:
I have come up with the below, which works, but I was looking for advice on refining it down to fewer individual step by steps.
In the below code:
= extracting data
= pulling the dates out
= formatting from
excel number to an actual date
= grabbing the item names
= transposing data and skipping some parts
= adding in dates to the row names
#1
df <- data.frame(read_excel("C:/example.xlsx",
sheet = "Sheet1"))
#2
dfdate <- gtb[1, -c(1,2,3,4,5)]
#3
dfdate <- format(as.Date(as.numeric(dfdate),
origin = "1899-12-30"), "%d/%m/%Y")
#4
rownames(gtb) <- gtb[,1]
#5
gtb <- as.data.frame(t(gtb[, -c(1,2,3,4,5)]))
#6
rownames(gtb) <- dfdate
After the row names have been added the structure is such that I am happy to start creating the visuals where needed.
thanks for your advice
David
Here is one suggestion, I don't really have easy access to your data, but I am including code to remove those columns as you do, based on their names, which can be nicer than removing by index.
df <- read.table( text=
"Item_Code 01/01/2018 01/02/2018 01/03/2018 01/04/2018
Item 99 51 60 69
Item2 42 47 88 2
Item3 36 81 42 48
",header=TRUE, check.names=FALSE) %>%
rename( `Item Code` = Item_Code )
library(tibble)
library(lubridate)
x <- df %>% select( -matches("Code \\d|Internal Code") ) %>%
column_to_rownames("Item Code") %>%
t %>% as.data.frame %>%
rownames_to_column("Item Code") %>%
mutate( `Item Code` = dmy(`Item Code`) )
x
Output:
> x
Item Code Item Item2 Item3
1 2018-01-01 99 42 36
2 2018-02-01 51 47 81
3 2018-03-01 60 88 42
4 2018-04-01 69 2 48
I went a bit forth and back with this solution, but it can be nice to also showcase how to remove columns by a regex on their column names, since you are removing several similarly named columns.
The t trick, that you also use, works becuase there is really only one more column there that would cause problems with this, as others have commented, and this can be temporarily stowed away as rownames. If that weren't the case, you're looking at a more complex solution involving pivot_wider and pivot_longer or splitting the data.frame and transposing only one of the halves.
I have scraped some data from a url to analyse cycling results. Unfortunately the name column exists of the name and the name of the team in one field. I would like to extract these from each other. Here's the code (last part doesn't work)
#get url
stradebianchi_2020 <- read_html("https://www.procyclingstats.com/race/strade-bianche/2020/result")
#scrape table
results_2020 <- stradebianchi_2020%>%
html_nodes("td")%>%
html_text()
#transpose scraped data into dataframe
results_stradebianchi_2020 <- as.data.frame(t(matrix(results_2020, 8, byrow = F)))
#rename
names(results_stradebianchi_2020) <- c("rank", "#", "name", "age", "team", "UCI point", "PCS points", "time")
#split rider from team
separate(data = results_stradebianchi_2020, col = name, into = c("left", "right"), sep = " ")
I think the best option is to get the team variable name and use that name to remove it from the 'name' column.
All suggestions are welcome!
I think your request is wrongly formulated. You want to remove team from name.
That's how you should do it in my opinion:
results_stradebianchi_2020 %>%
mutate(name = stringr::str_remove(name, team))
Write this instead of your line with separate.
In this case separate is not an optimal solution for you because the separation character is not clearly defined.
Also, I would advise you to remove the initial blanks from name with stringr::str_trim(name)
You could do this in base R with gsub and replace in the name column the pattern of team column with "", i.e. nothing. We use apply() with MARGIN=1 to go through the data frame row by row. Finally we use trimws to clean from whitespace (where we change to whitespace="[\\h\\v]" for better matching the spaces).
res <- transform(results_stradebianchi_2020,
name=trimws(apply(results_stradebianchi_2020, 1, function(x)
gsub(x["team"], "", x["name"])), whitespace="[\\h\\v]"))
head(res)
# rank X. name age team UCI.point PCS.points time
# 1 1 201 van Aert Wout 25 Team Jumbo-Visma 300 200 4:58:564:58:56
# 2 2 234 Formolo Davide 27 UAE-Team Emirates 250 150 0:300:30
# 3 3 87 Schachmann Maximilian 26 BORA - hansgrohe 215 120 0:320:32
# 4 4 111 Bettiol Alberto 26 EF Pro Cycling 175 100 1:311:31
# 5 5 44 Fuglsang Jakob 35 Astana Pro Team 120 90 2:552:55
# 6 6 7 Štybar Zdenek 34 Deceuninck - Quick Step 115 80 3:593:59
Can anyone help arranging long data to wide data, but complicated by linked results, ie listed in wide format identified by Study Number this repetitive results listed in wide format after the SN (I've shown an abbreviated table there are more results per patient listed along the bottom with repetitive in columns LabTest, LabDate, Result, Lower, Upper)...I've tried melt and recast, and binding columns but can't seem to get it to work. Over 1000 results to reformat so can't input results manually need to reformat a long data excel document in wide format in R Thank you
Original data looks like this
SN LabTest LabDate Result Lower Upper
TD62 Creat 05/12/2004 22 30 90
TD62 AST 06/12/2004 652 6 45
TD58 Creat 26/05/2007 72 30 90
TD58 Albumin 26/05/2005 22 25 35
TD14 AST 28/02/2007 234 6 45
TD14 Albumin 26/02/2007 15 25 35
Formatted data should look like this
SN LabTCode LabDate Result Lower Upper LabCode LabDate Result Lower Upper
TD62 Creat 05/12/04 22 30 90 AST 06/12/04 652 6 45
TD58 Creat 26/05/05 72 30 90 Alb 26/05/05 22 25 35
TD14 AST 28/02/07 92 30 90 Alb 26/02/07 15 25 35
Formatted data looks like this
So far I have tried:
data_wide2 <- dcast(tdl, SN + LabDate ~ LabCode, value.var="Result")
and
melt(tdl, id = c("SN", "LabDate"), measured= c("Result", "Upper", + "Lower"))
Your issue is that R won't like the final table because it has duplicate column names. Maybe you need the data in that format but it's a bad way to store data because it would be difficult to put the columns back into rows again without a load of manual work.
That said, if you want to do it you'll need a new column to help you transpose the data.
I've used dplyr and tidyr below, which are worth looking at rather than reshape. They're by the same author but more modern and designed to fit together as part of the 'tidyverse'.
library(dplyr)
library(tidyr)
#Recreate your data (not doing this bit in your question is what got you downvoted)
df <- data.frame(
SN = c("TD62","TD62","TD58","TD58","TD14","TD14"),
LabTest = c("Creat","AST","Creat","Albumin","AST","Albumin"),
LabDate = c("05/12/2004","06/12/2004","26/05/2007","26/05/2005","28/02/2007","26/02/2007"),
Result = c(22,652,72,22,234,15),
Lower = c(30,6,30,25,6,25),
Upper = c(90,45,90,35,45,35),
stringsAsFactors = FALSE
)
output <- df %>%
group_by(SN) %>%
mutate(id_number = row_number()) %>% #create an id number to help with tracking the data as it's transposed
gather("key", "value", -SN, -id_number) %>% #flatten the data so that we can rename all the column headers
mutate(key = paste0("t",id_number, key)) %>% #add id_number to the column names. 't' for 'test' to start name with a letter.
select(-id_number) %>% #don't need id_number anymore
spread(key, value)
SN t1LabDate t1LabTest t1Lower t1Result t1Upper t2LabDate t2LabTest t2Lower t2Result t2Upper
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 TD14 28/02/2007 AST 6 234 45 26/02/2007 Albumin 25 15 35
2 TD58 26/05/2007 Creat 30 72 90 26/05/2005 Albumin 25 22 35
3 TD62 05/12/2004 Creat 30 22 90 06/12/2004 AST 6 652 45
And you're there, possibly with some sorting issues still to crack if you need the columns in a specific order.