Manipulating a Column with a Concatenated List in R - r

I found a way to make it work, but it seems clumsy. There has to be a better way...
The question I might try to answer is
If I wanted to find out how often a language was selected by country, how would I do that efficiently?
This works, what's better?
library(tidyverse)
data(SO_survey, package = "crunch")
# what does it look like now?
SO_survey %>% select(Country, WantWorkLanguage) %>% head()
It's set up like this
# Country WantWorkLanguage
# 4 United States Matlab; Python; R; SQL
# 11 United States C#; R; SQL
# 36 Italy JavaScript; Python; R
# 125 Denmark Groovy; Java; JavaScript; Lua; SQL; TypeScript
# 242 United States C++; Python
# 298 Dominican Republic C; C#; CoffeeScript; Go; Haskell; JavaScript; Perl; PHP; Python; R; Ruby; SQL
Made a unique list of languages
# extract unique languages
(wanted = SO_survey %>%
select(WantWorkLanguage) %>%
unlist() %>%
strsplit("; ", fixed = T) %>%
unlist() %>%
unique()
To extract the county by one country
# how often did a respondent pick a particular language in the US?
SO_survey %>%
filter(Country == "United States") %>%
{strsplit(unlist(.$WantWorkLanguage),"; ",fixed = T)} %>%
unlist() %>%
table(. %in% wanted)
If I want it arranged/sorted
# If I want it sorted or arranged
SO_survey %>%
filter(Country == "United States") %>%
{strsplit(unlist(.$WantWorkLanguage),"; ",fixed = T)} %>%
unlist() %>%
table(. %in% wanted) %>%
data.frame() %>%
select(-Var2) %>%
arrange(-Freq)
The output:
# . Freq
# 1 Python 321
# 2 R 288
# 3 SQL 209
# 4 JavaScript 199
# 5 C++ 136
# 6 Java 115
# 7 Go 101
# 8 C# 100
# 9 Scala 81
# 10 C 74
# 11 Swift 57
# 12 Julia 56
# 13 TypeScript 55
# 14 Haskell 52
# 15 Rust 38
# 16 F# 36
# 17 PHP 32
# 18 Ruby 32
# 19 Assembly 29
# 20 Clojure 29
# 21 Matlab 29
# 22 Elixir 23
# 23 Perl 18
# 24 Objective-C 17
# 25 CoffeeScript 16
# 26 Erlang 16
# 27 Lua 13
# 28 Common Lisp 12
# 29 VBA 11
# 30 Groovy 7
# 31 Dart 5
# 32 Smalltalk 3
# 33 VB.NET 3
# 34 Hack 2
# 35 Visual Basic 6 1

Your tidyverse solution seems pretty good. For something more concise you could try base R or data.table:
library(data.table)
setDT(SO_survey)
setkey(SO_survey, Country)
SO_survey['United States', .(lang = unlist(strsplit(WantWorkLanguage, '; ')))
][, .N, keyby = V1 ][order(-N)]
# lang N
# 1: Python 321
# 2: R 288
# 3: SQL 209
# 4: JavaScript 199
# 5: C++ 136
# ...

Related

Manipulate/rearrange intervals in R columns

I have a data frame in R with a column (sni) with numbers that looks like this etc
bransch
sni
name
15
name
15
name
16-18
somename
16-18
name
241-3
someothername
241-3
where I have to transform/create a new column with just one number per row, i.e. no intervals so for example it should be a new row for all individual values in the intervals and look like this
bransch
sni
name
15
name
15
name
16
name
17
name
18
somename
16
somename
17
somename
18
name
241
name
242
name
243
someothername
241
someothername
242
someothername
243
I'm a bit unsure which function can do this the best way, or if someone has stumble upon a similar problem/solution. Currently I have tried to split the sni column (where the "-" starts) into two new ones, but then I'm a bit stuck since I will have many rows in one of the new columns without any values etc. Also the column is a character at the moment.
Any advice?
Sincerely,
TS
I took a while. Here is tidyverse approach:
library(dplyr)
library(tidyr)
df %>%
separate(sni, c("x", "y")) %>%
as_tibble() %>%
mutate(y = ifelse(as.numeric(y)<=9, paste0(substr(x, 1, nchar(x)-1), y),
y)) %>%
mutate(id = row_number()) %>%
pivot_longer(c(x,y)) %>%
mutate(value = as.numeric(value)) %>%
group_by(col2 =as.integer(gl(n(),2,n()))) %>%
fill(value, .direction = "down") %>%
complete(value = seq(first(value), last(value), by=1)) %>%
fill(bransch, .direction = "down") %>%
select(bransch, sni=value) %>%
group_by(col2, sni) %>%
slice(1)
col2 bransch sni
<int> <chr> <dbl>
1 1 name 15
2 2 name 15
3 3 name 16
4 3 name 17
5 3 name 18
6 4 somename 16
7 4 somename 17
8 4 somename 18
9 5 name 241
10 5 name 242
11 5 name 243
12 6 someothername 241
13 6 someothername 242
14 6 someothername 243
Let's try this.
Assume only three digits interval would have the pattern of 123-5 instead of 123-125, therefore in the ifelse, we modify this special pattern (e.g. 123-5) of interval into more regular one (123-125). Then separate the interval to individual integer using separate_rows.
We can then use complete to fill in the missing sequence in the interval.
library(tidyverse)
df %>%
group_by(sni,bransch) %>%
mutate(sni2 = ifelse(grepl("-", sni) & nchar(sub("-.*$", "", sni)) >= 3,
sub("^(\\d\\d)(.)-", "\\1\\2-\\1", sni),
sni)) %>%
separate_rows(sni2, convert = T) %>%
complete(sni2 = min(sni2):max(sni2)) %>%
ungroup() %>%
select(-sni)
# A tibble: 14 × 2
bransch sni2
<chr> <int>
1 name 15
2 name 15
3 name 16
4 name 17
5 name 18
6 somename 16
7 somename 17
8 somename 18
9 name 241
10 name 242
11 name 243
12 someothername 241
13 someothername 242
14 someothername 243
If I understood correctly
tmp=setNames(strsplit(df$sni,"-"),df$bransch)
tmp=unlist(
lapply(tmp,function(x){
x=as.numeric(x)
if (length(x)>1) {
if (x[1]<x[2]) {
seq(x[1],x[2],1)
} else {
seq(x[1],x[1]+x[2]-1,1)
}
} else {
x
}
})
)
data.frame(
"bransch"=names(tmp),
"sni"=tmp
)
bransch sni
1 name 15
2 name 15
3 name1 16
4 name2 17
5 name3 18
6 somename1 16
7 somename2 17
8 somename3 18
9 name1 241
10 name2 242
11 name3 243
12 someothername1 241
13 someothername2 242
14 someothername3 243
Using separate to get the start and end of the sequence, the we can map and unnest to get the result.
library (tidyverse)
data %>%
separate(
sni,
into = c("from", "to"),
fill = "right",
convert = TRUE) %>%
mutate(to = if_else(is.na(to), from, to)) %>%
transmute(
bransch,
sni = map2(from, to, `:`)) %>%
unnest_longer(sni)
# A tibble: 14 x 2
bransch sni
<chr> <int>
1 name 15
2 name 15
3 name 16
4 name 17
5 name 18
6 some name 16
7 some name 17
8 some name 18
9 name 241
10 name 242
11 name 243
12 someothername 241
13 someothername 242
14 someothername 243
Data
data <- tibble(
bransch = c("name","name","name","some name","name","someothername"),
sni =c("15","15","16-18","16-18","241-243","241-243"))

Web scraping with R (rvest)

I'm new to R and am having some trouble to create a good web scraper with R.... It has been only 5 days since I started to study this language. So, any help I'll appreciate!
Idea
I'm trying to web scraping the classification table of "Campeonato Brasileiro" from 2003 to 2021 on Wikipedia to group the teams later to analyze some stuff.
Explanation and problem
I'm scraping the page of the 2002 championship. I read the HTML page to extract the HTML nodes that I select with the "SelectorGadget" extension at Google Chrome. There is some considerations:
The page that I'm trying to access is from the 2002 championship. I done that because it was easier to extract the links of the tables that are present on a board in the final of the page, selecting just one selector for all (tr:nth-child(9) div a) to access their links by HTML attribute "href";
The selected CSS was from 2003 championship page.
So, in my twisted mind I thought: "Hey! I'm going to create a function to extract the tables from those pages and I'll save them in a data frame!". However, it went wrong and I'm not understanding why... When I tried to ran the "tabelageral" line, the following error returned : "Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "character"". I think that it is reading a string instead of a xml. What am I misunderstanding here? Where is my error? The "sapply" method? Since now, thanks!
The code
library("dplyr")
library("rvest")
link_wikipedia <- "https://pt.wikipedia.org/wiki/Campeonato_Brasileiro_de_Futebol_de_2002"
pagina_wikipedia <- read_html(link_wikipedia)
links_temporadas <- pagina_wikipedia %>%
html_nodes("tr:nth-child(9) div a") %>%
html_attr("href") %>%
paste("https://pt.wikipedia.org", ., sep = "")
tabela <- function(link){
pagina_tabela <- read_html(link)
tabela_wiki = link %>%
html_nodes("table.wikitable") %>%
html_table() %>%
paste(collapse = "|")
}
tabela_geral <- sapply(links_temporadas, FUN = tabela, USE.NAMES = FALSE)
tabela_final <- data.frame(tabela_geral)
You can use :contains to target the appropriate table by class and then a substring that the table contains. Furthermore, you can use html_table() to extract in tabular format from matched node. You can then subset on a vector of desired columns. I don't know the correct football terms so have guessed the columns to subset on. You can adjusted the columns vector.
If you wrap the years and constructed urls to make requests to inside of a map2_dfr() call you can return a single DataFrame for all desired years.
library(tidyverse)
library(rvest)
years <- 2003:2021
urls <- paste("https://pt.wikipedia.org/wiki/Campeonato_Brasileiro_de_Futebol_de_", years, sep = "")
columns <- c("Pos.", "Equipes", "GP", "GC", "SG")
df <- purrr::map2_dfr(urls, years, ~
read_html(.x, encoding = "utf-8") %>%
html_element('.wikitable:contains("ou rebaixamento")') %>%
html_table() %>%
.[columns] %>%
mutate(year = .y, SG = as.character(SG)))
You can get all the tables from those links by doing this:
tabela <- function(link){
read_html(link) %>% html_nodes("table.wikitable") %>% html_table()
}
all_tables = lapply(links_temporadas, tabela)
names(all_tables)<-2003:2022
This gives you a list of length 20, named 2003 to 2022 (i.e. one element for each of those years). Each element is itself a list of tables (i.e. the tables that are available at that link of links_temporadas. Note that the number of tables avaialable at each link varies.
lengths(all_tables)
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
6 5 10 9 10 12 11 10 12 11 13 14 17 16 16 16 16 15 17 7
You will need to determine which table(s) you are interested in from each of these years.
Here is a way. It's more complicated than your function because those pages have more than one table so the function returns only the tables with a column names matching "Pos.".
Then, before rbinding the tables, keep only the common columns since the older tables have one less column, column "M".
suppressPackageStartupMessages({
library("dplyr")
library("rvest")
})
link_wikipedia <- "https://pt.wikipedia.org/wiki/Campeonato_Brasileiro_de_Futebol_de_2002"
pagina_wikipedia <- read_html(link_wikipedia)
links_temporadas <- pagina_wikipedia %>%
html_nodes("tr:nth-child(9) div a") %>%
html_attr("href") %>%
paste("https://pt.wikipedia.org", ., sep = "")
tabela <- function(link){
pagina_tabela <- read_html(link)
lista_wiki <- pagina_tabela %>%
html_elements("table.wikitable") %>%
html_table()
i <- sapply(lista_wiki, \(x) "Pos." %in% names(x))
i <- which(i)[1]
lista_wiki[[i]]
}
tabela_geral <- sapply(links_temporadas, FUN = tabela, USE.NAMES = FALSE)
sapply(tabela_geral, ncol)
#> [1] 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 13 13 13 13
#sapply(tabela_geral, names)
common_names <- Reduce(intersect, lapply(tabela_geral, names))
tabela_reduzida <- lapply(tabela_geral, `[`, common_names)
tabela_final <- do.call(rbind, tabela_reduzida)
head(tabela_final)
#> # A tibble: 6 x 12
#> Pos. Equipes P J V E D GP GC SG `%`
#> <int> <chr> <chr> <int> <int> <int> <int> <int> <int> <chr> <int>
#> 1 1 Cruzeiro 100 46 31 7 8 102 47 +55 72
#> 2 2 Santos 87 46 25 12 9 93 60 +33 63
#> 3 3 São Paulo 78 46 22 12 12 81 67 +14 56
#> 4 4 São Caetano 742 46 19 14 13 53 37 +16 53
#> 5 5 Coritiba 73 46 21 10 15 67 58 +9 52
#> 6 6 Internacional 721 46 20 10 16 59 57 +2 52
#> # ... with 1 more variable: `Classificação ou rebaixamento` <chr>
Created on 2022-04-03 by the reprex package (v2.0.1)
To have all columns, including the "M" columns:
data.table::rbindlist(tabela_geral, fill = TRUE)

Performing operation among levels of grouped variable in R/dplyr

I want to perform a calculation among levels a grouping variable and fit this into a dplyr/tidyverse style workflow. I know this is confusing wording, but I hope the example below helps to clarify.
Below, I want to find the difference between levels "A" and "B" for each year that that I have data. One solution was to cast the data from long to wide format, and use mutate() in order to find the difference between A and B and create a new column with the results.
Ultimately, I'm working with a much larger dataset in which for each of N species, and for every year of sampling, I want to find the response ratio of some measured variable. Being able to keep the calculation in a long-format workflow would greatly help with later uses of the data.
library(tidyverse)
library(reshape)
set.seed(34)
test = data.frame(Year = rep(seq(2011,2020),2),
Letter = rep(c('A','B'),each = 10),
Response = sample(100,20))
test.results = test %>%
cast(Year ~ Letter, value = 'Response') %>%
mutate(diff = A - B)
#test.results
Year A B diff
2011 93 48 45
2012 33 44 -11
2013 9 80 -71
2014 10 61 -51
2015 50 67 -17
2016 8 43 -35
2017 86 20 66
2018 54 99 -45
2019 29 100 -71
2020 11 46 -35
Is there some solution where I could group by Year, and then use a function like summarize() to calculate between the levels of variable "Letters"?
group_by(Year)%>%
summarise( "something here to perform a calculation between levels A and B of the variable "Letters")
You can subset the Response values for "A" and "B" and then take the difference.
library(dplyr)
test %>%
group_by(Year) %>%
summarise(diff = Response[Letter == 'A'] - Response[Letter == 'B'])
# Year diff
# <int> <int>
# 1 2011 45
# 2 2012 -11
# 3 2013 -71
# 4 2014 -51
# 5 2015 -17
# 6 2016 -35
# 7 2017 66
# 8 2018 -45
# 9 2019 -71
#10 2020 -35
In this example, we can also take advantage of the fact that if we arrange the data "A" would come before "B" so we can use diff :
test %>%
arrange(Year, desc(Letter)) %>%
group_by(Year) %>%
summarise(diff = diff(Response))

Using str_split to fill rows down data frame with number ranges and multiple numbers

I have a dataframe with crop names and their respective FAO codes. Unfortunately, some crop categories, such as 'other cereals', have multiple FAO codes, ranges of FAO codes or even worse - multiple ranges of FAO codes.
Snippet of the dataframe with the different formats for FAO codes.
> FAOCODE_crops
SPAM_full_name FAOCODE
1 wheat 15
2 rice 27
8 other cereals 68,71,75,89,92,94,97,101,103,108
27 other oil crops 260:310,312:339
31 other fibre crops 773:821
Using the following code successfully breaks down these numbers,
unlist(lapply(unlist(strsplit(FAOCODE_crops$FAOCODE, ",")), function(x) eval(parse(text = x))))
[1] 15 27 56 44 79 79 83 68 71 75 89 92 94 97 101 103 108
... but I fail to merge these numbers back into the dataframe, where every FAOCODE gets its own row.
> FAOCODE_crops$FAOCODE <- unlist(lapply(unlist(strsplit(MAPSPAM_crops$FAOCODE, ",")), function(x) eval(parse(text = x))))
Error in `$<-.data.frame`(`*tmp*`, FAOCODE, value = c(15, 27, 56, 44, :
replacement has 571 rows, data has 42
I fully understand why it doesn't merge successfully, but I can't figure out a way to fill the table with a new row for each FAOCODE as idealized below:
SPAM_full_name FAOCODE
1 wheat 15
2 rice 27
8 other cereals 68
8 other cereals 71
8 other cereals 75
8 other cereals 89
And so on...
Any help is greatly appreciated!
We can use separate_rows to separate the ,. After that, we can loop through the FAOCODE using map and ~eval(parse(text = .x)) to evaluate the number range. Finnaly, we can use unnest to expand the data frame.
library(tidyverse)
dat2 <- dat %>%
separate_rows(FAOCODE, sep = ",") %>%
mutate(FAOCODE = map(FAOCODE, ~eval(parse(text = .x)))) %>%
unnest(cols = FAOCODE)
dat2
# # A tibble: 140 x 2
# SPAM_full_name FAOCODE
# <chr> <dbl>
# 1 wheat 15
# 2 rice 27
# 3 other cereals 68
# 4 other cereals 71
# 5 other cereals 75
# 6 other cereals 89
# 7 other cereals 92
# 8 other cereals 94
# 9 other cereals 97
# 10 other cereals 101
# # ... with 130 more rows
DATA
dat <- read.table(text = " SPAM_full_name FAOCODE
1 wheat 15
2 rice 27
8 'other cereals' '68,71,75,89,92,94,97,101,103,108'
27 'other oil crops' '260:310,312:339'
31 'other fibre crops' '773:821'",
header = TRUE, stringsAsFactors = FALSE)

Struggling to Create a Pivot Table in R

I am very, very new to any type of coding language. I am used to Pivot tables in Excel, and trying to replicate a pivot I have done in Excel in R. I have spent a long time searching the internet/ YouTube, but I just can't get it to work.
I am looking to produce a table in which I the left hand side column shows a number of locations, and across the top of the table it shows different pages that have been viewed. I want to show in the table the number of views per location which each of these pages.
The data frame 'specificreports' shows all views over the past year for different pages on an online platform. I want to filter for the month of October, and then pivot the different Employee Teams against the number of views for different pages.
specificreports <- readxl::read_excel("Multi-Tab File - Dashboard
Usage.xlsx", sheet = "Specific Reports")
specificreportsLocal <- tbl_df(specificreports)
specificreportsLocal %>% filter(Month == "October") %>%
group_by("Employee Team") %>%
This bit works, in that it groups the different team names and filters entries for the month of October. After this I have tried using the summarise function to summarise the number of hits but can't get it to work at all. I keep getting errors regarding data type. I keep getting confused because solutions I look up keep using different packages.
I would appreciate any help, using the simplest way of doing this as I am a total newbie!
Thanks in advance,
Holly
let's see if I can help a bit. It's hard to know what your data looks like from the info you gave us. So I'm going to guess and make some fake data for us to play with. It's worth noting that having field names with spaces in them is going to make your life really hard. You should start by renaming your fields to something more manageable. Since I'm just making data up, I'll give my fields names without spaces:
library(tidyverse)
## this makes some fake data
## a data frame with 3 fields: month, team, value
n <- 100
specificreportsLocal <-
data.frame(
month = sample(1:12, size = n, replace = TRUE),
team = letters[1:5],
value = sample(1:100, size = n, replace = TRUE)
)
That's just a data frame called specificreportsLocal with three fields: month, team, value
Let's do some things with it:
# This will give us total values by team when month = 10
specificreportsLocal %>%
filter(month == 10) %>%
group_by(team) %>%
summarize(total_value = sum(value))
#> # A tibble: 4 x 2
#> team total_value
#> <fct> <int>
#> 1 a 119
#> 2 b 172
#> 3 c 67
#> 4 d 229
I think that's sort of like what you already did, except I added the summarize to show how it works.
Now let's use all months and reshape it from 'long' to 'wide'
# if I want to see all months I leave out the filter and
# add a group_by month
specificreportsLocal %>%
group_by(team, month) %>%
summarize(total_value = sum(value)) %>%
head(5) # this just shows the first 5 values
#> # A tibble: 5 x 3
#> # Groups: team [1]
#> team month total_value
#> <fct> <int> <int>
#> 1 a 1 17
#> 2 a 2 46
#> 3 a 3 91
#> 4 a 4 69
#> 5 a 5 83
# to make this 'long' data 'wide', we can use the `spread` function
specificreportsLocal %>%
group_by(team, month) %>%
summarize(total_value = sum(value)) %>%
spread(team, total_value)
#> # A tibble: 12 x 6
#> month a b c d e
#> <int> <int> <int> <int> <int> <int>
#> 1 1 17 122 136 NA 167
#> 2 2 46 104 158 94 197
#> 3 3 91 NA NA NA 11
#> 4 4 69 120 159 76 98
#> 5 5 83 186 158 19 208
#> 6 6 103 NA 118 105 84
#> 7 7 NA NA 73 127 107
#> 8 8 NA 130 NA 166 99
#> 9 9 125 72 118 135 71
#> 10 10 119 172 67 229 NA
#> 11 11 107 81 NA 131 49
#> 12 12 174 87 39 NA 41
Created on 2018-12-01 by the reprex package (v0.2.1)
Now I'm not really sure if that's what you want. So feel free to make a comment on this answer if you need any of this clarified.
Welcome to Stack Overflow!
I'm not sure I correctly understand your need without a data sample, but this may work for you:
library(rpivotTable)
specificreportsLocal %>% filter(Month == "October")
rpivotTable(specificreportsLocal, rows="Employee Team", cols="page", vals="views", aggregatorName = "Sum")
Otherwise, if you do not need it interactive (as the Pivot Tables in Excel), this may work as well:
specificreportsLocal %>% filter(Month == "October") %>%
group_by_at(c("Employee Team", "page")) %>%
summarise(nr_views = sum(views, na.rm=TRUE))

Resources