I have a dataframe as follows:
df <- tibble::tribble(~home, ~visitor, ~hcountry, ~vcountry,
"Milan", "Manchester", "ITA", "ENG",
"LIVERPOOL", "MILAN", "ENG", "ITA",
"Real Madrid", "Juventus", "SPA", "ITA")
#> # A tibble: 3 x 4
#> home visitor hcountry vcountry
#> <chr> <chr> <chr> <chr>
#> 1 Milan Manchester ITA ENG
#> 2 LIVERPOOL MILAN ENG ITA
#> 3 Real Madrid Juventus SPA ITA
and would like to get only the italian teams ie: Milan, Milan, Juventus...how is it possible without using loops?
First off, I recommend a basic R tutorial to familiarise yourself with basic R data operations like subsetting etc. See for example R for Beginners on CRAN.
In your case you can do:
df[df$hcountry == "ITA" | df$vcountry == "ITA", ]
# home visitor hcountry vcountry
#1 Milan Manchester ITA ENG
#2 LIVERPOOL MILAN ENG ITA
#3 Real Madrid Juventus SPA ITA
Or
subset(df, hcountry == "ITA" | vcountry == "ITA")
Sample data
df <- read.table(text =
"home visitor hcountry vcountry
Milan Manchester ITA ENG
LIVERPOOL MILAN ENG ITA
'Real Madrid' Juventus SPA ITA", header =T)
Alternatively you could try stacking home and visitor countries to find unique values
library(dplyr)
library(tidyr)
df %>% gather(key1, country, -c(home, visitor)) %>%
gather(key2, team, -c(key1, country)) %>%
mutate_at(vars(key1, key2), substr, start=1, stop=1) %>%
filter(key1==key2) %>% select(-key1, -key2) %>%
mutate(team=tools::toTitleCase(tolower(team))) %>%
filter(country=="ITA") %>%
distinct()
#> # A tibble: 2 x 2
#> country team
#> <chr> <chr>
#> 1 ITA Milan
#> 2 ITA Juventus
Remove last distinct() if you want to see Milan value duplicated
We can use filter from dplyr
library(dplyr)
df %>%
filter(hcountry == "ITA" | vcountry == "ITA")
Related
Imagine dataset:
df1 <- tibble::tribble(~City, ~Population,
"United Kingdom > Leeds", 1500000,
"Spain > Las Palmas de Gran Canaria", 200000,
"Canada > Nanaimo, BC", 150000,
"Canada > Montreal", 250000,
"United States > Minneapolis, MN", 700000,
"United States > Milwaukee, WI", NA,
"United States > Milwaukee", 400000)
The same dataset for visual representation:
I would like to:
Split column City into three columns: City, Country, State (if available, NA otherwise)
Check that Milwaukee has data in state and population (the NA for Milwaukee should have a value of 400000 and then split [City-State-Country] :).
Could you, please, suggest the easiest method to do so :)
Here's another solution with extract to do the extraction of Country, City, and State in a single go with State extracted by an optional capture group (the remainder of the task is done as by #Allen's code):
library(tidyr)
library(dplyr)
df1 %>%
extract(City,
into = c("Country", "City", "State"),
regex = "([^>]+) > ([^,]+),? ?([A-Z]+)?"
) %>%
# as by #Allen Cameron:
group_by(Country, City) %>%
summarize(State = ifelse(all(is.na(State)), NA, State[!is.na(State)]),
Population = Population[!is.na(Population)])
You can use separate twice to get the country and state, then group_by Country and City to summarize away the NA values where appropriate:
library(tidyverse)
df1 %>%
separate(City, sep = " > ", into = c("Country", "City")) %>%
separate(City, sep = ', ', into = c('City', 'State')) %>%
group_by(Country, City) %>%
summarize(State = ifelse(all(is.na(State)), NA, State[!is.na(State)]),
Population = Population[!is.na(Population)])
#> # A tibble: 6 x 4
#> # Groups: Country [4]
#> Country City State Population
#> <chr> <chr> <chr> <dbl>
#> 1 Canada Montreal <NA> 250000
#> 2 Canada Nanaimo BC 150000
#> 3 Spain Las Palmas de Gran Canaria <NA> 200000
#> 4 United Kingdom Leeds <NA> 1500000
#> 5 United States Milwaukee WI 400000
#> 6 United States Minneapolis MN 700000
I want to scrape contents of multi page website using R, currently I'm able to scrape the first page, How do I scrape all pages and store them in csv.
Here;s my code so far
library(rvest)
library(tibble)
library(tidyr)
library(dplyr)
df = 'https://www.taneps.go.tz/epps/viewAllAwardedContracts.do?d-3998960-p=1&selectedItem=viewAllAwardedContracts.do&T01_ps=100' %>%
read_html() %>% html_table()
df
write.csv(df,"Contracts_test_taneps.csv")
Scrape multiple pages. Change 1:2 to 1:YOU NUMBER
library(tidyverse)
library(rvest)
get_taneps <- function(page) {
str_c("https://www.taneps.go.tz/epps/viewAllAwardedContracts.do?d-3998960-p=",
page, "&selectedItem=viewAllAwardedContracts.do&T01_ps=100") %>%
read_html() %>%
html_table() %>%
getElement(1) %>%
janitor::clean_names()
}
map_dfr(1:2, get_taneps)
# A tibble: 200 x 7
tender_no procuring_entity suppl~1 award~2 award~3 lot_n~4 notic~5
<chr> <chr> <chr> <chr> <chr> <chr> <lgl>
1 AE/005/2022-2023/MOROGORO/FA/G/01 Morogoro Municipal Council SHIBAM~ 08/11/~ "66200~ N/A NA
2 AE/005/2022-2023/DODOMA/FA/NC/02 Ministry of Livestock and Fish~ NINO G~ 04/11/~ "46511~ N/A NA
3 LGA/014/2022/2023/G/01 UTAWALA Bagamoyo District Council VILANG~ 02/11/~ "90000~ N/A NA
4 LGA/014/014/2022/2023/G/01 FEDHA 3EPICAR Bagamoyo District Council VILANG~ 02/11/~ "88100~ N/A NA
5 LGA/014/2022/2023/G/01/ARDHI Bagamoyo District Council VILANG~ 31/10/~ "16088~ N/A NA
6 LGA/014/2022/2023/G/11 VIFAA VYA USAFI SOKO LA SAMAKI Bagamoyo District Council MBUTUL~ 31/10/~ "10000~ N/A NA
7 DCD - 000899- 400E - ANIMAL FEEDS Kibaha Education Centre ALOYCE~ 29/10/~ "82400~ N/A NA
8 AE/005/2022-2023/MOROGORO/FA/G/01 Morogoro Regional Referral Hos~ JIGABH~ 02/11/~ "17950~ N/A NA
9 IE/023/2022-23/HQ/G/13 Commission for Mediation and A~ AKO GR~ 27/10/~ "42500~ N/A NA
10 AE/005/2022-2023/MOROGORO/FA/G/05 Morogoro Municipal Council THE GR~ 01/11/~ "17247~ N/A NA
# ... with 190 more rows, and abbreviated variable names 1: supplier_name, 2: award_date, 3: award_amount, 4: lot_name,
# 5: notice_pdf
# i Use `print(n = ...)` to see more rows
Write as .csv
write_csv(df, "Contracts_test_taneps.csv")
I have a dataframe that looks like this:
+------------+
|site |
+------------+
|JPN Tokyo |
|AUS Sydney |
|CHN Beijing |
But I'd like to make duplicate rows of the existing rows but with the 2nd and 3rd character changed to lowercase such that the dataframe becomes like this:
+------------+
|site |
+------------+
|JPN Tokyo |
|Jpn Tokyo |
|AUS Sydney |
|Aus Sydney |
|CHN Beijing |
|Chn Beijing |
Would anyone have an idea how to do that?
We expand the rows with uncount, then create a logical condition with duplicated on the 'site', replace the substring values to lower case using sub within case_when
library(dplyr)
library(tidyr)
library(stringr)
df1 <- df1 %>%
uncount(2) %>%
mutate(site = case_when(duplicated(site)
~ sub("^(.)(\\w+)", "\\1\\L\\2", site, perl = TRUE),
TRUE ~ site))
-output
df1
# A tibble: 6 x 1
site
<chr>
1 JPN Tokyo
2 Jpn Tokyo
3 AUS Sydney
4 Aus Sydney
5 CHN Beijing
6 Chn Beijing
data
df1 <- structure(list(site = c("JPN Tokyo", "AUS Sydney", "CHN Beijing"
)), class = "data.frame", row.names = c(NA, -3L))
edit: #AnilGoyal suggested the use of map_dfr, that reduced the call to only one line.
library(tidyverse)
data <-
tribble(
~site,
'JPN Tokyo',
'AUS Sydney',
'CHN Beijing' )
#option1
map_dfr(data$site, ~list(sites = c(.x, str_to_title(.x))))
#> # A tibble: 6 x 1
#> sites
#> <chr>
#> 1 JPN Tokyo
#> 2 Jpn Tokyo
#> 3 AUS Sydney
#> 4 Aus Sydney
#> 5 CHN Beijing
#> 6 Chn Beijing
#option2
map(data$site, ~rbind(.x, str_to_title(.x))) %>%
reduce(rbind) %>%
tibble(site = .)
#> # A tibble: 6 x 1
#> site[,1]
#> <chr>
#> 1 JPN Tokyo
#> 2 Jpn Tokyo
#> 3 AUS Sydney
#> 4 Aus Sydney
#> 5 CHN Beijing
#> 6 Chn Beijing
Created on 2021-06-08 by the reprex package (v2.0.0)
You can use substr to replace characters at specific position.
df1 <- df
substr(df1$site, 2, 3) <- tolower(substr(df1$site, 2, 3))
df1
# site
#1 Jpn Tokyo
#2 Aus Sydney
#3 Chn Beijing
res <- rbind(df1, df)
res[order(res$site), , drop = FALSE]
# site
#2 Aus Sydney
#5 AUS Sydney
#3 Chn Beijing
#6 CHN Beijing
#1 Jpn Tokyo
#4 JPN Tokyo
I want it to display a frequency table of total domestic ( which includes Boston + salt lake city) and total frequency of international ( London + Shanghai). But it prints it out like this.
table$Category<-c("Domestic","International")
> table
problem.6.data Freq Category
1 Boston 136 Domestic
2 London 102 International
3 Salt Lake City 277 Domestic
4 Shanghai 184 International
I want an output of:
1. Domestic: 136+277
2. International: 102+ 184
so, in the end the table should look like:
Domestic: 413
International: 286
What am I doing wrong?
If you don't mind using the tidyverse, you could use group_by() and summarize():
library(tidyverse)
df <-
data.frame(
stringsAsFactors = FALSE,
problem.6.data = c("Boston", "London", "Salt Lake City", "Shanghai"),
Freq = c(136L, 102L, 277L, 184L),
Category = c("Domestic", "International", "Domestic", "International")
)
df %>%
group_by(Category) %>%
summarise(sum = sum(Freq))
#> # A tibble: 2 x 2
#> Category sum
#> <chr> <int>
#> 1 Domestic 413
#> 2 International 286
Created on 2020-03-19 by the reprex package (v0.3.0)
Maybe aggregate from base R can give the desired output
dfout <- aggregate(Freq ~ Category, df, sum)
such that
> dfout
Category Freq
1 Domestic 413
2 International 286
I'm trying to find specifice words listed in a tibble arbeit in the another tibble rawEng$Text. If a word, or words, were found, I want to create, or mutate, a new data frame iDataArbeit with two new columns, one for the found word/s wArbeit, and one for the sum of there tf-idf iArbeitscores from arbeit$tfidf
My Data:
arbeit:
X1 feature tfidf
<dbl> <chr> <dbl>
1 0 sick 0.338
2 2 contract 0.188
3 3 pay 0.175
4 4 job 0.170
5 5 boss 0.169
6 6 sozialversicherungsnummer 0.169
rawEng:
Gender Gruppe Datum Text
<chr> <chr> <dttm> <chr>
1 F Berlin Expats 2017-07-07 00:00:00 Anyone out there who's had to apply for Führung~
2 F FAB 2018-01-18 00:00:00 Dear FAB, I am in need of a Führungszeugnis no ~
3 M Free Advice ~ 2017-01-30 00:00:00 Dear Friends, i would like to ask you how can I~
4 M FAB 2018-04-12 00:00:00 "Does anyone know why the \"Standesamt Pankow (~
5 F Berlin Expats 2018-11-12 00:00:00 having trouble finding consistent information a~
6 F Toytown Berl~ 2017-06-08 00:00:00 "Hello\r\n\r\nI have a question regarding Airbn~
I've tried with dplyr::mutate, using this code:
idataEnArbeit <- mutate(rawEng, wArbeit = ifelse((str_count(rawEng$Text, arbeit$feature))>=1,
arbeit$feature, NA),
iArbeit = ifelse((str_count(rawEng$Text, arbeit$feature))>=1,
arbeit$tfidf, NA))
but all I get is one Word, and it's tf-idf score, in the new columens iDatatArbeit$wArbeitand iDataArbeit$iArbeit
Gender Gruppe Datum Text wArbeit iArbeit
<chr> <chr> <dttm> <chr> <chr> <dbl>
1 F Berlin | Girl ~ 2018-09-11 13:22:05 "11 septembre, 13:21 GGI ~ sick 0.338
2 F ExpatBabies Be~ 2017-10-19 16:24:23 "16:24 Babysitter needed! B~ sick 0.338
3 F Berlin | Girl ~ 2018-06-22 18:24:19 "gepostet. Leonor Valen~ sick 0.338
4 F 'Neu in Berlin' 2018-09-18 23:19:51 "Hello guys, I am working wit~ sick 0.338
5 M Free Advice Be~ 2018-04-27 08:49:24 "In need of legal advice: Wha~ sick 0.338
6 F Free Advice Be~ 2018-07-04 18:33:03 "Is there somebody I can pay ~ sick 0.338
In summary: I want all words from arbeit$feature which are found in rawEng$Text to be added in iDataArbeit$wArbeit, and the sum of there tf-idf score to be added in iDataArbeit$iArbeit
Since I don't have your data, I'll import the gutenbergr library and play w/ Treasure Island.
library(tidytext)
library(gutenbergr)
## Now get the dataset
Treasure_Island <- gutenberg_works(title == "Treasure Island") %>% pull(gutenberg_id) %>%
gutenberg_download(.)
## and construct a toy arbeit:
arbeit <- data.frame(feature = c("island", "treasure", "to"),
tfidf = c(0.3,0.5,0.6))
## Break up a word into it's components (the head is just to keep the example short... you omit)
tidy_treasure <- unnest_tokens(Treasure_Island, feature, text, drop = FALSE) %>%
head(500)
## now bring the tfidf into tidy_treasure
df <- left_join(tidy_treasure, arbeit, by = "feature")
## and now you can average by sentence normally.
## To get the words we have to throw out the words that don't contribute to our tfidf.
## Two options:
df %>% filter(!is.na(tfidf)) %>% group_by(text) %>% summarize(AveTFIDF = sum(tfidf, na.rm = TRUE),
Words = paste(feature, collapse = ";"))
## Or if you want to keep a row for each found word, we can't use summarize, but we can still add them all up.
df %>% filter(!is.na(tfidf)) %>% group_by(text) %>% mutate(AveTFIDF = sum(tfidf, na.rm = TRUE))