Is it possible to convert lines from a text file into columns to get a dataframe? - r

I have a text file containing information on book title, author name, and country of birth which appear in seperate lines as shown below:
Oscar Wilde
De Profundis
Ireland
Nathaniel Hawthorn
Birthmark
USA
James Joyce
Ulysses
Ireland
Walt Whitman
Leaves of Grass
USA
Is there any way to convert the text to a dataframe with these three items appearing as different columns:
ID Author Book Country
1 "Oscar Wilde" "De Profundis" "Ireland"
2 "Nathaniel Hawthorn" "Birthmark" "USA"

There are built-in functions for dealing with this kind of data:
data.frame(scan(text=xx, multi.line=TRUE,
what=list(Author="", Book="", Country=""), sep="\n"))
# Author Book Country
#1 Oscar Wilde De Profundis Ireland
#2 Nathaniel Hawthorn Birthmark USA
#3 James Joyce Ulysses Ireland
#4 Walt Whitman Leaves of Grass USA

You can create a 3-column matrix from one column of data.
dat <- read.table('data.txt', sep = ',')
result <- matrix(dat$V1, ncol = 3, byrow = TRUE) |>
data.frame() |>
setNames(c('Author', 'Book', 'Country'))
result <- cbind(ID = 1:nrow(result), result)
result
# ID Author Book Country
#1 1 Oscar Wilde De Profundis Ireland
#2 2 Nathaniel Hawthorn Birthmark USA
#3 3 James Joyce Ulysses Ireland
#4 4 Walt Whitman Leaves of Grass USA

There aren't any built in functions that handle data like this. But you can reshape your data after importing.
#Test data
xx <- "Oscar Wilde
De Profundis
Ireland
Nathaniel Hawthorn
Birthmark
USA
James Joyce
Ulysses
Ireland
Walt Whitman
Leaves of Grass
USA"
writeLines(xx, "test.txt")
And then the code
library(dplyr)
library(tidyr)
lines <- read.csv("test.txt", header=FALSE)
lines %>%
mutate(
rid = ((row_number()-1) %% 3)+1,
pid = (row_number()-1) %/%3 + 1) %>%
mutate(col=case_when(rid==1~"Author",rid==2~"Book", rid==3~"Country")) %>%
select(-rid) %>%
pivot_wider(names_from=col, values_from=V1)
Which returns
# A tibble: 4 x 4
pid Author Book Country
<dbl> <chr> <chr> <chr>
1 1 Oscar Wilde De Profundis Ireland
2 2 Nathaniel Hawthorn Birthmark USA
3 3 James Joyce Ulysses Ireland
4 4 Walt Whitman Leaves of Grass USA

Related

data wrangling in R with names_pattern for pivoting on ENDING pattern?

I have a dataset with a column, CatSex, that's got data in it in a form similar to "American.Indian.or.Alaska.Native.men"--the characters after the last period, I want to turn into a new pivoted column, so I have two columns, one called Cat with only the demographic info in it, and one called Sex with the sex in it. The characters before the sex designation don't follow any clear pattern. I am not very good at R, but it's better than Tableau Prep with large data sets, it seems. What I ultimately want is to pivot the data so that I have two distinct columns for the different categories here. I used this code to get part of the way there (the original data held like 119 columns with names like "Grand.total.men..C2005_A_RV..First.major..Area..ethnic..cultural..and.gender.studies...Degrees.total"), but I can't figure out how to do this with the pattern I'm now left with in the column CatSex:
pivot_longer(
cols = -c(UnitID, Institution.Name),
names_to = c("CatSex", "Disc"),
names_pattern = "(.*)..C2005_A_RV..First.major..(.*)",
values_to = "Count",
values_drop_na = TRUE
)
Here's a screenshot of the data structure I have now. I'm sorry for not putting in reproducible code--I don't know how to do that in this context!
EDIT: Here's a head(df) of the cleaned data so far:
# A tibble: 6 × 5
UnitID Institution.Name CatSex Disc Count
<int> <fct> <chr> <chr> <int>
1 177834 A T Still University of Health Sciences Grand.total.men Health.professions.and.related.clinical.sciences...Degrees.total. 212
2 177834 A T Still University of Health Sciences Grand.total.women Health.professions.and.related.clinical.sciences...Degrees.total. 359
3 177834 A T Still University of Health Sciences White.non.Hispanic.men Health.professions.and.related.clinical.sciences...Degrees.total. 181
4 177834 A T Still University of Health Sciences White.non.Hispanic.women Health.professions.and.related.clinical.sciences...Degrees.total. 317
5 177834 A T Still University of Health Sciences Black.non.Hispanic.men Health.professions.and.related.clinical.sciences...Degrees.total. 3
6 177834 A T Still University of Health Sciences Black.non.Hispanic.women Health.professions.and.related.clinical.sciences...Degrees.total. 5
Using extract from tidyr package (it is in tidyverse)
Capture 2 groups with ()
Define second group to have one or more characters that are not . up to the end $
library(dplyr)
library(tidyr)
df %>%
extract(CatSex, c("Cat", "Sex"), "(.*)\\.([^.]+)$")
UnitID Institution.Name Cat Sex
1 222178 Abilene Christian University Hispanic men
2 222178 Abilene Christian University Hispanic women
3 222178 Abilene Christian University American.Indian.or.Alaska.Native men
4 222178 Abilene Christian University American.Indian.or.Alaska.Native women
5 222178 Abilene Christian University Asian.or.Pacific.Islander women
6 222178 Abilene Christian University Asian.or.Pacific.Islander men
7 222178 Abilene Christian University Grand.total men
8 222178 Abilene Christian University Grand.total women
9 222178 Abilene Christian University White.non.Hispanic men
10 222178 Abilene Christian University White.non.Hispanic women
11 222178 Abilene Christian University lack.non.Hispanic men
12 222178 Abilene Christian University Black.non.Hispanic women
13 222178 Abilene Christian University Hispanic men
14 222178 Abilene Christian University Hispanic women
15 222178 Abilene Christian University American.Indian.or.Alaska.Native men
Disc
1 Communication journalism..and.related.programs
2 Communication journalism and.related.programs
3 Communication journalism..and.related.programs
4 Communication..journalism..and.related.programs
5 Communication..journalism..and.related.programs
6 Communication .journalism..and.related.program
7 Computer.and.information.sciences.and.support.serv
8 computer.and.information.sciences.and.support.servi
9 Computer.and.information.sciences.and.support.servi
10 Computer.and.information.sciences.and.support.servi
11 Computer.and.information.sciences.and.support.servi
12 Computer.and.information.sciences.and.support.servi.
13 Computer.and.information.sciences.and.support.serv
14 Computer.and.information.sciences.and.support.servi.
15 Computer.and.information.sciences.and.support.servi
pivot_longer is not the right function in this context.
Here are few options -
Using tidyr::separate
tidyr::separate(df, 'CatSex', c('Cat', 'Sex'), sep = '(\\.)(?!.*\\.)')
#. Cat Sex
#1 Grand.total men
#2 Grand.total women
#3 White.non.Hispanic men
#4 White.non.Hispanic women
#5 Black.non.Hispanic men
#6 Black.non.Hispanic women
Using stringr functions
library(dplyr)
library(stringr)
df %>%
mutate(Sex = str_extract(CatSex, 'men|women'),
Cat = str_remove(CatSex, '\\.(men|women)'))
In base R
transform(df, Sex = sub('.*\\.(men|women)', '\\1', CatSex),
Cat = sub('\\.(men|women)', '', CatSex))
data
It is easier to help if you provide data in a reproducible format
df <- data.frame(CatSex = c("Grand.total.men", "Grand.total.women",
"White.non.Hispanic.men", "White.non.Hispanic.women",
"Black.non.Hispanic.men", "Black.non.Hispanic.women"))

Selecting a column with a dot in R (nested object)

I'm new to R and I'm not sure how to rephrase the question, but basically, I have this dataset coming from the following code:
data_url <- 'https://prod-scores-api.ausopen.com/year/2021/stats'
dat <- jsonlite::fromJSON(data_url)
men_aces <- bind_rows(dat$statistics$rankings[[1]]$players[1])
men_aces_table <- dat$players %>%
inner_join(men_aces, by = c('uuid' = 'player_id')) %>% select(full_name, nationality)
Which resulted in this data frame:
full_name nationality.uuid nationality.name nationality.code
1 Novak Djokovic 99da9b29-eade-4ac3-a7b0-b0b8c2192df7 Serbia SRB
2 Alexander Zverev 99d83e85-3173-4ccc-9d91-8368720f4a47 Germany GER
3 Milos Raonic 07779acb-6740-4b26-a664-f01c0b54b390 Canada CAN
4 Daniil Medvedev fa925d2d-337f-4074-a0bd-afddb38d66e1 Russia RUS
5 Nick Kyrgios 9b11f78c-47c1-43c4-97d0-ba3381eb9f07 Australia AUS
nationality is the nested object inside the player object if you check the JSON url, it contains the above properties (uuid, name, code), if I select the full_name property I would get the value (which is of type character) right back.
I'm not sure how to select the name and from that data frame (nationality) and rename it to country.
My expected outcome is:
full_name country
1 Novak Djokovic Serbia
2 Alexander Zverev Germany
3 Milos Raonic Canada
4 Daniil Medvedev Russia
5 Nick Kyrgios Australia
I would appreciate some help. Sorry I was unclear.
Use purrr::pmap_chr
library(tidyverse)
dat$players %>%
inner_join(men_aces, by = c('uuid' = 'player_id')) %>%
select(full_name, nationality) %>%
mutate(nationality = pmap_chr(nationality, ~ ..2))
full_name nationality
1 Novak Djokovic Serbia
2 Alexander Zverev Germany
3 Milos Raonic Canada
4 Daniil Medvedev Russia
5 Nick Kyrgios Australia
6 Alexander Bublik Kazakhstan
7 Reilly Opelka United States of America
8 Jiri Vesely Czech Republic
9 Andrey Rublev Russia
10 Lloyd Harris South Africa
11 Aslan Karatsev Russia
12 Taylor Fritz United States of America
13 Matteo Berrettini Italy
14 Grigor Dimitrov Bulgaria
15 Feliciano Lopez Spain
16 Stefanos Tsitsipas Greece
17 Felix Auger-Aliassime Canada
18 Thanasi Kokkinakis Australia
19 Ugo Humbert France
20 Borna Coric Croatia
You could do:
bind_cols(full_name = dat$players$full_name, country = dat$players$nationality$name)
# A tibble: 169 x 2
full_name country
<chr> <chr>
1 Novak Djokovic Serbia
2 Alexander Zverev Germany
3 Milos Raonic Canada
4 Daniil Medvedev Russia
5 Nick Kyrgios Australia
6 Alexander Bublik Kazakhstan
7 Reilly Opelka United States of America
8 Jiri Vesely Czech Republic
9 Andrey Rublev Russia
10 Lloyd Harris South Africa
just add this line at the end
newdf <- data.frame(full_name = men_aces_table$full_name, country = men_aces_table$nationality$name)

Find aiports that have flights connected to them

library(tidyverse)
library(nycflights13)
I want to find out which airports have flights to them. My attempt is seen below, but it is not correct (it yields a number that is way bigger than the amount of airports)
airPortFlights <- airports %>% rename(dest=faa) %>% left_join(flights, "dest"=faa)
If anyone wonders why I do the rename above, that's because it won't let me do
airports %>% left_join(flights, "dest"=faa)
It gives
Error: by required, because the data sources have no common variables`
I even tried airports %>% left_join(flights, by=c("dest"=faa)) and several other attempts, which are also not working.
Thanks in advance.
You want an inner_join and then either count the distinct flights, or just list the airports using distinct. Here I count them.
library(dplyr)
inner_join(airports, flights, by=c("faa"="dest")) %>%
count(faa, name) %>% # number of flights
arrange(-n)
# A tibble: 101 x 3
faa name n
<chr> <chr> <int>
1 ORD Chicago Ohare Intl 17283
2 ATL Hartsfield Jackson Atlanta Intl 17215
3 LAX Los Angeles Intl 16174
4 BOS General Edward Lawrence Logan Intl 15508
5 MCO Orlando Intl 14082
6 CLT Charlotte Douglas Intl 14064
7 SFO San Francisco Intl 13331
8 FLL Fort Lauderdale Hollywood Intl 12055
9 MIA Miami Intl 11728
10 DCA Ronald Reagan Washington Natl 9705
# ... with 91 more rows
So 101 of the 1,458 airports in this dataset have at least 1 record in the flights dataset, with Chicago's O'Hare Intl airport having the most flights from New York.
And just for fun, the following lists the airports that don't have any flights from NY:
anti_join(airports, flights, by=c("faa"="dest"))

Swap Misplaced cells in R?

I have a huge database (more than 65M of rows) and I noticed that some cells are misplaced. As an example, let's say I have this:
library("tidyverse")
DATA <- tribble(
~SURNAME,~NAME,~STATE,~COUNTRY,
'Smith','Emma','California','USA',
'Johnson','Oliia','Texas','USA',
'Williams','James','USA','California',
'Jones','Noah','Pennsylvania','USA',
'Williams','Liam','Illinois','USA',
'Brown','Sophia','USA','Louisiana',
'Daves','Evelyn','USA','Oregon',
'Miller','Jacob','New Mexico','USA',
'Williams','Lucas','Connecticut','USA',
'Daves','John','California','USA',
'Jones','Carl','USA','Illinois'
)
=====
> DATA
# A tibble: 11 x 4
SURNAME NAME STATE COUNTRY
<chr> <chr> <chr> <chr>
1 Smith Emma California USA
2 Johnson Oliia Texas USA
3 Williams James USA California
4 Jones Noah Pennsylvania USA
5 Williams Liam Illinois USA
6 Brown Sophia USA Louisiana
7 Daves Evelyn USA Oregon
8 Miller Jacob New Mexico USA
9 Williams Lucas Connecticut USA
10 Daves John California USA
11 Jones Carl USA Illinois
As you can see, the Country and State are misplaced in some rows, how can I efficiently swap those ones?
Kind Regards,
Luiz.
Using data.table and the in-built state.name vector:
setDT(DATA)
DATA[COUNTRY %in% state.name, `:=`(COUNTRY = STATE, STATE = COUNTRY)]
DATA
# SURNAME NAME STATE COUNTRY
# 1: Smith Emma California USA
# 2: Johnson Oliia Texas USA
# 3: Williams James California USA
# 4: Jones Noah Pennsylvania USA
# 5: Williams Liam Illinois USA
# 6: Brown Sophia Louisiana USA
# 7: Daves Evelyn Oregon USA
# 8: Miller Jacob New Mexico USA
# 9: Williams Lucas Connecticut USA
# 10: Daves John California USA
# 11: Jones Carl Illinois USA
Check this solution (it assumes that COUNTRY column is in ISO3 format e.g. MEX, CAN):
DATA %>%
mutate(
COUNTRY_TMP = if_else(str_detect(COUNTRY, '[A-Z]{3}'), COUNTRY, STATE),
STATE = if_else(str_detect(COUNTRY, '[A-Z]{3}'), STATE, COUNTRY),
COUNTRY = COUNTRY_TMP
) %>%
select(-COUNTRY_TMP)
Assuming all country names are followed ISO3 format, we can first install the countrycode package. In this package, there is a data frame called codelist with a column iso3c with the ISO3 country names. We can use that as follows to swap the country name.
library(tidyverse)
library(countrycode)
DATA2 <- DATA %>%
mutate(STATE2 = ifelse(STATE %in% codelist$iso3c &
!COUNTRY %in% codelist$iso3c, COUNTRY, STATE),
COUNTRY2 = ifelse(!STATE %in% codelist$iso3c &
COUNTRY %in% codelist$iso3c, COUNTRY, STATE)) %>%
select(-STATE, -COUNTRY) %>%
rename(STATE = STATE2, COUNTRY = COUNTRY2)
DATA2
# # A tibble: 11 x 4
# SURNAME NAME STATE COUNTRY
# <chr> <chr> <chr> <chr>
# 1 Smith Emma California USA
# 2 Johnson Oliia Texas USA
# 3 Williams James California USA
# 4 Jones Noah Pennsylvania USA
# 5 Williams Liam Illinois USA
# 6 Brown Sophia Louisiana USA
# 7 Daves Evelyn Oregon USA
# 8 Miller Jacob New Mexico USA
# 9 Williams Lucas Connecticut USA
# 10 Daves John California USA
# 11 Jones Carl Illinois USA

Merge two datasets

I create a node list as follows:
name <- c("Joe","Frank","Peter")
city <- c("New York","Detroit","Maimi")
age <- c(24,55,65)
node_list <- data.frame(name,age,city)
node_list
name age city
1 Joe 24 New York
2 Frank 55 Detroit
3 Peter 65 Maimi
Then I create an edge list as follows:
from <- c("Joe","Frank","Peter","Albert")
to <- c("Frank","Albert","James","Tony")
to_city <- c("Detroit","St. Louis","New York","Carson City")
edge_list <- data.frame(from,to,to_city)
edge_list
from to to_city
1 Joe Frank Detroit
2 Frank Albert St. Louis
3 Peter James New York
4 Albert Tony Carson City
Notice that the names in the node list and edge list do not overlap 100%. I want to create a master node list of all the names, capturing city information as well. This is my dplyr attempt to do this:
new_node <- edge_list %>%
gather("from_to", "name", from, to) %>%
distinct(name) %>%
full_join(node_list)
new_node
name age city
1 Joe 24 New York
2 Frank 55 Detroit
3 Peter 65 Maimi
4 Albert NA <NA>
5 James NA <NA>
6 Tony NA <NA>
I need to figure out how to add to_city information. What do I need to add to my dplyr code to make this happen? Thanks.
Join twice, once on to and once on from, with the irrelevant columns subsetted out:
library(dplyr)
node_list <- data_frame(name = c("Joe", "Frank", "Peter"),
city = c("New York", "Detroit", "Maimi"),
age = c(24, 55, 65))
edge_list <- data_frame(from = c("Joe", "Frank", "Peter", "Albert"),
to = c("Frank", "Albert", "James", "Tony"),
to_city = c("Detroit", "St. Louis", "New York", "Carson City"))
node_list %>%
full_join(select(edge_list, name = to, city = to_city)) %>%
full_join(select(edge_list, name = from))
#> Joining, by = c("name", "city")
#> Joining, by = "name"
#> # A tibble: 6 x 3
#> name city age
#> <chr> <chr> <dbl>
#> 1 Joe New York 24.
#> 2 Frank Detroit 55.
#> 3 Peter Maimi 65.
#> 4 Albert St. Louis NA
#> 5 James New York NA
#> 6 Tony Carson City NA
In this case the second join doesn't do anything because everybody is already included, but it would insert anyone who only existed in the from column.

Resources