How do I change the way my data is organized? - r

My data is currently organized in the following method (see first table below for actual data). I am only showing a portion of the overall data as the full image is quite large (over 100 rows).
Row September October November December January February March April May June July
1 Chino Hills Huntington Bea~ Fountain Valley Anaheim Fountain Vall~ Arcadia Anaheim Newport Be~ Santa Ana NA NA
2 Irvine Cerritos Long Beach Chino Hills Cerritos Anaheim NA Banning Newport Beach Anaheim NA
3 Glendale NA West Covina Monterey Park Encino NA Monterey Pa~ NA Los Angeles Cerritos Beverly Hi~
4 Norco Fountain Valley NA Monterey Park NA Long Beach NA Santa Ana Huntington Be~ Fountain Val~ NA
5 Los Angeles Inglewood West Covina Glendale NA Glendale NA Granada Hi~ Chino West Covina Tarzana
I want to change the way it is organized so that it shows the following. I want to emphasize that it would show all of the cities, not just the ones I have chosen to list. This is an incomplete diagram, but it gets the idea across:
+-------------+------------------+--------+----------+
| Chino Hills | Huntington Beach | Irvine | Glendale |
+-------------+------------------+--------+----------+
| Row 1 | Row 1 | Row 2 | Row 3 |
| Row 2 | | | Row 5 |
| | | | Row 5 |
+-------------+------------------+--------+----------+
I have tried tidyr::separate_rows(dfl, col), but this only works if the cities are in one cell; however, they are in multiple cells in multiple rows. This is what happens when I try the tidyr::separate_rows(dfl, col):
Row September October November December January February March April May June July
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 Chino Hills Huntington Bea~ Fountain Valley Anaheim Fountain Vall~ Arcadia Anaheim Newport Be~ Santa Ana NA NA
2 2 Irvine Cerritos Long Beach Chino Hills Cerritos Anaheim NA Banning Newport Beach Anaheim NA
3 3 Glendale NA West Covina Monterey Park Encino NA Monterey Pa~ NA Los Angeles Cerritos Beverly Hi~
4 4 Norco Fountain Valley NA Monterey Park NA Long Beach NA Santa Ana Huntington Be~ Fountain Val~ NA
5 5 Los Angeles Inglewood West Covina Glendale NA Glendale NA Granada Hi~ Chino West Covina Tarzana
As you can see, the only thing it does is add in another row of numbers which I do not need.
In summary, I need the Program R to find all of the cities and tell me what row they are in. The row may appear more than once if the city is in that row more than once. It will organize more than one column, not just the standard one column as used in tidyr. The number of columns will depend on the number of different cities.

We can get the data in long format, keep only unique values for each Row and value and get data in wide format. Assuming df is the dataframe name.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -Row, values_drop_na = TRUE) %>%
distinct(Row, value) %>%
group_by(value) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = value, values_from = Row)

Related

data wrangling in R with names_pattern for pivoting on ENDING pattern?

I have a dataset with a column, CatSex, that's got data in it in a form similar to "American.Indian.or.Alaska.Native.men"--the characters after the last period, I want to turn into a new pivoted column, so I have two columns, one called Cat with only the demographic info in it, and one called Sex with the sex in it. The characters before the sex designation don't follow any clear pattern. I am not very good at R, but it's better than Tableau Prep with large data sets, it seems. What I ultimately want is to pivot the data so that I have two distinct columns for the different categories here. I used this code to get part of the way there (the original data held like 119 columns with names like "Grand.total.men..C2005_A_RV..First.major..Area..ethnic..cultural..and.gender.studies...Degrees.total"), but I can't figure out how to do this with the pattern I'm now left with in the column CatSex:
pivot_longer(
cols = -c(UnitID, Institution.Name),
names_to = c("CatSex", "Disc"),
names_pattern = "(.*)..C2005_A_RV..First.major..(.*)",
values_to = "Count",
values_drop_na = TRUE
)
Here's a screenshot of the data structure I have now. I'm sorry for not putting in reproducible code--I don't know how to do that in this context!
EDIT: Here's a head(df) of the cleaned data so far:
# A tibble: 6 × 5
UnitID Institution.Name CatSex Disc Count
<int> <fct> <chr> <chr> <int>
1 177834 A T Still University of Health Sciences Grand.total.men Health.professions.and.related.clinical.sciences...Degrees.total. 212
2 177834 A T Still University of Health Sciences Grand.total.women Health.professions.and.related.clinical.sciences...Degrees.total. 359
3 177834 A T Still University of Health Sciences White.non.Hispanic.men Health.professions.and.related.clinical.sciences...Degrees.total. 181
4 177834 A T Still University of Health Sciences White.non.Hispanic.women Health.professions.and.related.clinical.sciences...Degrees.total. 317
5 177834 A T Still University of Health Sciences Black.non.Hispanic.men Health.professions.and.related.clinical.sciences...Degrees.total. 3
6 177834 A T Still University of Health Sciences Black.non.Hispanic.women Health.professions.and.related.clinical.sciences...Degrees.total. 5
Using extract from tidyr package (it is in tidyverse)
Capture 2 groups with ()
Define second group to have one or more characters that are not . up to the end $
library(dplyr)
library(tidyr)
df %>%
extract(CatSex, c("Cat", "Sex"), "(.*)\\.([^.]+)$")
UnitID Institution.Name Cat Sex
1 222178 Abilene Christian University Hispanic men
2 222178 Abilene Christian University Hispanic women
3 222178 Abilene Christian University American.Indian.or.Alaska.Native men
4 222178 Abilene Christian University American.Indian.or.Alaska.Native women
5 222178 Abilene Christian University Asian.or.Pacific.Islander women
6 222178 Abilene Christian University Asian.or.Pacific.Islander men
7 222178 Abilene Christian University Grand.total men
8 222178 Abilene Christian University Grand.total women
9 222178 Abilene Christian University White.non.Hispanic men
10 222178 Abilene Christian University White.non.Hispanic women
11 222178 Abilene Christian University lack.non.Hispanic men
12 222178 Abilene Christian University Black.non.Hispanic women
13 222178 Abilene Christian University Hispanic men
14 222178 Abilene Christian University Hispanic women
15 222178 Abilene Christian University American.Indian.or.Alaska.Native men
Disc
1 Communication journalism..and.related.programs
2 Communication journalism and.related.programs
3 Communication journalism..and.related.programs
4 Communication..journalism..and.related.programs
5 Communication..journalism..and.related.programs
6 Communication .journalism..and.related.program
7 Computer.and.information.sciences.and.support.serv
8 computer.and.information.sciences.and.support.servi
9 Computer.and.information.sciences.and.support.servi
10 Computer.and.information.sciences.and.support.servi
11 Computer.and.information.sciences.and.support.servi
12 Computer.and.information.sciences.and.support.servi.
13 Computer.and.information.sciences.and.support.serv
14 Computer.and.information.sciences.and.support.servi.
15 Computer.and.information.sciences.and.support.servi
pivot_longer is not the right function in this context.
Here are few options -
Using tidyr::separate
tidyr::separate(df, 'CatSex', c('Cat', 'Sex'), sep = '(\\.)(?!.*\\.)')
#. Cat Sex
#1 Grand.total men
#2 Grand.total women
#3 White.non.Hispanic men
#4 White.non.Hispanic women
#5 Black.non.Hispanic men
#6 Black.non.Hispanic women
Using stringr functions
library(dplyr)
library(stringr)
df %>%
mutate(Sex = str_extract(CatSex, 'men|women'),
Cat = str_remove(CatSex, '\\.(men|women)'))
In base R
transform(df, Sex = sub('.*\\.(men|women)', '\\1', CatSex),
Cat = sub('\\.(men|women)', '', CatSex))
data
It is easier to help if you provide data in a reproducible format
df <- data.frame(CatSex = c("Grand.total.men", "Grand.total.women",
"White.non.Hispanic.men", "White.non.Hispanic.women",
"Black.non.Hispanic.men", "Black.non.Hispanic.women"))

Find city names within affiliations and add them with their corresponding countries in new columns of a dataframe

I have a dataframe ‘dfa’ of affiliations that contains city names, for which the country is sometimes missing, e.g. like rows 4 (BAGHDAD) and 7 (BERLIN):
dfa <- data.frame(affiliation=c("DEPARTMENT OF PHARMACY, AMSTERDAM UNIVERSITY, AMSTERDAM, THE NETHERLANDS",
"DEPARTMENT OF BIOCHEMISTRY, LADY HARDINGE MEDICAL COLLEGE, NEW DELHI, INDIA.",
"DEPARTMENT OF PATHOLOGY, CHILDREN'S HOSPITAL, LOS ANGELES, UNITED STATES",
"COLLEGE OF EDUCATION FOR PURE SCIENCE, UNIVERSITY OF BAGHDAD.",
"DEPARTMENT OF CLINICAL LABORATORY, BEIJING GENERAL HOSPITAL, BEIJING, CHINA.",
"LABORATORY OF MOLECULAR BIOLOGY, ISTITUTO ORTOPEDICO, MILAN, ITALY.",
"DEPARTMENT OF AGRICULTURE, BERLIN INSTITUTE OF HEALTH, BERLIN",
"INSTITUTE OF LABORATORY MEDICINE, UNIVERSITY HOSPITAL, MUNICH, GERMANY.",
"DEPARTMENT OF CLINICAL PATHOLOGY, MAHIDOL UNIVERSITY, BANGKOK, THAILAND.",
"DEPARTMENT OF BIOLOGY, WASEDA UNIVERSITY, TOKYO, JAPAN",
"DEPARTMENT OF MOLECULAR BIOLOGY, MINISTRY OF HEALTH, TEHRAN, IRAN.",
"LABORATORY OF CARDIOVASCULAR DISEASE, FUWAI HOSPITAL, BEIJING, CHINA."))
I have now a second dataframe ‘dfb’ that contains a list of cities and corresponding country, some of which are present in 'dfa':
dfb <- data.frame(city=c("AGRI","AMSTERDAM","ATHENS","AUCKLAND","BUENOS AIRES","BEIJING","BAGHDAD","BANGKOK","BERLIN","BUDAPEST"),
country=c("TURKEY","NETHERLANDS","GREECE","NEW ZEALAND","ARGENTINA","CHINA","IRAQ","THAILAND","GERMANY","HUNGARY"))
How can I add cities and corresponding countries in two new columns only for cities that are present in both ‘dfa’ and ‘dfb’ (and even when the country is missing, as for BAGHDAD and BERLIN)?
NB: the goal is to add full city names but not part of them. Below in row 7, an example of what is not wanted: the AGRI city of TURKEY is inappropriately associated with BERLIN because this row includes the 'AGRICULTURE' word.
Is there a simple way to do that, ideally using dplyr?
affiliation city country
1 DEPARTMENT OF PHARMACY, AMSTERDAM UNIVERSITY, AMSTERDAM, THE NETHERLANDS AMSTERDAM NETHERLANDS
2 DEPARTMENT OF BIOCHEMISTRY, LADY HARDINGE MEDICAL COLLEGE, NEW DELHI, INDIA. <NA> <NA>
3 DEPARTMENT OF PATHOLOGY, CHILDREN'S HOSPITAL, LOS ANGELES, UNITED STATES <NA> <NA>
4 COLLEGE OF EDUCATION FOR PURE SCIENCE, UNIVERSITY OF BAGHDAD. BAGHDAD IRAQ
5 DEPARTMENT OF CLINICAL LABORATORY, BEIJING GENERAL HOSPITAL, BEIJING, CHINA. BEIJING CHINA
6 LABORATORY OF MOLECULAR BIOLOGY, ISTITUTO ORTOPEDICO, MILAN, ITALY. <NA> <NA>
7 DEPARTMENT OF AGRICULTURE, BERLIN INSTITUTE OF HEALTH, BERLIN AGRI TURKEY
8 INSTITUTE OF LABORATORY MEDICINE, UNIVERSITY HOSPITAL, MUNICH, GERMANY. <NA> <NA>
9 DEPARTMENT OF CLINICAL PATHOLOGY, MAHIDOL UNIVERSITY, BANGKOK, THAILAND. BANGKOK THAILAND
10 DEPARTMENT OF BIOLOGY, WASEDA UNIVERSITY, TOKYO, JAPAN <NA> <NA>
11 DEPARTMENT OF MOLECULAR BIOLOGY, MINISTRY OF HEALTH, TEHRAN, IRAN. <NA> <NA>
12 LABORATORY OF CARDIOVASCULAR DISEASE, FUWAI HOSPITAL, BEIJING, CHINA. BEIJING CHINA
A combination of str_extract and either a join or another str_extract is one option to get you there.
str_extract will get the first value it encounters, with a paste0 to collapse the cities into a long or string to check against.
library(dplyr)
library(stringr)
dfa %>%
mutate(city = str_extract(dfa$affiliation, paste0("\\b", dfb$city, collapse = "\\b|"))) %>%
left_join(dfb, by = "city")
Edit: added word bounderies in the paste0 so that only whole city names are matched and partial matching is avoided.
affiliation city country
1 DEPARTMENT OF PHARMACY, AMSTERDAM UNIVERSITY, AMSTERDAM, THE NETHERLANDS AMSTERDAM NETHERLANDS
2 DEPARTMENT OF BIOCHEMISTRY, LADY HARDINGE MEDICAL COLLEGE, NEW DELHI, INDIA. <NA> <NA>
3 DEPARTMENT OF PATHOLOGY, CHILDREN'S HOSPITAL, LOS ANGELES, UNITED STATES <NA> <NA>
4 COLLEGE OF EDUCATION FOR PURE SCIENCE, UNIVERSITY OF BAGHDAD. BAGHDAD IRAQ
5 DEPARTMENT OF CLINICAL LABORATORY, BEIJING GENERAL HOSPITAL, BEIJING, CHINA. BEIJING CHINA
6 LABORATORY OF MOLECULAR BIOLOGY, ISTITUTO ORTOPEDICO, MILAN, ITALY. <NA> <NA>
7 DEPARTMENT OF AGRICULTURE, BERLIN INSTITUTE OF HEALTH, BERLIN BERLIN GERMANY
8 INSTITUTE OF LABORATORY MEDICINE, UNIVERSITY HOSPITAL, MUNICH, GERMANY. <NA> <NA>
9 DEPARTMENT OF CLINICAL PATHOLOGY, MAHIDOL UNIVERSITY, BANGKOK, THAILAND. BANGKOK THAILAND
10 DEPARTMENT OF BIOLOGY, WASEDA UNIVERSITY, TOKYO, JAPAN <NA> <NA>
11 DEPARTMENT OF MOLECULAR BIOLOGY, MINISTRY OF HEALTH, TEHRAN, IRAN. <NA> <NA>
12 LABORATORY OF CARDIOVASCULAR DISEASE, FUWAI HOSPITAL, BEIJING, CHINA. BEIJING CHINA
This approach accounts for the possibility that an affiliation could match more than one city name.
library(tidyverse)
dfa %>%
mutate(city = map(affiliation, ~ str_extract(.x, dfb$city))) %>%
unnest(cols = c(city)) %>%
group_by(affiliation) %>%
mutate(nmatches = sum(!is.na(city))) %>%
filter((nmatches > 0 & !is.na(city)) | (nmatches == 0 & row_number() == 1)) %>%
ungroup() %>%
left_join(dfb, by = "city") %>%
mutate(country_match = str_detect(affiliation, country))
# A tibble: 12 x 5
affiliation city nmatches country country_match
<chr> <chr> <int> <chr> <lgl>
1 DEPARTMENT OF PHARMACY,… AMSTE… 1 NETHER… TRUE
2 DEPARTMENT OF BIOCHEMIS… NA 0 NA NA
3 DEPARTMENT OF PATHOLOGY… NA 0 NA NA
4 COLLEGE OF EDUCATION FO… BAGHD… 1 IRAQ FALSE
5 DEPARTMENT OF CLINICAL … BEIJI… 1 CHINA TRUE
6 LABORATORY OF MOLECULAR… NA 0 NA NA
7 BERLIN INSTITUTE OF HEA… BERLIN 1 GERMANY FALSE
8 INSTITUTE OF LABORATORY… NA 0 NA NA
9 DEPARTMENT OF CLINICAL … BANGK… 1 THAILA… TRUE
10 DEPARTMENT OF BIOLOGY, … NA 0 NA NA
11 DEPARTMENT OF MOLECULAR… NA 0 NA NA
12 LABORATORY OF CARDIOVAS… BEIJI… 1 CHINA TRUE
You could then double-check cases with 1 nmatches but country_match == F, and when there are 2 or more nmatches you can keep the one with country_match == T.

Selecting a column with a dot in R (nested object)

I'm new to R and I'm not sure how to rephrase the question, but basically, I have this dataset coming from the following code:
data_url <- 'https://prod-scores-api.ausopen.com/year/2021/stats'
dat <- jsonlite::fromJSON(data_url)
men_aces <- bind_rows(dat$statistics$rankings[[1]]$players[1])
men_aces_table <- dat$players %>%
inner_join(men_aces, by = c('uuid' = 'player_id')) %>% select(full_name, nationality)
Which resulted in this data frame:
full_name nationality.uuid nationality.name nationality.code
1 Novak Djokovic 99da9b29-eade-4ac3-a7b0-b0b8c2192df7 Serbia SRB
2 Alexander Zverev 99d83e85-3173-4ccc-9d91-8368720f4a47 Germany GER
3 Milos Raonic 07779acb-6740-4b26-a664-f01c0b54b390 Canada CAN
4 Daniil Medvedev fa925d2d-337f-4074-a0bd-afddb38d66e1 Russia RUS
5 Nick Kyrgios 9b11f78c-47c1-43c4-97d0-ba3381eb9f07 Australia AUS
nationality is the nested object inside the player object if you check the JSON url, it contains the above properties (uuid, name, code), if I select the full_name property I would get the value (which is of type character) right back.
I'm not sure how to select the name and from that data frame (nationality) and rename it to country.
My expected outcome is:
full_name country
1 Novak Djokovic Serbia
2 Alexander Zverev Germany
3 Milos Raonic Canada
4 Daniil Medvedev Russia
5 Nick Kyrgios Australia
I would appreciate some help. Sorry I was unclear.
Use purrr::pmap_chr
library(tidyverse)
dat$players %>%
inner_join(men_aces, by = c('uuid' = 'player_id')) %>%
select(full_name, nationality) %>%
mutate(nationality = pmap_chr(nationality, ~ ..2))
full_name nationality
1 Novak Djokovic Serbia
2 Alexander Zverev Germany
3 Milos Raonic Canada
4 Daniil Medvedev Russia
5 Nick Kyrgios Australia
6 Alexander Bublik Kazakhstan
7 Reilly Opelka United States of America
8 Jiri Vesely Czech Republic
9 Andrey Rublev Russia
10 Lloyd Harris South Africa
11 Aslan Karatsev Russia
12 Taylor Fritz United States of America
13 Matteo Berrettini Italy
14 Grigor Dimitrov Bulgaria
15 Feliciano Lopez Spain
16 Stefanos Tsitsipas Greece
17 Felix Auger-Aliassime Canada
18 Thanasi Kokkinakis Australia
19 Ugo Humbert France
20 Borna Coric Croatia
You could do:
bind_cols(full_name = dat$players$full_name, country = dat$players$nationality$name)
# A tibble: 169 x 2
full_name country
<chr> <chr>
1 Novak Djokovic Serbia
2 Alexander Zverev Germany
3 Milos Raonic Canada
4 Daniil Medvedev Russia
5 Nick Kyrgios Australia
6 Alexander Bublik Kazakhstan
7 Reilly Opelka United States of America
8 Jiri Vesely Czech Republic
9 Andrey Rublev Russia
10 Lloyd Harris South Africa
just add this line at the end
newdf <- data.frame(full_name = men_aces_table$full_name, country = men_aces_table$nationality$name)

Interpolate based on multiple conditions in r

Beginner r user here. I have a dataset of yearly employment numbers for different industry classifications and different subregions. For some observations, the number of employees is null. I would like to fill these values through linear interpolation (using na.approx or some other method). However, I only want to interpolate within the same industry classification and subregion.
For example, I have this:
subregion <- c("East Bay", "East Bay", "East Bay", "East Bay", "East Bay", "South Bay")
industry <-c("A","A","A","A","A","B" )
year <- c(2013, 2014, 2015, 2016, 2017, 2002)
emp <- c(50, NA, NA, 80,NA, 300)
data <- data.frame(cbind(subregion,industry,year, emp))
subregion industry year emp
1 East Bay A 2013 50
2 East Bay A 2014 <NA>
3 East Bay A 2015 <NA>
4 East Bay A 2016 80
5 East Bay A 2017 <NA>
6 South Bay B 2002 300
I need to generate this table, skipping interpolating the fifth observation because subregion and industry do not match the previous observation.
subregion industry year emp
1 East Bay A 2013 50
2 East Bay A 2014 60
3 East Bay A 2015 70
4 East Bay A 2016 80
5 East Bay A 2017 <NA>
6 South Bay B 2002 300
Articles like this have been helpful, but I cannot figure out how to adapt the solution to match the requirement that two columns be the same for interpolation to occur, instead of one. Any help would be appreciated.
We could do a group by na.approx (from zoo)
library(tidyverse)
data %>%
group_by(subregion, industry) %>%
mutate(emp = zoo::na.approx(emp, na.rm = FALSE))
# A tibble: 6 x 4
# Groups: subregion, industry [2]
# subregion industry year emp
# <fct> <fct> <dbl> <dbl>
#1 East Bay A 2013 50
#2 East Bay A 2014 60
#3 East Bay A 2015 70
#4 East Bay A 2016 80
#5 East Bay A 2017 NA
#6 South Bay B 2002 300
data
data <- data.frame(subregion,industry,year, emp)

How do I get the sum of frequency count based on two columns?

Assuming that the dataframe is stored as someData, and is in the following format:
ID Team Games Medal
1 Australia 1992 Summer NA
2 Australia 1994 Summer Gold
3 Australia 1992 Summer Silver
4 United States 1991 Winter Gold
5 United States 1992 Summer Bronze
6 Singapore 1991 Summer NA
How would I count the frequencies of the medal, based on the Team - while excluding NA as an variable. But at the same time, the total frequency of each country should be summed, rather than displayed separately for Gold, Silver and Bronze.
In other words, I am trying to display the total number of medals PER country, with the exception of NA.
I have tried something like this:
library(plyr)
counts <- ddply(olympics, .(olympics$Team, olympics$Medal), nrow)
names(counts) <- c("Country", "Medal", "Freq")
counts
But this just gives me a massive table of every medal for every country separately, including NA.
What I would like to do is the following:
Australia 2
United States 2
Any help would be greatly appreciated.
Thank you!
We can use count
library(dplyr)
df1 %>%
filter(!is.na(Medal)) %>%
count(Team)
# A tibble: 2 x 2
# Team n
# <fct> <int>
#1 Australia 2
#2 United States 2
You can do that in base R with table and colSums
colSums(table(someData$Medal, someData$Team))
Australia Singapore United States
2 0 2
Data
someData = read.table(text="ID Team Games Medal
1 Australia '1992 Summer' NA
2 Australia '1994 Summer' Gold
3 Australia '1992 Summer' Silver
4 'United States' '1991 Winter' Gold
5 'United States' '1992 Summer' Bronze
6 Singapore '1991 Summer' NA",
header=TRUE)

Resources