I have a set of data where the columns I'm trying to pivot vertically are stored as such:
testdata <- structure(list(id = c(723L, 621L, NA, NA, NA, NA, NA, NA, NA),
fullName = c("Will Smith", "Chris Rock", "", "", "", "",
"", "", ""), latestPosts.0.locationId = c(212928653L, 34505L,
NA, NA, NA, NA, NA, NA, NA), latestPosts.0.locationName = c("Miami",
"Atlanta", "", "", "", "", "", "", ""), latestPosts.1.locationId = c(1040683L,
20326736L, NA, NA, NA, NA, NA, NA, NA), latestPosts.1.locationName = c("New York",
"London", "", "", "", "", "", "", ""), latestPosts.2.locationId = c(NA,
215307317L, NA, NA, NA, NA, NA, NA, NA), latestPosts.2.locationName = c("",
"Paris", "", "", "", "", "", "", ""), latestPosts.3.locationId = c(1147378L,
34505L, NA, NA, NA, NA, NA, NA, NA), latestPosts.3.locationName = c("Seattle",
"Atlanta", "", "", "", "", "", "", ""), latestPosts.4.locationId = c(1147378L,
NA, NA, NA, NA, NA, NA, NA, NA), latestPosts.4.locationName = c("Seattle",
"", "", "", "", "", "", "", ""), latestPosts.5.locationId = c(238334931,
9432076525, NA, NA, NA, NA, NA, NA, NA), latestPosts.5.locationName = c("San Francisco",
"Brooklyn", "", "", "", "", "", "", ""), latestPosts.6.locationId = c(881699386L,
NA, NA, NA, NA, NA, NA, NA, NA), latestPosts.6.locationName = c("San Diego",
"", "", "", "", "", "", "", ""), latestPosts.7.locationId = c(NA,
234986797L, NA, NA, NA, NA, NA, NA, NA), latestPosts.8.locationId = c(1147378,
9021444765, NA, NA, NA, NA, NA, NA, NA), latestPosts.8.locationName = c("Seattle",
"Cleveland", "", "", "", "", "", "", ""), latestPosts.9.locationId = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA), latestPosts.9.locationName = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA), latestPosts.10.locationId = c(408631288L,
234986797L, NA, NA, NA, NA, NA, NA, NA), latestPosts.10.locationName = c("Portland",
"Orlando", "", "", "", "", "", "", ""), latestPosts.11.locationId = c(52043757619,
34505, NA, NA, NA, NA, NA, NA, NA), latestPosts.11.locationName = c("Nashville",
"Atlanta", "", "", "", "", "", "", "")), class = "data.frame", row.names = c(NA,
-9L))
I am trying to pivot where any time latestPosts.n.locationId OR latestPosts.n.locationName (in this case, n is a placeholder for the number in between the two) is not blank or not NA, it pivots so that the final output looks as such:
testdata_exp <- structure(list(id = c(723L, 724L, 725L, 726L, 727L, 728L, 729L,
730L, 731L, 621L, 622L, 623L, 624L, 625L, 626L, 627L, 628L, 629L
), fullName = c("Will Smith", "Will Smith", "Will Smith", "Will Smith",
"Will Smith", "Will Smith", "Will Smith", "Will Smith", "Will Smith",
"Chris Rock", "Chris Rock", "Chris Rock", "Chris Rock", "Chris Rock",
"Chris Rock", "Chris Rock", "Chris Rock", "Chris Rock"), locationId = c(212928653,
1040683, 1147378, 1147378, 238334931, 881699386, 1147378, 408631288,
52043757619, 34505, 20326736, 215307317, 34505, 9432076525, 234986797,
9021444765, 234986797, 34505), locationName = c("Miami Beach, Florida",
"Starbucks", "University of Evansville", "University of Evansville",
"Downtown Evansville", "Garden Of The Gods", "University of Evansville",
"Phi Gamma Delta - Epsilon Iota", "Nashville Pride", "University of the South",
"Riverview Camp For Girls", "Chattanooga, Tennessee", "University of the South",
"Grand Sirenis Riviera Maya Resort", "", "Sleepyhead Coffee",
"Sewanee, Tennessee", "University of the South")), class = "data.frame", row.names = c(NA,
-18L))
Or for a visual representation:
A couple things to keep in mind:
The number of latestPosts.n.locationId OR latestPosts.n.locationName might change from dataset to dataset, so it's best to account for not knowing how many there will be. This example goes up to 11, but other times it might be more or less.
If there is a locationId present, it doesn't always mean there will be a matching locationName field. Using this data as an example, there is a latestPosts.7.locationId field but no subsequent latestPosts.7.locationName
field.
Here's another variation (also using pivot_longer)
library(dplyr)
library(tidyr)
testdata %>%
pivot_longer(-c(id, fullName),
names_to = c("n", ".value"),
names_pattern = "latestPosts\\.([0-9]+)\\.(.+)") %>%
select(-n) %>%
filter(!((is.na(locationId) | locationId == '') & (is.na(locationName) | locationName == '')))
#> # A tibble: 18 × 4
#> id fullName locationId locationName
#> <int> <chr> <dbl> <chr>
#> 1 723 Will Smith 212928653 Miami
#> 2 723 Will Smith 1040683 New York
#> 3 723 Will Smith 1147378 Seattle
#> 4 723 Will Smith 1147378 Seattle
#> 5 723 Will Smith 238334931 San Francisco
#> 6 723 Will Smith 881699386 San Diego
#> 7 723 Will Smith 1147378 Seattle
#> 8 723 Will Smith 408631288 Portland
#> 9 723 Will Smith 52043757619 Nashville
#> 10 621 Chris Rock 34505 Atlanta
#> 11 621 Chris Rock 20326736 London
#> 12 621 Chris Rock 215307317 Paris
#> 13 621 Chris Rock 34505 Atlanta
#> 14 621 Chris Rock 9432076525 Brooklyn
#> 15 621 Chris Rock 234986797 <NA>
#> 16 621 Chris Rock 9021444765 Cleveland
#> 17 621 Chris Rock 234986797 Orlando
#> 18 621 Chris Rock 34505 Atlanta
You could do
library(tidyverse)
testdata %>%
rename_all(~sub("latestPosts\\.", "", .x)) %>%
mutate(across(contains("location"), as.character)) %>%
mutate(rownum = row_number()) %>%
pivot_longer(contains("location")) %>%
separate(name, into = c("group", "var")) %>%
group_by(id, fullName, group, rownum) %>%
summarise(var = c("locationId", "locationName"),
value = if(n() == 1) c(value, NA) else value, .groups = "drop") %>%
pivot_wider(names_from = var, values_from = value) %>%
select(id, fullName, locationId, locationName) %>%
filter((!is.na(locationName) & nzchar(locationName)) | !is.na(locationId)) %>%
mutate(locationId = as.numeric(locationId))
#> # A tibble: 18 x 4
#> id fullName locationId locationName
#> <int> <chr> <dbl> <chr>
#> 1 621 Chris Rock 34505 Atlanta
#> 2 621 Chris Rock 20326736 London
#> 3 621 Chris Rock 234986797 Orlando
#> 4 621 Chris Rock 34505 Atlanta
#> 5 621 Chris Rock 215307317 Paris
#> 6 621 Chris Rock 34505 Atlanta
#> 7 621 Chris Rock 9432076525 Brooklyn
#> 8 621 Chris Rock 234986797 NA
#> 9 621 Chris Rock 9021444765 Cleveland
#> 10 723 Will Smith 212928653 Miami
#> 11 723 Will Smith 1040683 New York
#> 12 723 Will Smith 408631288 Portland
#> 13 723 Will Smith 52043757619 Nashville
#> 14 723 Will Smith 1147378 Seattle
#> 15 723 Will Smith 1147378 Seattle
#> 16 723 Will Smith 238334931 San Francisco
#> 17 723 Will Smith 881699386 San Diego
#> 18 723 Will Smith 1147378 Seattle
Related
I have data that is structure like such:
actor_data <- structure(list(id = c(123L, 456L, 789L, 912L, 235L), name = c("Tom Cruise",
"Will Smith", "Ryan Reynolds", "Chris Rock", "Emma Stone"), locationid1 = c(5459L,
NA, 6114L, NA, NA), location1 = c("Paris, France", "", "Brooklyn, NY",
"", ""), locationid2 = c(NA, 5778L, NA, NA, 4432L), location3 = c("",
"Dolby Theater", "", "", "Hollywood"), locationid3 = c(NA, 2526L,
3101L, NA, NA), location3.1 = c("", "London", "Boston", "", ""
), locationid4 = c(6667L, 2333L, 1118L, NA, NA), location4 = c("Virginia",
"Maryland", "Washington", "", "")), class = "data.frame", row.names = c(NA,
-5L))
I am trying to make the location data run vertically instead of horizontally while also making sure its not accounting for blank fields.
So the final result will look like this:
actor_data_exp <- structure(list(id = c(123L, 123L, 456L, 456L, 456L, 789L, 789L,
789L, 235L), name = c("Tom Cruise", "Tom Cruise", "Will Smith",
"Will Smith", "Will Smith", "Ryan Reynolds", "Ryan Reynolds",
"Ryan Reynolds", "Emma Stone"), locationid = c(5459L, 6667L,
5778L, 2526L, 2333L, 6114L, 3101L, 1118L, 4432L), location = c("Paris, France",
"Virginia", "Dolby Theater", "London", "Maryland", "Brooklyn, NY",
"Boston", "Washington", "Hollywood")), class = "data.frame", row.names = c(NA,
-9L))
Or to give you a visual, will end up looking like this:
I would rename the columns that start with "location", so that there is an underscore before the number in each column name. Then use pivot_longer with the underscore as name_sep, and using c(".value", "var") in the names_to argument that to ensure both location and locationid have their own columns. This will also create the redundant column var which will contain the numbers 1-4 that were appended to the original column names.
Finally, filter out missing values and remove the redundant var column.
library(tidyverse)
actor_data %>%
rename_with(~ ifelse(grepl("location", .x),
sub("^(.*?)([0-9]\\.?\\d?)$", "\\1_\\2", .x), .x)) %>%
pivot_longer(starts_with("location"),
names_sep = "_", names_to = c(".value", "var")) %>%
filter(!is.na(locationid) & !is.na(location) & nzchar(location)) %>%
select(-var)
#> # A tibble: 6 x 4
#> id name locationid location
#> <int> <chr> <int> <chr>
#> 1 123 Tom Cruise 5459 Paris, France
#> 2 123 Tom Cruise 6667 Virginia
#> 3 456 Will Smith 2526 Dolby Theater
#> 4 456 Will Smith 2333 Maryland
#> 5 789 Ryan Reynolds 6114 Brooklyn, NY
#> 6 789 Ryan Reynolds 1118 Washington
Can't find it anywhere but I have the below data.frame and need to to look like the second data.frame but struggling with the first row. Any ideas? (in the original .csv I have 18 variables with 28 observations).
Here is a data.frame example of what I have:
#Have this cnames_have <- data.frame(names = c(NA, "Name", "BMC", "MFH", "MCHHS", "CIH"), Official.use.only = c( NA, "Last Updated", "2020-11-10", "2020-10-10", "2020-11-10", "2020-11-09"), X = c("Adult Unit", "Staffed", 8, NA, 0, 62), X1 = c(NA, "Current Available", 3, NA, 0,13), X2 = c("Pediatric Unit", "Staffed", 8, NA, 0, 62), X3 = c(NA, "Current Available", 3, NA, 0,13))
Here is an example of what I need:
#need this cnames <- data.frame(names = c("BMC", "MFH", "MCHHS", "CIH", "BMC", "MFH", "MCHHS", "CIH"), Last_Updated = c("2020-11-10", "2020-10-10", "2020-11-10", "2020-11-09"), beds = c("Adult Unit", "Adult Unit", "Adult Unit", "Adult Unit", "Pediatric Unit", "Pediatric Unit", "Pediatric Unit", "Pediatric Unit"), Staffed = c(8, NA, 0, 62, 8, NA, 0, 62), Current_Available = c(3, NA, 0,13, 3, NA, 0,13))
I have tried transpose, melt, dcast, gather, etc. Here is as far as I was able to get but then I couldn't think of where to go from there or if I just code-blocked myself.
df <- as.data.frame(t(cnames)) #from Have this cnames data frame
df <- df %>% rename(col_1 = "2") %>% fill(col_1)
Any help would be awesome as I need to figure this out so I can include visuals in my situational reports. Thank you in advance!
A dplyr and tidyr solution by pivotting long then back wide
library(dplyr)
cnames_have <- data.frame(names = c(NA, "Name", "BMC", "MFH", "MCHHS", "CIH"), Official.use.only = c( NA, "Last Updated", "2020-11-10", "2020-10-10", "2020-11-10", "2020-11-09"), X = c("Adult Unit", "Staffed", 8, NA, 0, 62), X1 = c(NA, "Current Available", 3, NA, 0,13), X2 = c("Pediatric Unit", "Staffed", 8, NA, 0, 62), X3 = c(NA, "Current Available", 3, NA, 0,13))
cnames_have %>% rename(Last_Updated = Official.use.only,
`Adult Staffed` = X,
`Adult Available` = X1,
`Pediatric Staffed` = X2,
`Pediatric Available` = X3) %>%
slice(-1:-2) %>%
tidyr::pivot_longer(`Adult Staffed`:`Pediatric Available`) %>%
tidyr::separate(., name, into = c("beds", "type")) %>%
tidyr::pivot_wider(names_from = type) %>%
arrange(beds)
#> # A tibble: 8 x 5
#> names Last_Updated beds Staffed Available
#> <chr> <chr> <chr> <chr> <chr>
#> 1 BMC 2020-11-10 Adult 8 3
#> 2 MFH 2020-10-10 Adult <NA> <NA>
#> 3 MCHHS 2020-11-10 Adult 0 0
#> 4 CIH 2020-11-09 Adult 62 13
#> 5 BMC 2020-11-10 Pediatric 8 3
#> 6 MFH 2020-10-10 Pediatric <NA> <NA>
#> 7 MCHHS 2020-11-10 Pediatric 0 0
#> 8 CIH 2020-11-09 Pediatric 62 13
This seems to work but is not as elegant as I would like, I'm sure there is a way to do this using pivot_longer. The data wrangling block of code could be put into a function.
library(dplyr)
library(tidyr)
library(purrr)
# Extract unique names from data
vec_names <- cnames_have[2, 1:4]
au <-
cnames_have %>%
select(names:X1) %>%
set_names(vec_names) %>%
mutate(beds = "Adult Unit") %>%
filter(row_number() > 2)
cnames <-
cnames_have %>%
select(names, Official.use.only, X2, X3) %>%
set_names(vec_names) %>%
mutate(beds = "Pediatric Unit") %>%
filter(row_number() > 2) %>%
bind_rows(au)
cnames
#> Name Last Updated Staffed Current Available beds
#> 1 BMC 2020-11-10 8 3 Pediatric Unit
#> 2 MFH 2020-10-10 <NA> <NA> Pediatric Unit
#> 3 MCHHS 2020-11-10 0 0 Pediatric Unit
#> 4 CIH 2020-11-09 62 13 Pediatric Unit
#> 5 BMC 2020-11-10 8 3 Adult Unit
#> 6 MFH 2020-10-10 <NA> <NA> Adult Unit
#> 7 MCHHS 2020-11-10 0 0 Adult Unit
#> 8 CIH 2020-11-09 62 13 Adult Unit
data
cnames_have <-
data.frame(names = c(NA, "Name", "BMC", "MFH", "MCHHS", "CIH"),
Official.use.only = c( NA, "Last Updated", "2020-11-10", "2020-10-10", "2020-11-10", "2020-11-09"),
X = c("Adult Unit", "Staffed", 8, NA, 0, 62),
X1 = c(NA, "Current Available", 3, NA, 0,13),
X2 = c("Pediatric Unit", "Staffed", 8, NA, 0, 62),
X3 = c(NA, "Current Available", 3, NA, 0,13))
Created on 2020-11-11 by the reprex package (v0.3.0)
cnames_have <- data.frame(names = c(NA, "Name", "BMC", "MFH", "MCHHS", "CIH"), Official.use.only = c( NA, "Last Updated", "2020-11-10", "2020-10-10", "2020-11-10", "2020-11-09"), X = c("Adult Unit", "Staffed", 8, NA, 0, 62), X1 = c(NA, "Current Available", 3, NA, 0,13), X2 = c("Pediatric Unit", "Staffed", 8, NA, 0, 62), X3 = c(NA, "Current Available", 3, NA, 0,13))
colnames(cnames_have) <- c("Name", "Last Updated", "Adult Unit", "Adult Unit Current Available", "Pediatric Unit", "Pediatric Unit Current Available")
cnames_have <- cnames_have[-1, ]
cnames_have <- cnames_have[-1, ]
cnames_have <- cnames_have[, c(1, 2, 3, 5, 4, 6)]
library(tidyr)
cnames_have <- gather(cnames_have, 'Unit', 'Staffed', 3:4)
cnames_have <- cnames_have[, -3]
colnames(cnames_have) <- c("names", "Last_Updated", "Current_Available", "beds", "Staffed")
cnames_have <- cnames_have[, c(1, 2, 4, 5, 3)]
This question already has answers here:
Fill missing values rowwise (right / left)
(2 answers)
Closed 2 years ago.
I am cleaning my data of which the dput looks as follows.
DF <- structure(list(toberevised = c("[Money amounts are in thousands of dollars]",
NA, NA, NA, "Item", NA, NA, NA, NA, "Number of returns", "Number of joint returns",
"Number with paid preparer's signature", "Number of exemptions",
"Adjusted gross income (AGI) [3]", "Salaries and wages in AGI: [4] Number",
"Salaries and wages in AGI: Amount", "Taxable interest: Number",
"Taxable interest: Amount", "Ordinary dividends: Number", "Ordinary dividends: Amount"
), ...2 = c("UNITED STATES [2]", NA, NA, NA, "All returns", NA,
NA, "1", NA, "135257620", "52607676", "80455243", "273738434",
"7364640131", "114060887", "5161583318", "59553985", "161324824",
"31158675", "164247298"), ...3 = c(NA, NA, NA, NA, "Under", "$50,000 [1]",
NA, "2", NA, "92150166", "20743943", "53622647", "159649737",
"1797097083", "75422766", "1541276272", "28527550", "39043002",
"13174923", "23867893"), ...4 = c(NA, NA, "Size of adjusted gross income",
NA, "50000", "under", "75000", "3", NA, "18221115", "11329459",
"11025624", "44189517", "1119634632", "16299827", "896339313",
"10891905", "16353293", "5255958", "12810282"), ...5 = c(NA,
NA, NA, NA, "75000", "under", "100000", "4", NA, "10499106",
"8296546", "6260725", "28555195", "905336768", "9520214", "721137490",
"7636612", "12852148", "4095938", "11524298"), ...6 = c(NA, NA,
NA, NA, "100000", "under", "200000", "5", NA, "10797979", "9193700",
"6678965", "30919226", "1429575727", "9782173", "1083175205",
"9092673", "23160862", "5824522", "25842394"), ...7 = c(NA, NA,
NA, NA, "200000", "or more", NA, "6", NA, "3589254", "3044028",
"2867282", "10424759", "2112995921", "3035907", "919655038",
"3405245", "69915518", "2807334", "90202431")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
In the first row and the third row, I would like to use something like na.locf from zoo but not on the rows but on the columns, so that the DF becomes.
DF[1,3:7] <- "UNITED STATES [2]"
DF[1,5:7] <- "Size of adjusted gross income"
apply na.locf rowwise :
DF[] <- t(apply(DF, 1, zoo::na.locf, na.rm = FALSE))
DF
# A tibble: 20 x 7
# toberevised ...2 ...3 ...4 ...5 ...6 ...7
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 [Money amounts are in th… UNITED ST… UNITED ST… UNITED STATES … UNITED STATES … UNITED STATES … UNITED STATES…
# 2 NA NA NA NA NA NA NA
# 3 NA NA NA Size of adjust… Size of adjust… Size of adjust… Size of adjus…
# 4 NA NA NA NA NA NA NA
# 5 Item All retur… Under 50000 75000 100000 200000
# 6 NA NA $50,000 [… under under under or more
# 7 NA NA NA 75000 100000 200000 200000
# 8 NA 1 2 3 4 5 6
# 9 NA NA NA NA NA NA NA
#10 Number of returns 135257620 92150166 18221115 10499106 10797979 3589254
#11 Number of joint returns 52607676 20743943 11329459 8296546 9193700 3044028
#12 Number with paid prepare… 80455243 53622647 11025624 6260725 6678965 2867282
#13 Number of exemptions 273738434 159649737 44189517 28555195 30919226 10424759
#14 Adjusted gross income (A… 7364640131 1797097083 1119634632 905336768 1429575727 2112995921
#15 Salaries and wages in AG… 114060887 75422766 16299827 9520214 9782173 3035907
#16 Salaries and wages in AG… 5161583318 1541276272 896339313 721137490 1083175205 919655038
#17 Taxable interest: Number 59553985 28527550 10891905 7636612 9092673 3405245
#18 Taxable interest: Amount 161324824 39043002 16353293 12852148 23160862 69915518
#19 Ordinary dividends: Num… 31158675 13174923 5255958 4095938 5824522 2807334
#20 Ordinary dividends: Amou… 164247298 23867893 12810282 11524298 25842394 90202431
As suggested by #G. Grothendieck na.locf0 is a better candidate here.
DF[] <- t(apply(DF, 1, zoo::na.locf0))
I have a dataset that looks like this:
Starting Dataset
Code used to create the Starting dataset:
dataset<-data.frame(Attorney=c("John Doe", "Client #1","274", "296",
"297", "Client #2", "633", "Jane Doe",
"Client #1", "309", "323"),
Date=c(NA, NA, "2019/4/4", "2019/4/4", "2019/4/12",
NA, " 2019/2/3", NA, NA, "2019/12/1", "2019/12/4"),
Code=c(NA, NA, "7NP/7NP", "1UE/1UE", "2C1/2C1",NA,
"7NP/7NP", NA, NA, "7NP/7NP", "7FU/7FU"),
Billed_Amount=c(NA, NA, 1200.00, 4000.00, 2775.00,
NA, 1200.00, NA, NA, 1200.00, 385),
Amount= c(NA, NA, "1200", "4000", "2775", NA, "1200",
NA, NA, "1200", "385"),
Current =c(NA, NA, 0, 0, 0, NA, 0, NA, NA, 0, 0),
X.120=c(NA, NA, "1200", "4000", "2775", NA, "1200",
NA, NA, "1200", "385"))
My goal is to end up with a dataset that looks like:
Goal Dataset
Code used to create Goal dataset:
dataset<-data.frame(Attorney=c("John Doe", "John Doe", "John Doe",
"John Doe", "Jane Jane", "Jane Jane"),
Date=c("2019/4/4", "2019/4/4", "2019/12/4", " 2019/2/3",
"2019/12/1","2019/12/4" ),
Code=c("7NP/7NP", "1UE/1UE","2C1/2C1", "7NP/7NP",
"7NP/7NP", "7FU/7FU"),
Billed_Amount=c(1200.00, 4000.00,2775.00, 1200.00,
1200.00, 385),
Amount= c(1200, 4000, 2775, 1200,1200, 385),
Current= c(0, 0, 0, 0, 0, 0),
X.120=c(1200, 4000, 2775,1200, 1200, 385))
I want to rename the rows underneath each attorney with the attorney's name while not worrying about preserving the client's name. My original dataset has a number of attorneys and they have a varying number of clients and those clients have a various number of codes, dates, and amounts associated with them.
I tried to use if else statement but encountered an error message.
I appreciate any help you can give me. Thanks!
Edit: I have edited my question to include hypothetical attorney names.
An option is to create a grouping variable based on the presence of 'Attorney substring in 'Attorney' column, then mutate the 'Attorney' column with the first element of 'Attorney' after grouping by 'grp', filter out the NA elements
library(dplyr)
library(stringr)
dataset %>%
group_by(grp = cumsum(str_detect(Attorney, "^Attorney"))) %>%
mutate(Attorney = first(Attorney)) %>%
filter_at(vars(Date:X.120), all_vars(!is.na(.))) %>%
ungroup %>%
select(-grp)
We can also use na.omit here
dataset %>%
group_by(grp = cumsum(str_detect(Attorney, "^Attorney"))) %>%
mutate(Attorney = first(Attorney)) %>%
ungroup %>%
select(-grp) %>%
na.omit
# A tibble: 6 x 7
# Attorney Date Code Billed_Amount Amount Current X.120
# <fct> <fct> <fct> <dbl> <fct> <dbl> <fct>
#1 Attorney #1 "2019/4/4" 7NP/7NP 1200 1200 0 1200
#2 Attorney #1 "2019/4/4" 1UE/1UE 4000 4000 0 4000
#3 Attorney #1 "2019/4/12" 2C1/2C1 2775 2775 0 2775
#4 Attorney #1 " 2019/2/3" 7NP/7NP 1200 1200 0 1200
#5 Attorney #2 "2019/12/1" 7NP/7NP 1200 1200 0 1200
#6 Attorney #2 "2019/12/4" 7FU/7FU 385 385 0 385
Or another option is to fill the 'Attorney' column after replaceing the non 'Attorney' substring elements with NA so that it gets filled with the previous non-NA element, then do na.omit
library(tidyr)
dataset %>%
mutate(Attorney = replace(Attorney, !str_detect(Attorney, "Attorney"), NA)) %>%
fill(Attorney) %>%
na.omit
Base R solution (using #akrun's logic):
data.frame(do.call("rbind",
lapply(split(dataset, cumsum(!(grepl("\\d+", dataset$Attorney)))),
function(x){
non_att_cols <- names(x)[names(x) != "Attorney"]
y <- data.frame(na.omit(x[,non_att_cols]))
y$Attorney <- x$Attorney[1]
return(y[,c("Attorney", non_att_cols)])
}
)
),
row.names = NULL
)
I have been unable to find an answer. There probably is one on stackoverflow... but I have not found one that I can use.
I have two data frames (db.1 and db.larger). what I need to do is:
if db.1$ID == db.larger$ID
db1$Gender <- db.larger$Gender
I need to copy the Gender value from db.larger to db.1 if the ID is a match.
Both data frames are between 500.000 rows and six million.
db.1 contains duplicates as more columns not shown in this example contain unique and vital information that I must keep.
both data frames contain more columns than shown
the ID values are characters as they can contain leading zeros.
I have been unable to use match as there are more than one occurrences of persons in db.1
Merge has not worked for me as it adds more data (columns) to the data frames than I want.
Here are the example output files:
db.1 <- structure(list(ID = c("453", "286", "345", "853", "675", "754","445", "564", "651", "685", "453", "286", "345"), Gender = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Name = c("Rashad Lawrence", "Ali Santana", "Cordell Cobb", "Amani Bennett", "Donavan Frank", "Jeffrey Michael", "Aliana Trujillo", "Cheyanne Wyatt", "Kayden Padilla", "Jasmine Glass", "Rashad Lawrence", "Ali Santana", "Cordell Cobb"), Score = c(0, 0.044, 0.822, 0.322, 0.394, 0.309, 0.826, 0.729, 0.318, 0.6, 0.648, 0.547, 0.53)), .Names = c("ID", "Gender","Name", "Score"), row.names = c(NA, -13L), class = "data.frame")
and
db.larger <- structure(list(ID = c("123", "158", "286", "345", "445", "453", "469", "546", "564", "566", "651", "675", "682", "685", "741", "754", "789", "852", "853", "963"), Gender = c(1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1), Name = c("Dexter Holmes", "Roman Macias", "Ali Santana", "Cordell Cobb", "Aliana Trujillo", "Rashad Lawrence", "Preston Mckee", "Kyra Howe", "Cheyanne Wyatt", "Tobias Hart", "Kayden Padilla", "Donavan Frank", "Jamie Yoder", "Jasmine Glass", "Jamar Carter", "Jeffrey Michael", "Erick Tate", "Darion Graves", "Amani Bennett", "Regina Sanders")), .Names = c("ID", "Gender", "Name"), row.names = c(NA, 20L), class = "data.frame")
Since you always have missing values in db.1$Gender, you can delete this column and then perform an inner_join from dplyr. This procedure keeps the duplicates in db.1.
library(dplyr)
db.1 <- db.1 %>%
select(-Gender)
db.combine <- inner_join(db.1,db.larger, by = "ID")
db.combine
ID Name.x Gender Name.y
1 453 Rashad Lawrence 1 Rashad Lawrence
2 286 Ali Santana 2 Ali Santana
3 345 Cordell Cobb 1 Cordell Cobb
4 853 Amani Bennett 1 Amani Bennett
5 675 Donavan Frank 2 Donavan Frank
6 754 Jeffrey Michael 2 Jeffrey Michael
7 445 Aliana Trujillo 1 Aliana Trujillo
8 564 Cheyanne Wyatt 2 Cheyanne Wyatt
9 651 Kayden Padilla 2 Kayden Padilla
10 685 Jasmine Glass 2 Jasmine Glass
11 453 Rashad Lawrence 1 Rashad Lawrence
12 286 Ali Santana 2 Ali Santana
13 345 Cordell Cobb 1 Cordell Cobb
Your Name variables are apparently not perfect matches, you could simply delete either Name.x or Name.y using select, however.