How to split cells into column header and rows - r

I have a dataframe with 3 variable ID, Var1 and Var2. Var 1 and two contains multiple lines that can be broken down into rows. I would like to make VAR 1 lines into headers and link Var 2 to the correct line of Var 1. My data looks like this:
ID VAR1 VAR2
1 Code Employee number Personal ID 132 12345 12452
2 Employee number Personal ID 32145 13452
3 Code Employee number 444 56743
4 Code Employee number Personal ID 546 89642 14667
I would like to obtain:
ID Code Employee number Personal ID
1 132 12345 12452
2 32145 13452
3 444 56743
4 546 89642 14667

Here's a tidyverse approach.
First you need to update the values that represent your future column names, as R doesn't like spaces in column names.
# example dataset
df = data.frame(ID = 1:2,
VAR1 = c("Code Employee number Personal ID", "Employee number Personal ID"),
VAR2 = c("132 12345 12452", "32145 13452"))
library(tidyverse)
df %>%
mutate(VAR1 = gsub("Personal ID", "PersonalID", VAR1),
VAR1 = gsub("Employee number", "EmployeeNummber", VAR1)) %>%
separate_rows(VAR1, VAR2) %>%
spread(VAR1, VAR2)
# ID Code EmployeeNummber PersonalID
# 1 1 132 12345 12452
# 2 2 <NA> 32145 13452

Related

Left-join two data frames by one column. If no matches are returned, join by another column

I want to join a data frame to another data frame first by one column, then, if there are no "matches" I want it to try to join by another column. The problem is similar to this question, but I'm trying to get a slightly different output.
Here are my 'observations'
#my df
my_plants <- data.frame(scientific_name = c("Abelmoschus esculentus",
"Abies balsamea",
"Ammophila breviligulata",
"Zigadenus glaucus"),
percent_cover = c(90, 80, 10, 60))
and here is the main list with some data that I want to extract for each of my observations. Obviously this is simplified.
#hypothetical database
plant_database <- data.frame(scientific_name = c("Abelmoschus esculentus",
"Abies balsamea",
"Ammophila breviligulata",
"Anticlea elegans"),
synonym = c(NA_character_,
NA_character_,
NA_character_,
"Zigadenus glaucus"),
score = c(1, 1, 2, 6))
Here is a function to join my observations to the main list. Note: I'm using a left_join because I want to know which observations were not matched.
#joining function
joining_fun <- function(plants, database) {
database_long <- database %>%
dplyr::mutate(ID = row.names(.)) %>%
tidyr::pivot_longer(., cols = c(scientific_name, synonym),
values_to = "scientific_name")
join <- dplyr::left_join(plants, database_long, by = "scientific_name") %>%
dplyr::select(-name)
return(join)
}
Which gets me here:
joining_fun(my_plants, plant_database)
scientific_name percent_cover score ID
1 Abelmoschus esculentus 90 1 1
2 Abies balsamea 80 1 2
3 Ammophila breviligulata 10 2 3
4 Zigadenus glaucus 60 6 4
but I want something like this:
scientific_name synonym percent_cover score ID
Abelmoschus esculentus NA 90 1 1
Abies balsamea NA 80 1 2
Ammophila breviligulata NA 10 2 3
Anticlea elegans Zigadenus glaucus 60 6 4
Thanks!
Use inner_join() to create a df of only cases that match on scientific_name.
Use anti_join() to get a version of plants that don't match on scientific_name.
Do another inner_join() of database with these unmatched cases, using key "synonym" = "scientific_name".
Do one more anti_join() to get cases without a match in either column.
Finally, bind all results together.
library(dplyr)
# add test case with no match in either column
my_plants <- add_row(
my_plants,
scientific_name = "Stackus overflovius",
percent_cover = 0
)
joining_fun <- function(plants, database) {
by_sci_name <- inner_join(plants, database, by = "scientific_name")
no_sci_match <- anti_join(plants, database, by = "scientific_name")
by_syn <- inner_join(database, no_sci_match, by = c("synonym" = "scientific_name"))
no_match <- anti_join(no_sci_match, database, by = c("scientific_name" = "synonym"))
bind_rows(by_syn, by_sci_name, no_match)
}
joining_fun(my_plants, plant_database)
scientific_name synonym score percent_cover
1 Anticlea elegans Zigadenus glaucus 6 60
2 Abelmoschus esculentus <NA> 1 90
3 Abies balsamea <NA> 1 80
4 Ammophila breviligulata <NA> 2 10
5 Stackus overflovius <NA> NA 0

How to split a column into different columns in R

I have data in which the education column takes the form from 1 to 3.
Payment column - also accepts values from 1 to 3
I wanted to group them in pairs
I have an education and payment column. How can I convert the table so that the payment is divided into 3 different columns
I would like the table to look like this:
enter image description here
*I tried to do this but it gave me an error
The function pivot_wider from the tidyr package is the way to go:
library(dplyr)
library(dplyr)
df %>%
pivot_wider(names_from = Education,
values_from = count,
names_glue = "Education = {.name}")
# A tibble: 3 × 4
PaymentTier `Education = 1` `Education = 2` `Education = 3`
<dbl> <dbl> <dbl> <dbl>
1 1 1000 666 6543
2 2 33 2222 9999
3 3 455 1111 5234
Data:
df <- data.frame(
Education = c(1,1,1,2,2,2,3,3,3),
PaymentTier = c(1,2,3,1,2,3,1,2,3),
count = c(1000,33,455,666,2222,1111,6543,9999,5234)
)

Join with closest value between two values in R

I was working in the following problem. I've got monthly data from a survey, let's call it df:
df1 = tibble(ID = c('1','2'), reported_value = c(1200, 31000), anchor_month = c(3,5))
ID reported_value anchor_month
1 1200 3
2 31000 5
So, the first row was reported in March, but there's no way to know if it's reporting March or February values and also can be an approximation to the real value. I've also got a table with actual values for each ID, let's call it df2:
df2 = tibble( ID = c('1', '2') %>% rep(4) %>% sort,
real_value = c(1200,1230,11000,10,25000,3100,100,31030),
month = c(1,2,3,4,2,3,4,5))
ID real_value month
1 1200 1
1 1230 2
1 11000 3
1 10 4
2 25000 2
2 3100 3
2 100 4
2 31030 5
So there's two challenges: first, I only care about the anchor month OR the previous month to the anchor month of each ID and then I want to match to the closest value (sounds like fuzzy join). So, my first challenge was to filter my second table so it only has the anchor month or the previous one, which I did doing the following:
filter_aux = df1 %>%
bind_rows(df1 %>% mutate(anchor_month = if_else(anchor_month == 1, 12, anchor_month- 1)))
df2 = df2 %>%
inner_join(filter_aux , by=c('ID', 'month' = 'anchor_month')) %>% distinct(ID, month)
Reducing df2 to:
ID real_value month
1 1230 2
1 11000 3
2 100 4
2 31030 5
Now I tried to do a difference_inner_join by ID and reported_value = real_value, (df1 %>% difference_inner_join(df2, by= c('ID', 'reported_value' = 'real_value'))) but it's bringing a non-numeric argument to binary operator error I'm guessing because ID is a string in my actual data. What gives? I'm no expert in fuzzy joins, so I guess I'm missing something.
My final dataframe would look like this:
ID reported_value anchor_month closest_value month
1 1200 3 1230 2
2 31000 5 31030 5
Thanks!
It was easier without fuzzy_join:
df3 = df1 %>% left_join(df2 , by='ID') %>%
mutate(dif = abs(real_value - reported_value)) %>%
group_by(ID) %>% filter(dif == min(dif))
Output:
ID reported_value anchor_month real_value month dif
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1200 3 1230 2 30
2 2 31000 5 31030 5 30

R append last row to a data frame

I have a data frame (df) that shares a key column ($Name) with a list of data frames:
head(df)
# A tibble: 6 x 3 ##truncating to show first 2 rows only
Name var1 var2
<chr> <chr> <chr>
1 Tom Marks LAX ORD
2 Bob Sells MIA CHI
I have a list of data frames that contains historical data for each person contained in df$Name.
head(employees$'Tom Marks')
Name date var3
Tom Marks 2017-01-01 250
Tom Marks 2017-01-02 457
head(employees$'Bob Sells')
Name date var3
Bob Sells 2017-01-01 385
Bob Sells 2017-01-02 273
I would like to append the value in var3 from employees list to the df by the most recent date (which is always the last row in an employees list). For example, the output, after matching Tom Marks from df$Name to employees$'Tom Marks' would look like this:
head(df)
Name var1 var2 var3
<chr> <chr> <chr> <num>
1 Tom Marks LAX ORD 457
2 Bob Sells MIA CHI 273
I have spent a decent amount of time researching filtering joins, mutating joins, bind_rows, reduce() functions but have been unsuccessful in accomplishing what is probably an easy task for a decent programmer. I'm hoping someone out there can put me out of my misery and provide some better direction or better yet, an answer!
Thank you!
If you're always after the last row, you can use tail to get it:
library(tidyverse)
left_join(
df,
map_df(employees, ~ tail(.x, 1))
)
This solution relies on the fact that your data arranged as you said they were, but you can easily arrange the list by date if they were not so.
library(tidyverse)
df %>% left_join(
df_list$employees %>%
bind_rows() %>%
group_by(Name) %>%
summarise_at(vars(var3), last))
# Name var1 var2 var3
# 1 Tom Marks LAX ORD 457
# 2 Bob Sells MIA CHI 273
Data
df <- data.frame(Name = c("Tom Marks", "Bob Sells"),
var1 = c("LAX", "MIA"),
var2 = c("ORD", "CHI"))
df_list <- list(employees = list(
`Tom Marks` = data.frame(Name = "Tom Marks",
date = c("2017-01-01", "2017-01-02"),
var3 = c(250, 457)),
`Bob Sells` = data.frame(Name = "Bob Sells",
date = c("2017-01-01", "2017-01-02"),
var3 = c(385, 273))
))

Select grouped rows with at least one matching criterion

I want to select all those groupings that contain at least one of the elements that I am interested in. I was able to do this by creating an intermediate array, but I am looking for something simpler and faster. This is because my actual data set has over 1M rows (and 20 columns) so I am not sure whether I will have sufficient memory to create an intermediate array. More importantly, the below method on my original file takes a lot of time.
Here's my code and data:
a) Data
dput(Data_File)
structure(list(Group_ID = c(123, 123, 123, 123, 234, 345, 444,
444), Product_Name = c("ABCD", "EFGH", "XYZ1", "Z123", "ABCD",
"EFGH", "ABCD", "ABCD"), Qty = c(2, 3, 4, 5, 6, 7, 8, 9)), .Names = c("Group_ID",
"Product_Name", "Qty"), row.names = c(NA, 8L), class = "data.frame")
b) Code: I want to select Group_ID that has at least one Product_Name = ABCD
#Find out transactions
Data_T <- Data_File %>%
group_by(Group_ID) %>%
dplyr::filter(Product_Name == "ABCD") %>%
select(Group_ID) %>%
distinct()
#Now filter them
Filtered_T <- Data_File %>%
group_by(Group_ID) %>%
dplyr::filter(Group_ID %in% Data_T$Group_ID)
c) Expected output is
Group_ID Product_Name Qty
<dbl> <chr> <dbl>
123 ABCD 2
123 EFGH 3
123 XYZ1 4
123 Z123 5
234 ABCD 6
444 ABCD 8
444 ABCD 9
I'm struggling with this for over 3 hours now. I looked at the auto-suggested thread by SO: Select rows with at least two conditions from all conditions but my question is very different.
I would do it like this:
Data_File %>% group_by(Group_ID) %>%
filter(any(Product_Name %in% "ABCD"))
# Source: local data frame [7 x 3]
# Groups: Group_ID [3]
#
# Group_ID Product_Name Qty
# <dbl> <chr> <dbl>
# 1 123 ABCD 2
# 2 123 EFGH 3
# 3 123 XYZ1 4
# 4 123 Z123 5
# 5 234 ABCD 6
# 6 444 ABCD 8
# 7 444 ABCD 9
Explanation: any() will return TRUE if there are any rows (within the group) that match the condition. The length-1 result will then be recycled to the full length of the group and the entire group will be kept. You could also do it with sum(Product_name %in% "ABCD") > 0 as the condition, but the any reads very nicely. Use sum instead if you wanted a more complicated condition, like 3 or more matching product names.
I prefer%in%to == for things like this because it has better behavior with NA and it is easy to expand if you wanted to check for any of multiple products by group.
If speed and efficiency are an issue, data.table will be faster. I would do it like this, which relies on a keyed join for the filtering and uses no non-data.table operations, so it should be very fast:
library(data.table)
df = as.data.table(df)
setkey(df)
groups = unique(subset(df, Product_Name %in% "ABCD", Group_ID))
df[groups, nomatch = 0]
# Group_ID Product_Name Qty
# 1: 123 ABCD 2
# 2: 123 EFGH 3
# 3: 123 XYZ1 4
# 4: 123 Z123 5
# 5: 234 ABCD 6
# 6: 444 ABCD 8
# 7: 444 ABCD 9

Resources