I have a data frame (df) that shares a key column ($Name) with a list of data frames:
head(df)
# A tibble: 6 x 3 ##truncating to show first 2 rows only
Name var1 var2
<chr> <chr> <chr>
1 Tom Marks LAX ORD
2 Bob Sells MIA CHI
I have a list of data frames that contains historical data for each person contained in df$Name.
head(employees$'Tom Marks')
Name date var3
Tom Marks 2017-01-01 250
Tom Marks 2017-01-02 457
head(employees$'Bob Sells')
Name date var3
Bob Sells 2017-01-01 385
Bob Sells 2017-01-02 273
I would like to append the value in var3 from employees list to the df by the most recent date (which is always the last row in an employees list). For example, the output, after matching Tom Marks from df$Name to employees$'Tom Marks' would look like this:
head(df)
Name var1 var2 var3
<chr> <chr> <chr> <num>
1 Tom Marks LAX ORD 457
2 Bob Sells MIA CHI 273
I have spent a decent amount of time researching filtering joins, mutating joins, bind_rows, reduce() functions but have been unsuccessful in accomplishing what is probably an easy task for a decent programmer. I'm hoping someone out there can put me out of my misery and provide some better direction or better yet, an answer!
Thank you!
If you're always after the last row, you can use tail to get it:
library(tidyverse)
left_join(
df,
map_df(employees, ~ tail(.x, 1))
)
This solution relies on the fact that your data arranged as you said they were, but you can easily arrange the list by date if they were not so.
library(tidyverse)
df %>% left_join(
df_list$employees %>%
bind_rows() %>%
group_by(Name) %>%
summarise_at(vars(var3), last))
# Name var1 var2 var3
# 1 Tom Marks LAX ORD 457
# 2 Bob Sells MIA CHI 273
Data
df <- data.frame(Name = c("Tom Marks", "Bob Sells"),
var1 = c("LAX", "MIA"),
var2 = c("ORD", "CHI"))
df_list <- list(employees = list(
`Tom Marks` = data.frame(Name = "Tom Marks",
date = c("2017-01-01", "2017-01-02"),
var3 = c(250, 457)),
`Bob Sells` = data.frame(Name = "Bob Sells",
date = c("2017-01-01", "2017-01-02"),
var3 = c(385, 273))
))
Related
I have a large dataset with multiple instances of 'Player', and I want to collapse their tournament scores into a single row but keep the row with the highest earnings. It's a large data set so I can't exactly paste the whole thing but a small example table looks like this:
Player
Earned
T1.
T2
T3
John Doe
2100
5
N/A
N/A
John Doe
1900
N/A
12
N/A
John Doe
500
N/A
N/A
16
I'd like to eliminate the N/A values and the duplicate rows by combining Tournament 1, 2, 3 into a single row, and also keeping the highest earnings value (2100), so that it looks more like this:
Player
Earned
T1.
T2
T3
John Doe
2100
5
12
16
So far I've used top_n(1, Earned) to keep the row with the highest earnings, but it only keeps the score of Tournament1, and I need to fill in the other columns with their scores.
From your reference to top_n, I'm inferring dplyr and related packages.
If your columns are strings (since "N/A" is not NA), then
func <- function(x, na.rm = TRUE, na = c("NA", "N/A")) {
if (is.numeric(x)) max(x, na.rm = na.rm) else head(na.omit(setdiff(x, na)), 1)
}
library(dplyr)
dat %>%
group_by(Player) %>%
summarize(across(everything(), func))
# # A tibble: 1 x 5
# Player Earned T1. T2 T3
# <chr> <int> <chr> <chr> <chr>
# 1 John Doe 2100 5 12 16
If your columns are numeric, though, then we can simplify that to
dat %>%
# an interim line to change your strings to numbers
mutate(across(-Player, ~ suppressWarnings(as.numeric(.)))) %>%
# pick up from here
group_by(Player) %>%
summarize(across(everything(), ~ max(., na.rm = TRUE)))
# # A tibble: 1 x 5
# Player Earned T1. T2 T3
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 John Doe 2100 5 12 16
Data
dat <- structure(list(Player = c("John Doe", "John Doe", "John Doe"), Earned = c(2100L, 1900L, 500L), T1. = c("5", "N/A", "N/A" ), T2 = c("N/A", "12", "N/A"), T3 = c("N/A", "N/A", "16")), class = "data.frame", row.names = c(NA, -3L))
I could not think of any other way to get rid of those N/A values and since you were trying to collapse scores into a single row, I guessed they might be numeric values. Hence I changed them into numeric type:
library(dplyr)
dat %>%
group_by(Player) %>%
mutate(Earned = cummax(Earned),
across(T1:T3, suppressWarnings(as.numeric))) %>%
group_by(Player, Earned) %>%
summarise(across(T1:T3, ~ na.omit(.x)))
# A tibble: 1 x 5
Player Max_Earned T1 T2 T3
<chr> <int> <dbl> <dbl> <dbl>
1 John Doe 2100 5 12 16
I used reproducible data shared by dear #r2evans, so I would like to thank him for that.
(P.S = I changed the T1. to T1 before using it.
Another solution to the top presented solution could be with lead and slice
dat %>%
arrange(desc(Earned)) %>%
mutate(T2 = lead(T2),
T3 = lead(T3,2)) %>%
slice(which.max(Earned))
Output:
Player Earned T1. T2 T3
1 John Doe 2100 5 12 16
Here is a data.table option
> setDT(dat)[, lapply(.SD, function(x) max(x, na.rm = TRUE)), Player]
Player Earned T1. T2 T3
1: John Doe 2100 5 12 16
I´m trying to count bigrams independently of order like 'John Doe' and 'Doe John' should be counted together as 2.
Already tried some examples using text mining such as those provided on https://www.oreilly.com/library/view/text-mining-with/9781491981641/ch04.html but couldn´t find any counting that ignores order of appearance.
library('widyr')
word_pairs <- austen_section_words %>%
pairwise_count(word, section, sort = TRUE)
word_pairs
It counts separated like this:
<chr> <chr> <dbl>
1 darcy elizabeth 144
2 elizabeth darcy 144
It should look like this:
item1 item2 n
<chr> <chr> <dbl>
1 darcy elizabeth 288
Thanks if anyone can help me.
This code works. There is probably something more efficient out there though.
# Create sample dataframe
df <- data.frame(name = c('darcy elizabeth', 'elizabeth darcy', 'John Doe', 'Doe John', 'Steve Smith'))
# Break out first and last names
library(stringr)
df$first <- word(df$name,1); df$second <- word(df$name,2);
# Reorder alphabetically
df$a <- ifelse(df$first<df$second, df$first, df$second); df$b <- ifelse(df$first>df$second, df$first, df$second)
library(dplyr)
summarize(group_by(df, a, b), n())
# Yields
# a b `n()`
# <chr> <chr> <int>
#1 darcy elizabeth 2
#2 Doe John 2
#3 Smith Steve 1
Tks Guys,
I considered your suggestions and tried a similar approach:
library(dplyr)
#Function to order 2 variables by alphabetical order.
#This function below i got from another post, couldn´t remember the author ;(.
alphabetical <- function(x,y){x < y}
#Created a sample dataframe
col1<-c("darcy","elizabeth","elizabeth","darcy","john","doe")
col2<-c("elizabeth","darcy","darcy","elizabeth","doe","john")
dfSample<-data.frame(col1,col2)
#Create an empty dataframe
dfCreated <- data.frame(col1=character(),col2=character())
#for each row, I reorder the columns and append to a new dataframe
#Tks to Gregor
for(i in 1:nrow(dfSample)) {
row <- c(as.String(dfSample[i,1]), as.String(dfSample[i,2]))
if(!alphabetical(row[1],row[2])){
row <- c(row[2],row[1])
}
dfCreated<-rbind(dfCreated,c(row[1],row[2]),stringsAsFactors=FALSE)
}
colnames(dfCreated)<-c("col1","col2")
dfCreated
#tks to Monk
summarize(group_by(dfCreated, col1, col2), n())
col1 col2 `n()`
<chr> <chr> <int>
1 darcy elizabeth 4
2 doe john 2
I have a dataframe with 3 variable ID, Var1 and Var2. Var 1 and two contains multiple lines that can be broken down into rows. I would like to make VAR 1 lines into headers and link Var 2 to the correct line of Var 1. My data looks like this:
ID VAR1 VAR2
1 Code Employee number Personal ID 132 12345 12452
2 Employee number Personal ID 32145 13452
3 Code Employee number 444 56743
4 Code Employee number Personal ID 546 89642 14667
I would like to obtain:
ID Code Employee number Personal ID
1 132 12345 12452
2 32145 13452
3 444 56743
4 546 89642 14667
Here's a tidyverse approach.
First you need to update the values that represent your future column names, as R doesn't like spaces in column names.
# example dataset
df = data.frame(ID = 1:2,
VAR1 = c("Code Employee number Personal ID", "Employee number Personal ID"),
VAR2 = c("132 12345 12452", "32145 13452"))
library(tidyverse)
df %>%
mutate(VAR1 = gsub("Personal ID", "PersonalID", VAR1),
VAR1 = gsub("Employee number", "EmployeeNummber", VAR1)) %>%
separate_rows(VAR1, VAR2) %>%
spread(VAR1, VAR2)
# ID Code EmployeeNummber PersonalID
# 1 1 132 12345 12452
# 2 2 <NA> 32145 13452
Good morning,
I've got a two-column dataset which I'd like to spread to more columns based on a group_by in Dplyr but I'm not sure how.
My data looks like:
Person Case
John A
John B
Bill C
David F
I'd like to be able to transform it to the following structure:
Person Case_1 Case_2 ... Case_n
John A B
Bill C NA
David F NA
My original thought was along the lines of:
data %>%
group_by(Person) %>%
spread()
Error: Please supply column name
What's the easiest, or most R-like way to achieve this?
You should first add a case id to the dataset, which can be done with a combination of group_by and mutate:
dat = data.frame(Person = c('John', 'John', 'Bill', 'David'), Case = c('A', 'B', 'C', 'F'))
dat = dat %>% group_by(Person) %>% mutate(id = sprintf('Case_%d', row_number()))
dat %>% head()
# A tibble: 4 × 3
Person Case id
<fctr> <fctr> <chr>
1 John A Case_1
2 John B Case_2
3 Bill C Case_1
4 David F Case_1
Now you can use spread to transform the data:
dat %>% spread(Person, Case)
# A tibble: 2 × 4
id Bill David John
* <chr> <fctr> <fctr> <fctr>
1 Case_1 C F A
2 Case_2 NA NA B
You can get the structure you list above using:
res = dat %>% spread(Person, Case) %>% select(-id) %>% t() %>% as.data.frame()
names(res) = unique(dat$id)
res
Case_1 Case_2
Bill C <NA>
David F <NA>
John A B
I'm trying to filter a patient database based on specific ICD9 (diagnosis) codes. I would like to use a vector indicating the first 3 strings of the ICD9 codes.
The example database contains 3 character variables for IC9 codes for each patient visit (var1 to var3).
Below is an example of the data
patient<-c("a","b","c")
var1<-c("8661", "865","8651")
var2<-c("8651","8674","2866")
var3<-c("2430","3456","9089")
observations<-data_frame(patient,var1,var2,var3)
patient var1 var2 var3
1 a 8661 8651 2430
2 b 865 8674 3456
3 c 8651 2866 9089
#diagnosis of interest: all beginning with "866" and "867"
dx<-c("866","867")
filtered_data<- filter(observations, var1 %like% dx | var2 %like% dx | var3 %like% dx)
I have tried several approaches including the grep and the %like% functions as you can see above but I haven’t been able to get it working for my case. I would appreciate any help you can provide.
Happy thanksgivings
Albit
You can make a regex pattern from the interest vector and apply it to each column of your data frame except for the patient id, use rowSums to check if there is any var in a row match the pattern:
library(dplyr)
pattern = paste("^(", paste0(dx, collapse = "|"), ")", sep = "")
pattern
# [1] "^(866|867)"
filter(observations, rowSums(sapply(observations[-1], grepl, pattern = pattern)) != 0)
# A tibble: 2 × 4
# patient var1 var2 var3
# <chr> <chr> <chr> <chr>
#1 a 8661 8651 2430
#2 b 865 8674 3456
Another option is to use Reduce with lapply:
filter(observations, Reduce("|", lapply(observations[-1], grepl, pattern = pattern)))
# A tibble: 2 × 4
# patient var1 var2 var3
# <chr> <chr> <chr> <chr>
#1 a 8661 8651 2430
#2 b 865 8674 3456
This approach works when you have more then two patterns and different patterns have different character length, for instance, if you have dx as dx<-c("866","867", "9089"):
dx<-c("866","867", "9089")
pattern = paste("^(", paste0(dx, collapse = "|"), ")", sep = "")
pattern
# [1] "^(866|867|9089)"
filter(observations, Reduce("|", lapply(observations[-1], grepl, pattern = pattern)))
# A tibble: 3 × 4
# patient var1 var2 var3
# <chr> <chr> <chr> <chr>
#1 a 8661 8651 2430
#2 b 865 8674 3456
#3 c 8651 2866 9089
Check this and this stack answer for more about multiple or conditions in regex.
This looks close to what you're looking for, but requires a bit more manipulation:
library(dplyr)
library(stringr)
library(tidyr)
obs2 <- observations %>%
gather(vars, value, -patient) %>%
filter(str_sub(value, 1, 3) %in% dx)
# A tibble: 2 × 3
patient vars value
<chr> <chr> <chr>
1 a var1 8661
2 b var2 8674
You can use apply and ldply
library(plyr)
filtered_obs <- apply(observations, 1, function(x) if(sum(substr(x,1,3) %in% dx)>0){x})
filtered_obs <- plyr::ldply(filtered_obs,rbind)
If you have variable number of characters then this should work-
filtered_obs <- lapply(dx, function(y)
{
plyr::ldply(apply(observations, 1, function(x)
{
if(sum(substr(x,1,nchar(y)) %in% y)>0){x}
}), rbind)
})
filtered_obs <- unique(plyr::ldply(filtered_obs,rbind))