Suppose I have a DT as -
id values valid_types
1 2|3 100|200
2 4 200
3 2|1 500|100
The valid_types tells me what are the valid types I need. There are 4 total types(100, 200, 500, 2000). An entry specifies their valid types and their corresponding values with | separated character values.
I want to transform this to a DT which has the types as columns and their corresponding values.
Expected:
id 100 200 500
1 2 3 NA
2 NA 4 NA
3 1 NA 2
I thought I could take both the columns and split them on | which would give me two lists. I would then combine them by setting the keys as names of the types list and then convert the final list to a DT.
But the idea I came up with is very convoluted and not really working.
Is there a better/easier way to do this ?
Here is another data.table approach:
dcast(
DT[, lapply(.SD, function(x) strsplit(x, "\\|")[[1L]]), by = id],
id ~ valid_types, value.var = "values"
)
Using tidyr library you can use separate_rows with pivot_wider :
library(tidyr)
df %>%
separate_rows(values, valid_types, sep = '\\|', convert = TRUE) %>%
pivot_wider(names_from = valid_types, values_from = values)
# id `100` `200` `500`
# <int> <int> <int> <int>
#1 1 2 3 NA
#2 2 NA 4 NA
#3 3 1 NA 2
A data.table way would be :
library(data.table)
library(splitstackshape)
setDT(df)
dcast(cSplit(df, c('values', 'valid_types'), sep = '|', direction = 'long'),
id~valid_types, value.var = 'values')
Related
when I execute the following code:
data_ikea_wider <- data_ikea_longer %>%
pivot_wider(id_cols = c(Record_no
, Geography
, City
, Country
, City.Country
, Year)
, names_from = Category, values_from = Value)
The columns just have n/a's as shown in the attached print screen.
What am I doing wrong? Thanks!
We could use dcast from data.table
library(data.table)
setDT(dat)[, col1 ~ col2, value.var = 'val')
Getting NAs from a pivot is not unexpected, it means that not all of your id columns have all "columns".
For example,
dat <- data.frame(col1 = c(1,1,2), col2 = c('a', 'b', 'a'), val = 1:3)
dat
# col1 col2 val
# 1 1 a 1
# 2 1 b 2
# 3 2 a 3
If we want to pivot keeping col1 as an id, and turning col2 values into new columns, then it should be apparent that we'll end up with two rows (ida 1 and 2), and two new columns (a and b) to replace col2 and val. Unfortunately, since we only have three rows, the 2 rows 2 columns = 4 cells will not be completely filled with 3 values, so one will be NA:
pivot_wider(dat, col1, names_from = col2, values_from = val)
# # A tibble: 2 x 3
# col1 a b
# <dbl> <int> <int>
# 1 1 1 2
# 2 2 3 NA
If you see this and are surprised, thinking that you actually have the data ... then you should check your data importing and filtering to make sure you did not inadvertently remove it (or it was not provided initially).
I have a dataframe like this
id start end
1 20/06/88 24/07/89
1 27/07/89 13/04/93
1 14/04/93 6/09/95
2 3/01/92 11/02/94
2 30/03/94 16/04/96
2 17/04/96 18/08/97
that I would like to merge with this other dataframe
id date
1 26/08/88
2 10/05/96
The resulting merged dataframe should look like this
id start end date
1 20/06/88 24/07/89 26/06/88
1 27/07/89 13/04/93 NA
1 14/04/93 6/09/95 NA
2 3/01/92 11/02/94 NA
2 30/03/94 16/04/96 NA
2 17/04/96 18/08/97 10/05/96
In practice I want to merge the two dataframes based on id and on the fact that date must lie within the interval spanned by the start and end vars of the first dataframe.
Do you have any suggestion on how to do this? I tried to use the fuzzyjoin package, but I have some memory issue..
Many thanks to everyone
Might be a dupe, will remove when I found a good target. In the meantime, we could use fuzzyjoin
library(tidyverse)
library(fuzzyjoin)
df1 %>%
mutate_at(2:3, as.Date, "%d/%m/%y") %>%
fuzzy_left_join(
df2 %>% mutate(date = as.Date(date, "%d/%m/%y")),
by = c("id" = "id", "start" = "date", "end" = "date"),
match_fun = list(`==`, `<`, `>`))
# id.x start end id.y date
#1 1 1988-06-20 1989-07-24 1 1988-08-26
#2 1 1989-07-27 1993-04-13 NA <NA>
#3 1 1993-04-14 1995-09-06 NA <NA>
#4 2 1992-01-03 1994-02-11 NA <NA>
#5 2 1994-03-30 1996-04-16 NA <NA>
#6 2 1996-04-17 1997-08-18 2 1996-05-10
All that remains is tidying up the id columns.
Sample data
df1 <- read.table(text = "
id start end
1 20/06/88 24/07/89
1 27/07/89 13/04/93
1 14/04/93 6/09/95
2 3/01/92 11/02/94
2 30/03/94 16/04/96
2 17/04/96 18/08/97", header = T)
df2 <- read.table(text = "
id date
1 26/08/88
2 10/05/96 ", header = T)
You can use sqldf for complex joins:
require(sqldf)
sqldf("SELECT df1.*,df2.date,df2.id as id2
FROM df1
LEFT JOIN df2
ON df1.id = df2.id AND
df1.start < df2.date AND
df1.end > df2.date")
Say I have two dataframes df1 and df2 as follow:
df1:
EmployeeID Skill
1 A
1 B
1 C
2 B
2 D
2 C
2 F
3 A
3 J
df2:
Opportunity.ID Skill
12345 A
12345 B
56788 C
56788 B
56788 F
09988 H
What I'm looking to do is to have a new data frame with all the EmployeeID that have all the skills required for a specific Opportunity.ID, and not only one of them. This is why a simple merge or left/right join will not be enought.
In our case, what I would like to have is:
Opportunity.ID Employee.ID
12345 1
56788 2
09988 NA
Note that employee 3 should not be assigned to opportunity 12345 because he only has one skill among the two required.
Any help would be greatly appreciated.
Here's one way using dplyr -
df2 %>%
left_join(df1, by = "Skill") %>%
group_by(Opportunity.ID) %>%
mutate(test = ave(Skill, EmployeeID, FUN = function(x) all(Skill %in% x))) %>%
ungroup() %>%
filter(test != "FALSE") %>%
distinct(Opportunity.ID, EmployeeID)
# A tibble: 3 x 2
Opportunity.ID EmployeeID
<int> <int>
1 12345 1
2 56788 2
3 9988 NA
There is probably a better solution, but with the data.table-package I came to the following approach:
library(data.table) # load the package
setDT(df1) # convert 'df1' to a 'data.table'
setDT(df2) # convert 'df2' to a 'data.table'
df2[, .(EmployeeID = df1[.SD[, .(Skill, n = .N)], on = .(Skill)
][, .(ne = .N), by = .(EmployeeID, n)
][n == ne, EmployeeID])
, by = Opportunity.ID]
which gives:
Opportunity.ID EmployeeID
1: 12345 1
2: 56788 2
3: 9988 NA
I have a data frame that looks like this:
head(df)
shotchart
1 BMMMBMMBMMBM
2 MMMBBMMBBMMB
3 BBBBMMBMMMBB
4 MMMMBBMMBBMM
Different patterns of the letter 'M' are worth certain values such as the following:
MM = 1
MMM = 2
MMMM = 3
I want to create an extra column to this data frame that calculates the total value of the different patterns of 'M' in each row individually.
For example:
head(df)
shotchart score
1 BMMMBMMBMMBM 4
2 MMMBBMMBBMMB 4
3 BBBBMMBMMMBB 3
4 MMMMBBMMBBMM 5
I can't seem to figure out how to assign the values to the different 'M' patterns.
I tried using the following code but it didn't work:
df$score <- revalue(df$scorechart, c("MM"="1", "MMM"="2", "MMMM"="3"))
We create a named vector ('nm1'), split the 'shotchart' to extract only 'M' and then use the named vector to change the values to get the sum
nm1 <- setNames(1:3, strrep("M", 2:4))
sapply(strsplit(gsub("[^M]+", ",", df$shotchart), ","),
function(x) sum(nm1[x[nzchar(x)]], na.rm = TRUE))
Or using tidyverse
library(tidyverse)
df %>%
mutate(score = str_extract_all(shotchart, "M+") %>%
map_dbl(~ nm1[.x] %>%
sum(., na.rm = TRUE)))
# shotchart score
#1 BMMMBMMBMMBM 4
#2 MMMBBMMBBMMB 4
#3 BBBBMMBMMMBB 3
#4 MMMMBBMMBBMM 5
You can also split on "B" and base the result on the count of "M" characters -1 as follows:
df <- data.frame(shotchart = c("BMMMBMMBMMBM", "MMMBBMMBBMMB", "BBBBMMBMMMBB", "MMMMBBMMBBMM"),
score = NA_integer_,
stringsAsFactors = F)
df$score <- lapply(strsplit(df$shotchart, "B"), function(i) sum((nchar(i)-1)[(nchar(i)-1)>0]))
# shotchart score
#1 BMMMBMMBMMBM 4
#2 MMMBBMMBBMMB 4
#3 BBBBMMBMMMBB 3
#4 MMMMBBMMBBMM 5
I would like to split a certain format of data from one column into multiple columns. Below are my sample data:
df = data.frame(id=c(1,2),data=c('apple:A%1^B%2^C%3_orange:A%1^B%2',
'apple:A%1^B%2^D%3_orange:A%3^B%2'))
# id data
# 1 apple:A%1^B%2^C%3_orange:A%1^B%2
# 2 apple:A%1^B%2^D%3_orange:C%3^B%2
which will then gives the following output
id data_apple_A data_apple_B data_apple_C data_apple_D data_orange_A data_orange_B
1 1 2 3 1 2
2 1 2 3 1 2
I have been able to do this but the method that I use involves looping through each of the row and perform the str_split by each of the separator in order to get the data for each row and append it to the final output dataframe which is very slow considering I will have 500k rows by 20 input column.
I don't think my for loop is a proper R way to code for this use case. Any help will be appreciated.
We can use cSplit with str_extract
library(splitstackshape)
library(zoo)
library(stringr)
dt <- cSplit(df, 'data', "\\^|_", fixed = FALSE, "long")[, c('grp', 'grp2', 'val')
:= .(na.locf(str_extract(data, "^[A-Za-z]+(?=:)")),
str_extract(data, "[A-Z](?=[%])"), as.numeric(str_extract(data, "\\d+"))) ][]
dcast(dt, id ~ paste0("data_", grp) + grp2, value.var = 'val', sep = "_", fill = 0)
# id data_apple_A data_apple_B data_apple_C data_apple_D data_orange_A data_orange_B
#1: 1 1 2 3 0 1 2
#2: 2 1 2 0 3 3 2