I have 2 R data.tables in R like so:
first_table
id | first | trunc | val1
=========================
1 | Bob | Smith | 10
2 | Sue | Goldm | 20
3 | Sue | Wollw | 30
4 | Bob | Bellb | 40
second_table
id | first | last | val2
==============================
1 | Bob | Smith | A
2 | Bob | Smith | B
3 | Sue | Goldman | A
4 | Sue | Goldman | B
5 | Sue | Wollworth | A
6 | Sue | Wollworth | B
7 | Bob | Bellbottom | A
8 | Bob | Bellbottom | B
As you can see, the last names in the first table are truncated. Also, the combination of first and last name is unique in the first table, but not in the second. I want to "join" on the combination of first name and last name under the incredibly naive assumptions that
first,last uniquely defines a person
that truncation of the last name does not introduce ambiguity.
The result should look like this:
id | first | trunc | last | val1
=======================================
1 | Bob | Smith | Smith | 10
2 | Sue | Goldm | Goldman | 20
3 | Sue | Wollw | Wollworth | 30
4 | Bob | Bellb | Bellbottom | 40
Basically, for each row in table_1, I need to find a row that back fills the last name.
For Each Row in first_table:
Find the first row in second_table with:
matching first_name & trunc is a substring of last
And then join on that row
Is there an easy vectorized way to accomplish this with data.table?
One approach is to join on first, then filter based on the substring-match
first_table[
unique(second_table[, .(first, last)])
, on = "first"
, nomatch = 0
][
substr(last, 1, nchar(trunc)) == trunc
]
# id first trunc val1 last
# 1: 1 Bob Smith 10 Smith
# 2: 2 Sue Goldm 20 Goldman
# 3: 3 Sue Wollw 30 Wollworth
# 4: 4 Bob Bellb 40 Bellbottom
Or, do the truncation on the second_table to match the first, then join on both columns
first_table[
unique(second_table[, .(first, last, trunc = substr(last, 1, 5))])
, on = c("first", "trunc")
, nomatch = 0
]
## yields the same answer
Related
I am trying to determine repeat IDs based on date and an initial event. Below is a sample data set
+----+------------+-------------------------+
| ID | Date | Investigation or Intake |
+----+------------+-------------------------+
| 1 | 1/1/2019 | Investigation |
| 2 | 1/2/2019 | Investigation |
| 3 | 1/3/2019 | Investigation |
| 4 | 1/4/2019 | Investigation |
| 1 | 1/2/2019 | Intake |
| 2 | 12/31/2018 | Intake |
| 3 | 1/5/2019 | Intake |
+----+------------+-------------------------+
I want to write R codes to go through IDs from 1 to 4 (IDs that have investigations) and see if they have a subsequent intake (an intake that happens at a later date than the date of investigation). So the expected output looks like this:
+----+------------+-------------------------+------------+
| ID | Date | Investigation or Intake | New Column |
+----+------------+-------------------------+------------+
| 1 | 1/1/2019 | Investigation | Sub Intake |
| 2 | 1/2/2019 | Investigation | None |
| 3 | 1/3/2019 | Investigation | Sub Intake |
| 4 | 1/4/2019 | Investigation | None |
| 1 | 1/2/2019 | Intake | |
| 2 | 12/31/2018 | Intake | |
| 3 | 1/5/2019 | Intake | |
+----+------------+-------------------------+------------+
What will the code look like to solve this? I am guessing it will be some loop function?
Thanks!
you can do this using the dplyr package and using some ifelse statements create a new column as required.
Instead of using looping instead just check the next entry in the group using lead function.
This solution assumes that in each group you will have one "Investigation" and then 0 or more "Intake" entries that are listed afterwards.
library(dplyr)
df <- data.frame(ID = c(1, 2, 3, 4, 1, 2, 3),
Date = as.Date(c("2019-01-01", "2019-01-02", "2019-1-03", "2019-01-04", "2019-01-02", "2018-12-31", "2019-1-5")),
Investigation_or_Intake = c("Investigation", "Investigation", "Investigation", "Investigation", "Intake", "Intake", "Intake"),
stringsAsFactors = FALSE)
df %>%
group_by(ID) %>% # Make groups according to ID column
mutate(newcol = ifelse(lead(Date) > Date, "Sub Intake", "None"), # Check next entry in the group to see if Date is after current
newcol = ifelse(Investigation_or_Intake == "Investigation" & is.na(newcol), "None", newcol)) # Change "Investigation" entries with no Intake to "None"
This gives
ID Date Investigation_or_Intake newcol
<dbl> <date> <chr> <chr>
1 1 2019-01-01 Investigation Sub Intake
2 2 2019-01-02 Investigation None
3 3 2019-01-03 Investigation Sub Intake
4 4 2019-01-04 Investigation None
5 1 2019-01-02 Intake NA
6 2 2018-12-31 Intake NA
7 3 2019-01-05 Intake NA
This question already has answers here:
Reshape from long to wide and create columns with binary value
(3 answers)
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
I have a data frame that looks like this:
+-----------+------------+-----------+-----+----------------+
| Unique ID | First Name | Last Name | Age | Characteristic |
+-----------+------------+-----------+-----+----------------+
| 1 | Bob | Smith | 25 | Intelligent |
| 1 | Bob | Smith | 25 | Funny |
| 1 | Bob | Smith | 25 | Short |
| 2 | Jim | Murphy | 62 | Tall |
| 2 | Jim | Murphy | 62 | Funny |
| 3 | Kelly | Green | 33 | Tall |
+-----------+------------+-----------+-----+----------------+
I want to convert the "Characteristic" column into a row header, and for the present of that characteristic in each record populate it with a 1 if they have it or a 0 if they don't, such that I only have 1 row per record and my output looks like:
+-----------+------------+-----------+-----+-------------+-------+-------+------+
| Unique ID | First Name | Last Name | Age | Intelligent | Funny | Short | Tall |
+-----------+------------+-----------+-----+-------------+-------+-------+------+
| 1 | Bob | Smith | 25 | 1 | 1 | 1 | 0 |
| 2 | Jim | Murphy | 62 | 0 | 1 | 0 | 1 |
| 3 | Kelly | Green | 33 | 0 | 0 | 0 | 1 |
+-----------+------------+-----------+-----+-------------+-------+-------+------+
A more consumable data, and a solution using dplyr and tidyr:
library(dplyr)
library(tidyr)
read.table(header=TRUE, stringsAsFactors=FALSE, text="
Unique_ID First_Name Last_Name Age Characteristic
1 Bob Smith 25 Intelligent
1 Bob Smith 25 Funny
1 Bob Smith 25 Short
2 Jim Murphy 62 Tall
2 Jim Murphy 62 Funny
3 Kelly Green 33 Tall") %>%
mutate(v = 1L) %>%
tidyr::spread(Characteristic, v, fill=0L)
# Unique_ID First_Name Last_Name Age Funny Intelligent Short Tall
# 1 1 Bob Smith 25 1 1 1 0
# 2 2 Jim Murphy 62 1 0 0 1
# 3 3 Kelly Green 33 0 0 0 1
Most of the work was done with spread. Unfortunately, this has NA instead of 0 for all of the empty spots. If you can live with it, you're good. (Edited based on #www's suggestion.)
Here is another tidyverse solution.
df %>%
mutate(ind = 1L) %>%
spread(Characteristic, ind, fill = 0L)
# Unique.ID First.Name Last.Name Age Funny Intelligent Short Tall
# 1 1 Bob Smith 25 1 1 1 0
# 2 2 Jim Murphy 62 1 0 0 1
# 3 3 Kelly Green 33 0 0 0 1
You can also use reshape2 to account for the case when there are more than 1 instance of each case.
library(reshape2)
dcast(df, ...~Characteristic, fun.aggregate = length)
The data
df <- read.table(text = "Unique ID | First Name | Last Name | Age | Characteristic
1 | Bob | Smith | 25 | Intelligent
1 | Bob | Smith | 25 | Funny
1 | Bob | Smith | 25 | Short
2 | Jim | Murphy | 62 | Tall
2 | Jim | Murphy | 62 | Funny
3 | Kelly | Green | 33 | Tall ", sep = "|", header = T, strip.white = T, stringsAsFactors = F)
I have these tables in my database:
TOTAL TABLE:
total_table_id | person_id | car_id | year
----------------|-----------|-----------|------
0 | 1 | 4 | 2015
1 | 1 | 2 | 2017
2 | 2 | 0 | 2017
3 | 3 | 3 | 2017
4 | 3 | 4 | 2015
PERSON TABLE:
person_id | name | age
------------|-------|-----
0 | John | 26
1 | Anna | 41
2 | Sam | 33
3 | Tim | 33
CAR TABLE:
car_id | model | color
--------|-------|-------
0 | A | red
1 | B | blue
2 | B | white
3 | D | red
4 | C | black
And what I want after select the year in a dropdown is to get something like this:
2017:
color | age | cars_count
--------|-------|------------
red | 33 | 2
white | 41 | 1
This is the query that I have for the moment:
from a in total
join b in person
on a.person_id equals b.person_id
join c in car
on a.car_id equals c.car_id
select new
{
color = c.color,
age = b.age,
cars_count = ? // <--------This is what I don't know how to get it
}).ToList();
Any tips?
You should use group by statement
var answer = (from total in totalTable
where total.year == 2017
join car in carTable on total.car_id equals car.car_id
join person in personTable on total.person_id equals person.person_id
group person by car.color into sub
select new
{
color = sub.Key,
age = sub.Max(x => x.age),
//or age = sub.Min(x => x.age),
//or age = sub.First().age,
count = sub.Count()
}).ToList();
I need help in taking an annual total (for each of many initiatives) and breaking that down to each month using a simple division formula. I need to do this for each distinct combination of a few columns while copying down the columns that are broken from annual to each monthly total. The loop will apply the formula to two columns and loop through each distinct group in a vector. I tried to explain in an example below as it's somewhat complex.
What I have :
| Init | Name | Date |Total Savings|Total Costs|
| A | John | 2015 | TotalD | TotalD |
| A | Mike | 2015 | TotalE | TotalE |
| A | Rob | 2015 | TotalF | TotalF |
| B | John | 2015 | TotalG | TotalG |
| B | Mike | 2015 | TotalH | TotalH |
......
| Init | Name | Date |Total Savings|Total Costs|
| A | John | 2016 | TotalI | TotalI |
| A | Mike | 2016 | TotalJ | TotalJ |
| A | Rob | 2016 | TotalK | TotalK |
| B | John | 2016 | TotalL | TotalL |
| B | Mike | 2016 | TotalM | TotalM |
I'm going to loop a function for the first row to take the "Total Savings" and "Total Costs" and divide by 12 where Date = 2015 and 9 where Date = 2016 (YTD to Sept) and create an individual row for each. I'm essentially breaking out an annual total in a row and creating a row for each month of the year. I need help in running that loop to copy also columns "Init", "Name", until "Init", "Name" combination are not distinct. Also, note the formula for the division based on the year will be different as well. I suppose I could separate the datasets for 2015 and 2016 and use two different functions and merge if that would be easier. Below should be the output:
| Init | Name | Date |Monthly Savings|Monthly Costs|
| A | John | 01-01-2015 | TotalD/12* | MonthD |
| A | John | 02-01-2015 | MonthD | MonthD |
| A | John | 03-01-2015 | MonthD | MonthD |
...
| A | Mike | 01-01-2016 | TotalE/9* | TotalE |
| A | Mike | 02-01-2016 | TotalE | TotalE |
| A | Mike | 03-01-2016 | TotalE | TotalE |
...
| B | John | 01-01-2015 | TotalG/12* | MonthD |
| B | John | 02-01-2015 | MonthG | MonthD |
| B | John | 03-01-2015 | MonthG | MonthD |
TotalD/12* = MonthD - this is the formula for 2015
TotalE/9* = MonthE - this is the formula for 2016
Any help would be appreciated...
As a start, here are some reproducible data, with the columns described:
myData <-
data.frame(
Init = rep(LETTERS[1:3], each = 4)
, Name = rep(c("John", "Mike"), each = 2)
, Date = 2015:2016
, Savings = (1:12)*1200
, Cost = (1:12)*2400
)
Next, set the divisor to be used for each year:
toDivide <-
c("2015" = 12, "2016" = 9)
Then, I am using the magrittr pipe as I split the data up into single rows, then looping through them with lapply to expand each row into the appropriate number of rows (9 or 12) with the savings and costs divided by the number of months. Finally, dplyr's bind_rows stitches the rows back together.
myData %>%
split(1:nrow(.)) %>%
lapply(function(x){
temp <- data.frame(
Init = x$Init
, Name = x$Name
, Date = as.Date(paste(x$Date
, formatC(1:toDivide[as.character(x$Date)]
, width = 2, flag = "0")
, "01"
, sep = "-"))
, Savings = x$Savings / toDivide[as.character(x$Date)]
, Cost = x$Cost / toDivide[as.character(x$Date)]
)
}) %>%
bind_rows()
The head of this looks like:
Init Name Date Savings Cost
1 A John 2015-01-01 100.0000 200.0000
2 A John 2015-02-01 100.0000 200.0000
3 A John 2015-03-01 100.0000 200.0000
4 A John 2015-04-01 100.0000 200.0000
5 A John 2015-05-01 100.0000 200.0000
6 A John 2015-06-01 100.0000 200.0000
with similar entries for each expanded row.
I have one table 'positions' with columns:
id | session_id | keyword_id | position
and some rows in it:
10 rows with session_id = 1
and 10 with session_id = 2.
As a result of the query I need a table like this:
id | keyword_id | position1 | position2
where 'position1' is a column with values that had session_id = 1 and 'position2' is a column with values that had session_id = 2.
The result set should contain 10 records.
Sorry for my bad English.
Data examle:
id | session_id | keyword_id | position
1 | 1 | 1 | 2
2 | 1 | 2 | 3
3 | 1 | 3 | 0
4 | 1 | 4 | 18
5 | 2 | 5 | 9
6 | 2 | 1 | 0
7 | 2 | 2 | 14
8 | 2 | 3 | 2
9 | 2 | 4 | 8
10 | 2 | 5 | 19
Assuming that you wish to combine positions with the same id, from the two sessions, then the following query should to the trick:
SELECT T1.keyword_id
, T1.position as Position1
, T2.position as Position2
FROM positions T1
INNER JOIN positions T2
ON T1.keyword_id = T2.keyword_id -- this will match positions by [keyword_id]
AND T1.session_id = 1
AND T2.session_id = 2