If my query is returning:
Id
Column1
Column2
123
Value
123
Value
456
Value
456
Value
and I have a second query that returns:
Id
Column3
123
50
456
75
How can I join the two querys by Id without the Column3 value appearing for every row where an Id is present - rather, for every row where it is present AND has a value only in Column1. For example:
Id
Column1
Column2
Column3
123
Value
50
123
Value
456
Value
75
456
Value
You can calculate Column3 using the case() function with the logic you've described.
For example:
let q1 = datatable(Id:long, Column1:string, Column2:string)
[
123, 'Value', '',
123, '', 'Value',
456, 'Value', '',
456, '', 'Value',
]
;
let q2 = datatable(Id:long, Column3:long)
[
123, 50,
456, 75,
]
;
q1
| join kind=inner q2 on Id
| project Id, Column1, Column2, Column3 = case(isempty(Column1), long(null), Column3)
Id
Column1
Column2
Column3
123
Value
123
Value
50
456
Value
456
Value
75
Related
I have a simple dataframe in R
df1 <- data.frame(
questionID = c(1,1,3,4,5,5),
userID = c(101, 101, 102, 101, 102,101),
Value=c(10,20,30,40,50,10))
The basic idea is to have a column that indicates the sum of value for a user on questions they asked before (lower number questions).
I tried using this function (after trying the pipe of sum which just gave errors about non-numeric that everybody seems to face)
f2 <- function(x){
Value_out <- filter(df1,questionID<x['questionID'] & userID == x['userID'] ) %>%
select(Value) %>%
summarize_if(is.numeric, sum, na.rm=TRUE)
}
out=mutate(df1,Expert=apply(df1, 1,f2))
While this works if you print it out, the Expert column is saved as a list of dataframes. All I want is an int or numeric of the sum of Value. Is there anyway to do this? By the way, yes, I've looked all over for ways to do this, with most answers just summarizing the column in a manner that won't work for me.
I think I would avoid writing my own function altogether and use data.table on this one. You can do what you want in just a couple lines, and I wouldn't be surprised if there was a way to golf this down to fewer lines
Given your same data, we create a data.table object:
library(data.table)
dt <- data.table(
questionID = c(1,1,3,4,5,5),
userID = c(101, 101, 102, 101, 102,101),
Value=c(10,20,30,40,50,10))
Next, we shift our values by 1 (lag) within each userID:
dt[, lastVal := shift(Value, n = 1, fill = 0), by = .(userID)]
And finally, we cumsum those by userID, and replace those with multiple Values with the same userID x questionID with the min Expert, which should be 0 because we used fill = 0 in shift above before we cumsum:
dt[,
Expert := cumsum(lastVal),
by = .(userID)][,
Expert := min(Expert),
by = .(userID, questionID)]
So, putting that all together, we have:
library(data.table)
dt <- data.table(
questionID = c(1,1,3,4,5,5),
userID = c(101, 101, 102, 101, 102,101),
Value=c(10,20,30,40,50,10))
dt[, lastVal := shift(Value, n = 1, fill = 0), by = .(userID)]
dt[,
Expert := cumsum(lastVal),
by = .(userID)][,
Expert := min(Expert),
by = .(userID, questionID)]
dt
questionID userID Value lastVal Expert
1: 1 101 10 0 0
2: 1 101 20 10 0
3: 3 102 30 0 0
4: 4 101 40 20 30
5: 5 102 50 30 30
6: 5 101 10 40 70
Using dplyr and purrr::map_dbl one approach would be to group_by userID and sum Value for each questionID which is less than current value.
library(dplyr)
df1 %>%
group_by(userID) %>%
mutate(Expert = purrr::map_dbl(questionID, ~sum(Value[questionID < .x])))
# questionID userID Value Expert
# <dbl> <dbl> <dbl> <dbl>
#1 1 101 10 0
#2 1 101 20 0
#3 3 102 30 0
#4 4 101 40 30
#5 5 102 50 30
#6 5 101 10 70
I am needing to see how to update one column in a data frame where the employeeID matches that of another data frame.
For example:
df1 >
empID salary
1 10000
2 15000
3 0
df2 >
empID salary2
1 10000
2 15000
3 20000
I am needing to see how to update it where df1$salary = 0, then update it where df1$empID = df2$empID.
I tried this but received "No such column: salary2" error :
df1$salary <- ifelse(df1$salary == 0,sqldf("UPDATE df1 SET salary = salary2 WHERE df1.empID = df2.empID"),df1$salary)
Here is another option with merge,
transform(merge(df1, df2, by = 'empID'),
salary = replace(salary, salary == 0, salary2[salary == 0]),
salary2 = NULL)
# empID salary
#1 1 10000
#2 2 15000
#3 3 20000
You can also use ifelse instead of replace for salary, i.e.
salary = ifelse(salary == 0, salary2, salary)
We could do
#find empID in df1 where salary is 0
inds <- df1$empID[df1$salary == 0]
#match empID with df2 and get respective salary and update df1
df1$salary[inds] <- df2$salary2[match(inds, df2$empID)]
df1
# empID salary
#1 1 10000
#2 2 15000
#3 3 20000
This should also work if you have multiple entries with 0 in df1.
We can do the same using ifelse like
ifelse(df1$salary == 0, df2$salary2[match(df1$empID, df2$empID)], df1$salary)
I have a dataframe with 3 variable ID, Var1 and Var2. Var 1 and two contains multiple lines that can be broken down into rows. I would like to make VAR 1 lines into headers and link Var 2 to the correct line of Var 1. My data looks like this:
ID VAR1 VAR2
1 Code Employee number Personal ID 132 12345 12452
2 Employee number Personal ID 32145 13452
3 Code Employee number 444 56743
4 Code Employee number Personal ID 546 89642 14667
I would like to obtain:
ID Code Employee number Personal ID
1 132 12345 12452
2 32145 13452
3 444 56743
4 546 89642 14667
Here's a tidyverse approach.
First you need to update the values that represent your future column names, as R doesn't like spaces in column names.
# example dataset
df = data.frame(ID = 1:2,
VAR1 = c("Code Employee number Personal ID", "Employee number Personal ID"),
VAR2 = c("132 12345 12452", "32145 13452"))
library(tidyverse)
df %>%
mutate(VAR1 = gsub("Personal ID", "PersonalID", VAR1),
VAR1 = gsub("Employee number", "EmployeeNummber", VAR1)) %>%
separate_rows(VAR1, VAR2) %>%
spread(VAR1, VAR2)
# ID Code EmployeeNummber PersonalID
# 1 1 132 12345 12452
# 2 2 <NA> 32145 13452
I am very new to R and sqldf and can't seem to solve a basic problem. I have a file with transactions where each row represents a product purchased.
The file looks like this:
customer_id,order_number,order_date, amount, product_name
1, 202, 21/04/2015, 58, "xlfd"
1, 275, 16//08/2015, 74, "ghb"
1, 275, 16//08/2015, 36, "fjk"
2, 987, 12/03/2015, 27, "xlgm"
3, 376, 16/05/2015, 98, "fgt"
3, 368, 30/07/2015, 46, "ade"
I need to find the maximum amount spent in a single transaction (same order_number) by each customer_id. For example in case of customer_id "1" it would be (74+36)=110.
In case sqldf is not a strict requirement.
Considering your input as dft , you can try:
require(dplyr)
require(magrittr)
dft %>%
group_by(customer_id, order_number) %>%
summarise(amt = sum(amount)) %>%
group_by(customer_id) %>%
summarise(max_amt = max(amt))
which gives:
Source: local data frame [3 x 2]
Groups: customer_id [3]
customer_id max_amt
<int> <int>
1 1 110
2 2 27
3 3 98
Assuming the dataframe is named orders, following will do the job:
sqldf("select customer_id, order_number, sum(amount)
from orders
group by customer_id, order_number")
Update: using nested query the following will give the desired output:
sqldf("select customer_id, max(total)
from (select customer_id, order_number, sum(amount) as total
from orders
group by customer_id, order_number)
group by customer_id")
Output:
customer_id max(total)
1 1 110
2 2 27
3 3 98
We can also use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'customer_id', 'order_number', we get the sum of 'amount', do a second group by with 'customer_id' and get the max of 'Sumamount'
library(data.table)
setDT(df1)[, .(Sumamount = sum(amount)) , .(customer_id, order_number)
][,.(MaxAmount = max(Sumamount)) , customer_id]
# customer_id MaxAmount
#1: 1 110
#2: 2 27
#3: 3 98
Or making it more compact, after grouping by 'customer_id', we split the 'amount' by 'order_number', loop through the list, get the sum, find the max to get the 'MaxAmount'
setDT(df1)[, .(MaxAmount = max(unlist(lapply(split(amount,
order_number), sum)))), customer_id]
# customer_id MaxAmount
#1: 1 110
#2: 2 27
#3: 3 98
Or using base R
aggregate(amount~customer_id, aggregate(amount~customer_id+order_number,
df1, sum), FUN = max)
I want to select all those groupings that contain at least one of the elements that I am interested in. I was able to do this by creating an intermediate array, but I am looking for something simpler and faster. This is because my actual data set has over 1M rows (and 20 columns) so I am not sure whether I will have sufficient memory to create an intermediate array. More importantly, the below method on my original file takes a lot of time.
Here's my code and data:
a) Data
dput(Data_File)
structure(list(Group_ID = c(123, 123, 123, 123, 234, 345, 444,
444), Product_Name = c("ABCD", "EFGH", "XYZ1", "Z123", "ABCD",
"EFGH", "ABCD", "ABCD"), Qty = c(2, 3, 4, 5, 6, 7, 8, 9)), .Names = c("Group_ID",
"Product_Name", "Qty"), row.names = c(NA, 8L), class = "data.frame")
b) Code: I want to select Group_ID that has at least one Product_Name = ABCD
#Find out transactions
Data_T <- Data_File %>%
group_by(Group_ID) %>%
dplyr::filter(Product_Name == "ABCD") %>%
select(Group_ID) %>%
distinct()
#Now filter them
Filtered_T <- Data_File %>%
group_by(Group_ID) %>%
dplyr::filter(Group_ID %in% Data_T$Group_ID)
c) Expected output is
Group_ID Product_Name Qty
<dbl> <chr> <dbl>
123 ABCD 2
123 EFGH 3
123 XYZ1 4
123 Z123 5
234 ABCD 6
444 ABCD 8
444 ABCD 9
I'm struggling with this for over 3 hours now. I looked at the auto-suggested thread by SO: Select rows with at least two conditions from all conditions but my question is very different.
I would do it like this:
Data_File %>% group_by(Group_ID) %>%
filter(any(Product_Name %in% "ABCD"))
# Source: local data frame [7 x 3]
# Groups: Group_ID [3]
#
# Group_ID Product_Name Qty
# <dbl> <chr> <dbl>
# 1 123 ABCD 2
# 2 123 EFGH 3
# 3 123 XYZ1 4
# 4 123 Z123 5
# 5 234 ABCD 6
# 6 444 ABCD 8
# 7 444 ABCD 9
Explanation: any() will return TRUE if there are any rows (within the group) that match the condition. The length-1 result will then be recycled to the full length of the group and the entire group will be kept. You could also do it with sum(Product_name %in% "ABCD") > 0 as the condition, but the any reads very nicely. Use sum instead if you wanted a more complicated condition, like 3 or more matching product names.
I prefer%in%to == for things like this because it has better behavior with NA and it is easy to expand if you wanted to check for any of multiple products by group.
If speed and efficiency are an issue, data.table will be faster. I would do it like this, which relies on a keyed join for the filtering and uses no non-data.table operations, so it should be very fast:
library(data.table)
df = as.data.table(df)
setkey(df)
groups = unique(subset(df, Product_Name %in% "ABCD", Group_ID))
df[groups, nomatch = 0]
# Group_ID Product_Name Qty
# 1: 123 ABCD 2
# 2: 123 EFGH 3
# 3: 123 XYZ1 4
# 4: 123 Z123 5
# 5: 234 ABCD 6
# 6: 444 ABCD 8
# 7: 444 ABCD 9