Merge by multiple criteria and split out duplicates into separate columns? - r

I'm quite sure this has been asked and answered at some point but I'm a novice and really lack the vocabulary to effectively find the question and solution. I have a simple task that I can't perform in Excel because of the internal memory limitations, but I don't know enough about SQL or R to figure out how to do it in either of those platforms.
I have two tables, one has unique entries with unique ID numbers, the other has multiple duplicates of those ID numbers, each showing a different number (representing each new salary over the course of a career). I'm trying to map each of the salaries to the original unique ID table, creating new columns for every possible change (Salary1:Salary50). Ultimately I'll also need to map on the dates and differences of each change for analysis. Here's an example:
This is the unique ID table:
Table 1
ID Salary1 Salary2 Salary3 Salary4 Salary5
1 ? ? ? ? ?
2 ? ? ? ? ?
3 ? ? ? ? ?
4 ? ? ? ? ?
5 ? ? ? ? ?
Here's the salary table with duplicate IDs and the info I want:
Table2
ID Salary SalaryDate
1 10 1/1/2014
1 11 1/1/2015
1 12 1/1/2016
2 12 1/1/2015
2 13 1/1/2016
3 10 1/1/2016
4 10 1/1/2014
4 12 1/1/2015
4 14 1/1/2016
5 10 1/1/2016
And the end state should look like this:
Table3
ID Salary1 Salary2 Salary3 Salary4 Salary5
1 10 11 12 0 0
2 12 13 0 0 0
3 10 0 0 0 0
4 10 12 0 0 0
5 10 0 0 0 0
I built a multiple-criteria Vlookup to pull everything into the right columns but the dataset has well over 100,000 rows to check so it can't complete it memory-wise. Can anyone advise on how I can do the same thing in Access, R, SPSS or if there is some efficient Excel-VBA code I can use?
Thanks for any help!

I have no idea what a "Vlookup" is, but apparently you are looking for something like this:
DF <- read.table(text = "ID Salary SalaryDate
1 10 1/1/2014
1 11 1/1/2015
1 12 1/1/2016
2 12 1/1/2015
2 13 1/1/2016
3 10 1/1/2016
4 10 1/1/2014
4 12 1/1/2015
4 14 1/1/2016
5 10 1/1/2016", header = TRUE)
#years of employment assuming the table is sorted by dates
DF$y <- ave(DF$ID, DF$ID, FUN = seq_along)
#reshape
library(reshape2)
dcast(DF, ID ~ y, value.var = "Salary", fill = 0)
# ID 1 2 3
#1 1 10 11 12
#2 2 12 13 0
#3 3 10 0 0
#4 4 10 12 14
#5 5 10 0 0
Note that this is not a very useful data format in R. Your original data format seems much more useful for further analyses.

Assume that the IDs in Table1 are a subset of the IDs in Table2 and we want just those. Also we want the first Salary for any ID in the Salary1 result column, the second salary in the Salary2 result column and so on. First compute Seq which is 1 for the first date in any ID, 2 for the second and so on. Then create a factor out of those sequence numbers whose levels are labelled by the Salary columns in Table1. In the last statement subset Table2 to the ID values of Table1 (in the case of the data shown they are the same so it does not have any effect) and reshape from long to wide form using xtabs. No packages are used.
Seq <- ave(1:nrow(Table2), Table2$ID, FUN = seq_along)
Table0 <- Table1[-1] # Table0 is Table1 without ID column
Table2$SalaryNo <- factor(Seq, levels = 1:ncol(Table0), labels = colnames(Table0))
xtabs(Salary ~ ID + SalaryNo, data = subset(Table2, ID %in% Table1$ID))
giving:
Salary_No
ID Salary1 Salary2 Salary3 Salary4 Salary5
1 10 11 12 0 0
2 12 13 0 0 0
3 10 0 0 0 0
4 10 12 14 0 0
5 10 0 0 0 0
Note: The tables were not provided in reproducible form and the solution may depend on specifically what they are so we have assumed this:
Lines1 <- "
ID Salary1 Salary2 Salary3 Salary4 Salary5
1 ? ? ? ? ?
2 ? ? ? ? ?
3 ? ? ? ? ?
4 ? ? ? ? ?
5 ? ? ? ? ?"
Table1 <- read.table(text = Lines1, header = TRUE)
Lines2 <- "
ID Salary SalaryDate
1 10 1/1/2014
1 11 1/1/2015
1 12 1/1/2016
2 12 1/1/2015
2 13 1/1/2016
3 10 1/1/2016
4 10 1/1/2014
4 12 1/1/2015
4 14 1/1/2016
5 10 1/1/2016"
Table2 <- read.table(text = Lines2, header = TRUE)
Update: Changed assumptions and code correspondingly. Also fixed an error (that did not affect the data shown but could affect other data).

Related

How to add columns from another data frame where there are multible matching rows

I'm new to R and I'm stuck.
NB! I'm sorry I could not figure out how to add more than 1 space between numbers and headers in my example so i used "_" instead.
The problem:
I have two data frames (Graduations and Occupations). I want to match the occupations to the graduations. The difficult part is that one person might be present multiple times in both data frames and I want to keep all the data.
Example:
Graduations
One person may have finished many curriculums. Original DF has more columns but they are not relevant for the example.
Person_ID__curriculum_ID__School ID
___1___________100__________10
___2___________100__________10
___2___________200__________10
___3___________300__________12
___4___________100__________10
___4___________200__________12
Occupations
Not all graduates have jobs, everyone in the DF should have only one main job (JOB_Type code "1") and can have 0-5 extra jobs (JOB_Type code "0"). Original DF has more columns but the are not relevant currently.
Person_ID___JOB_ID_____JOB_Type
___1_________1223________1
___3_________3334________1
___3_________2120________0
___3_________7843________0
___4_________4522________0
___4_________1240________1
End result:
New DF named "Result" containing the information of all graduations from the first DF(Graduations) and added columns from the second DF (Occupations).
Note that person "2" is not in the Occupations DF. Their data remains but added columns remain empty.
Note that person "3" has multiple jobs and thus extra duplicate rows are added.
Note that in case of person "4" has both multiple jobs and graduations so extra rows were added to fit in all the data.
New DF: "Result"
Person_ID__Curriculum_ID__School_ID___JOB_ID____JOB_Type
___1___________100__________10_________1223________1
___2___________100__________10
___2___________200__________10
___3___________300__________12_________3334________1
___3___________300__________12_________2122________0
___3___________300__________12_________7843________0
___4___________100__________10_________4522________0
___4___________100__________10_________1240________1
___4___________200__________12_________4522________0
___4___________200__________12_________1240________1
For me the most difficult part is how to make R add extra duplicate rows. I looked around to find an example or tutorial about something similar but could. Probably I did not use the right key words.
I will be very grateful if you could give me examples of how to code it.
You can use merge like:
merge(Graduations, Occupations, all.x=TRUE)
# Person_ID curriculum_ID School_ID JOB_ID JOB_Type
#1 1 100 10 1223 1
#2 2 100 10 NA NA
#3 2 200 10 NA NA
#4 3 300 12 3334 1
#5 3 300 12 2122 0
#6 3 300 12 7843 0
#7 4 100 10 4522 0
#8 4 100 10 1240 1
#9 4 200 12 4522 0
#10 4 200 12 1240 1
Data:
Graduations <- read.table(header=TRUE, text="Person_ID curriculum_ID School_ID
1 100 10
2 100 10
2 200 10
3 300 12
4 100 10
4 200 12")
Occupations <- read.table(header=TRUE, text="Person_ID JOB_ID JOB_Type
1 1223 1
3 3334 1
3 2122 0
3 7843 0
4 4522 0
4 1240 1")
An option with left_join
library(dplyr)
left_join(Graduations, Occupations)

How to subset a dataframe by name, by column index and by dropping columns at the same time

I have the dataframe below:
product<-c("ab","ab","ab","ac","ac","ac")
shop<-c("sad","sad","sad","sadas","fghj","xzzv")
week<-c(31,31,32)
category<-c("a","a","a","b","b","b")
tempr<-c(35,35,14,24,14,5)
value<-c(0,0,-6,8,4,0)
store<-data.frame(product,shop,category,tempr,value)
which looks like this:
product shop category tempr value
ab sad a 35 0
ab sad a 35 0
ab sad a 14 -6
ac sadas b 24 8
ac fghj b 14 4
ac xzzv b 5 0
I have a question that is probably wrong but I want to subset this dataframe by selecting the second column by name, then droping certain columns(1,3,4) by index number and then select the rest of the columns by index numbers without knowing the limit. Something like:
store2<-store2[,input$lux2,-c(1,3),4:]
Lets say that I drop out columns 1 and 3 keep column by giving the "shop" from my selectInput and then I want all the columns left. This would be the result:
shop tempr value
sad 35 0
sad 35 0
sad 14 -6
adas 24 8
fghj 14 4
xzzv 5 0
You cant mix select by colname, negative select and select by col-index.
You have to decide whether to use NAMES or INDEX-numbers.
Something like this would work:
colByName = "shop"
removeByInd = c(1,3)
fromNtoEnd = 4
ind <- setdiff(
c( match(colByName, names(store)), fromNtoEnd:ncol(store) ),
removeByInd
)
store2 <- store[,ind]
# shop tempr value
#1 sad 35 0
#2 sad 35 0
#3 sad 14 -6
#4 sadas 24 8
#5 fghj 14 4
#6 xzzv 5 0
If you consider using dplyr you can use:
store %>% select(-c(1,3),"shop",4:ncol(.))
which is very close to your "imagination" in your question.
I drop out columns 1 and 3 keep column by giving the "shop" from my
selectInput and then I want all the columns left.
As i understand from your explanation, and your output, either you use a selectInput, or do not specify anything, the columns will show up in the list. So, we only need to drop the columns which you do not need them.
#droping certain columns(1,3,4)
store[,-c(1,3)]
and the result is:
shop tempr value
sad 35 0
sad 35 0
sad 14 -6
sadas 24 8
fghj 14 4
xzzv 5 0

Find a function to return value based on condition using R

I have a table with values
KId sales_month quantity_sold
100 1 0
100 2 0
100 3 0
496 2 6
511 2 10
846 1 4
846 2 6
846 3 1
338 1 6
338 2 0
now i require output as
KId sales_month quantity_sold result
100 1 0 1
100 2 0 1
100 3 0 1
496 2 6 1
511 2 10 1
846 1 4 1
846 2 6 1
846 3 1 0
338 1 6 1
338 2 0 1
Here, the calculation has to go as such if quantity sold for the month of march(3) is less than 60% of two months January(1) and February(2) quantity sold then the result should be 1 or else it should display 0. Require solution to perform this.
Thanks in advance.
If I understand well, your requirement is to compare sold quantity in month t with the sum of quantity sold in months t-1 and t-2. If so, I can suggest using dplyr package that offer the nice feature of grouping rows and mutating columns in your data frame.
resultData <- group_by(data, KId) %>%
arrange(sales_month) %>%
mutate(monthMinus1Qty = lag(quantity_sold,1), monthMinus2Qty = lag(quantity_sold, 2)) %>%
group_by(KId, sales_month) %>%
mutate(previous2MonthsQty = sum(monthMinus1Qty, monthMinus2Qty, na.rm = TRUE)) %>%
mutate(result = ifelse(quantity_sold/previous2MonthsQty >= 0.6,0,1)) %>%
select(KId,sales_month, quantity_sold, result)
The result is as below:
Adding
select(KId,sales_month, quantity_sold, result)
at the end let us display only columns we care about (and not all these intermediate steps).
I believe this should satisfy your requirement. NA is the result column are due to 0/0 division or no data at all for the previous months.
Should you need to expand your calculation beyond one calendar year, you can add year column and adjust group_by() arguments appropriately.
For more information on dplyr package, follow this link

How to lookup value higher in data.frame in R

I have a time series data.frame where all values are below each other. But on every date there are more cases that come back regulary. Based on the time series I am adding a column with some calculations. These calculations are done case-specific. But for these calculations I need the value of the previous date of that case. I have now idea about which function to use. Can anybody point me to a function or an example somewhere on the net? Thanks!!
To be clear, this is what I mean. On date 1 the old value (before the score) for case 'a' is 1200. Based on the score of 1 the new value becomes 1250. On date 2 the I want this new value 1250 for placed in the column 'old value' (and than do some calculations to come to the new value, that new value has to be placed again in de column old value on date 4 or so et cetera). For case B the same. So the new value after the score on date 1 is 1190 and has to be placed in the correct row on date 3 (on date 2 there is now case B) et cetera for 1000's of cases and dates.
date name_case score old_value new_value
1 a 1 1200 1250
1 b 2 1275 1190
1 c 1 1300 1310
2 a 3 1250
2 c 1 1310
3 B 1 1190
Maybe this will do it. Assuming that we start with:
> dat
date name_case score old_value new_value
1 1 a 1 1200 1250
2 1 b 2 1275 1190
3 1 c 1 1300 1310
4 2 a 3 NA NA
5 2 c 1 NA NA
6 3 b 1 NA NA # note ... fixed cap issue
And then make a subset with values for new_value:
dat1 <- dat[ !is.na(dat$old_value), ]
And then replace the NA old_values with results from new_values in the subset by match-ing on name_case
dat[ is.na(dat$old_value) , "old_value" ] <-
dat1$new_value[ match(dat[ !is.na(dat$old_value) ,"name_case" ],
dat1$name_case)]
match generates a numeric vector that is used to index the new_values.
> dat
date name_case score old_value new_value
1 1 a 1 1200 1250
2 1 b 2 1275 1190
3 1 c 1 1300 1310
4 2 a 3 1250 NA
5 2 c 1 1190 NA
6 3 B 1 1310 NA

Merging columns from different data frames

I have a problem....
I have two data frames
>anna1
name from to result
11 66607 66841 0
11 66846 67048 0
11 67053 67404 0
11 67409 68216 0
11 68221 68786 0
11 68791 69020 0
11 69025 69289 0
11 69294 70167 0
11 70172 70560 0
and the second data frame is
>anna2
name from to result
11 66607 66841 5
11 66846 67048 6
11 67409 68216 7
11 69025 69289 12
11 70172 70560 45
What I want is to create a new data frame similar with the anna1 where all the 0 values will be replaced by the correct results in the correct row from the anna2
you are going to notice that in the anna2 data frame, in the from and to columns have only some same values with the respective in the anna1 data frame
....the intermediate are missing
So i need somehow to take the numbers from the result column in the anna2 and put them in the correct row in the anna1
thank you in advance
Best regards
Anna
A simpler merge:
anna3 <-merge(anna2,anna1[,1:3], all.y=TRUE)
anna3[is.na(anna3)] <- 0
Gives:
> anna3
name from to result
1 11 66607 66841 5
2 11 66846 67048 6
3 11 67053 67404 0
4 11 67409 68216 7
5 11 68221 68786 0
6 11 68791 69020 0
7 11 69025 69289 12
8 11 69294 70167 0
9 11 70172 70560 45
If the "from" column is guaranteed to be unique in both anna1 and anna2, AND every row in anna2 has a matching row in anna1 (though not vice versa), a simple solution is
row.index = function(d) which(anna1$from == d)[1]
indices = sapply(anna2$from, row.index)
anna1$result[indices] = anna2$result
Another approach
require(plyr)
anna <- rbind(anna1, anna2)
ddply(anna, .(name, from, to), summarize, result = sum(result))
EDIT. If the data frames are large, and speed is an issue, think of using data.table
require(data.table)
data.table(anna)[,list(result = sum(result)),'name, from, to']
You can use merge, but you have to explicitly specify what should be done with the two result columns.
d <- merge(anna1, anna2, by=c("name", "from", "to"), all=TRUE)
d$result <- ifelse(d$result.x == 0 & !is.na( d$result.y ), d$result.y, d$result.x)
d <- d[,c("name", "from", "to", "result")]

Resources