aggregating the data in different columns based on condition in r - r

order_id customer_id extension_of quantity cost duration
1 123 srujan NA 1 100 30
2 456 teja NA 1 100 30
3 789 srujan 123 1 100 30
I have sample data which contains order information. What I need to do is summarize the data(sum of cost etc) if value of order_id and extension_of columns matches.

I take it you want to add rows #1 and #3 as #1 is like the parent order and #3 is its extension. Here's one way you can do it (assuming dft is your data frame):
library(purrr)
dft$parent_id <- map2_dbl(dft$order_id, dft$extension_of, function(order_id, extension_of) if(is.na(extension_of)) order_id else extension_of)
aggregate(cost~parent_id, data=dft, FUN=sum)
This can not do recursion (i.e. an extension being extended itself), but it does sum up for one level.
If you need several levels, you can go for something like this:
library(purrr)
find_root <- function(order_id, extension_of){
if(is.na(extension_of)){
return(order_id)
}else{
parent_ext <- dft$extension_of[dft$order_id==extension_of]
return(find_root(extension_of, parent_ext))
}
}
dft$parent_id <- map2_dbl(dft$order_id, dft$extension_of, find_root)
aggregate(cost~parent_id, data=dft, FUN=sum) # same as before
(not it looks up the parent's parent, if necessary). One can certainly optimize for performance if really needed; the principle will stay the same, however.

Related

How to show the rows that have the same variable?

I would like to show the department that uses the same vendor using the vendor code in a very big dataset, so I guess I will need a loop for that but I am not really sure how to start.
for example, I want to see for each vendor code, all the department that uses it, only if it's used by 3 or more department
see the sample of data here
Here's a base R solution.
# get the repeated values
dat_tb <- table(dat$vendor_code)
# select for the condition and print from the whole data set
dat[ dat$vendor_code %in% names(dat_tb[ dat_tb > 2 ]), ]
vendor_code department
2 9966 dept2
3 9966 dept3
8 9966 dept8
9 9966 dept9
Data:
dat <- data.frame( vendor_code=rep(c(3344,9966,9966,3444,5566,3388),2),
department=paste0("dept",1:12))

Manipulating with rows where data needs to match with a partial string of a column and one other condition

I have a dataset where I need to filter a specific data. In order to do this I need my filter to match two conditions:
1) needs to match a partial string
2) a value of one column needs to be more than 1
the dataset needs to stay exacxtly the same --> I could do this in two steps with grep function but I would get a new dataset only with the data I am looking to change and that I do not want.
my data looks like this:
name nr_item price content end_nr_item
MINI HVLP Spritzpistole 1 20,16 1 1
LED G4 G9 MR16 GU10 1 13,09 6 6
Trennscheiben Ø115 Ø125 Ø230 2 12,53 30 60
Trennscheiben Ø115 Ø125 Ø230 2 12,53 1 2
LED G4 G9 MR16 GU10 3 35,49 20 60
Trennscheiben Ø115 Ø125 Ø230 1 10,18 4 4
I wanted to filter the data with the data$name ="Trennscheiben" and that has a data$content > 1
What I have managed to programe so far (it is not ok):
for(i in 1:length(data$content)){
if(data[grep("Trennscheiben", data$name[i]), ] & data$content[i] > 1){
data$end_nr_item[i] <- data$end_nr_item[i] / 10
}
}
I amstuck at this point... I would appreciate the help.
If you are looking for an answer added to your data frame then I would suggest adding a logical vector as a new column.
something like:
library('stringr')
data$answer_to_my_filter_question <- str_detect(data$name,'Trennscheiben')== TRUE & data$content > 1
This will return a logical string with your answer.
You could do the same thing using the mutate() function in the dplyr package:
library('dplyr')
library('stringr')
data <- mutate(data, answer_to_my_filter_question = (str_detect(name,'Trennscheiben') == TRUE & content > 1))

merging multiple dataframes with duplicate rows in R

Relatively new with R for this kind of thing, searched quite a bit and couldn't find much that was helpful.
I have about 150 .csv files with 40,000 - 60,000 rows each and I am trying to merge 3 columns from each into 1 large data frame. I have a small script that extracts the 3 columns of interest ("id", "name" and "value") from each file and merges by "id" and "name" with the larger data frame "MergedData". Here is my code (I'm sure this is a very inefficient way of doing this and that's ok with me for now, but of course I'm open to better options!):
file_list <- list.files()
for (file in file_list){
if(!exists("MergedData")){
MergedData <- read.csv(file, skip=5)[ ,c("id", "name", "value")]
colnames(MergedData) <- c("id", "name", file)
}
else if(exists("MergedData")){
temp_data <- read.csv(file, skip=5)[ ,c("id", "name", "value")]
colnames(temp_data) <- c("id", "name", file)
MergedData <- merge(MergedData, temp_data, by=c("id", "name"), all=TRUE)
rm(temp_data)
}
}
Not every file has the same number of rows, though many rows are common to many files. I don't have an inclusive list of rows, so I included all=TRUE to append new rows that don't yet exist in the MergedData file.
My problem is: many of the files contain 2-4 rows with identical "id" and "name" entries, but different "value" entries. So, when I merge them I end up adding rows for every possible combination, which gets out of hand fast. Most frustrating is that none of these duplicates are of any interest to me whatsoever. Is there a simple way to take the value for the first entry and just ignore any further duplicate entries?
Thanks!
Based on your comment, we could stack each file and then cast the resulting data frame from "long" to "wide" format:
library(dplyr)
library(readr)
library(reshape2)
df = lapply(file_list, function(file) {
dat = read_csv(file)
dat$source.file = file
return(dat)
})
df = bind_rows(df)
df = dcast(df, id + name ~ source.file, value.var="value")
In the code above, after reading in each file, we add a new column source.file containing the file name (or a modified version thereof).* Then we use dcast to cast the data frame from "long" to "wide" format to create a separate column for the value from each file, with each new column taking one of the names we just created in source.file.
Note also that depending on what you're planning to do with this data frame, you may find it more convenient to keep it in long format (i.e., skip the dcast step) for further analysis.
Addendum: Dealing with Aggregation function missing: defaulting to length warning. This happens when you have more than one row with the same id, name and source.file. That means there are multiple values that have to get mapped to the same cell, resulting in aggregation. The default aggregation function is length (i.e., a count of the number of values in that cell). The only ways around this that I know of are (a) keep the data in long format, (b) use a different aggregation function (e.g., mean), or (c) add an extra counter column to differentiate cases with multiple values for the same combination of id, name, and source.file. We demonstrate these below.
First, let's create some fake data:
df = data.frame(id=rep(1:2,2),
name=rep(c("A","B"), 2),
source.file=rep(c("001","002"), each=2),
value=11:14)
df
id name source.file value
1 1 A 001 11
2 2 B 001 12
3 1 A 002 13
4 2 B 002 14
Only one value per combination of id, name and source.file, so dcast works as desired.
dcast(df, id + name ~ source.file, value.var="value")
id name 001 002
1 1 A 11 13
2 2 B 12 14
Add an additional row with the same id, name and source.file. Since there are now two values getting mapped to a single cell, dcast must aggregate. The default aggregation function is to provide a count of the number of values.
df = rbind(df, data.frame(id=1, name="A", source.file="002", value=50))
dcast(df, id + name ~ source.file, value.var="value")
Aggregation function missing: defaulting to length
id name 001 002
1 1 A 1 2
2 2 B 1 1
Instead, use mean as the aggregation function.
dcast(df, id + name ~ source.file, value.var="value", fun.aggregate=mean)
id name 001 002
1 1 A 11 31.5
2 2 B 12 14.0
Add a new counter column to differentiate cases where there are multiple rows with the same id, name and source.file and include that in dcast. This gets us back to a single value per cell, but at the expense of having more than one column for some source.files.
# Add counter column
df = df %>% group_by(id, name, source.file) %>%
mutate(counter=1:n())
As you can see, the counter value only has a value of 1 in cases where there's only one combination of id, name, and source.file, but has values of 1 and 2 for one case where there are two rows with the same id, name, and source.file (rows 3 and 5 below).
df
id name source.file value counter
1 1 A 001 11 1
2 2 B 001 12 1
3 1 A 002 13 1
4 2 B 002 14 1
5 1 A 002 50 2
Now we dcast with counter included, so we get two columns for source.file "002".
dcast(df, id + name ~ source.file + counter, value.var="value")
id name 001_1 002_1 002_2
1 1 A 11 13 50
2 2 B 12 14 NA
* I'm not sure what your file names look like, so you'll probably need to adjust this create a naming format with a unique file identifier. For example, if your file names follow the pattern "file001.csv", "file002.csv", etc., you could do this: dat$source.file = paste0("Value", gsub("file([0-9]{3})\\.csv", "\\1", file).

Generate fixed length random id by year as character

I would like to create random id with fixed length 8
Here is sample data:
x <- data.frame(id=c(1,1,1,2,2,3,3,3,3,4,4), year=c(2001,2001,2001,2010,2010,2002,2002,2002,2002,2005,2005),x=seq(0,0.1,0.01))
My attempt:
x$new.id <- ave(x$id, x$year, FUN = function(x) rnorm(x,90000000,100000))
The random generated new.id should have equal id's for given id and year
There must be simple solution, yet I cannot find one. Thanks.
EDIT: Or otherwise how to create new 8 digit id for given number of rows.
Desired output: the column new.id should be class character
new.id year new.id
1 1 2001 89957391
2 1 2001 89957391
3 1 2001 89957391
4 2 2010 90331214
5 2 2010 90331214
6 3 2002 89995435
7 3 2002 89995435
8 3 2002 89995435
9 3 2002 89995435
10 4 2005 90058279
11 4 2005 90058279
You were pretty close with your coding approach (to use ave in that manner), though if you want to generate only one value per each group, you should pass 1 into rnorms n parameter.
The biggest problem as I see it here, is that you want to generate a random number of class integer (and then convert to character class) while rnorm returns double by definition.
So you could potentially do this (using round or floor or ceiling)
transform(x, new.id = ave(id,
year,
FUN = function(x) as.character(round(rnorm(1, 9e7, 1e5)))))
But it seems to me that more appropriate way would be to use sample instead
indx <- 1e7:(1e8 - 1)
transform(x, new.id = ave(id, year, FUN = function(x) as.character(sample(indx, 1))))
Edit: Now that I came to think about it a little more, it is possible that for a large enough data set you will have duplicated new.ids because you are independantly calling sample function each time. It seem to me that the best solution would be first creating a data set with new indexes per each id while generated by a single sample call and then merge it back to the data set. This Operation could be best done using the data.table package (because of it efficient joins and the ability to only add a single column while joining), something like the following should work
library(data.table)
y <- data.table(id = unique(x$id),
new.id = as.character(sample(indx, length(unique(x$id)))))
setkey(setDT(x), id) ; setkey(y, id)
x[y, new.id := i.new.id]
This will update you original data set by reference (without the need in <- assignment). You can convert back to data.frame (if you wish) by simply doing setDF(x).

Custom function within subset of data, base functions, vector output?

Apologises for a semi 'double post'. I feel I should be able to crack this but I'm going round in circles. This is on a similar note to my previously well answered question:
Within ID, check for matches/differences
test <- data.frame(
ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05",
"2004-03-05","2004-06-05","2004-09-05","2005-01-05",
"2006-10-03","2007-02-05")
)
What I want to do is tag the subject whose first vist (as at DOV) was less than 180 days from their diagnosis (DOD). I have the following from the plyr package.
ddply(test, "ID", function(x) ifelse( (as.numeric(x$DOV[1]) - as.numeric(x$DOD[1])) < 180,1,0))
Which gives:
ID V1
1 A 1
2 B 0
3 C 1
What I would like is a vector 1,1,1,0,0,0,0,1,1 so I can append it as a column to the data frame. Basically this ddply function is fine, it makes a 'lookup' table where I can see which IDs have a their first visit within 180 days of their diagnosis, which I could then take my original test and go through and make an indicator variable, but I should be able to do this is one step I'd have thought.
I'd also like to use base if possible. I had a method with 'by', but again it only gave one result per ID and was also a list. Have been trying with aggregate but getting things like 'by has to be a list', then 'it's not the same length' and using the formula method of input I'm stumped 'cbind(DOV,DOD) ~ ID'...
Appreciate the input, keen to learn!
After wrapping as.Date around the creation of those date columns, this returns the desired marking vector assuming the df named 'test' is sorted by ID (and done in base):
# could put an ordering operation here if needed
0 + unlist( # to make vector from list and coerce logical to integer
lapply(split(test, test$ID), # to apply fn with ID
function(x) rep( # to extend a listwise value across all ID's
min(x$DOV-x$DOD) <180, # compare the minimum of a set of intervals
NROW(x)) ) )
11 12 13 21 22 23 24 31 32 # the labels
1 1 1 0 0 0 0 1 1 # the values
I have added to data.frame function stringsAsFactors=FALSE:
test <- data.frame(ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05","2004-03-05",
"2004-06-05","2004-09-05","2005-01-05","2006-10-03","2007-02-05")
, stringsAsFactors=FALSE)
CODE
test$V1 <- ifelse(c(FALSE, diff(test$ID) == 0), 0,
1*(as.numeric(as.Date(test$DOV)-as.Date(test$DOD))<180))
test$V1 <- ave(test$V1,test$ID,FUN=max)

Resources