I am trying to write a function that merges based on two columns both found in two dataframes. One of the columns is an identifier string and the other is a date.
The first df ("model") includes identifiers, starting dates, and some other relevant info.
The second df ("futurevalues") is a melted df that includes the identifier, multiple months for each identifier, and the relevant value for each identifier-month pair.
I would like to merge values for each identifier based on a certain period of time in the future. So for instance, for Identifier= Mary and starting month="2005-01-31" in "model" I would like to pull in the relevant value for the next month and 11 more months after (so 12 data points for Mary for months starting month+1:starting month+12).
I can merge my dfs by the two columns to get the as-of date value (see below), but this isn't what I need.
testmerge=merge(model,futurevalues,by=c("month","identifier"),all=TRUE)
To solve this, I am trying to use the lubridate date functions. For instance, the function below will allow me to enter a month (and then lapply across the df maybe) to get the values for each of the starting months (which vary across the df, meaning it's not a standard time period across the entire thing).
monthiterate=function (x) {
x %m+% months(1:12)
}
Thanks a lot for your help.
EDIT: adding toy data (first one is model, second one is futurevalues)
structure(list(month = structure(c(12814, 12814, 12814, 12814,
12814, 12814, 12814, 12814, 12814, 12814), class = "Date"), identifier = structure(c(1L,
3L, 2L, 4L, 5L, 7L, 8L, 6L, 9L, 10L), .Label = c("AB1", "AC5",
"BB9", "C99", "D81", "GG8", "Q11", "R45", "ZA1", "ZZ9"), class = "factor"),
value = c(0.831876072999969, 0.218494398256579, 0.550872926656984,
1.81882711231324, -0.245597705276932, -0.964277509916354,
-1.84714556574606, -0.916239506529079, -0.475649743547525,
-0.227721186387637)), .Names = c("month", "identifier", "value"
), class = "data.frame", row.names = c(NA, 10L))
structure(list(identifier = structure(c(1L, 3L, 2L, 4L, 5L, 7L,
8L, 6L, 9L, 10L), .Label = c("AB1", "AC5", "BB9", "C99", "D81",
"GG8", "Q11", "R45", "ZA1", "ZZ9"), class = "factor"), month = structure(c(12814,
13238, 12814, 12814, 12964, 12903, 12903, 12842, 13148, 13148
), class = "Date"), futurereturns = c(-0.503033205660682, 1.22446988772542,
-0.825490985851348, 1.03902417581908, 0.172595565260429, 0.894967582911769,
-0.242324006922964, 0.415520398113024, -0.734437328639625, 2.64184935856802
)), .Names = c("identifier", "month", "futurereturns"), class = "data.frame", row.names
= c(NA, 10L))
You need to create a table of all the combinations of ID and month that you want. Starting with a table of each ID and their starting month:
library(lubridate)
set.seed(1834)
# 3 people, each with a different starting month
x <- data.frame(id = sample(LETTERS, 3)
, month = ymd("2005-01-01") + months(sample(0:11, 3)) - days(1))
> x
id month
1 D 2005-03-31
2 R 2005-07-31
3 Y 2005-02-28
Now add rows for the following two months, per ID. I use dplyr for this kind of thing.
library(dplyr)
y <- x %>%
rowwise %>%
do(data.frame(id = .$id
, month = seq(.$month + days(1)
, by = "1 month"
, length.out = 3) - days(1)))
> y
Source: local data frame [9 x 2]
Groups: <by row>
id month
1 D 2005-03-31
2 D 2005-04-30
3 D 2005-05-31
4 R 2005-07-31
5 R 2005-08-31
6 R 2005-09-30
7 Y 2005-02-28
8 Y 2005-03-31
9 Y 2005-04-30
Now you can use merge() (or left_join() from dplyr) to retrieve the rows you want from the full dataset.
Related
I know there are like a million questions regarding duplicate removal, but unfortunately
none of them helped me so far. I struggle with the following:
I have a data frame (loc) that includes data of citizen science observations of nature (animals, plants, etc.). It has about 90.000 rows and looks like this:
ID Datum lat long Anzahl Art Gruppe Anrede Wochentag
1 1665376475 2019-05-09 51.30993 9.319896 20 Alytes obstetricans Amphibien Herr Do
2 529728479 2019-05-06 50.58524 8.503332 1 Alytes obstetricans Amphibien Frau Mo
3 1579862637 2019-05-23 50.53925 8.467546 8 Alytes obstetricans Amphibien Herr Do
4 -415013306 2019-05-06 50.58524 8.503332 3 Alytes obstetricans Amphibien Frau Mo
I also made a small sample data frame (loc_sample) of 10 observations and used dput(loc_sample):
structure(list(ID = c(688380991L, -1207894879L, 802295973L, -815104336L, -632066829L, -133354744L, 1929856503L, 952982037L, 1782222413L, 1967897802L),
Datum = structure(c(1559088000, 1558742400, 1557619200, 1557273600, 1557187200, 1557619200, 1557619200, 1557187200, 1557964800, 1556841600),
tzone = "UTC",
class = c("POSIXct", "POSIXt")),
lat = c(52.1236088700115, 51.5928822313012, 53.723426877949, 50.7737623304861, 49.9238597947287, 51.805563222817, 50.1738326622472, 51.2763067511127, 51.395189306337, 51.5732959108075),
long = c(8.62399927116144, 9.89597797393799, 9.04058595819038, 8.20740532922287, 8.29073164862348, 9.9225640296936, 8.79065646492143, 6.40700340270996, 6.47360801696777, 6.25690012620748),
Anzahl = c(2L, 25L, 4L, 1L, 1L, 30L, 2L, 1L, 1L, 1L),
Art = c("Sturnus vulgaris", "Olethreutes arcuella", "Sylvia atricapilla", "Buteo buteo", "Turdus merula", "Orchis mascula subsp. mascula", "Parus major", "Luscinia megarhynchos", "Milvus migrans", "Andrena bicolor"),
Gruppe = c("Voegel", "Schmetterlinge", "Voegel", "Voegel", "Voegel", "Pflanzen", "Voegel", "Voegel", "Voegel", "InsektenSonstige"),
Anrede = c("Herr", "Herr", "Frau", "Herr", "Herr", "Herr", "Herr", "Herr", "Herr", "Herr"),
Wochentag = structure(c(4L, 7L, 1L, 4L, 3L, 1L, 1L, 3L, 5L, 6L),
.Label = c("So", "Mo", "Di", "Mi", "Do", "Fr", "Sa"),
class = c("ordered", "factor"))),
row.names = c(NA, -10L),
class = "data.frame")
For my question only the variables Datum, latand long are important. Datum is a date and in the POSIXct format while lat and long are both numeric. There are quite a few observations that were reported on the same day from the exact same location. I would like to filter and remove those. So I have to check three separate columns and keep only one of each "same-place-same-day" observations.
I already tried putting the three variables in question into one:
loc$dupl <- paste(loc$Datum, loc$lat, loc$long, sep=" ,")
locu <- unique(loc[,2:4])
It seems like I managed to filter the duplicates, but I'm actually not sure, if that's how it is done correctly.
Also, that gives me a data frame with only Datum, lat and long. As a final result I need the original data frame without the duplicates in date and location, but with all the other information for the unique rows still left.
When I try:
locu <- unique(loc[,2:9])
It gives me all the other columns, but it doesn't remove the date and location duplicates.
Thanks in advance for your help!
This can work:
#Code
new <- loc[!duplicated(paste(loc$Datum,loc$lat,loc$long)),]
To get the full data frame back after finding the duplicates, you coudl do sth. like:
loc[!duplicated(loc[,2:4]),]
This code first detects the duplicate rows and then subsets your original data frame.
Note: this code will always keep the first occurences and delete the duplicates in subsequent rows. If you want to keep a certain ID (e.g. the second one, not the first one), we need a different solution.
I have a data like this
df <-structure(list(label = structure(c(5L, 6L, 7L, 8L, 3L, 1L, 2L,
9L, 10L, 4L), .Label = c(" holand", " holandindia", " Holandnorway",
" USAargentinabrazil", "Afghanestan ", "Afghanestankabol", "Afghanestankabolindia",
"indiaAfghanestan ", "USA", "USAargentina "), class = "factor"),
value = structure(c(5L, 4L, 1L, 9L, 7L, 10L, 6L, 3L, 2L,
8L), .Label = c("1941029507", "2367321518", "2849255881",
"2913128511", "2927576083", "4550996370", "457707181.9",
"637943892.6", "796495286.2", "89291651.19"), class = "factor")), .Names = c("label",
"value"), class = "data.frame", row.names = c(NA, -10L))
I want to get the largest name (in letter) and then see how many smaller and similar names are and assign them to a group
then go for another next large name and assign them to another group
until no group left
at first I calculate the length of each so I will have the length of them
library(dplyr)
dft <- data.frame(names=df$label,chr=apply(df,2,nchar)[,1])
colnames(dft)[1] <- "label"
df2 <- inner_join(df, dft)
Now I can simply find which string is the longest
df2[which.max(df2$chr),]
Now I should see which other strings have the letters similar to this long string . we have these possibilities
Afghanestankabolindia
it can be
A
Af
Afg
Afgh
Afgha
Afghan
Afghane
.
.
.
all possible combinations but the order of letter should be the same (from left to right) for example it should be Afghand cannot be fAhg
so we have only two other strings that are similar to this one
Afghanestan
Afghanestankabol
it is because they should be exactly similar and not even a letter different (more than the largest string) to be assigned to the same group
The desire output for this is as follows:
label value group
Afghanestan 2927576083 1
Afghanestankabol 2913128511 1
Afghanestankabolindia 1941029507 1
indiaAfghanestan 796495286.2 2
Holandnorway 457707181.9 3
holand 89291651.19 3
holandindia 4550996370 3
USA 2849255881 4
USAargentina 2367321518 4
USAargentinabrazil 637943892.6 4
why indiaAfghanestan is a seperate group? because it does not completely belong to another name (it has partially name from one or another). it should be part of a bigger name
I tried to use this one Find similar strings and reconcile them within one dataframe which did not help me at all
I found something else which maybe helps
require("Biostrings")
pairwiseAlignment(df2$label[3], df2$label[1], gapOpening=0, gapExtension=4,type="overlap")
but still I don't know how to assign them into one group
You could try
library(magrittr)
df$label %>%
tolower %>%
trimws %>%
stringdist::stringdistmatrix(method = "jw", p = 0.1) %>%
as.dist %>%
`attr<-`("Labels", df$label) %>%
hclust %T>%
plot %T>%
rect.hclust(h = 0.3) %>%
cutree(h = 0.3) %>%
print -> df$group
df
# label value group
# 1 Afghanestan 2927576083 1
# 2 Afghanestankabol 2913128511 1
# 3 Afghanestankabolindia 1941029507 1
# 4 indiaAfghanestan 796495286.2 2
# 5 Holandnorway 457707181.9 3
# 6 holand 89291651.19 3
# 7 holandindia 4550996370 3
# 8 USA 2849255881 4
# 9 USAargentina 2367321518 4
# 10 USAargentinabrazil 637943892.6 4
See ?stringdist::'stringdist-metrics' for an overview of the string dissimilarity measures offered by stringdist.
I am trying to write a loop in R that creates a new variable based on a table of conditional outcomes.
I have four treatment groups (A, B, C, D). Each treatment group pays a different price at three different time periods (day, dinner, night).
Treatment Group Day Price Dinnertime Price Night Price
A 10 20 7
B 11 25 8
C 12 30 9
D 13 35 10
The time period is recorded as a given "hour" (day is hours 8-17, dinner is from 17-19 and night is from 19-0 and 0-8).
Hour Usage
Person 1 1 0
Person 1 2 0
Person 2 20 5
Person 3 17 6
Based on both treatment group (A, B, C and D) and time of day (night, day, dinnertime), I would like to create a new vector of prices.
Ideally, I would create dummy variables for each of the time periods (day, night and dinner) based on these hourly conditions. However, my data set is pretty large (24 observations per person per day) so I'm looking for a more elegant solution.
In plain language, I want this:
if group==A & time==night, then price=7 --> and this information saved in a new variable "price"
Any advice?
Edit: Question is about the loop with two conditions. Is there a way to refer this directly to the data-frame with the treatment groups and tariffs or do I just need to write it manually?
Assuming that you have some way of including a column for the group each person belongs to in the dataframe with the transactions on it. Then something like this may work for you.
df.pricing <- structure(list(Treatment.Group = c("A", "B", "C", "D"), Day.Price = 10:13,
Dinnertime.Price = c(20L, 25L, 30L, 35L), Night.Price = 7:10),
.Names = c("Treatment.Group", "Day.Price", "Dinnertime.Price", "Night.Price"),
class = "data.frame",
row.names = c(NA, -4L))
df.transactions <- structure(list(Person = c("Person1", "Person1", "Person2", "Person3", "Person4"),
Hour = c(1L, 2L, 20L, 17L, 9L),
Usage = c(0L, 0L, 5L, 6L, 2L)),
.Names = c("Person", "Hour", "Usage"),
class = "data.frame", row.names = c(NA, -5L))
# Add the group that each person belongs to
df.transactions$group <- c("A","A","B","C","D")
# Get the transaction price
df.transactions$price <- apply(df.transactions, 1, function(x){
hour <- as.numeric(x[["Hour"]])
price <- ifelse(hour >= 8 & hour <= 16, df.pricing[df.pricing$Treatment.Group == x[["group"]], "Day.Price"],
ifelse((hour > 16 & hour <= 18), df.pricing[df.pricing$Treatment.Group == x[["group"]], "Dinnertime.Price"],
df.pricing[df.pricing$Treatment.Group == x[["group"]], "Night.Price"]))})
I imagine that there's some way to do this with sqldf, though I'm not familiar with the syntax of that package enough to get this to work. Here's the issue:
I have two data frames, each of which describe genomic regions and contain some other data. I have to combine the two if the region described in the one df falls within the region of the other df.
One df, g, looks like this (though my real data has other columns)
start_position end_position
1 22926178 22928035
2 22887317 22889471
3 22876403 22884442
4 22862447 22866319
5 22822490 22827551
And another, l, looks like this (this sample has a named column)
name start end
101 GRMZM2G001024 11149187 11511198
589 GRMZM2G575546 24382534 24860958
7859 GRMZM2G441511 22762447 23762447
658 AC184765.4_FG005 26282236 26682919
14 GRMZM2G396835 10009264 10402790
I need to merge the two dataframes if the values from the start_position OR end_position columns in g fall within the start-end range in l, returning only the columns in l that have a match. I've been trying to get findInterval() to do the job, but haven't been able to return a merged DF. Any ideas?
My data:
g <- structure(list(start_position = c(22926178L, 22887317L, 22876403L,
22862447L, 22822490L), end_position = c(22928035L, 22889471L,
22884442L, 22866319L, 22827551L)), .Names = c("start_position",
"end_position"), row.names = c(NA, 5L), class = "data.frame")
l <- structure(list(name = structure(c(2L, 12L, 9L, 1L, 8L), .Label = c("AC184765.4_FG005",
"GRMZM2G001024", "GRMZM2G058655", "GRMZM2G072028", "GRMZM2G157132",
"GRMZM2G160834", "GRMZM2G166507", "GRMZM2G396835", "GRMZM2G441511",
"GRMZM2G442645", "GRMZM2G572807", "GRMZM2G575546", "GRMZM2G702094"
), class = "factor"), start = c(11149187L, 24382534L, 22762447L,
26282236L, 10009264L), end = c(11511198L, 24860958L, 23762447L,
26682919L, 10402790L)), .Names = c("name", "start", "end"), row.names = c(101L,
589L, 7859L, 658L, 14L), class = "data.frame")
I have a dataframe with some observations of when lines attached to IDs.
I need the period of time in days when each ID had a line/catheter attached.
Here is my dput return:
structure(list(ID = c(487622L, 487622L, 487639L, 487639L, 489027L,
489027L, 489027L, 491858L, 491858L, 491858L, 491858L, 491858L,
491858L), Line = c("Central Venous Line", "Central Venous Line",
"Central Venous Line", "Peripherally Inserted Central Catheter (PICC)",
"Haemodialysis Catheter", "Peripherally Inserted Central Catheter (PICC)",
"Haemodialysis Catheter", "Central Venous Line", "Haemodialysis Catheter",
"Central Venous Line", "Haemodialysis Catheter", "Central Venous Line",
"Peripherally Inserted Central Catheter (PICC)"), Start = structure(c(1362528000,
1363219200, 1362268800, 1363219200, 1364774400, 1365120000, 1365465600,
1364688000, 1364688000, 1365724800, 1365724800, 1366848000, 1369353600
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), End = structure(c(1362787200,
1363824000, 1363305600, 1363737600, 1365465600, 1366675200, 1365638400,
1365724800, 1365724800, 1366329600, 1366848000, 1367539200, 1369612800
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Days = c("3.095138889",
"7.045138889", "11.87777778", "5.736111111", "7.850694444", "18.02083333",
"1.813888889", "12.32986111", "12.71388889", "6.782638889", "13.14027778",
"7.718055556", "3.397222222"), dateOrder = c(1L, 2L, 1L, 2L,
1L, 2L, 3L, 1L, 2L, 3L, 4L, 5L, 6L)), .Names = c("ID", "Line",
"Start", "End", "Days", "dateOrder"), row.names = 79:91, class = "data.frame")
Here is the catch. It does not matter if an ID has more than one line/catheter. I just need to take the earliest start date for each ID, the latest end date for each ID, and calculate the number of continuous days each ID has a line/catheter attached.
The problem is confounded by some cases, e.g. ID 491858. This individual had a line removed (dateOrder = 5) on 2013-05-03 and reinserted on 2013-05-24 for just over 3 days.
How I intended to handle this is to subtract the gap (number of days) from the number of days of continuous time between min(Start Date) and max(end date).
There are over 20,000 records in the data set.
Here is what I have done so far:
Converted the DF to a list of DFs based on ID.
I intended to apply a function to each DF something as follows:
If the difference in time (days) between subsequent start date and previous end date for each row exceeds 0, then add TRUE or some arbitrary column value to each data frame.
function(y){
for (i in length(y)){
if(difftime(y$Start[i+1], y$End[i], units='days') > 0){
y$test <- TRUE}
}
}
Any help would be greatly appreciated.
Thanks.
UPDATE
Ignore the days column. It is of no use. I intend to aggregate month line counts from the unique cases.
I guess something like this might help, unless I've misunderstood something:
unlist(lapply(split(DF, DF$ID),
function(x) { totaldays <- max(x$End) - min(x$Start);
x$Start <- c(x$Start[-1], NA);
res <- difftime(x$Start[-length(x$Start)], x$End[-length(x$Start)], units = "days");
res <- res[res > 0];
res <- ifelse(length(res) == 0, 0, res);
return(as.numeric(totaldays - res)) }))
#487622 487639 489027 491858
# 10 17 22 36
DF is your dput.
If I understand correctly, you want the total amount of days that the catheter was present. To do that, I would use plyr
#assume df is your dput object
library(plyr)
day.summary <- ddply(df, "ID", function(x) data.frame(total.days = sum(as.numeric(x$Days))))
print(day.summary)
ID total.days
1 487622 10.14028
2 487639 17.61389
3 489027 27.68542
4 491858 56.08194