I have a dataframe consisting of an ID, that is the same for each element in a group, two datetimes and the time interval between these two. One of the datetime objects is my relevant time marker. Now I like to get a subset of the dataframe that consists of the earliest entry for each group. The entries (especially the time interval) need to stay untouched.
My first approach was to sort the frame according to 1. ID and 2. relevant datetime. However, I wasn't able to return the first entry for each new group.
I then have been looking at the aggregate() as well as ddply() function but I could not find an option in both that just returns the first entry without applying an aggregation function to the time interval value.
Is there an (easy) way to accomplish this?
ADDITION:
Maybe I was unclear by adding my aggregate() and ddply() notes. I do not necessarily need to aggregate. Given the fact that the dataframe is sorted in a way that the first row of each new group is the row I am looking for, it would suffice to just return a subset with each row that has a different ID than the one before (which is the start-row of each new group).
Example data:
structure(list(ID = c(1454L, 1322L, 1454L, 1454L, 1855L, 1669L,
1727L, 1727L, 1488L), Line = structure(c(2L, 1L, 3L, 1L, 1L,
1L, 1L, 1L, 1L), .Label = c("A", "B", "C"), class = "factor"),
Start = structure(c(1357038060, 1357221074, 1357369644, 1357834170,
1357913412, 1358151763, 1358691675, 1358789411, 1359538400
), class = c("POSIXct", "POSIXt"), tzone = ""), End = structure(c(1357110430,
1357365312, 1357564413, 1358230679, 1357978810, 1358674600,
1358853933, 1359531923, 1359568151), class = c("POSIXct",
"POSIXt"), tzone = ""), Interval = c(1206.16666666667, 2403.96666666667,
3246.15, 6608.48333333333, 1089.96666666667, 8713.95, 2704.3,
12375.2, 495.85)), .Names = c("ID", "Line", "Start", "End",
"Interval"), row.names = c(NA, -9L), class = "data.frame")
By reproducing the example data frame and testing it I found a way of getting the needed result:
Order data by relevant columns (ID, Start)
ordered_data <- data[order(data$ID, data$Start),]
Find the first row for each new ID
final <- ordered_data[!duplicated(ordered_data$ID),]
As you don't provide any data, here is an example using base R with a sample data frame :
df <- data.frame(group=c("a", "b"), value=1:8)
## Order the data frame with the variable of interest
df <- df[order(df$value),]
## Aggregate
aggregate(df, list(df$group), FUN=head, 1)
EDIT : As Ananda suggests in his comment, the following call to aggregate is better :
aggregate(.~group, df, FUN=head, 1)
If you prefer to use plyr, you can replace aggregate with ddply :
ddply(df, "group", head, 1)
Using ffirst from collapse
library(collapse)
ffirst(df, g = df$group)
data
df <- data.frame(group=c("a", "b"), value=1:8)
This could also be achieved by dplyr using group_by and slice-family of functions,
data %>%
group_by(ID) %>%
slice_head(n = 1)
Related
I am analysing IDs from the RePEc database. Each ID matches a unique publication and sometimes publications are linked because they are different versions of each other (e.g. a working paper that becomes a journal article). I have a database of about 250,000 entries that shows the main IDs in one column and then the previous or alternative IDs in another. It looks like this:
df$repec_id <– c("RePEc:cid:wgha:353", "RePEc:hgd:wpfacu:350","RePEc:cpi:dynxce:050")
df$alt_repec_id <– c("RePEc:sii:giihdizi:heidwg06-2019|RePEc:azi:cusiihdizi:gdhs06-2019", "RePEc:tqu:vishdizi:d8z7-200x", "RePEc:aus:cecips:15_59|RePEc:sga:leciam:c8wc0z888s|RePEc:cpi:dynxce:050", "RePEc:cid:wgha:353|RePEc:hgd:wpfacu:350")
I want to find out which IDs from the repec_id column are also present in the alt_repec_id column and create a dataframe that only has rows matching this condition. I tried to strsplit at "|" and use the %in% function like this:
df <- separate_rows(df, alt_repec_id, sep = "\\|")
df1 <- df1[trimws(df$alt_repec_id) %in% trimws(df$repec_id), ]
df1<- data.frame(df1)
df1 <- na.omit(df1)
df1 <- df1[!duplicated(df1$repec_id),]
It works but I'm worried that by eliminating duplicate rows based on the values in the repec_id column, I'm randomly eliminating matches. Is that right?
Ultimately, I want a dataframe that only contains values in which strings in the repec_id column match the partial strings in the alt_repec_id column. Using the example above, I want the following result:
df$repec_id <– c("RePEc:cpi:dynxce:050")
df$alt_repec_id <– c("RePEc:aus:cecips:15_59|RePEc:sga:leciam:c8wc0z888s|RePEc:cpi:dynxce:050")
Does anyone have a solution to my problem? Thanks in advance for your help!
Try using str_detect() from stringr to identify if the repec_id exists in the larger alt_repec_id string.
Then filter() down to where it was found. This this is not returning as expected, try looking at and posting a few examples where found_match == FALSE but the match was expected.
library(stringr)
library(dplyr)
df %>%
mutate(found_match = str_detect(alt_repec_id, repec_id)) %>%
filter(found_match == TRUE)
Here is a base R solution using grepl() + apply() + subset()
dfout <- subset(df,apply(df, 1, function(v) grepl(v[1],v[2])))
such that
> dfout
repec_id alt_repec_id
3 RePEc:cpi:dynxce:050 RePEc:aus:cecips:15_59|RePEc:sga:leciam:c8wc0z888s|RePEc:cpi:dynxce:050
DATA
df <- structure(list(repec_id = structure(c(1L, 3L, 2L), .Label = c("RePEc:cid:wgha:353",
"RePEc:cpi:dynxce:050", "RePEc:hgd:wpfacu:350"), class = "factor"),
alt_repec_id = structure(c(2L, 3L, 1L), .Label = c("RePEc:aus:cecips:15_59|RePEc:sga:leciam:c8wc0z888s|RePEc:cpi:dynxce:050",
"RePEc:sii:giihdizi:heidwg06-2019|RePEc:azi:cusiihdizi:gdhs06-2019",
"RePEc:tqu:vishdizi:d8z7-200x"), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))
I have the follow dataset:
dataset=structure(list(var1 = c(28.5627505742013, 22.8311421908438, 95.2216156944633,
43.9405107684433, 97.11211245507, 48.4108281508088, 77.1804554760456,
27.1229329891503, 69.5863061584532, 87.2112890332937), var2 = c(32.9009465128183,
54.1136392951012, 69.3181485682726, 70.2100433968008, 44.0986660309136,
62.8759404085577, 79.4413498230278, 97.4315509572625, 62.2505457513034,
76.0133410431445), var3 = c(89.6971945464611, 67.174579706043,
37.0924087055027, 87.7977314218879, 29.3221596442163, 37.5143952667713,
62.6237869635224, 71.3644423149526, 95.3462834469974, 27.4587387405336
), var4 = c(41.5336912125349, 98.2095112837851, 80.7970978319645,
91.1278881691396, 66.4086666144431, 69.2618868127465, 67.7560870349407,
71.4932355284691, 21.345994155854, 31.1811877787113), var5 = c(33.9312525652349,
88.1815139763057, 98.4453701227903, 25.0217059068382, 41.1195872165263,
37.0983888953924, 66.0217586159706, 23.8814191706479, 40.9594196081161,
79.7632974945009), var6 = c(39.813664201647, 80.6405956856906,
30.0273275375366, 34.6203793399036, 96.5195455029607, 44.5830867439508,
78.7370151281357, 42.010761089623, 23.0079878121614, 58.0372223630548
), kmeans = structure(c(2L, 1L, 3L, 1L, 3L, 1L, 1L, 1L, 2L, 3L
), .Label = c("1", "2", "3"), class = "factor")), .Names = c("var1",
"var2", "var3", "var4", "var5", "var6", "kmeans"), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
And the follow function:
myfun<-function(x){
c(sum(x),mean(x),sd(x))
}
With dplyr::summarise only, the result is ok:
library(tidyverse)
my1<-dataset%>%
summarise_if(.,is.numeric,.funs=funs(sum,mean,sd))
But, with myfun doesn't work:
my2<-dataset%>%
summarise_if(.,is.numeric,.funs=funs(myfun))
Error in summarise_impl(.data, dots) :
Column var1 must be length 1 (a summary value), not 3
What's the problem?
You can try this approach, Your approach will not yield the correct result as there it is not able to wrap two values returned by your custom function in a single cell, to circumvent the problem, I used enframe with list in the custom function:
library(tidyverse)
myfun<-function(x){
return(list(enframe(c('sum' = sum(x),'mean' = mean(x),'sd' = sd(x)))))
}
For example with mtcars data:
my2<-mtcars%>%
summarise_at(c('mpg','drat'), function(x) myfun(x)) %>%
unnest() %>%
select(-name1) %>%
set_names(nm = c('name', 'mpg', 'drat'))
it will yield:
name mpg drat
1 sum 642.900000 115.0900000
2 mean 20.090625 3.5965625
3 sd 6.026948 0.5346787
Also, there is one alternate way in which you can try solving it using purrr.
For example:
f <- function(x,...){
list('mean' = mean(x, ...),'sum' = sum(x, ...))
}
mtcars %>%
select(mpg, drat) %>%
map_dfr(~ f(.x, na.rm=T), .id ="Name") %>%
data.frame()
When you are applying this function
dataset%>% summarise_if(is.numeric,.funs=funs(sum,mean,sd))
You are applying three different function (sum, mean and sd) which is applied to all columns individually. So every column which is numeric these function would be applied to them. Here we have got three different function returning three values.
Regarding your function, I think what you were trying to do was
myfun<-function(x){
c(sum(x),mean(x),sd(x))
}
Now , when this function is applied to one column it returns you three values, so here one function is returning you three values instead.
myfun(dataset$var1)
#[1] 597.17994 59.71799 29.03549
As #NelsonGon mentioned in the comments, you are trying to store three values in single column. You could return them as list as #Pkumar showed or some variation of do also would help you achieve that. If you break down the functions and make three functions separately, it would work the same way as you have shown earlier.
myfun1 <- function(x) sum(x)
myfun2 <- function(x) mean(x)
myfun3 <- function(x) sd(x)
dataset %>% summarise_if(is.numeric,.funs=funs(myfun1,myfun2,myfun3))
it's not the most elegant way, but if your external function is just a list of other functions, maybe you can just use a list for your functions:
myfun_ls <- list(sum,mean,sd)
my2<-dataset%>%
summarise_if(.,is.numeric,.funs=myfun_ls)
I'm having an issue with separating rows in a dataframe that I'm working in.
In my dataframe, there's a column called officialIndices that I want to separate the rows by. This column stores a list of numbers act as indexes to indicate which rows have the same data. For example: indices 2:3 means that rows 2:3 have the same data.
Here is the code that I am working with.
offices_list <- data_google$offices
offices_JSON <- toJSON(offices_list)
offices_from_JSON <-
separate_rows(fromJSON(offices_JSON), officialIndices, convert = TRUE)
This is what my offices_list frame looks like
This is what it looks like after I try to separate the rows
My code works fine when it has indices 2:3 since there is a difference of 1. However on indices like 7:10, it separates the rows as 7 and 10 instead of doing 7, 8, 9, 10, which is how I want it do be done. How would I get my code to separate the rows like this?
Output of dput(head(offices_list))
structure(list(position = c("President of the United States",
"Vice-President of the United States", "United States Senate",
"Governor", "Mayor", "Auditor"), divisionId = c("ocd-division/country:us",
"ocd-division/country:us", "ocd-division/country:us/state:or",
"ocd-division/country:us/state:or", "ocd-division/country:us/state:or/place:portland",
"ocd-division/country:us/state:or/place:portland"), levels = list(
"country", "country", "country", "administrativeArea1", NULL,
NULL), roles = list(c("headOfState", "headOfGovernment"),
"deputyHeadOfGovernment", "legislatorUpperBody", "headOfGovernment",
NULL, NULL), officialIndices = list(0L, 1L, 2:3, 4L, 5L,
6L)), row.names = c(NA, 6L), class = "data.frame")
This should work. I expect it will work for further rows too, since I tested for ranges greater than two in officialIndices.
First I extracted the start and end rows, and used their difference to determine how many rows are needed. Then tidyr::uncount() will add that many copies.
library(dplyr); library(tidyr)
data_sep <- data %>%
separate(officialIndices, into = c("start", "end"), sep = ":") %>%
# Use 1 row, and more if "end" is defined and larger than "start"
mutate(rows = 1 + if_else(is.na(end), 0, as.numeric(end) - as.numeric(start))) %>%
uncount(rows)
I have a two data table (csv) which contain information about a MOOC course.
The first table contains information about mouse movments (distance). Like this:
1-2163.058../2-20903.66351.../3-25428.5415..
The first number means the day (1- first day, 2- second day, etc.) when it happens, the second number means the distance in pixel. (2163.058, 20903.66351, etc.).
The second table contains the same information but instead of distance, there is the time was recorded. Like this:
1-4662.0/2-43738.0/3-248349.0....
The first number means the day (1- first day, 2- second day, etc.) when it happens, the second number means the time in milliseconds.
In the table, every column records a data from the specific web page, and every row records a user behaviour on this page.
I want to create a new table with the same formation, where I want to count the speed by pixel. Divide the distance table with time table which gives new table with same order, shape.
Here are two links for the two tables goo.gl/AVQW7D goo.gl/zqzgaQ
How can I do this with raw csv?
> dput(distancestream[1:3,1:3])
structure(list(id = c(2L, 9L, 10L),
`http//tanul.sed.hu/mod/szte/frontpage.php` = structure(c(2L, 1L, 1L),
.Label = c("1-0", "1-42522.28760403924"),
class = "factor"),
`http//tanul.sed.hu/mod/szte/register.php` = c(0L, 0L, 0L)),
.Names = c("id", "http//tanul.sed.hu/mod/szte/frontpage.php",
"http//tanul.sed.hu/mod/szte/register.php"),
class = c("data.table", 0x0000000002640788))
> dput(timestream[1:3,1:3])
structure(list(id = c(2L, 9L, 10L),
`http//tanul.sed.hu/mod/szte/frontpage.php` = structure(c(2L, 1L, 1L),
.Label = c("0", "1-189044.0"),
class = "factor"),
`http//tanul.sed.hu/mod/szte/register.php` = c(0L, 0L, 0L)),
.Names = c("id",
"http//tanul.sed.hu/mod/szte/frontpage.php",
"http//tanul.sed.hu/mod/szte/register.php"),
class = c("data.table", 0x0000000002640788))
This may not be the most efficient method, but I believe it should yield the result you are looking for.
# Set file paths
dist.file <- # C:/Path/To/Distance/File.csv
time.file <- # C:/Path/To/Time/File.csv
# Read data files
dist <- read.csv(dist.file, stringsAsFactors = FALSE)
time <- read.csv(time.file, stringsAsFactors = FALSE)
# Create dataframe for speed values
speed <- dist
speed[,2:ncol(speed)] <- NA
# Create progress bar
pb <- txtProgressBar(min = 0, max = ncol(dist) * nrow(dist), initial = 0, style = 3, width = 20)
item <- 0
# Loop through all columns and rows of distance data
for(col in 2:ncol(dist)){
for(r in 1:nrow(dist)){
# Check that current item has data to be calculated
if(dist[r,col] != 0 & dist[r,col] != "1-0" & !is.na(time[r,col])){
# Split the data into it's separate day values
dists <- lapply(strsplit(strsplit(dist[r,col], "/")[[1]], "-"), as.numeric)
times <- lapply(strsplit(strsplit(time[r,col], "/")[[1]], "-"), as.numeric)
# Calculate the speeds for each day
speeds <- sapply(dists, "[[", 2) / sapply(times, "[[", 2)
# Paste together the day values and assign to the current item in speed dataframe
speed[r,col] <- paste(sapply(dists, "[[", 1), format(speeds, digits = 20), sep = "-", collapse = "/")
} else{
# No data to calculate, assign 0 to current item in speed dataframe
speed[r,col] <- 0
}
# Increase progress bar counter
item <- item + 1
setTxtProgressBar(pb,item)
}
}
# Create a csv for speed data
write.csv(speed, "speed.csv")
I have this dataframe called mydf. I want to match the current column in another dataframe called secondf with the column key.genomloc and extract the corresponding key.wesmut.genom column values and make that rowname as shown in the result.
This is what I have tried, but does not work as desired:
current <- secondf[,"key.genomloc"]
replacement <- secondf[,"key.wesmut.genom"]
v <- mydf[,"current"] %in% current
w <- current %in% mydf[,"current"]
rownames(mydf)<-mydf[,"current"]
rownames(mydf)[v] <- replacement[w]
Data:
mydf <-structure(list(current = structure(c(5L, 2L), .Label = c("chr1:115256529:T:C",
"chr1:115256530:G:T", "chr1:115258744:C:A", "chr1:115258744:C:T",
"chr1:115258747:C:T", "chr11:32417945:T:C", "chr12:25398284:C:A",
"chr12:25398284:C:T", "chr13:28592640:A:C", "chr13:28592641:T:A",
"chr13:28592642:C:A", "chr13:28592642:C:G", "chr15:90631838:C:T",
"chr15:90631934:C:T", "chr2:209113112:C:T", "chr2:209113113:G:A",
"chr2:209113113:G:C", "chr2:209113113:G:T", "chr2:25457242:C:T",
"chr2:25457243:G:A", "chr2:25457243:G:T", "chr4:55599320:G:T"
), class = "factor"), `index` = c(1451738, 1451718)), .Names = c("current",
"index"), row.names = 1:2, class = "data.frame")
secondf<-structure(c("WES:FLT3:p.D835H", "WES:FLT3:p.D835N", "WES:FLT3:p.D835Y",
"WES:FLT3:p.D835A", "WES:FLT3:p.D835V", "chr1:115256530:G:T",
"chr13:28592642:C:T", "chr13:28592642:C:A", "chr1:115258747:C:T",
"chr13:28592641:T:A"), .Dim = c(5L, 2L), .Dimnames = list(NULL,
c("key.wesmut.genom", "key.genomloc")))
Result
rowname current index
WES:FLT3:p.D835A chr1:115258747:C:T 1451738
WES:FLT3:p.D835H chr1:115256530:G:T 1451718
We can use match
mydf$rowname <- secondf[,1][match(mydf$current,secondf[,2])]
mydf[c(3,1:2)]
# rowname current index
#1 WES:FLT3:p.D835A chr1:115258747:C:T 1451738
#2 WES:FLT3:p.D835H chr1:115256530:G:T 1451718