I have read many posts on the site about randomly subsetting a large dataset for observations based on date -- for the first, last, or a specific date. However, I have a different challenge that requires me to subsample a large dataset by site AND date. I want to keep all sites in the subsetted dataset, but only include 1 date observation per site.
More specifically, I have a large dataset (for community ecology!) of insect community observations (n=2000) across 4 years. They were observed from ~900 sites, but each site has between 1 and 6 date observations within a year, with no sites repeated between years (this is why previous posts looking to subset a specific date range cannot apply here). Subsetting in this particular way is critical because of type of statistical analysis I am using - including spatial autocorrelation terms in the analysis means that I can only include one observation per site.
So the full dataset looks something like:
Site Date Ladybug
Baumgarten 6/24/2014 2
Baumgarten 8/6/2014 0
Baumgarten 8/20/2014 3
Baumgarten 7/8/2014 0
Baumgarten 7/22/2014 1
Berkevich 7/9/2014 0
Berkevich 7/23/2014 4
Berkevich 8/8/2014 0
Berkevich 8/22/2014 0
Boehm 6/24/2014 2
# dput(data)
dd <- structure(list(Site = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L), .Label = c("Baumgarten", "Berkevich", "Boehm"), class = "factor"), Date = structure(c(1L, 8L, 6L, 4L, 2L, 5L, 3L, 9L, 7L, 1L), .Label = c("6/24/2014", "7/22/2014", "7/23/2014", "7/8/2014", "7/9/2014", "8/20/2014", "8/22/2014", "8/6/2014", "8/8/2014" ), class = "factor"), Ladybug = c(2L, 0L, 3L, 0L, 1L, 0L, 4L, 0L, 0L, 2L)), .Names = c("Site", "Date", "Ladybug"), class = "data.frame", row.names = c(NA, -10L))
And my desired subsetted dataset would look something like:
Site Date Ladybugs
Baumgarten 8/20/2014 3
Berkevich 7/9/2014 0
Boehm 6/24/2014 2
I have dates entered in both MM/DD/YYYY and DOY format (since sites don't repeat between years, DOY x site subsetting will still work to ensure no repeating sites), so code that uses either could work.
Any advice would be much appreciated. Thanks.
Assuming your data is a data.frame named df, you could use dplyr and do the following:
library(dplyr)
df %>%
group_by(Site) %>%
sample_n(1)
# Source: local data frame [3 x 3]
# Groups: Site [3]
#
# Site Date Ladybug
# (fctr) (fctr) (int)
# 1 Baumgarten 8/20/2014 3
# 2 Berkevich 8/22/2014 0
# 3 Boehm 6/24/2014 2
Using data.table you can use:
require(data.table)
setDT(DT)
DT[,.SD[sample(.N,1)], by=Site]
This gives you
Site Date Ladybug
1: Baumgarten 8/20/2014 3
2: Berkevich 7/23/2014 4
3: Boehm 6/24/2014 2
You could also use base-R for this. It splits the data by site, samples one row and returns that. Then results get bound together.
set.seed(123)
res <- do.call(rbind,lapply(split(dat,dat$Site),function(x){x[sample(nrow(x),1),]}))
Another possibility is data.table:
library(data.table)
setDT(dat)
set.seed(123)
res <- dat[,.SD[sample(.N,1)],Site]
A possibly inefficient method, but it gets the job done.
levels <- length(unique(data$Site))
rowselect<- sapply(1:levels, function(x) {
elem <- which(array==unique(array)[x])
if(length(elem)<2){
return(elem)
} else {
return(sample(elem,1))
}
})
this gives the rowindex for 1 randomly selected row for each site.
Related
I have the following data frame.
Input:
class id q1 q2 q3 q4
Ali 12 1 2 3 3
Tom 16 1 2 4 2
Tom 18 1 2 3 4
Ali 24 2 2 4 3
Ali 35 2 2 4 3
Tom 36 1 2 4 2
class indicates the teacher's name,
id indicates the student user ID, and,
q1, q2, q3 and q4 indicate marks on different test questions
Requirement:
I am interested in finding potential cases of cheating. I hypothesise that if the students are in the same class and have similar scores on different questions, they are likely to have cheated.
For this, I want to calculate absolute distance or difference, grouped by class name, across multiple columns, i.e., all the test questions q1, q2, q3 and q4. And I want to store this information in a couple of new columns as below:
difference:
For a given class name, it contains the pairwise distance or difference with all other students' id. For a given class name, it stores the information as (id1, id2 = difference)
cheating:
This column lists any id's based on the previously created new column where the difference was zero (or some threshold value). This will be a flag to alert the teacher that their student might have cheated.
class id q1 q2 q3 q4 difference cheating
Ali 12 1 2 3 3 (12,24 = 2), (12,35 = 2) NA
Tom 16 1 2 4 2 (16,18 = 3), (16,36 = 0) 36
Tom 18 1 2 3 4 (16,18 = 3), (18,36 = 3) NA
Ali 24 2 2 4 3 (12,24 = 2), (24,35 = 0) 35
Ali 35 2 2 4 3 (12,35 = 2), (24,35 = 0) 24
Tom 36 1 2 4 2 (16,36 = 0), (18,36 = 3) 16
Is it possible to achieve this using dplyr?
Related posts:
I have tried to look for related solutions but none of them address the exact problem that I am facing e.g.,
This post calculates the difference between all pairs of rows. It does not incorporate the group_by situation plus the solution is extremely slow: R - Calculate the differences in the column values between rows/ observations (all combinations)
This one compares only two columns using stringdist(). I want my solution over multiple columns and with a group_by() condition: Creating new field that shows stringdist between two columns in R?
The following post compares the initial values in a column with their preceding values: R Calculating difference between values in a column
This one compares values in one column to all other columns. I would want this but done row wise and through group_by(): R Calculate the difference between values from one to all the other columns
dput()
For your convenience, I am sharing data dput():
structure(list(class =
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"),
id = c(12L, 16L, 18L, 24L, 35L, 36L),
q1 = c(1L, 1L, 1L, 2L, 2L, 1L),
q2 = c(2L, 2L, 2L, 2L, 2L, 2L),
q3 = c(3L, 4L, 3L, 4L, 4L, 4L),
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")
Any help would be greatly appreciated!
You could try to clustering the data, using hclust() for example. Once the relative distances are calculated and mapped, the cut the tree at the threshold of expected cheating.
This example I am using the standard dist() function to calculate differences, the stringdist function may be better or maybe another option is out there to try.
df<- structure(list(class =
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"),
id = c(12L, 16L, 18L, 24L, 35L, 36L),
q1 = c(1L, 1L, 1L, 2L, 2L, 1L),
q2 = c(2L, 2L, 2L, 2L, 2L, 2L),
q3 = c(3L, 4L, 3L, 4L, 4L, 4L),
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")
#apply the standard distance function
scores <- hclust(dist(df[ , 3:6]))
plot(scores)
#divide into groups based on level of matching too closely
groups <- cutree(scores, h=0.1)
#summary table
summarytable <- data.frame(class= df$class, id =df$id, groupings =groups)
#select groups with more than 2 people in them
suspectgroups <- table(groups)[table(groups) >=2]
potential_cheaters <- summarytable %>% filter(groupings %in% names(suspectgroups)) %>% arrange(groupings)
potential_cheaters
This works for this test case, but for larger datasets the height in the cutree() function may need to be adjusted. Also consider splitting the initial dataset by class to eliminate the chance of matching people between classes (depending on the situation of course).
This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 3 years ago.
I am trying to remove some rows of my data by adding them to a different row, in the form of another column. Is there a way I can group rows together by a certain variable?
I have tried using group_by statement in the dplyr package, but it does not seem to solve my issue.
library(dplyr)
late <- read.csv(file.choose())
late <- group_by(late, state, add = FALSE)
The data set I have (named "late") now is in this form:
ontime state count
0 AL 1
1 AL 44
null AL 3
0 AR 5
1 AR 50
...
But I would like it to be:
state count0 count1 countnull
AL 1 44 3
AR 5 50 null
...
Ultimately, I want to calculate count0/count1 for each state. So if there is a better way of going about this, I would be open to any suggestions.
You could do this with dcast() from the reshape2 package
library(reshape2)
df = data.frame(
ontime = c(0,1,NA,0,1),
state = c("AL","AL","AL","AR","AR"),
count = c(1,44,3,5,50)
)
dcast(df,state~ontime,value=count)
With spread:
library(dplyr)
library(tidyr)
df %>%
mutate(ontime = paste0('count', ontime)) %>%
spread(ontime, count)
Output:
state count0 count1 countnull
1 AL 1 44 3
2 AR 5 50 NA
Data:
df <- structure(list(ontime = structure(c(1L, 2L, 3L, 1L, 2L), .Label = c("0",
"1", "null"), class = "factor"), state = structure(c(1L, 1L,
1L, 2L, 2L), .Label = c("AL", "AR"), class = "factor"), count = c(1L,
44L, 3L, 5L, 50L)), class = "data.frame", row.names = c(NA, -5L
))
I'm creating a shiny application that will have a checkboxGroupInput, where each box checked will add another line to a frequency plot. I'm trying to wrap my head around reshape2 and ggplot2 to understand how to make this possible.
data:
head(testSet)
date store_id product_id count
1 2015-08-15 3 1 8
2 2015-08-15 3 3 1
3 2015-08-17 3 1 7
4 2015-08-17 3 2 3
5 2015-08-17 3 3 1
6 2015-08-18 3 3 2
class level information:
dput(droplevels(head(testSet, 10)))
structure(list(date = structure(c(16662, 16662, 16664,
16664, 16664, 16665, 16665, 16665, 16666, 16666), class = "Date"),
store_id = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), product_id = c(1L,
3L, 1L, 2L, 3L, 3L, 1L, 2L, 1L, 2L), count = c(8L, 1L, 7L,
3L, 1L, 2L, 18L, 1L, 0L, 2L)), .Names = c("date", "store_id",
"product_id", "count"), row.names = c(NA, 10L), class = "data.frame")
The graph should have an x-axis that corresponds to date, and a y-axis that corresponds to count. I would like to have a checkbox group input where for each box representing a product checked, a line corresponding to product_id will be plotted on the graph. The data is already filtered to store_id.
My first thought was to write a for loop inside of the plot to render a new geom_line() per each returned value of the input$productId vector. -- however after some research it seems that's the wrong way to go about things.
Currently I'm trying to melt() the data down to something useful, and then aes(...group=product_id), but getting errors on whatever I try.
Attempting to melt the data:
meltSet <- melt(testSet, id.vars="product_id", value.name="count", variable.name="date")
head of meltSet
head(meltSet)
product_id date count
1 1 date 16662
2 3 date 16662
3 1 date 16664
4 2 date 16664
5 3 date 16664
6 3 date 16665
tail of meltSet
tail(meltSet)
product_id date count
76 9 count 5
77 1 count 19
78 2 count 1
79 3 count 39
80 8 count 1
81 9 count 4
Plotting:
ggplot(data=meltSet, aes(x=date, y=count, group = product_id, colour = product_id)) + geom_line()
So my axis and values are all wonky, and not what I'm expecting from setting the plot.
If I'm understanding it correctly you don't need any melting, you just need to aggregate your data, summing up count by date and product_id. you can use data.table for this purpose:
testSet = data.table(testSet)
aggrSet = testSet[, .(count=sum(count)), by=.(date, product_id)]
You can do your ggplot stuff on aggrSet. It has three columns now: date, product_id, count.
When you melt like you did you merge two variables with different types into date: date(Date) and store_id(int).
Consider I have the below dataframe
AccountId,CloseDate
1,2015-05-07
2,2015-05-09
3,2015-05-01
4,2015-05-07
1,2015-05-09
1,2015-05-12
2,2015-05-12
3,2015-05-01
3,2015-05-01
3,2015-05-02
4,2015-05-17
1,2015-05-12
I want to group it based on AccountId and then I want to add another column naming date_diff which will contain the difference in CloseDate between the current row and previous row. Please note that I want this date_diff to be calculated only for rows having same AccountId. So I need to group the data before adding another column
Below is the R code that I am using
df <- read.df(sqlContext, "/home/ubuntu/work/csv/sample.csv", source = "com.databricks.spark.csv", inferSchema = "true", header="true")
df$CloseDate <- to_date(df$CloseDate)
groupedData <- SparkR::group_by(df, df$AccountId)
SparkR::mutate(groupedData, DiffCloseDt = as.numeric(SparkR::datediff((CloseDate),(SparkR::lag(CloseDate,1)))))
To add another column I am using mutate. But as the group_by returns groupedData I am not able to use mutate here. I am getting the below error
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘mutate’ for signature ‘"GroupedData"’
So how can I convert GroupedData into Dataframe so that I can add columns using mutate?
What you want is not possible to achieve using group_by. As already explained quite a few times on SO :
Using groupBy in Spark and getting back to a DataFrame
How to do custom operations on GroupedData in Spark?
DataFrame groupBy behaviour/optimization
group_by on a DataFrame doesn't physical group the data. Moreover order of operations after applying group_by is nondeterministic.
To achieve desired output you'll have to use window functions and provide an explicit ordering:
df <- structure(list(AccountId = c(1L, 2L, 3L, 4L, 1L, 1L, 2L, 3L,
3L, 3L, 4L, 1L), CloseDate = structure(c(3L, 4L, 1L, 3L, 4L,
5L, 5L, 1L, 1L, 2L, 6L, 5L), .Label = c("2015-05-01", "2015-05-02",
"2015-05-07", "2015-05-09", "2015-05-12", "2015-05-17"), class = "factor")),
.Names = c("AccountId", "CloseDate"),
class = "data.frame", row.names = c(NA, -12L))
hiveContext <- sparkRHive.init(sc)
sdf <- createDataFrame(hiveContext, df)
registerTempTable(sdf, "df")
query <- "SELECT *, LAG(CloseDate, 1) OVER (
PARTITION BY AccountId ORDER BY CloseDate
) AS DateLag FROM df"
dfWithLag <- sql(hiveContext, query)
withColumn(dfWithLag, "diff", datediff(dfWithLag$CloseDate, dfWithLag$DateLag)) %>%
head()
## AccountId CloseDate DateLag diff
## 1 1 2015-05-07 <NA> NA
## 2 1 2015-05-09 2015-05-07 2
## 3 1 2015-05-12 2015-05-09 3
## 4 1 2015-05-12 2015-05-12 0
## 5 2 2015-05-09 <NA> NA
## 6 2 2015-05-12 2015-05-09 3
I'm trying to use R to find the max value of each day for 1 to n days. My issue is there are multiple values in each day. Heres my code. After I run it incorrect number of dimensions.
Any suggestions:
Days <- unique(theData$Date) #Gets each unique Day
numDays <- length(Days)
Time <- unique(theData$Time) #Gets each unique time
numTime <- length(Time)
rowCnt <- 1
for (i in 1:numDays) #Do something for each individual day. In this case find max
{
temp <- which(theData[i]$Date == numDays[i])
temp <- theData[[i]][temp,]
High[rowCnt, (i-2)+2] <- max(temp$High) #indexing for when I print to CSV
rowCnt <- rowCnt + 1
}
Heres what it should come out to: Except 1 to n days and times.
Day Time Value
20130310 09:30:00 5
20130310 09:31:00 1
20130310 09:32:00 2
20130310 09:33:00 3
20130311 09:30:00 12
20130311 09:31:00 0
20130311 09:32:00 1
20130311 09:33:00 5
so this should return:
day time value
20130310 09:33:00 3
20130311 09:30:00 12
Any help would be greatly appreciated! Thanks!
Here is the solution using plyr package
mydata<-structure(list(Day = structure(c(2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L), .Label = c("", "x", "y"), class = "factor"), Value = c(0L,
1L, 2L, 3L, 12L, 0L, 1L, 5L), Time = c(5L, 6L, 7L, 8L, 1L, 2L,
3L, 4L)), .Names = c("Day", "Value", "Time"), row.names = c(NA,
8L), class = "data.frame")
library(plyr)
ddply(mydata,.(Day),summarize,max.value=max(Value))
Day max.value
1 x 3
2 y 12
Updated1: If your day is say 10/02/2012 12:00:00 AM, then you need to use:
mydata$Day<-with(mydata,as.Date(Day, format = "%m/%d/%Y"))
ddply(mydata,.(Day),summarize,max.value=max(Value))
Please see here for the example.
Updated2: as per new data: If your day is like the one you updated, you don't need to do anything. You can just use the code as following:
mydata1<-structure(list(Day = c(20130310L, 20130310L, 20130310L, 20130310L,
20130311L, 20130311L, 20130311L, 20130311L), Time = structure(c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("9:30:00", "9:31:00",
"9:32:00", "9:33:00"), class = "factor"), Value = c(5L, 1L, 2L,
3L, 12L, 0L, 1L, 5L)), .Names = c("Day", "Time", "Value"), class = "data.frame", row.names = c(NA,
-8L))
ddply(mydata,.(Day),summarize,Time=Time[which.max(Value)],max.value=max(Value))
Day Time max.value
1 20130310 9:30:00 5
2 20130311 9:30:00 12
If you want the time to appear in the output, then just use Time=Time[which.max(Value)] which gives the time at the maximum value.
This is a base function approach:
> do.call( rbind, lapply(split(dfrm, dfrm$Day),
function (df) df[ which.max(df$Value), ] ) )
Day Time Value
20130310 20130310 09:30:00 5
20130311 20130311 09:30:00 12
To explain what's happening it's good to learn to read R functions from the inside out (since they are often built around each other.) You wanted lines from a dataframe, so you would either need to build a numeric or logical vector that spanned the number of rows, .... or you can take the route I did and break the problem up by Day. That's what split does with dataframes. Then within each dataframe I applied a function, which.max to just a single day's subset of the data. Since I only got the results back from lapply as a list of dataframes, I needed to squash them back together, and the typical method for doing so is do.call(rbind, ...).
If I took the other route of making a vector for selection that applied to the whole dataframe I would use ave:
> dfrm[ with(dfrm, ave(Value, Day, FUN=function(v) v==max(v) ) ) , ]
Day Time Value
1 20130310 09:30:00 5
1.1 20130310 09:30:00 5
Huh? That's not right... What's the problem?
with(dfrm, ave(Value, Day, FUN=function(v) v==max(v) ) )
[1] 1 0 0 0 1 0 0 0
So despite asking for a logical vector with the "==" function, I got conversion to a numeric vector, something I still don't understand. But converting to logical outside that result I succeed again:
> dfrm[ as.logical( with(dfrm, ave(Value, Day,
FUN=function(v) v==max(v) ) ) ), ]
Day Time Value
1 20130310 09:30:00 5
5 20130311 09:30:00 12
Also note that the ave function (unlike tapply or aggregate) requires that you offer the function as a named argument with FUN=function(.). That is a common error I make. If you see the "error message unique() applies only to vectors", it seems out of the blue, but means that ave tried to group an argument that it expected to be discrete and you gave it a function.
Unlike other programming languages, in R it is considered good practice to avoid using for loops. Instead try something like:
index <- sapply(Days, function(x) {
which.max(Value)
})
theData[index, c("Day", "Time", "Value")]
This means for each value of Days, find the maximum value of Value and return its index. Then you can select the rows and columns of interest.
I recommend reading the help documentation for apply(), lapply(), sapply(), tapply(), mapply() (I'm probably forgetting one of them…) in and the plyr package.