selecting observations based on a condition depended on a grouped variable - r

I have a question that I am hoping some will help me answer. I have a data set ordered by parasites and year, that looks something like this (the actual dataset is much larger):
parasites year samples
1000 2000 11
910 2000 22
878 2000 13
999 2002 64
910 2002 75
710 2002 16
890 2004 29
810 2004 10
789 2004 9
876 2005 120
750 2005 12
624 2005 157
what I would like to do is, for every year, I want to select the 2 samples with the highest number of parasites, to give me an output that looks like this:
parasites year samples
1000 2000 11
910 2000 22
999 2002 64
910 2002 75
890 2004 29
810 2004 10
876 2005 120
750 2005 12
I am new to programming as a whole and still trying to find my way around R. can someone please explain to me how I would go about this? Thanks so much.

How about with data.table:
parasites<-read.table(header=T,text="parasites year samples
1000 2000 11
910 2000 22
878 2000 13
999 2002 64
910 2002 75
710 2002 16
890 2004 29
810 2004 10
789 2004 9
876 2005 120
750 2005 12
624 2005 157")
EDIT - sorry sorted by parasites, not samples
require(data.table)
data.table(parasites)[,.SD[order(-parasites)][1:2],by="year"]
Note .SD is the sub-table for each year value as set in by=
year parasites samples
1: 2000 1000 11
2: 2000 910 22
3: 2002 999 64
4: 2002 910 75
5: 2004 890 29
6: 2004 810 10
7: 2005 876 120
8: 2005 750 12

Here is a R-base solution (if you need it):
data = data.frame("parasites"=c(1000,910,878,999,910,710,890,910,789,876,750,624),
"year"=c(2000,2000,2000,2002,2002,2002,2004,2004,2004,2005,2005,2005),
"samples"=c(11,22,13,64,75,16,29,10,9,120,12,157))
data = data[order(data$year,data$samples),]
data_list = lapply(unique(data$year),function(x) (tail(data[data$year==x,],n=2)))
final_data = do.call(rbind, Map(as.data.frame,data_list))
Hope that helps!

Related

Need to find a number whose column name matches with a row vector

I have a data set (df) below that needs a variable which for example grabs the date on a column (with names as years below) and grabs the variable start_year and finds the value of it for that year and type (newstart). I have many more years and types in my dataset than below so something that works across any number of years is needed.
Here's what I have for data (a simplification):
st
type_num
start_year
end_year
2000
2001
2002
2003
2004
il
1
2000
2004
10
220
9
10
100
il
2
2001
2004
220
100
220
100
100
il
3
2000
2004
400
400
10
220
220
ak
1
2001
2003
10
220
9
10
100
ak
2
2001
2004
220
100
220
100
100
ak
3
2000
2003
400
400
10
220
220
wa
1
2001
2003
10
220
9
10
100
wa
2
2001
2004
220
100
220
100
100
wa
3
2000
2003
400
400
10
220
220
wa
4
2002
2003
500
600
700
800
900
Where's what I need:
st
type_num
start_year
end_year
2000
2001
2002
2003
2004
newstart
newend
il
1
2000
2004
10
220
9
10
100
10
100
il
2
2001
2004
220
100
220
100
100
100
100
il
3
2000
2004
400
400
10
220
220
400
220
ak
1
2001
2003
10
220
9
10
100
10
10
ak
2
2001
2004
220
100
220
100
100
100
100
ak
3
2000
2003
400
400
10
220
220
400
220
wa
1
2001
2003
10
220
9
10
100
220
10
wa
2
2001
2004
220
100
220
100
100
100
100
wa
3
2000
2003
400
400
10
220
220
400
220
wa
4
2002
2003
500
600
700
800
900
700
800
I was trying to get those variables using a couple indexes, tried this
which(colnames(df)==df$end_year[1])
Which seems to grab the column number of the matching date column, but wasn't able to figure out how to use it in an apply() to get it to do what this variable needs to do.
I also tried to make a less specific data set that got some suggestions to use rowwise() and get() but that didn't seem to work exactly, perhaps due to the less specific data. I tried to make something almost exactly what I intend to use for my real output.
I used pivot_longer() on the original data set I was merging in, solved the issue.

Selecting unique non-repeating values

I have some panel data from 2004-2007 which I would like to select according to unique values. To be more precise im trying to find out entry and exits of individual stores throughout the period. Data sample:
store year rev space market
1 2004 110000 1095 136
1 2005 110000 1095 136
1 2006 110000 1095 136
1 2007 120000 1095 136
2 2004 35000 800 136
3 2004 45000 1000 136
3 2005 45000 1000 136
3 2006 45000 1000 136
3 2007 45000 1000 136
4 2005 17500 320 136
4 2006 17500 320 136
4 2007 17500 320 136
5 2005 45000 580 191
5 2006 45000 580 191
5 2007 45000 580 191
6 2004 7000 345 191
6 2005 7000 345 191
6 2006 7000 345 191
7 2007 10000 500 191
So for instance I would like to find out how many stores have exited the market throughout the period, which should look like:
store year rev space market
2 2004 35000 800 136
6 2006 7000 345 191
As well as how many stores have entered the market, which would imply:
store year rev space market
4 2005 17500 320 136
5 2005 45000 580 191
7 2007 10000 500 191
UPDATE:
I didn't include that it also should assume incumbent stores, such as:
store year rev space market
1 2004 110000 1095 136
1 2005 110000 1095 136
1 2006 110000 1095 136
1 2007 120000 1095 136
3 2004 45000 1000 136
3 2005 45000 1000 136
3 2006 45000 1000 136
3 2007 45000 1000 136
Since im, pretty new to R I've been struggling to do it right even on year-by-year basis. Any suggestions?
Using the data.table package, if your data.frame is called df:
dt = data.table(df)
exit = dt[,list(ExitYear = max(year)),by=store]
exit = exit[ExitYear != 2007] #Or whatever the "current year" is for this table
enter = dt[,list(EntryYear = min(year)),by=store]
enter = enter[EntryYear != 2003]
UPDATE
To get all columns instead of just the year and store, you can do:
exit = dt[,.SD[year == max(year)], by=store]
exit[year != 2007]
store year rev space market
1: 2 2004 35000 800 136
2: 6 2006 7000 345 191
Using only base R functions, this is pretty simple:
> subset(aggregate(df["year"],df["store"],max),year!=2007)
store year
2 2 2004
6 6 2006
and
> subset(aggregate(df["year"],df["store"],min),year!=2004)
store year
4 4 2005
5 5 2005
7 7 2007
or using formula syntax:
> subset(aggregate(year~store,df,max),year!=2007)
store year
2 2 2004
6 6 2006
and
> subset(aggregate(year~store,df,min),year!=2004)
store year
4 4 2005
5 5 2005
7 7 2007
Update Getting all the columns isn't possible for aggregate, so we can use base 'by' instead. By isn't as clever at reassembling the array:
Filter(function(x)x$year!=2007,by(df,df$store,function(s)s[s$year==max(s$year),]))
$`2`
store year rev space market
5 2 2004 35000 800 136
$`6`
store year rev space market
18 6 2006 7000 345 191
So we need to take that step - let's build a little wrapper:
by2=function(x,c,...){Reduce(rbind,by(x,x[c],simplify=FALSE,...))}
And now use that instead:
> subset(by2(df,"store",function(s)s[s$year==max(s$year),]),year!=2007)
store year rev space market
5 2 2004 35000 800 136
18 6 2006 7000 345 191
We can further clarify this by creating a function for getting a row which has the stat (min or max) for a particular column:
statmatch=function(column,stat)function(df){df[df[column]==stat(df[column]),]}
> subset(by2(df,"store",statmatch("year",max)),year!=2007)
store year rev space market
5 2 2004 35000 800 136
18 6 2006 7000 345 191
Dplyr
Using all of these base functions which don't really resemble each other starts to get fiddly after a while, so it's a great idea to learn and use the excellent (and performant) dplyr package:
> df %>% group_by(store) %>%
arrange(-year) %>% slice(1) %>%
filter(year != 2007) %>% ungroup
Source: local data frame [2 x 5]
store year rev space market
1 2 2004 35000 800 136
2 6 2006 7000 345 191
and
> df %>% group_by(store) %>%
arrange(+year) %>% slice(1) %>%
filter(year != 2004) %>% ungroup
Source: local data frame [3 x 5]
store year rev space market
1 4 2005 17500 320 136
2 5 2005 45000 580 191
3 7 2007 10000 500 191
NB The ungroup is not strictly necessary here, but puts the table back in a default state for further calculations.

Subsetting panel data via unique values

I would like to segment panel data according to specific criterion and perform summary statistics on each segment. Data:
store year rev space market
1 2004 110000 1095 136
1 2005 110000 1095 136
1 2006 110000 1095 136
1 2007 120000 1095 136
2 2004 35000 800 136
3 2004 45000 1000 136
3 2005 45000 1000 136
3 2006 45000 1000 136
3 2007 45000 1000 136
4 2005 17500 320 136
4 2006 17500 320 136
4 2007 17500 320 136
5 2005 45000 580 191
5 2006 45000 580 191
5 2007 45000 580 191
6 2004 7000 345 191
6 2005 7000 345 191
6 2006 7000 345 191
7 2007 10000 500 191
From the example above I want to separate stores into entrants, exits and incumbents. So for instance I would like to find out how many stores have exited the market throughout the period, which should look like:
store year rev space market
2 2004 35000 800 136
6 2006 7000 345 191
Have entered the market:
store year rev space market
4 2005 17500 320 136
5 2005 45000 580 191
7 2007 10000 500 191
And remained incumbent throughout the period:
store year rev space market
1 2004 110000 1095 136
1 2005 110000 1095 136
1 2006 110000 1095 136
1 2007 120000 1095 136
3 2004 45000 1000 136
3 2005 45000 1000 136
3 2006 45000 1000 136
3 2007 45000 1000 136
I'm not experienced enough with R to perform such task, hence inputs would be appreciated.
Looks like a good excuse to use data.table:
library(data.table)
setDT(dat)
dat[, if(!max(dat$year) %in% year) tail(.SD,1) , by=store]
# store year rev space market
#1: 2 2004 35000 800 136
#2: 6 2006 7000 345 191
dat[, if(!min(dat$year) %in% year) head(.SD,1) , by=store]
# store year rev space market
#1: 4 2005 17500 320 136
#2: 5 2005 45000 580 191
#3: 7 2007 10000 500 191
dat[, if(min(dat$year) %in% year & max(dat$year) %in% year) .SD , by=store]
# store year rev space market
#1: 1 2004 110000 1095 136
#2: 1 2005 110000 1095 136
#3: 1 2006 110000 1095 136
#4: 1 2007 120000 1095 136
#5: 3 2004 45000 1000 136
#6: 3 2005 45000 1000 136
#7: 3 2006 45000 1000 136
#8: 3 2007 45000 1000 136

Automating finding and converting values in r

I have a sample dataset with 45 rows and is given below.
itemid title release_date
16 573 Body Snatchers 1993
17 670 Body Snatchers 1993
41 1645 Butcher Boy, The 1998
42 1650 Butcher Boy, The 1998
1 218 Cape Fear 1991
18 673 Cape Fear 1962
27 1234 Chairman of the Board 1998
43 1654 Chairman of the Board 1998
2 246 Chasing Amy 1997
5 268 Chasing Amy 1997
11 309 Deceiver 1997
37 1606 Deceiver 1997
28 1256 Designated Mourner, The 1997
29 1257 Designated Mourner, The 1997
12 329 Desperate Measures 1998
13 348 Desperate Measures 1998
9 304 Fly Away Home 1996
15 500 Fly Away Home 1996
26 1175 Hugo Pool 1997
39 1617 Hugo Pool 1997
31 1395 Hurricane Streets 1998
38 1607 Hurricane Streets 1998
10 305 Ice Storm, The 1997
21 865 Ice Storm, The 1997
4 266 Kull the Conqueror 1997
19 680 Kull the Conqueror 1997
22 876 Money Talks 1997
24 881 Money Talks 1997
35 1477 Nightwatch 1997
40 1625 Nightwatch 1997
6 274 Sabrina 1995
14 486 Sabrina 1954
33 1442 Scarlet Letter, The 1995
36 1542 Scarlet Letter, The 1926
3 251 Shall We Dance? 1996
30 1286 Shall We Dance? 1937
32 1429 Sliding Doors 1998
45 1680 Sliding Doors 1998
20 711 Substance of Fire, The 1996
44 1658 Substance of Fire, The 1996
23 878 That Darn Cat! 1997
25 1003 That Darn Cat! 1997
34 1444 That Darn Cat! 1965
7 297 Ulee's Gold 1997
8 303 Ulee's Gold 1997
what I am trying to do is to convert the itemid based on the movie name and if the release date of the movie is same. for example, The movie 'Ulee's Gold' has two item id's 297 & 303. I am trying to find a way to automate the process of checking the release date of the movie and if its same, itemid[2] of that movie should be replaced with itemid[1]. For the time being I have done it manually by extracting the itemid's into two vectors x & y and then changing them using vectorization. I want to know if there is a better way of getting this task done because there are only 18 movies with multiple id's but the dataset has a few hundred. Finding and processing this manually will be very time consuming.
I am providing the code that I have used to get this task done.
x <- c(670,1650,1654,268,1606,1257,348,500,1617,1607,865,680,881,1625,1680,1658,1003,303)
y<- c(573,1645,1234,246,309,1256,329,304,1175,1395,305,266,876,1477,1429,711,878,297)
for(i in 1:18)
{
df$itemid[x[i]] <- y[i]
}
Is there a better way to get this done?
I think you can do it in dplyr straightforwardly:
Using your comment above, a brief example:
itemid <- c(878,1003,1444,297,303)
title <- c(rep("That Darn Cat!", 3), rep("Ulee's Gold", 2))
year <- c(1997,1997,1965,1997,1997)
temp <- data.frame(itemid,title,year)
temp
library(dplyr)
temp %>% group_by(title,year) %>% mutate(itemid1 = min(itemid))
(I changed 'release_date' to 'year' for some reason... but this basically groups the title/year together, searches for the minimum itemid and the mutate creates a new variable with this lowest 'itemid'.
which gives:
# itemid title year itemid1
#1 878 That Darn Cat! 1997 878
#2 1003 That Darn Cat! 1997 878
#3 1444 That Darn Cat! 1965 1444
#4 297 Ulee's Gold 1997 297
#5 303 Ulee's Gold 1997 297

R - cumulative sum by condition

So I have a dataset which simplified looks something like this:
Year ID Sum
2009 999 100
2009 123 85
2009 666 100
2009 999 100
2009 123 90
2009 666 85
2010 999 100
2010 123 100
2010 666 95
2010 999 75
2010 123 100
2010 666 85
I'd like to add a column with the cumulative sum, by year and ID. Like this:
Year ID Sum Cum.Sum
2009 999 100 100
2009 123 85 85
2009 666 100 100
2009 999 100 200
2009 123 90 175
2009 666 85 185
2010 999 100 100
2010 123 100 100
2010 666 95 95
2010 999 75 175
2010 123 100 200
2010 666 85 180
I think this should be pretty straight-forward, but somehow I haven't been able to figure it out. How do I do this? Thanks for the help!
Using data.table:
require(data.table)
DT <- data.table(DF)
DT[, Cum.Sum := cumsum(Sum), by=list(Year, ID)]
Year ID Sum Cum.Sum
1: 2009 999 100 100
2: 2009 123 85 85
3: 2009 666 100 100
4: 2009 999 100 200
5: 2009 123 90 175
6: 2009 666 85 185
7: 2010 999 100 100
8: 2010 123 100 100
9: 2010 666 95 95
10: 2010 999 75 175
11: 2010 123 100 200
12: 2010 666 85 180
Another way
1) use ddply to sum a variable by group (similar to SQL group by)
X <- ddply ( dataset, .(Year,ID), sum)
2) merge the result with dataset
Y <- merge( dataset, X, by=('Year','ID')
You can use dplyr, and the base function cumsum:
require(dplyr)
dataset %>%
group_by(Year, ID) %>%
mutate(cumsum = cumsum(Sum)) %>%
ungroup()

Resources