Selecting unique non-repeating values - r

I have some panel data from 2004-2007 which I would like to select according to unique values. To be more precise im trying to find out entry and exits of individual stores throughout the period. Data sample:
store year rev space market
1 2004 110000 1095 136
1 2005 110000 1095 136
1 2006 110000 1095 136
1 2007 120000 1095 136
2 2004 35000 800 136
3 2004 45000 1000 136
3 2005 45000 1000 136
3 2006 45000 1000 136
3 2007 45000 1000 136
4 2005 17500 320 136
4 2006 17500 320 136
4 2007 17500 320 136
5 2005 45000 580 191
5 2006 45000 580 191
5 2007 45000 580 191
6 2004 7000 345 191
6 2005 7000 345 191
6 2006 7000 345 191
7 2007 10000 500 191
So for instance I would like to find out how many stores have exited the market throughout the period, which should look like:
store year rev space market
2 2004 35000 800 136
6 2006 7000 345 191
As well as how many stores have entered the market, which would imply:
store year rev space market
4 2005 17500 320 136
5 2005 45000 580 191
7 2007 10000 500 191
UPDATE:
I didn't include that it also should assume incumbent stores, such as:
store year rev space market
1 2004 110000 1095 136
1 2005 110000 1095 136
1 2006 110000 1095 136
1 2007 120000 1095 136
3 2004 45000 1000 136
3 2005 45000 1000 136
3 2006 45000 1000 136
3 2007 45000 1000 136
Since im, pretty new to R I've been struggling to do it right even on year-by-year basis. Any suggestions?

Using the data.table package, if your data.frame is called df:
dt = data.table(df)
exit = dt[,list(ExitYear = max(year)),by=store]
exit = exit[ExitYear != 2007] #Or whatever the "current year" is for this table
enter = dt[,list(EntryYear = min(year)),by=store]
enter = enter[EntryYear != 2003]
UPDATE
To get all columns instead of just the year and store, you can do:
exit = dt[,.SD[year == max(year)], by=store]
exit[year != 2007]
store year rev space market
1: 2 2004 35000 800 136
2: 6 2006 7000 345 191

Using only base R functions, this is pretty simple:
> subset(aggregate(df["year"],df["store"],max),year!=2007)
store year
2 2 2004
6 6 2006
and
> subset(aggregate(df["year"],df["store"],min),year!=2004)
store year
4 4 2005
5 5 2005
7 7 2007
or using formula syntax:
> subset(aggregate(year~store,df,max),year!=2007)
store year
2 2 2004
6 6 2006
and
> subset(aggregate(year~store,df,min),year!=2004)
store year
4 4 2005
5 5 2005
7 7 2007
Update Getting all the columns isn't possible for aggregate, so we can use base 'by' instead. By isn't as clever at reassembling the array:
Filter(function(x)x$year!=2007,by(df,df$store,function(s)s[s$year==max(s$year),]))
$`2`
store year rev space market
5 2 2004 35000 800 136
$`6`
store year rev space market
18 6 2006 7000 345 191
So we need to take that step - let's build a little wrapper:
by2=function(x,c,...){Reduce(rbind,by(x,x[c],simplify=FALSE,...))}
And now use that instead:
> subset(by2(df,"store",function(s)s[s$year==max(s$year),]),year!=2007)
store year rev space market
5 2 2004 35000 800 136
18 6 2006 7000 345 191
We can further clarify this by creating a function for getting a row which has the stat (min or max) for a particular column:
statmatch=function(column,stat)function(df){df[df[column]==stat(df[column]),]}
> subset(by2(df,"store",statmatch("year",max)),year!=2007)
store year rev space market
5 2 2004 35000 800 136
18 6 2006 7000 345 191
Dplyr
Using all of these base functions which don't really resemble each other starts to get fiddly after a while, so it's a great idea to learn and use the excellent (and performant) dplyr package:
> df %>% group_by(store) %>%
arrange(-year) %>% slice(1) %>%
filter(year != 2007) %>% ungroup
Source: local data frame [2 x 5]
store year rev space market
1 2 2004 35000 800 136
2 6 2006 7000 345 191
and
> df %>% group_by(store) %>%
arrange(+year) %>% slice(1) %>%
filter(year != 2004) %>% ungroup
Source: local data frame [3 x 5]
store year rev space market
1 4 2005 17500 320 136
2 5 2005 45000 580 191
3 7 2007 10000 500 191
NB The ungroup is not strictly necessary here, but puts the table back in a default state for further calculations.

Related

Need to find a number whose column name matches with a row vector

I have a data set (df) below that needs a variable which for example grabs the date on a column (with names as years below) and grabs the variable start_year and finds the value of it for that year and type (newstart). I have many more years and types in my dataset than below so something that works across any number of years is needed.
Here's what I have for data (a simplification):
st
type_num
start_year
end_year
2000
2001
2002
2003
2004
il
1
2000
2004
10
220
9
10
100
il
2
2001
2004
220
100
220
100
100
il
3
2000
2004
400
400
10
220
220
ak
1
2001
2003
10
220
9
10
100
ak
2
2001
2004
220
100
220
100
100
ak
3
2000
2003
400
400
10
220
220
wa
1
2001
2003
10
220
9
10
100
wa
2
2001
2004
220
100
220
100
100
wa
3
2000
2003
400
400
10
220
220
wa
4
2002
2003
500
600
700
800
900
Where's what I need:
st
type_num
start_year
end_year
2000
2001
2002
2003
2004
newstart
newend
il
1
2000
2004
10
220
9
10
100
10
100
il
2
2001
2004
220
100
220
100
100
100
100
il
3
2000
2004
400
400
10
220
220
400
220
ak
1
2001
2003
10
220
9
10
100
10
10
ak
2
2001
2004
220
100
220
100
100
100
100
ak
3
2000
2003
400
400
10
220
220
400
220
wa
1
2001
2003
10
220
9
10
100
220
10
wa
2
2001
2004
220
100
220
100
100
100
100
wa
3
2000
2003
400
400
10
220
220
400
220
wa
4
2002
2003
500
600
700
800
900
700
800
I was trying to get those variables using a couple indexes, tried this
which(colnames(df)==df$end_year[1])
Which seems to grab the column number of the matching date column, but wasn't able to figure out how to use it in an apply() to get it to do what this variable needs to do.
I also tried to make a less specific data set that got some suggestions to use rowwise() and get() but that didn't seem to work exactly, perhaps due to the less specific data. I tried to make something almost exactly what I intend to use for my real output.
I used pivot_longer() on the original data set I was merging in, solved the issue.

Importing Data in R

I want to import data into R but I am getting a few errors. I download my ".CSV" file to my computer and specify the file path like this setwd("C:/Users/intellipaat/Desktop/BLOG/files") and then I am writing read.data <- read.csv("file1.csv"), but the console returns an error like this.
"read.data<-read.csv(file1.csv)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
'file1.csv' object not found
What should I do for this? I tried the internet link route, but again I encountered a problem.
I wrote like this:
install.packages("XML")
install.packages("RCurl")
to load the packages, run the following command:
library("XML")
library("RCurl")
url <- "https://en.wikipedia.org/wiki/Ease_of_doing_business_index#Ranking"
tabs <- getURL(url)
and the console wrote me this error;
Error in function (type, msg, asError = TRUE) :
error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
I would be glad if you help me in this regard...
The Ease of Doing Business rankings table on Wikipedia is an HTML table, not a comma separated values file.
Loading the HTML table into an R data frame can be handled in a relatively straightforward manner with the rvest package. Instead of downloading the HTML file we can read it directly into R with read_html(), and then use html_table() to extract the tabular data into a data frame.
library(rvest)
wiki_url <- "https://en.wikipedia.org/wiki/Ease_of_doing_business_index#Ranking"
aPage <- read_html(wiki_url)
aTable <- html_table(aPage)[[2]] # second node is table of rankings
head(aTable)
...and the first few rows of output:
> head(aTable)
Classification Jurisdiction 2020 2019 2018 2017 2016 2015 2014 2013 2012
1 Very Easy New Zealand 1 1 1 1 2 2 3 3 3
2 Very Easy Singapore 2 2 2 2 1 1 1 1 1
3 Very Easy Hong Kong 3 4 5 4 5 3 2 2 2
4 Very Easy Denmark 4 3 3 3 3 4 5 5 5
5 Very Easy South Korea 5 5 4 5 4 5 7 8 8
6 Very Easy United States 6 8 6 8 7 7 4 4 4
2011 2010 2009 2008 2007 2006
1 3 2 2 2 2 1
2 1 1 1 1 1 2
3 2 3 4 4 5 7
4 6 6 5 5 7 8
5 16 19 23 30 23 27
6 5 4 3 3 3 3
>
Next, we confirm that the last countries were read correctly: Libya, Yemen, Venezuela, Eritrea, and Somalia.
> tail(aTable,n=5)
Classification Jurisdiction 2020 2019 2018 2017 2016 2015 2014 2013 2012
186 Below Average Libya 186 186 185 188 188 188 187 N/A N/A
187 Below Average Yemen 187 187 186 179 170 137 133 118 99
188 Below Average Venezuela 188 188 188 187 186 182 181 180 177
189 Below Average Eritrea 189 189 189 189 189 189 184 182 180
190 Below Average Somalia 190 190 190 190 N/A N/A N/A N/A N/A
2011 2010 2009 2008 2007 2006
186 N/A N/A N/A N/A N/A N/A
187 105 99 98 113 98 90
188 172 177 174 172 164 120
189 180 175 173 171 170 137
190 N/A N/A N/A N/A N/A N/A
Finally, we use tidyr and dplyr to convert the data to narrow format tidy data for subsequent analysis.
library(dplyr)
library(tidyr)
aTable %>%
# convert years 2017 - 2020 to character because pivot_longer()
# requires all columns to be of same data type
mutate_at(3:6,as.character) %>%
pivot_longer(-c(Classification,Jurisdiction),
names_to="Year",values_to="Rank") %>%
# convert Rank and Year to numeric values (introducing NA values)
mutate_at(c("Rank","Year"),as.numeric) -> rankings
head(rankings)
...and the output:
> head(rankings)
# A tibble: 6 x 4
Classification Jurisdiction Year Rank
<chr> <chr> <dbl> <dbl>
1 Very Easy New Zealand 2020 1
2 Very Easy New Zealand 2019 1
3 Very Easy New Zealand 2018 1
4 Very Easy New Zealand 2017 1
5 Very Easy New Zealand 2016 2
6 Very Easy New Zealand 2015 2
>

Creating new column based on row values of multiple data subsetting conditions

I have a dataframe that looks more or less like follows (the original one has 12 years of data):
Year Quarter Age_1 Age_2 Age_3 Age_4
2005 1 158 120 665 32
2005 2 257 145 121 14
2005 3 68 69 336 65
2005 4 112 458 370 101
2006 1 75 457 741 26
2006 2 365 134 223 45
2006 3 257 121 654 341
2006 4 175 124 454 12
2007 1 697 554 217 47
2007 2 954 987 118 54
2007 4 498 235 112 65
Where the numbers in the age columns represents the amount of individuals in each age class for a specific quarter within a specific year. It is noteworthy that sometimes not all quarters in a specific year have data (e.g., third quarter is not represented in 2007). Also, each row represents a sampling event. Although not shown in this example, in the original dataset I always have more than one sampling event for a specific quarter within a specific year. For example, for the first quarter in 2005 I have 47 sampling events, leading therefore to 47 rows.
What I´d like to have now is a dataframe structured in a way like:
Year Quarter Age_1 Age_2 Age_3 Age_4 Cohort
2005 1 158 120 665 32 158
2005 2 257 145 121 14 257
2005 3 68 69 336 65 68
2005 4 112 458 370 101 112
2006 1 75 457 741 26 457
2006 2 365 134 223 45 134
2006 3 257 121 654 341 121
2006 4 175 124 454 12 124
2007 1 697 554 217 47 47
2007 2 954 987 118 54 54
2007 4 498 235 112 65 65
In this case, I want to create a new column (Cohort) in my original dataset which basically follows my cohorts along my dataset. In other words, when I´m in my first year of data (2005 with all quarters), I take the row values of Age_1 and paste it into the new column. When I move to the next year (2006), then I take all my row values related to my Age_2 and paste it to the new column, and so on and so forth.
I have tried to use the following function, but somehow it only works for the first couple of years:
extract_cohort_quarter <- function(d, yearclass=2005, quarterclass=1) {
ny <- 1:nlevels(d$Year) #no. of Year levels in the dataset
nq <- 1:nlevels(d$Quarter)
age0 <- (paste("age", ny, sep="_"))
year0 <- as.character(yearclass + ny - 1)
quarter <- as.character(rep(1:4, length(age0)))
age <- rep(age0,each=4)
year <- rep(year0,each=4)
df <- data.frame(year,age,quarter,stringsAsFactors=FALSE)
n <- nrow(df)
dnew <- NULL
for(i in 1:n) {
tmp <- subset(d, Year==df$year[i] & Quarter==df$quarter[i])
tmp$Cohort <- tmp[[age[i]]]
dnew <- rbind(dnew, tmp)
}
levels(dnew$Year) <- paste("Yearclass_", yearclass, ":",
year,":",quarter,":", age, sep="")
dnew
}
I have plenty of data from age_1 to age_12 for all the years and quarters, so I don´t think that it´s something related to the data structure itself.
Is there an easier solution to solve this problem? Or is there a way to improve my extract_cohort_quarter() function? Any help will be much appreciated.
-M
I have a simple solution but that demands bit of knowledge of the data.table library. I think you can easily adapt it to your further needs.
Here is the data:
DT <- as.data.table(list(Year = c(2005, 2005, 2005, 2005, 2006, 2006 ,2006 ,2006, 2007, 2007, 2007),
Quarter= c(1, 2, 3, 4 ,1 ,2 ,3 ,4 ,1 ,2 ,4),
Age_1 = c(158, 257, 68, 112 ,75, 365, 257, 175, 697 ,954, 498),
Age_2= c(120 ,145 ,69 ,458 ,457, 134 ,121 ,124 ,554 ,987, 235),
Age_3= c(665 ,121 ,336 ,370 ,741 ,223 ,654 ,454,217,118,112),
Age_4= c(32,14,65,101,26,45,341,12,47,54,65)
))
Here is th code :
DT[,index := .GRP, by = Year]
DT[,cohort := get(paste0("Age_",index)),by = Year]
and the output:
> DT
Year Quarter Age_1 Age_2 Age_3 Age_4 index cohort
1: 2005 1 158 120 665 32 1 158
2: 2005 2 257 145 121 14 1 257
3: 2005 3 68 69 336 65 1 68
4: 2005 4 112 458 370 101 1 112
5: 2006 1 75 457 741 26 2 457
6: 2006 2 365 134 223 45 2 134
7: 2006 3 257 121 654 341 2 121
8: 2006 4 175 124 454 12 2 124
9: 2007 1 697 554 217 47 3 217
10: 2007 2 954 987 118 54 3 118
11: 2007 4 498 235 112 65 3 112
What it does:
DT[,index := .GRP, by = Year]
creates an index for all different year in your table (by = Year makes an operation for group of year, .GRP create an index following the grouping sequence).
I use it to call the column that you named Age_ with the number created
DT[,cohort := get(paste0("Age_",index)),by = Year]
You can even do everything in the single line
DT[,cohort := get(paste0("Age_",.GRP)),by = Year]
I hope it helps
Here is an option using tidyverse
library(dplyr)
library(tidyr)
df1 %>%
gather(key, Cohort, -Year, -Quarter) %>%
separate(key, into = c('key1', 'key2')) %>%
mutate(ind = match(Year, unique(Year))) %>%
group_by(Year) %>%
filter(key2 == Quarter[ind]) %>%
mutate(newcol = paste(Year, Quarter, paste(key1, ind, sep="_"), sep=":")) %>%
ungroup %>%
select(Cohort, newcol) %>%
bind_cols(df1, .)
# Year Quarter Age_1 Age_2 Age_3 Age_4 Cohort newcol
#1 2005 1 158 120 665 32 158 2005:1:Age_1
#2 2005 2 257 145 121 14 257 2005:2:Age_1
#3 2005 3 68 69 336 65 68 2005:3:Age_1
#4 2005 4 112 458 370 101 112 2005:4:Age_1
#5 2006 1 75 457 741 26 457 2006:1:Age_2
#6 2006 2 365 134 223 45 134 2006:2:Age_2
#7 2006 3 257 121 654 341 121 2006:3:Age_2
#8 2006 4 175 124 454 12 124 2006:4:Age_2
#9 2007 1 697 554 217 47 47 2007:1:Age_3
#10 2007 2 954 987 118 54 54 2007:2:Age_3
#11 2007 4 498 235 112 65 65 2007:4:Age_3

Subsetting panel data via unique values

I would like to segment panel data according to specific criterion and perform summary statistics on each segment. Data:
store year rev space market
1 2004 110000 1095 136
1 2005 110000 1095 136
1 2006 110000 1095 136
1 2007 120000 1095 136
2 2004 35000 800 136
3 2004 45000 1000 136
3 2005 45000 1000 136
3 2006 45000 1000 136
3 2007 45000 1000 136
4 2005 17500 320 136
4 2006 17500 320 136
4 2007 17500 320 136
5 2005 45000 580 191
5 2006 45000 580 191
5 2007 45000 580 191
6 2004 7000 345 191
6 2005 7000 345 191
6 2006 7000 345 191
7 2007 10000 500 191
From the example above I want to separate stores into entrants, exits and incumbents. So for instance I would like to find out how many stores have exited the market throughout the period, which should look like:
store year rev space market
2 2004 35000 800 136
6 2006 7000 345 191
Have entered the market:
store year rev space market
4 2005 17500 320 136
5 2005 45000 580 191
7 2007 10000 500 191
And remained incumbent throughout the period:
store year rev space market
1 2004 110000 1095 136
1 2005 110000 1095 136
1 2006 110000 1095 136
1 2007 120000 1095 136
3 2004 45000 1000 136
3 2005 45000 1000 136
3 2006 45000 1000 136
3 2007 45000 1000 136
I'm not experienced enough with R to perform such task, hence inputs would be appreciated.
Looks like a good excuse to use data.table:
library(data.table)
setDT(dat)
dat[, if(!max(dat$year) %in% year) tail(.SD,1) , by=store]
# store year rev space market
#1: 2 2004 35000 800 136
#2: 6 2006 7000 345 191
dat[, if(!min(dat$year) %in% year) head(.SD,1) , by=store]
# store year rev space market
#1: 4 2005 17500 320 136
#2: 5 2005 45000 580 191
#3: 7 2007 10000 500 191
dat[, if(min(dat$year) %in% year & max(dat$year) %in% year) .SD , by=store]
# store year rev space market
#1: 1 2004 110000 1095 136
#2: 1 2005 110000 1095 136
#3: 1 2006 110000 1095 136
#4: 1 2007 120000 1095 136
#5: 3 2004 45000 1000 136
#6: 3 2005 45000 1000 136
#7: 3 2006 45000 1000 136
#8: 3 2007 45000 1000 136

selecting observations based on a condition depended on a grouped variable

I have a question that I am hoping some will help me answer. I have a data set ordered by parasites and year, that looks something like this (the actual dataset is much larger):
parasites year samples
1000 2000 11
910 2000 22
878 2000 13
999 2002 64
910 2002 75
710 2002 16
890 2004 29
810 2004 10
789 2004 9
876 2005 120
750 2005 12
624 2005 157
what I would like to do is, for every year, I want to select the 2 samples with the highest number of parasites, to give me an output that looks like this:
parasites year samples
1000 2000 11
910 2000 22
999 2002 64
910 2002 75
890 2004 29
810 2004 10
876 2005 120
750 2005 12
I am new to programming as a whole and still trying to find my way around R. can someone please explain to me how I would go about this? Thanks so much.
How about with data.table:
parasites<-read.table(header=T,text="parasites year samples
1000 2000 11
910 2000 22
878 2000 13
999 2002 64
910 2002 75
710 2002 16
890 2004 29
810 2004 10
789 2004 9
876 2005 120
750 2005 12
624 2005 157")
EDIT - sorry sorted by parasites, not samples
require(data.table)
data.table(parasites)[,.SD[order(-parasites)][1:2],by="year"]
Note .SD is the sub-table for each year value as set in by=
year parasites samples
1: 2000 1000 11
2: 2000 910 22
3: 2002 999 64
4: 2002 910 75
5: 2004 890 29
6: 2004 810 10
7: 2005 876 120
8: 2005 750 12
Here is a R-base solution (if you need it):
data = data.frame("parasites"=c(1000,910,878,999,910,710,890,910,789,876,750,624),
"year"=c(2000,2000,2000,2002,2002,2002,2004,2004,2004,2005,2005,2005),
"samples"=c(11,22,13,64,75,16,29,10,9,120,12,157))
data = data[order(data$year,data$samples),]
data_list = lapply(unique(data$year),function(x) (tail(data[data$year==x,],n=2)))
final_data = do.call(rbind, Map(as.data.frame,data_list))
Hope that helps!

Resources