R - create a Panel dataset from 2 Cross sectional datasets - r

Could you please help me with the following task of creating a panel dataset from two cross-sectional datasets?
Specifically, a small portion of the cross-sectional datasets are:
1) - data1
ID| Yr | D | X
-------------------
1 | 2002 | F | 25
2 | 2002 | T | 27
& 2) - data2
ID | Yr | D | X
---------------------
1 | 2003 | T | 45
2 | 2003 | F | 35
And would like to create a panel of the form:
ID | Yr | D | X
-----------------------
1 | 2002 | F | 25
1 | 2003 | T | 45
2 | 2002 | T | 27
2 | 2003 | F | 35
The codes I have tried so far are:
IDvec<-data1[,1]
ID_panel=c()
for (i in 1:length(IDvec)) {
x<-rep(IDvec[i],2)
ID_panel<-append(ID_panel,x)
}
Years_panel<-rep(2002:2003,length(IDvec))
But cannot quite figure out how to link the 3rd and 4th columns. Any help would be greatly appreciated. Thank you.

Assuming that you want to concatenate data frames, then order by ID and Yr. Here's a dplyr approach:
library(dplyr)
data1 %>%
bind_rows(data2) %>%
arrange(ID, Yr)
ID Yr D X
1 1 2002 F 25
2 1 2003 T 45
3 2 2002 T 27
4 2 2003 F 35

Related

Groupby and filter by max value in pandas

I am working on a dataframe that looks like the following: (ID 2 added in edit)
+-------+----+------+------+
| Value | ID | Date | ID 2 |
+-------+----+------+------+
| 1 | 5 | 2012 | 111 |
| 1 | 5 | 2013 | 112 |
| 0 | 12 | 2017 | 113 |
| 0 | 12 | 2022 | 114 |
| 1 | 27 | 2005 | 115 |
| 1 | 27 | 2011 | 116 |
+-------+----+------+-----+
Only using rows with "Value" == "1" ("value is boolean), I would like to group the dataframe by ID and input the string "latest" to new (blank) column, giving the following output:
+-------+----+------+--------+
| Value | ID | Date | Latest |
+-------+----+------+--------+
| 1 | 5 | 2012 | |
| 1 | 5 | 2013 | Latest |
| 0 | 12 | 2017 | |
| 0 | 12 | 2022 | |
| 1 | 27 | 2005 | |
| 1 | 27 | 2011 | Latest |
+-------+----+------+--------+
The syntax of pandas is throwing me off as I am fairly new to Python.
In R I suppose I would be trying something like
df %>% select(Value == "1") %>% group_by(ID) %>% select(max(Date)
but I am not sure of the syntax in Pandas...I am trying to first select the subset of rows that meets the criteria "value == 1" by using
q = df.query('Value == 1')
my_query_index = q.index
my_query_index
This returns the index of all the rows but I am not sure how to incorporate this into the dataframe before grouping and filtering by max(date).
All help appreciated. Thank you.
EDIT: I used the pinned answer as following:
latest = df.query('Value==1').groupby("ID").max("Year").assign(Latest = "Latest")
df = pd.merge(df,latest,how="outer")
df
But I have since realized some of the max years are the same, i.e. there could be 4 rows, all with max year 2017. For the tiebreaker, I need to use the max ID 2 within groups. I have added .groupby("ID 2").max("ID 2") to the line of code, giving
latest = df.query('Value==1').groupby("ID").max("Year").groupby("ID 2").max("ID 2").assign(Latest = "Latest")
df = pd.merge(df,latest,how="outer")
df
but it is giving me a dataframe completely different than the one desired.
Thank you for your help, all is appreciated.
You can do this:
latest = df.query('Value==1').groupby("ID").max("year").assign(Latest = "Latest")
pd.merge(df,latest,how="outer")
Value ID Date Latest
0 1 5 2012 NaN
1 1 5 2013 Latest
2 0 12 2017 NaN
3 0 12 2022 NaN
4 1 27 2005 NaN
5 1 27 2011 Latest
Sort by 'ID' then by 'Date'
Use duplicated(keep='last') to identify the last item in each group
loc to assign in the right spot
df = df.sort_values(['ID', 'Date'])
mask1 = df.Value.eq(1)
mask2 = ~df.ID.duplicated(keep='last')
df.loc[mask1 & mask2, 'Latest'] = 'Latest'
df
Value ID Date Latest
0 1 5 2012 NaN
1 1 5 2013 Latest
2 0 12 2017 NaN
3 0 12 2022 NaN
4 1 27 2005 NaN
5 1 27 2011 Latest
One option is to groupby, using transform to get the max, then use a conditional statement with np.where to get the output:
max_values = df.groupby("ID").Date.transform("max")
df.assign(Latest=np.where(df.Date.eq(max_values) & df.Value.eq(1), "Latest", ""))
Value ID Date Latest
0 1 5 2012
1 1 5 2013 Latest
2 0 12 2017
3 0 12 2022
4 1 27 2005
5 1 27 2011 Latest

Subsetting a table in R

In R, I've created a 3-dimensional table from a dataset. The three variables are all factors and are labelled H, O, and S. This is the code I used to simply create the table:
attach(df)
test <- table(H, O, S)
Outputting the flattened table produces this table below. The two values of S were split up, so these are labelled S1 and S2:
ftable(test)
+-----------+-----------+-----+-----+
| H | O | S1 | S2 |
+-----------+-----------+-----+-----+
| Isolation | Dead | 2 | 15 |
| | Sick | 64 | 20 |
| | Recovered | 153 | 379 |
| ICU | Dead | 0 | 15 |
| | Sick | 0 | 2 |
| | Recovered | 1 | 9 |
| Other | Dead | 7 | 133 |
| | Sick | 4 | 20 |
| | Recovered | 17 | 261 |
+-----------+-----------+-----+-----+
The goal is to use this table object, subset it, and produce a second table. Essentially, I want only "Isolation" and "ICU" from H, "Sick" and "Recovered" from O, and only S1, so it basically becomes the 2-dimensional table below:
+-----------+------+-----------+
| | Sick | Recovered |
+-----------+------+-----------+
| Isolation | 64 | 153 |
| ICU | 0 | 1 |
+-----------+------+-----------+
S = S1
I know I could first subset the dataframe and then create the new table, but the goal is to subset the table object itself. I'm not sure how to retrieve certain values from each dimension and produce the reduced table.
Edit: ANSWER
I now found a much simpler method. All I needed to do was reference the specific columns in their respective directions. So a much simpler solution is below:
> test[1:2,2:3,1]
O
H Sick Healed
Isolation 64 153
ICU 0 1
Subset the data before running table, example:
ftable(table(mtcars[, c("cyl", "gear", "vs")]))
# vs 0 1
# cyl gear
# 4 3 0 1
# 4 0 8
# 5 1 1
# 6 3 0 2
# 4 2 2
# 5 1 0
# 8 3 12 0
# 4 0 0
# 5 2 0
# subset then run table
ftable(table(mtcars[ mtcars$gear == 4, c("cyl", "gear", "vs")]))
# vs 0 1
# cyl gear
# 4 4 0 8
# 6 4 2 2

Data imputation for empty subsetted dataframes in R

I'm trying to build a function in R in which I can subset my raw dataframe according to some specifications, and thereafter convert this subsetted dataframe into a proportion table.
Unfortunately, some of these subsettings yields to an empty dataframe as for some particular specifications I do not have data; hence no proportion table can be calculated. So, what I would like to do is to take the closest time step from which I have a non-empty subsetted dataframe and use it as an input for the empty subsetted dataframe.
Here some insights to my dataframe and function:
My raw dataframe looks +/- as follows:
| year | quarter | area | time_comb | no_individuals | lenCls | age |
|------|---------|------|-----------|----------------|--------|-----|
| 2005 | 1 | 24 | 2005.1.24 | 8 | 380 | 3 |
| 2005 | 2 | 24 | 2005.2.24 | 4 | 490 | 2 |
| 2005 | 1 | 24 | 2005.1.24 | 3 | 460 | 6 |
| 2005 | 1 | 21 | 2005.1.21 | 25 | 400 | 2 |
| 2005 | 2 | 24 | 2005.2.24 | 1 | 680 | 6 |
| 2005 | 2 | 21 | 2005.2.21 | 2 | 620 | 5 |
| 2005 | 3 | 21 | 2005.3.21 | NA | NA | NA |
| 2005 | 1 | 21 | 2005.1.21 | 1 | 510 | 5 |
| 2005 | 1 | 24 | 2005.1.24 | 1 | 670 | 4 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 750 | 4 |
| 2006 | 4 | 24 | 2006.4.24 | 1 | 660 | 8 |
| 2006 | 2 | 24 | 2006.2.24 | 8 | 540 | 3 |
| 2006 | 2 | 24 | 2006.2.24 | 4 | 560 | 3 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 250 | 2 |
| 2006 | 3 | 22 | 2006.3.22 | 1 | 520 | 2 |
| 2006 | 2 | 24 | 2006.2.24 | 1 | 500 | 2 |
| 2006 | 2 | 22 | 2006.2.22 | NA | NA | NA |
| 2006 | 2 | 21 | 2006.2.21 | 3 | 480 | 2 |
| 2006 | 1 | 24 | 2006.1.24 | 1 | 640 | 5 |
| 2007 | 4 | 21 | 2007.4.21 | 2 | 620 | 3 |
| 2007 | 2 | 21 | 2007.2.21 | 1 | 430 | 3 |
| 2007 | 4 | 22 | 2007.4.22 | 14 | 410 | 2 |
| 2007 | 1 | 24 | 2007.1.24 | NA | NA | NA |
| 2007 | 2 | 24 | 2007.2.24 | NA | NA | NA |
| 2007 | 3 | 24 | 2007.3.22 | NA | NA | NA |
| 2007 | 4 | 24 | 2007.4.24 | NA | NA | NA |
| 2007 | 3 | 21 | 2007.3.21 | 1 | 560 | 4 |
| 2007 | 1 | 21 | 2007.1.21 | 7 | 300 | 3 |
| 2007 | 3 | 23 | 2007.3.23 | 1 | 640 | 5 |
Here year, quarter and area refers to a particular time (Year & Quarter) and area for which X no. of individuals were measured (no_individuals). For example, from the first row we get that in the first quarter of the year 2005 in area 24 I had 8 individuals belonging to a length class (lenCLs) of 380 mm and age=3. It is worth to mention that for a particular year, quarter and area combination I can have different length classes and ages (thus, multiple rows)!
So what I want to do is basically to subset the raw dataframe for a particular year, quarter and area combination, and from that combination calculate a proportion table based on the number of individuals in each length class.
So far my basic function looks as follows:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
From the dataset example above, one can notice that for year=2005 & quarter=3 & area=21, I do not have any measured individuals. Yet, for the same area AND year I have data for either quarter 1 or 2. The most reasonable assumption would be to take the subsetted dataframe from the closest time step (herby quarter 2 with the same area and year), and replace the NA from the columns "no_individuals", "lenCls" and "age" accordingly.
Note also that for some cases I do not have data for a particular year! In the example above, one can see this by looking into area 24 from year 2007. In this case I can not borrow the information from the nearest quarter, and would need to borrow from the previous year instead. This would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I have tried to include this in my function by specifying some extra rules, but due to my poor programming skills I didn't make any progress.
So, any help here will be very much appreciated.
Here my LAK function which I'm trying to update:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
# In case of empty dataset
#if(is.data.frame(sALK) && nrow(sALK)==0){
if(sALK[rowSums(is.na(sALK)) > 0,]){
warning("Empty subset combination; data will be subsetted based on the
nearest timestep combination")
FIXME: INCLDUE IMPUTATION RULES HERE
}
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
So, I finally came up with a partial solution to my problem and will include my function here in case it might be of someone's interest:
LAK <- function(df, Year="2005", Quarter="1", Area="22",alkplot=T){
require(FSA)
# subset alk by year, quarter, area and species
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
print(sALK)
if(nrow(sALK)==1){
warning("Empty subset combination; data has been subsetted to the nearest input combination")
syear <- unique(as.numeric(as.character(sALK$year)))
sarea <- unique(as.numeric(as.character(sALK$area)))
sALK2 <- subset(df, year==syear & area==sarea)
vals <- as.data.frame(table(sALK2$comb_index))
colnames(vals)[1] <- "comb_index"
idx <- which(vals$Freq>1)
quarterId <- as.numeric(as.character(vals[idx,"comb_index"]))
imput <- subset(df,year==syear & area==sarea & comb_index==quarterId)
dfexp2 <- imput[rep(seq(nrow(imput)), imput$no_at_length_age), 1:ncol(imput)]
raw2 <- t(table(dfexp2$lenCls, dfexp2$age))
key2 <- round(prop.table(raw2, margin=1), 3)
print(key2)
if(alkplot==TRUE){
alkPlot(key2,"area",xlab="Age")
}
} else {
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_at_length_age), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
print(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
}
This solves my problem when I have data for at least one quarter of a particular Year & Area combination. Yet, I'm still struggling to figure out how to deal when I do not have data for a particular Year & Area combination. In this case I need to borrow data from the closest Year that contains data for all the quarters for the same area.
For the example exposed above, this would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I don't know if you have ever encountered MICE, but it is a pretty cool and comprehensive tool for variable imputation. It also allows you to see how the imputed data is distributed so that you can choose the method most suited for your problem. Check this brief explanation and the original package description

Calculating percentage changes in data frame subsets using a base year

I have the following dataset with information on the sales figures for two companies over a period of five years. I want to use the first year's figure as a baseline and calculate the percentage change in sales for each subsequent year for each company. I use the following:
transform(dataset, SalesD = unlist(aggregate(Sales ~ Company, function(x) ((x - x[1]) / x[1]) * 100, data=Dataset)$Sales))
yet I do not get the correct values for the second company (I expect the value at row 6 to be zero, as this is the base year). Here are the results:
+----+---------+------+--------+--------+
| ID | Company | Year | Sales | SalesD |
+----+---------+------+--------+--------+
| 1 | LSL | 2015 | 100000 | 0 |
| 2 | LSL | 2016 | 120000 | 20 |
| 3 | LSL | 2017 | 150000 | 50 |
| 4 | LSL | 2018 | 100000 | 0 |
| 5 | LSL | 2019 | 50000 | -50 |
| 6 | IDA | 2015 | 150000 | 50 |
| 7 | IDA | 2016 | 180000 | 80 |
| 8 | IDA | 2017 | 200000 | 100 |
| 9 | IDA | 2018 | 180000 | 80 |
| 10 | IDA | 2019 | 160000 | 60 |
+----+---------+------+--------+--------+
Could you help me point out what is wrong in the code?
Many thanks!
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by "Company", we get the percentage change by subtracting the "Sales" from the value of "Sales" that corresponds to min value of "Year", multiply by 100, round and assign (:=) to create "SalesD".
library(data.table)
setDT(df1)[, SalesD := round(100*(Sales-
Sales[which.min(Year)])/Sales[which.min(Year)]) , Company]
df1
# ID Company Year Sales SalesD
# 1: 1 LSL 2015 100000 0
# 2: 2 LSL 2016 120000 20
# 3: 3 LSL 2017 150000 50
# 4: 4 LSL 2018 100000 0
# 5: 5 LSL 2019 50000 -50
# 6: 6 IDA 2015 150000 0
# 7: 7 IDA 2016 180000 20
# 8: 8 IDA 2017 200000 33
# 9: 9 IDA 2018 180000 20
#10: 10 IDA 2019 160000 7

Interpolate variables on subsets of dataframe

I have a large dataframe which has observations from surveys from multiple states for several years. Here's the data structure:
state | survey.year | time1 | obs1 | time2 | obs2
CA | 2000 | 1 | 23 | 1.2 | 43
CA | 2001 | 2 | 43 | 1.4 | 52
CA | 2002 | 5 | 53 | 3.2 | 61
...
CA | 1998 | 3 | 12 | 2.3 | 20
CA | 1999 | 4 | 14 | 2.8 | 25
CA | 2003 | 5 | 19 | 4.3 | 29
...
ND | 2000 | 2 | 223 | 3.2 | 239
ND | 2001 | 4 | 233 | 4.2 | 321
ND | 2003 | 7 | 256 | 7.9 | 387
For each state/survey.year combination, I would like to interpolate obs2 so that it's time-location is lined up with (time1,obs1).
ie I would like to break up the dataframe into state/survey.year chunks, perform my linear interpolation, and then stitch the individual state/survey.year dataframes back together into a master dataframe.
I have been trying to figure out how to use the plyr and Hmisc packages for this. But keeping getting myself in a tangle.
Here's the code that I wrote to do the interpolation:
require(Hmisc)
df <- new.obs2 <- NULL
for (i in 1:(0.5*(ncol(indirect)-1))){
df[,"new.obs2"] <- approxExtrap(df[,"time1"],
df[,"obs1"],
xout = df[,"obs2"],
method="linear",
rule=2)
}
But I am not sure how to unleash plyr on this problem. Your generous advice and suggestions would be much appreciated. Essentially - I am just trying to interpolate "obs2", within each state/survey.year combination, so it's time references line up with those of "obs1".
Of course if there's a slick way to do this without invoking plyr functions, then I'd be open to that...
Thank you!
This should be as simple as,
ddply(df,.(state,survey.year),transform,
new.obs2 = approxExtrap(time1,obs1,xout = obs2,
method = "linear",
rule = 2))
But I can't promise you anything, since I haven't the foggiest idea what the point of your for loop is. (It's overwriting df[,"new.obs2"] each time through the loop? You initialize the entire data frame df to NULL? What's indirect?)

Resources