Keeping all the observations from a subset

Keeping all the observations from a subset - r

I'm trying to create two different subsets from a big data frame. Here's the code :
amelioration <- subset(resilience,
(subset(resilience, resilience$YEAR==2010)$SCORE>
subset(resilience, resilience$YEAR==2009)$SCORE))
The purpose is to select the observations which have a better score in 2010 than in 2009.
The problem is that in the new table (amelioration), the lines for 2009 and 2010 are missing but i want to keep them.
It should look like this :
ID | YEAR | SCORE
--------------------
177 | 2008 | 14
--------------------
177 | 2009 | 11
--------------------
177 | 2010 | 17
--------------------
177 | 2011 | 16
But it looks like this instead :
ID | YEAR | SCORE
--------------------
177 | 2008 | 14
--------------------
177 | 2011 | 16
I tried with the which command but it doesn't work either. What should I do ?

Using data.table you can recast your year variable over score, and filter directly on years.
ambivalence = data.table(ambivalence)
castedDF = dcast(ambivalence, id ~ year, fun.aggregate = sum)
targetRows = castedDF[`2010` > `2009`]

Related

Add new column to dataframe, based on values at specific rows within that dataframe [duplicate]

This question already has answers here:
Matching up two vectors in R
(2 answers)
Closed 4 years ago.
Suppose I have a dataframe such as the below
people.dat <- data.frame("ID" = c(2001, 1001, 2005, 2001 5000), "Data"
= c(100, 300, 500, 900, 200))
Which looks something like this
+------+------+
| ID | Data |
+------+------+
| 2001 | 100 |
| 1001 | 300 |
| 2005 | 500 |
| 2001 | 900 |
| 5000 | 200 |
+------+------+
Suppose the first thing I do is work out how many unique ID values are in the dataframe (this is necessary, due to the size of the real dataset in question)
unique_ids <- sort(c(unique(people.dat$ID)))
Which gives
[1] 1001 2001 2005 5000
Where I get stuck is that I would like to add a new column, say "new_id", which looks at the "ID" value in the dataframe and evaluates its position in unique_ids, and assigns positional value (so the column "new_id" consists of values at each row which range from 1:length(unique_ids)
An example of the output would be as follows
+------+------+--------+
| ID | Data | new_id |
+------+------+--------+
| 2001 | 100 | 2 |
| 1001 | 300 | 1 |
| 2005 | 500 | 3 |
| 2001 | 900 | 1 |
| 5000 | 200 | 4 |
+------+------+--------+
I thought about using a for loop with if statements, but my first attempts didn't quite hit the mark. Although, if I just wanted to replace "ID" with a sequential value, the following code would work (but where I get stuck is that I want to keep ID, but add another "new_id" column)
for (i in 1:48){
people.dat$ID[people.dat$ID == unique_ids[i]] <- i
}
Thank you for any help. Hope I have made the question as clear as possible (although I struggled to phrase some of it, so please let me know if there is anything specific that needs clarifying)

This is more like a 'rank' problem
people$rank=as.numeric(factor(people$ID))
people
ID Data rank
1 2001 100 2
2 1001 300 1
3 2005 500 3
4 2001 900 2
5 5000 200 4

Data imputation for empty subsetted dataframes in R

I'm trying to build a function in R in which I can subset my raw dataframe according to some specifications, and thereafter convert this subsetted dataframe into a proportion table.
Unfortunately, some of these subsettings yields to an empty dataframe as for some particular specifications I do not have data; hence no proportion table can be calculated. So, what I would like to do is to take the closest time step from which I have a non-empty subsetted dataframe and use it as an input for the empty subsetted dataframe.
Here some insights to my dataframe and function:
My raw dataframe looks +/- as follows:
| year | quarter | area | time_comb | no_individuals | lenCls | age |
|------|---------|------|-----------|----------------|--------|-----|
| 2005 | 1 | 24 | 2005.1.24 | 8 | 380 | 3 |
| 2005 | 2 | 24 | 2005.2.24 | 4 | 490 | 2 |
| 2005 | 1 | 24 | 2005.1.24 | 3 | 460 | 6 |
| 2005 | 1 | 21 | 2005.1.21 | 25 | 400 | 2 |
| 2005 | 2 | 24 | 2005.2.24 | 1 | 680 | 6 |
| 2005 | 2 | 21 | 2005.2.21 | 2 | 620 | 5 |
| 2005 | 3 | 21 | 2005.3.21 | NA | NA | NA |
| 2005 | 1 | 21 | 2005.1.21 | 1 | 510 | 5 |
| 2005 | 1 | 24 | 2005.1.24 | 1 | 670 | 4 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 750 | 4 |
| 2006 | 4 | 24 | 2006.4.24 | 1 | 660 | 8 |
| 2006 | 2 | 24 | 2006.2.24 | 8 | 540 | 3 |
| 2006 | 2 | 24 | 2006.2.24 | 4 | 560 | 3 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 250 | 2 |
| 2006 | 3 | 22 | 2006.3.22 | 1 | 520 | 2 |
| 2006 | 2 | 24 | 2006.2.24 | 1 | 500 | 2 |
| 2006 | 2 | 22 | 2006.2.22 | NA | NA | NA |
| 2006 | 2 | 21 | 2006.2.21 | 3 | 480 | 2 |
| 2006 | 1 | 24 | 2006.1.24 | 1 | 640 | 5 |
| 2007 | 4 | 21 | 2007.4.21 | 2 | 620 | 3 |
| 2007 | 2 | 21 | 2007.2.21 | 1 | 430 | 3 |
| 2007 | 4 | 22 | 2007.4.22 | 14 | 410 | 2 |
| 2007 | 1 | 24 | 2007.1.24 | NA | NA | NA |
| 2007 | 2 | 24 | 2007.2.24 | NA | NA | NA |
| 2007 | 3 | 24 | 2007.3.22 | NA | NA | NA |
| 2007 | 4 | 24 | 2007.4.24 | NA | NA | NA |
| 2007 | 3 | 21 | 2007.3.21 | 1 | 560 | 4 |
| 2007 | 1 | 21 | 2007.1.21 | 7 | 300 | 3 |
| 2007 | 3 | 23 | 2007.3.23 | 1 | 640 | 5 |
Here year, quarter and area refers to a particular time (Year & Quarter) and area for which X no. of individuals were measured (no_individuals). For example, from the first row we get that in the first quarter of the year 2005 in area 24 I had 8 individuals belonging to a length class (lenCLs) of 380 mm and age=3. It is worth to mention that for a particular year, quarter and area combination I can have different length classes and ages (thus, multiple rows)!
So what I want to do is basically to subset the raw dataframe for a particular year, quarter and area combination, and from that combination calculate a proportion table based on the number of individuals in each length class.
So far my basic function looks as follows:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
From the dataset example above, one can notice that for year=2005 & quarter=3 & area=21, I do not have any measured individuals. Yet, for the same area AND year I have data for either quarter 1 or 2. The most reasonable assumption would be to take the subsetted dataframe from the closest time step (herby quarter 2 with the same area and year), and replace the NA from the columns "no_individuals", "lenCls" and "age" accordingly.
Note also that for some cases I do not have data for a particular year! In the example above, one can see this by looking into area 24 from year 2007. In this case I can not borrow the information from the nearest quarter, and would need to borrow from the previous year instead. This would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I have tried to include this in my function by specifying some extra rules, but due to my poor programming skills I didn't make any progress.
So, any help here will be very much appreciated.
Here my LAK function which I'm trying to update:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
# In case of empty dataset
#if(is.data.frame(sALK) && nrow(sALK)==0){
if(sALK[rowSums(is.na(sALK)) > 0,]){
warning("Empty subset combination; data will be subsetted based on the
nearest timestep combination")
FIXME: INCLDUE IMPUTATION RULES HERE
}
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}

So, I finally came up with a partial solution to my problem and will include my function here in case it might be of someone's interest:
LAK <- function(df, Year="2005", Quarter="1", Area="22",alkplot=T){
require(FSA)
# subset alk by year, quarter, area and species
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
print(sALK)
if(nrow(sALK)==1){
warning("Empty subset combination; data has been subsetted to the nearest input combination")
syear <- unique(as.numeric(as.character(sALK$year)))
sarea <- unique(as.numeric(as.character(sALK$area)))
sALK2 <- subset(df, year==syear & area==sarea)
vals <- as.data.frame(table(sALK2$comb_index))
colnames(vals)[1] <- "comb_index"
idx <- which(vals$Freq>1)
quarterId <- as.numeric(as.character(vals[idx,"comb_index"]))
imput <- subset(df,year==syear & area==sarea & comb_index==quarterId)
dfexp2 <- imput[rep(seq(nrow(imput)), imput$no_at_length_age), 1:ncol(imput)]
raw2 <- t(table(dfexp2$lenCls, dfexp2$age))
key2 <- round(prop.table(raw2, margin=1), 3)
print(key2)
if(alkplot==TRUE){
alkPlot(key2,"area",xlab="Age")
}
} else {
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_at_length_age), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
print(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
}
This solves my problem when I have data for at least one quarter of a particular Year & Area combination. Yet, I'm still struggling to figure out how to deal when I do not have data for a particular Year & Area combination. In this case I need to borrow data from the closest Year that contains data for all the quarters for the same area.
For the example exposed above, this would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.

I don't know if you have ever encountered MICE, but it is a pretty cool and comprehensive tool for variable imputation. It also allows you to see how the imputed data is distributed so that you can choose the method most suited for your problem. Check this brief explanation and the original package description

Add a total row summation columns into a dataframe in R

I have a dataframe made up to multiple volume columns only and I want to create a total column called test which sits in the dataframe. The below code will work if I just set test <- ... but if I add summary_transposed_no_time$ to the left, the code doesn't seem to add it to the dataframe.
I would also like to know how I could evolve this piece of code so that I could create test to be all columns minus column 1, and then later in the process create another test column (maybe called test2) which would be a summation of all columns minus column 2 - I can hard code the column positions but not the column names (as they can change in naming convention each time the code is run) so I haven't included them here
w <- ncol(summary_transposed_no_time)
summary_transposed_no_time$test <- apply(summary_transposed_no_time[,c(1:w)], 1, sum)
Example of summary_transposed_no_time:
postal_dist_a | postal_dist_b | postal_dist_c
------------- | ------------- | -------------
20 | 25 | 15
25 | 40 | 23
31 | 32 | 19
24 | 39 | 17
37 | 19 | 26
Desired result columns within summary_transposed_no_time:
postal_dist_a | postal_dist_b | postal_dist_c | test
------------- | ------------- | ------------- | -------------
20 | 25 | 15 | 60
25 | 40 | 23 | 88
31 | 32 | 19 | 82
24 | 39 | 17 | 80
37 | 19 | 26 | 82

You should provide a reprocucible example.
But if your question is really just about how to do rowsums, I would rather use the built in function rowSums. Your code would be :
set.seed(1)
# I recreate a table more or less like yours
summary_transposed_no_time=data.frame(matrix(rnorm(1000),ncol=5))
n=ncol(summary_transposed_no_time)
# Test that contains the rowsum
summary_transposed_no_time$test=rowSums(summary_transposed_no_time)
# test1 rowsum minus column 1
summary_transposed_no_time$testm1=rowSums(summary_transposed_no_time[,2:n])
# test2 rowsum minus column 2
summary_transposed_no_time$testm2=rowSums(summary_transposed_no_time[,c(1,3:n)])
#test_i minus column i
i=3
summary_transposed_no_time$testmi=rowSums(summary_transposed_no_time[,c(1:n)][,-i])
#check on first line :
sum(summary_transposed_no_time[1,1:n])==summary_transposed_no_time$test[1]
sum(summary_transposed_no_time[1,2:n])==summary_transposed_no_time$testm1[1]
sum(summary_transposed_no_time[1,c(1,3:n)])==summary_transposed_no_time$testm2[1]
sum(summary_transposed_no_time[1,c(1:2,4:n)])==summary_transposed_no_time$testmi[1]

I've found how to create a "total_" for every column within the df. total_1 is the sum of all columns minus column 1, total_2 the sum of all columns minus column 2 etc etc
n=ncol(summary_transposed_no_time)
for (h in 1:ncol(summary_transposed_no_time)) {
summary_transposed_no_time[,paste0("total_",h)] <- rowSums(summary_transposed_no_time[,c(1:n)][,-h])
m = ncol(summary_transposed_no_time)
print(paste("added in a total columns for region", h, "so the column count is now : ",m))
} # end for (h in 1:nrow(filtered_data_contents)){

Aggregation on the result of group

I have a data set as follows:
Year | Sale
2010 | 100
2011 | 60
2012 | 10
2013 | 1400
2010 | 900
2012 | 730
2014 | 300
First i want to group the value column by Year.so the result would be:
key | value
2010 | 1000
2011 | 60
2012 | 740
2013 | 1400
2014 | 300
and then i want diff of consecutive year as follows:
key | value
2010 | 0
2011 | -940
2012 | -680
2013 | 660
2014 | -1100
can you help me do this in crossfilter reduce using add,redunce and initial methods

I'd strong recommend transforming your data before doing your Crossfilter so that each record includes the sales for the current and subsequent year. So you would have a record that looks like
Year | Sales | NextYearSales
2010 | 100 | 60
Then you can just do
var cf = crossfilter(data);
var dim = cf.dimension(function(d) { return d.Year });
var group = dim.group();
group.reduceSum(function(d) { return d.Sales - d.NextYearSales; });
Alternatively, you could use Crossfilter's groupAll method, or Reductio's somewhat nicer wrapper of it.

Interpolate variables on subsets of dataframe

I have a large dataframe which has observations from surveys from multiple states for several years. Here's the data structure:
state | survey.year | time1 | obs1 | time2 | obs2
CA | 2000 | 1 | 23 | 1.2 | 43
CA | 2001 | 2 | 43 | 1.4 | 52
CA | 2002 | 5 | 53 | 3.2 | 61
...
CA | 1998 | 3 | 12 | 2.3 | 20
CA | 1999 | 4 | 14 | 2.8 | 25
CA | 2003 | 5 | 19 | 4.3 | 29
...
ND | 2000 | 2 | 223 | 3.2 | 239
ND | 2001 | 4 | 233 | 4.2 | 321
ND | 2003 | 7 | 256 | 7.9 | 387
For each state/survey.year combination, I would like to interpolate obs2 so that it's time-location is lined up with (time1,obs1).
ie I would like to break up the dataframe into state/survey.year chunks, perform my linear interpolation, and then stitch the individual state/survey.year dataframes back together into a master dataframe.
I have been trying to figure out how to use the plyr and Hmisc packages for this. But keeping getting myself in a tangle.
Here's the code that I wrote to do the interpolation:
require(Hmisc)
df <- new.obs2 <- NULL
for (i in 1:(0.5*(ncol(indirect)-1))){
df[,"new.obs2"] <- approxExtrap(df[,"time1"],
df[,"obs1"],
xout = df[,"obs2"],
method="linear",
rule=2)
}
But I am not sure how to unleash plyr on this problem. Your generous advice and suggestions would be much appreciated. Essentially - I am just trying to interpolate "obs2", within each state/survey.year combination, so it's time references line up with those of "obs1".
Of course if there's a slick way to do this without invoking plyr functions, then I'd be open to that...
Thank you!

This should be as simple as,
ddply(df,.(state,survey.year),transform,
new.obs2 = approxExtrap(time1,obs1,xout = obs2,
method = "linear",
rule = 2))
But I can't promise you anything, since I haven't the foggiest idea what the point of your for loop is. (It's overwriting df[,"new.obs2"] each time through the loop? You initialize the entire data frame df to NULL? What's indirect?)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Keeping all the observations from a subset - r

Using data.table you can recast your year variable over score, and filter directly on years. ambivalence = data.table(ambivalence) castedDF = dcast(ambivalence, id ~ year, fun.aggregate = sum) targetRows = castedDF[`2010` > `2009`]

Related

Add new column to dataframe, based on values at specific rows within that dataframe [duplicate]

Data imputation for empty subsetted dataframes in R

Add a total row summation columns into a dataframe in R

Aggregation on the result of group

Interpolate variables on subsets of dataframe

Categories

Resources