Deleting a subset of rows based on other variables - r

I have followed this example Remove last N rows in data frame with the arbitrary number of rows but it just deletes only the last 50 rows of the data frame rather than the last 50 rows of every study site within the data frame. I have a really big data set that has multiple study sites and within each study site there's multiple depths and for each depth, a concentration of nutrients.
I want to just delete the last 50 rows of depth for each station.
E.g.
station 1 has 250 depths
station 2 has 1000 depths
station 3 has 150 depth
but keep all the other data consistent.
This just seems to remove the last 50 from the dataframe rather than the last 50 from every station...
df<- df[-seq(nrow(df),nrow(df)-50),]
What should I do to add more variables (study site) to filter by?

A potential base R solution would be:
d <- data.frame(station = rep(paste("station", 1:3), c(250, 1000, 150)),
depth = rnorm(250 + 1000 + 150, 100, 10))
d$grp_counter <- do.call("c", lapply(tapply(d$depth, d$station, length), seq_len))
d$grp_length <- rep(tapply(d$depth, d$station, length), tapply(d$depth, d$station, length))
d <- d[d$grp_counter <= (d$grp_length - 50),]
d
# OR w/o auxiliary vars: subset(d, select = -c(grp_counter, grp_length))

we can use slice function from dplyr package
df2<-df %>% group_by(Col1) %>% slice(1:(n()-4))
At first it groups by category column and if arranged in proper order it can remove last n number of rows (in this case 4) from dataframe for each category.

Related

How to match two observation values in a data frame and assign values in a third column in r

Before I begin explaining, my data is super messy and I am not an advanced programmer, so bear with me.
I have a large data frame with two columns containing data on each row and one column containing spread data and NA's. I want to identify a value in the NA column, using the two other columns, and assign that value to the other rows in the NA column that have the matching other two columns.
To explain what I mean I have created a small sample code with a data frame 270 x 3:
Sample Code:
year <-c()
for (i in 1988:1990){
y1 <- i %>% rep(30)
year <- c(year, y1)
}
year <- year %>% rep(3)
stock <- c()
for (i in 1:18) {
s1 <- i %>% rep(15)
stock<- c(stock, s1)
}
df <- data.frame(year,stock)
df$group <- NA
df$group[c(24, 130, 160, 212)] <- 11
df$group[c(60, 88, 140, 240)] <- 12
df$group[c(2, 72, 100, 183)] <- 13
df$group[c(16, 47, 203, 262)] <- 14
I have years and stocks data in every row, but the group is random. The years can be repeated because they are years of a collection of monthly dates and different stocks on the same dates.
I want to identify the group number of a certain row and see if that stock "persists" 12 months by having the same yearly value for i + 11 rows. I then want to assign the value of the stock's group in row i to i:(i+11). In other words, I want to see if the stock in row i and row i+11 is the same stock and the year is the same in row i and i+11
I tried to overcome this problem by only looking at the rows where we have data and create if-else conditions inside a for loop:
identifier <- which(!is.na(df$group))
for (i in identifier) {
if (df$stock[i] == df$stock[i + 11]) {
df$group[(i+1):(i+11)] <- df$group[i]
} else {
df$group[i] <- NA
}
}
I have two problems (that I have found so far) with the code:
It does not look at all 12 values from i to i+11, so it can cause
overlapping groups to become NA and assign false groups to certain
stocks and periods
My original data frame consist of >3m observations and the for loop takes more than 30 minutes to run.
My sample code does not fully represent my original data set. Some stocks exist for 15 years and some 20 months. The stocks can also change group from year to year.
Here is an example of the error and how I want it to look:
df$group[24] # is NA, but the surrounding values are not
df$group[16:27]
# They should be NA, in this example
df$group[16:27] <- NA
df$group[262] <-NA # also identified error
# I then want to remove the NA's and get a clean data set containing only stocks with assigned groups
df <- df[-which(is.na(df$group)),]
df
Do you have any suggestions on how to make this more efficient?
Ways I can structure the data better?
More efficiently look for values in a column and then assign it to the other rows?

How do I reformat a data set to have this particular structure without for loops?

I am trying to reorganize some raw data into a more condense form. Currently the data looks like the below output from the R code. I would like the final output to have columns for time, ID, and all possible desired prices. Then, I want each ID to have only one row for each time with the quantity number put in at the different desired prices(so how many an ID wants at a particular price during this time). So for example, a particular ID might have a quantity of 1 at 100 and quantity of 2 at 101. If it is a buy, then the value should be negative and if it is a sell then positive. For example, -1 for buy at 100 and 2 for sell at 101.
I originally tried doing it through a double for loop with the first loop being time and then the second loop being the ID. Then I was able to look at the quantity column and desired price for an ID and put them into a vector. Afterwards, I combined all the vectors together for that time and then repeated this. When I tried to use this in practice, it was not feasible because the code was too slow as there are hundreds of IDs and thousands of times.
Can someone help me do this in a faster and cleaner way?
set.seed(1)
time <- rep(seq(1,5), , each = 15)
id <- sample(342:450,75,replace = TRUE)
price <- sample(99:103,75,replace = TRUE)
Desire.Price <- sample(97:105,75,replace = TRUE)
quantity <- sample(1:4,75,replace = TRUE)
data <- data.frame(time = time, id = id,price = price, Desire.Price = Desire.Price,quantity = quantity)
data$buysell <- 0
data$buysell <- ifelse( data$Desire.Price <= data$price, "BUY","SELL")
I expect the final data set would look something like this.
Final.df <- data.frame(time=NA,id=NA,"97" = NA,"98"=NA ,"99"=NA,"100"=NA,"101"=NA,"102"=NA,"103"=NA
,"104"=NA,"105"=NA)
It would basically condense the original raw data to have all the information for a particular ID in a row during each time period.
Edit: If an ID did not get sampled in that time (for example ID 342 is not in time 1) they should have a row of NA in that time period( So ID 342 would have a row of NA in time 1). I edited the code that generates the samples to have more ids to reflect this( So that they can't all possibility be sampled in every time period).
Here's a tidyverse approach. First, make quantity signed based on BUY/SELL, then sum quantity for each id / time / Desire.Price, then spread those into wide format with a column for each Desire.Price.
library(dplyr); library(tidyr)
data %>%
mutate(quantity_signed = if_else(buysell == "BUY", -quantity, quantity)) %>%
count(id, time, Desire.Price, wt = quantity_signed) %>%
complete(id, time) %>% # EDIT to bring in all times for all id's
spread(Desire.Price, n) %>% View("output")
I think this approach is simple comparatively.
# Code
library(reshape2)
#Turning BUY quantity values negative.
data[which(data$buysell=="BUY"),]$quantity <- -(data[which(data$buysell=="BUY"),]$quantity)
#Using dcast function to achieve desired columns.
final.df <- dcast(data,time + id~Desire.Price ,fun=sum,value.var='quantity')

using aggregate to generate report based on multiple categories in r

I have a .dbf containing roughly 2.8 million records that contain residential parcel data with a year built category field, a county code field, and a windzone field (for building code restrictions). There are 3 year built categories and 5 wind zones. I need to get the number of parcels for each year built category in each windzone for each county. Basically I have a county (CNTY_ID = 11) with three year built categories (BUILT_CAT = "1" , "2" , "3") each that are also assigned to one of five windspeed categories (WINDSPEED = "100", "110", "120", etc.). I think I need to use the aggregate() function but haven't had any luck. Optimally the generated table would look something like:
CNTY_ID = 11
BUILT_CAT
1 2 3
WINDSPEED
100 x x x
120 x x x
.
.
.
150 x x x
CNTY_ID = 12
BUILT_CAT
1 2 3
WINDSPEED
100 x x x
120 x x x
.
.
.
150 x x x
Is this kind of task possible to accomplish?
Actually, you're better of using table, that's less hassle and more performant. You get an array back, and this one is easily converted to a data frame.
Some test data:
n <- 10000
df <- data.frame(
windspeed = sample(c(110,120,130), n, TRUE),
built_cat = sample(c(1,2,3),n,TRUE),
cnty_id = sample(1:20,n,TRUE)
)
Constructing the table and converting to a data frame:
tbl <- with(df, table(windspeed, built_cat, cnty_id))
as.data.frame(tbl)
Note that I use with here so I have the variable names automatically as the dimnames of my table. That helps with the conversion.
What you essentially need is a way to group your data.
I think dplyr is the way to go. You can use aggregate too.
Using dplyr
library(dplyr)
library(datasets)
temp <- airquality %>%
group_by(Month, Day) %>%
summarise(TOT = sum(Ozone))
View(temp)
This will give you the data in a normalized format where the data is grouped first by Month and then by Day of the month and then sums the provided variable. Ozone in this case. You can also count the values by using length in stead.
Using aggregate
temp2 <- aggregate(Ozone ~ Month + Day, data = airquality, sum)
View(temp2)
The key difference in the approach is the treatment of NA.
Since base R functions do not have a very intuitive treatment of NAs and would add the record whenever it encounters it. As a result in group by the sum fails for that grouped entity and it is dropped from the resultant.
Here is a link to other group by treatments using data.table or ddply. You can also achieve this by plyr or tapply.

In R, select rows that have one column that exists in another list

I'm new to R; have a simple stumbling block for which I've been searching for an answer for too long.
Dateframe includes a list of individuals with their performance over a five year period. The analysis needs to include only those individuals that participated in the most recent year, so I need to identify those individuals and then select all records from the original data frame for those individuals with all columns (there's 50 or more other columns).
Original data frame is performance_fiveyr; variables I'm working with are person_id and year. I have tried any number of possible ways to get what I need; I'm listing one of those ways here...
First step is to create the list of individuals that participated this past year
person_current <- subset (x = performance_fiveyr,
subset = year==2015, # keep only records from 2015
select = person_id # keep only the person_id variable
)
Next step then is to select from performance_fiveyr all rows that have a person_id that exists in person_current and return all other columns (more than 50 columns total).
performance_current <- performance_fiveyr[performance_fiveyr$person_id
%in% person_current, ]
I've tried more than a few variations of this and end up with either all columns and no rows or all rows and no variables.
Here is some example data:
set.seed(0)
p5 <- data.frame(id = sample(5, 20, replace=TRUE), year = sample(2010:2015, 20, replace=TRUE))
p5 <- p5[order(p5$id, p5$year), ]
I think you were on the right track. I think the below does what you are after:
current <- unique(p5[p5$year==2015, 'id'])
p_current <- p5[p5$id %in% current, ]
p_current

R find X percent growth within Y time frame

I have a R data frame with 3 columns
timestamp
Category
Value
I am trying to find an elegant way(ideally) to find where the values increased, or decreased, by X percent within a specified time frame. For example, I'd like to know all points in the data where Value increased by 50% or more within 1 week.
Are there any built in funcitons of packages where I can just pass a percentage and a number of days and have it return which rows in the data frame are a match?
something along these lines(pseudo code below):
RowsThatareAMatch <- findmatches(date=MyDF$Timestamp, grouping=MyDF$Category, data=MyDF$Value, growth=0.5, range=7)
The thing that is throwing me off is that I want it returning the rows for each Category that has values, and not just look at every value in the data frame. So if Category A & B had growth of 50% or more within 7 days 8 times in my data, I want those rows returned, and if categories C, D, & E didn't every have that kind of increase I don't want data from those categories returned at all.
Right now I am looking at systematically splitting the data frame into multiple data frames for each category and then doing the analysis on each individual data frame. While that approach could work, something is telling me that R has an easier way to do this.
Thoughts?
edit: Ideally what I am looking for returned is a data frame with 3 columns, and 1 row for every match in my data.
Category
Start timestamp of the match
End timestamp of the match.
Based in my experience with R I would need to identify the row numbers for each grouping and then I could extract the above data from the original data frame, but if there's any good way to go straight to the above output that would be awesome too!
Sample Data
So I have a CSV like this:
Timestamp,Category,Value
2015-01-01,A,1
2015-01-02,A,1.2
2015-01-03,A,1.3
2015-01-04,A,8
2015-01-05,A,8.2
2015-01-06,A,9
2015-01-07,A,9.2
2015-01-08,A,10
2015-01-09,A,11
2015-01-01,B,12
2015-01-02,B,12.75
2015-01-03,B,15
2015-01-04,B,60
2015-01-05,B,62.1
2015-01-06,B,63
2015-01-07,B,12.3
2015-01-08,B,10
2015-01-09,B,11
2015-01-01,C,100
2015-01-02,C,100000
2015-01-03,C,200
2015-01-04,C,350
2015-01-05,C,780
2015-01-06,C,780.2
2015-01-07,C,790
2015-01-08,C,790.3
2015-01-09,C,791
2015-01-01,D,0.5
2015-01-02,D,0.8
2015-01-03,D,0.83
2015-01-04,D,2
2015-01-05,D,0.01
2015-01-06,D,0.03
2015-01-07,D,0.99
2015-01-08,D,1.23
2015-01-09,D,5
I would read that into R like this
df <- read.csv("CategoryMeasurements.csv", header=TRUE)
Say you your data.frame is called df, you can do something like this using data.table which creates a new row that reads "increase over 50%" if the value grew by 50% or more (which you can then filter):
lag <- function(x, n) c(rep(NA, n), x[1:(length(x) - n)])
library(data.table)
setDT(df)[, ifelse(value/lag(value, 1) - 1 > 0.5, "increase over 50%", "Other"), by = category]
Well I'm not sure how elegant this is, but it works, and I wound up having to subset by category before passing the data frame to my function and will need to create a loop or use one of the apply functions to pass each category to my function, but it should get the job done.
Mydf <- read.csv("CategoryMeasurements.csv", header=TRUE)
GetIncreasesWithinRange <- function(df, growth, days ) {
# df = data frame with data you want processed. 1st column should be a date, 2nd column should be the data.
# growth = % of growth you are looking for in the data
# days = the number of days that the growth should occur in to be a match.
df <- df[order(df[,1]), ] # Sort the df by the date column. This is important for the loop logic.
# Initialize empty data frame to hold results that will be returned from this funciton.
ReturnDF <- data.frame( StartDate=as.Date(character()),
EndDate=as.Date(character()),
Growth=double(),
stringsAsFactors=FALSE)
TotalRows = nrow(df)
for(i in 1:TotalRows) {
StartDate <- toString(df[i,1])
StartValue <- df[i,2]
for(x in i:(TotalRows)) {
NextDate <- toString(df[x,1])
DayDiff <- as.numeric(difftime(NextDate ,StartDate , units = c("days")))
if(DayDiff >= days) {
NextValue <- df[x,2]
PercentChange = (NextValue - StartValue)/NextValue
if(PercentChange >= growth) {
ReturnDF[(nrow(ReturnDF)+1),] <- list(StartDate, NextDate, PercentChange)
}
break
}
}
}
return(ReturnDF)
}
subDF <- Mydf[which(Mydf$Category=='A'), ]
subDF$Category <- NULL # Nuke the category column from the subsetting DF. It's not relevant for this.
X <- GetIncreasesWithinRange(subDF, 0.5, 4)
print(X)
Which outputs
StartDate EndDate Growth
1 2015-01-01 2015-01-05 0.8780488
2 2015-01-02 2015-01-06 0.8666667
3 2015-01-03 2015-01-07 0.8586957

Resources