Remove rows that match a value - r

I'm trying to filter out some data. Say the columns contain a numeric value that if equal to zero in all columns must go. I've though about performing multiple matches with which as so
match1 <- match(which(storm$FATALITIES==0), which(storm$INJURIES==0))
match2 <- match(which(storm$CROPDMG==0), which(storm$CROPDMGEXP==0))
match3 <- match(which(storm$PROPDMG==0), which(storm$PROPDMGEXP==0))
match4 <- match(match1, match2)
matchF <- match(match4, match3)
but it clearly doesn't work since its giving a position given the last vector...
the data looks something like this:
BGN_DATE STATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
1 4/18/1950 0:00:00 AL TORNADO 0 15 25.0 K 3
2 4/18/1950 0:00:00 AL TORNADO 0 0 0.0 K 0
3 2/20/1951 0:00:00 AL TORNADO 0 2 25.0 K 0
4 6/8/1951 0:00:00 AL TORNADO 0 2 0.0 K 0
5 11/15/1951 0:00:00 AL TORNADO 0 0 0.0 K 0
6 11/15/1951 0:00:00 AL TORNADO 1 6 2.5 K 0
7 11/16/1951 0:00:00 AL TORNADO 0 1 2.5 K 0
CROPDMGEXP LATITUDE LONGITUDE REFNUM
1 3040 8812 1
2 3042 8755 2
3 3340 8742 3
4 3458 8626 4
5 3412 8642 5
6 3450 8748 6
7 3405 8631 7
I'm interested in matching removing all entries that are 0 for INJURIES, FATALITIES, CROPDMG, PROPDMG (all of them simultaneously). I've already filtered out NA with complete.cases().
Thanks

Here are a couple ways. One interactive and very intuitive:
subset(storm, INJURIES != 0 |
FATALITIES != 0 |
CROPDMG != 0 |
PROPDMG != 0)
and one programmatic, hence more flexible/scalable:
fields <- c('INJURIES', 'FATALITIES', 'CROPDMG', 'PROPDMG')
keep <- rowSums(storm[fields] != 0) > 0
storm[keep, ]

Related

Linear interpolation by multiple groupings in R

I have the following data set:
District Type DaysBtwn Start_Day End_Day Start_Vol End_Vol
1 A 0 3 0 31 28 23
2 A 1 3 0 31 24 0
3 B 0 3 0 31 17700 10526
4 B 1 3 0 31 44000 35800
5 C 0 3 0 31 5700 0
6 C 1 3 0 31 35000 500
For each of the group combinations District & Type, I want to do a simple linear interpolation: for a x=Days (Start_Day and End_Day) and y=Volumes (Start_Vol and End_Vol), I want the estimated volume returned for xout=DaysBtwn.
I have tried so many things. I think I am having issues because of the way my data is set up. Can someone point me in the right direction for how to use the approx function in R to get the desired output? I don't mind moving my data set around to get the correct format for approx.`
Example of desired output:
District Type EstimatedVol
1 0 25
2 1 15
3 0 13000
4 1 39000
5 0 2500
6 1 25000
dt <- data.table(input) interpolation <- dt[, approx(x,y,xout=z), by=list(input$District,input$Type)]
Why not simply calculate it directly?
dt$EstimatedVol <- (End_Vol - Start_Vol) / (End_Day - Start_Day) * (DaysBtwn - Start_Day) + Start_Vol

Average columns based on other column value and number of rows in R

I'm using R and am trying to create a new dataframe of averaged results from another dataframe based on the values in Column A. To demonstrate my goal here is some data:
set.seed(1981)
df <- data.frame(A = sample(c(0,1), replace=TRUE, size=100),
B=round(runif(100), digits=4),
C=sample(1:1000, 100, replace=TRUE))
head(df, 30)
A B C
0 0.6739 459
1 0.5466 178
0 0.154 193
0 0.41 206
1 0.7526 791
1 0.3104 679
1 0.739 434
1 0.421 171
0 0.3653 577
1 0.4035 739
0 0.8796 147
0 0.9138 37
0 0.7257 350
1 0.2125 779
0 0.1502 495
1 0.2972 504
0 0.2406 245
1 0.0325 613
0 0.8642 539
1 0.1096 630
1 0.2113 363
1 0.277 974
0 0.0485 755
1 0.0553 412
0 0.509 24
0 0.2934 795
0 0.0725 413
0 0.8723 606
0 0.3192 591
1 0.5557 177
I need to reduce the size of the data by calculating the average value for column B and column C for as many rows as the value in Column A stays consecutively the same, up to a maximum of 3 rows. If value A remains either 1, or 0 for more than 3 rows it would roll over into the next row in the new dataframe as you can see below.
The new dataframe requires the following columns:
Value of A B.Av C.Av No. of rows used
0 0.6739 459 1
1 0.5466 178 1
0 0.282 199.5 2
1 0.600666667 634.6666667 3
1 0.421 171 1
0 0.3653 577 1
1 0.4035 739 1
0 0.8397 178 3
1 0.2125 779 1
0 0.1502 495 1
1 0.2972 504 1
0 0.2406 245 1
1 0.0325 613 1
0 0.8642 539 1
1 0.1993 655.6666667 3
0 0.0485 755 1
1 0.0553 412 1
0 0.291633333 410.6666667 3
0 0.59575 598.5 2
1 0.5557 177 1
I haven't managed to find another similar scenario to mine whilst searching Stack Overflow so any help would be really appreciated.
Here is a base-R solution:
## define a function to split the run-length if greater than 3
split.3 <- function(l,v) {
o <- c(values=v,lengths=min(l,3))
while(l > 3) {
l <- l - 3
o <- rbind(o,c(values=v,lengths=min(l,3)))
}
return(o)
}
## compute the run-length encoding of column A
rl <- rle(df$A)
## apply split.3 to the run-length encoding
## the first column of vl are the values of column A
## the second column of vl are the corresponding run-length limited to 3
vl <- do.call(rbind,mapply(split.3,rl$lengths,rl$values))
## compute the begin and end row indices of df for each value of A to average
fin <- cumsum(vl[,2])
beg <- fin - vl[,2] + 1
## compute the averages
out <- do.call(rbind,lapply(1:length(beg), function(i) data.frame(`Value of A`=vl[i,1],
B.Av=mean(df$B[beg[i]:fin[i]]),
C.Av=mean(df$C[beg[i]:fin[i]]),
`No. of rows used`=fin[i]-beg[i]+1)))
## Value.of.A B.Av C.Av No..of.rows.used
##1 0 0.6739000 459.0000 1
##2 1 0.5466000 178.0000 1
##3 0 0.2820000 199.5000 2
##4 1 0.6006667 634.6667 3
##5 1 0.4210000 171.0000 1
##6 0 0.3653000 577.0000 1
##7 1 0.4035000 739.0000 1
##8 0 0.8397000 178.0000 3
##9 1 0.2125000 779.0000 1
##10 0 0.1502000 495.0000 1
##11 1 0.2972000 504.0000 1
##12 0 0.2406000 245.0000 1
##13 1 0.0325000 613.0000 1
##14 0 0.8642000 539.0000 1
##15 1 0.1993000 655.6667 3
##16 0 0.0485000 755.0000 1
##17 1 0.0553000 412.0000 1
##18 0 0.2916333 410.6667 3
##19 0 0.5957500 598.5000 2
##20 1 0.5557000 177.0000 1
Here is a data.table solution:
library(data.table)
setDT(df)
# create two group variables, consecutive A and for each consecutive A every three rows
(df[,rleid := rleid(A)][, threeWindow := ((1:.N) - 1) %/% 3, rleid]
# calculate the average of the columns grouped by the above two variables
[, c(.N, lapply(.SD, mean)), .(rleid, threeWindow)]
# drop group variables
[, `:=`(rleid = NULL, threeWindow = NULL)][])
# N A B C
#1: 1 0 0.6739000 459.0000
#2: 1 1 0.5466000 178.0000
#3: 2 0 0.2820000 199.5000
#4: 3 1 0.6006667 634.6667
#5: 1 1 0.4210000 171.0000
#6: 1 0 0.3653000 577.0000
#7: 1 1 0.4035000 739.0000
#8: 3 0 0.8397000 178.0000
#9: 1 1 0.2125000 779.0000
#10: 1 0 0.1502000 495.0000
#11: 1 1 0.2972000 504.0000
#12: 1 0 0.2406000 245.0000
#13: 1 1 0.0325000 613.0000
#14: 1 0 0.8642000 539.0000
#15: 3 1 0.1993000 655.6667
#16: 1 0 0.0485000 755.0000
#17: 1 1 0.0553000 412.0000
#18: 3 0 0.2916333 410.6667
#19: 2 0 0.5957500 598.5000
#20: 1 1 0.5557000 177.0000

R : Roll up by each variable and taking total count

I have a data set something similar to this with around 80 variables (flags) and 80,000 rows
< Acc_Nbr flag1 flag2 flag3 flag4 Exposure
< ab 1 0 1 0 1000
< bc 0 1 1 0 2000
< cd 1 1 0 1 3000
< ef 1 0 1 1 4000
< Expected Output
< Variable Count_Acct_Number Sum_Exposure Total_Acct_Number Total_Expo
< flag1 3 8000 4 10000
< flag2 2 5000 4 10000
< flag3 3 7000 4 10000
< flag4 2 7000 4 10000
Basically I want the output to show me count of account number and sum of exposure which are marked as 1 for each variable and in front of them total count of account numbers and exposures.
Please help.
We can convert the 'data.frame' to 'data.table' (setDT(df1), reshape it to 'long' with melt, grouped by 'variable', we get the sum of 'value1', sum of 'Exposure' where 'value1' is 1, number of rows (.N), and the sum of all the values in 'Exposure' to get the expected output.
library(data.table)
melt(setDT(df1), measure=patterns("^flag"))[,
list(Count_Acct_Number= sum(value1),
Sum_Exposure= sum(Exposure[value1==1]),
Total_Acct_Number = .N,
TotalExposure=sum(Exposure)),
by = variable]
# variable Count_Acct_Number Sum_Exposure Total_Acct_Number TotalExposure
#1: flag1 3 8000 4 10000
#2: flag2 2 5000 4 10000
#3: flag3 3 7000 4 10000
#4: flag4 2 7000 4 10000
A straigthforward way is to use the doBy package
library(doBy)
df <- data.frame(account=LETTERS[1:10], exposure=1:10*3.14, mark=round(runif(10)))
res <- as.data.frame(summaryBy(exposure~mark+account, df, FUN=sum))
subset(res, mark==0)
Starting with the base data (note, sample has randoms in it)
> df
account exposure mark
1 A 3.14 1
2 B 6.28 1
3 C 9.42 0
4 D 12.56 0
5 E 15.70 1
6 F 18.84 0
7 G 21.98 1
8 H 25.12 0
9 I 28.26 1
10 J 31.40 0
gives temp result which has marked the marks (in this case there is no actual summing, but would do as well)
> res
mark account exposure.sum
1 0 A 3.14
2 0 D 12.56
3 0 F 18.84
4 0 H 25.12
5 1 B 6.28
6 1 C 9.42
7 1 E 15.70
8 1 G 21.98
9 1 I 28.26
10 1 J 31.40
The final result can be selected with
> subset(res, mark==0)
mark account exposure.sum
1 0 A 3.14
2 0 D 12.56
3 0 F 18.84
4 0 H 25.12

compute a Means variable for a specific value in another variable

I would like to compute the mean age for every value from 1-7 in another variable called period.
This is how my data looks like:
work1 <- read.table(header=T, text="ID dead age gender inclusion_year diagnosis surv agrp period
87 0 25 2 2006 1 2174 1 5
396 0 19 2 2003 1 3077 1 3
446 0 23 2 2003 1 3144 1 3
497 0 19 2 2011 1 268 1 7
522 1 57 2 1999 1 3407 2 1
714 0 58 2 2003 1 3041 2 3
741 0 27 2 2004 1 2587 1 4
767 0 18 1 2008 1 1104 1 6
786 0 36 1 2005 1 2887 3 4
810 0 25 1 1998 1 3783 4 2")
This is a subset of a data with more then 1500 observations
This is what I'm trying to achieve:
sim <- read.table(header=T, text="Period diagnosis dead surv age
1 1 50 50000 35.5
2 1 80 70000 40.3
3 1 100 80000 32.8
4 1 120 100000 39.8
5 1 140 1200000 28.7
6 1 150 1400000 36.2
7 1 160 1600000 37.1")
In this data set I would like to group by period and diagnosis while all deaths(dead) and surv(survival time in days) is summarised in period time. I would also like for a mean value of the age in every period.
Have tried everything, still can't create the data set I'm striving for.
All help is appreciated!
You could try data.table
library(data.table)
as.data.table(work1)[, .(dead_sum=sum(dead),
surv_sum=sum(surv),
age_mean=mean(age)), keyby=.(period, diagnosis)]
Or dplyr
library(dplyr)
work1 %>% group_by(period, diagnosis) %>%
summarise(dead_sum=sum(dead), surv_sum=sum(surv), age_mean=mean(age))
# result
period diagnosis dead_sum surv_sum age_mean
1: 1 1 1 3407 57.00000
2: 2 1 0 3783 25.00000
3: 3 1 0 9262 33.33333
4: 4 1 0 5474 31.50000
5: 5 1 0 2174 25.00000
6: 6 1 0 1104 18.00000
7: 7 1 0 268 19.00000

Replace values great than a specific value within a loop in R

I am trying to figure out a way to loop through my data frame and replace any values greater than 200 with a decimal point.
Here is my code:
for (i in data$AGE) if (i > 199) i <- i*.01-2
Here is a head() sample of my data frame:
AGE LOC RACE SEX WORKREL PROD1 ICD10 INJ_ST DTH_YEAR DTH_MONTH DTH_DAY ACC_YEAR ACC_MONTH ACC_DAY
1 26 5 1 1 0 1290 V865 UT 2003 1 1 2002 12 31
2 20 1 7 2 0 1899 X47 HI 2003 1 1 2003 1 1
3 202 1 2 2 0 1598 W75 FL 2003 1 1 2003 1 1
4 86 5 1 2 0 1807 W18 FL 2003 1 1 2002 12 14
5 203 1 2 1 0 1598 W75 GA 2003 1 1 2003 1 1
6 79 0 1 2 2 921 X49 MA 2003 1 1 NA NA NA
So basically, if the value of AGE is greater than 200, then I want to multiply that value by .01 and then subtract 2.
My reason is because any value with 200 and greater is the age in months.
I'm not a Stats or R genius so my humble thanks in advance for all advice.
data$AGE[data$AGE> 200] <- data$AGE[data$AGE > 200] * 0.01 - 2
You can do this reasonably eleganty within and replace
data <- within(data, AGE <- replace(AGE, AGE > 200, AGE[AGE>200] * 0.01-2))
Or using data.table for memory efficiency and syntax elegance
library(data.table)
DT <- as.data.table(data)
# make sure that AGE is numeric not integer
DT[,AGE:= as.numeric(AGE)]
DT[AGE>200, AGE := AGE *0.01 -2]

Resources