Make a row NA starting from a cell in a column - r

I need to make a row NA starting by a cell in a column. Please see the example below:
How can I achieve this in R. Any help is appreciated.
When I use data <- [!(data$DES6=="F001"),] it removes 1st and 3rd row in the example below but I need to keep the 1st and 3rd row as shown in the output below.
Thanks in advance.
data:
YEAR ID STATE CROP CTY DES1 DES2 DES3 DES4 DES5 DES6 DES7 DES8
1998 53 CA 11 25 LOO1 50 N 23 W F001 25 S
1998 54 CA 11 26 LOO1 61 N 25 W NA NA NA
1998 55 CO 11 17 LOO1 62 S 26 E F001 26 N
output:
YEAR ID STATE CROP CTY DES1 DES2 DES3 DES4 DES5 DES6 DES7 DES8
1998 53 CA 11 25 LOO1 50 N 23 W NA NA NA
1998 54 CA 11 26 LOO1 61 N 25 W NA NA NA
1998 55 CO 11 17 LOO1 62 S 26 E NA NA NA

This will set the matching row to NA from the specified column to the end
df1[df1$DES6 %in% "F001", seq(grep("^DES6$", colnames(df1)), ncol(df1))] <- NA

Related

Having trouble Scraping Information from a text / pdf file into R

I track various information relating to the water in California on a daily basis. The people before me have done this by manually entering the data sourced from websites. I have begun to automate this process using R. It has gone well so far using selector gadget for pages like https://cdec.water.ca.gov/reportapp/javareports?name=RES
However, I am having trouble with this report since it is all text:
https://water.ca.gov/-/media/DWR-Website/Web-Pages/Programs/State-Water-Project/Operations-And-Maintenance/Files/Operations-Control-Office/Project-Wide-Operations/Dispatchers-Monday-Water-Report.txt?la=en&hash=B8C874426999D484F7CF1E9821EE9D8C6896CF1E
I have tried following different text mining tutorials step by step but am still really confused with this task.
I have also tried converting it to a pdf and using pdf tools as well and have not been able to achieve my goal.
Any help would be appreciated.
Thanks,
Ethan James W
library(httr)
library(stringi)
res <- httr::GET("https://water.ca.gov/-/media/DWR-Website/Web-Pages/Programs/State-Water-Project/Operations-And-Maintenance/Files/Operations-Control-Office/Project-Wide-Operations/Dispatchers-Monday-Water-Report.txt?la=en&hash=B8C874426999D484F7CF1E9821EE9D8C6896CF1E")
l <- stri_split_lines(content(res))[[1]]
page_breaks <- which(stri_detect_fixed(l, "SUMMARY OF SWP"))
# target page 1
page_one <- l[1:(page_breaks[2]-1)]
# find all the records on the page
recs <- paste0(page_one[stri_detect_regex(page_one, "^[[:alpha:]].*[[:digit:]]\\.")], collapse="\n")
# read it in as a fixed-width text file (b/c it really kinda is)
read.fwf(
textConnection(recs),
widths = c(10, 7, 8, 7, 7, 8, 8, 5, 7, 6, 7),
stringsAsFactors = FALSE
) -> xdf
# clean up the columns
xdf[] <- lapply(xdf, stri_trim_both)
xdf[] <- lapply(xdf, function(x) ifelse(grepl("\\.\\.|DCTOT", x), "NA", x)) # replace "....."s and the "DCTOT" string with "NA" so we can do the type conversion
xdf <- type.convert(xdf)
colnames(xdf) <- c("reservoir", "abs_max_elev", "abs_max_stor", "norm_min_elev", "norm_min_stor", "elev", "stor", "evap", "chng", "net_rel", "inflow")
xdf$reservoir <- as.character(xdf$reservoir)
Which gives us:
xdf
## reservoir abs_max_elev abs_max_stor norm_min_elev norm_min_stor elev stor evap chng net_rel inflow
## 1 FRENCHMN 5588.0 55475 5560.00 21472 5578.67 41922 NA -53 NA NA
## 2 ANTELOPE 5002.0 22564 4990.00 12971 4994.64 16306 NA -46 NA NA
## 3 DAVIS 5775.0 84371 5760.00 35675 5770.22 66299 NA -106 NA NA
## 4 OROVILLE 901.0 3553405 640.00 852196 702.69 1275280 249 -4792 6018 1475
## 5 F/B 225.0 11768 221.00 9350 224.52 11467 NA -106 NA NA
## 6 DIV 225.0 13353 221.00 12091 224.58 13217 NA -48 NA NA
## 7 F/B+DIV 225.0 25120 221.00 21441 NA 24684 NA -154 NA NA
## 8 AFTERBAY 136.0 54906 124.00 15156 132.73 41822 NA -263 5372 NA
## 9 CLIF CT 5.0 29082 -2.00 13965 -0.72 16714 NA 194 NA 5943
## 10 BETHANY 243.5 4894 241.50 4545 243.00 4806 NA 0 NA NA
## 11 DYER 806.0 545 785.00 90 795.40 299 NA -21 NA NA
## 12 DEL VALLE 703.0 39914 678.00 24777 690.22 31514 NA -122 97 0
## 13 TEHACHAPI 3101.0 545 3097.00 388 3098.22 434 NA -25 NA NA
## 14 TEHAC EAB 3101.0 1232 3085.00 254 3096.64 941 NA -39 NA NA
## 15 QUAIL+LQC 3324.5 8612 3306.50 3564 3318.18 6551 NA -10 0 NA
## 16 PYRAMID 2578.0 169901 2560.00 147680 2574.72 165701 25 -1056 881 0
## 17 ELDRBERRY 1530.0 27681 1490.00 12228 1510.74 19470 NA 805 0 0
## 18 CASTAIC 1513.0 319247 1310.00 33482 1491.48 273616 36 -1520 1432 0
## 19 SILVRWOOD 3355.0 74970 3312.00 39211 3351.41 71511 10 276 1582 107
## 20 DC AFBY 1 1933.0 50 1922.00 18 1932.64 49 NA 0 NA NA
## 21 DC AFBY 2 1930.0 967 1904.50 198 1922.01 696 NA 37 1690 NA
## 22 CRAFTON H 2925.0 292 2905.00 70 2923.60 274 NA -2 NA NA
## 23 PERRIS 1588.0 126841 1555.30 60633 1577.96 104620 21 85 8 NA
## 24 SAN LUIS 543.0 2027835 326.00 79231 470.16 1178789 238 3273 -4099 0
## 25 O'NEILL 224.5 55076 217.50 36843 222.50 49713 NA 2325 NA NA
## 26 LOS BANOS 353.5 34562 296.00 8315 322.87 18331 NA -5 0 0
## 27 L.PANOCHE 670.4 13233 590.00 308 599.60 664 NA 0 0 0
## 28 TRINITY 2370.0 2447656 2145.00 312631 2301.44 1479281 NA -1192 NA NA
## 29 SHASTA 1067.0 4552095 828.00 502004 974.01 2300953 NA -6238 NA NA
## 30 FOLSOM 466.0 976952 327.80 84649 408.50 438744 NA -2053 NA NA
## 31 MELONES 1088.0 2420000 808.00 300000 1031.66 1779744 NA -2370 NA NA
## 32 PINE FLT 951.5 1000000 712.58 100002 771.51 231361 NA 543 508 NA
## 33 MATHEWS 1390.0 182569 1253.80 3546 1352.17 94266 NA 522 NA NA
## 34 SKINNER 1479.0 44405 1393.00 0 1476.02 38485 NA 242 NA NA
## 35 BULLARDS 1956.0 966103 1730.00 230118 1869.01 604827 NA -1310 NA NA
That was the easy one :-)
Most of Page 2 is doable pretty in a pretty straightforward manner:
page_two <- l[page_breaks[2]:length(l)]
do.call(
rbind.data.frame,
lapply(
stri_split_fixed(
stri_replace_all_regex(
stri_trim_both(page_two[stri_detect_regex(
stri_trim_both(page_two), # trim blanks
"^([^[:digit:]]+)([[:digit:]\\.]+)[[:space:]]+([^[:digit:]]+)([[:digit:]\\.]+)$" # find the release rows
)]),
"[[:space:]]{2,}", "\t" # make tab-separated fields wherever there are 2+ space breaks
), "\t"),
function(x) {
if (length(x) > 2) { # one of the lines will only have one record but most have 2
data.frame(
facility = c(x[1],x[3]),
amt = as.numeric(c(x[2], x[4])),
stringsAsFactors = FALSE
)
} else {
data.frame(
facility = x[1],
amt = as.numeric(x[2]),
stringsAsFactors = FALSE
)
}
})
) -> ydf
Which gives us (sans the nigh useless TOTAL rows):
ydf[!grepl("TOTAL", ydf$facility),]
## facility amt
## 1 KESWICK RELEASE TO RIVER 15386.0
## 2 SHASTA STORAGE WITHDRAWAL 8067.0
## 3 SPRING CREEK RELEASE 0.0
## 4 WHISKYTOWN STORAGE WITHDRAWAL 46.0
## 6 OROVILLE STORAGE WITHDRAWL 5237.0
## 7 CDWR YUBA RIVER # MARYSVILLE 0.0
## 8 FOLSOM STORAGE WITHDRAWAL 1386.0
## 9 LAKE OROVILLE 20.2
## 10 BYRON BETHANY I.D. 32.0
## 11 POWER CANAL 0.0
## 12 SAN LUIS TO SAN FELIPE 465.0
## 13 SUTTER BUTTE 922.0
## 14 O'NEILL FOREBAY 2.0
## 15 LATERAL 0.0
## 16 CASTAIC LAKE 1432.0
## 17 RICHVALE 589.0
## 18 SILVERWOOD LAKE TO CLAWA 7.0
## 19 WESTERN 787.0
## 20 LAKE PERRIS 0.0
## 23 D/S FEATHER R. DIVERSIONS 0.0
## 24 FISH REQUIREMENT 1230.0
## 25 FLOOD CONTROL RELEASE 0.0
## 26 DELTA REQUIREMENT 3629.0
## 27 FEATHER R. RELEASE # RIVER OUTLET 3074.0
## 28 OTHER RELEASE 0.0
But, if you need the deltas or the plant operations data you're on your own.

Conditional interpolation of time series data in R

I have time series data with N/As. The data are to end up in an animated scatterplot
Week X Y
1 1 105
2 3 110
3 5 N/A
4 7 130
8 15 160
12 23 180
16 30 N/A
20 37 200
For a smooth animation, the data will be supplemented by calculated, additional values/rows. For the X values this is simply arithmetical. No problem so far.
Week X Y
1 1 105
2
2 3 110
4
3 5 N/A
6
4 7 130
8
9
10
11
12
13
14
8 15 160
16
17
18
19
20
21
22
12 23 180
24
25
26
27
28
29
16 30 N/A
31
32
33
34
35
36
20 37 200
The Y values should be interpolated and there is the additional requirement, that interpolation should only appear between two consecutive values and not between values, that have a N/A between them.
Week X Value
1 1 105
2 interpolated value
2 3 110
4
3 5 N/A
6
4 7 130
8 interpolated value
9 interpolated value
10 interpolated value
11 interpolated value
12 interpolated value
13 interpolated value
14 interpolated value
8 15 160
16 interpolated value
17 interpolated value
18 interpolated value
19 interpolated value
20 interpolated value
21 interpolated value
22 interpolated value
12 23 180
24
25
26
27
28
29
16 30 N/A
31
32
33
34
35
36
20 37 200
I have already experimented with approx, converted the "original" N/A to placeholder values and tried the zoo package with na.approx etc. but don´t get it, to express a correct condition statement for this kind of "conditional approximation" or "conditional gap filling". Any hint is welcome and very appreciated.
Thanks in advance
Replace the NAs with Inf, interpolate and then revert infinite values to NA.
library(zoo)
DF2 <- DF
DF2$Y[is.na(DF2$Y)] <- Inf
w <- merge(DF2, data.frame(Week = min(DF2$Week):max(DF2$Week)), by = 1, all.y = TRUE)
w$Value <- na.approx(w$Y)
w$Value[!is.finite(Value)] <- NA
giving the following where Week has been expanded to all weeks, Y is such that the original NAs are shown as Inf and the inserted NAs as NA. Value is the interpolated Y.
> w
Week X Y Value
1 1 1 105 105.0
2 2 3 110 110.0
3 3 5 Inf NA
4 4 7 130 130.0
5 5 NA NA 137.5
6 6 NA NA 145.0
7 7 NA NA 152.5
8 8 15 160 160.0
9 9 NA NA 165.0
10 10 NA NA 170.0
11 11 NA NA 175.0
12 12 23 180 180.0
13 13 NA NA NA
14 14 NA NA NA
15 15 NA NA NA
16 16 30 Inf NA
17 17 NA NA NA
18 18 NA NA NA
19 19 NA NA NA
20 20 37 200 200.0
Note: Input DF in reproducible form:
Lines <- "
Week X Y
1 1 105
2 3 110
3 5 N/A
4 7 130
8 15 160
12 23 180
16 30 N/A
20 37 200"
DF <- read.table(text = Lines, header = TRUE, na.strings = "N/A")

creating unique sequence for October 15 to April 30th following year- R

Basically, I'm looking at snowpack data. I want to assign a unique value to each date (column "snowday") over the period October 15 to May 15th the following year (the winter season of course) ~215 days. then add a column "snowmonth" that corresponds to the sequential months of the seasonal data, as well as a "snow year" column that represents the year where each seasonal record starts.
There are some missing dates- however- but instead of finding those dates and inserting NA's into the rows, I've opted to skip that step and instead go the sequential root which can then be plotted with respect to the "snowmonth"
Basically, I just need to get the "snowday" sequence of about 1:215 (+1 for leap years down in a column, and the rest I can do myself. It looks like this
month day year depth date yearday snowday snowmonth
12 26 1955 27 1955-12-26 360 NA NA
12 27 1955 24 1955-12-27 361 NA NA
12 28 1955 24 1955-12-28 362 NA NA
12 29 1955 24 1955-12-29 363 NA NA
12 30 1955 26 1955-12-30 364 NA NA
12 31 1955 26 1955-12-31 365 NA NA
1 1 1956 25 1956-01-01 1 NA NA
1 2 1956 25 1956-01-02 2 NA NA
1 3 1956 26 1956-01-03 3 NA NA
man<-data.table()
man <-  read.delim('mansfieldstake.txt',header=TRUE, check.names=FALSE)
man[is.na(man)]<-0
man$date<-paste(man$yy, man$mm, man$dd,sep="-", collapse=NULL)
man$yearday<-NA #day of the year 1-365
colnames(man)<- c("month","day","year","depth", "date","yearday")
man$date<-as.Date(man$date)
man$yearday<-yday(man$date)
man$snowday<-NA
man$snowmonth<-NA
man[420:500,]
head(man)
output would look something like this:
month day year depth date yearday snowday snowmonth
12 26 1955 27 1955-12-26 360 73 3
12 27 1955 24 1955-12-27 361 74 3
12 28 1955 24 1955-12-28 362 75 3
12 29 1955 24 1955-12-29 363 76 3
12 30 1955 26 1955-12-30 364 77 3
12 31 1955 26 1955-12-31 365 78 3
1 1 1956 25 1956-01-01 1 79 4
1 2 1956 25 1956-01-02 2 80 4
1 3 1956 26 1956-01-03 3 81 4
I've thought about loops and all that- but it's inefficient... leap years kinda mess things up as well- this has become more challenging than i thought. good first project though!
just looking for a simple sequence here, dropping all non-snow months. thanks for anybody who's got input!
If I understand correctly that snowday should be the number of days since the beginning of the season, all you need to make this column using data.table is:
day_one <- as.Date("1955-10-01")
man[, snowday := -(date - day_one)]
If all you want is a sequence of unique values, then seq() is your best bet.
Then you can create the snowmonth using:
library(lubridate)
man[, snowmonth := floor(-time_length(interval(date, day_one), unit = "month"))

Creating a vector with multiple sequences based on number of IDs' repetitions

I've got a data frame with panel-data, subjects' characteristic through the time. I need create a column with a sequence from 1 to the maximum number of year per every subject. For example, if subject 1 is in the data frame from 2000 to 2005, I need the following sequence: 1,2,3,4,5,6.
Below is a small fraction of my data. The last column (exp) is what I trying to get. Additionally, if you have a look at the first subject (13) you'll see that in 2008 the value of qtty is zero. In this case I need just a NA or a code (0,1, -9999), it doesn't matter which one.
Below the data is what I did to get that vector, but it didn't work.
Any help will be much appreciated.
subject season qtty exp
13 2000 29 1
13 2001 29 2
13 2002 29 3
13 2003 29 4
13 2004 29 5
13 2005 27 6
13 2006 27 7
13 2007 27 8
13 2008 0 NA
28 2000 18 1
28 2001 18 2
28 2002 18 3
28 2003 18 4
28 2004 18 5
28 2005 18 6
28 2006 18 7
28 2007 18 8
28 2008 18 9
28 2009 20 10
28 2010 20 11
28 2011 20 12
28 2012 20 13
35 2000 21 1
35 2001 21 2
35 2002 21 3
35 2003 21 4
35 2004 21 5
35 2005 21 6
35 2006 21 7
35 2007 21 8
35 2008 21 9
35 2009 14 10
35 2010 11 11
35 2011 11 12
35 2012 10 13
My code:
numbY<-aggregate(season ~ subject, data = toCountY,length)
colnames(numbY)<-c("subject","inFish")
toCountY$inFish<-numbY$inFish[match(toCountY$subject,numbY$subject)]
numbYbyFisher<-unique(numbY)
seqY<-aggregate(numbYbyFisher$inFish, by=list(numbYbyFisher$subject), function(x)seq(1,x,1))
I am using ddply and I distinguish 2 cases:
Either you generate a sequence along subjet and you replace by NA where you have qtty is zero
ddply(dat,.(subjet),transform,new.exp=ifelse(qtty==0,NA,seq_along(subjet)))
Or you generate a sequence along qtty different of zero with a jump where you have qtty is zero
ddply(dat,.(subjet),transform,new.exp={
hh <- seq_along(which(qtty !=0))
if(length(which(qtty ==0))>0)
hh <- append(hh,NA,which(qtty==0)-1)
hh
})
EDITED
ind=qtty!=0
exp=numeric(length(subject))
temp=0
for(i in 1:length(unique(subject[ind]))){
temp[i]=list(seq(from=1,to=table(subject[ind])[i]))
}
exp[ind]=unlist(temp)
this will provide what you need

Transforming long format data to short format by segmenting dates that include redundant observations

I have a data set that is long format and includes exact date/time measurements of 3 scores on a single test administered between 3 and 5 times per year.
ID Date Fl Er Cmp
1 9/24/2010 11:38 15 2 17
1 1/11/2011 11:53 39 11 25
1 1/15/2011 11:36 39 11 39
1 3/7/2011 11:28 95 58 2
2 10/4/2010 14:35 35 9 6
2 1/7/2011 13:11 32 7 8
2 3/7/2011 13:11 79 42 30
3 10/12/2011 13:22 17 3 18
3 1/19/2012 14:14 45 15 36
3 5/8/2012 11:55 29 6 11
3 6/8/2012 11:55 74 37 7
4 9/14/2012 9:15 62 28 18
4 1/24/2013 9:51 82 45 9
4 5/21/2013 14:04 135 87 17
5 9/12/2011 11:30 98 61 18
5 9/15/2011 13:23 55 22 9
5 11/15/2011 11:34 98 61 17
5 1/9/2012 11:32 55 22 17
5 4/20/2012 11:30 23 4 17
I need to transform this data to short format with time bands based on month (i.e. Fall=August-October; Winter=January-February; Spring=March-May). Some bands will include more than one observation per participant, and as such, will need a "spill over" band. An example transformation for the Fl scores below.
ID Fall1Fl Fall2Fl Winter1Fl Winter2Fl Spring1Fl Spring2Fl
1 15 NA 39 39 95 NA
2 35 NA 32 NA 79 NA
3 17 NA 45 NA 28 74
4 62 NA 82 NA 135 NA
5 98 55 55 NA 23 NA
Notice that dates which are "redundant" (i.e. more than 1 Aug-Oct observation) spill over into Fall2fl column. Dates that occur outside of the desired bands (i.e. November, December, June, July) should be deleted. The final data set should have additional columns that include Fl Er and Cmp.
Any help would be appreciated!
(Link to .csv file with long data http://mentor.coe.uh.edu/Data_Example_Long.csv )
This seems to do what you are looking for, but doesn't exactly match your desired output. I haven't looked at your sample data to see whether the problem lies with your sample desired output or the transformations I've done, but you should be able to follow along with the code to see how the transformations were made.
## Convert dates to actual date formats
mydf$Date <- strptime(gsub("/", "-", mydf$Date), format="%m-%d-%Y %H:%M")
## Factor the months so we can get the "seasons" that you want
Months <- factor(month(mydf$Date), levels=1:12)
levels(Months) <- list(Fall = c(8:10),
Winter = c(1:2),
Spring = c(3:5),
Other = c(6, 7, 11, 12))
mydf$Seasons <- Months
## Drop the "Other" seasons
mydf <- mydf[!mydf$Seasons == "Other", ]
## Add a "Year" column
mydf$Year <- year(mydf$Date)
## Add a "Times" column
mydf$Times <- as.numeric(ave(as.character(mydf$Seasons),
mydf$ID, mydf$Year, FUN = seq_along))
## Load "reshape2" and use `dcast` on just one variable.
## Repeat for other variables by changing the "value.var"
dcast(mydf, ID ~ Seasons + Times, value.var="Fluency")
# ID Fall_1 Fall_2 Winter_1 Winter_2 Spring_2 Spring_3
# 1 1 15 NA 39 39 NA 95
# 2 2 35 NA 32 NA 79 NA
# 3 3 17 NA 45 NA 29 NA
# 4 4 62 NA 82 NA 135 NA
# 5 5 98 55 55 NA 23 NA

Resources