Loss of data when using merge

Loss of data when using merge - r

I have a df with states that I am trying to add lat, long values for each state so I can plot percent values for each state on a map. When I use merge I get either and empty df if I don't use
all=TRUE
Or I get missing data for either my lat, long values of my data itself depending on which I make x or y
Code to load my df and add column header
fileURL <- c("https://drive.google.com/open?id=0B-jAX5hT2D3hNnVtLVhROENKRGs")
suppressMessages(require(data.table))
ge.planted <- fread(fileURL, na.strings = "NA")
colnames(ge.planted) <- c("region", "type", "crop", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015")
Code to get state names with lat, long values for the center of each state
snames <- data.frame(region=tolower(state.name), long=state.center$x, lat=state.center$y)
When I merge the two df using:
snames <- merge(ge.planted, snames, by="region")
I get
[1] region long lat type crop 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
[17] 2011 2012 2013 2014 2015
Or if I use
snames <- merge( ge.planted, snames, by="region", all=TRUE)
And I get my values but no lat, long
region type crop 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
1: Alabama Insect-resistant (Bt) only Cotton - - - - - 10 10 10 18 13 11 18 17 12
2: Alabama Herbicide-tolerant only Cotton - - - - - 28 25 25 15 18 7 4 11 4
3: Alabama Stacked gene varieties Cotton - - - - - 54 60 60 65 60 76 75 70 82
4: Alabama All GE varieties Cotton - - - - - 92 95 95 98 91 94 97 98 98
5: Arkansas Herbicide-tolerant only Soybean 43 60 68 84 92 92 92 92 94 94 96 95 94 97
6: Arkansas All GE varieties Soybean 43 60 68 84 92 92 92 92 94 94 96 95 94 97
2014 2015 long lat
1: 9 4 NA NA
2: 6 3 NA NA
3: 83 90 NA NA
4: 98 97 NA NA
5: 99 97 NA NA
6: 99 97 NA NA
And finally with
snames <- merge(snames, ge.planted, by="region", all=TRUE)
I get lat, long but no values
region long lat type crop 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
1 alabama -87 33 <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA NA NA NA NA NA NA NA NA NA NA
2 alaska -127 49 <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA NA NA NA NA NA NA NA NA NA NA
3 arizona -112 34 <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA NA NA NA NA NA NA NA NA NA NA
4 arkansas -92 35 <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA NA NA NA NA NA NA NA NA NA NA
5 california -120 37 <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA NA NA NA NA NA NA NA NA NA NA
6 colorado -106 39 <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA NA NA NA NA NA NA NA NA NA NA
From best I can tell instead of merging the files based on 'region' it is appending the 'y' value on to the end of the data frame.

The problem is that you used tolower(), so that region names in one frame are different to the other (ge.planted has caps, snames does not). So merge will not recognize region names as equivalent. Delete the tolower() call, and it should work.

Related

Filling in missing value in R [duplicate]

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 3 years ago.
I have a dataframe like this:
ID year fcmstat secmstat mstat
138 4 1998 NA NA 1
139 4 1999 NA NA 1
140 4 2000 NA NA 1
141 4 2001 NA NA 1
142 4 2002 NA NA 1
143 4 2003 2 NA 2
144 4 2004 NA NA NA
145 4 2005 NA NA NA
146 4 2006 NA 3 3
147 4 2007 NA NA NA
375 19 2001 NA NA 2
376 19 2002 6 NA 6
377 19 2003 NA NA NA
378 19 2004 NA 5 5
379 19 2005 NA NA NA
380 19 2006 NA NA 1
fcmstat: type of first marital status change
secmstat: type of second marital status change
first marital status, for ID 4(19), fsmstat was changed in 2003(2002) and second marital status secmstat was changed in 2006(2004). So, for ID 4, in 2004 and 2005 marital status was same as fcmstat of 2003 and for ID 19, 2003's mstat should be same as fcmstat of 2002.
I want to fill in t he last column as follows:
ID year fcmstat secmstat mstat
138 4 1998 NA NA 1
139 4 1999 NA NA 1
140 4 2000 NA NA 1
141 4 2001 NA NA 1
142 4 2002 NA NA 1
143 4 2003 2 NA 2
144 4 2004 NA NA 2
145 4 2005 NA NA 2
146 4 2006 NA 3 3
147 4 2007 NA NA NA
375 19 2001 NA NA 2
376 19 2002 6 NA 6
377 19 2003 NA NA 6
378 19 2004 NA 5 5
379 19 2005 NA NA NA
380 19 2006 NA NA 1
Also, before any first change, the mstatshould be same as before. Consider the following case.
ID year fcmstat secmstat mstat
1171 61 1978 NA NA 0
1172 61 1979 NA NA 0
1173 61 1980 NA NA 0
1174 61 1981 NA NA 0
1175 61 1982 NA NA 0
1176 61 1983 NA NA NA
1177 61 1984 NA NA NA
1178 61 1985 1 NA 1
1179 61 1986 NA NA 1
1180 61 1987 NA NA 1
the first change was in 1985. So, the missing mstat in 1984 and 1983 should be same as mstat of 1982. SO for this case, my desired output is:
ID year fcmstat secmstat mstat
1171 61 1978 NA NA 0
1172 61 1979 NA NA 0
1173 61 1980 NA NA 0
1174 61 1981 NA NA 0
1175 61 1982 NA NA 0
1176 61 1983 NA NA 0
1177 61 1984 NA NA 0
1178 61 1985 1 NA 1
1179 61 1986 NA NA 1
1180 61 1987 NA NA 1
As suggested by Schilker the code df$mstat_updated<-na.locf(df$mstat) gives the following:
ID year fcmstat secmstat mstat mstat_updated
138 4 1998 NA NA 1 1
139 4 1999 NA NA 1 1
140 4 2000 NA NA 1 1
141 4 2001 NA NA 1 1
142 4 2002 NA NA 1 1
143 4 2003 2 NA 2 2
144 4 2004 NA NA NA 2
145 4 2005 NA NA NA 2
146 4 2006 NA 3 3 3
147 4 2007 NA NA NA 3
148 4 2008 NA NA NA 3
However, I do want to fill in mstat for 2004 and 2005 but not in 2007 and 2008. I want to fill in NA's only between first marstat change, fcmstat and second marstat, secmstat change.

As I mentioned in my comment this a duplicate of here
library(zoo)
df<-data.frame(ID=c('4','4','4','4'),
year=c(2003,2004,2005,2006),
mstat=c(2,NA,NA,3))
df$mstat<-na.locf(df$mstat)

Subtract multiple columns by one column

I want to subtract the year in which the respondents were born (variables containing yrbrn) by the variable for year of the interview (inwyys) and save the results as new variables in the data frame.
Head of the data frame:
inwyys yrbrn2 yrbrn3 yrbrn4 yrbrn5 yrbrn6 yrbrn7 yrbrn8
1 2012 1949 1955 NA NA NA NA NA
2 2012 1983 1951 1956 1989 1995 2003 2005
3 2012 1946 1946 1978 NA NA NA NA
4 2013 NA NA NA NA NA NA NA
5 2013 1953 1959 1980 1985 1991 2008 2011
6 2013 1938 NA NA NA NA NA NA
Can someone help me with that?
Thank you very much!

This can be done by sub-setting (x[,-1]..take everything but not the first column, x[,1]..take the first column) your data and make the subtraction. With cbind you can bind the new result to the original data.
cbind(x, x[,-1] - x[,1])
# inwyys yrbrn2 yrbrn3 yrbrn4 yrbrn5 yrbrn6 yrbrn7 yrbrn8 yrbrn2 yrbrn3 yrbrn4 yrbrn5 yrbrn6 yrbrn7 yrbrn8
#1 2012 1949 1955 NA NA NA NA NA -63 -57 NA NA NA NA NA
#2 2012 1983 1951 1956 1989 1995 2003 2005 -29 -61 -56 -23 -17 -9 -7
#3 2012 1946 1946 1978 NA NA NA NA -66 -66 -34 NA NA NA NA
#4 2013 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#5 2013 1953 1959 1980 1985 1991 2008 2011 -60 -54 -33 -28 -22 -5 -2
#6 2013 1938 NA NA NA NA NA NA -75 NA NA NA NA NA NA
Data:
x <- read.table(header=TRUE, text=" inwyys yrbrn2 yrbrn3 yrbrn4 yrbrn5 yrbrn6 yrbrn7 yrbrn8
1 2012 1949 1955 NA NA NA NA NA
2 2012 1983 1951 1956 1989 1995 2003 2005
3 2012 1946 1946 1978 NA NA NA NA
4 2013 NA NA NA NA NA NA NA
5 2013 1953 1959 1980 1985 1991 2008 2011
6 2013 1938 NA NA NA NA NA NA")

I believe the following is what you are looking for
data$newvar1<-data$yrbrn2-data$inwyys
But replace "data" with the name of your data set. If you want to do it for each yrbrn column, just change "newvar1" to "newvar2" etc so you do not override your previous calculations

Having trouble Scraping Information from a text / pdf file into R

I track various information relating to the water in California on a daily basis. The people before me have done this by manually entering the data sourced from websites. I have begun to automate this process using R. It has gone well so far using selector gadget for pages like https://cdec.water.ca.gov/reportapp/javareports?name=RES
However, I am having trouble with this report since it is all text:
https://water.ca.gov/-/media/DWR-Website/Web-Pages/Programs/State-Water-Project/Operations-And-Maintenance/Files/Operations-Control-Office/Project-Wide-Operations/Dispatchers-Monday-Water-Report.txt?la=en&hash=B8C874426999D484F7CF1E9821EE9D8C6896CF1E
I have tried following different text mining tutorials step by step but am still really confused with this task.
I have also tried converting it to a pdf and using pdf tools as well and have not been able to achieve my goal.
Any help would be appreciated.
Thanks,
Ethan James W

library(httr)
library(stringi)
res <- httr::GET("https://water.ca.gov/-/media/DWR-Website/Web-Pages/Programs/State-Water-Project/Operations-And-Maintenance/Files/Operations-Control-Office/Project-Wide-Operations/Dispatchers-Monday-Water-Report.txt?la=en&hash=B8C874426999D484F7CF1E9821EE9D8C6896CF1E")
l <- stri_split_lines(content(res))[[1]]
page_breaks <- which(stri_detect_fixed(l, "SUMMARY OF SWP"))
# target page 1
page_one <- l[1:(page_breaks[2]-1)]
# find all the records on the page
recs <- paste0(page_one[stri_detect_regex(page_one, "^[[:alpha:]].*[[:digit:]]\\.")], collapse="\n")
# read it in as a fixed-width text file (b/c it really kinda is)
read.fwf(
textConnection(recs),
widths = c(10, 7, 8, 7, 7, 8, 8, 5, 7, 6, 7),
stringsAsFactors = FALSE
) -> xdf
# clean up the columns
xdf[] <- lapply(xdf, stri_trim_both)
xdf[] <- lapply(xdf, function(x) ifelse(grepl("\\.\\.|DCTOT", x), "NA", x)) # replace "....."s and the "DCTOT" string with "NA" so we can do the type conversion
xdf <- type.convert(xdf)
colnames(xdf) <- c("reservoir", "abs_max_elev", "abs_max_stor", "norm_min_elev", "norm_min_stor", "elev", "stor", "evap", "chng", "net_rel", "inflow")
xdf$reservoir <- as.character(xdf$reservoir)
Which gives us:
xdf
## reservoir abs_max_elev abs_max_stor norm_min_elev norm_min_stor elev stor evap chng net_rel inflow
## 1 FRENCHMN 5588.0 55475 5560.00 21472 5578.67 41922 NA -53 NA NA
## 2 ANTELOPE 5002.0 22564 4990.00 12971 4994.64 16306 NA -46 NA NA
## 3 DAVIS 5775.0 84371 5760.00 35675 5770.22 66299 NA -106 NA NA
## 4 OROVILLE 901.0 3553405 640.00 852196 702.69 1275280 249 -4792 6018 1475
## 5 F/B 225.0 11768 221.00 9350 224.52 11467 NA -106 NA NA
## 6 DIV 225.0 13353 221.00 12091 224.58 13217 NA -48 NA NA
## 7 F/B+DIV 225.0 25120 221.00 21441 NA 24684 NA -154 NA NA
## 8 AFTERBAY 136.0 54906 124.00 15156 132.73 41822 NA -263 5372 NA
## 9 CLIF CT 5.0 29082 -2.00 13965 -0.72 16714 NA 194 NA 5943
## 10 BETHANY 243.5 4894 241.50 4545 243.00 4806 NA 0 NA NA
## 11 DYER 806.0 545 785.00 90 795.40 299 NA -21 NA NA
## 12 DEL VALLE 703.0 39914 678.00 24777 690.22 31514 NA -122 97 0
## 13 TEHACHAPI 3101.0 545 3097.00 388 3098.22 434 NA -25 NA NA
## 14 TEHAC EAB 3101.0 1232 3085.00 254 3096.64 941 NA -39 NA NA
## 15 QUAIL+LQC 3324.5 8612 3306.50 3564 3318.18 6551 NA -10 0 NA
## 16 PYRAMID 2578.0 169901 2560.00 147680 2574.72 165701 25 -1056 881 0
## 17 ELDRBERRY 1530.0 27681 1490.00 12228 1510.74 19470 NA 805 0 0
## 18 CASTAIC 1513.0 319247 1310.00 33482 1491.48 273616 36 -1520 1432 0
## 19 SILVRWOOD 3355.0 74970 3312.00 39211 3351.41 71511 10 276 1582 107
## 20 DC AFBY 1 1933.0 50 1922.00 18 1932.64 49 NA 0 NA NA
## 21 DC AFBY 2 1930.0 967 1904.50 198 1922.01 696 NA 37 1690 NA
## 22 CRAFTON H 2925.0 292 2905.00 70 2923.60 274 NA -2 NA NA
## 23 PERRIS 1588.0 126841 1555.30 60633 1577.96 104620 21 85 8 NA
## 24 SAN LUIS 543.0 2027835 326.00 79231 470.16 1178789 238 3273 -4099 0
## 25 O'NEILL 224.5 55076 217.50 36843 222.50 49713 NA 2325 NA NA
## 26 LOS BANOS 353.5 34562 296.00 8315 322.87 18331 NA -5 0 0
## 27 L.PANOCHE 670.4 13233 590.00 308 599.60 664 NA 0 0 0
## 28 TRINITY 2370.0 2447656 2145.00 312631 2301.44 1479281 NA -1192 NA NA
## 29 SHASTA 1067.0 4552095 828.00 502004 974.01 2300953 NA -6238 NA NA
## 30 FOLSOM 466.0 976952 327.80 84649 408.50 438744 NA -2053 NA NA
## 31 MELONES 1088.0 2420000 808.00 300000 1031.66 1779744 NA -2370 NA NA
## 32 PINE FLT 951.5 1000000 712.58 100002 771.51 231361 NA 543 508 NA
## 33 MATHEWS 1390.0 182569 1253.80 3546 1352.17 94266 NA 522 NA NA
## 34 SKINNER 1479.0 44405 1393.00 0 1476.02 38485 NA 242 NA NA
## 35 BULLARDS 1956.0 966103 1730.00 230118 1869.01 604827 NA -1310 NA NA
That was the easy one :-)
Most of Page 2 is doable pretty in a pretty straightforward manner:
page_two <- l[page_breaks[2]:length(l)]
do.call(
rbind.data.frame,
lapply(
stri_split_fixed(
stri_replace_all_regex(
stri_trim_both(page_two[stri_detect_regex(
stri_trim_both(page_two), # trim blanks
"^([^[:digit:]]+)([[:digit:]\\.]+)[[:space:]]+([^[:digit:]]+)([[:digit:]\\.]+)$" # find the release rows
)]),
"[[:space:]]{2,}", "\t" # make tab-separated fields wherever there are 2+ space breaks
), "\t"),
function(x) {
if (length(x) > 2) { # one of the lines will only have one record but most have 2
data.frame(
facility = c(x[1],x[3]),
amt = as.numeric(c(x[2], x[4])),
stringsAsFactors = FALSE
)
} else {
data.frame(
facility = x[1],
amt = as.numeric(x[2]),
stringsAsFactors = FALSE
)
}
})
) -> ydf
Which gives us (sans the nigh useless TOTAL rows):
ydf[!grepl("TOTAL", ydf$facility),]
## facility amt
## 1 KESWICK RELEASE TO RIVER 15386.0
## 2 SHASTA STORAGE WITHDRAWAL 8067.0
## 3 SPRING CREEK RELEASE 0.0
## 4 WHISKYTOWN STORAGE WITHDRAWAL 46.0
## 6 OROVILLE STORAGE WITHDRAWL 5237.0
## 7 CDWR YUBA RIVER # MARYSVILLE 0.0
## 8 FOLSOM STORAGE WITHDRAWAL 1386.0
## 9 LAKE OROVILLE 20.2
## 10 BYRON BETHANY I.D. 32.0
## 11 POWER CANAL 0.0
## 12 SAN LUIS TO SAN FELIPE 465.0
## 13 SUTTER BUTTE 922.0
## 14 O'NEILL FOREBAY 2.0
## 15 LATERAL 0.0
## 16 CASTAIC LAKE 1432.0
## 17 RICHVALE 589.0
## 18 SILVERWOOD LAKE TO CLAWA 7.0
## 19 WESTERN 787.0
## 20 LAKE PERRIS 0.0
## 23 D/S FEATHER R. DIVERSIONS 0.0
## 24 FISH REQUIREMENT 1230.0
## 25 FLOOD CONTROL RELEASE 0.0
## 26 DELTA REQUIREMENT 3629.0
## 27 FEATHER R. RELEASE # RIVER OUTLET 3074.0
## 28 OTHER RELEASE 0.0
But, if you need the deltas or the plant operations data you're on your own.

Make a row NA starting from a cell in a column

I need to make a row NA starting by a cell in a column. Please see the example below:
How can I achieve this in R. Any help is appreciated.
When I use data <- [!(data$DES6=="F001"),] it removes 1st and 3rd row in the example below but I need to keep the 1st and 3rd row as shown in the output below.
Thanks in advance.
data:
YEAR ID STATE CROP CTY DES1 DES2 DES3 DES4 DES5 DES6 DES7 DES8
1998 53 CA 11 25 LOO1 50 N 23 W F001 25 S
1998 54 CA 11 26 LOO1 61 N 25 W NA NA NA
1998 55 CO 11 17 LOO1 62 S 26 E F001 26 N
output:
YEAR ID STATE CROP CTY DES1 DES2 DES3 DES4 DES5 DES6 DES7 DES8
1998 53 CA 11 25 LOO1 50 N 23 W NA NA NA
1998 54 CA 11 26 LOO1 61 N 25 W NA NA NA
1998 55 CO 11 17 LOO1 62 S 26 E NA NA NA

This will set the matching row to NA from the specified column to the end
df1[df1$DES6 %in% "F001", seq(grep("^DES6$", colnames(df1)), ncol(df1))] <- NA

Flagging duplicate (non-unique) values based on specifications

enter code hereI'm working with a dataset, "Final.Export" that looks like this:
LakeID LakeName SourceVariableName SourceVariableDescription SourceFlags
47 390 Moosehead Acolor(PCU) Apparent color <NA>
48 390 Moosehead Acolor(PCU) Apparent color <NA>
49 390 Moosehead Acolor(PCU) Apparent color <NA>
50 390 Moosehead Acolor(PCU) Apparent color <NA>
51 390 Moosehead Acolor(PCU) Apparent color <NA>
52 390 Moosehead Acolor(PCU) Apparent color <NA>
53 390 Moosehead Acolor(PCU) Apparent color <NA>
54 390 Moosehead Acolor(PCU) Apparent color <NA>
55 390 Moosehead Acolor(PCU) Apparent color <NA>
56 390 Moosehead Acolor(PCU) Apparent color <NA>
LagosVariableID LagosVariableName Value Units CensorCode DetectionLimit Date
47 11 Color, apparent 22 PCU NC NA 2003-08-26
48 11 Color, apparent 17 PCU NC NA 2003-08-26
49 11 Color, apparent 16 PCU NC NA 2003-08-26
50 11 Color, apparent 14 PCU NC NA 2003-08-26
51 11 Color, apparent 14 PCU NC NA 2003-08-26
52 11 Color, apparent 17 PCU NC NA 2003-08-26
53 11 Color, apparent 16 PCU NC NA 2003-08-26
54 11 Color, apparent 17 PCU NC NA 2003-08-26
55 11 Color, apparent 14 PCU NC NA 2003-08-26
56 11 Color, apparent 17 PCU NC NA 2003-08-26
LabMethodName LabMethodInfo SampleType SamplePosition SampleDepth MethodInfo
47 <NA> <NA> INTEGRATED SPECIFIED 6 <NA>
48 <NA> <NA> INTEGRATED SPECIFIED 7 <NA>
49 <NA> <NA> INTEGRATED SPECIFIED 6 <NA>
50 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
51 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
52 <NA> <NA> INTEGRATED SPECIFIED 9 <NA>
53 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
54 <NA> <NA> INTEGRATED SPECIFIED 8 <NA>
55 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
56 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
BasinType Subprogram Comments Dup
47 UNKNOWN NA NA NA
48 UNKNOWN NA NA NA
49 UNKNOWN NA NA NA
50 UNKNOWN NA NA NA
51 UNKNOWN NA NA NA
52 UNKNOWN NA NA NA
53 UNKNOWN NA NA NA
54 UNKNOWN NA NA NA
55 UNKNOWN NA NA NA
56 UNKNOWN NA NA NA
I want to flag all duplicate values as 1. Duplicate values are defined as those that have the exact same values in EVERY column of 'LakeID','Date','LagosVariableID','SampleDepth', and 'SamplePosition' columns.
To do this I have created a new data table "data1" using the following code:
library(data.table)
data1=data.table(Final.Export,key=c('LakeID','Date','LagosVariableID','SampleDepth','SamplePosition','Value'))
data1=data1[,Dup:=duplicated(.SD),.SDcols=c('LakeID','Date', 'LagosVariableID', 'SampleDepth', 'SamplePosition','Value')]
data1$Dup[which(data1$Dup==FALSE)]=NA
data1$Dup[which(data1$Dup==TRUE)]=1
The problem with "data1" is that only duplicate rows (according to my definition of duplicate) after the first unique row (flagged as NA) are flagged as "1." I need to flag the unique row and the associated duplicate rows as "1." Any ideas how to do this?
If this is confusing let me know how I can clarify.

It's difficult to say without a reproducible example, but it seems you want something like this:
data1[,dup:=duplicated(.SD),
by=list(LakeID, LagosVariableID, Value, Date, SamplePosition, SampleDepth)]
Edit:
After OP's clarification it appears they simply want this:
data1[,dup:=duplicated(.SD),
.SDcols=c('LakeID', 'Date', 'LagosVariableID', 'SampleDepth', 'SamplePosition')]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Loss of data when using merge - r

The problem is that you used tolower(), so that region names in one frame are different to the other (ge.planted has caps, snames does not). So merge will not recognize region names as equivalent. Delete the tolower() call, and it should work.

Related

Filling in missing value in R [duplicate]

Subtract multiple columns by one column

Having trouble Scraping Information from a text / pdf file into R

Make a row NA starting from a cell in a column

Flagging duplicate (non-unique) values based on specifications

Categories

Resources