I would like to select in my dataframe (catch) only the rows for which my "tspp.name" variable is the same as my "elasmo.name" variable.
For example, row #74807 and #74809 in this case would be selected, but not row #74823 because the elasmo.name is "skate" and the tspp.name is "Northern shrimp".
I am sure there is an easy answer for this, but I have not found it yet. Any hints would be appreciated.
> catch[4:6,]
gear tripID obsID sortie setID date time NAFO lat long dur depth bodymesh
74807 GRL2 G00001 A 1 13 2000-01-04 13:40:00 2H 562550 594350 2.000000 377 80
74809 GRL2 G00001 A 1 14 2000-01-04 23:30:00 2H 562550 594350 2.166667 370 80
74823 GRL2 G00001 A 1 16 2000-01-05 07:45:00 2H 561450 593050 3.000000 408 80
codendmesh mail.fil long.fil nbr.fil hook.shape hook.size hooks VTS tspp tspp.name elasmo
74807 45 NA NA NA NA NA 3.3 2211 Northern shrimp 2211
74809 45 NA NA NA NA NA 3.2 2211 Northern shrimp 2211
74823 45 NA NA NA NA NA 3.3 2211 Northern shrimp 211
elasmo.name kept discard Tcatch date.1 latitude longitude EID
74807 Northern shrimp 2747 50 2797 2000-01-04 56.91667 -60.21667 G00001-13
74809 Northern shrimp 4919 100 5019 2000-01-04 56.91667 -60.21667 G00001-14
74823 Skates 0 50 50 2000-01-05 56.73333 -60.00000 G00001-16
fgear
74807 Shrimp trawl (stern) with a grid
74809 Shrimp trawl (stern) with a grid
74823 Shrimp trawl (stern) with a grid
I know what the problem is - you need to read in the data "as is", by adding the argument as.is=TRUE to the read.csv command (which you presumably used to load everything in). Without this, the strings get stored as factors, and all methods suggested above will fail (as you've discovered!)
Once you've read in the data correctly, you can use either
catch[which(catch$tspp.name == catch$elasmo.name),]
or
subset(catch, tspp.name == elasmo.name)
to obtain the matching rows - do not omit the which in the first one otherwise the code will fail when doing comparisons with NAs.
Below is a 30-second example using a small fabricated data set that illustrates all these points explicitly.
First, create a text file on disk that looks like this (I saved it as "F:/test.dat" but it can be saved anywhere)...
col1~col2
a~b
a~a
b~b
c~NA
NA~d
NA~NA
Let's load it in without converting factors to strings, just to see the methods proposed above fall over:
> dat=read.csv("F:/test.dat",sep="~") # don't forget to check the filename
> dat[which(dat$col1==dat$col2),]
Error in Ops.factor(dat$col1, dat$col2) : level sets of factors are different
> dat[dat$col1==dat$col2,]
Error in Ops.factor(dat$col1, dat$col2) : level sets of factors are different
> subset(dat,col1==col2)
Error in Ops.factor(col1, col2) : level sets of factors are different
This is exactly the problem you were having. If you type dat$col1 and dat$col2 you'll see that the first has factor levels a b c while the second has factor levels a b d - hence the error messages.
Now let's do the same, but this time reading in the data "as is":
> dat=read.csv("F:/test.dat",sep="~",as.is=TRUE) # note the as.is=TRUE
> dat[which(dat$col1==dat$col2),]
col1 col2
2 a a
3 b b
> dat[dat$col1==dat$col2,]
col1 col2
2 a a
3 b b
NA <NA> <NA>
NA.1 <NA> <NA>
NA.2 <NA> <NA>
> subset(dat,col1==col2)
col1 col2
2 a a
3 b b
As you can see, the first method (based on which) and the third method (based on subset) both give the right answer, while the second method gets confused by comparisons with NA. I would personally advocate the subset method as in my opinion it's the neatest.
A final note: There are other ways that you can get strings arising as factors in a data frame - and to avoid all of those headaches, always remember to include the argument stringsAsFactors = FALSE at the end whenever you create a data frame using data.frame. For instance, the correct way to create the object dat directly in R would be:
dat=data.frame(col1=c("a","a","b","c",NA,NA), col2=c("b","a","b",NA,"d",NA),
stringsAsFactors=FALSE)
Type dat$col1 and dat$col2 and you'll see they've been interpreted correctly. If you try it again but with the stringsAsFactors argument omitted (or set to TRUE), you'll see those darned factors appear (just like the dodgy first method of loading from disk).
In short, always remember as.is=TRUE and stringsAsFactors=FALSE, and learn how to use the subset command, and you won't go far wrong!
Hope this helps :)
Related
I have a data set that has some duplicate records. For those records, most of the column values are the same, but a few ones are different.
I need to identify the columns where the values are different, and then subset those columns.
This would be a sample of my dataset:
library(data.table)
dat <- "ID location date status observationID observationRep observationVal latitude longitude setSource
FJX8KL loc1 2018-11-17 open 445 1 17.6 -52.7 -48.2 XF47
FJX8KL loc2 2018-11-17 open 445 2 1.9 -52.7 -48.2 LT12"
dat <- setDT(read.table(textConnection(dat), header=T))
And this is the output I would expect:
observationRep observationVal setSource
1: 1 17.6 XF47
2: 2 1.9 LT12
One detail is: my original dataset has 189 columns, so I need to check all of them.
How to achieve this?
Two issues, first, use text= argument rather than textConnection, second, use as.data.table, since seDT modifies object in place, but it yet isn't there.
dat1 <- data.table::as.data.table(read.table(text=dat, header=TRUE))
dat1[, c('observationRep', 'observationVal', 'setSource')]
# observationRep observationVal setSource
# 1: 1 17.6 XF47
# 2: 2 1.9 LT12
I'm running the following code below to retrieve a data set, which unfortunately uses "." instead of NA to represent missing data. After much wrangling and searching SO and other fora, I still cannot make the code replace all instances of "." with NA so I can convert the columns to numeric and go on with my life. I'm pretty sure the problem is between the screen and the chair, so I don't see a need to post sessionInfo, but please let me know otherwise. Help in solving this would be greatly appreciated. The first four columns are integers setting out the date and the unique ID, so I would only need to correct the other columns. Thanks in advance you all!
library(data.table)
google_mobility_data <- data.table(read.csv("https://github.com/OpportunityInsights/EconomicTracker/raw/main/data/Google Mobility - State - Daily.csv",stringsAsFactors = FALSE))
# The following line is the one where I can't make it work properly.
google_mobility_data[, .SD := as.numeric(sub("^\\.$", NA, .SD)), .SDcols = -c(1:4)]
I downloaded your data and changed the last entry on the first row to "." to test NA in the final column.
Use readLines to read a character vector.
Use gsub to change . to NA.
Use fread to read as a data.table.
library(data.table)
gmd <- readLines("Google Mobility - State - Daily.csv")
gmd[c(2,3120)]
# [1] "2020,2,24,1,.00286,-.00714,.0557,.06,.0129,.00857,."
# [2] "2020,4,25,10,-.384,-.191,.,-.479,-.441,.179,-.213"
gmd <- gsub(",\\.,",",NA,",gmd)
gmd <- gsub(",\\.$",",NA",gmd)
gmd[c(2,3120)]
# [1] "2020,2,24,1,.00286,-.00714,.0557,.06,.0129,.00857,NA"
# [2] "2020,4,25,10,-.384,-.191,NA,-.479,-.441,.179,-.213"
google_mobility_data <- fread(text=gmd)
google_mobility_data[c(1,3119)]
# year month day statefips gps_retail_and_recreation gps_grocery_and_pharmacy gps_parks gps_transit_stations gps_workplaces gps_residential gps_away_from_home
#1: 2020 2 24 1 0.00286 -0.00714 0.0557 0.060 0.0129 0.00857 NA
#2: 2020 4 25 10 -0.38400 -0.19100 NA -0.479 -0.4410 0.17900 -0.213
summary(google_mobility_data)
EDIT: You mentioned using na.strings with fread didn't work for you, so I suggested the above approach.
However, at least with the data file downloaded as I did, this worked in one line - as suggested by #MichaelChirico:
google_mobility_data <- fread("Google Mobility - State - Daily.csv",na.strings=".")
google_mobility_data[c(1,3119)]
year month day statefips gps_retail_and_recreation gps_grocery_and_pharmacy gps_parks gps_transit_stations gps_workplaces gps_residential gps_away_from_home
#1: 2020 2 24 1 0.00286 -0.00714 0.0557 0.060 0.0129 0.00857 NA
#2: 2020 4 25 10 -0.38400 -0.19100 NA -0.479 -0.4410 0.17900 -0.213
I have two dataframes. The number of observations is very different, and I would like to use some information from one dataframe into the other, conditioning to some logical relations, and I can't seem to be able to. A down-scaled example would look something like this:
year <- as.vector(c(rep(1949,5), rep(1950,5), rep(1951,5), rep(1952,5)))
moneyband <- as.vector(c(rep(c(10,20,30,40,50),4)))
rate <-as.vector(c(rep(c(0.1,0.2,0.3,0.4,0.5),2),rep(c(0.15,0.25,0.35,0.45,0.55),2)))
datasmall <- as.data.frame(cbind(year,moneyband,rate))
yearbig <- as.vector(c(rep(1949,10), rep(1950,10), rep(1951,10), rep(1952,11)))
earnings <- as.vector(c(rep(c(9,19,30,39,50),8),60))
databig <- as.data.frame(cbind(yearbig,earnings))
Now I want to create a new variable in the big database (let's call it ratebig) that assigns to that variable the rate associated with that amount of earnings, if earnings (in the big database) equal moneyband (in the small database) for a given year. As you can see, in this example this would happen with the values 30 and 50. The rest I would like them to be NA.
I tried this:
databig$ratebig <- NA
for (i in 1949:1952) {
databig$ratebig[datasmall$year == i & (databig$earnings[databig$yearbig==i]==datasmall$moneyband[datasmall$year == i])] <- datasmall$rate[datasmall$year == i & (databig$earnings[databig$yearbig==i]==datasmall$moneyband[datasmall$year == i])]
}
But the different size of databases (or other things) are giving me trouble (it gives me errors and the results are wrong). It seems the result does not take care the conditions as I would like, and it is influenced by relative position and the structure in the two datasets.
In principle, I wouldn't want to merge the datasets (we are talking about a high number of observations in the real data) and was hoping for a way to do this.
Thanks!!
For your case merge works fine
merge(databig, datasmall, by.x = c("yearbig", "earnings"),
by.y = c("year", "moneyband"), all.x = TRUE)
# yearbig earnings rate
#1 1949 9 NA
#2 1949 9 NA
#3 1949 19 NA
#4 1949 19 NA
#5 1949 30 0.30
#6 1949 30 0.30
#7 1949 39 NA
#8 1949 39 NA
#9 1949 50 0.50
#10 1949 50 0.50
#.....
Regarding why your for loop doesn't work as expected you need to do it for every row of databig
databig$ratebig <- NA
for (i in 1:nrow(databig)) {
inds <- databig$yearbig[i] == datasmall$year &
databig$earnings[i] == datasmall$moneyband
if (any(inds))
databig$ratebig[i] <- datasmall$rate[inds]
}
My data looks like this:
colnames(dati)< - c("grupa", "regions5", "regions6", "novads.rep", "pilseta.lt", "specialists", "limenis.1", "limenis.2", "cipari.3", "ratio", "gads", "KV", "DS")
and I have manually applied split to it in order to have 24 splits (12 splits including year and 12 without splitting by years). I did them following way:
k1<-split(dati$ratio, list(dati$gads, dati$grupa), drop=TRUE)
k2<-split(dati$ratio, list(dati$gads, dati$grupa, dati$regions5), drop=TRUE)
...
k13<-split(dati$ratio,list(dati$grupa),drop=TRUE)
k14<-split(dati$ratio,list(dati$grupa,dati$regions5),drop=TRUE)
...etc
and what I mean to do is to apply these splits to my function as follows:
function(k1,k13)
but instead of inserting the values manually I would like to change them so that I could do my function similar to this:
for(i in 1:12){function(k[i],k[i+12])}
I just can't seem to find the right way to do it
dati after i split them look like this:
grupa regions5 regions6 novads.rep pilseta.lt specialists
1 1* Zemgales Zemgales Novads lauki Silva
2 1* Kurzemes Kurzemes Novads lauki Sniedze
3 3* Kurzemes Kurzemes REP pilsēta AnitaE
4 1* Vidzemes Vidzemes Novads pilsēta Dainis
limenis.1 limenis.2 cipari.3 ratio gads KV
1 Jelgavas nov. Svētes pag. 1 0.8682626 2011 2162
2 Ventspils nov. Vārves pag. 1 0.3923857 2011 27467
3 _Liepāja _Liepāja 4 0.4069100 2011 30107
4 Alūksnes nov. Alūksne 2 0.5641127 2011 8147
DS
1 2490.03
2 70000.00
3 73989.33
4 14442.15
...
and here is the output i'm looking for:
count mean lowermean uppermean median ...
2011.1*.Kurzemes 119 0.83322820 7.719323e-01 0.8945241 0.79888324
2012.1*.Kurzemes 171 0.82800498 7.836221e-01 0.8723879 0.84424821
2013.1*.Kurzemes 144 0.77551814 7.347631e-01 0.8162731 0.80745150
2014.1*.Kurzemes 180 0.78134649 7.396007e-01 0.8230923 0.81635065
2015.1*.Kurzemes 80 0.78146588 7.135070e-01 0.8494248 0.73659659
2011.10*.Kurzemes 16 1.09552970 6.930780e-01 1.4979814 1.02127841
2012.10*.Kurzemes 22 0.87442906 5.721409e-01 1.1767172 0.74787482
2013.10*.Kurzemes 25 0.84406131 6.947097e-01 0.9934129 0.91786319
2014.10*.Kurzemes 22 0.79385199 5.880507e-01 0.9996533 0.71708060
2015.10*.Kurzemes 12 1.19059850 8.213604e-01 1.5598365 1.25322750
2012.11*.Kurzemes 1 0.09461065 NA NA 0.09461065
2013.11*.Kurzemes 2 0.18134522 -1.823437e+00 2.1861274 0.18134522
2014.11*.Kurzemes 1 0.11097174 NA NA 0.11097174
2013.12*.Kurzemes 1 0.44620780 NA NA 0.44620780
...
You could use a list:
k <- list()
k[[1]] <- split(dati$ratio, list(dati$gads, dati$grupa), drop=TRUE)
k[[2]] <- split(dati$ratio, list(dati$gads, dati$grupa, dati$regions5), drop=TRUE)
# etc
Then the following is valid:
for(i in 1:12){
function(k[[i]],k[[i+12]])
}
Note that k3 is the name of a variable, which could be x, myvar32, whatever. When you type k[3], you state that you want to access the third cell of the vector k. Note that k and k3 are totally distinct variables. If you want to be able to access you variables using k[i], you must first create the vector k and store what you need in k[i]...
The double bracket notation is used to access lists, which are basically handy store anything -- what you need in your case.
I am doing a rolling regression with a huge database, and the reference column used for rolling is called "Q" with the value from 5 to 45 for each data block. At first I tried with simple codes step by step, and it works very good:
fit <- as.formula(EB~EB1+EB2+EB3+EB4)
#use the 20 Quarters data to do regression
model<-lm(fit,data=datapool[(which(datapool$Q>=5&datapool$Q<=24)),])
#use the model to forecast the value of next quarter
pre<-predict(model,newdata=datapool[which(datapool$Q==25),])
#get the forecast error
error<-datapool[which(datapool$Q==25),]$EB -pre
The result of the code above is:
> head(t(t(error)))
[,1]
21 0.006202145
62 -0.003005097
103 -0.019273856
144 -0.016053012
185 -0.025608022
226 -0.004548264
The datapool has the structure below:
> head(datapool)
X Q Firm EB EB1 EB2 EB3
1 1 5 CMCSA US Equity 0.02118966 0.08608825 0.01688180 0.01826571
2 2 6 CMCSA US Equity 0.02331379 0.10506550 0.02118966 0.01688180
3 3 7 CMCSA US Equity 0.01844747 0.12961955 0.02331379 0.02118966
4 4 8 CMCSA US Equity NA NA 0.01844747 0.02331379
5 5 9 CMCSA US Equity 0.01262287 0.05622834 NA 0.01844747
6 6 10 CMCSA US Equity 0.01495291 0.06059339 0.01262287 NA
...
Firm B(also from Q5 to Q45)
...
Firm C(also from Q5 to Q45)
The errors produced above are all marked with "X" value in "datapool", so I can know from which firm does the error come from.
Since I need to run the regression for 21 times (quarters 5-24,6-25,...,25-44), so I do not want to do it manully, and have thought out the following codes:
fit <- as.formula(EB~EB1+EB2+EB3+EB4)
for (i in 0:20){
model<-lm(fit,data=datapool[(which(datapool$Q>=5+i&datapool$Q<=24+i)),])
pre<-predict(model,newdata=datapool[which(datapool$Q==25+i),])
error<-datapool[which(datapool$Q==25),]$EB -pre
}
The codes above works, and no error come out, but I do not know how to compile all errors produced by each regression into one datapool automatically? Can anyone help me with that?
(I say again: Really bad idea to use the name 'error' for a vector.) It is the name of a core function. This is how I would have attempted that task. (Using the subset parameter and indexing than the tortured which statements.
fit <- as.formula(EB~EB1+EB2+EB3+EB4)
pre <- numeric(len=21)
errset <- numeric(len=21)
for (i in 0:20){
model<-lm(fit,data=datapool, subset= Q>=5+i & Q<=24+i )
pre[i]<-predict(model,newdata=datapool[ datapool[["Q"]] %in% i:(25+i), ])
errset[i]<-datapool[25+i,]$EB -pre
}
errset
No gaurantees this won't error out by running out tof data at the beginning or end since you have not offered either data or a comprehensive description of the data-object.