Define attribute classes during readOGR - r

Is there any possibility to declare the data type of the attribute columns when importing, for example, a ESRI Shapefile with the readOGR command?
For example, I would like to keep the leading zeros in my key column (id_code):
example<- readOGR(example.shp", example")
str(example#data)
#'data.frame': 7149 obs. of 22 variables:
# $ id_code: num 101 102 103 104 105 106 107 108 109 110 ...
The result should something be like this:
str(example#data)
#'data.frame': 7149 obs. of 22 variables:
# $ id_code: char "0101" "0102" "0103" "0104" "0105" "0106"...
I am looking for something similar as colClasses in the read.csv() function

Yes, you can declare the data type when importing by specifying the encoding, ogrDrivers and use_iconv options in readOGR.
Please see ?readOGR.
From the documentation for the encoding option:
default NULL, if set to a character string, and the driver is “ESRI
Shapefile”, and use_iconv is FALSE, it is passed to the CPL Option
“SHAPE_ENCODING” immediately before reading the DBF of a shapefile. If
use_iconv is TRUE, and encoding is not NULL, it will be used to
convert input strings from the given value to the native encoding for
the system/platform.
You may also want to look into ogrInfo.

Related

classify a large collection of image files

I have a large collection of image files for a book, andthe publisher wants a list where files are classified by "type" (greyscale graph, b/w halftone image, color image, line drawing, etc.). This is a hard problem in general, but perhaps I can do some of this automatically using image processing tools, e.g., ImageMagick with the R magick package.
I think ImageMagick is the right tool, but I don't really know how to use it for this purpose.
What I have is just a list of fig numbers & file names:
1.1 ch01-intro/fig/alcohol-risk.jpg
1.2 ch01-intro/fig/measels.png
1.3 ch01-intro/fig/numbers.png
1.4 ch01-intro/fig/Lascaux-bull-chamber.jpg
...
Can someone help get me started?
Edit: This was probably an ill-framed or overly-arching question as initially stated. I thought that ImageMagick identify or the R magick::image_info() function could help, so the initial question perhaps should have been: "How to extract image information from a list of files [in R]". I can pose this separately, if not already asked.
An initial attempt at this gave me the following for my first images,
library(magick)
# initialize an empty array to hold the results of `image_info`
figinfo <- data.frame(
format=character(),
width=numeric(),
height=numeric(),
colorspace=character(),
matte=logical(),
filesize=numeric(),
density=character(), stringsAsFactors = FALSE
)
for (i in seq_along(files)) {
img <- image_read(files[i])
info <- image_info(img)
figinfo[i,] <- info
}
I get:
> figinfo
format width height colorspace matte filesize density
1 JPEG 661 733 sRGB FALSE 41884 72x72
2 PNG 838 591 sRGB TRUE 98276 38x38
3 PNG 990 721 sRGB TRUE 427253 38x38
4 JPEG 798 219 sRGB FALSE 99845 300x300
I conclude that this doesn't help much in answering the question I posed, of how to classify these images.
Edit2 Before closing this question, the advice to look into direct use of ImageMagick identify was helpful. https://imagemagick.org/script/escape.php
In particular, the %[type] is closer to
what I need. This is not exposed in magick::image_info(), so I may have to write a shell script or call system() in a loop.
For the record, here is how I can extract relevant attributes of these image files using identify directly.
# Get image characteristics via ImageMagick identify
# from: https://imagemagick.org/script/escape.php
#
# -format elements:
# %m image file format
# %f filename
# %[type] image type
# %k number of unique colors
# %h image height in pixels
# %r image class and colorspace
identify -format "%m,%f,%[type],%r,%k,%hx%w" imagefile
>identify -format "%m,%f,%[type],%r,%k,%hx%w" Quipu.png
PNG,Quipu.png,GrayscaleAlpha,DirectClass Gray Matte,16,449x299
The %[type] attribute takes me towards what I want.
To close this question:
In an R context, I was successful in using system(, intern=TRUE) for this task, as follows, with some manual fixups
# Use identify directly via system()
# function to run identify for one file
get_info <- function(file) {
cmd <- 'identify -quiet -format "%f,%m,%[type],%r,%h,%w,%x,%y"'
info <- system(paste(cmd, file), intern=TRUE)
unlist(strsplit(info, ","))
}
# This doesn't cause coercion to numeric
figinfo <- data.frame(
filename=character(),
format=character(),
type=character(),
class=character(),
height=numeric(),
width=numeric(),
xres=numeric(),
yres=numeric(),
stringsAsFactors = FALSE
)
for (i in seq_along(files)) {
info <- get_info(files[i])
info[4] <- sub("DirectClass ", "", info[4])
figinfo[i,] <- info
}
figinfo$height <- as.numeric(figinfo$height)
figinfo$width <- as.numeric(figinfo$width)
figinfo$xres=round(as.numeric(figinfo$xres))
figinfo$yres=round(as.numeric(figinfo$yres))
Then I have more or less what I want:
> str(figinfo)
'data.frame': 161 obs. of 8 variables:
$ filename: chr "mileyears4.png" "alcohol-risk.jpg" "measels.png" "numbers.png" ...
$ format : chr "PNG" "JPEG" "PNG" "PNG" ...
$ type : chr "Palette" "TrueColor" "TrueColorAlpha" "TrueColorAlpha" ...
$ class : chr "PseudoClass sRGB " "sRGB " "sRGB Matte" "sRGB Matte" ...
$ height : num 500 733 591 721 219 ...
$ width : num 720 661 838 990 798 ...
$ xres : num 72 72 38 38 300 38 300 38 28 38 ...
$ yres : num 72 72 38 38 300 38 300 38 28 38 ...
>

R Data.table strange search issue. (unprintable chars?)

Dears,
I have a strange issue.
I have a data.table "harvest":
str(harvest)
Classes ‘data.table’ and 'data.frame': 30005 obs. of 19 variables:
$ Date : Date, format: "2014-07-08" ...
$ Client : Factor w/ 68 levels
etc...
Now
harvest[grepl("Belgilux",Client),]
yields 258 results
To find out the exact name of the Client I do :
harvest[grepl("Belgilux",Client),unique(Client)]
[1] N.V. L'ORÉAL Belgilux S.A.
(out of 68 levels)
So far so good.
But if I do
harvest[Client=="N.V. L'ORÉAL Belgilux S.A."]
Empty data.table (0 rows) of 19 cols: Date,YearMonth,Month,Client,Project,Project.Code...
While I expected the same 258 results
levels(harvest$Client)[levels(harvest$Client)=="N.V. L'ORÉAL Belgilux S.A."] <- "Replace Me"
Gives no result either.
When I do the same with any other Clientname, it returns the correct amount of results.
I tried to give you a reproducable setup, but there the issue is not seen.
dt=data.table(Client=c("N.V. L'ORÉAL Belgilux S.A.", "Oh MyMedia","Testme","Oh MyMedia","N.V. L'ORÉAL Belgilux S.A."),Value=c(1:5))
dt$Client<-as.factor(dt$Client)
My question is : is it possible that there are some unprintable chars in my "Client" string and how can i see this?
Any other approaches?

Data lost in conversion from Factor to Numeric in R

I'm having trouble with a data conversion. I have this data that I get from a .csv file, for instance:
comisiones[2850,28:30]
Periodo.Pago Monto.Pago.Credito Disposicion.En.Efectivo
2850 Mensual 11,503.68 102,713.20
The field Monto.Pago.Credito has a Factor data class and I need it to be numeric but the double precision kind. I need the decimals.
str(comisiones$Monto.Pago.Credito)
Factor w/ 3205 levels "1,000.00","1,000.01",..: 2476 2197 1373 1905 1348 3002 1252 95 2648 667 ...
So I use the generic data conversion function as.numeric():
comisiones$Monto.Pago.Credito <- as.numeric(comisiones$Monto.Pago.Credito)
But then the observation changes to this:
comisiones[2850,28:30]
Periodo.Pago Monto.Pago.Credito Disposicion.En.Efectivo
2850 Mensual 796 102,713.20
str(comisiones$Monto.Pago.Credito)
num [1:5021] 2476 2197 1373 1905 1348 ...
The max of comisiones$Monto.Pago.Credito should be 11,504.68 but now it is 3205.
I don't know if there is a specific data class or type for the decimals in R, I've looked for it but, it didn´t work.
You need to clean up your column firstly, like remove the comma, convert it to character then to numeric:
comisiones$Monto.Pago.Credito <- as.numeric(gsub(",", "", comisiones$Monto.Pago.Credito))
The problem shows up when you convert a factor variable directly to numeric.
You can use extract_numeric from the tidyr package - it will handle factor inputs and remove commas, dollar signs, etc.
library(tidyr)
comisiones$Monto.Pago.Credito <- extract_numeric(comisiones$Monto.Pago.Credito)
If the resulting numbers are large, they may not print with decimal places when you view them, whether you used as.numeric or extract_numeric (which itself calls as.numeric). But the precision is still being stored. For instance:
> x <- extract_numeric("1,200,000.3444")
> x
[1] 1200000
Verify that precision is still stored:
> format(x, nsmall = 4)
[1] "1200000.3444"
> x > 1200000.3
[1] TRUE

R mistreat a number as a character

I am trying to read a series of text files into R. These files are of the same form, at least appear to be of the same form. Everything is fine except one file. When I read that file, R treated all numbers as characters. I used as.numeric to convert back, but the data value changed. I also tried to convert text file to csv and then read into R, but it did not work, either. Did any one have such problem before, please? How to fix it, please? Thank you!
The data is from Human Mortality Database. I cannot attach the data here due to copyright issue. But everyone can register through HMD and download data (www.mortality.org). As an example, I used Australian and Belgium 1 by 1 exposure data.
My codes are as follows:
AUSe<-read.table("AUS.Exposures_1x1.txt",skip=1,header=TRUE)[,-5]
BELe<-read.table("BEL.Exposures_1x1.txt",skip=1,header=TRUE)[,-5]
Then I want to add some rows in the above data frame or matrix. It is fine for Australian data (e.g, AUSe[1,3]+AUSe[2,3]). But error occurred when same command is applied to Belgium data: Error in BELe[1, 3] + BELe[2, 3] : non-numeric argument to binary operator. But if you look at the text file, you know those are two numbers. It is clear that R treated a number as a character when reading the text file, which is rather odd.
Try this instead:
BELe<-read.table("BEL.Exposures_1x1.txt",skip=1, colClasses="numeric", header=TRUE)[,-5]
Or you could surely post just a tiny bit of that file and not violate any copyright laws at least in my jurisdiction (which I think is the same one as The Human Mortality Database).
Belgium, Exposure to risk (period 1x1) Last modified: 04-Feb-2011, MPv5 (May07)
Year Age Female Male Total
1841 0 61006.15 62948.23 123954.38
1841 1 55072.53 56064.21 111136.73
1841 2 51480.76 52521.70 104002.46
1841 3 48750.57 49506.71 98257.28
.... . ....
So I might have suggested the even more accurate colClasses:
BELe<-read.table("BEL.Exposures_1x1.txt",skip=2, # really two lines to skip I think
colClasses=c(rep("integer", 2), rep("numeric",3)),
header=TRUE)[,-5]
I suspect the promlem occurs because of lines like these:
1842 110+ 0.00 0.00 0.00
So you will need to determine how much interest you have in preserving the 110+ values. With my method they will be coerced to NA's. (Well I thought they would be but like you I got an error. So this multi-step process is needed:
BELe<-read.table("Exposures_1x1.txt",skip=2,
header=TRUE)
BELe[ , 2:5] <- lapply(BELe[ , 2:5], as.character)
str(BELe)
#-------------
'data.frame': 18759 obs. of 5 variables:
$ Year : int 1841 1841 1841 1841 1841 1841 1841 1841 1841 1841 ...
$ Age : chr "0" "1" "2" "3" ...
$ Female: chr "61006.15" "55072.53" "51480.76" "48750.57" ...
$ Male : chr "62948.23" "56064.21" "52521.70" "49506.71" ...
$ Total : chr "123954.38" "111136.73" "104002.46" "98257.28" ...
#-------------
BELe[ , 2:5] <- lapply(BELe[ , 2:5], as.numeric)
#----------
Warning messages:
1: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
2: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
3: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
4: In lapply(BELe[, 2:5], as.numeric) : NAs introduced by coercion
str(BELe)
#-----------
'data.frame': 18759 obs. of 5 variables:
$ Year : int 1841 1841 1841 1841 1841 1841 1841 1841 1841 1841 ...
$ Age : num 0 1 2 3 4 5 6 7 8 9 ...
$ Female: num 61006 55073 51481 48751 47014 ...
$ Male : num 62948 56064 52522 49507 47862 ...
$ Total : num 123954 111137 104002 98257 94876 ...
# and just to show that tey are not really integers:
BELe$Total[1:5]
#[1] 123954.38 111136.73 104002.46 98257.28 94875.89
The way I typically read those files is:
BELexp <- read.table("BEL.Exposures_1x1.txt", skip = 2, header = TRUE, na.strings = ".", as.is = TRUE)
Note that Belgium lost 3 years of data in WWI that may never be recovered, and hence these three years are all NAs, which in those files are marked with ".", a character string. Hence the argument na.strings = ".". Specifying that argument will take care of all columns except Age, which is character (intentionally), due to the "110+". The reason the HMD does this is so that users have to be intentional about treatment of the open age group. You can convert the age column to integer using:
BELexp$Age <- as.integer(gsub("[+]", "", BELexp$Age))
Since such issues are long the bane of R-HMD users, the HMD has recently posted some R functions in a small but growing package on github called (for now) DemogBerkeley. The function readHMD() removes all of the above headaches:
library(devtools)
install_github("DemogBerkeley", subdir = "DemogBerkeley", username = "UCBdemography")
BELexp <- readHMD("BEL.Exposures_1x1.txt")
Note that a new indicator column, called OpenInterval is added, while Age is converted to integer as above.
Can you try read.csv(... stringsAsFactors=FALSE) ?

Excel misses values from numeric vector with write.table function in R

I am trying to save a data.frame from R so that it can be read from Excel. I have done this with several other data.frames that have the same structure as the one I refer to now, so far without problems. But for some reason when I try to save this data.frame and then open it with Excel, many of the numerical values in the columns FreqDev and LengthDev are not read by Excel. Instead, the rows show a string of "#" symbols.
My data.frame looks like this:
head(RegPartV)
LogFreq Word PhonCV WordClass FreqDev LengthDev Irregular
1277 28.395 geweest CV-CVVCC V 5.464336 -1.1518498 FALSE
903 25.647 gemaakt CV-CVVCC V 4.885296 -1.1518498 FALSE
752 23.304 gehad CV-CVC V 4.391595 -2.1100420 FALSE
610 22.765 gebracht CV-CCVCC V 4.278021 -0.6727537 FALSE
1312 22.041 gezegd CV-CVCC V 4.125465 -1.6309459 FALSE
647 21.987 gedaan CV-CVVC V 4.114086 -1.6309459 FALSE
The type of information in the data.frame is:
str(RegPartV)
'data.frame': 2096 obs. of 7 variables:
$ LogFreq : num 28.4 25.6 23.3 22.8 22 ...
$ Word : chr "geweest" "gemaakt" "gehad" "gebracht" ...
$ PhonCV : chr "CV-CVVCC" "CV-CVVCC" "CV-CVC" "CV-CCVCC" ...
$ WordClass: Factor w/ 1 level "V": 1 1 1 1 1 1 1 1 1 1 ...
$ FreqDev : num 5.46 4.89 4.39 4.28 4.13 ...
$ LengthDev: num -1.152 -1.152 -2.11 -0.673 -1.631 ...
$ Irregular: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
What is strange is that if I put my mouse over the numerical cells that now have only # symbols (in the excel file), I see a trace of the numbers that used to be there in the original R data.frame. For example, the values of these columns for the first row in the data.frame are:
>RegPartV[1,c(5,6)]
FreqDev LengthDev
1277 5.464336 -1.15185
And if I put my mouse over the Excel cells (that contain only # symbols) corresponding to the same values I just showed, I see:
54643356148468
and
-115184982188519
So the numbers are still there, but for some reason either R or Excel lost count of where the decimal was.
The method I am using to save the data.frame (and that I've used for structurally equivalent data.frame) is:
write.table(RegPartV,file="RegPartV",quote=F,sep="\t",row.names=F,col.names=T)
Then I open the file with Excel and I would expect to see all the info there, for some reason I'm having this numeric problem with this particular data.frame.
Any suggestions to get an Excel-readable data.frame are very welcome.
Thanks in advance.
From your problem description I suspect that you have "," as the default decimal separator in Excel. Either change the default in Excel or add dec="," to the write.table command.
That isn't actually an error: "#" means that a string/value is too long to fit into column. Widen the column and you'll see proper contents.

Resources