I am trying to read a Stata dataset in R with the foreign package, but when I try to read the file using:
library(foreign)
data <- read.dta("data.dta")
I got the following error:
Error in read.dta("data.dta") : a binary read error occurred
The file works fine in Stata. This site suggests saving the file in Stata without labels and then reading it into R. With this workaround I am able to load the file into R, but then I lose the labels. Why am I getting this error and how can I read the file into R with the labels? Another person finds that they get this error when they have variables with no values. My data do have at least one or two such variables, but I have no easy way to identify those variables in stata. It is a very large file with thousands of variables.
You should call library(foreign) before reading the Stata data.
library(foreign)
data <- read.dta("data.dta")
Updates: As mentioned here,
"The error message implies that the file was found, and that it started
with the right sequence of bytes to be a Stata .dta file, but that
something (probably the end of the file) prevented R from reading what it
was expecting to read. "
But, we might be just guessing without any further information.
Update to OP's question and answer:
I have tried whether that is the case using auto data from Stata, but its not.So, there should be other reasons:
*Claims 1 and 2: if there is missings in variable or there is dataset with labels, R read.dta will generate the error *
sysuse auto #this dataset has labels
replace mpg=. #generates missing for mpg variable
br in 1/10
make price mpg rep78 headroom trunk weight length turn displacement gear_ratio foreign
AMC Concord 4099 3 2.5 11 2930 186 40 121 3.58 Domestic
AMC Pacer 4749 3 3.0 11 3350 173 40 258 2.53 Domestic
AMC Spirit 3799 3.0 12 2640 168 35 121 3.08 Domestic
Buick Century 4816 3 4.5 16 3250 196 40 196 2.93 Domestic
Buick Electra 7827 4 4.0 20 4080 222 43 350 2.41 Domestic
Buick LeSabre 5788 3 4.0 21 3670 218 43 231 2.73 Domestic
Buick Opel 4453 3.0 10 2230 170 34 304 2.87 Domestic
Buick Regal 5189 3 2.0 16 3280 200 42 196 2.93 Domestic
Buick Riviera 10372 3 3.5 17 3880 207 43 231 2.93 Domestic
Buick Skylark 4082 3 3.5 13 3400 200 42 231 3.08 Domestic
save "~myauto"
de(myauto)
Contains data from ~\myauto.dta
obs: 74 1978 Automobile Data
vars: 12 25 Aug 2013 11:32
size: 3,478 (99.9% of memory free) (_dta has notes)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
make str18 %-18s Make and Model
price int %8.0gc Price
mpg int %8.0g Mileage (mpg)
rep78 int %8.0g Repair Record 1978
headroom float %6.1f Headroom (in.)
trunk int %8.0g Trunk space (cu. ft.)
weight int %8.0gc Weight (lbs.)
length int %8.0g Length (in.)
turn int %8.0g Turn Circle (ft.)
displacement int %8.0g Displacement (cu. in.)
gear_ratio float %6.2f Gear Ratio
foreign byte %8.0g origin Car type
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: foreign
library(foreign)
myauto<-read.dta("myauto.dta") #works perfect
str(myauto)
'data.frame': 74 obs. of 12 variables:
$ make : chr "AMC Concord" "AMC Pacer" "AMC Spirit" "Buick Century" ...
$ price : int 4099 4749 3799 4816 7827 5788 4453 5189 10372 4082 ...
$ mpg : int NA NA NA NA NA NA NA NA NA NA ...
$ rep78 : int 3 3 NA 3 4 3 NA 3 3 3 ...
$ headroom : num 2.5 3 3 4.5 4 4 3 2 3.5 3.5 ...
$ trunk : int 11 11 12 16 20 21 10 16 17 13 ...
$ weight : int 2930 3350 2640 3250 4080 3670 2230 3280 3880 3400 ...
$ length : int 186 173 168 196 222 218 170 200 207 200 ...
$ turn : int 40 40 35 40 43 43 34 42 43 42 ...
$ displacement: int 121 258 121 196 350 231 304 196 231 231 ...
$ gear_ratio : num 3.58 2.53 3.08 2.93 2.41 ...
$ foreign : Factor w/ 2 levels "Domestic","Foreign": 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "datalabel")= chr "1978 Automobile Data"
- attr(*, "time.stamp")= chr "25 Aug 2013 11:23"
- attr(*, "formats")= chr "%-18s" "%8.0gc" "%8.0g" "%8.0g" ...
- attr(*, "types")= int 18 252 252 252 254 252 252 252 252 252 ...
- attr(*, "val.labels")= chr "" "" "" "" ...
- attr(*, "var.labels")= chr "Make and Model" "Price" "Mileage (mpg)" "Repair Record 1978" ...
- attr(*, "expansion.fields")=List of 2
..$ : chr "_dta" "note1" "from Consumer Reports with permission"
..$ : chr "_dta" "note0" "1"
- attr(*, "version")= int 12
- attr(*, "label.table")=List of 1
..$ origin: Named int 0 1
.. ..- attr(*, "names")= chr "Domestic" "Foreign"
Here's a solver list. My guess is that the first item has a 75% likelihood to solve your issue.
In Stata, resave a fresh copy of your dta file with saveold, and try again.
If that fails, provide a sample to show what kind of values kill the read.dta function.
If missing values are to blame, run the loop from the other answer.
A more thorough description of the dataset would be required to work past that point. The issue seems fixable, I've never had much trouble using foreign with tons of Stata files.
You might also give a try to the Stata.file function in the memisc package to see if that fails too.
I do not know why this occurs and would be interested if anyone could explain, but read.dta indeed cannot handle variables that are all NA. A solution is to delete such variables in Stata with the following code:
foreach varname of varlist * {
quietly sum `varname'
if `r(N)'==0 {
drop `varname'
disp "dropped `varname' for too much missing data"
}
}
It's been a lot of time, but I solved this same problem exporting the .dta data to .csv. The problem was related to the labels of the factor variables, especially because the labels were in Spanish and the ASCII encoding is a mess. I hope this work for someone with the same problem and with Stata software.
In stata:
export delimited using "/Users/data.csv", nolabel replace
In R:
df <- read.csv("lapop2014.csv")
Related
I have some metabolomics data I am trying to process (validate the compounds that are actually present).
`'data.frame': 544 obs. of 48 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ No. : int 2 32 34 95 114 141 169 234 236 278 ...
$ RT..min. : num 0.89 3.921 0.878 2.396 0.845 ...
$ Molecular.Weight : num 70 72 72 78 80 ...
$ m.z : num 103 145 114 120 113 ...
$ HMDB.ID : chr "HMDB0006804" "HMDB0031647" "HMDB0006112" "HMDB0001505" ...
$ Name : chr "Propiolic acid" "Acrylic acid" "Malondialdehyde" "Benzene" ...
$ Formula : chr "C3H2O2" "C3H4O2" "C3H4O2" "C6H6" ...
$ Monoisotopic_Mass: num 70 72 72 78 80 ...
$ Delta.ppm. : num 1.295 0.833 1.953 1.023 0.102 ...
$ X1 : num 288.3 16.7 1130.9 3791.5 33.5 ...
$ X2 : num 276.8 13.4 1069.1 3228.4 44.1 ...
$ X3 : num 398.6 19.3 794.8 2153.2 15.8 ...
$ X4 : num 247.6 100.5 1187.5 1791.4 33.4 ...
$ X5 : num 98.4 162.1 1546.4 1646.8 45.3 ...`
I tried to write a loop so that if the Delta.ppm value is larger than (m/z - molecular weight)/molecular weight, the entire row is deleted in the subsequent dataframe.
for (i in 1:nrow(rawdata)) {
ppm <- (rawdata$m.z[i] - rawdata$Molecular.Weight[i]) /
rawdata$Molecular.Weight[i]
if (ppm > rawdata$Delta.ppm[i]) {
filtered_data <- rbind(filtered_data, rawdata[i,])
}
}
Instead of giving me a new df with the validated compounds, under the 'Values' section, it generates a single number for 'ppm'.
Still very new to R, any help is super appreciated!
No need to do this row-by-row, we can remove all undesired rows in one operation:
## base R
good <- with(rawdat, (m.z - Molecular.Weight)/Molecular.Weight < Delta.ppm.)
newdat <- rawdat[good, ]
## dplyr
newdat <- filter(rawdat, (m.z - Molecular.Weight)/Molecular.Weight < Delta.ppm.)
Iteratively adding rows to a frame using rbind(old, newrow) works in practice but scales horribly, see "Growing Objects" in The R Inferno. For each row added, it makes a complete copy of all rows in old, which works but starts to slow down a lot. It is far better to produce a list of these new rows and then rbind them at one time; e.g.,
out <- list()
for (...) {
# ... newrow ...
out <- c(out, list(newrow))
}
alldat <- do.call(rbind, out)
ppm[i] <- NULL
for (i in 1:nrow(rawdata)) {
ppm[i] <- (rawdata$m.z[i] - rawdata$Molecular.Weight[i]) /
rawdata$Molecular.Weight[i]
if (ppm[i] > rawdata$Delta.ppm[i]) {
filtered_data <- rbind(filtered_data, rawdata[i,])
}
}
EDIT: The problem was not within the geoMean function, but with a wrong use of aggregate(), as explained in the comments
I am trying to calculate the geometric mean of multiple measurements for several different species, which includes NAs. An example of my data looks like this:
species <- c("Ae", "Ae", "Ae", "Be", "Be")
phen <- c(2, NA, 3, 1, 2)
hveg <- c(NA, 15, 12, 60, 59)
df <- data.frame(species, phen, hveg)
When I try to calculate the geometric mean for the species Ae with the built-in function geoMean from the package EnvStats like this
library("EnvStats")
aggregate(df[, 3:3], list(df1$Sp), geoMean, na.rm=TRUE)
it works wonderful and skips the NAs to give me the geometric means per species.
Group.1 phen hveg
1 Ae 4.238536 50.555696
2 Be 1.414214 1.414214
When I do this with my large dataset, however, the function stumbles over NAs and returns NA as result even though there are e.g 10 numerical values and only one NA. This happens for example with the column SLA_mm2/mg.
My large data set looks like this:
> str(cut2trait1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 22 obs. of 19 variables:
$ Cut : chr "15_08" "15_08" "15_08" "15_08" ...
$ Block : num 1 1 1 1 1 1 1 1 1 1 ...
$ ID : num 451 512 431 531 591 432 551 393 511 452 ...
$ Plot : chr "1_1" "1_1" "1_1" "1_1" ...
$ Grazing : chr "n" "n" "n" "n" ...
$ Acro : chr "Leuc.vulg" "Dact.glom" "Cirs.arve" "Trif.prat" ...
$ Sp : chr "Lv" "Dg" "Ca" "Tp" ...
$ Label_neu : chr "Lv021" "Dg022" "Ca021" "Tp021" ...
$ PlantFunctionalType: chr "forb" "grass" "forb" "forb" ...
$ PlotClimate : chr "AC" "AC" "AC" "AC" ...
$ Season : chr "Aug" "Aug" "Aug" "Aug" ...
$ Year : num 2015 2015 2015 2015 2015 ...
$ Tiller : num 6 3 3 5 6 8 5 2 1 7 ...
$ Hveg : num 25 38 70 36 68 65 23 58 71 27 ...
$ Hrep : num 39 54 77 38 76 70 65 88 98 38 ...
$ Phen : num 8 8 7 8 8 7 6.5 8 8 8 ...
$ SPAD : num 40.7 42.4 48.7 43 31.3 ...
$ TDW_in_g : num 4.62 4.85 11.86 5.82 8.99 ...
$ SLA_mm2/mg : num 19.6 19.8 20.3 21.2 21.7 ...
and the result of my code
gm_cut2trait1 <- aggregate(cut2trait1[, 13:19], list(cut2trait1$Sp), geoMean, na.rm=TRUE)
is (only the first two rows):
Group.1 Tiller Hveg Hrep Phen SPAD TDW_in_g SLA_mm2/mg
1 Ae 13.521721 73.43485 106.67933 NA 28.17698 1.2602475 NA
2 Be 8.944272 43.95452 72.31182 5.477226 20.08880 0.7266361 9.309672
Here, the geometric mean of SLA for Ae is NA, even though there are 9 numeric measurements and only one NA in the column used to calculate the geometric mean.
I tried to use the geometric mean function suggested here:
Geometric Mean: is there a built-in?
But instead of NAs, this returned the value 1.000 when used with my big dataset, which doesn't solve my problem.
So my question is: What is the difference between my example df and the big dataset that throws the geoMean function off the rails?
I am trying to read the table from the following URL:
url <- 'http://faculty.chicagobooth.edu/ruey.tsay/teaching/introTS/m-ge3dx-4011.txt'
da <- read.table(url, header = TRUE, fill=FALSE, strip.white=TRUE)
I can look at the data using head:
> head(da)
date ge vw ew sp
1 19400131 -0.061920 -0.024020 -0.019978 -0.035228
2 19400229 -0.009901 0.013664 0.029733 0.006639
3 19400330 0.049333 0.018939 0.026168 0.009893
4 19400430 -0.041667 0.001196 0.013115 -0.004898
5 19400531 -0.197324 -0.220314 -0.269754 -0.239541
6 19400629 0.061667 0.066664 0.066550 0.076591
This works fine for the first 4 columns, for example, I can look at the column ew
> head(da$ew)
[1] -0.019978 0.029733 0.026168 0.013115 -0.269754 0.066550
but when I try to access the last one, I get some extra output which is not in the txt file.
> head(da$sp)
[1] -0.035228 0.006639 0.009893 -0.004898 -0.239541 0.076591
859 Levels: -0.000060 -0.000143 -0.000180 -0.000320 -0.000659 -0.000815 ... 0.163047
How do I get rid of the extra output? Thanks!
This is representation of a factor.
> str(da)
'data.frame': 861 obs. of 5 variables:
$ date: int 19400131 19400229 19400330 19400430 19400531 19400629 19400731 19400831 19400930 19401031 ...
$ ge : num -0.0619 -0.0099 0.0493 -0.0417 -0.1973 ...
$ vw : num -0.024 0.0137 0.0189 0.0012 -0.2203 ...
$ ew : num -0.02 0.0297 0.0262 0.0131 -0.2698 ...
$ sp : Factor w/ 859 levels "-0.000060","-0.000143",..: 226 411 445 42 353 828 613 585 441 684 ...
Row 58 has a dot instead of a number. This is sufficient information for R to handle the variable as a factor. Once you change the dot to NA or fix the error, you will be able to read in the data fine.
Another option would be to change the point to something meaningful after the data has been read in, and coercing to numeric afterwards. The following statement will coerce . to NA.
da$sp <- as.numeric(as.character(da$sp))
> str(da)
'data.frame': 861 obs. of 5 variables:
$ date: int 19400131 19400229 19400330 19400430 19400531 19400629 19400731 19400831 19400930 19401031 ...
$ ge : num -0.0619 -0.0099 0.0493 -0.0417 -0.1973 ...
$ vw : num -0.024 0.0137 0.0189 0.0012 -0.2203 ...
$ ew : num -0.02 0.0297 0.0262 0.0131 -0.2698 ...
$ sp : num -0.03523 0.00664 0.00989 -0.0049 -0.23954 ...
I'm re-running Kaplan-Meier Survival Curves from previously published data, using the exact data set used in the publication (Charpentier et al. 2008 - Inbreeding depression in ring-tailed lemurs (Lemur catta): genetic diversity predicts parasitism, immunocompetence, and survivorship). This publication ran the curves in SAS Version 9, using LIFETEST, to analyze the age at death structured by genetic heterozygosity and sex of the animal (n=64). She reports a Chi square value of 6.31 and a p value of 0.012; however, when I run the curves in R, I get a Chi square value of 0.9 and a p value of 0.821. Can anyone explain this??
R Code used: Age is the time to death, mort is the censorship code, sex is the stratum of gender, and ho2 is the factor delineating the two groups to be compared.
> survdiff(Surv(age, mort1)~ho2+sex,data=mariekmsurv1)
Call:
survdiff(formula = Surv(age, mort1) ~ ho2 + sex, data = mariekmsurv1)
N Observed Expected (O-E)^2/E (O-E)^2/V
ho2=1, sex=F 18 3 3.23 0.0166 0.0215
ho2=1, sex=M 12 3 2.35 0.1776 0.2140
ho2=2, sex=F 17 5 3.92 0.3004 0.4189
ho2=2, sex=M 17 4 5.50 0.4088 0.6621
Chisq= 0.9 on 3 degrees of freedom, p= 0.821
> str(mariekmsurv1)
'data.frame': 64 obs. of 6 variables:
$ id : Factor w/ 65 levels "","aeschylus",..: 14 31 33 30 47 57 51 39 36 3 ...
$ sex : Factor w/ 3 levels "","F","M": 3 2 3 2 2 2 2 2 2 2 ...
$ mort1: int 0 0 0 0 0 0 0 0 0 0 ...
$ age : num 0.12 0.192 0.2 0.23 1.024 ...
$ sex.1: Factor w/ 3 levels "","F","M": 3 2 3 2 2 2 2 2 2 2 ...
$ ho2 : int 1 1 1 2 1 1 1 1 1 2 ...
- attr(*, "na.action")=Class 'omit' Named int [1:141] 65 66 67 68 69 70 71 72 73 74 ...
.. ..- attr(*, "names")= chr [1:141] "65" "66" "67" "68" ...
Some ideas:
Try running it in SAS -- see if you get the same results as the author. Maybe they didn't send you the exact same dataset they used.
Look into the default values of the relevant SAS PROC and compare to the defaults of the R function you are using.
Given the HUGE difference between the Chi-squared (6.81 and 0.9) and P values (0.012 and 0.821) beteween SAS procedure and R procedure for survival analyses; I suspect that you have used wrong variables in the either one of the procedures.
The procedural difference / (data handling difference between SAS and R can cause some very small differences ) .
This is not a software error, this is highly likely to be a human error.
I am using the ggmcmc package to produce a summary pdf file of rjags package output using the ggmcmc() function. However, I get the following error message:
> ggmcmc(x, file = "Model0-output.pdf")
Plotting histograms
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 160, 164
When I check the structure of the input dataframe I created with the ggs() function, everything looks to be correct.
> str(x)
'data.frame': 240000 obs. of 4 variables:
$ Iteration: int 1 2 3 4 5 6 7 8 9 10 ...
$ Chain : int 1 1 1 1 1 1 1 1 1 1 ...
$ Parameter: Factor w/ 32 levels "N[1]","N[2]",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 96 87 76 79 89 95 85 78 86 89 ...
- attr(*, "nChains")= int 3
- attr(*, "nParameters")= int 32
- attr(*, "nIterations")= int 2500
- attr(*, "nBurnin")= num 2000
- attr(*, "nThin")= num 2
- attr(*, "description")= chr "postout0"
- attr(*, "parallel")= logi FALSE
Can anyone help me identify where the error is being caused and how I can correct it? Am I missing something obvious?
ggmcmc 0.5.1 solves the calculation of the number of bins in a different manner that it did it in previous versions. Previous versions relied on ggplot2:::bin, whereas 0.5.1 computes the bins and their binwidth by itself.
It is likely your case that the range of some of the parameters was so extreme that rounding errors would make some of them have one more or one less bins, therefore producing this error.