Number of Observation and variables are not equal in data frame R - r

I am running the abc(Approximate Bayesian Computataion) library in R. I am using the human dataset from abc.data. I run below line of code for model selection example which is working fine.
modsel.it <- postpr(stat.voight["italian",], models, stat.3pops.sim, tol=.05, method="mnlogistic")
summary(modsel.it)
I save the above mentioned human dataset data frames(stat.voight, models, stat.3pops.sim) as .csv(st,mod,stat3) respectively and run the same line of code for .csv files. It works fine but I get an error when I run the postpr functions as mentioned below
t <- postpr(st["italian",], mod, stat3, tol=.05, method="mnlogistic")
It gives me an error of Error: 'Number of summary statistics in 'target' has to be the same as 'sumstat'.
Then I checked the str (structure) of the actual dataframe and then open I saved as .csv. The one I saved as .csv is changed from the actual dataframe. Below shown are the images of the actual(stat.voight) and .csv(st) dataframes. I want to change my dataframe st dataframe be same as dataframestat.voight. Thanks

The write.csv() function has a default argument of row.names = TRUE, which writes the row names as the first column in the CSV. If you set row.names = FALSE, the row names will not be written to the file.
That said, the objects have a number of attributes that aren't written to the output files with write.csv(). As such, you're better off using saveRDS() and readRDS() to serialize these objects and reload them into R.

Related

R show variable list/header of Stata or SAS file in R without loading the complete dataset

I am given very big (around 10 Gb each) datasets in both SAS and Stata format. I am going to read them into R for analysis.
Is there a way to show what variables (columns) they contain inside without reading the whole data file? I often only need some of the variables. I can view them of course from File Explorer, but it's not reproducible and takes a lot of time.
Both SAS and Stata are available on the system, but just opening a file might take a minute or so.
If you have SAS run a proc contents or proc datasets to see the details of the dataset without opening it. You may want to do that anyways, so that you can verify variable types, lengths and formats.
libname myFiles 'path to your sas7bdatfiles';
proc contents data=myfiles.datasetName;
run;
See below for the dta solution, which you can update to SAS using read_sas.
library(haven)
# read in first row of dta
dta_head <- read_dta("my_data.dta",
n_max = 1)
# get variable names of dta
dta_names <- names(dta_head)
After examining the names and labels of your dta file, you can then remove the n_max = 1 option and read in full while possibly adding the col_select option specifying the subset of variables you wish to read in.

Object not found in R when trying to create a table from a csv dataset

I'm using RStudio to create glm's of a csv dataset and I'm really new to R (Using it for a Uni assignment). Quick summary, it's looking at some motor claim data. I've read.csv the dataset into R;
Motor <- read.csv("motor.csv", quote="", header=TRUE)
then am trying to run
(ClaimsTab <- table(Claims))
to create a table to see the frequency of different claim amounts.
'Claims' is a header in my CSV file and there's no spelling mistakes but am returned with
Error in table(Claims) : object 'Claims' not found
I've attempted to attach a picture of my dataset.
motor dataset picture
What am I doing wrong? I imported a different file earlier and the table() function was working fine.
The name of your dataframe object is Motor.
To access a column in a dataframe in base R you do: Motor$Claims
This should work:
ClaimsTab <- table(Motor$Claims)

Analyze code in dataframe with Tidycode in R

I am trying to take R code, stored in cells of the content column of a dataframe, and analyze the functions used by applying the Tidycode package. However, I first need to convert the data to a Matahari tibble before applying an unnest_calls() function.
Here is the data:
data <- read.csv("https://github.com/making-data-science-count/TidyTuesday-Analysis/raw/master/db-tmp/cleaned%20database.csv")
I have tried doing this in a number of different ways, including extracting each row (in the content column ) as an Rfile and then reading it back in with Tidycode calls, for example:
tmp<-data$content2[1])
writeLines(tmp, "tmp.R") #I've also used save() and write()
rfile<-tidycode::read_rfiles("tmp.R")
But, I keep getting errors such as: "Error in parse(text = x) : <text>:1:14: unexpected symbol
1: library(here)library"
Ultimately, what I would like to do is analyze the different types of code per file, and keep that linked with the other data in the data dataframe, such as date and username.
Any help would be greatly appreciated!

how do i get my variables to be recognized?

I am writing a dataframe using a csv file. I am making a data frame. However, when I go to run it, it's not recognizing the objects in the file. It will recognize some of them, but not all.
smallsample <- data.frame(read.csv("SmallSample.csv",header = TRUE),smallsample$age,smallsample$income,smallsample$gender,smallsample$marital,smallsample$numkids,smallsample$risk)
smallsample
It wont recognize marital or numkids, despite the fact that those are the column names in the table in the .csv file.
When you use read.csv the output is already in a dataframe.
You can simple use smallsample <- read.csv("SmallSample.csv")
Result using a dummy csv file
<table><tbody><tr><th> </th><th>age</th><th>income</th><th>gender</th><th>marital</th><th>numkids</th><th>risk</th></tr><tr><td>1</td><td>32</td><td>34932</td><td>Female</td><td>Single</td><td>1</td><td>0.9611315</td></tr><tr><td>2</td><td>22</td><td>50535</td><td>Male</td><td>Single</td><td>0</td><td>0.7257541</td></tr><tr><td>3</td><td>40</td><td>42358</td><td>Male</td><td>Single</td><td>1</td><td>0.6879534</td></tr><tr><td>4</td><td>40</td><td>54648</td><td>Male</td><td>Single</td><td>3</td><td>0.568068</td></tr></tbody></table>

How can I import data into R that is meant for use in SAS, SPSS, or STATA?

I am attempting to read data from the National Health Interview Survey in R: http://www.cdc.gov/nchs/nhis/nhis_2011_data_release.htm . The data is Sample Adult. The SAScii library actually has a function read.SAScii whose documentation has an example for the same data set I would like to use. The issue is it "doesn't work":
NHIS.11.samadult.SAS.read.in.instructions <-
"ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Program_Code/NHIS/2011/SAMADULT.sas"
NHIS.11.samadult.file.location <-
"ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHIS/2011/samadult.zip"
#store the NHIS file as an R data frame!
NHIS.11.samadult.df <-
read.SAScii (
NHIS.11.samadult.file.location ,
NHIS.11.samadult.SAS.read.in.instructions ,
zipped = T, )
#or store the NHIS SAS import instructions for use in a
#read.fwf function call outside of the read.SAScii function
NHIS.11.samadult.sas <- parse.SAScii( NHIS.11.samadult.SAS.read.in.instructions )
#save the data frame now for instantaneous loading later
save( NHIS.11.samadult.df , file = "NHIS.11.samadult.data.rda" )
However, when running it I get the error Error in toupper(SASinput) : invalid multibyte string 533.
Others on Stack Overflow with a similar error, but for functions such as read.delim and read.csv, have recommended to try changing the argument to fileEncoding="latin1" for example. The problem with read.SAScii is it has no such parameter fileEncoding.
See:
R: invalid multibyte string and Invalid multibyte string in read.csv
Just in case anyone has a similar problem, the issue and solution for me was to run options( encoding = "windows-1252" ) right before running the above code for read.SAScii since the ASCII file is meant for use in SAS and therefore on Windows. And I am using Linux.
The author of the SAScii library actually has another Github repository asdfree where he has working code for downloading CDC-NHIS datasets for all available years as well as as many other datasets from various surveys such as the American Housing Survey, FDA Drug Surveys, and many more.
The following links to the author's solution to the issue in this question. From there, you can easily find a link to the asdfree repository: https://github.com/ajdamico/SAScii/issues/3 .
As far as this dataset goes, the code in https://github.com/ajdamico/asdfree/blob/master/National%20Health%20Interview%20Survey/download%20all%20microdata.R#L8-L13 does the trick, however it doesn't encode the columns as factors or numeric properly. The good thing is that for any given dataset in an NHIS year, there are only about less than ten to twenty numeric columns where encoding these as numeric one by one is not so painful, and encoding the rest of the columns as numeric requires only a loop through the non-numeric columns.
The easiest solution for me, since I only require the Sample Adult dataset for 2011, and I was able to get my hands on a machine with SAS installed, was to run the SAS program included at http://www.cdc.gov/nchs/nhis/nhis_2011_data_release.htm to encode the columns as necessary. Finally, I used proc export to export the sas dataset onto a CSV file which I then opened in R easily with no necessary edits to the data except in dealing with missing values.
In case you want to work with NHIS datasets besides Sample Adult, it is worth noting that when I ran the available SAS program for 2010 "Sample Adult Cancer" (http://www.cdc.gov/nchs/nhis/nhis_2010_data_release.htm) and exported the data to a CSV, there was an issue with having less column names than actual columns when I attempted to read in the CSV file in R. Skipping the first line resolves this issue but you lose the descriptive column names. You can however import this same data easily without encoding with the R code in the asdfree repository. Please read the documentation there for more info.

Resources