Subsetting SPSS data imported into r with package haven? - r

I've used the package haven to read SPSS data into R. All seems ok, except that when I try to subset the data it doesn't seem to behave correctly. Here's the code (I don't have SPSS to create example data and can't post the real stuff):
require(haven)
df <- read_spss("filename1.sav")
tmp <- df[as_factor(df$variable1) == "factor1",]
tmp <- tmp[!is.na(tmp$variable2), ]
The above df has "NA" scattered throughout. I expected the above to subset only the data, keeping only rows with variable1 with "factor1" and discarding all rows with NAs in variable2. The first subset works as expected. But the second subset does not. It removes rows, but NAs are still present.
I suspect the issue has something to do with the way haven structures the imported data and uses the class labelled instead of an actual factor variable, but it's over my head. Anyone know what could be happening and how to accomplish the same?
Here's the structure of df, variable1 and variable2:
> str(df)
'data.frame': 4573 obs. of 316 variables:
> str(df$variable1)
Class 'labelled' atomic [1:4573] 9 9 9 14 8 8 2 4 8 16 ...
..- attr(*, "labels")= Named num [1:18] 1 2 3 4 5 6 7 8 9 10 ...
.. ..- attr(*, "names")= chr [1:18] "factor1" "factor2" "factor3" "factor4" ...
> str(df$variable2)
Class 'labelled' atomic [1:4573] 3 NA 3 NA 3 NA 1 1 NA NA ...
..- attr(*, "labels")= Named num [1:3] 1 2 3
.. ..- attr(*, "names")= chr [1:3] "Sponsor" "Not a Sponsor" "Don't Know"

Related

Rphylopars: "Error in class(tree) <- "phylo" : attempt to set an attribute on NULL"

I'm trying to compute a phenotypic covariance matrix between a fatty acid dataset and a phylogenetic tree using the Rphylopars package.
I'm able to load the data set and phylogeny; however, when I attempt to run the test I get the error message
Error in class(tree) <- "phylo" : attempt to set an attribute on NULL"
This is the code for the test
phy <- read.tree("combined_trees.txt")
plot(phy)
phy$tip.label
FA_data <- read.csv("fatty_acid_example_data.csv", header = TRUE, na.strings = ".")
head(FA_data)
str(FA_data)
PPE <- phylopars(trait_data = FA_data$fatty1_continuous, tree = FA_data$phy)
Not sure what other info will help figure out the issue. The data set and phylogeny loaded without an error.
In the tutorial, the tree and trait data are jointly simulated by the simtraits() function, so both end up as elements of a single list. In your case (which will be typical of real-data cases), the tree and the trait data come from different sources, so most likely you want
PPE <- phylopars(trait_data = FA_data, tree = phy)
provided that FA_data contains a first column species matching the tip names in phy, and otherwise only the numeric data you want to use (potentially only the single fatty_acid1 column).
For comparison, the data structure returned by simtraits() looks like this (using str()):
List of 4
$ trait_data:'data.frame': 45 obs. of 5 variables:
..$ species: chr [1:45] "t7" "t8" "t2" "t3" ...
..$ V1 : num [1:45] 1.338 0.308 1.739 2.009 2.903 ...
..$ V2 : num [1:45] -2.002 -0.115 -0.349 -4.452 NA ...
..$ V3 : num [1:45] -1.74 NA 1.09 -2.54 -1.19 ...
..$ V4 : num [1:45] 2.496 2.712 1.198 1.675 -0.117 ...
$ tree :List of 4
..$ edge : int [1:28, 1:2] 29 29 28 28 27 27 26 26 25 25 ...
..$ edge.length: num [1:28] 0.0941 0.0941 0.6233 0.7174 0.0527 ...
..$ Nnode : int 14
..$ tip.label : chr [1:15] "t7" "t8" "t2" "t3" ...
..- attr(*, "class")= chr "phylo"
..- attr(*, "order")= chr "postorder"
...
you can see that simtraits() returns a list containing (among other things) (1) a data frame with species as the first column and the other columns numeric and (2) a phylogenetic tree.
You

dplyr Mutate Creating Matrix Instead of Vector

I am creating a new column that looks at conditions in my data frame and alerts me whether an issue needs to be investigated or monitored. The code to add the column looks like this:
library(dplyr)
df %>%
mutate("Status" =
ifelse(apply(.[2:7], 1, sum) > 0 & .[8] > 0, "Investigate",
"Monitor"
)
)
If I run the command class(df$Status) on this newly generated column the class is listed as 'matrix'. What? Why isn't it listed as 'character'.
If I look at the structure of my data frame there's some oddity that may be the key, but I don't understand why. Notice that the first columns listed simply look like intergers, then the third column listed, which is the same data, has all this 'attr' phrasing. What is going on?
$ 2017-08 : int NA 1 NA 1 1 2 NA NA NA NA ...
$ 2017-09 : int NA NA 1 NA NA NA NA NA NA NA ...
$ 2017-10 : int NA NA NA NA NA NA 1 NA NA NA ...
- attr(*, "vars")= chr "Material"
- attr(*, "drop")= logi TRUE
- attr(*, "indices")=List of 34
..$ : int 0
..$ : int 1
..$ : int 2
..$ : int 3
..$ : int 4
...continued...
- attr(*, "group_sizes")= int 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "biggest_group_size")= int 1
- attr(*, "labels")='data.frame': 34 obs. of 1 variable:
I grouped variables earlier and sometimes ungrouping magically helps. In addition I often have to convert tibbles back to data frames to get other routines to work in my code. This may or may not be related.

How to get result of package function into a dataframe in r

I am at the learning stage of r.
I am using library(usdm) in r where I am using vifcor(vardata,th=0.4,maxobservations =50000) to find the not multicollinear variables. I need to get the result of vifcor(vardata,th=0.4,maxobservations =50000) into a structured dataframe for further analysis.
Data reading process I am using:
performdata <- read.csv('F:/DGDNDRV_FINAL/OutputTextFiles/data_blk.csv')
vardata <-performdata[,c(names(performdata[5:length(names(performdata))-2])]
Content of the csv file:
pointid grid_code Blocks_line_dst_CHT GrowthCenter_dst_CHT Roads_nationa_dst_CHT Roads_regiona_dst_CHT Settlements_CHT_line_dst_CHT Small_Hat_Bazar_dst_CHT Upazilla_lin_dst_CHT resp
1 6 150 4549.428711 15361.31836 3521.391846 318.9043884 3927.594727 480 1
2 6 127.2792206 4519.557617 15388.68457 3500.24292 342.0526123 3902.883545 480 1
3 2 161.5549469 4484.473145 15391.6377 3436.539063 335.4101868 3844.216553 540 1
My tries:
r<-vifcor(vardata,th=0.2,maxobservations =50000) returns
2 variables from the 6 input variables have collinearity problem:
Roads_regiona_dst_CHT GrowthCenter_dst_CHT
After excluding the collinear variables, the linear correlation coefficients ranges between:
min correlation ( Small_Hat_Bazar_dst_CHT ~ Roads_nationa_dst_CHT ): -0.04119076963
max correlation ( Small_Hat_Bazar_dst_CHT ~ Settlements_CHT_line_dst_CHT ): 0.1384278434
---------- VIFs of the remained variables --------
Variables VIF
1 Blocks_line_dst_CHT 1.026743892
2 Roads_nationa_dst_CHT 1.010556752
3 Settlements_CHT_line_dst_CHT 1.038307666
4 Small_Hat_Bazar_dst_CHT 1.026943711
class(r) returns
[1] "VIF"
attr(,"package")
[1] "usdm"
mode(r) returns "S4"
I need Roads_regiona_dst_CHT GrowthCenter_dst_CHT into a dataframe and VIFs of the remained variables into another dataframe!
But nothing worked!
Basically the resturned result is a S4 class and you can extract slots via the # operator:
library(usdm)
example(vifcor) # creates 'v2'
str(v2)
# Formal class 'VIF' [package "usdm"] with 4 slots
# ..# variables: chr [1:10] "Bio1" "Bio2" "Bio3" "Bio4" ...
# ..# excluded : chr [1:5] "Bio5" "Bio10" "Bio7" "Bio6" ...
# ..# corMatrix: num [1:5, 1:5] 1 0.0384 -0.3011 0.0746 0.7102 ...
# .. ..- attr(*, "dimnames")=List of 2
# .. .. ..$ : chr [1:5] "Bio1" "Bio2" "Bio3" "Bio8" ...
# .. .. ..$ : chr [1:5] "Bio1" "Bio2" "Bio3" "Bio8" ...
# ..# results :'data.frame': 5 obs. of 2 variables:
# .. ..$ Variables: Factor w/ 5 levels "Bio1","Bio2",..: 1 2 3 4 5
# .. ..$ VIF : num [1:5] 2.09 1.37 1.25 1.27 2.31
So you can extract the results and the excluded slot now via:
v2#excluded
# [1] "Bio5" "Bio10" "Bio7" "Bio6" "Bio4"
v2#results
# variables VIF
# 1 Bio1 2.086186
# 2 Bio2 1.370264
# 3 Bio3 1.253408
# 4 Bio8 1.267217
# 5 Bio9 2.309479
You should be able to use the below command to get the information in the slot 'results' into a data frame. You can then split the information out into separate data frames using traditional methods
df <- r#results
Note that r#results[1:2,2] would give you the VIF for the first two rows.

How to acces composite elements in a data frame

I've created this data frame and want to access the individual elements for plotting. But it seems I can't. What kind of data frame did I have created and how can I access its individual elements?
> print(df)
B.mean B.conf1 B.conf2
1 0.75000000 -0.18826132 1.68826132
2 0.66666667 0.01334534 1.31998799
3 0.33333333 -0.31998799 0.98665466
> names(df)
[1] "B"
> struct(df)
'data.frame': 3 obs. of 1 variable:
$ B: num [1:3, 1:3] 0.75 0.6667 0.3333 -0.1883 0.0133 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "mean" "conf1" "conf2"
The 'B' column is a matrix as evident from the str of 'df'. By using do.call with data.frame, it gets converted to 3 columns of a data.frame.
do.call(data.frame, df)

Accessing elements in a list of data frames

I have managed to subset and lapply a list of data frames as follows:
subsetDeathHA <-
na.omit(subset(outcome,select = c("Hospital Name", "mortailityRate", "State"), ))
orderSubsetDeathHA <-
subsetDeathHA[order(subsetDeathHA$"mr" , subsetDeathHA$"Hospital Name", subsetDeathHA$'State' ),]
splitOrderSubsetDeahtHA <-
split(orderSubsetDeathHA, orderSubsetDeathHA$'State')
aa<- lapply(splitOrderSubsetDeahtHA, function(x) { x[num,] })
num is the ranking number on a per State basis.
Using str(aa) shows this object is a list of (54) data.frames, where each data.frame is one object of 3 variables as follows:
List of 54
$ AK:'data.frame': 1 obs. of 3 variables:
..$ Hospital Name : chr NA
..$ mortalityRate : num NA
..$ State : chr NA
..- attr(*, "na.action")=Class 'omit' Named int [1:1986] 4 5 6 10 13 17 19 23 27 28 ...
.. .. ..- attr(*, "names")= chr [1:1986] "4" "5" "6" "10" ...
$ AL:'data.frame': 1 obs. of 3 variables:
..$ Hospital Name : chr "D C H REGIONAL MEDICAL CENTER"
..$ mortalityRate : num 15.8
..$ State : chr "AL"
..- attr(*, "na.action")=Class 'omit' Named int [1:1986] 4 5 6 10 13 17 19 23 27 28 ...
.. .. ..- attr(*, "names")= chr [1:1986] "4" "5" "6" "10" ...
What I can't seem to do is the following
1) Subset out the Hospital Name and the State by removing the mortalityRate variable and return a list of the resulting 54 objects/data frames.
2) Place row.names =F appropriately to suppress the indexing that R provides.
3) Even though I thought I had 'na'd out' the NA values in the first sub-setting operation,
when I print(aa), what follows is a sample of the output.
$AK
Hospital Name mr State
NA NA <NA> NA <NA>
$AL
Hospital Name mr State
56 D C H REGIONAL MEDICAL CENTER 15.8 AL
etc...
Any help/suggestions appreciated

Resources