This is part of a project to switch from SPSS to R. While there are good tools to import SPSS files into R (expss) what this question is part of is attempting to get the benefits of SPSS style labeling when data originates from CSV sources. This is to help bridge the staff training gap between SPSS and R by providing a common format for data.tables irrespective of file format origin.
Whilst CSV does a reasonable job of storing data it is hopeless for providing meaningful data. This inevitably means variable and factor levels and labels have to come from somewhere else. In most short examples of this (e.g. in documentation) it is practical to simply hard code the meta data in. But for larger projects it makes more sense to store this meta data in a second csv file.
Example data file
ID,varone,vartwo,varthree,varfour,varfive,varsix,varseven,vareight,varnine,varten
1,1,34,1,,1,,1,1,4,
2,1,21,0,1,,1,3,14,3,2
3,1,54,1,,,1,3,6,4,4
4,2,32,1,1,1,,3,7,4,
5,3,66,0,,,1,3,9,3,3
6,2,43,1,,1,,1,12,2,1
7,2,26,0,,,1,2,11,1,
8,3,,1,1,,,2,15,1,4
9,1,34,1,,1,,1,12,3,4
10,2,46,0,,,,3,13,2,
11,3,39,1,1,1,,3,7,1,2
12,1,28,0,,,1,1,6,5,1
13,2,64,0,,1,,2,11,,3
14,3,34,1,1,,,3,10,1,1
15,1,52,1,,1,1,1,8,6,
Example metadata file
Rowlabels,ID,varone,vartwo,varthree,varfour,varfive,varsix,varseven,vareight,varnine,varten
varlabel,,Question one,Question two,Question three,Question four,Question five,Question six,Question seven,Question eight,Question nine,Question ten
varrole,Unique,Attitude,Unique,Filter,Filter,Filter,Filter,Attitude,Filter,Attitude,Attitude
Missing,Error,Error,Ignored,Error,Unchecked,Unchecked,Unchecked,Error,Error,Error,Ignored
vallable,,One,,No,Checked,Checked,Checked,x,One,A,Support
vallable,,Two,,Yes,,,,y,Two,B,Neutral
vallable,,Three,,,,,,z,Three,C,Oppose
vallable,,,,,,,,,Four,D,Dont know
vallable,,,,,,,,,Five,E,
vallable,,,,,,,,,Six,F,
vallable,,,,,,,,,Seven,G,
vallable,,,,,,,,,Eight,,
vallable,,,,,,,,,Nine,,
vallable,,,,,,,,,Ten,,
vallable,,,,,,,,,Eleven,,
vallable,,,,,,,,,Twelve,,
vallable,,,,,,,,,Thirteen,,
vallable,,,,,,,,,Fourteen,,
vallable,,,,,,,,,Fifteen,,
SO the common elements are the column names which are the key to both files
The first column of the metadata file describes the role of the row for the data file
so
varlabel provides the variable label for each column
varrole describes the analytic purpose of the variable
missing describes how to treat missing data
varlabel describes the label for a factor level starting at one on up to as many labels as there are.
Right! Here's the code that works:
```#Libraries
library(expss)
library(data.table)
library(magrittr)```
readcsvdata <- function(dfile)
{
# TESTED - Working
print("OK Lets read some comma separated values")
rdata <- fread(file = dfile, sep = "," , quote = "\"" , header = TRUE, stringsAsFactors = FALSE,
na.strings = getOption("datatable.na.strings",""))
return(rdata)
}
rawdatafilename <- "testdata.csv"
rawmetadata <- "metadata.csv"
mdt <- readcsvdata(rawmetadata)
rdt <- readcsvdata(rawdatafilename)
names(rdt)[names(rdt) == "ï..ID"] <- "ID" # correct minor data error
commonnames <- intersect(names(mdt),names(rdt)) # find common variable names so metadata applies
commonnames <- commonnames[-(1)] # remove ID
qlabels <- as.list(mdt[1, commonnames, with = FALSE])
(Here I copy the rdt datatable simply so I can roll back to the original data without re-running the previous read chunks and tidying whenever I make changes that don't work out.
# set var names to columns
for (each_name in commonnames) # loop through commonnames and qlabels
{
expss::var_lab(tdt[[each_name]]) <- qlabels[[each_name]]
}
OK this is where I fall down.
Failure from here
factorcols <- as.vector(commonnames) # create a vector of column names (for later use)
for (col in factorcols)
{
print( is.na(mdt[4, ..col])) # print first row of value labels (as test)
if (is.na(mdt[4, ..col])) factorcols <- factorcols[factorcols != col]
# if not a factor column, remove it from the factorcol list and dont try to factor it
else { # if it is a vector factorise
print(paste("working on",col)) # I have had a lot of problem with unrecognised ..col variables
tlabels <- as.vector(na.omit(mdt[4:18, ..col])) # get list of labels from the data column}
validrange <- seq(1,lengths(tlabels),1) # range of valid values is 1 to the length of labels list
print(as.character(tlabels)) # for testing
print(validrange) # for testing
tdt[[col]] <- factor(tdt[[col]], levels = validrange, ordered = is.ordered(validrange), labels = as.character(tlabels))
# expss::val_lab(tdt[, ..col]) <- tlabels
tlabels = c() # flush loop variable
validrange = c() # flush loop variable
}
}
So the problem is revealed here when we check the data table.
tdt
the labels have been applied as whole vectors to each column entry except where there is only one value in the vector ("checked" for varfour and varfive)
tdt
id (int) 1
varone (fctr) c("One", "Two", "Three") 1 (should be "One" 1)
vartwo (S3: labelled) 34
varthree (fctr) c("No", "Yes") 1 (should be "No" 1)
varfour (fctr) NA
varfive (fctr) Checked
And a mystery
this code works just fine on a single columns when I don't use a for loop variable
# test using column name
tlabels <- c("one","two","three")
validrange <- c(1,2,3)
factor(tdt[,varone], levels = validrange, ordered=is.ordered(validrange), labels = tlabels)
It seems the issue is in the line tlabels <- as.vector(na.omit(mdt[4:18, ..col])). It doesn't make vector as you expect. Contrary to usual data.frame data.table doesn't drop dimensions when you provide single column in the index. And as.vector do nothing with data.frames/data.tables. So tlabels remains data.table. This line need to be rewritten as tlabels <- na.omit(mdt[[col]][4:18]).
Example:
library(data.table)
mdt = as.data.table(mtcars)
col = "am"
tlabels <- as.vector(na.omit(mdt[3:6, ..col])) # ! tlabels is data.table
str(tlabels)
# Classes ‘data.table’ and 'data.frame': 4 obs. of 1 variable:
# $ am: num 1 0 0 0
# - attr(*, ".internal.selfref")=<externalptr>
as.character(tlabels) # character vector of length 1
# [1] "c(1, 0, 0, 0)"
tlabels <- na.omit(mdt[[col]][3:6]) # vector
str(tlabels)
# num [1:4] 1 0 0 0
as.character(tlabels) # character vector of length 4
# [1] "1" "0" "0" "0"
I use data from the NHANES periodontal dataset (https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/OHXPER_F.htm) and after cleaning it to only keep the "pc" variables, I have a df=setPD 168 columns that include 6 measurements (pcd, pcm, pcs, pcp, pcl, pca) around 28 teeth numbered from #02 to #31
#names(setPD)
[1] "ohx02pcd" "ohx02pcm" "ohx02pcs" "ohx02pcp" "ohx02pcl" "ohx02pca" "ohx03pcd" "ohx03pcm" "ohx03pcs" "ohx03pcp" "ohx03pcl" "ohx03pca"
[13] "ohx04pcd" "ohx04pcm" "ohx04pcs" "ohx04pcp" "ohx04pcl" "ohx04pca" "ohx05pcd" "ohx05pcm" "ohx05pcs" "ohx05pcp" "ohx05pcl" "ohx05pca"
[25] "ohx06pcd" "ohx06pcm" "ohx06pcs" "ohx06pcp" "ohx06pcl" "ohx06pca" "ohx07pcd" "ohx07pcm" "ohx07pcs" "ohx07pcp" "ohx07pcl" "ohx07pca"
[37] "ohx08pcd" "ohx08pcm" "ohx08pcs" "ohx08pcp" "ohx08pcl" "ohx08pca" "ohx09pcd" "ohx09pcm" "ohx09pcs" "ohx09pcp" "ohx09pcl" "ohx09pca"
[49] "ohx10pcd" "ohx10pcm" "ohx10pcs" "ohx10pcp" "ohx10pcl" "ohx10pca" "ohx11pcd" "ohx11pcm" "ohx11pcs" "ohx11pcp" "ohx11pcl" "ohx11pca"
[61] "ohx12pcd" "ohx12pcm" "ohx12pcs" "ohx12pcp" "ohx12pcl" "ohx12pca" "ohx13pcd" "ohx13pcm" "ohx13pcs" "ohx13pcp" "ohx13pcl" "ohx13pca"
[73] "ohx14pcd" "ohx14pcm" "ohx14pcs" "ohx14pcp" "ohx14pcl" "ohx14pca" "ohx15pcd" "ohx15pcm" "ohx15pcs" "ohx15pcp" "ohx15pcl" "ohx15pca"
[85] "ohx18pcd" "ohx18pcm" "ohx18pcs" "ohx18pcp" "ohx18pcl" "ohx18pca" "ohx19pcd" "ohx19pcm" "ohx19pcs" "ohx19pcp" "ohx19pcl" "ohx19pca"
[97] "ohx20pcd" "ohx20pcm" "ohx20pcs" "ohx20pcp" "ohx20pcl" "ohx20pca" "ohx21pcd" "ohx21pcm" "ohx21pcs" "ohx21pcp" "ohx21pcl" "ohx21pca"
[109] "ohx22pcd" "ohx22pcm" "ohx22pcs" "ohx22pcp" "ohx22pcl" "ohx22pca" "ohx23pcd" "ohx23pcm" "ohx23pcs" "ohx23pcp" "ohx23pcl" "ohx23pca"
[121] "ohx24pcd" "ohx24pcm" "ohx24pcs" "ohx24pcp" "ohx24pcl" "ohx24pca" "ohx25pcd" "ohx25pcm" "ohx25pcs" "ohx25pcp" "ohx25pcl" "ohx25pca"
[133] "ohx26pcd" "ohx26pcm" "ohx26pcs" "ohx26pcp" "ohx26pcl" "ohx26pca" "ohx27pcd" "ohx27pcm" "ohx27pcs" "ohx27pcp" "ohx27pcl" "ohx27pca"
[145] "ohx28pcd" "ohx28pcm" "ohx28pcs" "ohx28pcp" "ohx28pcl" "ohx28pca" "ohx29pcd" "ohx29pcm" "ohx29pcs" "ohx29pcp" "ohx29pcl" "ohx29pca"
[157] "ohx30pcd" "ohx30pcm" "ohx30pcs" "ohx30pcp" "ohx30pcl" "ohx30pca" "ohx31pcd" "ohx31pcm" "ohx31pcs" "ohx31pcp" "ohx31pcl" "ohx31pca"
I am trying to apply a conditional selection in each group of six columns. This is:
transmute(setPD,PD02 = ifelse(setPD$ohx02pcd >5 |
setPD$ohx02pcm>5 |setPD$ohx02pcs >5|
setPD$ohx02pcp >5 | setPD$ohx02pcl >5 |
setPD$ohx02pca >5, 1, 0))
Then for the next tooth (03) I have to write again:
transmute(setPD,PD03 = ifelse(setPD$ohx03pcd >5 |
setPD$ohx03pcm>5|setPD$ohx03pcs >5|
setPD$ohx03pcp >5|setPD$ohx03pcl >5|
setPD$ohx03pca >5, 1, 0))
I tried to firstly do that conditional selection in a more efficient way, something like:
transmute(setPD,PD02 = ifelse(list(setPD$ohx02pcd:setPD$ohx02pcp) >5, 1, 0))
but it does not work.
Then I am looking for a way to write a loop that does that over each tooth without needing to write this 28 times!!
I thought of applying the select function of dplyr in a for loop but I don't know how to do that.
At the end I want to get all the new columns I made with transmute and say that if at least 2 of the 28 columns are 1, then I have disease, if <2 are 1 then I have health. ANy help would be appreciated.
**Note: If you want to get the dataset, it is open access from CDC.org:
https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/OHXPER_F.htm **
First, it is useful to point out that the logical statements of the form is A true OR is B true OR is C true are equivalent to asking is ANY of A,B,C true? We can use this to simplify the statements setPD$ohx02pcd >5 | setPD$ohx02pcm>5 |setPD$ohx02pcs >5| ... to ask if for any of these columns it is true that their value is larger than 5.
For example, let us focus on tooth number 02 first. To get all columns that concern this tooth, we can use grep to get a vector of column names. This can be achieved with
current_tooth <- grep("02", names(setPD), value = T)
Note that if there are any other columns in the data that contain the string 02, these columns will also show up. This does not appear to be the case in your data, but it is worthwhile pointing out here in case someone else uses it and this applies in other datasets.
Now, we can use these names to subset the dataframe. For instance,
setPD[,current_tooth]
will give you the corresponding columns. In each row, we want to check if any of the above mentioned conditions are true. Given a vector of logical statements, we can check if any of them is true with the function any. To go through a dataframe by row and apply a function, we can use apply, such as in
setPD$PD02 <-
apply(setPD[,grep("02", names(setPD), value = T)], 1, function(x) any(x>5))
Now, the above applies to one tooth only, namely 02. One way of doing it for all teeth is to create a vector with all tooth indicators and use this to loop over the above lines, replacing the "02" in the above grep call in each iteration and using assign or something similar to get the variable name right. A more elegant and more efficient way is to use the same principle on long data. Consider the following:
library(reshape2)
library(dplyr)
m <- melt(setPD, id.vars="SEQN")
m$num <- substr(m$variable, 4,5) # be careful here and check output!
m <- m %>% group_by(num) %>% mutate(PS = any(value>5))
m$num <- paste0("PS", m$num)
md <- dcast(m, SEQN ~ num, value.var = "PS")
setPD <- merge(setPD, md, by="SEQN")
This melts your data first and creates a variable num that indicates your tooth. Again, make sure that this works. I have used the fact that in your data, the tooth number all appear in the 4th and 5th place in the character string. Make sure this is true, and adjust the code otherwise. Then I create a variable PS which indicates whether any of the columns that contain the tooth identifer has a value larger than 5. Last but not least I recast the data so that you have the values of PD02, PD03, etc in columns again, before I merge this to the old dataset. The line with paste0 merely creates the variable names that you want to have.
I have created a dataset using WHO ATC/DDD Index a few months before and I want to make sure if the database online remains unchanged today, so I downloaded it again and try to use the digest package in R to do the comparison.
The two dataset (in txt format) can be downloaded here. (I am aware that you may think the files are unsafe and may have virus, but I don't know how to generate a dummy dataset to replicate the issue I have now, so I upload the dataset finally)
And I have written a little script as below:
library(digest)
ddd.old <- read.table("ddd.table.old.txt",header=TRUE,stringsAsFactors=FALSE)
ddd.new <- read.table("ddd.table.new.txt",header=TRUE,stringsAsFactors=FALSE)
ddd.old[,"ddd"] <- as.character(ddd.old[,"ddd"])
ddd.new[,"ddd"] <- as.character(ddd.new[,"ddd"])
ddd.old <- data.frame(ddd.old, hash = apply(ddd.old, 1, digest),stringsAsFactors=FALSE)
ddd.new <- data.frame(ddd.new, hash = apply(ddd.new, 1, digest),stringsAsFactors=FALSE)
ddd.old <- ddd.old[order(ddd.old[,"hash"]),]
ddd.new <- ddd.new[order(ddd.new[,"hash"]),]
And something really interesting happens when I do the checking:
> table(ddd.old[,"hash"]%in%ddd.new[,"hash"]) #line01
TRUE
506
> table(ddd.new[,"hash"]%in%ddd.old[,"hash"]) #line02
TRUE
506
> digest(ddd.old[,"hash"])==digest(ddd.new[,"hash"]) #line03
[1] TRUE
> digest(ddd.old)==digest(ddd.new) #line04
[1] FALSE
line01 and line02 shows that every rows in ddd.old can be found in ddd.new, and vice versa.
line03 shows that the hash column for both dataframe are the same
line04 shows that the two dataframe are different
What happen? Both dataframe with the identical rows (from line01 and line02), same order (from line03), but are different? (from line04)
Or do I have any misunderstanding about digest? Thanks.
Read in data as before.
ddd.old <- read.table("ddd.table.old.txt",header=TRUE,stringsAsFactors=FALSE)
ddd.new <- read.table("ddd.table.new.txt",header=TRUE,stringsAsFactors=FALSE)
ddd.old[,"ddd"] <- as.character(ddd.old[,"ddd"])
ddd.new[,"ddd"] <- as.character(ddd.new[,"ddd"])
Like Marek said, start by checking for differences with all.equal.
all.equal(ddd.old, ddd.new)
[1] "Component 6: 4 string mismatches"
[2] "Component 8: 24 string mismatches"
So we just need to look at columns 6 and 8.
different.old <- ddd.old[, c(6, 8)]
different.new <- ddd.new[, c(6, 8)]
Hash these columns.
hash.old <- apply(different.old, 1, digest)
hash.new <- apply(different.new, 1, digest)
And find the rows where they don't match.
different_rows <- which(hash.old != hash.new) #which is optional
Finally, combine the datasets.
cbind(different.old[different_rows, ], different.new[different_rows, ])