I have a data file that represents a contingency table that I need to work with. The problem is I can't figure out how to load it properly.
Data structure:
Rows: individual churches
1st Column: Name of the church
2nd - 12th column: Mean age of followers
Every cell: Number of people who follows corresponding church and are correspondingly old.
//In the original data set only the age range was available (e.g. between 60-69) so to enable computation with it I decided to replace it with mean age (e. g. 64.5 instead of 60-69)
Data sample:
name;7;15;25
catholic;25000;30000;15000
hinduism;5000;2000;3000
...
I tried to simply load the data and make them a 'table' so I could expand it but it didn't work (only produced something really weird).
dataset <- read.table("C:/.../dataset.csv", sep=";", quote="\"")
dataset_table <- as.table(as.matrix(dataset))
When I tried use the data as they were to produce a simple graph it didn't work either.
barplot(dataset[2,2:4])
Error in barplot.default(dataset[2,2:4]) : 'height' must be a vector or a matrix
Classing dataset[2,2:4] showed me that it is a 'list' which I don't understand (I guess it is because dataset is data.frame and not table).
If someone could point me into the right direction how to properly load the data as a table and then work with them, I'd be forever grateful :).
If your file is already a contingency table, don't use as.table().
df <- read.table(header=T,sep=";",text="name;7;15;25
catholic;25000;30000;15000
hinduism;5000;2000;3000")
colnames(df)[-1] <- substring(colnames(df)[-1],2)
barplot(as.matrix(df[2,2:4]), col="lightblue")
The transformation of colnames(...) is because R doesn't like column names that start with a number, so it prepends X. This codes just gets rid of that.
EDIT (Response to OP's comment)
If you want to convert the df defined above to a table suitable for use with expand.table(...) you have to set dimnames(...) and names(dimnames(...)) as described in the documentation for expand.table(...).
tab <- as.matrix(df[-1])
dimnames(tab) <- list(df$name,colnames(df)[-1])
names(dimnames(tab)) <- c("name","age")
library(epitools)
x.tab <- expand.table(tab)
str(x.tab)
# 'data.frame': 80000 obs. of 2 variables:
# $ name: Factor w/ 2 levels "catholic","hinduism": 1 1 1 1 1 1 1 1 1 1 ...
# $ age : Factor w/ 3 levels "7","15","25": 1 1 1 1 1 1 1 1 1 1 ...
Related
I am trying to enter a date of birth value on a dataset of senators. I rarely touch for loops so I may be doing this wrong but this is what I have so far:
ID <- c("A000055","B001303","M001201")
D.O.B. <- c("1965-07-22",NA,"1951-11-07")
leg_complete <- data.frame("name","D.O.B.")
for(id in leg_complete)
if(ID=="M001201") {
D.O.B. <- "1956-11-14"
} else {
break
}
Whenever I run the code and open the dataset, I get "No data available in the table. Is the for loop even the best move for entering a single value or should I use a different function?
Welcome to SO, #fdob!
I think you have a couple of issues in the code you posted above.
First, I think you meant to use ID instead of name in constructing your dataframe?
Second, you don't need the quotes around each variable either?
So I think you were trying to do this instead?
leg_complete <- data.frame(ID,D.O.B.)
If that is a correct assumption, then if you wanted to reassign a value in the dataframe, you can do something as simple as the below
leg_complete2[ID=="M001201"] <- "1956-11-14"
which gives the following:
> str(leg_complete)
'data.frame': 3 obs. of 2 variables:
$ ID : Factor w/ 3 levels "A000055","B001303",..: 1 2 3
$ D.O.B.: Factor w/ 2 levels "1951-11-07","1965-07-22": 2 NA 1
I'm trying to get the rows according to the values in the "Type of region" column into lists and put these lists into a other data structure (vector or list).
The data looks like this (~700 000 lines):
chr CS CE CloneName score strand # locs per clone # capReg alignments Type of region
chr1 10027684 10028042 clone_11546 1 + 1 1 chr1_10027880_10028380_DNaseI
chr1 10027799 10028157 clone_11547 1 + 1 1 chr1_10027880_10028380_DNaseI
chr1 10027823 10028181 clone_11548 1 - 1 1 chr1_10027880_10028380_DNaseI
chr1 10027841 10028199 clone_11549 1 + 1 1 chr1_10027880_10028380_DNaseI
Here's what i tried to do:
typeReg=dat[!duplicated(dat$`Type of region`),]
for(i in 1:nrow(typeReg)){
res[[i]]=dat[dat$`Type of region`==typeReg[i,]$`Type of region`,]
}
The for loop took too much time so i tried using an apply:
res=apply(typeReg, 1, function(x){
tmp=dat[dat$`Type of region`==x[9],]
})
But it is also long (there are 300 000 unique values in the Type of region column).
Do you have a solution to my problem or is it normal that it's taking this long?
You can use split():
type <- as.factor(dat$Type_of_Region)
split(dat, type)
But, as stated in the comments, using dplyr::group_by() may be a better option depending on what you want to do later.
Ok, so split works but the subsetting doesn't drop levels of the factor i have in my df. So basically for every list the split function created, it brought the 300 000 levels in the original df thus the huge size of the list. The possible solutions are to use the droplevels() function on every list created (not optimal if one list is too big to store in the RAM), use a for loop (this solution is really slow) or to remove the columns that cause a problem which is what i did.
res=split(dat[,c(-4,-9)], dat$`Type of region`, drop=TRUE)
I have searched a couple of options, generally trying out various combinations on cbind to accomplish this. Essentially I would like to create a data frame that combines different pivot tables. into one data frame in order to export to csv/excel. Is there a better way to accomplish this?
EDIT: Essentially I am trying to learn the basics of creating a function that can wrap around multiple different pivot tables to create a data frame ready for export that will serve as a template to ad hoc reporting. The problem I am having is that the cbind product takes object B, which as a standalone will be a table with the dates as columns, and forces it into a long table, where the dates are transposed into rows.
dataframe:
State FacilityName Date
NY Loew June 2014
NY Loew June 2014
CA Sunrise May 2014
CA May 2014
code:
volume <- function() {
df$missing = ifelse(is.na(df$FacilityName), "Missing", df$FacilityName)
df = subset(df, df$missing == "Missing")
x <- function(){
a <- as.data.frame(table(df$FacilityName))
b <- table(df$FacilityName, df$date)
cbind(a, b[,1], b[2])
}
}
When you give a factor to the table function, it uses the levels of the factor to build the table. So there's a nice way to obtain what you want by adding "Missing" to the levels of "FacilityName".
# loading data
ec <- read.csv(text=
'State, FacilityName, Date
NY,Loew,June 2014
NY,Loew,June 2014
CA,Sunrise,May 2014
CA,NA,May 2014', )
# Adding Missing to the possible levels of FacilityName
# note that we add it in front
new.levels <- c("Missing", levels(ec$FacilityName))
ec$FacilityName <- factor(ec$FacilityName, levels=new.levels)
# And replacing NAs by the new level "Missing"
ec$FacilityName[is.na(ec$FacilityName)] <- "Missing"
# the previous line would not have worked
# if we had not added "Missing" explicitly to the levels
# table() uses the levels to generate the table
# the levels are displayed in order
# now there's a level "Missing" in first position
t <- table(ec$FacilityName, ec$Date)
You get:
> t
June 2014 May 2014
Missing 0 1
Loew 2 0
Sunrise 0 1
You can add the total line like this (I don't think your code with nrow do what you say it does)
# adding total line
rbind(t, TOTAL=colSums(as.matrix(t)))
June 2014 May 2014
Missing 0 1
Loew 2 0
Sunrise 0 1
TOTAL 2 2
At this point you have a matrix so you may want to pass it to as.data.frame.
This can be easily implemented into a separate function if you want to. No need to bind several tables after all :)
Ok, so it seems I was trying to be cool and use a function to wrap everything in the hopes that it would be the beginning of learning to write flexible code. But, I did it the long way and ended up getting the result I wanted. While I will post the code that worked below, I am very interested in someone pointing me towards a better way to approach these kinds of problems, in order to learn better coding.
# Label the empty cells as Missing
ec$missing = ifelse(is.na(ec$FacilityName), "Missing", ec$FacilityName)
# Subset the dataframe to just missing values
df = subset(ec, ec$missing == "Missing")
# Create table that is a row of frequency by month for missing values
a <- table(df$missing, df$date)
# Reload dataframe to exclude Missing values
df = subset(ec, ec$missing != "Missing")
# Create table that shows frequency of observations for each facility by Month
b <- table(df$FacilityName, df$date)
# Create a Total row that can go at the bottom of the final data frame
Total <- nrow(ec)
# Bind all three objects
rbind(a,b,Total)
Here is an example of the final product I was looking for:
May2014 June2014
Missing 2 0
Sunrise 0 0
Loew 1 2
Total 3 2
The dataset named data has both categorical and continuous variables. I would like to the delete categorical variables.
I tried:
data.1 <- data[,colnames(data)[[3L]]!=0]
No error is printed, but categorical variables stay in data.1. Where are problems ?
The summary of "head(data)" is
id 1,2,3,4,...
age 45,32,54,23,...
status 0,1,0,0,...
...
(more variables like as I wrote above)
All variables are defined as "Factor".
What are you trying to do with that code? First of all, colnames(data) is not a list so using [[]] doesn't make sense. Second, The only thing you test is whether the third column name is not equal to zero. As a column name can never start with a number, that's pretty much always true. So your code translates to :
data1 <- data[,TRUE]
Not what you intend to do.
I suppose you know the meaning of binomial. One way of doing that is defining your own function is.binomial() like this :
is.binomial <- function(x,na.action=c('na.omit','na.fail','na.pass'){
FUN <- match.fun(match.arg(na.action))
length(unique(FUN(x)))==2
}
in case you want to take care of NA's. This you can then apply to your dataframe :
data.1 <- data[!sapply(data,is.binomial)]
This way you drop all binomial columns, i.e. columns with only two distinct values.
#Shimpei Morimoto,
I think you need a different approach.
Are the categorical variables defines in the dataframe as factors?
If so you can use:
data.1 <- data[,!apply(data,2,is.factor)]
The test you perform now is if the colname number 3L is not 0.
I think this is not the case.
Another approach is
data.1 <- data[,-3L]
works only if 3L is a number and the only column with categorical variables
I think you're getting there, with your last comment to #Mischa Vreeburg. It might make sense (as you suggest) to reformat your original data file, but you should also be able to solve the problem within R. I can't quite replicate the undefined columns error you got.
Construct some data that look as much like your data as possible:
X <- read.csv(textConnection(
"id,age,pre.treat,status
1,'27', 0,0
2,'35', 1,0
3,'22', 0,1
4,'24', 1,2
5,'55', 1,3
, ,yes(vs)no,"),
quote="\"'")
Take a look:
str(X)
'data.frame': 6 obs. of 4 variables:
$ id : int 1 2 3 4 5 NA
$ age : int 27 35 22 24 55 NA
$ pre.treat: Factor w/ 3 levels " 0"," 1","yes(vs)no": 1 2 1 2 2 3
$ status : int 0 0 1 2 3 NA
Define #Joris Mey's function:
is.binomial <- function(x,na.action=c('na.omit','na.fail','na.pass')) {
FUN <- match.fun(match.arg(na.action))
length(unique(FUN(x)))==2
}
Try it out: you'll see that it does not detect pre.treat as binomial, and keeps all the variables.
sapply(X,is.binomial)
X1 <- X[!sapply(X,is.binomial)]
names(X1)
## keeps everything
We can drop the last row and try again:
X2 <- X[-nrow(X),]
sapply(X2,is.binomial)
It is true in general that R does not expect "extraneous" information such as level IDs to be in the same column as the data themselves. On the one hand, you can do even better in the R world by simply leaving the data as their original, meaningful values ("no", "yes", or "healthy", "sick" rather than 0, 1); on the other hand the data take up slightly more space if stored as a text file, and, more important, it becomes harder to incorporate other meta-data such as units in the file along with the data ...
I am having rather lengthy problems concerning my data set and I believe that my trouble trace back to importing the data. I have looked at many other questions and answers as well as as many help sites as I can find, but I can't seem to make anything work. I am attemping to run some TTests on my data and have thus far been unable to do so. I believe the root cause is the data is imported as class NULL. I've tried to include as much information here as I can to show what I am working with and the types of issues I am having (in case the issue is in some other area)
An overview of my data and what i've been doing so far is this:
Example File data (as displayed in R after reading data from .csv file):
Part Q001 Q002 LA003 Q004 SA005 D106
1 5 3 text 99 text 3
2 3 text 2 text 2
3 2 4 3 text 5
4 99 5 text 2 2
5 4 2 1 text 3
So in my data, the "answers" are 1 through 5. 99 represents a question that was answered N/A. blanks represent unanswered questions. the 'text' questions are long and short answer/comments from a survey. All of them are stored in a large data set over over 150 Participants (Part) and over 300 questions (labled either Q, LA, SA, or D based on question with a 1-5 answer, long answer, short answer, or demographic (also numeric answers 0 thought 6 or so)).
When I import the data, I need to have it disregard any blank or 99 answers so they do not interfere with statistics. I also don't care about the comments, so I filter all of them out.
EDIT: data file looks like:
Part,Q001,Q002,LA003,Q004,SA005,D006
1,5,3,text,99,text,3
2,3,,text,2,text,2
etc...
I am using the following lines to read the data:
data.all <- read.table("data.csv", header=TRUE, sep=",", na.strings = c("","99"))
data <- data.all[, !(colnames(data.all) %in% c("LA003", "SA005")
now, when I type
class(data$Q001)
I get NULL
I need these to be Numeric. I can use summary(data) to get the means and such, but when I try to run ttests, I get errors including NULL.
I tried to turn this column into numerics by using
data<-sapply(data,as.numeric)
and I tried
data[,1]<-as.numeric(as.character(data[,1]))
(and with 2 instead of 1, but I don't really understand the sapply syntax, I saw it in several other answers and was trying to make it work)
when I then type
class(data$Q001)
I get "Error: $ operator is invalid for atomic vectors
If I do not try to use sapply, and I try to run a ttest, I've created subsets such as
data.2<-subset(data, D106 == "2")
data.3<-subset(data, D106 == "3")
and I use
t.test(data.2$Q001~data.3$Q001, na.rm=TRUE)
and I get "invalid type (NULL) for variable 'data.2$Q001'
I tried using the different syntax, trying to see if I can get anything to work, and
t.test(data.2$Q001, data.3$Q001, na.rm=TRUE)
gives "In is.na(d) : is.na() applied to non-(list or vector) of type 'NULL'" and "In mean.default(x) : argument is not numeric or logical: returning NA"
So, now that I think I've been clear about what I'm trying to do and some of the things I've tried...
How can I import my data so that numbers (specifically any number in a column with a header starting with Q) are accurately read as numbers and do not get a NULL class applied to them? What do I need to do in order to get my data properly imported to run TTests on it? I've used TTests on plenty of data in the past, but it has always been data I recorded manually in excel (and thus had only one column of numbers with no blanks or NAs) and I've never had an issue, and I just do not understand what it is about this data set that I can't get it to work. Any assistance in the right direction is much appreciated!
This works for me:
> z <- read.table(textConnection("Part,Q001,Q002,LA003,Q004,SA005,D006
+ 1,5,3,text,99,text,3
+ 2,3,,text,2,text,2
+ "),header=TRUE,sep=",",na.strings=c("","99"))
> str(z)
'data.frame': 2 obs. of 7 variables:
$ Part : int 1 2
$ Q001 : int 5 3
$ Q002 : int 3 NA
$ LA003: Factor w/ 1 level "text": 1 1
$ Q004 : int NA 2
$ SA005: Factor w/ 1 level "text": 1 1
$ D006 : int 3 2
> z2 <- z[,!(colnames(z) %in% c("LA003","SA005"))]
> str(z2)
'data.frame': 2 obs. of 5 variables:
$ Part: int 1 2
$ Q001: int 5 3
$ Q002: int 3 NA
$ Q004: int NA 2
$ D006: int 3 2
> z2$Q001
[1] 5 3
> class(z2$Q001)
[1] "integer"
The only I can think of is that your second command (which was missing some terminating parentheses and brackets) didn't work at all, you missed seeing the error message, and you are referring to some previously defined data object that doesn't have the same columns defined. For example, class(z$QQQ) is NULL following the above example.
edit: it appears that the original problem was some weird/garbage characters in the header that messed up the name of the first column. Manually renaming the column (names(data)[1] <- "Q001") seems to have fixed the problem.