Why does the table function find a variable that was deleted [duplicate] - r

This question already has answers here:
Why does R find a data.frame variable that isn't in the data.frame?
(2 answers)
Closed 27 days ago.
Why does the table function find a variable that was deleted?
Dog <- c("Rover", "Spot")
Cat <- c("Scratch", "Fluffy")
Pets <- data.frame(Dog, Cat) #create a data frame with two variables
names(Pets)
# [1] "Dog" "Cat"
#rename Dog to a longer name
names(Pets)[names(Pets)=="Dog"] <- "Dog_as_very_long_name"
Pets$Dog <- NULL # delete Dog
names(Pets)
#[1] "Dog_as_very_long_name" "Cat" #the variable dog is not in the data set anymore
table(Pets$Dog) #Why does the table function on a variable that was deleted
# Rover Spot
# 1 1

This is simply because of the partial matching that occurs in certain uses of $.
Try this:
> table(Pets$Ca)
Fluffy Scratch
1 1
Using the [[ notation instead will give you more control.
> table(Pets[["Ca"]])
< table of extent 0 >
> table(Pets[["Ca", exact = FALSE]])
Fluffy Scratch
1 1
You can use the options settings to give a warning when partial matches are used. Consider:
> options(warnPartialMatchDollar = TRUE)
> table(Pets$Ca)
Fluffy Scratch
1 1
Warning message:
In Pets$Ca : partial match of 'Ca' to 'Cat'
> table(Pets$Dog)
Rover Spot
1 1
Warning message:
In Pets$Dog : partial match of 'Dog' to 'Dog_as_very_long_name'

Related

Extracting single row from data.frame without loss of names [duplicate]

This question already has answers here:
How do I extract a single column from a data.frame as a data.frame?
(3 answers)
Closed 1 year ago.
I am simply extracting a single row from a data.frame. Consider for example
d=data.frame(a=1:3,b=1:3)
d[1,] # returns a data.frame
# a b
# 1 1 1
The output matched my expectation. The result was not as I expected though when dealing with a data.frame that contains a single column.
d=data.frame(a=1:3)
d[1,] # returns an integer
# [1] 1
Indeed, here, the extracted data is not a data.frame anymore but an integer! To me, it seems a little strange that the same function on the same data type wants to return different data types. One of the issue with this conversion is the loss of the column name.
To solve the issue, I did
extractRow = function(d,index)
{
if (ncol(d) > 1)
{
return(d[index,])
} else
{
d2 = as.data.frame(d[index,])
names(d2) = names(d)
return(d2)
}
}
d=data.frame(a=1:3,b=1:3)
extractRow(d,1)
# a b
# 1 1 1
d=data.frame(a=1:3)
extractRow(d,1)
# a
# 1 1
But it seems unnecessarily cumbersome. Is there a better solution?
Just subset with the drop = FALSE option:
extractRow = function(d, index) {
return(d[index, , drop=FALSE])
}
R tries to simplify data.frame cuts by default, the same thing happens with columns:
d[, "a"]
# [1] 1 2 3
Alternatives are:
d[1, , drop = FALSE]
tibble::tibble which has drop = FALSE by default
I can't tell you why that happens - it seems weird. One workaround would be to use slice from dplyr (although using a library seems unecessary for such a simple task).
library(dplyr)
slice(d, 1)
a
1 1
data.frames will simplify to vectors or scallars whith base subsetting [,].
If you want to avoid that, you can use tibbles instead:
> tibble(a=1:2)[1,]
# A tibble: 1 x 1
a
<int>
1 1
tibble(a=1:2)[1,] %>% class
[1] "tbl_df" "tbl" "data.frame"

Count occurrences of words in a string according to a category in R

I need to search through a text string for keywords and then assign a category in an R dataframe. This creates a problem where I have keywords from more than one category. I would like to easily extract rows where more than one category is represented so that I can manually evaluate them and assign the correct category.
To do this, I have tried to add a count column to show how many categories are represented in each string.
Using a combination of the two solutions linked below, I have managed to get part of the way, but I am still not getting the correct output
Partial animal string matching in R
Count occurrences of specific words from a dataframe row in R
I have created an example below. I would like the following rules to be applied:
if string has cat or lion wcount gets 1 - only 1 group represented (feline)
if string has dog or wolf wcount gets 1 - only 1 group represented (canine)
if string has (cat or lion) AND (dog or wolf) wcount get 2 - two groups represented (feline and canine)
I can then easily pull out rows where wcount > 1
id <- c(1:5)
text <- c('saw a cat',
'found a dog',
'saw a cat by a dog',
'There was a lion',
'Huge wolf'
)
dataset <- data.frame(id,text)
SearchGrp<-list(c("(cat|lion)", "feline"),
c("(dog|wolf)","canine"))
output_vector<- character (nrow(dataset))
for (i in seq_along(SearchGrp)){
output_vector[grepl(x=dataset$text, pattern = SearchGrp[[i]][1],ignore.case = TRUE)]<-SearchGrp[[i]][2]}
dataset$type<-output_vector
keyword_temp <- unlist(lapply(SearchGrp, function(x) new<-{x[1]}))
keyword<-paste(keyword_temp[1],"|",keyword_temp[2])
library(stringr)
getCount <- function(data,keyword)
{
wcount <- str_count(dataset$text, keyword)
return(data.frame(data,wcount))
}
getCount(dataset,keyword)
Here is a base R method to get the count across types.
dataset$wcnt <- rowSums(sapply(c("dog|wolf", "cat|lion"),
function(x) grepl(x, dataset$text)))
Here, sapply runs through the regular expressions of each type and feeds it to grepl. This returns a matrix, where the columns are logical vectors indicating if a particular type (eg, "dog|wolf") was found. rowSums sums the logicals along the rows to get the type variety count.
This returns
dataset
id text wcnt
1 1 saw a cat 1
2 2 found a dog 1
3 3 saw a cat by a dog 2
4 4 There was a lion 1
5 5 Huge wolf 1
If you want the intermediary step, returning logical vectors as variables in your data.frame, you would probably want to set your values up in a named vector and then do cbind with the result.
# construct named vector
myTypes <- c("canine"="dog|wolf", "feline"="cat|lion")
# cbind sapply results of logicals to original data.frame
dataset <- cbind(dataset, sapply(myTypes, function(x) grepl(x, dataset$text)))
This returns
dataset
id text canine feline
1 1 saw a cat FALSE TRUE
2 2 found a dog TRUE FALSE
3 3 saw a cat by a dog TRUE TRUE
4 4 There was a lion FALSE TRUE
5 5 Huge wolf TRUE FALSE

create list based on data frame in R

I have a data frame A in the following format
user item
10000000 1 # each user is a 8 digits integer, item is up to 5 digits integer
10000000 2
10000000 3
10000001 1
10000001 4
..............
What I want is a list B, with users' names as the name of list elements, list element is a vector of items corresponding to this user.
e.g
B = list(c(1,2,3),c(1,4),...)
I also need to paste names to B. To apply association rule learning, items need to be convert to characters
Originally I used tapply(A$user,A$item, c), this makes it not compatible with association rule package. See my post:
data format error in association rule learning R
But #sgibb's solution seems also generates an array, not a list.
library("arules")
temp <- as(C, "transactions") # C is output using #sgibb's solution
throws error: Error in as(C, "transactions") :
no method or default for coercing “array” to “transactions”
Have a look at tapply:
df <- read.table(textConnection("
user item
10000000 1
10000000 2
10000000 3
10000001 1
10000001 4"), header=TRUE)
B <- tapply(df$item, df$user, FUN=as.character)
B
# $`10000000`
# [1] "1" "2" "3"
#
# $`10000001`
# [1] "1" "4"
EDIT: I do not know the arules package, but here the solution proposed by #alexis_laz:
library("arules")
as(split(df$item, df$user), "transactions")
# transactions in sparse format with
# 2 transactions (rows) and
# 4 items (columns)

Loop through columns in S4 objects in R

I am trying to perform an association using the snpStats package.
I have a snp matrix called 'plink' which contains my genotype data (as
a list of $genotypes, $map, $fam), and plink$genotype has: SNP names as column names (2 SNPs) and the subject identifiers as the row names:
plink$genotype
SnpMatrix with 6 rows and 2 columns
Row names: 1 ... 6
Col names: 203 204
The plink dataset can be reproduced copying the following ped and map files and saving them as 'plink.ped' and plink.map' respectively:
plink.ped:
1 1 0 0 1 -9 A A G G
2 2 0 0 2 -9 G A G G
3 3 0 0 1 -9 A A G G
4 4 0 0 1 -9 A A G G
5 5 0 0 1 -9 A A G G
6 6 0 0 2 -9 G A G G
plink.map:
1 203 0 792429
2 204 0 819185
And then use plink in this way:
./plink --file plink --make-bed
#----------------------------------------------------------#
| PLINK! | v1.07 | 10/Aug/2009 |
|----------------------------------------------------------|
| (C) 2009 Shaun Purcell, GNU General Public License, v2 |
|----------------------------------------------------------|
| For documentation, citation & bug-report instructions: |
| http://pngu.mgh.harvard.edu/purcell/plink/ |
#----------------------------------------------------------#
Web-based version check ( --noweb to skip )
Recent cached web-check found...Problem connecting to web
Writing this text to log file [ plink.log ]
Analysis started: Tue Nov 29 18:08:18 2011
Options in effect:
--file /ugi/home/claudiagiambartolomei/Desktop/plink
--make-bed
2 (of 2) markers to be included from [ /ugi/home/claudiagiambartolomei/Desktop /plink.map ]
6 individuals read from [ /ugi/home/claudiagiambartolomei/Desktop/plink.ped ]
0 individuals with nonmissing phenotypes
Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)
Missing phenotype value is also -9
0 cases, 0 controls and 6 missing
4 males, 2 females, and 0 of unspecified sex
Before frequency and genotyping pruning, there are 2 SNPs
6 founders and 0 non-founders found
Total genotyping rate in remaining individuals is 1
0 SNPs failed missingness test ( GENO > 1 )
0 SNPs failed frequency test ( MAF < 0 )
After frequency and genotyping pruning, there are 2 SNPs
After filtering, 0 cases, 0 controls and 6 missing
After filtering, 4 males, 2 females, and 0 of unspecified sex
Writing pedigree information to [ plink.fam ]
Writing map (extended format) information to [ plink.bim ]
Writing genotype bitfile to [ plink.bed ]
Using (default) SNP-major mode
Analysis finished: Tue Nov 29 18:08:18 2011
I also have a phenotype data frame which contains the outcomes (outcome1, outcome2,...) I would like to associate with the genotype, which is this:
ID<- 1:6
sex<- rep(1,6)
age<- c(59,60,54,48,46,50)
bmi<- c(26,28,22,20,23, NA)
ldl<- c(5, 3, 5, 4, 2, NA)
pheno<- data.frame(ID,sex,age,bmi,ldl)
The association works for the single terms when I do this: (using the formula "snp.rhs.test"):
bmi<-snp.rhs.tests(bmi~sex+age,family="gaussian", data=pheno, snp.data=plink$genotype)
My question is, how do I loop through the outcomes? This type of data
seems different from all the others and I am having trouble
manipulating it, so I would also be grateful if you have suggestions
of some tutorials that can help me understand how to do this and other
manipulations such as subsetting the snp.matrix data for example.
This is what I have tried for the loop:
rhs <- function(x) {
x<- snp.rhs.tests(x, family="gaussian", data=pheno,
snp.data=plink$genotype)
}
res_ <- apply(pheno,2,rhs)
Error in x$terms : $ operator is invalid for atomic vectors
Then I tried this:
for (cov in names(pheno)) {
association<-snp.rhs.tests(cov, family="gaussian",data=pheno, snp.data=plink$genotype)
}
Error in eval(expr, envir, enclos) : object 'bmi' not found
Thank you as usual for your help!
-f
The author of snpStats is David Clayton. Although the website listed in the package description is wrong, he is still at that domain and it's possible to do a search for documentation with the advanced search feature of Google with this specification:
snpStats site:https://www-gene.cimr.cam.ac.uk/staff/clayton/
The likely reason for your difficulty with access is that this is an S4 package and the methods for access are different. Instead of print methods S4 objects typically have show-methods. There is a vignette on the package here: https://www-gene.cimr.cam.ac.uk/staff/clayton/courses/florence11/practicals/practical6.pdf , and the directory for his entire short course is open for access: https://www-gene.cimr.cam.ac.uk/staff/clayton/courses/florence11/
It becomes clear that the object returned from snp.rhs.tests can be accessed with "[" using sequential numbers or names as illustrated on p 7. You can get the names :
# Using the example on the help(snp.rhs.tests) page:
> names(slt3)
[1] "173760" "173761" "173762" "173767" "173769" "173770" "173772" "173774"
[9] "173775" "173776"
The things you may be calling columns are probably "slots"
> getSlots(class(slt3))
snp.names var.names chisq df N
"ANY" "character" "numeric" "integer" "integer"
> str(getSlots(class(slt3)))
Named chr [1:5] "ANY" "character" "numeric" "integer" "integer"
- attr(*, "names")= chr [1:5] "snp.names" "var.names" "chisq" "df" ...
> names(getSlots(class(slt3)))
[1] "snp.names" "var.names" "chisq" "df" "N"
But there is no [i,j] method for looping over those slot names. You should instead go to the help page ?"GlmTests-class" which lists the methods defined for that S4 class.
The correct way to do what the initial poster required is:
for (i in ncol(pheno)) {
association <- snp.rhs.tests(pheno[,i], family="gaussian", snp.data=plink$genotype)
}
The documentation of snp.rhs.tests() says that if data is missing, the phenotype is taken from the parent frame - or maybe it was worded in the opposite sense: if data is specified, the phenotype is evaluated in the specified data.frame.
This is a clearer version:
for (i in ncol(pheno)) {
cc <- pheno[,i]
association <- snp.rhs.tests(cc, family="gaussian", snp.data=plink$genotype)
}
The documentation says data=parent.frame() is the default in snp.rhs.tests().
There is a glaring error in the apply() code - Please do not do x <- some.fun(x), as it does very bad things. Try this instead - drop the data=, and use a different variable name.
rhs <- function(x) {
y<- snp.rhs.tests(x, family="gaussian", snp.data=plink$genotype)
}
res_ <- apply(pheno,2,rhs)
Also the initial poster's question is misleading.
plink$genotype is an S4 object, pheno is a data.frame (an S3 object). You really just want to select columns in a S3 data.frame, but you are thrown off course by how snp.rhs.tests() looks for the columns (if a data.frame is given) or a vector phenotype (if it is given as a plain vector - i.e. in the parent frame, or your "current" frame, since the subroutine is evaluated in a "child" frame!)

Does column exist and how to rearrange columns in R data frame

How do I add a column in the middle of an R data frame? I want to see if I have a column named "LastName" and then add it as the third column if it does not already exist.
One approach is to just add the column to the end of the data frame, and then use subsetting to move it into the desired position:
d$LastName <- c("Flim", "Flom", "Flam")
bar <- d[c("x", "y", "Lastname", "fac")]
1) Testing for existence: Use %in% on the colnames, e.g.
> example(data.frame) # to get 'd'
> "fac" %in% colnames(d)
[1] TRUE
> "bar" %in% colnames(d)
[1] FALSE
2) You essentially have to create a new data.frame from the first half of the old, your new column, and the second half:
> bar <- data.frame(d[1:3,1:2], LastName=c("Flim", "Flom", "Flam"), fac=d[1:3,3])
> bar
x y LastName fac
1 1 1 Flim C
2 1 2 Flom A
3 1 3 Flam A
>
Of the many silly little helper functions I've written, this gets used every time I load R. It just makes a list of the column names and indices but I use it constantly.
##creates an object from a data.frame listing the column names and location
namesind=function(df){
temp1=names(df)
temp2=seq(1,length(temp1))
temp3=data.frame(temp1,temp2)
names(temp3)=c("VAR","COL")
return(temp3)
rm(temp1,temp2,temp3)
}
ni <- namesind
Use ni to see your column numbers. (ni is just an alias for namesind, I never use namesind but thought it was a better name originally) Then if you want insert your column in say, position 12, and your data.frame is named bob with 20 columns, it would be
bob2 <- data.frame(bob[,1:11],newcolumn, bob[,12:20]
though I liked the add at the end and rearrange answer from Hadley as well.
Dirk Eddelbuettel's answer works, but you don't need to indicate row numbers or specify entries in the lastname column. This code should do it for a data frame named df:
if(!("LastName" %in% names(df))){
df <- cbind(df[1:2],LastName=NA,df[3:length(df)])
}
(this defaults LastName to NA, but you could just as easily use "LastName='Smith'")
or using cbind:
> example(data.frame) # to get 'd'
> bar <- cbind(d[1:3,1:2],LastName=c("Flim", "Flom", "Flam"),fac=d[1:3,3])
> bar
x y LastName fac
1 1 1 Flim A
2 1 2 Flom B
3 1 3 Flam B
I always thought something like append() [though unfortunate the name is] should be a generic function
## redefine append() as generic function
append.default <- append
append <- `body<-`(args(append),value=quote(UseMethod("append")))
append.data.frame <- function(x,values,after=length(x))
`row.names<-`(data.frame(append.default(x,values,after)),
row.names(x))
## apply the function
d <- (if( !"LastName" %in% names(d) )
append(d,values=list(LastName=c("Flim","Flom","Flam")),after=2) else d)

Resources