I am having trouble referencing columns in a dataframe by name. The function i have begins with extracting rows where no NA's are present:
prepare <- function(dataframe, attr1,attr2){
subset_na_still_there <- dataframe[!is.na(attr1) & !is.na(attr2),]
subset_na_still_there2 <- subset(dataframe, !is.na(attr1) & !is.na(attr2))
### someother code goes here
}
However, the subsets that are returned still contain NA's. I get no errors.
Here is a related question
edit:
Selecting the columns and then referencing them by number does the trick:
prepare <- function(dataframe, attr1,attr2){
subset_cols <- dataframe[,c(attr1, attr2)]
subset_gone <- subset_cols[!is.na(subset_cols[,1]) & !is.na(subset_cols[,2]),]
}
Why does the first version not work as expected?
How about this:
prepare <- function(x, attr1, attr2){
x[!is.na(x[attr1]) & !is.na(x[attr2]),]
}
Rather than creating your own function, try subset:
subset(mydata, !is.na(attr1) & !is.na(attr2))
If you want to get rid of rows with NAs in any field try
na.omit(mydata)
df <- data.frame(att1=c(1,NA,NA,10),att2=c(NA,1,2,3),val=c("a","z","e","r"))
df
att1 att2 val
1 1 NA a
2 NA 1 z
3 NA 2 e
4 10 3 r
test <- function(df,att1,att2){
df_no_na <- df[!is.na(att1) & !is.na(att2),]
df_no_na
}
test(df,df$att1,df$att2)
att1 att2 val
4 10 3 r
It's work for me. Are you sure about NA's ? Is is.na(df$att1) return TRUE ?
Related
I have managed to do chisq-test using loop in R but it is very slow for a large data and I wonder if you could help me out doing it faster with something like dplyr? I've tried with dplyr but I ended up getting an error all the time which I am not sure about the reason.
Here is a short example of my data:
df
1 2 3 4 5
row_1 2260.810 2136.360 3213.750 3574.750 2383.520
row_2 328.050 496.608 184.862 383.408 151.450
row_3 974.544 812.508 1422.010 1307.510 1442.970
row_4 2526.900 826.197 1486.000 2846.630 1486.000
row_5 2300.130 2499.390 1698.760 1690.640 2338.640
row_6 280.980 752.516 277.292 146.398 317.990
row_7 874.159 794.792 1033.330 2383.420 748.868
row_8 437.560 379.278 263.665 674.671 557.739
row_9 1357.350 1641.520 1397.130 1443.840 1092.010
row_10 1749.280 1752.250 3377.870 1534.470 2026.970
cs
1 1 1 2 1 2 2 1 2 3
What I want to do is to run chisq-test between each row of the df and cs. Then giving me the statistics and p.values as well as row names.
here is my code for the loop:
value = matrix(nrow=ncol(df),ncol=3)
for (i in 1:ncol(df)) {
tst <- chisq.test(df[i,], cs)
value[i,1] <- tst$p.value
value[i,2] <- tst$statistic
value[i,3] <- rownames(df)[i]}
Thanks for your help.
I guess you do want to do this column by column. Knowing the structure of Biobase::exprs(PANCAN_w)) would have helped greatly. Even better would have been to use an example from the Biobase package instead of a dataset that cannot be found.
This is an implementation of the code I might have used. Note: you do NOT want to use a matrix to store results if you are expecting a mixture of numeric and character values. You would be coercing all the numerics to character:
value = data.frame(p_val =NA, stat =NA, exprs = rownames(df) )
for (i in 1:col(df)) {
# tbl <- table((df[i,]), cs) ### No use seen for this
# I changed the indexing in the next line to compare columsn to the standard `cs`.
tst <- chisq.test(df[ ,i], cs) #chisq.test not vectorized, need some sort of loop
value[i, 1:2] <- tst[ c('p.value', 'statistic')] # one assignment per row
}
Obviously, you would need to change every instance of df (not a great name since there is also a df function) to Biobase::exprs(PANCAN_w)
I'm working with multiple big data frames in R and I'm trying to write functions that can modify each of them (given a set of common parameters). One function is giving me trouble (shown below).
RawData <- function(x)
{
for(i in 1:nrow(x))
{
if(grep(".DERIVED", x[i,]) >= 1)
{
x <- x[-i,]
}
}
for(i in 1:ncol(x))
{
if(is.numeric(x[,i]) != TRUE)
{
x <- x[,-i]
}
}
return(x)
}
The objective of this function is twofold: first, to remove any rows that contain a ".DERIVED" string in any one of their cells (using grep), and second, to remove any columns that are non-numeric (using is.numeric). I get an error on the following condition:
if(grep(".DERIVED", x[i,]) >= 1)
The error states the "argument is of zero length", which I believe is usually associated with NULL values in a vector. However, I've used is.null on the entire data frame that is giving me errors, and it confirmed that there are no null values in the DF. I'm sure I'm missing something relatively simple here. Any advice would be greatly appreciated.
If you can use non-base-R functions, this should address your issue. df is the data.frame in question here. It will also be faster than looping over rows (generally not advised if avoidable).
library(dplyr)
library(stringr)
df %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
You can make it a function just as you would anything else:
mattsFunction <- function(dat){
dat %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
}
you should probably give it a better name though
The error is from the line
if(grep(".DERIVED", x[i,]) >= 1)
When grep doesn't find the term ".DERIVED", it returns something of zero length, your inequality doesn't return TRUE or FALSE, but rather returns logical(0). The error is telling you that the if statement cannot evaluate whether logical(0) >= 1
A simple example:
if(grep(".DERIVED", "1234.DERIVEDabcdefg") >= 1) {print("it works")} # Works nicely, since the inequality can be evaluated
if(grep(".DERIVED", "1234abcdefg") > 1) {print("no dice")}
You can replace that line with if(length(grep(".DERIVED", x[i,])) != 0)
There's something else you haven't noticed yet, which is that you're removing rows/columns in a loop. Say you remove the 5th column, the next loop iteration (when i = 6) will be handling what was the 7th row! (this will end in an error along the lines of Error in[.data.frame(x, , i) : undefined columns selected)
I prefer using dplyr, but if you need to use base R functions there are ways to to this without if statements.
Notice that you should consider using the regex version of "\\.DERIVED" and not ".DERIVED" which would mean "any character followed by DERIVED".
I don't have example data or output, so here's my best go...
# Made up data
test <- data.frame(a = c("data","data.DERIVED","data","data","data.DERIVED"),
b = (c(1,2,3,4,5)),
c = c("A","B","C","D","E"),
d = c(2,5,6,8,9),
stringsAsFactors = FALSE)
# Note: The following code assumes that the column class is numeric because the
# example code provided assumed that the column class was numeric. This will not
# detects if the column is full of a string of character values of only numbers.
# Using the base subset command
test2 <- subset(test,
subset = !grepl("\\.DERIVED",test$a),
select = sapply(test,is.numeric))
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8
# Trying to use []. Note: If only 1 column is numeric this will return a vector
# instead of a data.frame
test2 <- test[!grepl("\\.DERIVED",test$a),]
test2 <- test2[,sapply(test,is.numeric)]
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8
I want to subset a dataframe by applying two conditions to it. When I attach the dataframe, apply the first condition, detach the dataframe, attach it again, apply the second condition, and detach again, I get the expected result, a dataframe with 9 observations.
Of course, you wouldn't normally detach/attach before applying the second condition. So I attach, apply the two conditions after one another, and then detach. But the result is different now: It's a dataframe with 24 observations. All but 5 of these observations consist exclusively of NA-values.
I know there's lots of advice against using attach, and I understand the point that it's dangerous, because it's easy to loose track of an attach statement still being active. My point here is a different one; I see a behaviour in attach that I just can't understand. I'm using R Studio 0.99.465 with 64-bit-R 3.2.1.
So here's the code, first the version that is clumsy, but produces the correct result (df with 9 observations, all non-NA):
df <- expand.grid(early_vvl=c(0,1), inter_churn=c(0,1), inter_new_contract=c(0,1), late_vvl=c(0,1), late_no_reaction=c(0,1))
attach(df)
df <- df[(1-early_vvl) >= inter_churn + inter_new_contract + late_vvl,]
detach(df)
attach(df)
df <- df[early_vvl <= late_no_reaction,]
detach(df)
Now the one that produces the dataframe with 24 observations, most of which consist only of NA values:
df <- expand.grid(early_vvl=c(0,1), inter_churn=c(0,1), inter_new_contract=c(0,1), late_vvl=c(0,1), late_no_reaction=c(0,1))
attach(df)
df <- df[(1-early_vvl) >= inter_churn + inter_new_contract + late_vvl,]
df <- df[early_vvl <= late_no_reaction,]
detach(df)
I'm puzzled. Does anybody understand why the second version produces a different result?
Have a look at what happens here:
attach(df)
df <- df[(1-early_vvl) >= inter_churn + inter_new_contract + late_vvl,]
length(early_vvl <= late_no_reaction)
## [1] 32
df <- df[early_vvl <= late_no_reaction,]
detach(df)
So your logical vector early_vvl <= late_no_reaction still uses the original df, the one that you attached. When you subset the data.frame the second time, the logical is longer than the data.frame has rows and it behaves like this:
df <- data.frame(x=1:5, y = letters[1:5])
df[rep(c(TRUE, FALSE), 5), ]
## x y
## 1 1 a
## 3 3 c
## 5 5 e
## NA NA <NA>
## NA.1 NA <NA>
You could just use & to avoid the problem:
df <- expand.grid(early_vvl=c(0,1), inter_churn=c(0,1), inter_new_contract=c(0,1), late_vvl=c(0,1), late_no_reaction=c(0,1))
attach(df)
df <- df[(1-early_vvl) >= inter_churn + inter_new_contract + late_vvl & early_vvl <= late_no_reaction,]
detach(df)
I have a list of stocks in an index sorted by date, and I'm trying to remove all rows in which the previous row has the same stock code. This will give a dataframe of the initial index and all dates that there was a change to the index
In my working example, I'll use names instead of the date column, and some numbers.
At first, I thought I could remove the rows by using subset() and !duplicated
name <- c("Joe","Mary","Sue","Frank","Carol","Bob","Kate","Jay")
num <- c(1,2,2,1,2,2,2,3)
num2 <- c(1,1,1,1,1,1,1,1)
df <- data.frame(name,num,num2)
dfnew <- subset(df, !duplicated(df[,2]))
However, this might not work in the case where a stock is removed from the list and then later replaced. So, in my working example, the desired output are the rows of Joe, Mary, Frank, Carol and Jay.
Next I created a function to tell if the index changes. The input of the function is row number:
#------ function to tell if there is a change in the row subset-----#
df2 <- as.matrix(df)
ChangeDay <- function(x){
Current <- df2[x,2:3]
Prev <- df2[x-1,2:3]
if (length(Current) != length(Prev))
NewList <- true
else
NewList <- length(which(Current==Prev))!=length(Current)
return(NewList)
}
Finally, I attempt to create a loop to remove the desired rows. I'm new to programming, and I struggle with loops. I'm not sure what the best way is to pre-allocate memory when the dimensions of my final output is unknown. All the books I've looked at only give trivial loop examples. Here is my latest attempt:
result <- matrix(data=NA,nrow=nrow(df2),ncol=3) #pre allocate memory
tmp <- as.numeric(df2) #store the original data
changes <- 1
for (i in 2:nrow(df2)){ #always keep row 1, thus the loop starts at row 2
if(ChangeDay(i)==TRUE){
result[i,] <-tmp[i] #store the row in result if ChangeDay(i)==TRUE
changes <- changes + 1 #increment counter
}
}
result <- result[1:changes,]
Thansk for your help, and any additional general advice on loops is appreciated!
It is not clear what you want to do. But I guess :
df[c(1,diff(df$num)) !=0,]
name num num2
1 Joe 1 1
2 Mary 2 1
4 Frank 1 1
5 Carol 2 1
8 Jay 3 1
I am trying to figure out why the rbind function is not working as intended when joining data.frames without names.
Here is my testing:
test <- data.frame(
id=rep(c("a","b"),each=3),
time=rep(1:3,2),
black=1:6,
white=1:6,
stringsAsFactors=FALSE
)
# take some subsets with different names
pt1 <- test[,c(1,2,3)]
pt2 <- test[,c(1,2,4)]
# method 1 - rename to same names - works
names(pt2) <- names(pt1)
rbind(pt1,pt2)
# method 2 - works - even with duplicate names
names(pt1) <- letters[c(1,1,1)]
names(pt2) <- letters[c(1,1,1)]
rbind(pt1,pt2)
# method 3 - works - with a vector of NA's as names
names(pt1) <- rep(NA,ncol(pt1))
names(pt2) <- rep(NA,ncol(pt2))
rbind(pt1,pt2)
# method 4 - but... does not work without names at all?
pt1 <- unname(pt1)
pt2 <- unname(pt2)
rbind(pt1,pt2)
This seems a bit odd to me. Am I missing a good reason why this shouldn't work out of the box?
edit for additional info
Using #JoshO'Brien's suggestion to debug, I can identify the error as occurring during this if statement part of the rbind.data.frame function
if (is.null(pi) || is.na(jj <- pi[[j]]))
(online version of code here: http://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R starting at: "### Here are the methods for rbind and cbind.")
From stepping through the program, the value of pi does not appear to have been set at this point, hence the program tries to index the built-in constant pi like pi[[3]] and errors out.
From what I can figure, the internal pi object doesn't appear to be set due to this earlier line where clabs has been initialized as NULL:
if (is.null(clabs)) clabs <- names(xi) else { #pi gets set here
I am in a tangle trying to figure this out, but will update as it comes together.
Because unname() & explicitly assigning NA as column headers are not identical actions. When the column names are all NA, then an rbind() is possible. Since rbind() takes the names/colnames of the data frame, the results do not match & hence rbind() fails.
Here is some code to help see what I mean:
> c1 <- c(1,2,3)
> c2 <- c('A','B','C')
> df1 <- data.frame(c1,c2)
> df1
c1 c2
1 1 A
2 2 B
3 3 C
> df2 <- data.frame(c1,c2) # df1 & df2 are identical
>
> #Let's perform unname on one data frame &
> #replacement with NA on the other
>
> unname(df1)
NA NA
1 1 A
2 2 B
3 3 C
> tem1 <- names(unname(df1))
> tem1
NULL
>
> #Please note above that the column headers though showing as NA are null
>
> names(df2) <- rep(NA,ncol(df2))
> df2
NA NA
1 1 A
2 2 B
3 3 C
> tem2 <- names(df2)
> tem2
[1] NA NA
>
> #Though unname(df1) & df2 look identical, they aren't
> #Also note difference in tem1 & tem2
>
> identical(unname(df1),df2)
[1] FALSE
>
I hope this helps. The names show up as NA each, but the two operations are different.
Hence, two data frames with their column headers replaced to NA can be "rbound" but two data frames without any column headers (achieved using unname()) cannot.