I recently encountered a new problem in R which I did not see before. I have a set of data with a dependent variable Accuracy which has only two values, "0" and "1". Before, I use data$Accuracy=as.numeric(data$Accuracy) to turn these two levels to numbers and it works.
This time, however, when I did the same thing. "0"s turned to "1"s and "1"s turned to "2"s. Is this due to the new changes made in R? How do I work around this issue?
Thanks!!
It could be that the columns are factor class and when we use as.numeric, we get the integer storage mode values (in R, indexing starts from 1). In that case, we can convert to character and then to numeric
data$Accuracy <- as.numeric(as.character(data$Accuracy))
If it is a factor the manual recommends
as.numeric(levels(data$Accuracy))[data$Accuracy]
to transform it to approximately its original numeric values.
I guess there could be a problem with the dataframe definition or reading from a file. If original data where only 0 and 1 data$Accuracy should be class integer. But any no numeric character in just one row will create a factor column. As example:
> zz<-data.frame(c(0, 0, 1, 1))
> zz
c.0..0..1..1.
1 0
2 0
3 1
4 1
> zz<-data.frame(c(0, 0, 1, 1, "")) # an empty space
> zz
c.0..0..1..1.....
1 0
2 0
3 1
4 1
5
> class(zz$c.0..0..1..1.....)
[1] "factor"
> zz<-data.frame(c(0, 0, 1, 1, NA)) # empty numeric data
> zz
c.0..0..1..1..NA.
1 0
2 0
3 1
4 1
5 NA
> class(zz$c.0..0..1..1..NA.)
[1] "numeric"
Related
Say I have a data frame with a column for summed data. What is the most efficient way to return a binary 0 or 1 in a new column if any value in columns a, b, or c are NOT zero? rowSums is fine for a total, but I also need a simple indicator if anything differs from a value.
tt <- data.frame(a=c(0,-5,0,0), b=c(0,5,10,0), c=c(-5,0,0,0))
tt[, ncol(tt)+1] <- rowSums(tt)
This yields:
> tt
a b c V4
1 0 0 -5 -5
2 -5 5 0 0
3 0 10 10 20
4 0 0 0 0
The fourth column is a simple sum of the data in the first three columns. How can I add a fifth column that returns a binary 1/0 value if any value differs from a criteria set on the first three columns?
For example, is there a simple way to return a 1 if any of a, b, or c are NOT 0?
as.numeric(rowSums(tt != 0) > 0)
# [1] 1 1 1 0
tt != 0 gives us a logical matrix telling us where there are values not equal to zero in tt.
When the sum of each row is greater than zero (rowSums(tt != 0) > 0), we know that at least one value in that row is not zero.
Then we convert the result to numeric (as.numeric(.)) and we've got a binary vector result.
We can use Reduce
+(Reduce(`|`, lapply(tt, `!=`, 0)))
#[1] 1 1 1 0
One could also use the good old apply loop:
+apply(tt != 0, 1, any)
#[1] 1 1 1 0
The argument tt != 0 is a logical matrix with entries stating whether the value is different from zero. Then apply() with margin 1 is used for a row-wise operation to check if any of the entries is true. The prefix + converts the logical output into numeric 0 or 1. It is a shorthand version of as.numeric().
I am trying to count integers in a vector that also contains zeros. However, tabulate doesn't count the zeros. Any ideas what I am doing wrong?
Example:
> tabulate(c(0,4,4,5))
[1] 0 0 0 2 1
but the answer I expect is:
[1] 1 0 0 0 2 1
Use a factor and define its levels
tabulate(factor(c(0,4,4,5), 0:5))
#[1] 1 0 0 0 2 1
The explanation for the behaviour you're seeing is in ?tabulate (bold face mine)
bin: a numeric vector (of positive integers), or a factor. Long
vectors are supported.
In other words, if you give a numeric vector, it needs to have positive >0 integers. Or use a factor.
I got annoyed enough by tabulate to write a short function that can count not only the zeroes but any other integers in a vector:
my.tab <- function(x, levs) {
sapply(levs, function(n) {
length(x[x==n])
}
)}
The parameter x is an integer vector that we want to tabulate. levs is another integer vector that contains the "levels" whose occurrences we count. Let's set x to some integer vector:
x <- c(0,0,1,1,1,2,4,5,5)
A) Use my.tab to emulate R's built-in tabulate. 0-s will be ignored:
my.tab(x, 1:max(x))
# [1] 3 1 0 1 2
B) Count the occurrences of integers from 0 to 6:
my.tab(x, 0:6)
# [1] 2 3 1 0 1 2 0
C) If you want to know (for some strange reason) only how many 1-s and 4-s your x vector contains, but ignore everything else:
my.tab(x, c(1,4))
# [1] 3 1
I have multiple files to read in using R. I iterate through the files in a loop, obtain dataframes and then try to change values of a particular column. Examples of the R dataframes are as follows:
df_A:
ID ZN
1 0
2 1
3 1
4 0
df_B:
ID ZN
1 2
2 1
3 1
4 2
As shown above, the column 'ZN' for some dataaframes may have 0's and 1's and others dataframes have have 1's and 2's. What I want is - as I'm iterating through the files, I want to make changes only in the dataframes with column ZN having 1's and 2's like this: 1 to 0 and 2 to 1. Dataframes with ZN values as 0's and 1's will be left unchaged.
my attempt did not work:
if(dataframe$ZN > 1){
dataframe$ZN<-recode(dataframe$ZN,"1=0;2=1")
}
else{
dataframe$ZN
}
Any solutions please?
One approach might be to decrement the value of ZN by one if we detect a single value of 2 anywhere in the column:
if (max(df_A$ZN) == 2) {
df_A$ZN = df_A$ZN - 1
}
Demo
If there are only two values i.e. 0 and 1, then
df_A$ZN <- (df_A$ZN==0) + 1
df_A$ZN
#[1] 2 1 1 2
Or using case_when for multiple values
library(dplyr)
df_A %>%
mutate(ZN = case_when(ZN==0 ~2, TRUE ~ 1))
I want to create a indicator variable after comparing the current value of a variable and the previous value. The logic is like this:if current value= previous value, then indicator =1,else 0. The first indicator value is truncated because there is no comparison.
It needs to be fast because I have lots of groups to compare in my data( I did not include the group for simplicity)
> dt<-c('a','a','a','b','a','a','c','c')
> indicator
[1] NA 1 1 0 0 1 0 1
Using base R you can remove the last elements and the first element of the vector with head() and tail() and do the comparison, then add the NA to the front.
c(NA, as.numeric(head(dt, -1) == tail(dt, -1)))
If dt were a vector of numbers, you could use diff like
dn <- c(1,1,1,2,1,1,3,3)
c(NA, (diff(dn)==0)+0)
(using +0 rather than as.numeric to make the booleans 1's and 0's.)
You can use Lag from Hmisc package
Ignoring the first value with [-1] and adding NA at the beginning.
library(Hmisc)
c(NA, as.numeric(dt== Lag(dt))[-1])
#[1] NA 1 1 0 0 1 0 1
You could also use rle in base R:
v <- rle(dt)[[1]]
x <- rep(1:length(v),v)
indicator <- c(NA, (diff(x)==0)*1)
#[1] NA 1 1 0 0 1 0 1
v: gets the number of times each character is repeated
x: contains the respective numeric vector from dt to benefit from diff
I am trying to build a dataframe from the output of a mapply.
Here is one example of my output.
> out[1:9,1]
$statistic
X-squared
1311.404
$parameter
df
1
$p.value
[1] 1.879366e-287
$estimate
prop 1 prop 2
0.001680737 0.009517644
$null.value
NULL
$conf.int
[1] -1.000000000 -0.007153045
attr(,"conf.level")
[1] 0.95
$alternative
[1] "less"
$method
[1] "2-sample test for equality of proportions with continuity correction"
$data.name
[1] "members out of enrolled"
I want to put these values into a dataframe. I have 1684 rows in this matrix. I want a dataframe with 1684 rows.
I also have codes from outside of this data that I want to incorporate into the dataframe. These are strings from fwa$proc.
> out[,1]$p.value
[1] 1.879366e-287
> out[,1]$estimate[[1]]
[1] 0.001680737
> out[,1]$estimate[[2]]
[1] 0.009517644
> as.character(fwa$proc[1])
[1] "10022"
I have looked here for support for doing this. I am creating a dataframe first and then attempting to fill my dataframe from another dataframe row by row as such...
n<-1684
new.df <- data.frame(cpt=character(n), FFS_prop=numeric(n), PHN_prop=numeric(n)
, differnce=numeric(n), results=character(n), Null_HO = character(n), Alt_HA=character(n), stringsAsFactors=FALSE)
Here is the head.
> head(new.df)
cpt FFS_prop PHN_prop differnce results Null_HO Alt_HA
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
Now to fill data row by row...
for (i in 1:n) new.df[i, ] <- data.frame(cpt = toString(fwa$proc[i])
,FFS_prop=round(out[,i]$estimate[[1]],5)
,PHN_prop=round(out[,i]$estimate[[2]],5)
,differnce=round(out[,i]$estimate[[1]]-out[,i]$estimate[[2]],5)
,results=if(out[,i]$p.value <.05) {"Reject NUll"} else {"Fail to Reject Null"}
,Null_HO = toString('FFS = pHN')
,Alt_HA = toString('FFS < PHN')
)
Here is the head after the code runs.
> head(new.df)
cpt FFS_prop PHN_prop differnce results Null_HO Alt_HA
1 1 0.00168 0.00952 -0.00784 1 1 1
2 1 0.00033 0.00142 -0.00109 1 1 1
3 1 0.00239 0.01461 -0.01222 1 1 1
4 1 0.00135 0.00919 -0.00783 1 1 1
5 1 0.00008 0.00180 -0.00172 1 1 1
6 1 0.00036 0.00177 -0.00141 1 1 1
Please friends, why don't my strings make it into the data dataframe?
I have tried to put as.character() around them, toString() around them all for naught.
Wiser ones please advise.
Thanks.
You can either set options(stringsAsFactors=F) of you can also set stringsAsFactors=F in the data.frame in you loop. The problem is that because you are building a new data.frame in each loop, it doesn't know about the rules you've set on the data.frame that it's going to added to later. So at the time of creation, it converts it's values to a factor which is stored as a unique integer for each observed character string. Since you are only adding one value, each factor has one level so they each coded as the integer 1.
Then when you go to do the assignment to the master data.frame, that integer 1 is converted to a character "1". So the str(new.df) should show that your character columns are still characters, they just happen to contain the character "1" for each row.
Building data.frames row-by-row is always a messy process that should be avoided if at all possible. It's better to try to build data data column wise and then build your data.frame at the end. You said that out was the result of using mapply on a prop.test so i've created a sample
out<-mapply(prop.test, replicate(10, rbinom(1, size = 100, prob = .5)), 100)
That gives something that matches your out with only 10 columns I believe. But then you can extract all the p-values with
apply(out, 2, '[[', "p.value")
and all of your FSS values with
apply(out, 2, function(x) x$estimate[[1]])
so your data.frame construction would look more like
new.df<- data.frame(cpt = fwa$proc
,FFS_prop=apply(out, 2, function(x) x$estimate[[1]])
,PHN_prop=apply(out, 2, function(x) x$estimate[[2]])
,pval = apply(out, 2, '[[', "p.value")
,Null_HO = 'FFS = pHN'
,Alt_HA = 'FFS < PHN'
,stringsAsFactors=F
)
new.df <- transform(new.df,
differnce=FFS_prop-PHN_prop,
,results=ifelse(pval<.05, "Reject NUll", "Fail to Reject Null")
)