how to recode with library(memisc)? - r

I just created a sample which gives the structure of my data:
a<-c(1,2,3,4,5,6,7,8,9,10,1,2,3,4,2,3,5,4,5,6)
b<-c(1,2,3,4,4,1,2,3,9,7,2,3,6,1,9,3,1,5,7,8)
c<-c(1,1,1,0,0,1,0,1,0,0,0,0,0,1,1,1,1,0,1,0)
d<-c(10,9,7,10,11,2,3,3,1,1,2,2,2,2,2,2,2,2,2,3)
e<-c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5)
df<-data.frame(a,b,c,d,e)
library(memisc)
df_p1<- within(df,{
e<-recode(e,
c(1,2,3)->"West",
c(4,5)->"East")})
I just would like to recode the rows 1,2,3 into West and 4,5 into East. I know for sure that I ran that recode command a week ago and it worked perfectly. Now I get error.
Error in `[<-.data.frame`(`*tmp*`, nl, value = list(East = c(4, 5), West = c(1, :
replacement element 2 has 3 rows, need 20
In addition: Warning message:
In if (as.factor.result) { :
the condition has length > 1 and only the first element will be used
I just figured out the problem. I dont know whether that is common sense, but I didnt know it. The problem occurs only when I add library(car) to my script. I suppose some problems may arise using both library(memisc) and library(car). Using both you will get the error message.

How about amending it slightly to this?
Use the car library instead of memisc
require(car)
df_p1<- within(df,{
e<-recode(e, "c(1,2,3)='West'; c(4,5)='East'")})

Related

`$<-.data.frame`(`*tmp*`, Numero, value = numeric(0) error [duplicate]

I have a numeric column ("value") in a dataframe ("df"), and I would like to generate a new column ("valueBin") based on "value." I have the following conditional code to define df$valueBin:
df$valueBin[which(df$value<=250)] <- "<=250"
df$valueBin[which(df$value>250 & df$value<=500)] <- "250-500"
df$valueBin[which(df$value>500 & df$value<=1000)] <- "500-1,000"
df$valueBin[which(df$value>1000 & df$value<=2000)] <- "1,000 - 2,000"
df$valueBin[which(df$value>2000)] <- ">2,000"
I'm getting the following error:
"Error in $<-.data.frame(*tmp*, "valueBin", value = c(NA, NA, NA, :
replacement has 6530 rows, data has 6532"
Every element of df$value should fit into one of my which() statements. There are no missing values in df$value. Although even if I run just the first conditional statement (<=250), I get the exact same error, with "...replacement has 6530 rows..." although there are way fewer than 6530 records with value<=250, and value is never NA.
This SO link notes a similar error when using aggregate() was a bug, but it recommends installing the version of R I have. Plus the bug report says its fixed.
R aggregate error: "replacement has <foo> rows, data has <bar>"
This SO link seems more related to my issue, and the issue here was an issue with his/her conditional logic that caused fewer elements of the replacement array to be generated. I guess that must be my issue as well, and figured at first I must have a "<=" instead of an "<" or vice versa, but after checking I'm pretty sure they're all correct to cover every value of "value" without overlaps.
R error in '[<-.data.frame'... replacement has # items, need #
The answer by #akrun certainly does the trick. For future googlers who want to understand why, here is an explanation...
The new variable needs to be created first.
The variable "valueBin" needs to be already in the df in order for the conditional assignment to work. Essentially, the syntax of the code is correct. Just add one line in front of the code chuck to create this name --
df$newVariableName <- NA
Then you continue with whatever conditional assignment rules you have, like
df$newVariableName[which(df$oldVariableName<=250)] <- "<=250"
I blame whoever wrote that package's error message... The debugging was made especially confusing by that error message. It is irrelevant information that you have two arrays in the df with different lengths. No. Simply create the new column first. For more details, consult this post https://www.r-bloggers.com/translating-weird-r-errors/
You could use cut
df$valueBin <- cut(df$value, c(-Inf, 250, 500, 1000, 2000, Inf),
labels=c('<=250', '250-500', '500-1,000', '1,000-2,000', '>2,000'))
data
set.seed(24)
df <- data.frame(value= sample(0:2500, 100, replace=TRUE))
TL;DR ...and late to the party, but that short explanation might help future googlers..
In general that error message means that the replacement doesn't fit into the corresponding column of the dataframe.
A minimal example:
df <- data.frame(a = 1:2); df$a <- 1:3
throws the error
Error in $<-.data.frame(*tmp*, a, value = 1:3) : replacement
has 3 rows, data has 2
which is clear, because the vector a of df has 2 entries (rows) whilst the vector we try to replace has 3 entries (rows).

Error warning while using the getSymbols function in R

I am trying to obtain Bitcoin data from yahoo finance using the following code:
getSymbols("BTC-USD",from= "2020-01-01",to="2020-12-31",warnings=FALSE,auto.assign = TRUE)
BTC-USD=BTC-USD[,"BTC-USD.Adjusted"]
However, I get the following error:
Warning message:
BTC-USD contains missing values. Some functions will not work if objects contain missing values in the middle of the series. Consider using na.omit(), na.approx(), na.fill(), etc to remove or replace them.
How can I fix this?
Thanks.
You've got a first problem which is you're trying to assign to an invalid symbol. Use _ instead of - which is the subtraction operator. If you really want the -, you can use backticks around the symbol.
Then you can use is.na to find the NA values and replace them with 0.
library(quantmod)
getSymbols("BTC-USD",from= "2020-01-01",to="2020-12-31",warnings=FALSE,auto.assign = TRUE)
BTC_USD <- `BTC-USD`[,"BTC-USD.Adjusted"]
BTC_USD[is.na(BTC_USD)] <- 0
BTC_USD[100:110,]
# BTC-USD.Adjusted
#2020-04-09 7302.089
#2020-04-10 6865.493
#2020-04-11 6859.083
#2020-04-12 6971.092
#2020-04-13 6845.038
#2020-04-14 6842.428
#2020-04-15 6642.110
#2020-04-16 7116.804
#2020-04-17 0.000
#2020-04-18 7257.665
#2020-04-19 7189.425
A better plan is probably to just remove the NA rows instead of replacing them with 0:
BTC_USD <- BTC_USD[!is.na(BTC_USD),]

How to use aggregate( ) to count NA values and using tapply() as an alternative

I am new to R and trying to prepare for an exam in R which will take place in one week.
On one of the homework questions, I am trying to solve a single problem in as many as ways as possible (preparing more tools always comes in handy in a time-constrained coding exam).
The problem is the following: in my dataset, "ckm_nodes.csv"
The variable adoption date records the month in which the
doctor began prescribing tetracycline, counting from November 1953. If the doctor did not begin prescribing it by month 17, i.e. February 1955, when the study ended, this is recorded as Inf. If it's not known when or if the doctor adopted tetracycline, their value is NA. Answer the following. (a) How many doctors began prescribing tetracycline in each month of the study? (b) How many never prescribed it during the study? (c) How many are NAs?
I was trying to use the aggregate( ) function to count the number of doctors starting to prescribe in each month. My base code is:
aggregate(nodes$adoption_date, by = nodes["adoption_date"], length),
which works but for the NA values.
I wonder if there is a way I can let the aggregate function count the NA values, so I read the R documentation on aggregate( ) function, which says the following:
na.action
a function which indicates what should happen when the data contain NA values. The default is to ignore missing values in the given variables.
So I googled how to solve this problem and set "na.action = NULL". However, when I try to run this code, here is what happened:
aggregate(nodes$adoption_date, by = nodes["adoption_date"], length, na.action = NULL)
Error in FUN(X[[i]], ...) :
2 arguments passed to 'length' which requires 1
Tried to move around the arguments in order:
aggregate(nodes$adoption_date, length, by = nodes["adoption_date"], na.action = NULL)
Error in FUN(X[[i]], ...) :
2 arguments passed to 'length' which requires 1
But it doesn't work either.
Any idea how to fix this?
***************** tapply()
Additionally, I was wondering if one can use the "tapply" function to solve Q1 on the homework. I tried
count <- function(data){
return(length(data$adoption_date))
}
count_tetra <- tapply(nodes,nodes$adoption_date,count)
Error in tapply(nodes, nodes$adoption_date, count) : arguments must
have same length
************** loops
I am also wondering how I can use a loop to achieve the same goal.
I can start by sorting the vector:
nodes_sorted <- nodes[order(nodes$adoption_date),]
Then, write a for loop, but how...?
Goal is to get a vector count, and each element of count corresponds to a value for number of prescriptions.
Thanks!
Example data:
nodes <- data.frame(
adoption_date = rep(c(1:17,NA,Inf), times = c(rep(5,17),20,3))
)
Have you looked at data.table? I believe something like this does the trick.
require(data.table)
# convert nodes to data.table
setDT(nodes)
# count occurrences for each value of adoption_rate
nodes[, .N, by = adoption_date]

R: errors in cor() and corrplot()

Another stumbling block. I have a large set of data (called "brightly") with about ~180k rows and 165 columns. I am trying to create a correlation matrix of these columns in R.
Several problems have arisen, none of which I can resolve with the suggestions proposed on this site and others.
First, how I created the data set: I saved it as a CSV file from Excel. My understanding is that CSV should remove any formatting, such that anything that is a number should be read as a number by R. I loaded it with
brightly = read.csv("brightly.csv", header=TRUE)
But I kept getting "'x' must be numeric" error messages every time I ran cor(brightly), so I replaced all the NAs with 0s. (This may be altering my data, but I think it will be all right--anything that's "NA" is effectively 0, either for the continuous or dummy variables.)
Now I am no longer getting the error message about text. But any time I run cor()--either on all of the variables simultaneously or combinations of the variables--I get "Warning message:
In cor(brightly$PPV, brightly, use = "complete") :
the standard deviation is zero"
I am also having some of the correlations of that one variable with others show up as "NA." I have ensured that no cell in the data is "NA," so I do not know why I am getting "NA" values for the correlations.
I also tried both of the following to make REALLY sure I wasn't including any NA values:
cor(brightly$PPV, brightly, use = "pairwise.complete.obs")
and
cor(brightly$PPV,brightly,use="complete")
But I still get warnings about the SD being zero, and I still get the NAs.
Any insights as to why this might be happening?
Finally, when I try to do corrplot to show the results of the correlations, I do the following:
brightly2 <- cor(brightly)
Warning message:
In cor(brightly) : the standard deviation is zero
corrplot(brightly2, method = "number")
Error in if (min(corr) < -1 - .Machine$double.eps || max(corr) > 1 + .Machine$double.eps) { :
missing value where TRUE/FALSE needed
And instead of making my nice color-coded correlation matrix, I get this. I have yet to find an explanation of what that means.
Any help would be HUGELY appreciated! Thanks very much!!
Please check if you replaced your NAs with 0 or '0' as one is character and other is int. Or you can even try using as.numeric(column_name) function to convert your char 0s with int 0. Also this error occurs if your dataset has factors, because those are not int values corrplot throws this error.
It would be helpful of you put sample of your data in the question using
str(head(your_dataset))
That would be helpful for you to check the datatypes of columns.
Let me know if I am wrong.
Cheerio.

Error with knnImputer from the DMwR Package: invalid 'times' argument

I'm trying to run knnImputer from the DMwR package on a genomic dataset. The dataset has two columns - one for location on a chromosome (numeric, an integer) and one for methylation values (also numeric, double), with many of the methylation values are missing. The idea is that distance should be based on location in the chromosome. I also have several other features, but chose to not include those). When I run the following line however, I get an error.
reg.knn <- knnImputation(as.matrix(testp), k=2, meth="median")
#ERROR:
#Error in rep(1, ncol(dist)) : nvalid 'times' argument
Any thoughts on what could be causing this?
If this doesn't work, does anyone know of anything other good KNN Imputers in R packages? I've been trying several but each returns some kind of error.
I got a similar error today:
Error in rep(1, ncol(dist)) : invalid 'times' argument
I could not find a solution online but with some trail and error , I think the issue is with no. of columns in data frame
Try passing at least '3' columns and do KNNimputation
I created a dummy column which gives ROW count of the observation (as third column).
It worked for me !
Examples for your reference:
Example 1 -
temp <- data.frame(X = c(1,2,3,4,5,6,7,8,9,10), Y = c(T, T, F, F,F,F,NA,NA,T,T))
temp7<-NULL temp7 <-knnImputation(temp,scale=T,k=3, meth='median', distData = NULL)
Error in rep(1, ncol(dist)) : invalid 'times' argument
Example 2 -
temp <- data.frame(X = 1:10, Y = c(T, T, F, F,F,F,NA,T,T,T), Z = c(NA,NA,7,8,9,5,11,9,9,4))
temp7<-NULL temp7 <-knnImputation(temp,scale=T,k=3, meth='median', distData = NULL)
Here number of columns passed is 3. Did NOT get any error!
Today, I encountered the same error. My df was much larger than 3 columns, so this seems to be not the (only?) problem.
I found that rows with too much NAs caused the problem (in my case, more than 95% of a given row was NA). Filtering out this row solved the problem.
Take home message: do not only filter for NAs over the columns (which I did), but also check the rows (it's of course impossible to impute by kNN if you cannot define what exactly is a nearest neighbor).
Would be nice if the package would provide a readable error message!
When I read into the code, I located the problem, if the column is smaller than 3, then in the process it where down-grade to something which is not a dataframe and thus the operation based on dataframe structure all fails, I think the author should handle this case.
And yes, the last answer also find it by trial, different road, same answer

Resources