R:as.numeric using grep() - r

quick question.
I have a table (Data) with a series of columns having similar colnames
colnames(Data):
[Sum_Flag_x_30,Sum_Flag_y_60,...Sum_Flag_z_n].
I ideally want to write a simple code to get all of these into numeric format (as.numeric).
Tried with:
Data[,grep("Sum_Flag",colnames(Data),value=T)] <- as.numeric(Data[,grep("Sum_Flag",colnames(Data),value=T)]
but it's not working and I get the following error:
Error in [<-.data.table(*tmp*, , grep("Sum_Flag", colnames(Data),
: Supplied 25 items to be assigned to 55057 items of column
'Sum_Flag_x_30'. If you wish to 'recycle' the RHS please use rep() to
make this intent clear to readers of your code.
Any clue about this?
Thanks guys
Ciao

All you have to do is use apply. Apply as.numeric() to every column, by doing:
# margin = 2 means that you're applying a function column-wise.
new_data <- apply(Data, MARGIN = 2, as.numeric)

Related

Error ifelse statement, fread and within function: If you wish to 'recycle' the RHS please use rep()

I'm trying to do the following ifelse statement. My purpose is to create a new column in which I can identify individuals that have the same value in four columns.
example<-data.frame(replicate(4,sample(1:2,20,rep=TRUE)))
within(example,example$X <- ifelse(example$X1!=example$X2,NA,
ifelse(example$X2!=example$X3,NA,
ifelse(example$X3!=example$X4,NA,example$X1))))
In my case, the four columns are years. I want to identify panel individuals. With my data the code is the following:
within(educ_1,educ_1$X <- ifelse(educ_1$N2014!=educ_1$N2015,NA,
ifelse(educ_1$N2015!=educ_1$N2016,NA,
ifelse(educ_1$N2016!=educ_1$N2017,NA,educ_1$N2014))))
However, I'm getting the following error:
Error in [<-.data.table(*tmp*, , nl, value = list(educ_1 = list(PER_ID = c(9.95048326217056e-313, :
Supplied 67 items to be assigned to 14191305 items of column 'educ_1'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.
One thing though is that I used fread function to import the data, since it is about 14millions observations then I thought it would be more efficient.
Any thoughts or suggestions about the syntax or the error? What should I do?. Why does it work with simple statements but not with my own data?
output of example
Using data.table syntax you could do:
setDT(example) # Shouldn't be necessary if you used fread() to import the data
example[,
X := uniqueN(unlist(.SD)) == 1,
by = 1:nrow(example),
.SDcols = patterns("X[0-9]")]

How to assign an edited dataset to a new variable in R?

The title might be misleading but I have the scenario here:
half_paper <- lapply(data_set[,-1], function(x) x[x==0]<-0.5)
This line is supposed to substitute 0 for 0.5 in all of the columns except the first one.
Then I want to take half_paper and put it into here where it would rank all of the columns except the first one in order.:
prestige_paper <-apply(half_paper[,-1],2,rank)
But I get an error and I think that I need to somehow make half_paper into a data set like data_set.
Thanks for all of your help
Your main issue 'This line is supposed to substitute 0 for 0.5 in all of the columns except the first one' can be remedied by placing another line in your anonymous function. The gets operator <- returns the value of whatever is on the right hand side, so your lapply was returning a value of 0.5 for each column. To remedy this, another line can be added to the function that returns the modified vector.
It's also worth noting that lapply returns a list. apply was substituted in for lapply in this case for consistency, but plyr::ddply may suit this specific need better.
half_mtcars <- apply(mtcars[, -1], 2, function(x) {x[x == 0] <- .5;return(x)})
prestige_mtcars_tail <- apply(half_mtcars, 2, rank)
prestige_mtcars <- cbind(mtcars[,1, drop = F], prestige_mtcars_tail)

Dynamically call dataframe column & conditional replacement in R

First question post. Please excuse any formatting issues that may be present.
What I'm trying to do is conditionally replace a factor level in a dataframe column. Reason being due to unicode differences between a right single quotation mark (U+2019) and an apostrophe (U+0027).
All of the columns that need this replacement begin with with "INN8", so I'm using
grep("INN8", colnames(demoDf)) -> apostropheFixIndices
for(i in apostropheFixIndices) {
levels(demoDfFinal[i]) <- c(levels(demoDf[i]), "I definitely wouldn't")
(insert code here)
}
to get the indices in order to perform the conditional replacement.
I've taken a look at a myriad of questions that involve naming variables on the fly: naming variables on the fly
as well as how to assign values to dynamic variables
and have explored the R-FAQ on turning a string into a variable and looked into Ari Friedman's suggestion that named elements in a list are preferred. However I'm unsure as to the execution as well as the significance of the best practice suggestion.
I know I need to do something along the lines of
demoDf$INN8xx[demoDf$INN8xx=="I definitely wouldn’t"] <- "I definitely wouldn't"]
but the iterations I've tried so far haven't worked.
Thank you for your time!
If I understand you correctly, then you don't want to rename the columns. Then this might work:
demoDf <- data.frame(A=rep("I definitely wouldn’t",10) , B=rep("I definitely wouldn’t",10))
newDf <- apply(demoDf, 2, function(col) {
gsub(pattern="’", replacement = "'", x = col)
})
It just checks all columns for the wrong symbol.
Or if you have a vector containing the column indices you want to check then you could go with
# Let's say you identified columns 2, 5 and 8
cols <- c(2,5,8)
sapply(cols, function(col) {
demoDf[,col] <<- gsub(pattern="’", replacement = "'", x = demoDf[,col])
})

R warning message - invalid factor level, NA generated

I have the following block of code. I am a complete beginner in R (a few days old) so I am not sure how much of the code will I need to share to counter my problem. So here is all of it I have written.
mdata <- read.csv("outcome-of-care-measures.csv",colClasses = "character")
allstate <- unique(mdata$State)
allstate <- allstate[order(allstate)]
spldata <- split(mdata,mdata$State)
if (num=="best") num <- 1
ranklist <- data.frame("hospital" = character(),"state" = character())
for (i in seq_len(length(allstate))) {
if (outcome=="heart attack"){
pdata <- spldata[[i]]
pdata[,11] <- as.numeric(pdata[,11])
bestof <- pdata[!is.na(as.numeric(pdata[,11])),][]
inorder <- order(bestof[,11],bestof[,2])
if (num=="worst") num <- nrow(bestof)
hospital <- bestof[inorder[num],2]
state <- allstate[i]
ranklist <- rbind(ranklist,c(hospital,state))
}
}
allstate is a character vector of states.
outcome can have values similar to "heart attack"
num will be numeric or "best" or "worst"
I want to create a data frame ranklist which will have hospital names and the state names which follow a certain criterion.
However I keep getting the error
invalid factor level, NA generated
I know it has something to do with rbind but I cannot figure out what is it. I have tried googling about this, and also tried troubleshooting using other similar queries on this site too. I have checked any of my vectors I am trying to bind are not factors. I also tried forcing the coercion by setting the hospital and state as.character() during assignment, but didn't work.
I would be grateful for any help.
Thanks in advance!
Since this is apparently from a Coursera assignment I am not going to give you a solution but I am going to hint at it: Have a look at the help pages for read.csv and data.frame. Both have the argument stringsAsFactors. What is the default, true or false? Do you want to keep the default setting? Is colClasses = "character" in line 1 necessary? Use the str function to check what the classes of the columns in mdata and ranklist are. read.csv additionally has an na.strings argument. If you use it correctly, also the NAs introduced by coercion warning will disappear and line 16 won't be necessary.
Finally, don't grow a matrix or data frame inside a loop if you know the final size beforehand. Initialize it with the correct dimensions (here 52 x 2) and assign e.g. the i-th hospital to the i-th row and first column of the data frame. That way rbind is not necessary.
By the way you did not get an error but a warning. R didn't interrupt the loop it just let you know that some values have been coerced to NA. You can also simplify the seq_len statement by using seq_along instead.

gsub apply combination in R

I am trying to use gsub on every column of a dataframe to remove some characters, I have tried using apply to do this without success:
data<-apply(data,2, function(x) gsub("£","",data[x]))
returns error
Error in `[.data.frame`(data, x) : undefined columns selected
I know it works if I do
for(i in 1: length(data)){data[,i]<-gsub("£","",data[,i]) }
But why doesn't the apply call work?
Here's the next best reproducible example. Though there might be a better / faster (vectorized) way if I thought a little harder. But since you asked for apply:
# just turn it to characters in order to
# turn . to , ... was just the first dataset that came to
# but as character should not be necessary for your data
ds[] <- sapply(mtcars,function(x) gsub("\\.",",",as.character(x)))

Resources