In excel you can create a new_column and define it to have a value 0 if the old_column is blank, and be 1 if old_column is not blank.
new_column=IF(ISBLANK([#[old_column]]),0,1)
Anyone can think of an efficient way to do this for a data frame in r. Say a column in the data frame is called old_column, I want to add a new_column to the data_frame with the above description.
I tried this:
mydf$old_column[is.na(mydf$old_column] <- 0
mydf$old_column[!is.na(mydf$old_column] <- 1
but it gives me this error:
invalid factor level, NA generated error.
mydf$new_column <- as.integer(!is.na(mydf$old_column)
You may not even need the as.integer() as is.na() returns a boolean vector and R treats TRUE and FALSE as 1 and 0 respectively.
If you want to follow the same scheme than in excel, then you are looking for ifelse:
mydf$new_column <- ifelse(is.na(mydf$old_column),0,1)
Also, in your original code, note that you assign the value to the old_column, so the second command will not find any NA's and will assign 1 everywhere.
The factor problem (guessing here), can be related to how you have loaded the data from your csv or xls file into R. Many methods have a parameter stringsAsFactors that you may want to set to False otherwise you can run into this type of errors. Provide the code, and we can help you. Example:
mydf <- read.csv("myfile.csv",stringsAsFactors = F)
Related
I want to have "Blancas" and "Sultana" under the "Variete" column.
Why after I use "subset", the filtered data is less than it should be?
Figure 1 is the original data,
figure 2 is the expected result,
figure 3 is result I obtained with the code below:
df <- read_excel("R_NLE_FTSW.xlsx")
options(scipen=200)
BLANCAS<-subset(df, Variete==c("Blancas","Sultana"))
view(BLANCAS)
It's obvious that some data of BLANCAS are missing.
P.S. And if try it in a sub-sheet, the final result sometimes will be 5 times more!
path = "R_NLE_FTSW.xlsx"
df <- map_dfr(excel_sheets(path),
~ read_xlsx(path, sheet = 4))
I don't understand why sometimes it's more and sometimes less than the expected result. Can anyone help me? Thank you so much!
First of all, while you mention that you need both "Blancas" and "sultanas" , your expected result shows only Blancas! So get that straight first.
For such data comign from excel :
Always clean the data after its imported. Check for unqiue values to find if there are any extra spaces etc.
Trim the character data, ensure Date fields are correct and numbers are numeric (not characters)
Now to subset a data : Use df%>%filter(Variete %in% c('Blancas','Sultana')
-> you can modify the c() vector to include items of interest.
-> if you wish to clean on the go?
df%>%filter(trimws(Variete)) %in% c('Blancas','Sultana'))
and your sub-sheet problem : We even don't know what data is there. If its similar then apply same logics.
I've reading data from an xlsx file. My read code start off like this:
ecommerce<-read.xlsx("C:\\Users\\Thomas Rhee\\Documents\\GGU\\GGU Fall 2018\\Tools for Business Analytics\\Final Project\\ecommerce.xlsx", sheet = "data", startRow = 1, colNames = TRUE, col = c(1,2,3,4,5,6,7,8));
attach(ecommerce)
names(ecommerce)
One of the columns is "price". It looks like this:
price
<chr>
329.98
324.83999999999997
324.83
350
308
310
I used the sapply to find out my 'price' column's class is character. I use the following code to convert it into numeric:
ecommerce$price <- as.numeric(as.character(ecommerce$price))
I checked again and it worked. I tried typing the following and get this output:
cor(rank, price)
Error in cor(rank, price) : 'y' must be numeric
I'm lost. I'm also a beginner at this, so I'm open to suggestions here. Please dumb it down for me.
This is a good example why you should not use attach.
d <- data.frame(x = 1:3)
attach(d)
x ## now available because of attach
# [1] 1 2 3
d$x <- LETTERS[1:3]
x ## however this still refers to the original values of d$x
# [1] 1 2 3
d$x
# [1] "A" "B" "C"
That means, you changed your original data in the data frame, but in your cor(.) call you reference the original one (the one which was attached)
So to solve your issue, drop the attach command and specify the columns directly (after you have transformed them to a numeric):
cor(ecommerce$rank, ecommerce$price)
Technically, you could re-attach ecommerce again, after you changed it, but because of these issues I would strongly dis-encourage you to use attach at all.
I was on that post read.csv and skip last column in R but did not find my answer, and try to check directly in Answer ... but that's not the right way (thanks mjuarez for taking the time to get me back on track.
The original question was:
I have read several other posts about how to import csv files with
read.csv but skipping specific columns. However, all the examples I
have found had very few columns, and so it was easy to do something
like:
columnHeaders <- c("column1", "column2", "column_to_skip")
columnClasses <- c("numeric", "numeric", "NULL")
data <- read.csv(fileCSV, header = FALSE, sep = ",", col.names =
columnHeaders, colClasses = columnClasses)
All answer were good, but does not work for what I entended to do. So I asked my self and other:
And in one function, does data <- read_csv(fileCSV)[,(ncol(data)-1)]
could work?
I've tried in one line of R to get on data, all 5 of first 6 columns, so not the last one. To do so, I would like to use "-" in the number of column, do you think it's possible? How can I do that?
Thanks!
In base r it has to be 2 steps operation. Example:
> data <- read.csv("test12.csv")
> data
# 3 columns are returned
a b c
1 1/02/2015 1 3
2 2/03/2015 2 4
# last column is excluded
> data[,-ncol(data)]
a b
1 1/02/2015 1
2 2/03/2015 2
one cannot write data <- read.csv("test12.csv")[,-ncol(data)] in base r.
But if you know max number of columns in your csv (say 3 in my case) then one can write:
df <- read.csv("test12.csv")[,-3]
df
a b
1 1/02/2015 1
2 2/03/2015 2
The right hand side of an assignment is processed first so this line from the question:
data <- read.csv(fileCSV)[,(ncol(data)-1)]
is trying to use data before it is defined. Also note what the above is saying is to take only the 2nd last field. To get all but the last field:
data <- read.csv(fileCSV)
data <- data[-ncol(data)]
If you know the name of the last field, say it is lastField, then this works and unlike the code above does not read the whole file and then remove the last field but rather only reads in fields other than the last. Also it is only one line of code.
read.csv(fileCSV, colClasses = c(lastField = "NULL"))
If you don't know the name of the last field but you do know how many fields there are, say n, then either of these would work:
read.csv(fileCSV)[-n]
read.csv(fileCSV, colClasses = replace(rep(NA, n), n, "NULL"))
Another way to do it without first reading in the last field is to first read in the header and first line to calculate the number of fields (assuming that all records have the same number) and then re-read the file using that.
n <- ncol(read.csv(fileCSV, nrows = 1))
making use of one of the prior two statements involving n.
It's not possible in one line as the data variable is not yet initialized when you call it. So the command ncol(data) will trigger an error.
You would need to use two lines of code to first load your data into the data variable and then remove the last column by either using data[,-ncol(data)] or data[,1:(ncol(data)-1)].
Not a single function, but at least a single line, using dplyr (disclaimer: I never use dplyr or magrittr, so a more optimized solution must exist using these libraries)
library(dplyr)
dat = read.table(fileCSV) %>% select(., which(names(.) != names(.)[ncol(.)]))
I am rather new to R and struggling at the moment with a specific issue. I need to iterate over a dataframe with 1 variable returned from a SQL database so that I can ultimately issue additional SQL queries using the information in the 1 variable. I need help understanding how to do this.
Here is what I have
> dt
Col
1 5D2D3F03-286E-4643-8F5B-10565608E5F8
2 582771BE-811E-4E45-B770-42A98EB5D7FB
3 4EB4D553-C680-4576-A854-54ED817226B0
4 80D53D5D-80D1-4A60-BD86-C85F6D53390D
5 9EF6CABF-0A4F-4FA9-9FD9-132589CAAC31
when trying to access by using it prints the entire list just as above
> dt[1]
Col
1 5D2D3F03-286E-4643-8F5B-10565608E5F8
2 582771BE-811E-4E45-B770-42A98EB5D7FB
3 4EB4D553-C680-4576-A854-54ED817226B0
4 80D53D5D-80D1-4A60-BD86-C85F6D53390D
5 9EF6CABF-0A4F-4FA9-9FD9-132589CAAC31
when trying to access by dt[1,] it brings additional unwanted information.
> a<-dt[1,]
> a
[1] 5D2D3F03-286E-4643-8F5B-10565608E5F8
5 Levels: 4EB4D553-C680-4576-A854-54ED817226B0 ... 9EF6CABF-0A4F-4FA9-9FD9-132589CAAC31
I need to isolate just the '5D2D3F03-286E-4643-8F5B-10565608E5F8' information and not the '5 levels......'.
I am sure this is simple, I just can't find it. any help is appreciated!
thanks!
There are two issues you need to address. One is that you want character data, not a factor variable (a factor is essentially a category variable). The other is that you want a simple vector of the values, not a data.frame.
1) To get the first column as a vector, use double-brackets or the $ notation:
a <- dt[[1]]
a <- dt[['Col']]
a <- dt$Col
Your notation dt[1,] does actually return the column as a vector too, but using the somewhat obscure fact that the [ method for data.frame objects will silently "drop" its value to a vector when using the two-index form dt[i,j], but not when using the one-index form dt[i]:
When [ and [[ are used with a single vector index (x[i] or x[[i]]), they index the data frame as if it were a list. In this usage a drop argument is ignored, with a warning.
Think of "dropping" like unboxing the data - instead of getting a data.frame with a single column, you're just getting the column data itself.
2) To convert to character data, use one of the suggestions in the comments from #akrun or #Vlo:
a <- as.character(dt[[1]])
a <- as.character(dt[['Col']])
a <- as.character(dt$Col)
or use the API of whatever you're using to make the SQL query - or to read in the results of the query - not convert the strings to factors in the first place.
I have the following block of code. I am a complete beginner in R (a few days old) so I am not sure how much of the code will I need to share to counter my problem. So here is all of it I have written.
mdata <- read.csv("outcome-of-care-measures.csv",colClasses = "character")
allstate <- unique(mdata$State)
allstate <- allstate[order(allstate)]
spldata <- split(mdata,mdata$State)
if (num=="best") num <- 1
ranklist <- data.frame("hospital" = character(),"state" = character())
for (i in seq_len(length(allstate))) {
if (outcome=="heart attack"){
pdata <- spldata[[i]]
pdata[,11] <- as.numeric(pdata[,11])
bestof <- pdata[!is.na(as.numeric(pdata[,11])),][]
inorder <- order(bestof[,11],bestof[,2])
if (num=="worst") num <- nrow(bestof)
hospital <- bestof[inorder[num],2]
state <- allstate[i]
ranklist <- rbind(ranklist,c(hospital,state))
}
}
allstate is a character vector of states.
outcome can have values similar to "heart attack"
num will be numeric or "best" or "worst"
I want to create a data frame ranklist which will have hospital names and the state names which follow a certain criterion.
However I keep getting the error
invalid factor level, NA generated
I know it has something to do with rbind but I cannot figure out what is it. I have tried googling about this, and also tried troubleshooting using other similar queries on this site too. I have checked any of my vectors I am trying to bind are not factors. I also tried forcing the coercion by setting the hospital and state as.character() during assignment, but didn't work.
I would be grateful for any help.
Thanks in advance!
Since this is apparently from a Coursera assignment I am not going to give you a solution but I am going to hint at it: Have a look at the help pages for read.csv and data.frame. Both have the argument stringsAsFactors. What is the default, true or false? Do you want to keep the default setting? Is colClasses = "character" in line 1 necessary? Use the str function to check what the classes of the columns in mdata and ranklist are. read.csv additionally has an na.strings argument. If you use it correctly, also the NAs introduced by coercion warning will disappear and line 16 won't be necessary.
Finally, don't grow a matrix or data frame inside a loop if you know the final size beforehand. Initialize it with the correct dimensions (here 52 x 2) and assign e.g. the i-th hospital to the i-th row and first column of the data frame. That way rbind is not necessary.
By the way you did not get an error but a warning. R didn't interrupt the loop it just let you know that some values have been coerced to NA. You can also simplify the seq_len statement by using seq_along instead.