I use Rstudio and imported a csv file from online.
data <- read.csv("http://databank.worldbank.org/data/download/GDP.csv", stringsAsFactors = FALSE)
In the file, column X.3 is of type character.
I want to convert row (5 to 202) from character to numeric so that I can calculate mean of it.
So, when I use this line below. It still remains as character
data[c(5:202),"X.3"] <- as.numeric(gsub(",","",data[c(5:202),"X.3"]))
when i type class(data[10,"X.3"]) it shows the output as character
I am able to convert the whole column to numeric using
data[,"X.3"] <- as.numeric(gsub(",","",data[,"X.3"]))
but i want to convert only specific row's ie from 5 to 202 beacause the other rows of the column becomes N/A. i am not sure how to do it.
Following changes to your code can help you make it numeric:
data <- read.csv("http://databank.worldbank.org/data/download/GDP.csv", header = T, stringsAsFactors = FALSE, skip = 3)
# skipping first 3 rows which is just empty space/junk and defining the one as header
data <- data[-1,]
#removing the first line after the header
data$US.dollars. <- as.numeric(gsub(',','',data$US.dollars.))
#replacing scientific comma with blank to convert the character to numeric
hist(data$US.dollars.) #sample plot
As mentioned in the comment, you cannot keep part of your column as character and part numeric because R doesn't allow that and it forces type conversion to a higher order in this case numeric to character. You can read here more about Implicit Coercion of R
Related
I imported a huge dataset with a lot of missing values or N/As, NAs.
This is how i import the Data:
Databsp<-read.csv("C:/Users/adminfor/Desktop/Neuer Ordner/Pseudonymized-Genet-Treatment-Summary-20220201120538.csv", na.strings=TRUE)
Next, I converedt all the NAs or N/As to using the following code:
a <- Databsp %>% replace_with_na_all(condition = ~.x %in% common_na_strings)
Now my question: Why are columns that only include numbers and NAs from the class "character" and not "integer/numeric". I tried several codes, but nothing seems to help...
You don't change your column classes. When you import your data, the column classes are first set, and you do nothing to change them. If a column in your CSV file has only numeric and NA values when you import it, it will be numeric. But if it has strings (including strings that you haven't yet told R are NA-equivalent, like "N/A") then read.csv must read them as character class because they are not numeric. Later, you replace those NA-equivalent values with actual NAs, but that replaces values only, it does not change the class of the columns.
The bad solution would be to patch this. Add an extra step after you replace the NA values, you could use the type.convert() function to re-assess the columns and convert them as necessary, a <- type.convert(a).
The better solution is to give read.csv your list of NA-equivalent strings when you read in the data. This is what the na.strings argument is supposed to be. From ?read.csv
na.strings
a character vector of strings which are to be interpreted as NA values.
So change your import line to
Databsp <- read.csv(
"C:/Users/adminfor/Desktop/Neuer Ordner/Pseudonymized-Genet-Treatment-Summary-20220201120538.csv",
na.strings = common_na_strings
)
And then the columns should be classed appropriately when you read them in, and you can skip the replace_with_na_all step as it is already taken care of. Relatedly, your current na.strings = TRUE does nothing because TRUE is not a character vector.
I have a csv file with headers in the form :
a,b,c,d
1,6,5,6,8
df <- read_csv("test.csv")
For some reason there's the value 1 in the example is incorrect and to correct the file, Id like to shift all the other values to the left and thus drop 1 but preserving the columns ending with :
a,b,c,d
6,5,6,8
How can I achieve that ?
What about this:
headers <- names(df)
new_df <- df[, 2:length(df)]
names(new_df) <- headers
In one line of code, the structure command creates an object and assigns attributes:
structure(df[,2:length(df)], names = names(df)[1:(length(df)-1)])
Recognizing that a data.frame is a list of equal-length vectors, where each vector represents a column, the following will also work:
structure(df[2:length(df)], names = names(df)[1:(length(df)-1)])
Note no comma in df[1:length(df)].
Also, I like the trick of removing items from a vector or list using a negative index. So I think an even cleaner bit of code is:
structure(df[-1], names = names(df)[-length(df)])
I have a csv with 2 columns but should be 7. The first column is a numerical ID. The second column has the other six numerical values. However, there are several different delimiters between them. They all follow the same pattern: numerical value, a dash ("-) OR a colon (":"), eight spaces, and then the next numerical value, until the final numerical value, with nothing after it. It starts with a dash and alternates with a colon. For example:
28.3- 7.1: 62.3- 1.8: 0.5- 196
Some of these cells have missing values denoted by a single period ("."). Example:
24- .: 58.2- .: .- 174
I'm using R but I can't figure out how to accomplish this. I know it probably requires dplyr or tidyverse but I can't find what to do where there are different delimiters and spaces.
So far, I've only successfully loaded the csv and used "str()" to determine that the column with these six values is a factor.
Here is how the data look in the .csv:
Here is how it looks in RStudio after I read it in using read.csv
Here is how it looks in RStudio if I use tab as the delimiter when using read.csv, as suggested in the comments
I would try just to sort out that first column if it is the only one doing the following:
CDC_delim <- read.table('CBC.csv', sep="\t", header=F)
head(CBC_delim)
then to split that first column into two but keep both elements:
CBC_delim <- CBC_delim %>%
#
mutate(column1 = as.character(column1)) %>% # your column names will be different, maybe just V1,
#
mutate(col2 = sapply(strsplit(column1,","), `[`, 1),
col3 = sapply(strsplit(column1,","), `[`, 2))
Should leave you with some basic tidy up such as deleteing the original column1, you can check you column names using colnames(CBC_delim)
But also see:
how-to-read-data-with-different-separators
I exported an excel in r without headers.
df = read.xlsx('D:/hotel rates/Yoho 5th Lane Signature 2799850.xls', sheetIndex = 1, header = FALSE)
Then changed the colnames as the second row values.
colnames(df) =df[2,]
But my column names appeared as numbers as follows.
Actually I want the second row as my colnames. Can any one fix this?
I think thereis some kind of factor level in data. when I run df[2,2]the console shows the following.`
I have a big data frame (22k rows, 400 columns) which is generated using read.csv from a csv file. It appears that every column is a factor and all the row values are the levels of this factor.
I now want to do some analysis (like PCA) but I can't work with it unless it is a matrix, but even when I try it like matrix, all I get is
> prcomp(as.matrix(my_data))
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Is there a way of transforming this data frame with factors to a simple big matrix?
I am new in R so forgive all the (maybe terrible) mistakes.
Thanks
You can do it that way:
df<-data.frame(a=as.factor(c(1,2,3)), b=as.factor(c(2,3,4)))
m<-apply(apply(df, 1, as.character), 1, as.numeric)
apply uses a method on the given data.frame. It is important not to leave out to transform it to character first, because otherwise it will be converted to the internal numeric representation of the factor.
To add column names, do this:
m<-m[-1,] # removes the first 'empty' row
colnames(m)<-c("a", "b") # replace the right hand side with your desired column names, e.g. the first row of your data.frame
One more tip. You probably read the data.frame from a file, when you set the parameter header=TRUE, the first row will not be the header but the column names of the data.frame will be correct.