How to convert dataframe from Charachter to Numeric - r

I know this question may be repeated but i tried all the solutions in :
How to convert entire dataframe to numeric while preserving decimals?
https://statisticsglobe.com/convert-data-frame-column-to-numeric-in-r
But didn't work
i imported excel data : from my computer manually :
File > import data > excel and i set the type of data as numeric
i checked my data using
View(Old_data)
and it s true of type numeric
head(Old_data)
QC_G.F9_01_4768 QC_G.F9_01_4765
M95T834 70027.02 69578.19
M97T834 95774.14 81479.30
M105T541 75686.39 68455.65
M109T834 72093.07 70942.65
M111T834_2 77502.98 77527.54
M114T834 68132.06 70296.73
M121T834 52233.05 56074.64
M125T834 44559.99 35831.79
M128T834 59257.48 59574.73
M135T834 105136.55 105274.98
but after data i Converted rows into columns and columns into rows using R :
New_data <- as.data.frame(t(Old_data))
When i checked my new data using :
View(New_data)
I found that my columns are of type character and not numeric
i tried to convert New_data to numeric
New_data_B -> as.numeric(New_data)
i checked my data using
dim(New_data_B)
17 1091
Here's example of my data
New_data_B
#> Name MT95T843 MT95T756
#> 1 QC_G.F9_01_4768 70027.02132 95774.13597
#> 2 QC_G.F9_01_4765 69578.18634 81479.29575
#> 3 QC_G.F9_01_4762 69578.18634 87021.95427
#> 4 QC_G.F9_01_4759 68231.14338 95558.76738
#> 5 QC_G.F9_01_4756 64874.12936 96780.77245
#> 6 QC_G.F9_01_4753 63866.65780 91854.35304
#> 7 CtrF01R5_G.D1_01_4757 66954.38799 128861.36163
#> 8 CtrF01R4_G.D5_01_4763 97352.55229 101353.25927
#> 9 CtrF01R3_G.C8_01_4754 61311.78576 7603.60896
#> 10 CtrF01R2_G.D3_01_4760 85768.36117 109461.75445
#> 11 CtrF01R1_G.C9_01_4755 85302.81947 104253.84537
#> 12 BtiF01R5_G.D7_01_4766 61252.42545 115683.73755
#> 13 BtiF01R4_G.D6_01_4764 81873.96379 112164.14229
#> 14 BtiF01R3_G.D2_01_4758 84981.21914 0.00000
#> 15 BtiF01R2_G.D4_01_4761 36629.02462 124806.49101
#> 16 BtiF01R1_G.D8_01_4767 0.00000 109927.26425
#> 17 rt 13.90181 13.90586
also i converted my data to csv file and i imported it :
Old_data <- as.data.frame(read.csv("data.csv" , sep="," , header=TRUE,stringsAsFactors=FALSE))
And also using :
#install.packages("readxl")
library("readxl")
Old_data <- read_excel("data.xlsx")
I tried the solution suggested by Mr sveer
New_data <- cbind(Name=Old_data[1,],as.data.frame(t(Old_data[-1,])))
it gives this result
head(New_data)
When i tried
View(New_data)
Name.QC_G.F9_01_4768 Name.QC_G.F9_01_4765
70027.02 69578.19
95774.14 81479.30
75686.39 68455.65
72093.07 70942.65
77502.98 77527.54
68132.06 70296.73
52233.05 56074.64
4559.99 35831.79
59257.48 59574.73
105136.55 105274.98
it delets the rownames !
Im just confused of this problem, i think the problem is because i converted rows into columns and columns into rows
Please tell me for any clarification and also if i can send the data to someone so he can try
Thank you very much

Reason why you get character type and not numeric:
Transponsing the data will lead to a matrix. A matrix can take only a single class ie. character when there are mixed class.
Solution:
I am still not sure about the structure of your data. It is always a good idea to add a reproducible example, if the data is large you could also use pastebin or just reproduce as described.
I assume that when you load the data via: File > import data > excel that the first column is called "Name".
To get your desired output (especially rownames) you could try:
setNames(as.data.frame(t(Old_data[,-1])),Old_data[[1]]) -> df
If you want to transform the rownames to a column:
tibble::rownames_to_column(df, "Name")

Related

R: read_csv reads numeric entries as logical - parsing col_logical instead of col_double

I am new to R.
I wrote a code for an assignment which reads several csv files and binds it into a data frame and then according to the id, calculates the mean of either nitrate or sulfate.
Data sample:
Date sulfate nitrate ID
<date> <dbl> <dbl> <dbl>
1 2003-10-06 7.21 0.651 1
2 2003-10-12 5.99 0.428 1
3 2003-10-18 4.68 1.04 1
4 2003-10-24 3.47 0.363 1
5 2003-10-30 2.42 0.507 1
6 2003-11-11 1.43 0.474 1
...
To read the files and create a data.frame, I wrote this function:
pollutantmean <- function (pollutant, id = 1:332) {
#creating a data frame from several files
file_m <- list.files(path = "specdata", pattern = "*.csv", full.names = TRUE)
read_file_m <- lapply(file_m, read_csv)
df_1 <- bind_rows(read_file_m)
# delete NAs
df_clean <- df_1[complete.cases(df_1),]
#select rows according to id
df_asid_clean <- filter(df_clean, ID %in% id)
#count the mean of the column
mean_result <- mean(df_asid_clean[, pollutant])
mean_result
However, when the read_csv function is applied, certain entries in nitrate column are read as col_logical, although the whole class of the column remains numeric and the entries are numeric. It seems that the code "expects" to receive logical value, although the real value is not.
Throughout the reading I get this message:
<...>
Parsed with column specification:
cols(
Date = col_date(format = ""),
sulfate = col_double(),
nitrate = col_logical(),
ID = col_double()
)
Warning: 41 parsing failures.
row col expected actual file
2055 nitrate 1/0/T/F/TRUE/FALSE 0.383 'specdata/288.csv'
2067 nitrate 1/0/T/F/TRUE/FALSE 0.355 'specdata/288.csv'
2073 nitrate 1/0/T/F/TRUE/FALSE 0.469 'specdata/288.csv'
2085 nitrate 1/0/T/F/TRUE/FALSE 0.144 'specdata/288.csv'
2091 nitrate 1/0/T/F/TRUE/FALSE 0.0984 'specdata/288.csv'
.... ....... .................. ...... ..................
See problems(...) for more details.
I tried to change the column class by writing
df_1[,nitrate] <- as.numeric(as.character(df_1[, nitrate])
, after binding rows, but it only shows that NAs are again introduced in step which calculates the mean.
What is wrong here, and how could I solve it?
Would appreciate your help!
UPDATE: tried to insert read_csv(col_types = list...), but I get "files" argument is not defined. As I understand, the R reads inside read_csv first, then lapply and because there is not "file" given at the time, it shows error.
The problem with readr::read_csv() failure in parsing the column types can be overcome by passing a col_types= argument in lapply(). We do this as follows:
pollutantmean <- function (directory,pollutant,id=1:332){
require(readr)
require(dplyr)
file_m <- list.files(path = directory, pattern = "*.csv", full.names = TRUE)[id]
read_file_m <- lapply(file_m, read_csv,col_types=list(col_date(),col_double(),
col_double(),col_integer()))
# rest of code goes here. Since I am a Community Mentor in the
# JHU Data Science Specialization, I am not allowed to post
# a complete solution to the programming assignment
}
Note that I use the [ form of the extract operator to subset the list of file names with the id vector that is an argument to the function, which avoids reading a lot of data that isn't necessary. This eliminates the need for the filter() statement in the code posted in the question.
With some additional programming statements to complete the assignment, the code in my answer produces the correct results for the three examples posted with the assignment, as listed below.
> pollutantmean("specdata","sulfate",1:10)
[1] 4.064128
> pollutantmean("specdata", "nitrate", 70:72)
[1] 1.706047
> pollutantmean("specdata", "nitrate", 23)
[1] 1.280833
Alternately we could implement lapply() with an anonymous function that also uses read_csv() as follows:
read_file_m <- lapply(file_m, function(x) {read_csv(x,col_types=list(col_date(),col_double(),
col_double(),col_integer()))})
NOTE: while it is completely understandable that students who have been exposed to the tidyverse would like to use it for the programming assignment, the fact that dplyr isn't introduced until the next course in the sequence (and readr isn't covered at all) makes it much more difficult to use for assignments in R Programming, especially the first assignment, where dplyr non-standard evaluation gives people fits. An example of this situation is yet another Stackoverflow question on pollutantmean().
With the read_csv update you don't need lapply, you can simply pass along the file path directly to read_csv as you already have defined.
Regarding the column types this can then be sen manually in the col_type argument:
col_type=cols(Date-col_date,sulfate=...)

Extract and match sets from list of filenames

I have a dataset of 4000+ images. For the purpose of figuring out the code, I moved a small subset of them to another folder.
The files look like this:
folder
[1] "r01c01f01p01-ch3.tiff" "r01c01f01p01-ch4.tiff" "r01c01f02p01-ch1.tiff"
[4] "r01c01f03p01-ch2.tiff" "r01c01f03p01-ch3.tiff" "r01c01f04p01-ch2.tiff"
[7] "r01c01f04p01-ch4.tiff" "r01c01f05p01-ch1.tiff" "r01c01f05p01-ch2.tiff"
[10] "r01c01f06p01-ch2.tiff" "r01c01f06p01-ch4.tiff" "r01c01f09p01-ch3.tiff"
[13] "r01c01f09p01-ch4.tiff" "r01c01f10p01-ch1.tiff" "r01c01f10p01-ch4.tiff"
[16] "r01c01f11p01-ch1.tiff" "r01c01f11p01-ch2.tiff" "r01c01f11p01-ch3.tiff"
[19] "r01c01f11p01-ch4.tiff" "r01c02f10p01-ch1.tiff" "r01c02f10p01-ch2.tiff"
[22] "r01c02f10p01-ch3.tiff" "r01c02f10p01-ch4.tiff"
I cannot remove the name prior to the -ch# as that information is important. What I want to do, however, is to filter this list of images, and return only sets (ie: r01c02f10p01) which have all four ch values (ch1-4).
I was originally thinking that we could approach the issue along the lines of this:
ch1 <- dir(path="/Desktop/cp/complete//", pattern="ch1")
ch2 <- dir(path="/Desktop/cp/complete//", pattern="ch2")
ch3 <- dir(path="/Desktop/cp/complete//", pattern="ch3")
ch4 <- dir(path="/Desktop/cp/complete//", pattern="ch4")
Applying this list with the file.remove function, similar to this:
final2 <- dir(path="/Desktop/cp1/Images//", pattern="ch5")
file.remove(folder,final2)
However, creating new variables for each ch value fragments out each file. I am unsure how to use these to actually distinguish whether an individual image has all four ch values to meaningfully filter my images. I'm kind of at a loss, as the other sources I've seen have issues that don't quite match this problem.
Earlier, I was able to remove the all images with ch5 from my image set like this. I was thinking this may be helpful in trying to filter only images which have ch1-ch4, but I'm not sure how to proceed.
##Create folder variable which has all image files
folder <- list.files(getwd())
##Create final2 variable which has all image files ending in ch5
final2 <- dir(path="/Desktop/cp1/Images//", pattern="ch5")
##Remove final2 from folder
file.remove(folder,final2)
To summarize: I expect to filter files from a random assortment without complete ch values (ie: maybe only ch1 and ch2, or ch3 and ch4, or ch1, ch2, ch3, and ch4), to an assortment which only contains files which have a complete set (four files with ch1, ch2, ch3, and ch4).
Starting with a vector of filenames like you would get from list.files or something similar, you can create a data frame of filenames, use regex to extract the alphanumeric part at the beginning and the number that follows "-ch". Then check that all elements of an expected set (I put this in ch_set, but there might be another way you need to do this) occur in each group's set of CH values.
# assume this is the vector of file names that comes from list.files
# or something comparable
files <- c("r01c01f01p01-ch3.tiff", "r01c01f01p01-ch4.tiff", "r01c01f02p01-ch1.tiff", "r01c01f03p01-ch2.tiff", "r01c01f03p01-ch3.tiff", "r01c01f04p01-ch2.tiff", "r01c01f04p01-ch4.tiff", "r01c01f05p01-ch1.tiff", "r01c01f05p01-ch2.tiff", "r01c01f06p01-ch2.tiff", "r01c01f06p01-ch4.tiff", "r01c01f09p01-ch3.tiff", "r01c01f09p01-ch4.tiff", "r01c01f10p01-ch1.tiff", "r01c01f10p01-ch4.tiff", "r01c01f11p01-ch1.tiff", "r01c01f11p01-ch2.tiff", "r01c01f11p01-ch3.tiff", "r01c01f11p01-ch4.tiff", "r01c02f10p01-ch1.tiff", "r01c02f10p01-ch2.tiff", "r01c02f10p01-ch3.tiff", "r01c02f10p01-ch4.tiff")
library(dplyr)
ch_set <- 1:4
files_to_keep <- data.frame(filename = files, stringsAsFactors = FALSE) %>%
tidyr::extract(filename, into = c("group", "ch"), regex = "(^[\\w\\d]+)\\-ch(\\d)", remove = FALSE) %>%
mutate(ch = as.numeric(ch)) %>%
group_by(group) %>%
filter(all(ch_set %in% ch))
files_to_keep
#> # A tibble: 8 x 3
#> # Groups: group [2]
#> filename group ch
#> <chr> <chr> <dbl>
#> 1 r01c01f11p01-ch1.tiff r01c01f11p01 1
#> 2 r01c01f11p01-ch2.tiff r01c01f11p01 2
#> 3 r01c01f11p01-ch3.tiff r01c01f11p01 3
#> 4 r01c01f11p01-ch4.tiff r01c01f11p01 4
#> 5 r01c02f10p01-ch1.tiff r01c02f10p01 1
#> 6 r01c02f10p01-ch2.tiff r01c02f10p01 2
#> 7 r01c02f10p01-ch3.tiff r01c02f10p01 3
#> 8 r01c02f10p01-ch4.tiff r01c02f10p01 4
Now that you have a dataframe of the complete groups, just pull the matching filenames back out:
files_to_keep$filename
#> [1] "r01c01f11p01-ch1.tiff" "r01c01f11p01-ch2.tiff" "r01c01f11p01-ch3.tiff"
#> [4] "r01c01f11p01-ch4.tiff" "r01c02f10p01-ch1.tiff" "r01c02f10p01-ch2.tiff"
#> [7] "r01c02f10p01-ch3.tiff" "r01c02f10p01-ch4.tiff"
One thing to note is that this worked without the mutate line where I converted ch to numeric—i.e. comparing character versions of those numbers to regular numeric version of them—because under the hood, %in% converts to matching types. That didn't seem totally safe if you needed to scale this, so I converted to have them in matching types.

How to write rownames into a spreadsheet with the googlesheets package in R?

I would like to write a data frame in a Google spreadsheet with the googlessheets package but the rownames isn't written in the first column.
My data frame looks like this :
> str(stats)
'data.frame': 4 obs. of 2 variables:
$ Offensive: num 194.7 87 62.3 10.6
$ Defensive: num 396.28 51.87 19.55 9.19
> stats
Offensive Defensive
Annualized Return 194.784261 396.278385
Annualized Standard Deviation 87.04125 51.872826
Worst Drawdown 22.26618 9.546208
Annualized Sharpe Ratio (Rf=0%) 1.61126 0.9193734
I load the library as recommanded in the documentation, create spreadsheet & worksheet then write the data with the gs_edit_cells command :
> install.packages("googlesheets")
> library("googlesheets")
> suppressPackageStartupMessages(library("dplyr"))
> mySpreadsheet <- gs_new("mySpreadsheet")
> mySpreadsheet <- mySpreadsheet %>% gs_ws_new("Stats")
> mySpreadsheet <- mySpreadsheet %>% gs_edit_cells(ws = "Stats", input = stats, trim = TRUE)
Everything goes well but googlesheets doesn't create a column with the rownames. Only two columns are created with their data (Offensive and Defensive).
I have try to convert the data frame into a matrix but still the same.
Any idea how I could achieve this ?
Thank you
Doesn't look like there is a row names argument for gs_edit_cells(). If you just want the row names to show up in the first column of the sheet you could try:
stats$Rnames = rownames(stats) ## add column equal to the row names
stats[,c("Rnames","Offensive", "Defensive")] ## re order so names are first
# names(stats) = c("","Offensive", "Defensive") optional if you want the names col to not have a "name"
From here just pass stats to the functions from the googlessheets package just like you did before

Reading large fixed format text file in r

I am trying to input a large (> 70 MB) fixed format text file into r. For a smaller file (< 1MB), I can use the read.fwf() function as shown below.
condodattest1a <- read.fwf(impfile1,widths=testcsv3$Varlen,col.names=testcsv3$Varname)
When I try to run the line of code below,
condodattest1 <- read.fwf(impfile,widths=testcsv3$Varlen,col.names=testcsv3$Varname)
I get the following error message:
Error: cannot allocate vector of size 2 Kb
The only difference between the 2 lines is the size of the input file.
The formatting for the file I want to import is given in the dataframe called testcsv3. I show a small snippet of the dataframe below:
> head(testcsv3)
Varlen Varname Varclass Varsep Varforfmt
1 2 "V1" "character" 2 "A2.0"
2 15 "V2" "character" 17 "A15.0"
3 28 "V3" "character" 45 "A28.0"
4 3 "V4" "character" 48 "F3.0"
5 1 "V5" "character" 49 "A1.0"
6 3 "V6" "character" 52 "A3.0"
At least part of my problem is that I am reading in all the data as factors when I use read.fwf() and I end up exceeding the memory limit on my computer.
I tried to use read.table() as a way of formatting each variable but it seems I need a text delimiter with that function. There is a suggestion in section 3.3 in the link below that I could use sep to identify the column where every variable starts.
http://data.princeton.edu/R/readingData.html
However, when I use the command below:
condodattest1b <- read.table(impfile1,sep=testcsv3$Varsep,col.names=testcsv3$Varname, colClasses=testcsv3$Varclass)
I get the following error message:
Error in read.table(impfile1, sep = testcsv3$Varsep, col.names = testcsv3$Varname, : invalid 'sep' argument
Finally, I tried to use:
condodattest1c <- read.fortran(impfile1,lengths=testcsv3$Varlen, format=testcsv3$Varforfmt, col.names=testcsv3$Varname)
but I get the following message:
Error in processFormat(format) : missing lengths for some fields
In addition: Warning messages:
1: In processFormat(format) : NAs introduced by coercion
2: In processFormat(format) : NAs introduced by coercion
3: In processFormat(format) : NAs introduced by coercion
All I am trying to do at this point is format the data when they come into r as something other than factors. I am hoping this will limit the amount of memory I am using and allow me to actually input the file. I would appreciate any suggestions about how I can do this. I know the Fortran formats for all the variables and the column at which each variable begins.
Thank you,
Warren
Maybe this code works for you. You have to fill varlen with the field sizes and add the corresponding type strings (e.g. numeric, character, integer) to colclasses
my.readfwf <- function(filename,varlen,colclasses) {
sidx <- cumsum(c(1,varlen[1:(length(varlen)-1)]))
eidx <- sidx+varlen-1
filecontent <- scan(filename,character(0),sep="\n")
if (any(diff(nchar(filecontent))!=0))
stop("line lengths differ!")
nlines <- length(filecontent)
res <- list()
for (i in seq_along(varlen)) {
res[[i]] <- sapply(filecontent,substring,first=sidx[i],last=eidx[i])
mode(res[[i]]) <- colclasses[i]
}
attributes(res) <- list(names=paste("V",seq_along(res),sep=""),row.names=seq_along(res[[1]]),class="data.frame")
return(res)
}

Subset by function's variable using $variable

I am having trouble to subset from a list using a variable of my function.
rankhospital <- function(state,outcome,num = "best") {
#code here
e3<-dataframe(...,state.name,...)
if (num=="worst"){ return(worst(state,outcome))
}else if((num%in%b=="TRUE" & outcome=="heart attack")=="TRUE"){
sep<-split(e3,e3$state.name)
hosp.estado<-sep$state
hospital<-hosp.estado[num,1]
return(as.character(hospital))
I split my data frame by state (which is a variable of my function)
But hosp.estado<-sep$state doesn't work. I have also tried as.data.frame.
The function (rankhospital("NY"....) returns me a character(0).
When I feed the sep$state with sep$"NY" directly in code it works perfectly so I guess the problem is I can't use a function's variable to do this. Am I right? What could I use instead?
Thank you!!
If state is a variable in your function, you can refer to a column with the name given by state using: sep[state] or sep[[state]]. The first produces a data frame with one column named based on the value of state. The second produces an unnamed vector.
df=data.frame(NY=rnorm(10),CA=rnorm(10), IL=rnorm(10))
state="NY"
df[state]
# NY
# 1 -0.79533912
# 2 -0.05487747
# 3 0.25014132
# 4 0.61824329
# 5 -0.17262350
# 6 -2.22390027
# 7 -1.26361438
# 8 0.35872890
# 9 -0.01104548
# 10 -0.94064916
df[[state]]
# [1] -0.79533912 -0.05487747 0.25014132 0.61824329 -0.17262350 -2.22390027 -1.26361438 0.35872890 -0.01104548 -0.94064916
class(df[state])
# [1] "data.frame"
class(df[[state]])
# [1] "numeric"
It seems like you are trying to get the top hospital in a state. You don't want to split here (see the result of sep to see what I mean). Instead, use:
as.character(e3[e3$state.name==state, 1][num])
This hopefully does what you want.
You need sep[[state]] instead of sep$state to get the data frame out of your sep list, which matches the state parameter of your function. Like this:
e3 <- read.csv("https://raw.github.com/Hindol/data-analysis-coursera/master/HW3/hospital-data.csv")
state <- "WY"
num <- 1:5
sep<-split(e3,e3$State)
hosp.estado<-sep[[state]]
hospital<-hosp.estado[num,1]
as.character(hospital)
# [1] "530002" "530006" "530008" "530010" "530011"

Resources