Read data set from website in numeric format not character

Read data set from website in numeric format not character - r

With the code below I read data from a website.
The problem is it reads the data as character not in numeric format especially some columns such as "Enlem(N) and Boylam(E).
How can I fix this?
library(rvest)
widths <- c(11,10,10,10,14,5,5,5,48,100)
dat <- "http://www.koeri.boun.edu.tr/scripts/lst5.asp" %>%
read_html %>%
html_nodes("pre") %>%
html_text %>%
textConnection %>%
read.fwf(widths = widths, stringsAsFactors = FALSE) %>%
setNames(nm = .[6,]) %>%
tail(-7) %>%
head(-2)

If you know what specific columns should be a number, you can convert those columns to be a number. If you do not know what columns should be a number, you can create a function to look at the data and if a large enough percentage of the cases in the column are a number change that column to be a number. I have used the function below for this purpose:
NumericColumns <- function(x, AllowedPercentNumeric =.95, PreserveDate=TRUE, PreserveColumns){
# find the counts of NA values in input data frame's columns
param_originalNA <- apply(x, 2, function(z){sum(is.na(z))})
# blindly coerce data.frame to numeric
df_JustNumbers <- suppressWarnings(as.data.frame(lapply(x, as.numeric)))
# Percent Non-NA values in each column
PercentNumeric <- (apply(df_JustNumbers, 2, function(x)sum(!is.na(x))))/(nrow(x)-param_originalNA)
rm(param_originalNA)
# identify columns which have a greater than or equal percentage of numeric as specified
param_numeric <- names(PercentNumeric)[PercentNumeric >= AllowedPercentNumeric]
# Remove columns from list to convert to numeric that are specified as to preserve
if (!missing(PreserveColumns)){param_numeric <- setdiff(param_numeric, PreserveColumns)}
# Identify columns that are dates initially
IsDateColumns <- lapply(x, function(y)(is(y, "Date")|is(y, "POSIXct")))
param_dates <- names(IsDateColumns)[IsDateColumns==TRUE]
# Remove dates from list if specified to preserve dates
if (PreserveDate){param_numeric <- setdiff(param_numeric, param_dates)}
# returns column position of numeric columns in target data frame
param_numeric <- match(param_numeric, colnames(df_JustNumbers))
# removes NA's from column list
param_numeric <- param_numeric[complete.cases(param_numeric)]
# coerces columns in param_numeric to numeric and inserts numeric columns into target data.frame
if(length(param_numeric)==1){
suppressWarnings(x[, param_numeric] <- as.numeric(x[, param_numeric]))
}
if(length(param_numeric)>1){
suppressWarnings(x[, param_numeric] <- apply(x[, param_numeric],2, function(x)as.numeric(x)))
}
return(x)
}
Once the function is created, you can use it on you data such as:
# Use function to convert to numeric
dat <- NumericColumns(dat)

Related

convert a list of tab delimited strings into dataframe

I have a list of strings with tabs like below:
xx <- list("raw total sequences:\t67166250", "1st fragments:\t33583125")
yy <- list("raw total sequences:\t190999", "1st fragments:\t222")
I want to have "row total sequences" and "1st fragments" as column names and the numeric values as column values and xx and yy as row names. How can I do it efficiently?

You may create one named list combining the individual lists that you have so that it is easier to work with them. return_df_from_list function captures the data in two capture groups, one before the colon (as column name) and second after the colon as value and returns a dataframe.
We apply the function to each list and combine them in one dataframe using map_df.
library(dplyr)
list_data <- lst(xx, yy)
return_df_from_list <- function(x) {
value <- stringr::str_match(x, '(.*):\t(.*)')
setNames(data.frame(t(value[, 3])), value[, 2])
}
result <- purrr::map_df(list_data, return_df_from_list, .id = "rowname") %>%
column_to_rownames() %>%
type.convert(as.is = TRUE)
result
# raw total sequences 1st fragments
#xx 67166250 33583125
#yy 190999 222

calculating average of two column in another column

I am trying to calculate the average of columns in another column but getting errror
converted all the nastring na.strings = c("N") to NA but after the class of columns is character.
after this i have NA in place of N in data frame but still the class of column is character
df <- data.frame("T_1_1"= c(68,24,"N",105,58,"N",135,126,24),
"T_1_2"=c(26,105,"N",73,39,97,46,108,"N"),
"T_1_3"=c(93,32,73,103,149,"N",147,113,139),
"S_2_1"=c(69,67,94,"N",77,136,137,92,73),
"S_2_2"=c(87,67,47,120,85,122,"N",96,79),
"S_2_3"= c(150,"N",132,121,29,78,109,40,"N"),
"TS1_av"=c(68.5,45.5,94,105,67.5,136,136,109,48.5),
"TS2_av"=c(56.5,86,47,96.5,62,109.5,46,102,79),
"TS3_av"=c(121.5,32,102.5,112,89,78,128,76.5,139)
)
df$TS1_av <- rowMeans(df[,c(as.numeric(as.character("T_1_1","S_2_1")))], na.rm=TRUE)

You can use :
#Change 'N' to NA
df[df == 'N'] <- NA
#Change the type of columns
df <- type.convert(df, as.is = TRUE)
#Take mean of selected columns and add a new column
df$TS1_av <- rowMeans(df[,c("T_1_1","S_2_1")], na.rm=TRUE)
df

You could use readr::parse_number to extract numbers and replace any string that can't be converted to numeric by NA.
the na argument allows to specify strings to be interpreted as NA (here 'N'). If you don't supply this argument, you get a warning for every string which couldn't be interpreted but it's also replaced by NA.
library(dplyr)
library(readr)
df <- df %>% mutate(across(where(is.character),readr::parse_number,na='N'))
df$TS1_av <- rowMeans(df[,c("T_1_1","S_2_1")], na.rm=TRUE)
df

2 Base R solutions:
# Columns to subset out: cols => character vector
cols <- c("T_1_1", "S_2_1")
# Option 1: calculate the mean row-wise: TS1_av => numeric vector
df$TS1_av <- apply(df[,cols], 1, function(x){
mean(suppressWarnings(as.numeric(x)), na.rm = TRUE)
}
)
# Columns to subset out: cols => character vector
cols <- c("T_1_1", "S_2_1")
# Option 2: Coerce to numeric and calculate the row mean:
# TS1_av => numeric vector
df$TS1_av <- rowMeans(
suppressWarnings(
vapply(df[,cols], as.numeric, numeric(nrow(df)))
),
na.rm = TRUE
)

Iteratively adding a row containing characters and numbers to a dataframe

I have a list containing named elements. I am iterating over the list names, performing the computation for each corresponding element, "encapsulating" the results and the name in a vector and finally adding the vector to a table. The row or vector after each iteration contains a mix of characters and numbers.
The first row is getting added but from the second row onwards there is a problem.
In this example, there is supposed to be one column (first) containing alphanumeric names. All rows after the first one contain NAs.
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame()
for(name in names(x))
{
tmp <- x[[name]]
m <- mean(tmp)
s <- sum(tmp)
df <- rbind(df, c(name,m,s))
}
df <- as.data.frame(df)
I know there are possibly more efficient ways but for the moment this is more intuitive for me as it is assuring that each computation is associated with a particular name. There can be several columns and rows and the names are extremely helpful to join tables, query, compare etc. They make it easier to trace back results to a particular element in my original list.
Additionally, I would be glad to know other ways in which the element names are always retained while transforming.
Thankyou!

You have to set stringsAsFactors = FALSE in rbind. With stringsAsFactors = TRUE the first iteration in the loop converts the string variables into factors (with the factor levels being the values).
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame()
for(name in names(x))
{
tmp <- x[[name]]
m <- mean(tmp)
s <- sum(tmp)
df <- rbind(df, c(name,m,s), stringsAsFactors = FALSE)
}
An easier solution would be to utilize sapply().
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame(name = names(x), m = sapply(x, mean), s = sapply(x, sum))

Merging two data.frames by two columns each

I have a huge data.frame that I want to reorder. The idea was to split it in half (as the first half contains different information than the second half) and create a third data frame which would be the combination of the two. As I always need the first two columns of the first data frame followed by the first two columns of the second data frame, I need help.
new1<-all_cont_video_algo[,1:826]
new2<-all_cont_video_algo[,827:length(all_cont_video_algo)]
df3<-data.frame()
The new data frame should look like the following:
new3[new1[1],new1[2],new2[1],new2[2],new1[3],new1[4],new2[3],new2[4],new1[5],new1[6],new2[5],new2[6], etc.].
Pseudoalgorithmically, cbind 2 columns from data frame new1 then cbind 2 columns from data frame new2 etc.
I tried the following now (thanks to Akrun):
new1<-all_cont_video_algo[,1:826]
new2<-all_cont_video_algo[,827:length(all_cont_video_algo)]
new1<-as.data.frame(new1, stringsAsFactors =FALSE)
new2<-as.data.frame(new2, stringsAsFactors =FALSE)
df3<-data.frame()
f1 <- function(Ncol, n) {
as.integer(gl(Ncol, n, Ncol))
}
lst1 <- split.default(new1, f1(ncol(new1), 2))
lst2 <- split.default(new2, f1(ncol(new2), 2))
lst3 <- Map(function(x, y) df3[unlist(cbind(x, y))], lst1, lst2)
However, giving me a "undefined columns selected error".

See whether the below code helps
library(tidyverse)
# Two sample data frames of equal number of columns and rows
df1 = mtcars %>% select(-1)
df2 = diamonds %>% slice(1:32)
# get the column names
dn1 = names(df1)
dn2 = names(df2)
# create new ordered list
neworder = map(seq(1,length(dn1),2), # sequence with interval 2
~c(dn1[.x:(.x+1)], dn2[.x:(.x+1)])) %>% # a vector of two columns each
unlist %>% # flatten the list
na.omit # remove NAs arising from odd number of columns
# Get the data frame ordered
df3 = bind_cols(df1, df2) %>%
select(neworder)

It is not clear without a reproducible example. Based on the description, we can split the dataset columns into a list of datasets and use Map to cbind the columns of corresponding datasets, unlist and use that to order the third dataset
1) Create a function to return a grouping column for splitting the dataset
f1 <- function(Ncol, n) {
as.integer(gl(Ncol, n, Ncol))
}
2) split the datasets into a list
lst1 <- split.default(df1, f1(ncol(df1), 2))
lst2 <- split.default(df2, f1(ncol(df2), 2))
3) Map through the corresponding list elements, cbind and unlist and use that to subset the columns of 'df3'
lst3 <- Map(function(x, y) df3[unlist(cbind(x, y))], lst1, lst2)
data
df1 <- as.data.frame(matrix(letters[1:10], 2, 5), stringsAsFactors = FALSE)
df2 <- as.data.frame(matrix(1:10, 2, 5))

Transpose a data frame

I need to transpose a large data frame and so I used:
df.aree <- t(df.aree)
df.aree <- as.data.frame(df.aree)
This is what I obtain:
df.aree[c(1:5),c(1:5)]
10428 10760 12148 11865
name M231T3 M961T5 M960T6 M231T19
GS04.A 5.847557e+03 0.000000e+00 3.165891e+04 2.119232e+04
GS16.A 5.248690e+04 4.047780e+03 3.763850e+04 1.187454e+04
GS20.A 5.370910e+03 9.518396e+03 3.552036e+04 1.497956e+04
GS40.A 3.640794e+03 1.084391e+04 4.651735e+04 4.120606e+04
My problem is the new column names(10428, 10760, 12148, 11865) that I need to eliminate because I need to use the first row as column names.
I tried with col.names() function but I haven't obtain what I need.
Do you have any suggestion?
EDIT
Thanks for your suggestion!!! Using it I obtain:
df.aree[c(1:5),c(1:5)]
M231T3 M961T5 M960T6 M231T19
GS04.A 5.847557e+03 0.000000e+00 3.165891e+04 2.119232e+04
GS16.A 5.248690e+04 4.047780e+03 3.763850e+04 1.187454e+04
GS20.A 5.370910e+03 9.518396e+03 3.552036e+04 1.497956e+04
GS40.A 3.640794e+03 1.084391e+04 4.651735e+04 4.120606e+04
GS44.A 1.225938e+04 2.681887e+03 1.154924e+04 4.202394e+04
Now I need to transform the row names(GS..) in a factor column....

You'd better not transpose the data.frame while the name column is in it - all numeric values will then be turned into strings!
Here's a solution that keeps numbers as numbers:
# first remember the names
n <- df.aree$name
# transpose all but the first column (name)
df.aree <- as.data.frame(t(df.aree[,-1]))
colnames(df.aree) <- n
df.aree$myfactor <- factor(row.names(df.aree))
str(df.aree) # Check the column types

You can use the transpose function from the data.table library. Simple and fast solution that keeps numeric values as numeric.
library(data.table)
# get data
data("mtcars")
# transpose
t_mtcars <- transpose(mtcars)
# get row and colnames in order
colnames(t_mtcars) <- rownames(mtcars)
rownames(t_mtcars) <- colnames(mtcars)

df.aree <- as.data.frame(t(df.aree))
colnames(df.aree) <- df.aree[1, ]
df.aree <- df.aree[-1, ]
df.aree$myfactor <- factor(row.names(df.aree))

Take advantage of as.matrix:
# keep the first column
names <- df.aree[,1]
# Transpose everything other than the first column
df.aree.T <- as.data.frame(as.matrix(t(df.aree[,-1])))
# Assign first column as the column names of the transposed dataframe
colnames(df.aree.T) <- names

With tidyr, one can transpose a dataframe with "pivot_longer" and then "pivot_wider".
To transpose the widely used mtcars dataset, you should first transform rownames to a column (the function rownames_to_column creates a new column, named "rowname").
library(tidyverse)
mtcars %>%
rownames_to_column() %>%
pivot_longer(!rowname, names_to = "col1", values_to = "col2") %>%
pivot_wider(names_from = "rowname", values_from = "col2")

You can give another name for transpose matrix
df.aree1 <- t(df.aree)
df.aree1 <- as.data.frame(df.aree1)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Read data set from website in numeric format not character - r

Related

convert a list of tab delimited strings into dataframe

calculating average of two column in another column

Iteratively adding a row containing characters and numbers to a dataframe

Merging two data.frames by two columns each

Transpose a data frame

Categories

Resources