convert a list of tab delimited strings into dataframe - r

I have a list of strings with tabs like below:
xx <- list("raw total sequences:\t67166250", "1st fragments:\t33583125")
yy <- list("raw total sequences:\t190999", "1st fragments:\t222")
I want to have "row total sequences" and "1st fragments" as column names and the numeric values as column values and xx and yy as row names. How can I do it efficiently?

You may create one named list combining the individual lists that you have so that it is easier to work with them. return_df_from_list function captures the data in two capture groups, one before the colon (as column name) and second after the colon as value and returns a dataframe.
We apply the function to each list and combine them in one dataframe using map_df.
library(dplyr)
list_data <- lst(xx, yy)
return_df_from_list <- function(x) {
value <- stringr::str_match(x, '(.*):\t(.*)')
setNames(data.frame(t(value[, 3])), value[, 2])
}
result <- purrr::map_df(list_data, return_df_from_list, .id = "rowname") %>%
column_to_rownames() %>%
type.convert(as.is = TRUE)
result
# raw total sequences 1st fragments
#xx 67166250 33583125
#yy 190999 222

Related

I want to find a mode for each group of dataframes within the element of a list, and write the result as a new column

I have a list called "data". It consists of 10 elements (lists), each having different number of elements (lists), such as
lengths(data)
[1] 26 33 3 20 22 21 17 18 12 29
Thus, the first element of our list consists of 26 elements, the second of 33, and so on ... Each of these elements are dataframes ("tibbles"), with 6 columns (first four being integers, fifth logical, and the last character), for instance
colnames(data[[1]][[1]])
[1] "width" "height" "x" "y" "space" "text"
Although the structure of dataframes (columns) is consistent in and outside of the groups, the number of rows differs for each dataframe even within the group.
I want to find a mode "height" for the dataframes grouped within the same element.
Thus, there is common mode for 26 dataframes within the first element and so on. In other words, I want to group the data for 26 dataframes within the first element, calculate the mode, and then write result as a new column to each of the dataframes so that I could perform different operations for rows with height above, below, and equal to mode.
This is what I figured out so far, although it is not correct it should produce the same result in most of the cases:
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
mode <- lapply(data, function(x) lapply(lapply(x, '[[',
'height'), getmode)) # find mode height for each
# paper and each page
mode2 <- lapply(mode, function (x) getmode(x))
# find mode for each paper
Here is one option where we loop over the outer list with map, then bind the inner list elements to a single data with bind_rows (from dplyr) creating a column grp, apply the getmode on the combined 'height' column to create a new column and then split the dataset by the 'grp' column
library(purrr)
library(dplyr)
map(data, ~ bind_rows(.x, .id = 'grp') %>%
mutate(Mode = getmode(height)) %>%
group_split(grp, .keep = FALSE))
Or loop over the list with lapply, loop over the inner list with sapply, extract the 'height' from individual inner list elements, apply the getmode and return a vector of mode values on which the getmode is applied again. Then loop over the inner list and create a new column with the mode value we got
lapply(data, function(x) {
mode <- getmode(sapply(x, function(y) getmode(y$height)))
lapply(x, function(y) cbind(y, mode))
})
data
set.seed(24)
data1 <- replicate(26, head(mtcars) %>% mutate(height = rnorm(6)), simplify = FALSE)
data2 <- replicate(33, head(iris) %>% mutate(height = rnorm(6)), simplify = FALSE)
data <- list(data1, data2)

Trying to iterate through certain columns based on column name using R but columns not in list get eliminated

I have a dataframe with 50 columns. I am trying to iterate through the columns in a dataframe that contain the word "apple". The dataframe contains 24 "apple" columns.
If apple_1 = 1 then all the other apple_x columns in the row should equal to 1 else it shouldn't do anything.
This is my code so far:
I am successfully able to create a list of column names containing apple (excluding apple_1)
applelist<- df %>% select(contains("apple"))%>%select(!contains("apple_1"))
applelist<- list(colnames(applelist))
But when I try to loop through the columns in the applelist and update the values for each row it wants to delete the 'non' applelist columns (go from 50 columns to 24). I only want to update the apple columns and leave the rest untouched.
for (i in 1:ncol(applelist){
df[, i] <- ifelse(df$apple_1==1, 1, df[, i])
}
Here, applelist is a list of length 1. i.e. what we are getting is
applelist <- list(c('a', 'b', 'c'))
We just need to extract the names as a vector or if we need a list, use as.list (not really needed here)
library(dplyr)
nm1 <- df %>%
select(contains("apple"))%>%
select(!contains("apple_1")) %>%
names
Then use
for(nm in nm1) {df[[nm]] <- ifelse(df$apple_1 == 1, 1, df[[nm]]}
In addition, a tidyverse option would be
df <- df %>%
mutate(across(all_of(nm1), ~ ifelse(apple_1 == 1, 1, .x))

Using Dataframe to Automatically create a list of values based off Subproduct

df <- data.frame("date"=
1:4,"product"=c("B","B","A","A"),"subproduct"=c("1","2","x","y"),"actuals"=1:4)
#creates df1,df2,dfx,dfy
for(i in unique(df$subproduct)) {
nam <- paste("df", i, sep = ".")
assign(nam, df[df$subproduct==i,])
}
# CREATES LIST OF DATAFRAMES
# How do I make this so i don't have to manually type list(df.,df.,df.)
list_df <- list(df.1,df.2,df.x,df.y) %>%
lapply( function(x) x[(names(x) %in% c("date", "actuals"))])
# creates df1,df2,df3,df4 only dates and actuals, removes the other column names
for (i in 1:length(list_df)) {
assign(paste0("df", i), as.data.frame(list_df[[i]]))
}
For the first for loop, it creates a df object based off unique subproduct. For the list() function, I want to be able to not have to type in df.1 ... df2... etc so if I have 100 unique subproducts in my data, I wouldn't need to type this df.1, df.2,df.x,df.y,df.z,df.zzz,df. over and over again. How would I best do this (1 question)
The last for loop creates separate dataframe objects with only date and actuals will be used to create time series for each. How can I put the values of these objects into a single dataframe or a list of dfs? (2nd question)
We can use mget to return the value of object on the subset of object names from ls. The pattern matches object names that starts with 'df'followed by a.` and any alphanumeric characters
mget(ls(pattern = '^df\\.[[:alnum:]]+$'))
If the OP wanted to create those objects in a different env
new_env <- new.env()
list2env(mget(ls(pattern = '^df\\.[[:alnum:]]+$')), envir = new_env)
If we want to create new objects from scratch, do a group_split on the 'subproduct' column, set the names accordingly, and create multiple objects (list2env - not recommended though)
library(dplyr)
library(stringr)
df %>%
group_split(subproduct) %>%
setNames(str_c('df.', c(1, 2, 'x', 'y'))) %>%
list2env(.GlobalEnv)

Read data set from website in numeric format not character

With the code below I read data from a website.
The problem is it reads the data as character not in numeric format especially some columns such as "Enlem(N) and Boylam(E).
How can I fix this?
library(rvest)
widths <- c(11,10,10,10,14,5,5,5,48,100)
dat <- "http://www.koeri.boun.edu.tr/scripts/lst5.asp" %>%
read_html %>%
html_nodes("pre") %>%
html_text %>%
textConnection %>%
read.fwf(widths = widths, stringsAsFactors = FALSE) %>%
setNames(nm = .[6,]) %>%
tail(-7) %>%
head(-2)
If you know what specific columns should be a number, you can convert those columns to be a number. If you do not know what columns should be a number, you can create a function to look at the data and if a large enough percentage of the cases in the column are a number change that column to be a number. I have used the function below for this purpose:
NumericColumns <- function(x, AllowedPercentNumeric =.95, PreserveDate=TRUE, PreserveColumns){
# find the counts of NA values in input data frame's columns
param_originalNA <- apply(x, 2, function(z){sum(is.na(z))})
# blindly coerce data.frame to numeric
df_JustNumbers <- suppressWarnings(as.data.frame(lapply(x, as.numeric)))
# Percent Non-NA values in each column
PercentNumeric <- (apply(df_JustNumbers, 2, function(x)sum(!is.na(x))))/(nrow(x)-param_originalNA)
rm(param_originalNA)
# identify columns which have a greater than or equal percentage of numeric as specified
param_numeric <- names(PercentNumeric)[PercentNumeric >= AllowedPercentNumeric]
# Remove columns from list to convert to numeric that are specified as to preserve
if (!missing(PreserveColumns)){param_numeric <- setdiff(param_numeric, PreserveColumns)}
# Identify columns that are dates initially
IsDateColumns <- lapply(x, function(y)(is(y, "Date")|is(y, "POSIXct")))
param_dates <- names(IsDateColumns)[IsDateColumns==TRUE]
# Remove dates from list if specified to preserve dates
if (PreserveDate){param_numeric <- setdiff(param_numeric, param_dates)}
# returns column position of numeric columns in target data frame
param_numeric <- match(param_numeric, colnames(df_JustNumbers))
# removes NA's from column list
param_numeric <- param_numeric[complete.cases(param_numeric)]
# coerces columns in param_numeric to numeric and inserts numeric columns into target data.frame
if(length(param_numeric)==1){
suppressWarnings(x[, param_numeric] <- as.numeric(x[, param_numeric]))
}
if(length(param_numeric)>1){
suppressWarnings(x[, param_numeric] <- apply(x[, param_numeric],2, function(x)as.numeric(x)))
}
return(x)
}
Once the function is created, you can use it on you data such as:
# Use function to convert to numeric
dat <- NumericColumns(dat)

combine multiple dataframes based on sequence of names

Say I have 30 dataframes all named with a date from 01/01/2000 to 30/01/2000 in the format of ddmmyy (code below) :
Season <- seq(as.Date("2000-01-01"),as.Date("2000-01-30"),1)
Season <- format(Season,"%d%m%y")
for (s in Season) {
df <- data.frame(X=1:10, Y=1:10)
aa <- paste(s,"tests",s ,sep = "_")
assign(aa,df)
}
Each name, you cans see, has the word tests added to it.I want to combine (rbind?) the data.frames based on the date. In this case, combine data.frames that contain the dates from 01-01-00 to 10-01-00.
I have the below code to combine all dataframes but what if I only want to select the ones shown above?
All_dfs <- do.call(rbind, eapply(.GlobalEnv,function(x) if(is.data.frame(x)) x))
Is it better to create a list first?
We can use mget to get the values of 'Season' in a list and then rbind the list of data.frames. As there is a suffix "tests" followed by "Season" concatenated to the "Season", we can use paste to get the string, then use mget.
res <- do.call(rbind, mget( paste0(Season[1:10], "_tests_", Season[1:10])))
dim(res)
#[1] 100 2

Resources