Different ways of selecting columns inside function resulting in different results, why? - r

I have written a short function to clean some dataframes that I have in a list. When selecting columns using the df[,1] method, my function doesn't work. However when I select using df$Column it does. Why is this?
columns_1 <- function(x) {
x[,1] <- dmy_hm(x[,1])
x[,2] <- NULL
x[,3] <- as.numeric(x[,3])
x[,4] <- NULL
return(x)
}
MS_ <- lapply(MS_, columns_1)
columns_2 <- function(x) {
x$DateTime <- dmy_hm(x$DateTime)
x$LogSeconds <- NULL
x$Pressure <- as.numeric(x$Pressure)
x$Temperature <- NULL
return(x)
}
MS_ <- lapply(MS_, columns_2)
The function columns_2 produces the desired results (all dataframes in list are cleaned). columns_1 returns the error message:
Error in FUN(X[[i]], ...) :
(list) object cannot be coerced to type 'double'
In addition: Warning message:
All formats failed to parse. No formats found.

The issue would be that the assignment was carried out after the first run and here some columns were lost.
library(lubridate)
MS_ <- lapply(MS_, columns_1)
Instead, it can be done by assigning to a different object
MS2_ <- lapply(MS_, columns_1)
data
set.seed(24)
df1 <- data.frame(DateTime = format(Sys.Date() + 1:5, "%d-%m-%Y %H:%M"),
LogSeconds = 1:5,
Pressure = rnorm(5), Temperature = rnorm(5, 25),
stringsAsFactors = FALSE)
MS_ <- list(df1, df1)

Related

How can I make a loop that calls dataframes

I have the wrote the code below for a transformation of rows of a dataframe to colums
RowsToColums <- function(df)
{
model = list()
for(i in seq_along(df))
{
if(i>4)
{
dataf <- data.frame(names = df[1], Year=colnames(df[i]), index = df[,i:i])
names(dataf)[3]<- toString(df[[3]][2])
names(dataf)[1]<- "Country"
model[[i]] <- dataf
}
}
df <- do.call(rbind, model)
df <- arrange(df, Country)
}
EC_Pop <- RowsToColums(EC_Pop)
EC_GDP <- RowsToColums(EC_GDP)
EC_Inflation <- RowsToColums(EC_Inflation)
ST_Tech_Exp <- RowsToColums(ST_Tech_Exp)
ST_Res_Jour <- RowsToColums(ST_Res_Jour)
ST_Res_Exp <- RowsToColums(ST_Res_Exp)
ST_Res_Pop <- RowsToColums(ST_Res_Pop)
ED_Unempl <- RowsToColums(ED_Unempl)
ED_Edu_Exp <- RowsToColums(ED_Edu_Exp)
But as you can see, I call many times the same function.
I tried to move all these dataframes in a vector like this
list_a = list(EC_Pop,EC_GDP,EC_Inflation,ST_Tech_Exp,ST_Res_Exp)
for (i in seq_along(list_a))
{
list_a[i] <- RowsToColums(list_a[i])
}
write a loop that everytime take the dataframe but it fails with an error
UseMethod ("arrange_") error:
Inapplicable method for 'arrange_' applied to object of class "NULL"
Does anybody know how to fix this case?

R aggregate function unexpected NA

When I use aggregate function on a data.frame which contains character and numeric columns, aggregate fails and returns only NAs for all. How can I solve this? My first idea was to check for value class but it did not work.
name <- rep(LETTERS[1:5],each=2)
feat <- paste0("Feat",name)
valuesA <- runif(10)*10
valuesB <- runif(10)*10
daf <- data.frame(ID=name,feature=feat,valueA=valuesA,valueB=valuesB, stringsAsFactors = FALSE)
aggregate(.~ID, data=daf,FUN=mean)
aggregate(.~ID, data=daf,FUN=function(x){
if(is.character(x)){
return(NA)
}else{ return(mean(x))}
})

How do I apply anti_join to a list of data.frames using a master data.frame

I have a list of data.frames that have been randomly stratified by class to obtain 70% of the original dataset. I would like to antijoin the list of data.frames to obtain a separate list of data.frames consisting of the remaining 30% of data.
require(splitstackshape)
n <- 10
heads <- "bc_title4"
train_split <- function(x) {
listofdfs <- list()
for(i in 1:n){
df <- stratified(x, 1, 0.7)
listofdfs[[i]] <- df
}
return(listofdfs)
}
train_list <- train_split(survey_points) #returns data.frame list to environment
test_split <- function(x) {
listofdfs2 <- list()
for(i in train_list){
df <- x[!x$id %in% data.frame(train_list[i])$id,]
listofdfs2[[i]] <- df
}
return(listofdfs2)
}
test_list <- test_split(survey_points)
However I do not know how to write a function for the antijoin as I get the following error:
Error in train_list[i] : invalid subscript type 'list'
6.
data.frame(train_list[i])
5.
x$id %in% data.frame(train_list[i])$id
4.
x$id %in% data.frame(train_list[i])$id
3.
`[.data.frame`(x, !x$id %in% data.frame(train_list[i])$id, )
2.
x[!x$id %in% data.frame(train_list[i])$id, ]
1.
test_split(survey_points)
The same error when attempting to use the function anti_join:
test_split <- function(x) {
listofdfs2 <- list()
for(i in train_list){
anti_join(x, data.frame(train_list[i]), by = "id")
listofdfs2[[i]]
}
return(listofdfs2)
}
test <- test_split(survey_points)
Error in train_list[i] : invalid subscript type 'list'
9.
data.frame(train_list[i])
8.
tbl_vars(y)
7.
check_valid_names(tbl_vars(y), warn_only = TRUE)
6.
anti_join.tbl_df(tbl_df(x), y, by = by, copy = copy, ...)
5.
anti_join(tbl_df(x), y, by = by, copy = copy, ...)
4.
as.data.frame(anti_join(tbl_df(x), y, by = by, copy = copy, ...))
3.
anti_join.data.frame(x, data.frame(train_list[i]), by = "id")
2.
anti_join(x, data.frame(train_list[i]), by = "id")
1.
test_split(survey_points)

Apply a user defined function to a list of data frames

I have a series of data frames structured similarly to this:
df <- data.frame(x = c('notes','year',1995:2005), y = c(NA,'value',11:21))
df2 <- data.frame(x = c('notes','year',1995:2005), y = c(NA,'value',50:60))
In order to clean them I wrote a user defined function with a set of cleaning steps:
clean <- function(df){
colnames(df) <- df[2,]
df <- df[grep('^[0-9]{4}', df$year),]
return(df)
}
I'd now like to put my data frames in a list:
df_list <- list(df,df2)
and clean them all at once. I tried
lapply(df_list, clean)
and
for(df in df_list){
clean(df)
}
But with both methods I get the error:
Error in df[2, ] : incorrect number of dimensions
What's causing this error and how can I fix it? Is my approach to this problem wrong?
You are close, but there is one problem in code. Since you have text in your dataframe's columns, the columns are created as factors and not characters. Thus your column naming does not provide the expected result.
#need to specify strings to factors as false
df <- data.frame(x = c('notes','year',1995:2005), y = c(NA,'value',11:21), stringsAsFactors = FALSE)
df2 <- data.frame(x = c('notes','year',1995:2005), y = c(NA,'value',50:60), stringsAsFactors = FALSE)
clean <- function(df){
colnames(df) <- df[2,]
#need to specify the column to select the rows
df <- df[grep('^[0-9]{4}', df$year),]
#convert the columns to numeric values
df[, 1:ncol(df)] <- apply(df[, 1:ncol(df)], 2, as.numeric)
return(df)
}
df_list <- list(df,df2)
lapply(df_list, clean)

How to make R insert a '0' in place of missing values while reading a CSV?

We have a multi-column CSV file of the following format:
id1,id2,id3,id4
1,2,3,4
,,3,4,6
2,,3,4
These missing values are to be assumed as a '0' when reading the CSV column by column. The following is the script we currently have:
data <- read.csv("data.csv")
dfList <- lapply(seq_along(data), function(i) {
seasonal_per <- msts(data[, i], seasonal.periods=c(24,168))
best_model <- tbats(seasonal_per)
fcst <- forecast.tbats(best_model, h=24, level=90)
dfForec <- print(fcst)
result <- cbind(0:23, dfForec[, 1])
result$id <- names(df)[i]
return(result[c("id", "V1", "V2")])
})
finaldf <- do.call(rbind, dfList)
write.csv(finaldf, file = "out.csv", row.names = FALSE)
This script breaks when the CSV has missing values giving the error Error in tau + 1 + adj.beta + object$p :
non-numeric argument to binary operator. How do we tell R to assume a '0' when it encounters a missing value?
I tried the following:
library("forecast")
D <- read.csv("data.csv",na.strings=".")
D[is.na(D)] <- 0
dfList <- lapply(seq_along(data), function(i) {
seasonal_per <- msts(data[, i], seasonal.periods=c(24,168))
best_model <- tbats(seasonal_per)
fcst <- forecast.tbats(best_model, h=24, level=90)
dfForec <- print(fcst)
result <- cbind(0:23, dfForec[, 1])
result$id <- names(df)[i]
return(result[c("id", "V1", "V2")])
})
finaldf <- do.call(rbind, dfList)
write.csv(finaldf, file = "out.csv", row.names = FALSE)
but it gives the following error:
Error in data[, i] : object of type 'closure' is not subsettable
If you're certain that any NA value should be 0, and that's the only issue, then
data <- read.csv("data.csv")
data[is.na(data)] <- 0
If you're working in the tidyverse (or just with dplyr), this option works well:
library(tidyverse)
data <- read_csv("data.csv") %>% replace(is.na(.), 0)

Resources