Treatment of 'empty' values - r

I am importing a csv file into R using the sqldf-package. I have several missing values for both numeric and string variables. I notice that missing values are left empty in the dataframe (as opposed to being filled with NA or something else). I want to replace the missing values with an user defined value. Obviously, a function like is.na() will not work in this case.
Toy dataframe with three columns:
A B C
3 4
2 4 6
34 23 43
2 5
I want:
A B C
3 4 NA
2 4 6
34 23 43
2 5 NA
Thank you in advance.

Assuming you are using read.csv.sql in sqldf with the default sqlite database it is producing a factor column for C so
(1) just convert the values to numeric using as.numeric(as.character(...)) like this:
> Lines <- "A,B,C
+ 3,4,
+ 2,4,6
+ 34,23,43
+ 2,5,
+ "
> cat(Lines, file = "stest.csv")
> library(sqldf)
> DF <- read.csv.sql("stest.csv")
> str(DF)
'data.frame': 4 obs. of 3 variables:
$ A: int 3 2 34 2
$ B: int 4 4 23 5
$ C: Factor w/ 3 levels "","43","6": 1 3 2 1
> DF$C <- as.numeric(as.character(DF$C))
> str(DF)
'data.frame': 4 obs. of 3 variables:
$ A: int 3 2 34 2
$ B: int 4 4 23 5
$ C: num NA 6 43 NA
(2) or if we use sqldf(..., method = "raw") then we can just use as.numeric:
> DF <- read.csv.sql("stest.csv", method = "raw")
> str(DF)
'data.frame': 4 obs. of 3 variables:
$ A: int 3 2 34 2
$ B: int 4 4 23 5
$ C: chr "" "6" "43" ""
> DF$C <- as.numeric(DF$C)
> str(DF)
'data.frame': 4 obs. of 3 variables:
$ A: int 3 2 34 2
$ B: int 4 4 23 5
$ C: num NA 6 43 NA
(3) If its feasible for you to use read.csv then we do get NA filling right off:
> str(read.csv("stest.csv"))
'data.frame': 4 obs. of 3 variables:
$ A: int 3 2 34 2
$ B: int 4 4 23 5
$ C: int NA 6 43 NA

Related

Combine a dataframe and a list of dataframes into a one-dimensional list of dataframes, R

This should be pretty easy, but I dont know how. I have a single dataframe and a list with two dataframes. Now I want to combine them together, so that I have a single list with three dataframes. And I do not want to do in "manually".
a = data.frame(xa = 1:10,
ya = 11:20)
b = list(c = data.frame(x = 1:10),
d = data.frame(x = 1:20,
y = 11:30))
Now I though about something like this:
res = c(a, b)
But this results in this:
> sapply(res, class)
xa ya c d
"integer" "integer" "data.frame" "data.frame"
So it turns the two columns of the single dataframe into a vector.
How could I maintain the dataframe structure for the "single" dataframe and extract the dataframes from the list of 2?
You can use c but you have to cover your data.frame a into a list.
res <- c(b, list(a=a))
str(res)
#List of 3
# $ c:'data.frame': 10 obs. of 1 variable:
# ..$ x: int [1:10] 1 2 3 4 5 6 7 8 9 10
# $ d:'data.frame': 20 obs. of 2 variables:
# ..$ x: int [1:20] 1 2 3 4 5 6 7 8 9 10 ...
# ..$ y: int [1:20] 11 12 13 14 15 16 17 18 19 20 ...
# $ a:'data.frame': 10 obs. of 2 variables:
# ..$ xa: int [1:10] 1 2 3 4 5 6 7 8 9 10
# ..$ ya: int [1:10] 11 12 13 14 15 16 17 18 19 20
You can always add it as a new element
b[["a"]]=a
The "a" can be used in a loop or something similar.

converting data type stored in list into Date R

I have a list data.
and there are several data frames in each.
[[1]]
ID: int [1:100] ...
Date: Factor w/ ...
days: num [1:100] ...
[[2]]
ID: int [1:100] ...
Date: Factor w/ ...
like this.
And I want to convert that factor to Date format.
I thought about
unlist the list - changing format - making it to list again.
But I have no idea how to do that..
sapply(data, function(x) x$Date <- as.Date(x$Date))
This doesn't work. It only returns Date and doesn't change the data type.
Is there any fast way to convert that format?
I can solve this by using for loop.
for(i in 1:2){
data[[i]]$Date <- as.Date(data[[i]]$Date)}
But I would like to use sapply or lappy.
It is better to transform factor into character at first and then to Date format. The most easiest way is to use lubridate package. ymd transform character vectors of format e.g. 2018-11-22 into Year-Month-Date datetime object. Please pay attention to lambda-function body, after the change of the data frame it is typed x, which is a shortcut of return(x). See the code below:
library(lubridate)
# simulation of data
df1 <- data.frame(
ID = 1:100,
Date = as.factor(sample(seq(ymd("2018-01-01"), ymd("2018-12-01"), 1), 100)),
days = sample(100))
df2 <- data.frame(
ID = 1:100,
Date = as.factor(sample(seq(ymd("2018-01-01"), ymd("2018-12-01"), 1), 100, replace =TRUE)))
dfs <- list(df1, df2)
str(dfs)
# List of 2
# $ :'data.frame': 100 obs. of 3 variables:
# ..$ ID : int [1:100] 1 2 3 4 5 6 7 8 9 10 ...
# ..$ Date: Factor w/ 100 levels "2018-01-06","2018-01-10",..: 17 89 40 2 84 46 58 62 66 43 ...
# ..$ days: int [1:100] 50 4 19 6 33 47 95 25 13 5 ...
# $ :'data.frame': 100 obs. of 2 variables:
# ..$ ID : int [1:100] 1 2 3 4 5 6 7 8 9 10 ...
# ..$ Date: Factor w/ 87 levels "2018-01-03","2018-01-04",..: 3 30 61 6 78 34 5 71 49 55 ...
# handling the data
dfs_2 <- lapply(dfs, function(x) {
x$Date <- ymd(as.character(x$Date))
x
})
str(dfs_2)
# List of 2
# $ :'data.frame': 100 obs. of 3 variables:
# ..$ ID : int [1:100] 1 2 3 4 5 6 7 8 9 10 ...
# ..$ Date: Date[1:100], format: "2018-03-10" "2018-10-25" "2018-11-25" ...
# ..$ days: int [1:100] 7 99 75 91 30 78 9 82 15 37 ...
# $ :'data.frame': 100 obs. of 2 variables:
# ..$ ID : int [1:100] 1 2 3 4 5 6 7 8 9 10 ...
# ..$ Date: Date[1:100], format: "2018-05-30" "2018-05-20" "2018-05-13" ...

Splitting a dataframe by columns

I want to split my dataframe by columns. Sounds trivial, but i didnt really succeed so far.
Here is what i have come up with:
SG <- data.frame(num = 1:26, let = letters, LET = LETTERS)
SG <- lapply(SG, function(x) split(x, colnames(SG)))
str(SG)
List of 3
$ num:List of 3
$ let:List of 3
$ LET:List of 3
I have successfully converted my dataframe into a list of lists. But i would like to have a list of dataframes, preserving the rowname info from SG, and each one of them containing one column of the initial dataframe. Is that possible?
Thank you!
This should work, row names are preserved. It returns a list of data frames:
SG <- lapply(SG, data.frame)
str(SG)
List of 3
$ num:'data.frame': 26 obs. of 1 variable:
..$ X..i..: int [1:26] 1 2 3 4 5 6 7 8 9 10 ...
$ let:'data.frame': 26 obs. of 1 variable:
..$ X..i..: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ LET:'data.frame': 26 obs. of 1 variable:
..$ X..i..: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
You can use
lapply(colnames(SG), function(x) SG[,x,drop=F])
which returns an object with the structure
List of 3
$ :'data.frame': 26 obs. of 1 variable:
..$ num: int [1:26] 1 2 3 4 5 6 7 8 9 10 ...
$ :'data.frame': 26 obs. of 1 variable:
..$ let: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ :'data.frame': 26 obs. of 1 variable:
..$ LET: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
In this case we are just subsetting. split() for data.frames is better when you want to separate the rows, not columns, into different groups.

Why does data.table recycle matrices into a single vector when data.frame does not?

Compare the behavior of data.table and data.frame below:
a.matrix <- matrix(seq_len(25),ncol = 5, nrow = 5)
a.list <- list(seq_len(5),a.matrix)
a.dt <- as.data.table(a.list)
a.df <- as.data.frame(a.list)
a.dt.df <- as.data.table(a.df)
str(a.dt)
str(a.df)
str(a.dt.df)
data.table recycles the columns of the matrix into a vector of appropriate length:
> str(a.dt)
Classes ‘data.table’ and 'data.frame': 25 obs. of 2 variables:
$ V1: int 1 2 3 4 5 1 2 3 4 5 ...
$ V2: int 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, ".internal.selfref")=<externalptr>
On the other hand, data.frame breaks each column out:
> str(a.df)
'data.frame': 5 obs. of 6 variables:
$ X1.5: int 1 2 3 4 5
$ X1 : int 1 2 3 4 5
$ X2 : int 6 7 8 9 10
$ X3 : int 11 12 13 14 15
$ X4 : int 16 17 18 19 20
$ X5 : int 21 22 23 24 25
My current workaround to get this behavior quickly with as.data.table is just to feed it through both as coercers:
> str(a.dt.df)
Classes ‘data.table’ and 'data.frame': 5 obs. of 6 variables:
$ X1.5: int 1 2 3 4 5
$ X1 : int 1 2 3 4 5
$ X2 : int 6 7 8 9 10
$ X3 : int 11 12 13 14 15
$ X4 : int 16 17 18 19 20
$ X5 : int 21 22 23 24 25
- attr(*, ".internal.selfref")=<externalptr>
Why is there a difference, and is there a fast way to get the data.frame behavior with data.table?
Just to close this on the SO end, as mentioned in the comments, this is being handled as a bug/issue at github now, added to data.table milestone v1.9.8 of this writing.
Follow-up
This is now resolved as per commit 64f377...

Convert Factors in 2 Data Frames of a List into Numeric

I am having trouble converting the columns of 2 data frames in a list to numeric. Right now both data frames have 2 columns consisting of factors. I want to convert them to numeric so that I can do mathematical operations on them. Below is sample code:
library(XML)
bal <- "http://www.baseball-reference.com/teams/BAL/2014-schedule-scores.shtml"
bos <- "http://www.baseball-reference.com/teams/BOS/2014-schedule-scores.shtml"
mylist <- list(bal, bos)
a <- lapply(mylist, readHTMLTable)
b <- lapply(a, function(x) x[["team_schedule"]][, c("R", "RA")])
c <- as.numeric(as.character(b))
When I run this code I get:
> c
[1] NA NA
> str(c)
num [1:2] NA NA
Here is the structure of b:
> str(b)
List of 2
$ :'data.frame': 165 obs. of 2 variables:
..$ R : Factor w/ 13 levels "","0","10","11",..: 6 6 7 8 10 7 6 5 9 2 ...
..$ RA: Factor w/ 13 levels "","0","1","10",..: 3 9 7 4 10 3 7 8 7 6 ...
$ :'data.frame': 166 obs. of 2 variables:
..$ R : Factor w/ 10 levels "","0","1","2",..: 3 8 6 4 8 2 7 9 6 3 ...
..$ RA: Factor w/ 13 levels "","1","10","14",..: 5 5 6 9 10 7 2 3 5 7 ...
What should I do differently to convert the factors into numeric values?
You need to use lapply. do a str on "b"
str(b)
This will let you know you have a list of 2 of 2 data.frames.
So you need to use lapply along with sapply, to preserve the data structure
lapply(b, function(x) sapply(x, function(x) as.numeric(as.character(x))))
You have D/N in your factor, which will be converted to NAs and also the list entries
that are blank/empty

Resources