R - Cannot change NAs in Data Frame to numeric - r

I have a data frame of values called "games" with several columns of numerics. The original csv file had some missing values, which became NAs when I read them in. I'm trying to replace these NAs with the row median (already stored as a column of the data frame). I can't get the original NA to coerce from a character to a numeric.
I first found the indices of the missing values.
ng <- which(is.na(games), arr.ind = TRUE)
Then I tried replacing the NAs with a value from the column "linemedian".
games[ng] <- games[ng[,1], "linemedian"]
games[ng]
[1] " -3.25" " 9.98" " -9.1" " -9.1" " 14.0" " -3.25" " 9.98" " -3.25" " 9.98" " 2.30" " 13.75" "-24.00" " 3.71" " 15.94" " 14.25" " -9.83" " 13.75" " -4.88"
Replacing the NAs with just any number also did not work.
games[is.na(games)] <- 0
[1] " 0.0" " 0.0" " 0" " 0" " 0" " 0.0" " 0.0" " 0.0" " 0.0" " 0.00" " 0.00" " 0.00" " 0" " 0" " 0.00" " 0.00" " 0.00" " 0.00"
I thought that removing the whitespace might change the outcome but it did not.
games[ng] <- as.numeric(trimws(games[ng[,1], "linemedian"]))
[1] "-3.25" "9.98" "-9.1" "-9.1" "14" "-3.25" "9.98" "-3.25" "9.98" "2.3" "13.75" "-24" "3.71" "15.94" "14.25" "-9.83" "13.75" "-4.88"
Other attempts that did not work:
games[ng] <- type.convert(games[ng]) # using type.convert()
games[, -c(1,2)] <- as.numeric(games[, -c(1,2)]) # first two columns are metadata
Error: (list) object cannot be coerced to type 'double'
games[, -c(1,2)] <- as.numeric(unlist(games[, -c(1,2)]))
games[ng] <- as.numeric(as.character(trimws(games[ng[,1], "linemedian"])))
# New Addition from Answer
games[, sapply(games, is.numeric)][ng] <- games[, sapply(games, is.numeric)][ng[,1], "linemedian"]
I know for sure that the value I'm assigning to games[ng] is a numeric.
games[ng[,1], "linemedian"]
[1] -3.25 9.98 -9.10 -9.10 14.00 -3.25 9.98 -3.25 9.98 2.30 13.75 -24.00 3.71 15.94 14.25 -9.83 13.75 -4.88
typeof(games[ng[,1], "linemedian"])
[1] "double"
Everywhere I look on the Stack Overflow boards, the obvious answer should be games[is.na(games)] <- VALUE. But that isn't working. Anybody have some idea?
Here's the full code if you want to replicate:
## Download Raw Files
download.file("http://www.thepredictiontracker.com/ncaa2016.csv",
"data/ncaa2016.csv")
download.file("http://www.thepredictiontracker.com/ncaapredictions.csv",
"data/ncaapredictions.csv")
## Create Training and Prediction Data Sets
games <- read.csv("data/ncaa2016.csv", header = TRUE, stringsAsFactors = FALSE,
colClasses=c(rep("character",2),rep("numeric",72)))
preds <- read.csv("data/ncaapredictions.csv", header = TRUE, stringsAsFactors = TRUE)
colnames(preds)[colnames(preds) == "linebillings"] <- "linebill"
colnames(preds)[colnames(preds) == "linebillings2"] <- "linebill2"
colnames(preds)[colnames(preds) == "home"] <- "Home"
colnames(preds)[colnames(preds) == "road"] <- "Road"
## Remove Columns with too many missing values
rm <- unique(c(names(games[, sapply(games, function(z) sum(is.na(z))) > 50]), # Games and predictions
names(preds[, sapply(preds, function(z) sum(is.na(z))) > 10]))) # with missing data
games <- games[, !(names(games) %in% rm)] # Remove games with no prediction data
preds <- preds[, !(names(preds) %in% rm)] # Remove predictions with no game data
## Replace NAs with Prediction Median
ng <- which(is.na(games), arr.ind = TRUE)
games[ng] <- games[ng[,1], "linemedian"]
Also, I can't post the entire dput() output, but here's a bit of a the data set just to show the structure.
dput(head(games[1:6]))
structure(list(Home = c("Alabama", "Arizona", "Arkansas", "Arkansas St.",
"Auburn", "Boston College"), Road = c("USC", "BYU", "Louisiana Tech",
"Toledo", "Clemson", "Georgia Tech"), line = c("12", "-2", "24.5",
"4", "-8.5", "-3"), linesag = c(12.19, 0.97, 24.26, -2.07, -4.78,
-2.74), linepayne = c(12, -0.81, 12.53, -0.86, -10.72, -3.87),
linemassey = c(19.15, -2.1, 21.07, -8.68, -5.45, -6.76)), .Names = c("Home",
"Road", "line", "linesag", "linepayne", "linemassey"), row.names = c(NA,
6L), class = "data.frame")
Lastly, I'm running R Version 3.2.1 on x86_64-w64-mingw32.

Without a test case this will be untested. It appears you are getting a global replacement but because some of your columns are character, you get coercion to all character values coerced from 0. I might have tried restricting the process to just numeric columns:
games[ , sapply(games, is.numeric) ][ ng ] <-
games[ , sapply(games, is.numeric)][ng[,1], "linemedian"]
After modifying your almost reproducible code I've concluded that your original code was successful but the output of your checking was the problem area>
str( games[ , sapply(games, is.numeric)][ng[,1], "linemedian"] )
#num [1:23] -3.25 9.98 -9.1 -9.1 14 -3.25 9.98 -3.25 9.98 2.3 ...
games[ ng ] <-
games[ , sapply(games, is.numeric)][ng[,1], "linemedian"]
games[ ng[1:2,] ]
[1] " -3.25" " 9.98"
> ng[1:2,]
row col
[1,] 619 3
[2,] 678 3
> str(games)
'data.frame': 720 obs. of 58 variables:
$ Home : chr "Alabama" "Arizona" "Arkansas" "Arkansas St." ...
$ Road : chr "USC" "BYU" "Louisiana Tech" "Toledo" ...
$ line : num 12 -2 24.5 4 -8.5 -3 8.5 37 -10.5 5 ...
$ linesag : num 12.19 0.97 24.26 -2.07 -4.78 ...
$ linepayne : num 12 -0.81 12.53 -0.86 -10.72 ...
deleted
> games[ c(619,678) , 3]
#[1] -3.25 9.98
> games[ matrix(c(619,678,3,3), ncol=2)]
[1] " -3.25" " 9.98"
So the third column remained numeric after the assignment, but for reasons I don't understand the output of the print function for matrix-indexed-extract looked like it was character when it was in fact numeric.

Related

Why does this regex not match decimal numbers?

([.[:digit:]]+)
I am thinking this should match decimal numbers like 25.8 or 0.6 ..., but it seems to give up at the "non-digit" part of the match... so I only get 25 or 0
I have tried to escape the "." with \. and .
I am doing this in R, using gregexpr().
Here is a minimal reproducible example:
test
[1] " UNITS\n LAB 6690-2(LOINC) WBC # Bld Auto 10.99 "
LABregexlabname
[1] "LAB[[:print:][:blank:]]+WBC[[:print:][:blank:]]+([\\.[:digit:]]+)[:blank:]*?"
> gregexpr( LABregexlabname, test)
[[1]]
[1] 11
attr(,"match.length")
[1] 46
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
substring( test, 11, 11+46)
[1] "LAB 6690-2(LOINC) WBC # Bld Auto 10"
Place the last [:blank:] inside [] as [[:blank:]] and use perl=TRUE.
test <- " UNITS\n LAB 6690-2(LOINC) WBC # Bld Auto 10.99 "
LABregexlabname <- "LAB[[:print:][:blank:]]+WBC[[:print:][:blank:]]+([.[:digit:]]+)[[:blank:]]*?"
regmatches(test, regexpr(LABregexlabname, test, perl=TRUE))
#[1] "LAB 6690-2(LOINC) WBC # Bld Auto 10.99"
It looks like TRE uses minimal match everywhere when using ? at the end. In this case, when removing the ? also TRE will give the whole number but also all spaces. So maybe leaving also [[:blank:]]* ?
LABregexlabname <- "LAB[[:print:][:blank:]]+WBC[[:print:][:blank:]]+([.[:digit:]]+)[[:blank:]]*"
regmatches(test, regexpr(LABregexlabname, test))
#[1] "LAB 6690-2(LOINC) WBC # Bld Auto 10.99 "
LABregexlabname <- "LAB[[:print:][:blank:]]+WBC[[:print:][:blank:]]+([.[:digit:]]+)"
regmatches(test, regexpr(LABregexlabname, test))
#[1] "LAB 6690-2(LOINC) WBC # Bld Auto 10.99"

indented unordered list to nested list()

I've got a log file that looks as follows:
Data:
+datadir=/data/2017-11-22
+Nusers=5292
Parameters:
+outdir=/data/2017-11-22/out
+K=20
+IC=179
+ICgroups=3
-group 1: 1-1
ICeffects: 1-5
-group 2: 2-173
ICeffects: 6-10
-group 3: 175-179
ICeffects: 11-15
I would like to parse this logfile into a nested list using R so that the result will look like this:
result <- list(Data = list(datadir = '/data/2017-11-22',
Nusers = 5292),
Parameters = list(outdir = '/data/2017-11-22/out',
K = 20,
IC = 179,
ICgroups = list(list('group 1' = '1-1',
ICeffects = '1-5'),
list('group 2' = '2-173',
ICeffects = '6-10'),
list('group 1' = '175-179',
ICeffects = '11-15'))))
Is there a not-extremely-painful way of doing this?
Disclaimer: This is messy. There is no guarantee that this will work for larger/different files without some tweaking. You will need to do some careful checking.
The key idea here is to reformat the raw data, to make it consistent with the YAML format, and then use yaml::yaml.load to parse the data to produce a nested list.
By the way, this is an excellent example on why one really should use a common markup language for log-output/config files (like JSON, YAML, etc.)...
I assume you read in the log file using readLines to produce the vector of strings ss.
# Sample data
ss <- c(
"Data:",
" +datadir=/data/2017-11-22",
" +Nusers=5292",
"Parameters:",
" +outdir=/data/2017-11-22/out",
" +K=20",
" +IC=179",
" +ICgroups=3",
" -group 1: 1-1",
" ICeffects: 1-5",
" -group 2: 2-173",
" ICeffects: 6-10",
" -group 3: 175-179",
" ICeffects: 11-15")
We then reformat the data to adhere to the YAML format.
# Reformat to adhere to YAML formatting
ss <- gsub("\\+", "- ", ss); # Replace "+" with "- "
ss <- gsub("ICgroups=\\d+","ICgroups:", ss); # Replace "ICgroups=3" with "ICgroups:"
ss <- gsub("=", " : ", ss); # Replace "=" with ": "
ss <- gsub("-group", "- group", ss); # Replace "-group" with "- group"
ss <- gsub("ICeffects", " ICeffects", ss); # Replace "ICeffects" with " ICeffects"
Note that – consistent with your expected output – the value 3 from ICgroups doesn't get used, and we need to replace ICgroups=3 with ICgroups: to initiate a nested sub-list. This was the part that threw me off first...
Loading & parsing the YAML string then produces a nested list.
require(yaml);
lst <- yaml.load(paste(ss, collapse = "\n"));
lst;
#$Data
#$Data[[1]]
#$Data[[1]]$datadir
#[1] "/data/2017-11-22"
#
#
#$Data[[2]]
#$Data[[2]]$Nusers
#[1] 5292
#
#
#
#$Parameters
#$Parameters[[1]]
#$Parameters[[1]]$outdir
#[1] "/data/2017-11-22/out"
#
#
#$Parameters[[2]]
#$Parameters[[2]]$K
#[1] 20
#
#
#$Parameters[[3]]
#$Parameters[[3]]$IC
#[1] 179
#
#
#$Parameters[[4]]
#$Parameters[[4]]$ICgroups
#$Parameters[[4]]$ICgroups[[1]]
#$Parameters[[4]]$ICgroups[[1]]$`group 1`
#[1] "1-1"
#
#$Parameters[[4]]$ICgroups[[1]]$ICeffects
#[1] "1-5"
#
#
#$Parameters[[4]]$ICgroups[[2]]
#$Parameters[[4]]$ICgroups[[2]]$`group 2`
#[1] "2-173"
#
#$Parameters[[4]]$ICgroups[[2]]$ICeffects
#[1] "6-10"
#
#
#$Parameters[[4]]$ICgroups[[3]]
#$Parameters[[4]]$ICgroups[[3]]$`group 3`
#[1] "175-179"
#
#$Parameters[[4]]$ICgroups[[3]]$ICeffects
#[1] "11-15"
PS. You will need to test this on larger files, and make changes to the substitution if necessary.

R Error when using beside=TRUE parameter

I am plotting a graph with barplot() and any attempts to use the beside=TRUE parameter seem to return the error of Error in -0.01 * height : non-numeric argument to binary operator
The following is the code for the graph:
combi <- as.matrix(combine)
barplot(combi, main="Top 5 hospitals in California",
ylab="Mortality/Admission Rates", col = heat.colors(5), las=1)
The output of the graph is that the bars are stacked on each other instead of being beside each other.
The issue is not reproducible, when combineis a data.frame:
combine <- data.frame(
HeartAttack = c(13.4,12.3,16,13,15.2),
HeartFailure = c(11.1,7.3,10.7,8.9,10.8),
Pneumonia = c(11.8,6.8,10,9.9,9.5),
HeartAttack2 = c(18.3,19.3,21.8,21.6,17.3),
HeartFailure2 = c(24,23.3,24.2,23.8,24.6),
Pneumonia2 = c(17.4,19,17,18.4,18.2)
)
combi <- as.matrix(combine)
barplot(combi, main="Top 5 hospitals in California",
ylab="Mortality/Admission Rates", col = heat.colors(5), las=1, beside = TRUE)
Had the same issue earlier (different dataset, tho) and resolved it by using as.numeric() on my dataframe after I converted it to matrix with as.matrix(). Leaving as as.numeric()" out leads to "Error in -0.01 * height : non-numeric argument to binary operator"
¯\(ツ)/¯
My df called tmp:
> tmp
125 1245 1252 1254 1525 1545 12125 12425 12525 12545 125245 125425
Freq.x.2d "14" " 1" " 1" " 1" " 3" " 2" " 1" " 1" " 9" " 4" " 1" " 5"
Freq.x.3d "13" " 0" " 1" " 0" " 4" " 0" " 0" " 0" "14" " 4" " 1" " 2"
> dim(tmp)
[1] 2 28
> is(tmp)
[1] "matrix" "array" "structure" "vector"
> tmp <- as.matrix(tmp)
> dim(tmp)
[1] 2 28
> is(tmp)
[1] "matrix" "array" "structure" "vector"
> tmp <- as.numeric(tmp)
> dim(tmp)
NULL
> is(tmp)
[1] "numeric" "vector"
barplot(tmp, las=2, beside=TRUE, col=c("grey40","grey80"))

How to write the proper regular expression to extract value from the string?

> str=" 9.48 12.89 13.9 6.79 "
> strsplit(str,split="\\s+")
[[1]]
[1] "" "9.48" "12.89" "13.9" "6.79"
> unlist(strsplit(str,split="\\s+"))->y
> y[y!=""]
[1] "9.48" "12.89" "13.9" "6.79"
How can i get it by a single regular expression with strsplit , not to oparate it with
y[y!=""]?
I would just trim the string before splitting it:
strsplit(gsub("^\\s+|\\s+$", "", str), "\\s+")[[1]]
# [1] "9.48" "12.89" "13.9" "6.79"
Alternatively, it is pretty direct to use scan in this case:
scan(text=str)
# Read 4 items
# [1] 9.48 12.89 13.90 6.79
If you want to extract just the numbers perhaps following regex would do.
regmatches(str, gregexpr("[0-9.]+", text = str))[[1]]
## [1] "9.48" "12.89" "13.9" "6.79"
To capture -ve numbers you can use following
str = " 9.48 12.89 13.9 --6.79 "
regmatches(str, gregexpr("\\-{0,1}[0-9.]+", text = str))[[1]]
## [1] "9.48" "12.89" "13.9" "-6.79"

Why such large differences in performance between R's by() and lapply()?

I have an xts object containing time series for multiple stock symbols. I need to split the xts object in symbol-specific subgroups and process the data for each symbol, then reassemble all the subgroups in the original xts matrix containing the full set of rows. Each symbol is a field between 1 and 4 characters that it's used as the factor index to split the matrix in subgroups.
These are the time reported to split my matrix when calling by(), lapply() and ddply():
> dim(ets)
[1] 442750 24
> head(ets)
Symbol DaySec ExchTm LclTm Open High Low Close CloseRet
2011-07-22 09:35:00 "AA" "34500" "09:34:54.697.094" "09:34:54.697.052" " 158100" " 158400" " 157900" " 158200" " 6.325111e-04"
2011-07-22 09:35:00 "AAPL" "34500" "09:34:59.681.827" "09:34:59.681.797" "3899200" "3899200" "3892200" "3894400" "-1.231022e-03"
2011-07-22 09:35:00 "ABC" "34500" "09:34:49.805.994" "09:34:49.806.008" " 400100" " 401800" " 400100" " 401600" " 3.749063e-03"
2011-07-22 09:35:00 "ALL" "34500" "09:34:59.009.001" "09:34:59.008.810" " 285500" " 285500" " 285300" " 285300" "-7.005254e-04"
2011-07-22 09:35:00 "AMAT" "34500" "09:34:59.982.447" "09:34:59.982.423" " 130200" " 130500" " 130200" " 130500" " 2.304147e-03"
2011-07-22 09:35:00 "AMZN" "34500" "09:34:48.012.576" "09:34:48.012.565" "2137400" "2139100" "2137400" "2139100" " 7.953588e-04"
... (15 more columns)
> system.time(by(ets, ets$Symbol, function(x) { return(x) }))
user system elapsed
78.725 0.932 79.735
> system.time(ddply(as.data.frame(ets), "Symbol", function(x) { return (x) }))
user system elapsed
100.590 0.416 101.105
> system.time(lapply(split.default(ets, ets$Symbol), function(x) { return(x) }))
user system elapsed
1.572 0.280 1.853
More information on working with data frame and matrix subgroups are available in this excellent blog post.
Why is there such a large difference in performances when using lapply/split.default?
Working in numeric mode greatly reduce the processing time:
> system.time(by(myxts[,c(1,2,3,4,5)], myxts$Symbol, summary))
user system elapsed
57.768 0.688 58.511
> system.time(by(myxts[,c(1,2,3,4,5,6,7,8)], myxts$Symbol, summary))
user system elapsed
62.284 0.620 62.971
> system.time(by(myxts[,c(1,2,3,4,5,6,7,8, 9, 10, 11, 12)], myxts$Symbol, summary))
user system elapsed
76.529 0.632 77.232
> myxts.numeric = myxts
> mode(myxts.numeric) = "numeric"
Warning message:
In as.double.xts(c("AA", "AAPL", "ABC", "ALL", "AMAT", "AMZN", "BAC", :
NAs introduced by coercion
> system.time(by(myxts.numeric[,c(1,2,3,4,5,6,7,8, 9, 10, 11, 12)], myxts$Symbol, summary))
user system elapsed
4.948 0.688 5.642

Resources