I am reading well structured, textual data in R and in the process of converting from character to numeric, numbers lose their decimal places.
I have tried using round(digits = 2) but it didn't work since I first had to apply as.numeric. At one point, I did set up options(digits = 2) before the conversion but it didn't work either.
Ultimately, I desired to get a data.frame with its numbers being exactly the same as the ones seen as characters.
I looked up for help here and did find answers like this, this, and this; however, none really helped me solve this issue.
How will I prevent number rounding when converting from character to
numeric?
Here's a reproducible piece of code I wrote.
library(purrr)
my_char = c(" 246.00 222.22 197.98 135.10 101.50 86.45
72.17 62.11 64.94 76.62 109.33 177.80")
# Break characters between spaces
my_char = strsplit(my_char, "\\s+")
head(my_char, n = 2)
#> [[1]]
#> [1] "" "246.00" "222.22" "197.98" "135.10" "101.50" "86.45"
#> [8] "72.17" "62.11" "64.94" "76.62" "109.33" "177.80"
# Convert from characters to numeric.
my_char = map_dfc(my_char, as.numeric)
head(my_char, n = 2)
#> # A tibble: 2 x 1
#> V1
#> <dbl>
#> 1 NA
#> 2 246
# Delete first value because it's empty
my_char = my_char[-1,1]
head(my_char, n = 2)
#> # A tibble: 2 x 1
#> V1
#> <dbl>
#> 1 246
#> 2 222.
It's how R visualize data in a tibble.
The function map_dfc is not rounding your data, it's just a way R use to display data in a tibble.
If you want to print the data with the usual format, use as.data.frame, like this:
head(as.data.frame(my_char), n = 4)
V1
#>1 246.00
#>2 222.22
#>3 197.98
#>4 135.10
Showing that your data has not been rounded.
Hope this helps.
Related
I have a dataframe where columns are numneri
asd <- data.frame(`2021`=rnorm(3), `2`=head(letters,3), check.names=FALSE)
But when I reference the columns names as variable, it is returning error
x = 2021
asd[x]
Error in `[.data.frame`(asd, x) : undefined columns selected
Expected output
x = 2021
asd[x]
2021
1 1.5570860
2 -0.8807877
3 -0.7627930
Reference it as a string:
x = "2021"
asd[,x]
[1] -0.2317928 -0.1895905 1.2514369
Use deparse
asd[,deparse(x)]
[1] 1.3445921 -0.3509493 0.5028844
asd[deparse(x)]
2021
1 1.3445921
2 -0.3509493
3 0.5028844
A bit more detail: numbers without quotes are not syntactically valid because they are parsed as numbers, so you will not be able to refer to them as column names without including quotes.
You can force R to interpret a number as a column name by
asd$2021
> asd$`2021`
[1] -0.634175 -1.612425 1.164135
Generally, you can protect yourself against syntactically invalid column names by
#(in base R)
names(asd) <- make.names(names(asd))
names(asd)
[1] "X2021" "X2"
#(or in tidyverse)
asd <- as_tibble(asd, .name_repair="universal")
New names:
* `2021` -> ...2021
* `2` -> ...2
# A tibble: 3 x 2
...2021 ...2
<dbl> <chr>
1 -0.634 a
2 -1.61 b
3 1.16 c
If the value is numeric, just convert to character with as.character - column/row names attributes are all character values
asd[as.character(x)]
2021
1 -0.4438473
2 -0.8904154
3 -0.9319593
I'm trying to use a new R package called waldo (see at the tidyverse blog too) that is designed to compare data objects to find differences. The waldo::compare() function returns an object that is, according to the documentation:
a character vector with class "waldo_compare"
The main purpose of this function is to be used within the console, leveraging coloring features to highlight outstanding values that are not equal between data objects. However, while just examining in console is useful, I do want to take those values and act on them (filter them out from the data, etc.). Therefore, I want to programmatically extract the outstanding values. I don't know how.
Example
Generate a vector of length 10:
set.seed(2020)
vec_a <- sample(0:20, size = 10)
## [1] 3 15 13 0 16 11 10 12 6 18
Create a duplicate vector, and add additional value (4) into an 11th vector element.
vec_b <- vec_a
vec_b[11] <- 4
vec_b <- as.integer(vec_b)
## [1] 3 15 13 0 16 11 10 12 6 18 4
Use waldo::compare() to test the differences between the two vectors
waldo::compare(vec_a, vec_b)
## `old[8:10]`: 12 6 18
## `new[8:11]`: 12 6 18 4
The beauty is that it's highlighted in the console:
But now, how do I extract the different value?
I can try to assign waldo::compare() to an object:
waldo_diff <- waldo::compare(vec_a, vec_b)
and then what? when I try to do waldo_diff[[1]] I get:
[1] "`old[8:10]`: \033[90m12\033[39m \033[90m6\033[39m \033[90m18\033[39m \n`new[8:11]`: \033[90m12\033[39m \033[90m6\033[39m \033[90m18\033[39m \033[34m4\033[39m"
and for waldo_diff[[2]] it's even worse:
Error in waldo_diff[3] : subscript out of bounds
Any idea how I could programmatically extract the outstanding values that appear in the "new" vector but not in the "old"?
As a disclaimer, I didn't know anything about this package until you posted so this is far from an authoritative answer, but you can't easily extract the different values using the compare() function as it returns an ANSI formatted string ready for pretty printing. Instead the workhorses for vectors seem to be the internal functions ses() and ses_context() which return the indices of the differences between the two objects. The difference seems to be that ses_context() splits the result into a list of non-contiguous differences.
waldo:::ses(vec_a, vec_b)
# A tibble: 1 x 5
x1 x2 t y1 y2
<int> <int> <chr> <int> <int>
1 10 10 a 11 11
The results show that there is an addition in the new vector beginning and ending at position 11.
The following simple function is very limited in scope and assumes that only additions in the new vector are of interest:
new_diff_additions <- function(x, y) {
res <- waldo:::ses(x, y)
res <- res[res$t == "a",] # keep only additions
if (nrow(res) == 0) {
return(NULL)
} else {
Map(function(start, end) {
d <- y[start:end]
`attributes<-`(d, list(start = start, end = end))
},
res[["y1"]], res[["y2"]])
}
}
new_diff_additions(vec_a, vec_b)
[[1]]
[1] 4
attr(,"start")
[1] 11
attr(,"end")
[1] 11
At least for the simple case of comparing two vectors, you’ll be better off
using diffobj::ses_dat() (which is from the package that waldo uses
under the hood) directly:
waldo::compare(1:3, 2:4)
#> `old`: 1 2 3
#> `new`: 2 3 4
diffobj::ses_dat(1:3, 2:4)
#> op val id.a id.b
#> 1 Delete 1 1 NA
#> 2 Match 2 2 NA
#> 3 Match 3 3 NA
#> 4 Insert 4 NA 3
For completeness, to extract additions you could do e.g.:
extract_additions <- function(x, y) {
ses <- diffobj::ses_dat(x, y)
y[ses$id.b[ses$op == "Insert"]]
}
old <- 1:3
new <- 2:4
extract_additions(old, new)
#> [1] 4
I'm completely new to R. I have a dataframe which which contains the character below:
[{\"task\":\"T1\",\"task_label\":\"Draw around the infarct area\n\",\"value\":[{\"tool\":0,\"frame\":0,\"points\":[{\"x\":786,\"y\":139.8},{\"x\":712.3,\"y\":245.3},{\"x\":717.7,\"y\":291.7},{\"x\":804.9,\"y\":335.6},{\"x\":866.1,\"y\":352.7},{\"x\":877.5,\"y\":402.4},{\"x\":866,\"y\":492.9},{\"x\":823.2,\"y\":560.1},{\"x\":765.5,\"y\":603.6},{\"x\":791.8,\"y\":631.7},{\"x\":830.3,\"y\":617.8},{\"x\":846.9,\"y\":618.1},{\"x\":937.1,\"y\":538.5},{\"x\":941.1,\"y\":476.4},{\"x\":983.2,\"y\":443},{\"x\":1020.5,\"y\":338.4},{\"x\":997.1,\"y\":232.7},{\"x\":996.9,\"y\":232.7},{\"x\":921.5,\"y\":145},{\"x\":921.2,\"y\":145},{\"x\":850.6,\"y\":121},{\"x\":850.6,\"y\":120.7},{\"x\":786,\"y\":139.8}],\"details\":[],\"tool_label\":\"Tool name\"}]}]"
I am looking to extract the x and y coordinates and index them. For example:
x1 = 786, x2 = 712.3, x3 = 717.7 etc
y1 = 139.8, y2 = 245.3, y3 = 291.7 etc
I have tried using substring and gsub but have got unstuck.
Ideally, I would create a for loop which reads the number and stores as a variable.
Any suggestions would be really appreciated! Thanks
Your data looks like a json structure. Only precondition: remove the \n character "Draw around the infarct area\n". Then this worked on my system.
require(jsonlite)
dt <- fromJSON("[{\"task\":\"T1\",\"task_label\":\"Draw around the infarct area\",\"value\":[{\"tool\":0,\"frame\":0,\"points\":[{\"x\":786,\"y\":139.8},{\"x\":712.3,\"y\":245.3},{\"x\":717.7,\"y\":291.7},{\"x\":804.9,\"y\":335.6},{\"x\":866.1,\"y\":352.7},{\"x\":877.5,\"y\":402.4},{\"x\":866,\"y\":492.9},{\"x\":823.2,\"y\":560.1},{\"x\":765.5,\"y\":603.6},{\"x\":791.8,\"y\":631.7},{\"x\":830.3,\"y\":617.8},{\"x\":846.9,\"y\":618.1},{\"x\":937.1,\"y\":538.5},{\"x\":941.1,\"y\":476.4},{\"x\":983.2,\"y\":443},{\"x\":1020.5,\"y\":338.4},{\"x\":997.1,\"y\":232.7},{\"x\":996.9,\"y\":232.7},{\"x\":921.5,\"y\":145},{\"x\":921.2,\"y\":145},{\"x\":850.6,\"y\":121},{\"x\":850.6,\"y\":120.7},{\"x\":786,\"y\":139.8}],\"details\":[],\"tool_label\":\"Tool name\"}]}]")
(dt[[3]][[1]])[[3]][[1]]
If you want to remove the \ncharacters with code you could use a function like str_replace in the stringr package.
As #Jan points out, this is JSON data. But I think it's probably easier to get the data out with regular expressions.
library(stringr)
library(dplyr)
str_extract_all(data,'([xy])[\\\":]+([0-9\\.]+)') %>%
str_extract_all(c("[xy]","[0-9\\.]+")) %>%
bind_cols
# A tibble: 46 x 2
V1 V2
<chr> <chr>
1 x 786
2 y 139.8
3 x 712.3
4 y 245.3
5 x 717.7
6 y 291.7
7 x 804.9
8 y 335.6
9 x 866.1
10 y 352.7
# … with 36 more rows
I don't know much about R, and I have a variables in a dataframe that I am trying to calculate some stats for, with the hope of writing them into a csv. I have been using a basic for loop, like this:
for(i in x) {
mean(my_dataframe[,c(i)], na.rm = TRUE))
}
where x is colnames(my_dataframe)
Not every variable is numeric - but when I add a print to the loop, this works fine - it just prints means when applicable, and NA when not. However, when I try to assign this loop to a value (means <- for....), it produces an empty list. Similarly, when I try to directly write the results to a csv, I get an empty csv. Does anyone know why this is happening/how to fix it?
this should work for you. you don't need a loop. just use the summary() function.
summary(cars)
The for loop executes the code inside, but it doesn't put any results together. To do that, you need to create an object to hold the results and explicitly assign each one:
my_means = rep(NA, ncol(my_dataframe)
for(i in seq_along(x)) {
my_means[i] = mean(my_dataframe[, x[i], na.rm = TRUE))
}
Note that I have also changed your loop to use i = 1, 2, 3, ... instead of each name.
sapply, as shown in another answer, is a nice shortcut that does the loop and combines the results for you, so you don't need to worry about pre-allocating the result object. It's also smart enough to iterate over columns of a data frame by default.
my_means_2 = sapply(my_dataframe, mean, na.rm = T)
Please give a reproducible example the next time you post a question.
Input is how I imagine your data would look like.
Input:
library(nycflights13)
library(tidyverse)
input <- flights %>% select(origin, air_time, carrier, arr_delay)
input
# A tibble: 336,776 x 4
origin air_time carrier arr_delay
<chr> <dbl> <chr> <dbl>
1 EWR 227. UA 11.
2 LGA 227. UA 20.
3 JFK 160. AA 33.
4 JFK 183. B6 -18.
5 LGA 116. DL -25.
6 EWR 150. UA 12.
7 EWR 158. B6 19.
8 LGA 53. EV -14.
9 JFK 140. B6 -8.
10 LGA 138. AA 8.
# ... with 336,766 more rows
The way I see it, there are 2 ways to do it:
Use summarise_all()
summarise_all() will summarise all your columns, including those that are not numeric.
Method:
input %>% summarise_all(funs(mean(., na.rm = TRUE)))
# A tibble: 1 x 4
origin air_time carrier arr_delay
<dbl> <dbl> <dbl> <dbl>
1 NA 151. NA 6.90
Warning messages:
1: In mean.default(origin, na.rm = TRUE) :
argument is not numeric or logical: returning NA
2: In mean.default(carrier, na.rm = TRUE) :
argument is not numeric or logical: returning NA
You will get a result and a warning if you were to use this method.
Use summarise_if
summarise only numeric columns. You can avoid from getting any error this way.
Method:
input %>% summarise_if(is.numeric, funs(mean(., na.rm = TRUE)))
# A tibble: 1 x 2
air_time arr_delay
<dbl> <dbl>
1 151. 6.90
You can then create a NA column for others
You can use lapply or sapply for this sort of thing. e.g.
sapply(my_dataframe, mean)
will get you all the means. You can also give it your own function e.g.
sapply(my_dataframe, function(x) sum(x^2 + 2)/4 - 9)
If all variables are not numeric you can use summarise_if from dplyr to get the results just for the numeric columns.
require(dplyr)
my_dataframe %>%
summarise_if(is.numeric, mean)
Without dplyr, you could do
sapply(my_dataframe[sapply(my_dataframe, is.numeric)], mean)
If a difftime variable is included in a tibble, and the specified number of observations is equal to the other variable(s), then the class of the variable is maintained.
tibble::tibble(a = c(1,2), b = as.difftime(c(1,2), units = "hours"))
# A tibble: 2 x 2
a b
<dbl> <time>
1 1 1 hours
2 2 1 hours
However, if the specified number of observations in the difftime variable is a proper factor of the number of observations in the other variable, so that the difftime variable is recycled, then the class of the variable silently changes to numeric:
tibble::tibble(a = c(1,2), b = as.difftime(1, units = "hours"))
# A tibble: 2 x 2
a b
<dbl> <dbl>
1 1 1
2 2 1
Does this difference in behaviour occur because tidyverse users are encouraged to use the period or duration objects provided by lubridate to specify times, rather than base R's difftime objects? Or is this an unintended bug?
The same issue occurs when using tibble::data_frame, and dplyr::data_frame, although I believe these may be deprecated in the future.
To be clear, the following calls do not silently change the class of the time-type variable:
tibble::tibble(a = c(1,2), b = lubridate::as.period("1H"))
# A tibble: 2 x 2
a b
<dbl> <S4: Period>
1 1 1H 0M 0S
2 2 1H 0M 0S
tibble::tibble(a = c(1,2), b = lubridate::as.duration("1H"))
# A tibble: 2 x 2
a b
<dbl> <S4: Duration>
1 1 3600s (~1 hours)
2 2 3600s (~1 hours)
The behavior you are seeing stems from something quite peculiar with the vector recycling process during dataframe creations. As you already know, objects passed to the data.frame function should have the same number of rows. But atomic vectors will be recycled a whole number of times if necessary. This raises the question as to why the following does not work:
dff <- data.frame(a=c(1,2), b=as.difftime(1, units="hours"))
The code above raises the following error:
Error in data.frame(a = c(1, 2), b = as.difftime(1, units = "hours"))
: arguments imply differing number of rows: 2, 1
It turns out, the reason this does not work is because a vector of difftime objects is not recognized as an atomic vector. You can check with the following:
is.vector(as.difftime(1, units="hours"))
This returns:
[1] FALSE
As result, when the data.frame function tries to recycle the column b, it checks first if the column is in fact a vector (with is.vector). Since that returns FALSE, the recycling does not proceed; and hence the error returned.
So, the ensuing question is: why not just convert column b with as.vector?
This would have actually been a good idea, expect that as.vector removes all attributes, including names, for the resulting vector. You can see that with the following:
as.vector(as.difftime(1, units="hours"))
returns:
[1] 1
All the properties of the difftime object got lost during the coercion process. This leads me to think that the tibble::data_frame function actually uses as.vector somewhere along the process of generating the data_frame. As a result, we see the following behavior:
data_frame(a=c(1,2), b=as.difftime(1, units="hours"))
returns
# A tibble: 2 x 2
a b
<dbl> <dbl>
1 1 1
2 2 1
I guess the conclusion is the same as the one reached by #agstudy: to maintain the difftime object, you may have to use list for column b as follows:
tibble::tibble(a = c(1,2), b = list(as.difftime(1, units = "hours")))
I hope this proves, in some way, useful.
I don't think that tibble encourages the use of lubridate ( even if I encourage you to use it ) do deal with date like types , But it is more an issue of how are vector are created internally when you recycle. In fact you can reproduce the same recyclying behavior when you play with c and list. For example using c you will lose typing :
c(as.difftime(c(1), units = "hours"),1)
### Time differences in hours
### [1] 1 1
But using list will keep the time difference type :
list(as.difftime(c(1), units = "hours"),2)
# [[1]]
# Time difference of 1 hours
#
# [[2]]
# [1] 2
Applying list with tibble , you "conserve" the class type:
tibble::tibble(a = c(1,2),
b = list(as.difftime(c(1), units = "hours")))
# A tibble: 2 x 2
# a b
# <dbl> <list>
# 1 1 <time [1]>
# 2 2 <time [1]>
But this is hardly manipulated later. Better to use lubridate in this case.