New to R and can't figure this out. I have a vector of characters, place it into a data.frame and they change to "factor":
> name <- c("Ann","Bob", "Carl", "Dan","Ed")
> class(name)
[1] "character" # Expected this.
> wt <- c(123,234,222,199,201)
> class(wt)
[1] "numeric" # Expected this.
> a <- data.frame(name, wt)
> class(a$wt)
[1] "numeric" # Expected this.
> class(a$name)
[1] "factor" # ???
I am not sure why this is happening.
As mentioned in the comments, use stringsAsFactors = FALSE when creating your data.frame:
str(data.frame(name, wt, stringsAsFactors = FALSE))
# 'data.frame': 5 obs. of 2 variables:
# $ name: chr "Ann" "Bob" "Carl" "Dan" ...
# $ wt : num 123 234 222 199 201
The default behavior is for stringsAsFactors = TRUE. This default behavior can be changed at startup, but you may not want to do this for compatibility with other people's scripts.
Some other packages that build upon data.frames have different default behavior. For instance, consider data.table from the "data.table" package or data_frame from the "dplyr" package:
library(data.table)
str(data.table(name, wt))
# Classes ‘data.table’ and 'data.frame': 5 obs. of 2 variables:
# $ name: chr "Ann" "Bob" "Carl" "Dan" ...
# $ wt : num 123 234 222 199 201
# - attr(*, ".internal.selfref")=<externalptr>
library(dplyr)
str(data_frame(name, wt))
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 5 obs. of 2 variables:
# $ name: chr "Ann" "Bob" "Carl" "Dan" ...
# $ wt : num 123 234 222 199 201
Related
Basically what the title says! I have columns like name, age, year, average_points, average_steals, average_rebounds etc. But all the average columns (there are a lot) are stored as characters. Thanks!
First I created some random data. You can mutate across the columns that starts_with "average" and convert them to as.integer. You can use the following code:
df <- data.frame(name = c("A", "B"),
age = c(10, 51),
year = c(2001, 1980),
average_points = c("3", "5"),
average_steals = c("4","6"),
average_bounds = c("6","7"))
str(df)
#> 'data.frame': 2 obs. of 6 variables:
#> $ name : chr "A" "B"
#> $ age : num 10 51
#> $ year : num 2001 1980
#> $ average_points: chr "3" "5"
#> $ average_steals: chr "4" "6"
#> $ average_bounds: chr "6" "7"
library(dplyr)
library(tidyr)
result <- df %>%
mutate(across(starts_with("average"), as.integer))
str(result)
#> 'data.frame': 2 obs. of 6 variables:
#> $ name : chr "A" "B"
#> $ age : num 10 51
#> $ year : num 2001 1980
#> $ average_points: int 3 5
#> $ average_steals: int 4 6
#> $ average_bounds: int 6 7
Created on 2022-07-20 by the reprex package (v2.0.1)
I am keep running into the error and I can't seem to find any apperant problem in the code.
library(tidyverse)
library(ggplot2)
require(data.table)
library(ggplot2)
require(data.table)
data <- as.data.frame(fread("MyND_merged.tsv"))
g <- ggplot(data = data, aes(x = study, y = value)) +
geom_boxplot() +
facet_wrap(facets = ~type, scale ='free') +
ggpubr::compare_means()
The error says
Error in inherits(data, "data.frame") :
argument "data" is missing, with no default
data is defined in the code, I believe so - would someone please help me solving this error?
Thank you
> str(data)
'data.frame': 4266 obs. of 9 variables:
$ V1 : int 0 1 2 3 4 5 6 7 8 9 ...
$ sample : chr "AANDS0002-01" "AANDS0002-01" ...
$ type : chr "index1" "index 2" ...
$ value : num 0.0122 0.9729 ...
$ donor_id: chr "AANDS0002" "AANDS0002" ...
$ gender : chr "M" "M" "M" "M" ...
$ age : int 80 80 80 80 80 80 75 75 75 75 ...
$ disease : chr "name1" "name2" ...
$ study : chr "mynd_2" "mynd_2" "mynd_2" "mynd_2" ...
I am trying to figure out how to get data in R for the purposes of making it into a table that I can store into a database like sql.
API <- "https://covidtrackerapi.bsg.ox.ac.uk/api/v2/stringency/date-range/{2020-01-01}/{2020-06-30}"
oxford_covid <- GET(API)
I then try to parse this data and make it into a dataframe but when I do so I get the errors of:
"Error: Columns 4, 5, 6, 7, 8, and 178 more must be named.
Use .name_repair to specify repair." and "Error: Tibble columns must have compatible sizes. * Size 2: Columns deaths, casesConfirmed, and stringency. * Size 176: Columns ..2020.12.27, ..2020.12.28, ..2020.12.29, and"
I am not sure if there is a better approach or how to parse this. Is there a method or approach? I am not having much luck online.
It looks like you're trying to take the JSON return from that API and call read.table or something on it. Don't do that, JSON should be parsed by JSON tools (such as jsonlite::parse_json).
Some work on that URL.
js <- jsonlite::parse_json(url("https://covidtrackerapi.bsg.ox.ac.uk/api/v2/stringency/date-range/2020-01-01/2020-06-30"))
lengths(js)
# scale countries data
# 3 183 182
str(js, max.level = 2, list.len = 3)
# List of 3
# $ scale :List of 3
# ..$ deaths :List of 2
# ..$ casesConfirmed:List of 2
# ..$ stringency :List of 2
# $ countries:List of 183
# ..$ : chr "ABW"
# ..$ : chr "AFG"
# ..$ : chr "AGO"
# .. [list output truncated]
# $ data :List of 182
# ..$ 2020-01-01:List of 183
# ..$ 2020-01-02:List of 183
# ..$ 2020-01-03:List of 183
# .. [list output truncated]
So this is rather large. Since you're hoping for a data.frame, I'm going to look at js$data only; js$countries looks relatively uninteresting,
str(unlist(js$countries))
# chr [1:183] "ABW" "AFG" "AGO" "ALB" "AND" "ARE" "ARG" "AUS" "AUT" "AZE" "BDI" "BEL" "BEN" "BFA" "BGD" "BGR" "BHR" "BHS" "BIH" "BLR" "BLZ" "BMU" "BOL" "BRA" "BRB" "BRN" "BTN" "BWA" "CAF" "CAN" "CHE" "CHL" "CHN" "CIV" "CMR" "COD" "COG" "COL" "CPV" ...
and does not correlate with the js$data. The js$scale might be interesting, but I'll skip it for now.
My first go-to for joining data like this into a data.frame is one of the following, depending on your preference for R dialects:
do.call(rbind.data.frame, list_of_frames) # base R
dplyr::bind_rows(list_of_frames) # tidyverse
data.table::rbindlist(list_of_frames) # data.table
But we're going to run into problems. Namely, there are entries that are NULL, when R would prefer that they be something (such as NA).
str(js$data[[1]][1])
# List of 2
# $ ABW:List of 8
# ..$ date_value : chr "2020-01-01"
# ..$ country_code : chr "ABW"
# ..$ confirmed : NULL # <--- problem
# ..$ deaths : NULL
# ..$ stringency_actual : int 0
# ..$ stringency : int 0
# ..$ stringency_legacy : int 0
# ..$ stringency_legacy_disp: int 0
So we need to iterate over each of those and replace NULL with NA. Unfortunately, I don't know of an easy tool to recursively go through lists of lists (even rapply doesn't work well in my tests), so we'll be a little brute-force here with a triple-lapply:
Long-story-short,
str(js$data[[1]][[1]])
# List of 8
# $ date_value : chr "2020-01-01"
# $ country_code : chr "ABW"
# $ confirmed : NULL
# $ deaths : NULL
# $ stringency_actual : int 0
# $ stringency : int 0
# $ stringency_legacy : int 0
# $ stringency_legacy_disp: int 0
jsdata <-
lapply(js$data, function(z) {
lapply(z, function(y) {
lapply(y, function(x) if (is.null(x)) NA else x)
})
})
str(jsdata[[1]][[1]])
# List of 8
# $ date_value : chr "2020-01-01"
# $ country_code : chr "ABW"
# $ confirmed : logi NA
# $ deaths : logi NA
# $ stringency_actual : int 0
# $ stringency : int 0
# $ stringency_legacy : int 0
# $ stringency_legacy_disp: int 0
(Technically, if we know that it's going to be integers, we should use NA_integer_. Fortunately, R and its dialects are able to work with this shortcut, as we'll see in a second.)
After that, we can do a double-dive rbinding and get back to the frame-making I discussed a couple of steps ago. Choose one of the following, whichever dialect you prefer:
alldat <- do.call(rbind.data.frame,
lapply(jsdata, function(z) do.call(rbind.data.frame, z)))
alldat <- dplyr::bind_rows(purrr::map(jsdata, dplyr::bind_rows))
alldat <- data.table::rbindlist(lapply(jsdata, data.table::rbindlist))
For simplicity, I'll show the first (base R) version:
tail(alldat)
# date_value country_code confirmed deaths stringency_actual stringency stringency_legacy stringency_legacy_disp
# 2020-06-30.AND 2020-06-30 AND 855 52 42.59 42.59 65.47 65.47
# 2020-06-30.ARE 2020-06-30 ARE 48667 315 72.22 72.22 83.33 83.33
# 2020-06-30.AGO 2020-06-30 AGO 284 13 75.93 75.93 83.33 83.33
# 2020-06-30.ALB 2020-06-30 ALB 2535 62 68.52 68.52 78.57 78.57
# 2020-06-30.ABW 2020-06-30 ABW 103 3 47.22 47.22 63.09 63.09
# 2020-06-30.AFG 2020-06-30 AFG 31507 752 78.70 78.70 76.19 76.19
And if you're curious about the $scale,
do.call(rbind.data.frame, js$scale)
# min max
# deaths 0 127893
# casesConfirmed 0 2633466
# stringency 0 100
## or
data.table::rbindlist(js$scale, idcol="id")
# id min max
# <char> <int> <int>
# 1: deaths 0 127893
# 2: casesConfirmed 0 2633466
# 3: stringency 0 100
## or
dplyr::bind_rows(js$scale, .id = "id")
I have a dataset in which I have 22 animals. Each animal has been named as follows: c(" Shark1", "Shark2", "Shark3", ...) etc.
I am trying to plot a two category variables against each other do determine the proportion of time each shark spent at separate depths:
Sharks<-table(merge$DepthCat, merge$ID2) #Depth category vs. ID
merge$DepthCat[merge$Depth2>200]<-"4"
Sharks<-table(merge$DepthCat, merge$ID2)
plot(t(Sharks), main="",
col=c("whitesmoke", "slategray3", "slategray", "slategray4"),
ylab="Depth catagory", xlab="Month")
axis(side=4)
While the plot works, it is not plotting in numerical order but instead alphabetical therefore I am getting the following graph below.
Does anyone know how to resolve this for the plot? I have research the array method but unsure how it would be implemented here.
You didn't provide your complete data set, so I generated my own random data. Given that the bar headers derived from ID2 are sorting lexicographically, I assumed they are stored as characters in your data.frame merge, so I generated them thusly.
set.seed(2L);
NR <- 300L;
merge <- data.frame(ID2=sample(as.character(1:22),NR,T),Depth2=pmax(0,rnorm(NR,100,50)),stringsAsFactors=F);
merge$DepthCat <- as.character(findInterval(merge$Depth2,c(0,66,133,200)));
str(merge);
## 'data.frame': 300 obs. of 3 variables:
## $ ID2 : chr "5" "16" "13" "4" ...
## $ Depth2 : num 148.8 91.5 136.1 57.8 163.9 ...
## $ DepthCat: chr "3" "2" "3" "1" ...
And sure enough, we can reproduce the problem with this test data:
Sharks <- table(merge$DepthCat,merge$ID2);
plot(t(Sharks),main='',col=c('whitesmoke','slategray3','slategray','slategray4'),ylab='Depth category',xlab='Month');
axis(side=4L);
The solution is to coerce the ID2 vector to numeric so it sorts numerically.
merge$ID2 <- as.integer(merge$ID2);
str(merge);
## 'data.frame': 300 obs. of 3 variables:
## $ ID2 : int 5 16 13 4 21 21 3 19 11 13 ...
## $ Depth2 : num 148.8 91.5 136.1 57.8 163.9 ...
## $ DepthCat: chr "3" "2" "3" "1" ...
Sharks <- table(merge$DepthCat,merge$ID2);
plot(t(Sharks),main='',col=c('whitesmoke','slategray3','slategray','slategray4'),ylab='Depth category',xlab='Month');
axis(side=4L);
I am having problems creating a boxplot of my data, because one of my variables is in the form of a list.
I am trying to create a boxplot:
boxplot(dist~species, data=out)
and received the following error:
Error in model.frame.default(formula = dist ~ species, data = out) :
invalid type (list) for variable 'species'
I have been unsuccessful in forcing 'species' into the form of a factor:
out[species]<- as.factor(out[[out$species]])
and receive the following error:
Error in .subset2(x, i, exact = exact) : invalid subscript type 'list'
How can I convert my 'species' column into a factor which I can then use to create a boxplot? Thanks.
EDIT:
str(out)
'data.frame': 4570 obs. of 6 variables:
$ GridRef : chr "NT73" "NT80" "NT85" "NT86" ...
$ pred : num 154 71 81 85 73 99 113 157 92 85 ...
$ pred_bin : int 0 0 0 0 0 0 0 0 0 0 ...
$ dist : num 20000 10000 9842 14144 22361 ...
$ years_since_1990: chr "21" "16" "21" "20" ...
$ species :List of 4570
..$ : chr "C.splendens"
..$ : chr "C.splendens"
..$ : chr "C.splendens"
.. [list output truncated]
It's hard to imagine how you got the data into this form in the first place, but it looks like
out <- transform(out,species=unlist(species))
should solve your problem.
set.seed(101)
f <- as.list(sample(letters[1:5],replace=TRUE,size=100))
## need I() to make a wonky data frame ...
d <- data.frame(y=runif(100),f=I(f))
## 'data.frame': 100 obs. of 2 variables:
## $ y: num 0.125 0.0233 0.3919 0.8596 0.7183 ...
## $ f:List of 100
## ..$ : chr "b"
## ..$ : chr "a"
boxplot(y~f,data=d) ## invalid type (list) ...
d2 <- transform(d,f=unlist(f))
boxplot(y~f,data=d2)