I am trying to apply the answer to my prior question on plotting with dates in the x axis to the COVID data in the New York Times but I get an error message:
require(RCurl)
require(foreign)
require(tidyverse)
counties = read.csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv", sep =",",header = T)
Philadelphia <- counties[counties$county=="Philadelphia",]
Philadelphia <- droplevels(Philadelphia)
rownames(Philadelphia) <- NULL
with(as.data.frame(Philadelphia),plot(date,cases,xaxt="n"))
axis.POSIXct(1,at=Philadelphia$date,
labels=format(Philadelphia$date,"%y-%m-%d"),
las=2, cex.axis=0.8)
# Error in format.default(structure(as.character(x), names = names(x), dim = dim(x), :
# invalid 'trim' argument
The structure of the data includes already a date format:
> str(Philadelphia)
'data.frame': 21 obs. of 6 variables:
$ date : Factor w/ 21 levels "2020-03-10","2020-03-11",..: 1 2 3 4 5 6 7 8 9 10 ...
$ county: Factor w/ 1 level "Philadelphia": 1 1 1 1 1 1 1 1 1 1 ...
$ state : Factor w/ 1 level "Pennsylvania": 1 1 1 1 1 1 1 1 1 1 ...
$ fips : int 42101 42101 42101 42101 42101 42101 42101 42101 42101 42101 ...
$ cases : int 1 1 1 3 4 8 8 10 17 33 ...
$ deaths: int 0 0 0 0 0 0 0 0 0 0 ...
I tried changing the axis call to
axis.Date(1,Philadelphia$date, at=Philadelphia$date,
labels=format(Philadelphia$date,"%y-%m-%d"),
las=2, cex.axis=0.8)
without success.
I wonder if it has to do with the strange horizontal lines in the plot (as opposed to points):
The 'invalid trim argument' error comes from format (it is the default second argument because you haven't explicitly specified the parameter).
I'm not entirely sure what you're doing here but I would change date to a Date object before plotting the data. You'll also want to use %Y instead of %y I believe.
library(dplyr)
counties = read.csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv", sep =",",header = T)
Philadelphia <- counties[counties$county=="Philadelphia",] %>%
mutate(date = as.POSIXct(date, format = '%Y-%m-%d'))
with(Philadelphia, plot(date,cases))
Related
I'm creating a correlation table using the correlate function in the corrr package. Here is my code and a screenshot of the output.
correlation_table <- corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson")
correlation_table
I think this would look better and be easier to read if I could round off the values in the correlation table. I tried this code:
correlation_table <- round(corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson"),2)
But I get this error:
Error in Math.data.frame(list(term = c("prof_rank_factor", "yrs.since.phd", : non-numeric variable(s) in data frame: term
The non-numeric variables part of this error message doesn't make sense to me. When I look at the structure I only see integer or numeric variable types.
'data.frame': 397 obs. of 6 variables:
$ prof_rank_factor : num 3 3 1 3 3 2 3 3 3 3 ...
$ yrs.since.phd : int 19 20 4 45 40 6 30 45 21 18 ...
$ yrs.service : int 18 16 3 39 41 6 23 45 20 18 ...
$ salary : num 139750 173200 79750 115000 141500 ...
$ sex_factor : num 1 1 1 1 1 1 1 1 1 2 ...
$ discipline_factor: num 2 2 2 2 2 2 2 2 2 2 ...
How can I clean up this correlation table with rounded values?
After returning the tibble output with correlate, loop across the columns that are numeric and round
library(dplyr)
corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson") %>%
mutate(across(where(is.numeric), round, digits = 2))
We can use:
options(digits=2)
correlation_table <- corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson")
correlation_table
I am new to R and I am constructing R codes for my personal project/exercise. The data I am using is about a survey on ethnic identity of people from Hongkong. I used 2019 data from http://data.hkupop.hku.hk/v3/hkupop/ethnic_identity/ch.html.
After removing NA values and reducing the columns to that of my necessity,
I noticed that the data is highly imbalanced so I tried to use under-sampling, ROSE and SMOTE. (the number had greatly reduced from 1015 observations to 573)
I removed the following column # from the set
df_f <- df[,-c(1,2,5,6,8,9,11,12,14,15,17,18,20,21,25,26,27,29,32,33,34,35,37)]
However, this is not a binary data, thus I had to force the factors in eth_id to combine into 0 = 1&3 (Hong Konger and Hong Kong Chinese) and 1 = 2&4 (Chinese and Chinese Hong Kong citizen)
How I combined the factors
df_p$eth_id <- recode(df_p$eth_id, "c('1', '3')='1+3';c('2', '4') = '2+4'")
library(plyr)
revalue(df_p$eth_id, c('1+3' = 0)) -> df_p$eth_id
revalue(df_p$eth_id, c('2+4' = 1)) -> df_p$eth_id
0 = Hong Kong Citizen + Hong Kong Chinese Citizen
1 = Chinese Citizen + Chinese Hong Kong Citizen
How I renamed the columns
df_f <- df_f %>%
rename(
eth_id = Q001,
HongKonger = Q002A,
Chinese = Q003A,
PRC = Q004A,
CH_race = Q005A,
Asian = Q006A,
global = Q007A,
class1 = mid,
housing1 = type,
housing2 = housingv2,
pi = inclin
)
HOW I PROCESSED MY NAs and unnecessary outliers
For the columns [,2:7], I changed their values to 0 for NAs, For example, df_f$HongKonger <- ifelse(is.na(df_f$HongKonger),0,df_f$HongKonger) so on and so forth.
And for the others, I removed the NAs like this:
df_p <- na.omit(df_p, cols= c("eth_id","sex","agegp","edugp","occgp","class","class2","housing1","housing2","pi"), invert=FALSE)
At this point of my data set, I was left with 14 columns and I renamed them (please refer to above). I uploaded the final structure of my data below which I used for ROSE and SMOTE :-)
Furthermore, I also removed rows that were outliers like:
Remove an unidentifiable ethnic_identity (8881 or level = 5)
df_f <- df_f[!df_f$Q001 == "8881",] table(df_f$Q001)
df_f <- df_f[!df_f$eth_id == "Don't know / hard to say",]
these codes must be carefully written, if you run it before the renaming please use eth_id in place of Q001 and vice-versa.
Now, I kept on getting this error when I run ROSE:
Error in [<-.data.frame(*tmp*, , indY, value = c(1L, 1L, 1L, 1L, 1L, : missing values are not allowed in subscripted assignments of data frames.
This is very misleading because I made sure to remove NA values completely (because all the questions related to this were related to NA issue, which is not applicable to mine) and I even changed all my factor values to numerical.
(Because I thought that the program is not understanding? the factor values.)
I am also getting this error message for SMOTE: Error in names(dn) <- dnn : attempt to set an attribute on NULL. This mak
es me even more confused to the level that I am doubting the data itself being not applicable to machine learning.
Here is the final structure of my data for your reference:
'data.frame': 573 obs. of 14 variables:
$ eth_id : Factor w/ 2 levels "0","1": 2 2 1 2 1 1 1 1 1 1 ...
$ HongKonger: num 9 0 0 0 0 2 0 2 0 8 ...
$ Chinese : num 9 9 1 3 7 0 7 9 0 0 ...
$ PRC : num 8 9 1 3 7 3 1 0 1 0 ...
$ CH_race : num 12 10 0 3 7 3 0 7 3 4 ...
$ Asian : num 0 7 6 0 0 2 2 0 0 6 ...
$ global : num 0 0 0 0 0 3 7 0 10 0 ...
$ sex : num 1 2 2 1 2 1 1 2 1 2 ...
$ agegp : num 6 5 2 2 6 5 2 4 6 1 ...
$ edugp : num 2 3 2 3 1 2 2 2 3 3 ...
$ class1 : num 3 3 3 5 3 3 4 4 4 3 ...
$ housing1 : num 1 1 2 2 1 2 1 2 1 1 ...
$ housing2 : num 3 3 1 4 3 1 2 1 3 3 ...
$ pi : num 3 2 1 2 1 1 1 4 1 1 ...
- attr(*, "na.action")= 'omit' Named int 14 24 46 52 58 67 77 84 94 129 ...
..- attr(*, "names")= chr "25" "44" "82" "90" ...
#How I divided the data into train and test set
set.seed(123)
index <- createDataPartition(df_p$eth_id, p = 0.7, list = FALSE)
train_data <- df_p[index, ]
test_data <- df_p[-index, ]
head(test_data)
str(train_data)
#How I used ROSE for under-sampling
library(ROSE)
ovun.sample(formula = train_data$eth_id ~ ., data = train_data, method="under", N = 250,seed = 123)$data
How I used ROSE for "both"
ovun.sample(formula = train_data$eth_id ~ . , data = train_data, method="both",
na.action=options("na.omit")$na.action,p=0.5,seed = 123)$data
How I used SMOTE
SMOTE(form = train_data$eth_id ~., data = train_data, perc.over = 100, k = 5, perc.under = 200)
I am keep on getting :
1) for ROSE: Error in [<-.data.frame(*tmp*, , indY, value = c(1L, 1L, 1L, 1L, 1L, : missing values are not allowed in subscripted assignments of data frames
2) for SMOTE: Error in names(dn) <- dnn : attempt to set an attribute on NULL
I am also confused changing all the factors into numeric value would make it still valid.
Thank you and thank you for sharing your knowledge ahead.
Whenever I try and plot across factors I keep getting the error.
Here is how my data looks like:
str(dataWithNoNa)
## 'data.frame': 17568 obs. of 4 variables:
## $ steps : num 1.717 0.3396 0.1321 0.1509 0.0755 ...
## $ date : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
## $ dayType : Factor w/ 2 levels "Weekday","Weekend": 1 1 1 1 1 1 1 1 1 1 ...
I am trying to plot using the lattice plotting system using Weekday/Weekend as a factor.
Here is what I tried:
plot(dataWithNoNa$steps~ dataWithNoNa$interval | dataWithNoNa$dayType, type="l")
Error in plot.window(...) : need finite 'xlim' values
I even checked to make sure my data had no NAs:
sum(is.na(dataWithNoNa$interval))
## [1] 0
sum(is.na(dataWithNoNa$steps))
## [1] 0
What am I doing wrong?
Try this:
library(lattice)
xyplot(steps ~ interval | factor(dayType), data=df)
Output:
Sample data:
df <- data.frame(
steps=c(1.717,0.3396,0.1321,0.1509,0.0755),
interval=c(0,5,10,15,20),
dayType=c(1,1,1,2,2)
)
I am pretty new to R and have a data file that represents a budget. I want to sum up all the price tags for one purpose in the purpose column. That purpose gets automatically factored when reading in the csv. But how can I assign the right prices to a purpose with several counts in the file and sum them up?
I got the file from this link:
http://www.berlin.de/imperia/md/content/senatsverwaltungen/finanzen/haushalt/ansatzn2013.xls?download.html
I opened it in Open Office, exported the .csv-file and called it ausgaben.csv.
> ausgaben <- read.csv("ausgaben.csv")
> str(ausgaben)
'data.frame': 15895 obs. of 8 variables:
$ Bereich : Factor w/ 13 levels "(30) Senatsverwaltungen",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Einzelplan : Factor w/ 28 levels "(01) Abgeordnetenhaus",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Kapitel : Factor w/ 270 levels "(0100) Abgeordnetenhaus",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Titelart : Factor w/ 1 level "Ausgaben": 1 1 1 1 1 1 1 1 1 1 ...
$ Titel : int 41101 41103 42201 42701 42801 42811 42821 44100 44304 44379 ...
$ Titelbezeichnung: Factor w/ 1286 levels "Abdeckung von Geldverlusten",..: 57 973 182 67 262 257 95 127 136 797 ...
$ Funktion : Factor w/ 135 levels "(011) Politische Führung",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Euro : Factor w/ 2909 levels "-1.083,0","-1.295,0",..: 539 2226 1052 1167 1983 1111 1575 2749 1188 1167 ...
In "Funktionen" are 135 levels which correspond to sums in "Euro". I want to get all the numbers in "Euro" for all their corresponding levels in "Funktionen" and sum them, so I get 135 Euro values and can show what is spent for what purpose in this budget.
This could be done with plyr:::ddply or many other functions (ave, tapply, etc...).
I think that 'Euro' should not be a factor, but numeric - so please fix this before trying to aggregate.
Since we do not have your data here is a toy example:
set.seed(1234)
df <- data.frame(fac = sample(LETTERS[1:3], 50, replace = TRUE),
x = runif(50))
require(plyr)
ddply(df, .(fac), summarise,
sum_x = sum(x))
# fac sum_x
1 A 7.938613
2 B 6.692007
3 C 5.645078
You can read the xls file with the gdata package:
library(gdata)
ausgaben <- read.xls("ansatzn2013.xls")
Firstly, you need to transform the values in the column Ansatz.2013.inkl..Nachtrag.in.Tsd..EUR from factor to numeric:
Euro <- as.character(ausgaben$Ansatz.2013.inkl..Nachtrag.in.Tsd..EUR)
Euro <- as.numeric(sub(",", "", Euro))
Then, you can calculate the sums with the aggregate function:
aggregate(Euro ~ ausgaben$Funktion, FUN = sum)
I want a graph that looks similar to the example given in the lattice docs:
#EXAMPLE GRAPH, not my data
> barchart(yield ~ variety | site, data = barley,
+ groups = year, layout = c(1,6), stack = TRUE,
+ auto.key = list(points = FALSE, rectangles = TRUE, space = "right"),
+ ylab = "Barley Yield (bushels/acre)",
+ scales = list(x = list(rot = 45)))
I melted my data to obtain this "long" form dataframe:
> str(MDist)
'data.frame': 34560 obs. of 6 variables:
$ fCycle : Factor w/ 2 levels "Dark","Light": 2 2 2 2 2 2 2 2 2 2 ...
$ groupname: Factor w/ 8 levels "rowA","rowB",..: 1 1 1 1 1 1 1 1 1 1 ...
$ location : Factor w/ 96 levels "c1","c10","c11",..: 1 1 1 1 1 1 1 1 1 1 ...
$ timepoint: num 1 2 3 4 5 6 7 8 9 10 ...
$ variable : Factor w/ 3 levels "inadist","smldist",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 0 55.7 75.3 99.2 45.9 73.8 79.3 73.5 69.8 67.6 ...
I want to create a stacked barchart for each groupname and fCycle. I tried this:
barchart(value~timepoint|groupname*fCycle, data=MDist, groups=variable,stack=T)
It doesn't throw any errors, but it's still thinking after 30 minutes. Is this because it doesn't know how to deal with the 36 values that contribute to each bar? How can I make this data easier for barchart to digest?
I don't know lattice well, but could it be because your timepoint variable is numeric, not a factor?