convert factor to date in R to create dummy variable - r

I need to create dummy variable for "before and after 04/11/2020" for variable "date" in dataset "counties". There are over hundred dates in the dataset. I am trying to covert the dates from factor to date with as.date function, but get NA. Could you please help finding where I am making an error? I kept the other dummy variable I created just in case, if it affects the overall outcome
counties <- read.csv('C:/Users/matpo/Desktop/us-counties.csv')
str(counties)
as.Date(counties$date, format = '%m/%d/%y')
#create dummy variables forNew York, New Jersey, California, and Illinois
counties$state = ifelse(counties$state == 'New Jersey' &
counties$state == 'New York'& counties$state == 'California' &
counties$state == 'Illinois', 1, 0)
counties$date = ifelse(counties$date >= "4/11/2020", 1, 0)
str output
$ date : logi NA NA NA NA NA NA ...
$ county: Factor w/ 1774 levels "Abbeville","Acadia",..: 1468 1468 1468 379 1468 1178 379 1468 979 942 ...
$ state : num 0 0 0 0 0 0 0 0 0 0 ...
$ fips : int 53061 53061 53061 17031 53061 6059 17031 53061 4013 6037 ...
$ cases : int 1 1 1 1 1 1 1 1 1 1 ...
$ deaths: int 0 0 0 0 0 0 0 0 0 0 ...``
Thank you!

You have an incorrect format in as.Date, you should use "%Y" for 4 digit year.
You need to assign the values back (<-) for the values to change.
"4/11/2020" is just a string, if you are comparing date you need to convert it to date object. Also you can avoid using ifelse here.
Try :
counties$date <- as.Date(counties$date, format = '%m/%d/%Y')
counties$dummy <- as.integer(counties$date >= as.Date('2020-04-11'))

Related

Invalid trim argument in plotting counts over dates in R

I am trying to apply the answer to my prior question on plotting with dates in the x axis to the COVID data in the New York Times but I get an error message:
require(RCurl)
require(foreign)
require(tidyverse)
counties = read.csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv", sep =",",header = T)
Philadelphia <- counties[counties$county=="Philadelphia",]
Philadelphia <- droplevels(Philadelphia)
rownames(Philadelphia) <- NULL
with(as.data.frame(Philadelphia),plot(date,cases,xaxt="n"))
axis.POSIXct(1,at=Philadelphia$date,
labels=format(Philadelphia$date,"%y-%m-%d"),
las=2, cex.axis=0.8)
# Error in format.default(structure(as.character(x), names = names(x), dim = dim(x), :
# invalid 'trim' argument
The structure of the data includes already a date format:
> str(Philadelphia)
'data.frame': 21 obs. of 6 variables:
$ date : Factor w/ 21 levels "2020-03-10","2020-03-11",..: 1 2 3 4 5 6 7 8 9 10 ...
$ county: Factor w/ 1 level "Philadelphia": 1 1 1 1 1 1 1 1 1 1 ...
$ state : Factor w/ 1 level "Pennsylvania": 1 1 1 1 1 1 1 1 1 1 ...
$ fips : int 42101 42101 42101 42101 42101 42101 42101 42101 42101 42101 ...
$ cases : int 1 1 1 3 4 8 8 10 17 33 ...
$ deaths: int 0 0 0 0 0 0 0 0 0 0 ...
I tried changing the axis call to
axis.Date(1,Philadelphia$date, at=Philadelphia$date,
labels=format(Philadelphia$date,"%y-%m-%d"),
las=2, cex.axis=0.8)
without success.
I wonder if it has to do with the strange horizontal lines in the plot (as opposed to points):
The 'invalid trim argument' error comes from format (it is the default second argument because you haven't explicitly specified the parameter).
I'm not entirely sure what you're doing here but I would change date to a Date object before plotting the data. You'll also want to use %Y instead of %y I believe.
library(dplyr)
counties = read.csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv", sep =",",header = T)
Philadelphia <- counties[counties$county=="Philadelphia",] %>%
mutate(date = as.POSIXct(date, format = '%Y-%m-%d'))
with(Philadelphia, plot(date,cases))

How to create independent different data.frame in a loop R

Good evening everybody,
I'm stuck about the construction of the for loop, I don't have any problem, buit I'd like to understand how I can create dataframe "independents" (duplicite with some differences).
I wrote the code step by step (it works), but I think that, maybe, there is a way to compact the code with the for.
x is my original data.frame
str(x)
Classes ‘data.table’ and 'data.frame': 13500 obs. of 6 variables:
$ a: int 1 56 1058 567 987 574 1001...
$ b: int 10 5 10 5 5 10 10 5 10 10 ...
$ c: int NA NA NA NA NA NA NA NA NA NA ...
$ d: int 0 0 0 0 0 0 0 0 0 0 ...
$ e: int 0 0 0 0 0 0 0 0 0 0 ...
$ f: int 22 22 22 22 22 22 22 22 22 22 ...
My first goal is to delete per every column the eventualy NA and "" elements. I do this by these codes of rows.
x_b<- x[!(!is.na(x$b) & x$b==""), ]
x_c<- x[!(!is.na(x$c) & x$c==""), ]
x_d<- x[!(!is.na(x$d) & x$d==""), ]
x_e<- x[!(!is.na(x$e) & x$e==""), ]
x_f<- x[!(!is.na(x$f) & x$f==""), ]
After this the second goal is to create per each new data.frame a id code that I create using the function paste0(x_b$a, x_b$f).
x_b$ID_1<-paste0(x_b$a, x_b$b)
x_c$ID_2<-paste0(x_c$a, x_c$c)
x_d$ID_3<-paste0(x_c$a, x_c$d)
x_e$ID_4<-paste0(x_c$a, x_c$e)
x_f$ID_5<-paste0(x_c$a, x_c$f)
I created this for loop to try to minimize the rows that I use, and to create a good code visualization.
z<-data.frame("a", "b","c","d","e","f")
zy<-data.frame("x_b", "x_c", "x_d", "x_e", "x_f")
for(i in z) {
for (j in zy ) {
target <- paste("_",i)
x[[i]]<-(!is.na(x[[i]]) & x[[i]]=="") #with this I able to create a column on the x data.frame,
#but if I put a new dataframe the for doesn't work
#the name, but I don't want this. I'd like to create a
#data.base per each transformation.
#at this point of the script, I should have a new
#different dataframe, as x_b, x_c, x_d, x_e, x_f but I
#don't know
#How to create them?
#If I have these data frame I will do this anther function
#in the for loop:
zy[[ID]]<-paste0(x_b$a, "_23X")
}
}
I'd like to have as output this:
str(x_b)
Classes ‘data.table’ and 'data.frame': 13500 obs. of 6 variables:
$ a: int 1 56 1058 567 987 574 1001...
$ b: int 10 5 10 5 5 10 10 5 10 10 ...
$ c: int NA NA NA NA NA NA NA NA NA NA ...
$ d: int 0 0 0 0 0 0 0 0 0 0 ...
$ e: int 0 0 0 0 0 0 0 0 0 0 ...
$ f: int 22 22 22 22 22 22 22 22 22 22 ...
$ ID: int 1_23X 56_23X 1058_23X 567_23X 987_23X 574_23X 1001_23X...
and so on.
I think that there is some important concept about the dataframe that I miss.
Where I wrong?
Thank you so much in advance for the support.
There is simple way to do this with the tidyverse package(s):
First goal:
drop.na(df)
You can also use na_if if you want convert "" to NA.
Second goal: use mutate to create a new variable:
df <- df %>%
mutate(id = paste0(x_b$a, "_23X"))

R Insert wrong year in class date

I need to insert this row into my dataframe:
new_row<-c("015-06-17","1+-07-24",0,1,">=10")
How can i put this wrong dates in the columns BirthDate and MarriageDate who are in class Date?
Existing dataframe:
BirthDate MarriageDate Sons Daugther Time
1952-10-05 1980-11-03 1 0 <10
1980-06-14 2002-05-20 0 2 >=10
Expected dataframe:
BirthDate MarriageDate Sons Daugther Time
1952-10-05 1980-11-03 1 0 <10
1980-06-14 2002-05-20 0 2 >=10
015-06-17 1+-07-24 0 1 >=10
I need to put them in the dataframe for correct them after insert.
It is not clear about the column types in 'df1'. Assuming that all are 'character', then
rbind(df1, as.list(new_row))
Or if the 'Sons' and 'Daughther' are numeric, then we have to change the vector elements to respective classes.
lst <- lapply(new_row, function(x) {
x1 <- type.convert(x)
if(is.factor(x1)) as.character(x1) else x1})
df2 <- rbind(df1, lst)
df2
# BirthDate MarriageDate Sons Daugther Time
#1 1952-10-05 1980-11-03 1 0 <10
#2 1980-06-14 2002-05-20 0 2 >=10
#3 015-06-17 1+-07-24 0 1 >=10
str(df2)
#'data.frame': 3 obs. of 5 variables:
#$ BirthDate : chr "1952-10-05" "1980-06-14" "015-06-17"
#$ MarriageDate: chr "1980-11-03" "2002-05-20" "1+-07-24"
#$ Sons : int 1 0 0
#$ Daugther : int 0 2 1
#$ Time : chr "<10" ">=10" ">=10"

Function to replace values in column

I would like to create new column in data.frame as following:
Data description:
`'data.frame': 20 obs. of 3 variables:
$ gvkey : int 1004 1004 1004 1004 1004 1004 1004 1004 1004 1004 ...
$ DEF : int 0 0 0 0 0 0 0 0 0 0 ...
$ FittedRobustRatio: num 0.549 0.532 0.519 0.539 0.531 ...`
Function I wrote and doesn't work:
fun.mark <- function(x,y){
if (x==0) { y[y>0.60] <- "Del"
} else (x==1) {
y[y<0.45] <- "Del2"}}
NewDataFrame <- ddply(ShorterData,~gvkey,transform,Fitcorr=fun.mark(DEF, FittedRobustRatio))
So basically what I want to do is to look into DEF column if 0 and FittedRobustRatio > 0.60 then replace the value with "Del" and if column DEF is 1 (there are only 0 or 1 values in the column) then look into FittedRobustRatio column and replace values where <0.45 with for example "Del2". Thanks.
To do this I normally nest ifelse commands, ifelse sets out like this:
ifelse(definition e.g. a>b, gets "x" if definition met, gets "y" if definition not met)
So this should work...
data.frame$new.column <- ifelse (
data.frame$DEF=="0"&data.frame$FittedRobustRatio>0.6, "Del", ifelse(
data.frame$DEF=="1"&data.frame$FittedRobustRatio<0.45, "Del2", "none"))
I could be wrong as you have not provided a reproducible dataframe so I can't test this.

How to sum up numbers in one CSV-column that belong to one factor in another column?

I am pretty new to R and have a data file that represents a budget. I want to sum up all the price tags for one purpose in the purpose column. That purpose gets automatically factored when reading in the csv. But how can I assign the right prices to a purpose with several counts in the file and sum them up?
I got the file from this link:
http://www.berlin.de/imperia/md/content/senatsverwaltungen/finanzen/haushalt/ansatzn2013.xls?download.html
I opened it in Open Office, exported the .csv-file and called it ausgaben.csv.
> ausgaben <- read.csv("ausgaben.csv")
> str(ausgaben)
'data.frame': 15895 obs. of 8 variables:
$ Bereich : Factor w/ 13 levels "(30) Senatsverwaltungen",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Einzelplan : Factor w/ 28 levels "(01) Abgeordnetenhaus",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Kapitel : Factor w/ 270 levels "(0100) Abgeordnetenhaus",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Titelart : Factor w/ 1 level "Ausgaben": 1 1 1 1 1 1 1 1 1 1 ...
$ Titel : int 41101 41103 42201 42701 42801 42811 42821 44100 44304 44379 ...
$ Titelbezeichnung: Factor w/ 1286 levels "Abdeckung von Geldverlusten",..: 57 973 182 67 262 257 95 127 136 797 ...
$ Funktion : Factor w/ 135 levels "(011) Politische Führung",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Euro : Factor w/ 2909 levels "-1.083,0","-1.295,0",..: 539 2226 1052 1167 1983 1111 1575 2749 1188 1167 ...
In "Funktionen" are 135 levels which correspond to sums in "Euro". I want to get all the numbers in "Euro" for all their corresponding levels in "Funktionen" and sum them, so I get 135 Euro values and can show what is spent for what purpose in this budget.
This could be done with plyr:::ddply or many other functions (ave, tapply, etc...).
I think that 'Euro' should not be a factor, but numeric - so please fix this before trying to aggregate.
Since we do not have your data here is a toy example:
set.seed(1234)
df <- data.frame(fac = sample(LETTERS[1:3], 50, replace = TRUE),
x = runif(50))
require(plyr)
ddply(df, .(fac), summarise,
sum_x = sum(x))
# fac sum_x
1 A 7.938613
2 B 6.692007
3 C 5.645078
You can read the xls file with the gdata package:
library(gdata)
ausgaben <- read.xls("ansatzn2013.xls")
Firstly, you need to transform the values in the column Ansatz.2013.inkl..Nachtrag.in.Tsd..EUR from factor to numeric:
Euro <- as.character(ausgaben$Ansatz.2013.inkl..Nachtrag.in.Tsd..EUR)
Euro <- as.numeric(sub(",", "", Euro))
Then, you can calculate the sums with the aggregate function:
aggregate(Euro ~ ausgaben$Funktion, FUN = sum)

Resources