I need to insert this row into my dataframe:
new_row<-c("015-06-17","1+-07-24",0,1,">=10")
How can i put this wrong dates in the columns BirthDate and MarriageDate who are in class Date?
Existing dataframe:
BirthDate MarriageDate Sons Daugther Time
1952-10-05 1980-11-03 1 0 <10
1980-06-14 2002-05-20 0 2 >=10
Expected dataframe:
BirthDate MarriageDate Sons Daugther Time
1952-10-05 1980-11-03 1 0 <10
1980-06-14 2002-05-20 0 2 >=10
015-06-17 1+-07-24 0 1 >=10
I need to put them in the dataframe for correct them after insert.
It is not clear about the column types in 'df1'. Assuming that all are 'character', then
rbind(df1, as.list(new_row))
Or if the 'Sons' and 'Daughther' are numeric, then we have to change the vector elements to respective classes.
lst <- lapply(new_row, function(x) {
x1 <- type.convert(x)
if(is.factor(x1)) as.character(x1) else x1})
df2 <- rbind(df1, lst)
df2
# BirthDate MarriageDate Sons Daugther Time
#1 1952-10-05 1980-11-03 1 0 <10
#2 1980-06-14 2002-05-20 0 2 >=10
#3 015-06-17 1+-07-24 0 1 >=10
str(df2)
#'data.frame': 3 obs. of 5 variables:
#$ BirthDate : chr "1952-10-05" "1980-06-14" "015-06-17"
#$ MarriageDate: chr "1980-11-03" "2002-05-20" "1+-07-24"
#$ Sons : int 1 0 0
#$ Daugther : int 0 2 1
#$ Time : chr "<10" ">=10" ">=10"
Related
q7 <- dbGetQuery(conn,
"SELECT TailNum AS TailNum, AVG(ontime.DepDelay) AS avg_delay, ontime.Year AS Year, planes.Year AS yearmade
FROM planes JOIN ontime USING(tailnum)
WHERE ontime.Cancelled = 0 AND planes.Year != '' AND planes.Year != 'None' AND ontime.Diverted = 0 AND ontime.DepDelay > 0
GROUP BY TailNum
ORDER BY avg_delay")
Codes that i have tried:
q7 <- data.frame(
yearmade = q7.yearmade, stringsAsFactors = FALSE)
^ Dataframe
Hi! Basically I would like to create a new column where the Year would subtract the yearmade and be placed into a new column, but before I could do that, I found out that the data I draw from another table into this dataframe shows as character(yearmade), is there any way to change it but retain the original data inside?
First use as.numeric() to change yearmade into a numeric variable. Then you can simply compute the difference between Year and yearmade.
I believe this will work for you.
set.seed(1)
Year <- 2000:2022
yearmade <- sample(c('2000', '1999', '1998'), length(Year), replace = TRUE)
TailNum <- sample(c('N3738B', 'N3737C', 'N37342'), length(Year), replace = TRUE)
avg_delay <- 1:length(Year)
q7 <- data.frame(TailNum, avg_delay, Year, yearmade)
# compute difference and add to data frame
q7$year_diff <- q7$Year - as.numeric(q7$yearmade)
This retains the original data, but introduces a new column year_diff.
> str(q7)
'data.frame': 23 obs. of 5 variables:
$ TailNum : chr "N3738B" "N3738B" "N3737C" "N3738B" ...
$ avg_delay: int 1 2 3 4 5 6 7 8 9 10 ...
$ Year : int 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
$ yearmade : chr "2000" "1998" "2000" "1999" ...
$ year_diff: num 0 3 2 4 4 7 8 8 9 11 ...
I need to create dummy variable for "before and after 04/11/2020" for variable "date" in dataset "counties". There are over hundred dates in the dataset. I am trying to covert the dates from factor to date with as.date function, but get NA. Could you please help finding where I am making an error? I kept the other dummy variable I created just in case, if it affects the overall outcome
counties <- read.csv('C:/Users/matpo/Desktop/us-counties.csv')
str(counties)
as.Date(counties$date, format = '%m/%d/%y')
#create dummy variables forNew York, New Jersey, California, and Illinois
counties$state = ifelse(counties$state == 'New Jersey' &
counties$state == 'New York'& counties$state == 'California' &
counties$state == 'Illinois', 1, 0)
counties$date = ifelse(counties$date >= "4/11/2020", 1, 0)
str output
$ date : logi NA NA NA NA NA NA ...
$ county: Factor w/ 1774 levels "Abbeville","Acadia",..: 1468 1468 1468 379 1468 1178 379 1468 979 942 ...
$ state : num 0 0 0 0 0 0 0 0 0 0 ...
$ fips : int 53061 53061 53061 17031 53061 6059 17031 53061 4013 6037 ...
$ cases : int 1 1 1 1 1 1 1 1 1 1 ...
$ deaths: int 0 0 0 0 0 0 0 0 0 0 ...``
Thank you!
You have an incorrect format in as.Date, you should use "%Y" for 4 digit year.
You need to assign the values back (<-) for the values to change.
"4/11/2020" is just a string, if you are comparing date you need to convert it to date object. Also you can avoid using ifelse here.
Try :
counties$date <- as.Date(counties$date, format = '%m/%d/%Y')
counties$dummy <- as.integer(counties$date >= as.Date('2020-04-11'))
suppose i have a dataframe
> str(data)
'data.frame': 2538 obs. of 5 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ SessionID: int 13307 21076 27813 8398 23118 12256 28799 11457 7542 19261 ...
$ Timestamp: POSIXct, format: "2014-04-06 18:42:05" "2014-04-03 15:27:48" "2014-04-04 09:10:14" "2014-04-03 23:39:20" ...
$ ItemID : int 214684513 214718203 214716928 214826900 214838180 214717318 214821307 214537967 214835775 214706432 ...
$ Price : int 0 0 0 0 0 0 0 0 0 0 ...
and i want to count total occurrance of each SessionID and each session start and end time mean i want output like this
> data
session id timestamp price
1 2014-04-0618:42:05.822 0
1 2014-04-0618:42:06.800 1
1 2014-04-0618:42:06.820 0
2 2014-04-0315:27:48.118 0
2 2014-04-0315:27:49.440 0
> result
session id session start and end time num of occurrence
1 2014-04-0618:42:05.822, 2014-04-0618:42:06.820 3
2 2014-04-0315:27:48.118, 2014-04-0315:27:49.440 2
The data.table way:
library(data.table)
setDT(data)
data[ , .(session_start = min(Timestamp),
session_end = max(Timestamp),
num_occurance = .N), by=Session_ID]
I am pretty new to R and have a data file that represents a budget. I want to sum up all the price tags for one purpose in the purpose column. That purpose gets automatically factored when reading in the csv. But how can I assign the right prices to a purpose with several counts in the file and sum them up?
I got the file from this link:
http://www.berlin.de/imperia/md/content/senatsverwaltungen/finanzen/haushalt/ansatzn2013.xls?download.html
I opened it in Open Office, exported the .csv-file and called it ausgaben.csv.
> ausgaben <- read.csv("ausgaben.csv")
> str(ausgaben)
'data.frame': 15895 obs. of 8 variables:
$ Bereich : Factor w/ 13 levels "(30) Senatsverwaltungen",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Einzelplan : Factor w/ 28 levels "(01) Abgeordnetenhaus",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Kapitel : Factor w/ 270 levels "(0100) Abgeordnetenhaus",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Titelart : Factor w/ 1 level "Ausgaben": 1 1 1 1 1 1 1 1 1 1 ...
$ Titel : int 41101 41103 42201 42701 42801 42811 42821 44100 44304 44379 ...
$ Titelbezeichnung: Factor w/ 1286 levels "Abdeckung von Geldverlusten",..: 57 973 182 67 262 257 95 127 136 797 ...
$ Funktion : Factor w/ 135 levels "(011) Politische Führung",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Euro : Factor w/ 2909 levels "-1.083,0","-1.295,0",..: 539 2226 1052 1167 1983 1111 1575 2749 1188 1167 ...
In "Funktionen" are 135 levels which correspond to sums in "Euro". I want to get all the numbers in "Euro" for all their corresponding levels in "Funktionen" and sum them, so I get 135 Euro values and can show what is spent for what purpose in this budget.
This could be done with plyr:::ddply or many other functions (ave, tapply, etc...).
I think that 'Euro' should not be a factor, but numeric - so please fix this before trying to aggregate.
Since we do not have your data here is a toy example:
set.seed(1234)
df <- data.frame(fac = sample(LETTERS[1:3], 50, replace = TRUE),
x = runif(50))
require(plyr)
ddply(df, .(fac), summarise,
sum_x = sum(x))
# fac sum_x
1 A 7.938613
2 B 6.692007
3 C 5.645078
You can read the xls file with the gdata package:
library(gdata)
ausgaben <- read.xls("ansatzn2013.xls")
Firstly, you need to transform the values in the column Ansatz.2013.inkl..Nachtrag.in.Tsd..EUR from factor to numeric:
Euro <- as.character(ausgaben$Ansatz.2013.inkl..Nachtrag.in.Tsd..EUR)
Euro <- as.numeric(sub(",", "", Euro))
Then, you can calculate the sums with the aggregate function:
aggregate(Euro ~ ausgaben$Funktion, FUN = sum)
I'm trying to join two datasets together. Call them x and y. I believe that the ID variables in y are a subset of the ID variables in x. But not in the pure sense because I know that x contains more IDs than y but I don't know the mapping. That is, some (but not all) of the IDs in x and y can be matched 1:1.
My ultimate goal is to figure out where this 1:1 mapping fails and flag these observations. I thought merge would be the way to go but maybe not. An example is below:
id <- c(1:10, 1:100)
X1 <- rnorm(110, mean = 0, sd = 1)
year <- c("2004","2005","2006","2001","2002")
year <- rep(year, 22)
month = c("Jul","Aug","Sep","Oct","Nov","Dec","Jan","Feb","Mar","Apr")
month <- rep(month, 11)
#dataset X
x <- cbind(id, X1, month, year)
#dataset Y
id2 <- c(1:10, 200)
Y1 <- rnorm(11, mean = 0 , sd = 1)
y <- cbind(id2,Y1)
#merge on the IDs; but we get an error because when id2 == 200 in y we don't
#have a match in x
result <- merge(x, y, by.x="id", by.y = "id2", all =TRUE)
The merge threw an error because id2 == 200 had no match in the x dataset. Unfortunately, I lost the ID and all the information as well! (it should equal 200 in row 111):
tail(result)
id X1 month year Y1
106 95 -0.0748386054887876 Nov 2002 NA
107 96 0.196765325477989 Dec 2004 NA
108 97 0.527922135906927 Jan 2005 NA
109 98 0.197927230533413 Feb 2006 NA
110 99 -0.00720474886698309 Mar 2001 NA
111 <NA> <NA> <NA> <NA> -0.9664941
What's more, I get duplicate observations on the ID variable in the merged file. The id2 == 1 observation only existed once but it just copied it twice (e.g. Y1 takes on the value 1.55 twice).
head(result)
id X1 month year Y1
1 1 -0.67371266313441 Jul 2004 1.553220
2 1 -0.318666983469993 Jul 2004 1.553220
3 10 -0.608192898092431 Apr 2002 1.234325
4 10 -0.72299929212347 Apr 2002 1.234325
5 100 -0.842111221826554 Apr 2002 NA
6 11 -0.16316681842082 Jul 2004 NA
This merge has made things more complicated than I intended. I was hoping I could examine every observation in x and figure out where the id matched id2 in y and flag the ones that didn't. So I would get a new vector, call it flag, that takes on a value 1 if x$id had a match in y$id2 and zero otherwise. This way, I could know where the 1:1 mapping failed. I could potentially get some traction on this by re-coding the NAs, but what about the error that gets thrown when id2 == 200? It just discards the information.
I have tried appending by rows with no luck and it looks like I should give up merge as well, perhaps it's better to wring a loop or function to do something along these lines:
for every observation in x
id2 = which(id2) corresponds to id-month-year
flag = 1 if length of above is == 1, 0 otherwise
etc.
Hopefully this all makes sense. I'd be very grateful for any help or guidance.
If you are looking for which things in x$id are in y$id2, then you can use
x$id %in% y$id2
to get a logical vector returning matches. It does not guarantee a 1-to-1 correspondence, however; just a 1-to-many. You can then add this vector to your data frame
x$match.y <- x$id %in% y$id2
to see what rows of x have a corresponding ID in y.
To see which observations are 1-to-1, you could do something like
y$id2[duplicated(y$id2)] #vector of duplicate elements in y$id2
(x$id %in% y$id2) & !(x$id %in% y$id2[duplicated(y$id2)])
to filter out elements that appear more than once in y$id2. You can also add this to x:
x$match.y.unique <- (x$id %in% y$id2) & !(x$id %in% y$id2[duplicated(y$id2)])
The same procedure can be done for y to determine what rows of y match in x, and which ones match uniquely.
The reason your merge failed was that you gave it two different structures (one a numeric matrix and the other a character matrix) for x and y. Using cbind when data.frame should be chosen is a common strategy for failure.
> str(x)
chr [1:110, 1:4] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "1" "2" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:4] "id" "X1" "month" "year"
> str(y)
num [1:11, 1:2] 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "id2" "Y1"
If you used the data.frame function (since dataframes are what merge is supposed to be working with) it would have succeeded:
> x <- data.frame(id, X1, month, year); y <- data.frame(id2,Y1)
> str( result <- merge(x, y, by.x="id", by.y = "id2", all =TRUE) )
'data.frame': 111 obs. of 5 variables:
$ id : num 1 1 2 2 3 3 4 4 5 5 ...
$ X1 : num 1.5063 2.5035 0.7889 -0.4907 -0.0446 ...
$ month: Factor w/ 10 levels "Apr","Aug","Dec",..: 6 6 2 2 10 10 9 9 8 8 ...
$ year : Factor w/ 5 levels "2001","2002",..: 3 3 4 4 5 5 1 1 2 2 ...
$ Y1 : num 1.449 1.449 -0.134 -0.134 -0.828 ...
> tail( result <- merge(x, y, by.x="id", by.y = "id2", all =TRUE) )
id X1 month year Y1
106 96 -0.3869157 Dec 2004 NA
107 97 0.6373009 Jan 2005 NA
108 98 -0.7735626 Feb 2006 NA
109 99 -1.3537915 Mar 2001 NA
110 100 0.2626190 Apr 2002 NA
111 200 NA <NA> <NA> -1.509818
If you have duplicates in your 'x' argument, then you should get duplicates in the result. It's then your responsibility to use !duplicated in whatever manner you deem appropriate (either before or after the merge), but you cannot expect merge to be making decisions like that for you.