Why is the R aggregation dropping data rows? - r

I have a data frame with 2 columns: date & observations. The data consists of multiple observations for each date.
str(observations)
tibble [2,599 × 2] (S3: tbl_df/tbl/data.frame)
$ date : chr [1:2599] "1/22/20" "1/22/20" "1/22/20" "1/22/20" ...
$ observation : num [1:2599] 0 0 0 0 0 0 0 0 0 0 ...
> tail(observations)
# A tibble: 6 x 2
date observation
<chr> <dbl>
1 5/13/20 4127
2 5/13/20 1042
3 5/13/20 14306
4 5/13/20 1066
5 5/13/20 0
6 5/13/20 89
I want to subtotal these observations to produce a single row for each date so I used this function:
subs <- aggregate(cbind(observation) ~ date,data=observations, FUN = sum, na.rm = TRUE)
But the output is missing any rows for the last 4 days of the original:
> tail(subs)
date observation
108 5/4/20 128269
109 5/5/20 130593
110 5/6/20 131890
111 5/7/20 133991
112 5/8/20 135840
113 5/9/20 137397

I apologize. On further investigation, it appears that the aggregate function returned that data out of order. I re-ordered the data frame and confirmed that all dates were accounted for.

Related

Adding two columns in a tibble and saving the sum to third column is making the third column a dataframe

I am working on generating a report, upon trying to write the tibble using xlsx package's write.xlsx, it gave an error (even after me specifying as.data.frame(tibble) in write.xlsx).
Upon checking the tibble, I realized that when I added multiple columns and stored the result in another column in the tibble, the total column has become a dataframe.
Example:
> marks <- tibble(math = c(90,90,85,90),
+ physics = c(90,85,95,80),
+ Total = c(rep(NA,4)))
> marks
# A tibble: 4 x 3
math physics Total
<dbl> <dbl> <lgl>
1 90 90 NA
2 90 85 NA
3 85 95 NA
4 90 80 NA
> class(marks)
[1] "tbl_df" "tbl" "data.frame"
> str(marks)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4 obs. of 3 variables:
$ math : num 90 90 85 90
$ physics: num 90 85 95 80
$ Total : logi NA NA NA NA
> marks$Total <- marks[,1] + marks[,2]
> str(marks)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4 obs. of 3 variables:
$ math : num 90 90 85 90
$ physics: num 90 85 95 80
$ Total :'data.frame': 4 obs. of 1 variable:
..$ math: num 180 175 180 170
>
As we can see above, I thought I can use vectorized operations of R but the "Total" column has changed to dataframe after summing up two columns and storing the result in Total column.
Could someone let me know why this is happening, also, how to perform the above operation.
Edited: OK seems like because tibble doesn't drop dimension, it was not like adding two vectors.
I think this is a result of the fact that by defaul tibbles don't drop the 2nd dimension when you access part of them with [], whereas dataframes do. Compare:
> marks[, 1]
# A tibble: 4 x 1
math
<dbl>
1 90
2 90
3 85
4 90
> marks_df = as.data.frame(marks)
> marks_df[ , 1]
[1] 90 90 85 90
So marks[, 1] + marks[, 2] is adding a tibble to a tibble and the result is a tibble.
To avoid this, you can either drop the 2nd dimension explicitly, or just use the column names:
marks$Total <- marks[,1, drop = TRUE] + marks[, 2, drop = TRUE]
marks$Total <- marks$math + marks$physics

Multiply each other value in row with first value in row

I have the following data frame:
Date <- c("04.06.2013","05.06.2013","06.06.2013","07.06.2013","08.06.2013","09.06.2013")
discharge <- c("1000","2000","1100","3000","1700","1600")
concentration_1 <- c("25","20","11","6.4","17","16")
concentration_2 <- c("1.4","1.7","2.7","3.2","4","4.7")
concentration_3 <- c("1.2","1.3","1.9","2.2","2.4","3")
concentration_4 <- c("1","0.92","2.5","3","3.4","4.8")
y <- data.frame(Date, discharge,concentration_1,concentration_2,concentration_3,concentration_4, stringsAsFactors=FALSE)
y$Date <- as.Date(y$Date, format ="%d.%m.%Y")
y[-1] <- sapply(y[-1], as.numeric)
In each row, I need to multiply each concentration with the discharge.
I was looking into the apply function but couldn´t figure out how to solve it.
No apply needed, just multiply. But first let's get your data in decent shape.
They way you define your data, because you use quotes around the numbers, all the columns that should be numeric are factors. We use lapply to convert them safely to numeric:
y <- data.frame(Date, discharge,concentration_1,concentration_2,concentration_3,concentration_4)
y$Date <- as.Date(y$Date, format ="%d.%m.%Y")
str(y)
# 'data.frame': 6 obs. of 6 variables:
# $ Date : Date, format: "2013-06-04" "2013-06-05" "2013-06-06" "2013-06-07" ...
# $ discharge : Factor w/ 6 levels "1000","1100",..: 1 5 2 6 4 3
# $ concentration_1: Factor w/ 6 levels "11","16","17",..: 5 4 1 6 3 2
# $ concentration_2: Factor w/ 6 levels "1.4","1.7","2.7",..: 1 2 3 4 5 6
# $ concentration_3: Factor w/ 6 levels "1.2","1.3","1.9",..: 1 2 3 4 5 6
# $ concentration_4: Factor w/ 6 levels "0.92","1","2.5",..: 2 1 3 4 5 6
# convert all columns but the first safely to numeric
y[, -1] = lapply(y[, -1], function(x) as.numeric(as.character(x)))
str(y)
# 'data.frame': 6 obs. of 6 variables:
# $ Date : Date, format: "2013-06-04" "2013-06-05" "2013-06-06" "2013-06-07" ...
# $ discharge : num 1000 2000 1100 3000 1700 1600
# $ concentration_1: num 25 20 11 6.4 17 16
# $ concentration_2: num 1.4 1.7 2.7 3.2 4 4.7
# $ concentration_3: num 1.2 1.3 1.9 2.2 2.4 3
# $ concentration_4: num 1 0.92 2.5 3 3.4 4.8
With that done, we can just multiply the concentration columns by the discharge column. R will "recycle" the discharge column to multiply each of the concentration columns appropriately.
concentration_columns = paste0("concentration_", 1:4)
y[, concentration_columns] = y[, concentration_columns] * y[, "discharge"]
y
# Date discharge concentration_1 concentration_2 concentration_3 concentration_4
# 1 2013-06-04 1000 25000 1400 1200 1000
# 2 2013-06-05 2000 40000 3400 2600 1840
# 3 2013-06-06 1100 12100 2970 2090 2750
# 4 2013-06-07 3000 19200 9600 6600 9000
# 5 2013-06-08 1700 28900 6800 4080 5780
# 6 2013-06-09 1600 25600 7520 4800 7680
The multiplication is vectorized, just use the columns you want to multiply as operands.
y[, 2] * y[, -(1:2)]
Once your values as not character (not in ""), you can use apply like this:
new <- data.frame(y[,1:2],apply(y[,3:6],2,function(x) x*y$discharge))

How to merge two data frames with non overlapping dates?

I have a data set with the following variables:
steps: Number of steps taking in a 5-minute interval
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken (288 intervals per day)
The main data set:
> head(activityData, 3)
steps date interval
1 1.7169811 2012-10-01 0
2 0.3396226 2012-10-01 5
3 0.1320755 2012-10-01 10
> str(activityData)
'data.frame': 17568 obs. of 3 variables:
$ steps : num 1.717 0.3396 0.1321 0.1509 0.0755 ...
$ date : chr "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
$ interval: num 0 5 10 15 20 25 30 35 40 45 ...
The data set has a range of two months.
I had to divided it by weekdays and weekend days. I did it with the following functions:
> dataAs.xtsWeekday <- dataAs.xts[.indexwday(dataAs.xts) %in% 1:5]
> dataAs.xtsWeekend <- dataAs.xts[.indexwday(dataAs.xts) %in% c(0, 6)]
After doing this I had to make some calculation, at which I failed so I decided to export the files and read them in, again.
After I imported the data again, I made the calculation I wanted, and I tried to merge the 2 datasets, but did not succeed.
First data set:
> head(weekdays, 3)
X steps date interval daytype
1 1 37.3826 2012-10-01 0 weekday
2 2 37.3826 2012-10-01 5 weekday
3 3 37.3826 2012-10-01 10 weekday
> str(weekdays)
'data.frame': 12960 obs. of 5 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ steps : num 37.4 37.4 37.4 37.4 37.4 ...
$ date : chr "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
$ interval: int 0 5 10 15 20 25 30 35 40 45 ...
$ daytype : chr "weekday" "weekday" "weekday" "weekday" ...
Second data set:
> head(weekend, 3)
X steps date interval daytype
1 1 0 2012-10-06 0 weekend
2 2 0 2012-10-06 5 weekend
3 3 0 2012-10-06 10 weekend
> str(weekend)
'data.frame': 4608 obs. of 5 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ steps : num 0 0 0 0 0 0 0 0 0 0 ...
$ date : chr "2012-10-06" "2012-10-06" "2012-10-06" "2012-10-06" ...
$ interval: int 0 5 10 15 20 25 30 35 40 45 ...
$ daytype : chr "weekend" "weekend" "weekend" "weekend" ...
Now I would like to merge the 2 data sets (weekdays, weekends) by date, but the problem is that I don't have any common dates or anything else common.
The final data set should have 4 columns and 17568 observations.
The columns should be:
steps: Number of steps taking in a 5-minute interval
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken
daytype: weekends days or normal weekdays.
I tried with:
merge
join(plyr)
union
Everywhere I looked all the data sets had a common ID or a common column in both data sets, not like in my case.
I also looked here, but I did not understand much and at many others, but they had nothing in common with my data set.
The other option I thought about was to add a column to the original data set and call it
"ID" and redo everything that I did so far; thing that I'll have to do if I don't find a way around this problem.
I would like some advice on how to proceed or what to try next.
Since you mentioned that your final data set should have 17568 (=4608+12960) observations/rows, I assume you want to stack the two data.frames over each other (and possibly order them by date afterwards). This is done by using rbind().
finaldata <- rbind(weekdays, weekend)
If you want to remove column X:
finaldata$X <- NULL
To convert your date column to actual dates:
finaldata$date <- as.Date(finaldata$date, format="%Y-%m-%d")
To order the whole data by date:
finaldata <- finaldata[order(finaldata$date),]

How to sum up numbers in one CSV-column that belong to one factor in another column?

I am pretty new to R and have a data file that represents a budget. I want to sum up all the price tags for one purpose in the purpose column. That purpose gets automatically factored when reading in the csv. But how can I assign the right prices to a purpose with several counts in the file and sum them up?
I got the file from this link:
http://www.berlin.de/imperia/md/content/senatsverwaltungen/finanzen/haushalt/ansatzn2013.xls?download.html
I opened it in Open Office, exported the .csv-file and called it ausgaben.csv.
> ausgaben <- read.csv("ausgaben.csv")
> str(ausgaben)
'data.frame': 15895 obs. of 8 variables:
$ Bereich : Factor w/ 13 levels "(30) Senatsverwaltungen",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Einzelplan : Factor w/ 28 levels "(01) Abgeordnetenhaus",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Kapitel : Factor w/ 270 levels "(0100) Abgeordnetenhaus",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Titelart : Factor w/ 1 level "Ausgaben": 1 1 1 1 1 1 1 1 1 1 ...
$ Titel : int 41101 41103 42201 42701 42801 42811 42821 44100 44304 44379 ...
$ Titelbezeichnung: Factor w/ 1286 levels "Abdeckung von Geldverlusten",..: 57 973 182 67 262 257 95 127 136 797 ...
$ Funktion : Factor w/ 135 levels "(011) Politische Führung",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Euro : Factor w/ 2909 levels "-1.083,0","-1.295,0",..: 539 2226 1052 1167 1983 1111 1575 2749 1188 1167 ...
In "Funktionen" are 135 levels which correspond to sums in "Euro". I want to get all the numbers in "Euro" for all their corresponding levels in "Funktionen" and sum them, so I get 135 Euro values and can show what is spent for what purpose in this budget.
This could be done with plyr:::ddply or many other functions (ave, tapply, etc...).
I think that 'Euro' should not be a factor, but numeric - so please fix this before trying to aggregate.
Since we do not have your data here is a toy example:
set.seed(1234)
df <- data.frame(fac = sample(LETTERS[1:3], 50, replace = TRUE),
x = runif(50))
require(plyr)
ddply(df, .(fac), summarise,
sum_x = sum(x))
# fac sum_x
1 A 7.938613
2 B 6.692007
3 C 5.645078
You can read the xls file with the gdata package:
library(gdata)
ausgaben <- read.xls("ansatzn2013.xls")
Firstly, you need to transform the values in the column Ansatz.2013.inkl..Nachtrag.in.Tsd..EUR from factor to numeric:
Euro <- as.character(ausgaben$Ansatz.2013.inkl..Nachtrag.in.Tsd..EUR)
Euro <- as.numeric(sub(",", "", Euro))
Then, you can calculate the sums with the aggregate function:
aggregate(Euro ~ ausgaben$Funktion, FUN = sum)

R How to update a column in data.frame using values from another data.frame

New to R.
I have a data.frame
'data.frame': 2070 obs. of 5 variables:
$ id : int 16625062 16711130 16625064 16668358 16625066 16711227 16711290 16668746 16711502 16625494 ...
$ subj : Factor w/ 3 levels "L","M","S": 1 1 1 1 1 1 1 1 1 1 ...
$ grade: int 4 6 4 5 4 6 6 5 6 4 ...
$ score: int 225 225 0 225 225 375 375 125 225 125 ...
$ level: logi NA NA NA NA NA NA ...
and a list of named numbers called lookup
Named num [1:12] 12 19 20 26 31 32 49 67 72 73 ...
- attr(*, "names")= chr [1:12] "0" "50" "100" "125" ...
I'd like to find a way to update the data frame "level" column by looking up values in the lookup list, matching the data frame "score" column with the name of the number in the lookup list. In other words, the score values in the data frame are used to lookup the number (that will go in the level column) in the lookup list.
So... if anyone understands what I mean... please help.
Thanks Robn
You should be able to do this with (assuming your data frame is called d):
d$level = as.numeric(lookup[as.character(d$score)])
For example:
lookup = list(1, 2, 3, 4)
names(lookup) = c("0", "50", "100", "150")
d = data.frame(score=c(50, 150, 0, 0), level=NA)
d$level = as.numeric(lookup[as.character(d$score)])
print(d)
# score level
# 1 50 2
# 2 150 4
# 3 0 1
# 4 0 1

Resources