I have a dataframe df with an ID variable and daily dates (format XYYYYMMDD) as column headers:
ID <- c(101,102,203,207,209)
X20170101 <- c(1,NA,NA,2,1)
X20170102 <- c(NA,1,1,1,NA)
X20170103<-c(NA,NA,NA,2,1)
X20170201<-c(NA,2,NA,NA,1)
X20170202<-c(NA,1,1,NA,NA)
X20170301<-c(NA,1,NA,NA,NA)
df <- data.table(ID,X20170101,X20170102,X20170103,X20170201,X20170202,X20170301)
ID X20170101 X20170102 X20170103 X20170201 X20170202 X20170301
101 1 NA NA NA NA NA
102 NA 1 NA 2 1 1
203 NA 1 NA NA 1 NA
207 2 1 2 NA NA NA
209 1 NA 1 1 NA NA
For each ID, I would like to sum across all dates/columns belonging to the same month. If yyyymm is the vector of strings for the first three months
yyyymm <- c("X201701","X201702","X201703")
I would like to obtain the dataframe want with strings in yyyymm as headers of the columns. That is:
ID X201701 X201702 X201703
101 1 NA NA
102 1 3 1
203 1 1 NA
207 5 NA NA
209 2 1 NA
My idea was to avoid reshaping the format of my dataset and use functions lapply and grepl to partially match the strings, but I'm missing something.
test = lapply(df, function(x) colSums(df[,grepl(x, names(df))]))
Many thanks.
Here's one using lubridate package to parse dates and split.default to divide data.frame into groups based on same month
library(lubridate)
factors = sapply(ymd(gsub("X", "", names(df)[-1])), function(x)
paste0(year(x), sprintf("%02d", as.integer(month(x)))))
data.frame(df[,1],
lapply(split.default(df[,-1], factors), function(x)
rowSums(x, na.rm = TRUE) * (NA^(rowSums(is.na(x)) == NCOL(x)))))
# ID X201701 X201702 X201703
#1 101 1 NA NA
#2 102 1 3 1
#3 203 1 1 NA
#4 207 5 NA NA
#5 209 2 1 NA
Is there a reason you don't want to spread your data?
library(tidyverse)
want <- df %>%
gather(key, value, -ID) %>%
mutate(key = substr(key, 1, 7)) %>%
group_by(ID, key) %>%
summarise(value = sum(value, na.rm=TRUE)) %>%
spread(key, value)
# A tibble: 5 x 4
# Groups: ID [5]
ID X201701 X201702 X201703
* <dbl> <dbl> <dbl> <dbl>
1 101 1 0 0
2 102 1 3 1
3 203 1 1 0
4 207 5 0 0
5 209 2 1 0
Related
I've a dataset similar to this (clearly much bigger):
ID <- c(1,2,3,4,5,6)
MASS <- c(324,162,508,675,670,832)
DIFF <- c("2","1","5","0","3&6","5")
d <- data.frame(ID, MASS, DIFF)
ID MASS DIFF
1 1 324 2
2 2 162 1
3 3 508 5
4 4 675 0
5 5 670 3&6
6 6 832 5
Is there any way in R to set up up a script that would:
read the values reported in the column DIFF (not considering & or 0)
find the same values in the column ID
paste the corresponding values present in the next cell (belonging to the column MASS), into a new column (one value per cells) next to the column DIFF that reports the IDs- if more than one values are reported in the column DIFF, make new columns (MASS1, MASS2, MASS3...)
The aim would be to obtain something like what is reported here below, I hope this can clarify my clumsy description of the problem:
ID MASS DIFF MASS1 MASS2
1 1 324 2 162 NA
2 2 162 1 324 NA
3 3 508 5 670 NA
4 4 675 0 NA NA
5 5 670 3&6 508 832
6 6 832 5 670 NA
Many thanks for any advice
This feels pretty hacky and overly complicated, but it works. Maybe someone else has a more efficient method:
library(dplyr)
library(tidyr)
library(purrr)
d |>
separate_rows(DIFF, convert = TRUE) |>
left_join(d, c("DIFF" = "ID")) |>
select(-DIFF.y) |>
group_by(ID) |>
mutate(DIFF = paste(DIFF, collapse = "&")) |>
ungroup() |>
rename(MASS = MASS.x) |>
group_split(ID) |>
map(~ .x |>
mutate(temp = row_number()) |>
pivot_wider(values_from = MASS.y, names_from = temp, names_glue = "MASS{temp}")) |>
bind_rows()
# A tibble: 6 × 5
ID MASS DIFF MASS1 MASS2
<dbl> <dbl> <chr> <dbl> <dbl>
1 1 324 2 162 NA
2 2 162 1 324 NA
3 3 508 5 670 NA
4 4 675 0 NA NA
5 5 670 3&6 508 832
6 6 832 5 670 NA
I am trying to use summarize_all to find the average value of each ID. I can do this successfully for the ID column and the column I am trying to find the average for (speed). However, when I use the below code, all other column are returned as NAs.
The second problem is that some IDs have multiple locations sites or drivers. So I need to keep the speed averaged by driver but have multiples rows that maintain the unique driver and location (date doesn't matter)
I thought this might be because the columns that are returning as NAs are non-numeric. I tried looking through other questions, but couldn't fond one that answers why this happens and how to fix it.
I also tried using the aggregate() function but the same happens.
This is the code I am currently using:
library(dplyr)
avg <- bind %>%
group_by(ID) %>%
summarize_all(mean)
This is what my data looks like:
ID Speed Location Driver Date
2 100 a 1 M
2 145 a 1 M
5 155 b 1 M
4 100 a 2 T
3 135 b 2 T
3 156 b 3 T
4 167 b 3 W
This is what I would like the final dataset to look like:
ID Speed Location Driver Date
2 122.5 a 1 M
5 155 b 1 M
4 133 a 2 T
4 133 b 3 W
3 145.5 b 2 T
3 145.5 b 3 T
So far, this is what my result looks like:
ID Speed Location Driver Date
2 122.5 NA NA NA
5 155 NA NA NA
4 133 NA NA NA
4 133 NA NA NA
3 145.5 NA NA NA
3 145.5 NA NA NA
with the error code:
There were 50 or more warnings (use warnings() to see the first 50)
We could replace the 'Speed' with mean of 'Speed' and then get the distinct rows
library(dplyr)
bind %>%
group_by(ID) %>%
mutate(Speed = mean(Speed)) %>%
distinct()
# A tibble: 6 x 5
# Groups: ID [4]
# ID Speed Location Driver Date
# <int> <dbl> <chr> <int> <chr>
#1 2 122. a 1 M
#2 5 155 b 1 M
#3 4 134. a 2 T
#4 3 146. b 2 T
#5 3 146. b 3 T
#6 4 134. b 3 W
The NAs in other columns would be the result of calling mean on non-numeric columns.
This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 years ago.
I have a tidy data set which describes attributes of products. Each product have many attributes, and each attribute is described in each row. My goal is to do some calculations on each product, without using loops. The reason for not wanting to use loops is that there are several hundreds of thousands of products, and thus many million attributes.
Toy dataset with only one product:
df <- data.frame(productID = 1, attributeID = seq(1,15,1), dataType = c('range', 'range', 'predefined', 'predefined', 'bool', 'bool', 'bool', 'bool', 'double', 'double', 'double', 'double', 'double', 'double', 'double'), double = c(NA,NA,NA,NA,NA,NA,NA,NA,0,0,15,11.4,6,0,0), logical = c(NA,NA,NA,NA,TRUE,FALSE,FALSE,FALSE,NA,NA,NA,NA,NA,NA,NA), predefined = c(NA,NA,'Black','Round',NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA), from.value = c(0,0,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA), to.value = c(249,368,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))
# productID attributeID dataType double logical predefined from.value to.value
# 1 1 1 range NA NA <NA> 0 249
# 2 1 2 range NA NA <NA> 0 368
# 3 1 3 predefined NA NA Black NA NA
# 4 1 4 predefined NA NA Round NA NA
# 5 1 5 bool NA TRUE <NA> NA NA
# 6 1 6 bool NA FALSE <NA> NA NA
# 7 1 7 bool NA FALSE <NA> NA NA
# 8 1 8 bool NA FALSE <NA> NA NA
# 9 1 9 double 0.0 NA <NA> NA NA
# 10 1 10 double 0.0 NA <NA> NA NA
# 11 1 11 double 15.0 NA <NA> NA NA
# 12 1 12 double 11.4 NA <NA> NA NA
# 13 1 13 double 6.0 NA <NA> NA NA
# 14 1 14 double 0.0 NA <NA> NA NA
# 15 1 15 double 0.0 NA <NA> NA NA
For example, how would one go about counting the zeros for each product in the double column?
Since you're only after counting the number of zeros in the double column, the following should help:
library(tidyverse)
df %>%
group_by(productID) %>%
summarise(sum.of.zeros=sum(double==0, na.rm = T))
The above sums the instances where double equals zero (if it equals zero, it would counted as 1 (TRUE) and if not it would be 0 (FALSE). The na.rm = T is required because the expression NA==0 would return an NA.
Take a look at the tidyverse packages, and dplyr in particular
library(tidyverse)
df %>% group_by( productID, from.value ) %>% summarise( amount = n_distinct( attributeID ))
# # A tibble: 2 x 3
# # Groups: productID [?]
# productID from.value amount
# <dbl> <dbl> <int>
# 1 1 0 2
# 2 1 NA 13
With data.table you can do:
library("data.table")
setDT(df)[, sum(na.omit(double)==0), productID]
or
setDT(df)[, sum(double==0, na.rm=TRUE), productID]
I am trying to spread() a couple of key/value pairs but the common value column does not collapse. I think that it may have to do with some previous processing, or more likely I do not know the right way to spread two or more key/value pairs to get the result I expect.
I'm starting with this data set:
library(tidyverse)
df <- tibble(order = 1:7,
line_1 = c(23,8,21,45,68,31,24),
line_2 = c(63,25,25,24,48,24,63),
line_3 = c(62,12,10,56,67,25,35))
There are 2 pre-spread steps to define order of the "count" values created in the following gather() function. This is the first pre-spread step to define the original order of the "count" variable using the row number:
ntrl <- df %>%
gather(line_1,
line_2,
line_3,
key = "sector",
value = "count") %>%
group_by(order) %>%
mutate(sector_ord = row_number()) %>%
arrange(order,
sector)
This is the second pre-spread step to define the numerical order of the "count" variable:
ord <- ntrl %>%
arrange(order,
count) %>%
group_by(order) %>%
mutate(num_ord = paste0("ord_",
row_number(),
sep=""))
And then finally the spread code that I have been using:
wide <- ord %>%
group_by(order) %>%
spread(key = sector,
value = count) %>%
spread(key = num_ord,
value = sector_ord)
What I'm getting is this:
order line_1 line_2 line_3 ord_1 ord_2 ord_3
1 1 23 NA NA 1 NA NA
2 1 NA 63 NA NA NA 2
3 1 NA NA 62 NA 3 NA
4 2 8 NA NA 1 NA NA
5 2 NA 25 NA NA NA 2
6 2 NA NA 12 NA 3 NA
7 3 21 NA NA NA 1 NA
8 3 NA 25 NA NA NA 2
9 3 NA NA 10 3 NA NA
... and so on thru 21 lines accounting for all 7 "order" lines
The behavior that I am expecting is that the "order" column would collapse in all rows that are the same "order" value to give the following:
order line_1 line_2 line_3 ord_1 ord_2 ord_3
1 1 23 63 62 1 3 2
2 2 8 25 12 1 3 2
3 3 21 25 10 2 3 1
4 4 45 24 56 2 1 3
... and so on, I think that paints the picture
I have reviewed the questions and answers about spreading with duplicate identifiers and the use of the index of row numbers but that does not help.
I figure that it has something to do with the double spreading, but I cannot figure out how to do that.
Thanks for your help.
A solution using tidyverse starting your df. The key is to use summarise_all(funs(.[which(!is.na(.))])) to select the only non-NA value for each column.
library(tidyverse)
df2 <- df %>%
gather(Lines, Value, -order) %>%
group_by(order) %>%
mutate(Rank = dense_rank(Value),
RankOrder = paste0("ord_", row_number())) %>%
spread(Lines, Value) %>%
spread(RankOrder, Rank) %>%
summarise_all(funs(.[which(!is.na(.))]))
df2
# A tibble: 7 x 7
order line_1 line_2 line_3 ord_1 ord_2 ord_3
<int> <dbl> <dbl> <dbl> <int> <int> <int>
1 1 23 63 62 1 3 2
2 2 8 25 12 1 3 2
3 3 21 25 10 2 3 1
4 4 45 24 56 2 1 3
5 5 68 48 67 3 1 2
6 6 31 24 25 3 1 2
7 7 24 63 35 1 3 2
Starting from df:
df %>%
gather(headers, line, -order) %>%
separate(headers, into = c('dummy', 'rn')) %>%
select(-dummy) %>%
group_by(order) %>%
mutate(ord = rank(line, ties.method='first')) %>%
{data.table::dcast(setDT(.), order ~ rn, value.var = c("line", "ord"))}
# order line_1 line_2 line_3 ord_1 ord_2 ord_3
#1: 1 23 63 62 1 3 2
#2: 2 8 25 12 1 3 2
#3: 3 21 25 10 2 3 1
#4: 4 45 24 56 2 1 3
#5: 5 68 48 67 3 1 2
#6: 6 31 24 25 3 1 2
#7: 7 24 63 35 1 3 2
So lets take the following data
set.seed(123)
A <- 1:10
age<- sample(20:50,10)
height <- sample(100:210,10)
df1 <- data.frame(A, age, height)
B <- c(1,1,1,2,2,3,3,5,5,5,5,8,8,9,10,10)
injury <- sample(letters[1:5],16, replace=T)
df2 <- data.frame(B, injury)
Now, we can merge the data using the following code:
df3 <- merge(df1, df2, by.x = "A", by.y = "B", all=T)
head(df3)
# A age height injury
# 1 1 28 206 e
# 2 1 28 206 d
# 3 1 28 206 d
# 4 2 43 149 e
# 5 2 43 149 d
# 6 3 31 173 d
But what i want in the new data frame is the length of injury's as a level variable.
So the desired output should look like this:
So in this simple example we know that the max length of injury's is 4 per unique df2$B . So we need 4 new columns.
Must my data has an unknown number, so a code is needed to generate the correct, so something like
length(unique(df2$injury[df2$B]))
but that is also not correct syntax, as the output should equal 4
I don't know where the letters are coming from in your sample output, because there are none in the variables in your sample input, but you can try something like:
library(splitstackshape)
dcast.data.table(getanID(df3, c("A", "age")), A + age + height ~
.id, value.var = "injury")
## A age height 1 2 3 4
## 1: 1 28 206 4 3 3 NA
## 2: 2 43 149 4 3 NA NA
## 3: 3 31 173 3 3 NA NA
## 4: 4 44 161 NA NA NA NA
## 5: 5 45 111 3 2 1 4
## 6: 6 21 195 NA NA NA NA
## 7: 7 33 125 NA NA NA NA
## 8: 8 41 104 4 3 NA NA
## 9: 9 32 133 4 NA NA NA
## 10: 10 30 197 1 2 NA NA
This adds a secondary ID based on the first two columns and then spreads it to a wide format.
If you want to accomplish this using the tidyr package, I found it necessary to create an index variable:
df3 %>%
group_by(A) %>%
mutate(ind = row_number()) %>%
spread(ind, injury)