I have the following data - it is a dump from a normalized database, but I can not access the database, and the database maintainer insists that this is not necessary.
The obs variable is the unique observation id, a.k.a. the one to "pivot" around
Specifically, I want to go from this olddata to the newdata data frame below:
> olddata
species obs variable value
3 ADFA 1 mean 4
4 ADFA 1 lat 118
5 ADFA 1 lon 49
6 ADFA 1 masl 74
96 HODO 8 mean 18
97 HODO 8 lat 120
98 HODO 8 lon 45
99 HODO 8 masl 36
189 HODO 9 mean 34
190 HODO 9 lat 126
191 HODO 9 lon 12
192 HODO 9 masl 35
I would like to reshape this data frame to look like:
> newdata
species obs mean lat lon masl
1 ADFA 1 4 118 49 74
2 HODO 8 18 120 45 36
3 HODO 9 34 126 12 35
Disclaimer: this has likely been asked before but I am unable to find the question among the many questions related to transforming data frames / matrices
Here are the dataframes for use when reproducing this issue:
olddata <- structure(list(species = c("ADFA", "ADFA", "ADFA", "ADFA", "HODO",
"HODO", "HODO", "HODO", "HODO", "HODO", "HODO", "HODO"), obs = c(1,
1, 1, 1, 8, 8, 8, 8, 9, 9, 9, 9), variable = c("mean", "lat",
"lon", "masl", "mean", "lat", "lon", "masl", "mean", "lat", "lon",
"masl"), value = c(4, 118, 49, 74, 18, 120, 45, 36, 34, 126,
12, 35)), .Names = c("species", "obs", "variable", "value"),
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10",
"11", "12"), class = "data.frame")
newdata <- structure(list(species = c("ADFA", "HODO", "HODO"), obs = c(1,
8, 9), mean = c(4, 18, 34), lat = c(118, 120, 126), lon = c(49,
45, 12), masl = c(74, 36, 35)), .Names = c("species", "obs",
"mean", "lat", "lon", "masl"), row.names = c(NA, -3L),
class = "data.frame")
Here is an example:
> library(reshape2)
> dcast(olddata, species+obs~variable)
species obs lat lon masl mean
1 ADFA 1 118 49 74 4
2 HODO 8 120 45 36 18
3 HODO 9 126 12 35 34
library(reshape2)
dcast(olddata,species+obs~variable)
Related
I want to remove the NAs in the following data frame so that I'm left with 3 columns instead of 4. The names of the columns don't matter.
structure(list(Practical1 = c("65", "85", NA, "60", NA), Practical2 = c("55",
"75", "100", NA, "35"), Practical3 = c(45.45, 50, 86.36, 40,
72.73), Practical4 = c(NA, NA, "92", "79", "71")), class = "data.frame", row.names = c(NA,
-5L))
I'd get something that looks like this:
pracA, pracB, pracC
65, 55, 45.45,
85, 75, 50,
100, 86.36, 92,
60, 40, 79,
35, 72.73, 71
In base R, with na.omit:
dat <- as.data.frame(t(apply(dat, 1, na.omit)))
colnames(dat) <- paste0("prac", 1:3)
output
prac1 prac2 prac3
1 65 55 45.45
2 85 75 50.00
3 100 86.36 92
4 60 40.00 79
5 35 72.73 71
In tidyr with unite + separate:
library(tidyr)
unite(dat, "a", starts_with("Practical"), na.rm = TRUE) %>%
separate(a, into = str_c("prac", 1:3), sep = "_")
I am trying to find the do a function which is similar to a vlookup in excel but which returns the maximum value and the other values in the same row.
The data frame looks like this:
The data frame which I am dealing with are given below:
dput(Book3)
structure(list(Item = c("ABA", "ABB", "ABC", "ABD", "ABE", "ABF"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-6L))
dput(Book4)
structure(list(Item = c("ABA", "ABB", "ABC", "ABD", "ABE", "ABF",
"ABA", "ABB", "ABC", "ABD", "ABE", "ABF", "ABA", "ABB", "ABC",
"ABD", "ABE", "ABF"), Max1 = c(12, 68, 27, 17, 74, 76, 78, 93,
94, 98, 46, 90, 5, 58, 67, 64, 34, 97), Additional1 = c(40, 66,
100, 33, 66, 19, 8, 70, 21, 93, 48, 34, 44, 89, 74, 20, 0, 47
), Additional2 = c(39, 31, 85, 58, 0, 2, 57, 28, 31, 32, 15,
22, 93, 41, 57, 81, 95, 46)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -18L))
The Expected output for this is given below:
You are looking for slice_max:
library(dplyr)
Book4 %>%
group_by(Item) %>%
slice_max(Max1)
# Item Max1 Additional1 Additional2
# 1 ABA 78 8 57
# 2 ABB 93 70 28
# 3 ABC 94 21 31
# 4 ABD 98 93 32
# 5 ABE 74 66 0
# 6 ABF 97 47 46
Using base R
subset(Book4, Max1 == ave(Max1, Item, FUN = max))
-output
# A tibble: 6 × 4
Item Max1 Additional1 Additional2
<chr> <dbl> <dbl> <dbl>
1 ABE 74 66 0
2 ABA 78 8 57
3 ABB 93 70 28
4 ABC 94 21 31
5 ABD 98 93 32
6 ABF 97 47 46
An alternative base solution that is more resilient to floating-point precision problems (c.f., Why are these numbers not equal?, https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f). It also allows two behavior options if there are duplicate max-values:
if you want all of them, use ties.method = "min";
if you want the first (or just one) of them, then ties.method = "first".
Book4[ave(Book4$Max1, Book4$Item, FUN = function(z) rank(-z, ties.method = "first")) == 1,]
# # A tibble: 6 x 4
# Item Max1 Additional1 Additional2
# <chr> <dbl> <dbl> <dbl>
# 1 ABE 74 66 0
# 2 ABA 78 8 57
# 3 ABB 93 70 28
# 4 ABC 94 21 31
# 5 ABD 98 93 32
# 6 ABF 97 47 46
Using R base aggregate + max + merge
> merge(Book4, aggregate(Max1~Item, data = Book4, max), by = c("Item", "Max1"))
Item Max1 Additional1 Additional2
1 ABA 78 8 57
2 ABB 93 70 28
3 ABC 94 21 31
4 ABD 98 93 32
5 ABE 74 66 0
6 ABF 97 47 46
I have one raw dataset like a below table.
likes age
1 2295 61
2 740 69
3 210 57
4 207 49
5 1226 51
6 9016 63
Using this dataset, my desired output is also like a below table
new_age likes
age <60 1643
age >60 12051
new_age is divided into two ranges. One is below 60, other is above 60.
Could you please help make this desired output?
Dput data is below:
structure(list(likes = c(2295L, 740L, 210L, 207L, 1226L, 9016L), age = c(61, 69, 57, 49, 51, 63)), class = "data.frame", row.names = c(NA, -6L))
We can create the group with a logical expression
library(dplyr)
df1 %>%
group_by(new_age = c('age > 60', 'age < 60')[(age < 60) + 1]) %>%
summarise(likes = sum(likes))
# A tibble: 2 x 2
# new_age likes
# <chr> <int>
#1 age < 60 1643
#2 age > 60 12051
I have a dataframe "data" with the following structure:
structure(list(age = c(45, 4, 32, 45), sex = c(1, 0, 1, 0), height = c(165,
178, 145, 132), weight = c(65, 73, 60, 45)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
And I would like to add to this data.frame two new variables (var1, var2), which should be calculated with the two following formulas:
var1 = age*height + (4 if sex==1 OR 2 if sex==0)
var2 = height*weight + (1 if age>40 or 2 if age=<40)
I have a problem both in adding the two variables to the data frame, both in applying a function (I tried to build a function, but seems that can be applied only to a single value and not to all values from all rows).
Can anyone help me, please?
akrun's suggestion of using Boolean arithmetic is a good one but you could also do simply a Boolean version of your own expression substituting multiplication for the if statements.s (whit mild editing of the "=<" to "<=")
data <- structure(list(age = c(45, 4, 32, 45), sex = c(1, 0, 1, 0), height = c(165, 178, 145, 132), weight = c(65, 73, 60, 45)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
data <- within(data, {var1 = age*height + 4*(sex==1) + 2 *(sex==0);
var2 = height*weight + (age>40) + 2 *(age <= 40)})
#----
> data
age sex height weight var2 var1
1 45 1 165 65 10726 7429
2 4 0 178 73 12996 714
3 32 1 145 60 8702 4644
4 45 0 132 45 5941 5942
Since the two sets of conditions are each disjoint, the "non-qualifying" choice terms will each be 0.
the function ifelse() is vector based, so it will apply the conditions to each element in the vector.
df <- structure(list(age = c(45, 4, 32, 45), sex = c(1, 0, 1, 0), height = c(165,
178, 145, 132), weight = c(65, 73, 60, 45)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
df$var1 <- ifelse(df$sex == 1,(df$age * df$height) + 4,(df$age * df$height) + 2)
df$var2 <- ifelse(df$age > 40,(df$weight * df$height) + 1,(df$age * df$height) + 2)
final output
> df
# A tibble: 4 x 6
age sex height weight var1 var2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 45 1 165 65 7429 10726
2 4 0 178 73 714 714
3 32 1 145 60 4644 4642
4 45 0 132 45 5942 5941
I rather the tool case_when() from dplyr package.
Your original data is:
data <-
structure(
list(age = c(45, 4, 32, 45),
sex = c(1, 0, 1, 0),
height = c(165, 178, 145, 132),
weight = c(65, 73, 60, 45)),
row.names = c(NA, -4L),
class = c("tbl_df", "tbl", "data.frame"))
The new variables are created by:
library(dplyr)
data ->
data %>% mutate(var1 = case_when(sex==1 ~ age*height + 4,
sex==0 ~ age*height + 2),
var2 = case_when(age>40 ~ height*weight + 1,
age<=40 ~ height*weight + 2)
)
The outcome is:
# A tibble: 4 x 6
age sex height weight var1 var2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 45 1 165 65 7429 10726
2 4 0 178 73 714 12996
3 32 1 145 60 4644 8702
4 45 0 132 45 5942 5941
We convert the logical/binary to numeric index by adding 1 to it and use that to change the values to 2, 4, or just 1, 2 and use that in the calculation
library(dplyr)
data %>%
mutate(var1 = (age * height) + c(2, 4)[sex + 1],
var2 = (height * weight) + (age <= 40)+1)
# A tibble: 4 x 6
# age sex height weight var1 var2
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 45 1 165 65 7429 10726
#2 4 0 178 73 714 12996
#3 32 1 145 60 4644 8702
#4 45 0 132 45 5942 5941
I am trying to calculate the number of weeks the inventory on hand will last given the sales projections for a dataset with 10s of million of rows. I have listed the expected output in the last column of the data structure given below. I also attached the implementation of this in Excel.
Logic
Weeksofsupply = Number of weeks the current inventory on hand will last.
example - in the attached image (SKU_CD 222, STORE_CD 33), the inventory on hand is 19, the sales values are
WK1 + WK2 = 15, Wk1 + Wk2 + Wk3 = 24, Which is greater than 19,
So we are picking 2, which the count of Weeks the current inventory will last.
Expected output in the last column
Data = structure(list(
SKU_CD = c(111, 111, 111, 111, 111, 111, 111,111, 111, 111, 111, 111, 222, 222, 222, 222, 222, 222, 222, 222, 222, 222, 222, 222),
STORE_CD = c(22, 22, 22, 22, 22, 22, 22,22, 22, 22, 22, 22, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33),
FWK_CD = c(201627, 201628, 201629, 201630, 201631, 201632,201633, 201634, 201635, 201636, 201637, 201638, 201627, 201628, 201629, 201630, 201631, 201632, 201633, 201634, 201635, 201636, 201637, 201638),
SALES = c(5, 2, 2, 2, 1, 3, 2, 2, 3, 2, 3, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 7, 5),
INVENTORY = c(29, 27, 25, 23, 22, 19, 17, 15, 12, 10, 25, 1, 19, 17, 15, 13, 12,9, 7, 5, 2, 0, 25, 18),
WeeksofSupply = c("11", "10", "9", "8", "8", "6", "5", "4", "3", "2", "Inventory More", "Inventory Less", "2", "2", "1", "1", "1", "Inventory Less", "Inventory Less", "Inventory Less", "Inventory Less", "Inventory Less", "Inventory More", "Inventory More")),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -24L),
.Names = c("SKU_CD", "STORE_CD", "FWK_CD", "SALES", "INVENTORY", "WeeksofSupply"))
Current Excel Code: (Here the weeks are shown in columns, but it should be rows like shown in the expected output.)
=IF(A2<SUM(B2:K2),SUMPRODUCT(--(SUBTOTAL(9,OFFSET(B2:K2,,,,COLUMN(B2:K2)-
COLUMN(B2)+1))<=A2))+LOOKUP(0,SUBTOTAL(9,OFFSET(B2:K2,,,,COLUMN(B2:K2)-
COLUMN(B2)+1))-B2:K2-A2,(A2-(SUBTOTAL(9,OFFSET(B2:K2,,,,COLUMN(B2:K2)-
COLUMN(B2)+1))-B2:K2))/B2:K2),IF(A2=SUM(B2:K2),COUNT(B2:K2),"Inventory
exceeds forecast"))
I would appreciate any input to implement this efficiently in R. Many Thanks for your time!
For your revised data in long format, you can do the following...
library(dplyr) #for the grouping functionality
#define a function to calculate weeks Supply from Sales and Inventory
weekSup <- function(sales,inv){
sales <- unlist(sales)
inv <- unlist(inv)
n <- length(sales)
weeksup <- rep(NA,n)
for(i in 1:n){
if(i==n | inv[i]<sales[i]){
weeksup[i] <- ifelse(inv[i]>sales[i],NA,inv[i]/sales[i])
} else {
weeksup[i] <- approxfun(cumsum(sales[i:n]),1:(n-i+1))(inv[i])
}
}
#Your 'inventory more' is coded as -1 (a number) to avoid whole column being forced to a character vector
weeksup <- replace(weeksup,is.na(weeksup),-1)
return(weeksup) #for whole weeks, change this to `return(floor(weeksup))`
}
Data2 <- Data %>% group_by(SKU_CD,STORE_CD) %>% mutate(weekSup=weekSup(SALES,INVENTORY))
head(Data2,20)
SKU_CD STORE_CD FWK_CD SALES INVENTORY WeeksofSupply weekSup
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 111 22 201627 5 29 11 11.3333333
2 111 22 201628 2 27 10 10.8333333
3 111 22 201629 2 25 9 9.8333333
4 111 22 201630 2 23 8 8.8333333
5 111 22 201631 1 22 8 8.0000000
6 111 22 201632 3 19 6 6.6666667
7 111 22 201633 2 17 5 5.8333333
8 111 22 201634 2 15 4 4.8333333
9 111 22 201635 3 12 3 3.6666667
10 111 22 201636 2 10 2 2.8333333
11 111 22 201637 3 25 Inventory More -1.0000000
12 111 22 201638 6 1 Inventory Less 0.1666667
13 222 33 201627 7 19 2 2.4444444
14 222 33 201628 8 17 2 2.0000000
15 222 33 201629 9 15 1 1.6000000
16 222 33 201630 10 13 1 1.2727273
17 222 33 201631 11 12 1 1.0833333
18 222 33 201632 12 9 Inventory Less 0.7500000
19 222 33 201633 13 7 Inventory Less 0.5384615
20 222 33 201634 14 5 Inventory Less 0.3571429
Here is one way to do it, using the linear interpolation method approxfun...
data$WeeksSupply <- sapply(1:nrow(data),function(i)
approxfun(cumsum(as.vector(c(data[i,2:11]))),1:10)(data$Inventory[i]))
data$WeeksSupply <- replace(data$WeeksSupply,is.na(data$WeeksSupply),
"Inventory Exceeds Forecast")
data
# A tibble: 2 x 12
Inventory Wk1 Wk2 Wk3 Wk4 Wk5 Wk6 Wk7 Wk8 Wk9 Wk10 WeeksSupply
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 200 20 15 25 40 35 45 30 50 45 55 6.66666666666667
2 2000 20 15 25 40 35 45 30 50 45 55 Inventory Exceeds Forecast