Currently, I have one df and a price table.
Order Number wgt wgt_intvl price
------------------- --------------- -----
01 22 0-15 50
02 5 15-25 75
03 35 25-50 135
What I'd like is to match the weight from the df into an interval of the price table in R. For example, the first order (Order Number 01) corresponds with a price of 75. Therefore, I want to add a column in the first df, say df$cost that corresponds with the appropriate price according to wgt_intvl in the price table.
The way I see to do it is with an if-else statement, but this is highly inefficient and I was wondering if there is a better way to do it. In reality these tables are much longer - there is no logical "buildup" in price or weight interval. I have 15 weight intervals in this table. My current solution would look like this:
If(wgt < 15){
df$cost <- 50
} else if (wgt > 15 & wgt < 25){
df$cost <- 75
} else if(wgt > 25 & wgt < 50){
df$cost <- 135
}
This times fifteen, using the corresponding prices of the price table. I'd love a more efficient solution. Thanks in advance!
Using the data shown reproducibly in the Note at the end, form the vector of cutpoints (i.e. the first number in each interval) and then use findInterval to find the interval corresponding to the weight.
cutpoints <- as.numeric(sub("-.*", "", dfprice$wgt_intvl))
transform(dfmain, price = dfprice$price[findInterval(wgt, cutpoints)])
giving:
Order wgt price
1 01 22 75
2 02 5 50
3 03 35 135
4 04 25 135
Note
dfmain <- data.frame(Order = c("01", "02", "03", "04"), wgt = c(22, 5, 35, 25),
stringsAsFactors = FALSE)
dfprice <- data.frame(wgt_intvl = c("0-15", "15-25", "25-50"),
price = c(50, 75, 135), stringsAsFactors = FALSE)
Instead of an if-statement you could use a more efficient case_when operation:
library(dplyr)
df %>%
mutate(cost = case_when(
wgt < 15 ~ 50,
wgt > 15 & wgt <25 ~ 75,
TRUE ~ 135))
Alternatively you could use cut() to transform wgt to wgt_intvl and match via left_join().
Related
I have a dataframe where the columns represent patients of various ages, and another dataframe with the values of those ages. I want to subset the data such that patients only below the age of 50 are displayed
> dat
GSM27015.26.M GSM27016.26.M GSM27018.29.M GSM27021.37.M GSM27023.40.M GSM27024.42.M
31307_at 179.86300 106.495000 265.58600 301.24300 218.50900 224.61000
31308_at 559.07800 411.483000 481.17600 570.73300 333.53900 370.07900
31309_r_at 20.76970 30.641500 50.21530 42.68920 27.10590 21.57620
31310_at 154.19100 224.446000 188.82300 177.86300 233.46300 120.90800
31311_at 956.79700 648.310000 933.65600 1016.41000 762.01300 1040.29000
And the annotation file with the ages of the patients
> ann
Gender Age
GSM27015 M 26
GSM27016 M 26
GSM27018 M 29
GSM27021 M 37
GSM27023 M 40
GSM27024 M 42
GSM27025 M 45
GSM27027 M 52
GSM27028 M 53
Here's something else to consider.
You could transpose your data, so that patients are rows and not columns. As it looks like you have age and gender in your column names, you can also make these additional columns as well.
dat_new <- cbind(do.call(rbind, strsplit(colnames(dat), '\\.')), as.data.frame(t(dat)))
colnames(dat_new)[1:3] <- c("id", "age", "gender")
rownames(dat_new) <- NULL
This is what it would look like:
id age gender 31307_at 31308_at 31309_r_at 31310_at 31311_at
1 GSM27015 26 M 179.863 559.078 20.7697 154.191 956.797
2 GSM27016 26 M 106.495 411.483 30.6415 224.446 648.310
3 GSM27018 29 M 265.586 481.176 50.2153 188.823 933.656
4 GSM27021 37 M 301.243 570.733 42.6892 177.863 1016.410
5 GSM27023 40 M 218.509 333.539 27.1059 233.463 762.013
6 GSM27024 42 M 224.610 370.079 21.5762 120.908 1040.290
Then, if you wish to subset based on age (e.g., <= 50 years), you can do:
dat_new[dat_new$age <= 50, ]
Perhaps try
dat[as.numeric(gsub(".*?\\.(\\d+)\\..*","\\1",names(dat)))<50]
Does this work:
> library(dplyr)
> data
GSM27015.26.M GSM27016.26.M GSM27018.29.M GSM27021.37.M GSM27023.40.M GSM27024.42.M GSM27024.52.M
31307_at 179.8630 106.4950 265.5860 301.2430 218.5090 224.6100 331.230
31308_at 559.0780 411.4830 481.1760 570.7330 333.5390 370.0790 370.079
31309_r_at 20.7697 30.6415 50.2153 42.6892 27.1059 21.5762 98998.000
31310_at 154.1910 224.4460 188.8230 177.8630 233.4630 120.9080 120.908
31311_at 956.7970 648.3100 933.6560 1016.4100 762.0130 1040.2900 1000.290
> data %>% select_if(as.numeric(gsub('GSM\\d{5}\\.(\\d{2})..','\\1',names(data))) < 50)
GSM27015.26.M GSM27016.26.M GSM27018.29.M GSM27021.37.M GSM27023.40.M GSM27024.42.M
31307_at 179.8630 106.4950 265.5860 301.2430 218.5090 224.6100
31308_at 559.0780 411.4830 481.1760 570.7330 333.5390 370.0790
31309_r_at 20.7697 30.6415 50.2153 42.6892 27.1059 21.5762
31310_at 154.1910 224.4460 188.8230 177.8630 233.4630 120.9080
31311_at 956.7970 648.3100 933.6560 1016.4100 762.0130 1040.2900
>
So I added one more column to your data "GSM27024.52.M" and in the select output, it wasn't selected.
An option with parse_number
library(stringr)
dat[readr::parse_number(str_remove(names(dat), "^[^.]+\\.")) < 50]
I have created the following data frame:
age <- c(21,35,829,2)
sex <- c("m","f","m","c")
height <- c(181,173,171,166)
weight <- c(69,58,75,60)
dat <- as.data.frame(cbind(age,sex,height,weight), stringsAsFactors = FALSE)
dat$age <- as.numeric(age)
dat
I want to choose now only the rows of students which are older than 20 or younger than 80.
Why does this work : dat[dat$age<20| dat$age>80,] ; subset(dat, age < 20 | age > 80)
But this does not: dat[dat$age>20| dat$age<80,] ; subset(dat, age > 20 | age < 80)
I can subset the rows who are NOT younger than 80 or older than 20, but not those who are actually in this interval.
What is the mistake?
Thanks in advance.
Because your condition allows basically every possible age. Think about it, your conditions are independent (because you are using the | operator), so every row, that fits in one of your conditions, are selected by your filter. Every age that is defined in your data.frame now, are higher than 20, OR if not, they certainly are lower than 80.
If you want to select every row, that is in between age 20 and 80, you would change the logic operator. To make these conditions dependent, like this:
dat[dat$age>20 & dat$age<80,]
subset(dat, age > 20 & age < 80)
Resulting this:
age sex height weight
1 21 m 181 69
2 35 f 173 58
Now, if you want to select all the rows, that are outside of this interval, you could negate this logic condition with the ! operator, like was suggested by #r2evans in the comment section. It would be something like this:
dat[!(dat$age > 20 & dat$age < 80),]
subset(dat, !(age > 20 & age < 80))
Resulting this:
age sex height weight
3 829 m 171 75
4 2 c 166 60
Why not use dplyr filter?
library(dplyr)
df_age <- dat %>%
dplyr::filter(age > 20
, age < 80)
I'm trying to subset data based on a conditional statement of a column that has blank values which means the employee logged in multiple times on a work order. An example data set is shown below:
employee_name <- c("Person A","Person A","Person A","Person A","Person A", "Person B","Person B","Person B")
work_order <- c("WO001","WO001","WO001","WO002","WO003","WO001","WO003", "WO003")
num_of_points <- c(40,"","",64,25,20,68,"")
time <- c(10, 30, 15, 20, 25, 5, 15, 30)
final_summary <- data.frame(employee_name,work_order,num_of_points, time)
View(final_summary)
Input
Basically, I want to sum up the points and time by selecting all rows with points > 30, then grouped by Employee Name and Work Order which should return this:
Output
I can do the summarize function properly, but when I perform the initial subset, it excludes the blank rows for num_of_points and thus does not compute all the adjacent time (in minutes) values. This makes sense because subset(num_of_points > 30) only finds anything greater than 30. How can I tweak this to include the blank rows so I can successfully filter the data in order to compute the sum of time accurately, grouped by unique work order and employee name?
Conver the num_of_points to numeric class, grouped by 'employee_name', 'work_order', get the sum of 'num_of_points' where it is greater than 30, and the sum of 'time', then filter out the rows where 'num_of_points' are 0
library(dplyr)
final_summary %>%
mutate(num_of_points = as.numeric(num_of_points)) %>%
group_by(employee_name, work_order) %>%
summarise(num_of_points = sum(num_of_points[num_of_points> 30],
na.rm = TRUE), time = sum(time)) %>%
filter(num_of_points > 0)
# A tibble: 3 x 4
# Groups: employee_name [2]
# employee_name work_order num_of_points time
# <chr> <chr> <dbl> <dbl>
#1 Person A WO001 40 55
#2 Person A WO002 64 20
#3 Person B WO003 68 45
In base R you will do:
aggregate(.~employee_name + work_order, type.convert(final_summary), sum, subset = num_of_points>30)
employee_name work_order num_of_points time
1 Person A WO001 40 10
2 Person A WO002 64 20
3 Person B WO003 68 15
You can aggregate num_of_points and time separately and merge the results.
merge(aggregate(num_of_points~employee_name + work_order, final_summary,
sum, subset = num_of_points>30),
aggregate(time~employee_name + work_order, final_summary, sum))
# employee_name work_order num_of_points time
#1 Person A WO001 40 55
#2 Person A WO002 64 20
#3 Person B WO003 68 45
I'm using R to create an occupancy model encounter history. I need to take a list of bird counts for individual leks, separate them by year, then code the count dates into two intervals, either within 10 days of the first count (Interval 1), or after 10 days after the first count (Interval 2). For any year where only 1 count occurred I need to add an entry coded as "U", to indicate that no count occurred during the second interval. Following that I need to subset out only the max count in each year and interval. A sample dataset:
ComplexId Date Males Year category
57 1941-04-15 97 1941 A
57 1942-04-15 67 1942 A
57 1943-04-15 44 1943 A
57 1944-04-15 32 1944 A
57 1946-04-15 21 1946 A
57 1947-04-15 45 1947 A
57 1948-04-15 67 1948 A
57 1989-03-21 25 1989 A
57 1989-03-30 41 1989 A
57 1989-04-13 2 1989 A
57 1991-03-06 35 1991 A
57 1991-04-04 43 1991 A
57 1991-04-11 37 1991 A
57 1991-04-22 25 1991 A
57 1993-03-23 6 1993 A
57 1994-03-06 17 1994 A
57 1994-03-11 10 1994 A
57 1994-04-06 36 1994 A
57 1994-04-15 29 1994 A
57 1994-04-21 27 1994 A
Now here is the code I wrote to accomplish my task, naming the dataframe above "c1" (you'll need to coerce the date column to date, and the category column to character):
c1_Year<-lapply(unique(c1$Year), function(x) c1[c1$Year == x,]) #splits complex counts into list by year
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-cbind(c1_Year[[i]], daydiff = as.numeric(c1_Year[[i]][,2]-c1_Year[[i]][1,2]))
} #adds column with difference between first survey and subsequent surveys
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-if(length(c1_Year[[i]][,1]) == 1)
rbind(c1_Year[[i]], c(c1_Year[[i]][1,1], NA, 0, c1_Year[[i]][1,4], "U", 11))
} # adds U values to years with only 1 count, while coercing the "u" into the appropriate interval
for(i in 1:length(c1_Year)){
c1_Year[[i]]$Interval<- ifelse(c1_Year[[i]][,6] < 10, 1, 2)
} # adds interval code for each survey, 1 = less than ten days after first count, 2 = more than 2 days after count
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-ddply(.data=c1_Year[[i]], .(Interval), subset, Males==max(Males))
} # subsets out max count in each interval
The problem arises during the second for-loop, which when options(error=recover) is enable returns:
Error in c1_Year[[i]] : subscript out of bounds
No suitable frames for recover()
`
At that point the code accomplishes what it was supposed to and adds the extra line to each year with only one count, even though the error message is generated the extra rows with the "U" code are still appended to the data frames. The issue is that I have 750 leks to do this for. So I tried to build the code above into a function, however when I run the function on any data the subscript out of bounds error stops the function from running. I could brute force it and just run the code above for each lek manually, but I was hoping there might be a more elegant solution. What I need to know is why am I getting the subscript out of bounds error, and how can I fix it?
Here's the function I wrote, so that you can see that it doesn't work:
create.OEH<-function(dataset, final_dataframe){
c1_Year<-lapply(unique(dataset$Year), function(x) dataset[dataset$Year == x,]) #splits complex counts into list by year
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-cbind(c1_Year[[i]], daydiff = as.numeric(c1_Year[[i]][,2]-c1_Year[[i]][1,2]))
} #adds column with difference between first survey and subsequent surveys
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-if(length(c1_Year[[i]][,1]) == 1)
rbind(c1_Year[[i]], c(c1_Year[[i]][1,1], NA, 0, c1_Year[[i]][1,4], "U", 11))
} # adds U values to years with only 1 count,
for(i in 1:length(c1_Year)){
c1_Year[[i]]$Interval<- ifelse(c1_Year[[i]][,6] < 10, 1, 2)
} # adds interval code for each survey, 1 = less than ten days after first count, 2 = more than 2 days after count
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-ddply(.data=c1_Year[[i]], .(Interval), subset, Males==max(Males))
} #subset out max count for each interval
df<-rbind.fill(c1_Year) #collapse list into single dataframe
final_dataframe<-df[!duplicated(df[,c("Year", "Interval")]),] #remove ties for max count
}
In this bit of code
for(i in 1:length(c1_Year)){
c1_Year[[i]]<-if(length(c1_Year[[i]][,1]) == 1)
rbind(c1_Year[[i]], c(c1_Year[[i]][1,1], NA, 0, c1_Year[[i]][1,4], "U", 11))
}
You are assigning NULL if length(c1_Year[[i]][,1]==1 is not true, which removes those elements from c1_Year entirely.
You probably want
for(i in 1:length(c1_Year)){
if (length(c1_Year[[i]][,1]) == 1) {
c1_Year[[i]] <- rbind(c1_Year[[i]], c(c1_Year[[i]][1,1], NA, 0, c1_Year[[i]][1,4], "U", 11))
}
}
However, I see you are already using ddply, so you may be able to avoid a lot of your replication.
The ddply(c1, .(Year), ...) splits up c1 into unique years.
c2 <- ddply(c1,
.(Year),
function (x) {
# create 'Interval'
x$Interval <- ifelse(x$Date - x$Date[1] < 10, 1, 2)
# extract max males per interval
o <- ddply(x, .(Interval), subset, Males==max(Males))
# add the 'U' col if no '2' interval
if (all(o$Interval != 2)) {
o <- rbind(o,
list(o$ComplexId, NA, 0, o$Year, 'U', 2))
}
# return the resulting dataframe
o
})
I converted your rbind(.., c(...)) to rbind(.., list(...)) to avoid converting everything back to string (which is what the c does because it cannot handle multiple different types).
Otherwise the code is almost the same as yours.
I have a R dataframe which describes the evolution of the sales of a product in approx. 2000 shops in a quarterly basis, with 5 columns (ie. 5 periods of time). I'd like to know how to analyse it with R.
I've already tried to make some basic analysis, that is to say to determine the average sales for the 1st period, the 2nd period, etc. and then determine the average for each period and then to compare the evolution of each shop relatively to this general evolution. For instance, there is a total of 50 000 sales for the 1st period and 35 000 for the 5th, so I assume that for each shop the normal sale in the 5th period is to be 35/55=0.63*the amount of the 1st period's sale: if the shop X has sold 100 items in the first period, I assume that it should normally sell 63 items in the 5th period.
Obviously, this is an easy-to-do method, but it is not statistically relevant.
I would like a method which would enable me to determine a trend curb which miminizes my R-square. My objective is to be able to analyse the sales of the shops by neutralizing the general trend: I'd like to know precisely what are the underperforming shops and what are the overperforming shops, with a statistically correct approach.
My dataframe is structured in this way :
shopID | sum | qt1 | qt2 | qt3 | qt4 | qt5
000001 | 150 | 45 | 15 | 40 | 25 | 25
000002 | 100 | 20 | 20 | 20 | 20 | 20
000003 | 500 | 200 | 0 | 100 | 100 | 100
... (2200 rows)
I've tried to put my timeserie in a list, which is successful, with the following functon:
reversesales=t(data.frame(sales$qt1,sales$qt2,sales$qt3,sales$qt4,sales$qt5))
# I reverse rows and columns of the frame in order that the time periods be the rows
timeser<-ts(reversesales,start=1,end=5, deltat=1/4)
# deltat=1/4 because it is a quarterly basis, 1 and 5 because I have 5 quarters
Still, I am unable to do anything with this variable. I can't do any plot (with the "plot" function) as there are 2200 rows (and so R wants to make me 2200 successive plots, obviously this is not what I want).
In addition, I don't know how to determine the theoretical trend and the theoretical value of the sales for each period for each shop...
Thank you for your help! (and merry Christmas)
An implementation of mixed model:
install.packages("nlme")
library("nlme")
library(dplyr)
# Generating some data with a structure like yours:
start <- round(sample(10:100, 50, replace = TRUE)*runif(50))
df <- data_frame(shopID = 1:50, qt1 = start, qt2 =round(qt1*runif(50, .5, 2)) ,qt3 = round(qt2*runif(50, .5, 2)), qt4 = round(qt3*runif(50, .5, 2)), qt5 = round(qt4*runif(50, .5, 2)))
df <- as.data.frame(df)
# Converting in into the long format:
df <- reshape(df, idvar = "shopID", varying = names(df)[-1], direction = "long", sep = "")
Estimating the model:
mod <- lme(qt ~ time, random = ~ time | shopID, data = df)
# Extract the random effects for comparison:
random.effects(mod)
(Intercept) time
1 74.0790805 3.7034172
2 7.8713699 4.2138001
3 -8.0670810 -5.8754060
4 -16.5114428 16.4920663
5 -16.7098229 6.4685228
6 -11.9630688 -8.0411504
7 -12.9669777 21.3071366
8 -24.1099280 32.9274361
9 8.5107335 -9.7976905
10 -13.2707679 -6.6028927
11 3.6206163 -4.1017784
12 21.2342886 -6.7120725
13 -14.6489512 11.6847109
14 -14.7291647 2.1365768
15 10.6791941 3.2097199
16 -14.1524187 -1.6933291
17 5.2120647 8.0119320
18 -2.5172933 -6.5011416
19 -9.0094366 -5.6031271
20 1.4857512 -5.9913865
21 -16.5973442 3.5164298
22 -26.7724763 27.9264081
23 49.0764631 -12.9800871
24 -0.1512509 2.3589947
25 15.7723150 -7.9295698
26 2.1955489 11.0318875
27 -8.0890346 -5.4145977
28 0.1338790 -8.3551182
29 9.7113758 -9.5799588
30 -6.0257683 42.3140432
31 -15.7655545 -8.6226255
32 -4.1450984 18.7995079
33 4.1510104 -1.6384103
34 2.5107652 -2.0871890
35 -23.8640815 7.6680185
36 -10.8228653 -7.7370976
37 -14.1253093 -8.1738468
38 42.4114024 -9.0436585
39 -10.7453627 2.4590883
40 -12.0947901 -5.2763010
41 -7.6578305 -7.9630013
42 -14.9985612 -0.4848326
43 -13.4081771 -7.2655456
44 -11.5646620 -7.5365387
45 6.9116844 -10.5200339
46 70.7785492 -11.5522014
47 -7.3556367 -8.3946072
48 27.3830419 -6.9049164
49 14.3188079 -9.9334156
50 -15.2077850 -7.9161690
I would interpret the values as follows: consider them as a deviation from zero, so that positive values are positive deviations from the average, whereas negative values are negative deviation from the average. The averages of the two columns are zero, as is checked below:
round(apply(random.effects(mod), 2, mean))
(Intercept) time
0 0
library(zoo)
#Reconstructing the data with four quarter columns (instead of five quarters as in your example)
shopID <- c(1, 2, 3, 4, 5)
sum <- c(150, 100, 500, 350, 50)
qt1 <- c(40, 10, 130, 50, 10)
qt2 <- c(40, 40, 110, 100, 15)
qt3 <- c(50, 30, 140, 150, 10)
qt4 <- c(20, 20, 120, 50, 15)
myDF <- data.frame(shopID, sum, qt1, qt2, qt3, qt4)
#The ts() function converts a numeric vector into an R time series object
ts1 <- ts(as.numeric((myDF[1,3:6])), frequency=4)
ts2 <- ts(as.numeric((myDF[2,3:6])), frequency=4)
ts3 <- ts(as.numeric((myDF[3,3:6])), frequency=4)
ts4 <- ts(as.numeric((myDF[4,3:6])), frequency=4)
ts5 <- ts(as.numeric((myDF[5,3:6])), frequency=4)
#Merge time series objects
tsm <- merge(a = as.zoo(ts1), b = as.zoo(ts2), c = as.zoo(ts3), d = as.zoo(ts4), e = as.zoo(ts5))
#Plotting the Time Series
plot.ts(tsm, plot.type = "single", lty = 1:5, xlab = "Time", ylab = "Sales")
The code is not optimized, and can be improved. More about time series analysis can be read here. Hope this gives some direction.