In my data frame I have only zeros up until row 1500 in the column nr.flights, that I want to transform to NA's (I have no data available about the nr.flights for the first 1500 rows). There are other values from row 1500 onwards that are zero but that needs to remain zero.
My dataframe looks like this:
Date AD Runway MTOW nr.flights
2008-01-01 A 18 376 0
2008-01-01 A 18 376 0
2008-01-01 D 36 190 0
2008-01-02 D 09 150 2
2008-01-02 A 36 280 1
2008-01-02 A 36 280 1
And I want it to look like this:
Date AD Runway MTOW nr.flights
2008-01-01 A 18 376 NA
2008-01-01 A 18 376 NA
2008-01-01 D 36 190 NA
2008-01-02 D 09 150 2
2008-01-02 A 36 280 1
2008-01-02 A 36 280 1
So far I've only managed to change the entire column into either NA's or zeros, but I want to have both of these in there. Any help would be much appreciated!
To reproduce:
df <- data.frame(Date=c("2008-01-01","2008-01-01","2008-01-01","2008-01- 02","2008-01-02","2008-01-02"),
AD = c("A", "A", "D", "D", "A", "A"), Runway = c(18, 18, 36, 09, 36,36),
MTOW = c(376, 376, 190, 150, 280, 280), nr.flights = c(0,0,0,2,1,1))
Here's a way:
is.na(df$nr.flights[1:1500])[df$nr.flights[1:1500] == 0] <- TRUE
It works by isolating the values equal to 0, then assign the NA status to TRUE. This is typically the safer option compared to df[mysubset] <- NA.
df
Date AD Runway MTOW nr.flights
1 2008-01-01 A 18 376 NA
2 2008-01-01 A 18 376 NA
3 2008-01-01 D 36 190 NA
4 2008-01-02 D 9 150 2
5 2008-01-02 A 36 280 1
6 2008-01-02 A 36 280 1
Here is an option using data.table
library(data.table)
setDT(df)[1:.N <=1500 & !nr.flights, nr.flights := NA]
df
# Date AD Runway MTOW nr.flights
#1: 2008-01-01 A 18 376 NA
#2: 2008-01-01 A 18 376 NA
#3: 2008-01-01 D 36 190 NA
#4: 2008-01- 02 D 9 150 2
#5: 2008-01-02 A 36 280 1
#6: 2008-01-02 A 36 280 1
Related
This question already has answers here:
Reshape horizontal to to long format using pivot_longer
(3 answers)
Closed 2 years ago.
Thank you all for your answers, I thought I was smarter than I am and hoped I would've understood any of it. I think I messed up my visualisation of my data aswell. I have edited my post to better show my sample data. Sorry for the inconvenience, and I truly hope that someone can help me.
I have a question about reshaping my data. The data collected looks as such:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurment4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
Now i would like it to look something like this:
PID Time Value
1 1435 1356
1 1405 1483
1 1374 1563
2 1848 943
2 1818 1173
2 1785 1300
3 185 1590
... ... ...
How would i tend to get there? I have looked up some things about wide to longformat, but it doesn't seem to do the trick. Am reletively new to Rstudio and Stackoverflow (if you couldn't tell that already).
Kind regards, and thank you in advance.
Here is a slightly different pivot_longer() version.
library(tidyr)
library(dplyr)
dw %>%
pivot_longer(cols = -PID, names_to =".value", names_pattern = "(.+)[0-9]")
# A tibble: 9 x 3
PID T measurement
<dbl> <dbl> <dbl>
1 1 1 100
2 1 4 200
3 1 7 50
4 2 2 150
5 2 5 300
6 2 8 60
7 3 3 120
8 3 6 210
9 3 9 70
The names_to = ".value" argument creates new columns from column names based on the names_pattern argument. The names_pattern argument takes a special regex input. In this case, here is the breakdown:
(.+) # match everything - anything noted like this becomes the ".values"
[0-9] # numeric characters - tells the pattern that the numbers
# at the end are excluded from ".values". If you have multiple digit
# numbers, use [0-9*]
In the last edit you asked for a solution that is easy to understand. A very simple approach would be to stack the measurement columns on top of each other and the Tdays columns on top of each other. Although specialty packages make things very concise and elegant, for simplicity we can solve this without additional packages. Standard R has a convenient function aptly named stack, which works like this:
> exp <- data.frame(value1 = 1:5, value2 = 6:10)
> stack(exp)
values ind
1 1 value1
2 2 value1
3 3 value1
4 4 value1
5 5 value1
6 6 value2
7 7 value2
8 8 value2
9 9 value2
10 10 value2
We can stack measurements and Tdays seperately and then combine them via cbind:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurement4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
cbind(stack(data, c(measurement1, measurement2, measurement3, measurement4)),
stack(data, c(Tdays1, Tdays2, Tdays3, Tdays4)))
Which keeps measurements and Tdays neatly together but leaves us without pid which we can add using rep to replicate the original pid 4 times:
result <- cbind(pid = rep(data$pid, 4),
stack(data, c(measurement1, measurement2, measurement3, measurement4)),
stack(data, c(Tdays1, Tdays2, Tdays3, Tdays4)))
The head of which looks like
> head(result)
pid values ind values ind
1 1 1356 measurement1 1435 Tdays1
2 2 943 measurement1 1848 Tdays1
3 3 1590 measurement1 185 Tdays1
4 4 130 measurement1 72 Tdays1
5 4 140 measurement1 82 Tdays1
6 4 220 measurement1 126 Tdays1
As I said above, this is not the order you expected and you can try to sort this data.frame, if that is of any concern:
result <- result[order(result$pid), c(1, 4, 2)]
names(result) <- c("pid", "Time", "Value")
leading to the final result
> head(result)
pid Time Value
1 1 1435 1356
13 1 1405 1483
25 1 1374 1563
37 1 NA NA
2 2 1848 943
14 2 1818 1173
tidyverse solution
library(tidyverse)
dw %>%
pivot_longer(-PID) %>%
mutate(name = gsub('^([A-Za-z]+)(\\d+)$', '\\1_\\2', name )) %>%
separate(name, into = c('A', 'B'), sep = '_', convert = T) %>%
pivot_wider(names_from = A, values_from = value)
Gives the following output
# A tibble: 9 x 4
PID B T measurement
<int> <int> <int> <int>
1 1 1 1 100
2 1 2 4 200
3 1 3 7 50
4 2 1 2 150
5 2 2 5 300
6 2 3 8 60
7 3 1 3 120
8 3 2 6 210
9 3 3 9 70
Considering a dataframe, df like the following:
PID T1 measurement1 T2 measurement2 T3 measurement3
1 1 100 4 200 7 50
2 2 150 5 300 8 60
3 3 120 6 210 9 70
You can use this solution to get your required dataframe:
iters = seq(from = 4, to = length(colnames(df))-1, by = 2)
finalDf = df[, c(1,2,3)]
for(j in iters){
tobind = df[, c(1,j,j+1)]
finalDf = rbind(finalDf, tobind)
}
finalDf = finalDf[order(finalDf[,1]),]
print(finalDf)
The output of the print statement is this:
PID T1 measurement1
1 1 1 100
4 1 4 200
7 1 7 50
2 2 2 150
5 2 5 300
8 2 8 60
3 3 3 120
6 3 6 210
9 3 9 70
Maybe you can try reshape like below
reshape(
setNames(data, gsub("(\\d+)$", "\\.\\1", names(data))),
direction = "long",
varying = 2:ncol(data)
)
I have a data frame with 3 different identifications and sometimes they overlap. I want to create a new column, with only one of those ids, in an order of preference (id1>id2>id3).
Ex.:
id1 id2 id3
12 145 8763
45 836 5766
13 768 9374
836 5766
12 145
9282
567
45 836 5766
and I want to have:
id1 id2 id3 id.new
12 145 8763 12
45 836 5766 45
13 768 9374 13
836 5766 836
9282 9282
567 567
I have tried the if else,which, grep functions.. but I can't make it work.
Ex. of my try:
df$id1 <- ifelse(df$id1 == "", paste(df$2), (ifelse(df$id1)))
I am able to do this on Excel, but I am switching to R, for being more reliable and reproducible :) But in excel I would use:
=if(A1="",B1,(if(B1="",C1,B1)),A1)
Using coalesce from the dplyr package, we can try:
library(dplyr)
df$id.new <- coalesce(df$id1, df$id2, df$id3)
df
id1 id2 id3 id.new
1 12 145 8763 12
2 45 836 5766 45
3 13 768 9374 13
4 NA 836 5766 836
5 12 145 NA 12
6 NA NA 9282 9282
7 NA 567 NA 567
8 45 836 5766 45
Data:
df <- data.frame(id1=c(12,45,13,NA,12,NA,NA,45),
id2=c(145,836,768,836,145,NA,567,836),
id3=c(8763,5766,9374,5766,NA,9282,NA,5766))
In base you can use apply of is.na(df) with function which.min to get a matrix used for subsetting. Thanks to #tim-biegeleisen for the dataset.
df$id.new <- df[cbind(1:nrow(df), apply(is.na(df), 1, which.min))]
df
# id1 id2 id3 id.new
#1 12 145 8763 12
#2 45 836 5766 45
#3 13 768 9374 13
#4 NA 836 5766 836
#5 12 145 NA 12
#6 NA NA 9282 9282
#7 NA 567 NA 567
#8 45 836 5766 45
I have a large data.table (circa 900k rows) which can be represented by the following example:
row.id entity.id event.date result
1: 1 100 2015-01-20 NA
2: 2 101 2015-01-20 NA
3: 3 104 2015-01-20 NA
4: 4 107 2015-01-20 NA
5: 5 103 2015-01-23 NA
6: 6 109 2015-01-23 NA
7: 7 102 2015-01-23 NA
8: 8 101 2015-01-26 NA
9: 9 110 2015-01-26 NA
10: 10 112 2015-01-26 NA
11: 11 109 2015-01-26 NA
12: 12 130 2015-01-29 NA
13: 13 100 2015-01-29 NA
14: 14 127 2015-01-29 NA
15: 15 101 2015-01-29 NA
16: 16 119 2015-01-29 NA
17: 17 104 2015-02-03 NA
18: 18 101 2015-02-03 NA
19: 19 125 2015-02-03 NA
20: 20 130 2015-02-03 NA
Essentially I have columns containing: the ID representing the entity in question (entity.id); the date of an event in which this ID partook (note that many, and differing numbers of, entities will participate in each event). I need to calculate a factor that, for each entity.id on each event date, depends (non-linearly) on the time (in days) that has elapsed since all the previous events in which that entity ID was entered.
To put it in other, more programmatic terms, on each row of the data.table I need to find all instances with matching ID and where the date is older than the event date of the row in question, work out the difference in time (in days) between the ‘current’ and historical events, and sum some non-linear function applied to each of the time periods (I’ll use the square in this example).
In the example above, for entity.id = 101 on 03-02-2015 (row 18), the we would need to look back to that ID's prior entries on rows 15, 8 and 2, calculate the differences in days from the ‘current’ event (14, 8 and 5 days), and then calculate the answer by summing the squares of those periods (14^2 + 8^2 + 5^2) = 196 + 64 + 25 = 285. (The real function is somewhat more complex but this is sufficiently representative.)
This is trivial to achieve with for-loops, as per below:
# Create sample dt
dt <- data.table(row.id = 1:20,
entity.id = c(100, 101, 104, 107, 103, 109, 102, 101, 110, 112,
109, 130, 100, 127, 101, 119, 104, 101, 125, 130),
event.date = as.Date(c("2015-01-20", "2015-01-20", "2015-01-20", "2015-01-20",
"2015-01-23", "2015-01-23", "2015-01-23",
"2015-01-26", "2015-01-26", "2015-01-26", "2015-01-26",
"2015-01-29", "2015-01-29", "2015-01-29", "2015-01-29", "2015-01-29",
"2015-02-03", "2015-02-03", "2015-02-03", "2015-02-03")),
result = NA)
setkey(dt, row.id)
for (i in 1:nrow(dt)) { #loop through each entry
# get a subset of dt comprised of rows with this row's entity.id, which occur prior to this row
event.history <- dt[row.id < i & entity.id == entity.id[i]]
# calc the sum of the differences between the current row event date and the prior events dates, contained within event.history, squared
dt$result[i] <- sum( (as.numeric(dt$event.date[i]) - as.numeric(event.history$event.date)) ^2 )
}
Unfortunately, on the real dataset it is also extremely slow, no doubt because if the amount of subsetting operations required. Is there a way to vectorise, or otherwise speed up, this operation? I’ve searched and searched and wracked my brains but can’t work out how to vecotrally subset rows based on differing data per each row without looping.
Note that I created a row.id column to allow me to extract all prior rows (rather than prior dates), as the two are broadly equivalent (an entity cannot attend more than one event a day) and this way was much quicker (I think because it avoids the need to coerce the dates to numeric before doing the comparison, ie. Dt[as.numeric(event_date) < as.numeric(event_date[i])]
Note also that I’m not wedded to it being a data.table; I’m happy to use dplyr or other mechanisms to achieve this if need be.
I think this can be achieved using a self-join with appropriate non-equi join critieria:
dt[, result2 := dt[
dt,
on=c("entity.id","event.date<event.date"),
sum(as.numeric(x.event.date - i.event.date)^2), by=.EACHI]$V1
]
dt
This gives a result which matches your output from the loop, with the exception of the NA values:
# row.id entity.id event.date result result2
# 1: 1 100 2015-01-20 0 NA
# 2: 2 101 2015-01-20 0 NA
# 3: 3 104 2015-01-20 0 NA
# 4: 4 107 2015-01-20 0 NA
# 5: 5 103 2015-01-23 0 NA
# 6: 6 109 2015-01-23 0 NA
# 7: 7 102 2015-01-23 0 NA
# 8: 8 101 2015-01-26 36 36
# 9: 9 110 2015-01-26 0 NA
#10: 10 112 2015-01-26 0 NA
#11: 11 109 2015-01-26 9 9
#12: 12 130 2015-01-29 0 NA
#13: 13 100 2015-01-29 81 81
#14: 14 127 2015-01-29 0 NA
#15: 15 101 2015-01-29 90 90
#16: 16 119 2015-01-29 0 NA
#17: 17 104 2015-02-03 196 196
#18: 18 101 2015-02-03 285 285
#19: 19 125 2015-02-03 0 NA
#20: 20 130 2015-02-03 25 25
I have a data.frame named sampleframe where I have stored all the table values. Inside sampleframe I have columns id, month, sold.
id month SMarch SJanFeb churn
101 1 0.00 0.00 1
101 2 0.00 0.00 1
101 3 0.00 0.00 1
108 2 0.00 6.00 1
103 2 0.00 10.00 1
160 1 0.00 2.00 1
160 2 0.00 3.00 1
160 3 0.50 0.00 0
164 1 0.00 3.00 1
164 2 0.00 6.00 1
I would like to calculate average sold for last three months based on ID. If it is month 3 then it has to consider average sold for the last two months based on ID, if it is month 2 then it has to consider average sold for 1 month based on ID., respectively for all months.
I have used ifelse and mean function to avail it but some rows are missing when i try to use it for all months
Query that I have used for execution
sampleframe$Churn <- ifelse(sampleframe$Month==4|sampleframe$Month==5|sampleframe$Month==6, ifelse(sampleframe$Sold<0.7*mean(sampleframe$Sold[sampleframe$ID[sampleframe$Month==-1&sampleframe$Month==-2&sampleframe$Month==-3]]),1,0),0)
adding according to the logic of the query it should compare with the previous months sold value of 70% and if the current value is higher than previous average months values then it should return 1 else 0
Not clear about the expected output. Based on the description about calculating average 'sold' for each 3 months, grouped by 'id', we can use roll_mean from library(RcppRoll). We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'id', if the number of rows is greater than 1, we get the roll_mean with n specified as 3 and concatenate with the averages for less than 3 or else i.e. for 1 observation, get the value itself.
library(RcppRoll)
library(data.table)
k <- 3
setDT(df1)[, soldAvg := if(.N>1) c(cumsum(sold[1:(k-1)])/1:(k-1),
roll_mean(sold,n=k, align='right')) else as.numeric(sold), id]
df1
# id month sold soldAvg
#1: 101 1 124 124.0000
#2: 101 2 211 167.5000
#3: 104 3 332 332.0000
#4: 105 4 124 124.0000
#5: 101 5 211 182.0000
#6: 101 6 332 251.3333
#7: 101 7 124 222.3333
#8: 101 8 211 222.3333
#9: 101 9 332 222.3333
#10: 102 10 124 124.0000
#11: 102 12 211 167.5000
#12: 104 3 332 332.0000
#13: 105 4 124 124.0000
#14: 102 5 211 182.0000
#15: 102 6 332 251.3333
#16: 106 7 124 124.0000
#17: 107 8 211 211.0000
#18: 102 9 332 291.6667
#19: 103 11 124 124.0000
#20: 103 2 211 167.5000
#21: 108 3 332 332.0000
#22: 108 4 124 228.0000
#23: 109 5 211 211.0000
#24: 103 6 332 222.3333
#25: 104 7 124 262.6667
#26: 105 8 211 153.0000
#27: 103 10 332 291.6667
Solution for above Question can be done by using library(dplyr) and use this query to avail the output
resultData <- group_by(data, KId) %>%
arrange(sales_month) %>%
mutate(monthMinus1Qty = lag(quantity_sold,1), monthMinus2Qty = lag(quantity_sold, 2)) %>%
group_by(KId, sales_month) %>%
mutate(previous2MonthsQty = sum(monthMinus1Qty, monthMinus2Qty, na.rm = TRUE)) %>%
mutate(result = ifelse(quantity_sold/previous2MonthsQty >= 0.6,0,1)) %>%
select(KId,sales_month, quantity_sold, result)
link to refer for solution and output Answer
I have the following data.table:
Month Day Lat Long Temperature
1: 10 01 80.0 180 -6.383330333333309
2: 10 01 77.5 180 -6.193327999999976
3: 10 01 75.0 180 -6.263328333333312
4: 10 01 72.5 180 -5.759997333333306
5: 10 01 70.0 180 -4.838330999999976
---
117020: 12 31 32.5 310 11.840003833333355
117021: 12 31 30.0 310 13.065001833333357
117022: 12 31 27.5 310 14.685003333333356
117023: 12 31 25.0 310 15.946669666666690
117024: 12 31 22.5 310 16.578336333333358
For every location (given by Lat and Long), I have a temperature for each day from 1 October to 31 December.
There are 1,272 locations consisting of each pairwise combination of Lat:
Lat
1 80.0
2 77.5
3 75.0
4 72.5
5 70.0
--------
21 30.0
22 27.5
23 25.0
24 22.5
and Long:
Long
1 180.0
2 182.5
3 185.0
4 187.5
5 190.0
---------
49 300.0
50 302.5
51 305.0
52 307.5
53 310.0
I'm trying to create a data.table that consists of 1,272 rows (one per location) and 92 columns (one per day). Each element of that data.table will then contain the temperature at that location on that day.
Any advice about how to accomplish that goal without using a for loop?
Here we use ChickWeights as the data, where we use "Chick-Diet" as the equivalent of your "lat-lon", and "Time" as your "Date":
dcast.data.table(data.table(ChickWeight), Chick + Diet ~ Time)
Produces:
Chick Diet 0 2 4 6 8 10 12 14 16 18 20 21
1: 18 1 1 1 NA NA NA NA NA NA NA NA NA NA
2: 16 1 1 1 1 1 1 1 1 NA NA NA NA NA
3: 15 1 1 1 1 1 1 1 1 1 NA NA NA NA
4: 13 1 1 1 1 1 1 1 1 1 1 1 1 1
5: ... 46 rows omitted
You will likely need to lat + lon ~ Month + Day or some such for your formula.
In the future, please make your question reproducible as I did here by using a built-in data set.
First create a date value using the lubridate package (I assumed year = 2014, adjust as necessary):
library(lubridate)
df$datetext <- paste(df$Month,df$Day,"2014",sep="-")
df$date <- mdy(df$datetext)
Then one option is to use the tidyr package to spread the columns:
library(tidyr)
spread(df[,-c(1:2,6)],date,Temperature)
Lat Long 2014-10-01 2014-12-31
1 22.5 310 NA 16.57834
2 25.0 310 NA 15.94667
3 27.5 310 NA 14.68500
4 30.0 310 NA 13.06500
5 32.5 310 NA 11.84000
6 70.0 180 -4.838331 NA
7 72.5 180 -5.759997 NA
8 75.0 180 -6.263328 NA
9 77.5 180 -6.193328 NA
10 80.0 180 -6.383330 NA