Changing specific values in a data frame - r

I have the following data (in a data frame), they are grouped by every 4 rows.
x y
1 1.495 0.0
2 1.500 30.0
3 2.500 30.0
4 2.505 0.0
5 8.495 0.0
6 8.500 30.0
7 9.500 30.0
8 9.505 0.0
9 10.495 0.0
10 10.500 30.0
11 11.500 30.0
12 11.505 0.0
13 16.495 0.0 ##From here
14 16.500 30.0
15 17.500 30.0
16 17.505 0.0
17 17.495 0.0
18 17.500 30.0
19 18.500 30.0
20 18.505 0.0 ## End here
21 19.495 0.0
22 19.500 30.0
23 20.500 30.0
24 20.505 0.0
25 23.495 0.0
26 23.500 30.0
27 24.500 30.0
28 24.505 0.0
.
.
.
I am trying to change the y-value of the rows that are overlapped (according to their x-values). For example, rows (13 to 16) are overlapped with row (17 to 20).
x-values of row 13-16: 16.495 16.500 -------- 17.500 17.505
x-values of row 17-20: ------------------ 17.495 17.500 ----------18.500 18.505
There are overlap from 17.495 to 17.505.
I would like to make the "in between" rows into something like:
13 16.495 0.0 ##From here
14 16.500 30.0
15 17.500 30.0
16 17.505 30.0
17 17.495 30.0
18 17.500 30.0
19 18.500 30.0
20 18.505 0.0 ## End here
Any idea how to do this?

Seeing the present sample data, it seems that you want to identify row(s) where a previous value in x is larger than the following value in x. In this case, row 17 is the one. Similarly, you want to identify row(s) where a value in x is larger than the following value in x. In this case, row 16 is the one. So, I tried to get row numbers for these rows in the following way. Note that your data is called mydf here.
ind <- c(which(x = lag(mydf$x) > mydf$x), which(x = lead(mydf$x) < mydf$x))
# Overwrite two specific elements in y
mydf$y[ind] <- 30
Here is the result for the part you specified. I hope this will help you.
#13 16.495 0
#14 16.500 30
#15 17.500 30
#16 17.505 30
#17 17.495 30
#18 17.500 30
#19 18.500 30
#20 18.505 0

Using a for loop, you can do the following (assuming your dataframe is called df):
# defining start and end values to process data by group of 4
start = seq(1,length(df$x),by = 4)
end = seq(4,length(df$x),by = 4)
# loop to inspect data by group of 4 and replace data in df in function of the overlap
for(i in 1:(length(start)-1))
{
if(max(df[start[i]:end[i],"x"]) > min(df[start[i+1]:end[i+1],"x"]))
{
df[end[i],"y"] = 30.0
df[start[i+1],"y"] = 30.0
}
else{}
}
And you get the following dataframe:
> df
x y
1 1.495 0
2 1.500 30
3 2.500 30
4 2.505 0
5 8.495 0
6 8.500 30
7 9.500 30
8 9.505 0
9 10.495 0
10 10.500 30
11 11.500 30
12 11.505 0
13 16.495 0
14 16.500 30
15 17.500 30
16 17.505 30
17 17.495 30
18 17.500 30
19 18.500 30
20 18.505 0
21 19.495 0
22 19.500 30
23 20.500 30
24 20.505 0
25 23.495 0
26 23.500 30
27 24.500 30
28 24.505 0

Related

How to transfer a column from a dataset sharing the same one with another one

I have two versions of datasets sharing the same columns (more or less). Let's take as an example
db = airquality
db1 = airquality[,-c(6)]
db1$Ozone[db1$Ozone < 30] <- 24
db1$Month[db1$Month == 5] <- 24
db
db1
If I would like to transfer two columns 'Ozone' and 'Wind' from the dataset 'db1' to the 'db' dataset by writing a code using the pipe operator %>% or another iterative method to achieve this result, which code you may possibly suggest?
Thanks
You csn do:
library(dplyr)
db1 %>%
select(Ozone, Wind) %>%
bind_cols(db)
Note that in this example, since some column names will be duplicated in the final result, dplyr will automatically rename the duplicates by appending numbers to the end of the column names.
Base R:
cbind(db, db1[,c(1,3)])
Ozone Solar.R Wind Temp Month Day Ozone Wind
1 41 190 7.4 67 5 1 41 7.4
2 36 118 8.0 72 5 2 36 8.0
3 12 149 12.6 74 5 3 24 12.6
4 18 313 11.5 62 5 4 24 11.5
5 NA NA 14.3 56 5 5 NA 14.3
6 28 NA 14.9 66 5 6 24 14.9
7 23 299 8.6 65 5 7 24 8.6
8 19 99 13.8 59 5 8 24 13.8
9 8 19 20.1 61 5 9 24 20.1
10 NA 194 8.6 69 5 10 NA 8.6
11 7 NA 6.9 74 5 11 24 6.9
12 16 256 9.7 69 5 12 24 9.7
.
.
.

Slide along data frame rows and compare rows with next rows

I guess something similar should have been asked before, however I could only find an answer for python and SQL. So please notify me in the comments when this was also asked for R!
Data
Let's say we have a dataframe like this:
set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20))
# In cause you do not get the same dataframe see the comment by #Ian Campbell - thanks!
position value
1 1 27
2 2 37
3 3 57
4 4 89
5 5 20
6 6 86
7 7 97
8 8 62
9 9 58
10 10 6
11 11 19
12 12 16
13 13 61
14 14 34
15 15 67
16 16 43
17 17 88
18 18 83
19 19 32
20 20 63
Goal
I'm interested in calculating the average value for n positions and subtract this from the average value of the next n positions, let's say n=5 for now.
What I tried
I now used this method, however when I apply this to a bigger dataframe it takes a huge amount of time, and hence wonder if there is a faster method for this.
calc <- function( pos ) {
this.five <- df %>% slice(pos:(pos+4))
next.five <- df %>% slice((pos+5):(pos+9))
differ = mean(this.five$value)- mean(next.five$value)
data.frame(dif= differ)
}
df %>%
group_by(position) %>%
do(calc(.$position))
That produces the following table:
position dif
<int> <dbl>
1 1 -15.8
2 2 9.40
3 3 37.6
4 4 38.8
5 5 37.4
6 6 22.4
7 7 4.20
8 8 -26.4
9 9 -31
10 10 -35.4
11 11 -22.4
12 12 -22.3
13 13 -0.733
14 14 15.5
15 15 -0.400
16 16 NaN
17 17 NaN
18 18 NaN
19 19 NaN
20 20 NaN
I suspect a data.table approach may be faster.
library(data.table)
setDT(df)
df[,c("roll.position","rollmean") := lapply(.SD,frollmean,n=5,fill=NA, align = "left")]
df[, result := rollmean[.I] - rollmean[.I + 5]]
df[,.(position,value,rollmean,result)]
# position value rollmean result
# 1: 1 27 46.0 -15.8
# 2: 2 37 57.8 9.4
# 3: 3 57 69.8 37.6
# 4: 4 89 70.8 38.8
# 5: 5 20 64.6 37.4
# 6: 6 86 61.8 22.4
# 7: 7 97 48.4 4.2
# 8: 8 62 32.2 -26.4
# 9: 9 58 32.0 -31.0
#10: 10 6 27.2 -35.4
#11: 11 19 39.4 -22.4
#12: 12 16 44.2 NA
#13: 13 61 58.6 NA
#14: 14 34 63.0 NA
#15: 15 67 62.6 NA
#16: 16 43 61.8 NA
#17: 17 88 NA NA
#18: 18 83 NA NA
#19: 19 32 NA NA
#20: 20 63 NA NA
Data
RNGkind(sample.kind = "Rounding")
set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20))
RNGkind(sample.kind = "default")

Check for nearest value in a column

Is there a way to check which value in a vector/column is nearest to a given value?
so for example I have column with number of days:
days: 50, 49, 59, 180, 170, 199, 200
I want to make a new column in the dataframe that marks an X everytime the dayscolumn has the value 183 or close to 183
It should look like this:
DAYS new column
0
12
12
14
133
140 X
0
12
14
15
178
183 X
0
15
30
72
172 X
Hope you can help me!
You're searching for local maxima, essentially. Start off by normalizing your data to your target, i.e. 183, and search for values closest to zero. Those are your local maxima. I added data with values greater than your target to demonstrate.
df <- data.frame(DAYS = c(0,12,12,14,133,140,0,12,14,15,178,183,184,190,0,15,30,72,172,172.5))
df$localmin <- abs(df$DAYS - 183)
df
> df
DAYS localmin
1 0.0 183.0
2 12.0 171.0
3 12.0 171.0
4 14.0 169.0
5 133.0 50.0
6 140.0 43.0
7 0.0 183.0
8 12.0 171.0
9 14.0 169.0
10 15.0 168.0
11 178.0 5.0
12 183.0 0.0
13 184.0 1.0
14 190.0 7.0
15 0.0 183.0
16 15.0 168.0
17 30.0 153.0
18 72.0 111.0
19 172.0 11.0
20 172.5 10.5
targets <- which(diff(sign(diff(c(df$localmin, 183)))) == 2) + 1L
df$targets <- 0
df$targets[targets] <- 1
df
> df
DAYS localmin targets
1 0.0 183.0 0
2 12.0 171.0 0
3 12.0 171.0 0
4 14.0 169.0 0
5 133.0 50.0 0
6 140.0 43.0 1
7 0.0 183.0 0
8 12.0 171.0 0
9 14.0 169.0 0
10 15.0 168.0 0
11 178.0 5.0 0
12 183.0 0.0 1
13 184.0 1.0 0
14 190.0 7.0 0
15 0.0 183.0 0
16 15.0 168.0 0
17 30.0 153.0 0
18 72.0 111.0 0
19 172.0 11.0 0
20 172.5 10.5 1

R program - getting particular values depending on another column

So I have data regarding Id number and time
Id number Time(hr)
1 5
2 6.1
3 7.2
4 8.3
5 9.6
6 10.9
7 13
8 15.1
9 17.2
10 19.3
11 21.4
12 23.5
13 25.6
14 27.1
15 28.6
16 30.1
17 31.8
18 33.5
19 35.2
20 36.9
21 38.6
22 40.3
23 42
24 43.7
25 45.4
I want this output
Time Id number
10 5
20 10
30 16
40 22
So I want the time to be in 10 hour intervals and get the ID that corresponds to that particular hour...I decided to use this code data <- data2[seq(0, nrow(data2), by=5), ] but instead of the Time being in 10 hr intervals...the ID number is at 10 intervals....but I dont want that output..so far I'm getting this output
Id.number Time..s.
10 19.3
20 36.9
You can use %% (mod) operator.
data[data$Time %% 10 == 0, ]
I use cut() and cumsum(table()) but I don't quite get the answer you are expecting. How exactly are you calculating this?
# first load the data
v.txt <- '1 5
2 6.1
3 7.2
4 8.3
5 9.6
6 10.9
7 13
8 15.1
9 17.2
10 19.3
11 21.4
12 23.5
13 25.6
14 27.1
15 28.6
16 30.1
17 31.8
18 33.5
19 35.2
20 36.9
21 38.6
22 40.3
23 42
24 43.7
25 45.4'
# load in the data... awkwardly...
v <- as.data.frame(matrix(as.numeric(unlist(strsplit(strsplit(v.txt, '\n')[[1]], ' +'))), byrow=T, ncol=2))
tens <- seq(from=0, by=10, to=100)
v$cut <- cut(v$Time, tens, labels=tens[-1])
v2 <- as.data.frame(cumsum(table(v$cut)))
names(v2) <- 'Time'
v2$Id <- rownames(v2)
rownames(v2) <- 1:nrow(v2)
v2 <- v2[,c(2,1)]
rm(v, v.txt, tens) # not needed anymore
v2 # the answer... but doesn't quite match your expected answer...
Id Time
1 10 5
2 20 10
3 30 15
4 40 21
5 50 25

Adding information on a graph using R

I would like to add some information on my graph which was plotted from this data set:
EDITTED:
#data set:
day <- c(0:28)
ndied <- c(342,335,240,122,74,64,49,60,51,44,35,48,41,34,38,27,29,23,20,15,20,16,17,17,14,10,4,1,2)
pdied <- c(19.1,18.7,13.4,6.8,4.1,3.6,2.7,3.3,2.8,2.5,2.0,2.7,2.3,1.9,2.1,1.5,1.6,1.3,1.1,0.8,1.1,0.9,0.9,0.9,0.8,0.6,0.2,0.1,0.1)
pmort <- data.frame(day,ndied,pdied)
> pmort
day ndied pdied
1 0 342 19.1
2 1 335 18.7
3 2 240 13.4
4 3 122 6.8
5 4 74 4.1
6 5 64 3.6
7 6 49 2.7
8 7 60 3.3
9 8 51 2.8
10 9 44 2.5
11 10 35 2.0
12 11 48 2.7
13 12 41 2.3
14 13 34 1.9
15 14 38 2.1
16 15 27 1.5
17 16 29 1.6
18 17 23 1.3
19 18 20 1.1
20 19 15 0.8
21 20 20 1.1
22 21 16 0.9
23 22 17 0.9
24 23 17 0.9
25 24 14 0.8
26 25 10 0.6
27 26 4 0.2
28 27 1 0.1
29 28 2 0.1
I have put together this script and still trying to improve on it so that the rest of the information can be added:
> barplot(pmort$pdied,xlab="Age(days)",ylab="Percent",xlim=c(0,28),ylim=c(0,20),legend="Mortality")
I am trying to insert the numbers 0 to 28 (age in days) on the x-axis but could not and I know that it could be a simple script. Secondly, I would like to add the number died or ndied (342 to 2) below each day(0 to 28) along the x-axis.
Example:
0 1 2 3 4 5 and so on...
(N=342) (N=335) (N=240) (N=122) (N=74) (N=64)
Graph:
Any help would be appreciated.
Baz
I gave you two ways to plot the info: one above the bars and one below. You can tweak it to meet your needs.
barX <- barplot(pmort$pdied,xlab="Age(days)",
ylab="Percent", names=pmort$day,
xlim=c(0,28),ylim=c(0,20),legend="Mortality")
text(cex=.5, x=barX, y=pmort$pdied+par("cxy")[2]/2, pmort$ndied, xpd=TRUE)
barX <- barplot(pmort$pdied,xlab="Age(days)",
ylab="Percent", names=pmort$day,
xlim=c(0,28),ylim=c(0,20),legend="Mortality")
text(cex=.5, x=barX, y=-.5, pmort$ndied, xpd=TRUE)

Resources