R program - getting particular values depending on another column - r

So I have data regarding Id number and time
Id number Time(hr)
1 5
2 6.1
3 7.2
4 8.3
5 9.6
6 10.9
7 13
8 15.1
9 17.2
10 19.3
11 21.4
12 23.5
13 25.6
14 27.1
15 28.6
16 30.1
17 31.8
18 33.5
19 35.2
20 36.9
21 38.6
22 40.3
23 42
24 43.7
25 45.4
I want this output
Time Id number
10 5
20 10
30 16
40 22
So I want the time to be in 10 hour intervals and get the ID that corresponds to that particular hour...I decided to use this code data <- data2[seq(0, nrow(data2), by=5), ] but instead of the Time being in 10 hr intervals...the ID number is at 10 intervals....but I dont want that output..so far I'm getting this output
Id.number Time..s.
10 19.3
20 36.9

You can use %% (mod) operator.
data[data$Time %% 10 == 0, ]

I use cut() and cumsum(table()) but I don't quite get the answer you are expecting. How exactly are you calculating this?
# first load the data
v.txt <- '1 5
2 6.1
3 7.2
4 8.3
5 9.6
6 10.9
7 13
8 15.1
9 17.2
10 19.3
11 21.4
12 23.5
13 25.6
14 27.1
15 28.6
16 30.1
17 31.8
18 33.5
19 35.2
20 36.9
21 38.6
22 40.3
23 42
24 43.7
25 45.4'
# load in the data... awkwardly...
v <- as.data.frame(matrix(as.numeric(unlist(strsplit(strsplit(v.txt, '\n')[[1]], ' +'))), byrow=T, ncol=2))
tens <- seq(from=0, by=10, to=100)
v$cut <- cut(v$Time, tens, labels=tens[-1])
v2 <- as.data.frame(cumsum(table(v$cut)))
names(v2) <- 'Time'
v2$Id <- rownames(v2)
rownames(v2) <- 1:nrow(v2)
v2 <- v2[,c(2,1)]
rm(v, v.txt, tens) # not needed anymore
v2 # the answer... but doesn't quite match your expected answer...
Id Time
1 10 5
2 20 10
3 30 15
4 40 21
5 50 25

Related

How to transfer a column from a dataset sharing the same one with another one

I have two versions of datasets sharing the same columns (more or less). Let's take as an example
db = airquality
db1 = airquality[,-c(6)]
db1$Ozone[db1$Ozone < 30] <- 24
db1$Month[db1$Month == 5] <- 24
db
db1
If I would like to transfer two columns 'Ozone' and 'Wind' from the dataset 'db1' to the 'db' dataset by writing a code using the pipe operator %>% or another iterative method to achieve this result, which code you may possibly suggest?
Thanks
You csn do:
library(dplyr)
db1 %>%
select(Ozone, Wind) %>%
bind_cols(db)
Note that in this example, since some column names will be duplicated in the final result, dplyr will automatically rename the duplicates by appending numbers to the end of the column names.
Base R:
cbind(db, db1[,c(1,3)])
Ozone Solar.R Wind Temp Month Day Ozone Wind
1 41 190 7.4 67 5 1 41 7.4
2 36 118 8.0 72 5 2 36 8.0
3 12 149 12.6 74 5 3 24 12.6
4 18 313 11.5 62 5 4 24 11.5
5 NA NA 14.3 56 5 5 NA 14.3
6 28 NA 14.9 66 5 6 24 14.9
7 23 299 8.6 65 5 7 24 8.6
8 19 99 13.8 59 5 8 24 13.8
9 8 19 20.1 61 5 9 24 20.1
10 NA 194 8.6 69 5 10 NA 8.6
11 7 NA 6.9 74 5 11 24 6.9
12 16 256 9.7 69 5 12 24 9.7
.
.
.

How to sample data non-random

I have weather dataset my data is date-dependent
I want to predict the temperature from 07 May 2008 until 18 May 2008 (which is maybe a total of 10-15 observations) my data size is around 200
I will be using decision tree/RF and SVM & NN to make my prediction
I've never handled data like this so I'm not sure how to sample non random data
I want to sample data 80% train data and 30% test data but I want to sample the data in the original order not randomly. Is that possible ?
install.packages("rattle")
install.packages("RGtk2")
library("rattle")
seed <- 42
set.seed(seed)
fname <- system.file("csv", "weather.csv", package = "rattle")
dataset <- read.csv(fname, encoding = "UTF-8")
dataset <- dataset[1:200,]
dataset <- dataset[order(dataset$Date),]
set.seed(321)
sample_data = sample(nrow(dataset), nrow(dataset)*.8)
test<-dataset[sample_data,] # 30%
train<-dataset[-sample_data,] # 80%
output
> head(dataset)
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed
1 2007-11-01 Canberra 8.0 24.3 0.0 3.4 6.3 NW 30
2 2007-11-02 Canberra 14.0 26.9 3.6 4.4 9.7 ENE 39
3 2007-11-03 Canberra 13.7 23.4 3.6 5.8 3.3 NW 85
4 2007-11-04 Canberra 13.3 15.5 39.8 7.2 9.1 NW 54
5 2007-11-05 Canberra 7.6 16.1 2.8 5.6 10.6 SSE 50
6 2007-11-06 Canberra 6.2 16.9 0.0 5.8 8.2 SE 44
WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am
1 SW NW 6 20 68 29 1019.7
2 E W 4 17 80 36 1012.4
3 N NNE 6 6 82 69 1009.5
4 WNW W 30 24 62 56 1005.5
5 SSE ESE 20 28 68 49 1018.3
6 SE E 20 24 70 57 1023.8
Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow
1 1015.0 7 7 14.4 23.6 No 3.6 Yes
2 1008.4 5 3 17.5 25.7 Yes 3.6 Yes
3 1007.2 8 7 15.4 20.2 Yes 39.8 Yes
4 1007.0 2 7 13.5 14.1 Yes 2.8 Yes
5 1018.5 7 7 11.1 15.4 Yes 0.0 No
6 1021.7 7 5 10.9 14.8 No 0.2 No
> head(test)
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed
182 2008-04-30 Canberra -1.8 14.8 0.0 1.4 7.0 N 28
77 2008-01-16 Canberra 17.9 33.2 0.0 10.4 8.4 N 59
88 2008-01-27 Canberra 13.2 31.3 0.0 6.6 11.6 WSW 46
58 2007-12-28 Canberra 15.1 28.3 14.4 8.8 13.2 NNW 28
96 2008-02-04 Canberra 18.2 22.6 1.8 8.0 0.0 ENE 33
126 2008-03-05 Canberra 12.0 27.6 0.0 6.0 11.0 E 46
WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am
182 E N 2 19 80 40 1024.2
77 N NNE 15 20 58 62 1008.5
88 N WNW 4 26 71 28 1013.1
58 NNW NW 6 13 73 44 1016.8
96 SSE ENE 7 13 92 76 1014.4
126 SSE WSW 7 6 69 35 1025.5
Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow
182 1020.5 1 7 5.3 13.9 No 0.0 No
77 1006.1 6 7 24.5 23.5 No 4.8 Yes
88 1009.5 1 4 19.7 30.7 No 0.0 No
58 1013.4 1 5 18.3 27.4 Yes 0.0 No
96 1011.5 8 8 18.5 22.1 Yes 9.0 Yes
126 1022.2 1 1 15.7 26.2 No 0.0 No
> head(train)
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed
7 2007-11-07 Canberra 6.1 18.2 0.2 4.2 8.4 SE 43
9 2007-11-09 Canberra 8.8 19.5 0.0 4.0 4.1 S 48
11 2007-11-11 Canberra 9.1 25.2 0.0 4.2 11.9 N 30
16 2007-11-16 Canberra 12.4 32.1 0.0 8.4 11.1 E 46
22 2007-11-22 Canberra 16.4 19.4 0.4 9.2 0.0 E 26
25 2007-11-25 Canberra 15.4 28.4 0.0 4.4 8.1 ENE 33
WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am
7 SE ESE 19 26 63 47 1024.6
9 E ENE 19 17 70 48 1026.1
11 SE NW 6 9 74 34 1024.4
16 SE WSW 7 9 70 22 1017.9
22 ENE E 6 11 88 72 1010.7
25 SSE NE 9 15 85 31 1022.4
Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow
7 1022.2 4 6 12.4 17.3 No 0.0 No
9 1022.7 7 7 14.1 18.9 No 16.2 Yes
11 1021.1 1 2 14.6 24.0 No 0.2 No
16 1012.8 0 3 19.1 30.7 No 0.0 No
22 1008.9 8 8 16.5 18.3 No 25.8 Yes
25 1018.6 8 2 16.8 27.3 No 0.0 No
I use mtcars as an example. An option to non-randomly split your data in train and test is to first create a sample size based on the number of rows in your data. After that you can use split to split the data exact at the 80% of your data. You using the following code:
smp_size <- floor(0.80 * nrow(mtcars))
split <- split(mtcars, rep(1:2, each = smp_size))
With the following code you can turn the split in train and test:
train <- split$`1`
test <- split$`2`
Let's check the number of rows:
> nrow(train)
[1] 25
> nrow(test)
[1] 7
Now the data is split in train and test without losing their order.

Purrr Multiply index data frame with dataframe

Thank you all for reading this problem.
What i would like to do is multiply my testdata with my index file while matching columns.
So multiplying Dp_water with Dp_water and iterating over all index vars kcal, fat, prot, carbs.
In my test data i have for 10 individuals data on consumption of 4 food groups in grams.
for each individual i would like to calculate the kcal fat prot carb intake.
For each individual i would like to make a new variable
Dp_water_kcal, Dp_coffee_kcal, Dp_soup_kcal , Dp_soda_kcal
Dp_water_fat, Dp_coffee_fat, Dp_soup_fat , Dp_soda_fat
ect...
library(tidyverse)
Sample data
Index file
index <- data.frame(Variable=c("Dp_water","Dp_coffee","Dp_soup","Dp_soda"),
kcal=c(0,10,20,40),
fat=c(0,5,10,15),
prot=c(2,4,6,8),
carbs=c(3,6,9,12))
index <- index %>%
pivot_longer(c(kcal,fat,prot,carbs)) %>%
pivot_wider(names_from = Variable, values_from = value)
> index
# A tibble: 4 x 5
name Dp_water Dp_coffee Dp_soup Dp_soda
<chr> <dbl> <dbl> <dbl> <dbl>
1 kcal 0 10 20 40
2 fat 0 5 10 15
3 prot 2 4 6 8
4 carbs 3 6 9 12
Below subject data consumption of 4 foodgroups.
test_data <- data.frame(Dp_water=c(11:20),
Dp_coffee=c(31:40),
Dp_soup=c(21:30),
Dp_soda=c(41:50),
id=1:10)
Dp_water Dp_coffee Dp_soup Dp_soda id
1 11 31 21 41 1
2 12 32 22 42 2
3 13 33 23 43 3
4 14 34 24 44 4
5 15 35 25 45 5
6 16 36 26 46 6
7 17 37 27 47 7
8 18 38 28 48 8
9 19 39 29 49 9
10 20 40 30 50 10
If i do the following it works. But i would like to do this for all variables and not only kcal. And i would like to be able to keep the id column.
test_data %>%
select(-id) %>%
map2_dfr(., test_data[match(names(.), names(test_data))], ~.x/100 * .y) %>%
set_names(paste0(names(.), "_kcal"))
# A tibble: 10 x 4
Dp_water_kcal Dp_coffee_kcal Dp_soup_kcal Dp_soda_kcal
<dbl> <dbl> <dbl> <dbl>
1 1.21 9.61 4.41 16.8
2 1.44 10.2 4.84 17.6
3 1.69 10.9 5.29 18.5
4 1.96 11.6 5.76 19.4
5 2.25 12.2 6.25 20.2
6 2.56 13.0 6.76 21.2
7 2.89 13.7 7.29 22.1
8 3.24 14.4 7.84 23.0
9 3.61 15.2 8.41 24.0
10 4 16 9 25
Thank you all for any help!

Slide along data frame rows and compare rows with next rows

I guess something similar should have been asked before, however I could only find an answer for python and SQL. So please notify me in the comments when this was also asked for R!
Data
Let's say we have a dataframe like this:
set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20))
# In cause you do not get the same dataframe see the comment by #Ian Campbell - thanks!
position value
1 1 27
2 2 37
3 3 57
4 4 89
5 5 20
6 6 86
7 7 97
8 8 62
9 9 58
10 10 6
11 11 19
12 12 16
13 13 61
14 14 34
15 15 67
16 16 43
17 17 88
18 18 83
19 19 32
20 20 63
Goal
I'm interested in calculating the average value for n positions and subtract this from the average value of the next n positions, let's say n=5 for now.
What I tried
I now used this method, however when I apply this to a bigger dataframe it takes a huge amount of time, and hence wonder if there is a faster method for this.
calc <- function( pos ) {
this.five <- df %>% slice(pos:(pos+4))
next.five <- df %>% slice((pos+5):(pos+9))
differ = mean(this.five$value)- mean(next.five$value)
data.frame(dif= differ)
}
df %>%
group_by(position) %>%
do(calc(.$position))
That produces the following table:
position dif
<int> <dbl>
1 1 -15.8
2 2 9.40
3 3 37.6
4 4 38.8
5 5 37.4
6 6 22.4
7 7 4.20
8 8 -26.4
9 9 -31
10 10 -35.4
11 11 -22.4
12 12 -22.3
13 13 -0.733
14 14 15.5
15 15 -0.400
16 16 NaN
17 17 NaN
18 18 NaN
19 19 NaN
20 20 NaN
I suspect a data.table approach may be faster.
library(data.table)
setDT(df)
df[,c("roll.position","rollmean") := lapply(.SD,frollmean,n=5,fill=NA, align = "left")]
df[, result := rollmean[.I] - rollmean[.I + 5]]
df[,.(position,value,rollmean,result)]
# position value rollmean result
# 1: 1 27 46.0 -15.8
# 2: 2 37 57.8 9.4
# 3: 3 57 69.8 37.6
# 4: 4 89 70.8 38.8
# 5: 5 20 64.6 37.4
# 6: 6 86 61.8 22.4
# 7: 7 97 48.4 4.2
# 8: 8 62 32.2 -26.4
# 9: 9 58 32.0 -31.0
#10: 10 6 27.2 -35.4
#11: 11 19 39.4 -22.4
#12: 12 16 44.2 NA
#13: 13 61 58.6 NA
#14: 14 34 63.0 NA
#15: 15 67 62.6 NA
#16: 16 43 61.8 NA
#17: 17 88 NA NA
#18: 18 83 NA NA
#19: 19 32 NA NA
#20: 20 63 NA NA
Data
RNGkind(sample.kind = "Rounding")
set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20))
RNGkind(sample.kind = "default")

from ffdf to regular dataframe

Is there a way to transform a ffdf into a normal dataframe?
Assuming that the thing is small enough to fit in the ram.
for example:
library(ff)
library(ffbase)
data(trees)
Girth <- ff(trees$Girth)
Height <- ff(trees$Height)
Volume <- ff(trees$Volume)
aktiv <- ff(as.factor(sample(0:1,31,replace=T)))
#Create data frame with some added parameters.
data <- ffdf(Girth=Girth,Height=Height,Volume=Volume,aktiv=aktiv)
rm(Girth,Height,Volume,trees,aktiv)
aktiv <- subset.ffdf(data, data$aktiv== "1" )
and then convert aktiv to data frame and save the RData
(sadly the person waiting the output don't want to learn how to work with the ff package, so I have no choise)
Thanks
Just use as.data.frame:
aktiv <- subset(as.data.frame(data), aktiv == 1)
Girth Height Volume aktiv
2 8.6 65 10.3 1
7 11.0 66 15.6 1
9 11.1 80 22.6 1
12 11.4 76 21.0 1
13 11.4 76 21.4 1
15 12.0 75 19.1 1
17 12.9 85 33.8 1
20 13.8 64 24.9 1
21 14.0 78 34.5 1
23 14.5 74 36.3 1
26 17.3 81 55.4 1
27 17.5 82 55.7 1
28 17.9 80 58.3 1
31 20.6 87 77.0 1
From here you can easily use save or write.csv, e.g.:
save(aktiv, file="aktiv.RData")

Resources