I am new to R. Your help here will be appreciated.
I have inputs such as.
columnA <- 14 # USERINPUT
columnB <- 1 # Incremented from 1.2.3.etc
columnC <- columnA * columnB
columnD <- 25 # remains constant
columnE <- columnC / columnD
columnF <- 8 # remains constant
columnG <- columnE + columnF
mydf <- data.frame(columnA,columnB,columnC,columnD,columnE,columnF,columnG)
Based on the above data frame I need to create a data frame such that in every susbsequent row value at columnB is incremented from 1 to 2 to 3 such that the value at columnG is never above 600 and we stop creating rows. I tried to do this in excel.Below is kind of the output i would need.
+---------+--------+---------+---------+---------+---------+---------+
| columnA | columB | columnC | columnD | columnE | columnF | columnG |
+---------+--------+---------+---------+---------+---------+---------+
| 14 | 1 | 14 | 25 | 0.56 | 8 | 8.56 |
| 14 | 2 | 28 | 25 | 1.12 | 8 | 9.12 |
| 14 | 3 | 42 | 25 | 1.68 | 8 | 9.68 |
| 14 | 4 | 56 | 25 | 2.24 | 8 | 10.24 |
| 14 | 5 | 70 | 25 | 2.8 | 8 | 10.8 |
| 14 | 6 | 84 | 25 | 3.36 | 8 | 11.36 |
| 14 | 7 | 98 | 25 | 3.92 | 8 | 11.92 |
| 14 | 8 | 112 | 25 | 4.48 | 8 | 12.48 |
+---------+--------+---------+---------+---------+---------+---------+
The end result should be a data frame
First you can compute the lenght of the data.frame:
userinput <- 14
N <- (600 - 8) * 25 / userinput
Then, using dplyr you create the data.frame:
mydf <- data_frame(ColA = 14, ColB = 1:floor(N), ColD = 25, ColF = 8) %>%
mutate(ColC = ColA * ColB, ColE = ColC/ColD, ColG = ColE + ColF)
If you need the columns in the correct order:
> mydf <- mydf %>% select(ColA, ColB, ColC, ColD, ColE, ColF, ColG)
> mydf
ColA ColB ColD ColF ColC ColE ColG
1: 14 1 25 8 14 0.56 8.56
2: 14 2 25 8 28 1.12 9.12
3: 14 3 25 8 42 1.68 9.68
4: 14 4 25 8 56 2.24 10.24
5: 14 5 25 8 70 2.80 10.80
---
1053: 14 1053 25 8 14742 589.68 597.68
1054: 14 1054 25 8 14756 590.24 598.24
1055: 14 1055 25 8 14770 590.80 598.80
1056: 14 1056 25 8 14784 591.36 599.36
1057: 14 1057 25 8 14798 591.92 599.92
Related
I want to select a certain amount of rows randomly while the first and last samples are always selected.
Suppose I have a row of numbers df as
| A | B |
| -------- | -------------- |
| 1 | 10 |
| 2 | 158 |
| 3 | 106 |
| 4 | 155 |
| 5 | 130 |
| 6 | 154 |
| 7 | 160 |
| 8 | 157 |
| 9 | 140 |
| 10 | 158 |
| 11 | 210 |
| 12 | 157 |
| 13 | 140 |
| 14 | 156 |
| 15 | 160 |
| 16 | 135 |
| 17 | 102 |
| 18 | 150 |
| 19 | 120 |
| 20 | 12 |
From the table, I want to randomly select 5 rows. While selecting 5 rows I want the row 1 and row 20 to be always selected, while the rest of 3 rows can be anything else.
Right now I'm doing the following thing, but don't know if there is a way to do it in the way I want.
n <- 5
shuffled= df[sample(1:nrow(df)), ] #shuffles the entire dataframe
extracted <- shuffled[1:n, ] #extracts top 5 rows from the shuffled sample
I need to do this because I will further analyze the results.
library(dplyr)
shuffle <- function(df, size, fixed = integer()) {
i <- 1:nrow(df)
ii <- i[-fixed]
df[c(fixed, sample(ii, size)), ]
}
shuffle(starwars, 3, fixed = c(1, 2))
#> # A tibble: 5 × 14
#> name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywal… 172 77 blond fair blue 19 male mascu… Tatooi…
#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi…
#> 3 Luminara Un… 170 56.2 black yellow blue 58 fema… femin… Mirial
#> 4 R4-P17 96 NA none silver… red, b… NA none femin… <NA>
#> 5 Mon Mothma 150 NA auburn fair blue 48 fema… femin… Chandr…
#> # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
#> # starships <list>, and abbreviated variable names ¹hair_color, ²skin_color,
#> # ³eye_color, ⁴birth_year, ⁵homeworld
#> # ℹ Use `colnames()` to see all variable names
I need to merge two data frames probably with left_join, offset the joining observation by a specific amount and add it to a new column. The purpose is the preparation of a time-series analysis hence the different shifts in calendar weeks. I would like to stay in the tidyverse.
I read a few posts with a nested left-join() and lag() but that's beyond my current capability.
MWE
library(tidyverse)
set.seed(1234)
df1 <- data.frame(
Week1 = sample(paste("2015", 20:40, sep = "."),10, replace = FALSE),
Qty = as.numeric(sample(1:10)))
df2 <- data.frame(
Week2 = paste0("2015.", 1:52),
Value = as.numeric(sample(1:52)))
df1 %>%
left_join(df2, by = c("Week1" = "Week2")) %>%
rename(Lag_0 = Value)
Current output
+----+---------+-------+-------+
| | Week1 | Qty | Lag_0 |
+====+=========+=======+=======+
| 1 | 2015.35 | 6.00 | 50.00 |
+----+---------+-------+-------+
| 2 | 2015.24 | 10.00 | 26.00 |
+----+---------+-------+-------+
| 3 | 2015.31 | 7.00 | 43.00 |
+----+---------+-------+-------+
| 4 | 2015.34 | 9.00 | 42.00 |
+----+---------+-------+-------+
| 5 | 2015.28 | 4.00 | 10.00 |
+----+---------+-------+-------+
| 6 | 2015.39 | 8.00 | 24.00 |
+----+---------+-------+-------+
| 7 | 2015.25 | 5.00 | 33.00 |
+----+---------+-------+-------+
| 8 | 2015.23 | 1.00 | 39.00 |
+----+---------+-------+-------+
| 9 | 2015.21 | 2.00 | 17.00 |
+----+---------+-------+-------+
| 10 | 2015.26 | 3.00 | 27.00 |
+----+---------+-------+-------+
It might be worthwhile pointing out that the target data frame does not hold the same amount of week observations as the joining data frame.
Desired output
+----+---------+-------+-------+-------+-------+--------+
| | Week1 | Qty | Lag_0 | Lag_3 | Lag_6 | Lag_12 |
+====+=========+=======+=======+=======+=======+========+
| 1 | 2015.35 | 6.00 | 50.00 | 9.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 2 | 2015.24 | 10.00 | 26.00 | 17.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 3 | 2015.31 | 7.00 | 43.00 | 10.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 4 | 2015.34 | 9.00 | 42.00 | 43.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 5 | 2015.28 | 4.00 | 10.00 | 33.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 6 | 2015.39 | 8.00 | 24.00 | 13.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 7 | 2015.25 | 5.00 | 33.00 | 25.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 8 | 2015.23 | 1.00 | 39.00 | 38.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 9 | 2015.21 | 2.00 | 17.00 | 6.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 10 | 2015.26 | 3.00 | 27.00 | 39.00 | | |
+----+---------+-------+-------+-------+-------+--------+
Column Lag_3, which I added manually, contains the values from the matching df2 week value but offset by three rows. Lag_6 would be offset by six rows, etc.
I suppose the challenge is, that the lag() would have to happen in the joining table after the matching but before the returning of the value.
Hope this makes sense and thanks for the assistance.
We just need to create the lag before in the second data and then do the join
library(dplyr)
df2 %>%
mutate(Lag_3 = lag(Value, 3), Lag_6 = lag(Value, 6)) %>%
left_join(df1, ., by = c("Week1" = "Week2")) %>%
rename(Lag_0 = Value)
-output
# Week1 Qty Lag_0 Lag_3 Lag_6
#1 2015.35 6 50 9 46
#2 2015.24 10 26 17 6
#3 2015.31 7 43 10 33
#4 2015.34 9 42 43 10
#5 2015.28 4 10 33 25
#6 2015.39 8 24 13 16
#7 2015.25 5 33 25 49
#8 2015.23 1 39 38 15
#9 2015.21 2 17 6 32
#10 2015.26 3 27 39 38
I would like to query a database like this:
table person
id name age weight
1 teste1 18 101
1 teste2 18 102
1 teste3 18 103
1 teste4 18 104
1 teste5 18 105
1 teste6 18 106
2 teste7 18 91
2 teste8 18 92
2 teste9 18 93
2 teste9 18 94
2 teste1 18 95
2 teste2 18 96
3 teste3 18 87
3 teste3 18 88
3 teste3 18 89
3 teste3 18 81
3 teste3 18 82
3 teste3 18 83
3 teste3 18 84
3 teste3 18 85
and the result should be the 3 highest weight of each id, like this:
id name age weight
1 teste4 18 106
1 teste5 18 105
1 teste6 18 104
2 teste9 18 96
2 teste1 18 95
2 teste2 18 94
3 teste3 18 89
3 teste3 18 88
3 teste3 18 87
can someone help me? Best regards
With ROW_NUMBER() window function:
select t.id, t.name, t.age, t.weight
from (
select *, row_number() over (partition by id order by weight desc) rn
from tablename
) t
where t.rn <= 3
order by t.id, t.weight desc
See the demo.
Without window functions you can use a correlated subquery in the WHERE clause:
select t.id, t.name, t.age, t.weight
from tablename t
where (select count(*) from tablename where id = t.id and weight >= t.weight) <= 3
order by t.id, t.weight desc;
See the demo.
Results:
| id | name | age | weight |
| --- | ------ | --- | ------ |
| 1 | teste6 | 18 | 106 |
| 1 | teste5 | 18 | 105 |
| 1 | teste4 | 18 | 104 |
| 2 | teste2 | 18 | 96 |
| 2 | teste1 | 18 | 95 |
| 2 | teste9 | 18 | 94 |
| 3 | teste3 | 18 | 89 |
| 3 | teste3 | 18 | 88 |
| 3 | teste3 | 18 | 87 |
Say I have a raw dataset (already in data frame and I can convert that easily to xts.data.table with as.xts.data.table), the DF is like the following:
Date | City | State | Country | DailyMinTemperature | DailyMaxTemperature | DailyMedianTemperature
-------------------------
2018-02-03 | New York City | NY | US | 18 | 22 | 19
2018-02-03 | London | LDN |UK | 10 | 25 | 15
2018-02-03 | Singapore | SG | SG | 28 | 32 | 29
2018-02-02 | New York City | NY | US | 12 | 30 | 18
2018-02-02 | London | LDN | UK | 12 | 15 | 14
2018-02-02 | Singapore | SG | SG | 27 | 31 | 30
and so on (many more cities and many more days).
And I would like to make this to show both the current day temperature and the day over day change from the previous day, together with the other info on the city (state, country). i.e., the new data frame should be something like (from the example above):
Date | City | State | Country | DailyMinTemperature | DailyMaxTemperature | DailyMedianTemperature| ChangeInDailyMin | ChangeInDailyMax | ChangeInDailyMedian
-------------------------
2018-02-03 | New York City | NY | US | 18 | 22 | 19 | 6 | -8 | 1
2018-02-03 | London | LDN |UK | 10 | 25 | 15 | -2 | -10 | 1
2018-02-03 | Singapore | SG | SG | 28 | 32 | 29 | 1 | 1 | -1
2018-02-03 | New York City | NY | US | ...
and so on. i.e., add 3 more columns to show the day over day change.
Note that in the dataframe I may not have data everyday, however my change is defined as the differences between temperature on day t - temperature on the most recent date where I have data on the temperature.
I tried to use the shift function but R was complaining about the := sign.
Is there any way in R I could get this to work?
Thanks!
You can use dplyr::mutate_at and lubridate package to transform data in desired format. The data needs to be arranged in Date format and difference of current record with previous record can be taken with help of dplyr::lag function.
library(dplyr)
library(lubridate)
df %>% mutate_if(is.character, funs(trimws)) %>% #Trim any blank spaces
mutate(Date = ymd(Date)) %>% #Convert to Date/Time
group_by(City, State, Country) %>%
arrange(City, State, Country, Date) %>% #Order data date
mutate_at(vars(starts_with("Daily")), funs(Change = . - lag(.))) %>%
filter(!is.na(DailyMinTemperature_Change))
Result:
# # A tibble: 3 x 10
# # Groups: City, State, Country [3]
# Date City State Country DailyMinTemperature DailyMaxTemperature DailyMedianTemperature DailyMinTemperature_Change DailyMaxT~ DailyMed~
# <date> <chr> <chr> <chr> <dbl> <dbl> <int> <dbl> <dbl> <int>
# 1 2018-02-03 London LDN UK 10.0 25.0 15 -2.00 10.0 1
# 2 2018-02-03 New York City NY US 18.0 22.0 19 6.00 - 8.00 1
# 3 2018-02-03 Singapore SG SG 28.0 32.0 29 1.00 1.00 -1
#
Data:
df <- read.table(text =
"Date | City | State | Country | DailyMinTemperature | DailyMaxTemperature | DailyMedianTemperature
2018-02-03 | New York City | NY | US | 18 | 22 | 19
2018-02-03 | London | LDN |UK | 10 | 25 | 15
2018-02-03 | Singapore | SG | SG | 28 | 32 | 29
2018-02-02 | New York City | NY | US | 12 | 30 | 18
2018-02-02 | London | LDN | UK | 12 | 15 | 14
2018-02-02 | Singapore | SG | SG | 27 | 31 | 30",
header = TRUE, stringsAsFactors = FALSE, sep = "|")
I have a dataframe like this
ID <- c(101,101,101,102,102,102,103,103,103)
Pt_A <- c(50,100,150,20,30,40,60,80,90)
df <- data.frame(ID,Pt_A)
+-----+------+
| ID | Pt_A |
+-----+------+
| 101 | 50 |
| 101 | 100 |
| 101 | 150 |
| 102 | 20 |
| 102 | 30 |
| 102 | 40 |
| 103 | 60 |
| 103 | 80 |
| 103 | 90 |
+-----+------+
I want to create 2 new columns with values calculated from Pt_A column.
df$Del_Pt_A <- NthRow(Pt_A) - 1stRow(Pt_A) grouped by ID, where n = 1,2,...n
df$Perc_Pt_A <- NthRow(Del_Pt_A) / 1stRow(Pt_A) grouped by ID, where n = 1,2,...n
Here is my desired output
+-----+------+---------+-----------+
| ID | Pt_A | Del_Pt_A | Perc_Pt_A|
+-----+------+---------+-----------+
| 101 | 50 | 0 | 0 |
| 101 | 100 | 50 | 1.0 |
| 101 | 150 | 100 | 2.0 |
| 102 | 20 | 0 | 0 |
| 102 | 30 | 10 | 0.5 |
| 102 | 40 | 20 | 1.0 |
| 103 | 60 | 0 | 0 |
| 103 | 80 | 20 | 0.3 |
| 103 | 90 | 30 | 0.5 |
+-----+------+---------+-----------+
I currently get the desired result in MS Excel but I want to learn to do it in R to make my work efficient. I came across packages like dplyr, plyr, data.table etc but I couldn't solve it using those. Could some one please help me figure out how to work around this.
Here's a data.table way:
library(data.table)
setDT(df)[,`:=`(
del = Pt_A - Pt_A[1],
perc = Pt_A/Pt_A[1]-1
),by=ID]
which gives
ID Pt_A del perc
1: 101 50 0 0.0000000
2: 101 100 50 1.0000000
3: 101 150 100 2.0000000
4: 102 20 0 0.0000000
5: 102 30 10 0.5000000
6: 102 40 20 1.0000000
7: 103 60 0 0.0000000
8: 103 80 20 0.3333333
9: 103 90 30 0.5000000
Here another option in base R:
cbind(df,
do.call(rbind,by(df,df$ID,
function(x)
setNames(data.frame(x$Pt_A-x$Pt_A[1],
x$Pt_A/x$Pt_A[1]-1),
c('Del_Pt_A','Perc_Pt_A')))))
# ID Pt_A Del_Pt_A Perc_Pt_A
# 101.1 101 50 0 0.0000000
# 101.2 101 100 50 1.0000000
# 101.3 101 150 100 2.0000000
# 102.1 102 20 0 0.0000000
# 102.2 102 30 10 0.5000000
# 102.3 102 40 20 1.0000000
# 103.1 103 60 0 0.0000000
# 103.2 103 80 20 0.3333333
# 103.3 103 90 30 0.5000000
I am using :
by to apply a function by group, the result is a list
do.call(rbind, list_by) to transform the list to a data.frame
cbind to add the result to the initial data.frame