I have 3 data base, in this three data base my sample ID varryig. But i want to merge to obtain one data base with multiple row of the same ID but not the same values
This is whats i have
df1
ID
tstart
1
12
2
4
df2
ID
tstart
2
40
3
15
df3
ID
tstart
2
80
3
80
this is what i want
ID
tstart
1
12
2
4
2
40
3
15
3
80
now i want to create a new variable i have this
ID
tstart
t stop
results 1
result 2
1
12
20
5
NA
2
4
40
10
NA
2
40
80
NA
52
3
15
80
68
NA
3
80
100
NA
56
and i want a new variable to have this df :
ID
tstart
t stop
result
1
12
20
5
2
4
40
10
2
40
80
52
3
15
80
68
3
80
100
56
We can use bind_rows to bind the datasets and then get the distinct rows
library(dplyr)
bind_rows(df1, df2, df3) %>%
distinct()
For the second case, if the input already have the 'tstart', 'tstop', we could coalesce the 'result' columns
dfnew %>%
mutate(result = coalesce(result1, result2), .keep = 'unused')
-output
ID tstart tstop result
1 1 12 20 5
2 2 4 40 10
3 2 40 80 52
4 3 15 80 68
5 3 80 100 56
data
dfnew <- structure(list(ID = c(1L, 2L, 2L, 3L, 3L), tstart = c(12L, 4L,
40L, 15L, 80L), tstop = c(20L, 40L, 80L, 80L, 100L), result1 = c(5L,
10L, NA, 68L, NA), result2 = c(NA, NA, 52L, NA, 56L)),
class = "data.frame", row.names = c(NA,
-5L))
Related
I am currently working in R, but I would also be able to tackle this problem in stata provided some help.
I have two very large datasets. One contains households and their locations, and the other contains weather data by date and location. I ultimately need a dataset where each row is a household, and contains weather data matched to that household by location. In this dataset, each column would identify the date of that observation.
For the sake of simplicity, I created sample three data frames in R.
The first emulates my household data:
house.id location.id
1 10001 a
2 10002 b
3 10003 c
4 10004 c
5 10005 a
The second emulates my weather data:
date location.id temperature
1 2020-01-01 a 70
2 2020-01-01 b 71
3 2020-01-01 c 74
4 2020-01-02 a 61
5 2020-01-02 b 63
6 2020-01-02 c 61
7 2020-01-03 a 57
8 2020-01-03 b 50
9 2020-01-03 c 64
And the final one displays what my ultimate goal is:
house.id location.id 2020-01-01 2020-01-02 2020-01-03
1 10001 a 70 62 57
2 10002 b 71 63 50
3 10003 c 74 61 64
4 10004 c 74 61 64
5 10005 a 70 62 57
As you can see, each household pulled weather data from its location id and appended it using additional columns which are named for their date (which was grabbed from the second dataset).
Obviously I created this third dataset manually, otherwise I wouldn't be asking for code here. I need to figure out how to automate the generation of the third dataset from the first two so that I can carry out the process on two much larger datasets.
Any help would be very much appreciated!!
First you need to reshape wide.
Using data.table that would look like this
library(data.table)
dd <- setDT(dd)
dd <- dcast(dd, location.id ~ date, value.var="temperature")
Or, using base R:
dd <- reshape(dd, direction = "wide", idvar = "location.id", timevar = "date")
Then you can merge:
m <- merge(d, dd, by="location.id", all.x = T)
location.id house.id 2020-01-01 2020-01-02 2020-01-03
1 a 10001 70 61 57
2 a 10005 70 61 57
3 b 10002 71 63 50
4 c 10003 74 61 64
5 c 10004 74 61 64
data:
d <- read.table(text = " house.id location.id
1 10001 a
2 10002 b
3 10003 c
4 10004 c
5 10005 a
",header=T)
dd <- read.table(text = " date location.id temperature
1 2020-01-01 a 70
2 2020-01-01 b 71
3 2020-01-01 c 74
4 2020-01-02 a 61
5 2020-01-02 b 63
6 2020-01-02 c 61
7 2020-01-03 a 57
8 2020-01-03 b 50
9 2020-01-03 c 64
",header=T )
try to do it this way
hh <- structure(list(house.id = 10001:10005, location.id = structure(c(1L,
2L, 3L, 3L, 1L), .Label = c("a", "b", "c"), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
temperature <- structure(list(date = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L,
3L, 3L), .Label = c("01.01.2020", "02.01.2020", "03.01.2020"), class = "factor"),
location.id = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L), .Label = c("a", "b", "c"), class = "factor"), temperature = c(70L,
71L, 74L, 61L, 63L, 61L, 57L, 50L, 64L)), class = "data.frame", row.names = c(NA,
-9L))
library(tidyverse)
temperature %>%
left_join(hh) %>%
pivot_wider(c(house.id, location.id),
names_from = date,
values_from = temperature) %>%
arrange(house.id)
Convert your weather data to the wide format and join to the household data. That should do it:
library(tidyverse)
#set up the household dataset
household_data <- tribble(~"house.id",~"location.id",
10001,"a",
10002,"b",
10003,"c",
10004,"c",
10005,"a")
#set up the weather dataset
weather_data <- tribble(~"date", ~"location.id", ~"temperature",
"2020-01-01","a",70,
"2020-01-01","b",71,
"2020-01-01","c",74,
"2020-01-02","a",61,
"2020-01-02","b",63,
"2020-01-02","c",61,
"2020-01-03","a",57,
"2020-01-03","b",50,
"2020-01-03","c",64)
household_data %>%
full_join(weather_data %>%
pivot_wider(names_from = "date",
values_from = "temperature"), # converts to wide format
by = "location.id") # joins the two data frames
# A tibble: 5 x 5
house.id location.id `2020-01-01` `2020-01-02` `2020-01-03`
<dbl> <chr> <dbl> <dbl> <dbl>
1 10001 a 70 61 57
2 10002 b 71 63 50
3 10003 c 74 61 64
4 10004 c 74 61 64
5 10005 a 70 61 57
I don't know how to do it in Stata however!
Suppose I have two datasets. One main dataset, with many columns of metadata, and one new dataset which will be used to fill in some of the gaps in concentrations in the main dataset:
Main dataset:
study_id timepoint age occupation concentration1 concentration2
1 1 21 0 3 7
1 2 21 0 4 6
1 3 22 0 NA NA
1 4 22 0 NA NA
2 1 36 3 0 4
2 2 36 3 2 11
2 3 37 3 NA NA
2 4 37 3 NA NA
New data set to merge:
study_id timepoint concentration1 concentration2
1 3 11 20
1 4 21 35
2 3 7 17
2 4 14 25
Whenever I merge by "study_id" and "timepoint", I get two new columns that are "concentration1.y" and "concentration2.y" while the original columns get renamed as "concentration1.x" and "concentration2.x". I don't want this.
This is what I want:
study_id timepoint age occupation concentration1 concentration2
1 1 21 0 3 7
1 2 21 0 4 6
1 3 22 0 11 20
1 4 22 0 21 35
2 1 36 3 0 4
2 2 36 3 2 11
2 3 37 3 7 17
2 4 37 3 14 25
In other words, I want to merge by "study_id" and "timepoint" AND merge the two concentration columns so the data are within the same columns. Please note that both datasets do not have identical columns (dataset 1 has 1000 columns with metadata while dataset2 just has study id, timepoint, and concentration columns that match the concentration columns in dataset1).
Thanks so much in advance.
Using coalesce is one option (from dplyr package). This still adds the two columns for concentration 1 and 2 from the second data frame. These would be removed after NA filled in.
library(tidyverse)
df1 %>%
left_join(df2, by = c("study_id", "timepoint")) %>%
mutate(concentration1 = coalesce(concentration1.x, concentration1.y),
concentration2 = coalesce(concentration2.x, concentration2.y)) %>%
select(-concentration1.x, -concentration1.y, -concentration2.x, -concentration2.y)
Or to generalize with multiple concentration columns:
df1 %>%
left_join(df2, by = c("study_id", "timepoint")) %>%
split.default(str_remove(names(.), "\\.x|\\.y")) %>%
map_df(reduce, coalesce)
Edit: To prevent the resultant column names from being alphabetized from split.default, you can add an intermediate step of sorting the list based on the first data frame's column name order.
df3 <- df1 %>%
left_join(df2, by = c("study_id", "timepoint")) %>%
split.default(str_remove(names(.), "\\.x|\\.y"))
df3[names(df1)] %>%
map_df(reduce, coalesce)
Output
study_id timepoint age occupation concentration1 concentration2
1 1 1 21 0 3 7
2 1 2 21 0 4 6
3 1 3 22 0 11 20
4 1 4 22 0 21 35
5 2 1 36 3 0 4
6 2 2 36 3 2 11
7 2 3 37 3 7 17
8 2 4 37 3 14 25
Data
df1 <- structure(list(study_id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
timepoint = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), age = c(21L,
21L, 22L, 22L, 36L, 36L, 37L, 37L), occupation = c(0L, 0L,
0L, 0L, 3L, 3L, 3L, 3L), concentration1 = c(3L, 4L, NA, NA,
0L, 2L, NA, NA), concentration2 = c(7L, 6L, NA, NA, 4L, 11L,
NA, NA)), class = "data.frame", row.names = c(NA, -8L))
df2 <- structure(list(study_id = c(1L, 1L, 2L, 2L), timepoint = c(3L,
4L, 3L, 4L), concentration1 = c(11L, 21L, 7L, 14L), concentration2 = c(20L,
35L, 17L, 25L)), class = "data.frame", row.names = c(NA, -4L))
I have 3 columns a , b ,c and I want to combine them into a new column with the help of column mood as the following :
if mod= 1 , data from a
if mod=2 , data from b
if mode=3, data from c
example
mode a b c
1 2 3 4
1 5 53 14
3 2 31 24
2 12 13 44
1 20 30 40
Output
mode a b c combine
1 2 3 4 2
1 5 53 14 5
3 2 31 24 24
2 12 13 44 13
1 20 30 40 20
We can use the row/column indexing to get the values from the dataset. Here, the row sequence (seq_len(nrow(df1))) and the column index ('mode') are cbinded to create a matrix to extract the corresponding values from the subset of dataset
df1$combine <- df1[2:4][cbind(seq_len(nrow(df1)), df1$mode)]
df1$combine
#[1] 2 5 24 13 20
data
df1 <- structure(list(mode = c(1L, 1L, 3L, 2L, 1L), a = c(2L, 5L, 2L,
12L, 20L), b = c(3L, 53L, 31L, 13L, 30L), c = c(4L, 14L, 24L,
44L, 40L)), class = "data.frame", row.names = c(NA, -5L))
Another solution in base R that works by converting "mode" to letters then extracting those values in the matching columns.
df1$combine <- diag(as.matrix(df1[, letters[df1$mode]]))
Also, two ways with dplyr(). Nested if_else :
library(dplyr)
df1 %>%
mutate(combine =
if_else(mode == 1, a,
if_else(mode == 2, b, c)
)
)
And case_when():
df1 %>% mutate(combine =
case_when(mode == 1 ~ a, mode == 2 ~ b, mode == 3 ~ c)
)
I have a data table that looks like the following:
Item 2018 2019 2020 2021 2022 2023
Apples 10 12 17 18 0 0
Bears 40 50 60 70 80 90
Cats 5 2 1 0 0 0
Dogs 15 17 18 15 11 0
I want a column that showing a count of the number of years with non-zero sales. That is:
Item 2018 2019 2020 2021 2022 2023 Count
Apples 10 12 17 18 0 0 4
Bears 40 50 60 70 80 90 6
Cats 5 2 1 0 0 0 3
Dogs 15 17 18 15 11 0 5
NB I'll want to do some analysis on this in the next pass, so looking to just add in the count column and not aggregate at this stage. This will be something like filter the rows if the count is greater than a threshold.
I looked at the tally() command from tidyverse, but this doesn't seem to do what I want (I think).
NB I haven't tagged this question as tidyverse due to the guidance on that tag. Shout if I need to edit this point.
As it is rowwise, we can use rowSums after converting the subset of dataset to logical
library(tidyverse)
df1 %>%
mutate(Count = rowSums(.[-1] > 0))
Or using reduce
df1 %>%
mutate(Count = select(., -1) %>%
mutate_all(funs(. > 0)) %>%
reduce(`+`))
Or with pmap
df1 %>%
mutate(Count = pmap_dbl(.[-1], ~ sum(c(...) > 0)))
# Item 2018 2019 2020 2021 2022 2023 Count
#1 Apples 10 12 17 18 0 0 4
#2 Bears 40 50 60 70 80 90 6
#3 Cats 5 2 1 0 0 0 3
#4 Dogs 15 17 18 15 11 0 5
data
df1 <- structure(list(Item = c("Apples", "Bears", "Cats", "Dogs"), `2018` = c(10L,
40L, 5L, 15L), `2019` = c(12L, 50L, 2L, 17L), `2020` = c(17L,
60L, 1L, 18L), `2021` = c(18L, 70L, 0L, 15L), `2022` = c(0L,
80L, 0L, 11L), `2023` = c(0L, 90L, 0L, 0L)), class = "data.frame",
row.names = c(NA, -4L))
I would like to create a subset of data that consists of Units that have a higher score in QTR 4 than QTR 1 (upward trend). Doesn't matter if QTR 2 or 3 are present.
Unit QTR Score
5 4 34
1 1 22
5 3 67
2 4 78
3 2 39
5 2 34
1 2 34
5 1 67
1 3 70
1 4 89
3 4 19
Subset would be:
Unit QTR Score
1 1 22
1 2 34
1 3 70
1 4 89
I've tried variants of something like this:
upward_subset <- subset(mydata,Unit if QTR=4~Score > QTR=1~Score)
Thank you for your time
If the dataframe is named "d", then this succeeds on your test set:
d[ which(d$Unit %in%
(sapply( split(d, d["Unit"]),
function(dd) dd[dd$QTR ==4, "Score"] - dd[dd$QTR ==1, "Score"]) > 0)) ,
]
#-------------
Unit QTR Score
2 1 1 22
7 1 2 34
9 1 3 70
10 1 4 89
An alternative in two steps:
result <- unlist(
by(
test,
test$Unit,
function(x) x$Score[x$QTR==4] > x$Score[x$QTR==2])
)
test[test$Unit %in% names(result[result==TRUE]),]
Unit QTR Score
2 1 1 22
7 1 2 34
9 1 3 70
10 1 4 89
A solution using data.table (Probably there are better versions than what I have at the moment).
Note: Assuming a QTR value for a given Unit is unique
Data:
df <- structure(list(Unit = c(5L, 1L, 5L, 2L, 3L, 5L, 1L, 5L, 1L, 1L,
3L), QTR = c(4L, 1L, 3L, 4L, 2L, 2L, 2L, 1L, 3L, 4L, 4L), Score = c(34L,
22L, 67L, 78L, 39L, 34L, 34L, 67L, 70L, 89L, 19L)), .Names = c("Unit",
"QTR", "Score"), class = "data.frame", row.names = c(NA, -11L
))
Solution:
dt <- data.table(df, key=c("Unit", "QTR"))
dt[, Score[Score[QTR == 4] > Score[QTR == 1]], by=Unit]
Unit V1
1: 1 22
2: 1 34
3: 1 70
4: 1 89