Suppose i have two dataset
ds1
NO ID DOB ID2 count
1 4083 2007-10-01 3625 5
2 4408 2008-07-01 3603 2
3 4514 2007-07-01 3077 3
4 4396 2008-05-01 3413 5
5 4222 2003-12-01 3341 1
ds2
loc share
12 445
23 4
10 56
1 1
23 34
I want "share" column of ds2 to be added to ds1 so that it would look like
dsmerged
NO ID DOB ID2 count share
1 4083 2007-10-01 3625 5 445
2 4408 2008-07-01 3603 2 4
3 4514 2007-07-01 3077 3 56
4 4396 2008-05-01 3413 5 1
5 4222 2003-12-01 3341 1 34
i tried merge as
dsmerged <- merge(ds1[,c(1:5)],ds2[,c(2)])
But what it does is it duplicates the dataset (5*5=25 rows) while it does add "share" column. i dont want that duplicate values obviously. Thank you
If you know that the rows represent the same id then you can just cbind
ds3 <- cbind(ds1, share = ds2$share)
but it would be better if you had an id to join on.
Using dplyr
library(dplyr)
bind_cols(ds1, ds2['share'])
Or with data.table
setDT(ds1)[, share := ds2[["share"]]]
Related
Currently, I have 3 datasets each 1368 rows of data points.
a <- sample(0:10000,1368, rep=TRUE)
Df <- data.frame(obs=c(1:1368),
var1=a)
df2<-data.frame(col1=Df$var1[1:90],
col2=Df$var1[91:180],
col3=Df$var1[181:270])
Dataset 1
col1 col2 col3
1 7878 8130 3924
2 5781 4375 6232
3 9324 9066 1734
4 9754 8796 2047
5 3462 4930 7381
6 7379 8103 3404
7 7355 5212 4505
dataset 2
col1 col2 col3
1 7878 8130 3924
2 5781 4375 6232
3 9324 9066 1734
4 9754 8796 2047
5 3462 4930 7381
6 7379 8103 3404
7 7355 5212 4505
8 5599 6887 5775
9 2321 7948 3553
10 3717 1248 5818
11 6276 5528 206
12 1328 1158 8681
13 4470 3009 1332
14 6472 9018 606
An example of one of the datasets that is being used with the expected outcome, I left out the excess rows.
My intention is to split each dataset sequentially into subsets, each with 90 observations. I am aware of the divisible issue, but the last subset having more entries isn't a problem, the main concern is just splitting the observations into either different datasets or different columns to perform specific statistical tests such as a T-test on each subset of data. The end result should a data frame with 14 columns.
The end goal is to have all 3 datasets of 1368 observations split into equal subsets.
What would be the best way to split the dataset into these subsets?
This should get you started, but without reproducible data, it is impossible to adapt a general method to your specific data:
n <- 1368 # rows
subsets <- n %/% 90 # 15 subsets
extra <- n %% 90 # 18 extra
grp <- c(rep(1:subsets, each=90), rep(subsets, extra)) # group numbers for each row assuming the extra goes in the last group
table(grp)
# grp
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# 90 90 90 90 90 90 90 90 90 90 90 90 90 90 108
Then use grp to split() your data frame into a list of groups.
This question already has answers here:
Reshape horizontal to to long format using pivot_longer
(3 answers)
Closed 2 years ago.
Thank you all for your answers, I thought I was smarter than I am and hoped I would've understood any of it. I think I messed up my visualisation of my data aswell. I have edited my post to better show my sample data. Sorry for the inconvenience, and I truly hope that someone can help me.
I have a question about reshaping my data. The data collected looks as such:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurment4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
Now i would like it to look something like this:
PID Time Value
1 1435 1356
1 1405 1483
1 1374 1563
2 1848 943
2 1818 1173
2 1785 1300
3 185 1590
... ... ...
How would i tend to get there? I have looked up some things about wide to longformat, but it doesn't seem to do the trick. Am reletively new to Rstudio and Stackoverflow (if you couldn't tell that already).
Kind regards, and thank you in advance.
Here is a slightly different pivot_longer() version.
library(tidyr)
library(dplyr)
dw %>%
pivot_longer(cols = -PID, names_to =".value", names_pattern = "(.+)[0-9]")
# A tibble: 9 x 3
PID T measurement
<dbl> <dbl> <dbl>
1 1 1 100
2 1 4 200
3 1 7 50
4 2 2 150
5 2 5 300
6 2 8 60
7 3 3 120
8 3 6 210
9 3 9 70
The names_to = ".value" argument creates new columns from column names based on the names_pattern argument. The names_pattern argument takes a special regex input. In this case, here is the breakdown:
(.+) # match everything - anything noted like this becomes the ".values"
[0-9] # numeric characters - tells the pattern that the numbers
# at the end are excluded from ".values". If you have multiple digit
# numbers, use [0-9*]
In the last edit you asked for a solution that is easy to understand. A very simple approach would be to stack the measurement columns on top of each other and the Tdays columns on top of each other. Although specialty packages make things very concise and elegant, for simplicity we can solve this without additional packages. Standard R has a convenient function aptly named stack, which works like this:
> exp <- data.frame(value1 = 1:5, value2 = 6:10)
> stack(exp)
values ind
1 1 value1
2 2 value1
3 3 value1
4 4 value1
5 5 value1
6 6 value2
7 7 value2
8 8 value2
9 9 value2
10 10 value2
We can stack measurements and Tdays seperately and then combine them via cbind:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurement4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
cbind(stack(data, c(measurement1, measurement2, measurement3, measurement4)),
stack(data, c(Tdays1, Tdays2, Tdays3, Tdays4)))
Which keeps measurements and Tdays neatly together but leaves us without pid which we can add using rep to replicate the original pid 4 times:
result <- cbind(pid = rep(data$pid, 4),
stack(data, c(measurement1, measurement2, measurement3, measurement4)),
stack(data, c(Tdays1, Tdays2, Tdays3, Tdays4)))
The head of which looks like
> head(result)
pid values ind values ind
1 1 1356 measurement1 1435 Tdays1
2 2 943 measurement1 1848 Tdays1
3 3 1590 measurement1 185 Tdays1
4 4 130 measurement1 72 Tdays1
5 4 140 measurement1 82 Tdays1
6 4 220 measurement1 126 Tdays1
As I said above, this is not the order you expected and you can try to sort this data.frame, if that is of any concern:
result <- result[order(result$pid), c(1, 4, 2)]
names(result) <- c("pid", "Time", "Value")
leading to the final result
> head(result)
pid Time Value
1 1 1435 1356
13 1 1405 1483
25 1 1374 1563
37 1 NA NA
2 2 1848 943
14 2 1818 1173
tidyverse solution
library(tidyverse)
dw %>%
pivot_longer(-PID) %>%
mutate(name = gsub('^([A-Za-z]+)(\\d+)$', '\\1_\\2', name )) %>%
separate(name, into = c('A', 'B'), sep = '_', convert = T) %>%
pivot_wider(names_from = A, values_from = value)
Gives the following output
# A tibble: 9 x 4
PID B T measurement
<int> <int> <int> <int>
1 1 1 1 100
2 1 2 4 200
3 1 3 7 50
4 2 1 2 150
5 2 2 5 300
6 2 3 8 60
7 3 1 3 120
8 3 2 6 210
9 3 3 9 70
Considering a dataframe, df like the following:
PID T1 measurement1 T2 measurement2 T3 measurement3
1 1 100 4 200 7 50
2 2 150 5 300 8 60
3 3 120 6 210 9 70
You can use this solution to get your required dataframe:
iters = seq(from = 4, to = length(colnames(df))-1, by = 2)
finalDf = df[, c(1,2,3)]
for(j in iters){
tobind = df[, c(1,j,j+1)]
finalDf = rbind(finalDf, tobind)
}
finalDf = finalDf[order(finalDf[,1]),]
print(finalDf)
The output of the print statement is this:
PID T1 measurement1
1 1 1 100
4 1 4 200
7 1 7 50
2 2 2 150
5 2 5 300
8 2 8 60
3 3 3 120
6 3 6 210
9 3 9 70
Maybe you can try reshape like below
reshape(
setNames(data, gsub("(\\d+)$", "\\.\\1", names(data))),
direction = "long",
varying = 2:ncol(data)
)
I have a data frame with 3 different identifications and sometimes they overlap. I want to create a new column, with only one of those ids, in an order of preference (id1>id2>id3).
Ex.:
id1 id2 id3
12 145 8763
45 836 5766
13 768 9374
836 5766
12 145
9282
567
45 836 5766
and I want to have:
id1 id2 id3 id.new
12 145 8763 12
45 836 5766 45
13 768 9374 13
836 5766 836
9282 9282
567 567
I have tried the if else,which, grep functions.. but I can't make it work.
Ex. of my try:
df$id1 <- ifelse(df$id1 == "", paste(df$2), (ifelse(df$id1)))
I am able to do this on Excel, but I am switching to R, for being more reliable and reproducible :) But in excel I would use:
=if(A1="",B1,(if(B1="",C1,B1)),A1)
Using coalesce from the dplyr package, we can try:
library(dplyr)
df$id.new <- coalesce(df$id1, df$id2, df$id3)
df
id1 id2 id3 id.new
1 12 145 8763 12
2 45 836 5766 45
3 13 768 9374 13
4 NA 836 5766 836
5 12 145 NA 12
6 NA NA 9282 9282
7 NA 567 NA 567
8 45 836 5766 45
Data:
df <- data.frame(id1=c(12,45,13,NA,12,NA,NA,45),
id2=c(145,836,768,836,145,NA,567,836),
id3=c(8763,5766,9374,5766,NA,9282,NA,5766))
In base you can use apply of is.na(df) with function which.min to get a matrix used for subsetting. Thanks to #tim-biegeleisen for the dataset.
df$id.new <- df[cbind(1:nrow(df), apply(is.na(df), 1, which.min))]
df
# id1 id2 id3 id.new
#1 12 145 8763 12
#2 45 836 5766 45
#3 13 768 9374 13
#4 NA 836 5766 836
#5 12 145 NA 12
#6 NA NA 9282 9282
#7 NA 567 NA 567
#8 45 836 5766 45
I want to merge the df OldData and NewData.
In this case, Nov-2015 and Dec 2015 are present in both df.
Since NewData is the most accurate update available, I want to update the value of Nov-2015 and Dec 2015 using the value in df NewData and of course adding the records of Jan-2016 and Feb-2016 as well.
Can anyone help?
OldData
Month Value
1 Jan-2015 3
2 Feb-2015 76
3 Mar-2015 31
4 Apr-2015 45
5 May-2015 99
6 Jun-2015 95
7 Jul-2015 18
8 Aug-2015 97
9 Sep-2015 61
10 Oct-2015 7
11 Nov-2015 42
12 Dec-2015 32
NewData
Month Value
1 Nov-2015 88
2 Dec-2015 45
3 Jan-2016 32
4 Feb-2016 11
Here is the output I want:
JoinData
Month Value
1 Jan-2015 3
2 Feb-2015 76
3 Mar-2015 31
4 Apr-2015 45
5 May-2015 99
6 Jun-2015 95
7 Jul-2015 18
8 Aug-2015 97
9 Sep-2015 61
10 Oct-2015 7
11 Nov-2015 88
12 Dec-2015 45
13 Jan-2016 32
14 Feb-2016 11
Thanks for #akrun, the problem is solved, and the following code works smoothly!!
rbindlist(list(OldData, NewData))[!duplicated(Month, fromLast=TRUE)]
Update: Now, let's upgrade our problem little bit.
suppose our OldData and NewData have another column called "Type".
How do we merge/update it this time?
> OldData
Month Type Value
1 2015-01 A 3
2 2015-02 A 76
3 2015-03 A 31
4 2015-04 A 45
5 2015-05 A 99
6 2015-06 A 95
7 2015-07 A 18
8 2015-08 A 97
9 2015-09 A 61
10 2015-10 A 7
11 2015-11 B 42
12 2015-12 C 32
13 2015-12 D 77
> NewData
Month Type Value
1 2015-11 A 88
2 2015-12 C 45
3 2015-12 D 22
4 2016-01 A 32
5 2016-02 A 11
The JoinData will suppose to update all value from NewData ass following:
> JoinData
Month Type Value
1 2015-01 A 3
2 2015-02 A 76
3 2015-03 A 31
4 2015-04 A 45
5 2015-05 A 99
6 2015-06 A 95
7 2015-07 A 18
8 2015-08 A 97
9 2015-09 A 61
10 2015-10 A 7
11 2015-11 B 42
12 2015-11 A 88 (originally not included, added from the NewData)
12 2015-12 C 45 (Updated the value by NewData)
13 2015-12 D 22 (Updated the value by NewData)
14 2016-01 A 32 (newly added from NewData)
15 2016-02 A 11 (newly added from NewData)
Thanks for #akrun: I have got the solution here for the second question as well.
Thanks for the help for everyone here!
Here is the answer:
d1 <- merge(OldData, NewData, by = c("Month","Type"), all = TRUE);d2 <- transform(d1, Value.x= ifelse(!is.na(Value.y), Value.y, Value.x))[-4];d2[!duplicated(d2[1:2], fromLast=TRUE),]
Here is an option using data.table (similar approach as #thelatemail mentioned in the comments)
library(data.table)
rbindlist(list(OldData, NewData))[!duplicated(Month, fromLast=TRUE)]
Or
rbindlist(list(OldData, NewData))[,if(.N >1) .SD[.N] else .SD, Month]
I have 3 dataframes and trying to loop through each category to obtain the total (sum) transactiondata$amount for each week.
Here is my pseudo code:
for a given productIDCat$Category {
for a given weeklydate$weekid{
calculate sum(transactiondata$amount)
when transactiondata$date is between weeklydate$begindate
and weeklydate$enddate
Append sum to new column in weeklydate
as weeklydate$testamount}}
Repeat this process for each week for that given product.
Below are my dataframes:
productIDCat
itemNo Category
1 Shoes
2 Shirts
3 Panties
4 Shoes
5 Shirts
6 Panties
transactiondata
amount date itemNo
12 2014-07-23 1
33 2014-07-29 1
6 2014-08-05 2
21 2014-08-06 2
23 2014-08-19 3
32 2014-08-27 3
weeklydate
weekid begindate enddate
1 2014-07-21 2014-07-25
2 2014-07-28 2014-08-01
3 2014-08-04 2014-08-08
4 2014-08-11 2014-08-15
5 2014-08-18 2014-08-22
6 2014-08-25 2014-08-29