Left join (or equivalent) to number index by group

Left join (or equivalent) to number index by group - r

I have a sequence of numbers (days):
dayNum <- c(1:10)
And I have a dataframe of id, day, and event:
id = c("aa", "aa", "aa", "bb", "bb", "cc")
day = c(1, 2, 3, 1, 6, 2)
event = c("Y", "Y", "Y", "Y", "Y", "Y")
df = data.frame(id, day, event)
Which looks like this:
id day event
aa 1 Y
aa 2 Y
aa 3 Y
bb 1 Y
bb 6 Y
cc 2 Y
I am trying to put this dataframe into a form that resembles left joining dayNum with df for each id. That is, even if id "aa" had no event on day 5, I should still get a row for "aa" on day 5 with N/A or something under event. Like this:
id day event
aa 1 Y
aa 2 Y
aa 3 Y
aa 4 N/A
aa 5 N/A
aa 6 N/A
aa 8 N/A
aa 9 N/A
aa 10 N/A
bb 1 Y
bb 2 N/A
bb 3 N/A
bb 4 N/A
bb 5 N/A
bb 6 Y
bb 7 N/A
...etc
I can make this work using dplyr and left_join when my dataframe only contains one unique id, but I am stuck trying to make this work with a dataframe that has many different ids.
A push in the right direction would be greatly appreciated.
Thank you!

We can use expand.grid and merge. We create a new dataset using the unique 'id' of 'df' and the 'dayNum'. Then merge with the 'df' to get the expected output.
merge(expand.grid(id=unique(df$id), day=dayNum), df, all.x=TRUE)
# id day event
#1 aa 1 Y
#2 aa 2 Y
#3 aa 3 Y
#4 aa 4 <NA>
#5 aa 5 <NA>
#6 aa 6 <NA>
#7 aa 7 <NA>
#8 aa 8 <NA>
#9 aa 9 <NA>
#10 aa 10 <NA>
#11 bb 1 Y
#12 bb 2 <NA>
#13 bb 3 <NA>
#14 bb 4 <NA>
#15 bb 5 <NA>
#16 bb 6 Y
#17 bb 7 <NA>
#18 bb 8 <NA>
#19 bb 9 <NA>
#20 bb 10 <NA>
#21 cc 1 <NA>
#22 cc 2 Y
#23 cc 3 <NA>
#24 cc 4 <NA>
#25 cc 5 <NA>
#26 cc 6 <NA>
#27 cc 7 <NA>
#28 cc 8 <NA>
#29 cc 9 <NA>
#30 cc 10 <NA>
A similar option using data.table would be to convert the 'data.frame' to 'data.table' (setDT(df), set the 'key' columns, join with the dataset derived from cross join of unique 'id' and 'dayNum'.
library(data.table)
setDT(df, key=c('id', 'day'))[CJ(id=unique(id), day=dayNum)]

Related

Group levels of an existing factor

A variable in my dataset looks like this
df <- data.frame(Month = factor(c(sample(1:12, 15, replace = T),
sample(c("Apr", "May"), 5, replace = T))))
Now, levels Apr & May were entered later in time by a different person, thereby stored as name of the month. So how do I get rid of the separate levels and group those values under the already existing 4 & 5 levels respectively? or conversely, how do store all values in names of months instead of numbers?

You can match with month.abb, i.e.
i1 <- match(df$Month, month.abb)
df$Month[!is.na(i1)] <- i1[!is.na(i1)]
df
# Month
#1 5
#2 2
#3 7
#4 12
#5 5
#6 12
#7 4
#8 6
#9 7
#10 10
#11 9
#12 4
#13 11
#14 10
#15 3
#16 4
#17 5
#18 4
#19 4
#20 4

Compare values of two dataframes and substitute them

I've two data frames with the same number of rows and columns, 113x159 with this structure:
df1:
1 2 3 4
a AT AA AG CT
b NA AG AT CC
c AG GG GT AA
d NA NA TT TC
df2:
1 2 3 4
a NA 23 12 NA
b NA 23 44 12
c 11 14 27 55
d NA NA 12 34
I want to compare value to value db1 e db2, and if the value of db 2 is NA and the value of db1 isn't, replace it (also if db1 value is NA and in db2 not).
At the end, my df has to be this:
1 2 3 4
a NA AA AG NA
b NA AG AT CC
c AG GG GT AA
d NA NA TT CC
I've written this if loop but it doesn't work:
merge.na<-function(x){
for (i in df2) AND (k in df1){
if (i==NA) AND (k!=NA)
k==NA}
Any idea?

We can use replace
replace(df1, is.na(df2), NA)
# X1 X2 X3 X4
#a <NA> AA AG <NA>
#b <NA> AG AT CC
#c AG GG GT AA
#d <NA> <NA> TT TC

Applying tidyr separate only to specific rows

I'm trying to use tidyr to separate one column in my data frame, while applying it only to specific rows. While dplyr::filter does the job, it omits the rest of my data. Is there a clean way to apply tidyr to specific rows while keeping the rest of the data untouched?
here is an example of my problem:
#creating DF for the example
df<-data.frame(var_a=letters[1:5],
var_b=c(sample(1:100,5)),
text=c("foo_bla","here_do","oh_yes","baa","land"))
gives me this:
var_a var_b text
1 a 10 foo_bla
2 b 58 here_do
3 c 34 oh_yes
4 d 1 baa
5 e 47 land
#separating one col:
clean_df<-df %>% separate(text,into=c("first","sec"),sep="_",remove=F)
clean_df
var_a var_b text first sec
1 a 10 foo_bla foo bla
2 b 58 here_do here do
3 c 34 oh_yes oh yes
4 d 1 baa baa <NA>
5 e 47 land land <NA>
I want to split only the "here_do" row.
Thanks in advance for any kind of help!

Another approach:
cols_to_split = c('here_do')
clean_df <-df %>%
filter(text %in% cols_to_split) %>%
tidyr::separate(text,into=c("first","sec"),sep="_",remove=F) %>%
bind_rows(filter(df, !text %in% cols_to_split))
# var_a var_b text first sec
#1 b 7 here_do here do
#2 a 26 foo_bla <NA> <NA>
#3 c 23 oh_yes <NA> <NA>
#4 d 2 baa <NA> <NA>
#5 e 67 land <NA> <NA>
If you need to keep rest of the rows in column 'first', you may use:
clean_df <-df %>%
filter(text %in% cols_to_split) %>%
tidyr::separate(text,into=c("first","sec"),sep="_",remove=F) %>%
bind_rows(filter(df, !text %in% cols_to_split)) %>%
mutate(first = ifelse(is.na(first), as.character(text), first))
# var_a var_b text first sec
#1 b 7 here_do here do
#2 a 26 foo_bla foo_bla <NA>
#3 c 23 oh_yes oh_yes <NA>
#4 d 2 baa baa <NA>
#5 e 67 land land <NA>

We can do this in base R by replacing the delimiter for the 'here_do' in the 'text' column i.e. change it to 'here,do' using sub, read it with read.csv and cbind with the original dataset
cbind(df, read.csv(text=sub("(?<=here)_(?=do)", ",", df$text,
perl = TRUE), header=FALSE, col.names = c("first", "sec")))
# var_a var_b text first sec
#1 a 93 foo_bla foo_bla
#2 b 51 here_do here do
#3 c 65 oh_yes oh_yes
#4 d 70 baa baa
#5 e 32 land land
Or if we need a tidyr solution, use the extract
library(tidyr)
extract(df, text, into = c("first", "sec"), "(here)_(do)", remove = FALSE)
# var_a var_b text first sec
#1 a 93 foo_bla <NA> <NA>
#2 b 51 here_do here do
#3 c 65 oh_yes <NA> <NA>
#4 d 70 baa <NA> <NA>
#5 e 32 land <NA> <NA>

reshape data with non-unique id and varying time frames

I have a dataset with the following format:
name1 year name2 profits2010 profits2009 count
AA 2009 AA 10 15 20
AA 2010 AA 10 15 3
BB 2009 BB 4 NA 34
BB 2010 BB 4 NA 4
I need to reshape the data to this format.Any ideas on how this can be done?
name1 year name2 profits count
AA 2009 AA 15 20
AA 2010 AA 10 3
BB 2009 BB NA 34
BB 2010 BB 4 4

Try
indx <- grep('profits', names(df1))
indx2 <- cbind(1:nrow(df1), match(df1$year,
as.numeric(sub('\\D+', '', names(df1)[indx]))))
df1$profits <- df1[indx][indx2]
df1[-indx]
# name1 year name2 count profits
#1 AA 2009 AA 20 15
#2 AA 2010 AA 3 10
#3 BB 2009 BB 34 NA
#4 BB 2010 BB 4 4

This isn't really reshaping, just defining a new variable. Try this:
df$profits <- ifelse(df$year==2009,df$profits2009,df$profits2010)

stratified sampling with group size below sample size in R

I have response data by market in the format:
head(df)
ID market q1 q2
470 France 1 3
625 Germany 0 2
155 Italy 1 6
648 Spain 0 5
862 France 1 7
699 Germany 0 8
460 Italy 1 6
333 Spain 1 5
776 Spain 1 4
and the following frequencies:
table(df$market)
France 140
Germany 300
Italy 50
Spain 75
I need to create a data frame with a sample of 100 responses per market, and all responses without replacement in cases when there's less than 100 of them.
so
table(df_new$market)
France 100
Germany 100
Italy 50
Spain 75
Thanks in advance!

The following looks valid:
set.seed(10); DF = data.frame(c1 = sample(LETTERS[1:4], 25, T), c2 = runif(25))
freqs = as.data.frame(table(DF$c1))
freqs$ss = ifelse(freqs$Freq >= 5, 5, freqs$Freq)
#> freqs
# Var1 Freq ss
#1 A 4 4
#2 B 11 5
#3 C 7 5
#4 D 3 3
res = mapply(function(x, y) DF[sample(which(DF$c1 %in% x), y), ],
x = freqs$Var1, y = freqs$ss, SIMPLIFY = F)
do.call(rbind, res)
# c1 c2
#5 A 0.3558977
#17 A 0.2289039
#6 A 0.5355970
#13 A 0.9546536
#3 B 0.2395891
#25 B 0.8015470
#10 B 0.4226376
#15 B 0.5005032
#19 B 0.7289646
#11 C 0.7477465
#9 C 0.8998325
#12 C 0.8226526
#1 C 0.7066469
#4 C 0.7707715
#23 D 0.4861003
#20 D 0.2498805
#21 D 0.1611833

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Left join (or equivalent) to number index by group - r

Related

Group levels of an existing factor

Compare values of two dataframes and substitute them

Applying tidyr separate only to specific rows

reshape data with non-unique id and varying time frames

stratified sampling with group size below sample size in R

Categories

Resources