Only changing a single variable in R - r

I have a dataframe df:
Group Age Sales
A1234 12 1000
A2312 11 900
B2100 23 2100
...
I intend to create a new dataframe through the modification of the Group variable, by only taking the substring of Group. At present, I am able to execute it in 2 steps:
dt1<- dt
dt1$Group<- substr(dt$Group,1,2)
Is it able to do the above in one single command? I guess the following would get tedious if I have to create and transform many intermediate dataframes along the way.

You can try:
dt1<-`$<-`(dt,"Group",substr(dt$Group,1,2))
dt1
# Group Age Sales
#1 A1 12 1000
#2 A2 11 900
#3 B2 23 2100
dt
# Group Age Sales
#1 A1234 12 1000
#2 A2312 11 900
#3 B2100 23 2100
The original table is unchanged and you get the new one with a single line.

Related

Selecting later date observation in panel data in R

I have the following panel data in R:
ID_column<- c("A","A","A","A","B","B","B","B")
Date_column<-c(20040131, 20041231,20051231,20061231, 20051231, 20061231, 20071231, 20081231)
Price_column<-c(12,13,17,19,35,38,39,41)
Data<- data.frame(ID_column, Date_column, Price_column)
#The data looks like this:
ID_column Date_column Price_column
1: A 20040131 12
2: A 20041231 13
3: A 20051231 17
4: A 20061231 19
5: B 20051231 35
6: B 20061231 38
7: B 20071231 39
8: B 20081231 41
My next aim would be to convert the Date column which is currently in a numeric YYYYMMDD format into YYYY by simply taking the first four digits of each entry in the data column as follows:
Data$Date_column<- substr(Data$Date_column,1,4)
#The data then looks like:
ID_column Date_column Price_column
1 A 2004 12
2 A 2004 13
3 A 2005 17
4 A 2006 19
5 B 2005 35
6 B 2006 38
7 B 2007 39
8 B 2008 41
My ultimate goal would be to employ the plm package for panel data regression, but when applying the package and using pdata.frame to set the ID and Time variables as indices, I get error messages of duplicate ID/Time pairs (In this case rows 1 and 2 which would both be given the tag: A,2004). To solve this issue, I would like to delete row 1 in the original data, and only keep the newer observation from the year 2004. This would the provide me with unique ID/Time pairs across the whole data.
Therefore I was hoping for someone to help me out with a loop or a package suggestion with which I can only keep the row with the newer/later observation within a year, if this occurs, also for application to larger data sets.. I believe this involves a couple commands of conditional formatting which I am having difficulties putting together currently. I believe a loop that evaluates whether the first four digits of consecutive date observations are identical and then deletes the one with the "smaller" date/takes the "larger" date would do it, but my experience with loops is very limited.
Kind regards and thank you!
I'd recommend to keep the Date_column as a reference to pick the later observation and mutate a new column for only the year,since you want the latest observation each year.
Data$year<- substr(Data$Date_column,1,4)
> Data$Date_column<- lubridate::ymd(Data$Date_column)
>
> Data %>% arrange(desc(Date_column)) %>%
+ distinct(ID_column,year,.keep_all = TRUE) %>%
+ arrange(Date_column)
ID_column Date_column Price_column year
1 A 2004-12-31 13 2004
2 A 2005-12-31 17 2005
3 B 2005-12-31 35 2005
4 A 2006-12-31 19 2006
5 B 2006-12-31 38 2006
6 B 2007-12-31 39 2007
since we arranged in the actual date in descending order, you guarantee that dropped rows for the unique combination of ID and year is the oldest. you can change the arrangement for the opposite; to get the oldest occuerence

How to select random rows from R data frame to include all distinct values of two columns

I want to select a random sample of rows from a large R data frame df (around 10 million rows) in such a way that all distinct values of two columns are included in the resulting sample. df looks like:
StoreID WEEK Units Value ProdID
2001 1 1 3.5 20702
2001 2 2 3 20705
2002 32 3 6 23568
2002 35 5 15 24025
2003 1 2 10 21253
I have the following unique values in the respective columns: StoreID: 1433 and WEEK: 52. When I generate a random sample of rows from df, I must have at least one row each for each StoreID and each WEEK value.
I used the function sample_frac in dplyr in various trials but that does not ensure that all distinct values of StoreID and WEEK are included at least once in the resulting sample. How can I achieve what I want?
It sounds like you need to group the desired columns before sampling rows. The last line will return one random row for each unique storeID-week pairing.
df <- data.frame(storeid=sample(c(2000:2010),1000,T),
week=sample(c(1:52),1000,T),
value=runif(1000))
# count number of duplicated storeid-week pairs
df %>% count(storeid,week) %>% filter(n>1)
df %>% group_by(storeid,week) %>% sample_n(1)
# A tibble: 468 x 3
# Groups: storeid, week [468]
storeid week value
<int> <int> <dbl>
1 2000 1 0.824
2 2000 2 0.0987
3 2000 6 0.916
4 2000 8 0.289
5 2000 9 0.610
6 2000 11 0.0807
7 2000 12 0.592
8 2000 13 0.849
9 2000 14 0.0181
10 2000 16 0.182
# ... with 458 more rows
Not sure if I have read the problem correctly. I would have tried the following using sample function.
Assuming your dataframe is called MyDataFrame and is two dimensional, I would have done it like this.
RandomizedDF <- MyDataFrame[sample(dim(MyDataFrame)[1],dim(MyDataFrame)[1],replace=FALSE),]
Let me know if this is what you wanted or something else?

switching list elements with dataframe rows

Consider my list IDs that has a dataframe of behaviours in each one:
IDs <- list(Dave = data.frame(Behaviour = c("Aggression","Interaction", "Nursing"), number = c(20,10,5), duration = c(60,39,27)),James = data.frame(Behaviour = c("Aggression","Interaction"), number = c(21,30), duration = c(30,49)))
IDs
$Dave
Behaviour number duration
1 Aggression 20 60
2 Interaction 10 39
3 Nursing 5 27
$James
Behaviour number duration
1 Aggression 21 30
2 Interaction 30 49
Note that James does not exhibit any nursing behaviour and therefore different number of rows between the two list elements.
I want to switch the list elements with the dataframe rows so that I have a list of behaviours and a dataframe of ID. So that it looks like this:
$Aggression
ID number duration
1 Dave 20 60
2 James 21 30
$Interaction
ID number duration
1 Dave 10 39
2 James 30 49
$Nursing
ID number duration
1 Dave 5 27
I thought that it could be achieved with reshape2::melt. I wasn't able to get further than melt(IDs, id = "Behaviour)
Any ideas?
Generally you can do it in two steps:
turning the list into a single data.frame/data.table
splitting it based on Behavior
You can do it like this, for example:
dt <- data.table::rbindlist(IDs, id = "ID")
# or: dt <- dplyr::bind_rows(IDs, .id = "ID")
split(dt, dt$Behaviour)
Note:
If you don't want the Behaviour column in the result and you used the data.table approach, you can modify the split to:
split(dt[,!"Behaviour"], dt$Behaviour)
Try this:
tmp<-data.frame(ID=rep(names(IDs),vapply(IDs,nrow,1L)),do.call(rbind,IDs),row.names=NULL)
split(tmp[-2],tmp$Behaviour)
#$Aggression
# ID number duration
#1 Dave 20 60
#4 James 21 30
#$Interaction
# ID number duration
#2 Dave 10 39
#5 James 30 49
#$Nursing
# ID number duration
#3 Dave 5 27
#6 James 1 17
Or using base R
d1 <- do.call(rbind, Map(cbind, id = names(IDs), IDs))
split(d1, d1$Behaviour)

How to run a loop in R to find a unique combination of numbers within a range of 7?

I have a dataset which looks something like this:-
Key Days
A 1
A 2
A 3
A 8
A 9
A 36
A 37
B 14
B 15
B 44
B 45
I would like to split the individual keys based on the days in groups of 7. For e.g.:-
Key Days
A 1
A 2
A 3
Key Days
A 8
A 9
Key Days
A 36
A 37
Key Days
B 14
B 15
Key Days
B 44
B 45
I could use ifelse and specify buckets of 1-7, 7-14 etc until 63-70 (max possible value of days). However the issue lies with the days column. There are lots of cases wherein there is an overlap in days - Take days 14-15 as an example which would fall into 2 brackets if split using the ifelse logic (7-14 & 15-21).
The ideal method of splitting this would be to identify a day and add 7 to it and check how many rows of data are actually falling under that category. I think we need to use loops for this. I could do it in excel but i have 20000 rows of data for 2000 keys hence i'm using R. I would need a loop which checks each key value and for each key it further checks the value of days and buckets them in group of 7 by checking the first day value of each range.
We create a grouping variable by applying %/% on the 'Day' column and then split the dataset into a list based on that 'grp'.
grp <- df$Day %/%7
split(df, factor(grp, levels = unique(grp)))
#$`0`
# Key Days
#1 A 1
#2 A 2
#3 A 3
#$`1`
# Key Days
#4 A 8
#5 A 9
#$`5`
# Key Days
#6 A 36
#7 A 37
#$`2`
# Key Days
#8 B 14
#9 B 15
#$`6`
# Key Days
#10 B 44
#11 B 45
Update
If we need to split by 'Key' also
lst <- split(df, list(factor(grp, levels = unique(grp)), df$Key), drop=TRUE)

Merging overlapping dataframes in R

Okay, so I have two different data frames (df1 and df2) which, to simplify it, have an ID, a date, and the score on a test. In each data frame the person (ID) have taken the test on multiple dates. When looking between the two data frames, some of the people are listed in df1 but not in df2, and vice versa, but some are listed in both and they can overlap differently.
I want to combine all the data into one frame, but the tricky part is if any of the IDs and scores from df1 and df2 are within 7 days (I can do this with a subtracted dates column), I want to combine that row.
In essence, for every ID there will be one row with both scores written separately if taken within 7 days, and if not it will make two separate rows, one with score from df1 and one from df2 along with all the other scores that might not be listed in both.
EX:
df1
ID Date1(yyyymmdd) Score1
1 20140512 50
1 20140501 30
1 20140703 50
1 20140805 20
3 20140522 70
3 20140530 10
df2
ID Date2(yyyymmdd) Score2
1 20140530 40
1 20140622 20
1 20140702 10
1 20140820 60
2 20140522 30
2 20140530 80
Wanted_df
ID Date1(yyyymmdd) Score1 Date2(yyyymmdd) Score2
1 20140512 50
1 20140501 30
1 20140703 50 20140702 10
1 20140805 20
1 20140530 40
1 20140622 20
1 20140820 60
3 20140522 70
3 20140530 10
2 20140522 30
2 20140530 80
Alright. I feel bad about the bogus outer join answer (which may be possible in a library I don't know about, but there are advantages to using RDBMS sometimes...) so here is a hacky workaround. It assumes that all the joins will be at most one to one, which you've said is OK.
# ensure the date columns are date type
df1$Date1 <- as.Date(as.character(df1$Date1), format="%Y%m%d")
df2$Date2 <- as.Date(as.character(df2$Date2), format="%Y%m%d")
# ensure the dfs are sorted
df1 <- df1[order(df1$ID, df1$Date1),]
df2 <- df2[order(df2$ID, df2$Date2),]
# initialize the output df3, which starts as everything from df1 and NA from df2
df3 <- cbind(df1,Date2=NA, Score2=NA)
library(plyr) #for rbind.fill
for (j in 1:nrow(df2)){
# see if there are any rows of test1 you could join test2 to
join_rows <- which(df3[,"ID"]==df2[j,"ID"] & abs(df3[,"Date1"]-df2[j,"Date2"])<7 )
# if so, join it to the first one (see discussion)
if(length(join_rows)>0){
df3[min(join_rows),"Date2"] <- df2[j,"Date2"]
df3[min(join_rows),"Score2"] <- df2[j,"Score2"]
} # if not, add a new row of just the test2
else df3 <- rbind.fill(df3,df2[j,])
}
df3 <- df3[order(df3$ID,df3$Date1,df3$Date2),]
row.names(df3)<-NULL # i hate these
df3
# ID Date1 Score1 Date2 Score2
# 1 1 2014-05-01 30 <NA> NA
# 2 1 2014-05-12 50 <NA> NA
# 3 1 2014-07-03 50 2014-07-02 10
# 4 1 2014-08-05 20 <NA> NA
# 5 1 <NA> NA 2014-05-30 40
# 6 1 <NA> NA 2014-06-22 20
# 7 1 <NA> NA 2014-08-20 60
# 8 2 <NA> NA 2014-05-22 30
# 9 2 <NA> NA 2014-05-30 80
# 10 3 2014-05-22 70 <NA> NA
# 11 3 2014-05-30 10 <NA> NA
I couldn't get the rows in the same sort order as yours, but they look the same.
Short explanation: For each row in df2, see if there's a row in df1 you can "join" it to. If not, stick it at the bottom of the table. In the initialization and rbinding, you'll see some hacky ways of assigning blank rows or columns as placeholders.
Why this is a bad hacky workaround: for large data sets, the rbinding of df3 to itself will consume more and more memory. The loop is definitely not optimal and its search does not exploit the fact that the tables are sorted. If by some chance the test were taken twice within a week, you would see some unexpected behavior (duplicates from df2, etc).
Use an outer join with an absolute value limit on the date difference. (A outer join B keeps all rows of A and B.) For example:
library(sqldf)
sqldf("select a.*, b.* from df1 a outer join df2 b on a.ID = b.ID and abs(a.Date1 - b.Date2) <=7")
Note that your date variables will have to be true dates. If they are currently characters or integers, you need to do something like df1$Date1 <- as.Date(as.character(df$Date1), format="%Y%M%D) etc.

Resources