data.frame matching - r

I have a simple R question. I have two data frames. The first contains all of my possible years. I assign NA to the second column. The second data frame has only a subset of the possible years, but an actual value for the second column. I want to combine the two data frames. More specifically, I want to match them by year and if the second has the correct year, to replace the NA in the first with the value of the second.
Here is example code.
one <- as.data.frame(matrix(1880:1890, ncol=2, nrow=11))
one[,2] <- NA
two <- data.frame(ncol=2, nrow=3)
two[1,] <- c(1880, "a")
two[2,] <- c(1887, "b")
two[3,] <- c(1889, "c")
I want to get the first row, second column of one to have value "a," the eighth row, second column to be "b," and the tenth row, second column to be "c."
Feel free to make the above code more elegant.
One thing I tried as a preliminary step, but it gave a little weird result was:
one[,1]==two[,1] -> test
But test only contains values 1880 and 1887...

one[match(two[,1],one[,1]),2]<-two[,2]
That should give you what you are looking for:
> one
V1 V2
1 1880 a
2 1881 <NA>
3 1882 <NA>
4 1883 <NA>
5 1884 <NA>
6 1885 <NA>
7 1886 <NA>
8 1887 b
9 1888 <NA>
10 1889 c
11 1890 <NA>

I like to use merge for these types of problems. It's pretty straightforward in my opinion. Check out the help article ?merge
three <- merge(one, two, by.x = 'V1', by.y = 'ncol', all = T)

Here's one approach (merge is another):
library(qdap)
one[, 2] <- lookup(one[, 1], two)
one
## V1 V2
## 1 1880 a
## 2 1881 <NA>
## 3 1882 <NA>
## 4 1883 <NA>
## 5 1884 <NA>
## 6 1885 <NA>
## 7 1886 <NA>
## 8 1887 b
## 9 1888 <NA>
## 10 1889 c
## 11 1890 <NA>

Related

converting an abbreviation into a full word

I am trying to avoid writing a long nested ifelse statement in excel.
I am working on two datasets, one where I have abbreviations and county names.
Abbre
COUNTY_NAME
1 AD Adams
2 AS Asotin
3 BE Benton
4 CH Chelan
5 CM Clallam
6 CR Clark
And another data set that contains the county abbreviation and votes.
CountyCode Votes
1 WM 97
2 AS 14
3 WM 163
4 WM 144
5 SJ 21
For the second table, how do I convert the countycode (abbreviation) into the full spelled-out text and add that as a new column?
I have been trying to solve this unsuccessfully using grep, match, and %in%. Clearly I am missing something and any insight would be greatly appreciated.
We can use a join
library(dplyr)
library(tidyr)
df2 <- df2 %>%
left_join(Abbre %>%
separate(COUNTY_NAME, into = c("CountyCode", "FullName")),
by = "CountyCode")
Or use base R
tmp <- read.table(text = Abbre$COUNTY_NAME, header = FALSE,
col.names = c("CountyCode", "FullName"))
df2 <- merge(df2, tmp, by = 'CountyCode', all.x = TRUE)
Another base R option using match
df2$COUNTY_NAME <- with(
df1,
COUNTY_NAME[match(df2$CountyCode, Abbre)]
)
gives
> df2
CountyCode Votes COUNTY_NAME
1 WM 97 <NA>
2 AS 14 Asotin
3 WM 163 <NA>
4 WM 144 <NA>
5 SJ 21 <NA>
A data.table option
> setDT(df1)[setDT(df2), on = .(Abbre = CountyCode)]
Abbre COUNTY_NAME Votes
1: WM <NA> 97
2: AS Asotin 14
3: WM <NA> 163
4: WM <NA> 144
5: SJ <NA> 21

Merging data with data.table roll="nearest" rolls matches across the entire DF

I have two tables, a sample table and a message table. In the message table, messages are recorded outside the sampling rate of the tracker. What I have been doing is using data.table roll nearest to match the sample message time to the closest value in the sample report. Instead of returning the sample message to the nearest time and NAs for everything else, it seems that it is rolling messages until the next message.
library(data.table)
remotes::install.github("dmirman/gazer") # to get the data
library((gazer)
samp <- system.file("extdata", "TJ_samp.xls", package = "gazer")
samp <- data.table::fread(samp, stringsAsFactors = FALSE) # reads in large datasets
msg <- system.file("extdata", "TJ_msg.xls", package = "gazer")
msg <- data.table::fread(msg, stringsAsFactors = FALSE) # reads in large datasets
setDT(samp)
setDT(msg)
DT_mesg <- msg[samp, on="time", roll="nearest"] # use this to get close to values in sample report
DT_mesg
#SR edfs are a nightmare. This makes it so messages are alined with closest values
get_msg <- DT_mesg %>%
group_by(trial, message) %>%
top_n(n=1, wt=desc(time)) # there are a lot of useless messages and they occupy the same time stamp. Only take the first message in time. This was one way I tried to deal with the multiple message issue, but it does not return messages close to their actual time.
get_msg
I was able to figure it out...
setDT(samp)
setDT(msg)
DT_mesg <- msg[samp, on="time", roll=4]
This gives me my desired result:
trial time message i.trial x y pup Label
1: 1 3314705 !MODE RECORD CR 250 2 1 L\n 1 958.8 580.8 4043 <NA>
2: NA 3314709 <NA> 1 959.1 576.2 4052 <NA>
3: NA 3314713 <NA> 1 959.8 575.6 4053 <NA>
4: NA 3314717 <NA> 1 960.6 575.2 4056 <NA>
5: NA 3314721 <NA> 1 960.2 579.6 4049 <NA>
Not sure why roll="nearest" returns this:
trial time message i.trial x y pup Label
1: 1 3314705 !MODE RECORD CR 250 2 1 L\n 1 958.8 580.8 4043 <NA>
2: 1 3314709 !MODE RECORD CR 250 2 1 L\n 1 959.1 576.2 4052 <NA>
3: 1 3314713 !MODE RECORD CR 250 2 1 L\n 1 959.8 575.6 4053 <NA>
4: 1 3314717 !MODE RECORD CR 250 2 1 L\n 1 960.6 575.2 4056 <NA>
5: 1 3314721 !MODE RECORD CR 250 2 1 L\n 1 960.2 579.6 4049 <NA>

data table lapply and additional columns in output

I am just hoping there is a more convenient way. Imaging I would like to run a model with different transformations of some of the columns, e.g. winsorizing. I would like to provide the transformed data set to the model and some additional columns that do not need to be transformed. Is there a practical way to this in one line? I do not want to replace the data using := because I am planning to run the model with different specifications of the transformation.
dt<-data.table(id=1:10, Country=sample(c("Germany", "USA"),10, replace=TRUE), x=rnorm(10,1,10),y=rnorm(10,1,10),factor=factor(sample(LETTERS[1:2],10,replace=TRUE)))
sel.col<-c("x","y")
dt[,lapply(.SD,Winsorize),.SDcols=sel.col,by=factor]
I would Need to call data.table again to merge the original dt with the transformed data and pay Attention to the order.
data.table(dt[,.(id,Country),by=factor],
dt[,lapply(.SD,Winsorize),.SDcols=sel.col,by=factor])
I was hoping that I could include the additional columns with the lapply call
dt[,.(lapply(.SD,Winsorize), id, Country),.SDcols=sel.col,by=factor]
Are there any other solutions?
Do you just need?
dt[, c(lapply(.SD,Winsorize), list(id = id, Country = Country)), .SDcols=sel.col,by=factor]
Unfortunately this method get's slow with big data. Apparently this was optimised in some recent update, but it still very slow.
There is no need to merge, you can assign columns after lapply call:
> library(DescTools)
> library(data.table)
> dt<-data.table(id=1:10, Country=sample(c("Germany", "USA"),10, replace=TRUE), x=rnorm(10,1,10),y=rnorm(10,1,10),factor=factor(sample(LETTERS[1:2],10,replace=TRUE)))
> sel.col<-c("x","y")
> dt
id Country x y factor
1: 1 Germany 13.116248 -0.4609152 B
2: 2 Germany -6.623404 -3.7048052 A
3: 3 USA -18.027532 22.2946805 A
4: 4 USA -13.377736 6.2021252 A
5: 5 Germany -12.585897 0.8255081 B
6: 6 Germany -8.816252 -12.1218135 B
7: 7 USA -3.459926 -11.5710316 B
8: 8 USA 3.180706 6.3262951 B
9: 9 Germany -5.520637 7.2877123 A
10: 10 Germany 15.857069 8.6422997 A
> # Notice an assignment `(sel.col) :=` here:
> dt[,(sel.col) := lapply(.SD,Winsorize),.SDcols=sel.col,by=factor]
> dt
id Country x y factor
1: 1 Germany 11.129140 -0.4609152 B
2: 2 Germany -6.623404 -1.7234191 A
3: 3 USA -17.097573 19.5642043 A
4: 4 USA -13.377736 6.2021252 A
5: 5 Germany -11.831968 0.8255081 B
6: 6 Germany -8.816252 -12.0116571 B
7: 7 USA -3.459926 -11.5710316 B
8: 8 USA 3.180706 5.2261377 B
9: 9 Germany -5.520637 7.2877123 A
10: 10 Germany 11.581528 8.6422997 A

Merging overlapping dataframes in R

Okay, so I have two different data frames (df1 and df2) which, to simplify it, have an ID, a date, and the score on a test. In each data frame the person (ID) have taken the test on multiple dates. When looking between the two data frames, some of the people are listed in df1 but not in df2, and vice versa, but some are listed in both and they can overlap differently.
I want to combine all the data into one frame, but the tricky part is if any of the IDs and scores from df1 and df2 are within 7 days (I can do this with a subtracted dates column), I want to combine that row.
In essence, for every ID there will be one row with both scores written separately if taken within 7 days, and if not it will make two separate rows, one with score from df1 and one from df2 along with all the other scores that might not be listed in both.
EX:
df1
ID Date1(yyyymmdd) Score1
1 20140512 50
1 20140501 30
1 20140703 50
1 20140805 20
3 20140522 70
3 20140530 10
df2
ID Date2(yyyymmdd) Score2
1 20140530 40
1 20140622 20
1 20140702 10
1 20140820 60
2 20140522 30
2 20140530 80
Wanted_df
ID Date1(yyyymmdd) Score1 Date2(yyyymmdd) Score2
1 20140512 50
1 20140501 30
1 20140703 50 20140702 10
1 20140805 20
1 20140530 40
1 20140622 20
1 20140820 60
3 20140522 70
3 20140530 10
2 20140522 30
2 20140530 80
Alright. I feel bad about the bogus outer join answer (which may be possible in a library I don't know about, but there are advantages to using RDBMS sometimes...) so here is a hacky workaround. It assumes that all the joins will be at most one to one, which you've said is OK.
# ensure the date columns are date type
df1$Date1 <- as.Date(as.character(df1$Date1), format="%Y%m%d")
df2$Date2 <- as.Date(as.character(df2$Date2), format="%Y%m%d")
# ensure the dfs are sorted
df1 <- df1[order(df1$ID, df1$Date1),]
df2 <- df2[order(df2$ID, df2$Date2),]
# initialize the output df3, which starts as everything from df1 and NA from df2
df3 <- cbind(df1,Date2=NA, Score2=NA)
library(plyr) #for rbind.fill
for (j in 1:nrow(df2)){
# see if there are any rows of test1 you could join test2 to
join_rows <- which(df3[,"ID"]==df2[j,"ID"] & abs(df3[,"Date1"]-df2[j,"Date2"])<7 )
# if so, join it to the first one (see discussion)
if(length(join_rows)>0){
df3[min(join_rows),"Date2"] <- df2[j,"Date2"]
df3[min(join_rows),"Score2"] <- df2[j,"Score2"]
} # if not, add a new row of just the test2
else df3 <- rbind.fill(df3,df2[j,])
}
df3 <- df3[order(df3$ID,df3$Date1,df3$Date2),]
row.names(df3)<-NULL # i hate these
df3
# ID Date1 Score1 Date2 Score2
# 1 1 2014-05-01 30 <NA> NA
# 2 1 2014-05-12 50 <NA> NA
# 3 1 2014-07-03 50 2014-07-02 10
# 4 1 2014-08-05 20 <NA> NA
# 5 1 <NA> NA 2014-05-30 40
# 6 1 <NA> NA 2014-06-22 20
# 7 1 <NA> NA 2014-08-20 60
# 8 2 <NA> NA 2014-05-22 30
# 9 2 <NA> NA 2014-05-30 80
# 10 3 2014-05-22 70 <NA> NA
# 11 3 2014-05-30 10 <NA> NA
I couldn't get the rows in the same sort order as yours, but they look the same.
Short explanation: For each row in df2, see if there's a row in df1 you can "join" it to. If not, stick it at the bottom of the table. In the initialization and rbinding, you'll see some hacky ways of assigning blank rows or columns as placeholders.
Why this is a bad hacky workaround: for large data sets, the rbinding of df3 to itself will consume more and more memory. The loop is definitely not optimal and its search does not exploit the fact that the tables are sorted. If by some chance the test were taken twice within a week, you would see some unexpected behavior (duplicates from df2, etc).
Use an outer join with an absolute value limit on the date difference. (A outer join B keeps all rows of A and B.) For example:
library(sqldf)
sqldf("select a.*, b.* from df1 a outer join df2 b on a.ID = b.ID and abs(a.Date1 - b.Date2) <=7")
Note that your date variables will have to be true dates. If they are currently characters or integers, you need to do something like df1$Date1 <- as.Date(as.character(df$Date1), format="%Y%M%D) etc.

Searching for greater/less than values with NAs

I have a dataframe for which I've calculated and added a difftime column:
name amount 1st_date 2nd_date days_out
JEAN 318.5 1971-02-16 1972-11-27 650 days
GREGORY 1518.5 <NA> <NA> NA days
JOHN 318.5 <NA> <NA> NA days
EDWARD 318.5 <NA> <NA> NA days
WALTER 518.5 1971-07-06 1975-03-14 1347 days
BARRY 1518.5 1971-11-09 1972-02-09 92 days
LARRY 518.5 1971-09-08 1972-02-09 154 days
HARRY 318.5 1971-09-16 1972-02-09 146 days
GARRY 1018.5 1971-10-26 1972-02-09 106 days
I want to break it out and take subtotals where days_out is 0-60, 61-90, 91-120, 121-180.
For some reason I can't even reliably write bracket notation. I would expect
members[members$days_out<=120, ] to show just Barry and Garry, but I get a whole lot of lines like:
NA.1095 <NA> NA <NA> <NA> NA days
NA.1096 <NA> NA <NA> <NA> NA days
NA.1097 <NA> NA <NA> <NA> NA days
Those don't exist in the original data. There's no one without a name. What am I doing wrong here?
This is standard behavior for < and other relational operators: when asked to evaluate whether NA is less than (or greater than, or equal to, or ...) some other number, they return NA, rather than TRUE or FALSE.
Here's an example that should make clear what is going on and point to a simple fix.
x <- c(1, 2, NA, 4, 5)
x[x < 3]
# [1] 1 2 NA
x[x < 3 & !is.na(x)]
# [1] 1 2
To see why all of those rows indexed by NA's have row.names like NA.1095, NA.1096, and so on, try this:
data.frame(a=1:2, b=1:2)[rep(NA, 5),]
# a b
# NA NA NA
# NA.1 NA NA
# NA.2 NA NA
# NA.3 NA NA
# NA.4 NA NA
If you are working at the console the subset function does not have that annoying 'feature' which is actually due to the behavior of [ more than to the relational operators.
subset(members, days_out <= 120)
If you are programming, then you can use which or Josh's conjunction with & is.na(.) that which does behind "the scenes":
members[ which(members$days_out <= 120), ]

Resources