How to use lookup in R? [duplicate] - r

This question already has answers here:
How should I deal with "package 'xxx' is not available (for R version x.y.z)" warning?
(18 answers)
Closed 7 years ago.
I was looking for an R equivalent function to the vlookup function in Excel and I came across lookup in R, however when I try to use it I keep getting errors.
I have 2 data frames (myresday and Assorted). myresday contains 2 columns: one with codes (column name is Res.Code) and the other with corresponding days of the week (colname = ContDay). Each of the codes represents a person and each person is matched with a day of the week they are supposed to be in work. Assorted contains the record of when each person actually came in over the course of a year. It is a dataframe similar to myresday, however it is much bigger. I want to see if the codes in Assorted are matched with the correct days or if the days corresponding to each code is incorrect.
I was trying to use lookup but kept coming across several errors. Here is my code:
Assorted$Cont_Day <- lookup(Assorted$VISIT_PROV_ID, myresday[, 1:2])
'the codes in myresday are in column 1, the days in column 2
R kept saying the function couldn't be found. I looked more into the function and someone suggested to use qdapTools library, so I put:
library('qdapTools')
before my code, and it said there is not qdapTools package.
Does anyone know how to do this or know of a better way to solve this?

you need to install qdapTools before using its library.
using this code works:
install.packages("qdapTools")
library('qdapTools')
Assorted$Cont_Day <- lookup(Assorted$VISIT_PROV_ID, myresday[, 1:2])

The base library function merge is likely to do what you need it to do, without involving any packages.
Let's make some toy data
set.seed(100)
myresday <- data.frame(
Res.Code=1:30,
ContDay=sample(1:7, 30, replace=T))
Assorted <- data.frame(
date=sample(seq(as.Date('2010-01-1'),as.Date('2011-01-01'), by='day'), 100, replace=T),
VISIT_PROV_ID=sample(1:30, 100, replace=T))
head(Assorted)
date VISIT_PROV_ID
1 2010-06-28 8
2 2010-12-06 26
3 2010-05-08 23
4 2010-12-16 15
5 2010-09-12 18
6 2010-11-22 1
And then do the merge
checkDay <- merge(Assorted, myresday, by.x='VISIT_PROV_ID', by.y='Res.Code')
head(checkDay)
VISIT_PROV_ID date ContDay
1 1 2010-06-16 3
2 1 2010-08-07 3
3 1 2010-11-22 3
4 1 2010-03-18 3
5 2 2010-08-19 2
6 2 2010-11-04 2
Edit: Updated column names

Related

how can I speed up this look up and sum function in r data.table [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
EDIT: There were lots of problems in my first example so I am reworking them here. This is primarily to direct credit towards the original responder who cut my process time by a factor of about 180 even with my poor example. This question was frozen for being unclear, or not general enough, but I think it has value as data.table can do amazing things with the right syntax, but that syntax can be elusive even with the available vignettes. From my own experience, having more examples of how data.table can be used will be helpful. Particularly for those of us who got our start in Excel the VLOOKUP like behavior here fills a gap that is not always easy to find.
The specific things that happen in this example that may be of general interest are:
looking up values in one data.table in another data.table
passing variables by name and by reference
apply like behavior in data.table
Original question with modified (limited rows) example:
I am looking for help in the arcane world of data.table, passing functions, and fast use of lookups across multiple tables. I have a larger function that, when I profile it, seems to spend all of its time in this one area doing some fairly straightforward lookup and sum actions. I am not adept enough at profiling to figure out exactly which subareas of the call are causing the problem but my guess is that I am unintentionally doing something computationally expensive that I don't need to do. Data.table syntax is still a complete mystery to me, so I am seeking help here to speed this process up.
Small worked example:
library(data.table)
set.seed(seed = 911)
##Other parts of the analysis generate all of these data.tables
#A data table containing id values (the real version has other things too)
whoamI<-data.table(id=1:5)
#The result of another calculation it tells me how many neighbors I will be interested in
#the real version has many more columns in it.
howmanyneighbors<-data.table(id=1:5,toCount=round(runif(5,min=1,max=3),0))
#Who the first three neighbors are for each id
#real version has a hundreds of neighbors
myneighborsare<-data.table(id=1:5,matrix(1:5,ncol=3,nrow=5,byrow = TRUE))
colnames(myneighborsare)<-c("id","N1","N2","N3")
#How many of each group live at each location?
groupPops<-data.table(id=1:5,matrix(floor(runif(25,min=0,max=10)),ncol=5,nrow=5))
colnames(groupPops)<-c("id","ape","bat","cat","dog","eel")
whoamI
howmanyneighbors
myneighborsare
groupPops
> whoamI
id
1: 1
2: 2
3: 3
4: 4
5: 5
> howmanyneighbors
id toCount
1: 1 2
2: 2 1
3: 3 3
4: 4 3
5: 5 2
> myneighborsare
id N1 N2 N3
1: 1 1 2 3
2: 2 4 5 1
3: 3 2 3 4
4: 4 5 1 2
5: 5 3 4 5
> groupPops
id ape bat cat dog eel
1: 1 9 8 6 8 1
2: 2 9 8 0 9 8
3: 3 6 1 9 1 2
4: 4 6 1 9 0 3
5: 5 6 2 2 2 5
##At any given time I will only want the group populations for some of the groups
#I will always want 'ape' but other groups will vary. Here I have picked two
#I retain this because passing the column names by variable along with the pass of 'ape' was tricky
#and I don't want to lose that syntax in any new answer
animals<-c("bat","eel")
i<-2 #similarly, howmanyneighbors has many more columns in it and I need to pass a reference to one of them which I call i here
##Functions I will call on the above data
#Get the ids of my neighbors from myneighborsare. The number of ids returned will vary based on value in howmanyneighbors
getIDs<-function(a){myneighborsare[id==a,2:(as.numeric(howmanyneighbors[id==a,..i])+1)]} #so many coding fails here it pains me to put this in public view
#Sum the populations of my neighbors for groups I am interested in.
sumVals<-function(b){colSums(groupPops[id%in%b,c("ape",..animals)])} #cringe
#Wrap the first two together and put them into a format that works well with being returned as a row in a data.table
doBoth<-function(a){
ro.ws<-getIDs(a)
su.ms<-sumVals(ro.ws)
answer<-lapply(split(su.ms,names(su.ms)),unname) #not too worried about this as it just mimics some things that happen in the original code at little time cost
return(answer)
}
#Run the above function on my data
result<-data.table(whoamI)
result[,doBoth(id),by=id]
id ape bat eel
1: 1 18 16 9
2: 2 6 1 3
3: 3 21 10 13
4: 4 24 18 14
5: 5 12 2 5
This involves a reshape and non-equi join.
library(data.table)
# reshape to long and add a grouping ID for a non-equi join later
molten_neighbors <- melt(myneighborsare, id.vars = 'id')[, grp_id := .GRP, by = variable]
#regular join by id
whoamI[howmanyneighbors,
on = .(id)
#non-equi join - replaces getIDs(a)
][molten_neighbors,
on = .(id, toCount >= grp_id),
nomatch = 0L
#regular join - next steps replace sumVals(ro.ws)
][groupPops[, c('id','ape', ..animals)],
on = .(value = id),
.(id, ape, bat, eel),
nomatch = 0L,
][,
lapply(.SD, sum),
keyby = id
]
I highly recommend simplifying future questions. Using 10 rows allows you to post the tables within your question. As is, it was somewhat difficult to follow.

Populate multiple columns by values in one column [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 5 years ago.
I haven't been able to find a solution to this so far... This one came the closest: 1
Here is a small subset of my dataframe, df:
ANIMAL(chr) MARKER(int) GENOTYPE(int)
"1012828" 1550978 0
"1012828" 1550982 2
"1012828" 1550985 1
"1012830" 1550982 0
"1012830" 1550985 2
"1012830" 1550989 2
And what I want is this...
ANIMAL MARKER_1550978 MARKER_1550982 MARKER_1550985 MARKER_1550989
"1012828" 0 2 1 NA
"1012830" NA 0 2 2
My thought, initially was to create columns for each marker according to the referenced question
markers <- unique(df$MARKER)
df[,markers] <- NA
since I can't have integers for column names in R. I added "MARKER_" to each new column so it would work:
df$MARKER <- paste("MARKER_",df$MARKER)
markers <- unique(df$MARKER)
df[,markers] <- NA
Now I have all my new columns, but with the same number of rows. I'll have no problem getting rid of unnecessary rows and columns, but how would I correctly populate my new columns with their correct GENOTYPE by MARKER and ANIMAL? Am guessing one-or-more of these: indexing, match, %in%... but don't know where to start. Searching for these in stackoverflow did not yield anything that seemed pertinent to my challenge.
What you're asking is a very common dataframe operation, commonly called "spreading", or "widening". The inverse of this operation is "gathering". Check out this handy cheatsheet, specifically the part on reshaping data.
library(tidyr)
df %>% spread(MARKER, GENOTYPE)
#> ANIMAL 1550978 1550982 1550985 1550989
#> 1 1012828 0 2 1 NA
#> 2 1012830 NA 0 2 2

Data.frame merge usage for selective row replacement [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
how to use merge() to update a table in R
What is the proper use of merge for this kind of operation in R? See below.
older <- data.frame(Member=c("first","second","third","fourth"),
VAL=c(NA,NA,NA,NA))
newer <- data.frame(Member=c("third","first"),
VAL=c(2125,4587))
#
merge.data.frame(older,newer,all=T)
Member VAL
1 first 4587
2 first NA
3 fourth NA
4 second NA
5 third 2125
6 third NA
That above is not exactly what I expect, I want to replace the older entries by newer ones, and not add another row. Like below. And I fail with merge.data.frame.
my.merge.fu(older,newer)
Member VAL
1 first 4587
2 second NA
3 third 2125
4 fourth NA
Kind of selective row replacement, where newer takes precedence and could not contain other Members than those in older.
Is there proper English term for such a R operation and is there prebuilt function for that?
Thank you.
You have effectively answered your own question.
If you want to deal with Matthew Ploude's point you could use
older$VAL[match(newer[newer$Member %in% older$Member, ]$Member, older$Member)
] <- newer[newer$Member %in% older$Member, ]$VAL
This also the effect that where newer has multiple new values, it is the latest which ends up in older so for example
older <- data.frame(Member=c("first","second","third","fourth"),
VAL=c(1234,NA,NA,5678))
newer <- data.frame(Member=c("third","first","fifth","first"),
VAL=c(2125,4587,2233,9876))
older$VAL[match(newer[newer$Member %in% older$Member,]$Member, older$Member)
] <- newer[newer$Member %in% older$Member,]$VAL
gives
> older
Member VAL
1 first 9876
2 second NA
3 third 2125
4 fourth 5678

Excluding values in cross table [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
R filtering out a subset
I have an R dataset. In this dataset, I wish to create a crosstable using the package gmodels for two categorial variables, and then run a chisq.test on them.
The two variables are witness and agegroup. witness consists of observations that has value 1,2 and 9. agegroup consists of values 1,2.
I wish to exclude values if witness=9, or/and a 3rd variable EMS=2 from the table but I am not sure how to proceed.
library(gmodels)
CrossTable (mydata$witness, mydata$agegroup)
chisq.test (mydata$witness, mydata$agegroup)
...so my question is, how can i do the above with the conditions that witness!=9 and EMS!=2
data:
witness agegroup EMS
1 1 2
2 2 2
1 1 2
2 1 2
9 2 2
2 2 2
1 2 2
9 2 2
2 1 2
#save the data in your current working directory
data <- read.table("data", header=TRUE, sep = " ")
data$witness[data$witness == "9"] <- NA
mydata <- data[!is.na(data$witness),]
library("gmodels")
CrossTable(mydata$witness, mydata$agegroup, chisq=TRUE)
You can leave the variable "EMS" in "mydata". It does no harm to your analysis!
HTH
I expect this question to be closed as it really seems like a duplicate. But as both Chase and I suggested, I think some form of subsetting is the simplest way to go about this, e.g.
mydata[mydata$witness !=9 & mydata$EMS !=2,]

filtering large data sets to exclude an identical element across all columns

I am a relatively new R user, and most of the complex coding (and packages) looks like Greek to me. It has been a long time since I used a programming language (Java/Perl) and I have only used R for very simple manipulations in the past (basic loading data from file, subsetting, ANOVA/T-Test). However, I am working on a project where I had no control over the data layout and the data file is very lengthy.
In my data, I have 172 rows which feature the Participant to a survey and 158 columns, each which represents the question number. The answers for each are 1-5. The raw data includes the number "99" to indicate that a question was not answered. I need to exclude any questions where a Participant did not answer without excluding the entire participant.
Part Q001 Q002 Q003 Q004
1 2 4 99 2
2 3 99 1 3
3 4 4 2 5
4 99 1 3 2
5 1 3 4 2
In the past I have used the subset feature to filter my data
data.filter <- subset(data, Q001 != 99)
Which works fine when I am working with sets where all my answers are contained in one column. Then this would just delete the whole row where the answer was not available.
However, with the answers in this set spread across 158 columns, if I subset out 99 in column 1 (Q001), I also filter out that entire Participant.
I'd like to know if there is a way to filter/subset the data such that my large data set would end up having 'blanks' when the "99" occured so that these 99's would not inflate or otherwise interfere with the statistics I run of the rest of the numbers. I need to be able to calculate means per question and run ANOVAs and T-Tests on various questions.
Resp Q001 Q002 Q003 Q004
1 2 4 2
2 3 1 3
3 4 4 2 5
4 1 3 2
5 1 3 4 2
Is this possible to do in R? I've tried to filter it before submitting to R, but it won't read the data file in when I have blanks, and I'd like to be able to use the whole data set without creating a subset for each question (which I will do if I have to... it's just time consuming if there is a better code or package to use)
Any assistance would be greatly appreciated!
You could replace the "99" by "NA" and the calculate the colMeans omitting NAs:
df <- replicate(20, sample(c(1,2,3,99), 4))
colMeans(df) # nono
dfc <- df
dfc[dfc == 99] <- NA
colMeans(dfc, na.rm = TRUE)
You can also indicate which values are NA's when you read your data base. For your particular case:
mydata <- read.table('dat_base', na.strings = "99")

Resources