Match dates and id between data.frames only once - r

Have 2 example databases as follows
id<-c(1,2,3,1,4,3,5)
date<-c("2011-1-1","2011-1-1","2011-2-2","2012-3-3","2012-4-4","2012-5-5","2012-6-6")
d<-data.frame(cbind(id,date))
colnames(d)<-c("id","date")
d$w<-do.call(paste,c(d[c("id","date")],sep=" "))
id<-c(7,8,9,10,7,10,8,10,11,12)
date<-c("2011-1-1","2011-1-1","2011-2-2","2012-3-3","2012-3-3","2012-4-4","2012-4-4","2012-5-5","2012-6-6","2012-6-6")
contr<-data.frame(cbind(id,date))
colnames(contr)<-c("id","date")
contr$w<-do.call(paste,c(contr[c("id","date")],sep=" "))
Consider that id and dates are repeated in both datasets but d$id are all different from contr$id and that all contr$date are %in% d$date
What I want is y that is a vector including ONE contr$w FOR EACH d$id that have a contr$date%in%d$date
I have tried this which does not work but I am sure there must be a much easier,simpler=better way to do it.
y<-0
for(i in length(levels(factor(d$w)))){
for(j in length(levels(factor(contr$w)))){
z<-ifelse(d$date[i]==contr$date[j],contr$w[j],NA)
y<-c(y,z)
y<-subset(y,!is.na(y))
}
}
Anyone can help?
Many thanks,
Marco

This did what I wanted, maybe I was not clear enough in my explanation. I just wanted a random date per id (then I can create the w column). I have sorted this by using a solution from this other question:
Random row selection in R
Many t
hanks for the effort anyway!
Marco

Actually I have written now a loop that does this (the previous answer did not work as some cases in d did not have a matching date in contr). It is very slow but it does exactly what I wanted
for(i in 1:length(d$rownames)){
if(TRUE%in%levels(factor(contr$w%in%d$w[i]))==TRUE){
control.2$rownames[i]<-sample(contr$rownames[ctr$w==d$w[i]],1)
contr<-contr[!contr$rownames%in%control.2$rownames[i],]
}else{
z<-contr[contr$practice==d$practice[i],]
z$tempo<-abs(difftime(z$date,d$date[i],units="days"))
z<-z[!is.na(z$tempo),]
z<-z[z$tempo==min(z$tempo),]
control.2$rownames[i]<-sample(z$rownames,1)
contr<-contr[!contr$rownames%in%control.2$rownames[i],]
}
}
Not the best code I am sure, but it works. The second look accounts for the few cases where there was no case with a matching date so I chose the sampled() one with the closest date. If you can come up with a faster version, that would be nice. My datasets are about d=~5K rows and contr=~2.5 million rows and it takes roughly 2 hours to run. Painful but worth the wait!

Related

How do I pull the values from multiple columns, conditionally, into a new column?

I am a relatively novice R user, though familiar with dplyr and tidy verse. I still can't seem to figure out how to pull in the actual data from one column if it meets certain condition, into a new column.
Here is what I'm trying to do. Participants have ranked specific practices (n=5) and provided responses to questions that represent their beliefs about these practices. I want to have five new columns that assign their beliefs about the practices to their ranks, rather than the practices.
For example, they have a score for "beliefs about NI" called ni.beliefs, if a participant ranked NI as their first choice, I want the value for ni.beliefs to be pulled into the new column for first.beliefs. The same is true that if a participant put pmii as their first choice practice, their value for pmii.beliefs should be pulled into the first.beliefs column.
So, I need five new columns called: first.beliefs, second.beliefs, third.beliefs, fourth.beliefs, last.beliefs and then I need each of these to have the data pulled in conditionally from the practice specific beliefs (ni.beliefs, dtt.beliefs, pmi.beliefs, sn.beliefs, script.beliefs) dependent on the practice specific ranks (rank assigned of 1-5 for each practice, rank.ni, rank.dtt, rank.pmi, rank.sn, rank.script).
Here is what I have so far but I am stuck and aware that this is not very close. Any help is appreciated!!!
`
Diss$first.beliefs <-ifelse(rank.ni==1, ni.beliefs,
ifelse(rank.dtt==1, dtt.beliefs,
ifelse(rank.pmi==1, pmi.beliefs,
ifelse(rank.sn, sn.beliefs,
ifelse(rank.script==1, script.beliefs)))))
`
Thank you!!
I'm not sure if I understood correctly (it would help if you show how your data looks like), but this is what I'm thinking:
Without using additional packages, if the ranking columns are equivalent to the index of the new columns you want (i.e. they rank each practice from 1 to 5, without repeats, and in the same order as the new columns "firsts belief, second belief, etc"), then you can use that data as the indices for the second set of columns:
for(j in 1:nrow(people_table)){
people_table[j,]$first.belief[[1]] <- names(beliefs)[(people_table[j,c(A:B)]) %in% 1]
people_table[j,]$second.belief[[1]] <- names(beliefs)[(people_table[j,c(A:B)]) %in% 2]
...
}
Where
A -> index of the first preference rank column
B -> index of the last preference rank column
(people_table[j,c(A:B)] %in% 1) -> this returns something like (FALSE FALSE TRUE FALSE FALSE)
beliefs -> vector with the names of each belief
That should work. It's simple, no need for packages, and it'll be fast too. Just make sure you've initialized/created the new columns first, otherwise you'll get some errors. If
This is done very easily with the case_when() function. You can improve on the code below.
library(dplyr)
Diss$first.beliefs <- case_when(
rank.ni == 1 ~ ni.beliefs,
rank.dtt == 1 ~ dtt.beliefs,
rank.pmi == 1 ~ pmi.beliefs,
rank.sn ~ sn.beliefs,
rank.script == 1 ~ script.beliefs
)

Deleting ranges of values based on character string in R

I have a pretty gigantic dataframe that looks
like this
I want to delete all NUTS2 values for certain countries (let's say Belgium here) and have no clue how to proceed. So far, the only thing that works has been this:
alldata<-alldata[!(alldata$nutscode=="be21" & alldata$nutslevel=="nuts2"),]
but I would have to keep writing this same line hundreds of times for all possible countries.
I want to exclude all values from the dataset where the nutscode variable has the character string "be" in the values AND the nutslevel equals 2.
I've tried using
alldata[!grepl("be", alldata$nutscode, alldata$nutslevel=="nuts2"),]
or
alldata[!grepl("be", alldata$nutscode) & alldata$nutslevel=="nuts2",]
since I've seen this posted in a similar thread here,
but I am clearly writing something wrong, it doesn't work, it just prints out values. I've also tried many many other alternatives, but nothing worked.
Is there a simpler way of removing the rows containing those specific strings from my dataframe, without writing the same line hundreds of times? Also please please if you reply, do provide a complete answer, I am a total noob at this and if I had known how to write a fancy loop or function to do this for me, I would have done it by now. :/
Thank you very much in advance!
Also for clarification: NUTS codes are used to classify regions and increase in complexity the deeper one goes on a regional level. E.g. AT0 is Austria as a whole, AT2 and AT3 are regions on NUTS1 level and AT21 or AT34 are even smaller regions on NUTS2 level. Each country has their own NUTS code following the same structure (e.g.BE, BE1 and BE34 are examples for NUTS levels 0,1 and 2 regions in Belgium)
I think you're very close with grepl. Why did you abandon the & construct from your first example? This works fine for me...
nutslevel <- c('nuts1', 'nuts1', 'nuts2', 'nuts2')
nutscode <- c('be2', 'o2', 'be2', 'o2')
dat <- data.frame(nutslevel, nutscode)
dat[!(grepl('be', dat$nutscode) & dat$nutslevel=='nuts2'), ]
last line returns
nutslevel nutscode
1 nuts1 be2
2 nuts1 o2
4 nuts2 o2
which excludes the third row, as desired.
Also, perhaps subset offers a slightly cleaner way to achieve this
subset(dat, !(grepl('be', nutscode) & nutslevel=='nuts2'))
Just for clarification. do the different countries nutscode? What is the pattern of the nutscode? As far as explained above, You did exclude all values from the dataset where the nutscode variable has the character string "be" in the values AND the nutslevel equals 2. Maybe only if the nutscode differ from country to country then would someone be able to respond to your question. One has to visualize the pattern.. So if possible, give nutscode for at least four countries. I hope the nutslevel=2for all the countries. Thank you

subset function is missing some values?

I have a dataframe, and I want to confirm that two columns match for each entry. So I tried:
> nrow(subset(df, col.a!=col.b))
[1] 0
That seemed good to me, but then I tried to compare how many matches there were to the total number of entries in the data frame. It seems like these numbers should be equal but they are not:
nrow(subset(df, col.a==col.b))
[1] 3443
nrow(df)
[1] 3453
Any idea what is going on here? Why does it looked like the subset dropped 10 entries? Thanks so much for your help.
Also, I'm fairly new to this, so please let me know if there is a better way of checking if the two columns match.
subset automatically drops rows where the criterion is NA. It should always (?) be the case that
nrow(d)
and
nrow(subset(d, col.a!=col.b))+
nrow(subset(d, col.a==col.b))+
nrow(subset(d, is.na(col.a) | is.na(col.b)))
should be equal.

R - How to completely detach a subset plm.dim from a parent plm.dim object?

I want to be able to completely detach a subset (created by tapply) of a dataframe from its parent dataframe. Basically I want R to forget the existing relation and consider the subset dataframe in its own right.
**Following the proposed solution in the comments, I find it does not work for my data. The reason might be that my real dataset is a plm.dim object with an assigned index. I tried this at home for the example dataset and it worked fine. However, once again in my real data, the problem is not solved.
Here's the output of my actual data (original 37 firms)
sum(tapply(p.data$abs_pb_t,p.data$Rfirm,sum)==0)
[1] 7
s.data <- droplevels(p.data[tapply(p.data$abs_pb_t,p.data$ID,sum)!=0,])
sum(tapply(s.data$abs_pb_t,s.data$Rfirm,sum)==0)
[1] 8
Not only is the problem not solved for some reason I get an extra count of a zero variable while I explicitly ask to only keep the ones that differ from zero
Unfortunately, I cannot recreate the same problem with a simple example. For that example, as said, droplevels() works just fine
A simple reproducible example explains:
library(plm)
dad<-cbind(as.data.frame(matrix(seq(1:40),8,5)),factors = c("q","w","e","r"), year = c("1991","1992", "1993","1994"))
dad<-plm.data(dad,index=c("factors","year"))
kid<-dad[tapply(dad$V5,dad$factors,sum)<=70,]
tapply(kid$V1,kid$factors,mean)
kid<-droplevels(dad[tapply(dad$V5,dad$factors,sum)<=70,])
tapply(kid$V1,kid$factors,mean)
So I create a dad and a kid dataframe based on some tapply condition (I'm sure this extends more generally).
the result of the tapply on the kid is the following
e q r w
7 NA 8 NA
Clearly R has not forgotten the dad and it adds that two factors are NA . In itself not much of a problem but in my real dataset which much more variables and subsetting to do, I'd like a cleaner cut so that it will make searching through the kid(s) easier. In other words, I don't want the initial factors q w e r to be remembered. The desired output would thus be:
e r
7 8
So, can anyone think of a reason why what works perfectly in a small data.frame would work differently in a larger dataframe? for p.data (N = 592, T = 16 and n = 37). I find that when I run 2 identical tapply functions, one on s.data and one on p.data, all values are different. So not only have the zeros not disappeared, literally every sum has changed in the s.data which should not be the case. Maybe that gives a clue as to where I go wrong...
And potentially it could solve the mystery of the factors that refuse to drop as well
Thanks
Simon

Calculate percentage over time on very large data frames

I'm new to R and my problem is I know what I need to do, just not how to do it in R. I have an very large data frame from a web services load test, ~20M observations. I has the following variables:
epochtime, uri, cache (hit or miss)
I'm thinking I need to do a coule of things. I need to subset my data frame for the top 50 distinct URIs then for each observation in each subset calculate the % cache hit at that point in time. The end goal is a plot of cache hit/miss % over time by URI
I have read, and am still reading various posts here on this topic but R is pretty new and I have a deadline. I'd appreciate any help I can get
EDIT:
I can't provide exact data but it looks like this, its at least 20M observations I'm retrieving from a Mongo database. Time is epoch and we're recording many thousands per second so time has a lot of dupes, thats expected. There could be more than 50 uri, I only care about the top 50. The end result would be a line plot over time of % TCP_HIT to the total occurrences by URI. Hope thats clearer
time uri action
1355683900 /some/uri TCP_HIT
1355683900 /some/other/uri TCP_HIT
1355683905 /some/other/uri TCP_MISS
1355683906 /some/uri TCP_MISS
You are looking for the aggregate function.
Call your data frame u:
> u
time uri action
1 1355683900 /some/uri TCP_HIT
2 1355683900 /some/other/uri TCP_HIT
3 1355683905 /some/other/uri TCP_MISS
4 1355683906 /some/uri TCP_MISS
Here is the ratio of hits for a subset (using the order of factor levels, TCP_HIT=1, TCP_MISS=2 as alphabetical order is used by default), with ten-second intervals:
ratio <- function(u) aggregate(u$action ~ u$time %/% 10,
FUN=function(x) sum((2-as.numeric(x))/length(x)))
Now use lapply to get the final result:
lapply(seq_along(levels(u$uri)),
function(l) list(uri=levels(u$uri)[l],
hits=ratio(u[as.numeric(u$uri) == l,])))
[[1]]
[[1]]$uri
[1] "/some/other/uri"
[[1]]$hits
u$time%/%10 u$action
1 135568390 0.5
[[2]]
[[2]]$uri
[1] "/some/uri"
[[2]]$hits
u$time%/%10 u$action
1 135568390 0.5
Or otherwise filter the data frame by URI before computing the ratio.
#MatthewLundberg's code is the right idea. Specifically, you want something that utilizes the split-apply-combine strategy.
Given the size of your data, though, I'd take a look at the data.table package.
You can see why visually here--data.table is just faster.
Thought it would be useful to share my solution to the plotting part of them problem.
My R "noobness" my shine here but this is what I came up with. It makes a basic line plot. Its plotting the actual value, I haven't done any conversions.
for ( i in 1:length(h)) {
name <- unlist(h[[i]][1])
dftemp <- as.data.frame(do.call(rbind,h[[i]][2]))
names(dftemp) <- c("time", "cache")
plot(dftemp$time,dftemp$cache, type="o")
title(main=name)
}

Resources