Calculating quintile based scores on R - r

I have a dataframe with year (2006 to 2010), 4 industry sectors, 150 firm names and the net income of these firms. In total I have 750 observations, one for each firm for each year. I want to give scores to firms for their income within each industry year based on the quintiles. So, firms with income in the top 20% within each industry-year get a score of 5, the next 20% get a score of 4 and so on. The bottom 20% get a score of 1.
The sample data base is:
Year Industry Firm Income
2006 Chemicals ABC 334.50
2007 Chemicals ABC 388.98
.
.
2006 Pharma XYZ 91.45
.
.
How do I do this in R? I have tried aggregate and tapply along with quantile but am not able to arrive at the logic that should be used for this. Please help.
I tried this just to allocate a score of 1 to the lowest 20%, but it returned an error.
db10$score <- ifelse(db10$income < aggregate(income~Year+industry,db10,quantile,c(0.2)),1,0)

Try this method:
First, I'll create the sample where to test the function below:
y = c(rep(2001,15),rep(2002,15),rep(2003,15))
ind = c("A","B","C","D","E","G","H","I","J","K","L","M","N","O","P")
val = runif(45,10,100)
df = data.frame(y,ind,val)
head(df,20)
y ind val
1 2001 A 63.32011
2 2001 B 85.67976
3 2001 C 86.77527
4 2001 D 32.18319
5 2001 E 49.86626
6 2001 G 57.73214
7 2001 H 18.08216
8 2001 I 22.31012
9 2001 J 44.11174
10 2001 K 54.76902
11 2001 L 41.82495
12 2001 M 64.84514
13 2001 N 59.16529
14 2001 O 61.28870
15 2001 P 84.76561
16 2002 A 83.68185
17 2002 B 45.01354
18 2002 C 62.22964
19 2002 D 98.41717
20 2002 E 19.91548
There are 3 years, and industries from A to P. The data frame is ordered by year and later by industry.
This function below takes a year value y and calculates the quintile category for all df$val where the year df$y is y
quintile = function(y) {
x = df$val[df$y == y]
qn = quantile(x, probs = (0:5)/5)
result = as.numeric(cut(x, qn, include.lowest = T))
}
The only thing left is to apply this function to the unique year values
df$qn = unlist(lapply(unique(df$y), quintile))
Result:
> head(df,20)
y ind val qn
1 2001 A 63.32011 4
2 2001 B 85.67976 5
3 2001 C 86.77527 5
4 2001 D 32.18319 1
5 2001 E 49.86626 2
6 2001 G 57.73214 3
7 2001 H 18.08216 1
8 2001 I 22.31012 1
9 2001 J 44.11174 2
10 2001 K 54.76902 3
11 2001 L 41.82495 2
12 2001 M 64.84514 4
13 2001 N 59.16529 3
14 2001 O 61.28870 4
15 2001 P 84.76561 5
16 2002 A 83.68185 4
17 2002 B 45.01354 1
18 2002 C 62.22964 3
19 2002 D 98.41717 5
20 2002 E 19.91548 1
Maybe there is a much simpler way to implement this...
Grouping by two columns
If you want to calculate quintiles based on the grouping of two columns: y and grp
y = c(rep(2001,15),rep(2002,15),rep(2003,15))
grp = c("G1","G1","G1","G1","G1","G2","G2","G2","G2","G2","G3","G3","G3","G3","G3")
ind = c("A","B","C","D","E","G","H","I","J","K","L","M","N","O","P")
val = round(runif(45,10,100))
df = data.frame(y,grp,ind,val)
> head(df,20)
y grp ind val
1 2001 G1 A 40
2 2001 G1 B 33
3 2001 G1 C 65
4 2001 G1 D 99
5 2001 G1 E 18
6 2001 G2 G 36
7 2001 G2 H 15
8 2001 G2 I 17
9 2001 G2 J 42
10 2001 G2 K 67
11 2001 G3 L 60
12 2001 G3 M 34
13 2001 G3 N 61
14 2001 G3 O 76
15 2001 G3 P 15
16 2002 G1 A 18
17 2002 G1 B 15
18 2002 G1 C 44
19 2002 G1 D 79
20 2002 G1 E 22
Then use:
quintile = function(z) {
x = df$val[df$y == z[1] & df$grp == z[2]]
qn = quantile(x, probs = (0:5)/5)
result = as.numeric(cut(x, qn, include.lowest = T))
}
df$qn = as.vector(apply(unique(df[,c("y","grp")]),1, quintile))
Result:
> head(df,20)
y grp ind val qn
1 2001 G1 A 40 3
2 2001 G1 B 33 2
3 2001 G1 C 65 4
4 2001 G1 D 99 5
5 2001 G1 E 18 1
6 2001 G2 G 36 3
7 2001 G2 H 15 1
8 2001 G2 I 17 2
9 2001 G2 J 42 4
10 2001 G2 K 67 5
11 2001 G3 L 60 3
12 2001 G3 M 34 2
13 2001 G3 N 61 4
14 2001 G3 O 76 5
15 2001 G3 P 15 1
16 2002 G1 A 18 2
17 2002 G1 B 15 1
18 2002 G1 C 44 4
19 2002 G1 D 79 5
20 2002 G1 E 22 3
I this example, y would be the year and grp the industry group, ind the firms and val the income.
Pay attention to the order of c("y","grp") inside the apply and the columns names inside the quintile function. You'll have to replace them with the column names you want.
Be warned that if your groups are small (in this example 5 firms per group), the quintiles may not be unique and an error will pop-up.
Using column names from question
quintile = function(z) {
x = df$Income[df$Year == z[1] & df$Industry == z[2]]
qn = quantile(x, probs = (0:5)/5)
result = as.numeric(cut(x, qn, include.lowest = T))
}
df$qn = as.vector(apply(unique(df[,c("Year","Industry")]),1, quintile))
Before applying this, the data frame df must be ordered by Year and Industry.

Related

plotting a heatmap with categorical data

Is heat map possible with categorical data like this:
so I want bins in y axis, year in x axis and Firm as values. Is this something possible? if so how to do in python.
Firm year bins
0 A 1998 binA
1 A 2000 binB
2 A 1999 binA
3 B 1998 binA
4 B 2000 binE
5 B 1999 binA
6 C 1998 binA
7 C 2000 binE
8 C 1999 binA
9 D 1998 binA
10 D 2000 binA
11 D 1999 binB
12 E 1998 binB
13 E 2000 binA
14 E 1999 binB
15 F 1998 binB
16 F 2000 binC
17 F 1999 binH
18 G 1998 binB
19 G 2000 binE
20 G 1999 binF
21 H 1998 binB
22 H 2000 binA
23 H 1999 binF
24 I 1998 binB
25 I 2000 binF
26 I 1999 binF
27 J 1998 binC
28 J 2000 binA
29 J 1999 binF
30 K 1998 binD
31 K 2000 binE
32 K 1999 binA
33 L 1998 binE
34 L 2000 binH
35 L 1999 binC
36 M 1998 binE
37 M 2000 binH
38 M 1999 binH
One solution with seaborn I tried did not work
import seaborn as sns
df=pd.pivot(df7['Firm'],df7['year'], df7['bins'])
ax = sns.heatmap(df)
R has following example:
Heatmap of categorical variable counts
Using R and following code I have tentatively been able to construct the heatmap with above example:
library(magrittr)
library(dplyr)
m<- read.csv("~/df55testR.csv",
stringsAsFactors=FALSE, header=T)
m<-m%>%select(2:6)
ml <- reshape2::melt(data = m, id.vars="Firm", variable.name = "year", value.name="bin")
ml
ml$Test_Gr <- apply(ml[,2:3], 1, paste0, collapse="_")
mw <- reshape2::dcast(ml, Firm ~ bin, fun.aggregate = length)
mwm<-as.matrix(mw[,-1])
mwm
mcm <- t(mwm) %*% mwm
colnames(mcm) <- colnames(mw)[-1]
rownames(wc) <- colnames(xw)[-1]
gplots::heatmap.2(mcm, trace="none", col = rev(heat.colors(15)))
you can try groupby with nunique
grouped = df.groupby(['year','bins']).nunique()['Firm'].reset_index([0,1])
piv_grouped = grouped.pivot(index='bins', columns='year', values='Firm')
sns.heatmap(piv_grouped, cmap='RdYlGn_r', linewidths=0.5, annot=True)

reordering database by loop in R - help me

I am trying to reorder a database by loop but it does not work for me. There is too much data to do it one by one.
fact <- rep (1:2 , each = 3)
t1 <- c(2006,2007,2008,2000,2001,2002)
t2 <- c(2007,2008,2009,2001,2002,2004)
var1 <- c(56,52,44,10,32,41)
var2 <- c(52,44,50,32,41,23)
db1 <- as.data.frame(cbind(fact, t1, t2, var1, var2))
db1
fact t1 t2 var1 var2
1 1 2006 2007 56 52
2 1 2007 2008 52 44
3 1 2008 2009 44 50
4 2 2000 2001 10 32
5 2 2001 2002 32 41
6 2 2002 2004 41 23
I need it to stay this way:
factor <- rep (1:2 , each = 4)
t <- c(2006,2007,2008,2009,2000,2001,2002,2004)
var <- c(56,52,44,50,10,32,41,23)
db2 <- as.data.frame(cbind(factor, t, var))
db2
factor t var
1 1 2006 56
2 1 2007 52
3 1 2008 44
4 1 2009 50
5 2 2000 10
6 2 2001 32
7 2 2002 41
8 2 2004 23
very thanks
dat1 <- as.data.frame(cbind(fact, t1, var1))
names(dat1) <- c("fact", "t", "var")
dat2 <- as.data.frame(cbind(fact, t2, var2))
names(dat2) <- c("fact", "t", "var")
rbind.data.frame(dat1, dat2)
fact t var
1 1 2006 56
2 1 2007 52
3 1 2008 44
4 2 2000 10
5 2 2001 32
6 2 2002 41
7 1 2007 52
8 1 2008 44
9 1 2009 50
10 2 2001 32
11 2 2002 41
12 2 2004 23
Or
dat <- db1
names(dat) <- c("fact", rep("t", 2), rep("var", 2))
rbind(dat[,c(1,2,4)], dat[,c(1,3,5)])
fact t var
1 1 2006 56
2 1 2007 52
3 1 2008 44
4 2 2000 10
5 2 2001 32
6 2 2002 41
7 1 2007 52
8 1 2008 44
9 1 2009 50
10 2 2001 32
11 2 2002 41
12 2 2004 23
Or, as indicated, have a look at the reshape2 package - melt would certainly be of use, e.g.
library(reshape2)
dat <- db1
names(dat) <- c("fact", rep("t", 2), rep("var", 2))
rbind(melt(dat[,c(1,2,4)], id.vars = c("fact","t"), value.name = "var"),
melt(dat[,c(1,3,5)], id.vars = c("fact","t"), value.name = "var")
)

filter a df with NA to get only individuals that appear more than one time in r

I am using a national survey to run a regression: the survey is conducted every two years and some individual are repeatedly interviewed while others just one time.
Now I want to make the df a panel one (have only the individual that appears more than one time). The df is like this:
year nquest nord nordp sex age
2000 10 1 1 F 40
2000 10 2 2 M 43
2000 30 1 1 M 30
2002 10 1 1 F 42
2002 10 2 2 M 45
2002 10 3 NA F 15
2002 30 1 1 M 32
2004 10 1 1 F 44
2004 10 2 2 M 47
2004 10 3 3 F 17
2004 50 1 NA M 66
where nquest is the code number of the family, nord is the code number of the individual and nordp is the code number that the individual had in the previous survey; when a new individual is interviewed the value in nordp is "missing" (R automatically insert NA). For example the individual 3 of family 10 has nordp=NA in 2002 because it is the first time that she is interviewed, while in 2004 nordp is 3 (because 3 was the number that she had in 2002).
I can't use nord to filter the df because the composition of the family may change (for example in 2002 in family x the mother has nordp=2 (it means that in 2000 nord was 2) and nord=2 but the next year nord could be 1 (for example if she gets divorced) but nordp is still 2).
I tried to filter using this command:
df <- df %>%
group_by(nquest, nordp)
filter(n()>1)
but I don't get the right df because if for the same family there are more than one individual insert (NA) they will be considered as the same person since nordp is NA the first time.
How can I consider also the individual that appears for the first time in a certain year (nordp=NA)? I tried to a create a command using age (the age in t shoul be equal to (age (in t-2) + 2; for example in 2000 age is 20, in 2002 is 22) but it didn't worked.
Consider that the df is composed by thousand observations and I can't check manually.
The final df should be:
year nquest nordp sex age
2000 10 1 F 40
2000 10 2 M 43
2000 30 1 M 30
2002 10 1 F 42
2002 10 2 M 45
2002 10 3 F 15
2002 30 1 M 32
2004 10 1 F 44
2004 10 2 M 47
2004 10 3 F 17
As you can see there are only the individual that appears more than one time and nquest=10 nordp=30 appears three times; with my command it appears just two times because in the first year nordp was NA.
We wish to assign unique IDs to individuals, then filter by the count of unique IDs. The main idea is to chain together the nordp and nord values within each family over years. Here's an idea inspired by Identify groups of linked episodes which chain together. First, load the igraph package, via library(igraph). Then the following function assigns IDs for a given family.
assignID <- function(d) {
fields <- names(d) # store original column names
d$nordp[is.na(d$nordp)] <- seq_len(sum(is.na(d$nordp))) + 100
d$nordp_x <- (d$year-2) * 1000 + d$nordp
d$nord_x <- d$year * 1000 + d$nord
dd <- d[, c("nordp_x", "nord_x")]
gr.test <- graph.data.frame(dd)
links <- data.frame(org_id = unique(unlist(dd)),
id = clusters(gr.test)$membership)
d <- merge(d, links, by.x = "nord_x", by.y = "org_id", all.x = TRUE)
d$uid <- d$nquest * 100 + d$id
d[, c(fields, "uid")]
}
The function can "tell", for example, that
year nordp nord
2000 1 1
2002 1 2
2004 2 3
is the same individual, by chaining together the nordp and nord over the years, and assigns the same unique ID to all 3 rows. So, for example,
assignID(subset(df, nquest == 10))
# year nquest nord nordp sex age dob uid
# 1 2000 10 1 1 F 40 1960 1001
# 2 2000 10 2 2 M 43 1957 1002
# 3 2002 10 1 1 F 42 1960 1001
# 4 2002 10 2 2 M 45 1957 1002
# 5 2002 10 3 101 F 15 1987 1003
# 6 2004 10 1 1 F 44 1960 1001
# 7 2004 10 2 2 M 47 1957 1002
# 8 2004 10 3 3 F 17 1987 1003
gives us an additional column with the uid for each individual.
The remaining steps are straightforward. We split the dataframe by nquest, apply assignID to each subset, and rbind the output:
dd <- do.call(rbind, by(df, df$nquest, assignID))
Then we can just group by uid and filter by count:
dd %>% group_by(uid) %>% filter(n()>1)
# Source: local data frame [10 x 8]
# Groups: uid [4]
# year nquest nord nordp sex age dob uid
# <int> <int> <int> <dbl> <fctr> <int> <int> <dbl>
# 1 2000 10 1 1 F 40 1960 1001
# 2 2000 10 2 2 M 43 1957 1002
# 3 2002 10 1 1 F 42 1960 1001
# 4 2002 10 2 2 M 45 1957 1002
# 5 2002 10 3 101 F 15 1987 1003
# 6 2004 10 1 1 F 44 1960 1001
# 7 2004 10 2 2 M 47 1957 1002
# 8 2004 10 3 3 F 17 1987 1003
# 9 2000 30 1 1 M 30 1970 3001
# 10 2002 30 1 1 M 32 1970 3001

Merging two different size datasets, copying the same row conditionally from smaller dataset to several rows in larger dataset

I am completely new with R, and I tried googling a representative solution for my problem for some time, but haven't found an adequate answer so far, so I hope that asking for help might solve this one here.
I should merge two different size data sets (other includes annual data: df_f, and other monthly data: df_m). I should merge the smaller df_f to the larger df_m in a way that rows of df_f are merged conditionally with df_m.
Here is a descriptive example of my problem (with some very basic reproducible numbers):
first dataset
a <- c(1990)
b <- c(1980:1981)
c <- c(1994:1995)
aa <- rep("A", 1)
bb <- rep("B", 2)
cc <- rep("C", 2)
df1 <- data.frame(comp=factor(c(aa, bb, cc)))
df2 <- data.frame(year=factor(c(a, b, c)))
other.columns <- rep("other_columns", length(df1))
df_f <- cbind(df1, df2, other.columns ) # first dataset
second dataset
z <- c(10:12)
x <- c(7:12)
xx <- c(1:9)
v <- c(2:9)
w <- rep(1990, length(z))
e <- rep(1980, length(x))
ee <- rep (1981, length(xx))
r <- rep(1995, length(v))
t <- rep("A", length(z))
y <- rep("B", length(x) + length(xx))
u <- rep("C", length(v))
df3 <- data.frame(month=factor(c(z, x, xx, v)))
df4 <- data.frame(year=factor(c(w, e, ee, r)))
df5 <- data.frame(comp=factor(c(t, y, u)))
df_m <- cbind(df5, df4, df3) # second dataset
Output:
> df_m
comp year month
1 A 1990 10
2 A 1990 11
3 A 1990 12
4 B 1980 7
5 B 1980 8
6 B 1980 9
7 B 1980 10
8 B 1980 11
9 B 1980 12
10 B 1981 1
11 B 1981 2
12 B 1981 3
13 B 1981 4
14 B 1981 5
15 B 1981 6
16 B 1981 7
17 B 1981 8
18 B 1981 9
19 C 1995 2
20 C 1995 3
21 C 1995 4
22 C 1995 5
23 C 1995 6
24 C 1995 7
25 C 1995 8
26 C 1995 9
> df_f
comp year other.columns
1 A 1990 other_columns
2 B 1980 other_columns
3 B 1981 other_columns
4 C 1994 other_columns
5 C 1995 other_columns
I want to have the rows from df_f placed to df_m (store the data from df_f to new columns in df_m) according to the conditions comp, year, and month. Comp (company) needs to match always, but matching the year is conditional to month: if month is >6 then year is matched between datasets, if month is <7 then year + 1 (in df_m) is matched with year (in df_f). Note that a certain row in df_f should be placed into several rows in df_m according to the conditions.
The wanted output clarifies the problem and the goal:
Wanted output:
comp year month comp year other.columns
1 A 1990 10 A 1990 other_columns
2 A 1990 11 A 1990 other_columns
3 A 1990 12 A 1990 other_columns
4 B 1980 7 B 1980 other_columns
5 B 1980 8 B 1980 other_columns
6 B 1980 9 B 1980 other_columns
7 B 1980 10 B 1980 other_columns
8 B 1980 11 B 1980 other_columns
9 B 1980 12 B 1980 other_columns
10 B 1981 1 B 1980 other_columns
11 B 1981 2 B 1980 other_columns
12 B 1981 3 B 1980 other_columns
13 B 1981 4 B 1980 other_columns
14 B 1981 5 B 1980 other_columns
15 B 1981 6 B 1980 other_columns
16 B 1981 7 B 1981 other_columns
17 B 1981 8 B 1981 other_columns
18 B 1981 9 B 1981 other_columns
19 C 1995 2 C 1994 other_columns
20 C 1995 3 C 1994 other_columns
21 C 1995 4 C 1994 other_columns
22 C 1995 5 C 1994 other_columns
23 C 1995 6 C 1994 other_columns
24 C 1995 7 C 1995 other_columns
25 C 1995 8 C 1995 other_columns
26 C 1995 9 C 1995 other_columns
Thank you very much in advance! I hope the question is clear enough, it was somewhat difficult to explain it at least.
The basic idea to solve your problem is to add an extra column with the year that should be used for matching. I will use the package dpylr for this and other manipulation steps.
Before the tables can be combined, the numeric columns must be converted to be numeric:
library(dplyr)
df_m <- mutate(df_m, year = as.numeric(as.character(year)),
month = as.numeric(as.character(month)))
df_f <- mutate(df_f, year = as.numeric(as.character(year)))
The reason is that you want to be able to do numerical comparison with the month (month > 6) and subtract one from the year. You cannot do this with a factor.
Then I add the column to be used for matching:
df_m <- mutate(df_m, match_year = ifelse(month >= 7, year, year - 1))
And in the last step, I join the two tables:
df_new <- left_join(df_m, df_f, by = c("comp", "match_year" = "year"))
The argument by determines which columns of the two data frames should be matched. The output agrees with your result:
## comp year month match_year other.columns
## 1 A 1990 10 1990 other_columns
## 2 A 1990 11 1990 other_columns
## 3 A 1990 12 1990 other_columns
## 4 B 1980 7 1980 other_columns
## 5 B 1980 8 1980 other_columns
## 6 B 1980 9 1980 other_columns
## 7 B 1980 10 1980 other_columns
## 8 B 1980 11 1980 other_columns
## 9 B 1980 12 1980 other_columns
## 10 B 1981 1 1980 other_columns
## 11 B 1981 2 1980 other_columns
## 12 B 1981 3 1980 other_columns
## 13 B 1981 4 1980 other_columns
## 14 B 1981 5 1980 other_columns
## 15 B 1981 6 1980 other_columns
## 16 B 1981 7 1981 other_columns
## 17 B 1981 8 1981 other_columns
## 18 B 1981 9 1981 other_columns
## 19 C 1995 2 1994 other_columns
## 20 C 1995 3 1994 other_columns
## 21 C 1995 4 1994 other_columns
## 22 C 1995 5 1994 other_columns
## 23 C 1995 6 1994 other_columns
## 24 C 1995 7 1995 other_columns
## 25 C 1995 8 1995 other_columns
## 26 C 1995 9 1995 other_columns

Removing rows of data frame if number of NA in a column is larger than 3

I have a data frame (panel data): Ctry column indicates the name of countries in my data frame. In any column (for example: Carx) if number of NAs is larger 3; I want to drop the related country in my data fame. For example,
Country A has 2 NA
Country B has 4 NA
Country C has 3 NA
I want to drop country B in my data frame. I have a data frame like this (This is for illustration, my data frame is actually very huge):
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
B 2000 NA
B 2001 NA
B 2002 NA
B 2003 NA
B 2004 18
B 2005 16
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
I want to create a data frame like this:
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
A fairly straightforward way in base R is to use sum(is.na(.)) along with ave, to do the counting, like this:
with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x))))
# [1] 1 1 1 1 1 1 4 4 4 4 4 4 3 3 3 3 3 3
Once you have that, subsetting is easy:
mydf[with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x)))) <= 3, ]
# Ctry year Carx
# 1 A 2000 23
# 2 A 2001 18
# 3 A 2002 20
# 4 A 2003 NA
# 5 A 2004 24
# 6 A 2005 18
# 13 C 2000 NA
# 14 C 2001 NA
# 15 C 2002 24
# 16 C 2003 21
# 17 C 2004 NA
# 18 C 2005 24
You can use by() function to group by Ctry and count NA's of each group :
DF <- read.csv(
text='Ctry,year,Carx
A,2000,23
A,2001,18
A,2002,20
A,2003,NA
A,2004,24
A,2005,18
B,2000,NA
B,2001,NA
B,2002,NA
B,2003,NA
B,2004,18
B,2005,16
C,2000,NA
C,2001,NA
C,2002,24
C,2003,21
C,2004,NA
C,2005,24',
stringsAsFactors=F)
res <- by(data=DF$Carx,INDICES=DF$Ctry,FUN=function(x)sum(is.na(x)))
validCtry <-names(res)[res <= 3]
DF[DF$Ctry %in% validCtry, ]
# Ctry year Carx
#1 A 2000 23
#2 A 2001 18
#3 A 2002 20
#4 A 2003 NA
#5 A 2004 24
#6 A 2005 18
#13 C 2000 NA
#14 C 2001 NA
#15 C 2002 24
#16 C 2003 21
#17 C 2004 NA
#18 C 2005 24
EDIT :
if you have more columns to check, you could adapt the previous code as follows:
res <- by(data=DF,INDICES=DF$Ctry,
FUN=function(x){
return(sum(is.na(x$Carx)) <= 3 &&
sum(is.na(x$Barx)) <= 3 &&
sum(is.na(x$Tarx)) <= 3)
})
validCtry <- names(res)[res]
DF[DF$Ctry %in% validCtry, ]
where, of course, you may change the condition in FUN according to your needs.
Since you mention that you data is "very huge" (whatever that means exactly), you could try a solution with dplyr and see if it's perhaps faster than the solutions in base R. If the other solutions are fast enough, just ignore this one.
require(dplyr)
newdf <- df %.% group_by(Ctry) %.% filter(sum(is.na(Carx)) <= 3)

Resources