Merging two different size datasets, copying the same row conditionally from smaller dataset to several rows in larger dataset

Merging two different size datasets, copying the same row conditionally from smaller dataset to several rows in larger dataset - r

I am completely new with R, and I tried googling a representative solution for my problem for some time, but haven't found an adequate answer so far, so I hope that asking for help might solve this one here.
I should merge two different size data sets (other includes annual data: df_f, and other monthly data: df_m). I should merge the smaller df_f to the larger df_m in a way that rows of df_f are merged conditionally with df_m.
Here is a descriptive example of my problem (with some very basic reproducible numbers):
first dataset
a <- c(1990)
b <- c(1980:1981)
c <- c(1994:1995)
aa <- rep("A", 1)
bb <- rep("B", 2)
cc <- rep("C", 2)
df1 <- data.frame(comp=factor(c(aa, bb, cc)))
df2 <- data.frame(year=factor(c(a, b, c)))
other.columns <- rep("other_columns", length(df1))
df_f <- cbind(df1, df2, other.columns ) # first dataset
second dataset
z <- c(10:12)
x <- c(7:12)
xx <- c(1:9)
v <- c(2:9)
w <- rep(1990, length(z))
e <- rep(1980, length(x))
ee <- rep (1981, length(xx))
r <- rep(1995, length(v))
t <- rep("A", length(z))
y <- rep("B", length(x) + length(xx))
u <- rep("C", length(v))
df3 <- data.frame(month=factor(c(z, x, xx, v)))
df4 <- data.frame(year=factor(c(w, e, ee, r)))
df5 <- data.frame(comp=factor(c(t, y, u)))
df_m <- cbind(df5, df4, df3) # second dataset
Output:
> df_m
comp year month
1 A 1990 10
2 A 1990 11
3 A 1990 12
4 B 1980 7
5 B 1980 8
6 B 1980 9
7 B 1980 10
8 B 1980 11
9 B 1980 12
10 B 1981 1
11 B 1981 2
12 B 1981 3
13 B 1981 4
14 B 1981 5
15 B 1981 6
16 B 1981 7
17 B 1981 8
18 B 1981 9
19 C 1995 2
20 C 1995 3
21 C 1995 4
22 C 1995 5
23 C 1995 6
24 C 1995 7
25 C 1995 8
26 C 1995 9
> df_f
comp year other.columns
1 A 1990 other_columns
2 B 1980 other_columns
3 B 1981 other_columns
4 C 1994 other_columns
5 C 1995 other_columns
I want to have the rows from df_f placed to df_m (store the data from df_f to new columns in df_m) according to the conditions comp, year, and month. Comp (company) needs to match always, but matching the year is conditional to month: if month is >6 then year is matched between datasets, if month is <7 then year + 1 (in df_m) is matched with year (in df_f). Note that a certain row in df_f should be placed into several rows in df_m according to the conditions.
The wanted output clarifies the problem and the goal:
Wanted output:
comp year month comp year other.columns
1 A 1990 10 A 1990 other_columns
2 A 1990 11 A 1990 other_columns
3 A 1990 12 A 1990 other_columns
4 B 1980 7 B 1980 other_columns
5 B 1980 8 B 1980 other_columns
6 B 1980 9 B 1980 other_columns
7 B 1980 10 B 1980 other_columns
8 B 1980 11 B 1980 other_columns
9 B 1980 12 B 1980 other_columns
10 B 1981 1 B 1980 other_columns
11 B 1981 2 B 1980 other_columns
12 B 1981 3 B 1980 other_columns
13 B 1981 4 B 1980 other_columns
14 B 1981 5 B 1980 other_columns
15 B 1981 6 B 1980 other_columns
16 B 1981 7 B 1981 other_columns
17 B 1981 8 B 1981 other_columns
18 B 1981 9 B 1981 other_columns
19 C 1995 2 C 1994 other_columns
20 C 1995 3 C 1994 other_columns
21 C 1995 4 C 1994 other_columns
22 C 1995 5 C 1994 other_columns
23 C 1995 6 C 1994 other_columns
24 C 1995 7 C 1995 other_columns
25 C 1995 8 C 1995 other_columns
26 C 1995 9 C 1995 other_columns
Thank you very much in advance! I hope the question is clear enough, it was somewhat difficult to explain it at least.

The basic idea to solve your problem is to add an extra column with the year that should be used for matching. I will use the package dpylr for this and other manipulation steps.
Before the tables can be combined, the numeric columns must be converted to be numeric:
library(dplyr)
df_m <- mutate(df_m, year = as.numeric(as.character(year)),
month = as.numeric(as.character(month)))
df_f <- mutate(df_f, year = as.numeric(as.character(year)))
The reason is that you want to be able to do numerical comparison with the month (month > 6) and subtract one from the year. You cannot do this with a factor.
Then I add the column to be used for matching:
df_m <- mutate(df_m, match_year = ifelse(month >= 7, year, year - 1))
And in the last step, I join the two tables:
df_new <- left_join(df_m, df_f, by = c("comp", "match_year" = "year"))
The argument by determines which columns of the two data frames should be matched. The output agrees with your result:
## comp year month match_year other.columns
## 1 A 1990 10 1990 other_columns
## 2 A 1990 11 1990 other_columns
## 3 A 1990 12 1990 other_columns
## 4 B 1980 7 1980 other_columns
## 5 B 1980 8 1980 other_columns
## 6 B 1980 9 1980 other_columns
## 7 B 1980 10 1980 other_columns
## 8 B 1980 11 1980 other_columns
## 9 B 1980 12 1980 other_columns
## 10 B 1981 1 1980 other_columns
## 11 B 1981 2 1980 other_columns
## 12 B 1981 3 1980 other_columns
## 13 B 1981 4 1980 other_columns
## 14 B 1981 5 1980 other_columns
## 15 B 1981 6 1980 other_columns
## 16 B 1981 7 1981 other_columns
## 17 B 1981 8 1981 other_columns
## 18 B 1981 9 1981 other_columns
## 19 C 1995 2 1994 other_columns
## 20 C 1995 3 1994 other_columns
## 21 C 1995 4 1994 other_columns
## 22 C 1995 5 1994 other_columns
## 23 C 1995 6 1994 other_columns
## 24 C 1995 7 1995 other_columns
## 25 C 1995 8 1995 other_columns
## 26 C 1995 9 1995 other_columns

Related

Annual moving window over a data frame

I have a data frame of discharge data. Below is a reproducible example:
library(lubridate)
Date <- sample(seq(as.Date('1981/01/01'), as.Date('1982/12/31'), by="day"), 24)
Date <- sort(Date, decreasing = F)
Station <- rep(as.character("A"), 24)
Discharge <- rnorm(n = 24, mean = 1, 1)
df <- cbind.data.frame(Station, Date, Discharge)
df$Year <- year(df$Date)
df$Month <- month(df$Date)
df$Day <- day(df$Date)
The output:
> df
Station Date Discharge Year Month Day
1 A 1981-01-23 0.75514968 1981 1 23
2 A 1981-02-17 -0.08552776 1981 2 17
3 A 1981-03-20 1.47586712 1981 3 20
4 A 1981-04-26 3.64823544 1981 4 26
5 A 1981-05-22 1.21880453 1981 5 22
6 A 1981-05-23 2.19482857 1981 5 23
7 A 1981-07-02 -0.13598754 1981 7 2
8 A 1981-07-23 0.12365626 1981 7 23
9 A 1981-07-24 2.12557882 1981 7 24
10 A 1981-09-02 2.79879494 1981 9 2
11 A 1981-09-04 1.67926948 1981 9 4
12 A 1981-11-06 0.49720784 1981 11 6
13 A 1981-12-21 -0.25272271 1981 12 21
14 A 1982-04-08 1.39706157 1982 4 8
15 A 1982-04-19 -0.13965981 1982 4 19
16 A 1982-05-26 0.55238425 1982 5 26
17 A 1982-06-23 3.94639154 1982 6 23
18 A 1982-06-25 -0.03415929 1982 6 25
19 A 1982-07-15 1.00996167 1982 7 15
20 A 1982-09-11 3.18225186 1982 9 11
21 A 1982-10-17 0.30875497 1982 10 17
22 A 1982-10-30 2.26209011 1982 10 30
23 A 1982-11-06 0.34430489 1982 11 6
24 A 1982-11-19 2.28251458 1982 11 19
What I need to do is to create a moving window function using base R. I have tried using runner package but it is proving not to be so flexible. This moving window (say 3) shall take 3 rows at a time and calculate the mean discharge. This window shall continue till the last date of the year 1981. Another window shall start from 1982 and do the same. How to approach this?

Using base R only
w=3
df$DischargeM=sapply(1:nrow(df),function(x){
tmp=NA
if (x>=w) {
if (length(unique(df$Year[(x-w+1):x]))==1) {
tmp=mean(df$Discharge[(x-w+1):x])
}
}
tmp
})
Station Date Discharge Year Month Day DischargeM
1 A 1981-01-21 2.0009355 1981 1 21 NA
2 A 1981-02-11 0.5948567 1981 2 11 NA
3 A 1981-04-17 0.2637090 1981 4 17 0.95316705
4 A 1981-04-18 3.9180253 1981 4 18 1.59219699
5 A 1981-05-09 -0.2589129 1981 5 9 1.30760712
6 A 1981-07-05 1.1055913 1981 7 5 1.58823456
7 A 1981-07-11 0.7561600 1981 7 11 0.53427946
8 A 1981-07-22 0.0978999 1981 7 22 0.65321706
9 A 1981-08-04 0.5410163 1981 8 4 0.46502541
10 A 1981-08-13 -0.5044425 1981 8 13 0.04482458
11 A 1981-10-06 1.5954315 1981 10 6 0.54400178
12 A 1981-11-08 -0.5757041 1981 11 8 0.17176164
13 A 1981-12-24 1.3892440 1981 12 24 0.80299047
14 A 1982-01-07 1.9363874 1982 1 7 NA
15 A 1982-02-20 1.4340554 1982 2 20 NA
16 A 1982-05-29 0.4536461 1982 5 29 1.27469632
17 A 1982-06-10 2.9776761 1982 6 10 1.62179253
18 A 1982-06-17 1.6371733 1982 6 17 1.68949847
19 A 1982-06-28 1.7585579 1982 6 28 2.12446908
20 A 1982-08-17 0.8297518 1982 8 17 1.40849432
21 A 1982-09-21 1.6853808 1982 9 21 1.42456348
22 A 1982-11-13 0.6066167 1982 11 13 1.04058309
23 A 1982-11-16 1.4989263 1982 11 16 1.26364126
24 A 1982-11-28 0.2273658 1982 11 28 0.77763625
(make sure your df is ordered).

You can do this by using dplyr and the rollmean or rollmeanr function from zoo.
You group the data by year, and apply the rollmeanr in a mutate function.
library(dplyr)
df %>%
group_by(Year) %>%
mutate(avg = zoo::rollmeanr(Discharge, k = 3, fill = NA))
# A tibble: 24 x 7
# Groups: Year [2]
Station Date Discharge Year Month Day avg
<chr> <date> <dbl> <dbl> <dbl> <int> <dbl>
1 A 1981-01-04 1.00 1981 1 4 NA
2 A 1981-03-26 0.0468 1981 3 26 NA
3 A 1981-03-28 0.431 1981 3 28 0.494
4 A 1981-05-04 1.30 1981 5 4 0.593
5 A 1981-08-26 2.06 1981 8 26 1.26
6 A 1981-10-14 1.09 1981 10 14 1.48
7 A 1981-12-10 1.28 1981 12 10 1.48
8 A 1981-12-23 0.668 1981 12 23 1.01
9 A 1982-01-02 -0.333 1982 1 2 NA
10 A 1982-04-13 0.800 1982 4 13 NA
# ... with 14 more rows

Kindly let me know if this is what you were anticipating
Base version:
result <- transform(df,
Discharge_mean = ave(Discharge,Year,
FUN= function(x) rollapply(x,width = 3, mean, align='right',fill=NA))
)
dplyr version:
result <-df %>%
group_by(Year)%>%
mutate(Discharge_mean=rollapply(Discharge,3,mean,align='right',fill=NA))
Output:
> result
Station Date Discharge Year Month Day Discharge_mean
1 A 1981-01-09 0.560448487 1981 1 9 NA
2 A 1981-01-17 0.006777809 1981 1 17 NA
3 A 1981-02-08 2.008959399 1981 2 8 0.8587286
4 A 1981-02-21 1.166452993 1981 2 21 1.0607301
5 A 1981-04-12 3.120080595 1981 4 12 2.0984977
6 A 1981-04-24 2.647325960 1981 4 24 2.3112865
7 A 1981-05-01 0.764980310 1981 5 1 2.1774623
8 A 1981-05-20 2.203700845 1981 5 20 1.8720024
9 A 1981-06-19 0.519390897 1981 6 19 1.1626907
10 A 1981-07-06 1.704146872 1981 7 6 1.4757462
# 14 more rows

Average of a variable by collapsing two columns in r [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
I would wish to find the average per season for each year. Each year is observed 4 times. The seasons are two but are repeated twice as shown below
year=rep(c(1990:1992),each=4)
season=c("W","D","W","D","W","W","D","D","D","W","W","D")
temp=c(28,25,26,21,28,25,20,20,20,35,28,21)
df=data.frame(year,season,temp)
which gives
year season temp
1 1990 W 28
2 1990 D 25
3 1990 W 26
4 1990 D 21
5 1991 W 28
6 1991 W 25
7 1991 D 20
8 1991 D 20
9 1992 D 20
10 1992 W 35
11 1992 W 28
12 1992 D 21
i want to collapse this data to have the average of the two seasons for each year as below
year season avgtemp
1 1990 D 23.0
2 1990 W 27.0
3 1991 D 20.0
4 1991 W 25.1
5 1992 D 20.5
6 1992 W 31.5
How can i obtain this?

Try below:
aggregate(df[, 3], df[, 1:2], mean)

library(tidyvere)
df %>%
group_by(year,season) %>%
summarise(avgtemp=mean(temp))
# A tibble: 6 x 3
# Groups: year [?]
year season avgtemp
<int> <fct> <dbl>
1 1990 D 23
2 1990 W 27
3 1991 D 20
4 1991 W 26.5
5 1992 D 20.5
6 1992 W 31.5

unlist and merge into a single dataframe in r

I have a list of dataframes that I need to be combined into a single one.
year<-1990:2000
v1<-1:11
v2<-20:30
df1<-data.frame(year,v1)
df2<-data.frame(year,v2)
ldf<-list(df1,df2)
I now want to unlist this dataframe and get
> head(df)
year v1 v2
1 1990 1 20
2 1991 2 21
3 1992 3 22
4 1993 4 23
Note that my question is different from the solution provided in a similar question, where the solution to that question was: `df <- ldply(ldf, data.frame)
Because what I am essentially looking for, is a more automatic way of doing this: df<-merge(df1,df2, by="year")

With more number of list elements, a convenient option is reduce with one of the join functions
library(tidyverse)
ldf %>%
reduce(inner_join, by = "year")
# year v1 v2
#1 1990 1 20
#2 1991 2 21
#3 1992 3 22
#4 1993 4 23
#5 1994 5 24
#6 1995 6 25
#7 1996 7 26
#8 1997 8 27
#9 1998 9 28
#10 1999 10 29
#11 2000 11 30

Is there anything wrong with:
df <- merge(ldf[[1]], ldf[[2]], by="year")
Or for a long list:
df1 <- ldf[[1]]
for (x in 2:length(ldf)) {
df1 <- merge(df1, ldf[[x]])
}
# year v1 v2
# 1 1990 1 20
# 2 1991 2 21
# 3 1992 3 22
# 4 1993 4 23
# 5 1994 5 24
# 6 1995 6 25
# 7 1996 7 26
# 8 1997 8 27
# 9 1998 9 28
# 10 1999 10 29
# 11 2000 11 30

Calculating quintile based scores on R

I have a dataframe with year (2006 to 2010), 4 industry sectors, 150 firm names and the net income of these firms. In total I have 750 observations, one for each firm for each year. I want to give scores to firms for their income within each industry year based on the quintiles. So, firms with income in the top 20% within each industry-year get a score of 5, the next 20% get a score of 4 and so on. The bottom 20% get a score of 1.
The sample data base is:
Year Industry Firm Income
2006 Chemicals ABC 334.50
2007 Chemicals ABC 388.98
.
.
2006 Pharma XYZ 91.45
.
.
How do I do this in R? I have tried aggregate and tapply along with quantile but am not able to arrive at the logic that should be used for this. Please help.
I tried this just to allocate a score of 1 to the lowest 20%, but it returned an error.
db10$score <- ifelse(db10$income < aggregate(income~Year+industry,db10,quantile,c(0.2)),1,0)

Try this method:
First, I'll create the sample where to test the function below:
y = c(rep(2001,15),rep(2002,15),rep(2003,15))
ind = c("A","B","C","D","E","G","H","I","J","K","L","M","N","O","P")
val = runif(45,10,100)
df = data.frame(y,ind,val)
head(df,20)
y ind val
1 2001 A 63.32011
2 2001 B 85.67976
3 2001 C 86.77527
4 2001 D 32.18319
5 2001 E 49.86626
6 2001 G 57.73214
7 2001 H 18.08216
8 2001 I 22.31012
9 2001 J 44.11174
10 2001 K 54.76902
11 2001 L 41.82495
12 2001 M 64.84514
13 2001 N 59.16529
14 2001 O 61.28870
15 2001 P 84.76561
16 2002 A 83.68185
17 2002 B 45.01354
18 2002 C 62.22964
19 2002 D 98.41717
20 2002 E 19.91548
There are 3 years, and industries from A to P. The data frame is ordered by year and later by industry.
This function below takes a year value y and calculates the quintile category for all df$val where the year df$y is y
quintile = function(y) {
x = df$val[df$y == y]
qn = quantile(x, probs = (0:5)/5)
result = as.numeric(cut(x, qn, include.lowest = T))
}
The only thing left is to apply this function to the unique year values
df$qn = unlist(lapply(unique(df$y), quintile))
Result:
> head(df,20)
y ind val qn
1 2001 A 63.32011 4
2 2001 B 85.67976 5
3 2001 C 86.77527 5
4 2001 D 32.18319 1
5 2001 E 49.86626 2
6 2001 G 57.73214 3
7 2001 H 18.08216 1
8 2001 I 22.31012 1
9 2001 J 44.11174 2
10 2001 K 54.76902 3
11 2001 L 41.82495 2
12 2001 M 64.84514 4
13 2001 N 59.16529 3
14 2001 O 61.28870 4
15 2001 P 84.76561 5
16 2002 A 83.68185 4
17 2002 B 45.01354 1
18 2002 C 62.22964 3
19 2002 D 98.41717 5
20 2002 E 19.91548 1
Maybe there is a much simpler way to implement this...
Grouping by two columns
If you want to calculate quintiles based on the grouping of two columns: y and grp
y = c(rep(2001,15),rep(2002,15),rep(2003,15))
grp = c("G1","G1","G1","G1","G1","G2","G2","G2","G2","G2","G3","G3","G3","G3","G3")
ind = c("A","B","C","D","E","G","H","I","J","K","L","M","N","O","P")
val = round(runif(45,10,100))
df = data.frame(y,grp,ind,val)
> head(df,20)
y grp ind val
1 2001 G1 A 40
2 2001 G1 B 33
3 2001 G1 C 65
4 2001 G1 D 99
5 2001 G1 E 18
6 2001 G2 G 36
7 2001 G2 H 15
8 2001 G2 I 17
9 2001 G2 J 42
10 2001 G2 K 67
11 2001 G3 L 60
12 2001 G3 M 34
13 2001 G3 N 61
14 2001 G3 O 76
15 2001 G3 P 15
16 2002 G1 A 18
17 2002 G1 B 15
18 2002 G1 C 44
19 2002 G1 D 79
20 2002 G1 E 22
Then use:
quintile = function(z) {
x = df$val[df$y == z[1] & df$grp == z[2]]
qn = quantile(x, probs = (0:5)/5)
result = as.numeric(cut(x, qn, include.lowest = T))
}
df$qn = as.vector(apply(unique(df[,c("y","grp")]),1, quintile))
Result:
> head(df,20)
y grp ind val qn
1 2001 G1 A 40 3
2 2001 G1 B 33 2
3 2001 G1 C 65 4
4 2001 G1 D 99 5
5 2001 G1 E 18 1
6 2001 G2 G 36 3
7 2001 G2 H 15 1
8 2001 G2 I 17 2
9 2001 G2 J 42 4
10 2001 G2 K 67 5
11 2001 G3 L 60 3
12 2001 G3 M 34 2
13 2001 G3 N 61 4
14 2001 G3 O 76 5
15 2001 G3 P 15 1
16 2002 G1 A 18 2
17 2002 G1 B 15 1
18 2002 G1 C 44 4
19 2002 G1 D 79 5
20 2002 G1 E 22 3
I this example, y would be the year and grp the industry group, ind the firms and val the income.
Pay attention to the order of c("y","grp") inside the apply and the columns names inside the quintile function. You'll have to replace them with the column names you want.
Be warned that if your groups are small (in this example 5 firms per group), the quintiles may not be unique and an error will pop-up.
Using column names from question
quintile = function(z) {
x = df$Income[df$Year == z[1] & df$Industry == z[2]]
qn = quantile(x, probs = (0:5)/5)
result = as.numeric(cut(x, qn, include.lowest = T))
}
df$qn = as.vector(apply(unique(df[,c("Year","Industry")]),1, quintile))
Before applying this, the data frame df must be ordered by Year and Industry.

Removing rows of data frame if number of NA in a column is larger than 3

I have a data frame (panel data): Ctry column indicates the name of countries in my data frame. In any column (for example: Carx) if number of NAs is larger 3; I want to drop the related country in my data fame. For example,
Country A has 2 NA
Country B has 4 NA
Country C has 3 NA
I want to drop country B in my data frame. I have a data frame like this (This is for illustration, my data frame is actually very huge):
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
B 2000 NA
B 2001 NA
B 2002 NA
B 2003 NA
B 2004 18
B 2005 16
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
I want to create a data frame like this:
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24

A fairly straightforward way in base R is to use sum(is.na(.)) along with ave, to do the counting, like this:
with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x))))
# [1] 1 1 1 1 1 1 4 4 4 4 4 4 3 3 3 3 3 3
Once you have that, subsetting is easy:
mydf[with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x)))) <= 3, ]
# Ctry year Carx
# 1 A 2000 23
# 2 A 2001 18
# 3 A 2002 20
# 4 A 2003 NA
# 5 A 2004 24
# 6 A 2005 18
# 13 C 2000 NA
# 14 C 2001 NA
# 15 C 2002 24
# 16 C 2003 21
# 17 C 2004 NA
# 18 C 2005 24

You can use by() function to group by Ctry and count NA's of each group :
DF <- read.csv(
text='Ctry,year,Carx
A,2000,23
A,2001,18
A,2002,20
A,2003,NA
A,2004,24
A,2005,18
B,2000,NA
B,2001,NA
B,2002,NA
B,2003,NA
B,2004,18
B,2005,16
C,2000,NA
C,2001,NA
C,2002,24
C,2003,21
C,2004,NA
C,2005,24',
stringsAsFactors=F)
res <- by(data=DF$Carx,INDICES=DF$Ctry,FUN=function(x)sum(is.na(x)))
validCtry <-names(res)[res <= 3]
DF[DF$Ctry %in% validCtry, ]
# Ctry year Carx
#1 A 2000 23
#2 A 2001 18
#3 A 2002 20
#4 A 2003 NA
#5 A 2004 24
#6 A 2005 18
#13 C 2000 NA
#14 C 2001 NA
#15 C 2002 24
#16 C 2003 21
#17 C 2004 NA
#18 C 2005 24
EDIT :
if you have more columns to check, you could adapt the previous code as follows:
res <- by(data=DF,INDICES=DF$Ctry,
FUN=function(x){
return(sum(is.na(x$Carx)) <= 3 &&
sum(is.na(x$Barx)) <= 3 &&
sum(is.na(x$Tarx)) <= 3)
})
validCtry <- names(res)[res]
DF[DF$Ctry %in% validCtry, ]
where, of course, you may change the condition in FUN according to your needs.

Since you mention that you data is "very huge" (whatever that means exactly), you could try a solution with dplyr and see if it's perhaps faster than the solutions in base R. If the other solutions are fast enough, just ignore this one.
require(dplyr)
newdf <- df %.% group_by(Ctry) %.% filter(sum(is.na(Carx)) <= 3)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Merging two different size datasets, copying the same row conditionally from smaller dataset to several rows in larger dataset - r

Related

Annual moving window over a data frame

Average of a variable by collapsing two columns in r [duplicate]

unlist and merge into a single dataframe in r

Calculating quintile based scores on R

Removing rows of data frame if number of NA in a column is larger than 3

Categories

Resources