Replace NA with condition - r

I am actually running econometrics analysis. I encounter problem in this analysis. I am using Rstudio.
My Database is composed of 1408 (704 for type 1 and 704 for type 2) observations and 49 variables.
Gender Period Matching group Group Type Overcharging
1 1 73 1 1 NA
0 2 73 1 1 NA
1 1 77 2 1 NA
1 2 77 2 1 NA
... ... ... ... ... ...
0 1 73 1 2 1
0 2 73 1 2 0
1 1 77 2 2 0
1 2 77 2 2 1
... ... ... ... ... ...
You can see that NA values are correlated with type of the agent (if agent is type 1). What I'd like to do is : if agents of type 1 belong to the same matching group, group and period of agents type 2, then replace NA by the same value of the agent of the type 2 (for each row).
Expected output
Gender Period Matching group Group Type Overcharging
1 1 73 1 1 1
0 2 73 1 1 0
1 1 77 2 1 0
1 2 77 2 1 1
0 1 73 1 2 1
0 2 73 1 2 0
1 1 77 2 2 0
1 2 77 2 2 1

Here is a solution with data.table:
library("data.table")
dt <- fread(header=TRUE,
'Gender Period Matching.group Group Type Overcharging
1 1 73 1 1 NA
0 2 73 1 1 NA
1 1 77 2 1 NA
1 2 77 2 1 NA
0 1 73 1 2 1
0 2 73 1 2 0
1 1 77 2 2 0
1 2 77 2 2 1')
d2 <- dt[Type!=1, Overcharging, .(Group,Period)]
rbind(dt[Type==1][d2, on=.(Group, Period), Overcharging:=i.Overcharging],dt[Type!=1])
# > rbind(dt[Type==1][d2, on=.(Group, Period), Overcharging:=i.Overcharging],dt[Type!=1])
# Gender Period Matching.group Group Type Overcharging
# 1: 1 1 73 1 1 1
# 2: 0 2 73 1 1 0
# 3: 1 1 77 2 1 0
# 4: 1 2 77 2 1 1
# 5: 0 1 73 1 2 1
# 6: 0 2 73 1 2 0
# 7: 1 1 77 2 2 0
# 8: 1 2 77 2 2 1
Eventually you can do in your special case:
dt[Type==1, Overcharging:=dt[Type!=1, Overcharging]]
(if the order of Group and Period for Type!=1 is the same as for Type==1)

We can use functions from dplyr and tidyr (from tidyverse) for such task. The fill function from tidyr can impute the missing values based on the previous or the next row. So the idea is to arrange the data frame first and use fill to impute all NA in the Overcharging column.
library(tidyverse)
dt <- read.csv(text = "Gender,Period,Matching.group,Group,Type,Overcharging
1,1,73,1,1,NA
0,2,73,1,1,NA
1,1,77,2,1,NA
1,2,77,2,1,NA
0,1,73,1,2,1
0,2,73,1,2,0
1,1,77,2,2,0
1,2,77,2,2,1",
stringsAsFactors = FALSE)
dt2 <- dt %>%
mutate(ID = 1:n()) %>% # Create a column with ID starting 1
arrange(Period, `Matching.group`, Group, Type) %>% # Arrange the columns
fill(Overcharging, .direction = c("up")) %>% # Fill the missing values, the direction is "up"
arrange(ID) %>% # Arrange the columns based on ID
select(-ID) # Remove the ID column

Related

Keep previous value if it is under a certain threshold

I would like to create a variable called treatment_cont that is grouped by group as follows:
ID day day_diff treatment treatment_cont
1 0 NA 1 1
1 14 14 1 1
1 20 6 2 2
1 73 53 1 1
2 0 NA 1 1
2 33 33 1 1
2 90 57 2 2
2 112 22 3 2
2 152 40 1 1
2 178 26 4 1
Treatment_cont is the same as treatment but we want to keep the same treatment regime only when the day_diff, the difference in days between treatments, is lower than 30.
I have tried many ways on dplyr, manipulating the table, but I cannot figure out how to do it efficiently.
Probably, a conditional mutate, using case_when and lag might work:
df %>% mutate(treatment_cont = case_when(day_diff < 30 ~ treatment,TRUE ~ lag(treatment)))
You are probably looking for lag (and perhaps it's brother, lead):
df %>%
replace_na(list(day_diff=0)) %>%
group_by(ID) %>%
arrange(day) %>%
mutate(
treatment_cont = ifelse(day_diff < 30, lag(treatment_cont, default = treatment_cont[1]),treatment_cont)
# A tibble: 10 x 5
ID day day_diff treatment treatment_cont
<int> <int> <dbl> <int> <int>
1 1 0 0 1 1
2 1 14 14 1 1
3 1 20 6 2 1
4 1 73 53 1 1
5 2 0 0 1 1
6 2 33 33 1 1
7 2 90 57 2 2
8 2 112 22 3 2
9 2 152 40 1 1
10 2 178 26 4 1
) %>%
ungroup %>%
arrange(ID, day)

Add column with numbers based on a second column

Here my data.frame:
df = read.table(text = 'Day ID Event
100 1 1
100 1 1
99 1 1
97 1 1
87 2 1
86 2 1
85 2 1
965 1 2
964 1 2
960 1 2
959 1 2
709 2 2
708 2 2
12 3 2
9 3 2', header = TRUE)
What I would like to do is to create a new column which, considering the ID and Event ones, assign for each observation a number in decreasing order based on the relative Day ones.
My desired output would be:
Day ID Event Count
100 1 1 4
100 1 1 4
99 1 1 3
97 1 1 1
87 2 1 3
86 2 1 2
85 2 1 1
965 1 2 7
964 1 2 6
960 1 2 2
959 1 2 1
709 2 2 2
708 2 2 1
12 3 2 4
9 3 2 1
E.g. If you look at the first 'block' above: Day 97 = 1, Day 98 = 2, Day 99 = 3 and Day 100 = 4. We are missing Day 98 but we still need to include it in the count.
I tried the following but the output is not the one I need:
df$Count <- ave(df$Day, df$Event, df$ID, FUN = seq_along)
Thanks for your help
We can try
library(dplyr)
df %>%
group_by(ID, Event) %>%
mutate(Count = 1+(Day-Day[n()]))

how to cast to multicolumn in R like Pandas-Style?

i searched a lot but didn't find anything relevant.
What I Want:
I'm trying to do a simple groupby and summarising in R.
My preffered output would be with multiindexed columns and multiindexed rows. Multiindexed rows are easy with dplyr, the difficulty are the cols.
what I already tried:
library(dplyr)
cp <- read.table(text="SEX REGION CAR_TYPE JOB EXPOSURE NUMBER
1 1 1 1 1 70 1
2 1 1 1 2 154 8
3 1 1 2 1 210 10
4 1 1 2 2 21 1
5 1 2 1 1 77 8
6 1 2 1 2 90 6
7 1 2 2 1 105 5
8 1 2 2 2 140 11
")
attach(cp)
cp_gb <- cp %>%
group_by(SEX, REGION, CAR_TYPE, JOB) %>%
summarise(counts=round(sum(NUMBER/EXPOSURE*1000)))
dcast(cp_gb, formula = SEX + REGION ~ CAR_TYPE + JOB, value.var="counts")
Now there is the problem that the column index is "melted" into one instead of a multiindexed column, like I know it from Python/Pandas.
Wrong output:
SEX REGION 1_1 1_2 2_1 2_2
1 1 14 52 48 48
1 2 104 67 48 79
Example how it would work in Pandas:
# clipboard, copy this withoud the comments:
# SEX REGION CAR_TYPE JOB EXPOSURE NUMBER
# 1 1 1 1 1 70 1
# 2 1 1 1 2 154 8
# 3 1 1 2 1 210 10
# 4 1 1 2 2 21 1
# 5 1 2 1 1 77 8
# 6 1 2 1 2 90 6
# 7 1 2 2 1 105 5
# 8 1 2 2 2 140 11
df = pd.read_clipboard(delim_whitespace=True)
gb = df.groupby(["SEX","REGION", "CAR_TYPE", "JOB"]).sum()
gb['promille_value'] = (gb['NUMBER'] / gb['EXPOSURE'] * 1000).astype(int)
gb = gb[['promille_value']].unstack(level=[2,3])
correct Output:
CAR_TYPE 1 1 2 2
JOB 1 2 1 2
SEX REGION
1 1 14 51 47 47
1 2 103 66 47 78
(Update) What works (nearly):
I tried to to with ftable, but it only prints ones in the matrix instead of the values of "counts".
ftable(cp_gb, col.vars=c("CAR_TYPE","JOB"), row.vars = c("SEX","REGION"))
ftable accepts lists of factors (data frame) or a table object. Instead of passing the grouped data frame as it is, converting it to a table object first before passing to ftable should get your the counts:
# because xtabs expects factors
cp_gb <- cp_gb %>% ungroup %>% mutate_at(1:4, as.factor)
xtabs(counts ~ ., cp_gb) %>%
ftable(col.vars=c("CAR_TYPE","JOB"), row.vars = c("SEX","REGION"))
# CAR_TYPE 1 2
# JOB 1 2 1 2
# SEX REGION
# 1 1 14 52 48 48
# 2 104 67 48 79
There is a difference of 1 in some of counts between R and pandas outputs because you use round in R and truncation (.astype(int)) in python.

Need to rank a dataset based on 3 columns in R [duplicate]

This question already has an answer here:
generate sequence within group in R [duplicate]
(1 answer)
Closed 6 years ago.
I have a dataset which was ordered using function order() in R and same is shown below
A B C
1 1 85
1 1 62
1 0 92
2 1 80
2 0 92
2 0 84
3 1 65
3 0 92
I've to print rank based on column A and expected output is shown below
A B C Rank
1 1 85 1
1 1 62 2
1 0 92 3
2 1 80 1
2 0 92 2
2 0 84 3
3 1 65 1
3 0 92 2
Request for expertise in R
A simple base R solution using ave and seq_along is
df$Rank <- ave(df$B, df$A, FUN=seq_along)
which returns
df
A B C Rank
1 1 1 85 1
2 1 1 62 2
3 1 0 92 3
4 2 1 80 1
5 2 0 92 2
6 2 0 84 3
7 3 1 65 1
8 3 0 92 2
seq_along returns a vector 1, 2, 3, ... the length of its argument. ave allows the user to apply a function to groups which are determined here by the variable A.
data
df <- read.table(header=TRUE, text="A B C
1 1 85
1 1 62
1 0 92
2 1 80
2 0 92
2 0 84
3 1 65
3 0 92")

R Removing duplicate entries in dataframe and keeping rows with fewer NAs and zeroes

I would like to deduplicate a data.frame I am generating from another
part of my codebase without the ability to know the order of the
columns and rows. The data.frame has some columns I want to compare
for duplication, here A and B, but I would like to then choose the
ones to keep from the rows that contain fewer NAs and zeros in other
columns in the dataframe, here C, D and E.
tc=
'Id B A C D E
1 62 12 0 NA NA
2 12 62 1 1 1
3 2 62 1 1 1
4 62 12 1 1 1
5 55 23 0 0 0 '
df =read.table(textConnection(tc),header=T)
I can use duplicated, but since I cannot control the order of the
columns and rows where my dataframe comes, I need a way to get the
unique values with fewer NAs and zeros.
This will work in the example, but won't if the incoming data.frame
has a different order:
df[!duplicated(data.frame(A=df$A,B=df$B),fromLast=TRUE),]
Id B A C D E
2 2 12 62 1 1 1
3 3 2 62 1 1 1
4 4 62 12 1 1 1
5 5 55 23 0 0 0
Any ideas?
Here's an approach based on counting valid values and reordering the data frame.
First, count the NAs and 0s in the columns C, D, and E.
rs <- rowSums(is.na(df[c("C", "D", "E")]) | !df[c("C", "D", "E")])
# [1] 3 0 0 0 3
Second, order the data frame by A, B, and the new variable:
df_ordered <- df[order(df$A, df$B, rs), ]
# Id B A C D E
# 4 4 62 12 1 1 1
# 1 1 62 12 0 NA NA
# 5 5 55 23 0 0 0
# 3 3 2 62 1 1 1
# 2 2 12 62 1 1 1
Now, you can remove duplicated rows and keep the row with the highest number of valid values.
df_ordered[!duplicated(df_ordered[c("A", "B")]), ]
# Id B A C D E
# 2 2 12 62 1 1 1
# 3 3 2 62 1 1 1
# 4 4 62 12 1 1 1
# 5 5 55 23 0 0 0

Resources