Need to rank a dataset based on 3 columns in R [duplicate] - r

This question already has an answer here:
generate sequence within group in R [duplicate]
(1 answer)
Closed 6 years ago.
I have a dataset which was ordered using function order() in R and same is shown below
A B C
1 1 85
1 1 62
1 0 92
2 1 80
2 0 92
2 0 84
3 1 65
3 0 92
I've to print rank based on column A and expected output is shown below
A B C Rank
1 1 85 1
1 1 62 2
1 0 92 3
2 1 80 1
2 0 92 2
2 0 84 3
3 1 65 1
3 0 92 2
Request for expertise in R

A simple base R solution using ave and seq_along is
df$Rank <- ave(df$B, df$A, FUN=seq_along)
which returns
df
A B C Rank
1 1 1 85 1
2 1 1 62 2
3 1 0 92 3
4 2 1 80 1
5 2 0 92 2
6 2 0 84 3
7 3 1 65 1
8 3 0 92 2
seq_along returns a vector 1, 2, 3, ... the length of its argument. ave allows the user to apply a function to groups which are determined here by the variable A.
data
df <- read.table(header=TRUE, text="A B C
1 1 85
1 1 62
1 0 92
2 1 80
2 0 92
2 0 84
3 1 65
3 0 92")

Related

cumsum by participant and reset on 0 R [duplicate]

This question already has answers here:
R cumulative sum by condition with reset
(3 answers)
Cumulative sum that resets when 0 is encountered
(4 answers)
Closed 1 year ago.
I have a data frame that looks like this below. I need to sum the number of correct trials by participant, and reset the counter when it gets to a 0.
Participant TrialNumber Correct
118 1 1
118 2 1
118 3 1
118 4 1
118 5 1
118 6 1
118 7 1
118 8 0
118 9 1
118 10 1
120 1 1
120 2 1
120 3 1
120 4 1
120 5 0
120 6 1
120 7 0
120 8 1
120 9 1
120 10 1
I've tried using splitstackshape:
df$Count <- getanID(cbind(df$Participant, cumsum(df$Correct)))[,.id]
But it cumulatively sums the correct trials when it gets to a 0 and not by participant:
Participant TrialNumber Correct Count
118 1 1 1
118 2 1 1
118 3 1 1
118 4 1 1
118 5 1 1
118 6 1 1
118 7 1 1
118 8 0 2
118 9 1 1
118 10 1 1
120 1 1 1
120 2 1 1
120 3 1 1
120 4 1 1
120 5 0 2
120 6 1 1
120 7 0 2
120 8 1 1
120 9 1 1
120 10 1 1
I then tried using dplyr:
df %>%
group_by(Participant) %>%
mutate(Count=cumsum(Correct)) %>%
ungroup %>%
as.data.frame(df)
Participant TrialNumber Correct Count
118 1 1 1
118 2 1 2
118 3 1 3
118 4 1 4
118 5 1 5
118 6 1 6
118 7 1 7
118 8 0 7
118 9 1 8
118 10 1 9
120 1 1 1
120 2 1 2
120 3 1 3
120 4 1 4
120 5 0 4
120 6 1 5
120 7 0 5
120 8 1 6
120 9 1 7
120 10 1 8
Which gets me closer, but still doesn't reset the counter when it gets to 0. If anyone has any suggestions to do this it would be greatly appreciated, thank you
Does this work?
library(dplyr)
library(data.table)
df %>%
mutate(grp = rleid(Correct)) %>%
group_by(Participant, grp) %>%
mutate(Count = cumsum(Correct)) %>%
select(- grp)
# A tibble: 10 x 4
# Groups: Participant, grp [6]
grp Participant Correct Count
<int> <chr> <dbl> <dbl>
1 1 A 1 1
2 1 A 1 2
3 1 A 1 3
4 2 A 0 0
5 3 A 1 1
6 3 B 1 1
7 3 B 1 2
8 4 B 0 0
9 5 B 1 1
10 5 B 1 2
Toy data:
df <- data.frame(
Participant = c(rep("A", 5), rep("B", 5)),
Correct = c(1,1,1,0,1,1,1,0,1,1)
)

How to swap/shuffle values within a column in R?

I want to anonymize data using cell swapping. Therefore I want to conditionally swap values within a column.
My data looks like:
Sex Age Houeshold_size
0 95 2
0 95 3
1 90 1
1 90 5
1 45 1
1 45 1
1 34 1
1 34 1
1 34 1
1 34 1
I want to give swap values so everyone above a certain age has a household size of 1. In this case 90 or older. So my outcome has to look like:
Sex Age Houeshold_size
0 95 1
0 95 1
1 90 1
1 90 1
1 45 1
1 45 1
1 34 2
1 34 3
1 34 5
1 34 1
It is more that I want to know how to conditionally swap data instead of solving this example, since its just a fraction of my data.
Thanks for helping me out, cheers.
You can use the following :
#Get the index where Age is 90 or higher
inds <- which(df$Age >= 90)
#replace `Houeshold_size` where age is less than 90 with that of inds
df$Houeshold_size[sample(which(df$Age < 90), length(inds))] <- df$Houeshold_size[inds]
#Change household size of inds to 1
df$Houeshold_size[inds] <- 1
# Sex Age Houeshold_size
#1 0 95 1
#2 0 95 1
#3 1 90 1
#4 1 90 1
#5 1 45 1
#6 1 45 3
#7 1 34 2
#8 1 34 1
#9 1 34 1
#10 1 34 5

Replace NA with condition

I am actually running econometrics analysis. I encounter problem in this analysis. I am using Rstudio.
My Database is composed of 1408 (704 for type 1 and 704 for type 2) observations and 49 variables.
Gender Period Matching group Group Type Overcharging
1 1 73 1 1 NA
0 2 73 1 1 NA
1 1 77 2 1 NA
1 2 77 2 1 NA
... ... ... ... ... ...
0 1 73 1 2 1
0 2 73 1 2 0
1 1 77 2 2 0
1 2 77 2 2 1
... ... ... ... ... ...
You can see that NA values are correlated with type of the agent (if agent is type 1). What I'd like to do is : if agents of type 1 belong to the same matching group, group and period of agents type 2, then replace NA by the same value of the agent of the type 2 (for each row).
Expected output
Gender Period Matching group Group Type Overcharging
1 1 73 1 1 1
0 2 73 1 1 0
1 1 77 2 1 0
1 2 77 2 1 1
0 1 73 1 2 1
0 2 73 1 2 0
1 1 77 2 2 0
1 2 77 2 2 1
Here is a solution with data.table:
library("data.table")
dt <- fread(header=TRUE,
'Gender Period Matching.group Group Type Overcharging
1 1 73 1 1 NA
0 2 73 1 1 NA
1 1 77 2 1 NA
1 2 77 2 1 NA
0 1 73 1 2 1
0 2 73 1 2 0
1 1 77 2 2 0
1 2 77 2 2 1')
d2 <- dt[Type!=1, Overcharging, .(Group,Period)]
rbind(dt[Type==1][d2, on=.(Group, Period), Overcharging:=i.Overcharging],dt[Type!=1])
# > rbind(dt[Type==1][d2, on=.(Group, Period), Overcharging:=i.Overcharging],dt[Type!=1])
# Gender Period Matching.group Group Type Overcharging
# 1: 1 1 73 1 1 1
# 2: 0 2 73 1 1 0
# 3: 1 1 77 2 1 0
# 4: 1 2 77 2 1 1
# 5: 0 1 73 1 2 1
# 6: 0 2 73 1 2 0
# 7: 1 1 77 2 2 0
# 8: 1 2 77 2 2 1
Eventually you can do in your special case:
dt[Type==1, Overcharging:=dt[Type!=1, Overcharging]]
(if the order of Group and Period for Type!=1 is the same as for Type==1)
We can use functions from dplyr and tidyr (from tidyverse) for such task. The fill function from tidyr can impute the missing values based on the previous or the next row. So the idea is to arrange the data frame first and use fill to impute all NA in the Overcharging column.
library(tidyverse)
dt <- read.csv(text = "Gender,Period,Matching.group,Group,Type,Overcharging
1,1,73,1,1,NA
0,2,73,1,1,NA
1,1,77,2,1,NA
1,2,77,2,1,NA
0,1,73,1,2,1
0,2,73,1,2,0
1,1,77,2,2,0
1,2,77,2,2,1",
stringsAsFactors = FALSE)
dt2 <- dt %>%
mutate(ID = 1:n()) %>% # Create a column with ID starting 1
arrange(Period, `Matching.group`, Group, Type) %>% # Arrange the columns
fill(Overcharging, .direction = c("up")) %>% # Fill the missing values, the direction is "up"
arrange(ID) %>% # Arrange the columns based on ID
select(-ID) # Remove the ID column

how to cast to multicolumn in R like Pandas-Style?

i searched a lot but didn't find anything relevant.
What I Want:
I'm trying to do a simple groupby and summarising in R.
My preffered output would be with multiindexed columns and multiindexed rows. Multiindexed rows are easy with dplyr, the difficulty are the cols.
what I already tried:
library(dplyr)
cp <- read.table(text="SEX REGION CAR_TYPE JOB EXPOSURE NUMBER
1 1 1 1 1 70 1
2 1 1 1 2 154 8
3 1 1 2 1 210 10
4 1 1 2 2 21 1
5 1 2 1 1 77 8
6 1 2 1 2 90 6
7 1 2 2 1 105 5
8 1 2 2 2 140 11
")
attach(cp)
cp_gb <- cp %>%
group_by(SEX, REGION, CAR_TYPE, JOB) %>%
summarise(counts=round(sum(NUMBER/EXPOSURE*1000)))
dcast(cp_gb, formula = SEX + REGION ~ CAR_TYPE + JOB, value.var="counts")
Now there is the problem that the column index is "melted" into one instead of a multiindexed column, like I know it from Python/Pandas.
Wrong output:
SEX REGION 1_1 1_2 2_1 2_2
1 1 14 52 48 48
1 2 104 67 48 79
Example how it would work in Pandas:
# clipboard, copy this withoud the comments:
# SEX REGION CAR_TYPE JOB EXPOSURE NUMBER
# 1 1 1 1 1 70 1
# 2 1 1 1 2 154 8
# 3 1 1 2 1 210 10
# 4 1 1 2 2 21 1
# 5 1 2 1 1 77 8
# 6 1 2 1 2 90 6
# 7 1 2 2 1 105 5
# 8 1 2 2 2 140 11
df = pd.read_clipboard(delim_whitespace=True)
gb = df.groupby(["SEX","REGION", "CAR_TYPE", "JOB"]).sum()
gb['promille_value'] = (gb['NUMBER'] / gb['EXPOSURE'] * 1000).astype(int)
gb = gb[['promille_value']].unstack(level=[2,3])
correct Output:
CAR_TYPE 1 1 2 2
JOB 1 2 1 2
SEX REGION
1 1 14 51 47 47
1 2 103 66 47 78
(Update) What works (nearly):
I tried to to with ftable, but it only prints ones in the matrix instead of the values of "counts".
ftable(cp_gb, col.vars=c("CAR_TYPE","JOB"), row.vars = c("SEX","REGION"))
ftable accepts lists of factors (data frame) or a table object. Instead of passing the grouped data frame as it is, converting it to a table object first before passing to ftable should get your the counts:
# because xtabs expects factors
cp_gb <- cp_gb %>% ungroup %>% mutate_at(1:4, as.factor)
xtabs(counts ~ ., cp_gb) %>%
ftable(col.vars=c("CAR_TYPE","JOB"), row.vars = c("SEX","REGION"))
# CAR_TYPE 1 2
# JOB 1 2 1 2
# SEX REGION
# 1 1 14 52 48 48
# 2 104 67 48 79
There is a difference of 1 in some of counts between R and pandas outputs because you use round in R and truncation (.astype(int)) in python.

R Removing duplicate entries in dataframe and keeping rows with fewer NAs and zeroes

I would like to deduplicate a data.frame I am generating from another
part of my codebase without the ability to know the order of the
columns and rows. The data.frame has some columns I want to compare
for duplication, here A and B, but I would like to then choose the
ones to keep from the rows that contain fewer NAs and zeros in other
columns in the dataframe, here C, D and E.
tc=
'Id B A C D E
1 62 12 0 NA NA
2 12 62 1 1 1
3 2 62 1 1 1
4 62 12 1 1 1
5 55 23 0 0 0 '
df =read.table(textConnection(tc),header=T)
I can use duplicated, but since I cannot control the order of the
columns and rows where my dataframe comes, I need a way to get the
unique values with fewer NAs and zeros.
This will work in the example, but won't if the incoming data.frame
has a different order:
df[!duplicated(data.frame(A=df$A,B=df$B),fromLast=TRUE),]
Id B A C D E
2 2 12 62 1 1 1
3 3 2 62 1 1 1
4 4 62 12 1 1 1
5 5 55 23 0 0 0
Any ideas?
Here's an approach based on counting valid values and reordering the data frame.
First, count the NAs and 0s in the columns C, D, and E.
rs <- rowSums(is.na(df[c("C", "D", "E")]) | !df[c("C", "D", "E")])
# [1] 3 0 0 0 3
Second, order the data frame by A, B, and the new variable:
df_ordered <- df[order(df$A, df$B, rs), ]
# Id B A C D E
# 4 4 62 12 1 1 1
# 1 1 62 12 0 NA NA
# 5 5 55 23 0 0 0
# 3 3 2 62 1 1 1
# 2 2 12 62 1 1 1
Now, you can remove duplicated rows and keep the row with the highest number of valid values.
df_ordered[!duplicated(df_ordered[c("A", "B")]), ]
# Id B A C D E
# 2 2 12 62 1 1 1
# 3 3 2 62 1 1 1
# 4 4 62 12 1 1 1
# 5 5 55 23 0 0 0

Resources