I want to anonymize data using cell swapping. Therefore I want to conditionally swap values within a column.
My data looks like:
Sex Age Houeshold_size
0 95 2
0 95 3
1 90 1
1 90 5
1 45 1
1 45 1
1 34 1
1 34 1
1 34 1
1 34 1
I want to give swap values so everyone above a certain age has a household size of 1. In this case 90 or older. So my outcome has to look like:
Sex Age Houeshold_size
0 95 1
0 95 1
1 90 1
1 90 1
1 45 1
1 45 1
1 34 2
1 34 3
1 34 5
1 34 1
It is more that I want to know how to conditionally swap data instead of solving this example, since its just a fraction of my data.
Thanks for helping me out, cheers.
You can use the following :
#Get the index where Age is 90 or higher
inds <- which(df$Age >= 90)
#replace `Houeshold_size` where age is less than 90 with that of inds
df$Houeshold_size[sample(which(df$Age < 90), length(inds))] <- df$Houeshold_size[inds]
#Change household size of inds to 1
df$Houeshold_size[inds] <- 1
# Sex Age Houeshold_size
#1 0 95 1
#2 0 95 1
#3 1 90 1
#4 1 90 1
#5 1 45 1
#6 1 45 3
#7 1 34 2
#8 1 34 1
#9 1 34 1
#10 1 34 5
Related
I'm trying to summarize the counts of one variable through grouping one variable, so that the total_count is connected to each row of the grouped variable.
I want to be able to add the "emp" column by grouping fam_id, so that total_employed reflects the numbers of employed in family for all within the same fam_id
acs_5years
fam_id emp ins age
33 1 1 45
33 0 1 23
44 1 1 19
44 1 0 26
44 1 0 54
44 1 0 50
77 1 1 33
77 1 1 38
77 1 1 44
88 1 0 65
88 0 0 90
should look like:
fam_id emp ins age total_employed
33 1 1 45 1
33 0 1 23 1
44 1 1 19 4
44 1 0 26 4
44 1 0 54 4
44 1 0 50 4
77 1 1 33 3
77 1 1 38 3
77 1 1 44 3
88 1 0 65 1
88 0 0 90 1
I've tried the following code:
sample_grouping <- acs_5years %>% group_by(fam_id) %>%
summarize(total_count=n(),.groups = 'drop') %>%
as.data.frame()
sample_grouping
#######
sample_2 <- acs_5years %>% group_by(fam_id) %>%
summarize(total_count=(emp))
sample_2
I'm not sure I'm getting correct results.
Any help or suggestions would be greatly appreciated, thanks in advance!
emp of fam_id 44 is different and your code is different with your data, but you may try
df %>%
group_by(fam_id) %>%
mutate(total_employed = sum(emp))
fam_id emp ins age total_employed
<int> <int> <int> <int> <int>
1 33 1 1 45 1
2 33 0 1 23 1
3 44 1 1 19 3
4 44 1 0 26 3
5 44 1 0 54 3
6 44 0 0 50 3
7 77 1 1 33 3
8 77 1 1 38 3
9 77 1 1 44 3
10 88 1 0 65 1
11 88 0 0 90 1
I am actually running econometrics analysis. I encounter problem in this analysis. I am using Rstudio.
My Database is composed of 1408 (704 for type 1 and 704 for type 2) observations and 49 variables.
Gender Period Matching group Group Type Overcharging
1 1 73 1 1 NA
0 2 73 1 1 NA
1 1 77 2 1 NA
1 2 77 2 1 NA
... ... ... ... ... ...
0 1 73 1 2 1
0 2 73 1 2 0
1 1 77 2 2 0
1 2 77 2 2 1
... ... ... ... ... ...
You can see that NA values are correlated with type of the agent (if agent is type 1). What I'd like to do is : if agents of type 1 belong to the same matching group, group and period of agents type 2, then replace NA by the same value of the agent of the type 2 (for each row).
Expected output
Gender Period Matching group Group Type Overcharging
1 1 73 1 1 1
0 2 73 1 1 0
1 1 77 2 1 0
1 2 77 2 1 1
0 1 73 1 2 1
0 2 73 1 2 0
1 1 77 2 2 0
1 2 77 2 2 1
Here is a solution with data.table:
library("data.table")
dt <- fread(header=TRUE,
'Gender Period Matching.group Group Type Overcharging
1 1 73 1 1 NA
0 2 73 1 1 NA
1 1 77 2 1 NA
1 2 77 2 1 NA
0 1 73 1 2 1
0 2 73 1 2 0
1 1 77 2 2 0
1 2 77 2 2 1')
d2 <- dt[Type!=1, Overcharging, .(Group,Period)]
rbind(dt[Type==1][d2, on=.(Group, Period), Overcharging:=i.Overcharging],dt[Type!=1])
# > rbind(dt[Type==1][d2, on=.(Group, Period), Overcharging:=i.Overcharging],dt[Type!=1])
# Gender Period Matching.group Group Type Overcharging
# 1: 1 1 73 1 1 1
# 2: 0 2 73 1 1 0
# 3: 1 1 77 2 1 0
# 4: 1 2 77 2 1 1
# 5: 0 1 73 1 2 1
# 6: 0 2 73 1 2 0
# 7: 1 1 77 2 2 0
# 8: 1 2 77 2 2 1
Eventually you can do in your special case:
dt[Type==1, Overcharging:=dt[Type!=1, Overcharging]]
(if the order of Group and Period for Type!=1 is the same as for Type==1)
We can use functions from dplyr and tidyr (from tidyverse) for such task. The fill function from tidyr can impute the missing values based on the previous or the next row. So the idea is to arrange the data frame first and use fill to impute all NA in the Overcharging column.
library(tidyverse)
dt <- read.csv(text = "Gender,Period,Matching.group,Group,Type,Overcharging
1,1,73,1,1,NA
0,2,73,1,1,NA
1,1,77,2,1,NA
1,2,77,2,1,NA
0,1,73,1,2,1
0,2,73,1,2,0
1,1,77,2,2,0
1,2,77,2,2,1",
stringsAsFactors = FALSE)
dt2 <- dt %>%
mutate(ID = 1:n()) %>% # Create a column with ID starting 1
arrange(Period, `Matching.group`, Group, Type) %>% # Arrange the columns
fill(Overcharging, .direction = c("up")) %>% # Fill the missing values, the direction is "up"
arrange(ID) %>% # Arrange the columns based on ID
select(-ID) # Remove the ID column
i searched a lot but didn't find anything relevant.
What I Want:
I'm trying to do a simple groupby and summarising in R.
My preffered output would be with multiindexed columns and multiindexed rows. Multiindexed rows are easy with dplyr, the difficulty are the cols.
what I already tried:
library(dplyr)
cp <- read.table(text="SEX REGION CAR_TYPE JOB EXPOSURE NUMBER
1 1 1 1 1 70 1
2 1 1 1 2 154 8
3 1 1 2 1 210 10
4 1 1 2 2 21 1
5 1 2 1 1 77 8
6 1 2 1 2 90 6
7 1 2 2 1 105 5
8 1 2 2 2 140 11
")
attach(cp)
cp_gb <- cp %>%
group_by(SEX, REGION, CAR_TYPE, JOB) %>%
summarise(counts=round(sum(NUMBER/EXPOSURE*1000)))
dcast(cp_gb, formula = SEX + REGION ~ CAR_TYPE + JOB, value.var="counts")
Now there is the problem that the column index is "melted" into one instead of a multiindexed column, like I know it from Python/Pandas.
Wrong output:
SEX REGION 1_1 1_2 2_1 2_2
1 1 14 52 48 48
1 2 104 67 48 79
Example how it would work in Pandas:
# clipboard, copy this withoud the comments:
# SEX REGION CAR_TYPE JOB EXPOSURE NUMBER
# 1 1 1 1 1 70 1
# 2 1 1 1 2 154 8
# 3 1 1 2 1 210 10
# 4 1 1 2 2 21 1
# 5 1 2 1 1 77 8
# 6 1 2 1 2 90 6
# 7 1 2 2 1 105 5
# 8 1 2 2 2 140 11
df = pd.read_clipboard(delim_whitespace=True)
gb = df.groupby(["SEX","REGION", "CAR_TYPE", "JOB"]).sum()
gb['promille_value'] = (gb['NUMBER'] / gb['EXPOSURE'] * 1000).astype(int)
gb = gb[['promille_value']].unstack(level=[2,3])
correct Output:
CAR_TYPE 1 1 2 2
JOB 1 2 1 2
SEX REGION
1 1 14 51 47 47
1 2 103 66 47 78
(Update) What works (nearly):
I tried to to with ftable, but it only prints ones in the matrix instead of the values of "counts".
ftable(cp_gb, col.vars=c("CAR_TYPE","JOB"), row.vars = c("SEX","REGION"))
ftable accepts lists of factors (data frame) or a table object. Instead of passing the grouped data frame as it is, converting it to a table object first before passing to ftable should get your the counts:
# because xtabs expects factors
cp_gb <- cp_gb %>% ungroup %>% mutate_at(1:4, as.factor)
xtabs(counts ~ ., cp_gb) %>%
ftable(col.vars=c("CAR_TYPE","JOB"), row.vars = c("SEX","REGION"))
# CAR_TYPE 1 2
# JOB 1 2 1 2
# SEX REGION
# 1 1 14 52 48 48
# 2 104 67 48 79
There is a difference of 1 in some of counts between R and pandas outputs because you use round in R and truncation (.astype(int)) in python.
This question already has an answer here:
generate sequence within group in R [duplicate]
(1 answer)
Closed 6 years ago.
I have a dataset which was ordered using function order() in R and same is shown below
A B C
1 1 85
1 1 62
1 0 92
2 1 80
2 0 92
2 0 84
3 1 65
3 0 92
I've to print rank based on column A and expected output is shown below
A B C Rank
1 1 85 1
1 1 62 2
1 0 92 3
2 1 80 1
2 0 92 2
2 0 84 3
3 1 65 1
3 0 92 2
Request for expertise in R
A simple base R solution using ave and seq_along is
df$Rank <- ave(df$B, df$A, FUN=seq_along)
which returns
df
A B C Rank
1 1 1 85 1
2 1 1 62 2
3 1 0 92 3
4 2 1 80 1
5 2 0 92 2
6 2 0 84 3
7 3 1 65 1
8 3 0 92 2
seq_along returns a vector 1, 2, 3, ... the length of its argument. ave allows the user to apply a function to groups which are determined here by the variable A.
data
df <- read.table(header=TRUE, text="A B C
1 1 85
1 1 62
1 0 92
2 1 80
2 0 92
2 0 84
3 1 65
3 0 92")
I have the following database (in wide form), "st_all", where I have got two variables I wish to reshape ("P" and "PLC"). The id for the subjects is "g_id".
g_id study condition sample PLC1 PLC2 PLC3 PLC4 PLC5 PLC6 PLC7 PLC8 PLC9 PLC10 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
1 1 1 1 1 20 20 20 50 50 20 30 20 50 50 1 2 2 1 2 2 1 1 1 1
2 2 1 1 1 60 70 50 70 60 60 60 70 60 50 1 2 1 1 2 2 1 1 1 1
3 3 1 1 1 80 50 55 58 70 50 80 80 60 65 1 2 2 1 2 2 1 1 1 1
4 4 1 1 1 89 51 59 62 72 60 86 80 61 54 1 1 2 1 2 2 1 1 1 1
5 5 1 1 1 90 50 60 70 80 50 90 80 60 50 1 1 1 1 2 2 1 1 1 1
6 6 1 1 1 95 50 60 100 95 60 50 60 60 55 1 2 2 1 2 2 1 1 1 1
To do so I ran the following code:
reshape(st_all,
idvar="g_id",
direction="long",
varying=list(c(5:14),c(15:24)),
v.names=c("PLC","P")
)
and I get the following error:
Error in `row.names<-.data.frame`(`*tmp*`, value = paste(d[, idvar], times[1L], :
invalid 'row.names' length
I have searched for an answer to this, but I do not find any.
Thanks in advance.
As noted in the comments, you'll have problems with the reshape function when your data is a tbl.
Use as.data.frame first:
reshape(as.data.frame(st_all),
idvar = "g_id",
direction = "long",
varying = list(c(5:14), c(15:24)),
v.names = c("PLC","P"))