R fill columns n times - r

Hi I want to simulate a dataset like this:
City Person
1 1
1 2
1 3
2 1
2 2
2 3
Where City ID can go from 1-30 and Person ID from 1-40. I know that I can create City by the following code:
data=data.frame(City=rep(1:30,40),Person=0)
However, I cannot figure out how to assign the Person variable for each City ID without using a loop. How do I assign the Person IDs from 1-40 for each City ID? Any help will be appreciated. Thanks.

We can do this with
df1$Person <- with(df1, ave(seq_along(City), City, FUN = seq_along))
Or
df1$Person <- sequence(table(df1$City))
Also, an easier expansion would be
expand.grid(City = 1:30, Person = 1:3)
Or with tidyverse
library(tidyverse)
crossing(City = 1:30, Person = 1:3)
Or using tidyverse
library(tidyverse)
df1 %>%
group_by(City) %>%
mutate(Person = row_number())
Or using data.table
library(data.table)
setDT(df1)[, Person := seq_len(.N), by = City]
data
df1 <- data.frame(City = rep(1:2, each = 3))

Related

How do I turn non-sequential data in R into sequential data while grouping on an ID

I have the following dataframe.
df <- data.frame(Person = c("Eric","Eric","Eric","Joe","Joe","Joe"), Order = c(2,7,4,2,5,1),
Value = c("A","A","B","C","A","B"))
The order column is currently in a random order. Every person has 3 order values which are random integers between 1 and 8. Order is always a value between 1 and 8, and there are no repeats for a person. How do I transform the Order column so that it is reflecting the order of the values, grouped by the person? Thus, the order column would always between 1 and 3. The desired output would look like this.
df <- data.frame(Person = c("Eric","Eric","Eric","Joe","Joe","Joe"), Order = c(1,3,2,2,3,1),
Value = c("A","A",'B","C","A","B"))```
Perhaps, we need to rank the 'Order' grouped by 'Person'
library(dplyr)
df %>%
group_by(Person) %>%
mutate(Order = rank(Order))
Some base R options
Using rank
df,
Order = ave(Order, Person, FUN = rank)
)
Using match + sort
transform(
df,
Order = ave(Order, Person, FUN = function(x) match(x,sort(x)))
)
Using data.table :
library(data.table)
setDT(df)[, Order := frank(Order), Person]
df
# Person Order Value
#1: Eric 1 A
#2: Eric 3 A
#3: Eric 2 B
#4: Joe 2 C
#5: Joe 3 A
#6: Joe 1 B

Assign a conditional value to new created column

My Data frame looks like this
Now, I want to add a new column which assigns one (!) specific value to each country. That means, there is only one value for Australia, one for Canada etc. for every year.
It should look like this:
Year Country R Ineq Adv NEW_COL
2018 Australia R1 Ineq1 1 x_Australia
2019 Australia R2 Ineq2 1 x_Australia
1972 Canada R1 Ineq1 1 x_Canada
...
Is there a smart way to do this?
Appreciate any help!
you use merge.
x = data.frame(country = c("AUS","CAN","AUS","USA"),
val1 = c(1:4))
y = data.frame(country = c("AUS","CAN","USA"),
val2 = c("a","b","c"))
merge(x,y)
country val1 val2
1 AUS 1 a
2 AUS 3 a
3 CAN 2 b
4 USA 4 c
You just manually create the (probably significantly smaller!) reference table that then gets duplicated in the original table in the merge. As you can see, my 3 row table (with a,b,c) is correctly duplicated up to the original (4 row) table such that every AUS gets "a".
You may use mutate and case_when from the package dplyr:
library(dplyr)
data <- data.frame(country = rep(c("AUS", "CAN"), each = 2))
data <- mutate(data,
newcol = case_when(
country == "CAN" ~ 1,
country == "AUS" ~ 2))
print(data)
You can use mutate and group_indices:
library(dplyr)
Sample data:
sample.df <- data.frame(Year = sample(1971:2019, 10, replace = T),
Country = sample(c("AUS", "Can", "UK", "US"), 10, replace = T))
Create new variable called ID, and assign unique ID to each Country group:
sample.df <- sample.df %>%
mutate(ID = group_indices(., Country))
If you want it to appear as x_Country, you can use paste (as commented):
sample.df <- sample.df %>%
mutate(ID = paste(group_indices(., Country), Country, sep = "_"))

Rearrange dataframe to fit longitudinal model in R

I have a dataframe where each entry relates to a job posting in the NHS specifying the week the job was posted, and what NHS Trust (and region) the job is in.
At the moment my dataframe looks something like this:
set.seed(1)
df1 <- data.frame(
NHS_Trust = sample(1:30,20,T),
Week = sample(1:10,20,T),
Region = sample(1:15,20,T))
And I would like to count the number of jobs for each week across each NHS Trust and assign that value to a new column 'jobs' so my dataframe looks like this:
set.seed(1)
df2 <- data.frame(
NHS_Trust = rep(1:30, each=10),
Week = rep(seq(1,10),30),
Region = rep(as.integer(runif(30,1,15)),1,each = 10),
Jobs = rpois(10*30, lambda = 2))
The dataframe may then be used to create a Poisson longitudinal multilevel model where I may model the number of jobs.
Using the data.table package you can group by, count and assign to a new column in a single expression. The syntax for data.tables is dt[i, j, by]. Here i is "with" - ie the subset of data specified by i or data in the order of i which is empty in this case so all data is used in its original order. The j tells what is to be done, here counting the the number of occurrences using .N, which is then assigned to the new variable count using the assign operator :=. The by takes a list of variables where the j operation is performed on each group.
library(data.table)
setDT(df1)
df1[, count := .N, by = .(NHS_Trust, Week, Region)]
A tidyverse approach would be
library(tidyverse)
df1 <- df1 %>%
group_by(NHS_Trust, Week, Region) %>%
count()
You can use count to count number of jobs across each Region, NHS_Trust and Week and use complete to fill in missing combinations.
library(dplyr)
df1 %>%
count(Region, NHS_Trust, Week, name = 'Jobs') %>%
tidyr::complete(Region, Week = 1:10, fill = list(Jobs = 0))
I guess I'm moving my comment to an answer:
df2 <- df1 %>% group_by(Region, NHS_Trust, Week) %>% count(); colnames(df2)[4] <- "Jobs"
df2$combo <- paste0(df2$Region, "_", df2$NHS_Trust, "_", df2$Week)
for (i in 1:length(unique(df2$Region))){
for (j in 1:length(unique(df2$NHS_Trust))){
for (k in 1:length(unique(df2$Week))){
curr_combo <- paste0(unique(df2$Region)[i], "_",
unique(df2$NHS_Trust)[j], "_",
unique(df2$Week)[k])
if(!curr_combo %in% df2$combo){
curdat <- data.frame(unique(df2$Region)[i],
unique(df2$NHS_Trust)[j],
unique(df2$Week)[k],
0,
curr_combo,
stringsAsFactors = FALSE)
#cat(curdat)
names(curdat) <- names(df2)
df2 <- rbind(as.data.frame(df2), curdat)
}
}
}
}
tail(df2)
# Region NHS_Trust Week Jobs combo
# 4495 15 1 4 0 15_1_4
# 4496 15 1 5 0 15_1_5
# 4497 15 1 8 0 15_1_8
# 4498 15 1 3 0 15_1_3
# 4499 15 1 6 0 15_1_6
# 4500 15 1 9 0 15_1_9
The for loop here check which Region-NHS_Trust-Week combinations are missing from df2 and appends those to df2 with a corresponding Jobs value of 0. The checking is done with the help of the new variable combo which is just a concatenation of the values in the fields mentioned earlier separated by underscores.
Edit: I am plenty sure the people here can come up with something more elegant than this.

Dplyr: Anonymising values up to a million rows with unique names

I have the following data:
library(dplyr)
d <- tibble(
region = c('all', 'one', 'eleven', 'six'),
forename = c('John', 'Jane', 'Rich', 'Clive'),
surname = c('Smith', 'Jones', 'Smith', 'Jones'))
I would like to anonymise the values within the 'forename ' and 'surname ' variables so that the data looks like this.
d <- tibble(
region = c('all', 'one', 'eleven', 'six'),
forename = c('forename1', 'forename2', 'forename3', 'forename4'),
surname = c('surname1', 'surname2', 'surname3', 'surname4'))
I could just do this manually but I have a df with millions of rows. What I would like is for the row number in the df to coincide with the value rename. So the data on row 67 for example would show:
d <- tibble(
region = c('all'),
forename = c('forename67'),
surname = c('surname67'))
Does anyone know how I would achieve this using dplyr if possible?
Thannks
As every row is a unique user, we can paste row_number to the column names.
library(dplyr)
d %>%
mutate(forename = paste0("forename", row_number()),
surname = paste0("surname", row_number()))
# A tibble: 4 x 3
# region forename surname
# <chr> <chr> <chr>
#1 all forename1 surname1
#2 one forename2 surname2
#3 eleven forename3 surname3
#4 six forename4 surname4
An option with stringr
library(dplyr)
library(stringr)
d %>%
mutate(forename = str_c("forename", row_number()),
surname = str_c("surname", row_number()))
Or with lapply from base R
d[c('forename', 'surname')] <- lapply(c('forename', 'surname'), function(x)
paste0(x, seq_len(nrow(d))))]

How to average and count in each group and creating a new table

I have a Dataset, I want to calculate average of KPI, CPM and CPC column and count times column in each score group(1-10).
How to create a new table according results?
A new table looks like:
score avg_KPI avg_CPC avg_CPM count_times
10
9
8
7
6
5
4
3
2
1
I try to use For method,but it doesn't work,
for (i in 1:10) {
aa <- subset(dataset1,score== i )
macroAvgs<-colMeans(aa[,2:4])
df <- rbind(score="i",KPI=macroAvgs[1],CPC=macroAvgs[2],CPM=macroAvgs[3],times=count(aa[5])
}
assuming your data is in a data.frame called df, do you just want this?
library(data.table)
setDT(df)[ ,.(lapply(.SD, mean), .N), by = score, .SDcols = c("KPI", "CPM", "CPC")]
or do you want this?
library(dplyr)
group_by(df, score) %>%
summarise(Mean_KPI = mean(KPI),
Mean_CPC = mean(CPC),
Mean_CPM = mean(CPM),
Sum_times = sum(times))

Resources