how to replace the NA in a data frame with the average number of this data frame - r

I have a data frame like this:
nums id
1233 1
3232 2
2334 3
3330 1
1445 3
3455 3
7632 2
NA 3
NA 1
And I can know the average "nums" of each "id" by using:
id_avg <- aggregate(nums ~ id, data = dat, FUN = mean)
What I would like to do is to replace the NA with the value of the average number of the corresponding id. for example, the average "nums" of 1,2,3 are 1000, 2000, 3000, respectively. The NA when id == 3 will be replaced by 3000, the last NA whose id == 1 will be replaced by 1000.
I tried the following code to achieve this:
temp <- dat[is.na(dat$nums),]$id
dat[is.na(dat$nums),]$nums <- id_avg[id_avg[,"id"] ==temp,]$nums
However, the second part
id_avg[id_avg[,"id"] ==temp,]$nums
is always NA, which means I always pass NA to the NAs I want to replace.
I don't know where I was wrong, or do you have better method to do this?
Thank you

Or you can fix it by:
dat[is.na(dat$nums),]$nums <- id_avg$nums[temp]
nums id
1 1233.000 1
2 3232.000 2
3 2334.000 3
4 3330.000 1
5 1445.000 3
6 3455.000 3
7 7632.000 2
8 2411.333 3
9 2281.500 1

What you want is contained in the zoo package.
library(zoo)
na.aggregate.default(dat, by = dat$id)
nums id
1 1233.000 1
2 3232.000 2
3 2334.000 3
4 3330.000 1
5 1445.000 3
6 3455.000 3
7 7632.000 2
8 2411.333 3
9 2281.500 1

Here is a dplyr way:
df %>%
group_by(id) %>%
mutate(nums = replace(nums, is.na(nums), as.integer(mean(nums, na.rm = T))))
# Source: local data frame [9 x 2]
# Groups: id [3]
# nums id
# <int> <int>
# 1 1233 1
# 2 3232 2
# 3 2334 3
# 4 3330 1
# 5 1445 3
# 6 3455 3
# 7 7632 2
# 8 2411 3
# 9 2281 1
You essentially want to merge the id_avg back to the original data frame by the id column, so you can also use match to follow your original logic:
dat$nums[is.na(dat$nums)] <- id_avg$nums[match(dat$id[is.na(dat$nums)], id_avg$id)]
dat
# nums id
# 1: 1233.000 1
# 2: 3232.000 2
# 3: 2334.000 3
# 4: 3330.000 1
# 5: 1445.000 3
# 6: 3455.000 3
# 7: 7632.000 2
# 8: 2411.333 3
# 9: 2281.500 1

Related

Select Random Consecutive Rows Per Group

I have data which is grouped by 'student_id':
my_data = data.frame(student_id = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),
exam_no = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
result = rnorm(15,60,10))
my_data
student_id exam_no result
1 1 1 56.60374
2 1 2 55.76655
3 1 3 53.81728
4 1 4 74.82202
5 1 5 34.91834
6 2 1 58.32422
7 2 2 60.38213
8 2 3 49.40390
9 2 4 63.85426
10 2 5 40.32912
11 3 1 69.54969
12 3 2 43.36639
13 3 3 37.97265
14 3 4 52.36436
15 3 5 61.62080
My Question:
For each student, I want to select a set of consecutive rows, with random start and end rows.
For example, keep exams 2-4 for student 1, keep exams 2-5 for student 2, etc.
I thought of the following way to do this:
Create a data frame that contains the max number of exams each student takes (in my problem, each student takes the same number of exams, but in the future this could be different)
library(dplyr)
counts = my_data %>% group_by(student_id) %>% summarise(counts = n())
# create variables that indicate where to start ("min") and where to end ("max") for each student
counts$min = sample(1:counts$counts, 1)
counts$max = sample(counts$min:counts$counts,1)
From here, I was then going to write a loop that would select rows between "min" and "max" index for each student (e.g. my_data[min:max]), but the results from the previous code are giving me warnings and illogical results:
Warning message:
In 1:counts$counts :
numerical expression has 3 elements: only the first used
Warning messages:
1: In counts$min:counts$counts :
numerical expression has 3 elements: only the first used
2: In counts$min:counts$counts :
numerical expression has 3 elements: only the first used
# A tibble: 3 x 4
student_id counts min max
<dbl> <int> <int> <int>
1 1 5 4 5
2 2 5 4 5
3 3 5 4 5
I am not sure how to continue this - can someone please show me how to continue?
Thanks!
A base R option using cumsum to label the in-between consecutive rows
subset(
my_data,
ave(
exam_no,
student_id,
FUN = function(x) cumsum(seq_along(x) %in% sample.int(length(x), 2))
) == 1
)
which gives, for example
student_id exam_no result
2 1 2 61.83643
3 1 3 51.64371
4 1 4 75.95281
6 2 1 51.79532
7 2 2 64.87429
8 2 3 67.38325
11 3 1 75.11781
12 3 2 63.89843
13 3 3 53.78759
A more compact version by data.table with a similar idea as above is
library(data.table)
setDT(my_data)[, .SD[cumsum((1:.N) %in% sample.int(.N, 2)) == 1], student_id]
Using data.table, within each group, sample two values from .I (without replacement), and create a sequence of indices.
library(data.table)
setDT(my_data)
set.seed(3)
my_data[my_data[ , {ix = sample(.I, 2); ix[1]:ix[2]}, by = student_id]$V1]
# student_id exam_no result
# <num> <num> <num>
# 1: 1 5 74.05672
# 2: 1 4 49.37525
# 3: 1 3 67.41662
# 4: 1 2 67.64935
# 5: 2 4 55.15337
# 6: 2 3 58.95694
# 7: 3 4 50.79859
# 8: 3 3 53.66886
# 9: 3 2 47.01089

Select value from previous group based on condition

I have the following df
df<-data.frame(value = c(1,1,1,2,1,1,2,2,1,2),
group = c(5,5,5,6,7,7,8,8,9,10),
no_rows = c(3,3,3,1,2,2,2,2,1,1))
where identical consecutive values form a group, i.e., values in rows 1:3 fall under group 5. Column "no_rows" tells us how many rows/entries each group has, i.e., group 5 has 3 rows/entries.
I am trying to substitute all values, where no_rows < 2, with the value from a previous group. I expect my end df to look like this:
df_end<-data.frame(value = c(1,1,1,1,1,1,2,2,2,2),
group = c(5,5,5,6,7,7,8,8,9,10),
no_rows = c(3,3,3,1,2,2,2,2,1,1))
I came up with this combination of if...else in a for loop, which gives me the desired output, however it is very slow and I am looking for a way to optimise it.
for (i in 2:length(df$group)){
if (df$no_rows[i] < 2){
df$value[i] <- df$value[i-1]
}
}
I have also tried with dplyr::mutate and lag() but it does not give me the desired output (it only removes the first value per group instead of taking the value of a previous group).
df<-df%>%
group_by(group) %>%
mutate(value = ifelse(no_rows < 2, lag(value), value))
I looked for a solution now for a few days but I could not find anything that fit my problem completly. Any ideas?
a data.table approach...
first, get the values of groups with length >=2, then fill in missing values (NA) by last-observation-carried-forward.
library(data.table)
# make it a data.table
setDT(df, key = "group")
# get values for groups of no_rows >= 2
df[no_rows >= 2, new_value := value][]
# value group no_rows new_value
# 1: 1 5 3 1
# 2: 1 5 3 1
# 3: 1 5 3 1
# 4: 2 6 1 NA
# 5: 1 7 2 1
# 6: 1 7 2 1
# 7: 2 8 2 2
# 8: 2 8 2 2
# 9: 1 9 1 NA
#10: 2 10 1 NA
# fill down missing values in new_value
setnafill(df, "locf", cols = c("new_value"))
# value group no_rows new_value
# 1: 1 5 3 1
# 2: 1 5 3 1
# 3: 1 5 3 1
# 4: 2 6 1 1
# 5: 1 7 2 1
# 6: 1 7 2 1
# 7: 2 8 2 2
# 8: 2 8 2 2
# 9: 1 9 1 2
#10: 2 10 1 2

How to automate renaming of columns in wide data using R

Consider the following data in the wide format
df<-data.frame("id"=c(1,2,3,4),
"ex"=c(1,0,0,1),
"aQL"=c(5,4,NA,6),
"bQL"=c(5,7,NA,9),
"cQL"=c(5,7,NA,9),
"bST"=c(3,7,8,9),
"cST"=c(8,7,5,3),
"aXY"=c(1,9,4,4),
"cXY"=c(5,3,1,4))
I want to keep the column (or variable) names "id" and "ex" and rename the remaining columns, e.g. "aQL", "bQL" and "cQL" as "QL.1", "QL.2" and "QL.3", respectively. The other columns with names ending with "ST" and "XY" are expected to be renamed in the same manner, also having the order .1, .2 and .3. Of note is "aST" and "bXY" are missing from the data set, but I want them to be included and renamed as ST.1 and XY.2, with each having NAs as their entries. The expected output would look like
df
id ex QL.1 QL.2 QL.3 ST.1 ST.2 ST.3 XY.1 XY.2 XY.3
1 1 1 5 5 5 NA 3 8 1 NA 5
2 2 0 4 7 7 NA 7 7 9 NA 3
3 3 0 NA NA NA NA 8 5 4 NA 1
4 4 1 6 9 9 NA 9 3 4 NA 4
The main data set has many variables, so I would like the renaming to be done in an automated manner. I tried the following code
renameCol <- function(x) {
setNames(x, paste0("QL.", seq_len(ncol(x))))
}
renameCol(df)
but it does not work as expected. Thus, it renames "id" and "ex" that I want to maintain and it is not flexible on the renaming of multiple variable (i.e. QL, ST, XY). Any help is greatly appreciated.
I would suggest a tidyverse approach where there is no need of a function. In this solution you can extract the first letter of each variable name as id and then assign a number with cur_group_id so that the order is kept. Finally, with this new number you transform the variable containing the names and then you format to wide in order to obtain the expected output:
library(tidyverse)
#Data
df<-data.frame("id"=c(1,2,3,4),
"ex"=c(1,0,0,1),
"aQL"=c(5,4,NA,6),
"bQL"=c(5,7,NA,9),
"cQL"=c(5,7,NA,9),
"bST"=c(3,7,8,9),
"cST"=c(8,7,5,3),
"aXY"=c(1,9,4,4),
"cXY"=c(5,3,1,4))
#Reshape
df %>% pivot_longer(cols = -c(1,2)) %>%
#Extract first letter as id
mutate(id2=substring(name,1,1)) %>%
#Create the number id
group_by(id2) %>%
mutate(id3=cur_group_id()) %>%
#Clean name
mutate(name=substring(name,2,nchar(name))) %>%
#Create final var
mutate(name2=paste0(name,'.',id3)) %>% ungroup() %>%
dplyr::select(-c(name,id2,id3)) %>%
#Format to wide
pivot_wider(names_from = name2,values_from=value)
Output:
# A tibble: 4 x 9
id ex QL.1 QL.2 QL.3 ST.2 ST.3 XY.1 XY.3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 5 5 5 3 8 1 5
2 2 0 4 7 7 7 7 9 3
3 3 0 NA NA NA 8 5 4 1
4 4 1 6 9 9 9 3 4 4
in base R you could do:
names(df) <- sub("(\\d)([A-Z]{2})$","\\2.\\1", chartr("abc","123",names(df)))
df
id ex QL.1 QL.2 QL.3 ST.2 ST.3 XY.1 XY.3
1 1 1 5 5 5 3 8 1 5
2 2 0 4 7 7 7 7 9 3
3 3 0 NA NA NA 8 5 4 1
4 4 1 6 9 9 9 3 4 4
If you need the NA columns:
names(df) <- sub("(\\d)([A-Z]{2})$","\\2.\\1", chartr("abc","123",names(df)))
a <- read.table(text=grep("\\.\\d",names(df),value = TRUE), sep=".")
b <- subset(aggregate(.~V1, a, function(x) setdiff(1:3,x)), V2>0)
df[do.call(paste, c(sep = ".", b))] <- NA
(df1 <- df[c(1, 2, order(names(df)[-(1:2)]) + 2)])
id ex QL.1 QL.2 QL.3 ST.1 ST.2 ST.3 XY.1 XY.2 XY.3
1 1 1 5 5 5 NA 3 8 1 NA 5
2 2 0 4 7 7 NA 7 7 9 NA 3
3 3 0 NA NA NA NA 8 5 4 NA 1
4 4 1 6 9 9 NA 9 3 4 NA 4
Another way you can try
colnames(df)[grepl("QL", colnames(df))] <- str_c("QL.", 1:3)
colnames(df)[grepl("ST", colnames(df))] <- str_c("ST.", 2:3)
colnames(df)[grepl("XY", colnames(df))] <- str_c("XY.", c(1,3))
# id ex QL.1 QL.2 QL.3 ST.2 ST.3 XY.1 XY.3
# 1 1 1 5 5 5 3 8 1 5
# 2 2 0 4 7 7 7 7 9 3
# 3 3 0 NA NA NA 8 5 4 1
# 4 4 1 6 9 9 9 3 4 4
Here is a solution that uses regular expressions via the stringr package:
library(stringr)
df<-data.frame("id"=c(1,2,3,4),
"ex"=c(1,0,0,1),
"aQL"=c(5,4,NA,6),
"bQL"=c(5,7,NA,9),
"cQL"=c(5,7,NA,9),
"bST"=c(3,7,8,9),
"cST"=c(8,7,5,3),
"aXY"=c(1,9,4,4),
"cXY"=c(5,3,1,4))
renameCol <- function(x) {
col_names <- colnames(x)
index_ql <- str_detect(col_names,
"^[a-z]{1}QL")
index_st <- str_detect(col_names,
"^[a-z]{1}ST")
index_xy <- str_detect(col_names,
"^[a-z]{1}XY")
replace_fun <- function(x) {which(letters %in% x)}
col_names[index_ql] <- paste0("QL.", str_replace(substr(col_names[index_ql], 1, 1),
"[a-z]", replace_fun))
col_names[index_st] <- paste0("ST.", str_replace(substr(col_names[index_st], 1, 1),
"[a-z]", replace_fun))
col_names[index_xy] <- paste0("XY.", str_replace(substr(col_names[index_xy], 1, 1),
"[a-z]", replace_fun))
col_names
}
colnames(df) <- renameCol(df)
df
#> id ex QL.1 QL.2 QL.3 ST.2 ST.3 XY.1 XY.3
#> 1 1 1 5 5 5 3 8 1 5
#> 2 2 0 4 7 7 7 7 9 3
#> 3 3 0 NA NA NA 8 5 4 1
#> 4 4 1 6 9 9 9 3 4 4
Created on 2020-09-07 by the reprex package (v0.3.0)
Edit
The function above was adapted so that it takes the order into account.
using base pattern matching:
you need to define a function that does what you want on one single column name:
f = function(x){
beg <- str_extract(x,"[a-z](?=[A-Z]{2})")
num <- which(letters == beg)
output <- paste0(str_extract(x,"(?<=[a-z])[A-Z]{2}"),".",num)
return(output)
}
here extract the lower case letter if you have two upper case letters after, find it position in alphabet, and paste the found number back to the upper case letters.
> f("cQL")
[1] "QL.3"
You can then use regmatches and regular expression directly on the name of your data frame:
m <- gregexpr("[a-z][A-Z]{2}", names(df),perl = T)
regmatches(names(df), m) <- lapply(regmatches(names(df), m), f)
names(df)
> names(df)
[1] "id" "ex" "QL.1" "QL.2" "QL.3" "ST.2" "ST.3" "XY.1" "XY.3"
It solves only the renaming part, not the the "including missing column number" part of your question

Subset data frame that include a variable

I have a list of events and sequences. I would like to print the sequences in a separate table if event = x is included somewhere in the sequence. See table below:
Event Sequence
1 a 1
2 a 1
3 x 1
4 a 2
5 a 2
6 a 3
7 a 3
8 x 3
9 a 4
10 a 4
In this case I would like a new table that includes only the sequences where Event=x was included:
Event Sequence
1 a 1
2 a 1
3 x 1
4 a 3
5 a 3
6 x 3
Base R solution:
d[d$Sequence %in% d$Sequence[d$Event == "x"], ]
Event Sequence
1: a 1
2: a 1
3: x 1
4: a 3
5: a 3
6: x 3
data.table solution:
library(data.table)
setDT(d)[Sequence %in% Sequence[Event == "x"]]
As you can see syntax/logic is quite similar between these two solutions:
Find event's that are equal to x
Extract their Sequence
Subset table according to specified Sequence
We can use dplyr to group the data and filter the sequence with any "x" in it.
library(dplyr)
df2 <- df %>%
group_by(Sequence) %>%
filter(any(Event %in% "x")) %>%
ungroup()
df2
# A tibble: 6 x 2
Event Sequence
<chr> <int>
1 a 1
2 a 1
3 x 1
4 a 3
5 a 3
6 x 3
DATA
df <- read.table(text = " Event Sequence
1 a 1
2 a 1
3 x 1
4 a 2
5 a 2
6 a 3
7 a 3
8 x 3
9 a 4
10 a 4",
header = TRUE, stringsAsFactors = FALSE)

R - Subset dataframe to include only subjects with more than 1 record

I'd like to subset a dataframe to include all records for subjects that have >1 record, and exclude those subjects with only 1 record.
Let's take the following dataframe;
mydata <- data.frame(subject_id = factor(c(1,2,3,4,4,5,5,6,6,7,8,9,9,9,10)),
variable = rnorm(15))
The code below gives me the subjects with >1 record using duplicated();
duplicates <- mydata[duplicated(mydata$subject_id),]$subject_id
But I want to retain in my subset all records for each subject with >1 record, so I tried;
mydata[mydata$subject_id==as.factor(duplicates),]
Which does not return the result I'm expecting.
Any ideas?
A data.table solution
set.seed(20)
subject_id <- as.factor(c(1,2,3,4,4,5,5,6,6,7,8,9,9,9,10))
variable <- rnorm(15)
mydata<-as.data.frame(cbind(subject_id, variable))
library(data.table)
setDT(mydata)[, .SD[.N > 1], by = subject_id] # #Thanks David.
# subject_id variable
# 1: 4 -1.3325937
# 2: 4 -0.4465668
# 3: 5 0.5696061
# 4: 5 -2.8897176
# 5: 6 -0.8690183
# 6: 6 -0.4617027
# 7: 9 -0.1503822
# 8: 9 -0.6281268
# 9: 9 1.3232209
A simple alternative is to use dplyr:
library(dplyr)
dfr <- data.frame(a=sample(1:2,10,rep=T), b=sample(1:5,10, rep=T))
dfr <- group_by(dfr, b)
dfr
# Source: local data frame [10 x 2]
# Groups: b
#
# a b
# 1 2 4
# 2 2 2
# 3 2 5
# 4 2 1
# 5 1 2
# 6 1 3
# 7 2 1
# 8 2 4
# 9 1 4
# 10 2 4
filter(dfr, n() > 1)
# Source: local data frame [8 x 2]
# Groups: b
#
# a b
# 1 2 4
# 2 2 2
# 3 2 1
# 4 1 2
# 5 2 1
# 6 2 4
# 7 1 4
# 8 2 4
Here you go (I changed your variable to var <- rnorm(15):
set.seed(11)
subject_id<-as.factor(c(1,2,3,4,4,5,5,6,6,7,8,9,9,9,10))
var<-rnorm(15)
mydata<-as.data.frame(cbind(subject_id,var))
x1 <- c(names(table(mydata$subject_id)[table(mydata$subject_id) > 1]))
x2 <- which(mydata$subject_id %in% x1)
mydata[x2,]
subject_id var
4 4 0.3951076
5 4 -2.4129058
6 5 -1.3309979
7 5 -1.7354382
8 6 0.4020871
9 6 0.4628287
12 9 -2.1744466
13 9 0.4857337
14 9 1.0245632
Try:
> mydata[mydata$subject_id %in% mydata[duplicated(mydata$subject_id),]$subject_id,]
subject_id variable
4 4 -1.3325937
5 4 -0.4465668
6 5 0.5696061
7 5 -2.8897176
8 6 -0.8690183
9 6 -0.4617027
12 9 -0.1503822
13 9 -0.6281268
14 9 1.3232209
I had to edit your data frame a little bit:
set.seed(20)
subject_id <- as.factor(c(1,2,3,4,4,5,5,6,6,7,8,9,9,9,10))
variable <- rnorm(15)
mydata<-as.data.frame(cbind(subject_id, variable))
Now to get all the rows for subjects that appear more than once:
mydata[duplicated(mydata$subject_id)
| duplicated(mydata$subject_id, fromLast = TRUE), ]
# subject_id variable
# 4 4 -1.3325937
# 5 4 -0.4465668
# 6 5 0.5696061
# 7 5 -2.8897176
# 8 6 -0.8690183
# 9 6 -0.4617027
# 12 9 -0.1503822
# 13 9 -0.6281268
# 14 9 1.3232209
Edit: this would also work, using your duplicates vector:
mydata[mydata$subject_id %in% duplicates, ]

Resources