count non-NA values and group by variable - r

I am trying to show how many complete observations there are per variabie ID without using the complete.cases package or any other package.
If I use na.omit to filter out the NA values, I will lose all of the IDs which might have ZERO complete cases.
In the end, I'd like a frequency table with two columns: ID and Number of Complete Observations
> length(unique(data$ID))
[1] 332
> head(data)
ID value
1 1 NA
2 1 NA
3 1 NA
4 1 NA
5 1 NA
6 1 NA
> dim(data)
[1] 772087 2
When I try to create my own function z - which counts non-NA values and apply that in the aggregate() function, the IDs with zero complete observations are left out. I should be left with 332 rows, not 323. How does one resolve this using base functions?
z <- function(x){
sum(!is.na(x))
}
aggregate(value ~ ID, data = data , FUN = "z")
> nrow(aggregate(isna ~ ID, data = data , FUN = "z"))
[1] 323

One of the ways to do this is using table:
df2 <- table(df$Id, !is.na(df$value))[,2]
data.frame(ID = names(df2), value = df2)
Data
structure(list(Id = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 4, 4), value = c(NA,
1, 1, 2, 2, NA, 3, NA, 3, 3, 4, 4)), .Names = c("Id", "value"
), row.names = c(NA, -12L), class = "data.frame")

Base R you can use your utility function like this:
stack(by(data$value, data$ID, FUN=function(x) sum(!is.na(x))))

you can directly use table for this purpose. Below is the sample code:
df1 <- structure(list(Id = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 4, 4), value = c(2,
1, 1, NA, NA, NA, 3, NA, 3, 3, 4, 4)), .Names = c("Id", "value"
), row.names = c(NA, -12L), class = "data.frame")
df2 <- as.data.frame.matrix(with(df1, table(Id, value)))
resultDf <- data.frame(Id=row.names(df2), count=apply(df2, 1, sum))
resultDf
The code makes a table of id and value. Then it just sums the non-na values from the table. Hope this is easy to understand and helps.

Related

Using map() function to apply for each element

I need, with the help of the map() function, apply the above for each element
How can I do so?
As dt is of class data.table, you can make a vector of columns of interest (i.e. your items; below I use grepl on the names), and then apply your weighting function to each of those columns using .SD and .SDcols, with by
qs = names(dt)[grepl("^q", names(dt))]
dt[, (paste0(qs,"wt")):=lapply(.SD, \(q) 1/(sum(!is.na(q))/.N)),
.(sex, education_code, age), .SDcols = qs]
As mentioned in the comments, you miss a dt <- in your dt[, .(ID, education_code, age, sex, item = q1_1)] which makes the column item unavailable in the following line dt[, no_respond := is.na(item)].
Your weighting scheme is not absolutely clear to me however, assuming you want to do what is done in your code here, I would go with dplyr solution to iterate over columns.
# your data without no_respond column and correcting missing value in q2_3
dt <- data.table::data.table(
ID = c(1,2,3,4, 5, 6, 7, 8, 9, 10),
education_code = c(20,50,20,60, 20, 10,5, 12, 12, 12),
age = c(87,67,56,52, 34, 56, 67, 78, 23, 34),
sex = c("F","M","M","M", "F","M","M","M", "M","M"),
q1_1 = c(NA,1,5,3, 1, NA, 3, 4, 5,1),
q1_2 = c(NA,1,5,3, 1, 2, NA, 4, 5,1),
q1_3 = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q1_text = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q2_1 = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q2_2 = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q2_3 = c(NA,1,5,3, 1, NA, NA, 4, 5,1),
q2_text = c(NA,1,5,3, 1, NA, 3, 4, 5,1))
dt %>%
group_by(sex, education_code, age) %>% #groups the df by sex, education_code, age
add_count() %>% #add a column with number of rows in each group
mutate(across(starts_with("q"), #for each column starting with "q"
~ 1/(sum(!is.na(.))/n), #create a new column following your weight calculation
.names = '{.col}_wgt')) %>% #naming the new column with suffix "_wgt" to original name
ungroup()

Count co-occurrences between elements in one column based on second column, and count only if unequal in third column

I want to count how often each pairwise combination of unique elements in column c in data frame df co-occurs on the elements of column a, but with the addition that co-occurrences are only counted if the respective values in column b are unequal, i.e., conditional on a non-match in column b
a <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4)
b <- c(1,1,2,2,2,1,1,2,2,3,3,3,3,1,1,1,2,2,2,4)
c <- c(1,2,1,2,3,2,3,1,2,1,1,2,3,1,2,1,1,2,4,1)
df <- as.data.frame(cbind(a,b,c))
Without considering column b I could do the following to retain for each pair of elements of column c, on how many elements of a they co-occur
df <- unique(df[,c(1,3)])
df <- merge(df, df, by = "a")
df$count <- 1
df <- aggregate(count ~ ., df[, c(2:4)], sum)
df <- df[df$c.x != df$c.y,]
With the additional condition of a non-match in b, there is only one difference: elements 2 and 4 of column c both co-occur on element 4 of column a, but have the same value in b and should therefore not be counted to end up with:
c.x <- c(2,3,4,1,3,1,2,1)
c.y <- c(1,1,1,2,2,3,3,4)
count <- c(4,3,1,4,3,3,3,1)
result <- as.data.frame(cbind(c.x,c.y,count))
As the original data set is large (> 1,000,000 observations), I welcome fast solutions, i.e., without using loops or merges. Usually, I create co-occurrence matrices from three-column data frames using sparseMatrix()
I'm not sure from your description if this is what you had in mind, nor how fast this would turn out to be, but here is an approach with purrr:
library(purrr)
split(df, c) %>%
combn(2, simplify = F) %>%
set_names(map(., ~ paste(names(.x), collapse = "_"))) %>%
map_int(~ merge(.x[[1]], .x[[2]], by = NULL) %>%
dplyr::filter(a.x == a.y && b.x != b.y) %>%
nrow())
Returns:
1_2 1_3 1_4 2_3 2_4 3_4
0 27 0 21 0 0
# Data used:
df <- structure(list(a = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4), b = c(1, 1, 2, 2, 2, 1, 1, 2, 2, 3, 3, 3, 3, 1, 1, 1, 2, 2, 2, 4), c = c(1, 2, 1, 2, 3, 2, 3, 1, 2, 1, 1, 2, 3, 1, 2, 1, 1, 2, 4, 1)), class = "data.frame", row.names = c(NA, -20L))

Reshape data to long form based on a pattern and not unique identifier

I have some data that comes from the measurement of an image where essentially the columns signify position (x) and height (z) data. The problem is that this data gets spit out as a .csv file in the wide format. I am trying to find a way to convert this to the long format but I'm unsure how to do this because I can't designate an identifier.
I know there are a lot of questions on reshaping data but I didn't find anything quite like this.
As an example:
df <- data.frame(V1 = c("Profile", "x", "[m]", 0, 2, 4, 6, 8, 10, 12, NA, NA),
V2 = c("1", "z", "[m]", 3, 3, 4, 10, 12, 9, 2, NA, NA),
V3 = c("Profile", "x", "[m]", 0, 2, 4, 6, NA, NA, NA, NA, NA),
V4 = c("2", "z", "[m]", 4, 8, 10, 10, NA, NA, NA, NA, NA),
V5 = c("Profile", "x", "[m]", 0, 2, 4, 6, 8, 10, 12, 14, 17),
V2 = c("3", "z", "[m]", 0, 1, 1, 10, 14, 11, 6, 2, 0))
Every two columns represents X,Z data (you can see grouped by Profile 1, Profile 2, Profile 3, etc). However, measurements are not equal lengths, hence the rows with NAs. Is there a programmatic way to reshape this data into the long form? i.e.:
profile x z
Profile 1 0 3
Profile 1 2 3
Profile 1 4 4
... ... ...
Profile 2 0 4
Profile 2 2 8
Profile 2 4 10
... ... ...
Thank you in advance for your help!
You can do the following (Its a bit verbose, feel free to optimize):
dfcols <- NCOL(df)
xColInds <- seq(1,dfcols,by=2)
zColInds <- seq(2,dfcols,by=2)
longdata <- do.call("rbind",lapply(1:length(xColInds), function(i) {
xValInd <- xColInds[i]
zValInd <- zColInds[i]
profileName <- paste0(df[1,xValInd]," ",df[1,zValInd])
xVals <- as.numeric(df[-(1:3),xValInd])
zVals <- as.numeric(df[-(1:3),zValInd])
data.frame(profile=rep(profileName,length(xVals)),
x = xVals,
z = zVals)
}))
If you want it more performant, dont cast to data.frame every single iteration. One cast at the end is enough, like:
xColInds <- seq(1,NCOL(df),by=2)
longdataList <- lapply(xColInds, function(xci) {
list(profileName = paste0(df[1,xci]," ",df[1,xci+1]),
x = df[-(1:3),xci],
z = df[-(1:3),xci+1])
})
longdata <- data.frame(profile = rep(unlist(lapply(longdataList,"[[","profileName")),each=NROW(df)-3),
x = as.numeric(unlist(lapply(longdataList,"[[","x"))),
z = as.numeric(unlist(lapply(longdataList,"[[","z"))))

Adding rows to make a full long dataset for longitudinal data analysis

I am working with a long-format longitudinal dataset where each person has 1, 2 or 3 time points. In order to perform certain analyses I need to make sure that each person has the same number of rows even if it consists of NAs because they did not complete the certain time point.
Here is a sample of the data before adding the rows:
structure(list(Values = c(23, 24, 45, 12, 34, 23), P_ID = c(1,
1, 2, 2, 2, 3), Event_code = c(1, 2, 1, 2, 3, 1), Site_code = c(1,
1, 3, 3, 3, 1)), class = "data.frame", row.names = c(NA, -6L))
This is the data I aim to get after adding the relevant rows:
structure(list(Values = c(23, 24, NA, 45, 12, 34, 23, NA, NA),
P_ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3), Event_code = c(1, 2,
3, 1, 2, 3, 1, 2, 3), Site_code = c(1, 1, 1, 3, 3, 3, 1,
1, 1)), class = "data.frame", row.names = c(NA, -9L))
I want to come up with code that would automatically add rows to the dataset conditionally on whether the participant has had 1, 2 or 3 visits. Ideally it would make rest of data all NAs while copying Participant_ID and site_code but if not possible I would be satisfied just with creating the right number of rows.
We could use fill after doing a complete
library(dplyr)
library(tidyr)
ExpandedDataset %>%
complete(P_ID, Event_code) %>%
fill(Site_code)
I came with quite a long code, but you could group it in a function and make it easier:
Here's your dataframe:
df <- data.frame(ID = c(rep("P1", 2), rep("P2", 3), "P3"),
Event = c("baseline", "visit 2", "baseline", "visit 2", "visit 3", "baseline"),
Event_code = c(1, 2, 1, 2, 3, 1),
Site_code = c(1, 1, 2, 2, 2, 1))
How many records you have per ID?
values <- summary(df$ID)
What is the maximum number of records for a single patient?
target <- max(values)
Which specific patients have less records than the maximum?
uncompliant <- names(which(values<target))
And how many records do you have for those patients who have missing information?
rowcount <- values[which(values<target)]
So now, let's create the vectors of the data frame we will add to your original one. First, IDs:
IDs <- vector()
for(i in 1:length(rowcount)){
y <- rep(uncompliant[i], target - rowcount[i])
IDs <- c(IDs, y)
}
And now, the sitecodes:
SC <- vector()
for(i in 1:length(rowcount)){
y <- rep(unique(df$Site_code[which(df$ID == uncompliant[i])]), target - rowcount[i])
SC <- c(SC, y)
}
Finally, a data frame with the values we will introduce:
introduce <- data.frame(ID = IDs, Event = rep(NA, length(IDs)),
Event_code = rep(NA, length(IDs)),
Site_code = SC)
Combine the original dataframe with the new values to be added and sort it so it looks nice:
final <- as.data.frame(rbind(df, introduce))
final <- final[order(v$ID), ]

Merging columns and removing NA

I have a data frame:
A<- c(NA, 1, 2, NA, 3, NA)
R<- c(2, 1, 2, 1, NA, 1)
C<- c(rep ("B",3), rep ("D", 3))
data1<-data.frame (A,R,C)
data1
And I wan to merge column A and R, to have a data frame like data2
AR<- c(2, 1, 2, 1, 3, 1)
C<- c(rep ("B",3), rep ("D", 3))
data2<-data.frame (AR,C)
data2
Do you know how I can do that?
You might want to consider what happens if "A" and "R" have different values, but this should work:
data2 <- with(data1, data.frame(AR=ifelse(is.na(A), R, A), C=C))

Resources