Removing cases in a long dataframe - r

I am currently experiencing a problem where I have a long dataframe (i.e., multiple rows per subject) and want to remove cases that don't have any measurements (in any of the rows) on one variable. I've tried transforming the data to wide format, but this was a problem as I can't go back anymore (going from long to wide "destroys" my timeline variable). Does anyone have an idea about how to fix this problem?
Below is some code to simulate the head of my data. Specifically, I want to remove cases that don't have a measurement of extraversion on any of the measurement occasions ("time").
structure(list(id = c(1L, 1L, 2L, 3L, 3L, 3L), time = c(79L, 95L, 79L, 28L, 40L, 52L),
extraversion = c(3.2, NA, NA, 2, 2.4, NA), satisfaction = c(3L, 3L, 4L, 5L, 5L, 9L),
`self-esteem` = c(4.9, NA, NA, 6.9, 6.7, NA)), .Names = c("id", "time", "extraversion",
"satisfaction", "self-esteem"), row.names = c(NA, 6L), class = "data.frame")
Note: I realise the missing of my extraversion variable coincides with my self-esteem variable.

To drop an entire id if they don't have any measurements for extraversion you could do:
library(data.table)
setDT(df)[, drop := all(is.na(extraversion)) ,by= id][!df$drop]
# id time extraversion satisfaction self-esteem drop
#1: 1 79 3.2 3 4.9 FALSE
#2: 1 95 NA 3 NA FALSE
#3: 3 28 2.0 5 6.9 FALSE
#4: 3 40 2.4 5 6.7 FALSE
#5: 3 52 NA 9 NA FALSE
Or you could use .I which I believe should be faster:
setDT(df)[df[,.I[!all(is.na(extraversion))], by = id]$V1]
Lastly, a base R solution could use ave (thanks to #thelatemail for the suggestion to make it shorter/more expressive):
df[!ave(is.na(df$extraversion), df$id, FUN = all),]

Assuming the data frame is named mydata, use a dplyr filter:
library(dplyr)
mydata %>%
group_by(id) %>%
filter(!all(is.na(extraversion))) %>%
ungroup()

d <-
structure(
list(
id = c(1L, 1L, 2L, 3L, 3L, 3L),
time = c(79L, 95L, 79L, 28L, 40L, 52L),
extraversion = c(3.2, NA, NA, 2, 2.4, NA),
satisfaction = c(3L, 3L, 4L, 5L, 5L, 9L),
`self-esteem` = c(4.9, NA, NA, 6.9, 6.7, NA)
),
.Names = c("id", "time", "extraversion",
"satisfaction", "self-esteem"),
row.names = c(NA, 6L),
class = "data.frame"
)
d[complete.cases(d$extraversion), ]
d[is.na(d$extraversion), ]
complete.cases is great if you wanted to remove any rows with missing data: complete.cases(d)

Related

Replace all values in dataframe using another dataframe as key in R

I have two dataframes and I want to replace all values ( in all the columns) of df1 using the equivalent value in df2 (df2$value).
df1
structure(list(Cell_ID = c(7L, 2L, 3L, 10L), n_1 = c(0L, 0L,
0L, 0L), n_2 = c(9L, 1L, 4L, 1L), n_3 = c(10L, 4L, 5L, 2L), n_4 = c(NA,
5L, NA, 4L), n_5 = c(NA, 7L, NA, 6L), n_6 = c(NA, 9L, NA, 8L),
n_7 = c(NA, 10L, NA, 3L)), class = "data.frame", row.names = c(NA,
-4L))
df2
structure(list(Cell_ID = 0:10, value = c(5L, 100L, 200L, 300L,
400L, 500L, 600L, 700L, 800L, 900L, 1000L)), class = "data.frame", row.names = c(NA,
-11L))
The desired output would look like this:
So far I tried this as suggested in another similar post but its not doing it well (randomly missing some points)
key= df2$Cell_ID
value = df2$value
lapply(1:8,FUN = function(i){df1[df1 == key[i]] <<- value[i]})
Note that the numbers have been just multiplied by 10 for ease in the example the real data has numbers are all over the place so just multiplying the dataframe by 10 won't work.
An option is match the elements with the 'Cell_ID' of second dataset and use that as index to return the corresponding 'value' from 'df2'
library(dplyr)
df1 %>%
mutate(across(everything(), ~ df2$value[match(., df2$Cell_ID)]))
-output
# Cell_ID n_1 n_2 n_3 n_4 n_5 n_6 n_7
#1 700 5 900 1000 NA NA NA NA
#2 200 5 100 400 500 700 900 1000
#3 300 5 400 500 NA NA NA NA
#4 1000 5 100 200 400 600 800 300
Or another option is to use a named vector to do the match
library(tibble)
df1 %>%
mutate(across(everything(), ~ deframe(df2)[as.character(.)]))
The base R equivalent is
df1[] <- lapply(df1, function(x) df2$value[match(x, df2$Cell_ID)])

R: populate data.frame within function in mapply

A data.frame df1 is queried (fuzzy match) against another data.frame df2 with agrep. Via iterating over its output (a list called matches holding the row number of the respective matches in df2), df1 is populated with affiliated values from df2.
The goal is a function that is passed to mapply; however, in all my attempts df1 remains unchanged.
In a for-loop, the code works as expected and populates df1 with the affiliated variables from df2. Still, I would be interested how to solve this with a function that is passed to mapply.
First, the two data.frames:
df1 <- structure(list(Species = c("Alisma plantago-aquatica", "Alnus glutinosa",
"Carex davalliana", "Carex echinata",
"Carex elata"),
CheckPoint = c(NA, NA, NA, NA, NA),
L = c(NA, NA, NA, NA, NA),
R = c(NA, NA, NA, NA, NA),
K = c(NA, NA, NA, NA, NA)),
row.names = c(NA, 5L), class = "data.frame")
df2 <- structure(list(Species = c("Alisma gramineum", "Alisma lanceolatum",
"Alisma plantago-aquatica", "Alnus glutinosa",
"Alnus incana", "Alnus viridis",
"Carex davalliana", "Carex depauperata",
"Carex diandra", "Carex digitata",
"Carex dioica", "Carex distans",
"Carex disticha", "Carex echinata",
"Carex elata"),
L = c(7L, 7L, 7L, 5L, 6L, 7L, 9L, 4L, 8L, 3L, 9L, 9L, 8L,
8L, 8L),
R = c(7L, 7L, 5L, 5L, 4L, 3L, 4L, 7L, 6L, NA, 4L, 6L, 6L,
NA, NA),
K = c(6L, 2L, NA, 3L, 5L, 4L, 4L, 2L, 7L, 4L, NA, 3L, NA,
3L, 2L)),
row.names = seq(1:15), class = "data.frame")
Then, fuzzy match by Species:
matches <- lapply(df1$Species, agrep, x = df2$Species, value = FALSE,
max.distance = c(deletions = 0,
insertions = 1,
substitutions = 1))
Populating df1 with the values from df2 works as expected:
for (i in 1:dim(df1)[1]){
df1[i, 2:5] <- df2[matches[[i]], ]
}
In contrast to my approach with mapply that does return the correct values, although as a list of dissasembled values that are never written into df1. No combination (with or without return(df1), writing it into another variable nor desparate attempts with the state of SIMPLIFY or USE.NAMES) yielded the desired results.
populatedf1 <- function(matches, index){
df1[index, 2:5] <- df2[matches, ]
#return(df1)
}
mapply(populatedf1, matches, seq_along(matches), SIMPLIFY = FALSE,
USE.NAMES = FALSE)
Would be great if someone knows the solution or could point me into a certain direction, thanks! :)
Actually, you would not need any loop here (for or mapply) if you replace lapply with sapply (so that it returns a vector instead of list) and then do a direct assignment.
matches <- sapply(df1$Species, agrep, x = df2$Species, value = FALSE,
max.distance = c(deletions = 0,
insertions = 1,
substitutions = 1))
df1[, 2:5] <- df2[matches,]
df1
# Species CheckPoint L R K
#1 Alisma plantago-aquatica Alisma plantago-aquatica 7 5 NA
#2 Alnus glutinosa Alnus glutinosa 5 5 3
#3 Carex davalliana Carex davalliana 9 4 4
#4 Carex echinata Carex echinata 8 NA 3
#5 Carex elata Carex elata 8 NA 2
As far as your approach is concerned you can use Map or mapply with SIMPLIFY = FALSE and bring the list of dataframes into one dataframe using do.call and rbind and then assign.
df1[, 2:5] <- do.call(rbind, Map(populatedf1, matches, seq_along(matches)))

How to create a new column showing if and how many variables share a specific observation

I have a question concerning the analysis of some bioinformatics data in R.
My test data frame consists of a variable "sequence" with different letter codes as observations and three different variables representing individuals/samples (P1, P2, P3) that say how often the particular observation was counted in an individual (so P3 contains the sequence "AB" 23 times for example).
I want to create a new column now (already indicated in my data frame as dummy column X with NA) that shows for each sequence row if the sequence is overall shared between individuals (P1, P2, P3) and more importantly how many of the three individuals share it. The numbers in the new column can therefore range only from 1 to 3. For example: for sequence "ABCDE" the new column would show value 1 because it occurs only in one individual P3, for sequence "ABC" the new column would show value 2 because it occurs in both individuals P2 and P3 and finally for "ABCD" it would show 3 since all individuals contain the sequence.
My test data looks like this after dput():
structure(list(Sequence = structure(1:9, .Label = c("AB", "ABC",
"ABCD", "ABCDE", "ABCDEF", "ABCDEFG", "ABCDEFGH", "ABCDEFGHI",
"ABCDEFGHIJ"), class = "factor"), P1 = c(5L, 0L, 20L, 0L, 3L,
1L, 0L, 0L, 0L), P2 = c(6L, 2L, 3L, 0L, 2L, 0L, 56L, 10L, 3L),
P3 = c(23L, 34L, 8L, 5L, 0L, 6L, 0L, 78L, 5L), X = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("Sequence",
"P1", "P2", "P3", "X"), class = "data.frame", row.names = c(NA,
-9L))
Thank you!
You can try to sum the "P." columns with a positive count:
mydf$X <- rowSums(mydf[, grep("^P", names(mydf))]>0)
head(mydf, 4)
# Sequence P1 P2 P3 X
#1 AB 5 6 23 3
#2 ABC 0 2 34 2
#3 ABCD 20 3 8 3
#4 ABCDE 0 0 5 1
We can use Reduce with lapply
df1$X <- Reduce(`+`, lapply(df1[2:4], `>`, 0))
df1$X
#[1] 3 2 3 1 2 2 1 2 2
Reduce can be very efficient as showed in the benchmarks here

Replacing loop in dplyr R

So I am trying to program function with dplyr withou loop and here is something I do not know how to do
Say we have tv stations (x,y,z) and months (2,3). If I group by this say we get
this output also with summarised numeric value
TV months value
x 2 52
y 2 87
z 2 65
x 3 180
y 3 36
z 3 99
This is for evaluated Brand.
Then I will have many Brands I need to filter to get only those which get value >=0.8*value of evaluated brand & <=1.2*value of evaluated brand
So for example from this down I would only want to filter first two, and this should be done for all months&TV combinations
brand TV MONTH value
sdg x 2 60
sdfg x 2 55
shs x 2 120
sdg x 2 11
sdga x 2 5000
As #akrun said, you need to use a combination of merging and subsetting. Here's a base R solution.
m <- merge(df, data, by.x=c("TV", "MONTH"), by.y=c("TV", "months"))
m[m$value.x >= m$value.y*0.8 & m$value.x <= m$value.y*1.2,][,-5]
# TV MONTH brand value.x
#1 x 2 sdg 60
#2 x 2 sdfg 55
Data
data <- structure(list(TV = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("x",
"y", "z"), class = "factor"), months = c(2L, 2L, 2L, 3L, 3L,
3L), value = c(52L, 87L, 65L, 180L, 36L, 99L)), .Names = c("TV",
"months", "value"), class = "data.frame", row.names = c(NA, -6L
))
df <- structure(list(brand = structure(c(2L, 1L, 4L, 2L, 3L), .Label = c("sdfg",
"sdg", "sdga", "shs"), class = "factor"), TV = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "x", class = "factor"), MONTH = c(2L,
2L, 2L, 2L, 2L), value = c(60L, 55L, 120L, 11L, 5000L)), .Names = c("brand",
"TV", "MONTH", "value"), class = "data.frame", row.names = c(NA,
-5L))

R: aggregate values on a tree

This question is similar to this, but it's got a C# answer, and I need a R answer.
I have some 50 files of about 650 rows with a format and data very similar to this toy data:
dput(y)
structure(list(level1 = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L), level2 = c(NA, 41L, 41L, 41L, 41L, 41L, 41L, 41L,
42L, 42L, 42L, 42L), level3 = c(NA, NA, 4120L, 4120L, 4120L,
4120L, 4120L, 4120L, NA, 4210L, 4210L, 4210L), level4 = c(NA,
NA, NA, 412030L, 412030L, 412050L, 412050L, 412050L, NA, NA,
421005L, 421005L), pid = c(NA, NA, NA, NA, 123456L, NA, 789012L,
345678L, NA, NA, NA, 901234L), description = c("income", "op.income",
"manuf.industries", "manuf 1", "client 1", "manuf 2", "client 2",
"client 3", "non-op.income", "financial", "interest", "bank 1"
), value = c(NA, NA, NA, NA, 15000L, NA, 272860L, 1150000L, NA,
NA, NA, 378L)), .Names = c("level1", "level2", "level3", "level4",
"pid", "description", "value"), class = c("data.table", "data.frame"
), row.names = c(NA, -12L), .internal.selfref = <pointer: 0x00000000001a0788>)
Each of the rows that have a value on value are a "leaf" o a tree, with branches identified in columns level1 to 4. I want to summarize the leafs by brach and put the corresponding values in the value column.
My expected output looks like this:
dput(res)
structure(list(level1 = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L), level2 = c(NA, 41L, 41L, 41L, 41L, 41L, 41L, 41L,
42L, 42L, 42L, 42L), level3 = c(NA, NA, 4120L, 4120L, 4120L,
4120L, 4120L, 4120L, NA, 4210L, 4210L, 4210L), level4 = c(NA,
NA, NA, 412030L, 412030L, 412050L, 412050L, 412050L, NA, NA,
421005L, 421005L), pid = c(NA, NA, NA, NA, 123456L, NA, 789012L,
345678L, NA, NA, NA, 901234L), description = c("income", "op.income",
"manuf.industries", "manuf 1", "client 1", "manuf 2", "client 2",
"client 3", "non-op.income", "financial", "interest", "bank 1"
), value = c(1438238L, 1437860L, 1437860L, 15000L, 15000L, 1422860L,
272860L, 1150000L, 378L, 378L, 378L, 378L)), .Names = c("level1",
"level2", "level3", "level4", "pid", "description", "value"), class = c("data.table",
"data.frame"), row.names = c(NA, -12L), .internal.selfref = <pointer: 0x00000000001a0788>)
I know this can be done with a for-loop, but I wanted to know if there is any faster, simpler alternative (I prefer data.table or base-solutions, but any other package works ok too). What I've tried so far:
z4<-y[!is.na(pid),sum(value),by=level4]
setkey(y,"level4");setkey(z4,"level4")
y[z4,][is.na(pid)]
This shows me the desired values in V1, so I wanted to see if I could assign them to value:
y[z4,][is.na(pid),value:=i.V1]
Error in eval(expr, envir, enclos) : object 'i.V1' not found
I think this could be caused because the call i.V1 is in the chained [ and not in the initial y[z4 call. But if I only subset on z4, how can I know which of the several matching level4 rows I should assign (that's why I'm thinking of using is.na(pid), because y[z4,value:=i.V1] produces the wrong result, as it updates all values that match level4).
As you can see, I'm badly stuck at this problem, and with "my method" I still would have 3 more levels to go.
Is there any easier way to do this?
Because the computations at each level require those from the previous level, I think a loop or recursion is required. Here is a recursive function to get the values using base R. You could surely do something similar with data.table, which would probably be much more efficient.
## Use y as data.frame
y <- as.data.frame(y)
## Recursive function to get values
f <- function(data, lvl=NULL) {
if (is.null(lvl)) lvl <- 1 # initialize level
if (lvl == 5) return (data) # we are done
cname <- paste0("level", lvl) # name of current level
nname <- ifelse (lvl == 4, "pid", paste0("level", lvl+1)) # name of next level
agg <- aggregate(as.formula(paste("value~", cname)), data=data, sum) # aggregate data
inds <- (ms <- match(data[,cname], agg[,cname], F)) & is.na(data[,nname]) # find index of leaves to fill
data$value[inds] <- agg$value[ms[inds]] # add new values
f(data, lvl+1) # recurse
}
f(data=y)
# level1 level2 level3 level4 pid description value
# 1 4 NA NA NA NA income 1438238
# 2 4 41 NA NA NA op.income 1437860
# 3 4 41 4120 NA NA manuf.industries 1437860
# 4 4 41 4120 412030 NA manuf 1 15000
# 5 4 41 4120 412030 123456 client 1 15000
# 6 4 41 4120 412050 NA manuf 2 1422860
# 7 4 41 4120 412050 789012 client 2 272860
# 8 4 41 4120 412050 345678 client 3 1150000
# 9 4 42 NA NA NA non-op.income 378
# 10 4 42 4210 NA NA financial 378
# 11 4 42 4210 421005 NA interest 378
# 12 4 42 4210 421005 901234 bank 1 378
I think the aggregation step could be made more efficient by only aggregating a subset of the data if need be. Honestly, this was fun, but a loop is probably the way to go.

Resources