Change variable in a column with a for loop using R - r

In my data frame, I have a column named Localisation. I want to change the data that is stored in it.
Thalamus, External capsule or Lenticulate for 0,
Cerebellum or Brain stem for 2 and
Frontal, Occipital, Parietal or Temporal for 1
I’m trying to do a For loop for this operation. My for statement doesn’t seem to be right, because I received
Error in for (. in i) 15:nrow(Localisation) :
4 arguments passed to 'for' which requires 3
and I don’t know how to compose my if statement.
dfM <- dfM %>%
for(i in dfM$Localisation) if(i = "Thalamus"| "External capsule"| "Lenticulate"){
dfM$Localisation <- "0"
} else if ( i = "Cerebellum"| "Brain stem") {
dfM$Localisation <- "2"
} else {
dfM$Localisation <- "1"
}
I know similar questions have been asked multiple times, but I can’t find a way to work with my data.
dput(head(dfM))
structure(list(new_id = c("5", "9", "10", "16", "30", "31"),
Localisation = c("Frontal", "Thalamus", "Occipital ", "Frontal",
"External capsule", "Cerebellum"), HIV.CT.initial = c(0L,
1L, 1L, 1L, 0L, 0L), Anticoagulant = c("Warfarin", "DOAC",
"Warfarin", "Warfarin", "Warfarin", "Warfarin"), Sex = c(1L,
1L, 1L, 2L, 2L, 1L), HTA = c(1L, 1L, 1L, 1L, 1L, 0L), Systolic_BP = c(116L,
169L, 164L, 109L, 134L, 146L), Diastolic_BP = c(70L, 65L,
80L, 60L, 75L, 85L), ACO = c(1L, 0L, 1L, 1L, 1L, 1L), Type.NACO = c(NA,
2L, NA, NA, NA, NA), Dose.ACO = c(NA, NA, NA, NA, NA, NA),
APT = c(0L, 0L, 1L, 0L, 1L, 0L), INR = c(4.2, 1.1, 1.9, 1.3,
3.6, 2.8), GLW = c(13L, 15L, 14L, 14L, 15L, 15L), GLW.Prog = c(NA,
NA, NA, NA, 14L, NA), mRS = c(4L, 3L, 4L, 6L, 6L, 5L), Time.to.scan = c(NA,
NA, NA, NA, NA, NA), Chx = c(0L, 0L, 0L, 0L, 0L, 0L), Type.Chx = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), Date.Chx = c("",
"", "", "", "", ""), décès.hospit = c(0L, 0L, 0L, 1L, 1L,
0L), décès.HIP = c(NA, NA, NA, 2L, 1L, NA), X3.mois = c(NA,
NA, NA, NA, NA, 1L), X6.mois = c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), X12.mois = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), X3.mois.1 = c(NA, NA, NA, 1L, 1L, 1L), X6.mois.1 = c(NA,
NA, NA, 1L, 1L, 1L), X12.mois.1 = c(NA, NA, NA, 1L, 1L, 1L
), X.1.y = c(NA, NA, NA, NA, NA, NA), ID = c(5L, 9L, 10L,
16L, 30L, 31L)), row.names = c(NA, 6L), class = "data.frame")

Up front: in general, don't use for in a %>%-pipe, it almost never is what you intend.
Since you have the %>%, I'll infer you're using dplyr and/or other related packages. Here's one way:
library(dplyr)
dfM %>%
mutate(Localisation = case_when(
Localisation %in% c("Thalamus", "External capsule", "Lenticulate") ~ "0",
Localisation %in% c("Cerebellum", "Brain stem") ~ "2",
TRUE ~ "1")
)
# new_id Localisation HIV.CT.initial Anticoagulant Sex HTA Systolic_BP Diastolic_BP ACO Type.NACO Dose.ACO APT INR GLW GLW.Prog mRS Time.to.scan Chx Type.Chx Date.Chx décès.hospit décès.HIP X3.mois X6.mois X12.mois X3.mois.1 X6.mois.1 X12.mois.1 X.1.y ID
# 1 5 1 0 Warfarin 1 1 116 70 1 NA NA 0 4.2 13 NA 4 NA 0 NA 0 NA NA NA NA NA NA NA NA 5
# 2 9 0 1 DOAC 1 1 169 65 0 2 NA 0 1.1 15 NA 3 NA 0 NA 0 NA NA NA NA NA NA NA NA 9
# 3 10 1 1 Warfarin 1 1 164 80 1 NA NA 1 1.9 14 NA 4 NA 0 NA 0 NA NA NA NA NA NA NA NA 10
# 4 16 1 1 Warfarin 2 1 109 60 1 NA NA 0 1.3 14 NA 6 NA 0 NA 1 2 NA NA NA 1 1 1 NA 16
# 5 30 0 0 Warfarin 2 1 134 75 1 NA NA 1 3.6 15 14 6 NA 0 NA 1 1 NA NA NA 1 1 1 NA 30
# 6 31 2 0 Warfarin 1 0 146 85 1 NA NA 0 2.8 15 NA 5 NA 0 NA 0 NA 1 NA NA 1 1 1 NA 31
Perhaps relevant: one of your values "Occipital " has a space at the end; if you have inadvertent spaces at times and need it to work, you can replace all references of Localisation == to trimws(Localisation) == to reduce the chance of a mis-classification.
Another method is to maintain a frame of translations:
translations <- tribble(
~Localisation, ~NewLocalisation
,"Thalamus", "0",
,"External capsule", "0",
,"Lenticulate", "0",
,"Cerebellum", "2",
,"Brain stem", "2"
)
dfM %>%
select(new_id, Localisation) %>% # just to reduce the display here on SO
left_join(., translations, by = "Localisation")
# new_id Localisation NewLocalisation
# 1 5 Frontal <NA>
# 2 9 Thalamus 0
# 3 10 Occipital <NA>
# 4 16 Frontal <NA>
# 5 30 External capsule 0
# 6 31 Cerebellum 2
I intentionally left the "default" step out to highlight that the NA values are natural: those are the ones not specifically identified.
dfM %>%
left_join(., translations, by = "Localisation") %>%
mutate(NewLocalisation = if_else(is.na(NewLocalisation), "1", NewLocalisation)) %>%
mutate(Localisation = NewLocalisation) %>%
select(-NewLocalisation)
# new_id Localisation HIV.CT.initial Anticoagulant Sex HTA Systolic_BP Diastolic_BP ACO Type.NACO Dose.ACO APT INR GLW GLW.Prog mRS Time.to.scan Chx Type.Chx Date.Chx décès.hospit décès.HIP X3.mois X6.mois X12.mois X3.mois.1 X6.mois.1 X12.mois.1 X.1.y ID
# 1 5 1 0 Warfarin 1 1 116 70 1 NA NA 0 4.2 13 NA 4 NA 0 NA 0 NA NA NA NA NA NA NA NA 5
# 2 9 0 1 DOAC 1 1 169 65 0 2 NA 0 1.1 15 NA 3 NA 0 NA 0 NA NA NA NA NA NA NA NA 9
# 3 10 1 1 Warfarin 1 1 164 80 1 NA NA 1 1.9 14 NA 4 NA 0 NA 0 NA NA NA NA NA NA NA NA 10
# 4 16 1 1 Warfarin 2 1 109 60 1 NA NA 0 1.3 14 NA 6 NA 0 NA 1 2 NA NA NA 1 1 1 NA 16
# 5 30 0 0 Warfarin 2 1 134 75 1 NA NA 1 3.6 15 14 6 NA 0 NA 1 1 NA NA NA 1 1 1 NA 30
# 6 31 2 0 Warfarin 1 0 146 85 1 NA NA 0 2.8 15 NA 5 NA 0 NA 0 NA 1 NA NA 1 1 1 NA 31
The reason I offer this alternative is that sometimes it's easier to maintain a separate table (in excel, CSV form, etc) in this fashion for automated processing of data. It doesn't offer any capability that the case_when or #dcarlson's ifelse methods do not.

Related

How to find rows with same values in two columns?

It's a little hard to explain, but I'm trying to compare the column "cpf" from two different data frames. I want to identify when the value in the two "cpf" columns from (df1) and (df2) is equal (these values can be in different rows). After that, I want to update the NA values if these are available from the other data frame
df1
cpf x y
1 21 NA NA
2 32 NA NA
3 43 NA NA
4 54 NA NA
5 65 NA NA
df2
cpf x y
1 54 5 10
2 0 NA NA
3 65 3 2
4 0 NA NA
5 0 NA NA
I want the following result
df3
cpf x y
1 21 NA NA
2 32 NA NA
3 43 NA NA
4 54 5 10
5 65 3 2
We could do a join on 'cpf' and use fcoalecse
library(data.table)
setDT(df1)[df2, c('x', 'y') := .(fcoalesce(x, i.x),
fcoalesce(y, i.y)), on = .(cpf)]
-output
df1
# cpf x y
#1: 21 NA NA
#2: 32 NA NA
#3: 43 NA NA
#4: 54 5 10
#5: 65 3 2
Or using coalecse from dplyr after a left_join
library(dplyr)
left_join(df1, df2, by = 'cpf') %>%
transmute(cpf, x = coalesce(x.x, x.y), y = coalesce(y.x, y.y))
# cpf x y
#1 21 NA NA
#2 32 NA NA
#3 43 NA NA
#4 54 5 10
#5 65 3 2
In base R, can use match
i1 <- match(df1$cpf, df2$cpf, nomatch = 0)
i2 <- match(df2$cpf, df1$cpf, nomatch = 0)
df1[i2, -1] <- df2[i1, -1]
data
df1 <- structure(list(cpf = c(21L, 32L, 43L, 54L, 65L), x = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), y = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), row.names = c("1",
"2", "3", "4", "5"), class = "data.frame")
df2 <- structure(list(cpf = c(54L, 0L, 65L, 0L, 0L), x = c(5L, NA, 3L,
NA, NA), y = c(10L, NA, 2L, NA, NA)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5"))
df1 %>%
left_join(df2, by = "cpf") %>%
select(cpf, x = x.y, y = y.y)
Output:
cpf x y
1 21 NA NA
2 32 NA NA
3 43 NA NA
4 54 5 10
5 65 3 2
Another base R option using merge
merge(df1,
df2,
by = "cpf",
all.x = TRUE,
suffixes = c(".x", "")
)[names(df1)]
gives
cpf x y
1 21 NA NA
2 32 NA NA
3 43 NA NA
4 54 5 10
5 65 3 2

How I can find out 1st and last observation with in group in R for every by group

Hi my data set is as follows
dialled Ringing state duration
NA NA NA 0
NA NA NA 0
NA NA NA 0
NA NA NA 0
123 NA NA 0
123 NA NA 0
123 NA NA 0
123 NA NA 60
NA NA active 0
NA NA active 0
NA NA inactive 0
NA NA inactive 0
NA 145 inactive 0
NA 145 inactive 0
NA 145 inactive 56
NA NA active 0
NA NA active 0
NA NA inactive 0
222 NA inactive 0
222 NA inactive 0
222 NA inactive 37
NA NA active 0
NA NA active 0
NA NA inactive 0
123 NA inactive 0
123 NA inactive 0
123 NA active 60
NA NA active 0
I want to get 1st and last obs. for every dialled number (repeated one as well, because every call is different). Answer I am looking for is
dialled Ringing state duration
123 NA NA 0
123 NA NA 60
222 NA inactive 0
222 NA inactive 37
123 NA NA 0
123 NA NA 60
I was using the following
library(plyr)
ddply(DF, .(Dialled_nbr), function(x) x[c(1,nrow(x)), ]) which gave me
dialled Ringing state duration
123 NA NA 0
123 NA NA 60
222 NA inactive 0
222 NA inactive 37
But answer is not correct. Please help
New data is
dialled Ringing state duration
123 NA NA 0
123 NA NA 0
123 NA NA 60
123 NA NA 0
123 NA NA 0
123 NA NA 70
222 NA inactive 0
222 NA inactive 0
222 NA inactive 37
123 NA inactive 0
123 NA inactive 0
123 NA active 60
Answer to be
dialled Ringing state duration
123 NA NA 0
123 NA NA 60
123 NA NA 0
123 NA NA 70
222 NA inactive 0
222 NA inactive 37
123 NA inactive 0
123 NA active 60
Here is an option with data.table_1.9.5. Create the "data.table" from "data.frame" using setDT, remove the NA values in "dialled" column (!is.na(dialled)), generate grouping variable by using rleid on "Dialled_nbr", get the row index of the first and last rows for the levels of grouping variable (.I(c(1L, .N)]), finally subset the "dt1" based on the row index.
library(data.table)
dt1 <- setDT(df)[!is.na(dialled)]
dt1[dt1[,.I[c(1L, .N)],rleid(dialled)]$V1]
# dialled Ringing state duration
#1: 123 NA NA 0
#2: 123 NA NA 60
#3: 222 NA inactive 0
#4: 222 NA inactive 37
#5: 123 NA inactive 0
#6: 123 NA active 60
Or using base R
df1 <- df[!is.na(df$dialled),]
grp<- inverse.rle(within.list(rle(df1$dialled),
values <- seq_along(values)))
df1[!duplicated(grp)|!duplicated(grp,fromLast=TRUE),]
# dialled Ringing state duration
#5 123 NA <NA> 0
#8 123 NA <NA> 60
#19 222 NA inactive 0
#21 222 NA inactive 37
#25 123 NA inactive 0
#27 123 NA active 60
Update
Based on the new dataset,
grp <- cumsum(c(TRUE,df$duration[-nrow(df)]!=0))
df[!duplicated(grp)|!duplicated(grp,fromLast=TRUE),]
# dialled Ringing state duration
#1 123 NA <NA> 0
#3 123 NA <NA> 60
#4 123 NA <NA> 0
#6 123 NA <NA> 70
#7 222 NA inactive 0
#9 222 NA inactive 37
#10 123 NA inactive 0
#12 123 NA active 60
data
df <- structure(list(dialled = c(NA, NA, NA, NA, 123L, 123L, 123L,
123L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 222L, 222L, 222L,
NA, NA, NA, 123L, 123L, 123L, NA), Ringing = c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 145L, 145L, 145L, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), state = c(NA, NA, NA,
NA, NA, NA, NA, NA, "active", "active", "inactive", "inactive",
"inactive", "inactive", "inactive", "active", "active", "inactive",
"inactive", "inactive", "inactive", "active", "active", "inactive",
"inactive", "inactive", "active", "active"), duration = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 60L, 0L, 0L, 0L, 0L, 0L, 0L, 56L, 0L,
0L, 0L, 0L, 0L, 37L, 0L, 0L, 0L, 0L, 0L, 60L, 0L)), .Names =
c("dialled", "Ringing", "state", "duration"), class = "data.frame",
row.names = c(NA, -28L))
newdata
df <- structure(list(dialled = c(123L, 123L, 123L, 123L, 123L, 123L,
222L, 222L, 222L, 123L, 123L, 123L), Ringing = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA), state = c(NA, NA, NA, NA,
NA, NA, "inactive", "inactive", "inactive", "inactive", "inactive",
"active"), duration = c(0L, 0L, 60L, 0L, 0L, 70L, 0L, 0L, 37L,
0L, 0L, 60L)), .Names = c("dialled", "Ringing", "state", "duration"
), class = "data.frame", row.names = c(NA, -12L))
Here are two options. First we'll need to set up a couple of things that will be used in both options.
## remove rows where 'dialled' is NA
ndf <- DF[!is.na(DF$dialled),]
## run-length encoding on the 'dialled' column in 'ndf'
le <- rle(ndf$dialled)$lengths
Option 1: Create an integer vector of row numbers to use for a subset.
ndf[cumsum(mapply(c, 1L, le-1L)), ]
# dialled Ringing state duration
# 5 123 NA <NA> 0
# 8 123 NA <NA> 60
# 19 222 NA inactive 0
# 21 222 NA inactive 37
# 25 123 NA inactive 0
# 27 123 NA active 60
If you prefer not to loop, then you can replace the mapply call with vec, defined as
vec <- replace(integer(2*length(le))+1L, c(FALSE, TRUE), le-1L)
Option 2: Add a helper id column. Then use dplyr functions to get the first and last rows based on that new id column.
library(dplyr)
## updated data with new column
DF2 <- cbind(id = rep.int(seq_along(le), le), ndf)
## group by id and filter on the first and last rows
slice(group_by(DF2, id), c(1, n()))
# id dialled Ringing state duration
# 1 1 123 NA NA 0
# 2 1 123 NA NA 60
# 3 2 222 NA inactive 0
# 4 2 222 NA inactive 37
# 5 3 123 NA inactive 0
# 6 3 123 NA active 60
You can remove the helper column if you want, but it might come in handy later too.

How I can find out 1st and last observation with in group in R for every by group, when 2 groups are repeated one after another

data is below
dialled Ringing state duration
NA NA NA 0
NA NA NA 0
NA NA NA 0
NA NA NA 0
123 NA NA 0
123 NA NA 0
123 NA NA 0
123 NA NA 60
NA NA active 0
NA NA active 0
NA NA inactive 0
NA NA inactive 0
123 NA inactive 0
123 NA inactive 0
123 NA inactive 0
NA NA inactive 0
NA NA inactive 0
NA NA inactive 0
222 NA inactive 0
222 NA inactive 0
222 NA inactive 37
NA NA active 0
NA NA active 0
NA NA inactive 0
123 NA inactive 0
123 NA inactive 0
123 NA active 60
NA NA active 0
NA NA active 0
NA NA active 0
NA NA active 0
123 NA inactive 0
123 NA inactive 0
123 NA inactive 0
answer i am looking for is
dialled Ringing state duration
123 NA NA 0
123 NA NA 60
123 NA inactive 0
123 NA inactive 0
222 NA inactive 0
222 NA inactive 37
123 NA inactive 0
123 NA inactive 60
123 NA inactive 0
123 NA inactive 0
also If you can help me to get the immediate next row, after last row of every by group and Rbind them
In data.table v1.9.5, there's a new function rleid() that helps accomplish this task fairly straightforward. You can install it by following these instructions:
require(data.table)
setDT(df)[, if (!is.na(dialled[1L])) .SD[c(1L, .N)],
by=.(dialled, rleid(dialled))]
# dialled rleid Ringing state duration
# 1: 123 2 NA NA 0
# 2: 123 2 NA NA 60
# 3: 123 4 NA inactive 0
# 4: 123 4 NA inactive 0
# 5: 222 6 NA inactive 0
# 6: 222 6 NA inactive 37
# 7: 123 8 NA inactive 0
# 8: 123 8 NA active 60
# 9: 123 10 NA inactive 0
# 10: 123 10 NA inactive 0
.SD contains the subset of data for groups specified in by =.
You could create a grouping variable "grp" (similar as here). Subset the rows of "df" that are not '0' for "grp", use slice to get the first and last row for each "grp", ungroup and remove the grp variable.
rl <- rle(!is.na(df$dialled))
grp <- inverse.rle(within.list(rl,
values[values] <- cumsum(values)[values]))
df$grp <- grp
library(dplyr)
df %>%
filter(grp!=0) %>%
group_by(grp) %>%
slice(c(1, n()))%>%
ungroup() %>%
select(-grp)
# dialled Ringing state duration
#1 123 NA NA 0
#2 123 NA NA 60
#3 123 NA inactive 0
#4 123 NA inactive 0
#5 222 NA inactive 0
#6 222 NA inactive 37
#7 123 NA inactive 0
#8 123 NA active 60
#9 123 NA inactive 0
#10 123 NA inactive 0
Or a base R option would be to get the row index of first and last rows of subset dataset "df1" based on "grp" and then use it to extract the rows.
df1 <- df[grp!=0,]
df2 <- df1[unlist(tapply(1:nrow(df1), grp[grp!=0],
FUN=function(x) c(head(x,1), tail(x,1)))),]
Update
It is not clear from the comments. Perhaps this helps
df2 %>%
group_by(grp) %>%
filter(any(duration>0)) %>%
slice(1)
# dialled Ringing state duration grp
#1 123 NA NA 0 1
#2 222 NA inactive 0 3
#3 123 NA inactive 0 4
data
df <- structure(list(dialled = c(NA, NA, NA, NA, 123L, 123L, 123L,
123L, NA, NA, NA, NA, 123L, 123L, 123L, NA, NA, NA, 222L, 222L,
222L, NA, NA, NA, 123L, 123L, 123L, NA, NA, NA, NA, 123L, 123L,
123L), Ringing = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA), state = c(NA, NA, NA, NA, NA, NA,
NA, NA, "active", "active", "inactive", "inactive", "inactive",
"inactive", "inactive", "inactive", "inactive", "inactive", "inactive",
"inactive", "inactive", "active", "active", "inactive", "inactive",
"inactive", "active", "active", "active", "active", "active",
"inactive", "inactive", "inactive"), duration = c(0L, 0L, 0L,
0L, 0L, 0L, 0L, 60L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 37L, 0L, 0L, 0L, 0L, 0L, 60L, 0L, 0L, 0L, 0L, 0L, 0L,
0L)), .Names = c("dialled", "Ringing", "state", "duration"),
class = "data.frame", row.names = c(NA, -34L))

r- max values of a row

I need to find the max value from a row, excluding the first column (which is a character).
I have a table MDist
> MDist
c.1. V2 V3 V4 V5 V6 V7 V8
1 repeticiones 0 0 1 1 1 2 <NA>
2 dias 0 0 12 15 20 28 sumas
3 0 NA NA NA NA NA NA 0
4 0 NA NA NA NA NA NA 0
5 12 NA NA 0 3 8 30 41
6 15 NA NA 3 0 5 26 34
7 20 NA NA 8 5 0 16 29
8 28 NA NA 15 13 8 0 36
I keep only the last column and transpose it:
> b<-data.frame(t(MDist[2:nrow(MDist), ncol(MDist)]))
> b
X1 X2 X3 X4 X5 X6 X7
1 sumas 0 0 41 34 29 36
sapply(b,class)
X1 X2 X3 X4 X5 X6 X7
"factor" "factor" "factor" "factor" "factor" "factor" "factor"
When I try to convert it to numeric, I get a vector full of 1.
> c<-as.numeric(b[1,2:ncol(b)])
> c
[1] 1 1 1 1 1 1
Also with as.numeric(as.character)) I get the same issue:
> as.numeric(as.character(b[1,2:ncol(b)]))
[1] 1 1 1 1 1 1
I need to get a line with every value of the original table (b) divided by the maximum value of that line. That would be :
0 0 1 34/41 29/41 36/41
Also:
within(MDist, rowMax <- do.call(`pmax`,
c(MDist[sapply(MDist, is.numeric)], na.rm=TRUE)))
# c.1. V2 V3 V4 V5 V6 V7 V8 rowMax
#1 repeticiones 0 0 1 1 1 2 <NA> 2
#2 dias 0 0 12 15 20 28 sumas 28
#3 0 NA NA NA NA NA NA 0 NA
#4 0 NA NA NA NA NA NA 0 NA
#5 12 NA NA 0 3 8 30 41 30
#6 15 NA NA 3 0 5 26 34 26
#7 20 NA NA 8 5 0 16 29 16
#8 28 NA NA 15 13 8 0 36 15
If you are looking for dividing the last column with the max of that column
MDist[,ncol(MDist)] <- as.numeric(as.character(MDist[, ncol(MDist)]))
MDist[,ncol(MDist)]/max(MDist[,ncol(MDist)], na.rm=TRUE)
# [1] NA NA 0.0000000 0.0000000 1.0000000 0.8292683 0.7073171
#[8] 0.8780488
data
MDist <- structure(list(c.1. = structure(c(7L, 6L, 1L, 1L, 2L, 3L, 4L,
5L), .Label = c("0", "12", "15", "20", "28", "dias", "repeticiones"
), class = "factor"), V2 = c(0L, 0L, NA, NA, NA, NA, NA, NA),
V3 = c(0L, 0L, NA, NA, NA, NA, NA, NA), V4 = c(1L, 12L, NA,
NA, 0L, 3L, 8L, 15L), V5 = c(1L, 15L, NA, NA, 3L, 0L, 5L,
13L), V6 = c(1L, 20L, NA, NA, 8L, 5L, 0L, 8L), V7 = c(2L,
28L, NA, NA, 30L, 26L, 16L, 0L), V8 = structure(c(6L, 7L,
1L, 1L, 5L, 3L, 2L, 4L), .Label = c("0", "29", "34", "36",
"41", "<NA>", "sumas"), class = "factor")), .Names = c("c.1.",
"V2", "V3", "V4", "V5", "V6", "V7", "V8"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
I used lapply. It worked but I would like to understand better why I wasn't able to do it the other way.
> as.numeric(lapply(b[1,2:ncol(b)], as.character))
[1] 0 0 41 34 29 36

How to select data that have complete cases of a certain column?

I'm trying to get a data frame (just.samples.with.shoulder.values, say) contain only samples that have non-NA values. I've tried to accomplish this using the complete.cases function, but I imagine that I'm doing something wrong syntactically below:
data <- structure(list(Sample = 1:14, Head = c(1L, 0L, NA, 1L, 1L, 1L,
0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L), Shoulders = c(13L, 14L, NA,
18L, 10L, 24L, 53L, NA, 86L, 9L, 65L, 87L, 54L, 36L), Knees = c(1L,
1L, NA, 1L, 1L, 2L, 3L, 2L, 1L, NA, 2L, 3L, 4L, 3L), Toes = c(324L,
5L, NA, NA, 5L, 67L, 785L, 42562L, 554L, 456L, 7L, NA, 54L, NA
)), .Names = c("Sample", "Head", "Shoulders", "Knees", "Toes"
), class = "data.frame", row.names = c(NA, -14L))
just.samples.with.shoulder.values <- data[complete.cases(data[,"Shoulders"])]
print(just.samples.with.shoulder.values)
I would also be interested to know whether some other route (using subset(), say) is a wiser idea. Thanks so much for the help!
You can try complete.cases too which will return a logical vector which allow to subset the data by Shoulders
data[complete.cases(data$Shoulders), ]
# Sample Head Shoulders Knees Toes
# 1 1 1 13 1 324
# 2 2 0 14 1 5
# 4 4 1 18 1 NA
# 5 5 1 10 1 5
# 6 6 1 24 2 67
# 7 7 0 53 3 785
# 9 9 1 86 1 554
# 10 10 1 9 NA 456
# 11 11 1 65 2 7
# 12 12 1 87 3 NA
# 13 13 0 54 4 54
# 14 14 1 36 3 NA
You could try using is.na:
data[!is.na(data["Shoulders"]),]
Sample Head Shoulders Knees Toes
1 1 1 13 1 324
2 2 0 14 1 5
4 4 1 18 1 NA
5 5 1 10 1 5
6 6 1 24 2 67
7 7 0 53 3 785
9 9 1 86 1 554
10 10 1 9 NA 456
11 11 1 65 2 7
12 12 1 87 3 NA
13 13 0 54 4 54
14 14 1 36 3 NA
There is a subtle difference between using is.na and complete.cases.
is.na will remove actual na values whereas the objective here is to only control for a variable not deal with missing values/na's those which could be legitimate data points

Resources