r- max values of a row - r

I need to find the max value from a row, excluding the first column (which is a character).
I have a table MDist
> MDist
c.1. V2 V3 V4 V5 V6 V7 V8
1 repeticiones 0 0 1 1 1 2 <NA>
2 dias 0 0 12 15 20 28 sumas
3 0 NA NA NA NA NA NA 0
4 0 NA NA NA NA NA NA 0
5 12 NA NA 0 3 8 30 41
6 15 NA NA 3 0 5 26 34
7 20 NA NA 8 5 0 16 29
8 28 NA NA 15 13 8 0 36
I keep only the last column and transpose it:
> b<-data.frame(t(MDist[2:nrow(MDist), ncol(MDist)]))
> b
X1 X2 X3 X4 X5 X6 X7
1 sumas 0 0 41 34 29 36
sapply(b,class)
X1 X2 X3 X4 X5 X6 X7
"factor" "factor" "factor" "factor" "factor" "factor" "factor"
When I try to convert it to numeric, I get a vector full of 1.
> c<-as.numeric(b[1,2:ncol(b)])
> c
[1] 1 1 1 1 1 1
Also with as.numeric(as.character)) I get the same issue:
> as.numeric(as.character(b[1,2:ncol(b)]))
[1] 1 1 1 1 1 1
I need to get a line with every value of the original table (b) divided by the maximum value of that line. That would be :
0 0 1 34/41 29/41 36/41

Also:
within(MDist, rowMax <- do.call(`pmax`,
c(MDist[sapply(MDist, is.numeric)], na.rm=TRUE)))
# c.1. V2 V3 V4 V5 V6 V7 V8 rowMax
#1 repeticiones 0 0 1 1 1 2 <NA> 2
#2 dias 0 0 12 15 20 28 sumas 28
#3 0 NA NA NA NA NA NA 0 NA
#4 0 NA NA NA NA NA NA 0 NA
#5 12 NA NA 0 3 8 30 41 30
#6 15 NA NA 3 0 5 26 34 26
#7 20 NA NA 8 5 0 16 29 16
#8 28 NA NA 15 13 8 0 36 15
If you are looking for dividing the last column with the max of that column
MDist[,ncol(MDist)] <- as.numeric(as.character(MDist[, ncol(MDist)]))
MDist[,ncol(MDist)]/max(MDist[,ncol(MDist)], na.rm=TRUE)
# [1] NA NA 0.0000000 0.0000000 1.0000000 0.8292683 0.7073171
#[8] 0.8780488
data
MDist <- structure(list(c.1. = structure(c(7L, 6L, 1L, 1L, 2L, 3L, 4L,
5L), .Label = c("0", "12", "15", "20", "28", "dias", "repeticiones"
), class = "factor"), V2 = c(0L, 0L, NA, NA, NA, NA, NA, NA),
V3 = c(0L, 0L, NA, NA, NA, NA, NA, NA), V4 = c(1L, 12L, NA,
NA, 0L, 3L, 8L, 15L), V5 = c(1L, 15L, NA, NA, 3L, 0L, 5L,
13L), V6 = c(1L, 20L, NA, NA, 8L, 5L, 0L, 8L), V7 = c(2L,
28L, NA, NA, 30L, 26L, 16L, 0L), V8 = structure(c(6L, 7L,
1L, 1L, 5L, 3L, 2L, 4L), .Label = c("0", "29", "34", "36",
"41", "<NA>", "sumas"), class = "factor")), .Names = c("c.1.",
"V2", "V3", "V4", "V5", "V6", "V7", "V8"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))

I used lapply. It worked but I would like to understand better why I wasn't able to do it the other way.
> as.numeric(lapply(b[1,2:ncol(b)], as.character))
[1] 0 0 41 34 29 36

Related

Change variable in a column with a for loop using R

In my data frame, I have a column named Localisation. I want to change the data that is stored in it.
Thalamus, External capsule or Lenticulate for 0,
Cerebellum or Brain stem for 2 and
Frontal, Occipital, Parietal or Temporal for 1
I’m trying to do a For loop for this operation. My for statement doesn’t seem to be right, because I received
Error in for (. in i) 15:nrow(Localisation) :
4 arguments passed to 'for' which requires 3
and I don’t know how to compose my if statement.
dfM <- dfM %>%
for(i in dfM$Localisation) if(i = "Thalamus"| "External capsule"| "Lenticulate"){
dfM$Localisation <- "0"
} else if ( i = "Cerebellum"| "Brain stem") {
dfM$Localisation <- "2"
} else {
dfM$Localisation <- "1"
}
I know similar questions have been asked multiple times, but I can’t find a way to work with my data.
dput(head(dfM))
structure(list(new_id = c("5", "9", "10", "16", "30", "31"),
Localisation = c("Frontal", "Thalamus", "Occipital ", "Frontal",
"External capsule", "Cerebellum"), HIV.CT.initial = c(0L,
1L, 1L, 1L, 0L, 0L), Anticoagulant = c("Warfarin", "DOAC",
"Warfarin", "Warfarin", "Warfarin", "Warfarin"), Sex = c(1L,
1L, 1L, 2L, 2L, 1L), HTA = c(1L, 1L, 1L, 1L, 1L, 0L), Systolic_BP = c(116L,
169L, 164L, 109L, 134L, 146L), Diastolic_BP = c(70L, 65L,
80L, 60L, 75L, 85L), ACO = c(1L, 0L, 1L, 1L, 1L, 1L), Type.NACO = c(NA,
2L, NA, NA, NA, NA), Dose.ACO = c(NA, NA, NA, NA, NA, NA),
APT = c(0L, 0L, 1L, 0L, 1L, 0L), INR = c(4.2, 1.1, 1.9, 1.3,
3.6, 2.8), GLW = c(13L, 15L, 14L, 14L, 15L, 15L), GLW.Prog = c(NA,
NA, NA, NA, 14L, NA), mRS = c(4L, 3L, 4L, 6L, 6L, 5L), Time.to.scan = c(NA,
NA, NA, NA, NA, NA), Chx = c(0L, 0L, 0L, 0L, 0L, 0L), Type.Chx = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), Date.Chx = c("",
"", "", "", "", ""), décès.hospit = c(0L, 0L, 0L, 1L, 1L,
0L), décès.HIP = c(NA, NA, NA, 2L, 1L, NA), X3.mois = c(NA,
NA, NA, NA, NA, 1L), X6.mois = c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), X12.mois = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), X3.mois.1 = c(NA, NA, NA, 1L, 1L, 1L), X6.mois.1 = c(NA,
NA, NA, 1L, 1L, 1L), X12.mois.1 = c(NA, NA, NA, 1L, 1L, 1L
), X.1.y = c(NA, NA, NA, NA, NA, NA), ID = c(5L, 9L, 10L,
16L, 30L, 31L)), row.names = c(NA, 6L), class = "data.frame")
Up front: in general, don't use for in a %>%-pipe, it almost never is what you intend.
Since you have the %>%, I'll infer you're using dplyr and/or other related packages. Here's one way:
library(dplyr)
dfM %>%
mutate(Localisation = case_when(
Localisation %in% c("Thalamus", "External capsule", "Lenticulate") ~ "0",
Localisation %in% c("Cerebellum", "Brain stem") ~ "2",
TRUE ~ "1")
)
# new_id Localisation HIV.CT.initial Anticoagulant Sex HTA Systolic_BP Diastolic_BP ACO Type.NACO Dose.ACO APT INR GLW GLW.Prog mRS Time.to.scan Chx Type.Chx Date.Chx décès.hospit décès.HIP X3.mois X6.mois X12.mois X3.mois.1 X6.mois.1 X12.mois.1 X.1.y ID
# 1 5 1 0 Warfarin 1 1 116 70 1 NA NA 0 4.2 13 NA 4 NA 0 NA 0 NA NA NA NA NA NA NA NA 5
# 2 9 0 1 DOAC 1 1 169 65 0 2 NA 0 1.1 15 NA 3 NA 0 NA 0 NA NA NA NA NA NA NA NA 9
# 3 10 1 1 Warfarin 1 1 164 80 1 NA NA 1 1.9 14 NA 4 NA 0 NA 0 NA NA NA NA NA NA NA NA 10
# 4 16 1 1 Warfarin 2 1 109 60 1 NA NA 0 1.3 14 NA 6 NA 0 NA 1 2 NA NA NA 1 1 1 NA 16
# 5 30 0 0 Warfarin 2 1 134 75 1 NA NA 1 3.6 15 14 6 NA 0 NA 1 1 NA NA NA 1 1 1 NA 30
# 6 31 2 0 Warfarin 1 0 146 85 1 NA NA 0 2.8 15 NA 5 NA 0 NA 0 NA 1 NA NA 1 1 1 NA 31
Perhaps relevant: one of your values "Occipital " has a space at the end; if you have inadvertent spaces at times and need it to work, you can replace all references of Localisation == to trimws(Localisation) == to reduce the chance of a mis-classification.
Another method is to maintain a frame of translations:
translations <- tribble(
~Localisation, ~NewLocalisation
,"Thalamus", "0",
,"External capsule", "0",
,"Lenticulate", "0",
,"Cerebellum", "2",
,"Brain stem", "2"
)
dfM %>%
select(new_id, Localisation) %>% # just to reduce the display here on SO
left_join(., translations, by = "Localisation")
# new_id Localisation NewLocalisation
# 1 5 Frontal <NA>
# 2 9 Thalamus 0
# 3 10 Occipital <NA>
# 4 16 Frontal <NA>
# 5 30 External capsule 0
# 6 31 Cerebellum 2
I intentionally left the "default" step out to highlight that the NA values are natural: those are the ones not specifically identified.
dfM %>%
left_join(., translations, by = "Localisation") %>%
mutate(NewLocalisation = if_else(is.na(NewLocalisation), "1", NewLocalisation)) %>%
mutate(Localisation = NewLocalisation) %>%
select(-NewLocalisation)
# new_id Localisation HIV.CT.initial Anticoagulant Sex HTA Systolic_BP Diastolic_BP ACO Type.NACO Dose.ACO APT INR GLW GLW.Prog mRS Time.to.scan Chx Type.Chx Date.Chx décès.hospit décès.HIP X3.mois X6.mois X12.mois X3.mois.1 X6.mois.1 X12.mois.1 X.1.y ID
# 1 5 1 0 Warfarin 1 1 116 70 1 NA NA 0 4.2 13 NA 4 NA 0 NA 0 NA NA NA NA NA NA NA NA 5
# 2 9 0 1 DOAC 1 1 169 65 0 2 NA 0 1.1 15 NA 3 NA 0 NA 0 NA NA NA NA NA NA NA NA 9
# 3 10 1 1 Warfarin 1 1 164 80 1 NA NA 1 1.9 14 NA 4 NA 0 NA 0 NA NA NA NA NA NA NA NA 10
# 4 16 1 1 Warfarin 2 1 109 60 1 NA NA 0 1.3 14 NA 6 NA 0 NA 1 2 NA NA NA 1 1 1 NA 16
# 5 30 0 0 Warfarin 2 1 134 75 1 NA NA 1 3.6 15 14 6 NA 0 NA 1 1 NA NA NA 1 1 1 NA 30
# 6 31 2 0 Warfarin 1 0 146 85 1 NA NA 0 2.8 15 NA 5 NA 0 NA 0 NA 1 NA NA 1 1 1 NA 31
The reason I offer this alternative is that sometimes it's easier to maintain a separate table (in excel, CSV form, etc) in this fashion for automated processing of data. It doesn't offer any capability that the case_when or #dcarlson's ifelse methods do not.

Replace going on NA values with sum of another column

I am trying to replace all going on NA values with sum of values from another column, but I'm a little confused.
How the data looks like
df
# Distance Distance2
# 1 160 8
# 2 20 NA
# 3 30 15
# 4 100 11
# 5 35 NA
# 6 42 NA
# 7 10 NA
# 8 10 2
# 9 9 NA
# 10 20 NA
And am looking to get a result like this
df
# Distance Distance2
# 1 160 8
# 2 20 20
# 3 30 15
# 4 100 11
# 5 35 87
# 6 42 87
# 7 10 87
# 8 10 2
# 9 9 29
# 10 20 29
Thanks in advance for your help
We can use rleid to create groups and replace NA with sum of Distance values.
library(data.table)
setDT(df)[, Distance_new := replace(Distance2, is.na(Distance2),
sum(Distance)), rleid(Distance2)]
df
# Distance Distance2 Distance_new
# 1: 160 8 8
# 2: 20 NA 20
# 3: 30 15 15
# 4: 100 11 11
# 5: 35 NA 87
# 6: 42 NA 87
# 7: 10 NA 87
# 8: 10 2 2
# 9: 9 NA 29
#10: 20 NA 29
We can also use this in dplyr :
library(dplyr)
df %>%
group_by(gr = rleid(Distance2)) %>%
mutate(Distance_new = replace(Distance2, is.na(Distance2), sum(Distance)))
data
df <- structure(list(Distance = c(160L, 20L, 30L, 100L, 35L, 42L, 10L,
10L, 9L, 20L), Distance2 = c(8L, NA, 15L, 11L, NA, NA, NA, 2L,
NA, NA)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10"))
You can group by consecutive NAs and replace with the sum, i.e.
library(dplyr)
df %>%
group_by(grp = cumsum(c(TRUE, diff(is.na(df$Distance2)) != 0))) %>%
mutate(Distance2 = replace(Distance2, is.na(Distance2), sum(Distance)))
# A tibble: 10 x 3
# Groups: grp [6]
Distance Distance2 grp
<int> <int> <int>
1 160 8 1
2 20 20 2
3 30 15 3
4 100 11 3
5 35 87 4
6 42 87 4
7 10 87 4
8 10 2 5
9 9 29 6
10 20 29 6
We can use fcoalesce
library(data.table)
library(zoo)
setDT(df)[, Distance2 := fcoalesce(Distance2, na.aggregate(Distance, FUN = sum)),
rleid(Distance2)]
data
df <- structure(list(Distance = c(160L, 20L, 30L, 100L, 35L, 42L, 10L,
10L, 9L, 20L), Distance2 = c(8L, NA, 15L, 11L, NA, NA, NA, 2L,
NA, NA)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10"))

Finding value in one data.frame and transfering value from other column

I don't know if I will be able to explain it correctly but what I want to achieve really simple.
That's first data.frame. The important value for me is in first column "V1"
> dput(Data1)
structure(list(V1 = c(10L, 5L, 3L, 9L, 1L, 2L, 6L, 4L, 8L, 7L
), V2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "NA", class = "factor"),
V3 = c(18L, 17L, 13L, 20L, 15L, 12L, 16L, 11L, 14L, 19L)), .Names = c("V1",
"V2", "V3"), row.names = c(NA, -10L), class = "data.frame")
Second data.frame:
> dput(Data2)
structure(list(Names = c(9L, 10L, 6L, 4L, 2L, 7L, 5L, 3L, 1L,
8L), Herat = c(30L, 29L, 21L, 25L, 24L, 22L, 28L, 27L, 23L, 26L
), Grobpel = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "NA", class = "factor"), Hassynch = c(19L, 12L,
15L, 20L, 11L, 13L, 14L, 16L, 18L, 17L)), .Names = c("Names",
"Herat", "Grobpel", "Hassynch"), row.names = c(NA, -10L), class = "data.frame"
)
The value from first data.frame can be find in 1st column and I would like to copy the value from 4 column (Hassynch) and put it in the second column in first data.frame.
How to do it in the fastest way ?
library(dplyr)
left_join(Data1, Data2, by=c("V1"="Names"))
# V1 V2 V3 Herat Grobpel Hassynch
# 1 10 NA 18 29 NA 12
# 2 5 NA 17 28 NA 14
# 3 3 NA 13 27 NA 16
# 4 9 NA 20 30 NA 19
# 5 1 NA 15 23 NA 18
# 6 2 NA 12 24 NA 11
# 7 6 NA 16 21 NA 15
# 8 4 NA 11 25 NA 20
# 9 8 NA 14 26 NA 17
# 10 7 NA 19 22 NA 13
# if you don't want V2 and V3, you could
left_join(Data1, Data2, by=c("V1"="Names")) %>%
select(-V2, -V3)
# V1 Herat Grobpel Hassynch
# 1 10 29 NA 12
# 2 5 28 NA 14
# 3 3 27 NA 16
# 4 9 30 NA 19
# 5 1 23 NA 18
# 6 2 24 NA 11
# 7 6 21 NA 15
# 8 4 25 NA 20
# 9 8 26 NA 17
# 10 7 22 NA 13
Here's a toy example that I made some time ago to illustrate merge. left_join from dplyr is also good, and data.table almost certainly has another option.
You can subset your reference dataframe so that it contains only the key variable and value variable so that you don't end up with an unmanageable dataframe.
id<-as.numeric((1:5))
m<-c("a","a","a","","")
n<-c("","","b","b","b")
dfm<-data.frame(cbind(id,m))
head(dfm)
id m
1 1 a
2 2 a
3 3 a
4 4
5 5
dfn<-data.frame(cbind(id,n))
head(dfn)
id n
1 1
2 2
3 3 b
4 4 b
5 5 b
dfm$id<-as.numeric(dfm$id)
dfn$id<-as.numeric(dfn$id)
dfm<-subset(dfm,id<4)
head(dfm)
id m
1 1 a
2 2 a
3 3 a
dfn<-subset(dfn,id!=1 & id!=2)
head(dfn)
id n
3 3 b
4 4 b
5 5 b
df.all<-merge(dfm,dfn,by="id",all=TRUE)
head(df.all)
id m n
1 1 a <NA>
2 2 a <NA>
3 3 a b
4 4 <NA> b
5 5 <NA> b
df.all.m<-merge(dfm,dfn,by="id",all.x=TRUE)
head(df.al.lm)
id m n
1 1 a <NA>
2 2 a <NA>
3 3 a b
df.all.n<-merge(dfm,dfn,by="id",all.y=TRUE)
head(df.all.n)
id m n
1 3 a b
2 4 <NA> b
3 5 <NA> b

How to sample from categorical variables in R data.frame ?

I am trying to sample from an R data frame but I have some problems with the categorical variables.
I am not taking a random subsamples of rows but I am generating rows such that the new variables have individually the same distribution of the original one.
I am having problem with the categorical variables.
> head(x0)
Symscore1 Symscore2 exercise3 exerciseduration3 groupchange age3
3 1 0 1 0 Transitional to Menopausal 52
4 0 0 5 2 Transitional to Menopausal 62
6 0 0 2 0 Transitional to Menopausal 54
8 0 0 5 3 Transitional to Menopausal 56
10 0 0 4 3 Transitional to Menopausal 59
13 0 1 4 3 Transitional to Menopausal 55
packyears bmi3 education3
3 2.357143 23.24380 Basic
4 2.000000 16.76574 University
6 1.000000 23.30668 Basic
8 1.428571 22.14533 University
10 1.428571 22.14533 University
13 0.000000 22.03857 University
> xa = as.data.frame(sapply(X = x0, FUN = sample))
> head(xa)
Symscore1 Symscore2 exercise3 exerciseduration3 groupchange age3 packyears
1 1 0 2 3 4 49 53.571430
2 0 0 3 0 3 46 2.142857
3 1 0 3 3 4 49 4.000000
4 0 1 3 3 4 58 0.000000
5 0 0 2 0 1 57 0.000000
6 0 0 3 0 1 47 26.871429
bmi3 education3
1 25.84777 2
2 21.25850 2
3 25.79592 3
4 23.93899 1
5 25.97012 2
6 23.53037 2
> X = rbind(x0,xa)
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = c(4, 3, 4, 4, 1, 1, 2, 4, 4, :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, ri, value = c(2, 2, 3, 1, 2, 2, 3, 2, 2, :
invalid factor level, NA generated
>
You could try:
x2 <- x0
x2[] <- lapply(x0, FUN = sample)
x2
# Symscore1 Symscore2 exercise3 exerciseduration3 groupchange
#3 0 0 1 0 Transitional to Menopausal
#4 0 0 5 3 Transitional to Menopausal
#6 0 0 4 3 Transitional to Menopausal
#8 0 0 2 0 Transitional to Menopausal
#10 1 1 4 3 Transitional to Menopausal
#13 0 0 5 2 Transitional to Menopausal
age3
#3 54
#4 59
#6 52
#8 56
#10 62
#13 5
rbind(x0,x2)
data
x0 <- structure(list(Symscore1 = c(1L, 0L, 0L, 0L, 0L, 0L), Symscore2 = c(0L,
0L, 0L, 0L, 0L, 1L), exercise3 = c(1L, 5L, 2L, 5L, 4L, 4L), exerciseduration3 = c(0L,
2L, 0L, 3L, 3L, 3L), groupchange = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = "Transitional to Menopausal", class = "factor"),
age3 = c(52L, 62L, 54L, 56L, 59L, 5L)), .Names = c("Symscore1",
"Symscore2", "exercise3", "exerciseduration3", "groupchange",
"age3"), class = "data.frame", row.names = c("3", "4", "6", "8",
"10", "13"))

How to select data that have complete cases of a certain column?

I'm trying to get a data frame (just.samples.with.shoulder.values, say) contain only samples that have non-NA values. I've tried to accomplish this using the complete.cases function, but I imagine that I'm doing something wrong syntactically below:
data <- structure(list(Sample = 1:14, Head = c(1L, 0L, NA, 1L, 1L, 1L,
0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L), Shoulders = c(13L, 14L, NA,
18L, 10L, 24L, 53L, NA, 86L, 9L, 65L, 87L, 54L, 36L), Knees = c(1L,
1L, NA, 1L, 1L, 2L, 3L, 2L, 1L, NA, 2L, 3L, 4L, 3L), Toes = c(324L,
5L, NA, NA, 5L, 67L, 785L, 42562L, 554L, 456L, 7L, NA, 54L, NA
)), .Names = c("Sample", "Head", "Shoulders", "Knees", "Toes"
), class = "data.frame", row.names = c(NA, -14L))
just.samples.with.shoulder.values <- data[complete.cases(data[,"Shoulders"])]
print(just.samples.with.shoulder.values)
I would also be interested to know whether some other route (using subset(), say) is a wiser idea. Thanks so much for the help!
You can try complete.cases too which will return a logical vector which allow to subset the data by Shoulders
data[complete.cases(data$Shoulders), ]
# Sample Head Shoulders Knees Toes
# 1 1 1 13 1 324
# 2 2 0 14 1 5
# 4 4 1 18 1 NA
# 5 5 1 10 1 5
# 6 6 1 24 2 67
# 7 7 0 53 3 785
# 9 9 1 86 1 554
# 10 10 1 9 NA 456
# 11 11 1 65 2 7
# 12 12 1 87 3 NA
# 13 13 0 54 4 54
# 14 14 1 36 3 NA
You could try using is.na:
data[!is.na(data["Shoulders"]),]
Sample Head Shoulders Knees Toes
1 1 1 13 1 324
2 2 0 14 1 5
4 4 1 18 1 NA
5 5 1 10 1 5
6 6 1 24 2 67
7 7 0 53 3 785
9 9 1 86 1 554
10 10 1 9 NA 456
11 11 1 65 2 7
12 12 1 87 3 NA
13 13 0 54 4 54
14 14 1 36 3 NA
There is a subtle difference between using is.na and complete.cases.
is.na will remove actual na values whereas the objective here is to only control for a variable not deal with missing values/na's those which could be legitimate data points

Resources