R: Column Heading Conversions - r

I have the following dataset. My objective is to understand how numbers and mammals and birds are there in the two types of locations.
df1:
Location Type Cat Mouse Dog Chicken Turkey Horse
1 1 0 0 1 0 1
1 0 0 1 0 1 0
2 1 1 1 1 1 1
2 0 1 0 0 0 0
1 1 1 0 0 1 0
I want it to read as
df2:
Location Type M M M B B M
1 1 0 0 1 0 1
1 0 0 1 0 1 0
2 1 1 1 1 1 1
2 0 1 0 0 0 0
1 1 1 0 0 1 0
with 'M' denoting Mammal and 'B' denoting Bird
I tried to manually enter the data into my .csv file and use it in R, however, the file gets read as
df2:
Location Type M M1 M2 B B1 M3
1 1 0 0 1 0 1
1 0 0 1 0 1 0
2 1 1 1 1 1 1
2 0 1 0 0 0 0
1 1 1 0 0 1 0
I am not sure why each 'M' or 'B' column gets numbered separately, how could I prevent this from happening
or
I also have the animal type classified as mammal and bird in another dataframe as below
dfanimal:
Name of Animal Mammal/Bird
Cat Mammal
Dog Mammal
Mouse Mammal
Chicken Bird
Turkey Bird
Horse Mammal
If there is a way for me to directly work my way with the dataframes, df1 and dfanimal?
Would be very thankful for any help.

After you manually change the column names, you can use check.names = FALSE while importing the csv. Since it is not advised to have duplicate column names in the dataframe those suffixes are by default added by R.
df1 <- read.csv('location/of/file.csv', check.names = FALSE)
If you want to use df_animal to change the column names, we can use match
names(df1)[-1] <- substr(df_animal$Mammal.Bird[match(names(df1)[-1],
df_animal$Name_of_Animal)], 1, 1)
df1
# Location M M M B B M
#1 1 1 0 0 1 0 1
#2 1 0 0 1 0 1 0
#3 2 1 1 1 1 1 1
#4 2 0 1 0 0 0 0
#5 1 1 1 0 0 1 0
data
df1 <- structure(list(Location = c(1L, 1L, 2L, 2L, 1L), Cat = c(1L,
0L, 1L, 0L, 1L), Mouse = c(0L, 0L, 1L, 1L, 1L), Dog = c(0L, 1L,
1L, 0L, 0L), Chicken = c(1L, 0L, 1L, 0L, 0L), Turkey = c(0L,
1L, 1L, 0L, 1L), Horse = c(1L, 0L, 1L, 0L, 0L)), class = "data.frame",
row.names = c(NA, -5L))
df_animal <- structure(list(Name_of_Animal = structure(c(1L, 3L, 5L, 2L, 6L,
4L), .Label = c("Cat", "Chicken", "Dog", "Horse", "Mouse", "Turkey"
), class = "factor"), Mammal.Bird = structure(c(2L, 2L, 2L, 1L,
1L, 2L), .Label = c("Bird", "Mammal"), class = "factor")), class = "data.frame",
row.names = c(NA, -6L))

Related

Select columns on match with vector and create ifelse condition with their content

I have a dataset with over several diseases, 0 indicating not having the disease and 1 having the disease.
To illustrate it with an example: I am interested in Diseases A and whether the people in the dataset have this diseases on its own or as the cause of another disease. Therefore I want to create a new variable "Type" with the values "NotDiseasedWithA", "Primary" and "Secondary". The diseases that can cause A are contained in a vector "SecondaryCauses":
SecondaryCauses = c("DiseaseB", "DiseaseD")
"NotDiseasedWithA" means that they do not have disease A.
"Primary" means that they have disease A but not any of the known diseases that can cause it.
"Secondary" means that they have disease A and a diseases that probably caused it.
Sample data
ID DiseaseA DiseaseB DiseaseC DiseaseD DiseaseE
1 0 1 0 0 0
2 1 0 0 0 1
3 1 0 1 1 0
4 1 0 1 1 1
5 0 0 0 0 0
My question is:
How do I select the columns I am interested in? I have more than 20 columns that are not ordered. Therefore I created the vector.
How do I create the condition based on the content of the diseases I am interested in?
I tried something like the following, but this did not work:
DF %>% mutate(Type = ifelse(DiseaseA == 0, "NotDiseasedWithA", ifelse(sum(names(DF) %in% SecondaryCauses) > 0, "Secondary", "Primary")))
So in the end I want to have this results:
ID DiseaseA DiseaseB DiseaseC DiseaseD DiseaseE Type
1 0 1 0 0 0 NotDiseasedWithA
2 1 0 0 0 1 Primary
3 1 0 1 1 0 Secondary
4 1 0 1 1 1 Secondary
5 0 0 0 0 0 NotDiseasedWithA
using data.table
df <- structure(list(ID = 1:5, DiseaseA = c(0L, 1L, 1L, 1L, 0L), DiseaseB = c(1L,
0L, 0L, 0L, 0L), DiseaseC = c(0L, 0L, 1L, 1L, 0L), DiseaseD = c(0L,
0L, 1L, 1L, 0L), DiseaseE = c(0L, 1L, 0L, 1L, 0L)), row.names = c(NA,
-5L), class = c("data.frame"))
library(data.table)
setDT(df) # make it a data.table
SecondaryCauses = c("DiseaseB", "DiseaseD")
df[DiseaseA == 0, Type := "NotDiseasedWithA"][DiseaseA == 1, Type := ifelse(rowSums(.SD) > 0, "Secondary", "Primary"), .SDcols = SecondaryCauses]
df
# ID DiseaseA DiseaseB DiseaseC DiseaseD DiseaseE Type
# 1: 1 0 1 0 0 0 NotDiseasedWithA
# 2: 2 1 0 0 0 1 Primary
# 3: 3 1 0 1 1 0 Secondary
# 4: 4 1 0 1 1 1 Secondary
# 5: 5 0 0 0 0 0 NotDiseasedWithA

How to change a binary variable dependent upon another binary variable?

I'm using a larger version of the following dataset:
Name HB>TE-iL TE-iL TE-iL^
John 1 1 0
Eric 0 0 0
Mike 1 1 0
Jim 0 1 0
Joe 1 0 1
...
If column HB>TE-iL for a given observation is equal to one, I want the TE-iL and TE-iL^ columns to become zeros. So the above data frame would become:
Name HB>TE-iL TE-iL TE-iL^
John 1 0 0
Eric 0 0 0
Mike 1 0 0
Jim 0 1 0
Joe 1 0 0
...
Is this possible? Thanks in advance.
# Detect where it is 1
is_1 = your_data[["HB>TE-iL"]] == 1
# set to 0
your_data[is_1, c("TE-iL", "TE-iL^")] = 0
We could use
df[c("TE-iL", "TE-iL^")] <- df[c("TE-iL", "TE-iL^")] * !df[["HB>TE-iL"]]
-output
df
# Name HB>TE-iL TE-iL TE-iL^
#1 John 1 0 0
#2 Eric 0 0 0
#3 Mike 1 0 0
#4 Jim 0 1 0
#5 Joe 1 0 0
data
df <- structure(list(Name = c("John", "Eric", "Mike", "Jim", "Joe"),
`HB>TE-iL` = c(1L, 0L, 1L, 0L, 1L), `TE-iL` = c(1L, 0L, 1L,
1L, 0L), `TE-iL^` = c(0L, 0L, 0L, 0L, 1L)),
class = "data.frame", row.names = c(NA,
-5L))
i think this should work
df <- df %>%
mutate(
`TE-iL` = case_when(`HB>TE-iL` == 1 ~ 0),
`TE-iL^` = case_when(`HB>TE-iL` == 1 ~ 0)
)

Sorting data with some similar words in R

I have a database with 100 columns, but a minimal production of my data are as follows:
df1<=read.table(text="PG1S1AW KOM1S1zo PG2S2AW KOM2S2zo PG3S3AW KOM3S3zo PG4S4AW KOM4S4zo PG5S5AW KOM5S5zo
4 1 2 4 4 3 0 4 0 5
4 4 3 1 3 1 0 3 0 1
2 3 5 3 3 2 1 4 0 2
1 1 1 1 1 3 0 5 0 1
2 5 3 4 4 5 0 1 3 4", header=TRUE)
I want to get columns starting with KOM and PG which have a greater of 3 . So we need to have PG4, KOM4 and above. Put it simply, starting with PG and KOM have the same values which is 4 and greater.
The intended output is:
PG4S4AW KOM4S4zo PG5S5AW KOM5S5zo
0 4 0 5
0 3 0 1
1 4 0 2
0 5 0 1
0 1 3 4
I have used the following code, but it does not work for me:
df2<- df1%>% select(contains("KO"))
Thanks for your help.
It is not entirely clear about the patterns. We create a function (f1) to extract one or more digits (\\d+) that follows the 'KOM' or (|) 'PG' with str_extract (from stringr), convert to numeric ('v1'), similarly, extract numbers after the 'S' ('v2'). Do a check whether these values are same and if one of the value is greater than 3, wrap with which so that if there are any NAs resulting from str_extract would be removed as which gives the column index while removing any NAs. Use the function in select to select the columns that follow the pattern
library(dplyr)
library(stringr)
f1 <- function(nm) {
v1 <- as.numeric(str_extract(nm, "(?<=(KOM|PG))\\d+"))
v2 <- as.numeric(str_extract(nm, "(?<=S)\\d+"))
nm[which((v1 == v2) & (v1 > 3))]
}
df1 %>%
select(f1(names(.)))
# PG4S4AW KOM4S4zo PG5S5AW KOM5S5zo
#1 0 4 0 5
#2 0 3 0 1
#3 1 4 0 2
#4 0 5 0 1
#5 0 1 3 4
data
df1 <- structure(list(PG1S1AW = c(4L, 4L, 2L, 1L, 2L), KOM1S1zo = c(1L,
4L, 3L, 1L, 5L), PG2S2AW = c(2L, 3L, 5L, 1L, 3L), KOM2S2zo = c(4L,
1L, 3L, 1L, 4L), PG3S3AW = c(4L, 3L, 3L, 1L, 4L), KOM3S3zo = c(3L,
1L, 2L, 3L, 5L), PG4S4AW = c(0L, 0L, 1L, 0L, 0L), KOM4S4zo = c(4L,
3L, 4L, 5L, 1L), PG5S5AW = c(0L, 0L, 0L, 0L, 3L), KOM5S5zo = c(5L,
1L, 2L, 1L, 4L)), class = "data.frame", row.names = c(NA, -5L
))
Given your example data, you can just instead look for the numbers 4 or 5.
df1 %>%
select(matches("4|5"))
#> KO4S4AW KOM4S4zo KO5S5AW KOM5S5zo
#> 1 0 4 0 5
#> 2 0 3 0 1
#> 3 1 4 0 2
#> 4 0 5 0 1
#> 5 0 1 3 4

How to split a data frame by the string name in the column and write respective outputs in a file?

I have a dataframe like this:
> e=read.table("SG.genotypes.txt", header=TRUE)
> head(e)
ID HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103
1 snp_3_47609552 0 1 1 1 1 0 1
2 snp_3_47614413 0 1 1 1 1 0 1
3 snp_3_47616151 0 1 1 1 1 0 1
4 snp_2_47616155 0 1 1 1 1 0 1
5 snp_2_47617504 0 1 1 1 1 0 1
6 snp_5_47617679 0 1 1 1 1 0 1
...
My data frame has many more snp_ names, but let's say how to split this example into 3 output files say named: chr_2,chr_3,chr_5
where chr_3 file for example will have just these lines:
ID HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103
1 snp_3_47609552 0 1 1 1 1 0 1
2 snp_3_47614413 0 1 1 1 1 0 1
3 snp_3_47616151 0 1 1 1 1 0
One way to do this would be to split column ID by string name and create two columns, but I wonder is there is a better way to do this.
We can substring the 'ID' column and use that to split
lst1 <- split(df1, substr(df1$ID, 1, 5))
Note that if the number after the 'snp_' is greater than 9, it may be better to use sub instead of substr
lst1 <- split(df1, sub("^(snp_\\d+)_.*", "\\1", df1$ID))
names(lst1) <- sub("snp", "chr", names(lst1))
lst1
#$chr_2
# ID HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103
#4 snp_2_47616155 0 1 1 1 1 0 1
#5 snp_2_47617504 0 1 1 1 1 0 1
#$chr_3
# ID HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103
#1 snp_3_47609552 0 1 1 1 1 0 1
#2 snp_3_47614413 0 1 1 1 1 0 1
#3 snp_3_47616151 0 1 1 1 1 0 1
#$chr_5
# ID HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103
#6 snp_5_47617679 0 1 1 1 1 0 1
Loop through the names of the list and write it to .csv file
lapply(names(lst1), function(nm) write.csv(lst[[nm]],
file = paste0(nm, ".csv"), quote = FALSE, row.names = FALSE))
data
df1 <- structure(list(ID = c("snp_3_47609552", "snp_3_47614413", "snp_3_47616151",
"snp_2_47616155", "snp_2_47617504", "snp_5_47617679"), HG00096 = c(0L,
0L, 0L, 0L, 0L, 0L), HG00097 = c(1L, 1L, 1L, 1L, 1L, 1L), HG00099 = c(1L,
1L, 1L, 1L, 1L, 1L), HG00100 = c(1L, 1L, 1L, 1L, 1L, 1L), HG00101 = c(1L,
1L, 1L, 1L, 1L, 1L), HG00102 = c(0L, 0L, 0L, 0L, 0L, 0L), HG00103 = c(1L,
1L, 1L, 1L, 1L, 1L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

Segregating dataset and name each new dataset as per unique column names

I have a dataset(nm) as shown below:
nm
2_V2O 10_Kutti 14_DD 15_TT 16_DD 19_V2O 20_Kutti
0 1 1 0 0 1 0
1 1 1 1 1 0 0
0 1 0 1 0 0 1
0 1 1 0 1 0 0
Now I want to have multiple new datasets which got segregated as per their unique column names. All dataset names also must be created as per their column names as shown below:
Kutti
10_Kutti 20_Kutti
1 0
1 0
1 1
1 0
V2O
2_V2O 19_V2O
0 1
1 0
0 0
0 0
DD
14_DD 16_DD
1 0
1 1
0 0
1 1
TT
16_TT
0
1
0
1
I know this can be done using "select" function in dplyr but I need one dynamic programme which builds this automatically for any dataset.
We can split by the substring of the column names of 'nm'. Remove the prefix of the columnames until the _ with sub and use that to split the 'nm'.
lst <- split.default(nm, sub(".*_", "", names(nm)))
lst
#$DD
# 14_DD 16_DD
#1 1 0
#2 1 1
#3 0 0
#4 1 1
#$Kutti
# 10_Kutti 20_Kutti
#1 1 0
#2 1 0
#3 1 1
#4 1 0
#$TT
# 15_TT
#1 0
#2 1
#3 1
#4 0
#$V2O
# 2_V2O 19_V2O
#1 0 1
#2 1 0
#3 0 0
#4 0 0
It is better to keep the data.frames in a list. If we insist that it should be individual data.frame objects in the global environment (not recommended), use list2env
list2env(lst, envir = .GlobalEnv)
Now, just call
DD
data
nm <- structure(list(`2_V2O` = c(0L, 1L, 0L, 0L), `10_Kutti` = c(1L,
1L, 1L, 1L), `14_DD` = c(1L, 1L, 0L, 1L), `15_TT` = c(0L, 1L,
1L, 0L), `16_DD` = c(0L, 1L, 0L, 1L), `19_V2O` = c(1L, 0L, 0L,
0L), `20_Kutti` = c(0L, 0L, 1L, 0L)), .Names = c("2_V2O", "10_Kutti",
"14_DD", "15_TT", "16_DD", "19_V2O", "20_Kutti"), class = "data.frame",
row.names = c(NA, -4L))

Resources