I have a dataset(nm) as shown below:
nm
2_V2O 10_Kutti 14_DD 15_TT 16_DD 19_V2O 20_Kutti
0 1 1 0 0 1 0
1 1 1 1 1 0 0
0 1 0 1 0 0 1
0 1 1 0 1 0 0
Now I want to have multiple new datasets which got segregated as per their unique column names. All dataset names also must be created as per their column names as shown below:
Kutti
10_Kutti 20_Kutti
1 0
1 0
1 1
1 0
V2O
2_V2O 19_V2O
0 1
1 0
0 0
0 0
DD
14_DD 16_DD
1 0
1 1
0 0
1 1
TT
16_TT
0
1
0
1
I know this can be done using "select" function in dplyr but I need one dynamic programme which builds this automatically for any dataset.
We can split by the substring of the column names of 'nm'. Remove the prefix of the columnames until the _ with sub and use that to split the 'nm'.
lst <- split.default(nm, sub(".*_", "", names(nm)))
lst
#$DD
# 14_DD 16_DD
#1 1 0
#2 1 1
#3 0 0
#4 1 1
#$Kutti
# 10_Kutti 20_Kutti
#1 1 0
#2 1 0
#3 1 1
#4 1 0
#$TT
# 15_TT
#1 0
#2 1
#3 1
#4 0
#$V2O
# 2_V2O 19_V2O
#1 0 1
#2 1 0
#3 0 0
#4 0 0
It is better to keep the data.frames in a list. If we insist that it should be individual data.frame objects in the global environment (not recommended), use list2env
list2env(lst, envir = .GlobalEnv)
Now, just call
DD
data
nm <- structure(list(`2_V2O` = c(0L, 1L, 0L, 0L), `10_Kutti` = c(1L,
1L, 1L, 1L), `14_DD` = c(1L, 1L, 0L, 1L), `15_TT` = c(0L, 1L,
1L, 0L), `16_DD` = c(0L, 1L, 0L, 1L), `19_V2O` = c(1L, 0L, 0L,
0L), `20_Kutti` = c(0L, 0L, 1L, 0L)), .Names = c("2_V2O", "10_Kutti",
"14_DD", "15_TT", "16_DD", "19_V2O", "20_Kutti"), class = "data.frame",
row.names = c(NA, -4L))
Related
I have a dataset with over several diseases, 0 indicating not having the disease and 1 having the disease.
To illustrate it with an example: I am interested in Diseases A and whether the people in the dataset have this diseases on its own or as the cause of another disease. Therefore I want to create a new variable "Type" with the values "NotDiseasedWithA", "Primary" and "Secondary". The diseases that can cause A are contained in a vector "SecondaryCauses":
SecondaryCauses = c("DiseaseB", "DiseaseD")
"NotDiseasedWithA" means that they do not have disease A.
"Primary" means that they have disease A but not any of the known diseases that can cause it.
"Secondary" means that they have disease A and a diseases that probably caused it.
Sample data
ID DiseaseA DiseaseB DiseaseC DiseaseD DiseaseE
1 0 1 0 0 0
2 1 0 0 0 1
3 1 0 1 1 0
4 1 0 1 1 1
5 0 0 0 0 0
My question is:
How do I select the columns I am interested in? I have more than 20 columns that are not ordered. Therefore I created the vector.
How do I create the condition based on the content of the diseases I am interested in?
I tried something like the following, but this did not work:
DF %>% mutate(Type = ifelse(DiseaseA == 0, "NotDiseasedWithA", ifelse(sum(names(DF) %in% SecondaryCauses) > 0, "Secondary", "Primary")))
So in the end I want to have this results:
ID DiseaseA DiseaseB DiseaseC DiseaseD DiseaseE Type
1 0 1 0 0 0 NotDiseasedWithA
2 1 0 0 0 1 Primary
3 1 0 1 1 0 Secondary
4 1 0 1 1 1 Secondary
5 0 0 0 0 0 NotDiseasedWithA
using data.table
df <- structure(list(ID = 1:5, DiseaseA = c(0L, 1L, 1L, 1L, 0L), DiseaseB = c(1L,
0L, 0L, 0L, 0L), DiseaseC = c(0L, 0L, 1L, 1L, 0L), DiseaseD = c(0L,
0L, 1L, 1L, 0L), DiseaseE = c(0L, 1L, 0L, 1L, 0L)), row.names = c(NA,
-5L), class = c("data.frame"))
library(data.table)
setDT(df) # make it a data.table
SecondaryCauses = c("DiseaseB", "DiseaseD")
df[DiseaseA == 0, Type := "NotDiseasedWithA"][DiseaseA == 1, Type := ifelse(rowSums(.SD) > 0, "Secondary", "Primary"), .SDcols = SecondaryCauses]
df
# ID DiseaseA DiseaseB DiseaseC DiseaseD DiseaseE Type
# 1: 1 0 1 0 0 0 NotDiseasedWithA
# 2: 2 1 0 0 0 1 Primary
# 3: 3 1 0 1 1 0 Secondary
# 4: 4 1 0 1 1 1 Secondary
# 5: 5 0 0 0 0 0 NotDiseasedWithA
I'm using a larger version of the following dataset:
Name HB>TE-iL TE-iL TE-iL^
John 1 1 0
Eric 0 0 0
Mike 1 1 0
Jim 0 1 0
Joe 1 0 1
...
If column HB>TE-iL for a given observation is equal to one, I want the TE-iL and TE-iL^ columns to become zeros. So the above data frame would become:
Name HB>TE-iL TE-iL TE-iL^
John 1 0 0
Eric 0 0 0
Mike 1 0 0
Jim 0 1 0
Joe 1 0 0
...
Is this possible? Thanks in advance.
# Detect where it is 1
is_1 = your_data[["HB>TE-iL"]] == 1
# set to 0
your_data[is_1, c("TE-iL", "TE-iL^")] = 0
We could use
df[c("TE-iL", "TE-iL^")] <- df[c("TE-iL", "TE-iL^")] * !df[["HB>TE-iL"]]
-output
df
# Name HB>TE-iL TE-iL TE-iL^
#1 John 1 0 0
#2 Eric 0 0 0
#3 Mike 1 0 0
#4 Jim 0 1 0
#5 Joe 1 0 0
data
df <- structure(list(Name = c("John", "Eric", "Mike", "Jim", "Joe"),
`HB>TE-iL` = c(1L, 0L, 1L, 0L, 1L), `TE-iL` = c(1L, 0L, 1L,
1L, 0L), `TE-iL^` = c(0L, 0L, 0L, 0L, 1L)),
class = "data.frame", row.names = c(NA,
-5L))
i think this should work
df <- df %>%
mutate(
`TE-iL` = case_when(`HB>TE-iL` == 1 ~ 0),
`TE-iL^` = case_when(`HB>TE-iL` == 1 ~ 0)
)
I have the following dataset. My objective is to understand how numbers and mammals and birds are there in the two types of locations.
df1:
Location Type Cat Mouse Dog Chicken Turkey Horse
1 1 0 0 1 0 1
1 0 0 1 0 1 0
2 1 1 1 1 1 1
2 0 1 0 0 0 0
1 1 1 0 0 1 0
I want it to read as
df2:
Location Type M M M B B M
1 1 0 0 1 0 1
1 0 0 1 0 1 0
2 1 1 1 1 1 1
2 0 1 0 0 0 0
1 1 1 0 0 1 0
with 'M' denoting Mammal and 'B' denoting Bird
I tried to manually enter the data into my .csv file and use it in R, however, the file gets read as
df2:
Location Type M M1 M2 B B1 M3
1 1 0 0 1 0 1
1 0 0 1 0 1 0
2 1 1 1 1 1 1
2 0 1 0 0 0 0
1 1 1 0 0 1 0
I am not sure why each 'M' or 'B' column gets numbered separately, how could I prevent this from happening
or
I also have the animal type classified as mammal and bird in another dataframe as below
dfanimal:
Name of Animal Mammal/Bird
Cat Mammal
Dog Mammal
Mouse Mammal
Chicken Bird
Turkey Bird
Horse Mammal
If there is a way for me to directly work my way with the dataframes, df1 and dfanimal?
Would be very thankful for any help.
After you manually change the column names, you can use check.names = FALSE while importing the csv. Since it is not advised to have duplicate column names in the dataframe those suffixes are by default added by R.
df1 <- read.csv('location/of/file.csv', check.names = FALSE)
If you want to use df_animal to change the column names, we can use match
names(df1)[-1] <- substr(df_animal$Mammal.Bird[match(names(df1)[-1],
df_animal$Name_of_Animal)], 1, 1)
df1
# Location M M M B B M
#1 1 1 0 0 1 0 1
#2 1 0 0 1 0 1 0
#3 2 1 1 1 1 1 1
#4 2 0 1 0 0 0 0
#5 1 1 1 0 0 1 0
data
df1 <- structure(list(Location = c(1L, 1L, 2L, 2L, 1L), Cat = c(1L,
0L, 1L, 0L, 1L), Mouse = c(0L, 0L, 1L, 1L, 1L), Dog = c(0L, 1L,
1L, 0L, 0L), Chicken = c(1L, 0L, 1L, 0L, 0L), Turkey = c(0L,
1L, 1L, 0L, 1L), Horse = c(1L, 0L, 1L, 0L, 0L)), class = "data.frame",
row.names = c(NA, -5L))
df_animal <- structure(list(Name_of_Animal = structure(c(1L, 3L, 5L, 2L, 6L,
4L), .Label = c("Cat", "Chicken", "Dog", "Horse", "Mouse", "Turkey"
), class = "factor"), Mammal.Bird = structure(c(2L, 2L, 2L, 1L,
1L, 2L), .Label = c("Bird", "Mammal"), class = "factor")), class = "data.frame",
row.names = c(NA, -6L))
I have a dataframe in R which looks like the one below.
a b c d e f
0 1 1 0 0 0
1 1 1 1 0 1
0 0 0 1 0 1
1 0 0 1 0 1
1 1 1 0 0 0
The database is big, spanning over 100 columns and 5000 rows and contain all binaries (0's and 1's). I want to construct an overlap between each and every columns in R. Something like the one given below. This overlap dataframe will be a square matrix with equal number of rows and columns and that will be same as the number of columns in the 1st dataframe.
a b c d e f
a 3 2 2 2 0 2
b 2 3 3 3 0 1
c 2 3 3 1 0 1
d 2 3 1 3 0 3
e 0 0 0 0 0 0
f 2 1 1 3 0 3
Each cell of the second dataframe is populated by the number of cases where both row and column have 1 in the first dataframe.
I'm thinking of constructing a empty matrix like this:
df <- matrix(ncol = ncol(data), nrow = ncol(data))
colnames(df) <- names(data)
rownames(df) <- names(data)
.. and iterating over each cell of this matrix using an apply command reading the corresponding row name (say, x) and column name (say, y) and running a function like the one below.
summation <- function (x,y) (return (sum(data$x * data$y)))
The problem with is I can't find out the row name and column name while within an apply function. Any help will be appreciated.
Any more efficient way than what I'm thinking is more than welcome.
You are looking for crossprod
crossprod(as.matrix(df1))
# a b c d e f
#a 3 2 2 2 0 2
#b 2 3 3 1 0 1
#c 2 3 3 1 0 1
#d 2 1 1 3 0 3
#e 0 0 0 0 0 0
#f 2 1 1 3 0 3
data
df1 <- structure(list(a = c(0L, 1L, 0L, 1L, 1L), b = c(1L, 1L, 0L, 0L,
1L), c = c(1L, 1L, 0L, 0L, 1L), d = c(0L, 1L, 1L, 1L, 0L), e = c(0L,
0L, 0L, 0L, 0L), f = c(0L, 1L, 1L, 1L, 0L)), .Names = c("a",
"b", "c", "d", "e", "f"), class = "data.frame", row.names = c(NA,
-5L))
I am trying to create a summary table and having a mental hang up. Essentially, what I think I want is a summaryBy statement getting colSums for the subsets for ALL columns except the factor to summarize on.
My data frame looks like this:
Cluster GO:0003677 GO:0003700 GO:0046872 GO:0008270 GO:0043565 GO:0005524
comp103680_c0 10 0 0 0 0 0 1
comp103947_c0 3 0 0 0 0 0 0
comp104660_c0 1 1 1 0 0 0 0
comp105255_c0 10 0 0 0 0 0 0
What I would like to do is get colSums for all columns after Cluster using Cluster as the grouping factor.
I have tried a bunch of things. The last was the ply ddply
> groupColumns = "Cluster"
> dataColumns = colnames(GO_matrix_MF[,2:ncol(GO_matrix_MF)])
> res = ddply(GO_matrix_MF, groupColumns, function(x) colSums(GO_matrix_MF[dataColumns]))
> head(res)
Cluster GO:0003677 GO:0003700 GO:0046872 GO:0008270 GO:0043565 GO:0005524 GO:0004674 GO:0045735
1 1 121 138 196 94 43 213 97 20
2 2 121 138 196 94 43 213 97 20
I am not sure what the return values represent, but they do not represent the colSums
Try:
> aggregate(.~Cluster, data=ddf, sum)
Cluster GO.0003677 GO.0003700 GO.0046872 GO.0008270 GO.0043565 GO.0005524
1 1 1 1 0 0 0 0
2 3 0 0 0 0 0 0
3 10 0 0 0 0 0 1
I think you are looking for something like this. I modified your data a bit. There are other options too.
# Modified data
foo <- structure(list(Cluster = c(10L, 3L, 1L, 10L), GO.0003677 = c(11L,
0L, 1L, 5L), GO.0003700 = c(0L, 0L, 1L, 0L), GO.0046872 = c(0L,
9L, 0L, 0L), GO.0008270 = c(0L, 0L, 0L, 0L), GO.0043565 = c(0L,
0L, 0L, 0L), GO.0005524 = c(1L, 0L, 0L, 0L)), .Names = c("Cluster",
"GO.0003677", "GO.0003700", "GO.0046872", "GO.0008270", "GO.0043565",
"GO.0005524"), class = "data.frame", row.names = c("comp103680_c0",
"comp103947_c0", "comp104660_c0", "comp105255_c0"))
library(dplyr)
foo %>%
group_by(Cluster) %>%
summarise_each(funs(sum))
# Cluster GO.0003677 GO.0003700 GO.0046872 GO.0008270 GO.0043565 GO.0005524
#1 1 1 1 0 0 0 0
#2 3 0 0 9 0 0 0
#3 10 16 0 0 0 0 1