I'm using a larger version of the following dataset:
Name HB>TE-iL TE-iL TE-iL^
John 1 1 0
Eric 0 0 0
Mike 1 1 0
Jim 0 1 0
Joe 1 0 1
...
If column HB>TE-iL for a given observation is equal to one, I want the TE-iL and TE-iL^ columns to become zeros. So the above data frame would become:
Name HB>TE-iL TE-iL TE-iL^
John 1 0 0
Eric 0 0 0
Mike 1 0 0
Jim 0 1 0
Joe 1 0 0
...
Is this possible? Thanks in advance.
# Detect where it is 1
is_1 = your_data[["HB>TE-iL"]] == 1
# set to 0
your_data[is_1, c("TE-iL", "TE-iL^")] = 0
We could use
df[c("TE-iL", "TE-iL^")] <- df[c("TE-iL", "TE-iL^")] * !df[["HB>TE-iL"]]
-output
df
# Name HB>TE-iL TE-iL TE-iL^
#1 John 1 0 0
#2 Eric 0 0 0
#3 Mike 1 0 0
#4 Jim 0 1 0
#5 Joe 1 0 0
data
df <- structure(list(Name = c("John", "Eric", "Mike", "Jim", "Joe"),
`HB>TE-iL` = c(1L, 0L, 1L, 0L, 1L), `TE-iL` = c(1L, 0L, 1L,
1L, 0L), `TE-iL^` = c(0L, 0L, 0L, 0L, 1L)),
class = "data.frame", row.names = c(NA,
-5L))
i think this should work
df <- df %>%
mutate(
`TE-iL` = case_when(`HB>TE-iL` == 1 ~ 0),
`TE-iL^` = case_when(`HB>TE-iL` == 1 ~ 0)
)
Related
Hi I am looking to retain rows in a dataset similar to the below:
ID
Value1
Value2
A
1
0
A
0
1
A
1
1
A
0
1
A
0
0
A
0
0
A
1
0
A
1
1
A
0
1
Where 'Value1' = 1 and 'Value2' in the immediate below row = 1. Under these conditions both rows should be retained; any other rows corresponding to ID 'A' should not be retained. Can anyone help with this please? In this example the below output should be returned:
ID
Value1
Value2
A
1
0
A
0
1
A
1
1
A
0
1
A
1
0
A
1
1
A
0
1
The logic is keep all the rows where row before has Value1=1 and row immediately after has Value2=1. I've added a few rows to your data to check different scenarios.
df=structure(list(ID = c("A", "A", "A", "A", "A", "A", "A", "A",
"A"), Value1 = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L), Value2 = c(0L,
1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-9L))
ID Value1 Value2
1 A 1 0
2 A 0 1
3 A 0 0
4 A 1 0
5 A 0 0
6 A 0 1
7 A 1 0
8 A 0 1
9 A 0 1
edit: your edit requires you to distinguish between 1's in Value1 and Value2 columns, there are probably a number of options available here, one option is to say that if Value=1 then this starts a new sequence, so the next row needs to have Value2=1 and Value1!=1.
tmp=which((df$Value1==1)+c(tail(df$Value1!=1 & df$Value2==1,-1),NA)==2)
df[sort(c(tmp,tmp+1)),]
ID Value1 Value2
1 A 1 0
2 A 0 1
7 A 1 0
8 A 0 1
note the row names/indices.
You can try
library(dplyr)
inds <- df |> summarise(n = which(Value1 == 1 & c(Value2[2:n()] , 0) == 1))
df |> slice(unlist(Map(c , inds$n , inds$n + 1)))
data
ID Value1 Value2
1 A 1 0
2 A 0 1
3 A 1 0
4 A 0 1
I have a dataset with over several diseases, 0 indicating not having the disease and 1 having the disease.
To illustrate it with an example: I am interested in Diseases A and whether the people in the dataset have this diseases on its own or as the cause of another disease. Therefore I want to create a new variable "Type" with the values "NotDiseasedWithA", "Primary" and "Secondary". The diseases that can cause A are contained in a vector "SecondaryCauses":
SecondaryCauses = c("DiseaseB", "DiseaseD")
"NotDiseasedWithA" means that they do not have disease A.
"Primary" means that they have disease A but not any of the known diseases that can cause it.
"Secondary" means that they have disease A and a diseases that probably caused it.
Sample data
ID DiseaseA DiseaseB DiseaseC DiseaseD DiseaseE
1 0 1 0 0 0
2 1 0 0 0 1
3 1 0 1 1 0
4 1 0 1 1 1
5 0 0 0 0 0
My question is:
How do I select the columns I am interested in? I have more than 20 columns that are not ordered. Therefore I created the vector.
How do I create the condition based on the content of the diseases I am interested in?
I tried something like the following, but this did not work:
DF %>% mutate(Type = ifelse(DiseaseA == 0, "NotDiseasedWithA", ifelse(sum(names(DF) %in% SecondaryCauses) > 0, "Secondary", "Primary")))
So in the end I want to have this results:
ID DiseaseA DiseaseB DiseaseC DiseaseD DiseaseE Type
1 0 1 0 0 0 NotDiseasedWithA
2 1 0 0 0 1 Primary
3 1 0 1 1 0 Secondary
4 1 0 1 1 1 Secondary
5 0 0 0 0 0 NotDiseasedWithA
using data.table
df <- structure(list(ID = 1:5, DiseaseA = c(0L, 1L, 1L, 1L, 0L), DiseaseB = c(1L,
0L, 0L, 0L, 0L), DiseaseC = c(0L, 0L, 1L, 1L, 0L), DiseaseD = c(0L,
0L, 1L, 1L, 0L), DiseaseE = c(0L, 1L, 0L, 1L, 0L)), row.names = c(NA,
-5L), class = c("data.frame"))
library(data.table)
setDT(df) # make it a data.table
SecondaryCauses = c("DiseaseB", "DiseaseD")
df[DiseaseA == 0, Type := "NotDiseasedWithA"][DiseaseA == 1, Type := ifelse(rowSums(.SD) > 0, "Secondary", "Primary"), .SDcols = SecondaryCauses]
df
# ID DiseaseA DiseaseB DiseaseC DiseaseD DiseaseE Type
# 1: 1 0 1 0 0 0 NotDiseasedWithA
# 2: 2 1 0 0 0 1 Primary
# 3: 3 1 0 1 1 0 Secondary
# 4: 4 1 0 1 1 1 Secondary
# 5: 5 0 0 0 0 0 NotDiseasedWithA
I have the following dataset. My objective is to understand how numbers and mammals and birds are there in the two types of locations.
df1:
Location Type Cat Mouse Dog Chicken Turkey Horse
1 1 0 0 1 0 1
1 0 0 1 0 1 0
2 1 1 1 1 1 1
2 0 1 0 0 0 0
1 1 1 0 0 1 0
I want it to read as
df2:
Location Type M M M B B M
1 1 0 0 1 0 1
1 0 0 1 0 1 0
2 1 1 1 1 1 1
2 0 1 0 0 0 0
1 1 1 0 0 1 0
with 'M' denoting Mammal and 'B' denoting Bird
I tried to manually enter the data into my .csv file and use it in R, however, the file gets read as
df2:
Location Type M M1 M2 B B1 M3
1 1 0 0 1 0 1
1 0 0 1 0 1 0
2 1 1 1 1 1 1
2 0 1 0 0 0 0
1 1 1 0 0 1 0
I am not sure why each 'M' or 'B' column gets numbered separately, how could I prevent this from happening
or
I also have the animal type classified as mammal and bird in another dataframe as below
dfanimal:
Name of Animal Mammal/Bird
Cat Mammal
Dog Mammal
Mouse Mammal
Chicken Bird
Turkey Bird
Horse Mammal
If there is a way for me to directly work my way with the dataframes, df1 and dfanimal?
Would be very thankful for any help.
After you manually change the column names, you can use check.names = FALSE while importing the csv. Since it is not advised to have duplicate column names in the dataframe those suffixes are by default added by R.
df1 <- read.csv('location/of/file.csv', check.names = FALSE)
If you want to use df_animal to change the column names, we can use match
names(df1)[-1] <- substr(df_animal$Mammal.Bird[match(names(df1)[-1],
df_animal$Name_of_Animal)], 1, 1)
df1
# Location M M M B B M
#1 1 1 0 0 1 0 1
#2 1 0 0 1 0 1 0
#3 2 1 1 1 1 1 1
#4 2 0 1 0 0 0 0
#5 1 1 1 0 0 1 0
data
df1 <- structure(list(Location = c(1L, 1L, 2L, 2L, 1L), Cat = c(1L,
0L, 1L, 0L, 1L), Mouse = c(0L, 0L, 1L, 1L, 1L), Dog = c(0L, 1L,
1L, 0L, 0L), Chicken = c(1L, 0L, 1L, 0L, 0L), Turkey = c(0L,
1L, 1L, 0L, 1L), Horse = c(1L, 0L, 1L, 0L, 0L)), class = "data.frame",
row.names = c(NA, -5L))
df_animal <- structure(list(Name_of_Animal = structure(c(1L, 3L, 5L, 2L, 6L,
4L), .Label = c("Cat", "Chicken", "Dog", "Horse", "Mouse", "Turkey"
), class = "factor"), Mammal.Bird = structure(c(2L, 2L, 2L, 1L,
1L, 2L), .Label = c("Bird", "Mammal"), class = "factor")), class = "data.frame",
row.names = c(NA, -6L))
I have a dataset(nm) as shown below:
nm
2_V2O 10_Kutti 14_DD 15_TT 16_DD 19_V2O 20_Kutti
0 1 1 0 0 1 0
1 1 1 1 1 0 0
0 1 0 1 0 0 1
0 1 1 0 1 0 0
Now I want to have multiple new datasets which got segregated as per their unique column names. All dataset names also must be created as per their column names as shown below:
Kutti
10_Kutti 20_Kutti
1 0
1 0
1 1
1 0
V2O
2_V2O 19_V2O
0 1
1 0
0 0
0 0
DD
14_DD 16_DD
1 0
1 1
0 0
1 1
TT
16_TT
0
1
0
1
I know this can be done using "select" function in dplyr but I need one dynamic programme which builds this automatically for any dataset.
We can split by the substring of the column names of 'nm'. Remove the prefix of the columnames until the _ with sub and use that to split the 'nm'.
lst <- split.default(nm, sub(".*_", "", names(nm)))
lst
#$DD
# 14_DD 16_DD
#1 1 0
#2 1 1
#3 0 0
#4 1 1
#$Kutti
# 10_Kutti 20_Kutti
#1 1 0
#2 1 0
#3 1 1
#4 1 0
#$TT
# 15_TT
#1 0
#2 1
#3 1
#4 0
#$V2O
# 2_V2O 19_V2O
#1 0 1
#2 1 0
#3 0 0
#4 0 0
It is better to keep the data.frames in a list. If we insist that it should be individual data.frame objects in the global environment (not recommended), use list2env
list2env(lst, envir = .GlobalEnv)
Now, just call
DD
data
nm <- structure(list(`2_V2O` = c(0L, 1L, 0L, 0L), `10_Kutti` = c(1L,
1L, 1L, 1L), `14_DD` = c(1L, 1L, 0L, 1L), `15_TT` = c(0L, 1L,
1L, 0L), `16_DD` = c(0L, 1L, 0L, 1L), `19_V2O` = c(1L, 0L, 0L,
0L), `20_Kutti` = c(0L, 0L, 1L, 0L)), .Names = c("2_V2O", "10_Kutti",
"14_DD", "15_TT", "16_DD", "19_V2O", "20_Kutti"), class = "data.frame",
row.names = c(NA, -4L))
I am trying to create a summary table and having a mental hang up. Essentially, what I think I want is a summaryBy statement getting colSums for the subsets for ALL columns except the factor to summarize on.
My data frame looks like this:
Cluster GO:0003677 GO:0003700 GO:0046872 GO:0008270 GO:0043565 GO:0005524
comp103680_c0 10 0 0 0 0 0 1
comp103947_c0 3 0 0 0 0 0 0
comp104660_c0 1 1 1 0 0 0 0
comp105255_c0 10 0 0 0 0 0 0
What I would like to do is get colSums for all columns after Cluster using Cluster as the grouping factor.
I have tried a bunch of things. The last was the ply ddply
> groupColumns = "Cluster"
> dataColumns = colnames(GO_matrix_MF[,2:ncol(GO_matrix_MF)])
> res = ddply(GO_matrix_MF, groupColumns, function(x) colSums(GO_matrix_MF[dataColumns]))
> head(res)
Cluster GO:0003677 GO:0003700 GO:0046872 GO:0008270 GO:0043565 GO:0005524 GO:0004674 GO:0045735
1 1 121 138 196 94 43 213 97 20
2 2 121 138 196 94 43 213 97 20
I am not sure what the return values represent, but they do not represent the colSums
Try:
> aggregate(.~Cluster, data=ddf, sum)
Cluster GO.0003677 GO.0003700 GO.0046872 GO.0008270 GO.0043565 GO.0005524
1 1 1 1 0 0 0 0
2 3 0 0 0 0 0 0
3 10 0 0 0 0 0 1
I think you are looking for something like this. I modified your data a bit. There are other options too.
# Modified data
foo <- structure(list(Cluster = c(10L, 3L, 1L, 10L), GO.0003677 = c(11L,
0L, 1L, 5L), GO.0003700 = c(0L, 0L, 1L, 0L), GO.0046872 = c(0L,
9L, 0L, 0L), GO.0008270 = c(0L, 0L, 0L, 0L), GO.0043565 = c(0L,
0L, 0L, 0L), GO.0005524 = c(1L, 0L, 0L, 0L)), .Names = c("Cluster",
"GO.0003677", "GO.0003700", "GO.0046872", "GO.0008270", "GO.0043565",
"GO.0005524"), class = "data.frame", row.names = c("comp103680_c0",
"comp103947_c0", "comp104660_c0", "comp105255_c0"))
library(dplyr)
foo %>%
group_by(Cluster) %>%
summarise_each(funs(sum))
# Cluster GO.0003677 GO.0003700 GO.0046872 GO.0008270 GO.0043565 GO.0005524
#1 1 1 1 0 0 0 0
#2 3 0 0 9 0 0 0
#3 10 16 0 0 0 0 1