R: How do I use setdiff() and match() to subset a dataframe?

R: How do I use setdiff() and match() to subset a dataframe? - r

In the map dataframe, the first column is chromosome and the second column is snp.
> dput(map[1:10,])
structure(list(V1 = c(15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L,
15L, 15L), V2 = c("SNP_A-1850451", "SNP_A-1910129", "SNP_A-1793338",
"SNP_A-1938260", "SNP_A-2269619", "SNP_A-2246474", "SNP_A-2275061",
"SNP_A-1961089", "SNP_A-2131173", "SNP_A-4227303"), V3 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), V4 = c(41361016L, 41366515L,
41386408L, 41396801L, 41410208L, 41419430L, 41440229L, 41446338L,
41449654L, 41455192L)), row.names = c(NA, 10L), class = "data.frame")
The mc dataframe is a subset of map based on the snps vector.
> snps <- c("SNP_A-2240938","SNP_A-2249234","SNP_A-2120212","SNP_A-1864785","SNP_A-1944204","SNP_A-2101022","SNP_A-2192181","SNP_A-4296300","SNP_A-1961028","SNP_A-2215249","SNP_A-2172180","SNP_A-1945117","SNP_A-1941491")
>
> # Subset the map file by the SNPs cis
> mc <- map[map[,2] %in% snps,]
Now I want to extract all of the remaining SNPs only on chromosome 14. I also want to use setdiff() to get the remaining SNPs that are different from those in mc. Then use these SNP names with the match() function on the original map file.
> # Extract all of the remaining SNPs only on chromosome 14
> for (i in 1:nrow(map)) {
+ if (map$V1==14 && setdiff(map,mc)) {
+ t.snp <- map[i,]
+ }
+ }
>
> t.snp
data frame with 0 columns and 10784 rows
Problem:
My t.snp dataframe has 0 columns.

Related

Case_when R multi conditional logic

I'm trying to utilize case_when from dplyr to create a categorical vector in a dataframe, but I want to see if it can be done in a cleaner way.
Example dataframe
structure(list(Primary.column = c(1L, 0L, 1L, 0L, 1L, 1L), Other_column1 = c(1L,
1L, 0L, 0L, 0L, 0L), Other_column2 = c(0L, 0L, 1L, 1L, 0L, 0L
), Other_column3 = c(0L, 0L, 0L, 0L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-6L))
I want to make it such that when Primary column = 0 the category is A. However, if Primary Column is 1 then I want it to query other columns based on the keyword "Other" in the other columns and whichever have 1, that determines the category.
Something like Primary Column == 1 & (select whichever column titled "Other" has 1 in it to decide category)
For this one this is the logic
If Primary Column is 1 and Other_column1 is 1 then the new column should have value B
If Primary Column is 1 and Other_column2 is 1 then the new column should have value C
If Primary Column is 1 and Other_column3 is 1 then the new column should have value D
In the test dataframe this is an easy solve because there are very few columns and this could be solved like so
test_df <- test_df %>%
mutate(new_column=case_when(Primary.column==0 ~ "A",
Primary.column==1 & Other_column1 ==1 ~ "B",
Primary.column==1 & Other_column2 ==1 ~ "C",
Primary.column==1 & Other_column3 ==1 ~ "D",
))
But the true dataframe has hundreds of "Other" columns and this is not a clean solution and I'd have hundreds of lines of code for this single variable. Not what I want.
in this example I also have a key that that tells me what columns the other columns take if primary column is 1.
Key
structure(list(Column = c("Other_column1 ", "Other_column2",
"Other_column3"), Value = c("B", "C", "D")), class = "data.frame", row.names = c(NA,
-3L))
Is there a way to utilize the key to make it so that I don't write out 100 lines of messy code? Or are there alternative solutions to keep things clean?

Xts format of data.frame

I have a dataframe with 2 columns in it. The first column contains POSIXct data, and the second contains some sample value integers. I cannot for the life of me figure out how to convert this dataframe to xts.
My data:
structure(list(SampleDateTime = list(1422835200000, 1423353600000,
1423958400000, 1433030400000, 1433635200000, 1434326400000,
1434844800000, 1444521600000, 1445731200000, 1453593600000,
1.455408e+12, 1420934400000, 1424563200000, 1425772800000,
1426982400000, 1430006400000, 1.431216e+12, 1.440288e+12,
1448150400000, 1460851200000), SampleValue = list(9L, 3L,
2L, 1733L, 19L, 6L, 1L, 17L, 7L, 23L, 147L, 1L, 17L, 1L,
1L, 19L, 1L, 11L, 2L, 91L), Dttm = structure(c(1107216000,
1107734400, 1108339200, 1117411200, 1118016000, 1118707200, 1119225600,
1128902400, 1130112000, 1137974400, 1139788800, 1105315200, 1108944000,
1110153600, 1111363200, 1114387200, 1115596800, 1124668800, 1132531200,
1145232000), class = c("POSIXct", "POSIXt"), tzone = "")), row.names = c("feature$attributes",
"feature$attributes1", "feature$attributes2", "feature$attributes3",
"feature$attributes4", "feature$attributes5", "feature$attributes6",
"feature$attributes7", "feature$attributes8", "feature$attributes9",
"feature$attributes10", "feature$attributes11", "feature$attributes12",
"feature$attributes13", "feature$attributes14", "feature$attributes15",
"feature$attributes16", "feature$attributes17", "feature$attributes18",
"feature$attributes19"), class = "data.frame")
>
I use this to try and make the conversion:
q <- as.xts(t, order.by = t$Dttm, dateFormat="POSIXct", frequency=NULL, RECLASS=FALSE)
I get the following errors:
Error in coredata.xts(x) : currently unsupported data type
Any help appreciated. I can't figure out why it says the DF is unsupported when it says they are in the documentation??

As pointed out in the comments your data is in a weird format in which the first two columns of the data frame are themselves lists rather than numeric columns so first unlist each column (this will not change any columns which are already not lists) giving dat.u and then convert. We assume dat is the data shown via dput in the question.
dat.u <- replace(dat, TRUE, lapply(dat, unlist))
z <- read.zoo(dat.u, index = "Dttm")
x <- as.xts(z)

Why does R strip names of vector extracted from a one-column matrix with named rows?

I want to take one row of matrix M and treat the row as a named vector, with the column names of the original matrix as the names of the vector. Usually M[x, ] does what I want but this fails if:
(a) the rows of the matrix are named, and
(b) the number of columns is 1.
I can work around this, but it seems inelegant. What is the purpose of this behaviour (and in particular why does it make any difference that the rows are named)?
Examples:
M <- structure(c(72L, 92L, 81L, 81L, 87L, 76L, 89L, 70L, 70L, 73L,
74L, 75L), .Dim = 4:3, .Dimnames = list(c("SeptQuiz", "Midterm",
"NovQuiz", "Final"), c("Anne", "Bo", "Cameron"))) # multiple columns, rows are named
(v1 <- M[3, ]) # subsetting one row, as a vector, preserves student names
M <- structure(c(91L, 87L, 83L, 81L), .Dim = c(4L, 1L), .Dimnames = list(
NULL, "Frank")) # one column, rows are unnamed
(v1 <- M[3, ]) # again, subsetting one row as a vector preserves student name
M <- structure(c(91L, 87L, 83L, 81L), .Dim = c(4L, 1L), .Dimnames = list(
c("SeptQuiz", "Midterm", "NovQuiz", "Final"), "Frank")) # one column, rows are named
(v1 <- M[3, ]) # subsetting one row deletes student name
if(ncol(M) == 1 && !is.null(rownames(M))) {names(v1) <- colnames(M)} # kludge to restore student name if it was stripped

How to separate a dataframe based on specific string in column name [duplicate]

This question already has answers here:
Split string by last two characters in R? (/negative string indices)
(5 answers)
Closed 3 years ago.
I have a huge data that I cannot split into two sets
df<- structure(list(name = structure(1:3, .Label = c("a", "b", "c"
), class = "factor"), X3C_AALI_01A = c(651L, 2L, 1877L), X3C_AALJ_01B = c(419L,
2L, 1825L), X3C_AALK_01A = c(1310L, 52L, 1286L), X4H_AAAK_11B = c(2978L,
4L, 1389L), X5L_AAT0_01B = c(2576L, 15L, 1441L), X5L_AAT1_01A = c(2886L,
5L, 921L), X5T_A9QA_03A = c(929L, 3L, 935L), A1_A0SI_10A = c(1578L,
1L, 2217L), A1_A0SK_07C = c(3003L, 6L, 2984L), A1_A0SO_01A = c(6413L,
0L, 3577L), A1_A0SP_05B = c(5157L, 5L, 4596L), A2_A04P_01A = c(4283L,
6L, 2508L), X5L_AAh1_10A = c(2886L, 5L, 921L), X5T_A0QA_03A = c(929L,
3L, 935L), A1_A0Sm_10A = c(1578L, 1L, 2217L), A1_ArSK_01A = c(3003L,
6L, 2984L), A1_AfSO_01A = c(6413L, 0L, 3577L), A1_AuSP_05A = c(5157L,
5L, 4596L), A2_Ap4P_11A = c(4283L, 6L, 2508L)), class = "data.frame", row.names = c(NA,
-3L))
basically , I want to split the data based on the last character of the column name. for example if you look at the above data, the second column is like this 3C_AALI_01A which I want to generate two data sets based on the _01A
So those columns that have 01 to 09 values I want them to be in one data frame and those ones that have 10 to whatever number want them to be in the second data frame. For example in the above example data.
the columns with the following names should be in one data frame
3C_AALI_01A
3C_AALJ_01B
3C_AALK_01A
5L_AAT0_01B
5L_AAT1_01A
5T_A9QA_03A
A1_A0SK_07C
A1_A0SO_01A
A1_A0SP_05B
A2_A04P_01A
5T_A0QA_03A
A1_ArSK_01A
A1_AfSO_01A
A1_AuSP_05A
and the columns with the following names should be in another data frame
4H_AAAK_11B
A1_A0SI_10A
5L_AAh1_10A
A1_A0Sm_10A
A2_Ap4P_11A

df1 <- df[,grep('0[1-9].$',colnames(df))]
df2 <- df[,-grep('0[1-9].$',colnames(df))]

You could use tidyr::separate(..., last=-1) approach
which uses negative string indexing, which is what you really want here
also, your dataframe is transposed, it would be more normal to have one single column name with the names, and numerical columns a, b, c. Like t(df) without the unwanted coercion to string.

How to split/subset a dataframe into multiple dataframes in R

I have looked through the web and stackflow and I was not able to find a solution to my problem. I don't know which of dplyr or a loop would be more efficient.
Below an example of a dataframe (my own datasets have more than 10,000 rows) I would like to split in three based on column B (<250) as a list with three objects or as three individual dataframes. Then for each new dataframe, I would like, for example, count the number of points (or length of the dataframe) and the duration (column Time is in seconds). Any suggestion would be really appreciated.
thank you
Martin
dput(mydata)
structure(list(Time = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 0L,
11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L), A = c(4L, 5L, 6L, 7L,
3L, 7L, 8L, 10L, 11L, 8L, 10L, 12L, 14L, 6L, 14L, 16L, 20L, 22L
), B = c(100.25, 150.75, 200, 1000.56, 2000.1, 100, 150, 50,
25.2, 102.25, 152.75, 202, 1002.56, 2002.1, 102, 152, 52, 27.2
)), .Names = c("Time", "A", "B"), class = "data.frame", row.names = c(NA,
-18L))

Grab IRanges from Bioconductor:
runs <- slice(Rle(df$B), upper=250)
This is an RleViews object, with a view (range) for every run under 250. You can extract the width of the views (the number of points that would be in each data frame):
width(runs)
You can split the data frame into a list like this:
blocks <- extractList(df, ranges(runs))
Note that blocks is now a formal SplitDataFrameList.
To compute the duration, you can extract the Time column as an IntegerList and compute the difference between the last and first element of every list element:
time <- blocks[,"Time"]
ptail(time, 1) - phead(time, 1)
This happens without actually forming separate list elements (the list is lazily managed) and so should be fast.

It's not clear how your specifications line up with your expected output. Here are two different method of splitting:
# Gives three groups
split( mydata[mydata$B <250, ] , (1:nrow(mydata[mydata$B <250, ]))%% 3)
# Gives groups of size three
split( mydata[mydata$B <250, ] , (1:nrow(mydata[mydata$B <250, ]))%/% 3)
This shows how to count the numbers of rows from the first method:
> three <- split( mydata[mydata$B <250, ] , (1:nrow(mydata[mydata$B <250, ]))%% 3)
> lapply(three, nrow)
$`0`
[1] 4
$`1`
[1] 5
$`2`
[1] 5

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: How do I use setdiff() and match() to subset a dataframe? - r

Related

Case_when R multi conditional logic

Xts format of data.frame

Why does R strip names of vector extracted from a one-column matrix with named rows?

How to separate a dataframe based on specific string in column name [duplicate]

How to split/subset a dataframe into multiple dataframes in R

Categories

Resources