How to combine multiple variable data to a single variable data? - r

After making my data frame, and selecting the variables i want to look at, i face a dilemma. The excel sheet which acts as my data source was used by different people recording the same type of data.
Mock Neg Neg1PCR Neg2PCR NegPBS red Red RedWine water Water white White
1 9 1 1 1 2 18 4 4 4 2 26
As you can see, because the data is written diffently, Major groups (Redwine, Whitewine and Water) have now been split into undergroups . How do i combine the undergroups into a combined group eg. red+Red+RedWine -> Total wine. I use the phyloseq package for this kind of dataset

names <- c("red","white","water")
df2 <- setNames(data.frame(matrix(ncol = length(names), nrow = nrow(df))),names)
for(col in names){
df2[,col] <- rowSums(df[,grep(col,tolower(names(df)))])
}
here
grep(col,tolower(names(df)))
looks for all the column names that contain the strings like "red" in the names of your vector. You then just sum them in a new data.frame df2 defined with the good lengths

I would just create a new data.frame, easiest to do with dplyr but also doable with base R:
with dplyr
newFrame <- oldFrame %>% mutate(Mock = Mock, Neg = Neg + Neg1PCR + Neg2PCR + NegPBS, Red = red + Red + RedWine, Water = water + Water, White = white = White)
with base R (not complete but you get the point)
newFrame <- data.frame(Red = oldFrame$Red + oldFrame$red + oldFrame$RedWine...)

One can use dplyr:starts_with and dplyr::select to combine columns. The ignore.case is by default TRUE in dplyr:starts_with with help in the data.frame OP has posted.
library(dplyr)
names <- c("red", "white", "water")
cbind(df[1], t(mapply(function(x)rowSums(select(df, starts_with(x))), names)))
# Mock red white water
# 1 1 24 28 8
Data:
df <- read.table(text =
"Mock Neg Neg1PCR Neg2PCR NegPBS red Red RedWine water Water white White
1 9 1 1 1 2 18 4 4 4 2 26",
header = TRUE, stringsAsFactors = FALSE)

Related

Renaming coat colors in R goes wrong with str_detect

I have a dataset with horses and want to group them based on coat colors. In my dataset more than 140 colors are used, I would like to go back to only a few coat colors and assign the rest to Other. But for some horses the coat color has not been registered, i.e. those are unknown. Below is what the new colors should be. (To illustrate the problem I have an old coat color and a new one. But I want to simply change the coat colors, not create a new column with colors)
Horse ID
Coatcolor(old)
Coatcolor
1
black
Black
2
bayspotted
Spotted
3
chestnut
Chestnut
4
grey
Grey
5
cream dun
Other
6
Unknown
7
blue roan
Other
8
chestnutgrey
Grey
9
blackspotted
Spotted
10
Unknown
Instead, I get the data below(second table), where unknown and other are switched.
Horse ID
Coatcolor
1
Black
2
Spotted
3
Chestnut
4
Grey
5
Unknown
6
Other
7
Unknown
8
Grey
9
Spotted
10
Other
I used the following code
mydata <- data %>%
mutate(Coatcolor = case_when(
str_detect(Coatcolor, "spotted") ~ "Spotted",
str_detect(Coatcolor, "grey") ~ "Grey",
str_detect(Coatcolor, "chestnut") ~ "Chestnut",
str_detect(Coatcolor, "black") ~ "Black",
str_detect(Coatcolor, "") ~ "Unknown",
TRUE ~ Coatcolor
))
mydata$Coatcolor[!mydata$Coatcolor %in% c("Spotted", "Grey", "Chestnut", "Black", "Unknown")] <- "Other"
So what am I doing wrong/missing? Thanks in advance.
You can use the recode function of thedplyr package. Assuming the missing spots are NA' s, you can then subsequently set all NA's to "Other" with replace_na of the tidyr package. It depends on the format of your missing data spots.
mydata <- tibble(
id = 1:10,
coatcol = letters[1:10]
)
mydata$coatcol[5] <- NA
mydata$coatcol[4] <- ""
mydata <- mydata %>%
mutate_all(list(~na_if(.,""))) %>% # convert empty string to NA
mutate(Coatcolor_old = replace_na(coatcol, "Unknown")) %>% #set all NA to Unknown
mutate(Coatcolor_new = recode(
Coatcolor_old,
'spotted'= 'Spotted',
'bayspotted' = 'Spotted',
'old_name' = 'new_name',
'a' = 'A', #etc.
))
mydata

Create alphabetically sorted word cloud

I want to create a word cloud for following data.
Red 30
Brown 12
Black 16
Green 33
Yellow 18
Grey 19
White 11
My word cloud should look like this:
In which words are alphabetically sorted and the font of the words is according to the values corresponding to the second column.
We can separate each word into letters then assign size per each letter and plot using ggplot2::geom_text:
library(ggplot2) # ggplot2_2.2.0
# data
df1 <- read.table(text ="
Red 30
Brown 12
Black 16
Green 33
Yellow 18
Grey 19
White 11", stringsAsFactors = FALSE)
colnames(df1) <- c("col", "size")
# order based on value of size
df1 <- df1[order(df1$col), ]
# separate into letters add size
datPlot <-
do.call(rbind,
lapply(seq(nrow(df1)), function(i){
myLetter <- c(".", unlist(strsplit(df1$col[i], split = "")))
data.frame(myLetter = myLetter,
size = c(10, rep(df1$size[i], length(myLetter) - 1)))
}))
# each letter gets a sequential number on x axis, y is fixed to 1
datPlot$x <- seq(nrow(datPlot))
datPlot$y <- 1
# plot text
ggplot(datPlot, aes(x, y, label = myLetter, size = size/3)) +
geom_text(col = "#F89443") +
scale_size_identity() +
theme_void()

Randomly select a certain percentage of rows and create new columns

I have a species column containing 10 species names. I have to distribute the species into four columns randomly such that each column will take a specific percentage of species.
Let's say the first column takes 20%, the second 30%, the third 40% and the last 10%. The four columns will be four different environments i.e.:
Restricted, Tidalflat, beach, estuary
Hence the column intake will be predefined but the selection will be random.
My input data will look like this:
species <- c('Natica','Tellina','Mactra','Natica','Arca','Arca','Tellina',
'Nassarius','Cardium','Cardium')
Result should look like this:
Some simple setup:
species <- c('Natica','Tellina','Mactra','Natica','Arca','Arca','Tellina',
'Nassarius','Cardium','Cardium')
rspecies <- sample(species)
envirs <- c('Restricted', 'Tidalflat', 'Beach', 'Estuary')
probs <- c(.2, .3, .4, .1)
nrs <- round(length(species) * probs)
Now, a data.frame with separate columns is not a very good way of expressing your data, as your data is not rectangular, i.e. you don't have the same number of observations in each column.
You can either present the data in long form:
df <- data.frame(species = rspecies, envir = rep(envirs, nrs), stringsAsFactors = FALSE)
species envir
1 Tellina Restricted
2 Natica Restricted
3 Arca Tidalflat
4 Mactra Tidalflat
5 Tellina Tidalflat
6 Arca Beach
7 Nassarius Beach
8 Cardium Beach
9 Cardium Beach
10 Natica Estuary
Or as a list:
split(rspecies, df$envir)
$Beach
[1] "Mactra" "Natica" "Arca" "Arca"
$Estuary
[1] "Tellina"
$Restricted
[1] "Nassarius" "Cardium"
$Tidalflat
[1] "Cardium" "Natica" "Tellina"
Edit:
One way to accommodate different number of species, is to make the assignment probabilistic according the environment. This will work better the larger the actual dataset is.
species2 <- c('Natica','Tellina','Mactra','Natica','Arca','Arca','Tellina',
'Nassarius','Cardium','Cardium', 'Cardium')
length(species2)
[1] 11
grps <- sample(envirs, size = length(species2), prob = probs, replace = TRUE)
df2 <- data.frame(species = species2, envir = grps, stringsAsFactors = FALSE)
df2 <- df2[order(df2$envir), ]
species envir
5 Arca Beach
10 Cardium Beach
1 Natica Estuary
11 Cardium Estuary
3 Mactra Restricted
7 Tellina Restricted
2 Tellina Tidalflat
4 Natica Tidalflat
6 Arca Tidalflat
8 Nassarius Tidalflat
9 Cardium Tidalflat
Maybe not in one line of code. I did not understand the column part, but you could use below to create a data frame but your column lengths are unequal.
species <- 1:1000
ranspecies <- sample(species)
first20 <- ranspecies[1:(floor(length(species)*.20))]
second30 <- ranspecies[(floor(length(species)*.20)+1):(floor(length(species)*.50))]
third40 <- ranspecies[(floor(length(species)*.50)+1):(floor(length(species)*.90))]
forth10 <- ranspecies[(floor(length(species)*.90)+1):length(species)]
or to match your example
species <- c('Natica'
,'Tellina'
,'Mactra'
,'Natica'
,'Arca'
,'Arca'
,'Tellina'
,'Nassarius'
,'Cardium'
,'Cardium')
ranspecies <- sample(species)
first20 <- ranspecies[1:(floor(length(species)*.20))]
second30 <- ranspecies[(floor(length(species)*.20)+1):(floor(length(species)*.50))]
third40 <- ranspecies[(floor(length(species)*.50)+1):(floor(length(species)*.90))]
forth10 <- ranspecies[(floor(length(species)*.90)+1):length(species)]
dflength <- max(length(first20), length(second30), length(third40),length(forth10))
data.frame(f = c(first20,rep(NA,dflength-length(first20)))
,s = c(second30,rep(NA,dflength-length(second30)))
,t = c(third40,rep(NA,dflength-length(third40)))
,f = c(forth10,rep(NA,dflength-length(forth10)))
)
Allthough I feel that some of the steps can be more compact. But I'll let you fiddle with it some more.

cbind 1:nrows of same ID variable value to original data.frame

I have a large dataframe, where a variable id (first column) recurs with different values in the second column. My idea is to order the dataframe, to split it into a list and then lapply a function which cbinds the sequence 1:nrows(variable id) to each group. My code so far:
DF <- DF[order(DF[,1]),]
DF <- split(DF,DF[,1])
DF <- lapply(1:length(DF), function(i) cbind(DF[[i]], 1:length(DF[[i]])))
But this gives me an error: arguments imply different number of rows.
Can you elaborate?
> head(DF, n=50)
cell area
1 1 121.2130
2 2 81.3555
3 3 81.5862
4 4 83.6345
...
33 1 121.3270
34 2 80.7832
35 3 81.1816
36 4 83.3340
DF <- DF[order(DF$cell),]
What I want is:
> head(DF, n=50)
cell area counter
1 1 121.213 1
33 1 121.327 2
65 1 122.171 3
97 1 122.913 4
129 1 123.697 5
161 1 124.474 6
...and so on.
This is my code:
cell.areas.t <- function(file) {
dat = paste(file)
DF <- read.table(dat, col.names = c("cell","area"))
DF <- splitstackshape::getanID(DF, "cell")[] # thanks to akrun's answer
ggplot2::ggplot(data = DF, aes(x = .id , y = area, color = cell)) +
geom_line(aes(group = cell)) + geom_point(size=0.1)
}
And the plot looks like this:
Most cells increase in area, only some decrease. This is only a first try to visualize my data, so what you can't see very well is that the areas drop down periodically due to cell division.
Additional question:
There is a problem I didn't take into account beforehand, which is that after a cell division a new cell is added to the data.frame and is handed the initial index 1 (you see in the image that all cells start from .id=1, not later), which is not what I want - it needs to inherit the index of its creation time. First thing that comes into my mind is that I could use a parsing mechanism that does this job for a newly added cell variable:
DF$.id[DF$cell != temporary.cellindex] <- max(DF$.id[DF$cell != temporary.cellindex])
Do you have a better idea? Thanks.
There is a boundary condition which may ease the problem: fixed number of cells at the beginning (32). Another solution would be to cut away all data before the last daughter cell is created.
Update: Additional question solved, here's the code:
cell.areas.t <- function(file) {
dat = paste(file)
DF <- read.table(dat, col.names = c("cell","area"))
DF$.id <- c(0, cumsum(diff(DF$cell) < 0)) + 1L # Indexing
title <- getwd()
myplot <- ggplot2::ggplot(data = DF, aes(x = .id , y = area, color = factor(cell))) +
geom_line(aes(group = cell)) + geom_line(size=0.1) + theme(legend.position="none") + ggtitle(title)
#save the plot
image=myplot
ggsave(file="cell_areas_time.svg", plot=image, width=10, height=8)
}
We can use getanID from splitstackshape
library(splitstackshape)
getanID(DF, "cell")[]
There's a much easier method to accomplish that goal. Use ave with seq.int
DF$group_seq <- ave(DF, DF[,1], FUN=function(x){ seq.int(nrow(x)) } )

Using R to remove data which is below a quartile threshold

I am creating correlations using R, with the following code:
Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor
I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!
Thank you so much.
I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):
Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663 2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352
So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?
Thanks again, and sorry for the evolution of the question!
Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:
# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
ID Val
1 1 0.76487516
2 2 0.59755578
3 3 0.94584374
4 4 0.72179297
5 5 0.04513418
6 6 0.95772248
7 7 0.14566118
8 8 0.84898704
9 9 0.07246594
10 10 0.14136138
# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
ID Val
1 1 0.7648752
2 2 0.5975558
3 3 0.9458437
4 4 0.7217930
6 6 0.9577225
7 7 0.1456612
8 8 0.8489870
# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
25%
0.1424363
Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.
I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).
qgroup = function(numvec, n = 4){
qtile = quantile(numvec, probs = seq(0, 1, 1/n))
out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))
return(out)
}
Function example:
v = rep(1:20)
> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Consider now the following data:
dt = data.table(
A0 = runif(100),
A1 = runif(100)
)
We apply qgroup() across the data to obtain two quartile group columns:
cols = colnames(dt)
qcols = c('Q0', 'Q1')
dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]
head(dt)
> A0 A1 Q0 Q1
1: 0.72121846 0.1908863 3 1
2: 0.70373594 0.4389152 3 2
3: 0.04604934 0.5301261 1 3
4: 0.10476643 0.1108709 1 1
5: 0.76907762 0.4913463 4 2
6: 0.38265848 0.9291649 2 4
Lastly, we only include rows for which both quartile groups are above the first quartile:
dt = dt[Q0 + Q1 > 2]

Resources