Given a large data frame with a column that has unique values
(ONE, TWO, THREE, FOUR, FIVE, SIX, SEVEN, EIGHT)
I want to replace some of the values. For example, every occurrence of 'ONE' should be replaced by '1' and
'FOUR' -> '2SQUARED'
'FIVE' -> '5'
'EIGHT' -> '2CUBED'
Other values should remain as they are.
IF/ELSE will run forever. How to apply a vectorized solution? Is match() the corrct way to go?
Using #rnso data set
library(plyr)
transform(data, vals = mapvalues(vals,
c('ONE', 'FOUR', 'FIVE', 'EIGHT'),
c('1','2SQUARED', '5', '2CUBED')))
# vals
# 1 1
# 2 TWO
# 3 THREE
# 4 2SQUARED
# 5 5
# 6 SIX
# 7 SEVEN
# 8 2CUBED
Try following using base R:
data = structure(list(vals = structure(c(4L, 8L, 7L, 3L, 2L, 6L, 5L,
1L), .Label = c("EIGHT", "FIVE", "FOUR", "ONE", "SEVEN", "SIX",
"THREE", "TWO"), class = "factor")), .Names = "vals", class = "data.frame", row.names = c(NA,
-8L))
initial = c('ONE', 'FOUR', 'FIVE', 'EIGHT')
final = c('1','2SQUARED', '5', '2CUBED')
myfn = function(ddf, init, fin){
refdf = data.frame(init,fin)
ddf$new = refdf[match(ddf$vals, init), 'fin']
ddf$new = as.character(ddf$new)
ndx = which(is.na(ddf$new))
ddf$new[ndx]= as.character(ddf$vals[ndx])
ddf
}
myfn(data, initial, final)
vals new
1 ONE 1
2 TWO TWO
3 THREE THREE
4 FOUR 2SQUARED
5 FIVE 5
6 SIX SIX
7 SEVEN SEVEN
8 EIGHT 2CUBED
>
Your column is probably a factor. Give this a try. Using rnso's data, I'd recommend you first create two vectors of values to change from and values to change to
from <- c("FOUR", "FIVE", "EIGHT")
to <- c("2SQUARED", "5", "2CUBED")
Then replace the factors with
with(data, levels(vals)[match(from, levels(vals))] <- to)
This gives
data
# vals
# 1 ONE
# 2 TWO
# 3 THREE
# 4 2SQUARED
# 5 5
# 6 SIX
# 7 SEVEN
# 8 2CUBED
Related
I have a dataset (test_df) that looks like:
Species
TreatmentA
TreatmentB
X0
L
K
Apple
Hot
Cloudy
1
2
3
Apple
Cold
Cloudy
4
5
6
Orange
Hot
Sunny
7
8
9
Orange
Cold
Sunny
10
11
12
I would like to display the effect of the treatments by using the X0, L, and K values as coefficients in a standard logistic function and plotting the same species across various treatments on the same plot. I would like a grid of plots with the logistic curves for each species on it's own plots, with each treatment then being grouped by color within every plot. In the above example, Plot1.Grid1 would have 2 logistic curves corresponding to Apple Hot and Apple Cold, and plot1.Grid2 would have 2 logistic curves corresponding to Orange Hot and Orange Cold.
The below code will create a single logistic function curve which can then be layered, but manually adding the layers for multiple treatments is tedious.
testx0 <- 1
testL <- 2
testk <- 3
days <- seq(from = -5, to = 5, by = 1)
functionmultitest <- function(x,testL,testK,testX0) {
(testL)/(1+exp((-1)*(testK) *(x - testX0)))
}
ggplot()+aes(x = days, y = functionmultitest(days,testL,testk,testx0))+geom_line()
The method described in (https://statisticsglobe.com/draw-multiple-function-curves-to-same-plot-in-r) works for dataframes with few species or treatments, but it becomes very tedious to individually define the curves if you have many treatments/species. Is there a way to programatically pass the list of coefficients and have ggplot handle the grouping?
Thank you!
Your current code shows how to compute the curve for a single row in your data frame. What you can do is pre-compute the curve for each row and then feed to ggplot.
Setup:
# Packages
library(ggplot2)
# Your days vector
days <- seq(from = -5, to = 5, by = 1)
# Your sample data frame above
df = structure(list(Species = c("Apple", "Apple", "Orange", "Orange"
), TreatmentA = c("Hot", "Cold", "Hot", "Cold"), TreatmentB = c("Cloudy",
"Cloudy", "Sunny", "Sunny"), X0 = c(1L, 4L, 7L, 10L), L = c(2L,
5L, 8L, 11L), K = c(3L, 6L, 9L, 12L)), class = "data.frame", row.names = c(NA,
-4L))
# Your function
functionmultitest <- function(x,testL,testK,testX0) {
(testL)/(1+exp((-1)*(testK) *(x - testX0)))
}
We'll "expand" each row of your data frame with the days vector:
# Define first a data frame of days:
days_df = data.frame(days = days)
# Perform a cross join
df_all = merge(days_df, df, all = T)
At this point, you will have a data frame where each original row is duplicated for as many days you have.
Now, just as you did for one row, we'll compute the value of the function for each row and store in the df_all as result:
df_all$result = mapply(functionmultitest, df_all$days, df_all$L, df_all$K, df_all$X0)
I'm not sure how you intended to handle treatmentA and treatmentB, so I'll just combine for illustration purposes:
df_all$combined_treatment = paste0(df_all$TreatmentA, "-", df_all$TreatmentB)
We can now feed this data frame to ggplot, set the color to be combined_treatment, and use the facet_grid function to split by species
ggplot(data = df_all, aes(x = days, y = result, color = combined_treatment))+
geom_line() +
facet_grid(Species ~ ., scales = "free")
The result is as follows:
I want to remove row with the test "student2". However, I don't want to remove row like "student22", "student 23"... etc.
For example:
Student.Code Values
1 canada.student12 2
2 canada.student2 3 # remove
3 canada.student23 5 # keep
4 US.student2 6 # remove
5 US.student32 2
6 Aus.student87 645
7 Turkey.student25 4 #keep
I used the code grepl("student2", example$Student.code, fixed = TRUE but it also find (remove) the rows with like "student23"
We can use grepl("student2$", example$Student.Code)
library(tidyverse)
example <- tibble::tribble(
~Student.Code, ~Values,
"canada.student12", 2L,
"canada.student2", 3L,
"canada.student23", 5L,
"US.student2", 6L,
"US.student32", 2L,
"Aus.student87", 645L,
"Turkey.student25", 4L
)
example$Student.Code
grepl("student2$", example$Student.Code)
[1] FALSE TRUE FALSE TRUE FALSE FALSE FALSE
example %>%
filter(!grepl("student2$", Student.Code))
# A tibble: 5 x 2
Student.Code Values
<chr> <int>
1 canada.student12 2
2 canada.student23 5
3 US.student32 2
4 Aus.student87 645
5 Turkey.student25 4
Data:
df <- data.frame(
Student = c("canada.student12", "canada.student2", "canada.student23","US.student2", "US.student32", "Aus.student87", "Turkey.student25"),
Value = c(2,3,5,6,2,654,5)
)
Solution: (in base R)
The idea is to use grepl to match those values where the number 2 occurs at the word boundary, that is, in regex, at \\b, and to exclude these strings with the negator !:
df[!grepl("student2\\b", df$Student),]
Student Value
1 canada.student12 2
3 canada.student23 5
5 US.student32 2
6 Aus.student87 654
7 Turkey.student25 5
Alternatively, you can also go the opposite way and match those patterns that you want to keep:
df[grepl("student(?=\\d{2,})", df$Student, perl = T),]
Here, the idea is to use positive lookahead to match values with student iff they are followed immediately by at least two digits (\\d{2,}). (Note that when using lookahead or lookbehind you need to include perl = T.)
If you have a variable with an exact value you want to remove, don't use grep or grepl.
example <- tibble::tribble(
~Student.Code, ~Values,
"canada.student12", 2L,
"canada.student2", 3L,
"canada.student23", 5L,
"US.student2", 6L,
"US.student32", 2L,
"Aus.student87", 645L,
"Turkey.student25", 4L
)
example <- example[example$Student.Code != "canada.student2",]
# or, in dplyr
example <- filter(example, Student.Code != "canada.student2")
# for multiple values
example <- filter(example, !(Student.Code %in% c("canada.student2", "US.student2")))
fixed = TRUE is not working because all it means is 'search for this exact string in the input strings', not 'only match this exact string (it must be the whole value)'
Supposing I have the following dataframes:
d1 <- data.frame(index = c(1,2,3,4), location = c('barn', 'house', 'restaurant', 'tomb'), random = c(5,3,2,1), different_col1 = c(66,33,22,11))
d2 <- data.frame(index = c(1,2,3,4), location = c('server', 'computer', 'home', 'dictionary'), random = c(1,7,2,9), differen_col2 = c('hi', 'there', 'different', 'column'))
What I am trying to do is get the location based on the index and what dataframe it is. So I have the following:
data <- data.frame(src = c('one', 'one', 'two', 'one', 'two'), index = c(1,4,2,3,2))
Where src indicates which dataframe the data should come from and index, the value in index from the index column.
src | index
-------------
one | 1
one | 4
two | 2
one | 3
two | 2
And I would like it to become:
src | index | location
-----------------------
one | 1 | barn
one | 4 | tomb
two | 2 | computer
one | 3 | restaurant
two | 2 | computer
Due to the size of my data I would like to avoid merge or comparable joins (sqldf, etc).
Here's one way to add a new column by reference using data.table:
require(data.table)
setDT(d1); setDT(d2); setDT(data) # convert all data.frames to data.tables
data[src == "one", location := d1[.SD, location, on="index"]]
data[src == "two", location := d2[.SD, location, on="index"]]
.SD stands for subset of data, and contains all columns in data that matches the condition provided in i-argument.
See the vignettes for more.
You can use match in the expression to the right of := as well instead of extracting location using a join. But it'd not be extensible if you'd want to match on multiple columns.
library(dplyr)
mutate(data,
location = ifelse(src == "one",
as.character(d1[index, "location"]),
as.character(d2[index, "location"])))
output
src index location
1 one 1 barn
2 one 4 tomb
3 two 2 computer
4 one 3 restaurant
5 two 2 computer
data.table will help you to deal with Big Data much more efficiently.
You could either use match or a special data.table implementation of merge that's much faster than the merge of my original solution, as we discussed in the comments.
Here's an example:
require(data.table)
d1 <- data.frame(index = c(1,2,3,4), location = c('barn', 'house', 'restaurant', 'tomb'), random = c(5,3,2,1), different_col1 = c(66,33,22,11))
d2 <- data.frame(index = c(1,2,3,4), location = c('server', 'computer', 'home', 'dictionary'), random = c(1,7,2,9), differen_col2 = c('hi', 'there', 'different', 'column'))
mydata <- data.table(src = c('one', 'one', 'two', 'one', 'two'), index = c(1,4,2,3,2))
mydata.d1 <- mydata[mydata$src == "one",]
mydata.d2 <- mydata[mydata$src == "two",]
mydata.d1 <- merge(mydata.d1, d1, all.x = T, by = "index")
mydata.d2 <- merge(mydata.d2, d2, all.x = T, by = "index")
# If you want to keep the 'different column' values from d1 and d2:
mydata <- rbind(mydata.d1, mydata.d2, fill = T)
mydata
index src location random different_col1 differen_col2
1: 1 one barn 5 66 NA
2: 3 one restaurant 2 22 NA
3: 4 one tomb 1 11 NA
4: 2 two computer 7 NA there
5: 2 two computer 7 NA there
# If you don't want to keep those 'different column' values:
mydata <- rbind(mydata.d1[,.(index, src, location)], mydata.d2[,.(index, src, location)])
mydata
index src location
1: 1 one barn
2: 3 one restaurant
3: 4 one tomb
4: 2 two computer
5: 2 two computer
Base solution: use a character index to chose the correct dataframe and then use mapply to handle submission of the multiple "parallel arguments.
dput(dat)
structure(list(src = c("one", "one", "two", "one", "two"), X. = c("|",
"|", "|", "|", "|"), index = c(1L, 4L, 2L, 3L, 2L), location = structure(c(1L,
4L, 5L, 3L, 5L), .Label = c("barn", "house", "restaurant", "tomb",
"computer", "dictionary", "home", "server"), class = "factor")), .Names = c("src",
"X.", "index", "location"), row.names = c(NA, -5L), class = "data.frame")
May need to use stringsAsFactor to ensure character argument.
dat$location <- mapply(function(whichd,i) dlist[[whichd]][i,'location'], whichd=dat$src, i=dat$index)
> dat
src X. index location
1 one | 1 barn
2 one | 4 tomb
3 two | 2 computer
4 one | 3 restaurant
5 two | 2 computer
>
As part of a project, I am currently using R to analyze some data. I am currently stuck with the retrieving few values from the existing dataset which i have imported from a csv file.
The file looks like:
For my analysis, I wanted to create another column which is the subtraction of the current value of x and its previous value. But the first value of every unique i, x would be the same value as it is currently. I am new to R and i was trying various ways for sometime now but still not able to figure out a way to do so. Request your suggestions in the approach that I can follow to achieve this task.
Mydata structure
structure(list(t = 1:10, x = c(34450L, 34469L, 34470L, 34483L,
34488L, 34512L, 34530L, 34553L, 34575L, 34589L), y = c(268880.73342868,
268902.322359863, 268938.194698248, 268553.521856105, 269175.38273083,
268901.619719038, 268920.864512966, 269636.604121984, 270191.206593437,
269295.344751692), i = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L)), .Names = c("t", "x", "y", "i"), row.names = c(NA, 10L), class = "data.frame")
You can use the package data.table to obtain what you want:
library(data.table)
setDT(MyData)[, x_diff := c(x[1], diff(x)), by=i]
MyData
# t x i x_diff
# 1: 1 34287 1 34287
# 2: 2 34789 1 502
# 3: 3 34409 1 -380
# 4: 4 34883 1 474
# 5: 5 34941 1 58
# 6: 6 34045 2 34045
# 7: 7 34528 2 483
# 8: 8 34893 2 365
# 9: 9 34551 2 -342
# 10: 10 34457 2 -94
Data:
set.seed(123)
MyData <- data.frame(t=1:10, x=sample(34000:35000, 10, replace=T), i=rep(1:2, e=5))
You can use the diff() function. If you want to add a new column to your existing data frame, the diff function will return a vector x-1 length of your current data frame though. so in your case you can try this:
# if your data frame is called MyData
MyData$newX = c(NA,diff(MyData$x))
That should input an NA value as the first entry in your new column and the remaining values will be the difference between sequential values in your "x" column
UPDATE:
You can create a simple loop by subsetting through every unique instance of "i" and then calculating the difference between your x values
# initialize a new dataframe
newdf = NULL
values = unique(MyData$i)
for(i in 1:length(values)){
data1 = MyData[MyData$i = values[i],]
data1$newX = c(NA,diff(data1$x))
newdata = rbind(newdata,data1)
}
# and then if you want to overwrite newdf to your original dataframe
MyData = newdf
# remove some variables
rm(data1,newdf,values)
I have 2 dataframe sharing the same rows IDs but with different columns
Here is an example
chrom coord sID CM0016 CM0017 CM0018
7 10 3178881 SP_SA036,SP_SA040 0.000000000 0.000000000 0.0009923
8 10 38894616 SP_SA036,SP_SA040 0.000434783 0.000467464 0.0000970
9 11 104972190 SP_SA036,SP_SA040 0.497802888 0.529319536 0.5479003
and
chrom coord sID CM0001 CM0002 CM0003
4 10 3178881 SP_SA036,SA040 0.526806527 0.544927536 0.565610860
5 10 38894616 SP_SA036,SA040 0.009049774 0.002849003 0.002857143
6 11 104972190 SP_SA036,SA040 0.451612903 0.401617251 0.435318275
I am trying to create a composite boxplot figure where I have in x axis the chrom and coord combined (so 3 points) and for each x value 2 boxplots side by side corresponding to the two dataframes ?
What is the best way of doing this ? Should I merge the two dataframes together somehow in order to get only one and loop over the boxplots rendering by 3 columns ?
Any idea on how this can be done ?
The problem is that the two dataframes have the same number of rows but can differ in number of columns
> dim(A)
[1] 99 20
> dim(B)
[1] 99 28
I was thinking about transposing the dataframe in order to get the same number of column but got lost on how to this properly
Thanks in advance
UPDATE
This is what I tried to do
I merged chrom and coord columns together to create a single ID
I used reshape t melt the dataframes
I merged the 2 melted dataframe into a single one
the head looks like this
I have two variable A2 and A4 corresponding to the 2 dataframes
then I created a boxplot such using this
ggplot(A2A4, aes(factor(combine), value)) +geom_boxplot(aes(fill = factor(variable)))
I think it solved my problem but the boxplot looks very busy with 99 x values with 2 boxplots each
So if these are your input tables
d1<-structure(list(chrom = c(10L, 10L, 11L),
coord = c(3178881L, 38894616L, 104972190L),
sID = structure(c(1L, 1L, 1L), .Label = "SP_SA036,SP_SA040", class = "factor"),
CM0016 = c(0, 0.000434783, 0.497802888), CM0017 = c(0, 0.000467464,
0.529319536), CM0018 = c(0.0009923, 9.7e-05, 0.5479003)), .Names = c("chrom",
"coord", "sID", "CM0016", "CM0017", "CM0018"), class = "data.frame", row.names = c("7",
"8", "9"))
d2<-structure(list(chrom = c(10L, 10L, 11L), coord = c(3178881L,
38894616L, 104972190L), sID = structure(c(1L, 1L, 1L), .Label = "SP_SA036,SA040", class = "factor"),
CM0001 = c(0.526806527, 0.009049774, 0.451612903), CM0002 = c(0.544927536,
0.002849003, 0.401617251), CM0003 = c(0.56561086, 0.002857143,
0.435318275)), .Names = c("chrom", "coord", "sID", "CM0001",
"CM0002", "CM0003"), class = "data.frame", row.names = c("4",
"5", "6"))
Then I would combine and reshape the data to make it easier to plot. Here's what i'd do
m1<-melt(d1, id.vars=c("chrom", "coord", "sID"))
m2<-melt(d2, id.vars=c("chrom", "coord", "sID"))
dd<-rbind(cbind(m1, s="T1"), cbind(m2, s="T2"))
mm$pos<-factor(paste(mm$chrom,mm$coord,sep=":"),
levels=do.call(paste, c(unique(dd[order(dd[[1]],dd[[2]]),1:2]), sep=":")))
I first melt the two input tables to turn columns into rows. Then I add a column to each table so I know where the data came from and rbind them together. And finally I do a bit of messy work to make a factor out of the chr/coord pairs sorted in the correct order.
With all that done, I'll make the plot like
ggplot(mm, aes(x=pos, y=value, color=s)) +
geom_boxplot(position="dodge")
and it looks like