I have a data frame with samples taken from different seasons. What I would like is to summarize what sites have samples from spring (March-May) and autumn (September-November) across different years. For example, if Site A had a sample from Spring 2007, the cell reads 'TRUE'. Here is an example dataset:
Dates <- data.frame(c(as.Date("2007-9-1"),
rep(as.Date("2008-3-1"), times = 3) ,
rep(as.Date("2008-9-1"), times = 3)))
Sites <- as.data.frame(as.factor(c("SiteA",rep(c("SiteA","SiteB","SiteC"), 2))))
Values <- data.frame(matrix(sample(0:50, 3.5*2, replace=TRUE), ncol=1))
Dataframe <- cbind(Dates,Sites,Values)
colnames(Dataframe) <- c("date","site","value")
I have managed to create the factor 'season' within this dataframe based on these functions.
Dataframe$Months <- as.numeric(format(Dataframe$date, '%m'))
Dataframe$Season <- cut(Dataframe$Months,
breaks = c(1, 2, 5, 8, 11, 12),
labels = c("Winter", "Spring", "Summer", "Autumn", "Winter"),
right = FALSE)
But I am unsure where to go from here. Here is what the output should look like.
A <- rep("TRUE",times = 3)
B <- c("FALSE",rep("TRUE",times = 2))
C <- c("FALSE",rep("TRUE",times = 2))
Output <- as.data.frame(rbind(A,B,C))
colnames(Output) <- c("Autumn.07","Spring.07","Autumn.08")
Here is a proposition:
Dataframe$Samplings <- interaction(Dataframe$Season, unlist(lapply(strsplit(as.character(Dataframe$date), '-'), function(x) x[[1]]) ))
u1 <- unique(Dataframe$site)
u2 <- unique(Dataframe$Samplings)
output <- matrix(
matrix(levels(interaction(u1, u2)), nrow=length(unique(Dataframe$site))) %in%
interaction(Dataframe$site,Dataframe$Samplings),
nrow=length(unique(Dataframe$site))
)
colnames(output) <- levels(Dataframe$Samplings)
rownames(output) <- unique(Dataframe$site)
output # with all time interactions
# you can clear it with
output[, apply(output, 2, sum) != 0]
using reshape2::dcast
Dataframe$site <- gsub("Site","",Dataframe$site)
Dataframe$year <- format(Dataframe$date, "%y")
temp <- reshape2::dcast(Dataframe, site ~ Season + year, length)
(ans <- apply(data.frame(temp[,2:ncol(temp)], row.names=temp[,1]), 1:2, as.logical))
there is a warning with your Dataframe$Season due to duplicate labels. You might want to fix that.
I think that this is what you're looking for. The time label isn't exactly as in the question, but I hope it's still understandable.
library(reshape2)
# prepare the input, to have a handy label for the columns
Dataframe$Year <- as.numeric(format(Dataframe$date, '%Y'))
Dataframe$TimeLabel <- paste0(Dataframe$Season, '.', Dataframe$Year)
# This is in stages, to make it clear what's happening.
# create a data frame with the right structure, but cells holding NA / numbers
df1 <- dcast(Dataframe, site ~ TimeLabel)
# turn NA / number into false/true, while ignoring the site column
df2 <- !is.na(df1[, -1])
# add back the site labels for rows
df3 <- cbind(as.data.frame(df1$site), df2)
Related
I'm looking for an easy way to make a table in R that shows each variable as a row in the dataframe and then each variable category as the column of the dataframe. In each cell the frequency of that category should be displayed and then the sum is the last column. The point is to display distribution for different variables with the same categories easily. I have included to a picture to show what I'm looking for.
I have managed to produce some code that achieves what I want, but it takes a lot of time to do this for each variable i want to include in the table.
mydata <- as.data.frame((table(mydat$var)))
mydata <- as.data.frame(t(mydata))
mydata <- lapply(mydata, as.numeric)
mydata <- as.data.frame(mydata)
mydata$sum <- (mydata$category 1 + mydata$category 2 + mydata$category 3)
mydata[-c(1), ]
The result looks like this:
To add more variables I imagine that i could use rbind(), but there might be some easier way to achieve something similar?
Here is a reproducible example using the mtcars dataset.
data("mtcars")
tdata <- as.data.frame(table(mtcars$cyl))
tdata1 <- as.data.frame(t(tdata))
tdata2 <- lapply(tdata1, as.numeric)
tdata3 <- as.data.frame(tdata2)
tdata3$sum <- (tdata3$V1 + tdata3$V2 + tdata3$V3)
tdata3 <- tdata3[-c(1),]
tdata3
Assuming you have a data.frame where each variable has the same categories (as in your example):
df <- data.frame(Var1 = c(rep("Cat1", 30),
rep("Cat2", 10),
rep("Cat3", 20) ),
Var2 = c(rep("Cat1", 10),
rep("Cat2", 20),
rep("Cat3", 30) ),
Var3 = c(rep("Cat1", 5),
rep("Cat2", 25),
rep("Cat3", 30) ) )
You could use lapply() to apply the table() function to every column in your data.frame:
tab <- lapply(colnames(df), function(x) table(df[, x]))
As lapply() outputs a list, use do.call to bind them, and rowSums() to create the sum column:
tab <- data.frame(do.call(rbind, t(tab)))
tab$Sum <- rowSums(tab)
# add variable labels as rows
rownames(tab) <- colnames(df)
The output will look like this:
Cat1 Cat2 Cat3 Sum
Var1 30 10 20 60
Var2 10 20 30 60
Var3 5 25 30 60
And, you could throw all this in a function:
my_tab_fun <- function(df) {
tab <- lapply(colnames(df),
function(x) table(df[, x]))
tab <- data.frame(
do.call(rbind, t(tab)))
tab$Sum <- rowSums(tab)
rownames(tab) <- colnames(df)
return(tab)
}
my_tab_fun(df)
Suppose I have a data frame (DF) that looks like the following:
test <- c('Test1','Test2','Test3')
col.DF.names < c('ID', 'year', 'car', 'age', 'year.1', 'car.1', 'age.1', 'year.2', 'car.2', 'age.2')
ID <- c('A','B','C')
year <- c(2001,2002,2003)
car <- c('acura','benz','lexus')
age <- c(55,16,20)
year.1 <- c(2011,2012,2013)
car.1 <- c('honda','gm','bmw')
age.1 <- c(43,21,34)
year.2 <- c(1961,1962,1963)
car.2 <- c('toyota','porsche','jeep')
age.2 <- c(33,56,42)
DF <- data.frame(ID, year, car, age, year.1, car.1, age.1, year.2, car.2, age.2)
I need the columns of data frame to lose the ".#" and instead have the Test# in front of it, so it looks something like this:
ID Test1.year Test1.car Test1.age Test2.year Test2.car Test2.age Test3.year Test3.car Test3.age
.... with all the data
Does anyone have a suggestion? Basically, starting at the second column, I"d like to add the test[1] name for 3 columns, and then move to the next set of three columns and add test[2] and so on..
I know how to hard code it:
colnames(DF)[2:4] <- paste(test[1], colnames(DF)[2:4], sep = ".")
but this is a toy set, and I would like to somewhat automate it, so I'm not specifically indicating[2:4] for example.
You could try:
colnames(DF)[-1] <- paste(sapply(test, rep, 3), colnames(DF)[-1], sep = ".")
or perhaps the following would be better:
colnames(DF)[-1] <- paste(sapply(test, rep, 3), colnames(DF)[2:4], sep = ".")
or:
colnames(DF)[-1] <- paste(rep(test, each=3), colnames(DF)[2:4], sep = ".")
thanks to #thelatemail
So I have three data sets that I need to merge. These contain school data and read/math scores for grades 4 and 5. One of them is a long form data set that has a lot of missingness in some variables (yes, I do need the data in long form) and the other two have the full missing data in wide form. All of these data frames contain a column that has an unique ID number for each individual in the database.
Here is a full reproducible example that generates a small example of the types of data.frames I am working with... The three data frames that I need to use are the following: school_lf, school4 and school5. school_lf has the long form data with NAs and school4 and school5 are the dfs I need to use to populate the NA's in this long form data (by id and grade)
set.seed(890)
school <- NULL
school$id <-sample(102938:999999, 100)
school$selected <-sample(0:1, 100, replace = T)
school$math4 <- sample(400:500, 100)
school$math5 <- sample(400:500, 100)
school$read4 <- sample(400:500, 100)
school$read5 <- sample(400:500, 100)
school <- as.data.frame(school)
# Delete observations at random from the school df
indm4 <- which(school$math4 %in% sample(school$math4, 25))
school$math4[indm4] <- NA
indm5 <- which(school$math5 %in% sample(school$math5, 50))
school$math5[indm5] <- NA
indr4 <- which(school$read4 %in% sample(school$read4, 70))
school$read4[indr4] <- NA
indr5 <- which(school$read5 %in% sample(school$read5, 81))
school$read5[indr5] <- NA
# Separate Read and Math
read <- as.data.frame(subset(school, select = -c(math4, math5)))
math <- as.data.frame(subset(school, select = -c(read4, read5)))
# Now turn this into long form data...
clr <- melt(read, id.vars = c("id", "selected"), variable.name = "variable", value.name = "readscore")
clm <- melt(math, id.vars = c("id", "selected"), value.name = "mathscore")
# Clean up the grades for each of these...
clr$grade <- ifelse(clr$variable == "read4", 4,
ifelse(clr$variable == "read5", 5, NA))
clm$grade <- ifelse(clm$variable == "math4", 4,
ifelse(clm$variable == "math5", 5, NA))
# Put all these in one df
school_lf <-cbind(clm, clr$readscore)
school_lf$readscore <- school_lf$`clr$readscore` # renames
school_lf$`clr$readscore` <- NULL # deletes
school_lf$variable <- NULL # deletes
###############
# Generate the 2 data frames with IDs that have the full data
set.seed(890)
school4 <- NULL
school4$id <-sample(102938:999999, 100)
school4$selected <-sample(0:1, 100, replace = T)
school4$math4 <- sample(400:500, 100)
school4$read4 <- sample(400:500, 100)
school4$grade <- 4
school4 <- as.data.frame(school4)
set.seed(890)
school5 <- NULL
school5$id <-sample(102938:999999, 100)
school5$selected <-sample(0:1, 100, replace = T)
school5$math5 <- sample(400:500, 100)
school5$read5 <- sample(400:500, 100)
school5$grade <- 5
school5 <- as.data.frame(school5)
I need to merge the wide-form data into the long-form data to replace the NAs with the actual values. I have tried the code below, but it introduces several columns instead of merging the read scores and the math scores where there's NA's. I simply need one column with the read scores and one with the math scores, instead of six separate columns (read.x, read.y, math.x, math.y, mathscore and readscore).
sch <- merge(school_lf, school4, by = c("id", "grade", "selected"), all = T)
sch <- merge(sch, school5, by = c("id", "grade", "selected"), all = T)
Any help is highly appreciated! I've been trying to solve this for hours now and haven't made any progress (so figured I'd ask here)
You can use the coalesce function from dplyr. If a value in the first vector is NA, it will see if the value at the same position in the second vector is not NA and select it. If again NA, it goes to the third.
library(dplyr)
sch %>% mutate(mathscore = coalesce(mathscore, math4, math5)) %>%
mutate(readscore = coalesce(readscore, read4, read5)) %>%
select(id:readscore)
EDIT: I just tried to do this approach on my actual data and it does not work because the replacement data also has some NAs and, as a result, the dfs I try to do coalesce with have differing number of rows... Back to square one.
I was able to figure this out with the following code (albeit it's not the most elegant or straight-forward ,and #Edwin's response helped point me in the right direction. Any suggestions on how to make this code more elegant and efficient are more than welcome!
# Idea: put both in long form and stack on top of one another... then merge like that!
sch4r <- as.data.frame(subset(school4, select = -c(mathscore)))
sch4m <- as.data.frame(subset(school4, select = -c(readscore)))
sch5r <- as.data.frame(subset(school5, select = -c(mathscore)))
sch5m <- as.data.frame(subset(school5, select = -c(readscore)))
# Put these in LF
sch4r_lf <- melt(sch4r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch4m_lf <- melt(sch4m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
sch5r_lf <- melt(sch5r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch5m_lf <- melt(sch5m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
# Combine in one DF
sch_full_4 <-cbind(sch4r_lf, sch4m_lf$mathscore)
sch_full_4$mathscore <- sch_full_4$`sch4m_lf$mathscore`
sch_full_4$`sch4m_lf$mathscore` <- NULL # deletes
sch_full_4$variable <- NULL
sch_full_5 <- cbind(sch5r_lf, sch5m$mathscore)
sch_full_5$mathscore <- sch_full_5$`sch5m$mathscore`
sch_full_5$`sch5m$mathscore` <- NULL
sch_full_5$variable <- NULL
# Stack together
sch_full <- rbind(sch_full_4,sch_full_5)
sch_full$selected <- NULL # delete this column...
# MERGE together
final_school_math <- mutate(school_lf, mathscore = coalesce(school_lf$mathscore, sch_full$mathscore))
final_school_read <- mutate(school_lf, readscore = coalesce(school_lf$readscore, sch_full$readscore))
final_df <- cbind(final_school_math, final_school_read$readscore)
final_df$readscore <- final_df$`final_school_read$readscore`
final_df$`final_school_read$readscore` <- NULL
I have a question on data subset based on dynamic column class. For example:
#Coming from other source. Dont exaclty know about their names and number of classes.
#But following are two demography, which will help in imagining the problem
gender <- c(1,2)
agegroup <- c(1,2,3,4,5,6,7,8)
#moredemo.................
# reproducible data
set.seed(1)
col1 <- as.data.frame(rep(gender, 100))
col2 <- as.data.frame(rep(agegroup, 25))
col3 <- runif(200)
datafile <- cbind(col1, col2, col3)
names(datafile)[1] = "gender"
names(datafile)[2] = "agegroup"
datafile <- as.data.frame(datafile)
#Subset is only for gender = 1 and agegroup = 3
#Subset is for every combination of classes in each demography
#No hardcoded name is required, because demography name will not be know
dat_gender_1_agegroup_3 <- datafile[datafile$gender == 1 & datafile$agegroup == 3, ]
But there can be more demography and not just gender and agegroup. There can be income or education or race and so on. each of the demography has varying number of class. Kindly help me in getting the subset of the dataset datafile on the varying number of columns. Thanks in advance
Using expand grid for combos then apply to subset:
#dummy data
set.seed(123)
mydata <- data.frame(gender = sample(1:2, 100, replace = TRUE),
agegroup = sample(1:10, 100, replace = TRUE))
#groups
gender <- c(1,2)
agegroup <- c(1,2,3,4,5,6,7,8)
#get all combo
myCombo <- expand.grid(gender, agegroup)
#result is a list object
apply(myCombo, 1, function(i){
mydata[ mydata$gender == i[1] &
mydata$agegroup == i[2], ]
})
Edit: Based on update, I think you just need split command
split(datafile, datafile[, 1:2])
What about (assuming the column names are "gender" and "agegroup"):
gender <- c(1,2)
agegroup <- c(1,2,3,4,5,6,7,8)
data_subset <- subset(full_data, gender%in%gender | agegroup%in%agegroup | [AND SO ON])
You can add as many [column_name]%in%[values] as you want.
HTH a little!
EDIT: you can very well use & instead of |, obviously.
I have two data frames. First one looks like
dat <- data.frame(matrix(nrow=2,ncol=3))
names(dat) <- c("Locus", "Pos", "NVAR")
dat[1,] <- c("ACTC1-001_1", "chr15:35087734..35087734", "1" )
dat[2,] <- c("ACTC1-001_2 ", "chr15:35086890..35086919", "2")
where chr15:35086890..35086919 indicates all the numbers within this range.
The second looks like:
dat2 <- data.frame(matrix(nrow=2,ncol=3))
names(dat2) <- c("VAR","REF.ALT"," FUNC")
dat2[1,] <- c("chr1:116242719", "T/A", "intergenic" )
dat2[2,] <- c("chr1:116242855", "A/G", "intergenic")
I want to merge these by the values in dat$Pos and dat2$VAR. If the single number in a cell in dat2$VAR is contained within the range of a cell in dat$Pos, I want to merge those rows. If this occurs more than once (dat2$VAR in more than one range in dat$Pos, I want it merged each time). What's the easiest way to do this?
Here is a solution, quite short but not particularly efficient so I would not recommend it for large data. However, you seemed to indicate your data was not that large so give it a try and let me know:
library(plyr)
exploded.dat <- adply(dat, 1, function(x){
parts <- strsplit(x$Pos, ":")[[1]]
chr <- parts[1]
range <- strsplit(parts[2], "..", fixed = TRUE)[[1]]
start <- range[1]
end <- range[2]
data.frame(VAR = paste(chr, seq(from = start, to = end), sep = ":"), x)
})
merge(dat2, exploded.dat, by = "VAR")
If it is too slow or uses too much memory for your needs, you'll have to implement something a bit more complex and this other question looks like a good starting point: Merge by Range in R - Applying Loops.
Please try this out and let us know how it works. Without a larger data set it is a bit hard to trouble shoot. If for whatever reason it does not work, please share a few more rows from your data tables (specifically ones that would match)
SPLICE THE DATA
range.strings <- do.call(rbind, strsplit(dat$Pos, ":"))[, 2]
range.strings <- do.call(rbind, strsplit(range.strings, "\\.\\."))
mins <- as.numeric(range.strings[,1])
maxs <- as.numeric(range.strings[,2])
d2.vars <- as.numeric(do.call(rbind, str_split(dat2$VAR, ":"))[,2])
names(d2.vars) <- seq(d2.vars)
FIND THE MATCHES
# row numebr is the row in dat
# col number is the row in dat2
matches <- sapply(d2.vars, function(v) mins < v & v <= maxs)
MERGE
# create a column in dat to merge-by
dat <- cbind(dat, VAR=NA)
# use the VAR in dat2 as the merge id
sapply(seq(ncol(matches)), function(i)
dat$VAR <- dat2[i, "VAR"] )
merge(dat, dat2)