Suppose I have a data frame (DF) that looks like the following:
test <- c('Test1','Test2','Test3')
col.DF.names < c('ID', 'year', 'car', 'age', 'year.1', 'car.1', 'age.1', 'year.2', 'car.2', 'age.2')
ID <- c('A','B','C')
year <- c(2001,2002,2003)
car <- c('acura','benz','lexus')
age <- c(55,16,20)
year.1 <- c(2011,2012,2013)
car.1 <- c('honda','gm','bmw')
age.1 <- c(43,21,34)
year.2 <- c(1961,1962,1963)
car.2 <- c('toyota','porsche','jeep')
age.2 <- c(33,56,42)
DF <- data.frame(ID, year, car, age, year.1, car.1, age.1, year.2, car.2, age.2)
I need the columns of data frame to lose the ".#" and instead have the Test# in front of it, so it looks something like this:
ID Test1.year Test1.car Test1.age Test2.year Test2.car Test2.age Test3.year Test3.car Test3.age
.... with all the data
Does anyone have a suggestion? Basically, starting at the second column, I"d like to add the test[1] name for 3 columns, and then move to the next set of three columns and add test[2] and so on..
I know how to hard code it:
colnames(DF)[2:4] <- paste(test[1], colnames(DF)[2:4], sep = ".")
but this is a toy set, and I would like to somewhat automate it, so I'm not specifically indicating[2:4] for example.
You could try:
colnames(DF)[-1] <- paste(sapply(test, rep, 3), colnames(DF)[-1], sep = ".")
or perhaps the following would be better:
colnames(DF)[-1] <- paste(sapply(test, rep, 3), colnames(DF)[2:4], sep = ".")
or:
colnames(DF)[-1] <- paste(rep(test, each=3), colnames(DF)[2:4], sep = ".")
thanks to #thelatemail
Related
I'm looking for an easy way to make a table in R that shows each variable as a row in the dataframe and then each variable category as the column of the dataframe. In each cell the frequency of that category should be displayed and then the sum is the last column. The point is to display distribution for different variables with the same categories easily. I have included to a picture to show what I'm looking for.
I have managed to produce some code that achieves what I want, but it takes a lot of time to do this for each variable i want to include in the table.
mydata <- as.data.frame((table(mydat$var)))
mydata <- as.data.frame(t(mydata))
mydata <- lapply(mydata, as.numeric)
mydata <- as.data.frame(mydata)
mydata$sum <- (mydata$category 1 + mydata$category 2 + mydata$category 3)
mydata[-c(1), ]
The result looks like this:
To add more variables I imagine that i could use rbind(), but there might be some easier way to achieve something similar?
Here is a reproducible example using the mtcars dataset.
data("mtcars")
tdata <- as.data.frame(table(mtcars$cyl))
tdata1 <- as.data.frame(t(tdata))
tdata2 <- lapply(tdata1, as.numeric)
tdata3 <- as.data.frame(tdata2)
tdata3$sum <- (tdata3$V1 + tdata3$V2 + tdata3$V3)
tdata3 <- tdata3[-c(1),]
tdata3
Assuming you have a data.frame where each variable has the same categories (as in your example):
df <- data.frame(Var1 = c(rep("Cat1", 30),
rep("Cat2", 10),
rep("Cat3", 20) ),
Var2 = c(rep("Cat1", 10),
rep("Cat2", 20),
rep("Cat3", 30) ),
Var3 = c(rep("Cat1", 5),
rep("Cat2", 25),
rep("Cat3", 30) ) )
You could use lapply() to apply the table() function to every column in your data.frame:
tab <- lapply(colnames(df), function(x) table(df[, x]))
As lapply() outputs a list, use do.call to bind them, and rowSums() to create the sum column:
tab <- data.frame(do.call(rbind, t(tab)))
tab$Sum <- rowSums(tab)
# add variable labels as rows
rownames(tab) <- colnames(df)
The output will look like this:
Cat1 Cat2 Cat3 Sum
Var1 30 10 20 60
Var2 10 20 30 60
Var3 5 25 30 60
And, you could throw all this in a function:
my_tab_fun <- function(df) {
tab <- lapply(colnames(df),
function(x) table(df[, x]))
tab <- data.frame(
do.call(rbind, t(tab)))
tab$Sum <- rowSums(tab)
rownames(tab) <- colnames(df)
return(tab)
}
my_tab_fun(df)
I have a question regarding combining columns based on two conditions.
I have two datasets from an experiment where participants had to type in a code, answer about their gender and eyetracking data was documented. The experiment happened twice (first: random1, second: random2).
eye <- c(1000,230,250,400)
gender <- c(1,2,1,2)
code <- c("ABC","DEF","GHI","JKL")
random1 <- data.frame(code,gender,eye)
eye2 <- c(100,250,230,450)
gender2 <- c(1,1,2,2)
code2 <- c("ABC","DEF","JKL","XYZ")
random2 <- data.frame(code2,gender2,eye2)
Now I want to combine the two dataframes. For all rows where code and gender match, the rows should be combined (so columns added). Code and gender variables of those two rows should become one each (gender3 and code3) and the eyetracking data should be split up into eye_first for random1 and eye_second for random2.
For all rows where there was not found a perfect match for their code and gender values, a new dataset with all of these rows should exist.
#this is what the combined data looks like
gender3 <- c(1,2)
eye_first <- c(1000,400)
eye_second <- c(100, 230)
code3 <- c("ABC", "JKL")
random3 <- data.frame(code3,gender3,eye_first,eye_second)
#this is what the data without match should look like
gender4 <- c(2,1,2)
eye4 <- c(230,250,450)
code4 <- c("DEF","GHI","XYZ")
random4 <- data.frame(code4,gender4,eye4)
I would greatly appreciate your help! Thanks in advance.
Use the same column names for your 2 data.frames and use merge
random1 <- data.frame(code = code, gender = gender, eye = eye)
random2 <- data.frame(code = code2, gender = gender2, eye = eye2)
df <- merge(random1, random2, by = c("code", "gender"), suffixes = c("_first", "_second"))
For your second request, you can use anti_join from dplyr
df2 <- merge(random1, random2, by = c("code", "gender"), suffixes = c("_first", "_second"), all = TRUE) # all = TRUE : keep rows with ids that are only in one of the 2 data.frame
library(dplyr)
anti_join(df2, df, by = c("code", "gender"))
I am trying to re-order columns in R using a for loop since the column range needs to be dynamic. Does anyone know what is missing from my code?
Group <- c("A","B","C","D")
Attrib1 <- c("x","y","x","z")
Attrib2 <- c("q","w","u","i")
Day1A <- c(5,4,6,3)
Day2A <- c(6,5,7,4)
Day3A <- c(9,8,10,7)
Day1B <- c(4,3,5,2)
Day2B <- c(3,2,4,1)
Day3B <- c(2,1,3,0)
df <- data.frame(Group, Attrib1,Attrib2,Day1A,Day2A,Day3A,Day1B,Day2B,Day3B)
day_count <- 3
for(i in 4:ncol(df)) {
if (i == day_count+3) break
df[c(i,day_count+i)]
}
Here is my desired result:
df <- data.frame(Group, Attrib1,Attrib2,Day1A,Day1B,Day2A,Day2B,Day3A,Day3B)
So, in theory you can just do sort(colnames(df)[4:ncol(df)]) to get that, but it gets tricky when you have say Day1A..Day10A..Day20A
Below is a quick workaround, to get the numbers and alphabets:
COLS = colnames(df)[4:ncol(df)]
day_no = as.numeric(gsub("[^0-9]","",COLS))
day_letter = gsub("Day[0-9]*","",COLS)
o = order(day_no,day_letter)
To get your final dataframe:
df[,c(colnames(df)[1:3],COLS[o])]
An option with select
library(dplyr)
library(stringr)
df %>%
select(Group, starts_with('Attrib'),
names(.)[-(1:3)][order(str_remove_all(names(.)[-(1:3)], '\\D+'))])
I am trying to create a table from calculations that I am doing to several text file. I think this might require a loop of some sort, but I am stuck on how to proceed. I have tried different loops but none seem to be working. I have managed to do what I want with one file. Here is my working code:
flare <- read.table("C:/temp/HD3_Bld_CD8_TEM.txt",
header=T)
head(flare[,c(1,2)])
#sum of the freq column, check to see if close to 1
sum(flare$freq)
#Sum of top 10
ten <- sum(flare$freq[1:10])
#Sum of 11-100
to100 <- sum(flare$freq[11:100])
#Sum of 101-1000
to1000 <- sum(flare$freq[101:1000])
#sum of 1001+
rest <- sum(flare$freq[-c(1:1000)])
#place the values of the sum in a table
df <- data.frame(matrix(ncol = 1, nrow = 4))
x <- c("Sum")
colnames(df) <- x
y <- c("10", "11-100", "101-1000", "1000+")
row.names(df) <- y
df[,1] <- c(ten,to100,to1000,rest)
The dataframe ends up looking like this:
>View(df)
Sum
10 0.1745092
11-100 0.2926735
101-1000 0.4211533
1000+ 0.1116640
This is perfect for making a stacked barplot, which I did. However, this is only for one text file. I have several of the same files. All of them have the same column names, so I know that all of them will be using DF$freq column for the calculations. How do I make a table after doing calculations with each file? I want to keep the names of the text files as the sample names so that way when i make a joint stacked barplot all the names will be there. Also, what is the best way to orient the data when writing the new table/dataframe?
I am still new to R, so any help, any explanation would be most welcome. Thank you.
How about something like this, your example is not reproducible so I made a dummy example which you can adjust:
library(tidyverse)
###load ALL your dataframes
test_df_1 <- data.frame(var1 = matrix(c(1,2,3,4,5,6), nrow = 6, ncol = 1))
test_df_1
test_df_2 <- data.frame(var2 = matrix(c(7,8,9,10,11,12), nrow = 6, ncol = 1))
test_df_2
### Bind them into one big wide dataframe
df <- cbind(test_df_1, test_df_2)
### Add an id column which repeats (in your case adjust this to repeat for the grouping you want, i.e replace the each = 2 with each = 10, and each = 4 with each = 100)
df <- df %>%
mutate(id = paste0("id_", c(rep(1, each = 2), rep(2, each = 4))))
### Gather your dataframes into long format by the id
df_gathered <- df %>%
gather(value = value, key = key, - id)
df_gathered
### use group_by to group data by id and summarise to get the sum of each group
df_gathered_sum <- df_gathered %>%
group_by(id, key) %>%
summarise(sigma = sum(value))
df_gathered_sum
You might have some issues with the ID column if your dfs are not equal length so this is only a partial answer. Can do better with a shortened example of your dataset. Can anyone else weigh in on creating an id column? May have sorted it with a couple of edits...
I think I solved it! It gives me the dataframe I want, and from it, I can make the stacked barplot to display the data.
sumfunction <- function(x) {
wow <- read.table(x, header=T)
#Sum of top 10
ten <- sum(wow$freq[1:10])
#Sum of 11-100
to100 <- sum(wow$freq[11:100])
#Sum of 101-1000
to1000 <- sum(wow$freq[101:1000])
#sum of 1001+
rest <- sum(wow$freq[-c(1:1000)])
blah <- c(ten,to100,to1000,rest)
}
library(data.table)
library(tools)
dir = "C:/temp/"
filenames <- list.files(path = dir, pattern = "*.txt", full.names = FALSE)
alltogether <- lapply(filenames, function(x) sumfunction(x))
data <- as.data.frame(data.table::transpose(alltogether),
col.names =c("Top 10 ", "From 11 to 100", "From 101 to 1000", "From 1000 on "),
row.names = file_path_sans_ext(basename(filenames)))
This gives me the dataframe that I want. I instead of putting the "top 10, 11-100, 101-1000, 1000+" as the row names, I changed them to column names and instead made the names of each text file become the row names. The file_path_sans_ext(basename(filenames)) makes sure to just keep the file name and remove the extension.
I hope this helps anyone that reads this! thank you again! I love this platform because just being part of this environment gets me thinking and always striving to better myself at R.
If anyone has any input, that would be great!!! <3
I have two data frames. First one looks like
dat <- data.frame(matrix(nrow=2,ncol=3))
names(dat) <- c("Locus", "Pos", "NVAR")
dat[1,] <- c("ACTC1-001_1", "chr15:35087734..35087734", "1" )
dat[2,] <- c("ACTC1-001_2 ", "chr15:35086890..35086919", "2")
where chr15:35086890..35086919 indicates all the numbers within this range.
The second looks like:
dat2 <- data.frame(matrix(nrow=2,ncol=3))
names(dat2) <- c("VAR","REF.ALT"," FUNC")
dat2[1,] <- c("chr1:116242719", "T/A", "intergenic" )
dat2[2,] <- c("chr1:116242855", "A/G", "intergenic")
I want to merge these by the values in dat$Pos and dat2$VAR. If the single number in a cell in dat2$VAR is contained within the range of a cell in dat$Pos, I want to merge those rows. If this occurs more than once (dat2$VAR in more than one range in dat$Pos, I want it merged each time). What's the easiest way to do this?
Here is a solution, quite short but not particularly efficient so I would not recommend it for large data. However, you seemed to indicate your data was not that large so give it a try and let me know:
library(plyr)
exploded.dat <- adply(dat, 1, function(x){
parts <- strsplit(x$Pos, ":")[[1]]
chr <- parts[1]
range <- strsplit(parts[2], "..", fixed = TRUE)[[1]]
start <- range[1]
end <- range[2]
data.frame(VAR = paste(chr, seq(from = start, to = end), sep = ":"), x)
})
merge(dat2, exploded.dat, by = "VAR")
If it is too slow or uses too much memory for your needs, you'll have to implement something a bit more complex and this other question looks like a good starting point: Merge by Range in R - Applying Loops.
Please try this out and let us know how it works. Without a larger data set it is a bit hard to trouble shoot. If for whatever reason it does not work, please share a few more rows from your data tables (specifically ones that would match)
SPLICE THE DATA
range.strings <- do.call(rbind, strsplit(dat$Pos, ":"))[, 2]
range.strings <- do.call(rbind, strsplit(range.strings, "\\.\\."))
mins <- as.numeric(range.strings[,1])
maxs <- as.numeric(range.strings[,2])
d2.vars <- as.numeric(do.call(rbind, str_split(dat2$VAR, ":"))[,2])
names(d2.vars) <- seq(d2.vars)
FIND THE MATCHES
# row numebr is the row in dat
# col number is the row in dat2
matches <- sapply(d2.vars, function(v) mins < v & v <= maxs)
MERGE
# create a column in dat to merge-by
dat <- cbind(dat, VAR=NA)
# use the VAR in dat2 as the merge id
sapply(seq(ncol(matches)), function(i)
dat$VAR <- dat2[i, "VAR"] )
merge(dat, dat2)