Making Calculations on Several Textfiles and making a Dataframe from it R

Making Calculations on Several Textfiles and making a Dataframe from it R - r

I am trying to create a table from calculations that I am doing to several text file. I think this might require a loop of some sort, but I am stuck on how to proceed. I have tried different loops but none seem to be working. I have managed to do what I want with one file. Here is my working code:
flare <- read.table("C:/temp/HD3_Bld_CD8_TEM.txt",
header=T)
head(flare[,c(1,2)])
#sum of the freq column, check to see if close to 1
sum(flare$freq)
#Sum of top 10
ten <- sum(flare$freq[1:10])
#Sum of 11-100
to100 <- sum(flare$freq[11:100])
#Sum of 101-1000
to1000 <- sum(flare$freq[101:1000])
#sum of 1001+
rest <- sum(flare$freq[-c(1:1000)])
#place the values of the sum in a table
df <- data.frame(matrix(ncol = 1, nrow = 4))
x <- c("Sum")
colnames(df) <- x
y <- c("10", "11-100", "101-1000", "1000+")
row.names(df) <- y
df[,1] <- c(ten,to100,to1000,rest)
The dataframe ends up looking like this:
>View(df)
Sum
10 0.1745092
11-100 0.2926735
101-1000 0.4211533
1000+ 0.1116640
This is perfect for making a stacked barplot, which I did. However, this is only for one text file. I have several of the same files. All of them have the same column names, so I know that all of them will be using DF$freq column for the calculations. How do I make a table after doing calculations with each file? I want to keep the names of the text files as the sample names so that way when i make a joint stacked barplot all the names will be there. Also, what is the best way to orient the data when writing the new table/dataframe?
I am still new to R, so any help, any explanation would be most welcome. Thank you.

How about something like this, your example is not reproducible so I made a dummy example which you can adjust:
library(tidyverse)
###load ALL your dataframes
test_df_1 <- data.frame(var1 = matrix(c(1,2,3,4,5,6), nrow = 6, ncol = 1))
test_df_1
test_df_2 <- data.frame(var2 = matrix(c(7,8,9,10,11,12), nrow = 6, ncol = 1))
test_df_2
### Bind them into one big wide dataframe
df <- cbind(test_df_1, test_df_2)
### Add an id column which repeats (in your case adjust this to repeat for the grouping you want, i.e replace the each = 2 with each = 10, and each = 4 with each = 100)
df <- df %>%
mutate(id = paste0("id_", c(rep(1, each = 2), rep(2, each = 4))))
### Gather your dataframes into long format by the id
df_gathered <- df %>%
gather(value = value, key = key, - id)
df_gathered
### use group_by to group data by id and summarise to get the sum of each group
df_gathered_sum <- df_gathered %>%
group_by(id, key) %>%
summarise(sigma = sum(value))
df_gathered_sum
You might have some issues with the ID column if your dfs are not equal length so this is only a partial answer. Can do better with a shortened example of your dataset. Can anyone else weigh in on creating an id column? May have sorted it with a couple of edits...

I think I solved it! It gives me the dataframe I want, and from it, I can make the stacked barplot to display the data.
sumfunction <- function(x) {
wow <- read.table(x, header=T)
#Sum of top 10
ten <- sum(wow$freq[1:10])
#Sum of 11-100
to100 <- sum(wow$freq[11:100])
#Sum of 101-1000
to1000 <- sum(wow$freq[101:1000])
#sum of 1001+
rest <- sum(wow$freq[-c(1:1000)])
blah <- c(ten,to100,to1000,rest)
}
library(data.table)
library(tools)
dir = "C:/temp/"
filenames <- list.files(path = dir, pattern = "*.txt", full.names = FALSE)
alltogether <- lapply(filenames, function(x) sumfunction(x))
data <- as.data.frame(data.table::transpose(alltogether),
col.names =c("Top 10 ", "From 11 to 100", "From 101 to 1000", "From 1000 on "),
row.names = file_path_sans_ext(basename(filenames)))
This gives me the dataframe that I want. I instead of putting the "top 10, 11-100, 101-1000, 1000+" as the row names, I changed them to column names and instead made the names of each text file become the row names. The file_path_sans_ext(basename(filenames)) makes sure to just keep the file name and remove the extension.
I hope this helps anyone that reads this! thank you again! I love this platform because just being part of this environment gets me thinking and always striving to better myself at R.
If anyone has any input, that would be great!!! <3

Related

How to rename and add "_cntr" to the middle of the columns of a dataframe within a function

I am new to programming in R and I have had to make a function that collects a dataframe and returns that same dataframe with twice as many columns as the original and in those new columns, the values have to be the original value minus the mean (the mean is row 51 of the dataframe). The fact is that I have made the function and it works, the only thing I need to do is rename column 9:16 of the dataframe, they have to have the same name as the original columns and add "_cntr" to them.
I had thought to add the _cntr with the paste function, but it does not work for me or I am not using it well, I had thought something like this:
nom = paste("cntr",sep = '_')
colnames(state.df3) = nom
and this put it inside the function that I will share next, but this changes the name of the first column by centr and leaves the rest of the columns with the value NA.
If I do that:
nom = paste("cntr",9:16,sep = '_')
colnames(state.df3) = nom
It returns cntr1, cntr2, cntr3 ... and I don't want it to return that, I want it to return "Population_cntr", "Income_cntr", "Illiteracy_cntr" ... all that from column 9 to 16 (since is where duplicates start)
The dataframe that I am using as a test can be accessed here:
state.df = as.data.frame(state.x77)
And this is the function that I have done so far, I would only need to modify the names of the 9:16 columns.
mi_funcion <- function(df) {
row_medias <- tail(df, 1)
row_resto <- head(df, -1)
tmp <- rbind(row_resto - as.list(row_medias), row_medias)
resultado = cbind(df, tmp)
return(resultado)
}
If someone could give me a hand and tell me where I am failing I would be very grateful.

This is just an example so for yours replace 1:2 with 9:16
df <- data.frame(Population = c(10),Income = c(20000),Illiteracy = ("Y"))
df
Population Income Illiteracy
1 10 20000 Y
colnames(df)[1:2] <- paste(colnames(df)[1:2],"cntr", sep = "_")
Output:
df
Population_cntr Income_cntr Illiteracy
1 10 20000 Y

We can also use dplyr's rename_with:
library(dplyr)
state.df3 %>% rename_with(~paste(.x, "cntr",sep = '_'), .cols = everything())

Comparing each row of one dataframe with a row in another dataframe using R

I'm relatively new to R and I have looked for an answer for my problem but didn't find one. I want to compare two dataframes.
library(dplyr)
library(gtools)
v1 <- LETTERS[1:10]
combinations_from_4_letters <- (as.data.frame(combinations(n = 10, r = 4, v = v1),
stringsAsFactors = FALSE))
combinations_from_4_letters$group <- rep(1:15, each = 14)
combinations_from_2_letters <- (as.data.frame(combinations(n = 10, r = 2, v = v1),
stringsAsFactors = FALSE))
Dataframe 'combinations_from_4_letters' contains all combinations that can be made from 10 letters without repetitions and permutations. The combinations are binned into groups from 1-15. I want to find out how often pairs of the 10 letters (saved in dataframe 'combinations_from_2_letters') are found in each group (basically a frequency table). I started doing a complicated loop looping through both dataframes but I think there must be a more 'R' solution to it, similar to comparing a dataframe and a vector like:
combinations_from_4_letters %in% combinations_from_2_letters[i,])
Thank you in advance for your help!

I recommend an approach like the following:
# adding dummy column for a complete cross-join
combinations_from_4_letters = combinations_from_4_letters %>%
mutate(ones = 1)
combinations_from_2_letters = combinations_from_2_letters %>%
mutate(ones = 1)
joined = combinations_from_2_letters %>%
inner_join(combinations_from_4_letters, by = "ones") %>%
# comparison goes here
mutate(within = ifelse(comb2 %in% comb4, 1, 0)) %>%
group_by(comb2) %>%
summarise(freq = sum(within))
You'll probably need to modify to ensure it matches the exact column names and your comparison condition.
Key ideas:
adding filler column so we have a complete cross-join
mutate a new indicator column for whether the two letter pair is within the four letter pair
sum indicators on the two letter pair

Create score based on word occurrences

I have two data frames with columns of words and associated scores for these words. I want to run comments through these frames and create an additive score based on if the words appear in the sentences.
I want to do this across many, many comments so it needs to be computationally efficient. So for example, the sentence "hi, he said. why is it okay" will get a score of .98 + .1 + .2 because the words "hi", "why", and "okay" are in data frame a. Any sentence could potentially have words from several data frames as well.
Can anyone help me create the column "add_score" with a procedure that scales well to large data frames? Thank you
a <- data.frame(words = c("hi","no","okay","why"),score = c(.98,.5,.2,.1))
b <- data.frame(words = c("bye","yes","here",score = c(.5,.3,.2)))
comment_df = data.frame(id = c("1","2","3"), comments = c("hi, he said. why
is it okay","okay okay okay no","yes, here is it"))
comment_df$add_score = c(1.28,1.1,.5)

This solution uses functions from tidyverse and stringr.
# Load packages
library(tidyverse)
library(stringr)
# Merge a and b to create score_df
score_df <- bind_rows(a, b)
# Create a function to calculate score for one string
string_cal <- function(string, score_df){
temp <- score_df %>%
# Count the number of words in one string
mutate(Number = str_count(string, pattern = fixed(words))) %>%
# Calcualte the score
mutate(Total_Score = score * Number)
# Return the sum
return(sum(temp$Total_Score))
}
# Use map_dbl to apply the string_cal function over comments
# The results are stored in the add_score column
comment_df <- comment_df %>%
mutate(add_score = map_dbl(comments, string_cal, score_df = score_df))
Data Preparation
a <- data.frame(words = c("hi","no","okay","why"),
score = c(.98,.5,.2,.1))
b <- data.frame(words = c("bye","yes","here"),
score = c(.5,.3,.2))
comment_df <- data.frame(id = c("1","2","3"),
comments = c("hi, he said. why is it okay",
"okay okay okay no",
"yes, here is it"))

merge data frames based on non-identical values in R

I have two data frames. First one looks like
dat <- data.frame(matrix(nrow=2,ncol=3))
names(dat) <- c("Locus", "Pos", "NVAR")
dat[1,] <- c("ACTC1-001_1", "chr15:35087734..35087734", "1" )
dat[2,] <- c("ACTC1-001_2 ", "chr15:35086890..35086919", "2")
where chr15:35086890..35086919 indicates all the numbers within this range.
The second looks like:
dat2 <- data.frame(matrix(nrow=2,ncol=3))
names(dat2) <- c("VAR","REF.ALT"," FUNC")
dat2[1,] <- c("chr1:116242719", "T/A", "intergenic" )
dat2[2,] <- c("chr1:116242855", "A/G", "intergenic")
I want to merge these by the values in dat$Pos and dat2$VAR. If the single number in a cell in dat2$VAR is contained within the range of a cell in dat$Pos, I want to merge those rows. If this occurs more than once (dat2$VAR in more than one range in dat$Pos, I want it merged each time). What's the easiest way to do this?

Here is a solution, quite short but not particularly efficient so I would not recommend it for large data. However, you seemed to indicate your data was not that large so give it a try and let me know:
library(plyr)
exploded.dat <- adply(dat, 1, function(x){
parts <- strsplit(x$Pos, ":")[[1]]
chr <- parts[1]
range <- strsplit(parts[2], "..", fixed = TRUE)[[1]]
start <- range[1]
end <- range[2]
data.frame(VAR = paste(chr, seq(from = start, to = end), sep = ":"), x)
})
merge(dat2, exploded.dat, by = "VAR")
If it is too slow or uses too much memory for your needs, you'll have to implement something a bit more complex and this other question looks like a good starting point: Merge by Range in R - Applying Loops.

Please try this out and let us know how it works. Without a larger data set it is a bit hard to trouble shoot. If for whatever reason it does not work, please share a few more rows from your data tables (specifically ones that would match)
SPLICE THE DATA
range.strings <- do.call(rbind, strsplit(dat$Pos, ":"))[, 2]
range.strings <- do.call(rbind, strsplit(range.strings, "\\.\\."))
mins <- as.numeric(range.strings[,1])
maxs <- as.numeric(range.strings[,2])
d2.vars <- as.numeric(do.call(rbind, str_split(dat2$VAR, ":"))[,2])
names(d2.vars) <- seq(d2.vars)
FIND THE MATCHES
# row numebr is the row in dat
# col number is the row in dat2
matches <- sapply(d2.vars, function(v) mins < v & v <= maxs)
MERGE
# create a column in dat to merge-by
dat <- cbind(dat, VAR=NA)
# use the VAR in dat2 as the merge id
sapply(seq(ncol(matches)), function(i)
dat$VAR <- dat2[i, "VAR"] )
merge(dat, dat2)

R loops: Adding a column to a table if does not already exist

I am trying to compile data from several files using for loops in R. I would like to get all the data into one table. Following calculation is just an example.
library(reshape)
dat1 <- data.frame("Specimen" = paste("sp", 1:10, sep=""), "Density_1" = rnorm(10,4,2), "Density_2" = rnorm(10,4,2), "Density_3" = rnorm(10,4,2))
dat2 <- data.frame("Specimen" = paste("fg", 1:10, sep=""), "Density_1" = rnorm(10,4,2), "Density_2" = rnorm(10,4,2))
dat <- c("dat1", "dat2")
for(i in 1:length(dat)){
data <- get(dat[i])
melt.data <- melt(data, id = 1)
assign(paste(dat[i], "tbl", sep=""), cast(melt.data, ~ variable, mean))
}
rbind(dat1tbl, dat2tbl)
What is the smoothest way to add an extra column into dat2? I would like to get the same column name ("Density_3" in this case) and fill it up with zeros, if it does not already exist. Assume that I have ~100 tables with number of columns (Density_1, 2, 3 etc) varying between 5 and 6.
I tried following, but it didn't work:
if(names(data) %in% "Density_3" == FALSE){
dat.all$Density_3 <- 0
} else {
dat.all$Density_3 <- dat.all$Density3}
Another one: is there a smooth way to rbind() the tables? It seems that rbind(get(dat)) does not work.

After staring at this question for a while I think its intent may have been obscured by the unnecessary get and assign manipulations. And I think the answer is pylr::rbind.fill
I would have constructed "dat", not as a character vector but as a list of two dataframes, used aggregate( ..., FUN=mean) (because I haven't gotten on the reshape2/plyr bus, except for melt and rbind.fill that is ) and then do.call(rbind.fill, ...) on the resulting list. At any rate this is what I think you want. I do not think it is a good idea to add in zeros for what are really missing values.
> rbind.fill(dat1tbl, dat2tbl)
value Density_1 Density_2 Density_3
1 (all) 5.006709 4.088988 2.958971
2 (all) 4.178586 3.812362 NA

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Making Calculations on Several Textfiles and making a Dataframe from it R - r

Related

How to rename and add "_cntr" to the middle of the columns of a dataframe within a function

Comparing each row of one dataframe with a row in another dataframe using R

Create score based on word occurrences

merge data frames based on non-identical values in R

R loops: Adding a column to a table if does not already exist

Categories

Resources