Rename columns of dataframe by adding a prefix -in R - r

I want to rename all columns of my dataframe (expect id and t) by adding the prefix "le_".
Firstly I turn the data frame from wide to long format and after specifying the columns ( named 1 - 27) I want to rename them as le_1 - le_27. Any suggestions on how to do this?
I tried with rename but I got stuck.
df_long_le <- df_wide_le %>%
pivot_longer(cols = starts_with("le_"), names_to = c( "t", ".value"),
names_pattern = "le_(.*)_(.*)") %>%
rename(df_long_le[3:29] = "le_*[1-27]")
Thank you!
This is how the dataframe looks like
enter image description here

To change all of the columns:
colnames(df_long_le) <- paste("le", colnames(df_long_le), sep = '_')
To change all but 1 and 2:
newcolnames <- paste("le", colnames(df_long_le)[-c(1,2)], sep = '_')
colnames(df_long_le) <- c(colnames(df_long_le)[1:2], newcolnames)

Related

How to merge tables and format appripriately?

So I have the following in cityzone.txt:
"earth/city/somerset/forest/somerset-test.txt#53497",
"earth/city/nottingham/forest/nighthill.txt#53498",
"earth/city/bury/town/bishop-zone1.mp3#53695",
And the following in areasize.txt:
planet\mars\red\crater.txt;56,
pluto\distant\dwarfmoon.txt;181,
mars\hot\red\redmoon.txt;43,
earth\city\somerset\forest\somerset-test.txt;205,
earth\city\bury\town\bishop-zone1.mp3;499,
So what I need is for a new table to be created and written to an output file.
What should happen is - for each row in cityzone.txt, the title for that row should be looked up in areasize.txt. If the title exists, the areasize number from areasize.txt should be appended to the cityzone row like this:
"title#id#areasize",
With quotes and comma accordingly.
So for cityzones.txt above, the output should be thus:
"earth/city/somerset/forest/somerset-test.txt#53497#205",
"earth/city/bury/town/bishop-zone1.mp3#53695#499",
And then it should be output to a file with quote sand comma as shown.
So only 2 of the 3 cityzone.txt rows are included in the results because only 2 of the 3 rows exist in areasize.txt.
My starter code for this is really a continuation from this question:
How do I merge partial data and format it in R?
So I will add the code for this to the code in that question.
Thank you.
You can do :
library(dplyr)
library(tidyr)
#Read the text files and keep only 1st column
cityzone <- read.table('cityzone.txt')[1]
areasize <- read.table('areasize.txt', sep = ';')
#Separate columns on # and join
#Clean areasize dataframe
cityzone %>% separate(V1, c('V1', 'V2'), sep = '#') %>%
inner_join(areasize %>%
mutate(V1 = gsub('\\\\', '/', V1),
V2 = sub(',$', '', V2)),
by = 'V1') -> result
#Combine output in required format and write
cat(sprintf('"%s#%s#%s",', result$V1, result$V2.x, result$V2.y),
file = 'output.lua', sep = '\n')

Sort vector of strings by characters from behind the strings

I have a dataframe with a number of repetitive column names separated by a serial number. It looks something like this:
temp <- c("DTA_1", "DTA_2", "DTA_3", "OCI_1", "OCI_2", "OCI_3", "Time_1", "Time_2", "Time_3")
At the end it should look like this
temp <- c("DTA_1", "Time_1", "OCI_1", "DTA_2", "Time_2", "OCI_2", "DTA_3", "Time_3", "OCI_3")
I've started working on it and I came to this:
for(i in 1:length(tt)){
paste(rev(strsplit(tt[i], "")[[1]]), collapse = "")
}
but then I realized I have to sort them after that and turn all the variables around again... It just seemed dumb and stupid.
Is there a better, more elegant way to do it?
You can specify the custom order of the strings by converting them to factor and specifying the order in the levels
temp[order(as.numeric(gsub("\\D", "", temp)),
factor(gsub("_\\d+", "", temp), levels = c("DTA", "Time", "OCI")))]
#[1] "DTA_1" "Time_1" "OCI_1" "DTA_2" "Time_2" "OCI_2" "DTA_3" "Time_3" "OCI_3"
An option is to read it to a 2 column data.frame by specifying the delimiter as _, order the columns and use that index for ordering the vector
temp[do.call(order, transform(read.table(text = temp, header = FALSE,
sep="_"), V1 = factor(V1, levels = c("DTA", "Time", "OCI")))[2:1])]
#[1] "DTA_1" "Time_1" "OCI_1" "DTA_2" "Time_2" "OCI_2" "DTA_3" "Time_3" "OCI_3"
Or as #d.b mentioned in the comments, instead of converting to factor, use match and order based on that index
temp[with(read.table(text = temp, sep = "_"), order(V2,
match(V1, c("DTA", "Time", "OCI"))))]
#[1] "DTA_1" "Time_1" "OCI_1" "DTA_2" "Time_2" "OCI_2" "DTA_3" "Time_3" "OCI_3"
Or an option in tidyverse
library(tidyverse)
library(forcats)
tibble(temp) %>%
separate(temp, into = c('t1', 't2'), convert = TRUE) %>%
arrange(t2, fct_relevel(t1, c('DTA', 'Time', 'OCI'))) %>%
unite(temp, t1, t2, sep="_") %>%
pull(temp)
#[1] "DTA_1" "Time_1" "OCI_1" "DTA_2" "Time_2" "OCI_2" "DTA_3" "Time_3" "OCI_3"

How do I avoid 'NA' values when coercing a .tsv column into numeric via as.numeric?

I have a dataframe with several columns from a .tsv file and want to transform one of them into the 'numeric' type for analysis. However, I keep getting the 'NAs' introduced by coercion warning all the time and do not know exactly why. There is some unnecessary info at the beginning of another column, which is pretty much the only formatting I did.
Originally, I thought the file might have added some extra tabs or spaces, which is why I tried to delete these via giving sub() as an argument.
I should also mention that I get the NA errors also when I do not replace the values and run the dataframe as is:
library(tidyverse)
data_2018 <- read_tsv('teina230.tsv')
data_1995 <- read_csv('OECD_1995.csv')
#get rid of long colname & select only columns containing %GDP
clean_data_2018 <- data_2018 %>%
select('na_item,sector,unit,geo','2018Q1','2018Q2','2018Q3','2018Q4') %>%
rename(country = 'na_item,sector,unit,geo')
clean_data_2018 <- clean_data_2018[grep("PC_GDP", clean_data_2018$'country'), ]
#remove unnecessary info
clean_data_2018 <- clean_data_2018 %>%
mutate(country=gsub('\\GD,S13,PC_GDP,','',country))
clean_data_2018 <- clean_data_2018 %>%
mutate(
'2018Q1'=as.numeric(sub("", "", '2018Q1', fixed = TRUE)),
'2018Q2'=as.numeric(sub(" ", "", '2018Q2', fixed = TRUE)),
'2018Q3'=as.numeric(sub(" ", "", '2018Q3', fixed = TRUE)),
'2018Q4'=as.numeric(sub(" ", "", '2018Q4', fixed = TRUE))
)
Is there another way to get around the problem and convert the column without replacing all the values with 'NA'?
Thanks guys :)
Thanks for the hint #divibisan !
Renaming the columns via rename() actually solved the problem. Here the code which finally worked:
library(tidyverse)
data_2018 <- read_tsv('teina230.tsv')
#get rid of long colname & select only columns containing %GDP
clean_data_2018 <- data_2018 %>%
select('na_item,sector,unit,geo','2018Q1','2018Q2','2018Q3','2018Q4') %>%
rename(country = 'na_item,sector,unit,geo',
quarter_1 = '2018Q1',
quarter_2 = '2018Q2',
quarter_3 = '2018Q3',
quarter_4 = '2018Q4')
clean_data_2018 <- clean_data_2018[grep("PC_GDP", clean_data_2018$'country'), ]
#remove unnecessary info
clean_data_2018 <- clean_data_2018 %>%
mutate(country=gsub('\\GD,S13,PC_GDP,','',country))
clean_data_2018 <- clean_data_2018 %>%
mutate(
quarter_1 = as.numeric(quarter_1),
quarter_2 = as.numeric(quarter_2),
quarter_3 = as.numeric(quarter_3),
quarter_4 = as.numeric(quarter_4)
)

How to add column to multiple data frames based on information from another data frame

Apologies if this question is simple/been answered elsewhere - I have looked but as a newbie I can't seem to find what I need.
I have a data frame (Length) which contains a a unique value which I need to add to different files
View(Length)
File_name Transcript_length <d
1 sample15.fasta.out_alternative.out_contig.copynumber.csv 89229486
2 sample16.fasta.out_alternative.out_contig.copynumber.csv 70908644
3 sample2.fasta.out_alternative.out_contig.copynumber.csv 56017470
4 sample28.fasta.out_alternative.out_contig.copynumber.csv 94888762
5 sample30.fasta.out_alternative.out_contig.copynumber.csv 106260465
6 sample31.fasta.out_alternative.out_contig.copynumber.csv 91189772
I have then imported and began to manipulate these copy.number.csv files but need to add a new column which contains the value corresponding to the file name?
Attempt 1:
#import copynumber data
import2 <- list.files(pattern="*copynumber.csv", full.names = TRUE)
list2env(
lapply(setNames(import2, make.names(gsub("$", "", import))),
read.csv, sep = ""),
envir = .GlobalEnv)
CN_files <- lapply(import2, read.csv, sep = "")
names(CN_files) <- gsub("$", "", import2)
#then manipulate
for (f in 1:length(CN_files)) {
names(CN_files[[f]]) <- c("Family", "Element", "Length", "Fragments", "Copies", "Solo_LTR", "Total_Bp", "Cover")
how do I then add the transcript length values to a new column based on the specific copynumber.csv file provided by the earlier data frame?
Any help greatly appreciated, again, I am new to this, so feel free to give more general advice on how to word a R question etc
I have worked out how to do it outside of the loop as so:
CN_files[[1]] <- CN_files[[1]] %>% mutate(bp = Length$Transcript_length[1])
CN_files[[2]] <- (CN_files[[2]] %>% mutate(bp = Length$Transcript_length[2]))
CN_files[[3]] <- (CN_files[[3]] %>% mutate(bp = Length$Transcript_length[3]))
CN_files[[4]] <- (CN_files[[4]] %>% mutate(bp = Length$Transcript_length[4]))
CN_files[[5]] <- (CN_files[[5]] %>% mutate(bp = Length$Transcript_length[5]))
CN_files[[6]] <- (CN_files[[6]] %>% mutate(bp = Length$Transcript_length[6]))
CN_files[[7]] <- (CN_files[[7]] %>% mutate(bp = Length$Transcript_length[7]))
CN_files[[8]] <- (CN_files[[8]] %>% mutate(bp = Length$Transcript_length[8]))
CN_files[[9]] <- (CN_files[[9]] %>% mutate(bp = Length$Transcript_length[9]))
Nevertheless, this seems quite awkward and non-efficient so again if anyone has any tips on how to approach this better it will be greatly appreciated!
Note, it was known that the order of the files within the list were the same as the 'Length' data file-

`gather` can't handle rownames

allcsvs = list.files(pattern = "*.csv$", recursive = TRUE)
library(tidyverse)
##LOOP to redact the snow data csvs##
for(x in 1:length(allcsvs)) {
df = read.csv(allcsvs[x], check.names = FALSE)
newdf = df %>%
gather(COL_DATE, SNOW_DEPTH, -PT_ID, -DATE) %>%
mutate(
DATE = as.Date(DATE,format = "%m/%d/%Y"),
COL_DATE = as.Date(COL_DATE, format = "%Y.%m.%d")
) %>%
filter(DATE == COL_DATE) %>%
select(-COL_DATE)
####TURN DATES UNAMBIGUOUS HERE####
df$DATE = lubridate::mdy(df$DATE)
finaldf = merge(newdf, df, all.y = TRUE)
write.csv(finaldf, allcsvs[x])
df = read.csv(allcsvs[x])
newdf = df[, -grep("X20", colnames(df))]
write.csv(newdf, allcsvs[x])
}
I am using the code above to populate a new column row-by-row using values from different existing columns, using date as selection criteria. If I manually open each .csv in excel and delete the first column, this code works great. However, if I run it on the .csvs "as is"
I get the following message:
Error: Column 1 must be named
So far I've tried putting -rownames within the parenthesis of gather, I've tried putting remove_rownames %>% below newdf = df %>%, but nothing seems to work. I tried reading the csv without the first column [,-1] or deleting the first column in R df[,1]<-NULL but for some reason when I do that my code returns an empty table instead of what I want it to. In other words, I can delete the rownames in Excel and it works great, if I delete them in R something funky happens.
Here is some sample data: https://drive.google.com/file/d/1RiMrx4wOpUdJkN4il6IopciSF6pKeNLr/view?usp=sharing
You can consider to import them with readr::read_csv.
An easy solution with tidyverse:
allcsvs %>%
map(read_csv) %>%
reduce(bind_rows) %>%
gather(COL_DATE, SNOW_DEPTH, -PT_ID, -DATE) %>%
mutate(
DATE = as.Date(DATE,format = "%m/%d/%Y"),
COL_DATE = as.Date(COL_DATE, format = "%Y.%m.%d")
) %>%
filter(DATE == COL_DATE) %>%
select(-COL_DATE)
With utils::read.csv, you are importing strings are factors. as.Date(DATE,format = "%m/%d/%Y") evaluates NA.
Update
Above solution returns one single dataframe. To write the each data file separately with the for loop:
for(x in 1:length(allcsvs)) {
read_csv(allcsvs[x]) %>%
gather(COL_DATE, SNOW_DEPTH, -PT_ID, -DATE) %>%
mutate(
COL_DATE = as.Date(COL_DATE, format = "%Y.%m.%d")
) %>%
filter(DATE == COL_DATE) %>%
select(-COL_DATE) %>%
write_csv(paste('tidy', allcsvs[x], sep = '_'))
}
Comparison
purrr:map and purrr:reduce can be used instead of for loop in some cases. Those functions take another functions as arguments.
readr::read_csv is typically 10x faster than base R equivalents. (More info: http://r4ds.had.co.nz/data-import.html). Also it can handle CSV files better.

Resources