EDIT
I am trying to name a column and rename all items within the column of a dataset:
dataSet <- read.csv(url) %>%
rename("newColumn1" = V1) %>%
mutate(newColumn1 = recode(newColumn1, "oldEntryX" = "newEntryX") %>%
select(dataSet, newColumn1)
And I get this error:
Error in recode(newColumn1, oldEntryX = "newEntryX" :
object 'newColumn1' not found
What am I missing?
The code runs correctly up through the rename function and displays the renamed column correctly, but soon as I include mutate it throws an error.
I have no problem sharing the real code but wanted to generalize it for the crowd.
source info was from https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data
IN the mutate step, you don't need quotes for column names on the lhs of =. Also, there are couple of case mismatches
Assuming the dataset is read correctly, we can
df1 %>%
rename(newColumn1 = V1, newColumn2 = V2) %>%
mutate(newColumn1 = recode(newColumn1, oldEntryX = "newEntryX"),
newColumn2 = recode(newColumn2, oldEntryY = "newEntryY"))
Based on the OP's code there is no closing quote as well "newColumn1
data
set.seed(24)
df1 <- data.frame(V1 = sample(c("oldEntryX", "x", "y"), 10, replace = TRUE),
V2 = sample(c("oldEntryY", "x", "y"), 10, replace = TRUE), stringsAsFactors= FALSE)
you can do this with some simple codes of R programming:
How to read csv file
Syntax :- `read.csv("filename.csv")
by using this command 1st row will be used as header. To improve this fault one should write
data <- read.csv("datafile.csv", header=FALSE)
How to rename the header/Column name:
names(data) <- c("Column1", "Column2", "Column3")
Now your headers are replaced by Column1, Column2 and Column3
Now to change Column1 data you can follow steps
data$Column1 <- c(write down set of values with which you want to replace)
To see the output type
data
Related
I have a CSV file with 141 rows and several columns. I wanted my data to be ordered in ascending order by the first two columns i.e. 'label' and 'index'. Following is my code:
final_data <- read.csv("./features.csv",
header = FALSE,
col.names = c('label','index', 'nr_pix', 'rows_with_1', 'cols_with_1',
'rows_with_3p', 'cols_with_3p', 'aspect_ratio',
'neigh_1', 'no_neigh_above', 'no_neigh_below',
'no_neigh_left', 'no_neigh_right', 'no_neigh_horiz',
'no_neigh_vert', 'connected_areas', 'eyes', 'custom'))
sorted_data_by_label <- final_data[order(label),]
sorted_data_by_index <- sorted_data_by_label[order(index),]
write.table(sorted_data_by_index, file = "./features.csv",
append = FALSE, sep = ',',
row.names = FALSE)
I chose to read from a CSV and use write.table because that was necessary for my code requirement to override the CSV with column names.
Now even when I added a , after order(label), and order(index), the code sorted data should still read other rows and columns right?
After running this code, I only get the first row out of 141 rows. Is there a way to fix this problem?
As #akrun has mentioned briefly, what you need to do is to change
sorted_data_by_label <- final_data[order(label),]
to
sorted_data_by_label <- final_data[order(final_data$label),]
and to change
sorted_data_by_index <- sorted_data_by_label[order(index),]
to
sorted_data_by_index <- sorted_data_by_label[order(sorted_data_by_label$index),]
This is because when you write label, R will try to find the index object in the global environment, not within the final_data data frame.
If you intended to use index that is a column of final_data, you need to use explicit final_data$index.
Other options
You can use with:
sorted_data_by_label <- with(final_data, final_data[order(label),])
sorted_data_by_index <- with(sorted_data_by_label, sorted_data_by_label[order(index),])
In dplyr you can use
sorted_data_by_label <- final_data %>% arrange(label)
sorted_data_by_index <- sorted_data_by_label %>% arrange(index)
I am trying to use the pipe function in dplyr and left_join to clean some meta data up. Setting up variables....
library(openxlsx)
library(tidyverse)
mdat <- read.xlsx("https://journals.plos.org/plospathogens/article/file?type=supplementary&id=info:doi/10.1371/journal.ppat.1005511.s011",
startRow = 3, fillMergedCells = TRUE) %>%
mutate(sample=Accession.Number)
dge$samples$sample=
[1] "SRR1346026" "SRR1346027" "SRR1346028" "SRR1346029" "SRR1346030" "SRR1346031" "SRR1346032" "SRR1346033" "SRR1346034"
[10] "SRR1346035" "SRR1346036" "SRR1346037" "SRR1346038" "SRR1346039" "SRR1346040" "SRR1346041" "SRR1346042" "SRR1346043"
[19] "SRR1346044" "SRR1346045" "SRR1346046" "SRR1346047" "SRR1346049" "SRR1346048" "SRR1346050" "SRR1346051" "SRR1346052"
I am trying to pipe in the dge$samples$sample, which is a character class. It needs to become a data frame of one column named sample so I can merge mdat with it by left join in order to remove all the metadata I don't have a sample for. If you run dim(mdat) you will find it is 35 by 15, I want to reduce it to the 19 samples I actually have data for, these are given in the dge$samples$sample list. I am trying to use the following code to first convert dge$samples$sample into a data frame with one column titled sample for joining the two and essentially removing all metadata that is not of interest to me. The code below has been my progress so far but I think I am failing to understand how pipe works.
test = data.frame(dge$samples$sample) %>%
colnames(.) = c("sample") %>%
left_join(
.,
mdat,
by = sample,
copy = FALSE,
suffix = c(".x", ".y"),
keep = FALSE,
na_matches = c("na", "never")
)
Why not just check if theyre in there and filter them:
mdat %>% filter( sample %in% dge$samples$sample )
It's easier to understand and controll than a join and performance shouldn't be an issue.
I think your code can be reduced to
library(dplyr)
test <- data.frame(sample = dge$samples$sample) %>%
left_join(mdat, by = 'sample')
Or an inner join should work as well, using base R :
test <- merge(data.frame(sample = dge$samples$sample), mdat, by = 'sample')
Using collapse
library(collapse)
sbt(mdat, sample %in% dge$samples$sample)
Hello I have 2 dataframes:
df1 looks like:
and the df2 looks like:
I have noticied that the df1 has point symbol (.), while df2 has "-". It is weird because both of them, if I open with a text editor or excel, they have "-".
What I need is to drop all the columns of df1 that match with a value of df2.
I have used this:
DataGenSample = df1[,!(names(df1) %in% df2)]
#DataGenSample <- df1[ , !(colnames(df1) %in% df2)]
but there is no change.
All the Data can be foun here. Whith the code that I have used.
# Data (df1):
DataGen <- read.table("data_CNA.txt",sep="\t", header=TRUE, check.names = FALSE)
# Samples (df2):
DeleteSample <- read.table("MuestrasEliminar.txt",sep="\t", header=TRUE, check.names = FALSE)
#Delete columns:
#DataGenSample = DataGen[,!(names(DataGen) %in% DeleteSample)]
DataGenSample <- DataGen[ , !(colnames(DataGen) %in% DeleteSample)]
The issue is - vs ..
When you read in the data, your read command probably has an argument like check.names that changes the names to make them "standard" R names - which means no punctuation other than _ and .. If you set check.names = FALSE the original names will be kept, and your code should work just fine.
Ok, I found out that you need to convert your df in vector first:
vecDeleteSample <- DeleteSample$SAMPLE_ID
And then you can drop the mathing columns of your vector/list:
DataGenSample <- DataGen[,!(names(DataGen) %in% vecDeleteSample)]
I am trying to highlight rows of an excel file based on a match from the columns in a separate excel file. Pretty much, I want to highlight a row in file1 if a cell in that row matches a cell in file2.
I saw the R package "conditionalFormatting" has some of this functionality, but I cannot figure out how to use it.
the pseudo-code i think would look something like this:
file1 <- read_excel("file1")
file2 <- read_excel("file2")
conditionalFormatting(file1, sheet = 1, cols = 1:end, rows = 1:22,
rule = "number in file1 is found in a specific column of file 2")
Please let me know if this makes sense or if i need to clarify something.
Thanks!
The conditionalFormatting() function embeds active conditional formatting into the excel document but is likely more complicated than you need for a one-time highlight. I'd suggest loading each file into a dataframe, determining which rows contain a matching cell, creating a highlight style (yellow background), loading the file as a workbook object, setting the appropriate rows to the highlight style, and saving the updated workbook object.
The following function is the used to determine which rows have a match. The magrittr package provides the %>% pipes and the data.table package provides the transpose() function.
find_matched_rows <- function(df1, df2) {
require(magrittr)
require(data.table)
# the dataframe object treats each column as a list making it much easier and
# faster to search via column than row. Transpose the original file1 dataframe
# to treat the rows as columns.
df1_transposed <- data.table::transpose(df1)
# assuming that the location of the match in the second file is irrelevant,
# unlist the file2 dataframe so that each value in file1 can be searched in a
# vector
df2_as_vector <- unlist(df2)
# determine which columns contain a match. If one or more matches are found,
# attribute the row as 'TRUE' in the output vector to be used to subset the
# row numbers
match_map <- lapply(df1_transposed,FUN = `%in%`, df2_as_vector) %>%
as.data.frame(stringsAsFactors = FALSE) %>%
sapply(function(x) sum(x) > 0)
# make a vector of row numbers using the logical match_map vector to subset
matched_rows <- seq(1:nrow(df1))[match_map]
return(matched_rows)
}
The following code loads the data, finds the matched rows, applies the highlight, and saves over the original file1.xlsx. The second tst_df1 and tst_df2 provide for an easy way of testing the find_matched_rows() function. As expected, it finds that the 1st and 3rd rows of the first dataframe contain a cell that matches a cell in second dataframe.
# used to ensure that the correct rows are highlighted. the dataframe does not
# include the header as an independent row unlike excel.
file1_header_row <- 1
file2_header_row <- 1
tst_df1 <- openxlsx::read.xlsx("./file1.xlsx",
startRow = file1_header_row)
tst_df2 <- openxlsx::read.xlsx("./file2.xlsx",
startRow = file2_header_row)
#example data for testing
tst_df1 <- data.frame(fname = c("John", "Bob", "Bill"),
lname = c("Smith", "Johnson", "Samson"),
wage = c(10, 15.23, 137.38),
stringsAsFactors = FALSE)
tst_df2 <- data.frame(a = c(10, 34, 284.2),
b = c("Billy", "Bill", "Billy-Bob"),
c = c("Samson", "Johansson", NA),
stringsAsFactors = FALSE)
df_matched_rows <- find_matched_rows(tst_df1, tst_df2)
# any color found in colours() can be used here or hex color beginning with "#"
highlight_style <- openxlsx::createStyle(fgFill = "yellow")
file1_wb <- openxlsx::loadWorkbook(file = "./file1.xlsx")
openxlsx::addStyle(wb = file1_wb,
sheet = 1,
style = highlight_style,
rows = file1_header_row + df_matched_rows,
cols = 1:ncol(tst_df1),
stack = TRUE,
gridExpand = TRUE)
openxlsx::saveWorkbook(wb = file1_wb,
file = "./file1.xlsx",
overwrite = TRUE)
This is the code I have used in R via Spark Cluster, and error also given below
mydata<-spark_read_csv(spark_cluster,name = "rd_1",path = "IAF_Extracted_Data_Zipped.csv",header = F,delimiter = "|")
mydata %>% select(customer=V1,device_subscriber_id=V2,user_subscriber_id=V3,user_id=V4,location_id=V5)
Error in .f(.x[[i]], ...) : object 'V1' not found
The renaming convention goes the other way around (new name = old name)
You are looking for the following:
mydata %>%
select(V1 = customer,
V2 = device_subscriber_id,
V3 = user_subscriber_id,
V4 = user_id,
V5 = location_id)
If you want specific names just provide a vector of names on read:
columns <- c("customer", "device_subscriber_id",
"user_subscriber_id", "user_id", "location_id")
spark_read_csv(
spark_cluster, name = "rd_1",path = "IAF_Extracted_Data_Zipped.csv",
header = FALSE, columns = columns, delimiter = "|"
)
The number of columns should match the number of columns in the input.
Of the top of my head you could try customer = mydata$V1 and similar for the other variables (assuming V1,... are column names of mydata).