Data cleaning in Excel sheets using R - r

I have data in Excel sheets and I need a way to clean it. I would like remove inconsistent values, like Branch name is specified as (Computer Science and Engineering, C.S.E, C.S, Computer Science). So how can I bring all of them into single notation?

The car package has a recode function. See it's help page for worked examples.
In fact an argument could be made that this should be a closed question:
Why is recode in R not changing the original values?
How to recode a variable to numeric in R?
Recode/relevel data.frame factors with different levels
And a few more questions easily identifiable with a search: [r] recode
EDIT:
I liked Marek's comment so much I decided to make a function that implemented it. (Factors have always been one of those R-traps for me and his approach seemed very intuitive.) The function is designed to take character or factor class input and return a grouped result that also classifies an "all_others" level.
my_recode <- function(fac, levslist){ nfac <- factor(fac);
inlevs <- levels(nfac);
othrlevs <- inlevs[ !inlevs %in% unlist(levslist) ]
# levslist of the form :::: list(
# animal = c("cow", "pig"),
# bird = c("eagle", "pigeon") )
levels(nfac)<- c(levslist, all_others =othrlevs); nfac}
df <- data.frame(name = c('cow','pig','eagle','pigeon', "zebra"),
stringsAsFactors = FALSE)
df$type <- my_recode(df$name, list(
animal = c("cow", "pig"),
bird = c("eagle", "pigeon") ) )
df
#-----------
name type
1 cow animal
2 pig animal
3 eagle bird
4 pigeon bird
5 zebra all_others

You want a way to clean your data and you specify R. Is there a reason for it? (automation, remote control [console], ...)
If not, I would suggest Open Refine. It is a great tool exactly for this job. It is not hosted, you can safely download it and run against your dataset (xls/xlsx work fine), you then create a text facet and group away.
It uses advanced algorithms (and even gives you a choice) and is really helpful. I have cleaned a lot of data in no time.
The videos at the official web site are useful.

There are no one size fits all solutions for these types of problems. From what I understand you have Branch Names that are inconsistently labelled.
You would like to see C.S.E. but what you actually have is CS, Computer Science, CSE, etc. And perhaps a number of other Branch Names that are inconsistent.
The first thing I would do is get a unique list of Branch Names in the file. I'll provide an example using letters() so you can see what I mean
your_df <- data.frame(ID=1:2000)
your_df$BranchNames <- sample(letters,2000, replace=T)
your_df$BranchNames <- as.character(your_df$BranchNames) # only if it's a factor
unique.names <- sort(unique(your_df$BranchNames))
Now that we have a sorted list of unique values, we can create a listing of recodes:
Let's say we wanted to rename A through G as just A
your_df$BranchNames[your_df$BranchNames %in% unique.names[1:7]] <- "A"
And you'd repeat the process above eliminating or group the unique names as appropriate.

Related

Writing For Loop or Split function to separate data from Master data frame into smaller data frames

I am once again asking for your help and guidance! Super duper novice here so I apologize in advance for not explaining things properly or my general lack of knowledge for something that feels like it should be easy to do.
I have sets of compounds in one "master" list that need to be separated into smaller list. I want to be able to do this with a "for loop" or some iterative function so I am not changing the numbers for each list. I want to separate the compounds based off of the column "Run.Number" (there are 21 Run.Numbers)
Step 1: Load the programs needed and open File containing "Master List"
# tMSMS List separation
#Load library packages
library(ggplot2)
library(reshape)
library(readr) #loading the csv's
library(dplyr) #data manipulation
library(magrittr) #forward pipe
library(openxlsx) #open excel sheets
library(Rcpp) #got this from an error code while trying to open excel sheets
#STEP 1: open file
S1_MasterList<- read.xlsx("/Users/owner/Documents/Research/Yurok/Bioassay/Bioassay Data/220410_tMSMS_neg_R.xlsx")
Step 2: Currently, to go through each list, I have to change the "i" value for each iteration. And I also must change the name manually (Ctrl+F), by replacing "S2_Export_1" with "S2_Export_2" and so on as I move from list to list. Also, when making the smaller list, there are a handful of columns containing data that need to be removed from the “Master List”. The specific format of column names are so it will be compatible with LC-MS software. This list is saved as a .csv file, again for compatibility with LC-MS software
#STEP 2: Iterative
#Replace: S2_Export_1
i=1
(S2_Separate<- S1_MasterList[which(S1_MasterList$Run.Number == i), ])
%>%
(S2_Export_1<-data.frame(S2_Separate$On,
S2_Separate$`Prec..m/z`,
S2_Separate$Z,
S2_Separate$`Ret..Time.(min)`,
S2_Separate$`Delta.Ret..Time.(min)`,
S2_Separate$Iso..Width,
S2_Separate$Collision.Energy))
(colnames(S2_Export_1)<-c("On", "Prec. m/z", "Z","Ret. Time (min)", "Delta Ret. Time (min)", "Iso. Width", "Collision Energy"))
(write.csv(S2_Export_1, "/Users/owner/Documents/Research/Yurok/Bioassay/Bioassay Data/Runs/220425_neg_S2_Export_1.csv", row.names = FALSE))
Results: The output should look like this image provided below, and for this one particular data frame called "Master List", there should be 21 smaller data frames. I also want the data frames to be named S2_Export_1, S2_Export_2, S2_Export_3, S2_Export_4, etc.
First, select only required columns (consider processing/renaming non-syntactic names first to avoid extra work downstream):
s1_sub <- select(S1_MasterList, Sample.Number, On, `Prec..m/z`, Z,
`Ret..Time.(min)`, `Delta.Ret..Time.(min)`,
Iso..Width, Collision.Energy)
Then split s1_sub into a list of dataframes with split()
s1_split <- split(s1_sub, s1_sub$Sample.Number)
Finally, name the resulting list of dataframes with setNames():
s1_split <- setNames(s1_split, paste0("S2_export_", seq_along(s1_split))

Re-structuring an odd data structure, adding repeat values

I have been given an oddly structured dataset that I need to prepare for visualisation in GIS. The data is from historic newspapers from different locations in China, published between 1921 and 1937. The excel table is structured as follows:
There is a sheet for each location, 2. each sheet has a column for every year and the variables for each newspaper is organised in blocks of 7 rows and separated by a blank row. Here's a sample from one of the sheets:
,1921年,1922年,1923年
,,,
Title of Newspaper 1,遠東報(Yuan Dong Bao),,
(Language),漢文,,
(Ideology),東支鉄道機関紙,,
(Owner),(総経理)史秉臣,,
(Editior),張福臣,,
(Publication Frequency),日刊,,
(Circulation),"1,000",,
(Others),1908年創刊、東支鉄道に支那か勢力を扶殖せし以来著しく排日記事を記載す,,
,,,
Title of Newspaper 2,ノーウォスチ・ジーズニ(Nōuosuchi Jīzuni),ノォウォスチ・ジーズニ,ノウウスチジーズニ
(Language),露文,露文,露文
(Ideology),政治、社会、文学、商工新聞,社会民主,社会民主
(Owner),タワリシエスウオ・ペチャヤチャ,ぺチヤチ合名会社,ぺチヤチ合名会社
(Editior),イ・エフ・ブロクミユレル,イ・エフ・ブロクミユレル(本名クリオリン)、記者(社員)チエルニヤエフスキー,イ・エフ・ブロクミユレル(本名クリオリン)、記者(社員)チエルニヤエフスキー
(Publication Frequency),日刊,日刊,日刊
(Circulation),"約3,000","約3,000","3,000"
(Others),1909年創刊、猶太人会より補助を受く、「エス・エル」党の過激派に接近せる主張をなす、哈爾賓諸新聞中最も紙面整頓せるものにして記事多く比較的正確なり、日本お対露干渉排斥排日記事を掲載す,1909年創刊、エス・エル党の過激派に接近せる主張をなす、哈爾賓諸新聞中最も紙面整頓し記事多く比較的正確にして最も有力なるものなり、日本お対露干渉排斥及一般排日記事を掲載し「チタ」政府を擁護す,1909年創刊、過激派に益々接近し長春会議以後は「ダリタ」通信と相待って過激派系の両翼たりし感あり、紙面整頓し記事比較的正確且つ金力に於て猶太人系の後援を有し最も有力なる新聞たり、一般排日記事を掲載し支那側に媚を呈す、「チタ」政権の擁護をなし当地に於ける機関紙たりと自任す
,,,
Title of Newspaper 3,北満洲(Kita Manshū),哈爾賓日々新聞(Harbin Nichi-Nichi Shimbun),哈爾賓日々新聞(Harbin Nichi-Nichi Shimbun)
(Language),邦文,邦文,邦文
(Ideology),,,
(Owner),合資組織,社長 児玉右二,株式組織 (社長)児玉右二
(Editior),木下猛、遠藤規矩郎,編集長代理 阿武信一,(副社長)磯部検三 (主筆)阿武信一
(Publication Frequency),日刊,日刊,日刊
(Circulation),"1,000内外",,
(Others),大正3年7月創刊,大正11年1月創刊の予定、西比利亜新聞、北満洲及哈爾賓新聞の合同せるものなり,大正11年1月創刊
Yes, it's also in numerous non-latin languages, which makes it a little bit more challenging.
I want to create a new matrix for every year, then rotate the table to turn the 7 rows for each newspaper into columns so that I end up with each row corresponding to one newspaper. Finally, I need to generate a new column that gives me the location of the newspaper. I would also like to add a unique identifier for each newspaper and add another column that states the year, just in case I decide to merge the entire dataset into a single matrix. I did the transformation manually in Excel but the entire dataset contains data several from thousand newspapers, so I need to automate the process. Here is what I want to achieve (sans unique identifier and year column):
Title of Newspaper,Language,Ideology,Owner,Editor,Publication Frequency,Circulation,Others,Location
直隷公報(Zhi Li Gong Bao),漢文,直隷省公署の公布機関,直隷省,,日刊,2500,光緒22年創刊、官報の改称,Tientsin
大公報(Da Gong Bao),漢文,稍親日的,合資組織,樊敏鋆,日刊,,光緒28年創刊、倪嗣仲の機関にて現に王祝山其の全権を握り居れり、9年夏該派の没落と共に打撃を受け少しく幹部を変更して再発行せり、但し資金は依然王より供給し居れり,Tientsin
天津日々新聞(Tianjin Ri Ri Xin Wen),漢文,日支親善,方若,郭心培,日刊,2000,光緒27年創刊、親日主義を以て一貫す、國聞報の後身なり民国9年安直戦争中直隷派の圧迫を受けたるも遂に屈せさりし,Tientsin
時聞報(Shi Wen Bao),漢文,中立,李大義,王石甫,,1000,光緒30年創刊、紙面相当価値あり,Tientsin
Is there a way of doing this in R? How will I go about it?
I've outlined a plan in a comment above. This is untested code that makes it more concrete. I'll keep testing till it works
inps <- readLines( ~/Documents/R_code/Tientsin unformatted.txt")
inp2 <- inps[ -(1:2) ]
# 'identify groupings with cumsum(grepl(",,,", inp2) as the second arg to split'
inp.df <- lapply( split(inp2, cumsum(grepl(",,,", inp2) , read.csv)
library(data.table) # only needed if you use rbindlist not needed for do.call(rbind , ..
# make a list of one-line dataframes as below
# finally run rbindlist or do.call(rbind, ...)
in.dt <- do.call( rbind, (inp.df)) # rbind checks for ordering of columns
This is the step that makes a one line dataframe from a set of text lines:
txt <- 'Title of Newspaper 1,遠東報(Yuan Dong Bao),,
(Language),漢文,,
(Ideology),東支鉄道機関紙,,
(Owner),(総経理)史秉臣,,
(Editior),張福臣,,
(Publication Frequency),日刊,,
(Circulation),"1,000",,
(Others),1908年創刊、東支鉄道に支那か勢力を扶殖せし以来著しく排日記事を記載す,,'
temp=read.table(text=txt, sep="," , colClasses=c(rep("character", 2), NA, NA))
in1 <- setNames( data.frame(as.list(temp$V2)), temp$V1)
in1
#-------------------------
Title of Newspaper 1 (Language) (Ideology) (Owner) (Editior) (Publication Frequency) (Circulation)
1 遠東報(Yuan Dong Bao) 漢文 東支鉄道機関紙 (総経理)史秉臣 張福臣 日刊 1,000
(Others)
1 1908年創刊、東支鉄道に支那か勢力を扶殖せし以来著しく排日記事を記載す
So it looks like the column names of the individually constructed items would need to further processing to make them capable of being successfully "bindable" by plyr::rbindlist or data.table::rbindlist

Conditional string replacement in r

I have a dataframe that contains a column with brokernames, which are handwritten by customers, that I would like to go through in order to replace the handwritten brokernames with unique brokernames that I have in a list.
A snippet of my data looks like this:
Data <- data.frame(Date = c("01-10-2020", "01-10-2020", "01-11-2020", "01-11-2020"),
Broker = c("RealEstate", "REALestate", "Estate", "ESTATE"))
My list of unique broker names looks like this:
Unique_brokers <- list("REALESTATE", "ESTATE")
Based on some sort of pattern-recognition, I would like to replace the brokernames in my Data dataframe with the unique brokernames in my Unique_brokers list.
I've partially managed to do this somewhat manually using a combination of case_when and str_detect from dplyr and stringr respectively.
Data <- Data %>%
mutate("UniqueBroker" = case_when(str_detect(Broker, regex("realestate", ignore_case=T))~"REALEASTE",
str_detect(Broker, regex("estate", ignore_case=T))~"ESTATE",
TRUE~"OTHER"))
However, this is fairly timeconsuming with >100 unique brokers and more than 12500 combinations of handwritten brokernames in ~80.000 records.
I was wondering whether it would be possible to make this replacement using mapply, I haven't been able to so far, however.
Many thanks in advance!
EDIT
Data$Broker consists of all kinds of combinations in terms of spelling, information included, etc.
E.g.
Data$Broker <- c("Real-estate", "Real estate", "Real estate department 788", "Michael / REAL Estate")
You can use the i flag for case-insensitive regex and use str_replace_all -
library(stringr)
Unique_brokers <- c("REALESTATE", "ESTATE")
Data$Unique_brokers <- str_replace_all(Data$Broker,
setNames(Unique_brokers, str_c('(?i)', Unique_brokers)))
A dplyr and stringrsolution:
Data %>%
mutate(Broker_unique = if_else(str_detect(Broker, "(?i)real(-|\\s)?estate"),
"REALESTATE",
"ESTATE"))
Date Broker Broker_unique
1 01-10-2020 Real-estate REALESTATE
2 01-10-2020 estate ESTATE
3 01-11-2020 Real estate department 788 REALESTATE
4 01-11-2020 Michael / REAL Estate REALESTATE
The pattern works like this:
(?i): make match case-insensitive
real: match literal real
-(-|\\s)?: match - OR (i.e., one whitespace) optionally
estate: match estate literally
Test data:
Data <- data.frame(Date = c("01-10-2020", "01-10-2020", "01-11-2020", "01-11-2020"),
Broker = c("RealEstate", "REALestate", "Estate", "ESTATE"))
Data$Broker <- c("Real-estate", "estate", "Real estate department 788", "Michael / REAL Estate")

Is it possible to analyze items' performance from different exams using the same questions but with a different order in each version?

We produced five exams2nops exams using the same groups of items, with randomized order. All of them were schoice items. As such, five different *.rds files were obtained, each of them will be used with the correspondent scanned exams. I noticed that in those *.rds files to be used in the nops_eval there is the information about the *.rmd which as used to produce that exams' question. E.g.:
However, after producing the the nops_eval.csv that information is lost.
I would like to merge all five nops_eval.csv files using the *.rmd information to match each question. Since the same question (e.g. exercise 22) can be genareted from the different *.rmd files. All the same 22 *.rmd files were used in all exams (all have the same 22 questions but with different orders.
I would like to obtain a data frame with the merged csv to allow me to conduct Item Response Theory and Rasch modeling analysis.
Yes you can merge the information from the CSV files by reordering them based on the file/name information from the RDS files. Below I illustrate how to do this using the check.* columns from the CSV files. These are typically closest to what I need for doing an IRT analysis.
First, you read the CSV and RDS from the first version of the exam:
eval1 <- read.csv2("/path/to/first/nops_eval.csv", dec = ".")
metainfo1 <- readRDS("/path/to/first/exam.rds")
Then, you only extract the check.* columns and use the exercise file names as column names.
eval1 <- eval1[, paste0("check.", 1:length(metainfo1[[1]]))]
names(eval1) <- sapply(metainfo1[[1]], function(x) x$metainfo$file)
I'm using $file here because it is always unique across exercises. If $name is also unique in your case and has the better labels, you can also use that instead.
Then you do the same for the second version of the exam:
eval2 <- read.csv2("/path/to/second/nops_eval.csv", dec = ".")
metainfo2 <- readRDS("/path/to/second/exam.rds")
eval2 <- eval2[, paste0("check.", 1:length(metainfo2[[1]]))]
names(eval2) <- sapply(metainfo2[[1]], function(x) x$metainfo$file)
If the same exercises have been used in the construction of the two version, the column names of eval1 and eval2 are the same, just in a different order. Then you can simply do
eval2 <- eval2[, names(eval1)]
to reorder the columns of eval2 to match those of eval1. Subsequently, you can do:
eval <- rbind(eval1, eval2)
If you have more than two versions of the exam, you just iterate the same code and rbind() everything together in the end.
Similar code can also be used if the exercises are just partially overlapping between the versions of the exercise. In that case I first construct a large enough NA matrix with the merged exercise file names and then insert the results:
n1 <- nrow(eval1)
n2 <- nrow(eval2)
nam <- unique(c(names(eval1), names(eval2)))
eval <- matrix(NA, nrow = n1 + n2, ncol = length(nam))
colnames(eval) <- nam
eval[1:n1, names(eval1)] <- as.matrix(eval1)
eval[(n1 + 1):(n1 + n2), names(eval2)] <- as.matrix(eval2)
Again you would need to iterate suitably to merge more than two versions.
In either case the resulting eval could then be processed further to become the IRT matrix for subsquent analysis.

Transform a matrix txt file in spectra data for ChemoSpec package

I want to use ChemoSpec with a mass spectra of about 60'000 datapoint.
I have them already in one txt file as a matrix (X + 90 samples = 91 columns; 60'000 rows).
How may I adapt this file as spectra data without exporting again each single file in csv format (which is quite long in R given the size of my data)?
The typical (and only?) way to import data into ChemoSpec is by way of the getManyCsv() function, which as the question indicates requires one CSV file for each sample.
Creating 90 CSV files from the 91 columns - 60,000 rows file described, may be somewhat slow and tedious in R, but could be done with a standalone application, whether existing utility or some ad-hoc script.
An R-only solution would be to create a new method, say getOneBigCsv(), adapted from getManyCsv(). After all, the logic of getManyCsv() is relatively straight forward.
Don't expect such a solution to be sizzling fast, but it should, in any case, compare with the time it takes to run getManyCsv() and avoid having to create and manage the many files, hence overall be faster and certainly less messy.
Sorry I missed your question 2 days ago. I'm the author of ChemoSpec - always feel free to write directly to me in addition to posting somewhere.
The solution is straightforward. You already have your data in a matrix (after you read it in with >read.csv("file.txt"). So you can use it to manually create a Spectra object. In the R console type ?Spectra to see the structure of a Spectra object, which is a list with specific entries. You will need to put your X column (which I assume is mass) into the freq slot. Then the rest of the data matrix will go into the data slot. Then manually create the other needed entries (making sure the data types are correct). Finally, assign the Spectra class to your completed list by doing something like >class(my.spectra) <- "Spectra" and you should be good to go. I can give you more details on or off list if you describe your data a bit more fully. Perhaps you have already solved the problem?
By the way, ChemoSpec is totally untested with MS data, but I'd love to find out how it works for you. There may be some changes that would be helpful so I hope you'll send me feedback.
Good Luck, and let me know how else I can help.
many years passed and I am not sure if anybody is still interested in this topic. But I had the same problem and did a little workaround to convert my data to class 'Spectra' by extracting the information from the data itself:
#Assumption:
# Data is stored as a numeric data.frame with column names presenting samples
# and row names including domain axis
dataframe2Spectra <- function(Spectrum_df,
freq = as.numeric(rownames(Spectrum_df)),
data = as.matrix(t(Spectrum_df)),
names = paste("YourFileDescription", 1:dim(Spectrum_df)[2]),
groups = rep(factor("Factor"), dim(Spectrum_df)[2]),
colors = rainbow(dim(Spectrum_df)[2]),
sym = 1:dim(Spectrum_df)[2],
alt.sym = letters[1:dim(Spectrum_df)[2]],
unit = c("a.u.", "Domain"),
desc = "Some signal. Describe it with 'desc'"){
features <- c("freq", "data", "names", "groups", "colors", "sym", "alt.sym", "unit", "desc")
Spectrum_chem <- vector("list", length(features))
names(Spectrum_chem) <- features
Spectrum_chem$freq <- freq
Spectrum_chem$data <- data
Spectrum_chem$names <- names
Spectrum_chem$groups <- groups
Spectrum_chem$colors <- colors
Spectrum_chem$sym <- sym
Spectrum_chem$alt.sym <- alt.sym
Spectrum_chem$unit <- unit
Spectrum_chem$desc <- desc
# important step
class(Spectrum_chem) <- "Spectra"
# some warnings
if (length(freq)!=dim(data)[2]) print("Dimension of data is NOT #samples X length of freq")
if (length(names)>dim(data)[1]) print("Too many names")
if (length(names)<dim(data)[1]) print("Too less names")
if (length(groups)>dim(data)[1]) print("Too many groups")
if (length(groups)<dim(data)[1]) print("Too less groups")
if (length(colors)>dim(data)[1]) print("Too many colors")
if (length(colors)<dim(data)[1]) print("Too less colors")
if (is.matrix(data)==F) print("'data' is not a matrix or it's not numeric")
return(Spectrum_chem)
}
Spectrum_chem <- dataframe2Spectra(Spectrum)
chkSpectra(Spectrum_chem)

Resources