Searching for target in Excel spreadsheet using R - r

As an R noob, I'm currently rather stumped by what is probably a rather trivial problem. I have data that looks like in the second image below, essentially a long sheet of rows with values in three columns. What I need is for a way to scan the sheet looking for particular combinations of values in the first and second column - combinations that are specified in a second spreadsheet of targets (see picture 1). When that particular combination is found, I need the script to extract the whole row in question from the data file.
So far, I've managed to read the files without problem:
library(xlsx)
folder <- 'C:\\Users\\...\\Desktop\\R EXCEL test\\'
target_file <- paste(folder,(readline(prompt = "Enter filename for target list:")),sep = "")
data_file <- paste(folder,(readline(prompt = "Enter data file:")),sep = "")
targetsDb <- read.xlsx(target_file, sheetName = "Sheet1")
data <- read.xlsx(data_file, sheetName = "Sheet1")
targets <- vector(mode = "list", length = 3)
for(i in 1:nrow(targetsDb)){
targets[[i]] <- c(targetsDb[i,1],targetsDb[i,2])
}
And with the last command I've managed to save the target combinations as items in a list. However, I run into trouble when it comes to iterating through the file looking for any of those combinations of cell values in the first two columns. My approach was to create a list with one item,
SID_IA <- vector(mode = "list", length = 1)
and to fill it with the values of column 1 and 2 iteratively for each row of the data file:
for(n in 1:nrow(data)){
SID_IA[[n]] <- c(data[n,1],data[n,2])
I would then nest another for loop here, which basically goes through every row in the targets sheet to check if the combination of values currently in the SID_IA list matches any of the target ones. Then at the end of the loop, the list is emptied so it can be filled with the following combination of data values.
for(i in targets){
if(SID_IA[[n]] %in% targets){
print(SID_IA[[n]], "in sentence" , data[n,1], "is ", data[n,3])
}else{
print(FALSE)
}
SID_IA[[n]] <- NULL
}
}
However, if I try to run that last loop, it returns the following output and error:
[1] FALSE
Error in SID_IA[[n]] : subscript out of bounds
In addition: Warning message:
In if (SID_IA[[n]] %in% targets) { :
the condition has length > 1 and only the first element will be used
So, it seems to be doing something for at least one iteration, but then crashes. I'm sure I'm missing something very elementary, but I just can't see it. Any ideas?
EDIT: As requested, I've removed the images and made the test Excel sheets available here and here.

OK.. I'm attempting an answer that should require minimum use of fancy tricks.
data<- xlsx::read.xlsx(file = "Data.xlsx",sheetIndex = 1)
target<- xlsx::read.xlsx(file = "Targets.xlsx",sheetIndex = 1)
head(data)
target
These values are already in data.frame format. If all you want to know is which rows appear exactly same in data and target, then it will be as simple as finding a merge
merge(target,data,all = F)
If, on the other hand , you want to keep the data table with a marking of target rows, then the easiest way will be to make an index column
data$indx<- 1:nrow(data)
data
mrg<- merge(target,data,all = F)
data$test<- rep("test", nrow(data))
data$test[mrg$indx]<- "target"
data
This is like the original image you'd posted.
BTW , if yo are on a graphical interface you can also use File dialogue to open data files.. check out file.choose()

(Posted on behalf of the OP).
Following from #R.S.'s suggestion that didn't involve vectors and loops, and after some playing around, I have figured out how to extract the target lines, and then how to remove them from the original data, outputting both results. I'm leaving it here for future reference and considering this solved.
extracted <- merge(targets,data,all = F)
write.xlsx(extracted,output_file1)
combined <-rbind(data,extracted)
minus.target <- combined[!duplicated(combined,fromLast = FALSE)&!duplicated(combined,fromLast = TRUE),]
write.xls(minus.target,output_file2)

Related

select a value from several dataframes (file.csv) and merge them into a new dataframe with R

Maybe I'm asking for something too simple, but I can't solve this problem.
I want to create a script that recursively enters the folders present in a base_folder, opens a specific file with a name that is always the same (w3nu) and selects a precise value (I need to select the email of the subject belonging to the Response column, filtering for the corresponding heat in the Question.Key column).
I want my script to repeat itself in the same way for all the folders present in the base folder.
Finally, I want to merge all the emails into a new dataframe.
I have created this script but it does not work.
library(tidyverse)
base_folder <- "data/raw/exp_1_participants/sbj"
files <- list.files(base_folder, recursive = TRUE, full.names = TRUE)
demo_email <- files[str_detect(files, "w3nu")]
email_extraction <- function(demo_email){
demo_email <- read.csv(task,header = T)
demo_email <- demo_email %>%
filter(Question.Key == "respondent-email") %>%
select(Response)
}
email_list_jolly <- vector(mode = "list", length = length(demo_email))
for(i in 1:length(email_list_jolly)){
email_list_jolly[[i]] <- email_extraction(demo_email[i])
}
email_list_stud <- cbind(email_list_jolly)
write.csv(email_list_stud, 'data/cleaned/email_list_stud.csv')
can you help me? thanks
From comments:
Looks like you haven't defined task within the script shown above, but you're telling read.csv to find it. Did you mean to pass demo_email to read.csv instead? task is probably a random vector in your workspace.

R: use single file while running a for loop on list of files

I am trying to create a loop where I select one file name from a list of file names, and use that one file to run read.capthist and subsequently discretize, fit, derived, and save the outputs using save. The list contains 10 files of identical rows and columns, the only difference between them are the geographical coordinates in each row.
The issue I am running into is that capt needs to be a single file (in the secr package they are 'captfile' types), but I don't know how to select a single file from this list and get my loop to recognize it as a single entity.
This is the error I get when I try and select only one file:
Error in read.capthist(female[[i]], simtraps, fmt = "XY", detector = "polygon") :
requires single 'captfile'
I am not a programmer by training, I've learned R on my own and used stack overflow a lot for solving my issues, but I haven't been able to figure this out. Here is the code I've come up with so far:
library(secr)
setwd("./")
files = list.files(pattern = "female*")
lst <- vector("list", length(files))
names(lst) <- files
for (i in 1:length(lst)) {
capt <- lst[i]
femsimCH <- read.capthist(capt, simtraps, fmt = 'XY', detector = "polygon")
femsimdiscCH <- discretize(femsimCH, spacing = 2500, outputdetector = 'proximity')
fit <- secr.fit(femsimdiscCH, buffer = 15000, detectfn = 'HEX', method = 'BFGS', trace = FALSE, CL = TRUE)
save(fit, file="C:/temp/fit.Rdata")
D.fit <- derived(fit)
save(D.fit, file="C:/temp/D.fit.Rdata")
}
simtraps is a list of coordinates.
Ideally I would also like to have my outputs have unique identifiers as well, since I am simulating data and I will have to compare all the results, I don't want each iteration to overwrite the previous data output.
I know I can use this code by bringing in each file and running this separately (this code works for non-simulation runs of a couple data sets), but as I'm hoping to run 100 simulations, this would be laborious and prone to mistakes.
Any tips would be greatly appreciated for an R novice!

Select multiple rows in multiple DFs with loop in R

I have read multiple questionnaire files into DFs in R. Now I want to create new DFs based on them, buit with only specific rows in them, via looping over all of them.The loop appears to work fine. However the selection of the rows does not seem to work. When I try selecting with simple squarebrackts, i get the error "incorrect number of dimensions". I tried it with subet(), but i dont seem to be able to set the subset correctly.
Here is what i have so far:
for (i in 1:length(subjectlist)) {
p[i] <- paste("path",subjectlist[i],sep="")
files <- list.files(path=p,full.names = T,include.dirs = T)
assign(paste("subject_",i,sep=""),read.csv(paste("path",subjectlist[i],".csv",sep=""),header=T,stringsAsFactors = T,row.names=NULL))
assign(paste("subject_",i,"_t",sep=""),sapply(paste("subject_",i,sep=""),[c((3:22),(44:63),(93:112),(140:159),(180:199),(227:246)),]))
}
Here's some code that tries to abstract away the details and do what it seems like you're trying to do. If you just want to read in a bunch of files and then select certain rows, I think you can avoid the assign functions and just use sapply to read all the data frames into a list. Let me know if this helps:
# Get the names of files we want to read in
files = list.files([arguments])
df.list = sapply(files, function(file) {
# Read in a csv file from the files vector
df = read.csv(file, header=TRUE, stringsAsFactors=FALSE)
# Add a column telling us the name of the csv file that the data came from
df$SourceFile = file
# Select only the rows we want
df = df[c(3:22,44:63,93:112,140:159,180:199,227:246), ]
}, simplify=FALSE)
If you now want to combine all the data frames into a single data frame, you can do the following (the SourceFile column tells you which file each row originally came from):
# Combine all the files into a single data frame
allDFs = do.call(rbind, df.list)

R for loop index issue

I am new to R and I am practicing to write R functions. I have 100 cvs separate
data files stored in my directory, and each is labeled by its id, e.g. "1" to "100.
I like to write a function that reads some selected files into R, calculates the
number of complete cases in each data file, and arrange the results into a data frame.
Below is the function that I wrote. First I read all files in "dat". Then, using
rbind function, I read the selected files I want into a data.frame. Lastly, I computed
the number of complete cases using sum(complete.cases()). This seems straightforward but
the function does not work. I suspect there is something wrong with the index but
have not figured out why. Searched through various topics but could not find a useful
answer. Many thanks!
`complete = function(directory,id) {
dat = list.files(directory, full.name=T)
dat.em = data.frame()
for (i in id) {
dat.ful= rbind(dat.em, read.csv(dat[i]))
obs = numeric()
obs[i] = sum(complete.cases(dat.ful[dat.ful$ID == i,]))
}
data.frame(ID = id, count = obs)
}
complete("envi",c(1,3,5)) `
get error and a warning message:
Error in data.frame(ID = id, count = obs) : arguments imply differing number of rows: 3, 5
One problem with your code is that you reset obs to numeric() each time you go through the loop, so obs ends up with only one value (the number of complete cases in the last file in dat).
Another issue is that the line dat.ful = rbind(dat.em, read.csv(dat[i])) resets dat.ful to contain just the data frame being read in that iteration of the loop. This won't cause an error, but you don't actually need to store the previous data frames, since you're just checking the number of complete cases for each data frame you read in.
Here's a different approach using lapply instead of a loop. Note that instead of giving the function a vector of indices, this function takes a vector of file names. In your example, you use the index instead of the file name as the file "id". It's better to use the file names directly, because even if the file names are numbers, using the index will give an incorrect result if, for some reason, your vector of file names is not sorted in ascending numeric order, or if the file names don't use consecutive numbers.
# Read files and return data frame with the number of complete cases in each csv file
complete = function(directory, files) {
# Read each csv file in turn and store its name and number of complete cases
# in a list
obs.list = lapply(files, function(x) {
dat = read.csv(paste0(directory,"/", x))
data.frame(fileName=x, count=sum(complete.cases(dat)))
})
# Return a data frame with the number of complete cases for each file
return(do.call(rbind, obs.list))
}
Then, to run the function, you need to give it a directory and a list of file names. For example, to read all csv files in the current working directory, you can do this:
filesToRead = list.files(pattern=".csv")
complete(getwd(), filesToRead)

concatenate a specific cell from a number of files into one single row in R

I have a folder with several files. From each file I would like to select a cell (third row, 5th column) and bind them into one single column. Here's what I've got so far:
fnames1 <- scan(file.choose(), what = "character", quiet = TRUE)
print(fnames1)
for (i in fnames1)
{
date.time <- read.table(paste("...",i, sep = ""), skip = 2, nrows = 1)
timecol <- paste(date.time[, 5])
time <- cbind(timecol)}
but I'm still just getting several individual cells instead of one row with all the cells in it.
Any help will be very much appreciated!
EDIT: Answers to MrFlick: when prompted, I choose a file that contains the names of all the files in the folder from which I want to extract the cell I need. This is where fnames1 comes from. Time is a variable I'm creating to try to concatenate all the cells together (which is obviously not working). Adding that pasteafter the read table is the only way I've managed to get the loop working... I'm very new to R and I've been working with trial and error.
I'm still not sure I completely understand, But what about
fnames1 <- scan(file.choose(), what = "character", quiet = TRUE)
print(fnames1)
time <- sapply(fnames, function(fn) {
date.time <- read.table(fn, skip = 2, nrows = 1)
date.time[, 5]
})
print(time)
Here we use sapply to loop over the file names rather than a for loop which also will automatically create a vector of the correct size for us using the date.time[3,5] values.

Resources