How to check eliminate several rows from a dataframe using R - r

I have two excel files A and B. Excel file A has 6 columns with 10,000 rows. Assume that columns are named A-F. Excel file B has one column (A) with 3500 rows. Here is what I want to do-
I want to eliminate rows based on the cell values (ids) in column A (in excel file A) and have a dataframe without them. To further elaborate- I want to check each id in Column A (excelfile A) against all ids in columnA in excel file B. If the id in column A in excel file A matches with any of the listed ids in columnA of excel file B, then, I want to eliminate those rows with matching ids in excel file A.
I was able to do this in excel. I want to do the same with R as a crosscheck. I am new to R and I am learning. Could someone help me with a best way to do this?
I know how to subset rows based on header title and a particular value of a cell. But, in this case, I have data with 10,000 observations, of which, I want to eliminate at least 3500 through matching the ids.

Related

Transform multiple entries in cell into new columns in R

I have a messy dataset with multiple entries in some cells. The numbers in paranthesis refer to the specific columns "(1)", "(2)", and "(3)". In this example
multiple entries in cell 30 refers to column (2) and 20 refers to column (1). No information for column (3).
I would like to split up/extract the values in the cells and create 3 additional columns.
Several hundred cells are affected in several columns.
Dataset
In the end I would like to have 3 new columns for each column affected. Any idea how I do that? I'm still a rookie so help is much appreciated!

Read / Import specific rows from large Excel files in R

I have dozens of very heavy Excel files that I need to import into R (then rebind). Each file has 2 sheets, where the second sheet (name: "Results") consists of 100K rows at least and has about 350 columns.
I would like to read a subset of the sheet "Results" from each file by columns, but most importantly, by specific rows. Each "ID" in the data, has a main row and then multiple rows below which contain data in specific columns. I would like to read the main row only (this leaves each file with 50-400 rows (depending on the file) and 150 variables). The first column that numbers main rows does not have a header.
This is what the data looks like (simplified):
I would like to import only the rows whose first column isn't empty but numbered (i.e., 1., 13., 34., 211.) and particular columns, in this example columns 2,3,5 (i.e., name, ID, status). The desired output would be:
Is there a simple way to do this?
Let's say a is our excel file, as data frame.
library(readxl)
a <- as.data.frame(read_excel("Pattern/File.xlsx",sheet = "Results"))
For instance, we want to select columns 1 to 3, so use
subset(a[,1:3],is.na(a[1])==FALSE)
By this function, you are subsetting the input data frame with values different than NA in first column.
Output:
...1 name ID
1 1 Dan us1d
4 13 Nev sa2e
6 34 Sam il5a
Note first column name (" ...1 "). This is autogenerated by read_excel() function, but should not be a problem.

Join columns from different files

I want to know how can we combine a specific column of one file with another column of another file in R?
I want to subtract 50 from the maximum of each column. I tried this but it didn't work:
a <- 50-max(datafile1$X2018.03.06,datafile1$X2017.07.13)

Exctracting specific columns from different excel sheets (but same excel file) in R

I am struggling with merging a number of excel sheets into one data frame(or tibble). What I have is an excel file containing 21 sheets, each of the sheets has the same 2 columns which are identical in all the 21 sheets. But now I don't know how to tackle this in the most efficient way. Number of rows is the same in every sheet.
After I have done merging the sheets I wanted to select a number of the columns with dplyr::select to only choose specific columns.
I don't have enough experience to handle this myself, how would you tackle it?
Columns are our clients, and each of the observations is their energy demand per 15 min. So all the observations are of the same data type.
I have tried to make a tibble with the following names:
>colnames(test)
[1]"X871691600001976087"
[2]"X871691600001837791" etc
I have a list of column names that I want to extract:
>testnames
[1]"X871685900000003968"
[2]"X871685900000009600" etc
Some of those column names will have to match, but they don't. I get this error:
selecttest <- select(test, one_of(testnames))
"Warning message:
Unknown variables: `871685900000003968`, etc.. (variables from testnames)"
Is this any sufficient information to get a hint here?

How to extract specific rows depending on part of the strings in one column in R

When I use R, I try to extract specific rows which have some specific strings in one column.
The data structure as following
ERC1 20679 14959 9770 RAB6-interacting protein 2 isoform
I want to extract the rows which have RAB6 in the last column. That column still has some other words besides RAB6 so I can not use column = "RAB6" to get them. It's just like a search function in excel. Does anyone have any ideas?
Assuming that your data frame is df:
df[grep("^RAB6", df$column),]
If not all values start with RAB6 remove the^.

Resources