List rows with specific columns with a specific value in R from dataset - r

I have a dataset called "flights" and I am attempting to list all the rows that have the value of "Escanaba, Michigan" in the column Destination. I would like to show 5 columns and then all the rows that apply to Escanaba.
Currently I have...
flights[,c("FlightDate","Carrier","Destination","DestCityName","AirTime")]
That works perfectly for what I want, except it shows all rows.
How do I call out a specific value from a column in a dataset?

This is a pretty basic indexing question (see e.g here, which was the first hit when I googled "R indexing"); you need to construct a logical vector that is TRUE for the relevant rows.
flights[flights$Destination=="Escanaba, Michigan",
c("FlightDate","Carrier","Destination","DestCityName","AirTime")]
A prettier alternative for interactive work (not entirely safe for programmatic use):
subset(flights,Destination=="Escanaba, Michigan",
select=c(FlightDate,Carrier,
Destination,DestCityName,AirTime))
If you want to allow for more than one possible value of Destination, try %in%

Related

R: Searching a column in a dataframe for matches to a reference list in another dataframe

I am trying to categorize genes with multiple GO descriptors into bins based on what those GO descriptors are related to. I have dataframe A which contains the raw data associated with a list of geneIDs (>500,000) and their associated GO descriptors and dataframe B which classifies these GO descriptors into larger groups.
Example of dataframe A
dfA
Example of dataframe B
dfB
Ideally, the final output would reference the entire list and generate a new column in dataframe A classifying the GeneIDs into the GO_Category's associated with its specific GO_IDs -- bonus points if it removes duplicate hits on the GO_Categorys.
Looking something like this...
Example of Ideal Solution
However, I know that the ideal solution might be difficult to obtain, and I already have dataframe B listed out based on the unique GO_Categories so a solution like this might be easier to obtain.
Example of Acceptable Solution
So far I have struggled with getting any command to search for partial strings using a list from another dataframe with the goal of returning all matches.
I have had partial success with the acceptable solution approach and using:
dfA <- dfA %>%
mutate(GO_Cat_1 = c('No', 'Yes')[1+str_detect(dfA$GO_IDs, as.character(dfB$GO_IDs))])
The solution seems okay, however, it does return an error along the lines of
problem with mutate() column GO_Cat_1.
i GO_Cat_1 = ...[].
i longer object length is not a multiple of shorter object length
I have also tried to look into applying grepl/grep - but struggled to feed it a list of terms to look for partial string matches in dfA.
Any assistance is greatly appreciated!

Unfiltering in R to mutate a variable in a data frame

I have a data set of open-answer responses in a survey. I have a set of topics that I want to assign to these open answer responses according to the content of the people's responses to a given question.
I want to specify keywords and search for a match to these keywords (/strings) for a specific variable (in this case, the variable is a question asked). Importantly, these strings can only be a partial match to the responses, I'm just specifying keywords that will appear at some point in the string. For example, filtering the responses by those that use the word 'money' at some point in their response to a given question. Given the matched rows, I want to assign a "1" to those rows in a new variable, let's call it "Topic1".
I have this code:
df %>%
filter(str_detect(var, "money|cash|revenue|profit")) %>%
mutate(Topic1 = "1")
This code works fine to filter and create the new variable according to the conditions I specify. However, if I assign back to the dataframe, I have now subsetted the entire dataframe according to my filter. I would like to filter the rows by my provided keywords, create the new variable, then 'unfilter' the dataframe, and assign the result back to the df.
I'm wondering if there is an 'unfilter' that I can use to reselect all my data. OR perhaps there is a better solution with ifelse?

how to view head of as.data.frame in R? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a huge data set with 20 columns and 20,000 rows, according to the manual of a program I use, we have to put the data as a data frame, though I'm not I understand what it does.. and I can't seem to view the head data frame I created.
I wrote in Bold the part that I don't understand, I'm very new with R, can a kind mind explain to me how the following works?
First I read the CSV file
vData = read.csv("my_matrix.csv");
1) Here we create the data frame as per the manual, what does -c(1:8) do exactly??
dataExpr0 = as.data.frame(t(vData[, -c(1:8)]))
2) Here, to understand what the above part does, I tried to view only the header of the data frame, with the following line, but it display the first 2 columns for the 20,000 rows of data. Is there a way to view only the first 2 rows?
head(dataExpr0, n = 2)
Let's disect what your call is doing, from the inside out.
Basic Indexing
When indexing a data.frame or matrix (assuming 2 dimensions), you access a single element of it with the square bracket notation, as you're seeing. For instance, to see the value in the fourth row, fifth column, you'd use vData[4,5]. This can work with ranges of rows and/or columns as well, such as vData[1:4,5] returning the first 4 rows and the 5th column as a vector.
Note: the range 1:4 can also be an arbitrary vector of numbers, such as vData[c(1,2,5),c(4,8)] which returns a 3 by 2 matrix.
BTW: by default, when the resulting slice/submatrix has one of its dimensions reduced to 1 (as in the latter example), R will drop it to the lower structure (e.g., matrix -> vector -> scalar). In this case, it will drop vData[1:4,5] to a vector. You can prevent this from happening by adding what appears to be a third dimension to the square brackets: vData[1:4,5,drop=FALSE], meaning "do not drop the simplified dimension". Now, you should get a matrix of 4 rows and 1 column in return.
You can read a much more thorough explanation of how to subset data.frames by reading (for example) some of the "Hadleyverse". If you do that, I highly encourage you to make it an interactive session: play in R as you read, to help cement the methods.
Negative Indexing
Negative indices mean "everything except what is listed". In your example, you are subsetting the data to extract everything except columns 1:8. So your vData[,-c(1:8)] is returning all rows and columns 9 through 20, a 20K by 12 matrix. Not small.
Transposition
You probably already know what t() does: transpose the matrix so that it is now 12 by 20K.
A word of warning: if all of your data.frame columns are of the same class (e.g., 'character', 'logical'), then all is fine. However, the fact that data.frames allow disparate types of data in different columns is not a feature shared by matrices. If one data.frame column is different than the others, they will be converted to the highest common format, e.g., logical < integer < numeric < character.
Back to a data.frame
After you transpose it (which converts to a matrix), you convert back to a data.frame, which may or may not be necessary depending on how to intend to deal with the data later. For instance, if the row names are not meaningful, then it may not be that useful to convert into a data.frame. That's relatively immaterial, but I'm a fan of not over-converting things. I'm also a fan of using the simpler data structure, and matrices are typically faster than data.frames.
Head
... merely gives you the top n rows of a data.frame or matrix. In your case, since you transposed it, it is now 20K columns wide, which may be a bit unwieldy on the command line.
Alternatives
Based on what I provided earlier, perhaps you just want to look at the top few rows and first few columns? dataExpr0[1:5,1:5] will work, as will (identically) head(dataExpr0[,1:5], n=5).
More Questions?
I strongly encourage you to read more of the Hadleyverse and become a little more familiar with subsetting and basic data management. It is fundamental to using R, and StackOverflow is not always patient enough to answer baseline questions like this. This forum is best suited for those who have already done some research, read documentation and help pages, and tried some code, and only after that cannot figure out why it is not working. You provided some basic code with is good, but SO is not ideally suited to teach how to start with R.

Changing hundreds of column names simultaneously in R

I have a data frame with hundreds of columns whose names I want to change. I'm very new to R, so it's rather easy to think through the logic of this, but I simply can't find a relevant example online.
The closest I could sort of get was this:
projectFileAllCombinedNames <- for (i in 1:200){names(projectFileAllCombined)[i+1] <-variableNames[i]}
Basically, starting at the second column of projectFileAllCombined, I want to loop through the columns in the dataframe and assign them the data values in the second data frame. I was able to change one column name manually with this code:
colnames(projectFileAllCombined)[2]<-"newColumnName"
but I can't possibly do that for hundreds of columns. I've spent multiple hours on this and can't crack it with any number of Google searches on "change multiple columns in r" or "change column names in r". The best I can find online is examples where people change a few columns with a c() function and I get how that works, but that still seems to require typing out all the column names as parameters to the function, unless there is a way to just pass the "variableNames" file into that c() function, but I don't know of one.
Will
colnames(projectFileAllCombined)[-1] <- variableNames
not suffice?
This assumes the ordering of columns in projectFileAllCombined is the same as the ordering of the new variable names in variableNames, and that
length(variableNames) == (ncol(projectFileAllCombined) - 1)
The key point here is that the replacement function 'colnames<-'() is vectorised and can replace any number of column names in a single call if passed a vector of replacement values.

How do I match single ID's in one data frame to multiples of the IDs in another data frame in R?

For a project at work, I need to generate a table from a list of proposal ids, and a table with more data about some of those proposals (called "awards"). I'm having trouble with the match() function; the data in the "awards" table often has several rows that use the same ID, while the proposals frame has only one copy of each ID. From what I've tried, R ignores multiple rows and only returns the first match, when I need all of them. I haven't been able to find anything in documentation or through searches that helps me, though I have been having difficulty phrasing the right question.
Here's what I have so far:
#R CODE to add awards data on proposals to new data spreadsheet
#read tab delimited files
Awards=read.delim("O:/testing.txt",as.is=T)
Proposals=read.delim("O:/test.txt",as.is=T)
#match IDs from both spreadsheets
Proposals$TotalAwarded=Awards$TotalAwarded([match(Proposals$IDs,Awards$IDs)]),
write.table(Proposals,"O:/tested.txt",quote=F,row.names=F,sep="\t")
This does exactly what I want, except that only the first match is encapsulated.
What's the best way to go forward? How do I make R utilize all of the matches available?
Thanks
See help on merge: ?merge
merge( Proposals, Awards, by=ID, all.y=TRUE )
But I cannot believe this hasn't been asked on SO before.

Resources