How to sort .csv files in R - r

I have one .csv file which i have imported into R. It contains a column with locations, some locations are repeated depending on how many times that location has been surveyed. I have another column with the total no. of plastic items.
I would like to add together the number of plastic items for locations that appear more than once, and create a separate column with the total no. of plastic and another column of the no. of times the location appeared.
I am unsure how to do this, any help will be much appreciated.

Using dplyr:
data %>%
group_by(location) %>%
mutate(TOTlocation=n(),TOTitems=sum(items))
And here's a base solution that does pretty much the same thing:
data[c("TOTloc","TOTitem")]<-t(sapply(data$location, function(x)
c(TOTloc=sum(data$location==x),
TOTitem=sum(data$items[data$location==x]))))
Note that in neither case do you need to sort anything - in dplyr you can use group_by to have each action done on only the part of the data set that belongs to a group determined by the contents of a certain column. In my base solution, I break down the locations list using sapply and then recalculate the TOTloc and TOTitem again for each row. This may not be a very efficient solution. A better solution will probably use split, but for some reason I couldn't make it work with my made up dataset, so maybe someone else can suggest how to best do that.

Related

How to change a dataframe's column types using tidy selection principles

I'm wondering what are the best practices to change a dataframe's column types ideally using tidy selection languages.
Ideally you would set the col types correctly up front when you import the data but that isn't always possible for various reasons.
So the next best pattern that I could identify is the below:
#random dataframe
df <- tibble(a_col=1:10,
b_col=letters[1:10],
c_col=seq.Date(ymd("2022-01-01"),by="day",length.out = 10))
My current favorite pattern involves using across() because I can use tidy selection verb to select variables that I want and then can "map" a formula to those.
# current favorite pattern
df<- df %>%
mutate(across(starts_with("a"),as.character))
Does anyone have any other favorite patterns or useful tricks here? It doesn't have to mutate. Often times I have to change the column types of dataframes with 100s of columns so it becomes quite tedious.
Yes this happens. Pain is where dates are in character format and if you once modify them and try to modify again (say in a mutate / summarise) there will be error.
In such a cases, change datatype only when you get to know what kind of data is there.
Select with names of columns id there is a sense in them
Check before applying the as.* if its already in that type with is.*
Applying it can be be by map / lapply / for loop, whatever is comfortable.
But it would be difficult to have a single approach for "all dataframes" as people try to name fields as per their choice or convenience.
Shared mine. Hope others help.

Most efficient algorithm to filter dataframe based on two nested conditions in R?

Im currently working with a really large dataframe (~2M rows) about "landings" and "takes off". With some information like the time the operation happened, in which airport, where was it heading and so on.
What I want to do is to filter the whole DF into a new one that just consider "flights", so about half the entries matching each take off with its corresponding landing based on the airport codes of the origin airport and the destination airport.
What I did, that works but considering how large the DF it takes about 200 hours to complete is
Loop on all rows of DF checking for some df$Operation=="takeoff"{
Loop on all rows, below the row found before, for df$operation="ladning"
where codes of origin and destination airport match the "take off" entry{
Once found i add the data i need to the new df called Flights
}
}
(If the second loop does not find a match in the next 100 rows it discards the entry and searchs for the next "take off")
Is there a function that perfoms this operation in a more efficient way? If not, do you know of an algorithm that could be way faster than the one i did?
I am really not used to data science, nor R. Any help will be appreciated.
Thanks in advance!
In R we try to avoid using loops. For filtering a dataframe I would use the filter function in dplyr. dplyr is great and easy and fast for working with dataframes. If it's still not fast enough you can try data.table, but it's a bit less user friendly.
This does what you want I think.
library(dplyr)
flights <- df %>%
arrange(datetime) %>% # make sure the data is in the right order
group_by(origin, destination) %>% # for each flight path
dplyr::filter(Operation %in% c("takeoff", "landing")) # get these rows
I recommend the online book R For Data Science:
https://r4ds.had.co.nz/

Reading subset of large data

I have a LARGE dataset with over 100 Million rows. I only want to read part of the data corresponds to one particular level of a factor, say column1 == A. How do I accomplish this in R using read.csv?
Thank you
You can't filter rows using read.csv. You might try sqldf::read.csv.sql as outlined in answers to this question.
But I think most people would process the file using another tool first. For example, csvkit allows filtering by rows.

Selecting rows in a long dataframe based on a short list

I'm sure this should be easier to do than the way I know how to do it.
I'd like to apply fields from a short dataframe back into a long one based on matching a common factor.
Example short dataframe, list of valid cases:
$ptid (factor) values 1,2,3,4,5...20
$valid 1/0 (to represent true/false; variable through ptid)
long dataframe has 15k rows, each level of $ptid will have several thousand rows
I want to apply $valid onto those rows when the it is 1/true from the list above
The way I know how to do it is to loop through each row of long dataframe, but this is horribly inelegant and also slow.
I have a niggling feeling there is a much better way with dply or similar and I'd really like to learn how.
Worked this out based on the comments, thank you Colonel.
combination_dataset <- Merge(short_dataframe, long_dataframe) worked (very quickly).
Thanks to those who commented.

How do I match single ID's in one data frame to multiples of the IDs in another data frame in R?

For a project at work, I need to generate a table from a list of proposal ids, and a table with more data about some of those proposals (called "awards"). I'm having trouble with the match() function; the data in the "awards" table often has several rows that use the same ID, while the proposals frame has only one copy of each ID. From what I've tried, R ignores multiple rows and only returns the first match, when I need all of them. I haven't been able to find anything in documentation or through searches that helps me, though I have been having difficulty phrasing the right question.
Here's what I have so far:
#R CODE to add awards data on proposals to new data spreadsheet
#read tab delimited files
Awards=read.delim("O:/testing.txt",as.is=T)
Proposals=read.delim("O:/test.txt",as.is=T)
#match IDs from both spreadsheets
Proposals$TotalAwarded=Awards$TotalAwarded([match(Proposals$IDs,Awards$IDs)]),
write.table(Proposals,"O:/tested.txt",quote=F,row.names=F,sep="\t")
This does exactly what I want, except that only the first match is encapsulated.
What's the best way to go forward? How do I make R utilize all of the matches available?
Thanks
See help on merge: ?merge
merge( Proposals, Awards, by=ID, all.y=TRUE )
But I cannot believe this hasn't been asked on SO before.

Resources