Question related to table() function in r - r

I have a very big dataset where there are a number of repetitions of suppose the state column for all latitude and longitude that it covers. Now, I want to find the order in which these states appear (data frame is too big so doesn't show all names) so as to add another column of values in the correct order corresponding to state names. The inner_join doesn't work and says that cannot assign variable of size 122.3Gb. I wanted to use the table() function but it gives alphabetically sorted values and not the order in which the state names appear in the data frame. What can I do?

Related

How do I create a data frame from columns of different data frames?

I am trying to create a dataframe that contains the latitude and longitude from one table, and a sampling event from another table. The table containing latitude and longitude also contains the sampling event column (this table is called "filtered_data"). However the sampling event is repeated multiple times for other criteria so I cannot use it as it is.
I used the following code...
unique_sample = data.frame(unique(filtered_data$SAMPLING.EVENT.IDENTIFIER))
... to extract unique values from the sampling event column and put them in a new dataframe.
From here I would like to join the lat/long info from the filtered_data column to the unique_sample df.
This last step is where I am stuck. I've tried the merge function but to no avail.
I discovered an easier answer:
SUBSET = filtered_data %>%
distinct(SAMPLING.EVENT.IDENTIFIER, LATITUDE, LONGITUDE)
this code allows me to extract distinct/unique records of the sampling event from the original df that include the columns I want (lat/long) without the extra steps I listed above.
Cheers

How to filter rows in R for Eurozone countries easily?

I usually use dplyr to filter data. I know hava huge dataset (62176 entries) of banks operating in different countries. I'd like to subset/filter that datasets for Eurozone banks only.
I haven't found any workaround rather than pasting all the name of Eurozone countries and then create a new dataset with filter.
Is there any workaround for this problem?
Thank you!
Without the data we can't give you clear answers however, given my understanding of the problem below are some methods.
Assuming your dataset already has a column that has each bank's operating country, you could create a manual vector of the countries you are interested in and then filter the dataset for rows that match
#manually assign countries to vector (this must match how the countries are listed in your data)
euro_countries<- c("Germany","England","France","Poland")
#Then filter dataset to pull up rows that match, I make up colnames as I don't know your data
dataframe %>% filter(op_country %in% euro_countries)
alternatively, depending on your data set you can reference the very helfpul countrycode library in R which has an existing dataset that can potentially join your dataset country column against the matching column in countrycode::codelist and then reference the countrycode::codelist$continent to filter for countries in "Europe".
#join your data set with the codelist table but depends on country column in your dataset
dataframe <- leftjoin(x=df,y=countrycode::codelist,by=c("op_country"="country.name.en"))
#filter your dataset with the new column
dataframe %>% filter(continent=="Europe")

Indexing dataframes in R

good day
I don´t understand a topic here, is like it works but I can´t understand why
I have this database
# planets_df is pre-loaded in your workspace
# Use order() to create positions
positions <- order(planets_df$diameter)
positions
# Use positions to sort planets_df
planets_df[positions,]
I don´t understand why if u take the column diameter, then if u want to order it why u put it in a row of the dataframe like for me it should be [ rows, colum] but u put a column in a row and it changes, I really don´t get that.Why it´s not planets_df[,positions].
The exercise is solved I just don´t get it, is a data camp exercise btw.
Sorry if my English is wrong, it is not my native language.
I believe that I have created an example that matches your description. For the mtcars data set, which is pre-loaded in any R session, we can sort based on the variable mpg.
The function order returns the row indices sorted by mpg in this case. The ordering variable indicates the order that the rows should be presented in by storing the row indices based on mpg.
ordering <- order(mtcars$mpg)
This next step indicates that we want the rows of mtcars as specified by ordering. Essentially ordering is the order of the rows we want and so we pass that object to the row portion the call to mtcars.
mtcars[ordering,]
If we instead passed ordering as the columns, we would be reordering the columns of mtcars instead of the rows.

How to reorder a column and include all values in that row with the reordering?

So I want to create a map showing the covid19 cases per 100.000 in Europe. However, after converting my Large SpatialPolygosDataFrame into a normal data frame with the fortify function and merging my corona information data set with the data frame of polygon data, the plot does not come out correctly anymore. I know that the cause is probably the order column, which is now not ordering the coordinates correctly anymore and has random jumps as you can see in the data frame. This is causing ggplot to incorrectly plot the coordinates in the right sequence. Does anyone know how I can reorder my entire dataframe by the column order. So the "order" column is ascending and all values around it are included in the reorder?
I fixed it with left_join function, this function does not reorder the rows after merging which fixed my problem. Here is a map with the left_join function:

How can I transform full state names to abbreviations?

I have a data frame A with a column called "states". The states are recorded by their full name, ex. "California". There are multiple rows for each state.
I have a data frame B, which has the number of gun deaths for each state. The states are recorded by abbreviations, ex. "CA"
What I would like is: I want each row in A to have the number of gun deaths for the corresponding state. I was planning to use dplyr::inner_join() for this.
But of course, the problem is that the state names are different in the different data frames.
What is the best way to make the names match?
If you have two vectors of the same length and want to construct a translation table just add the input vector of state names as the names-attribute of the output vector of state abbreviationa, and then pass in the names as inputs to the "["-function:
names(state.abb) <- state.name
# both are in-built values in the `state`-item of default `datasets` package
?state # also See: ?Constants and ?data
dfrm$abbrev <- state.abb[dfrm$states]

Resources