I'm running into some trouble while attempting a spatial join between a shapefile and a data table in csv.
Here's what my data looks like:
Point Shapefile's attribute data (StudentID):
ID Address Long Lat
123.00 street long lat
456.00 street long lat
789.01 street long lat
223.00 street long lat
412.02 street long lat
Data Table (Table):
ID Name Age School
123.00 name age school
456.00 name age school
789.01 name age school
223.00 name age school
412.02 name age school
Important note: StudentID contains roughly 500 records, while the Table only has 250. Some records in StudentID will NOT be matched.
Problem 1:
I have an excel file, which I converted to csv for importing into R. While running the join, I noticed that some of my data format changed in the ID column (so 123.00 would become 123; 456.00 to 456; 789.01 is the same). However, when I opened csv file in notepad the formatting is correct. I tried reading the table as a .txt file, but no luck. Does anyone know why this happens and what are some ways to overcome this?
Because I couldn't join the data based on an exact match, I decided to try a partial join because the IDS are unique regardless of the last 2 digits, which led me to Problem 2...
Problem 2:
Here is what I used to join the two:
StudentID#data = data.frame(StudentID#data, data[charmatch(StudentID#data$ID,Table$ID,])
This joined the data, but also, as expected, returned rows with NAs. I used na.omit to remove the rows and the resulting data contained all the ones that matched. However, in the shapefile, ALL of my points are still there. Why did those dots remain when the records have been removed?
Problem 1:
Excel sometimes exports floating values using a comma , as decimal separator. This can lead to problems in csv imports. Make sure that excel uses points . for decimal separators, or specify the separator in importing, i.e. read.csv('file.csv', sep=';').
Problem 2:
If you want to remove points with na values from a shapefile, you need a logical vector to select the rows you dont want anymore. Here is an example of how this could look like (assuming your shapefile was named student_points)
student_points <- student_points[!is.na(student_points#data$age), ]
Related
I have block level census data that contains an ID I use to merge it with block level shape files known as GEOID, this ID functions as a number of values who's meaning essentially gets geographically smaller
So you have country, state, county, tract, block in a number that looks like this "360050001000001" with the last 4 digits being the block
So I have a dataset that looks like this:
which continues on for dozens of more columns with granular race data and 37000+ rows
I have figured out a method to remove the last four digits of each GEOID so that they only represent the tract level in excel, no problem
The issue I have now is that I need all values of, say 36005000100 to be totaled up
So where I had "Total_Pop" of 5 different values for 36005000100, I need them all to be consolidated and summed up into one value. So the tract that would be 36005000100 might be the sum of 0 + 0 + 141 + 1344 + 367 + 1890, with this repeated based on the shared GEOID value. Each tract can contain anywhere from a few to dozens of blocks.
I have tried the consolidate function in excel but I can't seem to get it working right
I've looked into the merge and join functions for R but they primarily seem used for merging dataframes when I need to consolidate values
I'm trying to use the "aggregate" function in R as well and struggling with the appropriate syntax for my use
I am a beginner with R and I'm having a hard time finding any info related to the task at hand. Basically, I am trying to calculate 5- year averages for 15 categories of NHL player statistics and add each of the results as a column in the '21-22 Player Data. So, for example, I'd want 5-year averages for a player (ex. Connor McDavid) to be displayed in the same dataset as the 21-22 player data, but each of the numbers needed to calculate the mean lives in its own spreadsheet that has been imported into R. I have an .xlsx worksheet for each year from 17-18 to 21-22 so 5 sheets in total. I have loaded each of the sheets in to Rstudio, but the next steps are very difficult for me to figure out.
I think I have to use a pipe, locate one specific cell (ex. Connor McDavid, goals) in 5 different data frames, use a mean function to find the average for this one particular cell (ex. Connor McDavid, goals), assign that as a vector 5_year_average_goals, then add that vector as a column in the original 21-22 dataset so I can compare the production for each player last season to their 5-year averages. Then repeat that step for each column (assists, points, etc.) Would I have to repeat these steps for each player (row)? Is there an easy way to use a placeholder that will calculate these averages for every player in the 21-22 dataset?
This is a guess, and just a start ...
I suggest as you're learning that using dplyr and friends might make some of this simpler.
library(dplyr)
files <- list.files(pattern="xlsx$")
lapply(setNames(nm = files), read_xlsx) %>%
bind_rows(.id = "filename") %>%
arrange(filename) %>%
group_by(`Player Name`) %>%
mutate(across(c(GPG, APG), ~ cumsum(.) / seq_along(.), .names = "{.col}_5yr_avg"))
Brief walk-through:
list.files(.) should produce a character vector of all of the filenames. If you need it to be a subset, filter it as needed. If they are in a different directory, then files <- list.files("path/to/dir", pattern="xlsx$", full.names=TRUE).
lapply(..) reads the xlsx file for each of the filenames found by list.files(.), returns a list of data.frames.
bind_rows(.) combines all of the nested frames into a single frame, and adds the names of the list as a new column, filename, which will contain the file from which each row was extracted.
arrange(.) sorts by the year, which should put things in chronological order, required for a running average. I'm assuming that the files sort correctly, you may need to adjust this if I got it wrong.
group_by(..) makes sure that the next expressions only see on player at a time (across all files).
mutate calculates (I believe) the running average over the years. It's not perfectly resilient to issues (e.g., gaps in years), but it's a good start.
Hope this helps.
for an assignment I have to use fuzzy matching in R to merge two different datasets that both had a "Country" column. The first dataset is from Kaggle(Countries dataset) while the other is from ISO 3166 standard. I already use fuzzy matching it worked well. I add both data sets a new column that counts a number of observations(it is a must for fuzzy matching as far as I understand) 1 from their respectable lengths. That I named "Observation number" For my first dataset, there are 227 observations and for the ISO dataset, there are 249 observations.
I want to create a new dataset that includes columns from my first dataset(I had to use this data set specifically it has columns like migration, literacy, etc) and Country codes from the ISO dataset. I couldn't manage to do it. fuzzy matching output gave me how the first data set's observation numbers change in the ISO dataset. (For example in the first dataset countries ordered such as Afghanistan, Albania, Algeria.... whilst in ISO order in Albania, Algeria, Afghanistan) so for that fuzzy match output gave me 3,1,2... I understand this means 3rd observation in the ISO dataset is 1st in the Countries dataset.
I want to create a new data set that has all the information on the Countries datasets ordered withrespect to ISO datasets' Country columns' order.
However i cannot do it using
a=(Result1$matches)$observationnumber
#gives me vector a, where can I find i'th observation of Country dataset in ISO dataset
countryorderedlikeISO <- countries.of.the.world[match(c(a), countries.of.the.world$observation),]
It seems to ignore the countries that are present in ISO but not in the country dataset.
What can I do? I want this new dataset to be in ISO's length, with NA values for observations that are present in ISO but not in Country.
I have an ascii file that contains one week of data. This data is a text file and does not have header names. I currently have nearly completed a smaller task using R, and have made some attempts with Python as well. Being a pro at neither, its been a steep learning curve. Here is my data/code to paste rows together based on a specific sequence of chr in R that I created and is not working.
Each column holds different data, but the row data is what matters most. for example:
column 1 column 2 column 3 column 4
Row 1 Name Age YR Birth Date
Row 2 Middle Name School name siblings # of siblings
Row 3 Last Name street number street address
Row 4 Name Age YR Birth Date
Row 5 Middle Name School name siblings # of siblings
Row 6 Last Name street number street address
Row 7 Name Age YR Birth Date
Row 8 Middle Name School name siblings # of siblings
Row 9 Last Name street number street address
I have a folder to iterate or loop over that some files hold 100's of rows, and others hold 1000's. I have a code written that drops all the rows I don't need, and writes to a new .csv however, any pasting and/or merging isn't producing the desirable results.
What I need is a code to select only the Name and Last name rows (and their adjacent data) from the entire file and paste the last name row beside the end of the name row. Each file has the same amount of columns but different rows.
I have the file to a data frame, and have tried merging/pasting/binding (r and c) the rows/columns, and the result is still just shy of what I need. Rbind works the best thus far, but instead of producing the data with the rows pasted one after another on the same line, they are pasted beside each other in columns like this:
ie:
Name Last Name Name Last Name Name Last Name
Age Street Num Age Street Num Age Street Num
YR Street address YR Street address YR Street address
Birth NA Birth NA Birth NA
Date NA Date NA Date NA
I have tried to rbind them or family[c(Name, Age, YR Birth...)] and I am not successful. I have looked at how many columns I have and tried to add more columns to account for the paste, and instead it populates with the data from row 1.
I'm really at a loss here and if anyone can provide some insight I'd really appreciate it. I'm newer than some, but not as new as others. The results I am achieving look like:
Name Age YR Birth date Last Name Street Num Street Address NA NA
Name Age YR Birth date Last Name Street Num Street Address NA NA
Name Age YR Birth date Last Name Street Num Street Address NA NA
codes tried:
rowData <- rbind(name$Name, name$Age, name$YRBirth, name$Date)
colData <- cbind(name$V1 == "Name", name$V1 == "Last Name")
merge and paste also do not work. I have tried to create each variable as new data frames and am still not achieving the results I am looking for. Does anyone have any insight?
Ok, so if I understand your situation correctly, you want to first slice your data and pull out every third row starting with the 1st row and then pull out every 3rd row starting with the 3rd row. I'd do it like this (assume your data is in df:
df1 <- df[3*(1:(nrow(df)/3)) - 2,]
df2 <- df[3*(1:(nrow(df)/3)),]
once you have these, you can just slap them together, but instead of using rbind you want to use cbind. Then you can drop the NA columns and rename them.
df3 <- cbind(df1,df2)
df3 <- df3[1:7]
colnames(df3) <- c("Name", "Age", "YR", "Birth date", "Last Name", "Street Num", "Street Address")
I am struggling to understand how to combine in R two tables when the common variables are not exactly similar.
To give the context, I have downloaded two sources of information about politicians, from Twitter and from the administration and created two different data frames. In the first data frame (dataset 1), I have the name of the politicians present on Twitter. However, I don’t know if these politicians are now in function or not. To discover that, I could use the second date frame.
The second data frame (dataset 2) contains the name and other information about the politicians now in function.
The first and last names are the only variables contained in both tables. The two tables do not have the same number of rows.
Problem:
The names in the first dataset were indicated as one variable (first name + last name) whereas in the second dataset the names were separated in two variables (last name and first name). I used separate to separate the name column in the first tables. parliament_twitter_tempdata <- separate(parliament_twitter_tempdata,col=name, into=c("firstname","lastname"),extra ="merge”).
However I have problems with it as both datasets have:
composed first names and composed last names
first name and last name in the incorrect order
I have included a picture of a part (from lastname "J" to "M") of both datasets to illustrate the difference between the similar values or the inversion of lastname, firstname.
How could I improve my code?
The names in both tables are not completely similar. Some people did not write the official name in Instagram. Is there any function which could compare the two tables, find the set of variables that correspond to around 80% and remplace the name in the data frame 1 (from Twitter) with the official name of data frame 2 ? Ex. Dataset 1 : Marie Gabour ; Dataset 2 : Marie Gabour Jolliet —> Replace the Marie Gabour from dataset 1 into Marie Gabour
Could someone help me there? Many thanks !
[Part of the dataset 1 after having separate (lastname from "J" to "M" )1 [Part of the name in dataset 2 (lastname from "J" to "M") 2
Fuzzy matching might be a way to move forward:
https://cran.r-project.org/web/packages/fuzzyjoin/fuzzyjoin.pdf
Also, cleaning functions may help (e.g., using toppper or removing whitespace on the key).