I have this dataframe:
# A tibble: 6 x 4
Full.Name Year freq Ra
<chr> <chr> <dbl> <dbl>
1 A. Patrick Beharelle 2019 0.000713 0.110
2 A. Patrick Beharelle 2020 -0.0946 -0.116
3 Aaron P. Graft 2019 0.835 0.276
4 Aaron P. Graft 2020 -0.276 0.376
5 Aaron P. Jagdfeld 2019 -1.20 0.745
6 Aaron P. Jagdfeld 2020 10.7 0.889
Which describes a certain topic. Now, I want to visualize the freq column by Full.Name with a plot. That's not that hard. I can do that. But here comes the tricky part, which I am not able to do:
I have another dataframe, which is exactly the same structure (same columns, but different values), dealing with another topic and I want to include this dataframe into the other one's plot so that I can compare them.
I tried merging both dataframes, but both dataframes have a different amount of observations, therefore it' hard to merge. I tried to innerjoin() but due to the values of Full.Name not matching, that was not successful for me. Maybe there is another way to join both dataframes.
Any suggestions how to include both dataframes into one plot or even some kind of merged table, distinguishing between both topics, would be great. Any help is appreciated. Thanks in advance!
My understanding of your problem is that you have two dataframes and you want to compare the values in those two dataframes using one plot. You can achieve that by appending those two dataframes. Here is an example below:
## Sample DataFrame1
df1 = data.frame(Names=c("Alpha","Alpha","Rome","Victor","Victor"),
Year=c(2019,2020,2019,2020,2019),
Freq = c(0.000713,-0.000713,0.01724,-0.0760713,0.00213),
Dataframe="df1")
## Sample DataFrame2
df2 = data.frame(Names=c("Gamma","Gamma","Tango","Pan","Beta"),
Year=c(2019,2020,2019,2020,2019),
Freq = c(0.0713,-0.090713,0.1724,-0.013,0.0299),
Dataframe="df2")
## Appending the two DataFrames
rbind(df1,df2)
Hope this helps!
Related
I would like to combine two data frames. One is information for birds banded. The other is information on recovered banded birds. I would like to add the recovery data to the banding data, if the bird was recovered (not all birds were recovered). Unfortunately the full band number is not included in the banding data, only in the recovery data, so there is not a unique column to join them by.
One looks like this:
GISBLong
GISBLat
B Flyway
B Month
B Year
Band Prefix Plus
-85.41667
42.41667
8
5
2001
12456
-85.41655
36.0833
9
6
2003
21548
The other looks like this:
GISBLong
GISBLat
B Flyway
B Month
B Year
Band
R Month
R Year
-85.41667
42.41667
8
5
2001
124565482
12
2002
-85.41655
36.0833
9
6
2003
215486256
1
2004
I have tried '''merge''', '''ifelse''', '''dplyr-join''' with no luck. Any suggestions? Thanks in advance!
you should look up rbind(). That might do the trick. For it to work the data frames have to have the same columns. I'd suggest you to add missing columns to your first DF with dplyr::mutate() and later on eliminate useless rows.
I have two databases. The first one has about 70k rows with 3 columns. the second one has 790k rows with 2 columns. Both databases have a common variable grantee_name. I want to match each row of the first database to one or more rows of the second database based on this grantee_name. Note that merge will not work because the grantee_name do not match perfectly. There are different spellings etc. So, I am using the fuzzyjoin package and trying the following:
library("haven"); library("fuzzyjoin"); library("dplyr")
forfuzzy<-read_dta("/path/forfuzzy.dta")
filings <- read_dta ("/path/filings.dta")
> head(forfuzzy)
# A tibble: 6 x 3
grantee_name grantee_city grantee_state
<chr> <chr> <chr>
1 (ICS)2 MAINE CHAPTER CLEARWATER FL
2 (SUFFOLK COUNTY) VANDERBILT~ CENTERPORT NY
3 1 VOICE TREKKING A FUND OF ~ WESTMINSTER MD
4 10 CAN NEWBERRY FL
5 10 THOUSAND WINDOWS LIVERMORE CA
6 100 BLACK MEN IN CHICAGO INC CHICAGO IL
... 7 - 70000 rows to go
> head(filings)
# A tibble: 6 x 2
grantee_name ein
<chr> <dbl>
1 ICS-2 MAINE CHAPTER 123456
2 SUFFOLK COUNTY VANDERBILT 654321
3 VOICE TREKKING A FUND OF VOICES 789456
4 10 CAN 654987
5 10 THOUSAND MUSKETEERS INC 789123
6 100 BLACK MEN IN HOUSTON INC 987321
rows 7-790000 omitted for brevity
The above examples are clear enough to provide some good matches and some not-so-good matches. Note that, for example, 10 THOUSAND WINDOWS will match best with 10 THOUSAND MUSKETEERS INC but it does not mean it is a good match. There will be a better match somewhere in the filings data (not shown above). That does not matter at this stage.
So, I have tried the following:
df<-as.data.frame(stringdist_inner_join(forfuzzy, filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance"))
Totally new to R. This is resulting in an error:
cannot allocate vector of size 375GB (with the big database of course). A sample of 100 rows from forfuzzy always works. So, I thought of iterating over a list of 100 rows at a time.
I have tried the following:
n=100
lst = split(forfuzzy, cumsum((1:nrow(forfuzzy)-1)%%n==0))
df<-as.data.frame(lapply(lst, function(df_)
{
(stringdist_inner_join(df_, filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance", nthread = getOption("sd_num_thread")))
}
)%>% bind_rows)
I have also tried the above with mclapply instead of lapply. Same error happens even though I have tried a high-performance cluster setting 3 CPUs, each with 480G of memory and using mclapply with the option mc.cores=3. Perhaps a foreach command could help, but I have no idea how to implement it.
I have been advised to use the purrr and repurrrsive packages, so I try the following:
purrr::map(lst, ~stringdist_inner_join(., filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance", nthread = getOption("sd_num_thread")))
This seems to be working, after a novice error in the by=grantee_name statement. However, it is taking forever and I am not sure it will work. A sample list in forfuzzy of 100 rows, with n=10 (so 10 lists with 10 rows each) has been running for 50 minutes, and still no results.
If you split (with base::split or dplyr::group_split) your uniquegrantees data frame into a list of data frames, then you can call purrr::map on the list. (map is pretty much lapply)
purrr::map(list_of_dfs, ~stringdist_inner_join(., filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance"))
Your result will be a list of data frames each fuzzyjoined with filings. You can then call bind_rows (or you could do map_dfr) to get all the results in the same data frame again.
See R - Splitting a large dataframe into several smaller dateframes, performing fuzzyjoin on each one and outputting to a single dataframe
I haven't used foreach before but maybe the variable x is already the individual rows of zz1?
Have you tried:
stringdist_inner_join(x, zz2, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance")
?
I would like to ask a question regarding fuzzyjoin package. I am very new to R, and I promise I have read through the readme file and followed through examples on https://cran.r-project.org/web/packages/fuzzyjoin/index.html before I asked this question.
I have a list of vernacular names which I wanted to match with plant species names. A simple version of my list will look like below. Data 1 has a LocalName column with many typos of vernacular name. Data 2 is the table with correct local name and species where the matching should be based on.
data1 <- data.frame(Item=1:5, LocalName=c("BACTERIA F", "BAHIA", "BAIKEA", "BAIKIA", "BAIKIAEA SP"))
data 1
Item LocalName
1 1 BACTERIA F
2 2 BAHIA
3 3 BAIKEA
4 4 BAIKIA
5 5 BAIKIAEA SP
data2 <- data.frame(LocalName=c("ENGOKOM","BAHIA","BAIKIA","BANANIER","BALANITES"), Species=c("Barteria fistulosa","Mitragyna spp","Baikiaea spp", "Musa spp", "Balanites wilsoniana"))
data2
LocalName Species
1 ENGOKOM Barteria fistulosa
2 BAHIA Mitragyna spp
3 BAIKIA Baikiaea spp
4 BANANIER Musa spp
5 BALANITES Balanites wilsoniana
I tried using the stringdist_left_join function, and it managed to match many species correctly. I am being conservative by setting max_dist=1 because in my list, many vernacular names are very similar.
library(fuzzyjoin)
table <- data1%>%
stringdist_left_join(data2, by=c(LocalName="LocalName"), max_dist=1)
table
Item LocalName.x LocalName.y Species
1 1 BACTERIA F <NA> <NA>
2 2 BAHIA BAHIA Mitragyna spp
3 3 BAIKEA BAIKIA Baikiaea spp
4 4 BAIKIA BAIKIA Baikiaea spp
5 5 BAIKIAEA SP <NA> <NA>
However, I have one question. As you can see from data1, the Item 5 BAIKIAEA SP actually matches with the Species column of data2 instead of LocalName. I have many entries like this where the LocalName in data 1 were either typos of vernacular names or species name, however, I am not sure how to make stringdist_left_join matches two columns of data 2 with one column of data 1. I tried modifying the codes into something like this:
table <- data1%>%
stringdist_left_join(data2, by=c(LocalName="LocalName"|"Species"), max_dist=1)
but it did not work, citing "Error in "LocalName" | "Species" :
operations are possible only for numeric, logical or complex types". Does anyone know whether such matching is possible? Thanks in advance!
I am wanting to get correlation values between two variables for each county.
I have subset my data as shown below and get the appropriate value for the individual Adams county, but am now wanting to do the other counties:
CorrData<-read.csv("H://Correlation
Datasets/CorrelationData_Master_Regression.csv")
CorrData2<-subset(CorrData, CountyName=="Adams")
dzCases<-(cor.test(CorrData2$NumVisit, CorrData2$dzdx,
method="kendall"))
dzCases
I am wanting to do a For Loop or something similar that will make the process more efficient, and so that I don't have write 20 different variable correlations for each of the 93 counties.
When I run the following in R, it doesn't give an error, but it doesn't give me the response I was hoping for either. Rather than the Spearman's Correlation for each county, it seems to be ignoring the loop portion and just giving me the correlation between the two variables for ALL counties.
CorrData<-read.csv("H:\\CorrelationData_Master_Regression.csv")
for (i in CorrData$CountyName)
{
dzCasesYears<-cor.test(CorrData$NumVisit, CorrData$dzdx,
method="spearman")
}
A very small sample of my data looks similar to this:
CountyName Year NumVisits dzdx
Adams 2010 4.545454545 1.19
Adams 2011 20.83333333 0.20
Elmore 2010 26.92307692 0.24
Elmore 2011 0 0.61
Brown 2010 0 -1.16
Brown 2011 17.14285714 -1.28
Clark 2010 25 -1.02
Clark 2011 0 1.13
Cass 2010 17.85714286 0.50
Cass 2011 27.55102041 0.11
I have tried to find a similar example online, but am not having luck!
Thank you in advance for all your help!
You are looping but not using your iterator 'i' in your code. If this makes sense with respect with what you want to do (and judging from your condition). Based on comments, you might want to make sure you are using numerics. Also, i noticed that you are not iterating into your output cor.test vector. I'm not sure a loop is the most efficient way to do it, but it will be just fine and since your started with a loop, You should have something of the kind:
dzCasesYears = list() #Prep a list to store your corr.test results
counter = 0 # To store your corr.test into list through iterating
for (i in unique(CorrData$CountyName))
{
counter = counter + 1
# Creating new variables makes the code clearer
x = as.numeric(CorrData[CorrData$CountyName == i,]$NumVisit)
y = as.numeric(CorrData[CorrData$CountyName == i,]$dzdx)
dzCasesYears[[counter]] <-cor.test(x,y,method="spearman")
}
And it's always good to put a unique there when you are iterating.
data.table makes operations like this very simple.
library('data.table')
CorrData <- as.data.table(read.csv("H:\\CorrelationData_Master_Regression.csv"))
CorrData[, cor(dzdx, NumVisits), CountyName]
With the sample data, it's all negative ones because there's two points per county and so the correlation is perfect. The full dataset should be more interesting!
CountyName V1
1: Adams -1
2: Elmore -1
3: Brown -1
4: Clark -1
5: Cass -1
Edit to include p values from cor.test as OP asked in the comment
This is also quite simple!
CorrData[, .(cor=cor(dzdx, NumVisits),
p=cor.test(dzdx, NumVisits)$p.value),
CountyName]
...But it won't work with your sample data as two points per county is not enough for cor.test to get a p value. Perhaps you could take #smci's advice and dput a larger subset of the data to make your question truly reproducible
I have a data set in R that involves students and GPAs, for example
Student GPA
Jim 3.00
Tom 3.29
Ana 3.99
and so on.
I want a column that puts them in a bin. for example
Student GPASplit
Jim 3.0-3.5
Tom 3.0-3.5
Ana 3.5-4.0
Because when I try to take the statistics for the GPA all the bins are seperated based on the actual GPA. For example I am trying to find the percentage for how many students have higher than a 3.5, a GPA between 3.0-3.5, and so forth. But I get the percentage in terms of the actual GPA and when you have 4000 data points all with different GPAs, it is hard to figure out how many have a GPA higher than 3.5 and so forth? Does this make sense? Sorry if it doesn't.
You can use the cut() function to split data into bins that you define. You have to be careful about values that fall exactly on the boundaries though, and make sure they're being treated how you want. With your example data:
> df$GPA_split = cut(df$GPA, breaks = c(3.0, 3.5, 4.0), include.lowest = TRUE)
> df
Student GPA GPA_split
1 Jim 3.00 [3,3.5]
2 Tom 3.29 [3,3.5]
3 Ana 3.99 (3.5,4]
# Count values in each bin
> table(df$GPA_split)
[3,3.5] (3.5,4]
2 1