I'm an R beginner and I'm trying to merge two datasets and having some trouble with losing data. I might be totally off base with what I'm doing.
The first dataset is the Dewey Decimal System and the data looks like this
image of 10 rows of data from this set
I've named this dataset DDC
The next dataset is a list of books ordered during a particular time period.
image of 10 rows of the book ordering dataset
I've named this dataset DOA
I'm unsure how to include the data not in an image
(Can also provide the .csvs if needed)
I would like to merge the sets based on the first three digits of the call number.
To achieve this I've created a new variable in both sets called Call_Category2 that takes the first three digits of the call number value to be matched.
DDC$Call_Category2 = str_pad(DDC$Call_Category, width = 3, side = "left", pad = "0")
This dataset is just over 1000 rows. It is also padded because the 000 to 099 Dewey Decimal Classifications were dropping their leading 0s
DOA_data = transform(DOA_data, Call_Category2 = substr(Call_Category, 1,3))
This dataset is about 24000 rows.
I merge the sets and create a new set called DOA_Call
DOA_Call = merge(DDC, DOA_data, all.x = TRUE)
When I head the data the merge seems to be working properly but 10,000 rows do not get the DOA_Call data added. They just stay in their original state. This is about 40% of my total dataset so it is pretty substantial. My first instinct was that it was only putting DDC rows in once but that would mean I would be missing 23,000 rows which I'm not.
Am I doing something wrong with the merge or could it be an issue with the data not being clean enough?
Let me know if more information is needed!
I don't necessarily need code, pointers on what direction to troubleshoot in would be very helpful!
This is my best try with the information you provide. You will need to use:
functions such as left_join from dplyr (see https://dplyr.tidyverse.org/reference/join.html)
the stringt library to handle some variables (https://dplyr.tidyverse.org/reference/join.html)
and some familiarity with the tidyverse.
Please keep in mind that the best way to ask in stackoveflow is by providing a minimal reproducible example
Related
I have a dataset that is like this: list
df
200000
5666666
This dataset continues to 5551
Another dataset has also 5551 observations. I want to merge list dataset with another dataset. But no variable is the same. Just row names are the same.
I gave that
merge(list,df,by="rownames")
The error message is that it should have a valid column name
I tried also merge_all but not work
It is not working? Could someone please help
It's good practice to be more precise with the naming of your dataframe variables. I wouldn't use list but something like df_description. Either way, merging by rownames can be achieved by using by = "row.names" or by = 0. You can read more on merge() in the documentation (under "Details").
I have a very large Dataset (160k rows).
I want to analyse each subset of rows with the same ID.
I only care about subsets with the same ID that are at least 30rows long.
What approach should I use?
I did the same task in R and did the following (from what it seems that can't be translated to pyspark):
Order by ascending order.
check whether next row is same as current, if yes n=n+1, if no i do my analysis and save the results. Rinse and Repeat for the whole lenght of the Data frame.
One easy method is to group by 'ID' and collect the columns that are needed for your analysis.
If just one column:
grouped_df = original_df.groupby('ID').agg(F.collect_list("column_m")).alias("for_analysis"))
If you need multiple columns, you can use struct:
grouped_df = original_df.groupby('ID').agg(F.collect_list(F.struct("column_m", "column_n", "column_o")).alias("for_analysis"))
Then, once you have your data per ID, you can use a UDF to perform your elaborate analysis
grouped_df = grouped_df.withColumn('analysis_result', analyse_udf('for_analysis', ...))
I have a data frame called mydata with multiple columns, one of which is Benefits, which contains information about samples whether they are CB (complete responding), ICB (Intermediate) or NCB (Non-responding at all).
So basically the Benefit column is a vector with three values:
Benefit <- c("CB" , "ICB" , "NCB")
I want to make a histogram/barplot based on the number of each one of those. So basically it's not a numeric column. I tried solving this by the following code :
hist(as.numeric(metadata$Benefit))
tried also
barplot(metadata$Benefit)
didn't work obviously.
The second thing I want to do is to find a relation between the Age column of the same data frame and the Benefit column, like for example do the younger patients get more benefit ? Is there anyway to do that ?
THANKS!
Hi and welcome to the site :)
One nice way to find issues with code is to only run one command at the time.
# lets create some data
metadata <- data.frame(Benefit = c("ICB", "CB", "CB", "NCB"))
now the command 'as.numeric' does not work on character-data
as.numeric(metadata$Benefit) # returns NA's
Instead what we want is to count the number of instances of each unique value of the column Benefit, we do this with 'table'
tabledata <- table(metadata$Benefit)
and then it is the barplot function we want to create the plot
barplot(tabledata)
this is my very first time posting a question here. If anything I'm asking about is vague or unclear / I forgot to add extra information for context, feel free to let me know, thank you.
MY QUESTION:
I just made a data frame with multiple columns. How do I code for a new data frame that matches two rows with the same variables, and excludes all rows where the variables I want don't match? (along with any other column I want from the previous screenshot)?
SCREENSHOTS OF MY CURRENT DATA FRAME: ONE
, TWO (This isn't the entire data frame since the list is huge, just parts of it.) Notice how each state has multiple 'counties' under it.
THIS IS AN EXAMPLE OF WHAT I WANT MY FINAL DATA FRAME TO LOOK LIKE. In my new data frame, I want to exclude all rows where Location name does not match State name (so I will get rid of all counties and anything that isn't the State name).
e.g. I want to code for a new data frame where I will California = California, while also excluding rows without matching variables such as California = San Juan County
I want to code all of this using DPLYR.
Thank you!
If I understand your somewhat vague question well:
library(dplyr)
df%>%filter(column1==column2)
Assuming you don't have NA's in your numeric data, if so turn them to 0 before executing below code
library(dplyr)
new_df = df %>% filter(any_drinking.state == any_drinking.location) %>%
mutate(both_sexes_2012 = any_drinking.females_2012+any_drinking.males_2012,
diff = any_drinking.males_2012-any_drinking.females_2012) %>%
rename(females_2012 = any_drinking.females_2012,males_2012 = any_drinking.males_2012,
state = any_drinking.state, location = any_drinking.location)
I've been preparing my data and somehow I have way less data after merging my data sets.
Since I don't have the longitude and latitude in my data I've been using the following code after I downloaded the package zipcode (tel1 is my data containing zipcodes)
merge <- merge(zipcode,tel1,by.x=c('zip'),by.y=c('zip_code'))
Before merging I had 195956 observations, while after merging it dropped down to 180090, but I don't understand why.
In my opinion I just merged them where zip was equal to zip_code and I added the information from the dataset zipcode to my folder tel1
Afterward I wanted to remove the rows that contain NA because the merge couldn't define any numbers or whatever. I used this code
final <- result[complete.cases(result),]
Then my number of observations dropped down to 51006 which I just can't believe. There can't be so many mismatches in my data.
Is there any other code that I should use?
Afterwards I've been trying to delete the duplicates with the code
last <- with(final,final[order(state,latitude,longitude),])
but the number of observations was consistent (51006).
What did I do wrong or is there a way to get my data into an excel file again after merging the data so I could manually check if there are really so many mismatches?
Thanks
Can use the all argument to merge.
merge(zipcode, tel1, by.x='zip', by.y='zip_code', all.y=TRUE)
However, for rows where matches aren't found in the zipcode data, there will be NAs. Thus if you then na.rm or something to that effect, you will wind up with the same "data loss"
Check the zip codes for the rows where there are NAs in the lat and long columns after the merge:
tel1[is.na(tel1$latitude) | is.na(tel1$longitude),]
My guess is they aren't valid zip codes or the list of zipcodes you have is not complete.