I have a dataset that is like this: list
df
200000
5666666
This dataset continues to 5551
Another dataset has also 5551 observations. I want to merge list dataset with another dataset. But no variable is the same. Just row names are the same.
I gave that
merge(list,df,by="rownames")
The error message is that it should have a valid column name
I tried also merge_all but not work
It is not working? Could someone please help
It's good practice to be more precise with the naming of your dataframe variables. I wouldn't use list but something like df_description. Either way, merging by rownames can be achieved by using by = "row.names" or by = 0. You can read more on merge() in the documentation (under "Details").
Related
I have two data frames Name_List and Fruits_List. I would like to produce the dataframe Output. Output in such a way that Name should repeat based on the values in Rows_to_repeat column. Please let me know if anyone has the solution.
Name=c("rohit","murali","partha")
Rows_to_repeat=c(6,3,1)
Fruits=c("Apple","Orange","Watermelon","Mango","Banana","Kiwi","Pomo","Dates","Muskmelon","Papaya")
Person_Got_Fruit=c("rohit","rohit","rohit","rohit","rohit","rohit","murali","murali","murali","partha")
Name_List=data.frame(Name,Rows_to_repeat)
Fruits_List=data.frame(Fruits)
Output=data.frame(Fruits,Person_Got_Fruit)
This is the Solution which is working for me
Person_Got_Fruit=data.frame(rep(Name_List$Name, Name_List$Rows_to_repeat))
Output=cbind(Fruits_List,Person_Got_Fruit)
I'm an R beginner and I'm trying to merge two datasets and having some trouble with losing data. I might be totally off base with what I'm doing.
The first dataset is the Dewey Decimal System and the data looks like this
image of 10 rows of data from this set
I've named this dataset DDC
The next dataset is a list of books ordered during a particular time period.
image of 10 rows of the book ordering dataset
I've named this dataset DOA
I'm unsure how to include the data not in an image
(Can also provide the .csvs if needed)
I would like to merge the sets based on the first three digits of the call number.
To achieve this I've created a new variable in both sets called Call_Category2 that takes the first three digits of the call number value to be matched.
DDC$Call_Category2 = str_pad(DDC$Call_Category, width = 3, side = "left", pad = "0")
This dataset is just over 1000 rows. It is also padded because the 000 to 099 Dewey Decimal Classifications were dropping their leading 0s
DOA_data = transform(DOA_data, Call_Category2 = substr(Call_Category, 1,3))
This dataset is about 24000 rows.
I merge the sets and create a new set called DOA_Call
DOA_Call = merge(DDC, DOA_data, all.x = TRUE)
When I head the data the merge seems to be working properly but 10,000 rows do not get the DOA_Call data added. They just stay in their original state. This is about 40% of my total dataset so it is pretty substantial. My first instinct was that it was only putting DDC rows in once but that would mean I would be missing 23,000 rows which I'm not.
Am I doing something wrong with the merge or could it be an issue with the data not being clean enough?
Let me know if more information is needed!
I don't necessarily need code, pointers on what direction to troubleshoot in would be very helpful!
This is my best try with the information you provide. You will need to use:
functions such as left_join from dplyr (see https://dplyr.tidyverse.org/reference/join.html)
the stringt library to handle some variables (https://dplyr.tidyverse.org/reference/join.html)
and some familiarity with the tidyverse.
Please keep in mind that the best way to ask in stackoveflow is by providing a minimal reproducible example
probabily it is refusee.
I want to transpose a data frame that has both numeric and character columns. I have some lines where the id is repeated 2 or even more times. I would like to have a final dataframe where I have this data in one line.
I thought about using both the data.table and reshape2 library (they have similar functions) but I can't find the right combination to do what I want and I'm going crazy. Could someone give me some help?
Here a modified example of my database
example_data <-data.frame(cod=c(20,20,20,20,20,20,20,40,80,80,80,80,80,240),
id=c(44,68,137,150,186,236,289,236,44,150,155,236,68,289),
textVar=c('aaaa','aaaa','aaaa bbbb','aaaa','cccc','cccc','cccc bbb','dddd','dddd cccc','dddd','ffff','ffff gggg','ffff','hhhh'),
ww=c(4,4,4,4,4,4,4,45,118,118,118,118,118,118))
If for example consider the column with id=44 my output is like this:
exampleRow <-data.frame(cod_1=c(20),id=c(44),textVar_1=c('aaaa'),ww_1=c(4),cod_2=c(80),id=c(44),textVar_2=c('dddd cccc'),ww_2=c(118))
I have two dataframes. Applying the same dcast() function to the two get me different results in the output. Both the dataset have the same structure but different size. The first one has more than 950 rows:
The code I apply is:
trans_matrix_complete <- mod_attrib$transition_matrix
trans_matrix_complete[which(trans_matrix_complete$channel_from=="_3RDLIVE"),]
trans_matrix_complete <- rbind(trans_matrix_complete, df_dummy)
trans_matrix_complete$channel_to <- factor(trans_matrix_complete$channel_to,
levels = c(levels(trans_matrix_complete$channel_to)))
trans_matrix_complete <- dcast(trans_matrix_complete,
channel_from ~ channel_to,value.var = 'transition_probability')
And the trans_matrix_complete output I get is the following:
Something is not working as it should be as with the smaller dataframe of just few lines I get the following outcome:
Where
a) the row number is different. I'm not sure why there are two dots listed in the first case
b) and too, trying to assign rownames to the dataframe by
row.names(trans_matrix_complete) <- trans_matrix_complete$channel_from
does not work for the large dataframe, as despite the row.names contact the dataframe show up exactly as in the first image, without names assigned to rows.
Any idea about this weird behavior?
I resolved moving from dcast() to spread() of the package tidyverse using the following function:
trans_matrix_complete<-spread(trans_matrix_complete,
channel_to,transition_probability)
By applying spread() the two dataframe the matrix output is of the same format and accept rownames without any issue.
So I suspect it is all realted to the fact that dcast() and reshape2 package are not maintained anymore
Regards
So I have two columns. I need to add a third column. However this third column needs to have A for the first amount of rows, and B for the second specified amount of rows.
I tried adding this data_exercise_3 ["newcolumn"] <- (1:6)
but it didn't work. Can someone tell me what I'm doing wrong please?
Looks like you're having a problem with subsetting a data frame correctly. I'd recommend reviewing this concept before you proceed much further, either via a Coursera course or on a website like this UCLA R learning module on subsetting data frames. Subsetting is a crucial component of data wrangling with R, and you'll go much faster with a solid foundation of the basics!
You can assign values to a subset of a data frame by using [row, column] notation. Since your data frame is called data_exercise_3 and the column you'd like to assign values to is called 'newcolumn', then assuming you want the first 6 rows as 'A' and the next 3 as 'B', you could write it like this:
data_exercise_3[1:6,'newcolumn'] <- 'A'
data_exercise_3[7:9,'newcolumn'] <- 'B'
data_exercise_3$category <- c(rep("A",6),rep("B",6))