I have a table similar to the following, where there is only one column and the cells contain data that are seperated with a comma.
1 height,weight
2 180,85
3 165,62
4 170,73
I want to split them into multiple columns by comma in order to get the following result
height weight
1 180 85
2 165 62
3 170 73
However, str_split_fixed(x$type, ",", 2) command that has been proposed on a similar topic doesn't seem to work for my case.
Thank you so much in advance for your answers.
Related
I am writing a function to prepare a data frame in R to be used later in a regression. I want to rename any column which contains the word distance. Specifically, I want to drop the first descriptive word previous to distance. (So this would include both a word and a period before the start of the word distance).
I have:
country.distance.median country.distance.mean population life.q state.distance.mean
210 189 10000 0.6. 100
3100 2100 20000 0.7. 300
37 36 500 0.3 10
I would like:
distance.median distance.mean population life.q distance.mean
210 189 10000 0.6 100
3100 2100 20000 0.7 300
37 36 500 0.3 10
Because this will be contained in a function, the number and position of columns is variable, so I need a solution which is not reliant on column position. Note that it should not change the column name "life.q", and so the solutions needs to be able to likewise recognize and select columns based on the distance string. Note that the word in front of distance may change as well (for example, the column 'state.distance.mean').
(It should also have the ability to be used as an if statement within a function.)
Thank you for your time and thoughts. :)
You may try using sub here:
names(df) <- sub("^country\\.(?=distance\\.)", "", names(df), perl=TRUE)
df
distance.median distance.mean population life.q
1 210 189 10000 0.6
2 3100 2100 20000 0.7
3 37 36 500 0.3
More generally, to remove the first word preceded by dot, provided that there is another dot later in the word, you may try:
names(df) <- sub("^[^.]+\\.(?=.*\\.)", "", names(df), perl=TRUE)
My question is related to R.
I have code snippet related to 5 answer choice. When I run this answer choice every choice except one get error. The right one also did not match with the question.
My question is
A B C D E
1 7 4 23 68 15
2 12 53 14 10 20
3 39 88 98 50 84
4 18 38 33 47 72
5 31 6 51 38 27
6 20 15 68 99 50
This dataframe is given. To create this data frame I write the following code block.
A = c(7,12,39,18,31,20)
B = c(4,53,88,38,6,15)
C = c(23,14,98,33,51,68)
D = c(68,10,50,47,38,99)
E = c(15,20,84,72,27,50)
df_x = data.frame(A,B,C,D,E)
Question: Which of the following R code will sunset data frame df_x,returning the final three rows?
My answer choice is
df_x[nrow(df_x)-2:nrow(df_x)]
df_x[(nrow(df_x)-2):nrow(df_x)]
df_x[nrow(df-x)-2:,]
df_x[-3:]
df_x[(nrow(df_x)-2):nrow(df_x)
Among them only the 1st choice df_x[nrow(df_x)-2:nrow(df_x)] some output.
Output:
D C B A
1 68 23 4 7
2 10 14 53 12
3 50 98 88 39
4 47 33 38 18
5 38 51 6 31
6 99 68 15 20
I think this is not the correct one. All other choices give error. Can any one tell me which one is the correct choice? Or what is the actual query to answer the following question? I am new to R. So it is hard for me to find out the correct one.
df_x[(nrow(df_x)-2):nrow(df_x),]
Keep in mind, convention is df[rows, columns]. And you need to specify both arguments, which is why I put a comma after the row argument in the solution
Cheers,
Joe
The answers in those choices will produce errors because they are not creating the indexes properly.
In R, when you are subsetting database, you need to give the row numbers and the column numbers.
For example,df[row,col] will give you the data that is the given row and the given column. df[row,] will select all columns for the given row number.
If you don't put a comma (,) in the index, you are only selecting the columns. For e.gdf[1:2] is going to select the first and second columns
If you want to select multiple rows or multiple columns, you can put the numbers in as well e.g df[1:3,3:9]
When you use -, R removes the given row or column. So for example, df[-1,] removes the first row. df[,-3] removes the third column. df[-1:-5,] removes the first five rows.
Those answers all have errors in them because they don't have commas in the right places. If you want to select up to the last row or column in R, you need to give the last row or column number. You get this number by using nrow(df) or ncol(df). Using the : is the Python way of doing things.
The closest answer here is: df_x[(nrow(df_x)-2):nrow(df_x)] but you need to add a comma: df_x[(nrow(df_x)-2):nrow(df_x),]
The problem you are being expected to recognize (but have not) is operator precedence. The colon operator (for sequencing) has a higher precedence than the binary minus operator, so the expression: nrow(df_x)-2:nrow(df_x) gives you the vector difference possibly with recycling of the value of nrow(df_x) and the vector 2:nrow(df_x). So option number 2 which isolates nrow(df_x)-2 from the colon-operator with parentheses will give you the correct index. Adding parentheses to make terms obvious is good programming practice. See:
?Syntax
The other problem is that there is a missing comma after those expressions ... I think your course text should have given option 2 as
df_x[(nrow(df_x)-2):nrow(df_x),]
I am trying to transform my data-frame from wide format to long format. I have seen many questions already posted here regarding this, but it is not quite what I am looking for / I do not see how to apply it to my problem.
The data-frames share some columns like Name, SharedVal etc. but also have columns the other dataset does not have.
What I want to achieve:
Merge these two dataframes based on the UserId, but per UserID have as many rows as there are MeasureNo.
So if there have been two measurements for a user, there will be two rows with the same user id.
And the rows have the same length, but some columns have different entries/no entry at all.
Example:
Dataset1:
UserID Name MeasureNo SharedVal1 SpecificVal1
1 Anna 1 42 8
2 Alex 1 28 50
and
Dataset2:
UserID Name MeasureNo SharedVal1 DifferentVal1
1 Anna 2 15 99
2 Alex 2 33 45
And they should be merged into:
UserID Name MeasureNo SharedVal1 SpecificVal1 DifferentVal1
1 Anna 1 42 8 -
1 Anna 2 15 - 99
2 Alex 1 28 50 -
2 Alex 2 33 - 45
and so on...
problem is, the dataset is huge and there are a lot of rows and columns, so I thought that somehow merging them on the id and than reshaping is the most generic approach. But I could not achieve the expected behaviour.
What I am trying to say programatically is:
"Merge the two dataframes based on the userid and create a as much rows per userid as there are different times of measurement(MeasureNo). Both rows obviously have the same amount of columns. So im both rows, some values in certain columns cannot be filled.
Sorry I am new to SO and this was my best approach to visualizing a table with rows starting in a new line and the Key:Val representing a columing inside that row.
You can do outer join:
new_df <- merge(df1, df2, all = T)
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 5 years ago.
I'm attempting to merge two dataframes. One dataframe contains rownames which appear as values within a column of another dataframe. I would like to append a single column (Top.Viral.TaxID.Name) from the second dataframe based upon these mutual values, to the first dataframe.
The first dataframe looks like this:
ERR1780367 ERR1780369 ERR2013703 xxx...
374840 73 0 0
417290 56 57 20
1923444 57 20 102
349409 40 0 0
265522 353 401 22
322019 175 231 35
The second dataframe looks like this:
Top.Viral.TaxID Top.Viral.TaxID.Name
1 374840 Enterobacteria phage phiX174 sensu lato
2 417290 Saccharopolyspora erythraea prophage pSE211
3 1923444 Shahe picorna-like virus 14
4 417290 Saccharopolyspora erythraea prophage pSE211
5 981323 Gordonia phage GTE2
6 349409 Pandoravirus dulcis
However, I would also like to preserve the rownames of the first dataframe, so the result would look something like this:
ERR1780367 ERR1780369 ERR2013703 xxx... Top.Viral.TaxID.Name
374840 73 0 0 Enterobacteria phage phiX174 sensu lato
417290 56 57 20 Saccharopolyspora erythraea prophage pSE211
1923444 57 20 102 Shahe picorna-like virus 14
349409 40 0 0 Pandoravirus dulcis
265522 353 401 22 Hyposoter fugitivus ichnovirus
322019 175 231 35 Acanthocystis turfacea Chlorella virus 1
Thanks in advance.
I would strongly recommend against relying on rownames. They are embarrasingly often removed, and the function in dplyr/tidyr always strip them.
Always make the rownames a part of the data, i.e. use "tidy" data sets as in the example below
data(iris)
# We mix the data a bit, to check if rownames are conserved
iris = iris[sample.int(nrow(iris), 20),]
head(iris)
description =
data.frame(Species = unique(iris$Species))
description$fullname = paste("The wonderful", description$Species)
description
# .... the above are your data
iris = cbind(row = rownames(iris), iris)
# Now it is easy
merge(iris, description, by="Species")
And please, use reproducibly data when asking questions in SO to get fast answers. It is lot of work to reformat the data you presented into a form that can be tested.
Use sapply to loop through rownames of dataframe 1 (df1) and search the id in the dataframe 2 (df2), returning the description in the same row.
Something like this
df1$Top.Viral.TaxID.Name <- sapply(rownames(df1), (function(id){
df2$Top.Viral.TaxID.Name[df2$Top.Viral.TaxID == id]
}))
This question already has answers here:
How to drop factor levels while scraping data off US Census HTML site
(2 answers)
Closed 5 years ago.
I used a as.data.frame(table(something_to_count)), and get result like:
Var1 Freq
1 20 2970
2 30 1349
3 40 322
4 50 1009
I just want the $Var1 value, but if I write d[1,]$Var1 or d[1,1], I always get these things:
1] 20
305 Levels: 20 30 40 50 60 70 80 90 100 110 120 130 150 160 170 190 200 ... 4120
And when I try to output the value, it is always not 20, but 1. And as.number() also can only return 1. How can I literally get the Var1 value as it is instead of getting the id of the row? Also, when the outputs are levels numbers? What is wrong?
The as.data.frame method for objects of class "table" returns the first column as a factor and (along with any other "marginal labels" columns) and only the last column as the numeric counts. See the help page for ?table and look at the Value section. Tyler's recommendation to use the R-FAQ recommended as.numeric(as.character(.)) conversion strategy is "standard R".
This is because the function table turns the argument into a factor (type table into your console and you'll see the line a <- factor(a, exclude=exclude).
The best solution is just to do what Tyler suggested to transform the results of table into data.frame