This question already has answers here:
How to drop factor levels while scraping data off US Census HTML site
(2 answers)
Closed 5 years ago.
I used a as.data.frame(table(something_to_count)), and get result like:
Var1 Freq
1 20 2970
2 30 1349
3 40 322
4 50 1009
I just want the $Var1 value, but if I write d[1,]$Var1 or d[1,1], I always get these things:
1] 20
305 Levels: 20 30 40 50 60 70 80 90 100 110 120 130 150 160 170 190 200 ... 4120
And when I try to output the value, it is always not 20, but 1. And as.number() also can only return 1. How can I literally get the Var1 value as it is instead of getting the id of the row? Also, when the outputs are levels numbers? What is wrong?
The as.data.frame method for objects of class "table" returns the first column as a factor and (along with any other "marginal labels" columns) and only the last column as the numeric counts. See the help page for ?table and look at the Value section. Tyler's recommendation to use the R-FAQ recommended as.numeric(as.character(.)) conversion strategy is "standard R".
This is because the function table turns the argument into a factor (type table into your console and you'll see the line a <- factor(a, exclude=exclude).
The best solution is just to do what Tyler suggested to transform the results of table into data.frame
Related
I'm reading a .sav file using haven:
library(haven)
data <- read_spss("file.sav", user_na = FALSE)
Then trying to display one of the variables in a table:
table(data$region)
Which returns:
1 2 3 4 5 6 7 8 9 10 11 12
85 208 43 171 30 40 95 310 133 29 77 36
Which is technically correct, however - in SPSS, the numerical values in the top row have labels associated with them (region names in this case). If I just run data$region, it shows me the numbers and their associated labels at the end of the output, but is there a way to make those string labels appear in the first table row instead of their numerical counterparts?
Thank you in advance for your help!
The way to do this is to cast the variable as a factor, using the "labels" attribute of the vector as the factor levels. The sjlabelled package includes a function that does this in one step:
data$region <- sjlabelled::as_label(data$region)
While the table command will still work on the resulting data, the layout may be a little messy. The forcats package has a function that pretty-prints frequency tables for factors:
data$region %>% forcats::fct_count()
My question is related to R.
I have code snippet related to 5 answer choice. When I run this answer choice every choice except one get error. The right one also did not match with the question.
My question is
A B C D E
1 7 4 23 68 15
2 12 53 14 10 20
3 39 88 98 50 84
4 18 38 33 47 72
5 31 6 51 38 27
6 20 15 68 99 50
This dataframe is given. To create this data frame I write the following code block.
A = c(7,12,39,18,31,20)
B = c(4,53,88,38,6,15)
C = c(23,14,98,33,51,68)
D = c(68,10,50,47,38,99)
E = c(15,20,84,72,27,50)
df_x = data.frame(A,B,C,D,E)
Question: Which of the following R code will sunset data frame df_x,returning the final three rows?
My answer choice is
df_x[nrow(df_x)-2:nrow(df_x)]
df_x[(nrow(df_x)-2):nrow(df_x)]
df_x[nrow(df-x)-2:,]
df_x[-3:]
df_x[(nrow(df_x)-2):nrow(df_x)
Among them only the 1st choice df_x[nrow(df_x)-2:nrow(df_x)] some output.
Output:
D C B A
1 68 23 4 7
2 10 14 53 12
3 50 98 88 39
4 47 33 38 18
5 38 51 6 31
6 99 68 15 20
I think this is not the correct one. All other choices give error. Can any one tell me which one is the correct choice? Or what is the actual query to answer the following question? I am new to R. So it is hard for me to find out the correct one.
df_x[(nrow(df_x)-2):nrow(df_x),]
Keep in mind, convention is df[rows, columns]. And you need to specify both arguments, which is why I put a comma after the row argument in the solution
Cheers,
Joe
The answers in those choices will produce errors because they are not creating the indexes properly.
In R, when you are subsetting database, you need to give the row numbers and the column numbers.
For example,df[row,col] will give you the data that is the given row and the given column. df[row,] will select all columns for the given row number.
If you don't put a comma (,) in the index, you are only selecting the columns. For e.gdf[1:2] is going to select the first and second columns
If you want to select multiple rows or multiple columns, you can put the numbers in as well e.g df[1:3,3:9]
When you use -, R removes the given row or column. So for example, df[-1,] removes the first row. df[,-3] removes the third column. df[-1:-5,] removes the first five rows.
Those answers all have errors in them because they don't have commas in the right places. If you want to select up to the last row or column in R, you need to give the last row or column number. You get this number by using nrow(df) or ncol(df). Using the : is the Python way of doing things.
The closest answer here is: df_x[(nrow(df_x)-2):nrow(df_x)] but you need to add a comma: df_x[(nrow(df_x)-2):nrow(df_x),]
The problem you are being expected to recognize (but have not) is operator precedence. The colon operator (for sequencing) has a higher precedence than the binary minus operator, so the expression: nrow(df_x)-2:nrow(df_x) gives you the vector difference possibly with recycling of the value of nrow(df_x) and the vector 2:nrow(df_x). So option number 2 which isolates nrow(df_x)-2 from the colon-operator with parentheses will give you the correct index. Adding parentheses to make terms obvious is good programming practice. See:
?Syntax
The other problem is that there is a missing comma after those expressions ... I think your course text should have given option 2 as
df_x[(nrow(df_x)-2):nrow(df_x),]
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 5 years ago.
I'm attempting to merge two dataframes. One dataframe contains rownames which appear as values within a column of another dataframe. I would like to append a single column (Top.Viral.TaxID.Name) from the second dataframe based upon these mutual values, to the first dataframe.
The first dataframe looks like this:
ERR1780367 ERR1780369 ERR2013703 xxx...
374840 73 0 0
417290 56 57 20
1923444 57 20 102
349409 40 0 0
265522 353 401 22
322019 175 231 35
The second dataframe looks like this:
Top.Viral.TaxID Top.Viral.TaxID.Name
1 374840 Enterobacteria phage phiX174 sensu lato
2 417290 Saccharopolyspora erythraea prophage pSE211
3 1923444 Shahe picorna-like virus 14
4 417290 Saccharopolyspora erythraea prophage pSE211
5 981323 Gordonia phage GTE2
6 349409 Pandoravirus dulcis
However, I would also like to preserve the rownames of the first dataframe, so the result would look something like this:
ERR1780367 ERR1780369 ERR2013703 xxx... Top.Viral.TaxID.Name
374840 73 0 0 Enterobacteria phage phiX174 sensu lato
417290 56 57 20 Saccharopolyspora erythraea prophage pSE211
1923444 57 20 102 Shahe picorna-like virus 14
349409 40 0 0 Pandoravirus dulcis
265522 353 401 22 Hyposoter fugitivus ichnovirus
322019 175 231 35 Acanthocystis turfacea Chlorella virus 1
Thanks in advance.
I would strongly recommend against relying on rownames. They are embarrasingly often removed, and the function in dplyr/tidyr always strip them.
Always make the rownames a part of the data, i.e. use "tidy" data sets as in the example below
data(iris)
# We mix the data a bit, to check if rownames are conserved
iris = iris[sample.int(nrow(iris), 20),]
head(iris)
description =
data.frame(Species = unique(iris$Species))
description$fullname = paste("The wonderful", description$Species)
description
# .... the above are your data
iris = cbind(row = rownames(iris), iris)
# Now it is easy
merge(iris, description, by="Species")
And please, use reproducibly data when asking questions in SO to get fast answers. It is lot of work to reformat the data you presented into a form that can be tested.
Use sapply to loop through rownames of dataframe 1 (df1) and search the id in the dataframe 2 (df2), returning the description in the same row.
Something like this
df1$Top.Viral.TaxID.Name <- sapply(rownames(df1), (function(id){
df2$Top.Viral.TaxID.Name[df2$Top.Viral.TaxID == id]
}))
This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 5 years ago.
This is a very large dataset and I'm trying to get away from writing for loops in R. Looking for a way to attack what I would usually use a nested loop to do.
For each unique value in the confidence col., I need to extract the row indices for all other rows in the confidence col. that match that value. For example, the first occurrence, (50) would return 1,7,9. Then, using those indices, I want to average the values for the seqs column. Here, the first occurrence (50) would return 1980, 7357, and 3008 and then average these. The indented output would be a data frame with 2 columns: one with a list of unique values for confidence and one with a corresponding list of the average # seqs for each unique confidence value.
input
#seqs confidence
1980 50
1088 52
1099 52
2000 42
7009 45
1092 48
7357 50
5909 42
3008 50
output
ave.#seqs confidence
4115 50
1093.5 52
3954.5 42...
Given that it's a "very large dataset", I suggest a data.table solution.
library(data.table)
> setDT(data)[, mean(seqs), by=confidence]
confidence V1
1: 50 4115.0
2: 52 1093.5
3: 42 3954.5
4: 45 7009.0
5: 48 1092.0
Solutions using dplyr functions or aggregate would also work, but they're less efficient.
This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 6 years ago.
I have a data which I am working on it. I need to run repeated measures Anova test on it but first I have to reshape data to long format. I did something as shown on website, reshaping doesn't give any error but I don't think it works. So Anova test gives error. Here is my code and error.
# reshaping to long format
id=1:length(veri$SIRA)
k.1 <- veri$KOLEST
k.2 <- veri$KOLEST2
k.3 <- veri$KOLEST3
veri2 <- data.frame(id,k.1,k.2,k.3)
longformat <- reshape(veri2,direction="long", varying=list("k.1","k.2","k.3"), idvar="id")
This is output for longformat
id time k.1 k.2 k.3
1 1 1 209 195 181
2 2 1 243 184 172
3 3 1 192 178 162
4 4 1 210 112 93
5 5 1 190 188 172
6 6 1 232 169 156
Time is 1 all along. This seems little odd to me. I thought it shoud be 1-2-3 according to 3 different measures.
And this is error when I run the test:
repmesao <- aov(k~time+Error(id/time), data=longformat)
Error in model.frame.default(formula = k ~ id/time, data = longformat, :
invalid type (list) for variable 'k'
How I can fix this problem? Any suggestions?
For reshaping, use library(tidyr) and a command like so
data_long <- gather(data, group, dv, range of columns)
If you have more than on dv, then this procedure is not good. What I usually do is divide data into dvs like data_dv1 <- data[1:3] and data_dv2 <- cbind(data[1:2], data[4]). I reshape it as shown above, then I just cbind(data_dv1_long, data_dv2_long) minding that not all columns should be combined, as you will have for instance your subject id in both df, so choose the columns for cbind accordingly.
Also, don't know that you are going to use for ANOVA, but I recommend library(ez) with a command like so
ezANOVA(data=data, dv=.(dv), wid=.(subject_id), within=.(group1), between=.(group2), detail=T)