merge two files by one column - unix

I have two files
$ wc -l new_bacteria.txt
28633861 new_bacteria.txt
$ wc -l allin1_trinity_bacteria_blastx.tsv
4352 allin1_trinity_bacteria_blastx.tsv
$ head new_bacteria.txt
gi|406035365|ref|ZP_11042729.1| Acinetobacter parvus
gi|406035366|ref|ZP_11042730.1| Acinetobacter parvus
gi|406035367|ref|ZP_11042731.1| Acinetobacter parvus
gi|406035368|ref|ZP_11042732.1| Acinetobacter parvus
gi|406035369|ref|ZP_11042733.1| Acinetobacter parvus
gi|406035370|ref|ZP_11042734.1| Acinetobacter parvus
gi|406035371|ref|ZP_11042735.1| Acinetobacter parvus
gi|406035372|ref|ZP_11042736.1| Acinetobacter parvus
gi|406035373|ref|ZP_11042737.1| Acinetobacter parvus
gi|406035374|ref|ZP_11042738.1| Acinetobacter parvus
$ head allin1_trinity_bacteria_blastx.tsv
c91_g1_i1 gi|46447089|ref|YP_008454.1| 39.60 101 59 1 306 4 1676 1774 6e-11 68.2
c146_g1_i1 gi|357399595|ref|YP_004911520.1| 39.53 86 47 2 246 4 49 134 5e-06 52.0
c202_g1_i1 gi|508605652|ref|YP_006991274.2| 62.16 37 14 0 154 44 49 85 3e-06 45.4
c202_g1_i1 gi|508605652|ref|YP_006991274.2| 63.16 19 7 0 201 145 33 51 3e-06 27.7
c202_g1_i1 gi|508605652|ref|YP_006991274.2| 76.92 13 3 0 242 204 20 32 3e-06 21.6
c224_g1_i1 gi|395217261|ref|ZP_10401556.1| 72.62 84 23 0 260 9 274 357 6e-38 144
c230_g1_i1 gi|261381445|ref|ZP_05986018.1| 57.50 40 17 0 248 129 57 96 2e-09 45.8
c230_g1_i1 gi|261381445|ref|ZP_05986018.1| 50.00 42 19 1 120 1 101 142 2e-09 41.2
c294_g1_i1 gi|298242911|ref|ZP_06966718.1| 37.33 75 46 1 14 238 814 887 3e-07 56.2
c304_g1_i1 gi|296393792|ref|YP_003658676.1| 42.86 56 32 0 56 223 17 72 6e-06 51.2
I want to merge this two files by the second column of allin1_trinity_bacteria_blastx.tsv. And I wish to output a file have same number of lines of the this tsv file since the other file is really big.
This is a easy job in R but since here my annotation file (new_bacteria.txt) is really big. I am thinking about using unix merge. But how can I make the output only contains the columns I want in the tsv file, but not all the liens in the new_bacteria.txt file?
Thank you!

I am thinking about using unix merge. But how can I make the output
only contains the columns I want in the tsv file, but not all the
liens in the new_bacteria.txt file?
There is indeed a program named merge, but despite the name match with the merge() function of R, its purpose (combining separate changes to an original file) is not what you need; you could rather use join. Note that the files must be sorted on the join fields. The example script sorts both files prior to joining; if new_bacteria.txt is already sorted, you can use it instead of sorted.txt; and if you want to run multiple joins on allin1_trinity_bacteria_blastx.tsv, it may be worth it to sort it only once and reuse the sorted.tsv.
sort -k2b allin1_trinity_bacteria_blastx.tsv >sorted.tsv
sort new_bacteria.txt >sorted.txt
join -1 2 sorted.tsv sorted.txt

Related

Viewing dataset in RStudio shows different number of observations compared to R commands

I am currently studying data science with R. To practice, I am using the Auto data of the ISLR package. However, I am encountering a confusing situation when viewing the data. When I view the dataset Auto.df in RStudio, I get the following:
However, when I use dim(Auto.df), I get the following:
> dim(Auto.df)
[1] 392 9
And when I use nrow(Auto.df), I get the following:
> nrow(Auto.df)
[1] 392
And when I use str(Auto.df), I get the following:
> str(Auto.df)
'data.frame': 392 obs. of 9 variables:
$ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
$ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
$ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
$ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
$ weight : num 3504 3693 3436 3433 3449 ...
$ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
$ year : num 70 70 70 70 70 70 70 70 70 70 ...
$ origin : num 1 1 1 1 1 1 1 1 1 1 ...
$ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
And I have the following in my RStudio "Global Environment" tab:
So why does viewing the dataset in RStudio show 397 rows (observations), whilst everything else says that there are 392 observations?
There are 392 observations in the data. What you are viewing are the rownames of the data. You can set rownames as anything and they do not represent row number in the data.
If you check the rownames of Auto dataset you'll realise they are not sequential and some rownames jump by 2. For example, after 32 you don't have 33 but 34. Similarly after 126 there is 128. I don't know why the data is like that but that makes row number at the end to go till 397.

How to define range of values of a time series?

First of all, sorry for any mistakes regarding my post, I'm new to this site.
I´m getting started with R now and I´m trying to do some analysis with time series data.
So, I got a times series at hand and already loaded it into R.
I can also plot this times series and add labels to the axes and so on. So far so good.
My problem: When I plot the time series, R would set the range of values on the y-axis to the interval of [0:170] approximately.
This is somehow strange, since the times series contains the daily EUR/USD exchange rates for this year. That means the values are in a range of about 1.05 to 1.2.
The relative values are correct.
If the plot shows a maximum around day 40, the corresponding value in the data set appears to be a maximum.
But it is around 1.4 and not 170.
I hope one can understand my problem.
I would like to have the y-axis on a scale from 1 to 1.2 for example.
The ylim=c(1, 1.2) command will scale the axis to that range but not the values.
It just ignores them.
Does anyone know how to adjust that?
I´d really appreciate it.
Thank you very much in advance.
Thanks a lot for the input so far.
The "critical code" is the following:
> FRB <- read.csv("FRB_H10.csv", header=TRUE, sep=",")
> attach(FRB)
> str(FRB)
'data.frame': 212 obs. of 2 variables:
$ Date: Factor w/ 212 levels "2015-01-01","2015-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Rate: Factor w/ 180 levels "1.0524","1.0575",..: 180 179 177 178 174 173 175 176 171 172 ...
> plot.ts(Rate)
The result of this last plot is the one shown above.
Changing the variable to numeric yields this:
> as.numeric(Rate)
[1] 180 179 177 178 174 173 175 176 171 172 170 166 180 167 169 160 123 128 150 140 132 128 138 165
[25] 161 163 136 134 134 129 159 158 180 156 140 155 151 142 131 148 104 100 96 104 65 53 27 24
[49] 13 3 8 1 2 7 10 9 21 42 36 50 39 33 23 15 19 29 51 54 26 23 11 6
[73] 4 12 5 16 20 18 17 14 22 30 34 49 92 89 98 83 92 141 125 110 81 109 151 149
[97] 162 143 85 69 77 61 180 30 32 38 52 37 78 127 120 73 105 126 131 106 122 119 107 112
[121] 157 137 152 96 93 99 87 94 86 70 71 180 67 43 66 58 84 57 55 47 35 25 26 41
[145] 31 48 48 75 63 59 38 60 46 44 28 40 45 52 62 101 82 74 68 60 64 102 144 168
[169] 159 154 108 91 98 118 111 72 76 180 95 90 117 139 131 116 130 133 145 103 79 88 115 97
[193] 106 113 89 102 121 102 119 114 124 148 180 153 164 161 147 135 146 141 80 56
So, it remains unchanged. This is very strange. The data excerpt shows that "Rate" takes on values between 1.1 and 1.5 approximately, so really not the values that are shown above. :/
The data set can be found under this link:
https://www.dropbox.com/s/ndxstdl1aae5glt/FRB_H10.csv?dl=0
It should be alright. I got it from the data base from the Federal Reserve System, so quite a decent source.
(Had to remove the link to the data excerpt because my reputation only allows for 2 links to be posted at a time. But the entire data set should be even better, I guess.
#BlankUsername
Thanks very much for the link. I got it working now using this code:
FRB <- read.csv("FRB_H10.csv", header=TRUE, sep=",")
> attach(FRB)
> as.numeric(paste(Rate))
[1] NA 1.2015 1.1918 1.1936 1.1820 1.1811 1.1830 1.1832 1.1779 1.1806 1.1598 1.1517 NA
[14] 1.1559 1.1584 1.1414 1.1279 1.1290 1.1370 1.1342 1.1308 1.1290 1.1337 1.1462 1.1418 1.1432
[27] 1.1330 1.1316 1.1316 1.1300 1.1410 1.1408 NA 1.1395 1.1342 1.1392 1.1372 1.1346 1.1307
[40] 1.1363 1.1212 1.1197 1.1190 1.1212 1.1070 1.1006 1.0855 1.0846 1.0707 1.0576 1.0615 1.0524
[53] 1.0575 1.0605 1.0643 1.0621 1.0792 1.0928 1.0908 1.0986 1.0919 1.0891 1.0818 1.0741 1.0768
[66] 1.0874 1.0990 1.1008 1.0850 1.0818 1.0671 1.0598 1.0582 1.0672 1.0596 1.0742 1.0780 1.0763
[79] 1.0758 1.0729 1.0803 1.0876 1.0892 1.0979 1.1174 1.1162 1.1194 1.1145 1.1174 1.1345 1.1283
[92] 1.1241 1.1142 1.1240 1.1372 1.1368 1.1428 1.1354 1.1151 1.1079 1.1126 1.1033 NA 1.0876
[105] 1.0888 1.0914 1.0994 1.0913 1.1130 1.1285 1.1271 1.1108 1.1232 1.1284 1.1307 1.1236 1.1278
[118] 1.1266 1.1238 1.1244 1.1404 1.1335 1.1378 1.1190 1.1178 1.1196 1.1156 1.1180 1.1154 1.1084
[131] 1.1090 NA 1.1076 1.0952 1.1072 1.1025 1.1150 1.1020 1.1015 1.0965 1.0898 1.0848 1.0850
[144] 1.0927 1.0884 1.0976 1.0976 1.1112 1.1055 1.1026 1.0914 1.1028 1.0962 1.0953 1.0868 1.0922
[157] 1.0958 1.0994 1.1042 1.1198 1.1144 1.1110 1.1078 1.1028 1.1061 1.1200 1.1356 1.1580 1.1410
[170] 1.1390 1.1239 1.1172 1.1194 1.1263 1.1242 1.1104 1.1117 NA 1.1182 1.1165 1.1262 1.1338
[183] 1.1307 1.1260 1.1304 1.1312 1.1358 1.1204 1.1133 1.1160 1.1252 1.1192 1.1236 1.1246 1.1162
[196] 1.1200 1.1276 1.1200 1.1266 1.1249 1.1282 1.1363 NA 1.1382 1.1437 1.1418 1.1360 1.1320
[209] 1.1359 1.1345 1.1140 1.1016
Warning message:
NAs introduced by coercion
> Rate <- cbind(paste(Rate))
> plot(Rate)
Warning message:
In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion
> plot.ts(Rate, ylab="EUR/USD")
Despite the warning message, I get the following output (shown below). Like I intended to plot it.
Nevertheless, I do not really understand why it works the way it did. Why I have to use the paste() command and what it does exactly. I get the basic idea of what the classes do, but am very new to this whole world of R.
One thing I came to realize already is that R is such a powerful program. And yet confusing if you are a beginner. :D

How to process multi columns data in data.frame with plyr

I am trying to solve the DSC(Differential scanning calorimetry) data with R but it seems that I ran into some troubles. All this used to be done in Origin or Qtiplot tediously in my lab.But I wonder if there is another way to do it in batch.But the result did not goes well. For example, maybe I have used the wrong colnames of my data.frame,the code
dat$0.5min
Error: unexpected numeric constant in "dat$0.5"
can not reach my data.
So below is the full description of my purpose, thank you in advance!
the DSC data is like this(I store the CSV file in my GoogleDrive Link ) :
T1 0.5min T2 1min
40.59 -0.2904 40.59 -0.2545
40.81 -0.281 40.81 -0.2455
41.04 -0.2747 41.04 -0.2389
41.29 -0.2728 41.29 -0.2361
41.54 -0.2553 41.54 -0.2239
41.8 -0.07 41.8 -0.0732
42.06 0.1687 42.06 0.1414
42.32 0.3194 42.32 0.2817
42.58 0.3814 42.58 0.3421
42.84 0.3863 42.84 0.3493
43.1 0.3665 43.11 0.3322
43.37 0.3438 43.37 0.3109
43.64 0.3265 43.64 0.2937
43.9 0.3151 43.9 0.2819
44.17 0.3072 44.17 0.2735
44.43 0.2995 44.43 0.2656
44.7 0.2899 44.7 0.2563
44.96 0.2779 44.96 0.245
in fact I have merge the data into a data.frame and hope I can adjust it and do something further.
the command is:
dat<-read.csv("Book1.csv",header=F)
colnames(dat)<-c('T1','0.5min','T2','1min','T3','2min','T4','4min','T5','8min','T6','10min',
'T7','20min','T8','ascast1','T9','ascast2','T10','ascast3','T11','ascast4',
'T12','ascast5'
)
so actually dat is a data.frame with 1163 obs. of 24 variables.
T1,T2,T3.....T12 means temperature that the samples were tested of DSC although in the same interval they do differ a little due to the unstability of the machine.
And the colname along T1~T12 is Heat Flow of different heat treatment durations that records by the machine and ascast1~ascast5 means nothing done to the sample to check the accuracy of the machine.
Now I need to do something like the following:
for T1~T2 is in Celsius Degrees,I need to change them into Kelvin Degrees whichi means every data plus 273.16.
Two temperature is chosen to compare the result that is Ts=180.25,Te=240.45(all is discussed in Celsius Degrees and I have seen it Qtiplot to make sure). To be clear I list the two temperature and the first 6 columns data.
T1 0.5min T2 1min T3 2min T4 4min
180.25 -0.01710000 180.25 -0.01780000 180.25 -0.02120000 180.25 -0.02020000
. . . .
. . . .
240.45 0.05700000 240.45 0.04500000 240.45 0.05780000 240.45 0.05580000
That all Heat Flow in Ts should be the same that can be made 0 for convenience. So based on the different values Heat Flow of different times like 0.5min,1min,2min,4min,8min,10min,20min and ascas1~ascast5 all Heat Flow value should be minus the Heat Flow value in Ts.
And for Heat Flow in Te, the value should be adjust to make sure that all the Heat Flow data are the same in Te. The purpose is like the following, (1) calculate mean of the 12 heat flow data in Te. Let's use Hmean for the mean heat flow.So Hmean is the value that all Heat Flow should be. (2) for data in column 0.5min,I use col("0.5min") to denote, and the lineal transform formula is like the following:
col("0.5min")-[([0.05700000-(-0.01710000)]-Hmean)/(Te-Ts)]*(col(T1)-Ts)
Actually, [0.05700000-(-0.01710000)] is done in step 2,but I write it for your reference. And this formula is used for different pair of T1~T12 and columns,like (T1,0.5min),(T2, 1min),(T3,1min).....all is 12 pairs.
Now we can plot the 12 pairs of data on the same plot with intervals from 180~240(also in Celsius Degrees) to magnify the details of differences between the different scans of DSC.
I have been stuck on this problems for 2 days , so I return to stackoverflow for help.
Thanks!
I am assuming that your question was right in the beginning where you got the following error,
dat$0.5min
Error: unexpected numeric constant in "dat$0.5"
As I could not find a question in the rest of the steps. They just seemed like a step by step procedure of an experiment.
To fix that error, the problem is the column name has a number in it so to use the column name in the way you want (to reference a column), you should use "`", accent mark, symbol.
>dataF <- data.frame("0.5min"=1:10,"T2"=11:20,check.names = F)
> dataF$`0.5min`
[1] 1 2 3 4 5 6 7 8 9 10
Based on comments adding more information,
You can add a constant to add to alternate columns in the following manner,
dataF <- data.frame(matrix(1:100,10,10))
const <- 237
> print(dataF)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 1 11 21 31 41 51 61 71 81 91
2 2 12 22 32 42 52 62 72 82 92
3 3 13 23 33 43 53 63 73 83 93
4 4 14 24 34 44 54 64 74 84 94
5 5 15 25 35 45 55 65 75 85 95
6 6 16 26 36 46 56 66 76 86 96
7 7 17 27 37 47 57 67 77 87 97
8 8 18 28 38 48 58 68 78 88 98
9 9 19 29 39 49 59 69 79 89 99
10 10 20 30 40 50 60 70 80 90 100
dataF[,seq(1,ncol(dataF),by = 2)] <- dataF[,seq(1,ncol(dataF),by = 2)] + const
> print(dataF)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 238 11 258 31 278 51 298 71 318 91
2 239 12 259 32 279 52 299 72 319 92
3 240 13 260 33 280 53 300 73 320 93
4 241 14 261 34 281 54 301 74 321 94
5 242 15 262 35 282 55 302 75 322 95
6 243 16 263 36 283 56 303 76 323 96
7 244 17 264 37 284 57 304 77 324 97
8 245 18 265 38 285 58 305 78 325 98
9 246 19 266 39 286 59 306 79 326 99
10 247 20 267 40 287 60 307 80 327 100
To generalize, we know that the columns of a dataframe can be referenced with a vector of numbers/column names. Most operations in R are vectorized. You can use column names or numbers based on the pattern you are looking for.
For example, I change the name of my first two columns and want to access just those I do this,
colnames(dataF)[c(1,2)] <- c("Y1","Y2")
#Reference all column names with "Y" in it. You can do any operation you want on this.
dataF[,grep("Y",colnames(dataF))]
Y1 Y2
1 238 11
2 239 12
3 240 13
4 241 14
5 242 15
6 243 16
7 244 17
8 245 18
9 246 19
10 247 20

Creating decision tree

I have a csv file (298 rows and 24 columns) and i want to create a decision tree to predict the column "salary". I have downloaded tree package and added via library function.
But when i try to create the decision tree:
model<-tree(salary~.,data)
I get the error like below:
*Error in tree(salary ~ ., data) :
factor predictors must have at most 32 levels*
What is wrong with that? Data is as follows:
Name bat hit homeruns runs
1 Alan Ashby 315 81 7 24
2 Alvin Davis 479 130 18 66
3 Andre Dawson 496 141 20 65
...
team position putout assists errors
1 Hou. C 632 43 10
2 Sea. 1B 880 82 14
3 Mon. RF 200 11 3
salary league87 team87
1 475 N Hou.
2 480 A Sea.
3 500 N Chi.
And its the value of str(data):
'data.frame': 263 obs. of 24 variables:
$ Name : Factor w/ 263 levels "Al Newman","Alan Ashby",..: 2 7 8 10 6 1 13 11 9 3 ...
$ bat : int 315 479 496 321 594 185 298 323 401 574 ...
$ hit : int 81 130 141 87 169 37 73 81 92 159 ...
$ homeruns : int 7 18 20 10 4 1 0 6 17 21 ...
$ runs : int 24 66 65 39 74 23 24 26 49 107 ...
$ runs.batted : int 38 72 78 42 51 8 24 32 66 75 ...
$ walks : int 39 76 37 30 35 21 7 8 65 59 ...
$ years.in.major.leagues : int 14 3 11 2 11 2 3 2 13 10 ...
$ bats.during.career : int 3449 1624 5628 396 4408 214 509 341 5206 4631 ...
$ hits.during.career : int 835 457 1575 101 1133 42 108 86 1332 1300 ...
$ homeruns.during.career : int 69 63 225 12 19 1 0 6 253 90 ...
$ runs.during.career : int 321 224 828 48 501 30 41 32 784 702 ...
$ runs.batted.during.career: int 414 266 838 46 336 9 37 34 890 504 ...
$ walks.during.career : int 375 263 354 33 194 24 12 8 866 488 ...
$ league : Factor w/ 2 levels "A","N": 2 1 2 2 1 2 1 2 1 1 ...
$ division : Factor w/ 2 levels "E","W": 2 2 1 1 2 1 2 2 1 1 ...
$ team : Factor w/ 24 levels "Atl.","Bal.",..: 9 21 14 14 16 14 10 1 7
8 ...
$ position : Factor w/ 23 levels "1B","1O","23",..: 10 1 20 1 22 4 22 22 13 22 ...
$ putout : int 632 880 200 805 282 76 121 143 0 238 ...
$ assists : int 43 82 11 40 421 127 283 290 0 445 ...
$ errors : int 10 14 3 4 25 7 9 19 0 22 ...
$ salary : num 475 480 500 91.5 750 ...
$ league87 : Factor w/ 2 levels "A","N": 2 1 2 2 1 1 1 2 1 1 ...
$ team87 : Factor w/ 24 levels "Atl.","Bal.",..: 9 21 5 14 16 13 10 1 7 8 ...
The issue is almost certainly that you're including the name variable in your model, as it has too many factor levels. I would also remove it a methodological standpoint but this probably isn't the place for that discussion. Try:
train <- data
train$Name <- NULL
model<-tree(salary~.,train)
It seems that your salary is a factor vector, while you are trying to perform a regression, so it should be a numbers vector. Simply convert you salary to numeric, and it should work just fine. For more details read the library's help:
http://cran.r-project.org/web/packages/tree/tree.pdf
Usage
tree(formula, data, weights, subset, na.action = na.pass,
control = tree.control(nobs, ...), method = "recursive.partition",
split = c("deviance", "gini"), model = FALSE, x = FALSE, y = TRUE, wts
= TRUE, ...)
Arguments
formula A formula expression. The left-hand-side (response) should be either a numerical vector when a
regression tree will be fitted or a factor, when a classification tree
is produced. The right-hand-side should be a series of numeric or
factor variables separated by +; there should be no interaction terms.
Both . and - are allowed: regression trees can have offset terms.
(...)
Depending on what exactly is stored in your salary variable, the conversion can be less or more tricky, but this should generaly work:
salary = as.numeric(levels(salary))[salary]
EDIT
As pointed out in the comment, the actual error corresponds to the data variable, so if it is a numerical data, it could also be converted to numeric to solve the issue, if it has to be a factor you will need another model or reduce the number of levels. You can also convert these factors to the numerical format by hand (by for example defining as many binary features as you have levels), but this can lead to the exponential growth of your input space.
EDIT2
It seems that you have to first decide what you are trying to model. You are trying to predict salary, but based on what? It seems that your data consists of players' records, then their names are for sure wrong type of data to use for this prediction (in particular - it is probably causing the 32 levels error). You should remove all the columns from the data variable which should not be used for building a prediction. I do not know what is the exact aim here (as there is no information regarding it in the question), so I can only guess that you are trying to predict the person's salary based on his/her stats, so you should remove from the input data: players' names, players' teams and obviously salaries (as predicting X using X is not a good idea ;)).

R subsetting a data frame based on a factor variable formatted like a range (xx-xx)

I am facing this problem for many hours now, but I know I am missing something obvious.
Here is my problem:
I have a data-frame in .xlsx file that can be downloaded here.
I loaded this data-frame into R using RStudio on MAc and called it demoData.
There are 5 variables (AgeRange, Women, Men, Total, and Year).
I am not able to subset this data frame with a condition on the AgeRange. The format of this variable is as follow: xx-xx (00-04 meaning people between 00 and 04 years old). The message I have when I try to do that is that there is no row filling this condition.
The class of the variable "AgeRange" is factor.
Here is my code:
demoData[demoData$AgeRange=="00-04",]
Thank you for your help.
Edit: from Arun. Here's input from head(demoData):
Age Feminin Masculin. Ensemble Annee
1 00-04 720 745 1465 2004
2 05-09 745 767 1512 2004
3 10-14 813 830 1643 2004
4 15-19 824 820 1644 2004
5 20-24 839 823 1662 2004
6 25-29 752 699 1450 2004
# str(demoData)
'data.frame': 272 obs. of 5 variables:
$ Age : Factor w/ 16 levels "00-04 ","05-09 ",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Feminin : Factor w/ 216 levels "138 ","139 ",..: 112 124 164 165 174 130 106 86 78 66 ...
$ Masculin.: Factor w/ 201 levels "120 ","122 ",..: 132 141 174 169 170 124 111 89 90 75 ...
$ Ensemble : Factor w/ 242 levels "1041 ","1044 ",..: 53 66 115 116 119 50 38 14 9 238 ...
$ Annee : Factor w/ 17 levels "2004 ","2005",..: 1 1 1 1 1 1 1 1 1 1 ...
I read in your xlsx file with the xlsx package:
df<-read.xlsx("C:/Users/swatson1/Downloads/Evolution_Population_2004_2020.xlsx",1)
and it looked like this:
> df
Age Feminin MasculinÂ. Ensemble Annee
1 00-04Â 720Â 745Â 1465Â 2004Â
2 05-09Â 745Â 767Â 1512Â 2004Â
You could replace each column, getting rid of the extra character with something like:
df$Age<-substr(df$Age,1,5)
Alternatively, use gsub as this will work on any column regardless of the length of the entry:
df$Age<-gsub("Â\\s","",df$Age)
Then your code would work:
df[df$Age=="00-04",]
#coppied from the Excel file
str1 <- "00-04 "
utf8ToInt(str1)
#[1] 48 48 45 48 52 160
There seems to be a no-break space at the end of the string. Sanitize your file.
You should be able to remove the no-break spaces using
df$Age <- gsub(intToUtf8(160),"",df$Age)

Resources