I have a large dataframe in R and am trying to do some stats tests on certain columns, but the non-programmers who made the csv file added a bunch of text notes that I need to ignore.
For example a column might have values: 12,20,40,missing,64,32,no input,45,10
How do I only select the numbers using the which statement?
I failed miserably trying:
my_data_frame$Column.Title[which(is.numeric(my_data_frame$Column.Title))]
What do I change in the which function to only select the numbers and ignore the text? Thanks!
You can use the built-in as.numeric() converter to do something like this:
x <- my_data_frame$Column.Title
xn <- as.numeric(x)
which(!is.na(xn))
This won't distinguish between NAs created by failed coercion and pre-existing (numeric) NA values.
If there's a small enough variety of "missing" values you could read the data in with read.csv(..., na.strings=c("NA","missing","no input"))
Related
I have a dataset of startups, in which I have a column called "Amount" which is nothing but the valuations of each startup. when I tried to plot, the plot came out ugly and I found that those were in "char" format but when I tried to see the values in a column using table(copy$Amount) it showed all values mixed to each other. u can see the example pics here:
I'm a beginner to R, I tried several small codes but nothing worked. I want to remove the "string rows", "blank row", and "empty $ symbol row without number" and convert the remaining rows into numbers.
You can use parse_number from the readr package, which:
...drops any non-numeric characters before or after the first number. The grouping mark specified by the locale is ignored inside the number.
For example:
> x <- c("1,000", "$1,000", "$$1,000", "1,000$")
> readr::parse_number(x)
[1] 1000 1000 1000 1000
I have a problem with the selection of column in a dataframe using a for loop. I'm new to R so it's very possible that I missed something obvious, but I did not find anything that works for me.
I have a file with 20 climatic variable measured during 60 years in 399 differents places.
I have a line for each day, and my column are the 20 climatic variable for each place (with a number at the end of the name to identify the place where the measure was taken).
It looks like that :
Temperature_1 Rain_1 .....Temperature_399 Rain_399
Date 1
Date 2
...
I want to select the 20 column corresponding to one place, run some calculations on the variables, put the results in an empty 3D array I have created, then do the same for the next place until the last one.
My problem is that I don't know how to select the right columns automatically. I also have issues with the writing of the results in the array.
I tried to select the columns corresponding to one place using the numbers at the end of the name of the variables, but I don't think it is possible to change automatically the condition.
I also tried to use the position of the columns but I'm not doing it properly
This is my code :
#creation of an empty array
Indice_clim=array(NA,dim = c(60,8,399),dimnames=list(c(1959:2018),c("Huglin","CNI","HD","VHD","SHS","DoF","FreqLF","SLF"),c(1:399)))
#selection of the columns corresponding to the first place using "end with"
maille=select(donnees_SAFRAN,c(1:4),ends_with(".1",ignore.case = FALSE))
# another try using the columns position which I know is really badly done
for (j in seq(from=5, to=7984,by=20)){
paste0("maille",j-4)=select(donnees_SAFRAN,c(1:4),c(j:j+19))
}
#and the calculation on the selected columns, the "i loop" is working.
for(i in 1959:2018)temp=c(maille%>%filter(an==i,mois==4|mois==5|mois==6|mois==7|mois==8|mois==9)%>%summarise(sum(((T_moy.1-10)+(T_max.1-10))/2)*1.03),
maille%>%filter(an==i,mois==9)%>%summarise(mean(T_min.1)),
maille%>%filter(an==i)%>%summarise(sum(T_max.1>=30)),
maille%>%filter(an==i)%>%summarise(sum(T_max.1>=35)),
maille%>%filter(an==i,mois==4|mois==5|mois==6|mois==7|mois==8|mois==9,T_moy.1>=28)%>%summarise(sum(T_moy.1-28)),
maille%>%filter(an==i)%>%summarise(sum(T_min.1<=0)),
maille%>%filter(an==i,mois==4|mois==5|mois==6|mois==7|mois==8|mois==9)%>%summarise(sum(T_min.1<=0)),
maille%>%filter(an==i,mois==4|mois==5|mois==6|mois==7|mois==8|mois==9,T_moy.1<2)%>%summarise(sum(abs(2-T_moy.1))))
Indice_clim[[i-1958,,]]=as.numeric(temp)}
I would like to create a loop or something to do my calculation on each place and write the result in my array.
If you have any idea, I would very much appreciate it !
You can use the grep() function to look for each of the locations 1, 2, ..., 399 in the column names. If your big dataframe containing all the data is called df, then you could do this:
for (i in 1:399) {
selected_indices <- grep(paste0('_', i, '$'), colnames(df))
# do calculations on the selected columns
df[, selected_indices]
}
The for loop will automatically run through each location i from 1 through 399. The paste0() function concatenates '_' with the variable i and the dollar sign $ to create strings like "_1$", "_2$", ..., "_399$", which are then searched for using the grep() function in the column names of df. The '$' is used to specify that you want the patterns _1, _2, ... to appear at the end of the column names (it is a regular expression special character).
The grep() function uses the above regular expressions to returns the column indices required for each location. You can then extract the relevant portion of df and do whatever calculations you want.
I am trying to add 0s into character strings, but only under certain conditions.
I have a vector of file names like such:
my.fl <- c("res_P1_R1.rds", "res_P2_R1.rds",
"res_P1_R19.rds", "res_P2_R2.rds",
"res_P10_R1.rds", "res_P10_R19.rds")
I want to sort(my.fl) so that the file names are ordered by the numbers following the P and R, but as it stands sorting results in this:
"res_P1_R1.rds" "res_P1_R19.rds" "res_P10_R1.rds" "res_P10_R19.rds" "res_P2_R1.rds" "res_P2_R2.rds"
To fix this I need to add 0s after P and R, but only when the following number ranges from 1-9, if the following number is > 9 I want to do nothing.
The result should be as follows:
"res_P01_R01.rds" "res_P01_R19.rds" "res_P10_R01.rds" "res_P10_R19.rds" "res_P02_R01.rds" "res_P02_R02.rds"
and if I sort it, it is ordered as expected e.g.:
"res_P01_R01.rds" "res_P01_R19.rds" "res_P02_R01.rds" "res_P02_R02.rds" "res_P10_R01.rds" "res_P10_R19.rds"
I can add 0s based on position, but since the required position changes my solution only works on a subset of the file names. I think this would be a common problem but I haven't managed to find an answer on SO (or anywhere), any help much appreciated.
You should be able to just use mixedsort from the gtools package which removes the need to insert zeroes.
my.fl <- c("res_P1_R1.rds", "res_P2_R1.rds",
"res_P1_R19.rds", "res_P2_R2.rds",
"res_P10_R1.rds", "res_P10_R19.rds")
library(gtools)
mixedsort(my.fl)
[1] "res_P1_R1.rds" "res_P1_R19.rds" "res_P2_R1.rds" "res_P2_R2.rds" "res_P10_R1.rds" "res_P10_R19.rds"
But if you do want to insert the zeroes you could use something like:
sort(gsub("(?<=\\D)(\\d{1})(?=\\D)", "0\\1", my.fl, perl = TRUE))
[1] "res_P01_R01.rds" "res_P01_R19.rds" "res_P02_R01.rds" "res_P02_R02.rds" "res_P10_R01.rds" "res_P10_R19.rds"
I am reading this from a CSV file, and i need to write a function that churns out a final data frame, so given a particular entry, i have
x
[1] {2,4,5,11,12}
139 Levels: {1,2,3,4,5,6,7,12,17} ...
i can change it to
x2<-as.character(x)
which gives me
x
[1] "{2,4,5,11,12}"
how do i extract 2,4,5,11,12 out? (having 5 elements)
i have tried to use various ways, like gsub, but to no avail
can anyone please help me?
It sounds like you're trying to import a database table that contains arrays. Since R doesn't know about such data structures, it treats them as text.
Try this. I assume the column in question is x. The result will be a list, with each element being the vector of array values for that row in the table.
dat <- read.csv("<file>", stringsAsFactors=FALSE)
dat$x <- strsplit(gsub("\\{(.*)\\}", "\\1", dat$x), ",")
I am creating a simple data frame like this:
qcCtrl <- data.frame("2D6"="DNS00012345", "3A4"="DNS000013579")
My understanding is that the column names should be "2D6" and "3A4", but they are actually "X2D6" and "X3A4". Why are the X's being added and how do I make that stop?
I do not recommend working with column names starting with numbers, but if you insist, use the check.names=FALSE argument of data.frame:
qcCtrl <- data.frame("2D6"="DNS00012345", "3A4"="DNS000013579",
check.names=FALSE)
qcCtrl
2D6 3A4
1 DNS00012345 DNS000013579
One of the reasons I caution against this, is that the $ operator becomes more tricky to work with. For example, the following fails with an error:
> qcCtrl$2D6
Error: unexpected numeric constant in "qcCtrl$2"
To get round this, you have to enclose your column name in back-ticks whenever you work with it:
> qcCtrl$`2D6`
[1] DNS00012345
Levels: DNS00012345
The X is being added because R does not like having a number as the first character of a column name. To turn this off, use as.character() to tell R that the column name of your data frame is a character vector.