Efficiency way to clean data in R - r

Input is
the row 3 and row 5 had incorrtct format,
if I want
sale_date
produst_model
store_code
20210208
ASUS_DE552
AAE_08072
20210305
ASUS_AC693
AAE_08072
20210107
ASUS_DE551
AAR_7461
20210325
ASUS_DB341
CMHT_654
20210227
ASUS_HG0982
BR_981
If this table have 20,000 rows, Do I have more efficiency way to check every row is match rule?

From looking at the data posted my hunch is that the strings in the three columns were at some point extracted from a composite string such as 20210227_ASUS_HG0982_BR_981 but the extraction seems to have gone wrong in some places. If this assumption is correct then I would recommend going back to the original strings and fixing the extraction, for example like this using the extract function:
library(tidyverse)
data.frame(original) %>%
extract(original,
into = c("sale_date", "produst_model", "store_code"),
regex = "(\\d+)_(\\w+\\d+)_(\\w+)")
sale_date produst_model store_code
1 20210227 ASUS_HG0982 BR_981
Data:
original = "20210227_ASUS_HG0982_BR_981"
Obviously, the regex here is based only on a single string and will likely have to be adapted as soon as you have more strings.

Related

Read delimited text (without newlines) and tell how many columns

I have several hundred text delimited files. In some columns, a newline before the end of the row appears in random columns. When is try to read, it looks for the correct number of columns, but because it is split on to the next row.
the arg fill=T does not help because it creates empty incorrect columns.
If I have:
"Aa|Bb|C\nc\ntwo|three|four"
But really should be two rows by three columns:
"Aa|Bb|Cc\ntwo|three|four"
How can I get there for all rows of the data (the error occurs randomly throughout)?
Note that you have C\nc in the string, which introduces c to a new line. I guess you need to ensure the format of your input string as the first step, otherwise it is difficult to fixed via post-processing.
I am not sure if the code below is what you are after. Do you mean something like using read.csv?
read.csv(text = sub("\n","",s),sep = "|",header = FALSE)
which gives
V1 V2 V3
1 Aa Bb Cc
2 two three four
If you are using data.table, you can try fread (thank #akrun)
fread(sub("\n", "", s))
Data
s <- "Aa|Bb|C\nc\ntwo|three|four"

Select data in R that meet a condition and use a for loop on that condition

I have a problem with the selection of column in a dataframe using a for loop. I'm new to R so it's very possible that I missed something obvious, but I did not find anything that works for me.
I have a file with 20 climatic variable measured during 60 years in 399 differents places.
I have a line for each day, and my column are the 20 climatic variable for each place (with a number at the end of the name to identify the place where the measure was taken).
It looks like that :
Temperature_1 Rain_1 .....Temperature_399 Rain_399
Date 1
Date 2
...
I want to select the 20 column corresponding to one place, run some calculations on the variables, put the results in an empty 3D array I have created, then do the same for the next place until the last one.
My problem is that I don't know how to select the right columns automatically. I also have issues with the writing of the results in the array.
I tried to select the columns corresponding to one place using the numbers at the end of the name of the variables, but I don't think it is possible to change automatically the condition.
I also tried to use the position of the columns but I'm not doing it properly
This is my code :
#creation of an empty array
Indice_clim=array(NA,dim = c(60,8,399),dimnames=list(c(1959:2018),c("Huglin","CNI","HD","VHD","SHS","DoF","FreqLF","SLF"),c(1:399)))
#selection of the columns corresponding to the first place using "end with"
maille=select(donnees_SAFRAN,c(1:4),ends_with(".1",ignore.case = FALSE))
# another try using the columns position which I know is really badly done
for (j in seq(from=5, to=7984,by=20)){
paste0("maille",j-4)=select(donnees_SAFRAN,c(1:4),c(j:j+19))
}
#and the calculation on the selected columns, the "i loop" is working.
for(i in 1959:2018)temp=c(maille%>%filter(an==i,mois==4|mois==5|mois==6|mois==7|mois==8|mois==9)%>%summarise(sum(((T_moy.1-10)+(T_max.1-10))/2)*1.03),
maille%>%filter(an==i,mois==9)%>%summarise(mean(T_min.1)),
maille%>%filter(an==i)%>%summarise(sum(T_max.1>=30)),
maille%>%filter(an==i)%>%summarise(sum(T_max.1>=35)),
maille%>%filter(an==i,mois==4|mois==5|mois==6|mois==7|mois==8|mois==9,T_moy.1>=28)%>%summarise(sum(T_moy.1-28)),
maille%>%filter(an==i)%>%summarise(sum(T_min.1<=0)),
maille%>%filter(an==i,mois==4|mois==5|mois==6|mois==7|mois==8|mois==9)%>%summarise(sum(T_min.1<=0)),
maille%>%filter(an==i,mois==4|mois==5|mois==6|mois==7|mois==8|mois==9,T_moy.1<2)%>%summarise(sum(abs(2-T_moy.1))))
Indice_clim[[i-1958,,]]=as.numeric(temp)}
I would like to create a loop or something to do my calculation on each place and write the result in my array.
If you have any idea, I would very much appreciate it !
You can use the grep() function to look for each of the locations 1, 2, ..., 399 in the column names. If your big dataframe containing all the data is called df, then you could do this:
for (i in 1:399) {
selected_indices <- grep(paste0('_', i, '$'), colnames(df))
# do calculations on the selected columns
df[, selected_indices]
}
The for loop will automatically run through each location i from 1 through 399. The paste0() function concatenates '_' with the variable i and the dollar sign $ to create strings like "_1$", "_2$", ..., "_399$", which are then searched for using the grep() function in the column names of df. The '$' is used to specify that you want the patterns _1, _2, ... to appear at the end of the column names (it is a regular expression special character).
The grep() function uses the above regular expressions to returns the column indices required for each location. You can then extract the relevant portion of df and do whatever calculations you want.

Search large string for multiple instances if smaller string in r

In R, I have taken a JSON format of test results and converted them to a data frame of 14 variables and 1101 entries. In this test, the user must select squares in a particular order for a correct score. Under one variable, "input," the values are long strings with info on which square was selected and the time it took to select the square.
Ex:
"[{\"selectedSquare\":\"1\",\"tapTime\":\"00:00:00:06\"},
{\"selectedSquare\":\"0\",\"tapTime\":\"00:00:01:02\"},
{\"selectedSquare\":\"3\",\"tapTime\":\"00:00:02:00\"},
{\"selectedSquare\":\"2\",\"tapTime\":\"00:00:02:07\"}]"
Some entries have more than others, some have none.
I need to search each entry for the square a student selected, and output the order into a new column. Using the example above:
1,0,3,2
I have tried to access each entry individually to test functions on using df$input[1], but it returns a factor with 219 levels. I cannot find a way to only access the relevant piece of the input entry.
You can do this by using an appropriate regular expression. Try:
library(dplyr)
library(stringr)
pattern <- "(?<=\")\\d(?=\")" ## regular expression with look arounds
df$new.col <- sapply(df$input, function(x) {str_extract_all(x, pattern)[[1]] %>% paste(collapse = ",")})

Data Frame containing hyphens using R

I have created a list (Based on items in a column) in order to subset my dataset into smaller datasets relating to a particular variable. This list contains strings with hyphens in them -.
dim.list <- c('Age_CareContactDate-Gender', 'Age_CareContactDate-Group',
'Age_ServiceReferralReceivedDate-Gender',
'Age_ServiceReferralReceivedDate-Gender-0-18',
'Age_ServiceReferralReceivedDate-Group',
'Age_ServiceReferralReceivedDate-Group-ReferralReason')
I have then written some code to loop through each item in this list subsetting my main data.
for (i in dim.list) {assign(paste("df1.",i,sep=""),df[df$Dimension==i,])}
This works fine, however when I come to aggregate this in order to get some summary statistics I can't reference the dataset as R stops reading after the hyphen (I assume that the hyphen is some special character)
If I use a different list without hyphens e.g.
dim.list.abr <- c('ACCD_Gen','ACCD_Grp',
'ASRRD_Gen',
'ASRRD_Gen_0_18',
'ASRRD_Grp',
'ASRRD_Grp_RefRsn')
When my for loop above executes I get 6 data.frames with no observations.
Why is this happening?
Comment to answer:
Hyphens aren't allowed in standard variable names. Think of a simple example: a-b. Is it a variable name with a hyphen or is it a minus b? The R interpreter assumes a minus b, because it doesn't require spaces for binary operations. You can force non-standard names to work using backticks, e.g.,
# terribly confusing names:
`a-b` <- 5
`x+y` <- 10
`mean(x^2)` <- "this is awful"
but you're better off following the rules and using standard names without special characters like + - * / % $ # # ! & | ^ ( [ ' " in them. At ?quotes there is a section on Names and Identifiers:
Identifiers consist of a sequence of letters, digits, the period (.) and the underscore. They must not start with a digit nor underscore, nor with a period followed by a digit. Reserved words are not valid identifiers.
So that's why you're getting an error, but what you're doing isn't good practice. I completely agree with Axeman's comments. Use split to divide up your data frame into a list. And keep it in a list rather than use assign, it will be much easier to loop over or use lapply with that way. You might want to read my answer at How to make a list of data frames for a lot of discussion and examples.
Regarding your comment "dim.list is not the complete set of unique entries in the Dimensions column", that just means you need to subset before you split:
nice_list = df[df$Dimension %in% dim.list, ]
nice_list = split(nice_list, nice_list$Dimension)

How to get a data set with row names from a particular column in R language

what I have tried is
reddit <-read.csv('movie_metadata.csv')
reddit <- na.omit(reddit)
View(reddit)
facebook<-reddit[1:50,c(2,9,23)]
samp2 <- facebook[,-2]
rownames(samp2) <- facebook[,2]
samp2
samp.with.rownames <- data.frame(facebook[,-2], row.names=facebook[,2])
row.names(facebook)<-reddit$director_name[1:50]
d<-dist(as.matrix(samp.with.rownames))
e<-log(d)
hc<-hclust(d)
plot(hc,cex=0.8,las=1)
even after different methods what I am getting is numbers instead of names or text present there in column 2
welcome to SO.
First of all, I quite don't understand why would you want to change index number to text. The text needs to be unique in order fot it to work and know, that director name won't be unique.
Instead, add a column with director name to the dataset and when you will be saving the dataframe, use:
write.csv(samp2, row.names = F)
The second thing, your example is not reproducible, which wouldn't be a problem if you included your purpose of changing the index to characters.
Here is something that could help you maybe?
Changing index to unique name
try looking into ?hclust
Maybe what you need is create data frame with 2 columns, use 1 for distance and the second one for labels in the hclust:
hclust(d, labels = TRUE)
Good luck with your task :)

Resources