I have imported a Stata dta file into R using readstata13 package.
The variables have notes which contain full length of questions. I found the attr() function with which I can do a few things such as extract variable names (attr(df, name)), extract variable labels (attr(df, "var")), and label values (attr(df, "label")). However, I have not found a way to extract variable notes.
Is there a way to do so?
Below are a few lines of Stata code that produce a dta file with two variables and variable notes, which can be imported into R.
clear
input int(mpg weight)
34 1800
18 3670
21 4060
15 3720
19 3400
41 2040
25 1990
28 3260
30 1980
12 4720
end
note mpg: Mileage (mpg)
note weight: Weight (lbs.)
save "~/mpg_weight.dta", replace
EDIT:
You can actually do this directly in newer versions of readstata13() as follows:
df = read.dta13("~/mpg_weight.dta")
notes = attr(df, "expansion.fields")
This will generate a list providing variable name, characteristic name and the contents of the Stata characteristic field.
Here's a quick workaround using your toy example:
clear
input int(mpg weight)
34 1800
18 3670
21 4060
15 3720
19 3400
41 2040
25 1990
28 3260
30 1980
12 4720
end
note mpg: this is the first note
note mpg: and this is the second
note mpg: here's a third
note weight: Weight (lbs.)
save "~/mpg_weight.dta", replace
ds
local varlist `r(varlist)'
foreach var of local varlist {
generate notes_`var' = ""
forvalues i = 1 / ``var'[note0]' {
replace notes_`var' = "``var'[note`i']'" in `i'
}
}
export delimited notes_* using notes_mpg_weight.dta.csv, replace
You can then simply import everything in R as strings and go from there.
Related
I've got thousands of textfiles which 10-thousands of lines in different structure in a textfile. It looks like the following 3 lines:
DATE#2020-10-08#TIME#16:00:04#__JOBTYPE#ANFRAGE#__PATH#16 16 16 16 16#REFERENZ=23#REFERENZ*23°__PATH°16 16#
DATE#2020-10-08#__JOBTYPE#ANFRAGE#__PATH#16 16 16 16 16#REFERENZ*24°__PATH°16 16#
DATE#2020-10-08#TIME#16:00:04#__JOBTYPE#ANFRAGE#REFERENZ=25#__PATH#17 16 16 18 16
A # symbolizes normally a break between name of data and information. Sometimes there is another deeper level where # changes to ° and = changes to *. The lines in the original data have got about 10.000 signs per line. I am searching in each line just for the REFERENZ which can apear multiple times. E.g. in line 1.
The result of the read-function for this 3 lines should be a data.frame like this:
> Daten = data.frame(REFERENZ = c(23,24,25))
> str(Daten)
'data.frame': 3 obs. of 1 variable:
$ REFERENZ: num 23 24 25
Dies anybody knows a function in R which can search for this?
I am using read_lines()function from readr package for problem like that.
library(readr)
library(data.table)
t1 <- read_lines('textfile.txt')
table <- fread(paste0(t1, collapse = '\n'), sep = '#')
EDIT:
I misunderstood the question, my bad. I think you are looking for REGEX.
library(readr)
library(stringr)
t1 <- 'DATE#2020-10-08#TIME#16:00:04#__JOBTYPE#ANFRAGE#__PATH#16 16 16 16 16#REFERENZ=23#REFERENZ*23°__PATH°16 16#
DATE#2020-10-08#__JOBTYPE#ANFRAGE#__PATH#16 16 16 16 16#REFERENZ*24°__PATH°16 16#
DATE#2020-10-08#TIME#16:00:04#__JOBTYPE#ANFRAGE#REFERENZ=25#__PATH#17 16 16 18 16'
t1 <- read_lines(t1)
Daten = data.frame(REFERENZ = str_extract(str_extract(t1, 'REFERENZ\\W\\d*'), '[0-9]+'))
str(Daten)
I have 16 large datasets of landcover variables around routes. Example dataset "Trial1":
RtNo TYPE CA PLAND NP PD LPI TE
2001 cls_11 996.57 6.4297 22 0.1419 6.3055 31080
2010 cls_11 56.34 0.3654 23 0.1492 0.1669 15480
18003 cls_11 141.12 0.9899 37 0.2596 0.1503 38700
18014 cls_11 797.58 5.3499 47 0.3153 1.3969 98310
2001 cls_21 1514.97 9.7744 592 3.8195 0.8443 761670
2010 cls_21 638.55 4.1414 95 0.6161 0.7489 463260
18003 cls_21 904.68 6.3463 612 4.2931 0.8769 549780
18014 cls_21 1189.89 7.9814 759 5.0911 0.4123 769650
2001 cls_22 732.33 4.7249 653 4.2131 0.7212 377430
2010 cls_22 32.31 0.2096 168 1.0896 0.0198 31470
18003 cls_22 275.85 1.9351 781 5.4787 0.0423 237390
18014 cls_22 469.44 3.1488 104 6.7345 0.1014 377580
I want to first select rows that meet a condition, for example, all rows in column "TYPE" that is cls_21. I know the following code does this work:
Trial21 <-subset(Trial1, TYPE==" cls_21 ")
(yes the invisible space before and after the categorical variable caused me a considerable headache).
And there are several other ways of doing this as shown in
[https://stackoverflow.com/questions/5391124/select-rows-of-a-matrix-that-meet-a-condition]
I get the following output (sorry this one has extra columns, but shouldn't affect my question):
RtNo TYPE CA PLAND NP PD LPI TE ED LSI
2 18003 cls_21 904.68 6.3463 612 4.2931 0.8769 549780 38.5668 46.1194
18 18014 cls_21 1189.89 7.9814 759 5.0911 0.4123 769650 51.6255 56.2522
34 2001 cls_21 1514.97 9.7744 592 3.8195 0.8443 761670 49.1418 49.3462
50 2010 cls_21 638.55 4.1414 95 0.6161 0.7489 463260 30.0457 46.0118
62 2020 cls_21 625.5 4.1165 180 1.1846 0.5064 384840 25.3268 38.6407
85 2021 cls_21 503.55 2.7926 214 1.1868 0.1178 348330 19.3175 38.9267
I want to rename the columns in this subset so they uniquely identify the class by adding "L21" at the back of existing column names, and I can do this using
library(data.table)
setnames(Trial21, old = c('CA', 'PLAND', 'NP', 'PD', 'LPI', 'TE', 'ED', 'LSI'),
new = c('CAL21', 'PLANDL21', 'NPL21', 'PDL21', 'LPIL21', 'TEL21', 'EDL21', 'LSIL21'))
I want help to develop a function or a loop that automates this process so I don't have to spend days repeating the same codes for 15 different classes and 16 datasets (240 times). Also, decrease the risk of errors. I may have to do the same for additional datasets. Any help to speed the process will be greatly appreciated.
You could do:
a <- split(df, df$TYPE)
b <- sapply(names(a), function(x)setNames(a[[x]],
paste0(names(a[[x]]), sub(".*_", 'L', x))), simplify = FALSE)
You can use ls to get the variable names of the datasets, and manipulate them as you wish inside a loop and with get function, then create new datasets with assign.
sets = grep("Trial", ls(), value=TRUE) #Assuming every dataset has "Trial" in the name
for(i in sets){
classes = unique(get(i)$TYPE)
for(j in classes){
number = gsub("(.+)([0-9]{2})( )", "\\2", j)#this might be an overly complicated way of getting just the number, you can look for better options if you want
assign(paste0("Trial", number),
subset(Trial1, TYPE==j) %>% rename_with(function(x){paste0(x, number)}))}}
Here is a start that should work for your example:
library(dplyr)
myfilter <- function(data, number) {
data %>%
filter(TYPE == sprintf(" cls_%s ") %>%
rename_with(\(x) sprintf("%s%s", x, suffix), !1:2)
}
myfilter(example_data, 21)
Given a list of numbers (here: 21 to 31) you could then automatically use them to filter a single dataframe:
multifilter <- function(data) {
purrr::map(21:31, \(i) myfilter(data, i))
}
multifilter(example_data)
Finally, given a list of dataframes, you can automatically apply the filters to them:
purrr::map(list_of_dataframes, multifilter)
I tried to write a dataframe into a txt file without heading, but it kept adding column names. When I opened the file directly from the drive, it has 21 rows without the heading, but when I opened the file using read.delim(), I can see headers with some symbols.
Here is the code
write.table(trans_sequence, file="mytxtout.txt", sep=";", col.names =FALSE, row.names = FALSE,
quote = FALSE)
When I retrieved the data using read.delim, it looks like below. It should have 21 rows, but it made the top row a column name, making 20 rows. The first row should be like this
2745;9;2;HbA1c;LDL;C;Tests
But it created a header instead
read.delim("mytxtout.txt")
X2745.9.2.HbA1c.LDL.C.Tests
1 10433;9;2;BMI;Blood Pressure
2 13601;0;1;LDL-C Tests
3 13601;6;1;LDL-C Tests
4 36127;2;2;BMI;Blood Pressure
5 36127;5;1;Blood Pressure
6 36127;9;2;BMI;Blood Pressure
7 36127;10;2;BMI;Blood Pressure
8 54881;9;2;HbA1c;LDL-C Tests
9 59650;0;2;BMI;Blood Pressure
10 59650;3;2;BMI;Blood Pressure
11 66741;0;1;LDL-C Tests
12 72772;3;1;LDL-C Tests
13 77618;2;3;BMI;BMI Percentile;Blood Pressure
14 77618;3;2;BMI;BMI Percentile
15 81397;4;1;BMI
16 81397;6;2;BMI;Blood Pressure
17 81397;9;2;BMI;Blood Pressure
18 81397;9;1;BMI
19 83520;6;3;BMI;BMI Percentile;Blood Pressure
20 85178;10;1;LDL-C Tests
Any help will be greatly appreciated
I am trying to use the code mentioned for RFM modelling in R from the blog here. However, grouping the data frame into “Buy” and “No Buy” has not been explained clearly. As a result, when I try to execute the function getPercentages, I get error like:
object "Buy" not found.
I am trying to add a Buy column as follows:
df$Buy <- ifelse(df$Frequency > 1, 1, 0)
before executing the function.
I do not know if this is right way to get the values.
My head for df after getDataframe is
ID Date Amount Recency Frequency Monetary
1207779 2016-06-22 2112.00 8 20 1576.7725
2455590 2016-06-26 1064.00 4 16 1074.8400
2660337 2016-06-21 1870.00 9 20 1616.1700
257997 2016-06-22 616.00 8 22 684.8968
963883 2016-06-27 703.12 3 16 626.1125
1124489 2016-06-21 594.15 9 18 752.2011
Try this :
Buy<-rep(0,nrow(dftry))
dftry<-cbind(dftry,Buy)
DF2
Date EMMI ACT NO2
2011/02/12 12345 21 11
2011/02/14 43211 22 12
2011/02/19 12345 21 13
2011/02/23 43211 13 12
2011/02/23 56341 13 12
2011/03/03 56431 18 20
I need to find difference between two dates in a column. For example difference between ACT column values.For example, the EMMI 12345, Difference between dates 2011/02/19 - 2011/02/12 = 21-21 = 0. like that i want to do for entire column of ACT. Add a new column diff and add values to that. Can anybody let me know please how to do it.
This is the output i want
DF3
Date EMMI ACT NO2 DifACT
2011/02/12 12345 21 11 NA
2011/02/14 43211 22 12 NA
2011/02/19 12345 21 13 0
2011/02/23 43211 13 12 -9
2011/02/23 56341 13 12 5
Try this:
DF3 <- DF2
DF3$difACT <- ave( DF3$ACT, DF3$EMMI, FUN= function(x) c(NA, diff(x)) )
As long as the dates are sorted (within EMMI) this will work, if they are not sorted then we would need to modify the above to sort within EMMI first. I would probably sort the entire data frame on date first (and save the results of order), then run the above. Then if you need it back in the original order you can run order on the results of the original order results to "unorder" the data frame.
This is based on plyr package (not tested):
library(plyr)
DF3<-ddply(DF2,.(EMMI),mutate,difACT=diff(ACT))