How to create independent different data.frame in a loop R - r

Good evening everybody,
I'm stuck about the construction of the for loop, I don't have any problem, buit I'd like to understand how I can create dataframe "independents" (duplicite with some differences).
I wrote the code step by step (it works), but I think that, maybe, there is a way to compact the code with the for.
x is my original data.frame
str(x)
Classes ‘data.table’ and 'data.frame': 13500 obs. of 6 variables:
$ a: int 1 56 1058 567 987 574 1001...
$ b: int 10 5 10 5 5 10 10 5 10 10 ...
$ c: int NA NA NA NA NA NA NA NA NA NA ...
$ d: int 0 0 0 0 0 0 0 0 0 0 ...
$ e: int 0 0 0 0 0 0 0 0 0 0 ...
$ f: int 22 22 22 22 22 22 22 22 22 22 ...
My first goal is to delete per every column the eventualy NA and "" elements. I do this by these codes of rows.
x_b<- x[!(!is.na(x$b) & x$b==""), ]
x_c<- x[!(!is.na(x$c) & x$c==""), ]
x_d<- x[!(!is.na(x$d) & x$d==""), ]
x_e<- x[!(!is.na(x$e) & x$e==""), ]
x_f<- x[!(!is.na(x$f) & x$f==""), ]
After this the second goal is to create per each new data.frame a id code that I create using the function paste0(x_b$a, x_b$f).
x_b$ID_1<-paste0(x_b$a, x_b$b)
x_c$ID_2<-paste0(x_c$a, x_c$c)
x_d$ID_3<-paste0(x_c$a, x_c$d)
x_e$ID_4<-paste0(x_c$a, x_c$e)
x_f$ID_5<-paste0(x_c$a, x_c$f)
I created this for loop to try to minimize the rows that I use, and to create a good code visualization.
z<-data.frame("a", "b","c","d","e","f")
zy<-data.frame("x_b", "x_c", "x_d", "x_e", "x_f")
for(i in z) {
for (j in zy ) {
target <- paste("_",i)
x[[i]]<-(!is.na(x[[i]]) & x[[i]]=="") #with this I able to create a column on the x data.frame,
#but if I put a new dataframe the for doesn't work
#the name, but I don't want this. I'd like to create a
#data.base per each transformation.
#at this point of the script, I should have a new
#different dataframe, as x_b, x_c, x_d, x_e, x_f but I
#don't know
#How to create them?
#If I have these data frame I will do this anther function
#in the for loop:
zy[[ID]]<-paste0(x_b$a, "_23X")
}
}
I'd like to have as output this:
str(x_b)
Classes ‘data.table’ and 'data.frame': 13500 obs. of 6 variables:
$ a: int 1 56 1058 567 987 574 1001...
$ b: int 10 5 10 5 5 10 10 5 10 10 ...
$ c: int NA NA NA NA NA NA NA NA NA NA ...
$ d: int 0 0 0 0 0 0 0 0 0 0 ...
$ e: int 0 0 0 0 0 0 0 0 0 0 ...
$ f: int 22 22 22 22 22 22 22 22 22 22 ...
$ ID: int 1_23X 56_23X 1058_23X 567_23X 987_23X 574_23X 1001_23X...
and so on.
I think that there is some important concept about the dataframe that I miss.
Where I wrong?
Thank you so much in advance for the support.

There is simple way to do this with the tidyverse package(s):
First goal:
drop.na(df)
You can also use na_if if you want convert "" to NA.
Second goal: use mutate to create a new variable:
df <- df %>%
mutate(id = paste0(x_b$a, "_23X"))

Related

How do I change the structure of a r data table

I've merged a handful of data sets all downloaded from either spss, cvs, or excel files into one large data table. For the most part I can use all the variables I want to run tests but every once in a while the structure of them needs to be changed. As an example here's my data set:
> str(gadd.us)
'data.frame': 467 obs. of 381 variables:
$ nidaid : Nmnl. item chr "45-D11150341" "45-D11180321" "45-D11220022" "45-D11240432" ...
$ id : Nmnl. item chr "D11150341" "D11180321" "D11220022" "D11240432" ...
$ agew1 : Itvl. item num 17 17 15 18 17 15 15 18 20 18 ...
$ nagew1 : Itvl. item num 17.3 17.2 15.7 18.2 17.2 ...
$ nsex : Nmnl. item w/ 2 labels for 0,1 num 1 1 0 0 0 0 1 1 1 1 ...
and when I focus on just one variable I get something like this
> str(gadd.us$wasiblckw2)
Itvl. item + ms.v. num [1:467] 70 48 40 60 37 46 67 55 45 61 ...
> str(gadd.us$nsex)
Nmnl. item w/ 2 labels for 0,1 num [1:467] 1 1 0 0 0 0 1 1 1 1 ...
So when I try to create a histogram I get an error...
> hist(gadd.us$wasiblckw2)
Error in hist.default(gadd.us$wasiblckw2) :
some 'x' not counted; maybe 'breaks' do not span range of 'x'
If I change this variable using as.numeric() it works just fine. Any idea what's going on here?
If you import your data from SPSS, SAS, or Stata using haven: library(haven), haven stores variable formats in an attribute: format.spss, format.sas, or format.stata. format.spss, or format.sas. This can sometimes cause problems for your code. haven has several functions to remove such formats and labels:
gadd.us <- haven::zap_formats(gadd.us)
gadd.us <- haven::zap_labels(gadd.us)
You may also want to try some other zap_ functions.

Interval classification in r

I have two dataframes as follows
str(daily)
Classes ‘grouped_df’,‘tbl_df’,‘tbl’ and 'data.frame':15264 obs.of 3 variables:
$ steps : int 0 0 0 0 0 0 0 0 0 0 ...
$ date : Date, format: "2012-10-02" "2012-10-02" "2012-10-02" ...
$ interval: int 0 5 10 15 20 25 30 35 40 45 ...
interval<-data.frame(unique(daily$interval))
str(interval)
'data.frame': 288 obs. of 1 variable:
$ unique.daily.interval.:int 0 5 10 15 20 25 30 35 40 45 50 55 100..2350 2355
using dplyr, what I intended to do was find the mean of daily$steps for each interval across daily$Date using the following
mutate(daily,class=cut(daily$steps,c(0,interval$unique.daily.interval.),
include.lowest = TRUE) %>%
group_by(class) %>%
summarise(Mean = mean(daily$steps)))
The code fails giving the following error
Error: 'breaks' are not unique
which I have isolated to the 'class=cut' function. I have checked the interval df for uniqueness being only 288 values. Can someone point out what I am doing wrong ? Here is a reference I used Create class intervals in r and sum values
Here is a link to the data in question Activity monitoring data
thanks.

How to add dummy variables in R

I know there are several questions about this topic, but none of them seem to answer my specific question.
I have a dataset with five independent variables and I want to add two dummy variables to my regression in R. I have my data in Excel and importing the dataset is not a problem (I use read.csv2). Now, when I want to see my dummy variables, D1 and D2, I can't. I can see all the other variables. The two dummy variables both vary from 0 and 1 through the dataset.
I can easily see a summary of all my data, including D1 and D2 (with median, mean, etc.), and I can call each of the 5 variables separately without any problems at all, but I can't do that with D1 and D2.
> str(tilskuere) 'data.frame': 180 obs. of 7 variables:
$ ATT : int 3166 4315 7123 6575 7895 7323 3579 9571 5345 6595 ...
$ PRICE : int 80 95 120 100 105 115 80 130 105 100 ...
$ viewers: int 41000 43000 56000 66000 157000 91000 51000 30000 36000 72000 ...
$ CB1 : int 10 10 5 2 7 2 3 1 10 1 ...
$ CB2 : num 1 1 1 0 0.33 ...
$ D1 : int 0 0 0 1 0 0 0 0 0 0 ...
$ D2 : int 1 0 0 0 0 1 1 0 0 0 ...
> summary(tilskuere)
> mean(ATT) [1] 6856.372
> mean(D1) Fejl i
mean(D1) : object 'D1' not found
To sum up: I can run regressions in R without D1 and D2, but I can't include these as dummy variables as R can't find these variables, when I run them. R simply says "object D1 not found."
I hope someone can help. Thank you in advance.
Kind regards
Mikkel
I added the material in your comment to the text , added some linefeeds, and it is now clear that you don't understand that columns are not first class objects in R. Try:
mean(tilskuere$D1)
You can see what objects are in your workspace with:
ls()
You appear to have an object named ATT in your workspace as well as a length-180 column by the same name in the object named tilskuere.

Conditional input using read.table or readLines

I'm struggling with using readLines() and read.Table() to get a well formatted data frame in R.
I want to read files like this which are Hockey stats. I'd like to get a nicely formatted data frame, however, specifying the concrete amount of lines to read is difficult because in other files like this the number of players is different. Also, non-players, signed as #.AC, #.HC and so on, should not be read in.
I tried something like this
LINES <- 19
stats <- read.table(file=Datei, skip=11, header=FALSE, stringsAsFactors=FALSE,
encoding="UTF-8", nrows=LINES)
but as mentioned above, the value for LINES is different each time.
I also tried readLines as in this post, but had no luck with it.
Is there a way to integrate a condition in read.table, like (pseudo code)
if (first character == "AC") {
break read.table
}
Sorry if this looks strange, I don't have that much experience in scripting or coding.
Any help is appreciated, thanks a lot!
Greetz!
Your data show a couple of difficulties which should be handled in a sequence, which means you should not try to read the entire file with one command:
Read plain lines and find start and stop row
Depending on the specification of the files you read in my suggestion is to first find the the first row you actually want to read in by any indicator. So this can be a lone number which is always the same or as in my example two lines after the line "TEAM STATS". Finding the last line is then simple again by just looking for the first line containing only whitespaces after the start line:
lines <- readLines( Datei )
start <- which(lines == "TEAM STATS") + 2
end <- start + min( grep( "^\\s+$", lines[ start:length(lines) ] ) ) -2
lines <- lines[start:end]
Read the data to data.frame
In your case you meet a couple of complications:
Your header line starts with an # which is on default recognized as a comment character, ignoring the line. But even if you switch this behavior off (comment.char = "") it's not a valid column name.
If we tell read.table to split the columns along whitespaces you end up with one more column in the data, than in the header since the Player column contains white spaces in the cells. So the best is at the moment to just ignore the header line and let read.table do this with it's default behavior (comment.char = "#"). Also we let the PLAYER column be split into two and will fix this later.
You won't be able to use the first column as row.names since they are not unique.
The rows have unequal length, since the POS column is not filled everywhere.
:
tab <- read.table( text = lines[ start:end ], fill = TRUE, stringsAsFactors=FALSE )
# fix the PLAYER column
tab$V2 <- paste( tab$V2, tab$V3 )
tab <- tab[-3]
Fix the header
Just split the start line at multiple whitespaces and reset the first entry (#) by a valid column name:
colns <- strsplit( lines[start], "\\s+" )[[1]]
colns[1] <- "code"
colnames(tab) <- colns
Fix cases were "POS" was empty
This is done by finding the rows which last cell contains NAs and shift them by one cell to the right:
colsToFix <- which( is.na(tab[, "SHO%"]) )
tab[ colsToFix, 4:ncol(tab) ] <- tab[ colsToFix, 3:(ncol(tab)-1) ]
tab[ colsToFix, 3 ] <- NA
> str(tab)
'data.frame': 25 obs. of 20 variables:
$ code : chr "93" "91" "61" "88" ...
$ PLAYER: chr "Eichelkraut, Flori" "Müller, Lars" "Alt, Sebastian" "Gross, Arthur" ...
$ POS : chr "F" "F" "D" "F" ...
$ GP : chr "8" "6" "7" "8" ...
$ G : int 10 1 4 3 4 2 0 2 1 0 ...
$ A : int 5 11 5 5 3 4 6 3 3 4 ...
$ PTS : int 15 12 9 8 7 6 6 5 4 4 ...
$ PIM : int 12 10 12 6 2 36 37 29 6 0 ...
$ PPG : int 3 0 1 1 1 1 0 0 1 0 ...
$ PPA : int 1 5 2 2 1 2 4 2 1 1 ...
$ SHG : int 0 1 0 1 1 0 0 0 0 0 ...
$ SHA : int 0 0 1 0 1 0 0 1 0 0 ...
$ GWG : int 2 0 1 0 0 0 0 0 0 0 ...
$ FG : int 1 0 1 1 1 0 0 0 0 0 ...
$ OTG : int 0 0 0 0 0 0 0 0 0 0 ...
$ UAG : int 1 0 1 0 0 0 0 0 0 0 ...
$ ENG : int 0 0 0 0 0 0 0 0 0 0 ...
$ SHOG : int 0 0 0 0 0 0 0 0 0 0 ...
$ SHOA : num 0 0 0 0 0 0 0 0 0 0 ...
$ SHO% : num 0 0 0 0 0 0 0 0 0 0 ...

R How to update a column in data.frame using values from another data.frame

New to R.
I have a data.frame
'data.frame': 2070 obs. of 5 variables:
$ id : int 16625062 16711130 16625064 16668358 16625066 16711227 16711290 16668746 16711502 16625494 ...
$ subj : Factor w/ 3 levels "L","M","S": 1 1 1 1 1 1 1 1 1 1 ...
$ grade: int 4 6 4 5 4 6 6 5 6 4 ...
$ score: int 225 225 0 225 225 375 375 125 225 125 ...
$ level: logi NA NA NA NA NA NA ...
and a list of named numbers called lookup
Named num [1:12] 12 19 20 26 31 32 49 67 72 73 ...
- attr(*, "names")= chr [1:12] "0" "50" "100" "125" ...
I'd like to find a way to update the data frame "level" column by looking up values in the lookup list, matching the data frame "score" column with the name of the number in the lookup list. In other words, the score values in the data frame are used to lookup the number (that will go in the level column) in the lookup list.
So... if anyone understands what I mean... please help.
Thanks Robn
You should be able to do this with (assuming your data frame is called d):
d$level = as.numeric(lookup[as.character(d$score)])
For example:
lookup = list(1, 2, 3, 4)
names(lookup) = c("0", "50", "100", "150")
d = data.frame(score=c(50, 150, 0, 0), level=NA)
d$level = as.numeric(lookup[as.character(d$score)])
print(d)
# score level
# 1 50 2
# 2 150 4
# 3 0 1
# 4 0 1

Resources