How to get dataframe value? - r

I have a dataframe df. I took the column headers in another dataframe as I want to run a loop on it.
This is the output of the header dataframe
df.header
# [1] ISBAD MA_KH MA_CN Ngay.xep.hang So.ho.so SEQ Primary_Key Nganh Mo.hinh.xep.hang
# [10] Loai.hinh.DN NAM_TAI_CHINH Liq1 Liq4 Liq5 Per1 Per2 Per3 Per4
# [19] Per5 Lev1 Lev2 Lev3 Lev4 Lev5 Prof1 Prof2 Prof3
# [28] Prof4 Prof5 Gro14 Gro15 Gro16 Gro17 Gro18 Gro19 Gro20
# [37] Struc1 Cov1 Liq6 Prof6 Struc2 Lev6 Lev7 Lev8 Struc3
# [46] Struc4 Struc5 Prof8 Struc6 Liq7 Lev9 Cov.24 Cov2 Liq9
# [55] Cov4 Prof9 Struc7 Cov6 Prof10 Prof13 Prof16 Prof18 Prof19
# [64] Prof22 Per6 Per7 Per8 Prof23 Cov7 Prof24 Lev10 Struc8
# [73] Struc9 Lev11 Struc10 Liq10 cov3 Cov9 Cov10 Liq11 Cov11
# [82] Prof29 Prof30 Per9 Per10 Liq12 Cov12 Cov13 Liq13 Cov15
# [91] Per11 Per12 Per13 Per14 Cov16 Cov17 Cov18 Gro21 Gro22
#[100] Cov19 Cov20 Liq14 Liq15 Liq16 Liq17 Liq18 Struc11 Struc15
#[109] Liq19 Prof34 Prof35 Prof38 Prof40 Prof41 Prof42 Prof43 Struc12
#[118] Struc13 Struc14 Cov21 Cov22 Prof44 Gro23 Liq20 Cov23 Liq21
#[127] Liq22 Lev12 Prof31
Now when I put in the following code in the loop
liststring <- toString(df.header[2])
I got the output
liststring
# [1] "integer(0)"
Instead of MA_KH
I also tried toString(df.header[2],) and got the same result.
Not sure where I'm going wrong here

With command df.header <-head(df,0) you don't get a dataframe of colum header, but an empty copy of your original dataframe.
to get get just the names of a dataframe use: names(df).
Maybe you can post the purpose of this new dataframe. Iterating over variables of a dataframe can be done using lapply without creating a new dataframe.

Related

reorder a 1 dimensional dataframe based on the column order of a larger dataframe (R)

relevant_ods_reordered <- relevant_ods[names(cpm)]
the above seeks to reorder column names of a dataframe relevant_ods:
Plate1_DMSO_A01 Plate1_DMSO_B01 Plate1_DMSO_C01 Plate1_Lopinavir_D01
OD595 0.431 0.4495 0.4993 0.5785
Plate1_DMSO_E01 Plate1_DMSO_F01 Plate1_DMSO_G01 Plate1_DMSO_H01
OD595 0.5336 0.5133 0.527 0.5413
Plate1_DMSO_C12 Plate1_DMSO_D12 Plate1_Lopinavir_E12 Plate1_DMSO_F12
OD595 0.4137 0.4274 0.5241 0.4264
Plate1_DMSO_G12 Plate1_DMSO_H12
OD595 0.4561 0.4767
to match the order of the columns in a significantly larger dataframe:
[1] "Plate1_DMSO_A01" "Plate1_DMSO_A12"
[3] "Plate1_DMSO_B01" "Plate1_DMSO_B12"
[5] "Plate1_DMSO_C01" "Plate1_DMSO_C12"
[7] "Plate1_DMSO_D12" "Plate1_DMSO_E01"
[9] "Plate1_DMSO_F01" "Plate1_DMSO_F12"
[11] "Plate1_DMSO_G01" "Plate1_DMSO_G12"
[13] "Plate1_DMSO_H01" "Plate1_DMSO_H12"
[15] "Plate1_Lopinavir_D01" "Plate1_Lopinavir_E12"
[17] "Plate1_NS1519_22009_A02" "Plate1_NS1519_22009_A04"
[19] "Plate1_NS1519_22009_A05" "Plate1_NS1519_22009_A06"
[21] "Plate1_NS1519_22009_A07" "Plate1_NS1519_22009_A08"
[23] "Plate1_NS1519_22009_A09" "Plate1_NS1519_22009_A10"
[25] "Plate1_NS1519_22009_A11" "Plate1_NS1519_22009_B02"
[27] "Plate1_NS1519_22009_B03" "Plate1_NS1519_22009_B04"
[29] "Plate1_NS1519_22009_B05" "Plate1_NS1519_22009_B06"
etc.
Clearly, there is a returned
Error in `[.data.frame`(relevant_ods, names(cpm)) :
undefined columns selected
due to the mismatch between the numbers of columns
I have tried
relevant_ods_reordered <- relevant_ods[names(cpm),]
relevant_ods_reordered <- select(relevant_ods, names(cpm))
relevant_ods_reordered <- match(relevant_ods, names(cpm))
With base R, you need to find the names in common. intersect is good for this and preserves the order of its first argument:
relevant_ods[intersect(names(cpm), names(relevant_ods))]
Or with dplyr, use the select helper any_of:
select(relevant_ods, any_of(names(cpm)))

Get a list of the all the names of the objects in the datasets R package?

How can I get a list of the exact names of the objects in the datasets package?
I found many of them here:
data_package = data(package="datasets")
datasets <- as.data.frame(data_package[[3]])$Item
datasets
# [1] "AirPassengers" "BJsales" "BJsales.lead (BJsales)" "BOD" "CO2" "ChickWeight"
# [7] "DNase" "EuStockMarkets" "Formaldehyde" "HairEyeColor" "Harman23.cor" "Harman74.cor"
# [13] "Indometh" "InsectSprays" "JohnsonJohnson" "LakeHuron" "LifeCycleSavings" "Loblolly"
# [19] "Nile" "Orange" "OrchardSprays" "PlantGrowth" "Puromycin" "Seatbelts"
# [25] "Theoph" "Titanic" "ToothGrowth" "UCBAdmissions" "UKDriverDeaths" "UKgas"
# [31] "USAccDeaths" "USArrests" "USJudgeRatings" "USPersonalExpenditure" "UScitiesD" "VADeaths"
# [37] "WWWusage" "WorldPhones" "ability.cov" "airmiles" "airquality" "anscombe"
# [43] "attenu" "attitude" "austres" "beaver1 (beavers)" "beaver2 (beavers)" "cars"
# [49] "chickwts" "co2" "crimtab" "discoveries" "esoph" "euro"
# [55] "euro.cross (euro)" "eurodist" "faithful" "fdeaths (UKLungDeaths)" "freeny" "freeny.x (freeny)"
# [61] "freeny.y (freeny)" "infert" "iris" "iris3" "islands" "ldeaths (UKLungDeaths)"
# [67] "lh" "longley" "lynx" "mdeaths (UKLungDeaths)" "morley" "mtcars"
# [73] "nhtemp" "nottem" "npk" "occupationalStatus" "precip" "presidents"
# [79] "pressure" "quakes" "randu" "rivers" "rock" "sleep"
# [85] "stack.loss (stackloss)" "stack.x (stackloss)" "stackloss" "state.abb (state)" "state.area (state)" "state.center (state)"
# [91] "state.division (state)" "state.name (state)" "state.region (state)" "state.x77 (state)" "sunspot.month" "sunspot.year"
# [97] "sunspots" "swiss" "treering" "trees" "uspop" "volcano"
# [103] "warpbreaks" "women"
So something like this would iterate through each one
for(i in 1:length(datasets)) {
print(get(datasets[i]))
cat("\n\n")
}
It works for the first two datasets (AirPassengers and BJsales), but it fails on BJsales.lead (BJsales) since it should be referred to as datasets::BJsales.lead.
I guess I could use string split or similar to discard anything from a space onwards, but I wonder is there any neater way of obtaining a list of all the objects in the dataset package?
Notes
In addition to the above, I also tried listing everything in the datasets namespace but it gave a weird result:
ls(getNamespace("datasets"), all.names=TRUE)
# [1] ".__NAMESPACE__." ".__S3MethodsTable__." ".packageName"
There is a note on the ?data help page that states
Where the datasets have a different name from the argument that should be used to retrieve them the index will have an entry like beaver1 (beavers) which tells us that dataset beaver1 can be retrieved by the call data(beavers).
So the actual object name is the thing before the parentheses at the end. Since that value is returned as just a string, that's something you'll need to remove yourself unfortunately. But you can do that with a gsub
datanames <- data(package="datasets")$results[,"Item"]
objnames <- gsub("\\s+\\(.*\\)","", datanames)
for(ds in objnames) {
print(get(ds))
cat("\n\n")
}

Partial Variances at each row of a Matrix

I generated a series of 10,000 random numbers through:
rand_x = rf(10000, 3, 5)
Now I want to produce another series that contains the variances at each point i.e. the column look like this:
[variance(first two numbers)]
[variance(first three numbers)]
[variance(first four numbers)]
[variance(first five numbers)]
.
.
.
.
[variance of 10,000 numbers]
I have written the code as:
c ( var(rand_x[1:1]) : var(rand_x[1:10000])
but I am only getting 157 elements in the column rather than not 10,000. Can someone guide what I am doing wrong here?
An option is to loop over the index from 2 to 10000 in sapply, extract the elements of 'rand_x' from position 1 to the looped index, apply the var and return a vector of variance output
out <- sapply(2:10000, function(i) var(rand_x[1:i]))
Your code creates a sequence incrementing by one with the variance of the first two elements as start value and the variance of the whole vector as limit.
var(rand_x[1:2]):var(rand_x[1:n])
# [1] 0.9026262 1.9026262 2.9026262
## compare:
.9026262:3.33433
# [1] 0.9026262 1.9026262 2.9026262
What you want is to loop over the vector indices, using seq_along to get the variances of sequences growing by one. To see what needs to be done, I show you first a (rather slow) for loop.
vars <- numeric() ## initialize numeric vector
for (i in seq_along(rand_x)) {
vars[i] <- var(rand_x[1:i])
}
vars
# [1] NA 0.9026262 1.4786540 1.2771584 1.7877717 1.6095619
# [7] 1.4483273 1.5653797 1.8121144 1.6192175 1.4821020 3.5005254
# [13] 3.3771453 3.1723564 2.9464537 2.7620001 2.7086317 2.5757641
# [19] 2.4330738 2.4073546 2.4242747 2.3149455 2.3192964 2.2544765
# [25] 3.1333738 3.0343781 3.0354998 2.9230927 2.8226541 2.7258979
# [31] 2.6775278 2.6651541 2.5995346 3.1333880 3.0487177 3.0392603
# [37] 3.0483917 4.0446074 4.0463367 4.0465158 3.9473870 3.8537925
# [43] 3.8461463 3.7848464 3.7505158 3.7048694 3.6953796 3.6605357
# [49] 3.6720684 3.6580296
The first element has to be NA because the variance of one element is not defined (division by zero).
However, the for loop is slow. Since R is vectorized we rather want to use a function from the *apply family, e.g. vapply, which is much faster. In vapply we initialize with numeric(1) (or just 0) because the result of each iteration is of length one.
vars <- vapply(seq_along(rand_x), function(i) var(rand_x[1:i]), numeric(1))
vars
# [1] NA 0.9026262 1.4786540 1.2771584 1.7877717 1.6095619
# [7] 1.4483273 1.5653797 1.8121144 1.6192175 1.4821020 3.5005254
# [13] 3.3771453 3.1723564 2.9464537 2.7620001 2.7086317 2.5757641
# [19] 2.4330738 2.4073546 2.4242747 2.3149455 2.3192964 2.2544765
# [25] 3.1333738 3.0343781 3.0354998 2.9230927 2.8226541 2.7258979
# [31] 2.6775278 2.6651541 2.5995346 3.1333880 3.0487177 3.0392603
# [37] 3.0483917 4.0446074 4.0463367 4.0465158 3.9473870 3.8537925
# [43] 3.8461463 3.7848464 3.7505158 3.7048694 3.6953796 3.6605357
# [49] 3.6720684 3.6580296
Data:
n <- 50
set.seed(42)
rand_x <- rf(n, 3, 5)

Weird conversion from list to dataframe in R

I have a list that I created from a for loop and it looks like this:
I tried to convert it to a dataframe using the code:
dflist<- as.data.frame(mylist)
But my dataframe looks like this now:
I know I probably created my list wrong but I am thinking this is still salvageable if I just need to convert the numbers to a dataframe correctly.
My end goal is to plot the numbers against their index (1-30) and I thought creating a dataframe first to clean it up and then plot would be helpful.
Any help would be really appreciated. Thank you.
The data showed is a list. We can use unlist and create a data.frame. Based on the image showed in OP's post, each list element have a length of 1. By doing unlist, we convert the list to vector and then wrap with data.frame.
data.frame(ind= seq_along(lst), Col1= as.numeric(unlist(lst)))
Or another option would be stack after naming the list elements
df1 <- transform(stack(setNames(lst, seq_along(lst))),
values = as.numeric(values))
It gives a two column dataset. From this we can do the plotting
Regarding the OP's approach about calling as.data.frame directly on the list, it does work in a different way as it calls on as.data.frame.list. For example, if we do as.data.frame on a vector, it uses as.data.frame.vector
as.data.frame(1:5)
# 1:5
#1 1
#2 2
#3 3
#4 4
#5 5
But, if we call as.data.frame.list
as.data.frame.list(1:5)
# X1L X2L X3L X4L X5L
#1 1 2 3 4 5
we get a data.frame with 'n' columns (based on the length of the vector).
Suppose, we do the same on a list
as.data.frame(as.list(1:5))
# X1L X2L X3L X4L X5L
#1 1 2 3 4 5
It uses the as.data.frame.list. To get the complete list of methods of as.data.frame,
methods('as.data.frame')
#[1] as.data.frame.aovproj* as.data.frame.array
# [3] as.data.frame.AsIs as.data.frame.character
# [5] as.data.frame.chron* as.data.frame.complex
# [7] as.data.frame.data.frame as.data.frame.data.table*
# [9] as.data.frame.Date as.data.frame.dates*
#[11] as.data.frame.default as.data.frame.difftime
#[13] as.data.frame.factor as.data.frame.ftable*
#[15] as.data.frame.function* as.data.frame.grouped_df*
#[17] as.data.frame.idf* as.data.frame.integer
#[19] as.data.frame.ITime* as.data.frame.list <-------
#[21] as.data.frame.logical as.data.frame.logLik*
#[23] as.data.frame.matrix as.data.frame.model.matrix
#[25] as.data.frame.noquote as.data.frame.numeric
#[27] as.data.frame.numeric_version as.data.frame.ordered
#[29] as.data.frame.POSIXct as.data.frame.POSIXlt
#[31] as.data.frame.raw as.data.frame.rowwise_df*
#[33] as.data.frame.table as.data.frame.tbl_cube*
#[35] as.data.frame.tbl_df* as.data.frame.tbl_dt*
#[37] as.data.frame.tbl_sql* as.data.frame.times*
#[39] as.data.frame.ts as.data.frame.vector

R: Using for loop on data frame

I have a data frame, deflator.
I want to get a new data frame inflation which can be calculated by:
deflator[i] - deflator[i-4]
----------------------------- * 100
deflator [i - 4]
The data frame deflator has 71 numbers:
> deflator
[1] 0.9628929 0.9596746 0.9747274 0.9832532 0.9851884
[6] 0.9797770 0.9913502 1.0100561 1.0176906 1.0092516
[11] 1.0185932 1.0241043 1.0197975 1.0174097 1.0297328
[16] 1.0297071 1.0313232 1.0244618 1.0347808 1.0480411
[21] 1.0322142 1.0351968 1.0403264 1.0447121 1.0504402
[26] 1.0487097 1.0664664 1.0935239 1.0965951 1.1141851
[31] 1.1033155 1.1234482 1.1333870 1.1188136 1.1336276
[36] 1.1096461 1.1226584 1.1287245 1.1529588 1.1582911
[41] 1.1691221 1.1782178 1.1946234 1.1963453 1.1939922
[46] 1.2118189 1.2227960 1.2140535 1.2228828 1.2314258
[51] 1.2570788 1.2572214 1.2607763 1.2744415 1.2982076
[56] 1.3318808 1.3394186 1.3525902 1.3352815 1.3492751
[61] 1.3593859 1.3368135 1.3642940 1.3538567 1.3658135
[66] 1.3710932 1.3888638 1.4262185 1.4309707 1.4328823
[71] 1.4497201
This is a very tricky question for me.
I tried to do this using a for loop:
> d <- data.frame(deflator)
> for (i in 1:71) {d <-rbind(d,c(delfaotr ))}
I think I might be doing it wrong.
Why use data frames? This is a straightforward vector operation.
inflation = 100 * (deflator[1:67] - deflator[-(1:4)])/deflator[-(1:4)]
I agree with #Fhnuzoag that your example suggests calculations on a numeric vector, not a data frame. Here's an additional way to do your calculations taking advantage of the lag argument in the diff function (with indexes that match those in your question):
lagBy <- 4 # The number of indexes by which to lag
laggedDiff <- diff(deflator, lag = lagBy) # The numerator above
theDenom <- deflator[seq_len(length(deflator) - lagBy)] # The denominator above
inflation <- laggedDiff/theDenom
The first few results are:
head(inflation)
# [1] 0.02315470 0.02094710 0.01705379 0.02725941 0.03299085 0.03008297

Resources