Microsoft Azure Data Lake Store is a Hadoop file system that’s compatible with Hadoop Distributed File System (HDFS) and works with the Hadoop ecosystem. Data Lake Store is integrated with Azure Data Lake Analytics and Azure HDInsight and will be integrated with Microsoft offerings like Revolution-R Enterprise; industry-standard distributions like Hortonworks, Cloudera, and MapR; and individual Hadoop projects like Spark, Storm, Flume, Sqoop, and Kafka.
Administer ADL
Add Non-Interactive Account
To access the Data Lake Store from a script or through code you will need a non-interactive account.
Add the account using portal.azure.com:
1. Open the Active Directory Extension and open the Active Directory in use for your subscription. A new window will open to the classic management portal.
2. Go to Applications within the Active Directory management page
3. Add new application by clicking Add at the bottom of the Active Directory management page
4. Define a new web based application
5. Fill out urls
6. The new Application will be created.
7. Under Configure, expand Access Web APIs in Other Applications, and click configure key:
8. Set USER ASSIGNMENT REQUIRED TO ACCESS APP to Yes
9. Copy the ClientID (CLIENT ID) to be used later.
10. Under Keys, select a 2-year duration
11. Click save, and copy the new Secret Key to be used later.
Note: you will not be able to retrieve the secret key after you leave the page – you will have to create a new secret key.
12. Go back to the portal, browse to the Data Lake Store, click on the Resource Group…
Next, click Users, and add the newly defined application to the resource group setup when the Data Lake Store was created
Apply ACL to the Data Lake Store
After you create or modify account access to the Data Lake Store you will have to add the account group to the folder – basically modifying folder permissions.
1. Click on the Data Lake Store
2. Click on Data Explorer located at the top of the screen.
3. Within the Data Lake, click Access and modify the folder permissions.
Retrieve the TenantId
You will need to pass the tenant id with your authentication request. For Web Apps and Web API Apps, you can retrieve the tenant id by selecting View endpoints at the bottom of the screen and retrieving the id as shown below.
1. Click on the Data Lake Store
2. Click on All Settings, Users
3. Click on the External user added using the instructions listed above.
4. You’ll find the tenant id under the account.
Reference for Client ID and Tenant ID: https://azure.microsoft.com/en-us/documentation/articles/resource-group-create-service-principal-portal/
Retrieve the SubscriptionId
To retrieve the subscription id, browse to the Data Lake Store, click on the Subscription where the Data Lake Store is housed, and copy the subscription id:
Retrieve the Token Endpoint
Our authentication endpoint, also known as the token endpoing (authTokenEndpoin), is as follows:
https://login.microsoftonline.com/<TenantId>/oauth2/token
https://login.microsoftonline.com/14ea3e2d-a67c-4c86-821b-51e6745fd11d/oauth2/token
How to Access Azure Data Lake Store
Java SDK
See Get started with Azure Data Lake Store using Java
Example codes @ TFS, …Shared/Tools/AzureDatalakeStorewithJavaSDK
.NET SDK
See Get started with Azure Data Lake Store using .NET SDK
Example codes @ TFS, …Shared/Tools/AzureDataLakeWithNetSDK
Learn how to use the Azure Data Lake Store .NET SDK to create an Azure Data Lake account and perform basic operations such as create folders, upload and download data files, delete your account, etc, see Get started with Azure Data Lake Store using .NET SDK. For more information about Data Lake, see Azure Data Lake Store.
See the code example @ TFS, …/Shared/Tools/AzureDataLakeWithNetSDK.
Non-Interactive Authentication
The following snippet shows an AuthenticateApplication
method that you can use for a non-interactive log in experience.
// Authenticate the application with AAD through the application's secret key.
// You need to have an application registered with AAD in order to authenticate.
// For more information and instructions on how to register your application with AAD, see:
// https://azure.microsoft.com/en-us/documentation/articles/resource-group-create-service-principal-portal/
public static TokenCredentials AuthenticateApplication(string tenantId, string resource, string appClientId, Uri appRedirectUri, SecureString clientSecret)
{
var authContext = new AuthenticationContext("https://login.microsoftonline.com/" + tenantId);
var credential = new ClientCredential(appClientId, clientSecret);
var tokenAuthResult = authContext.AcquireToken(resource, credential);
return new TokenCredentials(tokenAuthResult.AccessToken);
}
TenantId
1. Start the Azure PowerShell, then Run Login-AzureRmAccount to login.
2. If you receive an error that the Login-AzureRmAccount module cannot be found:
- Install Azure PowerShell, if the command fails: http://aka.ms/webpi-azps
- Next add the modules:
# Install the Azure Resource Manager modules from the PowerShell Gallery Install -Module AzureRM # Install the Azure Service Management module from the PowerShell Gallery Install -Module Azure |
- Run Login-AzureRmAccount:
Resource
- https://management.core.windows.net/
- Install/configure Azure PowerShell: https://azure.microsoft.com/en-us/documentation/articles/powershell-install-configure/
ClientId
appClientId and
appRedirectUri
- Using portal.azure.com, go to App Registrations within the Active Directory management page. Select the application and Configure. Sign-On URL is appRedirectUri, and Client Id is appClientId.
Secret Key
clientSecret
- Using portal.azure.com, go to Applications within the Active Directory management page. Select the application and Configure. Go to Keys section and click 1 or 2 years, then Save it.
Azure Data Lake Store cmdlets
Below are commands to move files within, into, or out of a Data Lake store using powershell: (italicized values must be modified by user)
first login to azure, Azure Resource Manager must be installed:
login-azureRmAccount
view data lake subscriptions:
Get-AzureRmSubscription
select subscription of data lake:
Select-AzureRmSubscription -SubscriptionId “SubscriptionId”
(Add -Recurse to the end of the cmdlets below to make them act recursively):
move file/folder within a data lake store
Move-AzureRmDataLakeStoreItem -AccountName “data lake store name” -Path “/Original/Path/File.txt” -Destination “/New/Path/RenamedFile.txt”
Downloads a file from Data Lake Store into local directory
Export-AzureRmDataLakeStoreItem -AccountName “data lake store name” -Path /myFiles/TestSource.csv -Destination “C:\Test.csv”
Uploads a local file or directory to a Data Lake Store
Import-AzureRmDataLakeStoreItem -AccountName “data lake store name” -Path “C:\SrcFile.csv” -Destination “/MyFiles/File.csv”
Deletes a file or folder in Data Lake Store
Remove-AzureRmDataLakeStoreItem -AccountName “data lake store name” -Paths “/File01.txt“,”/MyFiles/File.csv”
Script to move multiple user-specified folders from one data lake to another (uses local machine as intermidiary): I recommend running this in Windows Powershell ISE
$datalakefolder = ‘azure data lake store folder‘
$subfolders = ‘X‘,’Y‘,’Z‘
$datalake1 = ‘datalake1’s azure data lake store name‘
$datalake2 = ‘datalake2’s azure data lake store name‘
$localpath = ‘local machine folder‘
Foreach ($subfolder in $subfolders)
{
Export-AzureRmDataLakeStoreItem -AccountName “$datalake1” -Path /$datalakefolder/$subfolder/ -Destination “$localpath\$datalakefolder\$subfolder\” -PerFileThreadCount “40” -ConcurrentFileCount “20”
Import-AzureRmDataLakeStoreItem -AccountName “$datalake2” -Path “$localpath\$datalakefolder\$subfolder\” -Destination /$datalakefolder/$subfolder/ -PerFileThreadCount “40” -ConcurrentFileCount “20”
}
Get file count and folder size in bytes:
Example:
# login Login -AzureRmAccount Get -AzureRmDataLakeStoreChildItem -AccountName "resource-name" -Path "/dir/2017/01/02/" | measure-object -sum 'Length' |
Count : 96 ← (file count)
Average :
Sum : 41780702 ← (folder size in bytes)
Maximum :
Minimum :
Property : Length
Get file count and folder size recursively in bytes:
function Get -DataLakeStoreChildItemRecursive ([hashtable] $Params ) { $AllFiles = New-Object Collections.Generic.List[Microsoft.Azure.Commands.DataLakeStore.Models.DataLakeStoreItem]; recurseDataLakeStoreChildItem -AllFiles $AllFiles -Params $Params $AllFiles } function recurseDataLakeStoreChildItem ([System.Collections.ICollection] $AllFiles , [hashtable] $Params ) { $ChildItems = Get -AzureRmDataLakeStoreChildItem @Params; $Path = $Params [ "Path" ]; foreach ( $ChildItem in $ChildItems ) { switch ( $ChildItem . Type ) { "FILE" { $AllFiles .Add( $ChildItem ); } "DIRECTORY" { $Params .Remove( "Path" ); $Params .Add( "Path" , $Path + "/" + $ChildItem .Name); recurseDataLakeStoreChildItem -AllFiles $AllFiles -Params $Params ; } } } } Get -DataLakeStoreChildItemRecursive -Params @{ 'Path' = '/dir' ; 'Account' = 'resource-name' } |
Troubleshooting
Authentication Failed
If authentication failed it might be possible that that the config is the issue. To fix it, start the Azure PowerShell, then Run Login-AzureRmAccount to login.
If the directory create, upload is failed, it might be a permission issue.